Author name cluster

Zixu Li

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

4 papers

1 author row

AAAI Conference 2026 Conference Paper

HABIT: Chrono-Synergia Robust Progressive Learning Framework for Composed Image Retrieval

Zixu Li
Yupeng Hu
Zhiwei Chen
Shiqi Zhang
Qinlei Huang
Zhiheng Fu
Yinwei Wei

Composed Image Retrieval (CIR) is a flexible image retrieval paradigm that enables users to accurately locate the target image through a multimodal query composed of a reference image and modification text. Although this task has demonstrated promising applications in personalized search and recommendation systems, it encounters a severe challenge in practical scenarios known as the Noise Triplet Correspondence (NTC) problem. This issue primarily arises from the high cost and subjectivity involved in annotating triplet data. To address this problem, we identify two central challenges: the precise estimation of composed semantic discrepancy and the insufficient progressive adaptation to modification discrepancy. To tackle these challenges, we propose a cHrono-synergiA roBust progressIve learning framework for composed image reTrieval (HABIT), which consists of two core modules. First, the Mutual Knowledge Estimation Module quantifies sample cleanliness by calculating the Transition Rate of mutual information between the composed feature and the target image, thereby effectively identifying clean samples that align with the intended modification semantics. Second, the Dual-consistency Progressive Learning Module introduces a collaborative mechanism between the historical and current models, simulating human habit formation to retain good habits and calibrate bad habits, ultimately enabling robust learning under the presence of NTC. Extensive experiments conducted on two standard CIR datasets demonstrate that HABIT significantly outperforms most methods under various noise ratios, exhibiting superior robustness and retrieval performance.

PDF Details DOI

AAAI Conference 2026 Conference Paper

INTENT: Invariance and Discrimination-aware Noise Mitigation for Robust Composed Image Retrieval

Zhiwei Chen
Yupeng Hu
Zhiheng Fu
Zixu Li
Jiale Huang
Qinlei Huang
Yinwei Wei

Composed Image Retrieval (CIR) is a challenging image retrieval paradigm that enables to retrieve target images based on multimodal queries consisting of reference images and modification texts. Although substantial progress has been made in recent years, existing methods assume that all samples are correctly matched. However, in real-world scenarios, due to high triplet annotation costs, CIR datasets inevitably contain annotation errors, resulting in incorrectly matched triplets. To address this issue, the problem of Noisy Triplet Correspondence (NTC) has attracted growing attention. We argue that noise in CIR can be categorized into two types: cross-modal correspondence noise and modality-inherent noise. The former arises from mismatches across modalities, whereas the latter originates from intra-modal background interference or visual factors irrelevant to the coarse-grained modification annotations. However, modality-inherent noise is often overlooked, and research on cross-modal correspondence noise remains nascent. To tackle above issues, we propose the Invariance and discrimiNaTion-awarE Noise neTwork (INTENT), comprising two components: Visual Invariant Composition and Bi-Objective Discriminative Learning, specifically designed to handle the two-aspect noise. The former applies causal intervention on the visual side via Fast Fourier Transform (FFT) to generate intervened composed features, enforcing visual invariance and enabling the model to ignore modality-inherent noise during composition. The latter adopts collaborative optimization with both positive and negative samples, and constructs a scalable decision boundary that dynamically adjusts decisions based on the loyalty degree, enabling robust correspondence discrimination. Extensive experiments on two widely used benchmark datasets demonstrate the superiority and robustness of INTENT.

PDF Details DOI

AAAI Conference 2026 Conference Paper

ReTrack: Evidence-Driven Dual-Stream Directional Anchor Calibration Network for Composed Video Retrieval

Zixu Li
Yupeng Hu
Zhiwei Chen
Qinlei Huang
Guozhi Qiu
Zhiheng Fu
Meng Liu

With the rapid growth of video data, Composed Video Retrieval (CVR) has emerged as a novel paradigm in video retrieval and is receiving increasing attention from researchers. Unlike unimodal video retrieval methods, the CVR task takes a multi-modal query consisting of a reference video and a piece of modification text as input. The modification text conveys the user's intended alterations to the reference video. Based on this input, the model aims to retrieve the most relevant target video. In the CVR task, there exists a substantial discrepancy in information density between video and text modalities. Traditional composition methods tend to bias the composed feature toward the reference video, which leads to suboptimal retrieval performance. This limitation is significant due to the presence of three core challenges: (1) modal contribution entanglement, (2) explicit optimization of composed features, and (3) retrieval uncertainty. To address these challenges, we propose the evidence-dRivEn dual-sTream diRectionAl anChor calibration networK (ReTrack). ReTrack is the first CVR framework that improves multi-modal query understanding by calibrating directional bias in composed features. It consists of three key modules: Semantic Contribution Disentanglement, Composition Geometry Calibration, and Reliable Evidence-driven Alignment. Specifically, ReTrack estimates the semantic contribution of each modality to calibrate the directional bias of the composed feature. It then uses the calibrated directional anchors to compute bidirectional evidence that drives reliable composed-to-target similarity estimation. Moreover, ReTrack exhibits strong generalization to the Composed Image Retrieval (CIR) task, achieving SOTA performance across three benchmark datasets in both CVR and CIR scenarios.

PDF Details DOI

AAAI Conference 2025 Conference Paper

ENCODER: Entity Mining and Modification Relation Binding for Composed Image Retrieval

Zixu Li
Zhiwei Chen
Haokun Wen
Zhiheng Fu
Yupeng Hu
Weili Guan

The objective of Composed Image Retrieval (CIR) is to identify a target image that meets the requirement based on a multimodal query (including the reference image and the modification text) provided by the user. Despite the notable success of existing approaches, they fail to adequately address the modification relation between visual entities and modification actions. This limitation is non-trivial due to three challenges: 1) irrelevant factor perturbation, 2) vague semantic boundaries, and 3) implicit modification relations. To address the above challenges, we propose an Entity miNing and modifiCation relatiOn binDing nEtwoRk (ENCODER), which has been designed to mine visual entities and modification actions, and then bind modification relations. Among the various components of the proposed ENCODER, we have initially designed the Latent Factor Filter (LFF) module to filter visual and textual latent factors related to modification semantics based on a threshold gating mechanism. Secondly, we propose Entity-Action Binding (EAB), which comprises modality-shared Learnable Relation Queries (LRQ) that are capable of mining visual entities and modification actions, as well as learning implicit modification relations for entity-action binding. Finally, the Multi-scale Composition module is introduced to achieve multi-scale feature composition, with guidance provided by entity-action binding. Extensive experiments on four benchmark datasets demonstrate the superiority of our proposed method.

PDF Details DOI