Author name cluster

Yanzhe Chen

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

3 papers

2 author rows

AAAI Conference 2026 Conference Paper

UniAPO: Unified Multimodal Automated Prompt Optimization

Qipeng zhu
Yanzhe Chen
Huasong Zhong
Jie Chen
Yan Li
Zhixin Zhang
Junping Zhang
Zhenheng Yang

Prompting is fundamental to unlocking the full potential of large language models. To automate and enhance this process, automatic prompt optimization (APO) has been developed, demonstrating effectiveness primarily in text-only input scenarios. However, extending existing APO methods to multimodal tasks—such as video-language generation—introduces two core challenges: (i) visual token inflation, where long visual-token sequences restrict context capacity and result in insufficient feedback signals; (ii) a lack of process-level supervision, as existing methods focus on outcome-level supervision and overlook intermediate supervision, limiting prompt optimization. We present UniAPO: Unified Multimodal Automated Prompt Optimization, the first framework tailored for multimodal APO. UniAPO adopts an EM-inspired optimization process that decouples feedback modeling and prompt refinement, making the optimization more stable and goal-driven. To further address the aforementioned challenges, we introduce a short-long term memory mechanism: historical feedback mitigates context limitations, while historical prompts provide directional guidance for effective prompt optimization. UniAPO achieves consistent gains across text, image, and video benchmarks, establishing a unified framework for efficient and transferable prompt optimization.

PDF Details DOI

ICLR Conference 2025 Conference Paper

MAI: A Multi-turn Aggregation-Iteration Model for Composed Image Retrieval

Yanzhe Chen
Zhiwen Yang
Jinglin Xu
Yuxin Peng 0001

Multi-Turn Composed Image Retrieval (MTCIR) addresses a real-world scenario where users iteratively refine retrieval results by providing additional information until a target meeting all their requirements is found. Existing methods primarily achieve MTCIR through a "multiple single-turn" paradigm, wherein methods incorrectly converge on shortcuts that only utilize the most recent turn's image, ignoring attributes from historical turns. Consequently, retrieval failures occur when modification requests involve historical information. We argue that explicitly incorporating historical information into the modified text is crucial to addressing this issue. To this end, we build a new retrospective-based MTCIR dataset, **FashionMT**, wherein modification demands are highly associated with historical turns. We also propose a Multi-turn Aggregation-Iteration (**MAI**) model, emphasizing efficient aggregation of multimodal semantics and optimization of information propagation in multi-turn retrieval. Specifically, we propose a new Two-stage Semantic Aggregation (TSA) paradigm coupled with a Cyclic Combination Loss (CCL), achieving improved semantic consistency and modality alignment by progressively interacting the reference image with its caption and the modified text. In addition, we design a Multi-turn Iterative Optimization (MIO) mechanism that dynamically selects representative tokens and reduces redundancy during multi-turn iterations. Extensive experiments demonstrate that the proposed MAI model achieves substantial improvements over state-of-the-art methods.

Details

AAAI Conference 2024 Conference Paper

FashionERN: Enhance-and-Refine Network for Composed Fashion Image Retrieval

Yanzhe Chen
Huasong Zhong
Xiangteng He
Yuxin Peng
Jiahuan Zhou
Lele Cheng

The goal of composed fashion image retrieval is to locate a target image based on a reference image and modified text. Recent methods utilize symmetric encoders (e.g., CLIP) pre-trained on large-scale non-fashion datasets. However, the input for this task exhibits an asymmetric nature, where the reference image contains rich content while the modified text is often brief. Therefore, methods employing symmetric encoders encounter a severe phenomenon: retrieval results dominated by reference images, leading to the oversight of modified text. We propose a Fashion Enhance-and-Refine Network (FashionERN) centered around two aspects: enhancing the text encoder and refining visual semantics. We introduce a Triple-branch Modifier Enhancement model, which injects relevant information from the reference image and aligns the modified text modality with the target image modality. Furthermore, we propose a Dual-guided Vision Refinement model that retains critical visual information through text-guided refinement and self-guided refinement processes. The combination of these two models significantly mitigates the reference dominance phenomenon, ensuring accurate fulfillment of modifier requirements. Comprehensive experiments demonstrate our approach's state-of-the-art performance on four commonly used datasets.

PDF Details DOI