Author name cluster

Yi Xin

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

10 papers

1 author row

AAAI Conference 2026 Conference Paper

Benchmarking Multimodal Knowledge Conflict for Large Multimodal Models

Yifan Jia
Yuntao Du
Kailin Jiang
Yuyang Liang
Qihan Ren
Yi Xin
Rui Yang
Fenze Feng

Large Multimodal Models (LMMs) face notable challenges when encountering multimodal knowledge conflicts, particularly under retrieval-augmented generation (RAG) frameworks, where the contextual information from external sources may contradict the model’s internal parametric knowledge, leading to unreliable outputs. However, existing benchmarks fail to reflect such realistic conflict scenarios. Most focus solely on intra-memory conflicts, while context-memory and inter-context conflicts remain largely unaddressed. Furthermore, commonly used factual knowledge-based evaluations are often overlooked, and existing datasets lack a thorough investigation into conflict detection capabilities.To bridge this gap, we propose MMKC-Bench, a benchmark designed to evaluate factual knowledge conflicts in both context-memory and inter-context scenarios. MMKC-Bench encompasses four types of multimodal knowledge conflicts and includes 1,881 knowledge instances and 3,997 images across 32 broad types, collected through automated pipelines with human verification. We evaluate four representative series of LMMs on both model behavior analysis and conflict detection tasks. Our findings show that while current LMMs are capable of recognizing knowledge conflicts, they tend to favor internal parametric knowledge over external evidence. We hope MMKC-Bench will foster further research in multimodal knowledge conflict and enhance the development of multimodal RAG systems.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Beyond Conservation: Flexible Molecular Assembly with Unbalanced Diffusion Bridge

Rongchao Zhang
Yiwei Lou
Yu Huang
Yi Xin
Yongzhi Cao
Hanpin Wang

Molecular assembly (MA) has long been a fundamental task in chemistry and biology, with the potential to create new materials and enable novel functions beyond the molecular scale. However, its vast conformational search space poses substantial challenges, and current generative models remain limited in capturing molecular flexibility and preventing non-physical poses. In this paper, we propose AssemUDB, a diffusion bridge–based framework that learns transport mappings between two distinct flexible domains for molecular assembly generation. We reformulate the marginal matching constraint of diffusion bridges as a coupling distribution governed by unbalanced transport rather than imposing strict conservation. Subsequently, we employ a progressive process from structural relaxation in Euclidean space to assembly on the SE(3) manifold. This relaxation of marginal conservation grants the generative model greater flexibility and leads to more physically plausible atom placements. Comprehensive experiments demonstrate the superior performance of AssemUDB. Notably, we find that the method demonstrates performance comparable to, or even better than, mature tools such as PackMol for packing tasks.

PDF Details DOI

AAAI Conference 2026 Conference Paper

TIDE: Temporal-Aware Sparse Autoencoders for Interpretable Diffusion Transformers in Image Generation

Victor Shea-Jay Huang
Le Zhuo
Yi Xin
Zhaokai Wang
Fu-Yun Wang
Yuchi Wang
Renrui Zhang
Peng Gao

Diffusion Transformers (DiTs) are a powerful yet underexplored class of generative models compared to U-Net-based diffusion architectures. We propose TIDE—Temporal-aware sparse autoencoders for Interpretable Diffusion transformErs—a framework designed to extract sparse, interpretable activation features across timesteps in DiTs. TIDE effectively captures temporally-varying representations and reveals that DiTs naturally learn hierarchical semantics (e.g., 3D structure, object class, and fine-grained concepts) during large-scale pretraining. Experiments show that TIDE enhances interpretability and controllability while maintaining reasonable generation quality, enabling applications such as safe image editing and style transfer.

PDF Details DOI

AAAI Conference 2025 Conference Paper

Exploit Your Latents: Coarse-Grained Protein Backmapping with Latent Diffusion Models

Rongchao Zhang
Yu Huang
Yiwei Lou
Yi Xin
Haixu Chen
Yongzhi Cao
Hanpin Wang

Coarse-grained (CG) molecular dynamics of proteins is a preferred approach to studying large molecules on extended time scales by condensing the entire atomic model into a limited number of pseudo-atoms and preserving the thermodynamic properties of the system. However, the significantly increased efficiency impedes the analysis of substantial physicochemical information, since high-resolution atomic details are sacrificed to accelerate simulation. In this paper, we propose LatCPB, a generative approach based on diffusion that enables high-resolution backmapping of CG proteins. Specifically, our model encodes an all-atom into discrete latent embeddings, aligned with learnable multimodal discrete priors for circumventing posterior collapse and maintaining the discrete properties of the protein sequence. During the generation, we further design a latent diffusion process within the continuous latent space due to the potential stochastics in the data. Moreover, LatCPB performs a contrastive learning strategy in latent space to separate feature representations of various molecules and conformations of the same molecule, thus enhancing the comprehension of molecular representational diversity. Experimental results demonstrate that LatCPB is able to backmap CG proteins effectively and achieve outstanding performance.

PDF Details DOI

AAAI Conference 2025 Conference Paper

Robust Logit Adjustment for Learning with Long-Tailed Noisy Data

MingCai Chen
Yuntao Du
Wenyu Jiang
Baoming Zhang
Shuai Feng
Yi Xin
Chongjun Wang

Learning with noisy labels (LNL) methods have enabled the deployment of machine learning systems with imperfectly labeled data. However, these methods often struggle to identify noise in the presence of long-tailed (LT) class distributions, where the memorization effect becomes class-dependent. Conversely, LT methods are suboptimal under label noise, as it hinders access to accurate label frequency statistics. This study aims to address the long-tailed noisy data by bridging the methodological gap between LNL and LT approaches. We propose a direct solution, termed Robust Logit Adjustment, which estimates ground-truth labels through label refurbishment, thereby mitigating the impact of label noise. Simultaneously, our method incorporates the distribution of training-time corrected target labels into the LT method logit adjustment, providing class-rebalanced supervision. Extensive experiments on both synthetic and real-world long-tailed noisy datasets demonstrate the superior performance of our method.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning

Zhongwei Wan
Zhihao Dou
Che Liu
Yu Zhang
Dongfei Cui
Qinjian Zhao
Hui Shen
Jing Xiong

Multimodal large language models (MLLMs) have shown promising capabilities in reasoning tasks, yet still struggle significantly with complex problems requiring explicit self-reflection and self-correction, especially compared to their unimodal text-based counterparts. Existing reflection methods are simplistic and struggle to generate meaningful, instructive feedback, as the reasoning ability and knowledge limits of pre-trained models are largely fixed during initial training. To overcome these challenges, we propose \textit{multimodal \textbf{S}elf-\textbf{R}eflection enhanced reasoning with Group Relative \textbf{P}olicy \textbf{O}ptimization} \textbf{SRPO}, a two-stage reflection-aware reinforcement learning (RL) framework explicitly designed to enhance multimodal LLM reasoning. In the first stage, we construct a high-quality, reflection-focused dataset under the guidance of an advanced MLLM, which generates reflections based on initial responses to help the policy model to learn both reasoning and self-reflection. In the second stage, we introduce a novel reward mechanism within the GRPO framework that encourages concise and cognitively meaningful reflection while avoiding redundancy. Extensive experiments across multiple multimodal reasoning benchmarks—including MathVista, MathVision, Mathverse, and MMMU-Pro—using Qwen-2. 5-VL-7B and Qwen-2. 5-VL-32B demonstrate that SRPO significantly outperforms state-of-the-art models, achieving notable improvements in both reasoning accuracy and reflection quality.

PDF Details

AAAI Conference 2024 Conference Paper

MmAP: Multi-Modal Alignment Prompt for Cross-Domain Multi-Task Learning

Yi Xin
Junlong Du
Qiang Wang
Ke Yan
Shouhong Ding

Multi-Task Learning (MTL) is designed to train multiple correlated tasks simultaneously, thereby enhancing the performance of individual tasks. Typically, a multi-task network structure consists of a shared backbone and task-specific decoders. However, the complexity of the decoders increases with the number of tasks. To tackle this challenge, we integrate the decoder-free vision-language model CLIP, which exhibits robust zero-shot generalization capability. Recently, parameter-efficient transfer learning methods have been extensively explored with CLIP for adapting to downstream tasks, where prompt tuning showcases strong potential. Nevertheless, these methods solely fine-tune a single modality (text or visual), disrupting the modality structure of CLIP. In this paper, we first propose Multi-modal Alignment Prompt (MmAP) for CLIP, which aligns text and visual modalities during fine-tuning process. Building upon MmAP, we develop an innovative multi-task prompt learning framework. On the one hand, to maximize the complementarity of tasks with high similarity, we utilize a gradient-driven task grouping method that partitions tasks into several disjoint groups and assign a group-shared MmAP to each group. On the other hand, to preserve the unique characteristics of each task, we assign an task-specific MmAP to each task. Comprehensive experiments on two large multi-task learning datasets demonstrate that our method achieves significant performance improvements compared to full fine-tuning while only utilizing approximately ~ 0.09% of trainable parameters.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Towards Understanding the Working Mechanism of Text-to-Image Diffusion Model

Mingyang Yi
Aoxue Li
Yi Xin
Zhenguo Li

Recently, the strong latent Diffusion Probabilistic Model (DPM) has been applied to high-quality Text-to-Image (T2I) generation (e. g. , Stable Diffusion), by injecting the encoded target text prompt into the gradually denoised diffusion image generator. Despite the success of DPM in practice, the mechanism behind it remains to be explored. To fill this blank, we begin by examining the intermediate statuses during the gradual denoising generation process in DPM. The empirical observations indicate, the shape of image is reconstructed after the first few denoising steps, and then the image is filled with details (e. g. , texture). The phenomenon is because the low-frequency signal (shape relevant) of the noisy image is not corrupted until the final stage in the forward process (initial stage of generation) of adding noise in DPM. Inspired by the observations, we proceed to explore the influence of each token in the text prompt during the two stages. After a series of experiments of T2I generations conditioned on a set of text prompts. We conclude that in the earlier generation stage, the image is mostly decided by the special token [\texttt{EOS}] in the text prompt, and the information in the text prompt is already conveyed in this stage. After that, the diffusion model completes the details of generated images by information from themselves. Finally, we propose to apply this observation to accelerate the process of T2I generation by properly removing text guidance, which finally accelerates the sampling up to 25\%+.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

V-PETL Bench: A Unified Visual Parameter-Efficient Transfer Learning Benchmark

Yi Xin
Siqi Luo
Xuyang Liu
Yuntao Du
Haodi Zhou
Xinyu Cheng
Christina Lee
Junlong Du

Parameter-efficient transfer learning (PETL) methods show promise in adapting a pre-trained model to various downstream tasks while training only a few parameters. In the computer vision (CV) domain, numerous PETL algorithms have been proposed, but their direct employment or comparison remains inconvenient. To address this challenge, we construct a Unified Visual PETL Benchmark (V-PETL Bench) for the CV domain by selecting 30 diverse, challenging, and comprehensive datasets from image recognition, video action recognition, and dense prediction tasks. On these datasets, we systematically evaluate 25 dominant PETL algorithms and open-source a modular and extensible codebase for fair evaluation of these algorithms. V-PETL Bench runs on NVIDIA A800 GPUs and requires approximately 310 GPU days. We release all the benchmark, making it more efficient and friendly to researchers. Additionally, V-PETL Bench will be continuously updated for new PETL algorithms and CV tasks.

PDF Details DOI

AAAI Conference 2024 Conference Paper

VMT-Adapter: Parameter-Efficient Transfer Learning for Multi-Task Dense Scene Understanding

Yi Xin
Junlong Du
Qiang Wang
Zhiwen Lin
Ke Yan

Large-scale pre-trained models have achieved remarkable success in various computer vision tasks. A standard approach to leverage these models is to fine-tune all model parameters for downstream tasks, which poses challenges in terms of computational and storage costs. Recently, inspired by Natural Language Processing (NLP), parameter-efficient transfer learning has been successfully applied to vision tasks. However, most existing techniques primarily focus on single-task adaptation, and despite limited research on multi-task adaptation, these methods often exhibit suboptimal training/inference efficiency. In this paper, we first propose an once-for-all Vision Multi-Task Adapter (VMT-Adapter), which strikes approximately O(1) training and inference efficiency w.r.t task number. Concretely, VMT-Adapter shares the knowledge from multiple tasks to enhance cross-task interaction while preserves task-specific knowledge via independent knowledge extraction modules. Notably, since task-specific modules require few parameters, VMT-Adapter can handle an arbitrary number of tasks with a negligible increase of trainable parameters. We also propose VMT-Adapter-Lite, which further reduces the trainable parameters by learning shared parameters between down- and up-projections. Extensive experiments on four dense scene understanding tasks demonstrate the superiority of VMT-Adapter(-Lite), achieving a 3.96% (1.34%) relative improvement compared to single-task full fine-tuning, while utilizing merely ～1% (0.36%) trainable parameters of the pre-trained model.

PDF Details DOI