Arrow Research search

Author name cluster

Yan Xia

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

14 papers
1 author row

Possible papers

14

AAAI Conference 2026 Conference Paper

AUVIC: Adversarial Unlearning of Visual Concepts for Multi-modal Large Language Models

  • Haokun Chen
  • Jianing Li
  • Yao Zhang
  • Jinhe Bi
  • Yan Xia
  • Jindong Gu
  • Volker Tresp

Multimodal Large Language Models (MLLMs) achieve impressive performance once optimized on massive datasets. Such datasets often contain sensitive or copyrighted content, raising significant data privacy concerns. Regulatory frameworks mandating the 'right to be forgotten' drive the need for machine unlearning. This technique allows for the removal of target data without resource-consuming retraining. However, while well-studied for text, visual concept unlearning in MLLMs remains underexplored. A primary challenge is precisely removing a target visual concept without disrupting model performance on related entities. To address this, we introduce AUVIC, a novel visual concept unlearning framework for MLLMs. AUVIC applies adversarial perturbations to enable precise forgetting. This approach effectively isolates the target concept while avoiding unintended effects on similar entities. To evaluate our method, we construct VCUBench. It is the first benchmark designed to assess visual concept unlearning in group contexts. Experimental results demonstrate that AUVIC achieves state-of-the-art target forgetting rates while incurs minimal performance degradation on non-target concepts.

AAAI Conference 2026 Conference Paper

CMMCoT: Enhancing Complex Multi-Image Comprehension via Multi-Modal Chain-of-Thought and Memory Augmentation

  • Guanghao Zhang
  • Tao Zhong
  • Yan Xia
  • Mushui Liu
  • Zhelun Yu
  • Haoyuan Li
  • Wanggui He
  • Dong She

While previous multimodal slow-thinking methods have demonstrated remarkable success in single-image understanding scenarios, their effectiveness becomes fundamentally constrained when extended to more complex multi-image comprehension tasks. This limitation stems from their predominant reliance on text-based intermediate reasoning processes. While for human, when engaging in sophisticated multi-image analysis, they typically perform two complementary cognitive operations: (1) continuous cross-image visual comparison through region-of-interest matching, and (2) dynamic memorization of critical visual concepts throughout the reasoning chain. Motivated by these observations, we propose the Complex Multi-Modal Chain-of-Thought (CMMCoT) framework, a multi-step reasoning framework that mimics human-like "slow thinking" for multi-image understanding. Our approach incorporates two key innovations: (1) The construction of interleaved multimodal multi-step reasoning chains, which utilize critical visual region tokens, extracted from intermediate reasoning steps, as supervisory signals. This mechanism not only facilitates comprehensive cross-modal understanding but also enhances model interpretability. (2) The introduction of a test-time memory augmentation module that expands the model’s reasoning capacity during inference while preserving parameter efficiency. Furthermore, to facilitate research in this direction, we have curated a novel multi-image slow-thinking dataset. Extensive experiments demonstrate the effectiveness of our model.

NeurIPS Conference 2025 Conference Paper

AnomalyCoT: A Multi-Scenario Chain-of-Thought Dataset for Multimodal Large Language Models

  • Jiaxi Cheng
  • Yuliang Xu
  • Shoupeng Wang
  • Tao Ma
  • Yuchen He
  • Jinghe Zhang
  • Sihang Cai
  • Jiawei Zhen

Industrial Anomaly Detection (IAD) is an indispensable quality control technology in modern production processes. Recently, on account of the outstanding visual comprehension and cross-domain knowledge transfer capabilities of multimodal large language models (MLLMs), existing studies have explored the application of MLLMs in the IAD domain and established some multimodal IAD datasets. However, although the latest datasets contain various fundamental IAD tasks, they formulate tasks in a general question-and-answer format lacking a rigorous reasoning process, and they are relatively limited in the diversity of scenarios, which restricts their reliability in practical applications. In this paper, we propose AnomalyCoT, a multimodal Chain-of-Thought (CoT) dataset for multi-scenario IAD tasks. It consists of 37, 565 IAD samples with the CoT data and is defined by challenging composite IAD tasks. Meanwhile, the CoT data for each sample provides precise coordinates of anomaly regions, thereby improving visual comprehension of defects across different types. AnomalyCoT is constructed through a systematic pipeline and involves multiple manual operations. Based on AnomalyCoT, we conducted a comprehensive evaluation of various mainstream MLLMs and fine-tuned representative models in different ways. The final results show that Gemini-2. 0-flash achieved the best performance in the direct evaluation with an accuracy rate of 59. 6\%, while Llama 3. 2-Vision achieves the best performance after LoRA fine-tuning with an accuracy rate of 94. 0\%. Among all the fine-tuned models, the average accuracy improvement reaches 36. 5\%, demonstrating the potential of integrating CoT datasets in future applications within the IAD field. The code and data are available at \url{https: //github. com/Zhaolutuan/AnomalyCoT}.

NeurIPS Conference 2025 Conference Paper

L2RSI: Cross-view LiDAR-based Place Recognition for Large-scale Urban Scenes via Remote Sensing Imagery

  • Ziwei Shi
  • Xiaoran Zhang
  • Wenjing Xu
  • Yan Xia
  • Yu Zang
  • Siqi Shen
  • Cheng Wang

We tackle the challenge of LiDAR-based place recognition, which traditionally depends on costly and time-consuming prior 3D maps. To overcome this, we first construct LiRSI-XA dataset, which encompasses approximately $110, 000$ remote sensing submaps and $13, 000$ LiDAR point cloud submaps captured in urban scenes, and propose a novel method, L2RSI, for cross-view LiDAR place recognition using high-resolution Remote Sensing Imagery. This approach enables large-scale localization capabilities at a reduced cost by leveraging readily available overhead images as map proxies. L2RSI addresses the dual challenges of cross-view and cross-modal place recognition by learning feature alignment between point cloud submaps and remote sensing submaps in the semantic domain. Additionally, we introduce a novel probability propagation method based on particle estimation to refine position predictions, effectively leveraging temporal and spatial information. This approach enables large-scale retrieval and cross-scene generalization without fine-tuning. Extensive experiments on LiRSI-XA demonstrate that, within a $100km^2$ retrieval range, L2RSI accurately localizes $83. 27\%$ of point cloud submaps within a $30m$ radius for top-$1$ retrieved location. Our project page is publicly available at https: //shizw695. github. io/L2RSI/.

AAAI Conference 2025 Conference Paper

L4DR: LiDAR-4DRadar Fusion for Weather-Robust 3D Object Detection

  • Xun Huang
  • Ziyu Xu
  • Hai Wu
  • Jinlong Wang
  • Qiming Xia
  • Yan Xia
  • Jonathan Li
  • Kyle Gao

LiDAR-based 3D object detection is crucial for autonomous driving. However, due to the quality deterioration of LiDAR point clouds, it suffers from performance degradation in adverse weather conditions. Fusing LiDAR with the weatherrobust 4D radar sensor is expected to solve this problem; however, it faces challenges of significant differences in terms of data quality and the degree of degradation in adverse weather. To address these issues, we introduce L4DR, a weather-robust 3D object detection method that effectively achieves LiDAR and 4D Radar fusion. Our L4DR proposes Multi-Modal Encoding (MME) and Foreground-Aware Denoising (FAD) modules to reconcile sensor gaps, which is the first exploration of the complementarity of early fusion between LiDAR and 4D radar. Additionally, we design an Inter-Modal and IntraModal ({IM}2) parallel feature extraction backbone coupled with a Multi-Scale Gated Fusion (MSGF) module to counteract the varying degrees of sensor degradation under adverse weather conditions. Experimental evaluation on a VoD dataset with simulated fog proves that L4DR is more adaptable to changing weather conditions. It delivers a significant performance increase under different fog levels, improving the 3D mAP by up to 20.0% over the traditional LiDAR-only approach. Moreover, the results on the K-Radar dataset validate the consistent performance improvement of L4DR in realworld adverse weather conditions.

IJCAI Conference 2024 Conference Paper

Bridging LiDAR Gaps: A Multi-LiDARs Domain Adaptation Dataset for 3D Semantic Segmentation

  • Shaoyang Chen
  • Bochun Yang
  • Yan Xia
  • Ming Cheng
  • Siqi Shen
  • Cheng Wang

We focus on the domain adaptation problem for 3D semantic segmentation, addressing the challenge of data variability in point clouds collected by different LiDARs. Existing benchmarks often mix different types of datasets, which blurs and complicates segmentation evaluations. Here, we introduce a Multi-LiDARs Domain Adaptation Segmentation (MLDAS) dataset, which contains point-wise semantic annotated point clouds captured simultaneously by a 128-beam LiDAR, a 64-beam LiDAR, a 32-beam LiDAR. We select 31, 875 scans from 2 representative scenarios: campus and urban street. Furthermore, we evaluate the current 3D segmentation unsupervised domain adaptation methods on the proposed dataset and propose Hierarchical Segmentation Network with Spatial Consistency (HSSC) as a novel knowledge transfer method to mitigate the domain gap significantly using spatial-temporal consistency constraints. Extensive experiments show that HSSC greatly improves the state-of-the-art cross-domain semantic segmentation methods. Our project is available at https: //sychen320. github. io/projects/MLDAS.

NeurIPS Conference 2024 Conference Paper

Mind's Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models

  • Wenshan Wu
  • Shaoguang Mao
  • Yadong Zhang
  • Yan Xia
  • Li Dong
  • Lei Cui
  • Furu Wei

Large language models (LLMs) have exhibited impressive performance in language comprehension and various reasoning tasks. However, their abilities in spatial reasoning, a crucial aspect of human cognition, remain relatively unexplored. Human possess a remarkable ability to create mental images of unseen objects and actions through a process known as the Mind's Eye, enabling the imagination of the unseen world. Inspired by this cognitive capacity, we propose Visualization-of-Thought (VoT) prompting. VoT aims to elicit spatial reasoning of LLMs by visualizing their reasoning traces, thereby guiding subsequent reasoning steps. We employed VoT for multi-hop spatial reasoning tasks, including natural language navigation, visual navigation, and visual tiling in 2D grid worlds. Experimental results demonstrated that VoT significantly enhances the spatial reasoning abilities of LLMs. Notably, VoT outperformed existing multimodal large language models (MLLMs) in these tasks. While VoT works surprisingly well on LLMs, the ability to generate mental images to facilitate spatial reasoning resembles the mind's eye process, suggesting its potential viability in MLLMs. Please find the dataset and codes in our project page.

AAAI Conference 2024 Conference Paper

StyleSinger: Style Transfer for Out-of-Domain Singing Voice Synthesis

  • Yu Zhang
  • Rongjie Huang
  • Ruiqi Li
  • JinZheng He
  • Yan Xia
  • Feiyang Chen
  • Xinyu Duan
  • Baoxing Huai

Style transfer for out-of-domain (OOD) singing voice synthesis (SVS) focuses on generating high-quality singing voices with unseen styles (such as timbre, emotion, pronunciation, and articulation skills) derived from reference singing voice samples. However, the endeavor to model the intricate nuances of singing voice styles is an arduous task, as singing voices possess a remarkable degree of expressiveness. Moreover, existing SVS methods encounter a decline in the quality of synthesized singing voices in OOD scenarios, as they rest upon the assumption that the target vocal attributes are discernible during the training phase. To overcome these challenges, we propose StyleSinger, the first singing voice synthesis model for zero-shot style transfer of out-of-domain reference singing voice samples. StyleSinger incorporates two critical approaches for enhanced effectiveness: 1) the Residual Style Adaptor (RSA) which employs a residual quantization module to capture diverse style characteristics in singing voices, and 2) the Uncertainty Modeling Layer Normalization (UMLN) to perturb the style attributes within the content representation during the training phase and thus improve the model generalization. Our extensive evaluations in zero-shot style transfer undeniably establish that StyleSinger outperforms baseline models in both audio quality and similarity to the reference singing voice samples. Access to singing voice samples can be found at https://stylesinger.github.io/.

NeurIPS Conference 2023 Conference Paper

Achieving Cross Modal Generalization with Multimodal Unified Representation

  • Yan Xia
  • Hai Huang
  • Jieming Zhu
  • Zhou Zhao

This paper introduces a novel task called Cross Modal Generalization (CMG), which addresses the challenge of learning a unified discrete representation from paired multimodal data during pre-training. Then in downstream tasks, the model can achieve zero-shot generalization ability in other modalities when only one modal is labeled. Existing approaches in multimodal representation learning focus more on coarse-grained alignment or rely on the assumption that information from different modalities is completely aligned, which is impractical in real-world scenarios. To overcome this limitation, we propose \textbf{Uni-Code}, which contains two key contributions: the Dual Cross-modal Information Disentangling (DCID) module and the Multi-Modal Exponential Moving Average (MM-EMA). These methods facilitate bidirectional supervision between modalities and align semantically equivalent information in a shared discrete latent space, enabling fine-grained unified representation of multimodal sequences. During pre-training, we investigate various modality combinations, including audio-visual, audio-text, and the tri-modal combination of audio-visual-text. Extensive experiments on various downstream tasks, i. e. , cross-modal event classification, localization, cross-modal retrieval, query-based video segmentation, and cross-dataset event localization, demonstrate the effectiveness of our proposed methods. The code is available at https: //github. com/haihuangcode/CMG.

NeurIPS Conference 2023 Conference Paper

Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks

  • Haoyi Duan
  • Yan Xia
  • Zhou Mingze
  • Li Tang
  • Jieming Zhu
  • Zhou Zhao

In recent years, the deployment of large-scale pre-trained models in audio-visual downstream tasks has yielded remarkable outcomes. However, these models, primarily trained on single-modality unconstrained datasets, still encounter challenges in feature extraction for multi-modal tasks, leading to suboptimal performance. This limitation arises due to the introduction of irrelevant modality-specific information during encoding, which adversely affects the performance of downstream tasks. To address this challenge, this paper proposes a novel Dual-Guided Spatial-Channel-Temporal (DG-SCT) attention mechanism. This mechanism leverages audio and visual modalities as soft prompts to dynamically adjust the parameters of pre-trained models based on the current multi-modal input features. Specifically, the DG-SCT module incorporates trainable cross-modal interaction layers into pre-trained audio-visual encoders, allowing adaptive extraction of crucial information from the current modality across spatial, channel, and temporal dimensions, while preserving the frozen parameters of large-scale pre-trained models. Experimental evaluations demonstrate that our proposed model achieves state-of-the-art results across multiple downstream tasks, including AVE, AVVP, AVS, and AVQA. Furthermore, our model exhibits promising performance in challenging few-shot and zero-shot scenarios. The source code and pre-trained models are available at https: //github. com/haoyi-duan/DG-SCT.

NeurIPS Conference 2023 Conference Paper

Extensible Prompts for Language Models on Zero-shot Language Style Customization

  • Tao Ge
  • Hu Jing
  • Li Dong
  • Shaoguang Mao
  • Yan Xia
  • Xun Wang
  • Si-Qing Chen
  • Furu Wei

We propose eXtensible Prompt (X-Prompt) for prompting a large language model (LLM) beyond natural language (NL). X-Prompt instructs an LLM with not only NL but also an extensible vocabulary of imaginary words. Registering new imaginary words allows us to instruct the LLM to comprehend concepts that are difficult to describe with NL words, thereby making a prompt more descriptive. Also, these imaginary words are designed to be out-of-distribution (OOD) robust so that they can be (re)used like NL words in various prompts, distinguishing X-Prompt from soft prompt that is for fitting in-distribution data. We propose context-augmented learning (CAL) to learn imaginary words for general usability, enabling them to work properly in OOD (unseen) prompts. We experiment X-Prompt for zero-shot language style customization as a case study. The promising results of X-Prompt demonstrate its potential to facilitate advanced interaction beyond the natural language interface, bridging the communication gap between humans and LLMs.