Author name cluster

Yan Xia

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

14 papers

1 author row

AAAI Conference 2026 Conference Paper

AUVIC: Adversarial Unlearning of Visual Concepts for Multi-modal Large Language Models

Haokun Chen
Jianing Li
Yao Zhang
Jinhe Bi
Yan Xia
Jindong Gu
Volker Tresp

Multimodal Large Language Models (MLLMs) achieve impressive performance once optimized on massive datasets. Such datasets often contain sensitive or copyrighted content, raising significant data privacy concerns. Regulatory frameworks mandating the 'right to be forgotten' drive the need for machine unlearning. This technique allows for the removal of target data without resource-consuming retraining. However, while well-studied for text, visual concept unlearning in MLLMs remains underexplored. A primary challenge is precisely removing a target visual concept without disrupting model performance on related entities. To address this, we introduce AUVIC, a novel visual concept unlearning framework for MLLMs. AUVIC applies adversarial perturbations to enable precise forgetting. This approach effectively isolates the target concept while avoiding unintended effects on similar entities. To evaluate our method, we construct VCUBench. It is the first benchmark designed to assess visual concept unlearning in group contexts. Experimental results demonstrate that AUVIC achieves state-of-the-art target forgetting rates while incurs minimal performance degradation on non-target concepts.

PDF Details DOI

AAAI Conference 2026 Conference Paper

CMMCoT: Enhancing Complex Multi-Image Comprehension via Multi-Modal Chain-of-Thought and Memory Augmentation

Guanghao Zhang
Tao Zhong
Yan Xia
Mushui Liu
Zhelun Yu
Haoyuan Li
Wanggui He
Dong She

While previous multimodal slow-thinking methods have demonstrated remarkable success in single-image understanding scenarios, their effectiveness becomes fundamentally constrained when extended to more complex multi-image comprehension tasks. This limitation stems from their predominant reliance on text-based intermediate reasoning processes. While for human, when engaging in sophisticated multi-image analysis, they typically perform two complementary cognitive operations: (1) continuous cross-image visual comparison through region-of-interest matching, and (2) dynamic memorization of critical visual concepts throughout the reasoning chain. Motivated by these observations, we propose the Complex Multi-Modal Chain-of-Thought (CMMCoT) framework, a multi-step reasoning framework that mimics human-like "slow thinking" for multi-image understanding. Our approach incorporates two key innovations: (1) The construction of interleaved multimodal multi-step reasoning chains, which utilize critical visual region tokens, extracted from intermediate reasoning steps, as supervisory signals. This mechanism not only facilitates comprehensive cross-modal understanding but also enhances model interpretability. (2) The introduction of a test-time memory augmentation module that expands the model’s reasoning capacity during inference while preserving parameter efficiency. Furthermore, to facilitate research in this direction, we have curated a novel multi-image slow-thinking dataset. Extensive experiments demonstrate the effectiveness of our model.

PDF Details DOI

EAAI Journal 2025 Journal Article

An undersampling method for software defect prediction based on Hilbert curve mapping distance

Yu Tang
Ye Du
Ang Li
Ming-song Yang
Yan Xia

Details DOI

NeurIPS Conference 2025 Conference Paper

AnomalyCoT: A Multi-Scenario Chain-of-Thought Dataset for Multimodal Large Language Models

Jiaxi Cheng
Yuliang Xu
Shoupeng Wang
Tao Ma
Yuchen He
Jinghe Zhang
Sihang Cai
Jiawei Zhen

Industrial Anomaly Detection (IAD) is an indispensable quality control technology in modern production processes. Recently, on account of the outstanding visual comprehension and cross-domain knowledge transfer capabilities of multimodal large language models (MLLMs), existing studies have explored the application of MLLMs in the IAD domain and established some multimodal IAD datasets. However, although the latest datasets contain various fundamental IAD tasks, they formulate tasks in a general question-and-answer format lacking a rigorous reasoning process, and they are relatively limited in the diversity of scenarios, which restricts their reliability in practical applications. In this paper, we propose AnomalyCoT, a multimodal Chain-of-Thought (CoT) dataset for multi-scenario IAD tasks. It consists of 37, 565 IAD samples with the CoT data and is defined by challenging composite IAD tasks. Meanwhile, the CoT data for each sample provides precise coordinates of anomaly regions, thereby improving visual comprehension of defects across different types. AnomalyCoT is constructed through a systematic pipeline and involves multiple manual operations. Based on AnomalyCoT, we conducted a comprehensive evaluation of various mainstream MLLMs and fine-tuned representative models in different ways. The final results show that Gemini-2. 0-flash achieved the best performance in the direct evaluation with an accuracy rate of 59. 6\%, while Llama 3. 2-Vision achieves the best performance after LoRA fine-tuning with an accuracy rate of 94. 0\%. Among all the fine-tuned models, the average accuracy improvement reaches 36. 5\%, demonstrating the potential of integrating CoT datasets in future applications within the IAD field. The code and data are available at \url{https: //github. com/Zhaolutuan/AnomalyCoT}.

PDF Details

NeurIPS Conference 2025 Conference Paper

L2RSI: Cross-view LiDAR-based Place Recognition for Large-scale Urban Scenes via Remote Sensing Imagery

Ziwei Shi
Xiaoran Zhang
Wenjing Xu
Yan Xia
Yu Zang
Siqi Shen
Cheng Wang

We tackle the challenge of LiDAR-based place recognition, which traditionally depends on costly and time-consuming prior 3D maps. To overcome this, we first construct LiRSI-XA dataset, which encompasses approximately $110, 000$ remote sensing submaps and $13, 000$ LiDAR point cloud submaps captured in urban scenes, and propose a novel method, L2RSI, for cross-view LiDAR place recognition using high-resolution Remote Sensing Imagery. This approach enables large-scale localization capabilities at a reduced cost by leveraging readily available overhead images as map proxies. L2RSI addresses the dual challenges of cross-view and cross-modal place recognition by learning feature alignment between point cloud submaps and remote sensing submaps in the semantic domain. Additionally, we introduce a novel probability propagation method based on particle estimation to refine position predictions, effectively leveraging temporal and spatial information. This approach enables large-scale retrieval and cross-scene generalization without fine-tuning. Extensive experiments on LiRSI-XA demonstrate that, within a $100km^2$ retrieval range, L2RSI accurately localizes $83. 27\%$ of point cloud submaps within a $30m$ radius for top-$1$ retrieved location. Our project page is publicly available at https: //shizw695. github. io/L2RSI/.

PDF Details

AAAI Conference 2025 Conference Paper

L4DR: LiDAR-4DRadar Fusion for Weather-Robust 3D Object Detection

Xun Huang
Ziyu Xu
Hai Wu
Jinlong Wang
Qiming Xia
Yan Xia
Jonathan Li
Kyle Gao

LiDAR-based 3D object detection is crucial for autonomous driving. However, due to the quality deterioration of LiDAR point clouds, it suffers from performance degradation in adverse weather conditions. Fusing LiDAR with the weatherrobust 4D radar sensor is expected to solve this problem; however, it faces challenges of significant differences in terms of data quality and the degree of degradation in adverse weather. To address these issues, we introduce L4DR, a weather-robust 3D object detection method that effectively achieves LiDAR and 4D Radar fusion. Our L4DR proposes Multi-Modal Encoding (MME) and Foreground-Aware Denoising (FAD) modules to reconcile sensor gaps, which is the first exploration of the complementarity of early fusion between LiDAR and 4D radar. Additionally, we design an Inter-Modal and IntraModal ({IM}2) parallel feature extraction backbone coupled with a Multi-Scale Gated Fusion (MSGF) module to counteract the varying degrees of sensor degradation under adverse weather conditions. Experimental evaluation on a VoD dataset with simulated fog proves that L4DR is more adaptable to changing weather conditions. It delivers a significant performance increase under different fog levels, improving the 3D mAP by up to 20.0% over the traditional LiDAR-only approach. Moreover, the results on the K-Radar dataset validate the consistent performance improvement of L4DR in realworld adverse weather conditions.

PDF Details DOI

IJCAI Conference 2024 Conference Paper

Bridging LiDAR Gaps: A Multi-LiDARs Domain Adaptation Dataset for 3D Semantic Segmentation

Shaoyang Chen
Bochun Yang
Yan Xia
Ming Cheng
Siqi Shen
Cheng Wang

We focus on the domain adaptation problem for 3D semantic segmentation, addressing the challenge of data variability in point clouds collected by different LiDARs. Existing benchmarks often mix different types of datasets, which blurs and complicates segmentation evaluations. Here, we introduce a Multi-LiDARs Domain Adaptation Segmentation (MLDAS) dataset, which contains point-wise semantic annotated point clouds captured simultaneously by a 128-beam LiDAR, a 64-beam LiDAR, a 32-beam LiDAR. We select 31, 875 scans from 2 representative scenarios: campus and urban street. Furthermore, we evaluate the current 3D segmentation unsupervised domain adaptation methods on the proposed dataset and propose Hierarchical Segmentation Network with Spatial Consistency (HSSC) as a novel knowledge transfer method to mitigate the domain gap significantly using spatial-temporal consistency constraints. Extensive experiments show that HSSC greatly improves the state-of-the-art cross-domain semantic segmentation methods. Our project is available at https: //sychen320. github. io/projects/MLDAS.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Mind's Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models

Wenshan Wu
Shaoguang Mao
Yadong Zhang
Yan Xia
Li Dong
Lei Cui
Furu Wei

Large language models (LLMs) have exhibited impressive performance in language comprehension and various reasoning tasks. However, their abilities in spatial reasoning, a crucial aspect of human cognition, remain relatively unexplored. Human possess a remarkable ability to create mental images of unseen objects and actions through a process known as the Mind's Eye, enabling the imagination of the unseen world. Inspired by this cognitive capacity, we propose Visualization-of-Thought (VoT) prompting. VoT aims to elicit spatial reasoning of LLMs by visualizing their reasoning traces, thereby guiding subsequent reasoning steps. We employed VoT for multi-hop spatial reasoning tasks, including natural language navigation, visual navigation, and visual tiling in 2D grid worlds. Experimental results demonstrated that VoT significantly enhances the spatial reasoning abilities of LLMs. Notably, VoT outperformed existing multimodal large language models (MLLMs) in these tasks. While VoT works surprisingly well on LLMs, the ability to generate mental images to facilitate spatial reasoning resembles the mind's eye process, suggesting its potential viability in MLLMs. Please find the dataset and codes in our project page.

PDF Details DOI

AAAI Conference 2024 Conference Paper

StyleSinger: Style Transfer for Out-of-Domain Singing Voice Synthesis

Yu Zhang
Rongjie Huang
Ruiqi Li
JinZheng He
Yan Xia
Feiyang Chen
Xinyu Duan
Baoxing Huai

Style transfer for out-of-domain (OOD) singing voice synthesis (SVS) focuses on generating high-quality singing voices with unseen styles (such as timbre, emotion, pronunciation, and articulation skills) derived from reference singing voice samples. However, the endeavor to model the intricate nuances of singing voice styles is an arduous task, as singing voices possess a remarkable degree of expressiveness. Moreover, existing SVS methods encounter a decline in the quality of synthesized singing voices in OOD scenarios, as they rest upon the assumption that the target vocal attributes are discernible during the training phase. To overcome these challenges, we propose StyleSinger, the first singing voice synthesis model for zero-shot style transfer of out-of-domain reference singing voice samples. StyleSinger incorporates two critical approaches for enhanced effectiveness: 1) the Residual Style Adaptor (RSA) which employs a residual quantization module to capture diverse style characteristics in singing voices, and 2) the Uncertainty Modeling Layer Normalization (UMLN) to perturb the style attributes within the content representation during the training phase and thus improve the model generalization. Our extensive evaluations in zero-shot style transfer undeniably establish that StyleSinger outperforms baseline models in both audio quality and similarity to the reference singing voice samples. Access to singing voice samples can be found at https://stylesinger.github.io/.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

Achieving Cross Modal Generalization with Multimodal Unified Representation

Yan Xia
Hai Huang
Jieming Zhu
Zhou Zhao

This paper introduces a novel task called Cross Modal Generalization (CMG), which addresses the challenge of learning a unified discrete representation from paired multimodal data during pre-training. Then in downstream tasks, the model can achieve zero-shot generalization ability in other modalities when only one modal is labeled. Existing approaches in multimodal representation learning focus more on coarse-grained alignment or rely on the assumption that information from different modalities is completely aligned, which is impractical in real-world scenarios. To overcome this limitation, we propose \textbf{Uni-Code}, which contains two key contributions: the Dual Cross-modal Information Disentangling (DCID) module and the Multi-Modal Exponential Moving Average (MM-EMA). These methods facilitate bidirectional supervision between modalities and align semantically equivalent information in a shared discrete latent space, enabling fine-grained unified representation of multimodal sequences. During pre-training, we investigate various modality combinations, including audio-visual, audio-text, and the tri-modal combination of audio-visual-text. Extensive experiments on various downstream tasks, i. e. , cross-modal event classification, localization, cross-modal retrieval, query-based video segmentation, and cross-dataset event localization, demonstrate the effectiveness of our proposed methods. The code is available at https: //github. com/haihuangcode/CMG.

PDF Details

NeurIPS Conference 2023 Conference Paper

Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks

Haoyi Duan
Yan Xia
Zhou Mingze
Li Tang
Jieming Zhu
Zhou Zhao

In recent years, the deployment of large-scale pre-trained models in audio-visual downstream tasks has yielded remarkable outcomes. However, these models, primarily trained on single-modality unconstrained datasets, still encounter challenges in feature extraction for multi-modal tasks, leading to suboptimal performance. This limitation arises due to the introduction of irrelevant modality-specific information during encoding, which adversely affects the performance of downstream tasks. To address this challenge, this paper proposes a novel Dual-Guided Spatial-Channel-Temporal (DG-SCT) attention mechanism. This mechanism leverages audio and visual modalities as soft prompts to dynamically adjust the parameters of pre-trained models based on the current multi-modal input features. Specifically, the DG-SCT module incorporates trainable cross-modal interaction layers into pre-trained audio-visual encoders, allowing adaptive extraction of crucial information from the current modality across spatial, channel, and temporal dimensions, while preserving the frozen parameters of large-scale pre-trained models. Experimental evaluations demonstrate that our proposed model achieves state-of-the-art results across multiple downstream tasks, including AVE, AVVP, AVS, and AVQA. Furthermore, our model exhibits promising performance in challenging few-shot and zero-shot scenarios. The source code and pre-trained models are available at https: //github. com/haoyi-duan/DG-SCT.

PDF Details

NeurIPS Conference 2023 Conference Paper

Extensible Prompts for Language Models on Zero-shot Language Style Customization

Tao Ge
Hu Jing
Li Dong
Shaoguang Mao
Yan Xia
Xun Wang
Si-Qing Chen
Furu Wei

We propose eXtensible Prompt (X-Prompt) for prompting a large language model (LLM) beyond natural language (NL). X-Prompt instructs an LLM with not only NL but also an extensible vocabulary of imaginary words. Registering new imaginary words allows us to instruct the LLM to comprehend concepts that are difficult to describe with NL words, thereby making a prompt more descriptive. Also, these imaginary words are designed to be out-of-distribution (OOD) robust so that they can be (re)used like NL words in various prompts, distinguishing X-Prompt from soft prompt that is for fitting in-distribution data. We propose context-augmented learning (CAL) to learn imaginary words for general usability, enabling them to work properly in OOD (unseen) prompts. We experiment X-Prompt for zero-shot language style customization as a case study. The promising results of X-Prompt demonstrate its potential to facilitate advanced interaction beyond the natural language interface, bridging the communication gap between humans and LLMs.

PDF Details

YNIMG Journal 2004 Journal Article

A knowledge-driven algorithm for a rapid and automatic extraction of the human cerebral ventricular system from MR neuroimages

Yan Xia
Qingmao Hu
Aamer Aziz
Wieslaw L. Nowinski

Details DOI

AIJ Journal 1997 Journal Article

An event driven integration reasoning scheme for handling dynamic threats in an unstructured environment

Yan Xia
S.S. Iyengar
N.E. Brener

Details DOI