Author name cluster

Xiaochen Yuan

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

8 papers

1 author row

AAAI Conference 2026 Conference Paper

Cross-view Anchor Graph Learning and Factorization for Incomplete Multi-view Clustering

Xinxin Wang
Yongshan Zhang
Xiaochen Yuan
Yicong Zhou

Graph-based incomplete multi-view clustering algorithms have gathered much attention due to their impressive clustering performance. However, existing methods primarily leverage intra-view correlation from observed views, while ignoring the exploration of explicit compensation relationships between different views. Moreover, these methods need post-processing to get labels, and the separate steps lack negotiation, which may lead to sub-optimal solutions. To address these issues, we propose a Cross-view Anchor Graph Learning and Factorization (AGLF) method. AGLF develops an Anchor Graph Completion (AGC) framework that explicitly learn the missing subgraph structures. Instead of requiring post-processing, AGC directly produces soft labels. By establishing a third-order tensor of soft labels, it employs the tensor Schatten p-norm to enhance anchor graph learning and factorization. To significantly improve the quality of subgraph learning, AGLF incorporates compensation subgraphs from supplementary views into the AGC framework, enabling the construction of a better anchor graph for label learning. An optimization algorithm is devised to solve the objective function. Experimental results across various datasets demonstrate the effectiveness of our method.

PDF Details DOI

AAAI Conference 2026 Conference Paper

FaceShield: Explainable Face Anti-Spoofing with Multimodal Large Language Models

Hongyang Wang
Yichen Shi
Zhuofu Tao
Yuhao Gao
Liepiao Zhang
Xun Lin
Jun Feng
Xiaochen Yuan

Face anti-spoofing (FAS) is crucial for protecting facial recognition systems from presentation attacks. Previous methods approached this task as a classification problem, lacking interpretability and reasoning behind the predicted results. Recently, multimodal large language models (MLLMs) have shown strong capabilities in perception, reasoning, and decision-making in visual tasks. However, there is currently no universal and comprehensive MLLM and dataset specifically designed for FAS task. To address this gap, we propose FaceShield, a MLLM for FAS, along with the corresponding pre-training and supervised fine-tuning (SFT) datasets, FaceShield-pre10K and FaceShield-sft45K. FaceShield is capable of determining the authenticity of faces, identifying types of spoofing attacks, providing reasoning for its judgments, and detecting attack areas. Specifically, we employ spoof-aware vision perception (SAVP) that incorporates both the original image and auxiliary information based on prior knowledge. We then use an prompt-guided vision token masking (PVTM) strategy to random mask vision tokens, thereby improving the model's generalization ability. We conducted extensive experiments on three benchmark datasets, demonstrating that FaceShield significantly outperforms previous deep learning models and general MLLMs on four FAS tasks, i.e., coarse-grained classification, fine-grained classification, reasoning, and attack localization.

PDF Details DOI

JBHI Journal 2026 Journal Article

HDPL: Hypergraph-based Dynamic Prompting Learning for Incomplete Multimodal Medical Learning

Xiaomin Zhou
Guoheng Huang
Qin Zhao
Jianbin He
Xiaochen Yuan
Ming Li
Chi-Man Pun
Ling Guo

Multimodal learning has garnered significant attention in the medical field due to its ability to provide a more comprehensive perspective utilizing various types of data, that aids in making more accurate decisions. However, the complexity of medical data, coupled with missing modalities, severely hinders predictive accuracy. Existing methods for multimodal learning with missing modalities still face considerable challenges. For instance, approaches that construct multimodal shared feature spaces often result in high computational costs, while methods that infer missing modalities based on complete ones may overly rely on the complete modalities, potentially skewing results. Pre-trained transformer methods address these issues but still have limitations, such as it can only process one missing modality at testing-stage. This is partly because structured data, unlike sequential data, lacks inherent minimum semantic units or natural order. Additionally, the positional encodings generated by this type of methods may introduce information interference when applied to structured data, leading to poor alignment with sequential data during modality fusion in transformer models. To tackle these challenges, we introduce HDPL: Hypergraph-based Dynamic Prompt Learning for Incomplete Multimodal Medical Learning, comprising three modules. The High-Order Hypergraph Embedding module can identify the minimal semantic units within structured data and utilizes hypergraph structures to extract high-dimensional features from clinical data. The Multimodal Medical Data Integrator module closes the distance of the embedding vectors corresponding in the shared space of modality-features, facilitating the integration of modalities in transformer. The Dynamic Network Structure Optimization module is a dynamic learning network by dynamically change the width and depth of network, improving the overall performance of the model, and it alleviates the shortcomings caused by incomplete modality to some extent. Through comprehensive experimentation, we demonstrate the efficiency and robustness of our model in dealing missing modalities and reducing training-burdens. Our code and dataset are available at https://github.com/colorful823/HDPL.

Details DOI

AAAI Conference 2026 Conference Paper

PASA: Progressive-Adaptive Spectral Augmentation for Automated Auscultation in Data-Scarce Environments

Ying Wang
Guoheng Huang
Xueyuan Gong
Xinxin Wang
Xiaochen Yuan

Automated auscultation advances the detection of respiratory diseases, especially in areas with limited resources where traditional diagnostic methods are unavailable. On the other hand, the scarcity of auscultation datasets limits the automation performance, prompting the needs for data augmentation methods. However, most of the existing methods neglect the difference in acoustic sounds that requires personalized augmentation strategies. To address this, we propose a Progressive-Adaptive Spectral Augmentation (PASA), which is one of the first paradigms to adaptively select the best augmentation strategy for each sample. The PASA innovatively treats augmentation selection problem as a Markov Decision Process (MDP), creating an alternating loop between the diagnostic model and the augmentation selection. The agent selects the optimal augmentation operations and magnitudes via a task-specific design, including state construction, action sampling, Hybrid Batch-Sample (HBS) strategy execution, and reward guidance. The HBS strategy initially applies uniform augmentation across mini-batches while collecting sample-specific performance statistics. When model performance stabilizes, it transits to sample-level augmentation based on accumulated difficulty assessments. This two-phase design balances computational complexity with personalization. Extensive experiments across three benchmark datasets demonstrate that the PASA outperforms the state-of-the-art methods, pioneering a transformative paradigm for adaptive data augmentation in automated auscultation.

PDF Details DOI

AAAI Conference 2026 Conference Paper

SUGAR: Learning Skeleton Representation with Visual-Motion Knowledge for Action Recognition

Qilang Ye
Yu Zhou
Lian He
Jie Zhang
Xuanming Guo
Jiayu Zhang
Mingkui Tan
Weicheng Xie

Large Language Models (LLMs) hold rich implicit knowledge and powerful transferability. In this paper, we explore the combination of LLMs with the human skeleton to perform action classification and description. However, when treating LLM as a recognizer, two questions arise: 1) How can LLMs understand the skeleton? 2) How can LLMs distinguish among actions? To address these problems, we introduce a novel paradigm named learning Skeleton representation with visual-motion knowledge for Action Recognition (SUGAR). In our pipeline, we first utilize off-the-shelf large-scale video models as a knowledge base to generate visual, motion information related to actions. Then, we propose to supervise skeleton learning through this prior knowledge to yield discrete representations. Finally, we use the LLM with untouched pre-training weights to understand these representations and generate the desired action targets and descriptions. Notably, we present a Temporal Query Projection (TQP) module to continuously model the skeleton signals with long sequences. Experiments on several skeleton-based action classification benchmarks demonstrate the efficacy of our SUGAR. Moreover, experiments on zero-shot scenarios show that SUGAR is more versatile than linear-based methods.

PDF Details DOI

AAAI Conference 2024 Conference Paper

PVALane: Prior-Guided 3D Lane Detection with View-Agnostic Feature Alignment

Zewen Zheng
Xuemin Zhang
Yongqiang Mou
Xiang Gao
Chengxin Li
Guoheng Huang
Chi-Man Pun
Xiaochen Yuan

Monocular 3D lane detection is essential for a reliable autonomous driving system and has recently been rapidly developing. Existing popular methods mainly employ a predefined 3D anchor for lane detection based on front-viewed (FV) space, aiming to mitigate the effects of view transformations. However, the perspective geometric distortion between FV and 3D space in this FV-based approach introduces extremely dense anchor designs, which ultimately leads to confusing lane representations. In this paper, we introduce a novel prior-guided perspective on lane detection and propose an end-to-end framework named PVALane, which utilizes 2D prior knowledge to achieve precise and efficient 3D lane detection. Since 2D lane predictions can provide strong priors for lane existence, PVALane exploits FV features to generate sparse prior anchors with potential lanes in 2D space. These dynamic prior anchors help PVALane to achieve distinct lane representations and effectively improve the precision of PVALane due to the reduced lane search space. Additionally, by leveraging these prior anchors and representing lanes in both FV and bird-eye-viewed (BEV) spaces, we effectively align and merge semantic and geometric information from FV and BEV features. Extensive experiments conducted on the OpenLane and ONCE-3DLanes datasets demonstrate the superior performance of our method compared to existing state-of-the-art approaches and exhibit excellent robustness.

PDF Details DOI

JBHI Journal 2024 Journal Article

Quaternion Cross-Modality Spatial Learning for Multi-Modal Medical Image Segmentation

Junyang Chen
Guoheng Huang
Xiaochen Yuan
Guo Zhong
Zewen Zheng
Chi-Man Pun
Jian Zhu
Zhixin Huang

Recently, the Deep Neural Networks (DNNs) have had a large impact on imaging process including medical image segmentation, and the real-valued convolution of DNN has been extensively utilized in multi-modal medical image segmentation to accurately segment lesions via learning data information. However, the weighted summation operation in such convolution limits the ability to maintain spatial dependence that is crucial for identifying different lesion distributions. In this paper, we propose a novel Quaternion Cross-modality Spatial Learning (Q-CSL) which explores the spatial information while considering the linkage between multi-modal images. Specifically, we introduce to quaternion to represent data and coordinates that contain spatial information. Additionally, we propose Quaternion Spatial-association Convolution to learn the spatial information. Subsequently, the proposed De-level Quaternion Cross-modality Fusion (De-QCF) module excavates inner space features and fuses cross-modality spatial dependency. Our experimental results demonstrate that our approach compared to the competitive methods perform well with only 0. 01061 M parameters and 9. 95G FLOPs.

Details DOI

JBHI Journal 2023 Journal Article

QGD-Net: A Lightweight Model Utilizing Pixels of Affinity in Feature Layer for Dermoscopic Lesion Segmentation

Jingchao Wang
Guoheng Huang
Guo Zhong
Xiaochen Yuan
Chi-Man Pun
Jie Deng

Response: Pixels with location affinity, which can be also called “pixels of affinity, ” have similar semantic information. Group convolution and dilated convolution can utilize them to improve the capability of the model. However, for group convolution, it does not utilize pixels of affinity between layers. For dilated convolution, after multiple convolutions with the same dilated rate, the pixels utilized within each layer do not possess location affinity with each other. To solve the problem of group convolution, our proposed quaternion group convolution uses the quaternion convolution, which promotes the communication between to promote utilizing pixels of affinity between channels. In quaternion group convolution, the feature layers are divided into 4 layers per group, ensuring the quaternion convolution can be performed. To solve the problem of dilated convolution, we propose the quaternion sawtooth wave-like dilated convolutions module (QS module). QS module utilizes quaternion convolution with sawtooth wave-like dilated rates to effectively leverage the pixels that share the location affinity both between and within layers. This allows for an expanded receptive field, ultimately enhancing the performance of the model. In particular, we perform our quaternion group convolution in QS module to design the quaternion group dilated neutral network (QGD-Net). Extensive experiments on Dermoscopic Lesion Segmentation based on ISIC 2016 and ISIC 2017 indicate that our method has significantly reduced the model parameters and highly promoted the precision of the model in Dermoscopic Lesion Segmentation. And our method also shows generalizability in retinal vessel segmentation.

Details DOI