Author name cluster

Xu Cao

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

15 papers

2 author rows

EAAI Journal 2026 Journal Article

Corrigendum to “Exploring cross-branch information for semi-supervised remote sensing object detection” [Eng. Appl. Artif. Intel. 162 (2025) EAAI_112378]

Shitian He
Huanxin Zou
Yingqian Wang
Xu Cao
Hao Chen
Ning Jing

Details DOI

EAAI Journal 2026 Journal Article

Corrigendum to “MambaRSIS: Context-aware multi-scale feature aggregation with selective state space model for remote sensing instance segmentation” [Eng. Appl. Artif. Intel. 160 (2025) EAAI_111993]

Liyuan Pan
Xu Cao
Huanxin Zou
Hao Chen
Shitian He
Yuqing Zhang
Xuanming Liu
Jiangshan Li

Details DOI

TMLR Journal 2026 Journal Article

The Clever Hans Mirage: A Comprehensive Survey on Spurious Correlations in Machine Learning

Wenqian Ye
Luyang Jiang
Eric Xie
Guangtao Zheng
Yunsheng Ma
Xu Cao
Dongliang Guo
Daiqing Qi

Back in the early 20th century, a horse named Hans appeared to perform arithmetic and other intellectual tasks during exhibitions in Germany, while it actually relied solely on involuntary cues in the body language from the human trainer. Modern machine learning models are no different. These models are known to be sensitive to spurious correlations between non-essential features of the inputs (e.g., background, texture, and secondary objects) and the corresponding labels. Such features and their correlations with the labels are known as spurious because they tend to change with shifts in real-world data distributions, which can negatively impact the model's generalization and robustness. In this paper, we provide a comprehensive survey of this emerging issue, along with a fine-grained taxonomy of existing state-of-the-art methods for addressing spurious correlations in machine learning models. Additionally, we summarize existing datasets, benchmarks, and metrics to facilitate future research. The paper concludes with a discussion of the broader impacts, the recent advancements, and future challenges in the era of generative AI, aiming to provide valuable insights for researchers in the related domains of the machine learning community.

PDF Details

IROS Conference 2025 Conference Paper

EAROL: Environmental Augmented Perception-Aware Planning and Robust Odometry via Downward-Mounted Tilted LiDAR

Xinkai Liang
Yigu Ge
Yangxi Shi
Haoyu Yang
Xu Cao
Hao Fang 0001

To address the challenges of localization drift and perception-planning coupling in unmanned aerial vehicles (UAVs) operating in open-top scenarios (e. g. , collapsed buildings, roofless mazes), this paper proposes EAROL, a novel framework with a downward-mounted tilted LiDAR configuration (20° inclination), integrating a LiDAR-Inertial Odometry (LIO) system and a hierarchical trajectory-yaw optimization algorithm. The hardware innovation enables constraint enhancement via dense ground point cloud acquisition and forward environmental awareness for dynamic obstacle detection. A tightly-coupled LIO system, empowered by an Iterative Error-State Kalman Filter (IESKF) with dynamic motion compensation, achieves high level 6-DoF localization accuracy in feature-sparse environments. The planner, augmented by environment, balancing environmental exploration, target tracking precision, and energy efficiency. Physical experiments demonstrate 81% tracking error reduction, 22% improvement in perceptual coverage, and near-zero vertical drift across indoor maze and 60-meter-scale outdoor scenarios. This work proposes a hardware-algorithm co-design paradigm, offering a robust solution for UAV autonomy in post-disaster search and rescue missions. We will release our software and hardware as an open-source package 3 for the community. Video: https://youtu.be/7av2ueLSiYw.

Details

EAAI Journal 2025 Journal Article

Exploring cross-branch information for semi-supervised remote sensing object detection

Shitian He
Huanxin Zou
Yingqian Wang
Xu Cao
Hao Chen
Ning Jing

Semi-supervised object detection (SSOD) provides a promising solution to mitigate the annotation costs in remote sensing applications. Mainstream teacher-student based SSOD methods leverage unlabeled images through pseudo labeling, and their effectiveness is fundamentally limited by the inevitable noise in pseudo labels, particularly for remote sensing (RS) scenarios with complex backgrounds and dense, multi-scale and oriented objects. Current methods primarily focus on reducing pseudo label noise through category, scale and Intersection over Union information mining, as well as designing fine-grained confidence thresholding strategies. However, the inherent discrepancy between classification and localization reliability is neglected. In this study, with analyzing the characteristic discrepancies between the classification and localization branches, We propose artificial intelligence (AI) methodological innovation method named cross-branch information incorporation method (i. e. , CBI-SSOD) to utilize these discrepancies to assist the training of the classification branch, and thus improve the performance of SSOD methods. Specifically, our method present two key AI innovations. Firstly, we propose a pretext task to extract cross-branch information, which can improve the classification ability by reinforce the consistent predictions between the classification branch and the pretext task. Besides, we propose a pseudo label reassignment approach to adjust the soft classification pseudo labels, and thus suppress pseudo label noise and improve the detection performance. Extensive experiments on Dataset for Object Detection in Aerial Images (DOTAv1. 0) and DOTAv1. 5 datasets validate the effectiveness and superiority of our method, and demonstrate the practical engineering impact of our method on RS applications and interpretation systems.

Details DOI

NeurIPS Conference 2025 Conference Paper

Fine-Grained Preference Optimization Improves Spatial Reasoning in VLMs

Yifan Shen
Yuanzhe Liu
Jingyuan Zhu
Xu Cao
Xiaofeng Zhang
Yixiao He
Wenming Ye
James Rehg

Current Vision-Language Models (VLMs) struggle with fine-grained spatial reasoning, particularly when multi-step logic and precise spatial alignment are required. In this work, we introduce SpatialReasoner-R1, a vision-language reasoning model designed to address these limitations. To construct high-quality supervision for spatial reasoning, we design a Multi-Model Monte Carlo Tree Search (M3CTS) method that generates diverse, logically consistent Long Chain-of-Thought (LongCoT) reasoning trajectories. In addition, we propose a fine-grained Direct Preference Optimization (fDPO) method that introduces segment-specific preference granularity for descriptive grounding and logical reasoning, guided by a spatial reward mechanism that evaluates candidate responses based on visual consistency, spatial grounding, and logical coherence. Experimental results demonstrate that fDPO achieves relative performance gains of 4. 1% and 9. 0% over standard DPO on spatial qualitative and quantitative tasks, respectively. SpatialReasoner-R1, trained with fDPO, sets a new SoTA on SpatialRGPT-Bench, outperforming the strongest baseline by 9. 4% in average accuracy, while maintaining competitive performance on general vision-language tasks.

PDF Details

IROS Conference 2025 Conference Paper

IMVPR: Implicit BEV-Enhanced Multi-View Aggregation for Visual Place Recognition

Xu Cao
Caibo Zhang
Ziming Liu
Xuchang Zhong
Hao Fang

Visual Place Recognition (VPR) is essential for robotics and autonomous driving, enabling localization by matching current observations with a database of known places. While monocular VPR methods rely on visual features, they are sensitive to environmental changes, and multimodal approaches using LiDAR or radar incur high costs and complexity. Multi-view camera configurations offer a cost-effective alternative by expanding perception range and providing richer structural information. In this work, we propose IMVPR, an implicit BEV-enhanced multi-view place recognition network that achieves consistent and parallel multi-view feature fusion and place descriptors aggregation. Unlike methods that explicitly construct BEV features, we introduce descriptor queries to implicitly represent 3D spatial locations, facilitating spatial point projection-based fusion. A cross-attention mechanism further enables end-to-end multi-view feature aggregation. We evaluate IMVPR on four scenes from the nuScenes dataset, including both in-domain and out-of-domain scenarios, demonstrating its superior accuracy and generalization compared to state-of-the-art methods, including multimodal approaches. Our results highlight the potential of multi-view vision-based methods as a scalable and robust solution for VPR.

Details

JBHI Journal 2025 Journal Article

MaKAN-Mixer: Channel Interaction-Based Mamba Method for rPPG Extraction

Hengrui Zhang
Feiyang Liao
Gang Yuan
Haoyang Jin
Biao Xie
Xu Cao
Mingcui Fu
Jian Zheng

Remote photoplethysmography (rPPG) achieves non-contact heart rate monitoring by detecting subtle skin color variations in facial videos, offering significant potential in healthcare, fitness, and security applications. However, accurately extracting rPPG signals in complex environments–especially under variable lighting and motion artifacts–remains challenging. The main difficulties are capturing spatio-temporal dynamics and modeling long-term dependencies across channels. To address these limitations, we propose MaKAN-Mixer, a novel end-to-end network designed to enhance the robustness and accuracy of rPPG signal extraction. First, MaKAN-Mixer integrates a Hybrid of Eulerian Video Magnification and Temporal Shift Module Amplification (HETA) to amplify subtle physiological signals and enhance temporal information without relying on explicit region-of-interest (ROI) selection. Additionally, we propose the Mamba-KAN Fusion Module (MKFM), which leverages Mamba's ability to efficiently model long-term dependencies in temporal sequences. By incorporating the Kolmogorov-Arnold Network (KAN) for effective channel mixing, MKFM ensures the comprehensive fusion of relevant spatio-temporal features across different channels. Finally, we employ a KAN Feedforward Neural Network (KFN) to capture complex, nonlinear, and periodic physiological patterns, improving heart rate estimation. Extensive experiments conducted on four benchmark datasets demonstrate that MaKAN-Mixer achieves superior performance in both intra- and cross-dataset testing, exhibiting exceptional robustness in challenging scenarios, particularly with compressed video data and complex environments. In comparison to the best-performing existing method, which reported RMSE values of 0. 78/0. 47/4. 57/6. 81 on the four datasets, MaKAN-Mixer significantly improves the RMSE to 0. 66/0. 40/0. 32/6. 25, highlighting its effectiveness across diverse conditions. Furthermore, novel visualization techniques were employed for qualitative validation of the results, underscoring its potential for accurate, real-world rPPG monitoring.

Details DOI

EAAI Journal 2025 Journal Article

MambaRSIS: Context-aware multi-scale feature aggregation with selective state space model for remote sensing instance segmentation

Liyuan Pan
Xu Cao
Huanxin Zou
Hao Chen
Shitian He
Yuqing Zhang
Xuanming Liu
Jiangshan Li

Remote sensing instance segmentation aims to detect and assign pixel-level labels to each instance in remote sensing images, which holds critical engineering significance for both civil and military applications. While existing domain-specific methods have made progress, they still struggle with three persistent challenges: ineffective context modeling in cluttered backgrounds, information loss during multi-scale feature fusion, and blurred boundaries for densely clustered small objects. To address these limitations, we propose a novel remote sensing instance segmentation framework with three artificial intelligence (AI) methodological innovations, which comprises: a Context Perception Module (CPM) for context modeling, a Context Guided Multi-Scale Feature Aggregation (CGFA) method for multi-scale feature fusion, and a Multi-Path Region Proposal Extractor (MPRPE) with boundary-refined segmentation. The CPM leverages the selective state space model (Mamba) to capture long-range contextual information, effectively addressing the issue of cluttered backgrounds in remote sensing images. The CGFA replaces standard feature pyramid network architecture which is limited by direct summation or concatenation, preserving fine-grained spatial details with context guidance. The MPRPE and boundary-aware segmentation head mitigate the challenges of missed detection of small objects and blurred edge predictions, which arise from the clustered distribution of small objects and semantic ambiguity. Extensive experiments on the challenging iSAID and NWPU VHR-10 datasets validate the proposed method’s consistent improvements across metrics while demonstrating its practical engineering impact on remote sensing interpretation systems.

Details DOI

IROS Conference 2025 Conference Paper

On-Board Vision-Language Models (VLMs) for Personalized Motion Control of Autonomous Vehicles

Can Cui 0009
Zichong Yang
Yupeng Zhou
Juntong Peng
Sung-Yeon Park
Cong Zhang
Yunsheng Ma
Xu Cao

Personalized driving refers to an autonomous vehicle’s ability to adapt its driving behavior or control strategies to match individual users’ preferences and driving styles while maintaining safety and comfort standards. However, existing works either fail to capture every individual’s preference precisely or become computationally inefficient as the user base expands. Vision-Language Models (VLMs) offer promising solutions to this front through their natural language understanding and scene reasoning capabilities. In this work, we propose a lightweight yet effective on-board VLM framework that provides low-latency personalized driving performance while maintaining strong reasoning capabilities. Our solution incorporates a Retrieval-Augmented Generation (RAG)-based memory module that enables continuous learning of individual driving preferences through human feedback. Through comprehensive real-world vehicle experiments, our system has demonstrated the ability to provide safe, comfortable, and personalized driving experiences across various scenarios and significantly reduce takeover rates by up to 76. 9%. To the best of our knowledge, this work represents the first personalized VLM motion control system in real-world autonomous vehicles. The demo video can be watched at https://tinyurl.com/4xsnz79n.

Details

IROS Conference 2025 Conference Paper

Planning and Control for Active Morphing Tensegrity Aerial Vehicles in Confined Spaces

Siyuan Hao
Zichen Tao
Yun Gui
Songyuan Liu
Jiaxu Shi
Xu Cao
Qingkai Yang

Morphing quadrotors are capable of adapting to constrained environments through geometric reconfiguration. However, existing systems are limited by mechanical complexity and rigid links, which affect both safety and performance in such environments. In this paper, we propose a strut-actuated tensegrity aerial vehicle that integrates shape adaptation with collision resilience. By incorporating deformable struts and a cable network, our vehicle enables real-time morphological adjustments during flight while maintaining stability. We present a hierarchical planning framework that ensures the entire vehicle remains confined within an icosahedral space, thereby guaranteeing full-body safety. An on-manifold Model Predictive Controller (MPC) is employed to track these optimized trajectories and compensate for inertia shifts during shape deformation. Simulation results validate the effectiveness of the proposed framework, demonstrating its capability to navigate in restricted scenarios.

Details

NeurIPS Conference 2025 Conference Paper

Toward Human Deictic Gesture Target Estimation

Xu Cao
Pranav Virupaksha
Sangmin Lee
Bolin Lai
Wenqi Jia
Jintai Chen
James Rehg

Humans have a remarkable ability to use co-speech deictic gestures, such as pointing and showing, to enrich verbal communication and support social interaction. These gestures are so fundamental that infants begin to use them even before they acquire spoken language, which highlights their central role in human communication. Understanding the intended targets of another individual's deictic gestures enables inference of their intentions, comprehension of their current actions, and prediction of upcoming behaviors. Despite its significance, gesture target estimation remains an underexplored task within the computer vision community. In this paper, we introduce GestureTarget, a novel task designed specifically for comprehensive evaluation of social deictic gesture semantic target estimation. To address this task, we propose TransGesture, a set of Transformer-based gesture target prediction models. Given an input image and the spatial location of a person, our models predict the intended target of their gesture within the scene. Critically, our gaze-aware joint cross attention fusion model demonstrates how incorporating gaze-following cues significantly improves gesture target mask prediction IoU by 6% and gesture existence prediction accuracy by 10%. Our results underscore the complexity and importance of integrating gaze cues into deictic gesture intention understanding, advocating for increased research attention to this emerging area. All data, code will be made publicly available upon acceptance. Code of TransGesture is available at GitHub. com/IrohXu/TransGesture.

PDF Details

UAI Conference 2023 Conference Paper

Mitigating Transformer Overconfidence via Lipschitz Regularization

Wenqian Ye
Yunsheng Ma
Xu Cao
Kun Tang

Though Transformers have achieved promising results in many computer vision tasks, they tend to be over-confident in predictions, as the standard Dot Product Self-Attention (DPSA) can barely preserve distance for the unbounded input domain. In this work, we fill this gap by proposing a novel Lipschitz Regularized Transformer (LRFormer). Specifically, we present a new similarity function with the distance within Banach Space to ensure the Lipschitzness and also regularize the term by a contractive Lipschitz Bound. The proposed method is analyzed with a theoretical guarantee, providing a rigorous basis for its effectiveness and reliability. Extensive experiments conducted on standard vision benchmarks demonstrate that our method outperforms the state-of-the-art single forward pass approaches in prediction, calibration, and uncertainty estimation.

Details

IJCAI Conference 2022 Conference Paper

AggPose: Deep Aggregation Vision Transformer for Infant Pose Estimation

Xu Cao
Xiaoye Li
Liya Ma
Yi Huang
Xuan Feng
Zening Chen
Hongwu Zeng
Jianguo Cao

Movement and pose assessment of newborns lets experienced pediatricians predict neurodevelopmental disorders, allowing early intervention for related diseases. However, most of the newest AI approaches for human pose estimation methods focus on adults, lacking publicly benchmark for infant pose estimation. In this paper, we fill this gap by proposing infant pose dataset and Deep Aggregation Vision Transformer for human pose estimation, which introduces a fast trained full transformer framework without using convolution operations to extract features in the early stages. It generalizes Transformer + MLP to high-resolution deep layer aggregation within feature maps, thus enabling information fusion between different vision levels. We pre-train AggPose on COCO pose dataset and apply it on our newly released large-scale infant pose estimation dataset. The results show that AggPose could effectively learn the multi-scale features among different resolutions and significantly improve the performance of infant pose estimation. We show that AggPose outperforms hybrid model HRFormer and TokenPose in the infant pose estimation dataset. Moreover, our AggPose outperforms HRFormer by 0. 8 AP on COCO val pose estimation on average. Our code is available at github. com/SZAR-LAB/AggPose.

PDF Details DOI

IROS Conference 2021 Conference Paper

Semi-supervised Vein Segmentation of Ultrasound Images for Autonomous Venipuncture

Yu Chen
Yuxuan Wang
Bolin Lai
Zijie Chen
Xu Cao
Nanyang Ye 0001
Zhongyuan Ren
Junbo Zhao

Venipuncture is an indispensable procedure for both diagnosis and treatment. In this paper, unlike existing solutions that fully or partially rely on professional assistance, a compact robotic system integrating both novel hardware and software developments is introduced. The hardware consists of a set of units to facilitate the supporting, positioning, puncturing, and imaging functionalities. To achieve full automation, a novel deep learning framework — semi-ResNeXt-Unet for semi-supervised vein segmentation from ultrasound images is proposed. The depth information of vein is calculated and enables the automated navigation for the puncturing unit. The algorithm is validated on 40 volunteers, and the proposed semi-ResNeXt-Unet improves the dice similarity coefficient (DSC) by 5. 36%, decreases the centroid error by 1. 38 pixels and decreases the failure rate by 5. 60%, compared to fully-supervised ResNeXt-Unet.

Details