Author name cluster

Xiaohao Cai

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

7 papers

2 author rows

AAAI Conference 2026 Conference Paper

HiFi-Mamba: Dual-Stream?-Laplacian Enhanced Mamba for High-Fidelity MRI Reconstruction

Hongli Chen
Pengcheng Fang
Yuxia Chen
Yingxuan Ren
Jing Hao
Fangfang Tang
Xiaohao Cai
Shanshan Shan

Reconstructing high-fidelity MR images from undersampled k-space data remains a challenging problem in MRI. While Mamba variants for vision tasks offer promising long-range modeling capabilities with linear-time complexity, their direct application to MRI reconstruction inherits two key limitations: (1) insensitivity to high-frequency anatomical details; and (2) reliance on redundant multi-directional scanning. To address these limitations, we introduce High-Fidelity Mamba (HiFi-Mamba), a novel dual-stream Mamba-based architecture comprising stacked?-Laplacian (WL) and HiFi-Mamba blocks. Specifically, the WL block performs fidelity-preserving spectral decoupling, producing complementary low- and high-frequency streams. This separation enables the HiFi-Mamba block to focus on low-frequency structures, enhancing global feature modeling. Concurrently, the HiFi-Mamba block selectively integrates high-frequency features through adaptive state-space modulation, preserving comprehensive spectral details. To eliminate the scanning redundancy, the HiFi-Mamba block adopts a streamlined unidirectional traversal strategy that preserves long-range modeling capability with improved computational efficiency. Extensive experiments on standard MRI reconstruction benchmarks demonstrate that HiFi-Mamba consistently outperforms state-of-the-art CNN-based, Transformer-based, and other Mamba-based models in reconstruction accuracy while maintaining a compact and efficient model design.

PDF Details DOI

AAAI Conference 2026 Conference Paper

MOGO: Residual Quantized Hierarchical Causal Transformer for Real-Time and Infinite-Length 3D Human Motion Generation

Dongjie Fu
Tengjiao Sun
Pengcheng Fang
Xiaohao Cai
Hansung Kim

Recent advances in transformer-based text-to-motion generation have significantly improved motion quality. However, achieving both real-time performance and long-horizon scalability remains an open challenge. In this paper, we present MOGO (Motion Generation with One-pass), a novel autoregressive framework for efficient and scalable 3D human motion generation. MOGO consists of two key components. First, we introduce MoSA-VQ, a motion scale-adaptive residual vector quantization module that hierarchically discretizes motion sequences through learnable scaling parameters, enabling dynamic allocation of representation capacity and producing compact yet expressive multi-level representations. Second, we design the RQHC-Transformer, a residual quantized hierarchical causal transformer that decodes motion tokens in a single forward pass. Each transformer block aligns with one quantization level, allowing hierarchical abstraction and temporally coherent generation with strong semantic flow. Compared to diffusion- and LLM-based approaches, MOGO achieves lower inference latency while preserving high motion fidelity. Moreover, its hierarchical latent design enables seamless and controllable infinite-length motion generation, with stable transitions and the ability to adaptively incorporate updated control signals at arbitrary points in time. To further enhance generalization and interpretability, we introduce Textual Condition Alignment (TCA), which leverages large language models with Chain-of-Thought reasoning to bridge the gap between real-world prompts and training data. TCA not only improves zero-shot performance on unseen datasets but also enriches motion comprehension for in-distribution prompts through explicit intent decomposition. Extensive experiments on HumanML3D, KIT-ML, and the unseen CMP dataset demonstrate that MOGO outperforms prior methods in generation quality, inference efficiency, and temporal scalability.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

CALM: Culturally Self-Aware Language Models

Lingzhi Shen
Xiaohao Cai
Yunfei Long
Imran Razzak
Guanming Chen
Shoaib Jameel

Cultural awareness in language models is the capacity to understand and adapt to diverse cultural contexts. However, most existing approaches treat culture as static background knowledge, overlooking its dynamic and evolving nature. This limitation reduces their reliability in downstream tasks that demand genuine cultural sensitivity. In this work, we introduce CALM, a novel framework designed to endow language models with cultural self-awareness. CALM disentangles task semantics from explicit cultural concepts and latent cultural signals, shaping them into structured cultural clusters through contrastive learning. These clusters are then aligned via cross-attention to establish fine-grained interactions among related cultural features and are adaptively integrated through a Mixture-of-Experts mechanism along culture-specific dimensions. The resulting unified representation is fused with the model's original knowledge to construct a culturally grounded internal identity state, which is further enhanced through self-prompted reflective learning, enabling continual adaptation and self-correction. Extensive experiments conducted on multiple cross-cultural benchmark datasets demonstrate that CALM consistently outperforms state-of-the-art methods.

PDF Details

TMLR Journal 2025 Journal Article

Neural varifolds: an aggregate representation for quantifying the geometry of point clouds

Juheon Lee
Xiaohao Cai
Carola-Bibiane Schönlieb
Simon Masnou

Point clouds are popular 3D representations for real-life objects (such as in LiDAR and Kinect) due to their detailed and compact representation of surface-based geometry. Recent approaches characterise the geometry of point clouds by bringing deep learning based techniques together with geometric fidelity metrics such as optimal transportation costs (e.g., Chamfer and Wasserstein metrics). In this paper, we propose a new surface geometry characterisation within this realm, namely a neural varifold representation of point clouds. Here, the surface is represented as a measure/distribution over both point positions and tangent spaces of point clouds. The varifold representation quantifies not only the surface geometry of point clouds through the manifold-based representation, but also subtle geometric consistencies on the surface due to the combined product space. This study proposes neural varifold algorithms to compute the varifold norm between two point clouds using neural networks on point clouds and their neural tangent kernel representations. The proposed neural varifold is evaluated on three different sought-after tasks -- shape matching, few-shot shape classification, and shape reconstruction. Detailed evaluation and comparison to the state-of-the-art methods demonstrate that the proposed versatile neural varifold is superior in shape matching and few-shot shape classification, and is competitive for shape reconstruction.

PDF Details

ICRA Conference 2025 Conference Paper

Talk2Radar: Bridging Natural Language with 4D mmWave Radar for 3D Referring Expression Comprehension

Runwei Guan
Ruixiao Zhang 0001
Ningwei Ouyang
Jianan Liu
Ka Lok Man
Xiaohao Cai
Ming Xu 0011
Jeremy S. Smith

Embodied perception is essential for intelligent vehicles and robots in interactive environmental understanding. However, these advancements primarily focus on vision, with limited attention given to using 3D modeling sensors, restricting a comprehensive understanding of objects in response to prompts containing qualitative and quantitative queries. Recently, as a promising automotive sensor with affordable cost, 4D millimeter-wave radars provide denser point clouds than conventional radars and perceive both semantic and physical characteristics of objects, thereby enhancing the reliability of perception systems. To foster the development of natural language-driven context understanding in radar scenes for 3D visual grounding, we construct the first dataset, Talk2Radar, which bridges these two modalities for 3D Referring Expression Comprehension (REC). Talk2Radar contains 8, 682 referring prompt samples with 20, 558 referred objects. Moreover, we propose a novel model, T-RadarNet, for 3D REC on point clouds, achieving State-Of-The-Art (SOTA) performance on the Talk2Radar dataset compared to counterparts. Deformable-FPN and Gated Graph Fusion are meticulously designed for efficient point cloud feature modeling and cross-modal fusion between radar and text features, respectively. Comprehensive experiments provide deep insights into radar-based 3D REC. We release our project at https://github.com/GuanRunwei/Talk2Radar.

Details

ECAI Conference 2024 Conference Paper

Detect Closer Surfaces That Can be Seen: New Modeling and Evaluation in Cross-Domain 3D Object Detection

Ruixiao Zhang 0001
Yihong Wu 0004
Juheon Lee
Xiaohao Cai
Adam Prügel-Bennett

The performance of domain adaptation technologies has not yet reached an ideal level in the current 3D object detection field for autonomous driving, which is mainly due to significant differences in the size of vehicles, as well as the environments they operate in when applied across domains. These factors together hinder the effective transfer and application of knowledge learned from specific datasets. Since the existing evaluation metrics are initially designed for evaluation on a single domain by calculating the 2D or 3D overlap between the prediction and ground-truth bounding boxes, they often suffer from the overfitting problem caused by the size differences among datasets. This raises a fundamental question related to the evaluation of the 3D object detection models’ cross-domain performance: Do we really need models to maintain excellent performance in their original 3D bounding boxes after being applied across domains? From a practical application perspective, one of our main focuses is actually on preventing collisions between vehicles and other obstacles, especially in cross-domain scenarios where correctly predicting the size of vehicles is much more difficult. In other words, as long as a model can accurately identify the closest surfaces to the ego vehicle, it is sufficient to effectively avoid obstacles. In this paper, we propose two metrics to measure 3D object detection models’ ability of detecting the closer surfaces to the sensor on the ego vehicle, which can be used to evaluate their cross-domain performance more comprehensively and reasonably. Furthermore, we propose a refinement head, named EdgeHead, to guide models to focus more on the learnable closer surfaces, which can greatly improve the cross-domain performance of existing models not only under our new metrics, but even also under the original BEV/3D metrics. Our code is available at https: //github. com/Galaxy-ZRX/EdgeHead.

Details

ECAI Conference 2023 Conference Paper

A Bilevel Formalism for the Peer-Reviewing Problem

Gennaro Auricchio
Ruixiao Zhang 0001
Jie Zhang 0008
Xiaohao Cai

Due to the large number of submissions that more and more conferences experience, finding an automatized way to well distribute the submitted papers among reviewers has become necessary. We model the peer-reviewing matching problem as a bilevel programming (BP) formulation. Our model consists of a lower-level problem describing the reviewers’ perspective and an upper-level problem describing the editors’. Every reviewer is interested in minimizing their overall effort, while the editors are interested in finding an allocation that maximizes the quality of the reviews and follows the reviewers’ preferences the most. To the best of our knowledge, the proposed model is the first one that formulates the peer-reviewing matching problem by considering two objective functions, one to describe the reviewers’ viewpoint and the other to describe the editors’ viewpoint. We demonstrate that both the upper-level and lower-level problems are feasible and that our BP model admits a solution under mild assumptions. After studying the properties of the solutions, we propose a heuristic to solve our model and compare its performance with the relevant state-of-the-art methods. Extensive numerical results show that our approach can find fairer solutions with competitive quality and less effort from the reviewers. (Our code website: https: //github. com/Galaxy-ZRX/Bilevel-Review.)

Details