Author name cluster

Kai Xu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

42 papers

2 author rows

JBHI Journal 2026 Journal Article

A 3D Edge-Attention Denoising Diffusion Network for Prostate Segmentation in Puncture Biopsy

Haomin Kuang
Jiaxin Guo
Kai Xu
Yun-Hui Liu

Prostate cancer is the second most common cancer in men, and transrectal ultrasound (TRUS) guided biopsy is the standard method to diagnose prostate cancer. Accurate prostate segmentation in TRUS images is crucial for precise biopsy. Manual segmentation is laborious, while automated segmentation faces significant challenges due to the low signal-to-noise ratio, blurred boundaries, and presence of noise and artifacts. To address these issues, this paper proposes a 3D edge-attention denoising diffusion network, aiming to achieve high accuracy and generalizability for prostate segmentation in TRUS-guided biopsy. The proposed network incorporates an edge attention denoising U-Net (EAD U-Net) to extract and utilize desired edge information in TRUS images, improving the segmentation accuracy in challenging regions of the prostate. To reduce uncertainty and enhance network accuracy, we incorporate a Kalman fusion module, which utilizes the Kalman filter and all estimations from the EAD U-Net in reverse process to obtain the optimal segmentation estimation. The proposed network was evaluated using 1834 3D ultrasound images from two open-source datasets. Comparative experiments with existing methods demonstrate that our method surpasses state-of-the-art techniques, proving its effectiveness in prostate segmentation from TRUS images. The proposed method achieved an average Dice similarity coefficient of 92. 92% and 94. 0%, and the 95th percentile of Hausdorff distance of 1. 07 mm and 0. 77 mm on two datasets, demonstrating the potential to facilitate accurate MRI-TRUS fusion guided prostate biopsy.

Details DOI

AAAI Conference 2026 Conference Paper

AnchorHOI: Zero-shot Generation of 4D Human-Object Interaction via Anchor-based Prior Distillation

Sisi Dai
Kai Xu

Despite significant progress in text-driven 4D human-object interaction (HOI) generation with supervised methods, the scalability remains limited by the scarcity of large-scale 4D HOI datasets. To overcome this, recent approaches attempt zero-shot 4D HOI generation with pre-trained image diffusion models. However, interaction cues are minimally distilled during the generation process, restricting their applicability across diverse scenarios. In this paper, we propose AnchorHOI, a novel framework that thoroughly exploits hybrid priors by incorporating video diffusion models beyond image diffusion models, advancing 4D HOI generation. Nevertheless, directly optimizing high-dimensional 4D HOI with such priors remains challenging, particularly for human pose and compositional motion. To address this challenge, AnchorHOI introduces an anchor-based prior distillation strategy, which constructs interaction-aware anchors and then leverages them to guide generation in a tractable two-step process. Specifically, two tailored anchors are designed for 4D HOI generation: anchor Neural Radiance Fields (NeRFs) for expressive interaction composition, and anchor keypoints for realistic motion synthesis. Extensive experiments demonstrate that AnchorHOI outperforms previous methods with superior diversity and generalization.

PDF Details DOI

EAAI Journal 2026 Journal Article

Cross-modal correlation-guided hierarchical multiscale network for cloud removal of optical remote sensing imagery

Anling Wang
Kai Xu
Wenxin Wang
Chengcheng Fan

Synthetic aperture radar (SAR) provides significant complementary information for cloud removal in optical remote sensing imagery, enabling the recovery of large-scale missing regions. To achieve high-quality reconstruction, this study focuses on two key challenges: extracting valid complementary information from SAR data and maintaining a balance between global scene coherence and fine-grained detail restoration. Therefore, we propose a cross-modal correlation-guided hierarchical multiscale network, which synergizes multimodal fusion and multiscale optimization to restore scene details occluded by clouds. For cross modal fusion, the correlation propagated fusion module, aims to propagate SAR-derived global correlations to cloud-contaminated optical images. Furthermore, the hierarchical multiscale image reconstruction integrates the strengths of the location-driven feature aggregation in aggregating information between adjacent scales and the optimization capabilities of the deep supervision mechanism across multilevel. Extensive experiments demonstrate that the proposed method surpasses current methods by over 0. 4 in peak signal-to-noise ratio on both real and synthetic datasets, producing cloud-free results with clearer object details and more harmonious overall appearance. Five-fold cross-validation and comparisons under varying cloud cover levels further validate the proposed method's strong generalization and robustness. Moreover, the validation of land cover classification after cloud removal highlights its practical applicability in real.

Details DOI

JBHI Journal 2026 Journal Article

Fundus Image Enhancement With Pyramid Conditional Flow

Kai Xu
Zhen Liang
Wenjun Wei
Huaian Chen
Yi Jin

Deep learning-based approaches, which learn pixel-to-pixel mapping from input to output images, have demonstrated exceptional performance in enhancing low-quality fundus images. However, due to the ambiguous definition of the ground-truth high-quality image, the pixel-to-pixel mapping encounters an ill-posed problem arising from the complex one-to-many relationship between low-quality fundus images and their corresponding high-quality versions. To address this problem, this work proposes a PCFlow, the first normalizing flow method that learns the complex distributions of high-quality fundus images rather than a pixel-to-pixel mapping. Unlike the existing image natural enhancement methods that aim to restore images with comfortable visual quality, PCFlow enhances fundus images by prioritizing clinically significant information. To this end, we design a condition module that utilizes retinal structure as a conditioning factor to constrain the optimization of PCFlow, and then build an invertible coupling layer that employs a pyramid structure for identifying each frequency component of retinal features. With the cooperation and interactions of these key components, the proposed PCFlow preserves the retinal structures and pathological characteristics essential for clinical applications. Extensive experiments on the real and synthetic fundus datasets demonstrate that our method achieves better performance.

Details DOI

AAAI Conference 2026 Conference Paper

Topology-Inspired Backward-Free Framework for Test-Time Adaptation in Medical Detection

Bin Pu
Xingguo Lv
Jiewen Yang
Kai Xu
Lei Zhao
Zuozhu Liu
Kenli Li

Recently, Test-Time Adaptation (TTA) has gained increasing attention in medical imaging due to its ability to improve model generalization under domain shifts without retraining. In particular, directly applying a well-trained model across various medical centers faces significant performance degradation caused by variations in equipment, operators, imaging conditions, and scanning skill levels of sonographers. Existing TTA methods either rely on parameter adaptation that increases computational cost or apply simple prediction fusion that ignores anatomical structure knowledge. To address these limitations, we propose a novel backward-free Topology-aware TTA framework named T^3 that integrates Structural Perception Modeling (SPM) and Box Regression Adaptation (BRA). SPM is implemented through an organ space heatmap generated via Gaussian kernel superposition. This heatmap encodes anatomical topology without requiring additional training or source data. BRA further improves localization and classification by fusing detection outputs based on the contribution of detected results to anatomically meaningful peak points from the heatmaps. Extensive experiments were conducted across six cross-domain scenarios, and the results demonstrate that our method achieves state-of-the-art cross-domain detection performance while maintaining high efficiency, offering a practical and robust solution for real-world medical diagnostic applications.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Unified Mixture-of-Experts Framework for Joint Cardiac and Vascular Ultrasound Analysis and Report Generation

Bin Pu
Jiewen Yang
Xingguo Lv
Kai Xu
Kenli Li

Echocardiography and vascular ultrasound are essential for comprehensive cardiovascular assessment, yet manual evaluation and writing reports are labor-intensive, time-consuming, and require expertise from both cardiology and vascular surgery departments. Current automated report generation systems mainly focus on X-ray or CT, often neglecting echocardiographic modalities and critical quantitative parameters like aortic diameter and main pulmonary artery diameter, limiting their clinical utility. Moreover, the interdependence between cardiac and peripheral vascular health necessitates cross-departmental insights, which existing methods fail to incorporate. To address these limitations, we first propose the vision-language framework named the Echo-Cardiac-Vascular (ECV), for joint cardiac and vascular ultrasound report generation and parameter measurements. ECV introduces a Mixture-of-Experts vision encoder tailored for distinct ultrasound subtypes, a structured parameter measurement module for accurate quantification, and task-specific decoders that generate interpretable, multimodal diagnostic reports. Our framework, trained on 10K+ paired records, achieves high accuracy, improving diagnostic efficiency, consistency, and cross-disciplinary clinical applicability.

PDF Details DOI

EAAI Journal 2025 Journal Article

A time and frequency convolutional Autoencoder for anomaly detection in industrial robots based on inertial measurement unit error calibration

Jianlong Li
Xiaoqin Liu
Xing Wu
Dongxiao Wang
Kai Xu
Yashan Li

In the realm of industrial robots, ensuring operational reliability and Long-Term Autonomy hinges on the accurate detection of anomalies. However, this sample difference due to noise, joint random errors and sensor errors increases the challenge of robot anomaly detection. To address this problem, an unsupervised deep learning method based on inertial measurement unit (IMU) error calibration is proposed. Firstly, the attitude signals acquired by the IMU from the end of the robot were calibrated using Kalman filtering. The three dimensional (3D) free acceleration was corrected based on the calibrated attitude signal and the calibrated 3D free acceleration signal was used as a signal sample. Secondly, a time and frequency convolutional autoencoder model (TFCAE) is proposed. And the distribution of the different component signals is fitted by stacking multiple encoder modules and 3D-TFCAE is used for 3D free acceleration signal reconstruction model. Then, the error sphere radius is calculated based on the reconstruction error of the 3D free acceleration signal. And the error sphere radius is used as the anomaly detection threshold to realize the robust detection of different types of anomalies. The model was evaluated on a constructed anomaly dataset. This study contributes an innovative 3D-TFCAE architecture, integrating Kalman filtering with time-frequency feature fusion, markedly enhancing anomaly detection in complex signal environments. Experimental findings reveal that 3D-TFCAE significantly outperforms 18 baseline models, improving detection accuracy by about 20 %–40 %, offering an effective solution for high-precision anomaly detection in industrial robots. The code for this project is available at https: //github. com/LJlong977/3DTFCAE.

Details DOI

ICLR Conference 2025 Conference Paper

EvA: Erasing Spurious Correlations with Activations

Qiyuan He
Kai Xu
Angela Yao

Spurious correlations often arise when models associate features strongly correlated with, but not causally related to, the label e.g. an image classifier associates bodies of water with ducks. To mitigate spurious correlations, existing methods focus on learning unbiased representation or incorporating additional information about the correlations during training. This work removes spurious correlations by ``**E**rasing **wi**th **A**ctivations'' (EvA). EvA learns class-specific spurious indicator on each channel for the fully connected layer of pretrained networks. By erasing spurious connections during re-weighting, EvA achieves state-of-the-art performance across diverse datasets (6.2\% relative gain on BAR and achieves 4.1\% on Waterbirds). For biased datasets without any information about the spurious correlations, EvA can outperform previous methods (4.8\% relative gain on Waterbirds) with 6 orders of magnitude less compute, highlighting its data and computational efficiency.

Details

AAAI Conference 2025 Conference Paper

Hierarchically-Structured Open-Vocabulary Indoor Scene Synthesis with Pre-trained Large Language Model

Weilin Sun
Xinran Li
Manyi Li
Kai Xu
Xiangxu Meng
Lei Meng

Indoor scene synthesis aims to automatically produce plausible, realistic, and diverse 3D indoor scenes, especially given arbitrary user requirements. Recently, the promising generalization ability of pre-trained large language models (LLM) assist in open-vocabulary indoor scene synthesis. However, the challenge lies in converting the LLM-generated outputs into reasonable and physically feasible scene layouts. In this paper, we propose to generate hierarchically structured scene descriptions with LLM and then compute the scene layouts. Specifically, we train a hierarchy-aware network to infer the fine-grained relative positions between objects and design a divide-and-conquer optimization to solve for scene layouts. The advantages of using hierarchically structured scene representation are two-fold. First, the hierarchical structure provides a rough grounding for object arrangement, which alleviates contradictory placements with dense relations and enhances the generalization ability of the network to infer fine-grained placements. Second, it naturally supports the divide-and-conquer optimization, by first arranging the sub-scenes and then the entire scene, to more effectively solve for a feasible layout. We conduct extensive comparison experiments and ablation studies with both qualitative and quantitative evaluations to validate the effectiveness of our key designs with the hierarchically structured scene representation. Our approach can generate more reasonable scene layouts while better aligned with the user requirements and LLM descriptions. We also present open-vocabulary scene synthesis and interactive scene design results to show the strength of our approach in the applications.

PDF Details DOI

AAAI Conference 2025 Conference Paper

Physical-aware Neural Radiance Fields for Efficient Exposure Correction

Kai Xu
Mingwen Shao
Yuanjian Qiao
Yan Wang

Neural Radiance Fields (NeRF) has achieved remarkable success in synthesizing impressive novel views. However, existing methods usually fail to handle scenes with adverse lighting conditions caused by external time variations and different camera settings, leading to poor visual quality. To address this challenge, we propose a physical-aware NeRF for efficient exposure correction, named PHY-NeRF. Specifically, we design Adaptive Lighting Particles inspired by the theory of light scattering and absorption, which can adjust the illumination intensity during volume rendering. Subsequently, we can handle scenes with different lighting conditions by jointly optimizing camera parameters and these lighting particles. Moreover, to promote natural brightness transitions, we devise a global illumination consistency module to control the lighting intensity across views at the feature level while completing more details. Benefiting from the above designs, our PHY-NeRF can tackle arbitrary low-light or overexposed scenes in an unsupervised manner. Extensive experiments show that our PHY-NeRF achieves state-of-the-art results in addressing adverse lighting problems while ensuring high rendering efficiency.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Rollout Roulette: A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods

Isha Puri
Shivchander Sudalairaj
Guangxuan Xu
Abhishek Bhandwaldar
Kai Xu
Akash Srivastava

Large language models (LLMs) have achieved significant performance gains via scaling up model sizes and/or data. However, recent evidence suggests diminishing returns from such approaches, motivating a pivot to scaling test-time compute. Existing deterministic inference-time scaling methods, usually with reward models, cast the task as a search problem, but suffer from a key limitation: early pruning. Due to inherently imperfect reward models, promising trajectories may be discarded prematurely, leading to suboptimal performance. We propose a novel inference-time scaling approach by adapting particle-based Monte Carlo methods. Our method maintains a diverse set of candidates and robustly balances exploration and exploitation. Our empirical evaluation demonstrates that our particle filtering methods have a 4--16x better scaling rate over deterministic search counterparts on both various challenging mathematical and more general reasoning tasks. Using our approach, we show that Qwen2. 5-Math-1. 5B-Instruct surpasses GPT-4o accuracy in only 4 rollouts, while Qwen2. 5-Math-7B-Instruct scales to o1 level accuracy in only 32 rollouts. Our work not only presents an effective method to inference-time scaling, but also connects rich literature in probabilistic inference with inference-time scaling of LLMs to develop more robust algorithms in future work.