Author name cluster

Xiao Sun

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

18 papers

2 author rows

AIIM Journal 2026 Journal Article

Double Graph Attention Network for predicting non-alcoholic fatty liver disease in patients with type 2 diabetes

Tianbin Chen
Yongbin Zeng
Jinlin Wang
Xiao Sun
Sihao Liu
Ya Fu
Qiang Yi
Qishui Ou

Details DOI

JBHI Journal 2026 Journal Article

Head-and-Neck Organs Segmentation in CT Based on Spatial Prior and Shape Description

Chengyang An
Tao Yang
Xiao Sun
Yu Qiao
Yubing Li
Jilan Jiang
Ling Zhu
LIsheng Wang

Accurate delineation of organs at risk (OARs) is critical for effective radiotherapy in head and neck cancer, and different deep learning methods have been proposed for this task. Although these methods can effectively segment large organs, they all face challenges in segmenting different small organs with high accuracy, due to large numbers, complex distributions, and diverse shapes of small organs in the head and neck region. To address this challenge, this paper proposes a novel segmentation framework that incorporates spatial distribution information of all organs and shape priors of small organs into deep networks to constrain and enhance small organ segmentation. First, a spatial guidance network (SG-Net) is proposed to generate spatial guidance maps (SGMs) of organs, emphasizing the boundaries of different organs and their spatial positional relationships, thereby providing useful spatial cues to constrain organ segmentation. Second, for small-volume organs, we specifically design a deep shape description module (DSDM) to extract organ-specific shape features from CT images and integrate them into the original deep features to enhance the features' sensitivity to shape constraints. Finally, a regularization term is employed to reduce excessive smoothing in the predicted probability maps of the deep network, preserving the shape details of small organs. With this framework, while the segmentation accuracy of large organs is maintained, small organ segmentation is significantly improved. Experimental results demonstrate its effectiveness for segmentation of small organs, with a significant improvement over state-of-the-art methods.

Details DOI

AAAI Conference 2026 Conference Paper

RacketVision: A Multiple Racket Sports Benchmark for Unified Ball and Racket Analysis

Linfeng Dong
Yuchen Yang
Hao Wu
Wei Wang
Yuenan Hou
Zhihang Zhong
Xiao Sun

We introduce RacketVision, a novel dataset and benchmark for advancing computer vision in sports analytics, covering table tennis, tennis, and badminton. The dataset is the first to provide large-scale, fine-grained annotations for racket pose alongside traditional ball positions, enabling research into complex human-object interactions. It is designed to tackle three interconnected tasks: fine-grained ball tracking, articulated racket pose estimation, and predictive ball trajectory forecasting. Our evaluation of established baselines reveals a critical insight for multi-modal fusion: while naively concatenating racket pose features degrades performance, a Cross-Attention mechanism is essential to unlock their value, leading to trajectory prediction results that surpass strong unimodal baselines. RacketVision provides a versatile resource and a strong starting point for future research in dynamic object tracking, conditional motion forecasting, and multi-modal analysis in sports.

PDF Details DOI

AAAI Conference 2025 Conference Paper

GRPose: Learning Graph Relations for Human Image Generation with Pose Priors

Xiangchen Yin
Donglin Di
Lei Fan
Hao Li
Wei Chen
Gouxiaofei
Yang Song
Xiao Sun

Recent methods using diffusion models have made significant progress in human image generation with various control signals such as pose priors. However, existing efforts are still struggling to generate high-quality images with consistent pose alignment, resulting in unsatisfactory output. In this paper, we propose a framework that delves into the graph relations of pose priors to provide control information for human image generation. The main idea is to establish a graph topological structure between the pose priors and latent representation of diffusion models to capture the intrinsic associations between different pose parts. A Progressive Graph Integrator (PGI) is designed to learn the spatial relationships of the pose priors with the graph structure, adopting a hierarchical strategy within an Adapter to gradually propagate information across different pose parts. Besides, a pose perception loss is introduced based on a pretrained pose estimation network to minimize the pose differences. Extensive qualitative and quantitative experiments conducted on the Human-Art and LAION-Human datasets clearly demonstrate that our model can achieve significant performance improvement over the latest benchmark models.

PDF Details DOI

ICLR Conference 2025 Conference Paper

PEARL: Parallel Speculative Decoding with Adaptive Draft Length

Tianyu Liu
Yun Li
Qitan Lv
Kai Liu 0052
Jianchen Zhu
Winston Hu
Xiao Sun

Speculative decoding (SD), where an extra draft model is employed to provide multiple **draft** tokens first and then the original target model verifies these tokens in parallel, has shown great power for LLM inference acceleration. However, existing SD methods suffer from the mutual waiting problem, i.e., the target model gets stuck when the draft model is *guessing* tokens, and vice versa. This problem is directly incurred by the asynchronous execution of the draft model and the target model, and is exacerbated due to the fixed draft length in speculative decoding. To address these challenges, we propose a conceptually simple, flexible, and general framework to boost speculative decoding, namely **P**arallel sp**E**culative decoding with **A**daptive d**R**aft **L**ength (PEARL). Specifically, PEARL proposes *pre-verify* to verify the first draft token in advance during the drafting phase, and *post-verify* to generate more draft tokens during the verification phase. PEARL parallels the drafting phase and the verification phase via applying the two strategies, and achieves adaptive draft length for different scenarios, which effectively alleviates the mutual waiting problem. Experiments on various text generation benchmarks demonstrate the effectiveness of our PEARL, leading to a superior speedup performance up to **4.43$\times$** and **1.50$\times$**, compared to auto-regressive decoding and vanilla speculative decoding, respectively.

Details

ICRA Conference 2025 Conference Paper

Safety-Critical Traffic Simulation with Adversarial Transfer of Driving Intentions

Zherui Huang
Xing Gao 0005
Guanjie Zheng
Licheng Wen
Xuemeng Yang
Xiao Sun

Traffic simulation, complementing real-world data with a long-tail distribution, allows for effective evaluation and enhancement of the ability of autonomous vehicles to handle accident-prone scenarios. Simulating such safety-critical scenarios is nontrivial, however, from log data that are typically regular scenarios, especially in consideration of dynamic adversarial interactions between the future motions of autonomous vehicles and surrounding traffic participants. To address it, this paper proposes an innovative and efficient strategy, termed IntSim, that explicitly decouples the driving intentions of surrounding actors from their motion planning for realistic and efficient safety-critical simulation. We formulate the adversarial transfer of driving intention as an optimization problem, facilitating extensive exploration of diverse attack behaviors and efficient solution convergence. Simultaneously, intention-conditioned motion planning benefits from powerful deep models and large-scale real-world data, permitting the simulation of realistic motion behaviors for actors. Specially, through adapting driving intentions based on environments, IntSim facilitates the flexible realization of dynamic adversarial interactions with autonomous vehicles. Finally, extensive open-loop and closed-loop experiments on real-world datasets, including nuScenes and Waymo, demonstrate that the proposed IntSim achieves state-of-the-art performance in simulating realistic safety-critical scenarios and further improves planners in handling such scenarios.

Details

IROS Conference 2025 Conference Paper

Towards Label-Free 3D Visual Grounding with Vision Foundation Models

Xiaopei Wu
Yuenan Hou
Binbin Lin
Xinge Zhu
Yuexin Ma
Haifeng Liu 0001
Deng Cai 0001
Xiao Sun

Details

AAAI Conference 2024 Conference Paper

Aleth-NeRF: Illumination Adaptive NeRF with Concealing Field Assumption

Ziteng Cui
Lin Gu
Xiao Sun
Xianzheng Ma
Yu Qiao
Tatsuya Harada

The standard Neural Radiance Fields (NeRF) paradigm employs a viewer-centered methodology, entangling the aspects of illumination and material reflectance into emission solely from 3D points. This simplified rendering approach presents challenges in accurately modeling images captured under adverse lighting conditions, such as low light or over-exposure. Motivated by the ancient Greek emission theory that posits visual perception as a result of rays emanating from the eyes, we slightly refine the conventional NeRF framework to train NeRF under challenging light conditions and generate normal-light condition novel views unsupervisedly. We introduce the concept of a ``Concealing Field," which assigns transmittance values to the surrounding air to account for illumination effects. In dark scenarios, we assume that object emissions maintain a standard lighting level but are attenuated as they traverse the air during the rendering process. Concealing Field thus compel NeRF to learn reasonable density and colour estimations for objects even in dimly lit situations. Similarly, the Concealing Field can mitigate over-exposed emissions during rendering stage. Furthermore, we present a comprehensive multi-view dataset captured under challenging illumination conditions for evaluation. Our code and proposed dataset are available at https://github.com/cuiziteng/Aleth-NeRF.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Bi-ViT: Pushing the Limit of Vision Transformer Quantization

Yanjing Li
Sheng Xu
Mingbao Lin
Xianbin Cao
Chuanjian Liu
Xiao Sun
Baochang Zhang

Vision transformers (ViTs) quantization offers a promising prospect to facilitate deploying large pre-trained networks on resource-limited devices. Fully-binarized ViTs (Bi-ViT) that pushes the quantization of ViTs to its limit remain largely unexplored and a very challenging task yet, due to their unacceptable performance. Through extensive empirical analyses, we identify the severe drop in ViT binarization is caused by attention distortion in self-attention, which technically stems from the gradient vanishing and ranking disorder. To address these issues, we first introduce a learnable scaling factor to reactivate the vanished gradients and illustrate its effectiveness through theoretical and experimental analyses. We then propose a ranking-aware distillation method to rectify the disordered ranking in a teacher-student framework. Bi-ViT achieves significant improvements over popular DeiT and Swin backbones in terms of Top-1 accuracy and FLOPs. For example, with DeiT-Tiny and Swin-Tiny, our method significantly outperforms baselines by 22.1% and 21.4% respectively, while 61.5x and 56.1x theoretical acceleration in terms of FLOPs compared with real-valued counterparts on ImageNet. Our codes and models are attached on https://github.com/YanjingLi0202/Bi-ViT/.

PDF Details DOI

ICML Conference 2024 Conference Paper

Learning 1-Bit Tiny Object Detector with Discriminative Feature Refinement

Sheng Xu 0007
Mingze Wang
Yanjing Li
Mingbao Lin
Baochang Zhang 0001
David S. Doermann
Xiao Sun

1-bit detectors show impressive performance comparable to their real-valued counterparts when detecting commonly sized objects while exhibiting significant performance degradation on tiny objects. The challenge stems from the fact that high-level features extracted by 1-bit convolutions seem less compelling to reveal the discriminative foreground features. To address these issues, we introduce a Discriminative Feature Refinement method for 1-bit Detectors (DFR-Det), aiming to enhance the discriminative ability of foreground representation for tiny objects in aerial images. This is accomplished by refining the feature representation using an information bottleneck (IB) to achieve a distinctive representation of tiny objects. Specifically, we introduce a new decoder with a foreground mask, aiming to enhance the discriminative ability of high-level features for the target but suppress the background impact. Additionally, our decoder is simple but effective and can be easily mounted on existing detectors without extra burden added to the inference procedure. Extensive experiments on various tiny object detection (TOD) tasks demonstrate DFR-Det’s superiority over state-of-the-art 1-bit detectors. For example, 1-bit FCOS achieved by DFR-Det achieves the 12. 8% AP on AI-TOD dataset, approaching the performance of the real-valued counterpart.

Details

NeurIPS Conference 2024 Conference Paper

LucidAction: A Hierarchical and Multi-model Dataset for Comprehensive Action Quality Assessment

Linfeng Dong
Wei Wang
Yu Qiao
Xiao Sun

Action Quality Assessment (AQA) research confronts formidable obstacles due to limited, mono-modal datasets sourced from one-shot competitions, which hinder the generalizability and comprehensiveness of AQA models. To address these limitations, we present LucidAction, the first systematically collected multi-view AQA dataset structured on curriculum learning principles. LucidAction features a three-tier hierarchical structure, encompassing eight diverse sports events with four curriculum levels, facilitating sequential skill mastery and supporting a wide range of athletic abilities. The dataset encompasses multi-modal data, including multi-view RGB video, 2D and 3D pose sequences, enhancing the richness of information available for analysis. Leveraging a high-precision multi-view Motion Capture (MoCap) system ensures precise capture of complex movements. Meticulously annotated data, incorporating detailed penalties from professional gymnasts, ensures the establishment of robust and comprehensive ground truth annotations. Experimental evaluations employing diverse contrastive regression baselines on LucidAction elucidate the dataset's complexities. Through ablation studies, we investigate the advantages conferred by multi-modal data and fine-grained annotations, offering insights into improving AQA performance. The data and code will be openly released to support advancements in the AI sports field.

PDF Details DOI

AAAI Conference 2024 Conference Paper

PREFER: Prompt Ensemble Learning via Feedback-Reflect-Refine

Chenrui Zhang
Lin Liu
Chuyuan Wang
Xiao Sun
Hongyu Wang
Jinpeng Wang
Mingchen Cai

As an effective tool for eliciting the power of Large Language Models (LLMs), prompting has recently demonstrated unprecedented abilities across a variety of complex tasks. To further improve the performance, prompt ensemble has attracted substantial interest for tackling the hallucination and instability of LLMs. However, existing methods usually adopt a two-stage paradigm, which requires a pre-prepared set of prompts with substantial manual effort, and is unable to perform directed optimization for different weak learners. In this paper, we propose a simple, universal, and automatic method named PREFER (Prompt Ensemble learning via Feedback-Reflect-Refine) to address the stated limitations. Specifically, given the fact that weak learners are supposed to focus on hard examples during boosting, PREFER builds a feedback mechanism for reflecting on the inadequacies of existing weak learners. Based on this, the LLM is required to automatically synthesize new prompts for iterative refinement. Moreover, to enhance stability of the prompt effect evaluation, we propose a novel prompt bagging method involving forward and backward thinking, which is superior to majority voting and is beneficial for both feedback and weight calculation in boosting. Extensive experiments demonstrate that our PREFER achieves state-of-the-art performance in multiple types of tasks by a significant margin. We have made our code publicly available.

PDF Details DOI

TMLR Journal 2023 Journal Article

Extreme Masking for Learning Instance and Distributed Visual Representations

Zhirong Wu
Zihang Lai
Xiao Sun
Stephen Lin

The paper presents a scalable approach for learning spatially distributed visual representations over individual tokens and a holistic instance representation simultaneously. We use self-attention blocks to represent spatially distributed tokens, followed by cross-attention blocks to aggregate the holistic instance. The core of the approach is the use of extremely large token masking (75\%-90\%) as the data augmentation for supervision. Our model, named ExtreMA, follows the plain BYOL approach where the instance representation from the unmasked subset is trained to predict that from the intact input. Instead of encouraging invariance across inputs, learning requires the model to capture informative variations in an image. The paper makes three contributions: 1) It presents random masking as a strong and computationally efficient data augmentation for siamese representation learning. 2) With multiple sampling per instance, extreme masking greatly speeds up learning and improves performance with more data. 3) ExtreMA obtains stronger linear probing performance than masked modeling methods, and better transfer performance than prior contrastive models.

PDF Details

NeurIPS Conference 2023 Conference Paper

Q-DM: An Efficient Low-bit Quantized Diffusion Model

Yanjing Li
Sheng Xu
Xianbin Cao
Xiao Sun
Baochang Zhang

Denoising diffusion generative models are capable of generating high-quality data, but suffers from the computation-costly generation process, due to a iterative noise estimation using full-precision networks. As an intuitive solution, quantization can significantly reduce the computational and memory consumption by low-bit parameters and operations. However, low-bit noise estimation networks in diffusion models (DMs) remain unexplored yet and perform much worse than the full-precision counterparts as observed in our experimental studies. In this paper, we first identify that the bottlenecks of low-bit quantized DMs come from a large distribution oscillation on activations and accumulated quantization error caused by the multi-step denoising process. To address these issues, we first develop a Timestep-aware Quantization (TaQ) method and a Noise-estimating Mimicking (NeM) scheme for low-bit quantized DMs (Q-DM) to effectively eliminate such oscillation and accumulated error respectively, leading to well-performed low-bit DMs. In this way, we propose an efficient Q-DM to calculate low-bit DMs by considering both training and inference process in the same framework. We evaluate our methods on popular DDPM and DDIM models. Extensive experimental results show that our method achieves a much better performance than the prior arts. For example, the 4-bit Q-DM theoretically accelerates the 1000-step DDPM by 7. 8x and achieves a FID score of 5. 17, on the unconditional CIFAR-10 dataset.

PDF Details

AAAI Conference 2021 Conference Paper

BoW Pooling: A Plug-and-Play Unit for Feature Aggregation of Point Clouds

Xiang Zhang
Xiao Sun
Zhouhui Lian

Point cloud provides a compact and flexible representation for 3D shapes and recently attracts more and more attention due to the increasing demands in practical applications. The major challenge of handling such irregular data is how to achieve the permutation invariance of points in the input. Most of existing methods extract local descriptors that encode the geometry of local structure, followed by a symmetric function to form a global representation. The max pooling usually serves as the symmetric function and shows slight superiority compared to the average pooling. We argue that some discrimination information is inevitably missing when applying the max pooling across all local descriptors. In this paper, we propose the BoW pooling, a plug-and-play unit to substitute the max pooling. Our BoW pooling analyzes the set of local descriptors statistically and generates a histogram that reflects how the primitives in the dictionary constitute the overall geometry. Extensive experiments demonstrate that the proposed Bow pooling is efficient to improve the performance in point cloud classification, shape retrieval and segmentation tasks and outperforms other existing symmetric functions.

PDF Details

NeurIPS Conference 2020 Conference Paper

ScaleCom: Scalable Sparsified Gradient Compression for Communication-Efficient Distributed Training

Chia-Yu Chen
Jiamin Ni
Songtao Lu
Xiaodong Cui
Pin-Yu Chen
Xiao Sun
Naigang Wang
Swagath Venkataramani

Large-scale distributed training of Deep Neural Networks (DNNs) on state-of-the-art platforms are expected to be severely communication constrained. To overcome this limitation, numerous gradient compression techniques have been proposed and have demonstrated high compression ratios. However, most existing compression methods do not scale well to large scale distributed systems (due to gradient build-up) and / or lack evaluations in large datasets. To mitigate these issues, we propose a new compression technique, Scalable Sparsified Gradient Compression (ScaleComp), that (i) leverages similarity in the gradient distribution amongst learners to provide a commutative compressor and keep communication cost constant to worker number and (ii) includes low-pass filter in local gradient accumulations to mitigate the impacts of large batch size training and significantly improve scalability. Using theoretical analysis, we show that ScaleComp provides favorable convergence guarantees and is compatible with gradient all-reduce techniques. Furthermore, we experimentally demonstrate that ScaleComp has small overheads, directly reduces gradient traffic and provides high compression rates (70-150X) and excellent scalability (up to 64-80 learners and 10X larger batch sizes over normal training) across a wide range of applications (image, language, and speech) without significant accuracy loss.

PDF Details

NeurIPS Conference 2020 Conference Paper

Ultra-Low Precision 4-bit Training of Deep Neural Networks

Xiao Sun
Naigang Wang
Chia-Yu Chen
Jiamin Ni
Ankur Agrawal
Xiaodong Cui
Swagath Venkataramani
Kaoutar El Maghraoui

In this paper, we propose a number of novel techniques and numerical representation formats that enable, for the very first time, the precision of training systems to be aggressively scaled from 8-bits to 4-bits. To enable this advance, we explore a novel adaptive Gradient Scaling technique (Gradscale) that addresses the challenges of insufficient range and resolution in quantized gradients as well as explores the impact of quantization errors observed during model training. We theoretically analyze the role of bias in gradient quantization and propose solutions that mitigate the impact of this bias on model convergence. Finally, we examine our techniques on a spectrum of deep learning models in computer vision, speech, and NLP. In combination with previously proposed solutions for 4-bit quantization of weight and activation tensors, 4-bit training shows a non-significant loss in accuracy across application domains while enabling significant hardware acceleration (> 7X over state-of-the-art FP16 systems).

PDF Details

NeurIPS Conference 2019 Conference Paper

Hybrid 8-bit Floating Point (HFP8) Training and Inference for Deep Neural Networks

Xiao Sun
Jungwook Choi
Chia-Yu Chen
Naigang Wang
Swagath Venkataramani
Vijayalakshmi (Viji) Srinivasan
Xiaodong Cui
Wei Zhang

Reducing the numerical precision of data and computation is extremely effective in accelerating deep learning training workloads. Towards this end, 8-bit floating point representations (FP8) were recently proposed for DNN training. However, its applicability was demonstrated on a few selected models only and significant degradation is observed when popular networks such as MobileNet and Transformer are trained using FP8. This degradation is due to the inherent precision requirement difference in the forward and backward passes of DNN training. Using theoretical insights, we propose a hybrid FP8 (HFP8) format and DNN end-to-end distributed training procedure. We demonstrate, using HFP8, the successful training of deep learning models across a whole spectrum of applications including Image Classification, Object Detection, Language and Speech without accuracy degradation. Finally, we demonstrate that, by using the new 8 bit format, we can directly quantize a pre-trained model down to 8-bits without losing accuracy by simply fine-tuning batch normalization statistics. These novel techniques enable a new generations of 8-bit hardware that are robust for building and deploying neural network models.

PDF Details