Author name cluster

Yue Sun

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

17 papers

2 author rows

AAAI Conference 2026 Conference Paper

EndoIR: Degradation-Agnostic All-in-One Endoscopic Image Restoration via Noise-Aware Routing Diffusion

Tong Chen
Xinyu Ma
Long Bai
Wenyang Wang
Yue Sun
Luping Zhou

Endoscopic images often suffer from diverse and co-occurring degradations such as low lighting, smoke, and bleeding, which obscure critical clinical details. Existing restoration methods are typically task-specific and often require prior knowledge of the degradation type, limiting their robustness in real-world clinical use. We propose EndoIR, an all-in-one, degradation-agnostic diffusion-based framework that restores multiple degradation types using a single model. EndoIR introduces a Dual-Domain Prompter that extracts joint spatial–frequency features, coupled with an adaptive embedding that encodes both shared and task-specific cues as conditioning for denoising. To mitigate feature confusion in conventional concatenation-based conditioning, we design a Dual-Stream Diffusion architecture that processes clean and degraded inputs separately, with a Rectified Fusion Block integrating them in a structured, degradation-aware manner. Furthermore, Noise-Aware Routing Block improves efficiency by dynamically selecting only noise-relevant features during denoising. Experiments on SegSTRONG-C and CEC datasets demonstrate that EndoIR achieves state-of-the-art performance across multiple degradation scenarios while using fewer parameters than strong baselines, and downstream segmentation experiments confirm its clinical utility.

PDF Details DOI

AAAI Conference 2026 Conference Paper

HiFi-Mesh: High-Fidelity Efficient 3D Mesh Generation via Compact Autoregressive Dependence

Yanfeng Li
Tao Tan
Qinquan Gao
Zhiwen Cao
Xiaohong Liu
Yue Sun

High-fidelity 3D meshes can be tokenized into one-dimension (1D) sequences and directly modeled using autoregressive approaches for faces and vertices. However, existing methods suffer from insufficient resource utilization, resulting in slow inference and the ability to handle only small-scale sequences, which severely constrains the expressible structural details. We introduce the Latent Autoregressive Network (LANE), which incorporates compact autoregressive dependencies in the generation process, achieving a 6× improvement in maximum generatable sequence length compared to existing methods. To further accelerate inference, we propose the Adaptive Computation Graph Reconfiguration (AdaGraph) strategy, which effectively overcomes the efficiency bottleneck of traditional serial inference through spatiotemporal decoupling in the generation process. Experimental validation demonstrates that LANE achieves superior performance across generation speed, structural detail, and geometric consistency, providing an effective solution for high-quality 3D mesh generation.

PDF Details DOI

JBHI Journal 2026 Journal Article

M $^{3}$ SegNet: A Multi-Modal and Multi-Branch Framework for Nasopharyngeal Carcinoma Segmentation in Radiotherapy Planning

Junqiang Ma
Luyi Han
Henry H. Y. Tong
DengQiang Jia
Hui Xie
Anne W. M. Lee
Hing Ming Hung
Tao Tan

Accurate and simultaneous labeling of multiple structures, including gross tumor volumes, clinical target volumes, and organs at risk, is a fundamental multi-task requirement for radiotherapy planning in nasopharyngeal carcinoma. However, conventional manual labeling is labor-intensive and suffers from substantial inter-observer variability. This variability poses a significant challenge to the multi-modal interpretation of CT and MRI scans. Against this backdrop, automated approaches, particularly multi-modal and multi-task learning, are promising solutions. However, their clinical adoption is limited by three urgent needs: attention mechanisms that fuse multi-modal information at both local and global views, explicit incorporation of anatomical priors to regularize predictions, and a unified framework that enables concurrent segmentation of all desired structures. To overcome these limitations, we propose M $^{3}$ SegNet, a novel multi-modal and multi-branch framework that concurrently performs all clinically relevant segmentation tasks, integrating feature fusion and anatomical guidance. Our primary contributions are threefold. First, we introduce the Synergistic Global-Local Attention that extracts informative features from various imaging modalities (CT, T1-weighted, T2-weighted, and T1 contrast). Second, we propose an Anatomy-Aware Hierarchical Learning strategy that uses OAR spatial information to guide tumor segmentation. We also integrate Random Modality Dropout to enhance robustness against missing modalities. We validated M $^{3}$ SegNet on an internal 257-patient NPC dataset and confirmed its generalizability on three external datasets. In experiments, our framework significantly outperformed state-of-the-art methods. By providing a mechanism to leverage multi-modal information and anatomical priors, our M $^{3}$ SegNet offers a reliable, automated, and clinically translatable solution for NPC radiotherapy planning.

Details DOI

AAAI Conference 2026 Conference Paper

SUGAR: Learning Skeleton Representation with Visual-Motion Knowledge for Action Recognition

Qilang Ye
Yu Zhou
Lian He
Jie Zhang
Xuanming Guo
Jiayu Zhang
Mingkui Tan
Weicheng Xie

Large Language Models (LLMs) hold rich implicit knowledge and powerful transferability. In this paper, we explore the combination of LLMs with the human skeleton to perform action classification and description. However, when treating LLM as a recognizer, two questions arise: 1) How can LLMs understand the skeleton? 2) How can LLMs distinguish among actions? To address these problems, we introduce a novel paradigm named learning Skeleton representation with visual-motion knowledge for Action Recognition (SUGAR). In our pipeline, we first utilize off-the-shelf large-scale video models as a knowledge base to generate visual, motion information related to actions. Then, we propose to supervise skeleton learning through this prior knowledge to yield discrete representations. Finally, we use the LLM with untouched pre-training weights to understand these representations and generate the desired action targets and descriptions. Notably, we present a Temporal Query Projection (TQP) module to continuously model the skeleton signals with long sequences. Experiments on several skeleton-based action classification benchmarks demonstrate the efficacy of our SUGAR. Moreover, experiments on zero-shot scenarios show that SUGAR is more versatile than linear-based methods.

PDF Details DOI