Author name cluster

Yan Zhong

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

12 papers

2 author rows

AAAI Conference 2026 Conference Paper

Exploring Category-level Articulated Object Pose Tracking on SE(3) Manifolds

Xianhui Meng
Yukang Huo
Li Zhang
Liu Liu
Haonan Jiang
Yan Zhong
Pingrui Zhang
Cewu Lu

Articulated objects are prevalent in daily life and robotic manipulation tasks. However, compared to rigid objects, pose tracking for articulated objects remains an underexplored problem due to their inherent kinematic constraints. To address these challenges, this work proposes a novel point-pair-based pose tracking framework, termed PPF-Tracker. The proposed framework first performs quasi-canonicalization of point clouds in the SE(3) Lie group space, and then models articulated objects using Point Pair Features (PPF) to predict pose voting parameters by leveraging the invariance properties of SE(3). Finally, semantic information of joint axes is incorporated to impose unified kinematic constraints across all parts of the articulated object. PPF-Tracker is systematically evaluated on both synthetic datasets and real-world scenarios, demonstrating strong generalization across diverse and challenging environments. Experimental results highlight the effectiveness and robustness of PPF-Tracker in multi-frame pose tracking of articulated objects. We believe this work can foster advances in robotics, embodied intelligence, and augmented reality.

PDF Details DOI

AAAI Conference 2026 Conference Paper

MusicRec: Multi-modal Semantic-Enhanced Identifier with Collaborative Signals for Generative Recommendation

Yuqiu Zhao
Lei Shi
Yan Zhong
Feifei Kou
Pengfei Zhang
Jiwei Zhang
Mingying Xu
Yanchao Liu

Generative recommendation as a new paradigm is influencing the current development of recommender systems. It aims to assign identifiers that capture richer semantic and collaborative information to items, and subsequently predict item identifiers via autoregressive generation using Large Language Models (LLMs). Existing approaches primarily tokenize item text into codebooks with preserved semantic IDs through RQ-VAE, or separately tokenize different modality features of items. However, existing tokenization methods face two major challenges: (1) Learning decoupled multi-modal features limits the quality of the semantic representation. (2) Ignoring collaborative signals from interaction history limits the comprehensiveness of identifiers. To address these limitations, we propose a multi-modal semantic-enhanced identifier with collaborative signals for generative recommendation, named MusicRec. In MusicRec, we propose a tokenization approach based on shared-specific modal fusion, enabling the generated identifiers to preserve semantic information more comprehensively from all modalities. In addition, we incorporate collaborative signals from user interactions to guide identifier generation, preserving collaborative patterns in the semantic representation space. Extensive experiments on three public datasets demonstrate that MusicRec achieves state-of-the-art performance compared to existing baseline methods.

PDF Details DOI

YNIMG Journal 2025 Journal Article

A novel approach of [18F]FDG PET-based individual metabolic radiomics network to predict cognitive impairment in multiple system atrophy

Daoyan Hu
Xiaofeng Dou
Jing Wang
Chentao Jin
Ke Liu
Rui Zhou
Xiaohui Zhang
Congcong Yu

Details DOI

NeurIPS Conference 2025 Conference Paper

BADiff: Bandwidth Adaptive Diffusion Model

Xi Zhang
Hanwei Zhu
Yan Zhong
Jiamang Wang
Weisi Lin

In this work, we propose a novel framework to enable diffusion models to adapt their generation quality based on real-time network bandwidth constraints. Traditional diffusion models produce high-fidelity images by performing a fixed number of denoising steps, regardless of downstream transmission limitations. However, in practical cloud-to-device scenarios, limited bandwidth often necessitates heavy compression, leading to loss of fine textures and wasted computation. To address this, we introduce a joint end-to-end training strategy where the diffusion model is conditioned on a target quality level derived from the available bandwidth. During training, the model learns to adaptively modulate the denoising process, enabling early-stop sampling that maintains perceptual quality appropriate to the target transmission condition. Our method requires minimal architectural changes and leverages a lightweight quality embedding to guide the denoising trajectory. Experimental results demonstrate that our approach significantly improves the visual fidelity of bandwidth-adapted generations compared to naive early-stopping, offering a promising solution for efficient image delivery in bandwidth-constrained environments. Code is available at: https: //github. com/xzhang9308/BADiff.

PDF Details

AAAI Conference 2025 Conference Paper

MSSDA: Multi-Sub-Source Domain Adaptation for Diabetic Foot Neuropathy Recognition

Yan Zhong
Zhixin Yan
Yi Xie
Shibin Wu
Huaidong Zhang
Lin Shu
Peiru Zhou

Diabetic foot neuropathy (DFN) is a critical factor leading to diabetic foot ulcers, which is one of the most common and severe complications of diabetes mellitus (DM) and is associated with high risks of amputation and mortality. Despite its significance, existing datasets do not directly derive from plantar data and lack continuous, long-term foot-specific information. To advance DFN research, we have collected a novel dataset comprising continuous plantar pressure data to recognize diabetic foot neuropathy. This dataset includes data from 94 DM patients with DFN and 41 DM patients without DFN. Moreover, traditional methods divide datasets by individuals, potentially leading to significant domain discrepancies in some feature spaces due to the absence of mid-domain data. In this paper, we propose an effective domain adaptation method to address this proplem. We split the dataset based on convolutional feature statistics and select appropriate sub-source domains to enhance efficiency and avoid negative transfer. We then align the distributions of each source and target domain pair in specific feature spaces to minimize the domain gap. Comprehensive results validate the effectiveness of our method on both the newly proposed dataset for DFN recognition and an existing dataset.

PDF Details DOI

IJCAI Conference 2025 Conference Paper

Pre-defined Keypoints Promote Category-level Articulation Pose Estimation via Multi-Modal Alignment

Wenbo Xu
Li Zhang
Liu Liu
Yan Zhong
Haonan Jiang
Xue Wang
Rujing Wang

Articulations are essential in everyday interactions, yet traditional RGB-based pose estimation methods often struggle with issues such as lighting variations and shadows. To overcome these challenges, we propose a novel Pre-defined keypoint based framework for category-level articulation pose estimation via multi-modal Alignment, coined PAGE. Specifically, we first propose a customized keypoint estimation method, aiming to avoid the divergent distance pattern between heuristically generated keypoints and visible points. In addition, to reduce the mutual information redundancy between point clouds and RGB images, we design the geometry-color alignment, which fuses the features after aligning two modalities. This is followed by decoding the radius for each visible point, and applying our proposal integration scoring strategy to predict keypoints. Ultimately, the framework outputs the per-part 6D pose of the articulation. We conduct extensive experiments to evaluate PAGE across a variety of datasets, from synthetic to real-world scenarios, demonstrating its robustness and superior performance.

PDF Details DOI

AAAI Conference 2025 Conference Paper

R^2-Art: Category-Level Articulation Pose Estimation from Single RGB Image via Cascade Render Strategy

Li Zhang
Haonan Jiang
Yukang Huo
Yan Zhong
Jianan Wang
Xue Wang
Rujing Wang
Liu Liu

Human life is filled with articulated objects. Previous works for estimating the pose of category-level articulated objects rely on costly 3D point clouds or RGB-D images. In this paper, our goal is to estimate category-level articulation poses from a single RGB image, where we propose R2-Art, a novel category-level Articulation pose estimation framework from a single RGB image and a cascade Render strategy. Given an RGB image as input, R2-Art estimates per-part 6D pose for the articulation. Specifically, we design parallel regression branches tailored to generate camera-to-root translation and rotation. Using the predicted joint states, we perform PC prior transformation and deformation with a joint-centric modeling approach. For further refinement, a cascade render strategy is proposed for projecting the 3D deformed prior onto the 2D mask. Extensive experiments are provided to validate our R2-Art on various datasets ranging from synthetic datasets to real-world scenarios, demonstrating the superior performance and robustness of the R2-Art. We believe that this work has the potential to be applied in many fields including robotics, embodied intelligence, and augmented reality.

PDF Details DOI

AAAI Conference 2025 Conference Paper

SpikingSSMs: Learning Long Sequences with Sparse and Parallel Spiking State Space Models

Shuaijie Shen
Chao Wang
Renzhuo Huang
Yan Zhong
Qinghai Guo
Zhichao Lu
Jianguo Zhang
Luziwei Leng

Known as low energy consumption networks, spiking neural networks (SNNs) have gained a lot of attention within the past decades. While SNNs are increasing competitive with artificial neural networks (ANNs) for vision tasks, they are rarely used for long sequence tasks, despite their intrinsic temporal dynamics. In this work, we develop spiking state space models (SpikingSSMs) for long sequence learning by leveraging on the sequence learning abilities of state space models (SSMs). Inspired by dendritic neuron structure, we hierarchically integrate neuronal dynamics with the original SSM block, meanwhile realizing sparse synaptic computation. Furthermore, to solve the conflict of event-driven neuronal dynamics with parallel computing, we propose a light-weight surrogate dynamic network which accurately predicts the after-reset membrane potential and compatible to learnable thresholds, enabling orders of acceleration in training speed compared with conventional iterative methods. On the long range arena benchmark task, SpikingSSM achieves competitive performance to state-of-the-art SSMs meanwhile realizing on average 90% of network sparsity. On language modeling, our network significantly surpasses existing spiking large language models (spikingLLMs) on the WikiText-103 dataset with only a third of the model size, demonstrating its potential as backbone architecture for low computation cost LLMs.

PDF Details DOI

ICML Conference 2025 Conference Paper

SToFM: a Multi-scale Foundation Model for Spatial Transcriptomics

Suyuan Zhao
Yizhen Luo
Ganbo Yang
Yan Zhong
Hao Zhou 0012
Zaiqing Nie

Spatial Transcriptomics (ST) technologies provide biologists with rich insights into single-cell biology by preserving spatial context of cells. Building foundational models for ST can significantly enhance the analysis of vast and complex data sources, unlocking new perspectives on the intricacies of biological tissues. However, modeling ST data is inherently challenging due to the need to extract multi-scale information from tissue slices containing vast numbers of cells. This process requires integrating macro-scale tissue morphology, micro-scale cellular microenvironment, and gene-scale gene expression profile. To address this challenge, we propose SToFM, a multi-scale S patial T ranscript o mics F oundation M odel. SToFM first performs multi-scale information extraction on each ST slice, to construct a set of ST sub-slices that aggregate macro-, micro- and gene-scale information. Then an SE(2) Transformer is used to obtain high-quality cell representations from the sub-slices. Additionally, we construct SToCorpus-88M, the largest high-resolution spatial transcriptomics corpus for pretraining. SToFM achieves outstanding performance on a variety of downstream tasks, such as tissue region semantic segmentation and cell type annotation, demonstrating its comprehensive understanding of ST data through capturing and integrating multi-scale information.

Details

NeurIPS Conference 2025 Conference Paper

Theory-Driven Label-Specific Representation for Incomplete Multi-View Multi-Label Learning

Quanjiang Li
Tianxiang Xu
Tingjin Luo
Yan Zhong
Yang Li
Yiyun Zhou
Chenping Hou

Multi-view multi-label learning typically suffers from dual data incompleteness due to limitations in feature storage and annotation costs. The interplay of hetero geneous features, numerous labels, and missing information significantly degrades model performance. To tackle the complex yet highly practical challenges, we propose a Theory-Driven Label-Specific Representation (TDLSR) framework. Through constructing the view-specific sample topology and prototype association graph, we develop the proximity-aware imputation mechanism, while deriving class representatives that capture the label correlation semantics. To obtain semantically distinct view representations, we introduce principles of information shift, inter action and orthogonality, which promotes the disentanglement of representation information, and mitigates message distortion and redundancy. Besides, label semantic-guided feature learning is employed to identify the discriminative shared and specific representations and refine the label preference across views. Moreover, we theoretically investigate the characteristics of representation learning and the generalization performance. Finally, extensive experiments on public datasets and real-world applications validate the effectiveness of TDLSR.

PDF Details

IJCAI Conference 2024 Conference Paper

Large Language Model-Enhanced Algorithm Selection: Towards Comprehensive Algorithm Representation

Xingyu Wu
Yan Zhong
Jibin Wu
Bingbing Jiang
Kay Chen Tan

Algorithm selection, a critical process of automated machine learning, aims to identify the most suitable algorithm for solving a specific problem prior to execution. Mainstream algorithm selection techniques heavily rely on problem features, while the role of algorithm features remains largely unexplored. Due to the intrinsic complexity of algorithms, effective methods for universally extracting algorithm information are lacking. This paper takes a significant step towards bridging this gap by introducing Large Language Models (LLMs) into algorithm selection for the first time. By comprehending the code text, LLM not only captures the structural and semantic aspects of the algorithm, but also demonstrates contextual awareness and library function understanding. The high-dimensional algorithm representation extracted by LLM, after undergoing a feature selection module, is combined with the problem representation and passed to the similarity calculation module. The selected algorithm is determined by the matching degree between a given problem and different algorithms. Extensive experiments validate the performance superiority of the proposed model and the efficacy of each key module. Furthermore, we present a theoretical upper bound on model complexity, showcasing the influence of algorithm representation and feature selection modules. This provides valuable theoretical guidance for the practical implementation of our method.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Rethinking 3D Convolution in $\ell_p$-norm Space

Li Zhang
Yan Zhong
Jianan Wang
Zhe Min
Rujing Wang
Liu Liu

Convolution is a fundamental operation in the 3D backbone. However, under certain conditions, the feature extraction ability of traditional convolution methods may be weakened. In this paper, we introduce a new convolution method based on $\ell_p$-norm. For theoretical support, we prove the universal approximation theorem for $\ell_p$-norm based convolution, and analyze the robustness and feasibility of $\ell_p$-norms in 3D point cloud tasks. Concretely, $\ell_{\infty}$-norm based convolution is prone to feature loss. $\ell_2$-norm based convolution is essentially a linear transformation of the traditional convolution. $\ell_1$-norm based convolution is an economical and effective feature extractor. We propose customized optimization strategies to accelerate the training process of $\ell_1$-norm based Nets and enhance the performance. Besides, a theoretical guarantee is given for the convergence by \textit{regret} argument. We apply our methods to classic networks and conduct related experiments. Experimental results indicate that our approach exhibits competitive performance with traditional CNNs, with lower energy consumption and instruction latency.

PDF Details DOI