Arrow Research search

Author name cluster

Lei Yang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

42 papers
2 author rows

Possible papers

42

JBHI Journal 2026 Journal Article

A Segmentation-Guided Feature Alignment and Fusion Network for Glioma IDH Genotyping

  • Minghui Chen
  • Guohua Zhao
  • Lei Yang
  • Haowen Zhu
  • Hongwei Xu
  • Huiqin Jiang
  • Ling Ma

Isocitrate dehydrogenase (IDH) is a pivotal molecular marker for glioma diagnosis, prognosis, and treatment planning. Multi-modal deep learning methods, which integrate features from multiple magnetic resonance imaging (MRI) sequences, have become a powerful solution for non-invasive IDH genotyping. However, existing methods still have limitations in feature extraction and fusion, which constrains their robustness. In this work, we propose a novel segmentation-guided feature alignment and fusion network (SFAF-Net) for glioma IDH genotyping, with three key innovations: 1) The Segmentation-guided Feature Alignment (SFA) module leverages tumor segmentation supervision to facilitate cross-modal feature alignment; 2) The Redundancy-Attenuated Fusion (RAF) module implements similarity-based selective fusion of modality pairs to reduce feature redundancy; 3) A randomized modality dropout mechanism within RAF enhances model robustness against input variations. Comprehensive experiments conducted on public and private datasets demonstrate that SFAF-Net outperforms state-of-the-art methods across diverse MRI sequences. Moreover, SFAF-Net supports an arbitrary number of input sequences, enabling flexible adaptation to diverse clinical scanning protocols in personalized diagnosis.

AAAI Conference 2026 Conference Paper

Cross-Modal Unlearning via Influential Neuron Path Editing in Multimodal Large Language Models

  • Kunhao Li
  • Wenhao Li
  • Di Wu
  • Lei Yang
  • Jun Bai
  • Ju Jia
  • Jason Xue

Multimodal Large Language Models (MLLMs) extend foundation models to real-world applications by integrating inputs such as text and vision. However, their broad knowledge capacity raises growing concerns about privacy leakage, toxicity mitigation, and intellectual property violations. Machine Unlearning (MU) offers a practical solution by selectively forgetting targeted knowledge while preserving overall model utility. When applied to MLLMs, existing neuron-editing-based MU approaches face two fundamental challenges: (i) forgetting becomes inconsistent across modalities because existing point-wise attribution methods fail to capture the structured, layer-by-layer information flow that connects different modalities; and (ii) general knowledge performance declines when sensitive neurons that also support important reasoning paths are pruned, as this disrupts the model’s ability to generalize. To alleviate these limitations, we propose a multimodal influential neuron path editor (MIP-Editor) for MU. Our approach introduces modality-specific attribution scores to identify influential neuron paths responsible for encoding forget-set knowledge and applies influential-path-aware neuron-editing via representation misdirection. This strategy also enables effective and coordinated forgetting across modalities while preserving the model's general capabilities. Experimental results demonstrate that MIP-Editor achieves a superior unlearning performance on multimodal tasks, with a maximum forgetting rate of 87.75% and up to 54.26% improvement in general knowledge retention. On textual tasks, MIP-Editor achieves up to 80.65% forgetting and preserves 77.90% of general performance.

AAAI Conference 2026 Conference Paper

FedLAGC: Towards High Performance System-Heterogeneous Federated Learning via Layer-Adaptive Submodel Extraction and Gradient Correction

  • Qing Hu
  • Tianchi Liao
  • Shuyi Wu
  • Lei Yang
  • Chuan Chen

Federated learning has emerged as a promising paradigm for collaborative model training while preserving data privacy. However, many existing FL methods implicitly assume that clients have sufficient computational and storage resources, making them less applicable in real-world scenarios with severe system heterogeneity. To address this, submodel extraction has recently gained attention as a promising strategy to tailor the global model to resource-constrained clients. Despite this progress, existing methods often suffer from noticeable performance gaps across clients and structural inconsistency in the extracted models, leading to degraded global performance and increased communication overhead. In this work, we propose FedLAGC, a novel federated framework that jointly tackles performance imbalance and communication inefficiency through Layer-Adaptive submodel extraction and Gradient Correction. Specifically, FedLAGC constructs client-specific submodels by selecting structurally important parameters according to layer-wise importance scores, ensuring both resource adaptiveness and architectural consistency. Additionally, we propose a lightweight correction mechanism that captures historical optimization drift, helping to align local updates with the global direction and reduce redundant communication. The rigorous convergence analysis of FedLAGC for system-heterogeneous federated learning under non-convex objectives is given. Extensive experiments on CIFAR-10 and CIFAR-100 with ResNet-18 and ResNet-34 under various system and data heterogeneity settings demonstrate the significant superiority of FedLAGC (up to 24% accuracy improvement and 3.66× communication efficiency) over state-of-the-art methods.

AAAI Conference 2026 System Paper

PHOTONS: Pose-Free Human-Centric Photo-Realistic Real-Time Novel View Synthesis from Sparse Views

  • Yongyang Cheng
  • Boqin Qin
  • Zhao Hui
  • Xu Chen
  • Tao Zhang
  • Shang Sun
  • Haiquan Kang
  • Xiaojie Xu

We present PHOTONS (Pose-Free Human-Centric Photo-Realistic Real-Time Novel View Synthesis from Sparse Views), a real-time framework for novel view synthesis without requiring camera calibration. Our method reconstructs consistent 3D Gaussian point clouds and synthesizes 2K photo-realistic novel views from arbitrary numbers (>=2) of freely placed cameras. PHOTONS faithfully renders dynamic human bodies amid complex backgrounds, including interactive object manipulation and fine-grained details (e.g., hair strands), while maintaining 25 FPS throughput on commodity GPU like NVIDIA RTX 4090. By combining pose-free spatial point cloud reconstruction with Gaussian parameter estimation, our method demonstrates strong resilience to occlusions and camera perturbations. Additionally, we develop a 3D stereo system that drastically reduces setup complexity compared to existing solutions. Experiments on public and custom datasets show that PHOTONS outperforms state-of-the-art methods in both efficiency and visual quality.

AAAI Conference 2025 Conference Paper

A Novel Diffusion Model for Pairwise Geoscience Data Generation with Unbalanced Training Dataset

  • Junhuan Yang
  • Yuzhou Zhang
  • Yi Sheng
  • Youzuo Lin
  • Lei Yang

Recently, the advent of generative AI technologies has made transformational impacts on our daily lives, yet its application in scientific applications remains in its early stages. Data scarcity is a major, well-known barrier in data-driven scientific computing, so physics-guided generative AI holds significant promise. In scientific computing, most tasks study the conversion of multiple data modalities to describe physical phenomena, for example, spatial and waveform in seismic imaging, time and frequency in signal processing, and temporal and spectral in climate modeling; as such, multi-modal pairwise data generation is highly required instead of single-modal data generation, which is usually used in natural images (e.g., faces, scenery). Moreover, in real-world applications, the unbalance of available data in terms of modalities commonly exists; for example, the spatial data (i.e., velocity maps) in seismic imaging can be easily simulated, but real-world seismic waveform is largely lacking. While the most recent efforts enable the powerful diffusion model to generate multi-modal data, how to leverage the unbalanced available data is still unclear. In this work, we use seismic imaging in subsurface geophysics as a vehicle to present "UB-Diff", a novel diffusion model for multi-modal paired scientific data generation. One major innovation is a one-in-two-out encoder-decoder network structure, which can ensure pairwise data is obtained from a co-latent representation. Then, the co-latent representation will be used by the diffusion process for pairwise data generation. Experimental results on the OpenFWI dataset show that UB-Diff significantly outperforms existing techniques in terms of Fréchet Inception Distance (FID) score and pairwise evaluation, indicating the generation of reliable and useful multi-modal pairwise data.

ICLR Conference 2025 Conference Paper

Discrete Distribution Networks

  • Lei Yang

We introduce a novel generative model, the Discrete Distribution Networks (DDN), that approximates data distribution using hierarchical discrete distributions. We posit that since the features within a network inherently capture distributional information, enabling the network to generate multiple samples simultaneously, rather than a single output, may offer an effective way to represent distributions. Therefore, DDN fits the target distribution, including continuous ones, by generating multiple discrete sample points. To capture finer details of the target data, DDN selects the output that is closest to the Ground Truth (GT) from the coarse results generated in the first layer. This selected output is then fed back into the network as a condition for the second layer, thereby generating new outputs more similar to the GT. As the number of DDN layers increases, the representational space of the outputs expands exponentially, and the generated samples become increasingly similar to the GT. This hierarchical output pattern of discrete distributions endows DDN with unique properties: more general zero-shot conditional generation and 1D latent representation. We demonstrate the efficacy of DDN and its intriguing properties through experiments on CIFAR-10 and FFHQ. The code is available at https://discrete-distribution-networks.github.io/

NeurIPS Conference 2025 Conference Paper

Entropy-Calibrated Label Distribution Learning

  • Yunan Lu
  • Bowen Xue
  • Xiuyi Jia
  • Lei Yang

Label Distribution Learning (LDL) has emerged as a powerful framework for estimating complete conditional label distributions, providing crucial reliability for risk-sensitive decision-making tasks. While existing LDL algorithms exhibit competent performance under the conventional LDL performance evaluation methods, two key limitations remain: (1) current algorithms systematically underperform on the samples with low-entropy label distributions, which can be particularly valuable for decision making, and (2) the conventional performance evaluation methods are inherently biased due to the numerical imbalance of samples. In this paper, through empirical and theoretical analyses, we find that excessive cohesion between anchor vectors contributes significantly to the observed entropy bias phenomenon in LDL algorithms. Accordingly, we propose an inter-anchor angular regularization term that mitigates cohesion among anchor vectors by penalizing over-small angles. Besides, to alleviate the numerical imbalance of high-entropy samples in test set, we propose an entropy-calibrated aggregation strategy that obtains the overall model performance by evaluating performance on the low-entropy and high-entropy subsets of the overall test set separately. Finally, we conduct extensive experiments on various real-world datasets to demonstrate the effectiveness of our proposal.

IROS Conference 2025 Conference Paper

sEMG-Based Continues Motion Prediction of Shoulder exoskeleton Control Using the VGANet Model

  • Tongxin Jiang
  • Fuhai Zhang
  • Lei Yang
  • Tianyang Wu
  • Yili Fu

Wearable exoskeleton robots play a crucial role in promoting upper limb function recovery. To enhance human-robot interaction and achieve precise control, continuous prediction of limb joint angles is required. This paper proposes a decoupled network model (VGANet) based on Variable Graph Convolutional Networks (V-GCN) and Temporal External Attention (TEA) for motion prediction in upper limb rehabilitation training. By establishing a mapping relationship between surface electromyography (sEMG) signals and upper limb movements, the model can predict future joint angles based on real-time sEMG signals. Experimental results demonstrate that this method can achieve continuous motion prediction for the shoulder joint and has been successfully applied to the control system of exoskeleton robots, providing an effective solution for the intelligent development of rehabilitation exoskeletons.

NeurIPS Conference 2025 Conference Paper

Towards a Pairwise Ranking Model with Orderliness and Monotonicity for Label Enhancement

  • Yunan Lu
  • Xixi Zhang
  • Yaojin Lin
  • Weiwei Li
  • Lei Yang
  • Xiuyi Jia

Label distribution in recent years has been applied in a diverse array of complex decision-making tasks. To address the availability of label distributions, label enhancement has been established as an effective learning paradigm that aims to automatically infer label distributions from readily available multi-label data, e. g. , logical labels. Recently, numerous works have demonstrated that the label ranking is significantly beneficial to label enhancement. However, these works still exhibit deficiencies in representing the probabilistic relationships between label distribution and label rankings, or fail to accommodate scenarios where multiple labels are equally important for a given instance. Therefore, we propose PROM, a pairwise ranking model with orderliness and monotonicity, to explain the probabilistic relationship between label distributions and label rankings. Specifically, we propose the monotonicity and orderliness assumptions for the probabilities of different ranking relationships and derive the mass functions for PROM, which are theoretically ensured to preserve the monotonicity and orderliness. Further, we propose a generative label enhancement algorithm based on PROM, which directly learns a label distribution predictor from the readily available multi-label data. Finally, extensive experiments demonstrate the efficacy of our proposed model.

IROS Conference 2025 Conference Paper

ULRVT II: A Novel Upper Limb Rehabilitation Robot with Joint Synergy Control and Evaluation for Virtual Training *

  • Lei Yang
  • Fuhai Zhang
  • Tianyang Wu
  • Tongxin Jiang
  • Yili Fu

Global population aging has led to a sharp increase in patients of upper limb motor dysfunction. Robot assisted virtual training, as a novel solution, can offer safe and precise assistance for upper limb rehabilitation. However, it remains a critical challenge to compensate virtual interaction force and realize joint synergy movement. In this paper, we design an upper limb rehabilitation robot for virtual training (ULRVT II) which is a cable driven exoskeleton with high compatibility controlled by a joint synergy method. Moreover, we establish a rehabilitation platform with a virtual training environment and evaluation system for experimental validation. Tests for the performance of joint synergy and virtual training are carried out to show the effectiveness of our robot.

NeurIPS Conference 2025 Conference Paper

V2X-Radar: A Multi-modal Dataset with 4D Radar for Cooperative Perception

  • Lei Yang
  • Xinyu Zhang
  • Jun Li
  • Chen Wang
  • Jiaqi Ma
  • Zhiying Song
  • Tong Zhao
  • Ziying Song

Modern autonomous vehicle perception systems often struggle with occlusions and limited perception range. Previous studies have demonstrated the effectiveness of cooperative perception in extending the perception range and overcoming occlusions, thereby enhancing the safety of autonomous driving. In recent years, a series of cooperative perception datasets have emerged; however, these datasets primarily focus on cameras and LiDAR, neglecting 4D Radar—a sensor used in single-vehicle autonomous driving to provide robust perception in adverse weather conditions. In this paper, to bridge the gap created by the absence of 4D Radar datasets in cooperative perception, we present V2X-Radar, the first large-scale, real-world multi-modal dataset featuring 4D Radar. V2X-Radar dataset is collected using a connected vehicle platform and an intelligent roadside unit equipped with 4D Radar, LiDAR, and multi-view cameras. The collected data encompasses sunny and rainy weather conditions, spanning daytime, dusk, and nighttime, as well as various typical challenging scenarios. The dataset consists of 20K LiDAR frames, 40K camera images, and 20K 4D Radar data, including 350K annotated boxes across five categories. To support various research domains, we have established V2X-Radar-C for cooperative perception, V2X-Radar-I for roadside perception, and V2X-Radar-V for single-vehicle perception. Furthermore, we provide comprehensive benchmarks across these three sub-datasets.

AAAI Conference 2024 Conference Paper

IT3D: Improved Text-to-3D Generation with Explicit View Synthesis

  • Yiwen Chen
  • Chi Zhang
  • Xiaofeng Yang
  • Zhongang Cai
  • Gang Yu
  • Lei Yang
  • Guosheng Lin

Recent strides in Text-to-3D techniques have been propelled by distilling knowledge from powerful large text-to-image diffusion models (LDMs). Nonetheless, existing Text-to-3D approaches often grapple with challenges such as over-saturation, inadequate detailing, and unrealistic outputs. This study presents a novel strategy that leverages explicitly synthesized multi-view images to address these issues. Our approach involves the utilization of image-to-image pipelines, empowered by LDMs, to generate posed high-quality images based on the renderings of coarse 3D models. Although the generated images mostly alleviate the aforementioned issues, challenges such as view inconsistency and significant content variance persist due to the inherent generative nature of large diffusion models, posing extensive difficulties in leveraging these images effectively. To overcome this hurdle, we advocate integrating a discriminator alongside a novel Diffusion-GAN dual training strategy to guide the training of 3D models. For the incorporated discriminator, the synthesized multi-view images are considered real data, while the renderings of the optimized 3D models function as fake data. We conduct a comprehensive set of experiments that demonstrate the effectiveness of our method over baseline approaches.

AAAI Conference 2024 Conference Paper

Learning Dense Correspondence for NeRF-Based Face Reenactment

  • Songlin Yang
  • Wei Wang
  • Yushi Lan
  • Xiangyu Fan
  • Bo Peng
  • Lei Yang
  • Jing Dong

Face reenactment is challenging due to the need to establish dense correspondence between various face representations for motion transfer. Recent studies have utilized Neural Radiance Field (NeRF) as fundamental representation, which further enhanced the performance of multi-view face reenactment in photo-realism and 3D consistency. However, establishing dense correspondence between different face NeRFs is non-trivial, because implicit representations lack ground-truth correspondence annotations like mesh-based 3D parametric models (e.g., 3DMM) with index-aligned vertexes. Although aligning 3DMM space with NeRF-based face representations can realize motion control, it is sub-optimal for their limited face-only modeling and low identity fidelity. Therefore, we are inspired to ask: Can we learn the dense correspondence between different NeRF-based face representations without a 3D parametric model prior? To address this challenge, we propose a novel framework, which adopts tri-planes as fundamental NeRF representation and decomposes face tri-planes into three components: canonical tri-planes, identity deformations, and motion. In terms of motion control, our key contribution is proposing a Plane Dictionary (PlaneDict) module, which efficiently maps the motion conditions to a linear weighted addition of learnable orthogonal plane bases. To the best of our knowledge, our framework is the first method that achieves one-shot multi-view face reenactment without a 3D parametric model prior. Extensive experiments demonstrate that we produce better results in fine-grained motion control and identity preservation than previous methods.

AAAI Conference 2024 Conference Paper

Let All Be Whitened: Multi-Teacher Distillation for Efficient Visual Retrieval

  • Zhe Ma
  • Jianfeng Dong
  • Shouling Ji
  • Zhenguang Liu
  • Xuhong Zhang
  • Zonghui Wang
  • Sifeng He
  • Feng Qian

Visual retrieval aims to search for the most relevant visual items, e.g., images and videos, from a candidate gallery with a given query item. Accuracy and efficiency are two competing objectives in retrieval tasks. Instead of crafting a new method pursuing further improvement on accuracy, in this paper we propose a multi-teacher distillation framework Whiten-MTD, which is able to transfer knowledge from off-the-shelf pre-trained retrieval models to a lightweight student model for efficient visual retrieval. Furthermore, we discover that the similarities obtained by different retrieval models are diversified and incommensurable, which makes it challenging to jointly distill knowledge from multiple models. Therefore, we propose to whiten the output of teacher models before fusion, which enables effective multi-teacher distillation for retrieval models. Whiten-MTD is conceptually simple and practically effective. Extensive experiments on two landmark image retrieval datasets and one video retrieval dataset demonstrate the effectiveness of our proposed method, and its good balance of retrieval performance and efficiency. Our source code is released at https://github.com/Maryeon/whiten_mtd.

JBHI Journal 2024 Journal Article

MSDE-Net: A Multi-Scale Dual-Encoding Network for Surgical Instrument Segmentation

  • Lei Yang
  • Yuge Gu
  • Guibin Bian
  • Yanhong Liu

Minimally invasive surgery, which relies on surgical robots and microscopes, demands precise image segmentation to ensure safe and efficient procedures. Nevertheless, achieving accurate segmentation of surgical instruments remains challenging due to the complexity of the surgical environment. To tackle this issue, this paper introduces a novel multiscale dual-encoding segmentation network, termed MSDE-Net, designed to automatically and precisely segment surgical instruments. The proposed MSDE-Net leverages a dual-branch encoder comprising a convolutional neural network (CNN) branch and a transformer branch to effectively extract both local and global features. Moreover, an attention fusion block (AFB) is introduced to ensure effective information complementarity between the dual-branch encoding paths. Additionally, a multilayer context fusion block (MCF) is proposed to enhance the network's capacity to simultaneously extract global and local features. Finally, to extend the scope of global feature information under larger receptive fields, a multi-receptive field fusion (MRF) block is incorporated. Through comprehensive experimental evaluations on two publicly available datasets for surgical instrument segmentation, the proposed MSDE-Net demonstrates superior performance compared to existing methods.

IJCAI Conference 2024 Conference Paper

RoboFusion: Towards Robust Multi-Modal 3D Object Detection via SAM

  • Ziying Song
  • Guoxing Zhang
  • Lin Liu
  • Lei Yang
  • Shaoqing Xu
  • Caiyan Jia
  • Feiyang Jia
  • Li Wang

Multi-modal 3D object detectors are dedicated to exploring secure and reliable perception systems for autonomous driving (AD). Although achieving state-of-the-art (SOTA) performance on clean benchmark datasets, they tend to overlook the complexity and harsh conditions of real-world environments. With the emergence of visual foundation models (VFMs), opportunities and challenges are presented for improving the robustness and generalization of multi-modal 3D object detection in AD. Therefore, we propose RoboFusion, a robust framework that leverages VFMs like SAM to tackle out-of-distribution (OOD) noise scenarios. We first adapt the original SAM for AD scenarios named SAM-AD. To align SAM or SAM-AD with multi-modal methods, we then introduce AD-FPN for upsampling the image features extracted by SAM. We employ wavelet decomposition to denoise the depth-guided images for further noise reduction and weather interference. At last, we employ self-attention mechanisms to adaptively reweight the fused features, enhancing informative features while suppressing excess noise. In summary, RoboFusion significantly reduces noise by leveraging the generalization and robustness of VFMs, thereby enhancing the resilience of multi-modal 3D object detection. Consequently, RoboFusion achieves SOTA performance in noisy scenarios, as demonstrated by the KITTI-C and nuScenes-C benchmarks. Code is available at https: //github. com/adept-thu/RoboFusion.

NeurIPS Conference 2023 Conference Paper

FineMoGen: Fine-Grained Spatio-Temporal Motion Generation and Editing

  • Mingyuan Zhang
  • Huirong Li
  • Zhongang Cai
  • Jiawei Ren
  • Lei Yang
  • Ziwei Liu

Text-driven motion generation has achieved substantial progress with the emergence of diffusion models. However, existing methods still struggle to generate complex motion sequences that correspond to fine-grained descriptions, depicting detailed and accurate spatio-temporal actions. This lack of fine controllability limits the usage of motion generation to a larger audience. To tackle these challenges, we present FineMoGen, a diffusion-based motion generation and editing framework that can synthesize fine-grained motions, with spatial-temporal composition to the user instructions. Specifically, FineMoGen builds upon diffusion model with a novel transformer architecture dubbed Spatio-Temporal Mixture Attention SAMI. SAMI optimizes the generation of the global attention template from two perspectives: 1) explicitly modeling the constraints of spatio-temporal composition; and 2) utilizing sparsely-activated mixture-of-experts to adaptively extract fine-grained features. To facilitate a large-scale study on this new fine-grained motion generation task, we contribute the HuMMan-MoGen dataset, which consists of 2, 968 videos and 102, 336 fine-grained spatio-temporal descriptions. Extensive experiments validate that FineMoGen exhibits superior motion generation quality over state-of-the-art methods. Notably, FineMoGen further enables zero-shot motion editing capabilities with the aid of modern large language models (LLM), which faithfully manipulates motion sequences with fine-grained instructions.

JBHI Journal 2023 Journal Article

MSA-GCN: A Multi-information Selection Aggregation Graph Convolutional Network for Breast Tumor Grading

  • Kang Li
  • Suya Han
  • Lei Yang
  • Zizhao Sun
  • Zhan Yu
  • Hongwei Xu
  • Ling Ma
  • Jianbo Gao

Physicians typically combine multi-modal data to make a graded diagnosis of breast tumors. However, most existing breast tumor grading methods rely solely on image information, resulting in limited accuracy in grading. This paper proposes a Multi-information Selection Aggregation Graph Convolutional Networks (MSA-GCN) for breast tumor grading. Firstly, to fully utilize phenotypic data reflecting the clinical and pathological characteristics of tumors, an automatic combination screening and weight encoder is proposed for phenotypic data, which can construct a population graph with improved structural information. Then, a graph structure is designed through similarity learning to reflect the correlation between patient image features. Finally, a multi-information selection aggregation mechanism is employed in the graph convolution model to extract the effective features of multi-modal data and enhance the classification performance of the model. The proposed method is evaluated on different clinical datasets from the Digital Database for Screening Mammography (DDSM) and INbreast. The average classification accuracies are 90. 74% and 85. 35%, respectively, surpassing the performance of existing methods. In conclusion, our method effectively fuses image and non-image information, leading to a significant improvement in the accuracy of breast tumor grading.

NeurIPS Conference 2023 Conference Paper

PrimDiffusion: Volumetric Primitives Diffusion for 3D Human Generation

  • Zhaoxi Chen
  • Fangzhou Hong
  • Haiyi Mei
  • Guangcong Wang
  • Lei Yang
  • Ziwei Liu

We present PrimDiffusion, the first diffusion-based framework for 3D human generation. Devising diffusion models for 3D human generation is difficult due to the intensive computational cost of 3D representations and the articulated topology of 3D humans. To tackle these challenges, our key insight is operating the denoising diffusion process directly on a set of volumetric primitives, which models the human body as a number of small volumes with radiance and kinematic information. This volumetric primitives representation marries the capacity of volumetric representations with the efficiency of primitive-based rendering. Our PrimDiffusion framework has three appealing properties: **1)** compact and expressive parameter space for the diffusion model, **2)** flexible representation that incorporates human prior, and **3)** decoder-free rendering for efficient novel-view and novel-pose synthesis. Extensive experiments validate that PrimDiffusion outperforms state-of-the-art methods in 3D human generation. Notably, compared to GAN-based methods, our PrimDiffusion supports real-time rendering of high-quality 3D humans at a resolution of $512\times512$ once the denoising process is done. We also demonstrate the flexibility of our framework on training-free conditional generation such as texture transfer and 3D inpainting.

NeurIPS Conference 2023 Conference Paper

RenderMe-360: A Large Digital Asset Library and Benchmarks Towards High-fidelity Head Avatars

  • Dongwei Pan
  • Long Zhuo
  • Jingtan Piao
  • Huiwen Luo
  • Wei Cheng
  • Yuxin Wang
  • Siming Fan
  • Shengqi Liu

Synthesizing high-fidelity head avatars is a central problem for computer vision and graphics. While head avatar synthesis algorithms have advanced rapidly, the best ones still face great obstacles in real-world scenarios. One of the vital causes is the inadequate datasets -- 1) current public datasets can only support researchers to explore high-fidelity head avatars in one or two task directions; 2) these datasets usually contain digital head assets with limited data volume, and narrow distribution over different attributes, such as expressions, ages, and accessories. In this paper, we present RenderMe-360, a comprehensive 4D human head dataset to drive advance in head avatar algorithms across different scenarios. It contains massive data assets, with 243+ million complete head frames and over 800k video sequences from 500 different identities captured by multi-view cameras at 30 FPS. It is a large-scale digital library for head avatars with three key attributes: 1) High Fidelity: all subjects are captured in 360 degrees via 60 synchronized, high-resolution 2K cameras. 2) High Diversity: The collected subjects vary from different ages, eras, ethnicities, and cultures, providing abundant materials with distinctive styles in appearance and geometry. Moreover, each subject is asked to perform various dynamic motions, such as expressions and head rotations, which further extend the richness of assets. 3) Rich Annotations: the dataset provides annotations with different granularities: cameras' parameters, background matting, scan, 2D/3D facial landmarks, FLAME fitting, and text description. Based on the dataset, we build a comprehensive benchmark for head avatar research, with 16 state-of-the-art methods performed on five main tasks: novel view synthesis, novel expression synthesis, hair rendering, hair editing, and talking head generation. Our experiments uncover the strengths and flaws of state-of-the-art methods. RenderMe-360 opens the door for future exploration in modern head avatars. All of the data, code, and models will be publicly available at https: //renderme-360. github. io/.

NeurIPS Conference 2023 Conference Paper

SMPLer-X: Scaling Up Expressive Human Pose and Shape Estimation

  • Zhongang Cai
  • Wanqi Yin
  • Ailing Zeng
  • Chen Wei
  • Qingping SUN
  • Wang Yanjun
  • Hui En Pang
  • Haiyi Mei

Expressive human pose and shape estimation (EHPS) unifies body, hands, and face motion capture with numerous applications. Despite encouraging progress, current state-of-the-art methods still depend largely on a confined set of training datasets. In this work, we investigate scaling up EHPS towards the first generalist foundation model (dubbed SMPLer-X), with up to ViT-Huge as the backbone and training with up to 4. 5M instances from diverse data sources. With big data and the large model, SMPLer-X exhibits strong performance across diverse test benchmarks and excellent transferability to even unseen environments. 1) For the data scaling, we perform a systematic investigation on 32 EHPS datasets, including a wide range of scenarios that a model trained on any single dataset cannot handle. More importantly, capitalizing on insights obtained from the extensive benchmarking process, we optimize our training scheme and select datasets that lead to a significant leap in EHPS capabilities. 2) For the model scaling, we take advantage of vision transformers to study the scaling law of model sizes in EHPS. Moreover, our finetuning strategy turn SMPLer-X into specialist models, allowing them to achieve further performance boosts. Notably, our foundation model SMPLer-X consistently delivers state-of-the-art results on seven benchmarks such as AGORA (107. 2 mm NMVE), UBody (57. 4 mm PVE), EgoBody (63. 6 mm PVE), and EHF (62. 3 mm PVE without finetuning).

NeurIPS Conference 2023 Conference Paper

Towards Robust and Expressive Whole-body Human Pose and Shape Estimation

  • Hui En Pang
  • Zhongang Cai
  • Lei Yang
  • Qingyi Tao
  • Zhonghua Wu
  • Tianwei Zhang
  • Ziwei Liu

Whole-body pose and shape estimation aims to jointly predict different behaviors (e. g. , pose, hand gesture, facial expression) of the entire human body from a monocular image. Existing methods often exhibit suboptimal performance due to the complexity of in-the-wild scenarios. We argue that the prediction accuracy of these models is significantly affected by the quality of the bounding box, e. g. , scale, alignment. The natural discrepancy between the ideal bounding box annotations and model detection results is particularly detrimental to the performance of whole-body pose and shape estimation. In this paper, we propose a novel framework to enhance the robustness of whole-body pose and shape estimation. Our framework incorporates three new modules to address the above challenges from three perspectives: (1) a Localization Module enhances the model's awareness of the subject's location and semantics within the image space; (2) a Contrastive Feature Extraction Module encourages the model to be invariant to robust augmentations by incorporating a contrastive loss and positive samples; (3) a Pixel Alignment Module ensures the reprojected mesh from the predicted camera and body model parameters are more accurate and pixel-aligned. We perform comprehensive experiments to demonstrate the effectiveness of our proposed framework on body, hands, face and whole-body benchmarks.

AAAI Conference 2023 Conference Paper

TransVCL: Attention-Enhanced Video Copy Localization Network with Flexible Supervision

  • Sifeng He
  • Yue He
  • Minlong Lu
  • Chen Jiang
  • Xudong Yang
  • Feng Qian
  • Xiaobo Zhang
  • Lei Yang

Video copy localization aims to precisely localize all the copied segments within a pair of untrimmed videos in video retrieval applications. Previous methods typically start from frame-to-frame similarity matrix generated by cosine similarity between frame-level features of the input video pair, and then detect and refine the boundaries of copied segments on similarity matrix under temporal constraints. In this paper, we propose TransVCL: an attention-enhanced video copy localization network, which is optimized directly from initial frame-level features and trained end-to-end with three main components: a customized Transformer for feature enhancement, a correlation and softmax layer for similarity matrix generation, and a temporal alignment module for copied segments localization. In contrast to previous methods demanding the handcrafted similarity matrix, TransVCL incorporates long-range temporal information between feature sequence pair using self- and cross- attention layers. With the joint design and optimization of three components, the similarity matrix can be learned to present more discriminative copied patterns, leading to significant improvements over previous methods on segment-level labeled datasets (VCSL and VCDB). Besides the state-of-the-art performance in fully supervised setting, the attention architecture facilitates TransVCL to further exploit unlabeled or simply video-level labeled data. Additional experiments of supplementing video-level labeled datasets including SVD and FIVR reveal the high flexibility of TransVCL from full supervision to semi-supervision (with or without video-level annotation). Code is publicly available at https://github.com/transvcl/TransVCL.

NeurIPS Conference 2022 Conference Paper

Benchmarking and Analyzing 3D Human Pose and Shape Estimation Beyond Algorithms

  • Hui En Pang
  • Zhongang Cai
  • Lei Yang
  • Tianwei Zhang
  • Ziwei Liu

3D human pose and shape estimation (a. k. a. ``human mesh recovery'') has achieved substantial progress. Researchers mainly focus on the development of novel algorithms, while less attention has been paid to other critical factors involved. This could lead to less optimal baselines, hindering the fair and faithful evaluations of newly designed methodologies. To address this problem, this work presents the \textit{first} comprehensive benchmarking study from three under-explored perspectives beyond algorithms. \emph{1) Datasets. } An analysis on 31 datasets reveals the distinct impacts of data samples: datasets featuring critical attributes (\emph{i. e. } diverse poses, shapes, camera characteristics, backbone features) are more effective. Strategical selection and combination of high-quality datasets can yield a significant boost to the model performance. \emph{2) Backbones. } Experiments with 10 backbones, ranging from CNNs to transformers, show the knowledge learnt from a proximity task is readily transferable to human mesh recovery. \emph{3) Training strategies. } Proper augmentation techniques and loss designs are crucial. With the above findings, we achieve a PA-MPJPE of 47. 3 (mm) on the 3DPW test set with a relatively simple model. More importantly, we provide strong baselines for fair comparisons of algorithms, and recommendations for building effective training configurations in the future. Codebase is available at \url{https: //github. com/smplbody/hmr-benchmarks}.

JMLR Journal 2021 Journal Article

A Fast Globally Linearly Convergent Algorithm for the Computation of Wasserstein Barycenters

  • Lei Yang
  • Jia Li
  • Defeng Sun
  • Kim-Chuan Toh

We consider the problem of computing a Wasserstein barycenter for a set of discrete probability distributions with finite supports, which finds many applications in areas such as statistics, machine learning and image processing. When the support points of the barycenter are pre-specified, this problem can be modeled as a linear programming (LP) problem whose size can be extremely large. To handle this large-scale LP, we analyse the structure of its dual problem, which is conceivably more tractable and can be reformulated as a well-structured convex problem with 3 kinds of block variables and a coupling linear equality constraint. We then adapt a symmetric Gauss-Seidel based alternating direction method of multipliers (sGS-ADMM) to solve the resulting dual problem and establish its global convergence and global linear convergence rate. As a critical component for efficient computation, we also show how all the subproblems involved can be solved exactly and efficiently. This makes our method suitable for computing a Wasserstein barycenter on a large-scale data set, without introducing an entropy regularization term as is commonly practiced. In addition, our sGS-ADMM can be used as a subroutine in an alternating minimization method to compute a barycenter when its support points are not pre-specified. Numerical results on synthetic data sets and image data sets demonstrate that our method is highly competitive for solving large-scale Wasserstein barycenter problems, in comparison to two existing representative methods and the commercial software Gurobi. [abs] [ pdf ][ bib ] &copy JMLR 2021. ( edit, beta )

NeurIPS Conference 2021 Conference Paper

SimiGrad: Fine-Grained Adaptive Batching for Large Scale Training using Gradient Similarity Measurement

  • Heyang Qin
  • Samyam Rajbhandari
  • Olatunji Ruwase
  • Feng Yan
  • Lei Yang
  • Yuxiong He

Large scale training requires massive parallelism to finish the training within a reasonable amount of time. To support massive parallelism, large batch training is the key enabler but often at the cost of generalization performance. Existing works explore adaptive batching or hand-tuned static large batching, in order to strike a balance between the computational efficiency and the performance. However, these methods can provide only coarse-grained adaption (e. g. , at a epoch level) due to the intrinsic expensive calculation or hand tuning requirements. In this paper, we propose a fully automated and lightweight adaptive batching methodology to enable fine-grained batch size adaption (e. g. , at a mini-batch level) that can achieve state-of-the-art performance with record breaking batch sizes. The core component of our method is a lightweight yet efficient representation of the critical gradient noise information. We open-source the proposed methodology by providing a plugin tool that supports mainstream machine learning frameworks. Extensive evaluations on popular benchmarks (e. g. , CIFAR10, ImageNet, and BERT-Large) demonstrate that the proposed methodology outperforms state-of-the-art methodologies using adaptive batching approaches or hand-tuned static strategies in both performance and batch size. Particularly, we achieve a new state-of-the-art batch size of 78k in BERT-Large pretraining with SQuAD score 90. 69 compared to 90. 58 reported in previous state-of-the-art with 59k batch size.

AAAI Conference 2018 Conference Paper

Accelerated Training for Massive Classification via Dynamic Class Selection

  • Xingcheng Zhang
  • Lei Yang
  • Junjie Yan
  • Dahua Lin

Massive classification, a classification task defined over a vast number of classes (hundreds of thousands or even millions), has become an essential part of many real-world systems, such as face recognition. Existing methods, including the deep networks that achieved remarkable success in recent years, were mostly devised for problems with a moderate number of classes. They would meet with substantial difficulties, e. g. excessive memory demand and computational cost, when applied to massive problems. We present a new method to tackle this problem. This method can efficiently and accurately identify a small number of “active classes” for each mini-batch, based on a set of dynamic class hierarchies constructed on the fly. We also develop an adaptive allocation scheme thereon, which leads to a better tradeoff between performance and cost. On several large-scale benchmarks, our method significantly reduces the training cost and memory demand, while maintaining competitive performance.

JMLR Journal 2018 Journal Article

Online Bootstrap Confidence Intervals for the Stochastic Gradient Descent Estimator

  • Yixin Fang
  • Jinfeng Xu
  • Lei Yang

In many applications involving large dataset or online learning, stochastic gradient descent (SGD) is a scalable algorithm to compute parameter estimates and has gained increasing popularity due to its numerical convenience and memory efficiency. While the asymptotic properties of SGD-based estimators have been well established, statistical inference such as interval estimation remains much unexplored. The classical bootstrap is not directly applicable if the data are not stored in memory. The plug-in method is not applicable when there is no explicit formula for the covariance matrix of the estimator. In this paper, we propose an online bootstrap procedure for the estimation of confidence intervals, which, upon the arrival of each observation, updates the SGD estimate as well as a number of randomly perturbed SGD estimates. The proposed method is easy to implement in practice. We establish its theoretical properties for a general class of models that includes linear regressions, generalized linear models, M-estimators and quantile regressions as special cases. The finite-sample performance and numerical utility is evaluated by simulation studies and real data applications. [abs] [ pdf ][ bib ] &copy JMLR 2018. ( edit, beta )

JMLR Journal 2016 Journal Article

Model-free Variable Selection in Reproducing Kernel Hilbert Space

  • Lei Yang
  • Shaogao Lv
  • Junhui Wang

Variable selection is popular in high-dimensional data analysis to identify the truly informative variables. Many variable selection methods have been developed under various model assumptions. Whereas success has been widely reported in literature, their performances largely depend on validity of the assumed models, such as the linear or additive models. This article introduces a model-free variable selection method via learning the gradient functions. The idea is based on the equivalence between whether a variable is informative and whether its corresponding gradient function is substantially non-zero. The proposed variable selection method is then formulated in a framework of learning gradients in a flexible reproducing kernel Hilbert space. The key advantage of the proposed method is that it requires no explicit model assumption and allows for general variable effects. Its asymptotic estimation and selection consistencies are studied, which establish the convergence rate of the estimated sparse gradients and assure that the truly informative variables are correctly identified in probability. The effectiveness of the proposed method is also supported by a variety of simulated examples and two real-life examples. [abs] [ pdf ][ bib ] &copy JMLR 2016. ( edit, beta )

ICRA Conference 2014 Conference Paper

Method of improving WiFi SLAM based on spatial and temporal coherence

  • Shao-Wen Yang
  • Sharon Xue Yang
  • Lei Yang

The paper addresses the revisiting (loop closing) problem of simultaneous localization and mapping (SLAM) by investigating spatio-temporal coherence in inertial and perceptual inputs to improve the robustness and convergence of SLAM. The basic idea is to find out coherent subsequences of confidence in trajectory to ensure against error-prone correspondences. It is achieved by leveraging fuzzy matching based on local trajectory structure and measurement similarity. Our approach does not rely on any global features or propagation modeling, which can be unreliable in the presence of gross errors and result in divergence. Apart from WiFi SLAM, our approach can also be capable of improving generic SLAM problems by leveraging spatio-temporal coherence. The experiments show that our approach can significantly reduce the ambiguity in WiFi fingerprinting, and subsequently lead to performance improvement in terms of mapping and localization.