Arrow Research search

Author name cluster

Yuan Du

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

7 papers
2 author rows

Possible papers

7

AAAI Conference 2026 Conference Paper

Decomposing the Neurons: Activation Sparsity via Mixture of Experts for Continual Test Time Adaptation

  • Rongyu Zhang
  • Aosong Cheng
  • Yulin Luo
  • Gaole Dai
  • Huanrui Yang
  • Jiaming Liu
  • Ran Xu
  • Li Du

Continual Test-Time Adaptation (CTTA), which aims to adapt the pre-trained model to ever-evolving target domains, emerges as an important task for vision models. As current vision models appear to be heavily biased towards texture, continuously adapting the model from one domain distribution to another can result in serious catastrophic forgetting. Drawing inspiration from the the encoding characteristics of neuron activation in neural networks, we propose the Mixture-of-Activation-Sparsity-Experts (MoASE) for the CTTA task. Given the distinct reaction of neurons with low and high activation to domain-specific and agnostic features, MoASE decomposes the neural activation into high-activation and low-activation components in each expert with a Spatial Differentiable Dropout (SDD). Based on the decomposition, we devise a Domain-Aware Router (DAR) that utilizes domain information to adaptively weight experts that process the post-SDD sparse activations, and the Activation Sparsity Gate (ASG) that adaptively assigns feature selection thresholds of the SDD for different experts for more precise feature decomposition. Finally, we introduce a Homeostatic-Proximal (HP) loss to maintain update consistency between the teacher and student experts to prevent error accumulation. Extensive experiments substantiate that MoASE achieves state-of-the-art performance in both classification and segmentation tasks.

AAAI Conference 2026 Conference Paper

MoLe-VLA: Dynamic Layer-skipping Vision Language Action Model via Mixture-of-Layers for Efficient Robot Manipulation

  • Rongyu Zhang
  • Menghang Dong
  • Yuan Zhang
  • Liang Heng
  • Xiaowei Chi
  • Gaole Dai
  • Li Du
  • Dan Wang

Vision-Language-Action (VLA) models enable robotic systems to perform embodied tasks but face deployment challenges due to the high computational demands of the dense Large Language Models (LLMs), with existing early-exit-based sparsification methods often overlooking the critical semantic role of final layers in downstream tasks. Aligning with the recent breakthrough of the Shallow Brain Hypothesis (SBH) in neuroscience and the mixture of experts in model sparsification, we conceptualize each LLM layer as an expert and propose a Mixture-of-LayEr Vision Language Action model (MoLe-VLA or simply MoLe) architecture for dynamic LLM layer activation. Specifically, we introduce a Spatial-Temporal Aware Router (STAR) for MoLe to selectively activate only parts of the layers based on the robot’s current state, mimicking the brain's distinct signal pathways specialized for cognition and causal reasoning. Additionally, to compensate for the cognition ability of LLM lost during the layer-skipping, we devise a Cognitive self-Knowledge Distillation (CogKD) to enhance the understanding of task demands and generate task-relevant action sequences by leveraging cognition features. Extensive experiments in RLBench simulations and real-world environments demonstrate the superiority of MoLe-VLA in both efficiency and performance, improving the mean success rate by 9.7% across ten simulation tasks while accelerating inference by 36.8% over OpenVLA.

IJCAI Conference 2025 Conference Paper

FBQuant: FeedBack Quantization for Large Language Models

  • Yijiang Liu
  • Hengyu Fang
  • Liulu He
  • Rongyu Zhang
  • Yichuan Bai
  • Yuan Du
  • Li Du

Deploying Large Language Models (LLMs) on edge devices is increasingly important, as it eliminates reliance on network connections, reduces expensive API calls, and enhances user privacy. However, on-device deployment is challenging due to the limited computational resources of edge devices. In particular, the key bottleneck stems from memory bandwidth constraints related to weight loading. Weight-only quantization effectively reduces memory access, yet often induces significant accuracy degradation. Recent efforts to incorporate sub-branches have shown promise for mitigating quantization errors, but these methods either lack robust optimization strategies or rely on suboptimal objectives. To address these gaps, we propose FeedBack Quantization (FBQuant), a novel approach inspired by negative feedback mechanisms in automatic control. FBQuant inherently ensures that the reconstructed weights remain bounded by the quantization process, thereby reducing the risk of overfitting. To further offset the additional latency introduced by sub-branches, we develop an efficient CUDA kernel that decreases 60% of extra inference time. Comprehensive experiments demonstrate the efficiency and effectiveness of FBQuant across various LLMs. Notably, for 3-bit Llama2-7B, FBQuant improves zero-shot accuracy by 1. 2%.

AAAI Conference 2025 Conference Paper

PAT: Pruning-Aware Tuning for Large Language Models

  • Yijiang Liu
  • Huanrui Yang
  • Youxin Chen
  • Rongyu Zhang
  • Miao Wang
  • Yuan Du
  • Li Du

Large language models (LLMs) excel in language tasks, especially with supervised fine-tuning after pre-training. However, their substantial memory and computational requirements hinder practical applications. Structural pruning, which reduces less significant weight dimensions, is one solution. Yet, traditional post-hoc pruning often leads to significant performance loss, with limited recovery from further fine-tuning due to reduced capacity. Since the model fine-tuning refines the general and chaotic knowledge in pre-trained models, we aim to incorporate structural pruning with the fine-tuning, and propose the Pruning-Aware Tuning (PAT) paradigm to eliminate model redundancy while preserving the model performance to the maximum extend. Specifically, we insert the innovative Hybrid Sparsification Modules (HSMs) between the Attention and FFN components to accordingly sparsify the upstream and downstream linear modules. The HSM comprises a lightweight operator and a globally shared trainable mask. The lightweight operator maintains a training overhead comparable to that of LoRA, while the trainable mask unifies the channels to be sparsified, ensuring structural pruning. Additionally, we propose the Identity Loss which decouples the transformation and scaling properties of the HSMs to enhance training robustness. Extensive experiments demonstrate that PAT excels in both performance and efficiency. For example, our Llama2-7b model with a 25% pruning ratio achieves 1.33x speedup while outperforming the LoRA-finetuned model by up to 1.26% in accuracy with a similar training cost.

AAAI Conference 2024 Conference Paper

Efficient Deweahter Mixture-of-Experts with Uncertainty-Aware Feature-Wise Linear Modulation

  • Rongyu Zhang
  • Yulin Luo
  • Jiaming Liu
  • Huanrui Yang
  • Zhen Dong
  • Denis Gudovskiy
  • Tomoyuki Okuno
  • Yohei Nakata

The Mixture-of-Experts (MoE) approach has demonstrated outstanding scalability in multi-task learning including low-level upstream tasks such as concurrent removal of multiple adverse weather effects. However, the conventional MoE architecture with parallel Feed Forward Network (FFN) experts leads to significant parameter and computational overheads that hinder its efficient deployment. In addition, the naive MoE linear router is suboptimal in assigning task-specific features to multiple experts which limits its further scalability. In this work, we propose an efficient MoE architecture with weight sharing across the experts. Inspired by the idea of linear feature modulation (FM), our architecture implicitly instantiates multiple experts via learnable activation modulations on a single shared expert block. The proposed Feature Modulated Expert (FME) serves as a building block for the novel Mixture-of-Feature-Modulation-Experts (MoFME) architecture, which can scale up the number of experts with low overhead. We further propose an Uncertainty-aware Router (UaR) to assign task-specific features to different FM modules with well-calibrated weights. This enables MoFME to effectively learn diverse expert functions for multiple tasks. The conducted experiments on the multi-deweather task show that our MoFME outperforms the state-of-the-art in the image restoration quality by 0.1-0.2 dB while saving more than 74% of parameters and 20% inference time over the conventional MoE counterpart. Experiments on the downstream segmentation and classification tasks further demonstrate the generalizability of MoFME to real open-world applications.

ICML Conference 2024 Conference Paper

SFC: Achieve Accurate Fast Convolution under Low-precision Arithmetic

  • Liulu He
  • Yufei Zhao
  • Rui Gao
  • Yuan Du
  • Li Du

Fast convolution algorithms, including Winograd and FFT, can efficiently accelerate convolution operations in deep models. However, these algorithms depend on high-precision arithmetic to maintain inference accuracy, which conflicts with the model quantization. To resolve this conflict and further improve the efficiency of quantized convolution, we proposes SFC, a new algebra transform for fast convolution by extending the Discrete Fourier Transform (DFT) with symbolic computing, in which only additions are required to perform the transformation at specific transform points, avoiding the calculation of irrational number and reducing the requirement for precision. Additionally, we enhance convolution efficiency by introducing correction terms to convert invalid circular convolution outputs of the Fourier method into effective ones. The numerical error analysis is presented for the first time in this type of work and proves that our algorithms can provide a 3. 68× multiplication reduction for 3×3 convolution, while the Winograd algorithm only achieves a 2. 25× reduction with similarly low numerical errors. Experiments carried out on benchmarks and FPGA show that our new algorithms can further improve the computation efficiency of quantized models while maintaining accuracy, surpassing both the quantization-alone method and existing works on fast convolution quantization.

ICRA Conference 2022 Conference Paper

Prototype-Voxel Contrastive Learning for LiDAR Point Cloud Panoptic Segmentation

  • Minzhe Liu
  • Qiang Zhou
  • Hengshuang Zhao
  • Jianing Li 0001
  • Yuan Du
  • Kurt Keutzer
  • Li Du
  • Shanghang Zhang

LiDAR point cloud panoptic segmentation, including both semantic and instance segmentation, plays a critical role in meticulous scene understanding for autonomous driving. Existing 3D voxelized approaches either utilize 3D sparse convolution that only focuses on local scene understanding, or add extra and time-consuming PointNet branch to capture global feature structures. To address these limitations, we propose an end-to-end Prototype-Voxel Contrastive Learning (PVCL) framework for learning stable and discriminative semantic representations, which includes voxel-level and prototype-level contrastive learning (CL). The voxel-level CL decreases intra-class distance and increases inter-class distance among sample representations, while the prototype-level CL further reduces the dependence of CL on negative sampling and avoids the influence of outliers from the same class, enabling PVCL to be more effective for outdoor point cloud panoptic segmentation. Extensive experiments are conducted on the public point cloud panoptic segmentation datasets, Semantic-KITTI and nuScenes, where evaluations and ablation studies demonstrate PVCL achieves superior performance compared with the state-of-the-art. Our approach ranks the top on the public leaderboard of Semantic-KITTI at the time of submission, and surpasses the published 2nd rank, EfficientLPS, by 1. 7% in PQ.