Arrow Research search

Author name cluster

Li Shen

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

98 papers
2 author rows

Possible papers

98

TMLR Journal 2026 Journal Article

Condense, Don't Just Prune: Enhancing Efficiency and Performance in MoE Layer Pruning

  • Mingyu Cao
  • Gen Li
  • Jie Ji
  • Jiaqi Zhang
  • AJAY JAISWAL
  • Li Shen
  • Xiaolong Ma
  • Shiwei Liu

Mixture-of-Experts (MoE) has garnered significant attention for its ability to scale up neural networks while utilizing the same or even fewer active parameters. However, MoE does not alleviate the massive memory requirements of networks, which limits their practicality in real-world applications, especially in the era of large language models (LLMs). While recent work explores the possibility of removing entire layers of MoE to reduce memory, the performance degradation is still notable. In this paper, we propose ConDense-MoE (CD-MoE), which, instead of dropping the entire MoE layer, condenses the large, sparse MoE layer into a smaller, denser layer with only a few experts activated for all tokens, while maintaining hardware friendliness. Our approach is specifically designed for fine-grained MoE with shared experts, where Feed-Forward Networks are split into many small experts, with certain experts isolated to serve as shared experts that are always activated, such as DeepSeekMoE and QwenMoE. We demonstrate the effectiveness of our method. Specifically, for the DeepSeekMoE-16B model, our approach maintains 90% of the average accuracy while reducing memory usage by 27.5% and increasing inference speed by 1.26 times. Moreover, we show that by applying lightweight expert fine-tuning—only to the condensed layers—and using 5 hours on a single 80G A100 GPU, we can successfully recover 98% of the original performance.

TMLR Journal 2026 Journal Article

Subspace based Federated Unlearning

  • Guanghao Li
  • Li Shen
  • Yan Sun
  • Yue Hu
  • Han Hu
  • Dacheng Tao

Federated learning (FL) enables collaborative machine learning among multiple clients while preserving user data privacy by preventing the exchange of local data. However, when users request to leave the FL system, the trained FL model may still retain information about their contributions. To comply with the right to be forgotten, federated unlearning has been proposed, which aims to remove a designated client's influence from the FL model. Existing federated unlearning methods typically rely on storing historical parameter updates, which may be impractical in resource-constrained FL settings. In this paper, we propose a Subspace-based Federated Unlearning method (SFU) that addresses this challenge without requiring additional storage. SFU updates the model via gradient ascent constrained within a subspace, specifically the orthogonal complement of the gradient descent directions derived from the remaining clients. By projecting the ascending gradient of the target client onto this subspace, SFU can mitigate the contribution of the target client while maintaining model performance on the remaining clients. SFU is communication-efficient, requiring only one round of local training per client to transmit gradient information to the server for model updates. Extensive empirical evaluations on multiple datasets demonstrate that SFU achieves competitive unlearning performance while preserving model utility. Compared to representative baseline methods, SFU consistently shows promising results under various experimental settings.

NeurIPS Conference 2025 Conference Paper

Ada-R1: Hybrid-CoT via Bi-Level Adaptive Reasoning Optimization

  • Haotian Luo
  • Haiying He
  • Yibo Wang
  • Jinluan Yang
  • Rui Liu
  • Naiqiang Tan
  • Xiaochun Cao
  • Dacheng Tao

Recently, long-thought reasoning models achieve strong performance on complex reasoning tasks, but often incur substantial inference overhead, making efficiency a critical concern. Our empirical analysis reveals that the benefit of using Long-CoT varies across problems: while some problems require elaborate reasoning, others show no improvement—or even degraded accuracy. This motivates adaptive reasoning strategies that tailor reasoning depth to the input. However, prior work primarily reduces redundancy within long reasoning paths, limiting exploration of more efficient strategies beyond the Long-CoT paradigm. To address this, we propose a novel two-stage framework for adaptive and efficient reasoning. First, we construct a hybrid reasoning model by merging long and short CoT models to enable diverse reasoning styles. Second, we apply bi-level preference training to guide the model to select suitable reasoning styles (group-level), and prefer concise and correct reasoning within each style group (instance-level). Experiments demonstrate that our method significantly reduces inference costs compared to other baseline approaches, while maintaining performance. Notably, on five mathematical datasets, the average length of reasoning is reduced by more than 50\%, highlighting the potential of adaptive strategies to optimize reasoning efficiency in large language models.

NeurIPS Conference 2025 Conference Paper

Adaptive Defense against Harmful Fine-Tuning for Large Language Models via Bayesian Data Scheduler

  • Zixuan Hu
  • Li Shen
  • Zhenyi Wang
  • Yongxian Wei
  • Dacheng Tao

Harmful fine-tuning poses critical safety risks to fine-tuning-as-a-service for large language models. Existing defense strategies preemptively build robustness via attack simulation but suffer from fundamental limitations: (i) the infeasibility of extending attack simulations beyond bounded threat models due to the inherent difficulty of anticipating unknown attacks, and (ii) limited adaptability to varying attack settings, as simulation fails to capture their variability and complexity. To address these challenges, we propose Bayesian Data Scheduler (BDS), an adaptive tuning-stage defense strategy with no need for attack simulation. BDS formulates harmful fine-tuning defense as a Bayesian inference problem, learning the posterior distribution of each data point's safety attribute, conditioned on the fine-tuning and alignment datasets. The fine-tuning process is then constrained by weighting data with their safety attributes sampled from the posterior, thus mitigating the influence of harmful data. By leveraging the post hoc nature of Bayesian inference, the posterior is conditioned on the fine-tuning dataset, enabling BDS to tailor its defense to the specific dataset, thereby achieving adaptive defense. Furthermore, we introduce a neural scheduler based on amortized Bayesian learning, enabling efficient transfer to new data without retraining. Comprehensive results across diverse attack and defense settings demonstrate the state-of-the-art performance of our approach. Code is available at https: //github. com/Egg-Hu/Bayesian-Data-Scheduler.

ECAI Conference 2025 Conference Paper

AIRES: A General Framework for Efficient Intrinsic Rewards Based on Attention Mechanisms

  • Xin Liu
  • Jie Tan
  • Li Shen
  • Xu Wang
  • Guoli Wu
  • Xiaoguang Ren
  • Huadong Dai

Efficient exploration in high-dimensional observation spaces remains a critical challenge in deep reinforcement learning, particularly in scenarios with sparse extrinsic rewards. A promising approach is to encourage exploration by estimating intrinsic rewards based on the novelty of observations. However, there is a gap between the observed novelty and the actual effectiveness of exploration, as both environmental stochasticity and the agent’s actions may influence observations. To accurately evaluate the novelty contributed by agent exploration in intrinsic rewards, we propose the AIRES (Attention-driven Intrinsic Reward for Exploration Strategy) framework. AIRES leverages the attention mechanisms to analyze the relationship within trajectory sequences generated by agent-environment interactions, employing attention weights to quantify the relevance of observations to actions. By applying attention weights to intrinsic rewards, the novelty brought by agent exploration is enhanced and the impact of environmental stochasticity is reduced. Extensive experiments demonstrate that AIRES significantly enhances the performance of prominent intrinsic reward methods, establishing it as a robust and scalable solution for efficient exploration.

NeurIPS Conference 2025 Conference Paper

AlphaDecay: Module-wise Weight Decay for Heavy-Tailed Balancing in LLMs

  • Di He
  • Songjun Tu
  • AJAY JAISWAL
  • Li Shen
  • Ganzhao Yuan
  • Shiwei Liu
  • Lu Yin

Weight decay is a standard regularization technique for training large language models (LLMs). While it is common to assign a uniform decay rate to every layer, this approach overlooks the structural diversity of LLMs and the varying spectral properties across modules. In this paper, we introduce AlphaDecay, a simple yet effective method that adaptively assigns different weight decay strengths to each module of an LLM. Our approach is guided by Heavy-Tailed Self-Regularization (HT-SR) theory, which analyzes the empirical spectral density (ESD) of weight correlation matrices to quantify “heavy-tailedness. ” Modules exhibiting more pronounced heavy-tailed ESDs, reflecting stronger feature learning, are assigned weaker decay, while modules with lighter-tailed spectra receive stronger decay. Our method leverages tailored weight decay assignments to balance the module-wise differences in spectral properties, leading to improved performance. Extensive pre-training tasks with various model sizes from 60M to 1B demonstrate that AlphaDecay achieves better perplexity and generalization than conventional uniform decay and other adaptive decay baselines. The code is available at https: //github. com/hed-ucas/AlphaDecay.

NeurIPS Conference 2025 Conference Paper

Analytic Energy-Guided Policy Optimization for Offline Reinforcement Learning

  • Jifeng Hu
  • Sili Huang
  • Zhejian Yang
  • Shengchao Hu
  • Li Shen
  • Hechang Chen
  • Lichao Sun
  • Yi Chang

Conditional decision generation with diffusion models has shown powerful competitiveness in reinforcement learning (RL). Recent studies reveal the relation between energy-function-guidance diffusion models and constrained RL problems. The main challenge lies in estimating the intermediate energy, which is intractable due to the log-expectation formulation during the generation process. To address this issue, we propose the Analytic Energy-guided Policy Optimization (AEPO). Specifically, we first provide a theoretical analysis and the closed-form solution of the intermediate guidance when the diffusion model obeys the conditional Gaussian transformation. Then, we analyze the posterior Gaussian distribution in the log-expectation formulation and obtain the target estimation of the log-expectation under mild assumptions. Finally, we train an intermediate energy neural network to approach the target estimation of log-expectation formulation. We apply our method in 30+ offline RL tasks to demonstrate the effectiveness of our method. Extensive experiments illustrate that our method surpasses numerous representative baselines in D4RL offline reinforcement learning benchmarks.

TMLR Journal 2025 Journal Article

Are Large Language Models Really Robust to Word-Level Perturbations?

  • Haoyu Wang
  • Guozheng Ma
  • Cong Yu
  • Ning Gui
  • Linrui Zhang
  • Zhiqi Huang
  • Suwei Ma
  • Yongzhe Chang

The swift advancement in the scales and capabilities of Large Language Models (LLMs) positions them as promising tools for a variety of downstream tasks. In addition to the pursuit of better performance and the avoidance of violent feedback on a certain prompt, to ensure the responsibility of the LLMs, much attention is drawn to the robustness of LLMs. However, existing evaluation methods mostly rely on traditional question answering datasets with predefined supervised labels, potentially ignoring the superior generation capabilities of contemporary LLMs. To investigate the robustness of LLMs while using their generation ability, we propose a novel rational evaluation pipeline that leverages reward models as diagnostic tools to evaluate the long conversation generated from more challenging open questions by LLMs, which we refer to as the Reward Model for Reasonable Robustness Evaluation (TREvaL). Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions, a capability not entirely encompassed by individual words or letters.Our extensive empirical experiments demonstrate that TREvaL provides an identification for the lack of robustness of nowadays LLMs.Notably, we are surprised to discover that robustness tends to decrease as fine-tuning (SFT and RLHF) is conducted, calling for more attention on the robustness during alignment process.

NeurIPS Conference 2025 Conference Paper

CHPO: Constrained Hybrid-action Policy Optimization for Reinforcement Learning

  • Ao Zhou
  • Jiayi Guan
  • Li Shen
  • Fan Lu
  • Sanqing Qu
  • Junqiao Zhao
  • Ziqiao Wang
  • Ya Wu

Constrained hybrid-action reinforcement learning (RL) promises to learn a safe policy within a parameterized action space, which is particularly valuable for safety-critical applications involving discrete-continuous hybrid action spaces. However, existing hybrid-action RL algorithms primarily focus on reward maximization, which faces significant challenges for tasks involving both cost constraints and hybrid action spaces. In this work, we propose a novel Constrained Hybrid-action Policy Optimization algorithm (CHPO) to address the problems of constrained hybrid-action RL. Concretely, we rethink the limitations of hybrid-action RL in handling safe tasks with parameterized action spaces and reframe the objective of constrained hybrid-action RL by introducing the concept of Constrained Parameterized-action Markov Decision Process (CPMDP). Subsequently, we present a constrained hybrid-action policy optimization algorithm to confront the constrained hybrid-action problems and conduct theoretical analyses demonstrating that the CHPO converges to the optimal solution while satisfying safety constraints. Finally, extensive experiments demonstrate that the CHPO achieves competitive performance across multiple experimental tasks.

EAAI Journal 2025 Journal Article

Code-switching finetuning: Bridging multilingual pretrained language models for enhanced cross-lingual performance

  • Changtong Zan
  • Liang Ding
  • Li Shen
  • Yu Cao
  • Weifeng Liu

In recent years, the development of pre-trained models has significantly propelled advancements in natural language processing. However, multilingual sequence-to-sequence pretrained language models (Seq2Seq PLMs) are pretrained on a wide range of languages (e. g. , 25 languages), yet often finetuned for specific bilingual tasks (e. g. , English–German), leading to domain and task discrepancies between pretraining and finetuning stages, which may lead to sub-optimal downstream performance. In this study, we first illustratively reveal such domain and task discrepancies, and then conduct an in-depth investigation into the side effects that these discrepancies may have on both training dynamic and downstream performance. To alleviate those side effects, we introduce a simple and effective code-switching restoration task (namely code-switching finetuning) into the standard pretrain-finetune pipeline. Specifically, in the first stage, we recast the downstream data as the self-supervised format used for pretraining, in which the denoising signal is the code-switched cross-lingual phrase. Then, the model is finetuned on downstream task as usual in the second stage. Experiments spanning both natural language generation (12 supervised translations, 30 zero-shot translations, and 2 cross-lingual summarization tasks) and understanding (7 cross-lingual natural language inference tasks) tasks demonstrate that our model consistently and significantly surpasses the standard finetuning strategy. Analyses show that our method introduces negligible computational cost and reduces cross-lingual representation gaps. We have made the code publicly available at: https: //github. com/zanchangtong/CSR4mBART.

NeurIPS Conference 2025 Conference Paper

Continual Model Merging without Data: Dual Projections for Balancing Stability and Plasticity

  • Enneng Yang
  • Anke Tang
  • Li Shen
  • Guibing Guo
  • Xingwei Wang
  • Xiaochun Cao
  • Jie Zhang

Model merging integrates multiple expert models with diverse capabilities into a unified framework, facilitating collaborative learning. However, most existing methods assume simultaneous access to all models, which is often impractical in real-world scenarios where models are received sequentially. While some studies have investigated continual model merging (CMM)--which involves sequentially merging multiple models--the challenge of balancing prior knowledge (stability) and incorporating new tasks (plasticity) remains unresolved. This paper, for the first time, formally defines the stability and plasticity of CMM from the perspective of orthogonal projection. Subsequently, we analyze the relationships among the spaces spanned by task data, historical gradients, and accumulated gradients. Building on this, we propose a data-free \textbf{D}ual \textbf{O}rthogonal \textbf{P}rojection (DOP) method, which eliminates data dependence and mitigates interference between the merged model and models for old and new tasks by projecting their parameter differences onto their respective approximate data spaces. Finally, to solve potential conflicts between stability and plasticity, we reformulate DOP as a multi-objective optimization problem and employ a multi-gradient descent algorithm to obtain a Pareto-optimal solution. Extensive experiments across multiple architectures and task configurations validate that our approach significantly outperforms state-of-the-art CMM methods.

AAAI Conference 2025 Conference Paper

Divide, Conquer and Combine: A Training-Free Framework for High-Resolution Image Perception in Multimodal Large Language Models

  • Wenbin Wang
  • Liang Ding
  • Minyan Zeng
  • Xiabin Zhou
  • Li Shen
  • Yong Luo
  • Wei Yu
  • Dacheng Tao

Multimodal large language models (MLLMs) have experienced significant advancements recently, but still struggle to recognize and interpret intricate details in high-resolution (HR) images effectively. While state-of-the-art (SOTA) MLLMs claim to process images at 4K resolution, existing MLLM benchmarks only support up to 2K, leaving the capabilities of SOTA models on true HR images largely untested. Furthermore, existing methods for enhancing HR image perception in MLLMs rely on computationally expensive visual instruction tuning. To address these limitations, we introduce HR-Bench, the first deliberately designed benchmark to rigorously evaluate MLLM performance on 4K & 8K images. Through extensive experiments, we demonstrate that while downsampling HR images leads to vision information loss, leveraging complementary modalities, e.g., text, can effectively compensate for this loss. Building upon this insight, we propose Divide, Conquer and Combine, a novel training-free framework for enhancing MLLM perception of HR images. Our method follows a three-staged approach: 1) Divide: recursively partitioning the HR image into patches and merging similar patches to minimize computational overhead, 2) Conquer: leveraging the MLLM to generate accurate textual descriptions for each image patch, and 3) Combine: utilizing the generated text descriptions to enhance the MLLM's understanding of the overall HR image. Extensive experiments show that: 1) the SOTA MLLM achieves 63% accuracy, which is markedly lower than the 87% accuracy achieved by humans on HR-Bench; 2) our method brings consistent and significant improvements (a relative increase of +6% on HR-Bench and +8% on general multimodal benchmarks).

NeurIPS Conference 2025 Conference Paper

Effective Policy Learning for Multi-Agent Online Coordination Beyond Submodular Objectives

  • Qixin Zhang
  • Yan Sun
  • Can Jin
  • Xikun Zhang
  • Yao Shu
  • Puning Zhao
  • Li Shen
  • Dacheng Tao

In this paper, we present two effective policy learning algorithms for multi-agent online coordination(MA-OC) problem. The first one, **MA-SPL**, not only can achieve the optimal $(1-\frac{c}{e})$-approximation guarantee for the MA-OC problem with submodular objectives but also can handle the unexplored $\alpha$-weakly DR-submodular and $(\gamma, \beta)$-weakly submodular scenarios, where $c$ is the curvature of the investigated submodular functions, $\alpha$ denotes the diminishing-return(DR) ratio and the tuple$(\gamma, \beta)$ represents the submodularity ratios. Subsequently, in order to reduce the reliance on the unknown parameters $\alpha, \gamma, \beta$ inherent in the **MA-SPL** algorithm, we then introduce the second online algorithm named **MA-MPL**. This **MA-MPL** algorithm is entirely *parameter-free* and simultaneously can maintain the same approximation ratio as the first **MA-SPL** algorithm. The core of our **MA-SPL** and **MA-MPL** algorithms is a novel continuous-relaxation technique term as policy-based continuous extension. Compared with the well-established multi-linear extension, a notable advantage of this new policy-based continuous extension is its ability to provide a lossless rounding scheme for any set function, thereby enabling us to tackle the challenging weakly submodular objective functions. Finally, extensive simulations are conducted to demonstrate the effectiveness of our proposed algorithms.

NeurIPS Conference 2025 Conference Paper

Efficient Federated Learning against Byzantine Attacks and Data Heterogeneity via Aggregating Normalized Gradients

  • Shiyuan Zuo
  • Xingrun Yan
  • Rongfei Fan
  • Li Shen
  • Puning Zhao
  • Jie Xu
  • Han Hu

Federated Learning (FL) enables multiple clients to collaboratively train models without sharing raw data, but is vulnerable to Byzantine attacks and data heterogeneity, which can severely degrade performance. Existing Byzantine-robust approaches tackle data heterogeneity, but incur high computational overhead during gradient aggregation, thereby slowing down the training process. To address this issue, we propose a simple yet effective Federated Normalized Gradients Algorithm (Fed-NGA), which performs aggregation by merely computing the weighted mean of the normalized gradients from each client. This approach yields a favorable time complexity of $\mathcal{O}(pM)$, where $p$ is the model dimension and $M$ is the number of clients. We rigorously prove that Fed-NGA is robust to both Byzantine faults and data heterogeneity. For non-convex loss functions, Fed-NGA achieves convergence to a neighborhood of stationary points under general assumptions, and further attains zero optimality gap under some mild conditions, which is an outcome rarely achieved in existing literature. In both cases, the convergence rate is $\mathcal{O}(1/T^{\frac{1}{2} - \delta})$, where $T$ denotes the number of iterations and $\delta \in (0, 1/2)$. Experimental results on benchmark datasets confirm the superior time efficiency and convergence performance of Fed-NGA over existing methods.

JMLR Journal 2025 Journal Article

FusionBench: A Unified Library and Comprehensive Benchmark for Deep Model Fusion

  • Anke Tang
  • Li Shen
  • Yong Luo
  • Enneng Yang
  • Han Hu
  • Lefei Zhang
  • Bo Du
  • Dacheng Tao

Deep model fusion is an emerging technique that unifies the predictions or parameters of several deep neural networks into a single better-performing model in a cost-effective and data-efficient manner. Although a variety of deep model fusion techniques have been introduced, their evaluations tend to be inconsistent and often inadequate to validate their effectiveness and robustness. We present FusionBench, the first benchmark and a unified library designed specifically for deep model fusion. Our benchmark consists of multiple tasks, each with different settings of models and datasets. This variety allows us to compare fusion methods across different scenarios and model scales. Additionally, FusionBench serves as a unified library for easy implementation and testing of new fusion techniques. FusionBench is open source and actively maintained, with community contributions encouraged. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2025. ( edit, beta )

IJCAI Conference 2025 Conference Paper

Hypernetwork Aggregation for Decentralized Personalized Federated Learning

  • Weishi Li
  • Yong Peng
  • Mengyao Du
  • Fuhui Sun
  • Xiaoyan Wang
  • Li Shen

Personalized Federated Learning (PFL) meets each user’s personalized needs while still facing the high communication costs due to the large amount of data transmission and frequent communication. Decentralized PFL (DPFL) as an alternative discards the central server in PFL, which reduces the pressure of communication and the risk of server failure by using peer-to-peer communication. Nevertheless, DPFL still suffers from the significant communication pressure due to the transmission of a large number of model parameters, especially numerous nodes. To address the issues, we propose a novel personalized framework, DFedHP, in which each client utilizes a hypernetwork to generate the shared part of model parameters and train the personalized parameters separately. The number of parameters in a hypernetwork is much smaller than those in a typical local network, so hypernetwork aggregation reduces communication costs and the risk of privacy leakage. Furthermore, DFedHP can seamlessly integrate into existing DPFL algorithms as a plugin to boost their efficacy. At last, extensive experiments on various data heterogeneous environments demonstrate that DFedHP can reduce communication costs, accelerate convergence rate, and improve generalization performance compared with state-of-the-art (SOTA) baselines.

AAAI Conference 2025 Conference Paper

Image-to-video Adaptation with Outlier Modeling and Robust Self-learning

  • Junbao Zhuo
  • Shuhui Wang
  • Zhenghan Chen
  • Li Shen
  • Qingming Huang
  • Huimin Ma

The image-to-video adaptation task seeks to effectively harness both labeled images and unlabeled videos for achieving effective video recognition. The modality gap of the image and video modalities and the domain discrepancy across the two domains are the two essential challenges in this task. Existing methods reduce the domain discrepancy via close-set domain adaptation techniques, resulting in inaccurate domain alignment as there exist outlier target frames. To tackle this issue, we extend the vanilla classifier with outlier classes, where each outlier class responsible for capturing outlier frames for a specific class via batch nuclear norm maximization loss. We further propose a new loss by treating the source images apart from class c as instances from outlier class specific for c. As for the modality gap, existing methods usually utilize the pseudo labels obtained from an image-level adapted model to learn a video-level model. Rare efforts are dedicated to handling the noise in pseudo labels. We proposed a new metric based on label propagation consistency to select samples for training a better video-level model. Experiments on 3 benchmarks validating the effectiveness of our method.

NeurIPS Conference 2025 Conference Paper

Layer as Puzzle Pieces: Compressing Large Language Models through Layer Concatenation

  • Fei Wang
  • Li Shen
  • Liang Ding
  • Chao Xue
  • Ye Liu
  • Changxing Ding

Large Language Models (LLMs) excel at natural language processing tasks, but their massive size leads to high computational and storage demands. Recent works have sought to reduce their model size through layer-wise structured pruning. However, they tend to ignore retaining the capabilities in the pruned part. In this work, we re-examine structured pruning paradigms and uncover several key limitations: 1) notable performance degradation due to direct layer removal, 2) incompetent linear weighted layer aggregation, and 3) the lack of effective post-training recovery mechanisms. To address these limitations, we propose CoMe, including a progressive layer pruning framework with a Concatenation-based Merging technology and a hierarchical distillation post-training process. Specifically, we introduce a channel sensitivity metric that utilizes activation intensity and weight norms for fine-grained channel selection. Subsequently, we employ a concatenation-based layer merging method to fuse the most critical channels in the adjacent layers, enabling a progressive model size reduction. Finally, we propose a hierarchical distillation protocol, which leverages the correspondences between the original and pruned model layers established during pruning, enabling efficient knowledge transfer. Experiments on seven benchmarks show that CoMe achieves state-of-the-art performance; when pruning 30% of LLaMA-2-7b's parameters, the pruned model retains 83% of its original average accuracy.

NeurIPS Conference 2025 Conference Paper

Merging on the Fly Without Retraining: A Sequential Approach to Scalable Continual Model Merging

  • Anke Tang
  • Enneng Yang
  • Li Shen
  • Yong Luo
  • Han Hu
  • Lefei Zhang
  • Bo Du
  • Dacheng Tao

Deep model merging represents an emerging research direction that combines multiple fine-tuned models to harness their specialized capabilities across different tasks and domains. Current model merging techniques focus on merging all available models simultaneously, with weight interpolation-based methods being the predominant approach. However, these conventional approaches are not well-suited for scenarios where models become available sequentially, and they often suffer from high memory requirements and potential interference between tasks. In this study, we propose a training-free projection-based continual merging method that processes models sequentially through orthogonal projections of weight matrices and adaptive scaling mechanisms. Our method operates by projecting new parameter updates onto subspaces orthogonal to existing merged parameter updates while using an adaptive scaling mechanism to maintain stable parameter distances, enabling efficient sequential integration of task-specific knowledge. Our approach maintains constant memory complexity to the number of models, minimizes interference between tasks through orthogonal projections, and retains the performance of previously merged models through adaptive task vector scaling. Extensive experiments on CLIP-ViT models demonstrate that our method achieves a 5-8% average accuracy improvement while maintaining robust performance in different task orderings. Code is publicly available at https: //github. com/tanganke/opcm.

NeurIPS Conference 2025 Conference Paper

Mix Data or Merge Models? Balancing the Helpfulness, Honesty, and Harmlessness of Large Language Model via Model Merging

  • Jinluan Yang
  • Dingnan Jin
  • Anke Tang
  • Li Shen
  • Didi Zhu
  • Zhengyu Chen
  • Ziyu Zhao
  • Daixin Wang

Achieving balanced alignment of large language models (LLMs) in terms of Helpfulness, Honesty, and Harmlessness (3H optimization) constitutes a cornerstone of responsible AI. Existing methods like data mixture strategies face limitations, including heavy reliance on expert knowledge and conflicting optimization signals. While model merging offers parameter-level conflict-resolution strategies through integrating specialized models' parameters, its potential for 3H optimization remains underexplored. This paper systematically compares the effectiveness of model merging and data mixture methods in constructing 3H-aligned LLMs for the first time, revealing previously overlooked collaborative and conflict relationships among the 3H dimensions and discussing the advantages and drawbacks of data mixture (\textit{data-level}) and model merging (\textit{parameter-level}) methods in mitigating the conflict for balanced 3H optimization. Specially, we propose a novel \textbf{R}eweighting \textbf{E}nhanced task \textbf{S}ingular \textbf{M}erging method, \textbf{RESM}, through outlier weighting and sparsity-aware rank selection strategies to address the challenges of preference noise accumulation and layer sparsity adaptation inherent in 3H-aligned LLM merging. Extensive evaluations can verify the effectiveness and robustness of RESM compared to previous data mixture (2\%-5\% gain) and model merging (1\%-3\% gain) methods in achieving balanced LLM alignment.

NeurIPS Conference 2025 Conference Paper

MixPrompt: Efficient Mixed Prompting for Multimodal Semantic Segmentation

  • Zhiwei Hao
  • Zhongyu Xiao
  • Jianyuan Guo
  • Li Shen
  • Yong Luo
  • Han Hu
  • Dan Zeng

Recent advances in multimodal semantic segmentation show that incorporating auxiliary inputs—such as depth or thermal images—can significantly improve performance over single-modality (RGB-only) approaches. However, most existing solutions rely on parallel backbone networks and complex fusion modules, greatly increasing model size and computational demands. Inspired by prompt tuning in large language models, we introduce \textbf{MixPrompt}: a prompting-based framework that integrates auxiliary modalities into a pretrained RGB segmentation model without modifying its architecture. MixPrompt uses a lightweight prompting module to extract and fuse information from auxiliary inputs into the main RGB backbone. This module is initialized using the early layers of a pretrained RGB feature extractor, ensuring a strong starting point. At each backbone layer, MixPrompt aligns RGB and auxiliary features in multiple low-rank subspaces, maximizing information use with minimal parameter overhead. An information mixing scheme enables cross-subspace interaction for further performance gains. During training, only the prompting module and segmentation head are updated, keeping the RGB backbone frozen for parameter efficiency. Experiments across NYU Depth V2, SUN-RGBD, MFNet, and DELIVER datasets show that MixPrompt achieves improvements of 4. 3, 1. 1, 0. 4, and 1. 1 mIoU, respectively, over two-branch baselines, while using nearly half the parameters. MixPrompt also outperforms recent prompting-based methods under similar compute budgets.

NeurIPS Conference 2025 Conference Paper

Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search

  • Huanjin Yao
  • Jiaxing Huang
  • Wenhao Wu
  • Jingyi Zhang
  • Yibo Wang
  • Shunyu Liu
  • Yingjie Wang
  • YuXin Song

In this work, we aim to develop an MLLM that understands and solves questions by learning to create each intermediate step of the reasoning involved till the final answer. To this end, we propose Collective Monte Carlo Tree Search (CoMCTS), a new learning-to-reason method for MLLMs, which introduces the concept of collective learning into ``tree search'' for effective and efficient reasoning-path searching and learning. The core idea of CoMCTS is to leverage collective knowledge from multiple models to collaboratively conjecture, search and identify effective reasoning paths toward correct answers via four iterative operations including Expansion, Simulation and Error Positioning, Backpropagation, and Selection. Using CoMCTS, we construct Mulberry-260k, a multimodal dataset with a tree of rich, explicit and well-defined reasoning nodes for each question. With Mulberry-260k, we perform collective SFT to train our model, Mulberry, a series of MLLMs with o1-like step-by-step Reasoning and Reflection capabilities. Extensive experiments demonstrate the superiority of our proposed methods on various benchmarks. Code is available at https: //github. com/HJYao00/Mulberry.

NeurIPS Conference 2025 Conference Paper

On the Empirical Power of Goodness-of-Fit Tests in Watermark Detection

  • Weiqing He
  • Xiang Li
  • Tianqi Shang
  • Li Shen
  • Weijie Su
  • Qi Long

Large language models (LLMs) raise concerns about content authenticity and integrity because they can generate human-like text at scale. Text watermarks, which embed detectable statistical signals into generated text, offer a provable way to verify content origin. Many detection methods rely on pivotal statistics that are i. i. d. under human-written text, making goodness-of-fit (GoF) tests a natural tool for watermark detection. However, GoF tests remain largely underexplored in this setting. In this paper, we systematically evaluate eight GoF tests across three popular watermarking schemes, using three open-source LLMs, two datasets, various generation temperatures, and multiple post-editing methods. We find that general GoF tests can improve both the detection power and robustness of watermark detectors. Notably, we observe that text repetition, common in low-temperature settings, gives GoF tests a unique advantage not exploited by existing methods. Our results highlight that classic GoF tests are a simple yet powerful and underused tool for watermark detection in LLMs.

NeurIPS Conference 2025 Conference Paper

Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation

  • Yibo Wang
  • Tiansheng Huang
  • Li Shen
  • Huanjin Yao
  • Haotian Luo
  • Rui Liu
  • Naiqiang Tan
  • Jiaxing Huang

Harmful fine-tuning attack introduces significant security risks to the fine-tuning services. Main-stream defenses aim to vaccinate the model such that the later harmful fine-tuning attack is less effective. However, our evaluation results show that such defenses are fragile-- with a few fine-tuning steps, the model still can learn the harmful knowledge. To this end, we do further experiment and find that an embarrassingly simple solution-- adding purely random perturbations to the fine-tuned model, can recover the model from harmful behaviors, though it leads to a degradation in the model’s fine-tuning performance. To address the degradation of fine-tuning performance, we further propose \methodname, which optimizes an adaptive perturbation that will be applied to the model after fine-tuning. \methodname maintains model's safety alignment performance without compromising downstream fine-tuning performance. Comprehensive experiments are conducted on different harmful ratios, fine-tuning tasks and mainstream LLMs, where the average harmful scores are reduced by up-to 21. 2%, while maintaining fine-tuning performance. As a by-product, we analyze the adaptive perturbation and show that different layers in various LLMs have distinct safety coefficients. Source code available at https: //github. com/w-yibo/Panacea.

AAMAS Conference 2025 Conference Paper

Prompt Tuning with Diffusion for Few-Shot Pre-trained Policy Generalization

  • Shengchao Hu
  • Wanru Zhao
  • Weixiong Lin
  • Li Shen
  • Ya Zhang
  • Dacheng Tao

Offline reinforcement learning (RL) methods harness previous experiences to derive an optimal policy, forming the foundation for pretrained large-scale models (PLMs). When adapting to novel tasks, PLMs leverage expert trajectories as prompts to accelerate adaptation. While various prompt-tuning techniques aim to improve prompt quality, their effectiveness is often limited by initialization constraints, restricting exploration and potentially leading to suboptimal solutions. To eliminate dependence on the initial prompt, we reframe prompt-tuning as conditional generative modeling, where prompts are generated from random noise. Our proposed Prompt Diffuser employs a conditional diffusion model to generate high-quality prompts. Central to our framework is trajectory reconstruction and the seamless integration of downstream task guidance during training. Experimental results validate Prompt Diffuser’s effectiveness, demonstrating strong performance in meta-RL tasks.

TMLR Journal 2025 Journal Article

QPO: Query-dependent Prompt Optimization via Multi-Loop Offline Reinforcement Learning

  • Yilun Kong
  • Hangyu Mao
  • Zhao Qi
  • Bin Zhang
  • Jingqing Ruan
  • Li Shen
  • Yongzhe Chang
  • Xueqian Wang

Prompt engineering has demonstrated remarkable success in enhancing the performance of large language models (LLMs) across diverse tasks. However, most existing prompt optimization methods only focus on the task-level performance, overlooking the importance of query-preferred prompts, which leads to suboptimal performances. Additionally, these methods rely heavily on frequent interactions with LLMs to obtain feedback for guiding the optimization process, incurring substantial redundant interaction costs. In this paper, we introduce Query-dependent Prompt Optimization ($\textbf{QPO}$), which leverages multi-loop offline reinforcement learning to iteratively fine-tune a small pretrained language model to generate optimal prompts tailored to the input queries, thus significantly improving the prompting effect on the large target LLM. We derive insights from offline prompting demonstration data, which already exists in large quantities as a by-product of benchmarking diverse prompts on open-sourced tasks, thereby circumventing the expenses of online interactions. Furthermore, we continuously augment the offline dataset with the generated prompts in each loop, as the prompts from the fine-tuned model are supposed to outperform the source prompts in the original dataset. These iterative loops bootstrap the model towards generating optimal prompts. Experiments on various LLM scales and diverse NLP and math tasks demonstrate the efficacy and cost-efficiency of our method in both zero-shot and few-shot scenarios.

NeurIPS Conference 2025 Conference Paper

R1-ShareVL: Incentivizing Reasoning Capabilities of Multimodal Large Language Models via Share-GRPO

  • Huanjin Yao
  • Qixiang Yin
  • Jingyi Zhang
  • Min Yang
  • Yibo Wang
  • Wenhao Wu
  • Fei Su
  • Li Shen

In this work, we aim to incentivize the reasoning ability of Multimodal Large Language Models (MLLMs) via reinforcement learning (RL) and develop an effective approach that mitigates the sparse reward and advantage vanishing issues during RL. To this end, we propose Share-GRPO, a novel RL approach that tackle these issues by exploring and sharing diverse reasoning trajectories over expanded question space. Specifically, Share-GRPO first expands the question space for a given question via data transformation techniques, and then encourages MLLM to effectively explore diverse reasoning trajectories over the expanded question space and shares the discovered reasoning trajectories across the expanded questions during RL. In addition, Share-GRPO also shares reward information during advantage computation, which estimates solution advantages hierarchically across and within question variants, allowing more accurate estimation of relative advantages and improving the stability of policy training. Extensive evaluations over 6 widely-used reasoning benchmarks showcase the superior performance of our method. Code is available at https: //github. com/HJYao00/R1-ShareVL.

NeurIPS Conference 2025 Conference Paper

Robust Policy Expansion for Offline-to-Online RL under Diverse Data Corruption

  • Longxiang He
  • Deheng Ye
  • Junbo Tan
  • Xueqian Wang
  • Li Shen

Pretraining a policy on offline data followed by fine-tuning through online interactions, known as Offline-to-Online Reinforcement Learning (O2O RL), has emerged as a promising paradigm for real-world RL deployment. However, both offline datasets and online interactions in practical environments are often noisy or even maliciously corrupted, severely degrading the performance of O2O RL. Existing works primarily focus on mitigating the conservatism of offline policies via online exploration, while the robustness of O2O RL under data corruption, including states, actions, rewards, and dynamics, is still unexplored. In this work, we observe that data corruption induces heavy-tailed behavior in the policy, thereby substantially degrading the efficiency of online exploration. To address this issue, we incorporate Inverse Probability Weighted (IPW) into the online exploration policy to alleviate heavy-tailedness, and propose a novel, simple yet effective method termed $\textbf{RPEX}$: $\textbf{R}$obust $\textbf{P}$olicy $\textbf{EX}$pansion. Extensive experimental results on D4RL datasets demonstrate that RPEX achieves SOTA O2O performance across a wide range of data corruption scenarios.

NeurIPS Conference 2025 Conference Paper

RobustMerge: Parameter-Efficient Model Merging for MLLMs with Direction Robustness

  • Fanhu Zeng
  • Haiyang Guo
  • Fei Zhu
  • Li Shen
  • Hao Tang

Fine-tuning pre-trained models with custom data leads to numerous expert models on specific tasks. Merging models into one universal model to empower multi-task ability refraining from data leakage has gained popularity. With the expansion in data and model size, parameter-efficient tuning becomes the common practice for obtaining task-specific models efficiently. However, few methods are dedicated to efficient merging, and existing methods designed for full fine-tuning merging fail under efficient merging. To address the issue, we analyze from low-rank decomposition and reveal that direction robustness during merging is crucial for merging efficient modules. We furthermore uncover that compensating for the gap between stark singular values contributes to direction robustness. Therefore, we propose RobustMerge, a training-free parameter-efficient merging method with complementary parameter adaptation to maintain direction robustness. Specifically, we (1) prune parameters and scale coefficients from inter-parameter relations for singular values to maintain direction stability away from task interference, and (2) perform cross-task normalization to enhance unseen task generalization. We establish a benchmark consisting of diverse multimodal tasks, on which we conduct experiments to certify the outstanding performance and generalizability of our method. Additional studies and extensive analyses further showcase the effectiveness.

NeurIPS Conference 2025 Conference Paper

RoMa: A Robust Model Watermarking Scheme for Protecting IP in Diffusion Models

  • Yingsha Xie
  • Rui Min
  • Zeyu Qin
  • Fei Ma
  • Li Shen
  • Fei Yu
  • Xiaochun Cao

Preserving intellectual property (IP) within a pre-trained diffusion model is critical for protecting the model's copyright and preventing unauthorized model deployment. In this regard, model watermarking is a common practice for IP protection that embeds traceable information within models and allows for further verification. Nevertheless, existing watermarking schemes often face challenges due to their vulnerability to fine-tuning, limiting their practical application in general pre-training and fine-tuning paradigms. Inspired by using mode connectivity to analyze model performance between a pair of connected models, we investigate watermark vulnerability by leveraging Linear Mode Connectivity (LMC) as a proxy to analyze the fine-tuning dynamics of watermark performance. Our results show that existing watermarked models tend to converge to sharp minima in the loss landscape, thus making them vulnerable to fine-tuning. To tackle this challenge, we propose RoMa, a Ro bust M odel w a termarking scheme that improves the robustness of watermarks against fine-tuning. Specifically, RoMa decomposes watermarking into two components, including Embedding Functionality, which preserves reliable watermark detection capability, and Path-specific Smoothness, which enhances the smoothness along the watermark-connected path to improve robustness. Extensive experiments on benchmark datasets MS-COCO-2017 and CUB-200-2011 demonstrate that RoMa significantly improves watermark robustness against fine-tuning while maintaining generation quality, outperforming baselines. The code is available at https: //github. com/xiekks/RoMa.

NeurIPS Conference 2025 Conference Paper

Self-Verification Provably Prevents Model Collapse in Recursive Synthetic Training

  • Shi Fu
  • Yingjie Wang
  • Yuzhu Chen
  • Li Shen
  • Dacheng Tao

Large generative models are increasingly trained on synthetic data from earlier generations, raising concerns about model collapse, a progressive performance decline consistently observed in empirical studies. However, theoretical understanding of recursive training dynamics and their failure modes remains limited. In this work, we theoretically show that recursive training inherently leads to exponential error growth unless mitigated by sufficient real data. Addressing the growing scarcity of real data, we introduce a self-verification mechanism enabling models to filter their outputs based on internal confidence scores without external validation. Through rigorous analysis, we derive finite-sample error bounds demonstrating that self-verification alone can prevent collapse, even in fully synthetic training regimes. Our theoretical framework extends to large language models (LLMs), characterizing the conditions under which recursive training can maintain stability without performance degradation.

NeurIPS Conference 2025 Conference Paper

Stochastic Regret Guarantees for Online Zeroth- and First-Order Bilevel Optimization

  • Parvin Nazari
  • Bojian Hou
  • Davoud Ataee Tarzanagh
  • Li Shen
  • George Michailidis

Online bilevel optimization (OBO) is a powerful framework for machine learning problems where both outer and inner objectives evolve over time, requiring dynamic updates. Current OBO approaches rely on deterministic \textit{window-smoothed} regret minimization, which may not accurately reflect system performance when functions change rapidly. In this work, we introduce a novel search direction and show that both first- and zeroth-order (ZO) stochastic OBO algorithms leveraging this direction achieve sublinear {stochastic bilevel regret without window smoothing}. Beyond these guarantees, our framework enhances efficiency by: (i) reducing oracle dependence in hypergradient estimation, (ii) updating inner and outer variables alongside the linear system solution, and (iii) employing ZO-based estimation of Hessians, Jacobians, and gradients. Experiments on online parametric loss tuning and black-box adversarial attacks validate our approach.

NeurIPS Conference 2025 Conference Paper

Tackling Continual Offline RL through Selective Weights Activation on Aligned Spaces

  • Jifeng Hu
  • Sili Huang
  • Li Shen
  • Zhejian Yang
  • Shengchao Hu
  • Shisong Tang
  • Hechang Chen
  • Lichao Sun

Continual offline reinforcement learning (CORL) has shown impressive ability in diffusion-based continual learning systems by modeling the joint distributions of trajectories. However, most research only focuses on limited continual task settings where the tasks have the same observation and action space, which deviates from the realistic demands of training agents in various environments. In view of this, we propose Vector-Quantized Continual Diffuser, named VQ-CD, to break the barrier of different spaces between various tasks. Specifically, our method contains two complementary sections, where the quantization spaces alignment provides a unified basis for the selective weights activation. In the quantized spaces alignment, we leverage vector quantization to align the different state and action spaces of various tasks, facilitating continual training in the same space. Then, we propose to leverage a unified diffusion model attached by the inverse dynamic model to master all tasks by selectively activating different weights according to the task-related sparse masks. Finally, we conduct extensive experiments on 15 continual learning (CL) tasks, including conventional CL task settings (identical state and action spaces) and general CL task settings (various state and action spaces). Compared with 17 baselines, our method reaches the SOTA performance.

NeurIPS Conference 2025 Conference Paper

Unveiling the Power of Multiple Gossip Steps: A Stability-Based Generalization Analysis in Decentralized Training

  • Qinglun Li
  • Yingqi Liu
  • Miao Zhang
  • Xiaochun Cao
  • Quanjun Yin
  • Li Shen

Decentralized training removes the centralized server, making it a communication-efficient approach that can significantly improve training efficiency, but it often suffers from degraded performance compared to centralized training. Multi-Gossip Steps (MGS) serve as a simple yet effective bridge between decentralized and centralized training, significantly reducing experiment performance gaps. However, the theoretical reasons for its effectiveness and whether this gap can be fully eliminated by MGS remain open questions. In this paper, we derive upper bounds on the generalization error and excess error of MGS using stability analysis, systematically answering these two key questions. 1). Optimization Error Reduction: MGS reduces the optimization error bound at an exponential rate, thereby exponentially tightening the generalization error bound and enabling convergence to better solutions. 2). Gap to Centralization: Even as MGS approaches infinity, a non-negligible gap in generalization error remains compared to centralized mini-batch SGD ($\mathcal{O}(T^{\frac{c\beta}{c\beta +1}}/{n m})$ in centralized and $\mathcal{O}(T^{\frac{2c\beta}{2c\beta +2}}/{n m^{\frac{1}{2c\beta +2}}})$ in decentralized). Furthermore, we provide the first unified analysis of how factors like learning rate, data heterogeneity, node count, per-node sample size, and communication topology impact the generalization of MGS under non-convex settings without the bounded gradients assumption, filling a critical theoretical gap in decentralized training. Finally, promising experiments on CIFAR datasets support our theoretical findings.

NeurIPS Conference 2025 Conference Paper

Vad-R1: Towards Video Anomaly Reasoning via Perception-to-Cognition Chain-of-Thought

  • Chao Huang
  • Benfeng Wang
  • Wei Wang
  • Jie Wen
  • Chengliang Liu
  • Li Shen
  • Xiaochun Cao

Recent advancements in reasoning capability of Multimodal Large Language Models (MLLMs) demonstrate its effectiveness in tackling complex visual tasks. However, existing MLLM-based Video Anomaly Detection (VAD) methods remain limited to shallow anomaly descriptions without deep reasoning. In this paper, we propose a new task named Video Anomaly Reasoning (VAR), which aims to enable deep analysis and understanding of anomalies in the video by requiring MLLMs to think explicitly before answering. To this end, we propose Vad-R1, an end-to-end MLLM-based framework for VAR. Specifically, we design a Perception-to-Cognition Chain-of-Thought (P2C-CoT) that simulates the human process of recognizing anomalies, guiding the MLLMs to reason about anomalies step-by-step. Based on the structured P2C-CoT, we construct Vad-Reasoning, a dedicated dataset for VAR. Furthermore, we propose an improved reinforcement learning algorithm AVA-GRPO, which explicitly incentivizes the anomaly reasoning capability of MLLMs through a self-verification mechanism with limited annotations. Experimental results demonstrate that Vad-R1 achieves superior performance, outperforming both open-source and proprietary models on VAD and VAR tasks.

NeurIPS Conference 2025 Conference Paper

Value-Guided Decision Transformer: A Unified Reinforcement Learning Framework for Online and Offline Settings

  • Hongling Zheng
  • Li Shen
  • Yong Luo
  • Deheng Ye
  • Shuhan Xu
  • Bo Du
  • Jialie Shen
  • Dacheng Tao

The Conditional Sequence Modeling (CSM) paradigm, benefiting from the transformer's powerful distribution modeling capabilities, has demonstrated considerable promise in Reinforcement Learning (RL) tasks. However, much of the work has focused on applying CSM to single online or offline settings, with the general architecture rarely explored. Additionally, existing methods primarily focus on deterministic trajectory modeling, overlooking the randomness of state transitions and the diversity of future trajectory distributions. Fortunately, value-based methods offer a viable solution for CSM, further bridging the potential gap between offline and online RL. In this paper, we propose Value-Guided Decision Transformer (VDT), which leverages value functions to perform advantage-weighting and behavior regularization on the Decision Transformer (DT), guiding the policy toward upper-bound optimal decisions during the offline training phase. In the online tuning phase, VDT further integrates value-based policy improvement with behavior cloning under the CSM architecture through limited interaction and data collection, achieving performance improvement within minimal timesteps. The predictive capability of value functions for future returns is also incorporated into the sampling process. Our method achieves competitive performance on various standard RL benchmarks, providing a feasible solution for developing CSM architectures in general scenarios. Code is available at here.

NeurIPS Conference 2024 Conference Paper

A Huber Loss Minimization Approach to Mean Estimation under User-level Differential Privacy

  • Puning Zhao
  • Lifeng Lai
  • Li Shen
  • Qingming Li
  • Jiafei Wu
  • Zhe Liu

Privacy protection of users' entire contribution of samples is important in distributed systems. The most effective approach is the two-stage scheme, which finds a small interval first and then gets a refined estimate by clipping samples into the interval. However, the clipping operation induces bias, which is serious if the sample distribution is heavy-tailed. Besides, users with large local sample sizes can make the sensitivity much larger, thus the method is not suitable for imbalanced users. Motivated by these challenges, we propose a Huber loss minimization approach to mean estimation under user-level differential privacy. The connecting points of Huber loss can be adaptively adjusted to deal with imbalanced users. Moreover, it avoids the clipping operation, thus significantly reducing the bias compared with the two-stage approach. We provide a theoretical analysis of our approach, which gives the noise strength needed for privacy protection, as well as the bound of mean squared error. The result shows that the new method is much less sensitive to the imbalance of user-wise sample sizes and the tail of sample distributions. Finally, we perform numerical experiments to validate our theoretical analysis.

NeurIPS Conference 2024 Conference Paper

A-FedPD: Aligning Dual-Drift is All Federated Primal-Dual Learning Needs

  • Yan Sun
  • Li Shen
  • Dacheng Tao

As a popular paradigm for juggling data privacy and collaborative training, federated learning (FL) is flourishing to distributively process the large scale of heterogeneous datasets on edged clients. Due to bandwidth limitations and security considerations, it ingeniously splits the original problem into multiple subproblems to be solved in parallel, which empowers primal dual solutions to great application values in FL. In this paper, we review the recent development of classical federated primal dual methods and point out a serious common defect of such methods in non-convex scenarios, which we say is a ``dual drift'' caused by dual hysteresis of those longstanding inactive clients under partial participation training. To further address this problem, we propose a novel Aligned Federated Primal Dual (A-FedPD) method, which constructs virtual dual updates to align global consensus and local dual variables for those protracted unparticipated local clients. Meanwhile, we provide a comprehensive analysis of the optimization and generalization efficiency for the A-FedPD method on smooth non-convex objectives, which confirms its high efficiency and practicality. Extensive experiments are conducted on several classical FL setups to validate the effectiveness of our proposed method.

NeurIPS Conference 2024 Conference Paper

Decomposed Prompt Decision Transformer for Efficient Unseen Task Generalization

  • Hongling Zheng
  • Li Shen
  • Yong Luo
  • Tongliang Liu
  • Jialie Shen
  • Dacheng Tao

Multi-task offline reinforcement learning aims to develop a unified policy for diverse tasks without requiring real-time interaction with the environment. Recent work explores sequence modeling, leveraging the scalability of the transformer architecture as a foundation for multi-task learning. Given the variations in task content and complexity, formulating policies becomes a challenging endeavor, requiring careful parameter sharing and adept management of conflicting gradients to extract rich cross-task knowledge from multiple tasks and transfer it to unseen tasks. In this paper, we propose the Decomposed Prompt Decision Transformer (DPDT) that adopts a two-stage paradigm to efficiently learn prompts for unseen tasks in a parameter-efficient manner. We incorporate parameters from pre-trained language models (PLMs) to initialize DPDT, thereby providing rich prior knowledge encoded in language models. During the decomposed prompt tuning phase, we learn both cross-task and task-specific prompts on training tasks to achieve prompt decomposition. In the test time adaptation phase, the cross-task prompt, serving as a good initialization, were further optimized on unseen tasks through test time adaptation, enhancing the model's performance on these tasks. Empirical evaluation on a series of Meta-RL benchmarks demonstrates the superiority of our approach. The project is available at https: //github. com/ruthless-man/DPDT.

AAAI Conference 2024 Conference Paper

Evaluate Geometry of Radiance Fields with Low-Frequency Color Prior

  • Qihang Fang
  • Yafei Song
  • Keqiang Li
  • Li Shen
  • Huaiyu Wu
  • Gang Xiong
  • Liefeng Bo

A radiance field is an effective representation of 3D scenes, which has been widely adopted in novel-view synthesis and 3D reconstruction. It is still an open and challenging problem to evaluate the geometry, i.e., the density field, as the ground-truth is almost impossible to obtain. One alternative indirect solution is to transform the density field into a point-cloud and compute its Chamfer Distance with the scanned ground-truth. However, many widely-used datasets have no point-cloud ground-truth since the scanning process along with the equipment is expensive and complicated. To this end, we propose a novel metric, named Inverse Mean Residual Color (IMRC), which can evaluate the geometry only with the observation images. Our key insight is that the better the geometry, the lower-frequency the computed color field. From this insight, given a reconstructed density field and observation images, we design a closed-form method to approximate the color field with low-frequency spherical harmonics, and compute the inverse mean residual color. Then the higher the IMRC, the better the geometry. Qualitative and quantitative experimental results verify the effectiveness of our proposed IMRC metric. We also benchmark several state-of-the-art methods using IMRC to promote future related research. Our code is available at https://github.com/qihangGH/IMRC.

NeurIPS Conference 2024 Conference Paper

Fairness-Aware Estimation of Graphical Models

  • Zhuoping Zhou
  • Davoud Ataee Tarzanagh
  • Bojian Hou
  • Qi Long
  • Li Shen

This paper examines the issue of fairness in the estimation of graphical models (GMs), particularly Gaussian, Covariance, and Ising models. These models play a vital role in understanding complex relationships in high-dimensional data. However, standard GMs can result in biased outcomes, especially when the underlying data involves sensitive characteristics or protected groups. To address this, we introduce a comprehensive framework designed to reduce bias in the estimation of GMs related to protected attributes. Our approach involves the integration of the pairwise graph disparity error and a tailored loss function into a nonsmooth multi-objective optimization problem, striving to achieve fairness across different sensitive groups while maintaining the effectiveness of the GMs. Experimental evaluations on synthetic and real-world datasets demonstrate that our framework effectively mitigates bias without undermining GMs' performance.

NeurIPS Conference 2024 Conference Paper

Is Mamba Compatible with Trajectory Optimization in Offline Reinforcement Learning?

  • Yang Dai
  • Oubo Ma
  • Longfei Zhang
  • Xingxing Liang
  • Shengchao Hu
  • Mengzhu Wang
  • Shouling Ji
  • Jincai Huang

Transformer-based trajectory optimization methods have demonstrated exceptional performance in offline Reinforcement Learning (offline RL). Yet, it poses challenges due to substantial parameter size and limited scalability, which is particularly critical in sequential decision-making scenarios where resources are constrained such as in robots and drones with limited computational power. Mamba, a promising new linear-time sequence model, offers performance on par with transformers while delivering substantially fewer parameters on long sequences. As it remains unclear whether Mamba is compatible with trajectory optimization, this work aims to conduct comprehensive experiments to explore the potential of Decision Mamba (dubbed DeMa) in offline RL from the aspect of data structures and essential components with the following insights: (1) Long sequences impose a significant computational burden without contributing to performance improvements since DeMa's focus on sequences diminishes approximately exponentially. Consequently, we introduce a Transformer-like DeMa as opposed to an RNN-like DeMa. (2) For the components of DeMa, we identify the hidden attention mechanism as a critical factor in its success, which can also work well with other residual structures and does not require position embedding. Extensive evaluations demonstrate that our specially designed DeMa is compatible with trajectory optimization and surpasses previous methods, outperforming Decision Transformer (DT) with higher performance while using 30\% fewer parameters in Atari, and exceeding DT with only a quarter of the parameters in MuJoCo.

IJCAI Conference 2024 Conference Paper

MuEP: A Multimodal Benchmark for Embodied Planning with Foundation Models

  • Kanxue Li
  • Baosheng Yu
  • Qi Zheng
  • Yibing Zhan
  • Yuhui Zhang
  • Tianle Zhang
  • Yijun Yang
  • Yue Chen

Foundation models have demonstrated significant emergent abilities, holding great promise for enhancing embodied agents' reasoning and planning capacities. However, the absence of a comprehensive benchmark for evaluating embodied agents with multimodal observations in complex environments remains a notable gap. In this paper, we present MuEP, a comprehensive Multimodal benchmark for Embodied Planning. MuEP facilitates the evaluation of multimodal and multi-turn interactions of embodied agents in complex scenes, incorporating fine-grained evaluation metrics that provide insights into the performance of embodied agents throughout each task. Furthermore, we evaluate embodied agents with recent state-of-the-art foundation models, including large language models (LLMs) and large multimodal models (LMMs), on the proposed benchmark. Experimental results show that foundation models based on textual representations of environments usually outperform their visual counterparts, suggesting a gap in embodied planning abilities with multimodal observations. We also find that control language generation is an indispensable ability beyond common-sense knowledge for accurate embodied task completion. We hope the proposed MuEP benchmark can contribute to the advancement of embodied AI with foundation models.

AAAI Conference 2024 Conference Paper

Neural Network Approximation for Pessimistic Offline Reinforcement Learning

  • Di Wu
  • Yuling Jiao
  • Li Shen
  • Haizhao Yang
  • Xiliang Lu

Deep reinforcement learning (RL) has shown remarkable success in specific offline decision-making scenarios, yet its theoretical guarantees are still under development. Existing works on offline RL theory primarily emphasize a few trivial settings, such as linear MDP or general function approximation with strong assumptions and independent data, which lack guidance for practical use. The coupling of deep learning and Bellman residuals makes this problem challenging, in addition to the difficulty of data dependence. In this paper, we establish a non-asymptotic estimation error of pessimistic offline RL using general neural network approximation with C-mixing data regarding the structure of networks, the dimension of datasets, and the concentrability of data coverage, under mild assumptions. Our result shows that the estimation error consists of two parts: the first converges to zero at a desired rate on the sample size with partially controllable concentrability, and the second becomes negligible if the residual constraint is tight. This result demonstrates the explicit efficiency of deep adversarial offline RL frameworks. We utilize the empirical process tool for C-mixing sequences and the neural network approximation theory for the Holder class to achieve this. We also develop methods to bound the Bellman estimation error caused by function approximation with empirical Bellman constraint perturbations. Additionally, we present a result that lessens the curse of dimensionality using data with low intrinsic dimensionality and function classes with low complexity. Our estimation provides valuable insights into the development of deep offline RL and guidance for algorithm model design.

TMLR Journal 2024 Journal Article

Revisiting Discrete Soft Actor-Critic

  • haibin zhou
  • Tong Wei
  • Zichuan Lin
  • Junyou Li
  • Junliang Xing
  • Yuanchun Shi
  • Li Shen
  • Chao Yu

We study the adaption of Soft Actor-Critic (SAC), which is considered as a state-of-the-art reinforcement learning (RL) algorithm, from continuous action space to discrete action space. We revisit vanilla discrete SAC and provide an in-depth understanding of its Q value underestimation and performance instability issues when applied to discrete settings. We thereby propose Stable Discrete SAC (SDSAC), an algorithm that leverages entropy-penalty and double average Q-learning with Q-clip to address these issues. Extensive experiments on typical benchmarks with discrete action space, including Atari games and a large-scale MOBA game, show the efficacy of our proposed method. Our code is at: https://github.com/coldsummerday/SD-SAC.git.

NeurIPS Conference 2024 Conference Paper

Uncovering, Explaining, and Mitigating the Superficial Safety of Backdoor Defense

  • Rui Min
  • Zeyu Qin
  • Nevin L. Zhang
  • Li Shen
  • Minhao Cheng

Backdoor attacks pose a significant threat to Deep Neural Networks (DNNs) as they allow attackers to manipulate model predictions with backdoor triggers. To address these security vulnerabilities, various backdoor purification methods have been proposed to purify compromised models. Typically, these purified models exhibit low Attack Success Rates (ASR), rendering them resistant to backdoored inputs. However, \textit{Does achieving a low ASR through current safety purification methods truly eliminate learned backdoor features from the pretraining phase? } In this paper, we provide an affirmative answer to this question by thoroughly investigating the \textit{Post-Purification Robustness} of current backdoor purification methods. We find that current safety purification methods are vulnerable to the rapid re-learning of backdoor behavior, even when further fine-tuning of purified models is performed using a very small number of poisoned samples. Based on this, we further propose the practical Query-based Reactivation Attack (QRA) which could effectively reactivate the backdoor by merely querying purified models. We find the failure to achieve satisfactory post-purification robustness stems from the insufficient deviation of purified models from the backdoored model along the backdoor-connected path. To improve the post-purification robustness, we propose a straightforward tuning defense, Path-Aware Minimization (PAM), which promotes deviation along backdoor-connected paths with extra model updates. Extensive experiments demonstrate that PAM significantly improves post-purification robustness while maintaining a good clean accuracy and low ASR. Our work provides a new perspective on understanding the effectiveness of backdoor safety tuning and highlights the importance of faithfully assessing the model's safety.

TMLR Journal 2024 Journal Article

Visual Prompt Based Personalized Federated Learning

  • Guanghao Li
  • Wansen Wu
  • Yan Sun
  • Li Shen
  • Baoyuan Wu
  • Dacheng Tao

As a popular paradigm of distributed learning, personalized federated learning (PFL) allows personalized models to improve generalization ability and robustness by utilizing knowledge from all distributed clients. Most existing PFL algorithms tackle personalization in a model-centric way, such as personalized layer partition, model regularization, and model interpolation, which all fail to take into account the data characteristics of distributed clients. In this paper, we propose a novel PFL framework for image classification tasks, dubbed pFedPT, that leverages personalized visual prompts to implicitly represent local data distribution information of clients and provides that information to the aggregation model to help with classification tasks. Specifically, in each round of pFedPT training, each client generates a local personalized prompt related to local data distribution. Then, the local model is trained on the input composed of raw data and a visual prompt to learn the distribution information contained in the prompt. During model testing, the aggregated model obtains client-specific knowledge of the data distributions based on the prompts, which can be seen as an adaptive fine-tuning of the aggregation model to improve model performances on different clients. Furthermore, the visual prompt can be added as an orthogonal method to implement personalization on the client for existing FL methods to boost their performance. Experiments on the CIFAR10 and CIFAR100 datasets show that pFedPT outperforms several state-of-the-art (SOTA) PFL algorithms by a large margin in various settings. The code is available at: https://github.com/hkgdifyu/pFedPT.

AAAI Conference 2023 Conference Paper

AdaTask: A Task-Aware Adaptive Learning Rate Approach to Multi-Task Learning

  • Enneng Yang
  • Junwei Pan
  • Ximei Wang
  • Haibin Yu
  • Li Shen
  • Xihua Chen
  • Lei Xiao
  • Jie Jiang

Multi-task learning (MTL) models have demonstrated impressive results in computer vision, natural language processing, and recommender systems. Even though many approaches have been proposed, how well these approaches balance different tasks on each parameter still remains unclear. In this paper, we propose to measure the task dominance degree of a parameter by the total updates of each task on this parameter. Specifically, we compute the total updates by the exponentially decaying Average of the squared Updates (AU) on a parameter from the corresponding task. Based on this novel metric, we observe that many parameters in existing MTL methods, especially those in the higher shared layers, are still dominated by one or several tasks. The dominance of AU is mainly due to the dominance of accumulative gradients from one or several tasks. Motivated by this, we propose a Task-wise Adaptive learning rate approach, AdaTask in short, to separate the accumulative gradients and hence the learning rate of each task for each parameter in adaptive learning rate approaches (e.g., AdaGrad, RMSProp, and Adam). Comprehensive experiments on computer vision and recommender system MTL datasets demonstrate that AdaTask significantly improves the performance of dominated tasks, resulting SOTA average task-wise performance. Analysis on both synthetic and real-world datasets shows AdaTask balance parameters in every shared layer well.

NeurIPS Conference 2023 Conference Paper

An Efficient Dataset Condensation Plugin and Its Application to Continual Learning

  • Enneng Yang
  • Li Shen
  • Zhenyi Wang
  • Tongliang Liu
  • Guibing Guo

Dataset condensation (DC) distills a large real-world dataset into a small synthetic dataset, with the goal of training a network from scratch on the latter that performs similarly to the former. State-of-the-art (SOTA) DC methods have achieved satisfactory results through techniques such as accuracy, gradient, training trajectory, or distribution matching. However, these works all perform matching in the high-dimension pixel spaces, ignoring that natural images are usually locally connected and have lower intrinsic dimensions, resulting in low condensation efficiency. In this work, we propose a simple-yet-efficient dataset condensation plugin that matches the raw and synthetic datasets in a low-dimensional manifold. Specifically, our plugin condenses raw images into two low-rank matrices instead of parameterized image matrices. Our plugin can be easily incorporated into existing DC methods, thereby containing richer raw dataset information at limited storage costs to improve the downstream applications' performance. We verify on multiple public datasets that when the proposed plugin is combined with SOTA DC methods, the performance of the network trained on synthetic data is significantly improved compared to traditional DC methods. Moreover, when applying the DC methods as a plugin to continual learning tasks, we observed that our approach effectively mitigates catastrophic forgetting of old tasks under limited memory buffer constraints and avoids the problem of raw data privacy leakage.

YNIMG Journal 2023 Journal Article

Brain-wide genome-wide colocalization study for integrating genetics, transcriptomics and brain morphometry in Alzheimer's disease

  • Jingxuan Bao
  • Junhao Wen
  • Zixuan Wen
  • Shu Yang
  • Yuhan Cui
  • Zhijian Yang
  • Guray Erus
  • Andrew J. Saykin

Alzheimer's disease (AD) is one of the most common neurodegenerative diseases. However, the AD mechanism has not yet been fully elucidated to date, hindering the development of effective therapies. In our work, we perform a brain imaging genomics study to link genetics, single-cell gene expression data, tissue-specific gene expression data, brain imaging-derived volumetric endophenotypes, and disease diagnosis to discover potential underlying neurobiological pathways for AD. To do so, we perform brain-wide genome-wide colocalization analyses to integrate multidimensional imaging genomic biobank data. Specifically, we use (1) the individual-level imputed genotyping data and magnetic resonance imaging (MRI) data from the UK Biobank, (2) the summary statistics of the genome-wide association study (GWAS) from multiple European ancestry cohorts, and (3) the tissue-specific cis-expression quantitative trait loci (cis-eQTL) summary statistics from the GTEx project. We apply a Bayes factor colocalization framework and mediation analysis to these multi-modal imaging genomic data. As a result, we derive the brain regional level GWAS summary statistics for 145 brain regions with 482,831 single nucleotide polymorphisms (SNPs) followed by posthoc functional annotations. Our analysis yields the discovery of a potential AD causal pathway from a systems biology perspective: the SNP chr10:124165615:G>A (rs6585827) mutation upregulates the expression of BTBD16 gene in oligodendrocytes, a specialized glial cells, in the brain cortex, leading to a reduced risk of volumetric loss in the entorhinal cortex, resulting in the protective effect on AD. We substantiate our findings with multiple evidence from existing imaging, genetic and genomic studies in AD literature. Our study connects genetics, molecular and cellular signatures, regional brain morphologic endophenotypes, and AD diagnosis, providing new insights into the mechanistic understanding of the disease. Our findings can provide valuable guidance for subsequent therapeutic target identification and drug discovery in AD.

YNIMG Journal 2023 Journal Article

Cortical encoding of rhythmic kinematic structures in biological motion

  • Li Shen
  • Xiqian Lu
  • Xiangyong Yuan
  • Ruichen Hu
  • Ying Wang
  • Yi Jiang

Biological motion (BM) perception is of great survival value to human beings. The critical characteristics of BM information lie in kinematic cues containing rhythmic structures. However, how rhythmic kinematic structures of BM are dynamically represented in the brain and contribute to visual BM processing remains largely unknown. Here, we probed this issue in three experiments using electroencephalogram (EEG). We found that neural oscillations of observers entrained to the hierarchical kinematic structures of the BM sequences (i.e., step-cycle and gait-cycle for point-light walkers). Notably, only the cortical tracking of the higher-level rhythmic structure (i.e., gait-cycle) exhibited a BM processing specificity, manifested by enhanced neural responses to upright over inverted BM stimuli. This effect could be extended to different motion types and tasks, with its strength positively correlated with the perceptual sensitivity to BM stimuli at the right temporal brain region dedicated to visual BM processing. Modeling results further suggest that the neural encoding of spatiotemporally integrative kinematic cues, in particular the opponent motions of bilateral limbs, drives the selective cortical tracking of BM information. These findings underscore the existence of a cortical mechanism that encodes periodic kinematic features of body movements, which underlies the dynamic construction of visual BM perception.

NeurIPS Conference 2023 Conference Paper

Defending against Data-Free Model Extraction by Distributionally Robust Defensive Training

  • Zhenyi Wang
  • Li Shen
  • Tongliang Liu
  • Tiehang Duan
  • Yanjun Zhu
  • Donglin Zhan
  • David Doermann
  • Mingchen Gao

Data-Free Model Extraction (DFME) aims to clone a black-box model without knowing its original training data distribution, making it much easier for attackers to steal commercial models. Defense against DFME faces several challenges: (i) effectiveness; (ii) efficiency; (iii) no prior on the attacker's query data distribution and strategy. However, existing defense methods: (1) are highly computation and memory inefficient; or (2) need strong assumptions about attack data distribution; or (3) can only delay the attack or prove a model theft after the model stealing has happened. In this work, we propose a Memory and Computation efficient defense approach, named MeCo, to prevent DFME from happening while maintaining the model utility simultaneously by distributionally robust defensive training on the target victim model. Specifically, we randomize the input so that it: (1) causes a mismatch of the knowledge distillation loss for attackers; (2) disturbs the zeroth-order gradient estimation; (3) changes the label prediction for the attack query data. Therefore, the attacker can only extract misleading information from the black-box model. Extensive experiments on defending against both decision-based and score-based DFME demonstrate that MeCo can significantly reduce the effectiveness of existing DFME methods and substantially improve running efficiency.

NeurIPS Conference 2023 Conference Paper

Dynamic Sparsity Is Channel-Level Sparsity Learner

  • Lu Yin
  • Gen Li
  • Meng Fang
  • Li Shen
  • Tianjin Huang
  • Zhangyang "Atlas" Wang
  • Vlado Menkovski
  • Xiaolong Ma

Sparse training has received an upsurging interest in machine learning due to its tantalizing saving potential for both the entire training process as well as the inference. Dynamic sparse training (DST) as a leading approach can train deep neural networks at high sparsity from scratch to match the performance of their dense counterparts. However, most if not all DST prior arts demonstrate their effectiveness on unstructured sparsity with highly irregular sparse patterns, which receives limited support in common hardware. This limitation hinders the usage of DST in practice. In this paper, we propose Channel-aware dynamic sparse (Chase), that for the first time seamlessly translates the promise of unstructured dynamic sparsity to GPU-friendly channel-level sparsity (not fine-grained N: M or group sparsity) during one end-to-end training process, without any ad-hoc operations. The resulting small sparse networks can be directly accelerated by commodity hardware, without using any particularly sparsity-aware hardware accelerators. This appealing outcome is partially motivated by a hidden phenomenon of dynamic sparsity: off-the-shelf unstructured DST implicitly involves biased parameter reallocation across channels, with a large fraction of channels (up to 60%) being sparser than others. By progressively identifying and removing these channels during training, our approach transfers unstructured sparsity to channel-wise sparsity. Our experimental results demonstrate that Chase achieves 1. 7x inference throughput speedup on common GPU devices without compromising accuracy with ResNet-50 on ImageNet. We release our code in https: //github. com/luuyin/chase.

AAAI Conference 2023 Conference Paper

Evaluating Model-Free Reinforcement Learning toward Safety-Critical Tasks

  • Linrui Zhang
  • Qin Zhang
  • Li Shen
  • Bo Yuan
  • Xueqian Wang
  • Dacheng Tao

Safety comes first in many real-world applications involving autonomous agents. Despite a large number of reinforcement learning (RL) methods focusing on safety-critical tasks, there is still a lack of high-quality evaluation of those algorithms that adheres to safety constraints at each decision step under complex and unknown dynamics. In this paper, we revisit prior work in this scope from the perspective of state-wise safe RL and categorize them as projection-based, recovery-based, and optimization-based approaches, respectively. Furthermore, we propose Unrolling Safety Layer (USL), a joint method that combines safety optimization and safety projection. This novel technique explicitly enforces hard constraints via the deep unrolling architecture and enjoys structural advantages in navigating the trade-off between reward improvement and constraint satisfaction. To facilitate further research in this area, we reproduce related algorithms in a unified pipeline and incorporate them into SafeRL-Kit, a toolkit that provides off-the-shelf interfaces and evaluation utilities for safety-critical tasks. We then perform a comparative study of the involved algorithms on six benchmarks ranging from robotic control to autonomous driving. The empirical results provide an insight into their applicability and robustness in learning zero-cost-return policies without task-dependent handcrafting. The project page is available at https://sites.google.com/view/saferlkit.

NeurIPS Conference 2023 Conference Paper

Fair Canonical Correlation Analysis

  • Zhuoping Zhou
  • Davoud Ataee Tarzanagh
  • Bojian Hou
  • Boning Tong
  • Jia Xu
  • Yanbo Feng
  • Qi Long
  • Li Shen

This paper investigates fairness and bias in Canonical Correlation Analysis (CCA), a widely used statistical technique for examining the relationship between two sets of variables. We present a framework that alleviates unfairness by minimizing the correlation disparity error associated with protected attributes. Our approach enables CCA to learn global projection matrices from all data points while ensuring that these matrices yield comparable correlation levels to group-specific projection matrices. Experimental evaluation on both synthetic and real-world datasets demonstrates the efficacy of our method in reducing correlation disparity error without compromising CCA accuracy.

AAAI Conference 2023 Conference Paper

FedABC: Targeting Fair Competition in Personalized Federated Learning

  • Dui Wang
  • Li Shen
  • Yong Luo
  • Han Hu
  • Kehua Su
  • Yonggang Wen
  • Dacheng Tao

Federated learning aims to collaboratively train models without accessing their client's local private data. The data may be Non-IID for different clients and thus resulting in poor performance. Recently, personalized federated learning (PFL) has achieved great success in handling Non-IID data by enforcing regularization in local optimization or improving the model aggregation scheme on the server. However, most of the PFL approaches do not take into account the unfair competition issue caused by the imbalanced data distribution and lack of positive samples for some classes in each client. To address this issue, we propose a novel and generic PFL framework termed Federated Averaging via Binary Classification, dubbed FedABC. In particular, we adopt the ``one-vs-all'' training strategy in each client to alleviate the unfair competition between classes by constructing a personalized binary classification problem for each class. This may aggravate the class imbalance challenge and thus a novel personalized binary classification loss that incorporates both the under-sampling and hard sample mining strategies is designed. Extensive experiments are conducted on two popular datasets under different settings, and the results demonstrate that our FedABC can significantly outperform the existing counterparts.

TMLR Journal 2023 Journal Article

FedDAG: Federated DAG Structure Learning

  • Erdun Gao
  • Junjia Chen
  • Li Shen
  • Tongliang Liu
  • Mingming Gong
  • Howard Bondell

To date, most directed acyclic graphs (DAGs) structure learning approaches require data to be stored in a central server. However, due to the consideration of privacy protection, data owners gradually refuse to share their personalized raw data to avoid private information leakage, making this task more troublesome by cutting off the first step. Thus, a puzzle arises: how do we discover the underlying DAG structure from decentralized data? In this paper, focusing on the additive noise models (ANMs) assumption of data generation, we take the first step in developing a gradient-based learning framework named FedDAG, which can learn the DAG structure without directly touching the local data and also can naturally handle the data heterogeneity. Our method benefits from a two-level structure of each local model. The first level structure learns the edges and directions of the graph and communicates with the server to get the model information from other clients during the learning procedure, while the second level structure approximates the mechanisms among variables and personally updates on its own data to accommodate the data heterogeneity. Moreover, FedDAG formulates the overall learning task as a continuous optimization problem by taking advantage of an equality acyclicity constraint, which can be solved by gradient descent methods to boost the searching efficiency. Extensive experiments on both synthetic and real-world datasets verify the efficacy of the proposed method.

NeurIPS Conference 2023 Conference Paper

Federated Learning with Manifold Regularization and Normalized Update Reaggregation

  • Xuming An
  • Li Shen
  • Han Hu
  • Yong Luo

Federated Learning (FL) is an emerging collaborative machine learning framework where multiple clients train the global model without sharing their own datasets. In FL, the model inconsistency caused by the local data heterogeneity across clients results in the near-orthogonality of client updates, which leads to the global update norm reduction and slows down the convergence. Most previous works focus on eliminating the difference of parameters (or gradients) between the local and global models, which may fail to reflect the model inconsistency due to the complex structure of the machine learning model and the Euclidean space's limitation in meaningful geometric representations. In this paper, we propose FedMRUR by adopting the manifold model fusion scheme and a new global optimizer to alleviate the negative impacts. Concretely, FedMRUR adopts a hyperbolic graph manifold regularizer enforcing the representations of the data in the local and global models are close to each other in a low-dimensional subspace. Because the machine learning model has the graph structure, the distance in hyperbolic space can reflect the model bias better than the Euclidean distance. In this way, FedMRUR exploits the manifold structures of the representations to significantly reduce the model inconsistency. FedMRUR also aggregates the client updates norms as the global update norm, which can appropriately enlarge each client's contribution to the global update, thereby mitigating the norm reduction introduced by the near-orthogonality of client updates. Furthermore, we theoretically prove that our algorithm can achieve a linear speedup property $\mathcal{O}(\frac{1}{\sqrt{SKT}})$ for non-convex setting under partial client participation, where $S$ is the participated clients number, $K$ is the local interval and $T$ is the total number of communication rounds. Experiments demonstrate that FedMRUR can achieve a new state-of-the-art (SOTA) accuracy with less communication.

NeurIPS Conference 2023 Conference Paper

FlatMatch: Bridging Labeled Data and Unlabeled Data with Cross-Sharpness for Semi-Supervised Learning

  • Zhuo Huang
  • Li Shen
  • Jun Yu
  • Bo Han
  • Tongliang Liu

Semi-Supervised Learning (SSL) has been an effective way to leverage abundant unlabeled data with extremely scarce labeled data. However, most SSL methods are commonly based on instance-wise consistency between different data transformations. Therefore, the label guidance on labeled data is hard to be propagated to unlabeled data. Consequently, the learning process on labeled data is much faster than on unlabeled data which is likely to fall into a local minima that does not favor unlabeled data, leading to sub-optimal generalization performance. In this paper, we propose FlatMatch which minimizes a cross-sharpness measure to ensure consistent learning performance between the two datasets. Specifically, we increase the empirical risk on labeled data to obtain a worst-case model which is a failure case needing to be enhanced. Then, by leveraging the richness of unlabeled data, we penalize the prediction difference (i. e. , cross-sharpness) between the worst-case model and the original model so that the learning direction is beneficial to generalization on unlabeled data. Therefore, we can calibrate the learning process without being limited to insufficient label information. As a result, the mismatched learning performance can be mitigated, further enabling the effective exploitation of unlabeled data and improving SSL performance. Through comprehensive validation, we show FlatMatch achieves state-of-the-art results in many SSL settings.

TMLR Journal 2023 Journal Article

Fusion of Global and Local Knowledge for Personalized Federated Learning

  • Tiansheng Huang
  • Li Shen
  • Yan Sun
  • Weiwei Lin
  • Dacheng Tao

Personalized federated learning, as a variant of federated learning, trains customized models for clients using their heterogeneously distributed data. However, it is still inconclusive about how to design personalized models with better representation of shared global knowledge and personalized pattern. To bridge the gap, we in this paper explore personalized models with low-rank and sparse decomposition. Specifically, we employ proper regularization to extract a low-rank global knowledge representation (GKR), so as to distill global knowledge into a compact representation. Subsequently, we employ a sparse component over the obtained GKR to fuse the personalized pattern into the global knowledge. As a solution, we propose a two-stage proximal-based algorithm named \textbf{Fed}erated learning with mixed \textbf{S}parse and \textbf{L}ow-\textbf{R}ank representation (FedSLR) to efficiently search for the mixed models. Theoretically, under proper assumptions, we show that the GKR trained by FedSLR can at least sub-linearly converge to a stationary point of the regularized problem, and that the sparse component being fused can converge to its stationary point under proper settings. Extensive experiments also demonstrate the superior empirical performance of FedSLR. Moreover, FedSLR reduces the number of parameters, and lowers the down-link communication complexity, which are all desirable for federated learning algorithms. Source code is available in \url{https://github.com/huangtiansheng/fedslr}.

NeurIPS Conference 2023 Conference Paper

Learning Better with Less: Effective Augmentation for Sample-Efficient Visual Reinforcement Learning

  • Guozheng Ma
  • Linrui Zhang
  • Haoyu Wang
  • Lu Li
  • Zilin Wang
  • Zhen Wang
  • Li Shen
  • Xueqian Wang

Data augmentation (DA) is a crucial technique for enhancing the sample efficiency of visual reinforcement learning (RL) algorithms. Notably, employing simple observation transformations alone can yield outstanding performance without extra auxiliary representation tasks or pre-trained encoders. However, it remains unclear which attributes of DA account for its effectiveness in achieving sample-efficient visual RL. To investigate this issue and further explore the potential of DA, this work conducts comprehensive experiments to assess the impact of DA's attributes on its efficacy and provides the following insights and improvements: (1) For individual DA operations, we reveal that both ample spatial diversity and slight hardness are indispensable. Building on this finding, we introduce Random PadResize (Rand PR), a new DA operation that offers abundant spatial diversity with minimal hardness. (2) For multi-type DA fusion schemes, the increased DA hardness and unstable data distribution result in the current fusion schemes being unable to achieve higher sample efficiency than their corresponding individual operations. Taking the non-stationary nature of RL into account, we propose a RL-tailored multi-type DA fusion scheme called Cycling Augmentation (CycAug), which performs periodic cycles of different DA operations to increase type diversity while maintaining data distribution consistency. Extensive evaluations on the DeepMind Control suite and CARLA driving simulator demonstrate that our methods achieve superior sample efficiency compared with the prior state-of-the-art methods.

AAAI Conference 2023 Conference Paper

Offline Quantum Reinforcement Learning in a Conservative Manner

  • Zhihao Cheng
  • Kaining Zhang
  • Li Shen
  • Dacheng Tao

Recently, to reap the quantum advantage, empowering reinforcement learning (RL) with quantum computing has attracted much attention, which is dubbed as quantum RL (QRL). However, current QRL algorithms employ an online learning scheme, i.e., the policy that is run on a quantum computer needs to interact with the environment to collect experiences, which could be expensive and dangerous for practical applications. In this paper, we aim to solve this problem in an offline learning manner. To be more specific, we develop the first offline quantum RL (offline QRL) algorithm named CQ2L (Conservative Quantum Q-learning), which learns from offline samples and does not require any interaction with the environment. CQ2L utilizes variational quantum circuits (VQCs), which are improved with data re-uploading and scaling parameters, to represent Q-value functions of agents. To suppress the overestimation of Q-values resulting from offline data, we first employ a double Q-learning framework to reduce the overestimation bias; then a penalty term that encourages generating conservative Q-values is designed. We conduct abundant experiments to demonstrate that the proposed method CQ2L can successfully solve offline QRL tasks that the online counterpart could not.

AAMAS Conference 2023 Conference Paper

Provably Efficient Convergence of Primal-Dual Actor-Critic with Nonlinear Function Approximation

  • Jing Dong
  • Li Shen
  • Yinggan Xu
  • Baoxiang Wang

We study the convergence of the actor-critic algorithm with nonlinear function approximation under a nonconvex-nonconcave primaldual formulation. Stochastic gradient descent ascent is applied with an adaptive proximal term for robust learning rates. We show the first efficient convergence result with primal-dual actor-critic with a convergence rate of O √︃ ln(𝑁𝑑𝐺2 ) 𝑁 under Markovian sampling, where 𝐺 is the element-wise maximum of the gradient, 𝑁 is the number of iterations, and 𝑑 is the dimension of the gradient. Our result is presented with only the Polyak-Łojasiewicz (PL) condition for the dual variable, which is easy to verify and applicable to a wide range of RL scenarios.

NeurIPS Conference 2023 Conference Paper

Stability and Generalization of the Decentralized Stochastic Gradient Descent Ascent Algorithm

  • Miaoxi Zhu
  • Li Shen
  • Bo Du
  • Dacheng Tao

The growing size of available data has attracted increasing interest in solving minimax problems in a decentralized manner for various machine learning tasks. Previous theoretical research has primarily focused on the convergence rate and communication complexity of decentralized minimax algorithms, with little attention given to their generalization. In this paper, we investigate the primal-dual generalization bound of the decentralized stochastic gradient descent ascent (D-SGDA) algorithm using the approach of algorithmic stability under both convex-concave and nonconvex-nonconcave settings. Our theory refines the algorithmic stability in a decentralized manner and demonstrates that the decentralized structure does not destroy the stability and generalization of D-SGDA, implying that it can generalize as well as the vanilla SGDA in certain situations. Our results analyze the impact of different topologies on the generalization bound of the D-SGDA algorithm beyond trivial factors such as sample sizes, learning rates, and iterations. We also evaluate the optimization error and balance it with the generalization gap to obtain the optimal population risk of D-SGDA in the convex-concave setting. Additionally, we perform several numerical experiments which validate our theoretical findings.

NeurIPS Conference 2023 Conference Paper

Towards Stable Backdoor Purification through Feature Shift Tuning

  • Rui Min
  • Zeyu Qin
  • Li Shen
  • Minhao Cheng

It has been widely observed that deep neural networks (DNN) are vulnerable to backdoor attacks where attackers could manipulate the model behavior maliciously by tampering with a small set of training samples. Although a line of defense methods is proposed to mitigate this threat, they either require complicated modifications to the training process or heavily rely on the specific model architecture, which makes them hard to deploy into real-world applications. Therefore, in this paper, we instead start with fine-tuning, one of the most common and easy-to-deploy backdoor defenses, through comprehensive evaluations against diverse attack scenarios. Observations made through initial experiments show that in contrast to the promising defensive results on high poisoning rates, vanilla tuning methods completely fail at low poisoning rate scenarios. Our analysis shows that with the low poisoning rate, the entanglement between backdoor and clean features undermines the effect of tuning-based defenses. Therefore, it is necessary to disentangle the backdoor and clean features in order to improve backdoor purification. To address this, we introduce Feature Shift Tuning (FST), a method for tuning-based backdoor purification. Specifically, FST encourages feature shifts by actively deviating the classifier weights from the originally compromised weights. Extensive experiments demonstrate that our FST provides consistently stable performance under different attack settings. Without complex parameter adjustments, FST also achieves much lower tuning costs, only $10$ epochs. Our codes are available at https: //github. com/AISafety-HKUST/stable_backdoor_purification.

NeurIPS Conference 2023 Conference Paper

Understanding How Consistency Works in Federated Learning via Stage-wise Relaxed Initialization

  • Yan Sun
  • Li Shen
  • Dacheng Tao

Federated learning (FL) is a distributed paradigm that coordinates massive local clients to collaboratively train a global model via stage-wise local training processes on the heterogeneous dataset. Previous works have implicitly studied that FL suffers from the "client-drift" problem, which is caused by the inconsistent optimum across local clients. However, till now it still lacks solid theoretical analysis to explain the impact of this local inconsistency. To alleviate the negative impact of the "client drift" and explore its substance in FL, in this paper, we first design an efficient FL algorithm FedInit, which allows employing the personalized relaxed initialization state at the beginning of each local training stage. Specifically, FedInit initializes the local state by moving away from the current global state towards the reverse direction of the latest local state. This relaxed initialization helps to revise the local divergence and enhance the local consistency level. Moreover, to further understand how inconsistency disrupts performance in FL, we introduce the excess risk analysis and study the divergence term to investigate the test error of the proposed FedInit method. Our studies show that on the non-convex objectives, optimization error is not sensitive to this local inconsistency, while it mainly affects the generalization error bound in FedInit. Extensive experiments are conducted to validate this conclusion. Our proposed FedInit could achieve state-of-the-art (SOTA) results compared to several advanced benchmarks without any additional costs. Meanwhile, stage-wise relaxed initialization could also be incorporated into the current advanced algorithms to achieve higher performance in the FL paradigm.

NeurIPS Conference 2022 Conference Paper

Boosting the Transferability of Adversarial Attacks with Reverse Adversarial Perturbation

  • Zeyu Qin
  • Yanbo Fan
  • Yi Liu
  • Li Shen
  • Yong Zhang
  • Jue Wang
  • Baoyuan Wu

Deep neural networks (DNNs) have been shown to be vulnerable to adversarial examples, which can produce erroneous predictions by injecting imperceptible perturbations. In this work, we study the transferability of adversarial examples, which is significant due to its threat to real-world applications where model architecture or parameters are usually unknown. Many existing works reveal that the adversarial examples are likely to overfit the surrogate model that they are generated from, limiting its transfer attack performance against different target models. To mitigate the overfitting of the surrogate model, we propose a novel attack method, dubbed reverse adversarial perturbation (RAP). Specifically, instead of minimizing the loss of a single adversarial point, we advocate seeking adversarial example located at a region with unified low loss value, by injecting the worst-case perturbation (the reverse adversarial perturbation) for each step of the optimization procedure. The adversarial attack with RAP is formulated as a min-max bi-level optimization problem. By integrating RAP into the iterative process for attacks, our method can find more stable adversarial examples which are less sensitive to the changes of decision boundary, mitigating the overfitting of the surrogate model. Comprehensive experimental comparisons demonstrate that RAP can significantly boost adversarial transferability. Furthermore, RAP can be naturally combined with many existing black-box attack techniques, to further boost the transferability. When attacking a real-world image recognition system, Google Cloud Vision API, we obtain 22% performance improvement of targeted attacks over the compared method. Our codes are available at https: //github. com/SCLBD/Transfer attack RAP.

IJCAI Conference 2022 Conference Paper

Few-Shot Adaptation of Pre-Trained Networks for Domain Shift

  • Wenyu Zhang
  • Li Shen
  • Wanyue Zhang
  • Chuan-Sheng Foo

Deep networks are prone to performance degradation when there is a domain shift between the source (training) data and target (test) data. Recent test-time adaptation methods update batch normalization layers of pre-trained source models deployed in new target environments with streaming data. Although these methods can adapt on-the-fly without first collecting a large target domain dataset, their performance is dependent on streaming conditions such as mini-batch size and class-distribution which can be unpredictable in practice. In this work, we propose a framework for few-shot domain adaptation to address the practical challenges of data-efficient adaptation. Specifically, we propose a constrained optimization of feature normalization statistics in pre-trained source models supervised by a small target domain support set. Our method is easy to implement and improves source model performance with as little as one sample per class for classification tasks. Extensive experiments on 5 cross-domain classification and 4 semantic segmentation datasets show that our proposed method achieves more accurate and reliable performance than test-time adaptation, while not being constrained by streaming conditions.

NeurIPS Conference 2022 Conference Paper

Make Sharpness-Aware Minimization Stronger: A Sparsified Perturbation Approach

  • Peng Mi
  • Li Shen
  • Tianhe Ren
  • Yiyi Zhou
  • Xiaoshuai Sun
  • Rongrong Ji
  • Dacheng Tao

Deep neural networks often suffer from poor generalization caused by complex and non-convex loss landscapes. One of the popular solutions is Sharpness-Aware Minimization (SAM), which smooths the loss landscape via minimizing the maximized change of training loss when adding a perturbation to the weight. However, we find the indiscriminate perturbation of SAM on all parameters is suboptimal, which also results in excessive computation, ~\emph{i. e. }, double the overhead of common optimizers like Stochastic Gradient Descent~(SGD). In this paper, we propose an efficient and effective training scheme coined as Sparse SAM (SSAM), which achieves sparse perturbation by a binary mask. To obtain the sparse mask, we provide two solutions which are based on Fisher information and dynamic sparse training, respectively. In addition, we theoretically prove that SSAM can converge at the same rate as SAM, ~\emph{i. e. }, $O(\log T/\sqrt{T})$. Sparse SAM not only has the potential for training acceleration but also smooths the loss landscape effectively. Extensive experimental results on CIFAR10, CIFAR100, and ImageNet-1K confirm the superior efficiency of our method to SAM, and the performance is preserved or even better with a perturbation of merely 50\% sparsity. Code is available at \url{https: //github. com/Mi-Peng/Sparse-Sharpness-Aware-Minimization}.

NeurIPS Conference 2022 Conference Paper

MissDAG: Causal Discovery in the Presence of Missing Data with Continuous Additive Noise Models

  • Erdun Gao
  • Ignavier Ng
  • Mingming Gong
  • Li Shen
  • Wei Huang
  • Tongliang Liu
  • Kun Zhang
  • Howard Bondell

State-of-the-art causal discovery methods usually assume that the observational data is complete. However, the missing data problem is pervasive in many practical scenarios such as clinical trials, economics, and biology. One straightforward way to address the missing data problem is first to impute the data using off-the-shelf imputation methods and then apply existing causal discovery methods. However, such a two-step method may suffer from suboptimality, as the imputation algorithm may introduce bias for modeling the underlying data distribution. In this paper, we develop a general method, which we call MissDAG, to perform causal discovery from data with incomplete observations. Focusing mainly on the assumptions of ignorable missingness and the identifiable additive noise models (ANMs), MissDAG maximizes the expected likelihood of the visible part of observations under the expectation-maximization (EM) framework. In the E-step, in cases where computing the posterior distributions of parameters in closed-form is not feasible, Monte Carlo EM is leveraged to approximate the likelihood. In the M-step, MissDAG leverages the density transformation to model the noise distributions with simpler and specific formulations by virtue of the ANMs and uses a likelihood-based causal discovery algorithm with directed acyclic graph constraint. We demonstrate the flexibility of MissDAG for incorporating various causal discovery algorithms and its efficacy through extensive simulations and real data experiments.

IJCAI Conference 2022 Conference Paper

Penalized Proximal Policy Optimization for Safe Reinforcement Learning

  • Linrui Zhang
  • Li Shen
  • Long Yang
  • Shixiang Chen
  • Xueqian Wang
  • Bo Yuan
  • Dacheng Tao

Safe reinforcement learning aims to learn the optimal policy while satisfying safety constraints, which is essential in real-world applications. However, current algorithms still struggle for efficient policy updates with hard constraint satisfaction. In this paper, we propose Penalized Proximal Policy Optimization (P3O), which solves the cumbersome constrained policy iteration via a single minimization of an equivalent unconstrained problem. Specifically, P3O utilizes a simple yet effective penalty approach to eliminate cost constraints and removes the trust-region constraint by the clipped surrogate objective. We theoretically prove the exactness of the penalized method with a finite penalty factor and provide a worst-case analysis for approximate error when evaluated on sample trajectories. Moreover, we extend P3O to more challenging multi-constraint and multi-agent scenarios which are less studied in previous work. Extensive experiments show that P3O outperforms state-of-the-art algorithms with respect to both reward improvement and constraint satisfaction on a set of constrained locomotive tasks.

IJCAI Conference 2022 Conference Paper

Robust Weight Perturbation for Adversarial Training

  • Chaojian Yu
  • Bo Han
  • Mingming Gong
  • Li Shen
  • Shiming Ge
  • Du Bo
  • Tongliang Liu

Overfitting widely exists in adversarial robust training of deep networks. An effective remedy is adversarial weight perturbation, which injects the worst-case weight perturbation during network training by maximizing the classification loss on adversarial examples. Adversarial weight perturbation helps reduce the robust generalization gap; however, it also undermines the robustness improvement. A criterion that regulates the weight perturbation is therefore crucial for adversarial training. In this paper, we propose such a criterion, namely Loss Stationary Condition (LSC) for constrained perturbation. With LSC, we find that it is essential to conduct weight perturbation on adversarial data with small classification loss to eliminate robust overfitting. Weight perturbation on adversarial data with large classification loss is not necessary and may even lead to poor robustness. Based on these observations, we propose a robust perturbation strategy to constrain the extent of weight perturbation. The perturbation strategy prevents deep networks from overfitting while avoiding the side effect of excessive weight perturbation, significantly improving the robustness of adversarial training. Extensive experiments demonstrate the superiority of the proposed method over the state-of-the-art adversarial training methods.

NeurIPS Conference 2022 Conference Paper

Streaming Radiance Fields for 3D Video Synthesis

  • Lingzhi Li
  • Zhen Shen
  • Zhongshu Wang
  • Li Shen
  • Ping Tan

We present an explicit-grid based method for efficiently reconstructing streaming radiance fields for novel view synthesis of real world dynamic scenes. Instead of training a single model that combines all the frames, we formulate the dynamic modeling problem with an incremental learning paradigm in which per-frame model difference is trained to complement the adaption of a base model on the current frame. By exploiting the simple yet effective tuning strategy with narrow bands, the proposed method realizes a feasible framework for handling video sequences on-the-fly with high training efficiency. The storage overhead induced by using explicit grid representations can be significantly reduced through the use of model difference based compression. We also introduce an efficient strategy to further accelerate model optimization for each frame. Experiments on challenging video sequences demonstrate that our approach is capable of achieving a training speed of 15 seconds per-frame with competitive rendering quality, which attains $1000 \times$ speedup over the state-of-the-art implicit methods.

JMLR Journal 2022 Journal Article

Towards Practical Adam: Non-Convexity, Convergence Theory, and Mini-Batch Acceleration

  • Congliang Chen
  • Li Shen
  • Fangyu Zou
  • Wei Liu

Adam is one of the most influential adaptive stochastic algorithms for training deep neural networks, which has been pointed out to be divergent even in the simple convex setting via a few simple counterexamples. Many attempts, such as decreasing an adaptive learning rate, adopting a big batch size, incorporating a temporal decorrelation technique, seeking an analogous surrogate, etc., have been tried to promote Adam-type algorithms to converge. In contrast with existing approaches, we introduce an alternative easy-to-check sufficient condition, which merely depends on the parameters of the base learning rate and combinations of historical second-order moments, to guarantee the global convergence of generic Adam for solving large-scale non-convex stochastic optimization. This observation, coupled with this sufficient condition, gives much deeper interpretations on the divergence of Adam. On the other hand, in practice, mini-Adam and distributed-Adam are widely used without any theoretical guarantee. We further give an analysis on how the batch size or the number of nodes in the distributed system affects the convergence of Adam, which theoretically shows that mini-batch and distributed Adam can be linearly accelerated by using a larger mini-batch size or a larger number of nodes. At last, we apply the generic Adam and mini-batch Adam with the sufficient condition for solving the counterexample and training several neural networks on various real-world datasets. Experimental results are exactly in accord with our theoretical analysis. [abs] [ pdf ][ bib ] &copy JMLR 2022. ( edit, beta )

TIST Journal 2021 Journal Article

Quantized Adam with Error Feedback

  • Congliang Chen
  • Li Shen
  • Haozhi Huang
  • Wei Liu

In this article, we present a distributed variant of an adaptive stochastic gradient method for training deep neural networks in the parameter-server model. To reduce the communication cost among the workers and server, we incorporate two types of quantization schemes, i.e., gradient quantization and weight quantization, into the proposed distributed Adam. In addition, to reduce the bias introduced by quantization operations, we propose an error-feedback technique to compensate for the quantized gradient. Theoretically, in the stochastic nonconvex setting, we show that the distributed adaptive gradient method with gradient quantization and error feedback converges to the first-order stationary point, and that the distributed adaptive gradient method with weight quantization and error feedback converges to the point related to the quantized level under both the single-worker and multi-worker modes. Last, we apply the proposed distributed adaptive gradient methods to train deep neural networks. Experimental results demonstrate the efficacy of our methods.

NeurIPS Conference 2021 Conference Paper

Sparse Training via Boosting Pruning Plasticity with Neuroregeneration

  • Shiwei Liu
  • Tianlong Chen
  • Xiaohan Chen
  • Zahra Atashgahi
  • Lu Yin
  • Huanyu Kou
  • Li Shen
  • Mykola Pechenizkiy

Works on lottery ticket hypothesis (LTH) and single-shot network pruning (SNIP) have raised a lot of attention currently on post-training pruning (iterative magnitude pruning), and before-training pruning (pruning at initialization). The former method suffers from an extremely large computation cost and the latter usually struggles with insufficient performance. In comparison, during-training pruning, a class of pruning methods that simultaneously enjoys the training/inference efficiency and the comparable performance, temporarily, has been less explored. To better understand during-training pruning, we quantitatively study the effect of pruning throughout training from the perspective of pruning plasticity (the ability of the pruned networks to recover the original performance). Pruning plasticity can help explain several other empirical observations about neural network pruning in literature. We further find that pruning plasticity can be substantially improved by injecting a brain-inspired mechanism called neuroregeneration, i. e. , to regenerate the same number of connections as pruned. We design a novel gradual magnitude pruning (GMP) method, named gradual pruning with zero-cost neuroregeneration (GraNet), that advances state of the art. Perhaps most impressively, its sparse-to-sparse version for the first time boosts the sparse-to-sparse training performance over various dense-to-sparse methods with ResNet-50 on ImageNet without extending the training time. We release all codes in https: //github. com/Shiweiliuiiiiiii/GraNet.

YNIMG Journal 2020 Journal Article

A multi-model deep convolutional neural network for automatic hippocampus segmentation and classification in Alzheimer’s disease

  • Manhua Liu
  • Fan Li
  • Hao Yan
  • Kundong Wang
  • Yixin Ma
  • Li Shen
  • Mingqing Xu

Alzheimer's disease (AD) is a progressive and irreversible brain degenerative disorder. Mild cognitive impairment (MCI) is a clinical precursor of AD. Although some treatments can delay its progression, no effective cures are available for AD. Accurate early-stage diagnosis of AD is vital for the prevention and intervention of the disease progression. Hippocampus is one of the first affected brain regions in AD. To help AD diagnosis, the shape and volume of the hippocampus are often measured using structural magnetic resonance imaging (MRI). However, these features encode limited information and may suffer from segmentation errors. Additionally, the extraction of these features is independent of the classification model, which could result in sub-optimal performance. In this study, we propose a multi-model deep learning framework based on convolutional neural network (CNN) for joint automatic hippocampal segmentation and AD classification using structural MRI data. Firstly, a multi-task deep CNN model is constructed for jointly learning hippocampal segmentation and disease classification. Then, we construct a 3D Densely Connected Convolutional Networks (3D DenseNet) to learn features of the 3D patches extracted based on the hippocampal segmentation results for the classification task. Finally, the learned features from the multi-task CNN and DenseNet models are combined to classify disease status. Our method is evaluated on the baseline T1-weighted structural MRI data collected from 97 AD, 233 MCI, 119 Normal Control (NC) subjects in the Alzheimer's Disease Neuroimaging Initiative (ADNI) database. The proposed method achieves a dice similarity coefficient of 87.0% for hippocampal segmentation. In addition, the proposed method achieves an accuracy of 88.9% and an AUC (area under the ROC curve) of 92.5% for classifying AD vs. NC subjects, and an accuracy of 76.2% and an AUC of 77.5% for classifying MCI vs. NC subjects. Our empirical study also demonstrates that the proposed multi-model method outperforms the single-model methods and several other competing methods.

AAAI Conference 2020 Conference Paper

Adaptive Activation Network and Functional Regularization for Efficient and Flexible Deep Multi-Task Learning

  • Yingru Liu
  • Xuewen Yang
  • Dongliang Xie
  • Xin Wang
  • Li Shen
  • Haozhi Huang
  • Niranjan Balasubramanian

Multi-task learning (MTL) is a common paradigm that seeks to improve the generalization performance of task learning by training related tasks simultaneously. However, it is still a challenging problem to search the flexible and accurate architecture that can be shared among multiple tasks. In this paper, we propose a novel deep learning model called Task Adaptive Activation Network (TAAN) that can automatically learn the optimal network architecture for MTL. The main principle of TAAN is to derive flexible activation functions for different tasks from the data with other parameters of the network fully shared. We further propose two functional regularization methods that improve the MTL performance of TAAN. The improved performance of both TAAN and the regularization methods is demonstrated by comprehensive experiments.

IJCAI Conference 2019 Conference Paper

Discrete Trust-aware Matrix Factorization for Fast Recommendation

  • Guibing Guo
  • Enneng Yang
  • Li Shen
  • Xiaochun Yang
  • Xiaodong He

Trust-aware recommender systems have received much attention recently for their abilities to capture the influence among connected users. However, they suffer from the efficiency issue due to large amount of data and time-consuming real-valued operations. Although existing discrete collaborative filtering may alleviate this issue to some extent, it is unable to accommodate social influence. In this paper we propose a discrete trust-aware matrix factorization (DTMF) model to take dual advantages of both social relations and discrete technique for fast recommendation. Specifically, we map the latent representation of users and items into a joint hamming space by recovering the rating and trust interactions between users and items. We adopt a sophisticated discrete coordinate descent (DCD) approach to optimize our proposed model. In addition, experiments on two real-world datasets demonstrate the superiority of our approach against other state-of-the-art approaches in terms of ranking accuracy and efficiency.

JBHI Journal 2019 Journal Article

Mining Directional Drug Interaction Effects on Myopathy Using the FAERS Database

  • Danai Chasioti
  • Xiaohui Yao
  • Pengyue Zhang
  • Samuel Lerner
  • Sara K. Quinney
  • Xia Ning
  • Lang Li
  • Li Shen

Mining high-order drug-drug interaction (DDI) induced adverse drug effects from electronic health record databases is an emerging area, and very few studies have explored the relationships between high-order drug combinations. We investigate a novel pharmacovigilance problem for mining directional DDI effects on myopathy using the FDA Adverse Event Reporting System (FAERS) database. Our paper provides information on the risk of myopathy associated with adding new drugs on the already prescribed medication, and visualizes the identified directional DDI patterns as user-friendly graphical representation. We utilize the Apriori algorithm to extract frequent drug combinations from the FAERS database. We use odds ratio to estimate the risk of myopathy associated with directional DDI. We create a tree-structured graph to visualize the findings for easy interpretation. Our method confirmed myopathy association with previously reported HMG-CoA reductase inhibitors like rosuvastatin, fluvastatin, simvastatin, and atorvastatin. New, previously unidentified but mechanistically plausible associations with myopathy were also observed, such as the DDI between pamidronate and levofloxacin. Additional top findings are gadolinium-based imaging agents, which however are often used in myopathy diagnosis. Other DDIs with no obvious mechanism are also reported, such as that of sulfamethoxazole with trimethoprim and potassium chloride. This study shows the feasibility to estimate high-order directional DDIs in a fast and accurate manner. The results of the analysis could become a useful tool in the specialists’ hands through an easy-to-understand graphic visualization.

NeurIPS Conference 2018 Conference Paper

Gather-Excite: Exploiting Feature Context in Convolutional Neural Networks

  • Jie Hu
  • Li Shen
  • Samuel Albanie
  • Gang Sun
  • Andrea Vedaldi

While the use of bottom-up local operators in convolutional neural networks (CNNs) matches well some of the statistics of natural images, it may also prevent such models from capturing contextual long-range feature interactions. In this work, we propose a simple, lightweight approach for better context exploitation in CNNs. We do so by introducing a pair of operators: gather, which efficiently aggregates feature responses from a large spatial extent, and excite, which redistributes the pooled information to local features. The operators are cheap, both in terms of number of added parameters and computational complexity, and can be integrated directly in existing architectures to improve their performance. Experiments on several datasets show that gather-excite can bring benefits comparable to increasing the depth of a CNN at a fraction of the cost. For example, we find ResNet-50 with gather-excite operators is able to outperform its 101-layer counterpart on ImageNet with no additional learnable parameters. We also propose a parametric gather-excite operator pair which yields further performance gains, relate it to the recently-introduced Squeeze-and-Excitation Networks, and analyse the effects of these changes to the CNN feature activation statistics.

AAAI Conference 2017 Conference Paper

Adaptive Proximal Average Approximation for Composite Convex Minimization

  • Li Shen
  • Wei Liu
  • Junzhou Huang
  • Yu-Gang Jiang
  • Shiqian Ma

We propose a fast first-order method to solve multi-term nonsmooth composite convex minimization problems by employing a recent proximal average approximation technique and a novel adaptive parameter tuning technique. Thanks to this powerful parameter tuning technique, the proximal gradient step can be performed with a much larger stepsize in the algorithm implementation compared with the prior PA- APG method (Yu 2013), which is the core to enable significant improvements in practical performance. Moreover, by choosing the approximation parameter adaptively, the proposed method is shown to enjoy the O( 1 k ) iteration complexity theoretically without needing any extra computational cost, while the PA-APG method incurs much more iterations for convergence. The preliminary experimental results on overlapping group Lasso and graph-guided fused Lasso problems confirm our theoretic claim well, and indicate that the proposed method is almost five times faster than the stateof-the-art PA-APG method and therefore suitable for higherprecision required optimization.

AAAI Conference 2016 Conference Paper

Co-Occurrence Feature Learning for Skeleton Based Action Recognition Using Regularized Deep LSTM Networks

  • Wentao Zhu
  • Cuiling Lan
  • Junliang Xing
  • Wenjun Zeng
  • Yanghao Li
  • Li Shen
  • Xiaohui Xie

Skeleton based action recognition distinguishes human actions using the trajectories of skeleton joints, which provide a very good representation for describing actions. Considering that recurrent neural networks (RNNs) with Long Short-Term Memory (LSTM) can learn feature representations and model long-term temporal dependencies automatically, we propose an endto-end fully connected deep LSTM network for skeleton based action recognition. Inspired by the observation that the co-occurrences of the joints intrinsically characterize human actions, we take the skeleton as the input at each time slot and introduce a novel regularization scheme to learn the co-occurrence features of skeleton joints. To train the deep LSTM network effectively, we propose a new dropout algorithm which simultaneously operates on the gates, cells, and output responses of the LSTM neurons. Experimental results on three human action recognition datasets consistently demonstrate the effectiveness of the proposed model.

IJCAI Conference 2015 Conference Paper

Adaptive Sharing for Image Classification

  • Li Shen
  • Gang Sun
  • Zhouchen Lin
  • Qingming Huang
  • Enhua Wu

In this paper, we formulate the image classification problem in a multi-task learning framework. We propose a novel method to adaptively share information among tasks (classes). Different from imposing strong assumptions or discovering specific structures, the key insight in our method is to selectively extract and exploit the shared information among classes while capturing respective disparities simultaneously. It is achieved by estimating a composite of two sets of parameters with different regularization. Besides applying it for learning classifiers on pre-computed features, we also integrate the adaptive sharing with deep neural networks, whose discriminative power can be augmented by encoding class relationship. We further develop two strategies for solving the optimization problems in the two scenarios. Empirical results demonstrate that our method can significantly improve the classification performance by transferring knowledge appropriately.

JBHI Journal 2014 Journal Article

Automatic Motion Analysis System for Pyloric Flow in Ultrasonic Videos

  • Chaojie Chen
  • Yuanyuan Wang
  • Jinhua Yu
  • Zhuyu Zhou
  • Li Shen
  • Ya-Qing Chen

Ultrasonography has been widely used to evaluate duodenogastric reflux (DGR). But to the best of our knowledge, no automatic analysis system was developed to realize the quantitative computer-aided analysis. In this paper, we propose a system to perform the automatic detection of DGR in the ultrasonic image sequences by applying the automatic motion analysis. The motion field is estimated based on image velocimetry. Then, an intelligent motion analysis is applied. For the DGR detection, the motion and structural information is combined to analyze the transploric motion of the fluid. In order to test the performance of the proposed system, we designed the experiment with the real and synthetic ultrasonic data. The proposed system achieved a good performance in the DGR detection. The automatic results were accordant with the gold standard in analyzing the fluid motion. The proposed system is supposed to be a promising tool for the study and evaluation of DGR.

NeurIPS Conference 2012 Conference Paper

High-Order Multi-Task Feature Learning to Identify Longitudinal Phenotypic Markers for Alzheimer's Disease Progression Prediction

  • Hua Wang
  • Feiping Nie
  • Heng Huang
  • Jingwen Yan
  • Sungeun Kim
  • Shannon Risacher
  • Andrew Saykin
  • Li Shen

Alzheimer disease (AD) is a neurodegenerative disorder characterized by progressive impairment of memory and other cognitive functions. Regression analysis has been studied to relate neuroimaging measures to cognitive status. However, whether these measures have further predictive power to infer a trajectory of cognitive performance over time is still an under-explored but important topic in AD research. We propose a novel high-order multi-task learning model to address this issue. The proposed model explores the temporal correlations existing in data features and regression tasks by the structured sparsity-inducing norms. In addition, the sparsity of the model enables the selection of a small number of MRI measures while maintaining high prediction accuracy. The empirical studies, using the baseline MRI and serial cognitive data of the ADNI cohort, have yielded promising results.