Author name cluster

Tianlong Chen

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

56 papers

1 author row

TMLR Journal 2026 Journal Article

$\texttt{LucidAtlas}$: Learning Uncertainty-Aware, Covariate-Disentangled, Individualized Atlas Representations

Yining Jiao
Sreekalyani Bhamidi
Carlton Jude ZDANSKI
Huaizhi Qu
Julia S Kimbell
Andrew Prince
Cameron P Worden
Samuel Kirse

Interpreting how covariates influence spatially structured biological variation — for example, how pediatric airway geometry changes along the airway and across a growing population — remains a key challenge in developing models suitable for clinical application. We present $\texttt{LucidAtlas}$, a versatile framework for modeling and interpreting spatially varying information with associated covariates. To address the limitations of neural additive models when analyzing dependent covariates, we introduce a marginalization approach that enables accurate explanations of how combinations of covariates shape the learned atlas. $\texttt{LucidAtlas}$ integrates covariate interpretation, spatial representation, individualized prediction, population distribution analysis, and out-of-distribution detection into a single interpretable model. We validate its effectiveness on a synthetic spatiotemporal dataset, the OASIS brain volume dataset, and a pediatric airway shape dataset. Our findings underscore the critical role of by-construction interpretable models in advancing scientific discovery. The implementation is publicly available at https://github.com/****.

AAAI Conference 2026 Conference Paper

COIN: Uncertainty-Guarding Selective Question Answering for Foundation Models with Provable Risk Guarantees

Zhiyuan Wang
Jinhao Duan
Qingni Wang
Xiaofeng Zhu
Tianlong Chen
Xiaoshuang Shi
Kaidi Xu

Uncertainty quantification (UQ) in foundation models is crucial for identifying and mitigating hallucinations in automatically generated text. However, heuristic UQ approaches lack statistical guarantees for key metrics such as the false discovery rate (FDR) in selective prediction tasks. Previous research adopts the split conformal prediction (SCP) framework to ensure desired coverage of admissible answers by constructing data-driven prediction sets, yet these sets typically contain incorrect candidates, undermining their practical effectiveness. To address this, we introduce COIN, an uncertainty-guarding selection framework that calibrates statistically valid uncertainty thresholds to filter a single generated answer per question under user-specified FDR constraints. COIN estimates the empirical error rate on the calibration set and applies confidence interval methods such as Clopper–Pearson to establish a high-probability upper bound on the true error rate (i.e., FDR). This enables the selection of the largest threshold that ensures FDR control on test data while significantly increasing sample retention. We demonstrate COIN's robustness in risk control, strong test-time power in retaining admissible answers, and predictive efficiency under limited calibration data across both general and multimodal text generation tasks. Furthermore, we show that employing alternative UQ and upper bound construction strategies can further boost COIN's power performance, which underscores its extensibility and adaptability to diverse application scenarios.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Model Editing as a Double-Edged Sword: Steering Agent Behavior Toward Beneficence or Harm

Baixiang Huang
Zhen Tan
Haoran Wang
Zijie Liu
Dawei Li
Ali Payani
Huan Liu
Tianlong Chen

Agents based on Large Language Models (LLMs) have demonstrated strong capabilities across a wide range of tasks. However, deploying LLM-based agents in high-stakes domains comes with significant safety and ethical risks. Unethical behavior by these agents can directly result in serious real-world consequences, including physical harm and financial loss. To efficiently steer the ethical behavior of agents, we frame agent behavior steering as a model editing task, which we term Behavior Editing. Model editing is an emerging area of research that enables precise and efficient modifications to LLMs while preserving their overall capabilities. To systematically study and evaluate this approach, we introduce BehaviorBench, a multi-tier benchmark grounded in psychological moral theories. This benchmark supports both the evaluation and editing of agent behaviors across a variety of scenarios, with each tier introducing more complex and ambiguous scenarios. We first demonstrate that Behavior Editing can dynamically steer agents toward the target behavior within specific scenarios. Moreover, Behavior Editing enables not only scenario-specific local adjustments but also more extensive shifts in an agent’s global moral alignment. We demonstrate that Behavior Editing can be used to promote ethical and benevolent behavior or, conversely, to induce harmful or malicious behavior. Through extensive evaluations of agents built on frontier LLMs, BehaviorBench validates the effectiveness of behavior editing across a wide range of models and scenarios. Our findings offer key insights into a new paradigm for steering agent behavior, highlighting both the promise and perils of Behavior Editing.

PDF Details DOI

AAAI Conference 2026 Conference Paper

OR-R1: Automating Modeling and Solving of Operations Research Optimization Problem via Test-Time Reinforcement Learning

Zezhen Ding
Zhen Tan
Jiheng Zhang
Tianlong Chen

Optimization modeling and solving are fundamental to the application of Operations Research (OR) in real-world decision making, yet the process of translating natural language problem descriptions into formal models and solver code remains highly expertise intensive. While recent advances in large language models (LLMs) have opened new opportunities for automation, the generalization ability and data efficiency of existing LLM-based methods are still limited, asmost require vast amounts of annotated or synthetic data, resulting in high costs and scalability barriers. In this work, we present OR-R1, a data-efficient training framework for automated optimization modeling and solving. OR-R1 first employs supervised fine-tuning (SFT) to help the model acquire the essential reasoning patterns for problem formulation and code generation from limited labeled data. In addition, it improves the capability and consistency through Test-Time Group Relative Policy Optimization (TGRPO). This two-stage design enables OR-R1 to leverage both scarce labeled and abundant unlabeled data for effective learning. Experiments show that OR-R1 achieves state-of-the-art performance with an average solving accuracy of 67.7%, using only 1/10 the synthetic data required by prior methods such as ORLM, exceeding ORLM’s solving accuracy by up to 4.2%. Remarkably, OR-R1 outperforms ORLM by over 2.4% with just 100 synthetic samples. Furthermore, TGRPO contributes an additional 3.1%–6.4% improvement in accuracy, significantly narrowing the gap between single-attempt (Pass@1) and multi-attempt (Pass@8) performance from 13% to 7%. Extensive evaluations across diverse real-world benchmarks demonstrate that OR-R1 provides a robust, scalable, and cost-effective solution for automated OR optimization problem modeling and solving, lowering the expertise and data barriers for industrial OR applications.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Vulnerability-Aware Robust Multimodal Adversarial Training

Junrui Zhang
Xinyu Zhao
Jie Peng
Chenjie Wang
Jianmin Ji
Tianlong Chen

Multimodal learning has shown significant superiority on various tasks by integrating multiple modalities. However, the interdependencies among modalities increase the susceptibility of multimodal models to adversarial attacks. Existing methods mainly focus on attacks on specific modalities or indiscriminately attack all modalities. In this paper, we find that these approaches ignore the differences between modalities in their contribution to final robustness, resulting in suboptimal robustness performance. To bridge this gap, we introduce Vulnerability-Aware Robust Multimodal Adversarial Training (VARMAT), a probe-in-training adversarial training method that improves multimodal robustness by identifying the vulnerability of each modality. To be specific, VARMAT first explicitly quantifies the vulnerability of each modality, grounded in a first-order approximation of the attack objective (Probe). Then, we propose a targeted regularization term that penalizes modalities with high vulnerability, guiding robust learning while maintaining task accuracy (Training). We demonstrate the enhanced robustness of our method across multiple multimodal datasets involving diverse modalities. Finally, we achieve {12.73%, 22.21%, 11.19%} robustness improvement on three multimodal datasets, revealing a significant blind spot in multimodal adversarial training.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

$\texttt{BetaConform}$: Efficient MAP Estimation of LLM Ensemble Judgment Performance with Prior Transfer

Huaizhi Qu
Inyoung Choi
Zhen Tan
Song Wang
Sukwon Yun
Qi Long
Faizan Siddiqui
Kwonjoon Lee

LLM ensembles are widely used for LLM judges. However, how to estimate their accuracy, especially in an efficient way, is unknown. In this paper, we present a principled $\textit{maximum a posteriori}$ (MAP) framework for an economical and precise estimation of the performance of LLM ensemble judgment. We first propose a mixture of Beta-Binomial distributions to model the judgment distribution, revising from the vanilla Binomial distribution. Next, we introduce a conformal prediction-driven approach that enables adaptive stopping during iterative sampling to balance accuracy with efficiency. Furthermore, we design a prior transfer mechanism that utilizes learned distributions on open-source datasets to improve estimation on a target dataset when only scarce annotations are available. Finally, we present $\texttt{BetaConform}$, a framework that integrates our distribution assumption, adaptive stopping, and the prior transfer mechanism to deliver a theoretically guaranteed distribution estimation of LLM ensemble judgment with minimum labeled samples. $\texttt{BetaConform}$ is also validated empirically. For instance, with only $10$ samples from the TruthfulQA dataset, for a Llama ensembled judge, $\texttt{BetaConform}$ gauges its performance with an error margin as small as $3. 37\\%$.

TMLR Journal 2025 Journal Article

A Survey on Model MoErging: Recycling and Routing Among Specialized Experts for Collaborative Learning

Prateek Yadav
Colin Raffel
Mohammed Muqeeth
Lucas Caccia
Haokun Liu
Tianlong Chen
Mohit Bansal
Leshem Choshen

The availability of performant pre-trained models has led to a proliferation of fine-tuned expert models that are specialized to a particular domain or task. Model MoErging methods aim to recycle expert models to create an aggregate system with improved performance or generalization. A key component of MoErging methods is the creation of a router that decides which expert model(s) to use for a particular input or application. The promise, effectiveness, and large design space of MoErging has spurred the development of many new methods over the past few years. This rapid pace of development has made it challenging to compare different MoErging methods, which are rarely compared to one another and are often validated in different experimental setups. To remedy such gaps, we present a comprehensive survey of MoErging methods that includes a novel taxonomy for cataloging key design choices and clarifying suitable applications for each method. Apart from surveying MoErging research, we inventory software tools and applications that make use of MoErging. We additionally discuss related fields of study such as model merging, multitask learning, and mixture-of-experts models. Taken as a whole, our survey provides a unified overview of existing MoErging methods and creates a solid foundation for future work in this burgeoning field.

AAAI Conference 2025 Conference Paper

BrainMAP: Learning Multiple Activation Pathways in Brain Networks

Song Wang
Zhenyu Lei
Zhen Tan
Jiaqi Ding
Xinyu Zhao
Yushun Dong
Guorong Wu
Tianlong Chen

Functional Magnetic Resonance Image (fMRI) is commonly employed to study human brain activity, since it offers insight into the relationship between functional fluctuations and human behavior. To enhance analysis and comprehension of brain activity, Graph Neural Networks (GNNs) have been widely applied to the analysis of functional connectivities (FC) derived from fMRI data, due to their ability to capture the synergistic interactions among brain regions. However, in the human brain, performing complex tasks typically involves the activation of certain pathways, which could be represented as paths across graphs. As such, conventional GNNs struggle to learn from these pathways due to the long-range dependencies of multiple pathways. To address these challenges, we introduce a novel framework BrainMAP to learn multiple pathways in brain networks. BrainMAP leverages sequential models to identify long-range correlations among sequentialized brain regions and incorporates an aggregation module based on Mixture of Experts (MoE) to learn from multiple pathways. Our comprehensive experiments highlight BrainMAP's superior performance. Furthermore, our framework enables explanatory analyses of crucial brain regions involved in tasks.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

BrainMoE: Cognition Joint Embedding via Mixture-of-Expert Towards Robust Brain Foundation Model

Ziquan Wei
Tingting Dan
Tianlong Chen
Guorong Wu

Given the large scale of public functional Magnetic Resonance Imaging (fMRI), e. g. , UK Biobank (UKB) and Human Connectome Projects (HCP), brain foundation models are emerging. Although the amount of samples under rich environmental variables is unprecedented, existing brain foundation models learn from fMRI derived from a narrow range of cognitive states stimulated by similar environments, causing the limited robustness demonstrated in various applications and datasets acquired with different pipelines and limited sample size. By capitalizing on the variety of cognitive status as subjects performing explicit tasks, we present the mixture of brain experts, namely BrainMoE, pre-training on tasking fMRI with rich behavioral tasks in addition to resting fMRI for a robust brain foundation model. Brain experts are designed to produce embeddings for different behavioral tasks related to cognition. Afterward, these cognition embeddings are mixed by a cognition adapter via cross-attention so that BrainMoE can handle orthogonal embeddings and be robust on those boutique downstream datasets. We have pre-trained two existing self-regressive architectures and one new supervised architecture as brain experts on 68, 251 fMRI scans among UKB and HCP, containing 12 different cognitive states. Then, BrainMoE is evaluated on a variety of applications, including sex, age prediction, human behavior recognition, disease early diagnosis of Autism, Parkinson's disease, Alzheimer's disease, and Schizophrenia, and fMRI-EEG multimodal applications, where promising results in eight datasets from three different pipelines indicate great potential to facilitate current neuroimaging applications in clinical routines.

AAAI Conference 2025 Conference Paper

Breaking the Resource Monopoly from Industries: Sustainable and Reliable LLM Serving by Recycling Outdated and Resource-Constrained GPUs

Tianlong Chen

In recent years, Large Language Model (LLM) agents, exemplified by models like ChatGPT, and PaLM, have showcased remarkable prowess in various tasks, owing to their vast number of parameters and emergent in-context learning capabilities. People expect the wide usage of LLM serving at edge hardware, personal devices, and organization/enterprise IT infrastructures to revolutionize global access to information, communication, automation, and creativity. However, due to the extreme large-scale LLM parameters (LLaMA 3.1 contains 405 billion of 2 or 4 bytes floating point numbers), the LLM serving is facing significant sustainability pressure due to its requirements on the latest high-embodied carbon hardware (e.g., GPUs, HBMs, memory, storage, and network hardware) and the high operational carbon emissions, leading to a significant and alarming increase in carbon emissions and a high barrier to their widespread deployments and practical applications in various scenarios. Companies, organizations, and institutes usually have the complete general-purpose IT infrastructure, which consists of a large amount of computing, memory, storage, and network hardware. Although these general-purpose IT infrastructures are far more than enough for existing application executions, deploying and executing the LLM for a broad spectrum of serving platforms can be challenging and difficult due to resource limitations. Purchasing the latest hardware including GPUs (e.g., Nvidia H100 or H200) will lead to considerable issues including 1) serious embodied carbon emissions during the new hardware production, 2) no explicitly lower operational carbon emissions with essential modeling and optimizations, 3) high economic and financial pressures, and 4) potentially tremendous existing hardware resource wasting. Therefore, it is a trend and becomes a must to explore how to use the existing hardware, especially outdated hardware, to collectively improve both environmental sustainability, efficiency, and reliability for LLM serving. A few pioneering examples include Microsoft’s Project Natick, Google’s TPU Pod Optimization, Alibaba’s Cloud Server Repurposing, and Facebook’s Network Hardware Reuse. In this talk, I will traverse my series of contributions with promising new directions, particularly emphasizing modularized LLM architecture (Part 1), in-storage sustainable computing (Part 2), and reliable serving against software and hardware attacks (Part 3).

PDF Details DOI

AAAI Conference 2025 Conference Paper

DLF: Disentangled-Language-Focused Multimodal Sentiment Analysis

Pan Wang
Qiang Zhou
Yawen Wu
Tianlong Chen
Jingtong Hu

Multimodal Sentiment Analysis (MSA) leverages heterogeneous modalities, such as language, vision, and audio, to enhance the understanding of human sentiment. While existing models often focus on extracting shared information across modalities or directly fusing heterogeneous modalities, such approaches can introduce redundancy and conflicts due to equal treatment of all modalities and the mutual transfer of information between modality pairs. To address these issues, we propose a Disentangled-Language-Focused (DLF) multimodal representation learning framework, which incorporates a feature disentanglement module to separate modality-shared and modality-specific information. To further reduce redundancy and enhance language-targeted features, four geometric measures are introduced to refine the disentanglement process. A Language-Focused Attractor (LFA) is further developed to strengthen language representation by leveraging complementary modality-specific information through a language-guided cross-attention mechanism. The framework also employs hierarchical predictions to improve overall accuracy. Extensive experiments on two popular MSA datasets, CMU-MOSI and CMU-MOSEI, demonstrate the significant performance gains achieved by the proposed DLF framework. Comprehensive ablation studies further validate the effectiveness of the feature disentanglement module, language-focused attractor, and hierarchical predictions.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

IndustryEQA: Pushing the Frontiers of Embodied Question Answering in Industrial Scenarios

Yifan Li
Yuhang Chen
Anh Dao
Lichi Li
Zhongyi Cai
Zhen Tan
Tianlong Chen
Yu Kong

Existing Embodied Question Answering (EQA) benchmarks primarily focus on household environments, often overlooking safety-critical aspects and reasoning processes pertinent to industrial settings. This drawback limits the evaluation of agent readiness for real-world industrial applications. To bridge this, we introduce IndustryEQA, the first benchmark dedicated to evaluating embodied agent capabilities within safety-critical industrial warehouse scenarios. Built upon the NVIDIA Isaac Sim platform, IndustryEQA provides high-fidelity episodic memory videos featuring diverse industrial assets, dynamic human agents, and carefully designed hazardous situations inspired by real-world safety guidelines. The benchmark includes rich annotations covering six categories: equipment safety, human safety, object recognition, attribute recognition, temporal understanding, and spatial understanding. Besides, it also provides extra reasoning evaluation based on these categories. Specifically, it comprises 971 question-answer pairs generated from small warehouse scenarios and 373 pairs from large ones, incorporating scenarios with and without human. We further propose a comprehensive evaluation framework, including various baseline models, to assess their general perception and reasoning abilities in industrial environments. IndustryEQA aims to steer EQA research towards developing more robust, safety-aware, and practically applicable embodied agents for complex industrial environments.

AAAI Conference 2025 Conference Paper

Mapping from Meaning: Addressing the Miscalibration of Prompt-Sensitive Language Models

Kyle Cox
Jiawei Xu
Yikun Han
Rong Xu
Tianhao Li
Chi-Yang Hsu
Tianlong Chen
Walter Gerych

An interesting behavior in large language models (LLMs) is prompt sensitivity. When provided with different but semantically equivalent versions of the same prompt, models may produce very different distributions of answers. This suggests that the uncertainty reflected in a model's output distribution for one prompt may not reflect the model's uncertainty about the meaning of the prompt. We model prompt sensitivity as a type of generalization error, and show that sampling across the semantic concept space with paraphrasing perturbations improves uncertainty calibration without compromising accuracy. Additionally, we introduce a new metric for uncertainty decomposition in black-box LLMs that improves upon entropy-based decomposition by modeling semantic continuities in natural language generation. We show that this decomposition metric can be used to quantify how much LLM uncertainty is attributed to prompt sensitivity. Our work introduces a new way to improve uncertainty calibration in prompt-sensitive language models, and provides evidence that some LLMs fail to exhibit consistent general reasoning about the meanings of their inputs.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Mozart: Modularized and Efficient MoE Training on 3.5D Wafer-Scale Chiplet Architectures

Shuqing Luo
Ye Han
Pingzhi Li
Jiayin Qin
Jie Peng
Yang Zhao
Yu Cao
Tianlong Chen

Mixture-of-Experts (MoE) architecture offers enhanced efficiency for Large Language Models (LLMs) with modularized computation, yet its inherent sparsity poses significant hardware deployment challenges, including memory locality issues, communication overhead, and inefficient computing resource utilization. Inspired by the modular organization of the human brain, we propose $\texttt{Mozart}$, a novel algorithm-hardware co-design framework tailored for efficient training of MoE-based LLMs on 3. 5D wafer-scale chiplet architectures. On the algorithm side, $\texttt{Mozart}$ exploits the inherent modularity of chiplets and introduces: ($1$) an expert allocation strategy that enables efficient on-package all-to-all communication, and ($2$) a fine-grained scheduling mechanism that improves communication-computation overlap through streaming tokens and experts. On the architecture side, $\texttt{Mozart}$ adaptively co-locates heterogeneous modules on specialized chiplets with a 2. 5D NoP-Tree topology and hierarchical memory structure. Evaluation across three popular MoE models demonstrates significant efficiency gains, enabling more effective parallelization and resource utilization for large-scale modularized MoE-LLMs.

NeurIPS Conference 2025 Conference Paper

Multi-Agent Debate for LLM Judges with Adaptive Stability Detection

Tianyu Hu
Zhen Tan
Song Wang
Huaizhi Qu
Tianlong Chen

With advancements in reasoning capabilities, Large Language Models (LLMs) are increasingly employed for automated judgment tasks. While LLMs-as-Judges offer promise in automating evaluations, current approaches often rely on simplistic aggregation methods (e. g. , majority voting), which can fail even when individual agents provide correct answers. To address this, we propose a multi-agent debate judge framework where agents collaboratively reason and iteratively refine their responses. We formalize the debate process mathematically, analyzing agent interactions and proving that debate amplifies correctness compared to static ensembles. To enhance efficiency, we introduce a stability detection mechanism that models judge consensus dynamics via a time-varying Beta-Binomial mixture, with adaptive stopping based on distributional similarity (Kolmogorov-Smirnov test). This mechanism models the judges' collective correct rate dynamics using a time-varying mixture of Beta-Binomial distributions and employs an adaptive stopping criterion based on distributional similarity (Kolmogorov-Smirnov statistic). Experiments across multiple benchmarks and models demonstrate that our framework improves judgment accuracy over majority voting while maintaining computational efficiency.

YNIMG Journal 2025 Journal Article

Novelty modulates proactive and reactive cognitive control modes: Evidence from ERP and EEG data

Qianqian Li
Tianlong Chen
Lixia Wang
Hongshan Gu
Bi Ying Hu
Chuanhua Gu
Zongkui Zhou

NeurIPS Conference 2025 Conference Paper

One Token Embedding Is Enough to Deadlock Your Large Reasoning Model

Mohan Zhang
Yihua Zhang
Jinghan Jia
Zhangyang "Atlas" Wang
Sijia Liu
Tianlong Chen

Modern large reasoning models (LRMs) exhibit impressive multi-step problem-solving via chain-of-thought (CoT) reasoning. However, this iterative thinking mechanism introduces a new vulnerability surface. We present the Deadlock Attack, a resource exhaustion method that hijacks an LRM's generative control flow by training a malicious adversarial embedding to induce perpetual reasoning loops. Specifically, the optimized embedding encourages transitional tokens (e. g. , “Wait”, “But”) after reasoning steps, preventing the model from concluding its answer. A key challenge we identify is the continuous-to-discrete projection gap: naïve projections of adversarial embeddings to token sequences nullify the attack. To overcome this, we introduce a backdoor implantation strategy, enabling reliable activation through specific trigger tokens. Our method achieves a 100\% attack success rate across four advanced LRMs (Phi-RM, Nemotron-Nano, R1-Qwen, R1-Llama) and three math reasoning benchmarks, forcing models to generate up to their maximum token limits. The attack is also stealthy (in terms of causing negligible utility loss on benign user inputs) and remains robust against existing strategies trying to mitigate the overthinking issue. Our findings expose a critical and underexplored security vulnerability in LRMs from the perspective of reasoning (in)efficiency.

TMLR Journal 2025 Journal Article

Pushing the Limits of Sparsity: A Bag of Tricks for Extreme Pruning

Andy Li
Aiden Durrant
Milan Markovic
Tianjin Huang
Souvik Kundu
Tianlong Chen
Lu Yin
Georgios Leontidis

Pruning of deep neural networks has been an effective technique for reducing model size while preserving most of the performance of dense networks, crucial for deploying models on memory and power-constrained devices. While recent sparse learning methods have shown promising performance up to moderate sparsity levels such as 95% and 98%, accuracy quickly deteriorates when pushing sparsities to extreme levels due to unique challenges such as fragile gradient flow. In this work, we explore network performance beyond the commonly studied sparsities, and develop techniques that encourage stable training without accuracy collapse even at extreme sparsities, including 99.90%, 99.95\% and 99.99% on ResNet architectures. We propose three complementary techniques that enhance sparse training through different mechanisms: 1) Dynamic ReLU phasing, where DyReLU initially allows for richer parameter exploration before being gradually replaced by standard ReLU, 2) weight sharing which reuses parameters within a residual layer while maintaining the same number of learnable parameters, and 3) cyclic sparsity, where both sparsity levels and sparsity patterns evolve dynamically throughout training to better encourage parameter exploration. We evaluate our method, which we term Extreme Adaptive Sparse Training (EAST) at extreme sparsities using ResNet-34 and ResNet-50 on CIFAR-10, CIFAR-100, and ImageNet,achieving competitive or improved performance compared to existing methods, with notable gains at extreme sparsity levels.

AAAI Conference 2025 Conference Paper

Sparse Transfer Learning Accelerates and Enhances Certified Robustness: A Comprehensive Study

Zhangheng Li
Tianlong Chen
Linyi Li
Bo Li
Zhangyang Wang

Certified robustness is a critical measure for assessing the reliability of machine learning systems. Traditionally, the computational burden associated with certifying the robustness of machine learning models has posed a substantial challenge, particularly with the continuous expansion of model sizes. In this paper, we introduce an innovative approach to expedite the verification process for L2-norm certified robustness through sparse transfer learning. Our approach is both efficient and effective. It leverages verification results obtained from pre-training tasks and applies sparse updates to these results. To enhance performance, we incorporate dynamic sparse mask selection and introduce a novel stability-based regularizer called DiffStab. Empirical results demonstrate that our method accelerates the verification process for downstream tasks by as much as 70-80%, with only slight reductions in certified accuracy compared to dense parameter updates. We further validate that this performance improvement is even more pronounced in the few-shot transfer learning scenario.

PDF Details DOI

AAAI Conference 2025 Conference Paper

Tuning-Free Accountable Intervention for LLM Deployment – a Metacognitive Approach

Zhen Tan
Jie Peng
Song Wang
Lijie Hu
Tianlong Chen
Huan Liu

Large Language Models (LLMs) have brought significant advances across various NLP tasks through few-shot or zero-shot prompting, bypassing the need for parameter tuning. However, the "black-box" nature behind their massive parameter sizes increases the "hallucination" concerns, especially in high-stakes applications (e.g., healthcare), where decision mistakes can lead to severe consequences. In contrast, human decision-making relies on complex cognitive processes, such as the ability to sense and adaptively correct mistakes through conceptual understanding. Drawing inspiration from human cognition, we propose an innovative metacognitive approach CLEAR, to equip LLMs with capabilities for self-aware error identification and correction. Our framework constructs concept-specific sparse subnetworks that indicate decision processes. This provides a novel interface for model {intervention} after deployment. The benefits include: (i) at inference time, our metacognitive LLMs can self-consciously identify potential mispredictions with minimum human involvement, (ii) the model can self-correct its errors efficiently without additional tuning, and (iii) the correction procedure is not only self-explanatory but also user-friendly, enhancing model interpretability and accessibility. With these metacognitive features, our approach pioneers a new path toward the trustworthiness of LLMs.

PDF Details DOI

AAAI Conference 2025 Conference Paper

Visual Prompting Upgrades Neural Network Sparsification: A Data-Model Perspective

Can Jin
Tianjin Huang
Yihua Zhang
Mykola Pechenizkiy
Sijia Liu
Shiwei Liu
Tianlong Chen

The rapid development of large-scale deep learning models questions the affordability of hardware platforms, which necessitates the pruning to reduce their computational and memory footprints. Sparse neural networks as the product, have demonstrated numerous favorable benefits like low complexity, undamaged generalization, etc. Most of the prominent pruning strategies are invented from a model-centric perspective, focusing on searching and preserving crucial weights by analyzing network topologies. However, the role of data and its interplay with model-centric pruning has remained relatively unexplored. In this research, we introduce a novel data-model co-design perspective: to promote superior weight sparsity by learning important model topology and adequate input data in a synergetic manner. Specifically, customized Visual Prompts are mounted to upgrade neural Network sparsification in our proposed VPNs framework. As a pioneering effort, this paper conducts systematic investigations about the impact of different visual prompts on model pruning and suggests an effective joint optimization approach. Extensive experiments with 3 network architectures and 8 datasets evidence the substantial performance improvements from VPNs over existing start-of-the-art pruning algorithms. Furthermore, we find that subnetworks discovered by VPNs from pre-trained models enjoy better transferability across diverse downstream scenarios. These insights shed light on new promising possibilities of data-model co-designs for vision model sparsification.

PDF Details DOI

EAAI Journal 2025 Journal Article

Word-Sequence Entropy: Towards uncertainty estimation in free-form medical question answering applications and beyond

Zhiyuan Wang
Jinhao Duan
Chenxi Yuan
Qingyu Chen
Tianlong Chen
Yue Zhang
Ren Wang
Xiaoshuang Shi

NeurIPS Conference 2024 Conference Paper

$\texttt{Model-GLUE}$: Democratized LLM Scaling for A Large Model Zoo in the Wild

Xinyu Zhao
Guoheng Sun
Ruisi Cai
Yukun Zhou
Pingzhi Li
Peihao Wang
Bowen Tan
Yexiao He

As Large Language Models (LLMs) excel across tasks and specialized domains, scaling LLMs based on existing models has gained significant attention, which is challenged by potential performance drop when combining disparate models. Various techniques have been proposed to aggregate pre-trained LLMs, including model merging, Mixture-of-Experts, and stacking. Despite their merits, a comprehensive comparison and synergistic application of them to a diverse model zoo is yet to be adequately addressed. In light of this research gap, this paper introduces $\texttt{Model-GLUE}$, a holistic LLM scaling guideline. First, our work starts with a benchmarking of existing LLM scaling techniques, especially selective merging, and variants of mixture. Utilizing the insights from the benchmark results, we formulate a strategy for the selection and aggregation of a heterogeneous model zoo characterizing different architectures and initialization. Our methodology involves clustering mergeable models, selecting a merging strategy, and integrating model clusters through model-level mixture. Finally, evidenced by our experiments on a diverse Llama-2-based model zoo, $\texttt{Model-GLUE}$ shows an average performance enhancement of 5. 61\%, achieved without additional training. Codes are available at https: //github. com/Model-GLUE/Model-GLUE.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Flex-MoE: Modeling Arbitrary Modality Combination via the Flexible Mixture-of-Experts

Sukwon Yun
Inyoung Choi
Jie Peng
Yangfan Wu
Jingxuan Bao
Qiyiwen Zhang
Jiayi Xin
Qi Long

Multimodal learning has gained increasing importance across various fields, offering the ability to integrate data from diverse sources such as images, text, and personalized records, which are frequently observed in medical domains. However, in scenarios where some modalities are missing, many existing frameworks struggle to accommodate arbitrary modality combinations, often relying heavily on a single modality or complete data. This oversight of potential modality combinations limits their applicability in real-world situations. To address this challenge, we propose Flex-MoE (Flexible Mixture-of-Experts), a new framework designed to flexibly incorporate arbitrary modality combinations while maintaining robustness to missing data. The core idea of Flex-MoE is to first address missing modalities using a new missing modality bank that integrates observed modality combinations with the corresponding missing ones. This is followed by a uniquely designed Sparse MoE framework. Specifically, Flex-MoE first trains experts using samples with all modalities to inject generalized knowledge through the generalized router ($\mathcal{G}$-Router). The $\mathcal{S}$-Router then specializes in handling fewer modality combinations by assigning the top-1 gate to the expert corresponding to the observed modality combination. We evaluate Flex-MoE on the ADNI dataset, which encompasses four modalities in the Alzheimer's Disease domain, as well as on the MIMIC-IV dataset. The results demonstrate the effectiveness of Flex-MoE, highlighting its ability to model arbitrary modality combinations in diverse missing modality scenarios. Code is available at: \url{https: //github. com/UNITES-Lab/flex-moe}.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

GDeR: Safeguarding Efficiency, Balancing, and Robustness via Prototypical Graph Pruning

Guibin Zhang
Haonan Dong
Yuchen Zhang
Zhixun Li
Dingshuo Chen
Kai Wang
Tianlong Chen
Yuxuan Liang

Training high-quality deep models necessitates vast amounts of data, resulting in overwhelming computational and memory demands. Recently, data pruning, distillation, and coreset selection have been developed to streamline data volume by \textit{retaining}, \textit{synthesizing}, or \textit{selecting} a small yet informative subset from the full set. Among these methods, data pruning incurs the least additional training cost and offers the most practical acceleration benefits. However, it is the most vulnerable, often suffering significant performance degradation with imbalanced or biased data schema, thus raising concerns about its accuracy and reliability in on-device deployment. Therefore, there is a looming need for a new data pruning paradigm that maintains the efficiency of previous practices while ensuring balance and robustness. Unlike the fields of computer vision and natural language processing, where mature solutions have been developed to address these issues, graph neural networks (GNNs) continue to struggle with increasingly large-scale, imbalanced, and noisy datasets, lacking a unified dataset pruning solution. To achieve this, we introduce a novel dynamic soft-pruning method, \ourmethod, designed to update the training ``basket'' during the process using trainable prototypes. \ourmethod first constructs a well-modeled graph embedding hypersphere and then samples \textit{representative, balanced, and unbiased subsets} from this embedding space, which achieves the goal we called {\fontfamily{lmtt}\selectfont \textbf{Graph Training Debugging}}. Extensive experiments on four datasets across three GNN backbones, demonstrate that \ourmethod (I) achieves or surpasses the performance of the full dataset with $30\%\sim50\%$ fewer training samples, (II) attains up to a $2. 81\times$ lossless training speedup, and (III) outperforms state-of-the-art pruning methods in imbalanced training and noisy training scenarios by $0. 3\%\sim4. 3\%$ and $3. 6\%\sim7. 8\%$, respectively.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

GTBench: Uncovering the Strategic Reasoning Capabilities of LLMs via Game-Theoretic Evaluations

Jinhao Duan
Renming Zhang
James Diffenderfer
Bhavya Kailkhura
Lichao Sun
Elias Stengel-Eskin
Mohit Bansal
Tianlong Chen

As Large Language Models (LLMs) are integrated into critical real-world applications, their strategic and logical reasoning abilities are increasingly crucial. This paper evaluates LLMs' reasoning abilities in competitive environments through game-theoretic tasks, e. g. , board and card games that require pure logic and strategic reasoning to compete with opponents. We first propose GTBench, a language-driven environment composing 10 widely-recognized tasks, across a comprehensive game taxonomy: complete versus incomplete information, dynamic versus static, and probabilistic versus deterministic scenarios. Then, we (1) Characterize the game-theoretic reasoning of LLMs; and (2) Perform LLM-vs. -LLM competitions as reasoning evaluation. We observe that (1) LLMs have distinct behaviors regarding various gaming scenarios; for example, LLMs fail in complete and deterministic games yet they are competitive in probabilistic gaming scenarios; (2) Most open-source LLMs, e. g. , CodeLlama-34b-Instruct and Llama-2-70b-chat, are less competitive than commercial LLMs, e. g. , GPT-4, in complex games, yet the recently released Llama-3-70b-Instruct makes up for this shortcoming. In addition, code-pretraining greatly benefits strategic reasoning, while advanced reasoning methods such as Chain-of-Thought (CoT) and Tree-of-Thought (ToT) do not always help. We further characterize the game-theoretic properties of LLMs, such as equilibrium and Pareto Efficiency in repeated games. Detailed error profiles are provided for a better understanding of LLMs' behavior. We hope our research provides standardized protocols and serves as a foundation to spur further explorations in the strategic reasoning of LLMs.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Sparsity-Guided Holistic Explanation for LLMs with Interpretable Inference-Time Intervention

Zhen Tan
Tianlong Chen
Zhenyu Zhang
Huan Liu

Large Language Models (LLMs) have achieved unprecedented breakthroughs in various natural language processing domains. However, the enigmatic ``black-box'' nature of LLMs remains a significant challenge for interpretability, hampering transparent and accountable applications. While past approaches, such as attention visualization, pivotal subnetwork extraction, and concept-based analyses, offer some insight, they often focus on either local or global explanations within a single dimension, occasionally falling short in providing comprehensive clarity. In response, we propose a novel methodology anchored in sparsity-guided techniques, aiming to provide a holistic interpretation of LLMs. Our framework, termed SparseCBM, innovatively integrates sparsity to elucidate three intertwined layers of interpretation: input, subnetwork, and concept levels. In addition, the newly introduced dimension of interpretable inference-time intervention facilitates dynamic adjustments to the model during deployment. Through rigorous empirical evaluations on real-world datasets, we demonstrate that SparseCBM delivers a profound understanding of LLM behaviors, setting it apart in both interpreting and ameliorating model inaccuracies. Codes are provided in supplements.

PDF Details DOI

TMLR Journal 2024 Journal Article

Unlearning Sensitive Information in Multimodal LLMs: Benchmark and Attack-Defense Evaluation

Vaidehi Patil
Yi-Lin Sung
Peter Hase
Jie Peng
Tianlong Chen
Mohit Bansal

Large Language Models (LLMs) trained on massive datasets may inadvertently acquire sensitive information such as personal details and potentially harmful content. This risk is further heightened in multimodal LLMs (aka MLLMs) as they integrate information from multiple modalities (image and text). Adversaries can exploit this stored knowledge by crafting inputs across modalities to extract sensitive details. Evaluating how effectively MLLMs can forget such information (targeted unlearning) necessitates the creation of high-quality, well-annotated image-text pairs. While significant research has addressed the creation of datasets for unlearning within LLMs, it has primarily concentrated on text modality. Creation of analogous datasets for multimodal data and models remain an understudied area. To address this gap, we first introduce a multimodal unlearning benchmark, UnLOK-VQA (Unlearning Outside Knowledge VQA), as well as an “attack and-defense” framework to evaluate methods for deleting specific multimodal knowledge from MLLMs. Our dataset generation process involves an automated pipeline to create samples of varied proximity levels to the target data point for evaluation of generalization and specificity, followed by manual filtering to retain only the high-quality data points. We use this process to extend a visual question-answering dataset for evaluating multimodal information deletion. Next, we present a comprehensive unlearning evaluation involving an attack-and-defense framework consisting of four white box and three blackbox attacks against six unlearning defense objectives. We also design a whitebox attack based on the interpretability of hidden states in LLMs motivated by past work. Our experimental results demonstrate that multimodal extraction attacks (with an attack success rate of 45.5%) are more successful than either image-only (32%) or text-only attacks (39%). The best overall defense mechanism, which removes answer information from internal model hidden states, reduces the success rate of multimodal attack to 15.7%. Furthermore, our findings suggest that larger models exhibit greater resilience to attacks, implying that model scaling could be a valuable strategy for enhancing robustness and developing safer models. UnLOK-VQA thus facilitates a comprehensive evaluation of unlearning in MLLMs and serves as a challenging benchmark for future research in unlearning.

TMLR Journal 2023 Journal Article

Can Pruning Improve Certified Robustness of Neural Networks?

Zhangheng Li
Tianlong Chen
Linyi Li
Bo Li
Zhangyang Wang

With the rapid development of deep learning, the sizes of deep neural networks are getting larger beyond the affordability of hardware platforms. Given the fact that neural networks are often over-parameterized, one effective way to reduce such computational overhead is neural network pruning, by removing redundant parameters from trained neural networks. It has been recently observed that pruning can not only reduce computational overhead but also can improve empirical robustness of deep neural networks (NNs), potentially owing to removing spurious correlations while preserving the predictive accuracies. This paper for the first time demonstrates that pruning can generally improve $L_\infty$ certified robustness for ReLU-based NNs under the \textit{complete verification} setting. Using the popular Branch-and-Bound (BaB) framework, we find that pruning can enhance the estimated bound tightness of certified robustness verification, by alleviating linear relaxation and sub-domain split problems. We empirically verify our findings with off-the-shelf pruning methods and further present a new stability-based pruning method tailored for reducing neuron instability, that outperforms existing pruning methods in enhancing certified robustness. Our experiments show that by appropriately pruning an NN, its certified accuracy can be boosted up to \textbf{8.2\%} under standard training, and up to \textbf{24.5\%} under adversarial training on the CIFAR10 dataset. We additionally observe the possible existence of {\it certified lottery tickets} in our experiments that can match both standard and certified robust accuracies of the original dense models across different datasets. Our findings offer a new angle to study the intriguing interaction between sparsity and robustness, i.e. interpreting the interaction of sparsity and certified robustness via neuron stability. Codes will be fully released.

NeurIPS Conference 2023 Conference Paper

H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

Zhenyu Zhang
Ying Sheng
Tianyi Zhou
Tianlong Chen
Lianmin Zheng
Ruisi Cai
Zhao Song
Yuandong Tian

Large Language Models (LLMs), despite their recent impressive accomplishments, are notably cost-prohibitive to deploy, particularly for applications involving long-content generation, such as dialogue systems and story writing. Often, a large amount of transient state information, referred to as the $\mathsf{KV}$ $\mathsf{cache}$, is stored in GPU memory in addition to model parameters, scaling linearly with the sequence length and batch size. In this paper, we introduce a novel approach for implementing the $\mathsf{KV}$ $\mathsf{cache}$ which significantly reduces its memory footprint. Our approach is based on the noteworthy observation that a small portion of tokens contributes most of the value when computing attention scores. We call these tokens Heavy Hitters ($\mathsf{H_2}$). Through a comprehensive investigation, we find that ($i$) the emergence of $\mathsf{H_2}$ is natural and strongly correlates with the frequent co-occurrence of tokens in the text, and ($ii$) removing them results in significant performance degradation. Based on these insights, we propose Heavy Hitter Oracle ($\mathsf{H_2O}$), a $\mathsf{KV}$ $\mathsf{cache}$ eviction policy that dynamically retains a balance of recent and $\mathsf{H_2}$ tokens. We formulate the $\mathsf{KV}$ $\mathsf{cache}$ eviction as a dynamic submodular problem and prove (under mild assumptions) a theoretical guarantee for our novel eviction algorithm which could help guide future work. We validate the accuracy of our algorithm with OPT, LLaMA, and GPT-NeoX across a wide range of tasks. Our implementation of $\mathsf{H_2O}$ with 20\% heavy hitters improves the throughput over three leading inference systems DeepSpeed Zero-Inference, Hugging Face Accelerate, and FlexGen by up to $29\times$, $29\times$, and $3\times$ on OPT-6. 7B and OPT-30B. With the same batch size, $\mathsf{H_2O}$ can reduce the latency by up to $1. 9\times$.

AAAI Conference 2023 Conference Paper

Peeling the Onion: Hierarchical Reduction of Data Redundancy for Efficient Vision Transformer Training

Zhenglun Kong
Haoyu Ma
Geng Yuan
Mengshu Sun
Yanyue Xie
Peiyan Dong
Xin Meng
Xuan Shen

Vision transformers (ViTs) have recently obtained success in many applications, but their intensive computation and heavy memory usage at both training and inference time limit their generalization. Previous compression algorithms usually start from the pre-trained dense models and only focus on efficient inference, while time-consuming training is still unavoidable. In contrast, this paper points out that the million-scale training data is redundant, which is the fundamental reason for the tedious training. To address the issue, this paper aims to introduce sparsity into data and proposes an end-to-end efficient training framework from three sparse perspectives, dubbed Tri-Level E-ViT. Specifically, we leverage a hierarchical data redundancy reduction scheme, by exploring the sparsity under three levels: number of training examples in the dataset, number of patches (tokens) in each example, and number of connections between tokens that lie in attention weights. With extensive experiments, we demonstrate that our proposed technique can noticeably accelerate training for various ViT architectures while maintaining accuracy. Remarkably, under certain ratios, we are able to improve the ViT accuracy rather than compromising it. For example, we can achieve 15.2% speedup with 72.6% (+0.4) Top-1 accuracy on Deit-T, and 15.7% speedup with 79.9% (+0.1) Top-1 accuracy on Deit-S. This proves the existence of data redundancy in ViT. Our code is released at https://github.com/ZLKong/Tri-Level-ViT

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter

AJAY JAISWAL
Shiwei Liu
Tianlong Chen
Zhangyang "Atlas" Wang

Large pre-trained transformers are $\textit{show-stealer}$ in modern-day deep learning, and it becomes crucial to comprehend the parsimonious patterns that exist within them as they grow in scale. With exploding parameter counts, Lottery Ticket Hypothesis (LTH) and its variants, have lost their pragmatism in sparsifying them due to high computation and memory bottleneck of repetitive $\textit{train-prune-retrain}$ routine of iterative magnitude pruning (IMP) which worsens with increasing model size. In this paper, we comprehensively study $\textit{induced sparse patterns}$ across multiple large pre-trained vision and language transformers. We propose the existence of -- $\textbf{essential sparsity}$ defined with a $\textbf{sharp dropping point}$ beyond which the performance declines much faster w. r. t the rise of sparsity level, when we directly remove weights with the smallest magnitudes in $\textbf{one-shot}$. We also present an intriguing emerging phenomenon of $\textbf{abrupt sparsification}$ during the pre-training of BERT, i. e. , BERT suddenly becomes heavily sparse in pre-training after certain iterations. Moreover, our observations also indicate a $\textbf{counter-intuitive}$ finding that BERT trained with a larger amount of pre-training data tends to have a better ability to condense knowledge in comparatively relatively fewer parameters. Lastly, we investigate the effect of the pre-training loss on essential sparsity and discover that self-supervised learning (SSL) objectives trigger stronger emergent sparsification properties than supervised learning (SL). All our codes will be publicly available.

NeurIPS Conference 2022 Conference Paper

A Comprehensive Study on Large-Scale Graph Training: Benchmarking and Rethinking

Keyu Duan
Zirui Liu
Peihao Wang
Wenqing Zheng
Kaixiong Zhou
Tianlong Chen
Xia Hu
Zhangyang Wang

Large-scale graph training is a notoriously challenging problem for graph neural networks (GNNs). Due to the nature of evolving graph structures into the training process, vanilla GNNs usually fail to scale up, limited by the GPU memory space. Up to now, though numerous scalable GNN architectures have been proposed, we still lack a comprehensive survey and fair benchmark of this reservoir to find the rationale for designing scalable GNNs. To this end, we first systematically formulate the representative methods of large-scale graph training into several branches and further establish a fair and consistent benchmark for them by a greedy hyperparameter searching. In addition, regarding efficiency, we theoretically evaluate the time and space complexity of various branches and empirically compare them w. r. t GPU memory usage, throughput, and convergence. Furthermore, We analyze the pros and cons for various branches of scalable GNNs and then present a new ensembling training manner, named EnGCN, to address the existing issues. Remarkably, our proposed method has achieved new state-of-the-art (SOTA) performance on large-scale datasets. Our code is available at https: //github. com/VITA-Group/Large Scale GCN_Benchmarking.

NeurIPS Conference 2022 Conference Paper

Advancing Model Pruning via Bi-level Optimization

Yihua Zhang
Yuguang Yao
Parikshit Ram
Pu Zhao
Tianlong Chen
Mingyi Hong
Yanzhi Wang
Sijia Liu

The deployment constraints in practical applications necessitate the pruning of large-scale deep learning models, i. e. , promoting their weight sparsity. As illustrated by the Lottery Ticket Hypothesis (LTH), pruning also has the potential of improving their generalization ability. At the core of LTH, iterative magnitude pruning (IMP) is the predominant pruning method to successfully find ‘winning tickets’. Yet, the computation cost of IMP grows prohibitively as the targeted pruning ratio increases. To reduce the computation overhead, various efficient ‘one-shot’ pruning methods have been developed, but these schemes are usually unable to find winning tickets as good as IMP. This raises the question of how to close the gap between pruning accuracy and pruning efficiency? To tackle it, we pursue the algorithmic advancement of model pruning. Specifically, we formulate the pruning problem from a fresh and novel viewpoint, bi-level optimization (BLO). We show that the BLO interpretation provides a technically-grounded optimization base for an efficient implementation of the pruning-retraining learning paradigm used in IMP. We also show that the proposed bi-level optimization-oriented pruning method (termed BiP) is a special class of BLO problems with a bi-linear problem structure. By leveraging such bi-linearity, we theoretically show that BiP can be solved as easily as first-order optimization, thus inheriting the computation efficiency. Through extensive experiments on both structured and unstructured pruning with 5 model architectures and 4 data sets, we demonstrate that BiP can find better winning tickets than IMP in most cases, and is computationally as efficient as the one-shot pruning schemes, demonstrating $2-7\times$ speedup over IMP for the same level of model accuracy and sparsity.

TMLR Journal 2022 Journal Article

Adversarial Feature Augmentation and Normalization for Visual Recognition

Tianlong Chen
Yu Cheng
Zhe Gan
Jianfeng Wang
Lijuan Wang
Jingjing Liu
Zhangyang Wang

Recent advances in computer vision take advantage of adversarial data augmentation to improve the generalization of classification models. Here, we present an effective and efficient alternative that advocates adversarial augmentation on intermediate feature embeddings, instead of relying on computationally-expensive pixel-level perturbations. We propose $\textbf{A}$dversarial $\textbf{F}$eature $\textbf{A}$ugmentation and $\textbf{N}$ormalization (A-FAN), which ($i$) first augments visual recognition models with adversarial features that integrate flexible scales of perturbation strengths, ($ii$) then extracts adversarial feature statistics from batch normalization, and re-injects them into clean features through feature normalization. We validate the proposed approach across diverse visual recognition tasks with representative backbone networks, including ResNets and EfficientNets for classification, Faster-RCNN for detection, and Deeplab V3+ for segmentation. Extensive experiments show that A-FAN yields consistent generalization improvement over strong baselines across various datasets for classification, detection, and segmentation tasks, such as CIFAR-10, CIFAR-100, ImageNet, Pascal VOC2007, Pascal VOC2012, COCO2017, and Cityspaces. Comprehensive ablation studies and detailed analyses also demonstrate that adding perturbations to specific modules and layers of classification/detection/segmentation backbones yields optimal performance. Codes and pre-trained models are available in: https://github.com/VITA-Group/CV_A-FAN.

NeurIPS Conference 2022 Conference Paper

Augmentations in Hypergraph Contrastive Learning: Fabricated and Generative

Tianxin Wei
Yuning You
Tianlong Chen
Yang Shen
Jingrui He
Zhangyang Wang

This paper targets at improving the generalizability of hypergraph neural networks in the low-label regime, through applying the contrastive learning approach from images/graphs (we refer to it as HyperGCL). We focus on the following question: How to construct contrastive views for hypergraphs via augmentations? We provide the solutions in two folds. First, guided by domain knowledge, we fabricate two schemes to augment hyperedges with higher-order relations encoded, and adopt three vertex augmentation strategies from graph-structured data. Second, in search of more effective views in a data-driven manner, we for the first time propose a hypergraph generative model to generate augmented views, and then an end-to-end differentiable pipeline to jointly learn hypergraph augmentations and model parameters. Our technical innovations are reflected in designing both fabricated and generative augmentations of hypergraphs. The experimental findings include: (i) Among fabricated augmentations in HyperGCL, augmenting hyperedges provides the most numerical gains, implying that higher-order information in structures is usually more downstream-relevant; (ii) Generative augmentations do better in preserving higher-order information to further benefit generalizability; (iii) HyperGCL also boosts robustness and fairness in hypergraph representation learning. Codes are released at https: //github. com/weitianxin/HyperGCL.

TMLR Journal 2022 Journal Article

Can You Win Everything with A Lottery Ticket?

Tianlong Chen
Zhenyu Zhang
Jun Wu
Randy Huang
Sijia Liu
Shiyu Chang
Zhangyang Wang

$\textit{Lottery ticket hypothesis}$ (LTH) has demonstrated to yield independently trainable and highly sparse neural networks (a.k.a. $\textit{winning tickets}$), whose test set accuracies can be surprisingly on par or even better than dense models. However, accuracy is far from the only evaluation metric, and perhaps not always the most important one. Hence it might be myopic to conclude that a sparse subnetwork can replace its dense counterpart, even if the accuracy is preserved. Spurred by that, we perform the first comprehensive assessment of lottery tickets from diverse aspects beyond test accuracy, including $\textit{(i)}$ generalization to distribution shifts, $\textit{(ii)}$ prediction uncertainty, $\textit{(iii)}$ interpretability, and $\textit{(iv)}$ geometry of loss landscapes. With extensive experiments across datasets {CIFAR-10, CIFAR-100, and ImageNet}, model architectures, as well as tens of sparsification methods, we thoroughly characterize the trade-off between model sparsity and the all-dimension model capabilities. We find that an appropriate sparsity (e.g., $20\%\sim99.08\%$) can yield the winning ticket to perform comparably or even better $\textbf{in all above four aspects}$, although some aspects (generalization to certain distribution shifts, and uncertainty) appear more sensitive to the sparsification than others. We term it as a $\texttt{LTH-PASS}$. Overall, our results endorse choosing a good sparse subnetwork of a larger dense model, over directly training a small dense model of similar parameter counts. We hope that our study can offer more in-depth insights on pruning, for researchers and engineers who seek to incorporate sparse neural networks for user-facing deployments. Codes are available in: https://github.com/VITA-Group/LTH-Pass.

JMLR Journal 2022 Journal Article

Learning to Optimize: A Primer and A Benchmark

Tianlong Chen
Xiaohan Chen
Wuyang Chen
Howard Heaton
Jialin Liu
Zhangyang Wang
Wotao Yin

Learning to optimize (L2O) is an emerging approach that leverages machine learning to develop optimization methods, aiming at reducing the laborious iterations of hand engineering. It automates the design of an optimization method based on its performance on a set of training problems. This data-driven procedure generates methods that can efficiently solve problems similar to those in training. In sharp contrast, the typical and traditional designs of optimization methods are theory-driven, so they obtain performance guarantees over the classes of problems specified by the theory. The difference makes L2O suitable for repeatedly solving a particular optimization problem over a specific distribution of data, while it typically fails on out-of-distribution problems. The practicality of L2O depends on the type of target optimization, the chosen architecture of the method to learn, and the training procedure. This new paradigm has motivated a community of researchers to explore L2O and report their findings. This article is poised to be the first comprehensive survey and benchmark of L2O for continuous optimization. We set up taxonomies, categorize existing works and research directions, present insights, and identify open challenges. We benchmarked many existing L2O approaches on a few representative optimization problems. For reproducible research and fair benchmarking purposes, we released our software implementation and data in the package Open-L2O at https://github.com/VITA-Group/Open-L2O. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2022. ( edit, beta )

NeurIPS Conference 2022 Conference Paper

M³ViT: Mixture-of-Experts Vision Transformer for Efficient Multi-task Learning with Model-Accelerator Co-design

hanxue liang
Zhiwen Fan
Rishov Sarkar
Ziyu Jiang
Tianlong Chen
Kai Zou
Yu Cheng
Cong Hao

Multi-task learning (MTL) encapsulates multiple learned tasks in a single model and often lets those tasks learn better jointly. Multi-tasking models have become successful and often essential for many sophisticated systems such as autonomous driving and indoor robots. However, when deploying MTL onto those real-world systems that are often resource-constrained or latency-sensitive, two prominent challenges arise: (i) during training, simultaneously optimizing all tasks is often difficult due to gradient conflicts across tasks, and the challenge is amplified when a growing number of tasks have to be squeezed into one compact model; (ii) at inference, current MTL regimes have to activate nearly the entire model even to just execute a single task. Yet most real systems demand only one or two tasks at each moment, while flexibly switching between tasks per need: therefore such “all tasks activated” inference is also highly inefficient and non-scalable in practice. In this paper, we present a model-accelerator co-design framework to enable efficient on-device MTL, that tackles both training and inference bottlenecks. Our framework, dubbed M³ViT, customizes mixture-of-experts (MoE) layers into a vision transformer (ViT) backbone for MTL, and sparsely activates task-specific experts during training, which effectively disentangles the parameter spaces to avoid different tasks’ training conflicts. Then at inference with any task of interest, the same design allows for activating only the task-corresponding sparse “expert” pathway, instead of the full model. Our new model design is further enhanced by hardware-level innovations, in particular, a novel computation reordering scheme tailored for memory-constrained MTL that achieves zero-overhead switching between tasks and can scale to any number of experts. Extensive experiments on PASCAL-Context and NYUD-v2 datasets at both software and hardware levels are conducted to demonstrate the effectiveness of the proposed design. When executing the practical scenario of single-task inference, M³ViT achieves higher accuracies than encoder-focused MTL methods, while significantly reducing 88% inference FLOPs. When implemented on a hardware platform of one Xilinx ZCU104 FPGA, our co-design framework reduces the memory requirement by 2. 40×, while achieving energy efficiency (as the product of latency and power) up to 9. 23× times higher than a comparable FPGA baseline.

NeurIPS Conference 2022 Conference Paper

Old can be Gold: Better Gradient Flow can Make Vanilla-GCNs Great Again

AJAY JAISWAL
Peihao Wang
Tianlong Chen
Justin Rousseau
Ying Ding
Zhangyang Wang

Despite the enormous success of Graph Convolutional Networks (GCNs) in modeling graph-structured data, most of the current GCNs are shallow due to the notoriously challenging problems of over-smoothening and information squashing along with conventional difficulty caused by vanishing gradients and over-fitting. Previous works have been primarily focused on the study of over-smoothening and over-squashing phenomena in training deep GCNs. Surprisingly, in comparison with CNNs/RNNs, very limited attention has been given to understanding how healthy gradient flow can benefit the trainability of deep GCNs. In this paper, firstly, we provide a new perspective of gradient flow to understand the substandard performance of deep GCNs and hypothesize that by facilitating healthy gradient flow, we can significantly improve their trainability, as well as achieve state-of-the-art (SOTA) level performance from vanilla-GCNs. Next, we argue that blindly adopting the Glorot initialization for GCNs is not optimal, and derive a topology-aware isometric initialization scheme for vanilla-GCNs based on the principles of isometry. Additionally, contrary to ad-hoc addition of skip-connections, we propose to use gradient-guided dynamic rewiring of vanilla-GCNs with skip connections. Our dynamic rewiring method uses the gradient flow within each layer during training to introduce on-demand skip-connections adaptively. We provide extensive empirical evidence across multiple datasets that our methods improve gradient flow in deep vanilla-GCNs and significantly boost their performance to comfortably compete and outperform many fancy state-of-the-art methods. Codes are available at: https: //github. com/VITA-Group/GradientGCN.

AAAI Conference 2022 Conference Paper

Playing Lottery Tickets with Vision and Language

Zhe Gan
Yen-Chun Chen
Linjie Li
Tianlong Chen
Yu Cheng
Shuohang Wang
Jingjing Liu
Lijuan Wang

Large-scale pre-training has recently revolutionized visionand-language (VL) research. Models such as LXMERT and UNITER have significantly lifted the state of the art over a wide range of VL tasks. However, the large number of parameters in such models hinders their application in practice. In parallel, work on the lottery ticket hypothesis (LTH) has shown that deep neural networks contain small matching subnetworks that can achieve on par or even better performance than the dense networks when trained in isolation. In this work, we perform the first empirical study to assess whether such trainable subnetworks also exist in pre-trained VL models. We use UNITER as the main testbed (also test on LXMERT and ViLT), and consolidate 7 representative VL tasks for experiments, including visual question answering, visual commonsense reasoning, visual entailment, referring expression comprehension, image-text retrieval, GQA, and NLVR2. Through comprehensive analysis, we summarize our main findings as follows. (i) It is difficult to find subnetworks that strictly match the performance of the full model. However, we can find “relaxed” winning tickets at 50%-70% sparsity that maintain 99% of the full accuracy. (ii) Subnetworks found by task-specific pruning transfer reasonably well to the other tasks, while those found on the pre-training tasks at 60%/70% sparsity transfer universally, matching 98%/96% of the full accuracy on average over all the tasks. (iii) Besides UNITER, other models such as LXMERT and ViLT can also play lottery tickets. However, the highest sparsity we can achieve for ViLT is far lower than LXMERT and UNITER (30% vs. 70%). (iv) LTH also remains relevant when using other training methods (e. g. , adversarial training).

TMLR Journal 2022 Journal Article

Queried Unlabeled Data Improves and Robustifies Class-Incremental Learning

Tianlong Chen
Sijia Liu
Shiyu Chang
Lisa Amini
Zhangyang Wang

Class-incremental learning (CIL) suffers from the notorious dilemma between learning newly added classes and preserving previously learned class knowledge. That catastrophic forgetting issue could be mitigated by storing historical data for replay, which yet would cause memory overheads as well as imbalanced prediction updates. To address this dilemma, we propose to leverage "free" external unlabeled data querying in continual learning. We first present a CIL with Queried Unlabeled Data (CIL-QUD) scheme, where we only store a handful of past training samples as anchors and use them to query relevant unlabeled examples each time. Along with new and past stored data, the queried unlabeled are effectively utilized, through learning-without-forgetting (LwF) regularizers and class-balance training. Besides preserving model generalization over past and current tasks, we next study the problem of adversarial robustness for CIL-QUD. Inspired by the recent success of learning robust models with unlabeled data, we explore a new robustness-aware CIL setting, where the learned adversarial robustness has to resist forgetting and be transferred as new tasks come in continually. While existing options easily fail, we show queried unlabeled data can continue to benefit, and seamlessly extend CIL-QUD into its robustified versions, RCIL-QUD. Extensive experiments demonstrate that CIL-QUD achieves substantial accuracy gains on CIFAR-10 and CIFAR-100, compared to previous state-of-the-art CIL approaches. Moreover, RCIL-QUD establishes the first strong milestone for robustness-aware CIL. Codes are available in https://github.com/VITA-Group/CIL-QUD.

NeurIPS Conference 2022 Conference Paper

Randomized Channel Shuffling: Minimal-Overhead Backdoor Attack Detection without Clean Datasets

Ruisi Cai
Zhenyu Zhang
Tianlong Chen
Xiaohan Chen
Zhangyang Wang

Deep neural networks (DNNs) typically require massive data to train on, which is a hurdle for numerous practical domains. Facing the data shortfall, one viable option is to acquire domain-specific training data from external uncensored sources, such as open webs or third-party data collectors. However, the quality of such acquired data is often not rigorously scrutinized, and one cannot easily rule out the risk of `"poisoned" examples being included in such unreliable datasets, resulting in unreliable trained models which pose potential risks to many high-stake applications. While existing options usually suffer from high computational costs or assumptions on clean data access, this paper attempts to detect backdoors for potential victim models with minimal prior knowledge. In particular, provided with a trained model, users are assumed to (1) have no prior knowledge of whether it is already poisoned, or what the target class/percentage of samples is poisoned, and (2) have no access to a clean sample set from the same training distribution, nor any trusted model trained on such clean data. To tackle this challenging scenario, we first observe the contrasting channel-level statistics between the backdoor trigger and clean image features, and consequently, how they can be differentiated by progressive channel shuffling. We then propose the randomized channel shuffling method for backdoor-targeted class detection, which requires only a few feed-forward passes. It thus incurs minimal overheads and demands no clean sample nor prior knowledge. We further explore a “full” clean data-free setting, where neither the target class detection nor the trigger recovery can access the clean data. Extensive experiments are conducted with three datasets (CIFAR-10, GTSRB, Tiny ImageNet), three architectures (AlexNet, ResNet-20, SENet-18), and three attacks (BadNets, clean label attack, and WaNet). Results consistently endorse the effectiveness of our proposed technique in backdoor model detection, with margins of 0. 291 ～ 0. 640 AUROC over the current state-of-the-arts. Codes are available at https: //github. com/VITA-Group/Random-Shuffling-BackdoorDetect.

NeurIPS Conference 2022 Conference Paper

Sparse Winning Tickets are Data-Efficient Image Recognizers

Mukund Varma T
Xuxi Chen
Zhenyu Zhang
Tianlong Chen
Subhashini Venugopalan
Zhangyang Wang

Improving the performance of deep networks in data-limited regimes has warranted much attention. In this work, we empirically show that “winning tickets” (small sub-networks) obtained via magnitude pruning based on the lottery ticket hypothesis, apart from being sparse are also effective recognizers in data-limited regimes. Based on extensive experiments, we find that in low data regimes (datasets of 50-100 examples per class), sparse winning tickets substantially outperform the original dense networks. This approach, when combined with augmentations or fine-tuning from a self-supervised backbone network, shows further improvements in performance by as much as 16% (absolute) on low-sample datasets and long-tailed classification. Further, sparse winning tickets are more robust to synthetic noise and distribution shifts compared to their dense counterparts. Our analysis of winning tickets on small datasets indicates that, though sparse, the networks retain density in the initial layers and their representations are more generalizable. Code is available at https: //github. com/VITA-Group/DataEfficientLTH.

NeurIPS Conference 2021 Conference Paper

Chasing Sparsity in Vision Transformers: An End-to-End Exploration

Tianlong Chen
Yu Cheng
Zhe Gan
Lu Yuan
Lei Zhang
Zhangyang Wang

Vision transformers (ViTs) have recently received explosive popularity, but their enormous model sizes and training costs remain daunting. Conventional post-training pruning often incurs higher training budgets. In contrast, this paper aims to trim down both the training memory overhead and the inference complexity, without sacrificing the achievable accuracy. We carry out the first-of-its-kind comprehensive exploration, on taking a unified approach of integrating sparsity in ViTs "from end to end''. Specifically, instead of training full ViTs, we dynamically extract and train sparse subnetworks, while sticking to a fixed small parameter budget. Our approach jointly optimizes model parameters and explores connectivity throughout training, ending up with one sparse network as the final output. The approach is seamlessly extended from unstructured to structured sparsity, the latter by considering to guide the prune-and-grow of self-attention heads inside ViTs. We further co-explore data and architecture sparsity for additional efficiency gains by plugging in a novel learnable token selector to adaptively determine the currently most vital patches. Extensive results on ImageNet with diverse ViT backbones validate the effectiveness of our proposals which obtain significantly reduced computational cost and almost unimpaired generalization. Perhaps most surprisingly, we find that the proposed sparse (co-)training can sometimes \textit{improve the ViT accuracy} rather than compromising it, making sparsity a tantalizing "free lunch''. For example, our sparsified DeiT-Small at ($5\%$, $50\%$) sparsity for (data, architecture), improves $\mathbf{0. 28\%}$ top-1 accuracy, and meanwhile enjoys $\mathbf{49. 32\%}$ FLOPs and $\mathbf{4. 40\%}$ running time savings. Our codes are available at https: //github. com/VITA-Group/SViTE.

NeurIPS Conference 2021 Conference Paper

Data-Efficient GAN Training Beyond (Just) Augmentations: A Lottery Ticket Perspective

Tianlong Chen
Yu Cheng
Zhe Gan
Jingjing Liu
Zhangyang Wang

Training generative adversarial networks (GANs) with limited real image data generally results in deteriorated performance and collapsed models. To conquer this challenge, we are inspired by the latest observation, that one can discover independently trainable and highly sparse subnetworks (a. k. a. , lottery tickets) from GANs. Treating this as an inductive prior, we suggest a brand-new angle towards data-efficient GAN training: by first identifying the lottery ticket from the original GAN using the small training set of real images; and then focusing on training that sparse subnetwork by re-using the same set. We find our coordinated framework to offer orthogonal gains to existing real image data augmentation methods, and we additionally present a new feature-level augmentation that can be applied together with them. Comprehensive experiments endorse the effectiveness of our proposed framework, across various GAN architectures (SNGAN, BigGAN, and StyleGAN-V2) and diverse datasets (CIFAR-10, CIFAR-100, Tiny-ImageNet, ImageNet, and multiple few-shot generation datasets). Codes are available at: https: //github. com/VITA-Group/Ultra-Data-Efficient-GAN-Training.

NeurIPS Conference 2021 Conference Paper

Improving Contrastive Learning on Imbalanced Data via Open-World Sampling

Ziyu Jiang
Tianlong Chen
Ting Chen
Zhangyang Wang

Contrastive learning approaches have achieved great success in learning visual representations with few labels of the target classes. That implies a tantalizing possibility of scaling them up beyond a curated “seed" benchmark, to incorporating more unlabeled images from the internet-scale external sources to enhance its performance. However, in practice, larger amount of unlabeled data will require more computing resources due to the bigger model size and longer training needed. Moreover, open-world unlabeled data usually follows an implicit long-tail class or attribute distribution, many of which also do not belong to the target classes. Blindly leveraging all unlabeled data hence can lead to the data imbalance as well as distraction issues. This motivates us to seek a principled approach to strategically select unlabeled data from an external source, in order to learn generalizable, balanced and diverse representations for relevant classes. In this work, we present an open-world unlabeled data sampling framework called Model-Aware K-center (MAK), which follows three simple principles: (1) tailness, which encourages sampling of examples from tail classes, by sorting the empirical contrastive loss expectation (ECLE) of samples over random data augmentations; (2) proximity, which rejects the out-of-distribution outliers that may distract training; and (3) diversity, which ensures diversity in the set of sampled examples. Empirically, using ImageNet-100-LT (without labels) as the seed dataset and two “noisy” external data sources, we demonstrate that MAK can consistently improve both the overall representation quality and the class balancedness of the learned features, as evaluated via linear classifier evaluation on full-shot and few-shot settings. Thecode is available at: https: //github. com/VITA-Group/MAK.

NeurIPS Conference 2021 Conference Paper

Sanity Checks for Lottery Tickets: Does Your Winning Ticket Really Win the Jackpot?

Xiaolong Ma
Geng Yuan
Xuan Shen
Tianlong Chen
Xuxi Chen
Xiaohan Chen
Ning Liu
Minghai Qin

There have been long-standing controversies and inconsistencies over the experiment setup and criteria for identifying the "winning ticket" in literature. To reconcile such, we revisit the definition of lottery ticket hypothesis, with comprehensive and more rigorous conditions. Under our new definition, we show concrete evidence to clarify whether the winning ticket exists across the major DNN architectures and/or applications. Through extensive experiments, we perform quantitative analysis on the correlations between winning tickets and various experimental factors, and empirically study the patterns of our observations. We find that the key training hyperparameters, such as learning rate and training epochs, as well as the architecture characteristics such as capacities and residual connections, are all highly correlated with whether and when the winning tickets can be identified. Based on our analysis, we summarize a guideline for parameter settings in regards of specific architecture characteristics, which we hope to catalyze the research progress on the topic of lottery ticket hypothesis. Our codes are publicly available at: https: //github. com/boone891214/sanity-check-LTH.

NeurIPS Conference 2021 Conference Paper

Sparse Training via Boosting Pruning Plasticity with Neuroregeneration

Shiwei Liu
Tianlong Chen
Xiaohan Chen
Zahra Atashgahi
Lu Yin
Huanyu Kou
Li Shen
Mykola Pechenizkiy

Works on lottery ticket hypothesis (LTH) and single-shot network pruning (SNIP) have raised a lot of attention currently on post-training pruning (iterative magnitude pruning), and before-training pruning (pruning at initialization). The former method suffers from an extremely large computation cost and the latter usually struggles with insufficient performance. In comparison, during-training pruning, a class of pruning methods that simultaneously enjoys the training/inference efficiency and the comparable performance, temporarily, has been less explored. To better understand during-training pruning, we quantitatively study the effect of pruning throughout training from the perspective of pruning plasticity (the ability of the pruned networks to recover the original performance). Pruning plasticity can help explain several other empirical observations about neural network pruning in literature. We further find that pruning plasticity can be substantially improved by injecting a brain-inspired mechanism called neuroregeneration, i. e. , to regenerate the same number of connections as pruned. We design a novel gradual magnitude pruning (GMP) method, named gradual pruning with zero-cost neuroregeneration (GraNet), that advances state of the art. Perhaps most impressively, its sparse-to-sparse version for the first time boosts the sparse-to-sparse training performance over various dense-to-sparse methods with ResNet-50 on ImageNet without extending the training time. We release all codes in https: //github. com/Shiweiliuiiiiiii/GraNet.

NeurIPS Conference 2021 Conference Paper

You are caught stealing my winning lottery ticket! Making a lottery ticket claim its ownership

Xuxi Chen
Tianlong Chen
Zhenyu Zhang
Zhangyang Wang

Despite tremendous success in many application scenarios, the training and inference costs of using deep learning are also rapidly increasing over time. The lottery ticket hypothesis (LTH) emerges as a promising framework to leverage a special sparse subnetwork (i. e. , $\textit{winning ticket}$) instead of a full model for both training and inference, that can lower both costs without sacrificing the performance. The main resource bottleneck of LTH is however the extraordinary cost to find the sparse mask of the winning ticket. That makes the found winning ticket become a valuable asset to the owners, highlighting the necessity of protecting its copyright. Our setting adds a new dimension to the recently soaring interest in protecting against the intellectual property (IP) infringement of deep models and verifying their ownerships, since they take owners' massive/unique resources to develop or train. While existing methods explored encrypted weights or predictions, we investigate a unique way to leverage sparse topological information to perform $\textit{lottery verification}$, by developing several graph-based signatures that can be embedded as credentials. By further combining trigger set-based methods, our proposal can work in both white-box and black-box verification scenarios. Through extensive experiments, we demonstrate the effectiveness of lottery verification in diverse models (ResNet-20, ResNet-18, ResNet-50) on CIFAR-10 and CIFAR-100. Specifically, our verification is shown to be robust to removal attacks such as model fine-tuning and pruning, as well as several ambiguity attacks. Our codes are available at https: //github. com/VITA-Group/NO-stealing-LTH.

NeurIPS Conference 2020 Conference Paper

Graph Contrastive Learning with Augmentations

Yuning You
Tianlong Chen
Yongduo Sui
Ting Chen
Zhangyang Wang
Yang Shen

Generalizable, transferrable, and robust representation learning on graph-structured data remains a challenge for current graph neural networks (GNNs). Unlike what has been developed for convolutional neural networks (CNNs) for image data, self-supervised learning and pre-training are less explored for GNNs. In this paper, we propose a graph contrastive learning (GraphCL) framework for learning unsupervised representations of graph data. We first design four types of graph augmentations to incorporate various priors. We then systematically study the impact of various combinations of graph augmentations on multiple datasets, in four different settings: semi-supervised, unsupervised, and transfer learning as well as adversarial attacks. The results show that, even without tuning augmentation extents nor using sophisticated GNN architectures, our GraphCL framework can produce graph representations of similar or better generalizability, transferrability, and robustness compared to state-of-the-art methods. We also investigate the impact of parameterized graph augmentation extents and patterns, and observe further performance gains in preliminary experiments. Our codes are available at https: //github. com/Shen-Lab/GraphCL.

NeurIPS Conference 2020 Conference Paper

Once-for-All Adversarial Training: In-Situ Tradeoff between Robustness and Accuracy for Free

Haotao Wang
Tianlong Chen
Shupeng Gui
TingKuei Hu
Ji Liu
Zhangyang Wang

Adversarial training and its many variants substantially improve deep network robustness, yet at the cost of compromising standard accuracy. Moreover, the training process is heavy and hence it becomes impractical to thoroughly explore the trade-off between accuracy and robustness. This paper asks this new question: how to quickly calibrate a trained model in-situ, to examine the achievable trade-offs between its standard and robust accuracies, without (re-)training it many times? Our proposed framework, Once-for-all Adversarial Training (OAT), is built on an innovative model-conditional training framework, with a controlling hyper-parameter as the input. The trained model could be adjusted among different standard and robust accuracies “for free” at testing time. As an important knob, we exploit dual batch normalization to separate standard and adversarial feature statistics, so that they can be learned in one model without degrading performance. We further extend OAT to a Once-for-all Adversarial Training and Slimming (OATS) framework, that allows for the joint trade-off among accuracy, robustness and runtime efficiency. Experiments show that, without any re-training nor ensembling, OAT/OATS achieve similar or even superior performance compared to dedicatedly trained models at various configurations. Our codes and pretrained models are available at: https: //github. com/VITA-Group/Once-for-All-Adversarial-Training.

NeurIPS Conference 2020 Conference Paper

Robust Pre-Training by Adversarial Contrastive Learning

Ziyu Jiang
Tianlong Chen
Ting Chen
Zhangyang Wang

Recent work has shown that, when integrated with adversarial training, self-supervised pre-training can lead to state-of-the-art robustness In this work, we improve robustness-aware self-supervised pre-training by learning representations that are consistent under both data augmentations and adversarial perturbations. Our approach leverages a recent contrastive learning framework, which learns representations by maximizing feature consistency under differently augmented views. This fits particularly well with the goal of adversarial robustness, as one cause of adversarial fragility is the lack of feature invariance, i. e. , small input perturbations can result in undesirable large changes in features or even predicted labels. We explore various options to formulate the contrastive task, and demonstrate that by injecting adversarial perturbations, contrastive pre-training can lead to models that are both label-efficient and robust. We empirically evaluate the proposed Adversarial Contrastive Learning (ACL) and show it can consistently outperform existing methods. For example on the CIFAR-10 dataset, ACL outperforms the previous state-of-the-art unsupervised robust pre-training approach by 2. 99% on robust accuracy and 2. 14% on standard accuracy. We further demonstrate that ACL pre-training can improve semi-supervised adversarial training, even when only a few labeled examples are available. Our codes and pre-trained models have been released at: https: //github. com/VITA-Group/Adversarial-Contrastive-Learning.

NeurIPS Conference 2020 Conference Paper

The Lottery Ticket Hypothesis for Pre-trained BERT Networks

Tianlong Chen
Jonathan Frankle
Shiyu Chang
Sijia Liu
Yang Zhang
Zhangyang Wang
Michael Carbin

In natural language processing (NLP), enormous pre-trained models like BERT have become the standard starting point for training on a range of downstream tasks, and similar trends are emerging in other areas of deep learning. In parallel, work on the lottery ticket hypothesis has shown that models for NLP and computer vision contain smaller matching subnetworks capable of training in isolation to full accuracy and transferring to other tasks. In this work, we combine these observations to assess whether such trainable, transferrable subnetworks exist in pre-trained BERT models. For a range of downstream tasks, we indeed find matching subnetworks at 40% to 90% sparsity. We find these subnetworks at (pre-trained) initialization, a deviation from prior NLP research where they emerge only after some amount of training. Subnetworks found on the masked language modeling task (the same task used to pre-train the model) transfer universally; those found on other tasks transfer in a limited fashion if at all. As large-scale pre-training becomes an increasingly central paradigm in deep learning, our results demonstrate that the main lottery ticket observations remain relevant in this context. Codes available at https: //github. com/VITA-Group/BERT-Tickets.

NeurIPS Conference 2020 Conference Paper

Training Stronger Baselines for Learning to Optimize

Tianlong Chen
Weiyi Zhang
Zhou Jingyang
Shiyu Chang
Sijia Liu
Lisa Amini
Zhangyang Wang

Learning to optimize (L2O) is gaining increased attention because classical optimizers require laborious, problem-specific design and hyperparameter tuning. However, there are significant performance and practicality gaps between manually designed optimizers and existing L2O models. Specifically, learned optimizers are applicable to only a limited class of problems, often exhibit instability, and generalize poorly. As research efforts focus on increasingly sophisticated L2O models, we argue for an orthogonal, under-explored theme: improved training techniques for L2O models. We first present a progressive, curriculum-based training scheme, which gradually increases the optimizer unroll length to mitigate the well-known L2O dilemma of truncation bias (shorter unrolling) versus gradient explosion (longer unrolling). Secondly, we present an off-policy imitation learning based approach to guide the L2O learning, by learning from the behavior of analytical optimizers. We evaluate our improved training techniques with a variety of state-of-the-art L2O models and immediately boost their performance, without making any change to their model structures. We demonstrate that, using our improved training techniques, one of the earliest and simplest L2O models can be trained to outperform even the latest and most complex L2O models on a number of tasks. Our results demonstrate a greater potential of L2O yet to be unleashed, and prompt a reconsideration of recent L2O model progress. Our codes are publicly available at: https: //github. com/VITA-Group/L2O-Training-Techniques.

NeurIPS Conference 2019 Conference Paper

Learning to Optimize in Swarms

Yue Cao
Tianlong Chen
Zhangyang Wang
Yang Shen

Learning to optimize has emerged as a powerful framework for various optimization and machine learning tasks. Current such "meta-optimizers" often learn in the space of continuous optimization algorithms that are point-based and uncertainty-unaware. To overcome the limitations, we propose a meta-optimizer that learns in the algorithmic space of both point-based and population-based optimization algorithms. The meta-optimizer targets at a meta-loss function consisting of both cumulative regret and entropy. Specifically, we learn and interpret the update formula through a population of LSTMs embedded with sample- and feature-level attentions. Meanwhile, we estimate the posterior directly over the global optimum and use an uncertainty measure to help guide the learning process. Empirical results over non-convex test functions and the protein-docking application demonstrate that this new meta-optimizer outperforms existing competitors. The codes are publicly available at: https: //github. com/Shen-Lab/LOIS