Author name cluster

Yingyan Lin

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

10 papers

1 author row

NeurIPS Conference 2023 Conference Paper

ShiftAddViT: Mixture of Multiplication Primitives Towards Efficient Vision Transformer

Haoran You
Huihong Shi
Yipin Guo
Yingyan Lin

Vision Transformers (ViTs) have shown impressive performance and have become a unified backbone for multiple vision tasks. However, both the attention mechanism and multi-layer perceptrons (MLPs) in ViTs are not sufficiently efficient due to dense multiplications, leading to costly training and inference. To this end, we propose to reparameterize pre-trained ViTs with a mixture of multiplication primitives, e. g. , bitwise shifts and additions, towards a new type of multiplication-reduced model, dubbed $\textbf{ShiftAddViT}$, which aims to achieve end-to-end inference speedups on GPUs without requiring training from scratch. Specifically, all $\texttt{MatMuls}$ among queries, keys, and values are reparameterized using additive kernels, after mapping queries and keys to binary codes in Hamming space. The remaining MLPs or linear layers are then reparameterized with shift kernels. We utilize TVM to implement and optimize those customized kernels for practical hardware deployment on GPUs. We find that such a reparameterization on (quadratic or linear) attention maintains model accuracy, while inevitably leading to accuracy drops when being applied to MLPs. To marry the best of both worlds, we further propose a new mixture of experts (MoE) framework to reparameterize MLPs by taking multiplication or its primitives as experts, e. g. , multiplication and shift, and designing a new latency-aware load-balancing loss. Such a loss helps to train a generic router for assigning a dynamic amount of input tokens to different experts according to their latency. In principle, the faster the experts run, the more input tokens they are assigned. Extensive experiments on various 2D/3D Transformer-based vision tasks consistently validate the effectiveness of our proposed ShiftAddViT, achieving up to $\textbf{5. 18$\times$}$ latency reductions on GPUs and $\textbf{42. 9}$% energy savings, while maintaining a comparable accuracy as original or efficient ViTs. Codes and models are available at https: //github. com/GATECH-EIC/ShiftAddViT.

PDF Details

AAAI Conference 2022 Conference Paper

Early-Bird GCNs: Graph-Network Co-optimization towards More Efficient GCN Training and Inference via Drawing Early-Bird Lottery Tickets

Haoran You
Zhihan Lu
Zijian Zhou
Yonggan Fu
Yingyan Lin

Graph Convolutional Networks (GCNs) have emerged as the state-of-the-art deep learning model for representation learning on graphs. However, it remains notoriously challenging to train and inference GCNs over large graph datasets, limiting their application to large real-world graphs and hindering the exploration of deeper and more sophisticated GCN graphs. This is because as the graph size grows, the sheer number of node features and the large adjacency matrix can easily explode the required memory and data movements. To tackle the aforementioned challenges, we explore the possibility of drawing lottery tickets when sparsifying GCN graphs, i. e. , subgraphs that largely shrink the adjacency matrix yet are capable of achieving accuracy comparable to or even better than their full graphs. Specifically, we for the first time discover the existence of graph early-bird (GEB) tickets that emerge at the very early stage when sparsifying GCN graphs, and propose a simple yet effective detector to automatically identify the emergence of such GEB tickets. Furthermore, we advocate graph-model co-optimization and develop a generic efficient GCN early-bird training framework dubbed GEBT that can significantly boost the efficiency of GCN training by (1) drawing joint early-bird tickets between the GCN graphs and models and (2) enabling simultaneously sparsification of both the GCN graphs and models. Experiments on various GCN models and datasets consistently validate our GEB finding and the effectiveness of our GEBT, e. g. , our GEBT achieves up to 80. 2% ∼ 85. 6% and 84. 6% ∼ 87. 5% savings of GCN training and inference costs while offering a comparable or even better accuracy as compared to state-of-the-art methods. Our source code and supplementary material are available at https: //github. com/RICE-EIC/Early-Bird-GCN.

PDF Details

TMLR Journal 2022 Journal Article

State-of-the-art (SOTA) approaches to deep network (DN) training overparametrize the model and then prune a posteriori to obtain a ``winning ticket'' subnetwork that can achieve high accuracy. Using a recently developed spline interpretation of DNs, we obtain novel insights into how DN pruning affects its mapping. In particular, under the realm of spline operators, we are able to pinpoint the impact of pruning onto the DN's underlying input space partition and per-region affine mappings, opening new avenues in understanding why and when are pruned DNs able to maintain high performance. We also discover that a DN's spline mapping exhibits an early-bird (EB) phenomenon whereby the spline's partition converges at early training stages, bridging the recently developed DN spline theory and lottery ticket hypothesis of DNs. We finally leverage this new insight to develop a principled and efficient pruning strategy whose goal is to prune isolated groups of nodes that have a redundant contribution in the forming of the spline partition. Extensive experiments on four networks and three datasets validate that our new spline-based DN pruning approach reduces training FLOPs by up to 3.5x while achieving similar or even better accuracy than current state-of-the-art methods. Code is available at https://github.com/RICE-EIC/Spline-EB.

PDF Details

AAAI Conference 2022 Conference Paper

MIA-Former: Efficient and Robust Vision Transformers via Multi-Grained Input-Adaptation

Zhongzhi Yu
Yonggan Fu
Sicheng Li
Chaojian Li
Yingyan Lin

Vision transformers (ViTs) have recently demonstrated great success in various computer vision tasks, motivating a tremendously increased interest in their deployment into many real-world IoT applications. However, powerful ViTs are often too computationally expensive to be fitted onto realworld resource-constrained devices, due to (1) their quadratically increased complexity with the number of input tokens and (2) their overparameterized self-attention heads and model depth. In parallel, different images are of varying complexity and their different regions can contain various levels of visual information, e. g. , a sky background is not as informative as a foreground object in object classification tasks, indicating that treating all regions/tokens equally in terms of model complexity is unnecessary while such opportunities for trimming down ViTs’ complexity have not been fully explored. To this end, we propose a Multi-grained Input- Adaptive Vision TransFormer framework dubbed MIA- Former that can input-adaptively adjust the structure of ViTs at three coarse-to-fine-grained granularities (i. e. , model depth and the number of model heads/tokens). In particular, our MIA-Former adopts a low-cost network trained with a hybrid supervised and reinforcement training method to skip unnecessary layers, heads, and tokens in an input adaptive manner, reducing the overall computational cost. Furthermore, an interesting side effect of our MIA-Former is that its resulting ViTs are naturally equipped with improved robustness against adversarial attacks over their static counterparts, because MIA-Former’s multi-grained dynamic control improves the model diversity similar to the effect of ensemble and thus increases the difficulty of adversarial attacks against all its sub-models. Extensive experiments and ablation studies validate that the proposed MIA-Former framework can (1) effectively allocate computation budgets adaptive to the difficulty of input images, achieving state-of-the-art (SOTA) accuracy-efficiency trade-offs, e. g. , 20% computation savings with the same or even a higher accuracy compared with SOTA dynamic transformer models, and (2) boost ViTs’ robustness accuracy under various adversarial attacks over their vanilla counterparts by 2. 4% and 3. 0%, respectively. Our code is available at https: //github. com/RICE-EIC/MIA-Former. Copyright c 2022, Association for the Advancement of Artificial Intelligence (www. aaai. org). All rights reserved.

PDF Details

NeurIPS Conference 2021 Conference Paper

Drawing Robust Scratch Tickets: Subnetworks with Inborn Robustness Are Found within Randomly Initialized Networks

Yonggan Fu
Qixuan Yu
Yang Zhang
Shang Wu
Xu Ouyang
David Cox
Yingyan Lin

Deep Neural Networks (DNNs) are known to be vulnerable to adversarial attacks, i. e. , an imperceptible perturbation to the input can mislead DNNs trained on clean images into making erroneous predictions. To tackle this, adversarial training is currently the most effective defense method, by augmenting the training set with adversarial samples generated on the fly. \textbf{Interestingly, we discover for the first time that there exist subnetworks with inborn robustness, matching or surpassing the robust accuracy of the adversarially trained networks with comparable model sizes, within randomly initialized networks without any model training}, indicating that adversarial training on model weights is not indispensable towards adversarial robustness. We name such subnetworks Robust Scratch Tickets (RSTs), which are also by nature efficient. Distinct from the popular lottery ticket hypothesis, neither the original dense networks nor the identified RSTs need to be trained. To validate and understand this fascinating finding, we further conduct extensive experiments to study the existence and properties of RSTs under different models, datasets, sparsity patterns, and attacks, drawing insights regarding the relationship between DNNs’ robustness and their initialization/overparameterization. Furthermore, we identify the poor adversarial transferability between RSTs of different sparsity ratios drawn from the same randomly initialized dense network, and propose a Random RST Switch (R2S) technique, which randomly switches between different RSTs, as a novel defense method built on top of RSTs. We believe our findings about RSTs have opened up a new perspective to study model robustness and extend the lottery ticket hypothesis.

PDF Details

NeurIPS Conference 2021 Conference Paper

Locality Sensitive Teaching

Zhaozhuo Xu
Beidi Chen
Chaojian Li
Weiyang Liu
Le Song
Yingyan Lin
Anshumali Shrivastava

The emergence of the Internet-of-Things (IoT) sheds light on applying the machine teaching (MT) algorithms for online personalized education on home devices. This direction becomes more promising during the COVID-19 pandemic when in-person education becomes infeasible. However, as one of the most influential and practical MT paradigms, iterative machine teaching (IMT) is prohibited on IoT devices due to its inefficient and unscalable algorithms. IMT is a paradigm where a teacher feeds examples iteratively and intelligently based on the learner's status. In each iteration, current IMT algorithms greedily traverse the whole training set to find an example for the learner, which is computationally expensive in practice. We propose a novel teaching framework, Locality Sensitive Teaching (LST), based on locality sensitive sampling, to overcome these challenges. LST has provable near-constant time complexity, which is exponentially better than the existing baseline. With at most 425. 12x speedups and 99. 76% energy savings over IMT, LST is the first algorithm that enables energy and time efficient machine teaching on IoT devices. Owing to LST's substantial efficiency and scalability, it is readily applicable in real-world education scenarios.

PDF Details

AAAI Conference 2020 Conference Paper

Fractional Skipping: Towards Finer-Grained Dynamic CNN Inference

Jianghao Shen
Yue Wang
Pengfei Xu
Yonggan Fu
Zhangyang Wang
Yingyan Lin

While increasingly deep networks are still in general desired for achieving state-of-the-art performance, for many speciﬁc inputs a simpler network might already sufﬁce. Existing works exploited this observation by learning to skip convolutional layers in an input-dependent manner. However, we argue their binary decision scheme, i. e. , either fully executing or completely bypassing one layer for a speciﬁc input, can be enhanced by introducing ﬁner-grained, “softer” decisions. We therefore propose a Dynamic Fractional Skipping (DFS) framework. The core idea of DFS is to hypothesize layer-wise quantization (to different bitwidths) as intermediate “soft” choices to be made between fully utilizing and skipping a layer. For each input, DFS dynamically assigns a bitwidth to both weights and activations of each layer, where fully executing and skipping could be viewed as two “extremes” (i. e. , full bitwidth and zero bitwidth). In this way, DFS can “fractionally” exploit a layer’s expressive power during input-adaptive inference, enabling ﬁner-grained accuracy-computational cost trade-offs. It presents a uniﬁed view to link input-adaptive layer skipping and input-adaptive hybrid quantization. Extensive experimental results demonstrate the superior tradeoff between computational cost and model expressive power (accuracy) achieved by DFS. More visualizations also indicate a smooth and consistent transition in the DFS behaviors, especially the learned choices between layer skipping and different quantizations when the total computational budgets vary, validating our hypothesis that layer quantization could be viewed as intermediate variants of layer skipping. Our source code and supplementary material are available at https: //github. com/Torment123/DFS.

PDF Details

NeurIPS Conference 2020 Conference Paper

FracTrain: Fractionally Squeezing Bit Savings Both Temporally and Spatially for Efficient DNN Training

Yonggan Fu
Haoran You
Yang Zhao
Yue Wang
Chaojian Li
Kailash Gopalakrishnan
Zhangyang Wang
Yingyan Lin

Recent breakthroughs in deep neural networks (DNNs) have fueled a tremendous demand for intelligent edge devices featuring on-site learning, while the practical realization of such systems remains a challenge due to the limited resources available at the edge and the required massive training costs for state-of-the-art (SOTA) DNNs. As reducing precision is one of the most effective knobs for boosting training time/energy efficiency, there has been a growing interest in low-precision DNN training. In this paper, we explore from an orthogonal direction: how to fractionally squeeze out more training cost savings from the most redundant bit level, progressively along the training trajectory and dynamically per input. Specifically, we propose FracTrain that integrates (i) progressive fractional quantization which gradually increases the precision of activations, weights, and gradients that will not reach the precision of SOTA static quantized DNN training until the final training stage, and (ii) dynamic fractional quantization which assigns precisions to both the activations and gradients of each layer in an input-adaptive manner, for only "fractionally" updating layer parameters. Extensive simulations and ablation studies (six models, four datasets, and three training settings including standard, adaptation, and fine-tuning) validate the effectiveness of FracTrain in reducing computational cost and hardware-quantified energy/latency of DNN training while achieving a comparable or better (-0. 12%~+1. 87%) accuracy. For example, when training ResNet-74 on CIFAR-10, FracTrain achieves 77. 6% and 53. 5% computational cost and training latency savings, respectively, compared with the best SOTA baseline, while achieving a comparable (-0. 07%) accuracy. Our codes are available at: https: //github. com/RICE-EIC/FracTrain.

PDF Details

NeurIPS Conference 2020 Conference Paper

Multiplication (e. g. , convolution) is arguably a cornerstone of modern deep neural networks (DNNs). However, intensive multiplications cause expensive resource costs that challenge DNNs' deployment on resource-constrained edge devices, driving several attempts for multiplication-less deep networks. This paper presented ShiftAddNet, whose main inspiration is drawn from a common practice in energy-efficient hardware implementation, that is, multiplication can be instead performed with additions and logical bit-shifts. We leverage this idea to explicitly parameterize deep networks in this way, yielding a new type of deep network that involves only bit-shift and additive weight layers. This hardware-inspired ShiftAddNet immediately leads to both energy-efficient inference and training, without compromising the expressive capacity compared to standard DNNs. The two complementary operation types (bit-shift and add) additionally enable finer-grained control of the model's learning capacity, leading to more flexible trade-off between accuracy and (training) efficiency, as well as improved robustness to quantization and pruning. We conduct extensive experiments and ablation studies, all backed up by our FPGA-based ShiftAddNet implementation and energy measurements. Compared to existing DNNs or other multiplication-less models, ShiftAddNet aggressively reduces over 80% hardware-quantified energy cost of DNNs training and inference, while offering comparable or better accuracies. Codes and pre-trained models are available at https: //github. com/RICE-EIC/ShiftAddNet.

PDF Details

NeurIPS Conference 2019 Conference Paper

E2-Train: Training State-of-the-art CNNs with Over 80% Energy Savings

Yue Wang
Ziyu Jiang
Xiaohan Chen
Pengfei Xu
Yang Zhao
Yingyan Lin
Zhangyang Wang

Convolutional neural networks (CNNs) have been increasingly deployed to edge devices. Hence, many efforts have been made towards efficient CNN inference on resource-constrained platforms. This paper attempts to explore an orthogonal direction: how to conduct more energy-efficient training of CNNs, so as to enable on-device training? We strive to reduce the energy cost during training, by dropping unnecessary computations, from three complementary levels: stochastic mini-batch dropping on the data level; selective layer update on the model level; and sign prediction for low-cost, low-precision back-propagation, on the algorithm level. Extensive simulations and ablation studies, with real energy measurements from an FPGA board, confirm the superiority of our proposed strategies and demonstrate remarkable energy savings for training. For example, when training ResNet-74 on CIFAR-10, we achieve aggressive energy savings of >90% and >60%, while incurring a top-1 accuracy loss of only about 2% and 1. 2%, respectively. When training ResNet-110 on CIFAR-100, an over 84% training energy saving is achieved without degrading inference accuracy.

PDF Details

Yingyan Lin

Possible papers

ShiftAddViT: Mixture of Multiplication Primitives Towards Efficient Vision Transformer

Early-Bird GCNs: Graph-Network Co-optimization towards More Efficient GCN Training and Inference via Drawing Early-Bird Lottery Tickets

Max-Affine Spline Insights Into Deep Network Pruning

MIA-Former: Efficient and Robust Vision Transformers via Multi-Grained Input-Adaptation

Drawing Robust Scratch Tickets: Subnetworks with Inborn Robustness Are Found within Randomly Initialized Networks

Locality Sensitive Teaching

Fractional Skipping: Towards Finer-Grained Dynamic CNN Inference

FracTrain: Fractionally Squeezing Bit Savings Both Temporally and Spatially for Efficient DNN Training

ShiftAddNet: A Hardware-Inspired Deep Network

E2-Train: Training State-of-the-art CNNs with Over 80% Energy Savings