Arrow Research search

Author name cluster

Geng Yuan

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

28 papers
2 author rows

Possible papers

28

JBHI Journal 2026 Journal Article

MTS-LOF: Medical Time-Series Representation Learning via Occlusion-Invariant Features

  • Huayu Li
  • Ana S. Carreon-Rascon
  • Xiwen Chen
  • Geng Yuan
  • Ao Li

Medical time series data are indispensable in healthcare, providing critical insights for disease diagnosis, treatment planning, and patient management. The exponential growth in data complexity, driven by advanced sensor technologies, has presented challenges related to data labeling. Self-supervised learning (SSL) has emerged as a transformative approach to address these challenges, eliminating the need for extensive human annotation. In this study, we introduce a novel framework for Medical Time Series Representation Learning, known as MTS-LOF. MTS-LOF leverages the strengths of Joint-Embedding SSL and Masked Autoencoder (MAE) methods, offering a unique approach to representation learning for medical time series data. By combining these techniques, MTS-LOF enhances the potential of healthcare applications by providing more sophisticated, context-rich representations. Additionally, MTS-LOF employs a multi-masking strategy to facilitate occlusion-invariant feature learning. This approach allows the model to create multiple views of the data by masking portions of it. By minimizing the discrepancy between the representations of these masked patches and the fully visible patches, MTS-LOF learns to capture rich contextual information within medical time series datasets. The results of experiments conducted on diverse medical time series datasets demonstrate the superiority of MTS-LOF over other methods. These findings hold promise for significantly enhancing healthcare applications by improving representation learning. Furthermore, our work delves into the integration of Joint-Embedding SSL and MAE techniques, shedding light on the intricate interplay between temporal and structural dependencies in healthcare data. This understanding is crucial, as it allows us to grasp the complexities of healthcare data analysis.

IJCAI Conference 2025 Conference Paper

FairSMOE: Mitigating Multi-Attribute Fairness Problem with Sparse Mixture-of-Experts

  • Changdi Yang
  • Zheng Zhan
  • Ci Zhang
  • Yifan Gong
  • Yize Li
  • Zichong Meng
  • Jun Liu
  • Xuan Shen

Real‐world datasets usually contain multiple attributes, making it essential to ensure fairness across all of them simultaneously. However, different attributes may vary in difficulty, and no existing approaches have effectively addressed this issue. Consequently, an attribute‐adaptive strategy is needed to achieve fairness for all attributes. Multi‐task Learning (MTL) leverages shared information to optimize multiple tasks concurrently, while Sparsely‐Gated Mixture‐of‐Experts (SMoE) can dynamically allocate computational resources to the most needed tasks. In this work, we formulate multi‐attribute fairness issue as an MTL problem and employ SMoE to achieve desirable performance across all attributes simultaneously. We first analyze the feasibility and find the potentiality by formalizing multi-attribute fairness problem into a MTL problem and mitigating it by using SMoE. However, vanilla SMoE could lead to over-utilization problem which causes sub-optimal performance. We then proposed an innovative SMoE framework for multi-attribute fair image classification, which further improves multi-attribute fairness by redesigning the MoE layer and routing policy with fairness consideration. Extensive experiments demonstrated the effectiveness. Taking a DeiT-Small as the backbone, we achieve 77. 25% and 86. 01% accuracy on the ISIC2019 and CelebA dataset respectively with Multi-attribute Predictive Quality Disparity (PQD) score of 0. 801 and 0. 787, beating current state-of-the-art methods Muffin, InfoFair and MultiFair.

NeurIPS Conference 2025 Conference Paper

Harmony in Divergence: Towards Fast, Accurate, and Memory-efficient Zeroth-order LLM Fine-tuning

  • Qitao Tan
  • Jun Liu
  • Zheng Zhan
  • Caiwen Ding
  • Yanzhi Wang
  • Xiaolong Ma
  • Jaewoo Lee
  • Jin Lu

Large language models (LLMs) excel across various tasks, but standard first-order (FO) fine-tuning demands considerable memory, significantly limiting real-world deployment. Recently, zeroth-order (ZO) optimization stood out as a promising memory-efficient training paradigm, avoiding backward passes and relying solely on forward passes for gradient estimation, making it attractive for resource-constrained scenarios. However, ZO method lags far behind FO method in both convergence speed and accuracy. To bridge the gap, we introduce a novel layer-wise divergence analysis that uncovers the distinct update pattern of FO and ZO optimization. Aiming to resemble the learning capacity of FO method from the findings, we propose \textbf{Di}vergence-driven \textbf{Z}eroth-\textbf{O}rder (\textbf{DiZO}) optimization. DiZO conducts divergence-driven layer adaptation by incorporating projections to ZO updates, generating diverse-magnitude updates precisely scaled to layer-wise individual optimization needs. Our results demonstrate that DiZO significantly reduces the needed iterations for convergence without sacrificing throughput, cutting training GPU hours by up to 48\% on various datasets. Moreover, DiZO consistently outperforms the representative ZO baselines in fine-tuning RoBERTa-large, OPT-series, and Llama-series on downstream tasks and, in some cases, even surpasses memory-intensive FO fine-tuning. Our code is released at \url{https: //github. com/Skilteee/DiZO}.

ICLR Conference 2025 Conference Paper

Mutual Effort for Efficiency: A Similarity-based Token Pruning for Vision Transformers in Self-Supervised Learning

  • Sheng Li 0019
  • Qitao Tan
  • Yue Dai 0005
  • Zhenglun Kong
  • Tianyu Wang
  • Jun Liu 0075
  • Ao Li 0004
  • Ninghao Liu 0001

Self-supervised learning (SSL) offers a compelling solution to the challenge of extensive labeled data requirements in traditional supervised learning. With the proven success of Vision Transformers (ViTs) in supervised tasks, there is increasing interest in adapting them for SSL frameworks. However, the high computational demands of SSL pose substantial challenges, particularly on resource-limited platforms like edge devices, despite its ability to achieve high accuracy without labeled data. Recent studies in supervised learning have shown that token pruning can reduce training costs by removing less informative tokens without compromising accuracy. However, SSL’s dual-branch encoders make traditional single-branch pruning strategies less effective, as they fail to account for the critical cross-branch similarity information, leading to reduced accuracy in SSL. To this end, we introduce SimPrune, a novel token pruning strategy designed for ViTs in SSL. SimPrune leverages cross-branch similarity information to efficiently prune tokens, retaining essential semantic information across dual branches. Additionally, we incorporate a difficulty-aware pruning strategy to further enhance SimPrune's effectiveness. Experimental results show that our proposed approach effectively reduces training computation while maintaining accuracy. Specifically, our approach offers 24\% savings in training costs compared to SSL baseline, without sacrificing accuracy.

AAAI Conference 2025 Conference Paper

Toward Adaptive Large Language Models Structured Pruning via Hybrid-grained Weight Importance Assessment

  • Jun Liu
  • Zhenglun Kong
  • Pu Zhao
  • Changdi Yang
  • Xuan Shen
  • Hao Tang
  • Geng Yuan
  • Wei Niu

Structured pruning for large language models (LLMs) has garnered significant academic interest due to its ability to efficiently compress and accelerate LLMs by eliminating redundant weight groups at a coarse-grained granularity. Current structured pruning methods for LLMs typically depend on a singular granularity for assessing weight importance, resulting in notable performance degradation in downstream tasks. Intriguingly, our empirical investigations reveal that utilizing unstructured pruning, which achieves better performance retention by pruning weights at a finer granularity, \emph{i.e.}, individual weights, yields significantly varied sparse LLM structures when juxtaposed to structured pruning. This suggests that evaluating both holistic and individual assessments for weight importance are essential for LLM pruning. Building on this insight, we introduce the Hybrid-grained Weight Importance Assessment (HyWIA), a novel method that merges fine-grained and coarse-grained evaluations of weight importance for the pruning of LLMs. Leveraging an attention mechanism, HyWIA adaptively determines the optimal blend of granularity in weight importance assessments in an end-to-end pruning manner. Extensive experiments on LLaMA-V1/V2, Vicuna, Baichuan, and Bloom across various benchmarks demonstrate the effectiveness of HyWIA in pruning LLMs. For example, HyWIA surpasses the cutting-edge LLM-Pruner by an average margin of 2.82% in accuracy across seven downstream tasks when pruning LLaMA-7B by 50%.

ICML Conference 2024 Conference Paper

Advancing Dynamic Sparse Training by Exploring Optimization Opportunities

  • Jie Ji
  • Gen Li 0012
  • Lu Yin 0006
  • Minghai Qin
  • Geng Yuan
  • Linke Guo
  • Shiwei Liu 0003
  • Xiaolong Ma

Dynamic Sparse Training (DST) is an effective approach for addressing the substantial training resource requirements posed by the ever-increasing size of the Deep Neural Networks (DNNs). Characterized by its dynamic "train-prune-grow” schedule during training, DST implicitly develops a bi-level structure for training the weights while discovering a subnetwork topology. However, such a structure is consistently overlooked by the current DST algorithms for further optimization opportunities, and these algorithms, on the other hand, solely optimize the weights while determining masks heuristically. In this paper, we extensively study DST algorithms and argue that the training scheme of DST naturally forms a bi-level problem in which the updating of weight and mask is interdependent. Based on this observation, we introduce a novel efficient training framework called BiDST, which for the first time, introduces bi-level optimization methodology into dynamic sparse training domain. Unlike traditional partial-heuristic DST schemes, which suffer from sub-optimal search efficiency for masks and miss the opportunity to fully explore the topological space of neural networks, BiDST excels at discovering excellent sparse patterns by optimizing mask and weight simultaneously, resulting in maximum 2. 62% higher accuracy, 2. 1$\times$ faster execution speed, and 25$\times$ reduced overhead. Code available at https: //github. com/jjsrf/BiDST-ICML2024.

NeurIPS Conference 2024 Conference Paper

Fast and Memory-Efficient Video Diffusion Using Streamlined Inference

  • Zheng Zhan
  • Yushu Wu
  • Yifan Gong
  • Zichong Meng
  • Zhenglun Kong
  • Changdi Yang
  • Geng Yuan
  • Pu Zhao

The rapid progress in artificial intelligence-generated content (AIGC), especially with diffusion models, has significantly advanced development of high-quality video generation. However, current video diffusion models exhibit demanding computational requirements and high peak memory usage, especially for generating longer and higher-resolution videos. These limitations greatly hinder the practical application of video diffusion models on standard hardware platforms. To tackle this issue, we present a novel, training-free framework named Streamlined Inference, which leverages the temporal and spatial properties of video diffusion models. Our approach integrates three core components: Feature Slicer, Operator Grouping, and Step Rehash. Specifically, Feature Slicer effectively partitions input features into sub-features and Operator Grouping processes each sub-feature with a group of consecutive operators, resulting in significant memory reduction without sacrificing the quality or speed. Step Rehash further exploits the similarity between adjacent steps in diffusion, and accelerates inference through skipping unnecessary steps. Extensive experiments demonstrate that our approach significantly reduces peak memory and computational overhead, making it feasible to generate high-quality videos on a single consumer GPU (e. g. , reducing peak memory of Animatediff from 42GB to 11GB, featuring faster inference on 2080Ti).

ICLR Conference 2024 Conference Paper

Waxing-and-Waning: a Generic Similarity-based Framework for Efficient Self-Supervised Learning

  • Sheng Li 0019
  • Chao Wu 0006
  • Ao Li 0004
  • Yanzhi Wang 0001
  • Xulong Tang
  • Geng Yuan

Deep Neural Networks (DNNs), essential for diverse applications such as visual recognition and eldercare, often require a large amount of labeled data for training, making widespread deployment of DNNs a challenging task. Self-supervised learning (SSL) emerges as a promising approach, which leverages inherent patterns within data through diverse augmentations to train models without explicit labels. However, while SSL has shown notable advancements in accuracy, its high computation costs remain a daunting impediment, particularly for resource-constrained platforms. To address this problem, we introduce SimWnW, a similarity-based efficient self-supervised learning framework. By strategically removing less important regions in augmented images and feature maps, SimWnW not only reduces computation costs but also eliminates irrelevant features that might slow down the learning process, thereby accelerating model convergence. The experimental results show that SimWnW effectively reduces the amount of computation costs in self-supervised model training without compromising accuracy. Specifically, SimWnW yields up to 54\% and 51\% computation savings in training from scratch and transfer learning tasks, respectively.

IJCAI Conference 2023 Conference Paper

Data Level Lottery Ticket Hypothesis for Vision Transformers

  • Xuan Shen
  • Zhenglun Kong
  • Minghai Qin
  • Peiyan Dong
  • Geng Yuan
  • Xin Meng
  • Hao Tang
  • Xiaolong Ma

The conventional lottery ticket hypothesis (LTH) claims that there exists a sparse subnetwork within a dense neural network and a proper random initialization method, called the winning ticket, such that it can be trained from scratch to almost as good as the dense counterpart. Meanwhile, the research of LTH in vision transformers (ViTs) is scarcely evaluated. In this paper, we first show that the conventional winning ticket is hard to find at weight level of ViTs by existing methods. Then, we generalize the LTH for ViTs to input data consisting of image patches inspired by the input dependence of ViTs. That is, there exists a subset of input image patches such that a ViT can be trained from scratch by using only this subset of patches and achieve similar accuracy to the ViTs trained by using all image patches. We call this subset of input patches the winning tickets, which represent a significant amount of information in the input data. We use a ticket selector to generate the winning tickets based on the informativeness of patches for various types of ViT, including DeiT, LV-ViT, and Swin Transformers. The experiments show that there is a clear difference between the performance of models trained with winning tickets and randomly selected subsets, which verifies our proposed theory. We elaborate the analogical similarity between our proposed Data-LTH-ViTs and the conventional LTH for further verifying the integrity of our theory. The Source codes are available at https: //github. com/shawnricecake/vit-lottery-ticket-input.

NeurIPS Conference 2023 Conference Paper

HotBEV: Hardware-oriented Transformer-based Multi-View 3D Detector for BEV Perception

  • Peiyan Dong
  • Zhenglun Kong
  • Xin Meng
  • Pinrui Yu
  • Yifan Gong
  • Geng Yuan
  • Hao Tang
  • Yanzhi Wang

The bird's-eye-view (BEV) perception plays a critical role in autonomous driving systems, involving the accurate and efficient detection and tracking of objects from a top-down perspective. To achieve real-time decision-making in self-driving scenarios, low-latency computation is essential. While recent approaches to BEV detection have focused on improving detection precision using Lift-Splat-Shoot (LSS)-based or transformer-based schemas, the substantial computational and memory burden of these approaches increases the risk of system crashes when multiple on-vehicle tasks run simultaneously. Unfortunately, there is a dearth of literature on efficient BEV detector paradigms, let alone achieving realistic speedups. Unlike existing works that focus on reducing computation costs, this paper focuses on developing an efficient model design that prioritizes actual on-device latency. To achieve this goal, we propose a latency-aware design methodology that considers key hardware properties, such as memory access cost and degree of parallelism. Given the prevalence of GPUs as the main computation platform for autonomous driving systems, we develop a theoretical latency prediction model and introduce efficient building operators. By leveraging these operators and following an effective local-to-global visual modeling process, we propose a hardware-oriented backbone that is also optimized for strong feature capturing and fusing. Using these insights, we present a new hardware-oriented framework for efficient yet accurate camera-view BEV detectors. Experiments show that HotBEV achieves a 2\%$\sim$23\% NDS gain, and 2\%$\sim$7. 8\% mAP gain with a 1. 1$\times$$\sim$3. 4$\times$ speedups compared to existing works on V100; On multiple GPU devices such as GPU GTX 2080 and the low-end GTX 1080, HotBEV achieves 1. 1$\times$$\sim$6. 3$\times$ faster than others.

NeurIPS Conference 2023 Conference Paper

PackQViT: Faster Sub-8-bit Vision Transformers via Full and Packed Quantization on the Mobile

  • Peiyan Dong
  • Lei Lu
  • Chao Wu
  • Cheng Lyu
  • Geng Yuan
  • Hao Tang
  • Yanzhi Wang

While Vision Transformers (ViTs) have undoubtedly made impressive strides in computer vision (CV), their intricate network structures necessitate substantial computation and memory resources. A decision-making process for CV tasks typically entails performing computations with low latency, which is a tricky problem for ViT models. Model quantization is a widely-used technique to optimize the hardware efficiency of deep neural networks. Full quantization under Sub-8-bit precision, in particular, is a promising solution to reduce inference latency significantly. Unfortunately, current commodity hardware, such as CPUs and GPUs, still struggles to efficiently execute these sub-8-bit quantized networks, as their SIMD instructions only support a granularity of 8 bits or wider. Also, there is a scarcity of literature that presents a full quantization paradigm for ViTs. In this paper, we propose an activation-aware fully sub-8-bit quantization-aware training (QAT) framework called PackQViT for efficient yet accurate ViT acceleration on mobile devices to facilitate real-time AI-powered decision-making. Specifically, in revisiting data activation within the ViT dataflow, two characteristics are relevant to quantization strategy and precision: the long-tailed distribution and systematic channel-wise outliers. In response, we employ either log2 quantization or clipping to address the long-tailed distribution and incorporate outlier-aware training for residual link quantization to regulate the various channel-wise outliers more consistently. Notably, due to the systematic fixed pattern, outlier-aware training approach can predict the channel indices and regularized scales of outliers in advance, thus avoiding the runtime data-adaptive selection during inference. Furthermore, we employ Int-$2^{n}$-Softmax, Int-LayerNorm, and Integer GELU to enable integer-only computation flow. Finally, we develop a SIMD-based 4-bit packed multiplier to achieve end-to-end ViT acceleration on mobile phones. Compared to prior studies on ViT quantization using 8-bit precision, PackQViT surpasses other works by an improved accuracy ranging from 0. 4\% to 17. 9\% for various widely used ViTs on ImageNet dataset; under 4-bit precision, PackQViT demonstrates 0. 4%$\sim$2. 8% higher accuracy. Compared to the baseline multiplier, our implementations on the Realme GT Android smartphone with Snapdragon 870 SoC CPU achieve 2. 6x$\sim$3. 7x speedup under 8-bit scenario and 3. 8x$\sim$5. 9x speedup under 4-bit which ensures practical real-time performance.

AAAI Conference 2023 Conference Paper

Peeling the Onion: Hierarchical Reduction of Data Redundancy for Efficient Vision Transformer Training

  • Zhenglun Kong
  • Haoyu Ma
  • Geng Yuan
  • Mengshu Sun
  • Yanyue Xie
  • Peiyan Dong
  • Xin Meng
  • Xuan Shen

Vision transformers (ViTs) have recently obtained success in many applications, but their intensive computation and heavy memory usage at both training and inference time limit their generalization. Previous compression algorithms usually start from the pre-trained dense models and only focus on efficient inference, while time-consuming training is still unavoidable. In contrast, this paper points out that the million-scale training data is redundant, which is the fundamental reason for the tedious training. To address the issue, this paper aims to introduce sparsity into data and proposes an end-to-end efficient training framework from three sparse perspectives, dubbed Tri-Level E-ViT. Specifically, we leverage a hierarchical data redundancy reduction scheme, by exploring the sparsity under three levels: number of training examples in the dataset, number of patches (tokens) in each example, and number of connections between tokens that lie in attention weights. With extensive experiments, we demonstrate that our proposed technique can noticeably accelerate training for various ViT architectures while maintaining accuracy. Remarkably, under certain ratios, we are able to improve the ViT accuracy rather than compromising it. For example, we can achieve 15.2% speedup with 72.6% (+0.4) Top-1 accuracy on Deit-T, and 15.7% speedup with 79.9% (+0.1) Top-1 accuracy on Deit-S. This proves the existence of data redundancy in ViT. Our code is released at https://github.com/ZLKong/Tri-Level-ViT

ICLR Conference 2023 Conference Paper

Self-Ensemble Protection: Training Checkpoints Are Good Data Protectors

  • Sizhe Chen
  • Geng Yuan
  • Xinwen Cheng
  • Yifan Gong 0004
  • Minghai Qin
  • Yanzhi Wang 0001
  • Xiaolin Huang

As data becomes increasingly vital, a company would be very cautious about releasing data, because the competitors could use it to train high-performance models, thereby posing a tremendous threat to the company's commercial competence. To prevent training good models on the data, we could add imperceptible perturbations to it. Since such perturbations aim at hurting the entire training process, they should reflect the vulnerability of DNN training, rather than that of a single model. Based on this new idea, we seek perturbed examples that are always unrecognized (never correctly classified) in training. In this paper, we uncover them by model checkpoints' gradients, forming the proposed self-ensemble protection (SEP), which is very effective because (1) learning on examples ignored during normal training tends to yield DNNs ignoring normal examples; (2) checkpoints' cross-model gradients are close to orthogonal, meaning that they are as diverse as DNNs with different architectures. That is, our amazing performance of ensemble only requires the computation of training one model. By extensive experiments with 9 baselines on 3 datasets and 5 architectures, SEP is verified to be a new state-of-the-art, e.g., our small $\ell_\infty=2/255$ perturbations reduce the accuracy of a CIFAR-10 ResNet18 from 94.56% to 14.68%, compared to 41.35% by the best-known method. Code is available at https://github.com/Sizhe-Chen/SEP.

ICLR Conference 2023 Conference Paper

SmartFRZ: An Efficient Training Framework using Attention-Based Layer Freezing

  • Sheng Li 0019
  • Geng Yuan
  • Yue Dai 0005
  • Youtao Zhang
  • Yanzhi Wang 0001
  • Xulong Tang

There has been a proliferation of artificial intelligence applications, where model training is key to promising high-quality services for these applications. However, the model training process is both time-intensive and energy-intensive, inevitably affecting the user's demand for application efficiency. Layer freezing, an efficient model training technique, has been proposed to improve training efficiency. Although existing layer freezing methods demonstrate the great potential to reduce model training costs, they still remain shortcomings such as lacking generalizability and compromised accuracy. For instance, existing layer freezing methods either require the freeze configurations to be manually defined before training, which does not apply to different networks, or use heuristic freezing criteria that is hard to guarantee decent accuracy in different scenarios. Therefore, there lacks a generic and smart layer freezing method that can automatically perform ``in-situation'' layer freezing for different networks during training processes. To this end, we propose a generic and efficient training framework (SmartFRZ). The core proposed technique in SmartFRZ is attention-guided layer freezing, which can automatically select the appropriate layers to freeze without compromising accuracy. Experimental results show that SmartFRZ effectively reduces the amount of computation in training and achieves significant training acceleration, and outperforms the state-of-the-art layer freezing approaches.

AAAI Conference 2023 Conference Paper

Towards Real-Time Segmentation on the Edge

  • Yanyu Li
  • Changdi Yang
  • Pu Zhao
  • Geng Yuan
  • Wei Niu
  • Jiexiong Guan
  • Hao Tang
  • Minghai Qin

The research in real-time segmentation mainly focuses on desktop GPUs. However, autonomous driving and many other applications rely on real-time segmentation on the edge, and current arts are far from the goal. In addition, recent advances in vision transformers also inspire us to re-design the network architecture for dense prediction task. In this work, we propose to combine the self attention block with lightweight convolutions to form new building blocks, and employ latency constraints to search an efficient sub-network. We train an MLP latency model based on generated architecture configurations and their latency measured on mobile devices, so that we can predict the latency of subnets during search phase. To the best of our knowledge, we are the first to achieve over 74% mIoU on Cityscapes with semi-real-time inference (over 15 FPS) on mobile GPU from an off-the-shelf phone.

NeurIPS Conference 2022 Conference Paper

EfficientFormer: Vision Transformers at MobileNet Speed

  • Yanyu Li
  • Geng Yuan
  • Yang Wen
  • Ju Hu
  • Georgios Evangelidis
  • Sergey Tulyakov
  • Yanzhi Wang
  • Jian Ren

Vision Transformers (ViT) have shown rapid progress in computer vision tasks, achieving promising results on various benchmarks. However, due to the massive number of parameters and model design, e. g. , attention mechanism, ViT-based models are generally times slower than lightweight convolutional networks. Therefore, the deployment of ViT for real-time applications is particularly challenging, especially on resource-constrained hardware such as mobile devices. Recent efforts try to reduce the computation complexity of ViT through network architecture search or hybrid design with MobileNet block, yet the inference speed is still unsatisfactory. This leads to an important question: can transformers run as fast as MobileNet while obtaining high performance? To answer this, we first revisit the network architecture and operators used in ViT-based models and identify inefficient designs. Then we introduce a dimension-consistent pure transformer (without MobileNet blocks) as a design paradigm. Finally, we perform latency-driven slimming to get a series of final models dubbed EfficientFormer. Extensive experiments show the superiority of EfficientFormer in performance and speed on mobile devices. Our fastest model, EfficientFormer-L1, achieves $79. 2\%$ top-1 accuracy on ImageNet-1K with only $1. 6$ ms inference latency on iPhone 12 (compiled with CoreML), which runs as fast as MobileNetV2$\times 1. 4$ ($1. 6$ ms, $74. 7\%$ top-1), and our largest model, EfficientFormer-L7, obtains $83. 3\%$ accuracy with only $7. 0$ ms latency. Our work proves that properly designed transformers can reach extremely low latency on mobile devices while maintaining high performance.

NeurIPS Conference 2022 Conference Paper

Layer Freezing & Data Sieving: Missing Pieces of a Generic Framework for Sparse Training

  • Geng Yuan
  • Yanyu Li
  • Sheng Li
  • Zhenglun Kong
  • Sergey Tulyakov
  • Xulong Tang
  • Yanzhi Wang
  • Jian Ren

Recently, sparse training has emerged as a promising paradigm for efficient deep learning on edge devices. The current research mainly devotes the efforts to reducing training costs by further increasing model sparsity. However, increasing sparsity is not always ideal since it will inevitably introduce severe accuracy degradation at an extremely high sparsity level. This paper intends to explore other possible directions to effectively and efficiently reduce sparse training costs while preserving accuracy. To this end, we investigate two techniques, namely, layer freezing and data sieving. First, the layer freezing approach has shown its success in dense model training and fine-tuning, yet it has never been adopted in the sparse training domain. Nevertheless, the unique characteristics of sparse training may hinder the incorporation of layer freezing techniques. Therefore, we analyze the feasibility and potentiality of using the layer freezing technique in sparse training and find it has the potential to save considerable training costs. Second, we propose a data sieving method for dataset-efficient training, which further reduces training costs by ensuring only a partial dataset is used throughout the entire training process. We show that both techniques can be well incorporated into the sparse training algorithm to form a generic framework, which we dub SpFDE. Our extensive experiments demonstrate that SpFDE can significantly reduce training costs while preserving accuracy from three dimensions: weight sparsity, layer freezing, and dataset sieving. Our code and models will be released.

IJCAI Conference 2022 Conference Paper

Pruning-as-Search: Efficient Neural Architecture Search via Channel Pruning and Structural Reparameterization

  • Yanyu Li
  • Pu Zhao
  • Geng Yuan
  • Xue Lin
  • Yanzhi Wang
  • Xin Chen

Neural architecture search (NAS) and network pruning are widely studied efficient AI techniques, but not yet perfect. NAS performs exhaustive candidate architecture search, incurring tremendous search cost. Though (structured) pruning can simply shrink model dimension, it remains unclear how to decide the per-layer sparsity automatically and optimally. In this work, we revisit the problem of layer-width optimization and propose Pruning-as-Search (PaS), an end-to-end channel pruning method to search out desired sub-network automatically and efficiently. Specifically, we add a depth-wise binary convolution to learn pruning policies directly through gradient descent. By combining the structural reparameterization and PaS, we successfully searched out a new family of VGG-like and lightweight networks, which enable the flexibility of arbitrary width with respect to each layer instead of each stage. Experimental results show that our proposed architecture outperforms prior arts by around 1. 0% top-1 accuracy under similar inference speed on ImageNet-1000 classification task. Furthermore, we demonstrate the effectiveness of our width search on complex tasks including instance segmentation and image translation. Code and models are released.

IJCAI Conference 2022 Conference Paper

Real-Time Portrait Stylization on the Edge

  • Yanyu Li
  • Xuan Shen
  • Geng Yuan
  • Jiexiong Guan
  • Wei Niu
  • Hao Tang
  • Bin Ren
  • Yanzhi Wang

In this work we demonstrate real-time portrait stylization, specifically, translating self-portrait into cartoon or anime style on mobile devices. We propose a latency-driven differentiable architecture search method, maintaining realistic generative quality. With our framework, we obtain 10× computation reduction on the generative model and achieve real-time video stylization on off-the-shelf smartphone using mobile GPUs.

NeurIPS Conference 2022 Conference Paper

SparCL: Sparse Continual Learning on the Edge

  • Zifeng Wang
  • Zheng Zhan
  • Yifan Gong
  • Geng Yuan
  • Wei Niu
  • Tong Jian
  • Bin Ren
  • Stratis Ioannidis

Existing work in continual learning (CL) focuses on mitigating catastrophic forgetting, i. e. , model performance deterioration on past tasks when learning a new task. However, the training efficiency of a CL system is under-investigated, which limits the real-world application of CL systems under resource-limited scenarios. In this work, we propose a novel framework called Sparse Continual Learning (SparCL), which is the first study that leverages sparsity to enable cost-effective continual learning on edge devices. SparCL achieves both training acceleration and accuracy preservation through the synergy of three aspects: weight sparsity, data efficiency, and gradient sparsity. Specifically, we propose task-aware dynamic masking (TDM) to learn a sparse network throughout the entire CL process, dynamic data removal (DDR) to remove less informative training data, and dynamic gradient masking (DGM) to sparsify the gradient updates. Each of them not only improves efficiency, but also further mitigates catastrophic forgetting. SparCL consistently improves the training efficiency of existing state-of-the-art (SOTA) CL methods by at most 23X less training FLOPs, and, surprisingly, further improves the SOTA accuracy by at most 1. 7%. SparCL also outperforms competitive baselines obtained from adapting SOTA sparse training methods to the CL setting in both efficiency and accuracy. We also evaluate the effectiveness of SparCL on a real mobile phone, further indicating the practical potential of our method.

AAAI Conference 2021 System Paper

A Compression-Compilation Co-Design Framework Towards Real-Time Object Detection on Mobile Devices

  • Yuxuan Cai
  • Geng Yuan
  • Hongjia Li
  • Wei Niu
  • Yanyu Li
  • Xulong Tang
  • Bin Ren
  • Yanzhi Wang

The rapid development and wide utilization of object detection techniques have aroused requirements for both accuracy and speed of object detectors. In this work, we propose a compression-compilation co-design framework to achieve real-time YOLOv4 inference on mobile devices. We propose a novel fine-grained structured pruning, which maintain high accuracy while achieving high hardware parallelism. Our pruned YOLOv4 achieves 48. 9 mAP and 17 FPS inference speed on an off-the-shelf Samsung Galaxy S20 smartphone, which is 5. 5× faster than the original state-of-the-art detector YOLOv4.

IJCAI Conference 2021 Conference Paper

A Compression-Compilation Framework for On-mobile Real-time BERT Applications

  • Wei Niu
  • Zhenglun Kong
  • Geng Yuan
  • Weiwen Jiang
  • Jiexiong Guan
  • Caiwen Ding
  • Pu Zhao
  • Sijia Liu

Transformer-based deep learning models have increasingly demonstrated high accuracy on many natural language processing (NLP) tasks. In this paper, we propose a compression-compilation co-design framework that can guarantee the identified model meets both resource and real-time specifications of mobile devices. Our framework applies a compiler-aware neural architecture optimization method (CANAO), which can generate the optimal compressed model that balances both accuracy and latency. We are able to achieve up to 7. 8x speedup compared with TensorFlow-Lite with only minor accuracy loss. We present two types of BERT applications on mobile devices: Question Answering (QA) and Text Generation. Both can be executed in real-time with latency as low as 45ms. Videos for demonstrating the framework can be found on https: //www. youtube. com/watch? v=_WIRvK_2PZI

ICML Conference 2021 Conference Paper

Lottery Ticket Preserves Weight Correlation: Is It Desirable or Not?

  • Ning Liu 0007
  • Geng Yuan
  • Zhengping Che
  • Xuan Shen
  • Xiaolong Ma
  • Qing Jin
  • Jian Ren 0005
  • Jian Tang 0008

In deep model compression, the recent finding "Lottery Ticket Hypothesis" (LTH) pointed out that there could exist a winning ticket (i. e. , a properly pruned sub-network together with original weight initialization) that can achieve competitive performance than the original dense network. However, it is not easy to observe such winning property in many scenarios, where for example, a relatively large learning rate is used even if it benefits training the original dense model. In this work, we investigate the underlying condition and rationale behind the winning property, and find that the underlying reason is largely attributed to the correlation between initialized weights and final-trained weights when the learning rate is not sufficiently large. Thus, the existence of winning property is correlated with an insufficient DNN pretraining, and is unlikely to occur for a well-trained DNN. To overcome this limitation, we propose the "pruning & fine-tuning" method that consistently outperforms lottery ticket sparse training under the same pruning algorithm and the same total training epochs. Extensive experiments over multiple deep models (VGG, ResNet, MobileNet-v2) on different datasets have been conducted to justify our proposals.

NeurIPS Conference 2021 Conference Paper

MEST: Accurate and Fast Memory-Economic Sparse Training Framework on the Edge

  • Geng Yuan
  • Xiaolong Ma
  • Wei Niu
  • Zhengang Li
  • Zhenglun Kong
  • Ning Liu
  • Yifan Gong
  • Zheng Zhan

Recently, a new trend of exploring sparsity for accelerating neural network training has emerged, embracing the paradigm of training on the edge. This paper proposes a novel Memory-Economic Sparse Training (MEST) framework targeting for accurate and fast execution on edge devices. The proposed MEST framework consists of enhancements by Elastic Mutation (EM) and Soft Memory Bound (&S) that ensure superior accuracy at high sparsity ratios. Different from the existing works for sparse training, this current work reveals the importance of sparsity schemes on the performance of sparse training in terms of accuracy as well as training speed on real edge devices. On top of that, the paper proposes to employ data efficiency for further acceleration of sparse training. Our results suggest that unforgettable examples can be identified in-situ even during the dynamic exploration of sparsity masks in the sparse training process, and therefore can be removed for further training speedup on edge devices. Comparing with state-of-the-art (SOTA) works on accuracy, our MEST increases Top-1 accuracy significantly on ImageNet when using the same unstructured sparsity scheme. Systematical evaluation on accuracy, training speed, and memory footprint are conducted, where the proposed MEST framework consistently outperforms representative SOTA works. A reviewer strongly against our work based on his false assumptions and misunderstandings. On top of the previous submission, we employ data efficiency for further acceleration of sparse training. And we explore the impact of model sparsity, sparsity schemes, and sparse training algorithms on the number of removable training examples. Our codes are publicly available at: https: //github. com/boone891214/MEST.

NeurIPS Conference 2021 Conference Paper

Sanity Checks for Lottery Tickets: Does Your Winning Ticket Really Win the Jackpot?

  • Xiaolong Ma
  • Geng Yuan
  • Xuan Shen
  • Tianlong Chen
  • Xuxi Chen
  • Xiaohan Chen
  • Ning Liu
  • Minghai Qin

There have been long-standing controversies and inconsistencies over the experiment setup and criteria for identifying the "winning ticket" in literature. To reconcile such, we revisit the definition of lottery ticket hypothesis, with comprehensive and more rigorous conditions. Under our new definition, we show concrete evidence to clarify whether the winning ticket exists across the major DNN architectures and/or applications. Through extensive experiments, we perform quantitative analysis on the correlations between winning tickets and various experimental factors, and empirically study the patterns of our observations. We find that the key training hyperparameters, such as learning rate and training epochs, as well as the architecture characteristics such as capacities and residual connections, are all highly correlated with whether and when the winning tickets can be identified. Based on our analysis, we summarize a guideline for parameter settings in regards of specific architecture characteristics, which we hope to catalyze the research progress on the topic of lottery ticket hypothesis. Our codes are publicly available at: https: //github. com/boone891214/sanity-check-LTH.

IJCAI Conference 2021 Conference Paper

Towards Fast and Accurate Multi-Person Pose Estimation on Mobile Devices

  • Xuan Shen
  • Geng Yuan
  • Wei Niu
  • Xiaolong Ma
  • Jiexiong Guan
  • Zhengang Li
  • Bin Ren
  • Yanzhi Wang

The rapid development of autonomous driving, abnormal behavior detection, and behavior recognition makes an increasing demand for multi-person pose estimation-based applications, especially on mobile platforms. However, to achieve high accuracy, state-of-the-art methods tend to have a large model size and complex post-processing algorithm, which costs intense computation and long end-to-end latency. To solve this problem, we propose an architecture optimization and weight pruning framework to accelerate inference of multi-person pose estimation on mobile devices. With our optimization framework, we achieve up to 2. 51X faster model inference speed with higher accuracy compared to representative lightweight multi-person pose estimator.

AAAI Conference 2021 Conference Paper

YOLObile: Real-Time Object Detection on Mobile Devices via Compression-Compilation Co-Design

  • Yuxuan Cai
  • Hongjia Li
  • Geng Yuan
  • Wei Niu
  • Yanyu Li
  • Xulong Tang
  • Bin Ren
  • Yanzhi Wang

The rapid development and wide utilization of object detection techniques have aroused attention on both accuracy and speed of object detectors. However, the current state-of-theart object detection works are either accuracy-oriented using a large model but leading to high latency or speed-oriented using a lightweight model but sacrificing accuracy. In this work, we propose YOLObile framework, a real-time object detection on mobile devices via compression-compilation co-design. A novel block-punched pruning scheme is proposed for any kernel size. To improve computational efficiency on mobile devices, a GPU-CPU collaborative scheme is adopted along with advanced compiler-assisted optimizations. Experimental results indicate that our pruning scheme achieves 14× compression rate of YOLOv4 with 49. 0 mAP. Under our YOLObile framework, we achieve 17 FPS inference speed using GPU on Samsung Galaxy S20. By incorporating our proposed GPU-CPU collaborative scheme, the inference speed is increased to 19. 1 FPS, and outperforms the original YOLOv4 by 5× speedup. Source code is at: https: //github. com/nightsnack/YOLObile.

AAAI Conference 2018 Conference Paper

Towards Ultra-High Performance and Energy Efficiency of Deep Learning Systems: An Algorithm-Hardware Co-Optimization Framework

  • Yanzhi Wang
  • Caiwen Ding
  • Zhe Li
  • Geng Yuan
  • Siyu Liao
  • Xiaolong Ma
  • Bo Yuan
  • Xuehai Qian

Hardware accelerations of deep learning systems have been extensively investigated in industry and academia. The aim of this paper is to achieve ultra-high energy efficiency and performance for hardware implementations of deep neural networks (DNNs). An algorithm-hardware co-optimization framework is developed, which is applicable to different DNN types, sizes, and application scenarios. The algorithm part adopts the general block-circulant matrices to achieve a fine-grained tradeoff of accuracy and compression ratio. It applies to both fully-connected and convolutional layers and contains a mathematically rigorous proof of the effectiveness of the method. The proposed algorithm reduces computational complexity per layer from O(n2 ) to O(n log n) and storage complexity from O(n2 ) to O(n), both for training and inference. The hardware part consists of highly efficient Field Programmable Gate Array (FPGA)-based implementations using effective reconfiguration, batch processing, deep pipelining, resource re-using, and hierarchical control. Experimental results demonstrate that the proposed framework achieves at least 152X speedup and 71X energy efficiency gain compared with IBM TrueNorth processor under the same test accuracy. It achieves at least 31X energy efficiency gain compared with the reference FPGA-based work.