Arrow Research search

Author name cluster

Song Han

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

25 papers
1 author row

Possible papers

25

EAAI Journal 2026 Journal Article

Cross-layer feature consistency and dual-transformer residual framework for underwater image enhancement

  • Xinbin Li
  • Lei Cheng
  • Song Han
  • Jing Yang
  • Hui Dang
  • Muge Li

Underwater imaging suffers from complex degradations (e. g. , color casts, blur, and haze) due to light scattering in water, limiting its utility in engineering applications such as marine exploration and underwater robotics. To address this, we propose the Cross-layer Feature Consistency-guided Dual-Transformer Reconstruction Framework (CFC-DTRF). In terms of artificial intelligence contribution, this work introduces a novel multi-stage framework that leverages feature-consistency supervision to jointly constrain feature and pixel domains, effectively disentangling content and color degradations through dedicated transformers. The framework integrates two innovative modules: a Sliding-Window Content-Attention Transformer (SWCA-Transformer) for detail preservation and a Multi-Scale Color-Attention Transformer (MSCA-Transformer) for color correction, enhancing restoration fidelity with computational efficiency. For engineering applications, this method significantly improves underwater image quality for practical tasks like environmental monitoring and robotic navigation. Extensive experiments show that CFC-DTRF outperforms state-of-the-art methods in content preservation and color accuracy. The code of the proposed CFC-DTRF is available at https: //github. com/ChengLeiYSU/CFC-DTRF.

EAAI Journal 2025 Journal Article

A novel ensemble method based on residual convolutional neural network with attention module for transient stability assessment considering operational variability

  • Wensheng Liu
  • Song Han
  • Na Rong

Data-driven methods have been extensively applied in the field of power system transient stability assessment (TSA) owing to their robust capabilities to excavate valuable features. However, TSA methods still face significant challenges in predictive accuracy and generalization ability under variable operation conditions with fluctuating loads or power generations. To address this, a data-driven ensemble TSA method which integrates convolutional block attention module (CBAM) with residual network (ResNet) is proposed to enhance the prediction accuracy. Meanwhile, the traditional cross entropy loss function is replaced by the focal loss function, aiming to reduce the misclassification of unstable samples. Moreover, a rapid updating strategy integrating active learning and fine turning techniques is suggested. It can renew the classifier quickly with limited labeled samples and less time when the network topology changes substantially and makes the pre-trained TSA model unavailable, thus ensuring optimal performance on the new topology. Finally, case studies conducted on the New England 10-machine 39-bus system and the Western Electricity Coordinating Council (WECC) 29-machine 179-bus system validate the effectiveness and robustness of the proposed TSA method. The accuracy of the proposed TSA method achieves 99. 56% on 10-machine system and 99. 47% on 29-machine system separately, demonstrating the superiority of the proposed TSA method.

EAAI Journal 2025 Journal Article

Adversarial black-box attack and defense for convolutional neural network-based power quality disturbance classification

  • Xiudong Zhang
  • Congmei Jiang
  • Mingbiao Yu
  • Xiankui Wen
  • Jing Zhang
  • Na Rong
  • Song Han

Correctly identifying power quality disturbance (PQD) is crucial for the proper functioning of power systems. Deep learning (DL) techniques have been widely used for PQD classification due to their excellent performance. However, DL models are susceptible to adversarial attacks, posing a serious security threat to DL-based PQD classification systems. This issue has received limited attention in current research. In this study, we first utilize a convolutional neural network (CNN) to recognize various types of PQD signals. To evaluate model robustness, we introduce a black-box attack method for PQD classification based on the variance-tuning momentum iterative fast gradient sign method (VMI-FGSM). VMI-FGSM integrates a variance tuning method into the iterative process of the momentum iterative fast gradient sign method (MI-FGSM), thereby producing more transferable adversarial PQD signals. To defend against such attacks, we propose a perturbation removal defense based on a generative adversarial network (PRD-GAN). This approach is capable of removing perturbations from adversarial PQD signals before they are recognized by the target classification model. Experiments demonstrate that VMI-FGSM produces adversarial perturbations that are nearly identical to those of the advanced MI-FGSM, but its adversarial examples are significantly more effective at misleading the target CNN model. Furthermore, the proposed PRD-GAN effectively reconstructs adversarial PQD signals into clean forms under various black-box attack intensities and outperforms the multi-level denoising autoencoder (ML-DAE) in defense performance due to its superior reconstruction capability.

NeurIPS Conference 2025 Conference Paper

Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search

  • Yuxian Gu
  • Qinghao Hu
  • Haocheng Xi
  • Junyu Chen
  • Shang Yang
  • Song Han
  • Han Cai

We present Jet-Nemotron, a new family of hybrid-architecture language models, which matches or exceeds the accuracy of leading full-attention models while significantly improving generation throughput. Jet-Nemotron is developed using Post Neural Architecture Search (PostNAS), a novel neural architecture exploration pipeline that enables efficient model design. Unlike prior approaches, PostNAS begins with a pre-trained full-attention model and freezes its MLP weights, allowing efficient exploration of attention block designs. The pipeline includes four key components: (1) learning optimal full-attention layer placement and elimination, (2) linear attention block selection, (3) designing new attention blocks, and (4) performing hardware-aware hyperparameter search. Our Jet-Nemotron-2B model achieves comparable or superior accuracy to Qwen3, Qwen2. 5, Gemma3, and Llama3. 2 across a comprehensive suite of benchmarks while delivering up to 53. 6× generation throughput speedup and 6. 1× prefilling speedup. It also achieves higher accuracy on MMLU and MMLU-Pro than recent advanced MoE full-attention models, such as DeepSeek-V3-Small and Moonlight, despite their larger scale with 15B total and 2. 2B activated parameters.

NeurIPS Conference 2025 Conference Paper

Radial Attention: $\mathcal{O}(n\log n)$ Sparse Attention with Energy Decay for Long Video Generation

  • XINGYANG LI
  • Muyang Li
  • Tianle Cai
  • Haocheng Xi
  • Shuo Yang
  • Yujun Lin
  • Lvmin Zhang
  • Songlin Yang

Recent advances in diffusion models have enabled high-quality video generation, but the additional temporal dimension significantly increases computational costs, making training and inference on long videos prohibitively expensive. In this paper, we identify a phenomenon we term Spatiotemporal Energy Decay in video diffusion models: post-softmax attention scores diminish as spatial and temporal distance between tokens increase, akin to the physical decay of signal or waves over space and time in nature. Motivated by this, we propose Radial Attention, a scalable sparse attention mechanism with $\mathcal{O}(n \log n)$ complexity that translates energy decay into exponentially decaying compute density, which is significantly more efficient than standard $\mathcal{O}(n^2)$ dense attention and more expressive than linear attention. Specifically, Radial Attention employs a simple, static attention mask where each token attends to spatially nearby tokens, with the attention window size shrinking with temporal distance. Moreover, it allows pre-trained video diffusion models to extend their generation length with efficient LoRA-based fine-tuning. Extensive experiments show that \method maintains video quality across Wan2. 1-14B, HunyuanVideo, and Mochi 1, achieving up to a 1. 9× speedup over the original dense attention. With minimal tuning, it enables video generation up to 4× longer while reducing training costs by up to 4. 4× compared to direct fine-tuning and accelerating inference by up to 3. 7× compared to dense attention inference. Code is released at https: //github. com/mit-han-lab/radial-attention.

NeurIPS Conference 2025 Conference Paper

Scaling RL to Long Videos

  • Yukang Chen
  • Wei Huang
  • Baifeng Shi
  • Qinghao Hu
  • Hanrong Ye
  • Ligeng Zhu
  • Zhijian Liu
  • Pavlo Molchanov

We introduce a full-stack framework that scales up reasoning in vision-language models (VLMs) to long videos, leveraging reinforcement learning. We address the unique challenges of long video reasoning by integrating three critical components: (1) a large-scale dataset, LongVideo-Reason, comprising 104K long video QA pairs with high-quality reasoning annotations across diverse domains such as sports, games, and vlogs; (2) a two-stage training pipeline that extends VLMs with chain-of-thought supervised fine-tuning (CoT-SFT) and reinforcement learning (RL); and (3) a training infrastructure for long video RL, named Multi-modal Reinforcement Sequence Parallelism (MR-SP), which incorporates sequence parallelism and a vLLM-based engine tailored for long video, using cached video embeddings for efficient rollout and prefilling. In our experiments, LongVILA-R1-7B achieves strong performance on video benchmarks, reaching 65. 1% and 71. 1% accuracy on VideoMME without and with subtitles, respectively, and consistently outperforming LongVILA-7B across multiple benchmarks. Moreover, LongVILA-R1-7B supports processing up to 8, 192 video frames per video, and configurable FPS settings. Notably, our MR-SP system achieves up to 2. 1x speedup on long video RL training. In addition, we release our training system for public availability that supports RL training on various modalities (video, text, and audio), various models (VILA and Qwen series), and even image and video generation models. On a single A100 node (8 GPUs), it supports RL training on hour-long videos (e. g. , 3, 600 frames). Code and models are available at https: //github. com/NVlabs/Long-RL

NeurIPS Conference 2025 Conference Paper

Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation

  • Shuo Yang
  • Haocheng Xi
  • Yilong Zhao
  • Muyang Li
  • Jintao Zhang
  • Han Cai
  • Yujun Lin
  • Xiuyu Li

Diffusion Transformers (DiTs) are essential for video generation but suffer from significant latency due to the quadratic complexity of attention. By computing only critical tokens, sparse attention reduces computational costs and offers a promising acceleration approach. However, we identify that existing methods fail to approach optimal generation quality under the same computation budget for two reasons: (1) Inaccurate critical token identification: current methods cluster tokens based on position rather than semantics, leading to imprecise aggregated representations. (2) Excessive computation waste: critical tokens are scattered among non-critical ones, leading to wasted computation on GPUs, which are optimized for processing contiguous tokens. In this paper, we propose SVG2, a training-free framework that maximizes identification accuracy and minimizes computation waste, achieving a Pareto frontier trade-off between generation quality and efficiency. The core of SVG2 is semantic-aware permutation, which clusters and reorders tokens based on semantic similarity using k-means. This approach ensures both a precise cluster representation, improving identification accuracy, and a densified layout of critical tokens, enabling efficient computation without padding. Additionally, SVG2 integrates Top-p dynamic budget control and customized kernel implementations, achieving up to $2. 30\times$ and $1. 89\times$ speedup while maintaining a PSNR of up to $30$ and $26$ on HunyuanVideo and Wan 2. 1, respectively. Our code is open-sourced at https: //github. com/svg-project/Sparse-VideoGen.

NeurIPS Conference 2025 Conference Paper

Twilight: Adaptive Attention Sparsity with Hierarchical Top-$p$ Pruning

  • Chaofan Lin
  • Jiaming Tang
  • Shuo Yang
  • Hanshuo Wang
  • Tian Tang
  • Boyu Tian
  • Ion Stoica
  • Song Han

Leveraging attention sparsity to accelerate long-context large language models (LLMs) has been of great importance recently. However, most existing sparse attention algorithms use a fixed budget of how many tokens to use in their computations. This simple static decision raises critical issues in real-world deployment because it fails to account for the dynamic nature of real-world scenarios, where the optimal balance between accuracy and efficiency can vary greatly. In this paper, we reveal a key insight that leveraging the idea of top-$p$ sampling (a. k. a. , nucleus sampling) in sparse attention could enable efficient and adaptive budget decisions. Based on this, we propose Twilight, a framework that enhances any existing sparse attention algorithm with adaptive budget decision capabilities without sacrificing accuracy. Empirical results show that Twilight can adaptively prune up to 98% tokens with nearly no accuracy loss in both mid- and long-context scenarios, leading to a $1. 4\times$ speedup over state-of-the-art sparse attention mechanisms.

TMLR Journal 2025 Journal Article

Wolf: Dense Video Captioning with a World Summarization Framework

  • Boyi Li
  • Ligeng Zhu
  • Ran Tian
  • Shuhan Tan
  • Yuxiao Chen
  • Yao Lu
  • Yin Cui
  • Sushant Veer

We propose Wolf, a WOrLd summarization Framework for accurate video captioning. Wolf is an automated captioning framework that adopts a mixture-of-experts approach, leveraging complementary strengths of Vision Language Models (VLMs). By utilizing both image and video models, our framework captures different levels of information and summarizes them efficiently. Our approach can be applied to enhance video understanding, auto-labeling, and captioning. To evaluate caption quality, we introduce CapScore, an LLM-based metric to assess the similarity and quality of generated captions compared to the ground truth captions. We further build four human-annotated datasets in three domains: autonomous driving, general scenes, and robotics, to facilitate comprehensive comparisons. We show that Wolf achieves superior captioning performance compared to state-of-the-art approaches from the research community (VILA1.5, CogAgent) and commercial solutions (Gemini-Pro-1.5, GPT-4V). For instance, in comparison with GPT-4V, Wolf improves CapScore (caption quality) by 55.6% and CapScore (caption similarity) by 77.4% on challenging driving videos. Finally, we establish a benchmark for video captioning and introduce a leaderboard, aiming to accelerate advancements in video understanding, captioning, and data alignment.

NeurIPS Conference 2025 Conference Paper

WorldModelBench: Judging Video Generation Models As World Models

  • Dacheng Li
  • Yunhao Fang
  • Yukang Chen
  • Shuo Yang
  • Shiyi Cao
  • Justin Wong
  • Michael Luo
  • Xiaolong Wang

Video generation models have rapidly progressed, positioning themselves as video world models capable of supporting decision-making applications like robotics and autonomous driving. However, current benchmarks fail to rigorously evaluate these claims, focusing only on general video quality, ignoring important factors to world models such as physics adherence. To bridge this gap, we propose WorldModelBench, a benchmark designed to evaluate the world modeling capabilities of video generation models in application-driven domains. WorldModelBench offers two key advantages: (1) Against to nuanced world modeling violations: By incorporating instruction-following and physics-adherence dimensions, WorldModelBench detects subtle violations, such as irregular changes in object size that breach the mass conservation law—issues overlooked by prior benchmarks. (2) Aligned with large-scale human preferences: We crowd-source 67K human labels to accurately measure 14 frontier models. Using our high-quality human labels, we further fine-tune an accurate judger to automate the evaluation procedure, achieving 9. 9% lower error in predicting world modeling violations than GPT-4o with 2B parameters. In addition, we demonstrate that training to align human annotations by maximizing the rewards from the judger noticeably improve the world modeling capability. The dataset is hosted in HuggingFace at https: //huggingface. co/datasets/Efficient-Large-Model/worldmodelbench. The code to run evaluation is available at https: //github. com/WorldModelBench-Team/WorldModelBench.

NeurIPS Conference 2024 Conference Paper

BitDelta: Your Fine-Tune May Only Be Worth One Bit

  • James Liu
  • Guangxuan Xiao
  • Kai Li
  • Jason D. Lee
  • Song Han
  • Tri Dao
  • Tianle Cai

Large Language Models (LLMs) are typically trained in two phases: pre-training on large internet-scale datasets, and fine-tuning for downstream tasks. Given the higher computational demand of pre-training, it is intuitive to assume that fine-tuning adds less new information to the model, and is thus more compressible. We explore this assumption by decomposing the weights of fine-tuned models into their pre-trained components and an additional delta. We introduce a simple method, BitDelta, which successfully quantizes this delta down to 1 bit without compromising performance. This interesting finding not only highlights the potential redundancy of information added during fine-tuning, but also has significant implications for the multi-tenant serving and multi-tenant storage of fine-tuned models. By enabling the use of a single high-precision base model accompanied by multiple 1-bit deltas, BitDelta dramatically reduces GPU memory requirements by more than 10x, thus reducing per-user generation latency by more than 10x in multi-tenant settings. We validate BitDelta through experiments across Llama-2, Mistral and MPT model families, and on models up to 70B parameters, showcasing minimal performance degradation in all tested settings.

EAAI Journal 2023 Journal Article

Frequency stability prediction of renewable energy penetrated power systems using CoAtNet and SHAP values

  • Peili Liu
  • Song Han
  • Na Rong

As the complexity of power systems increases, traditional model-driven methods for online frequency stability prediction (FSP) encounter constraints in both accuracy and efficiency. To enhance the accuracy and efficiency of FSP, an data-driven method using CoAtNet and SHAP values is proposed. By leveraging the combination of convolution and attention mechanisms, CoAtNet addresses the limitation of traditional deep learning approaches that may not be able to extract data features comprehensively. Moreover, selecting all features as input into a deep-learning model may cause a substantial computation burden. It is thus impractical for CoAtNet to perform FSP of large-scale power systems. For this problem, this paper develops a SHAP values-based feature selection method to select the effective features as input. This process greatly reduces the numerical complexity, maintaining a high prediction performance. Additionally, the marginally stable situation of the system frequency is ignored by most researchers. A frequency security index to identify marginally stable situations is thus employed to generate the data labels, which are classed as “absolute security”, “relative security”, and “insecurity”. Finally, verified by the comparison simulation, the proposed model outperforms other models with accuracies of 98. 80% on the modified IEEE 39-bus system and 99. 04% on the modified ACTIVSg500 system.

NeurIPS Conference 2022 Conference Paper

Efficient Spatially Sparse Inference for Conditional GANs and Diffusion Models

  • Muyang Li
  • Ji Lin
  • Chenlin Meng
  • Stefano Ermon
  • Song Han
  • Jun-Yan Zhu

During image editing, existing deep generative models tend to re-synthesize the entire output from scratch, including the unedited regions. This leads to a significant waste of computation, especially for minor editing operations. In this work, we present Spatially Sparse Inference (SSI), a general-purpose technique that selectively performs computation for edited regions and accelerates various generative models, including both conditional GANs and diffusion models. Our key observation is that users tend to make gradual changes to the input image. This motivates us to cache and reuse the feature maps of the original image. Given an edited image, we sparsely apply the convolutional filters to the edited regions while reusing the cached features for the unedited regions. Based on our algorithm, we further propose Sparse Incremental Generative Engine (SIGE) to convert the computation reduction to latency reduction on off-the-shelf hardware. With 1. 2%-area edited regions, our method reduces the computation of DDIM by $7. 5\times$ and GauGAN by $18\times$ while preserving the visual fidelity. With SIGE, we accelerate the inference time of DDIM by $3. 0\times$ on RTX 3090 and $6. 6\times$ on Apple M1 Pro CPU, and GauGAN by $4. 2\times$ on RTX 3090 and $14\times$ on Apple M1 Pro CPU.

NeurIPS Conference 2022 Conference Paper

On-Device Training Under 256KB Memory

  • Ji Lin
  • Ligeng Zhu
  • Wei-Ming Chen
  • Wei-Chen Wang
  • Chuang Gan
  • Song Han

On-device training enables the model to adapt to new data collected from the sensors by fine-tuning a pre-trained model. Users can benefit from customized AI models without having to transfer the data to the cloud, protecting the privacy. However, the training memory consumption is prohibitive for IoT devices that have tiny memory resources. We propose an algorithm-system co-design framework to make on-device training possible with only 256KB of memory. On-device training faces two unique challenges: (1) the quantized graphs of neural networks are hard to optimize due to low bit-precision and the lack of normalization; (2) the limited hardware resource (memory and computation) does not allow full backpropagation. To cope with the optimization difficulty, we propose Quantization- Aware Scaling to calibrate the gradient scales and stabilize 8-bit quantized training. To reduce the memory footprint, we propose Sparse Update to skip the gradient computation of less important layers and sub-tensors. The algorithm innovation is implemented by a lightweight training system, Tiny Training Engine, which prunes the backward computation graph to support sparse updates and offload the runtime auto-differentiation to compile time. Our framework is the first practical solution for on-device transfer learning of visual recognition on tiny IoT devices (e. g. , a microcontroller with only 256KB SRAM), using less than 1/1000 of the memory of PyTorch and TensorFlow while matching the accuracy. Our study enables IoT devices not only to perform inference but also to continuously adapt to new data for on-device lifelong learning. A video demo can be found here: https: //youtu. be/XaDCO8YtmBw.

TCS Journal 2021 Journal Article

A fast algorithm for source-wise round-trip spanners

  • Chun Jiang Zhu
  • Song Han
  • Kam-Yiu Lam

In this paper, we study the problem of fast constructions of source-wise round-trip spanners in weighted directed graphs. For a source vertex set S ⊆ V in a graph G ( V, E ), an S-sourcewise round-trip spanner of G of stretch k is a subgraph H of G such that for every pair of vertices u, v ∈ S × V, their round-trip distance in H is at most k times of their round-trip distance in G. We show that for a graph G ( V, E ) with n vertices and m edges, an s-sized source vertex set S ⊆ V and an integer k > 1, there exists an algorithm that in time O ( m s 1 / k log 5 ⁡ n ) constructs an S-sourcewise round-trip spanner of stretch O ( k log ⁡ n ) and O ( n s 1 / k log 2 ⁡ n ) edges with high probability. Compared to the fast algorithms for constructing all-pairs round-trip spanners [26, 12], our algorithm improves the running time and the number of edges in the spanner when k is super-constant. Compared with the existing algorithm for constructing source-wise round-trip spanners [36], our algorithm significantly improves their construction time Ω ( min ⁡ { m s, n ω } ) (where ω ∈ [ 2, 2. 373 ) and 2. 373 is the matrix multiplication exponent) to nearly linear O ( m s 1 / k log 5 ⁡ n ), at the expense of paying an extra O ( log ⁡ n ) in the stretch. As an important building block of the algorithm, we develop a graph partitioning algorithm to partition G into clusters of bounded radius and prove that for every u, v ∈ S × V at small round-trip distance, the probability of separating them in different clusters is small. The algorithm takes the size of S as input and does not need the knowledge of S. With the algorithm and a reachability vertex size estimation algorithm, we show that the recursive algorithm for constructing standard round-trip spanners [26] can be adapted to the source-wise setting. We rigorously prove the correctness and computational complexity of the adapted algorithms. Finally, we show how to remove the dependence on the edge weight in the source-wise case.

NeurIPS Conference 2021 Conference Paper

Delayed Gradient Averaging: Tolerate the Communication Latency for Federated Learning

  • Ligeng Zhu
  • Hongzhou Lin
  • Yao Lu
  • Yujun Lin
  • Song Han

Federated Learning is an emerging direction in distributed machine learning that en-ables jointly training a model without sharing the data. Since the data is distributed across many edge devices through wireless / long-distance connections, federated learning suffers from inevitable high communication latency. However, the latency issues are undermined in the current literature [15] and existing approaches suchas FedAvg [27] become less efficient when the latency increases. To over comethe problem, we propose \textbf{D}elayed \textbf{G}radient \textbf{A}veraging (DGA), which delays the averaging step to improve efficiency and allows local computation in parallel tocommunication. We theoretically prove that DGA attains a similar convergence rate as FedAvg, and empirically show that our algorithm can tolerate high network latency without compromising accuracy. Specifically, we benchmark the training speed on various vision (CIFAR, ImageNet) and language tasks (Shakespeare), with both IID and non-IID partitions, and show DGA can bring 2. 55$\times$ to 4. 07$\times$ speedup. Moreover, we built a 16-node Raspberry Pi cluster and show that DGA can consistently speed up real-world federated learning applications.

NeurIPS Conference 2021 Conference Paper

Memory-efficient Patch-based Inference for Tiny Deep Learning

  • Ji Lin
  • Wei-Ming Chen
  • Han Cai
  • Chuang Gan
  • Song Han

Tiny deep learning on microcontroller units (MCUs) is challenging due to the limited memory size. We find that the memory bottleneck is due to the imbalanced memory distribution in convolutional neural network (CNN) designs: the first several blocks have an order of magnitude larger memory usage than the rest of the network. To alleviate this issue, we propose a generic patch-by-patch inference scheduling, which operates only on a small spatial region of the feature map and significantly cuts down the peak memory. However, naive implementation brings overlapping patches and computation overhead. We further propose receptive field redistribution to shift the receptive field and FLOPs to the later stage and reduce the computation overhead. Manually redistributing the receptive field is difficult. We automate the process with neural architecture search to jointly optimize the neural architecture and inference scheduling, leading to MCUNetV2. Patch-based inference effectively reduces the peak memory usage of existing networks by4-8×. Co-designed with neural networks, MCUNetV2 sets a record ImageNetaccuracy on MCU (71. 8%) and achieves >90% accuracy on the visual wake words dataset under only 32kB SRAM. MCUNetV2 also unblocks object detection on tiny devices, achieving 16. 9% higher mAP on Pascal VOC compared to the state-of-the-art result. Our study largely addressed the memory bottleneck in tinyML and paved the way for various vision applications beyond image classification.

NeurIPS Conference 2020 Conference Paper

Differentiable Augmentation for Data-Efficient GAN Training

  • Shengyu Zhao
  • Zhijian Liu
  • Ji Lin
  • Jun-Yan Zhu
  • Song Han

The performance of generative adversarial networks (GANs) heavily deteriorates given a limited amount of training data. This is mainly because the discriminatorsis memorizing the exact training set. To combat it, we propose Differentiable Augmentation (DiffAugment), a simple method that improves the data efficiency of GANs by imposing various types of differentiable augmentations on both real and fake samples. Previous attempts to directly augment the training data manipulate the distribution of real images, yielding little benefit; DiffAugment enables us to adopt the differentiable augmentation for the generated samples, effectively stabilizes training, and leads to better convergence. Experiments demonstrate consistent gains of our method over a variety of GAN architectures and loss functions for both unconditional and class-conditional generation. With DiffAugment, we achieve astate-of-the-art FID of 6. 80 with an IS of 100. 8 on ImageNet 128×128 and 2-4× reductions of FID given 1, 000 images on FFHQ and LSUN. Furthermore, with only 20% training data, we can match the top performance on CIFAR-10 and CIFAR-100. Finally, our method can generate high-fidelity images using only 100 images without pre-training, while being on par with existing transfer learning algorithms. Code is available at https: //github. com/mit-han-lab/data-efficient-gans.

NeurIPS Conference 2020 Conference Paper

MCUNet: Tiny Deep Learning on IoT Devices

  • Ji Lin
  • Wei-Ming Chen
  • Yujun Lin
  • john cohn
  • Chuang Gan
  • Song Han

Machine learning on tiny IoT devices based on microcontroller units (MCU) is appealing but challenging: the memory of microcontrollers is 2-3 orders of magnitude smaller even than mobile phones. We propose MCUNet, a framework that jointly designs the efficient neural architecture (TinyNAS) and the lightweight inference engine (TinyEngine), enabling ImageNet-scale inference on microcontrollers. TinyNAS adopts a two-stage neural architecture search approach that first optimizes the search space to fit the resource constraints, then specializes the network architecture in the optimized search space. TinyNAS can automatically handle diverse constraints (i. e. device, latency, energy, memory) under low search costs. TinyNAS is co-designed with TinyEngine, a memory-efficient inference library to expand the search space and fit a larger model. TinyEngine adapts the memory scheduling according to the overall network topology rather than layer-wise optimization, reducing the memory usage by 3. 4×, and accelerating the inference by 1. 7-3. 3× compared to TF-Lite Micro [3] and CMSIS-NN [28]. MCUNet is the first to achieves >70% ImageNet top1 accuracy on an off-the-shelf commercial microcontroller, using 3. 5× less SRAM and 5. 7× less Flash compared to quantized MobileNetV2 and ResNet-18. On visual&audio wake words tasks, MCUNet achieves state-of-the-art accuracy and runs 2. 4-3. 4× faster than Mo- bileNetV2 and ProxylessNAS-based solutions with 3. 7-4. 1× smaller peak SRAM. Our study suggests that the era of always-on tiny machine learning on IoT devices has arrived.

NeurIPS Conference 2020 Conference Paper

TinyTL: Reduce Memory, Not Parameters for Efficient On-Device Learning

  • Han Cai
  • Chuang Gan
  • Ligeng Zhu
  • Song Han

Efficient on-device learning requires a small memory footprint at training time to fit the tight memory constraint. Existing work solves this problem by reducing the number of trainable parameters. However, this doesn't directly translate to memory saving since the major bottleneck is the activations, not parameters. In this work, we present Tiny-Transfer-Learning (TinyTL) for memory-efficient on-device learning. TinyTL freezes the weights while only learns the memory-efficient bias modules, thus no need to store the intermediate activations. To maintain the adaptation capacity, we introduce a new memory-efficient bias module, the lite residual module, to refine the feature extractor by learning small residual feature maps adding only 3. 8% memory overhead. Extensive experiments show that TinyTL significantly saves the memory (up to 6. 5x) with little accuracy loss compared to fine-tuning the full network. Compared to fine-tuning the last layer, TinyTL provides significant accuracy improvements (up to 33. 8%) with little memory overhead. Furthermore, combined with feature extractor adaptation, TinyTL provides 7. 5-12. 9x memory saving without sacrificing accuracy compared to fine-tuning the full Inception-V3. Code is released at https: //github. com/mit-han-lab/tinyML/tree/master/tinyTL.

AAAI Conference 2019 Conference Paper

Communication-Optimal Distributed Dynamic Graph Clustering

  • Chun Jiang Zhu
  • Tan Zhu
  • Kam-Yiu Lam
  • Song Han
  • Jinbo Bi

We consider the problem of clustering graph nodes over large-scale dynamic graphs, such as citation networks, images and web networks, when graph updates such as node/edge insertions/deletions are observed distributively. We propose communication-efficient algorithms for two well-established communication models namely the message passing and the blackboard models. Given a graph with n nodes that is observed at s remote sites over time [1, t], the two proposed algorithms have communication costs Õ(ns) and Õ(n + s) (Õ hides a polylogarithmic factor), almost matching their lower bounds, Ω(ns) and Ω(n + s), respectively, in the message passing and the blackboard models. More importantly, we prove that at each time point in [1, t] our algorithms generate clustering quality nearly as good as that of centralizing all updates up to that time and then applying a standard centralized clustering algorithm. We conducted extensive experiments on both synthetic and real-life datasets which confirmed the communication efficiency of our approach over baseline algorithms while achieving comparable clustering results.

NeurIPS Conference 2019 Conference Paper

Deep Leakage from Gradients

  • Ligeng Zhu
  • Zhijian Liu
  • Song Han

Passing gradient is a widely used scheme in modern multi-node learning system (e. g, distributed training, collaborative learning). In a long time, people used to believe that gradients are safe to share: i. e, the training set will not be leaked by gradient sharing. However, in this paper, we show that we can obtain the private training set from the publicly shared gradients. The leaking only takes few gradient steps to process and can obtain the original training set instead of look-alike alternatives. We name this leakage as \textit{deep leakage from gradient} and practically validate the effectiveness of our algorithm on both computer vision and natural language processing tasks. We empirically show that our attack is much stronger than previous approaches and thereby and raise people's awareness to rethink the gradients' safety. We also discuss some possible strategies to defend this deep leakage.

NeurIPS Conference 2019 Conference Paper

Park: An Open Platform for Learning-Augmented Computer Systems

  • Hongzi Mao
  • Parimarjan Negi
  • Akshay Narayan
  • Hanrui Wang
  • Jiacheng Yang
  • Haonan Wang
  • Ryan Marcus
  • ravichandra addanki

We present Park, a platform for researchers to experiment with Reinforcement Learning (RL) for computer systems. Using RL for improving the performance of systems has a lot of potential, but is also in many ways very different from, for example, using RL for games. Thus, in this work we first discuss the unique challenges RL for systems has, and then propose Park an open extensible platform, which makes it easier for ML researchers to work on systems problems. Currently, Park consists of 12 real world system-centric optimization problems with one common easy to use interface. Finally, we present the performance of existing RL approaches over those 12 problems and outline potential areas of future work.

NeurIPS Conference 2019 Conference Paper

Point-Voxel CNN for Efficient 3D Deep Learning

  • Zhijian Liu
  • Haotian Tang
  • Yujun Lin
  • Song Han

We present Point-Voxel CNN (PVCNN) for efficient, fast 3D deep learning. Previous work processes 3D data using either voxel-based or point-based NN models. However, both approaches are computationally inefficient. The computation cost and memory footprints of the voxel-based models grow cubically with the input resolution, making it memory-prohibitive to scale up the resolution. As for point-based networks, up to 80% of the time is wasted on dealing with the sparse data which have rather poor memory locality, not on the actual feature extraction. In this paper, we propose PVCNN that represents the 3D input data in points to reduce the memory consumption, while performing the convolutions in voxels to reduce the irregular, sparse data access and improve the locality. Our PVCNN model is both memory and computation efficient. Evaluated on semantic and part segmentation datasets, it achieves much higher accuracy than the voxel-based baseline with 10× GPU memory reduction; it also outperforms the state-of-the-art point-based models with 7× measured speedup on average. Remarkably, the narrower version of PVCNN achieves 2× speedup over PointNet (an extremely efficient model) on part and scene segmentation benchmarks with much higher accuracy. We validate the general effectiveness of PVCNN on 3D object detection: by replacing the primitives in Frustrum PointNet with PVConv, it outperforms Frustrum PointNet++ by 2. 4% mAP on average with 1. 5× measured speedup and GPU memory reduction.

NeurIPS Conference 2015 Conference Paper

Learning both Weights and Connections for Efficient Neural Network

  • Song Han
  • Jeff Pool
  • John Tran
  • William Dally

Neural networks are both computationally intensive and memory intensive, making them difficult to deploy on embedded systems. Also, conventional networks fix the architecture before training starts; as a result, training cannot improve the architecture. To address these limitations, we describe a method to reduce the storage and computation required by neural networks by an order of magnitude without affecting their accuracy by learning only the important connections. Our method prunes redundant connections using a three-step method. First, we train the network to learn which connections are important. Next, we prune the unimportant connections. Finally, we retrain the network to fine tune the weights of the remaining connections. On the ImageNet dataset, our method reduced the number of parameters of AlexNet by a factor of 9×, from 61 million to 6. 7 million, without incurring accuracy loss. Similar experiments with VGG-16 found that the total number of parameters can be reduced by 13×, from 138 million to 10. 3 million, again with no loss of accuracy.