Arrow Research search

Author name cluster

Kang Zhao

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

17 papers
2 author rows

Possible papers

17

NeurIPS Conference 2025 Conference Paper

A Simple Linear Patch Revives Layer-Pruned Large Language Models

  • Xinrui Chen
  • Haoli Bai
  • Tao Yuan
  • ruikang liu
  • Kang Zhao
  • Xianzhi Yu
  • Lu Hou
  • Tian Guan

Layer pruning has emerged as a widely used technique for compressing large language models (LLMs). However, existing layer pruning approaches often incur substantial performance degradation. We identify the majority of this degradation to a single yet previously overlooked issue: \textit{the mismatch of activation magnitudes at the pruning interface}. The pre-interface activations exhibit significantly different scales from the post-interface ones, causing the distributional shift as it propagates through the remaining layers. To address this issue, we introduce \textsc{LinearPatch}, a lightweight and plug-and-play technique that fuses two operations into one matrix multiply at the pruning interface: (i) a Hadamard transformation that suppresses massive outliers at particular tokens and (ii) a channel-wise scaling that aligns activation statistics. On LLaMA-3-8B, \textsc{LinearPatch} preserves up to \textbf{94. 15\%} of the original model's performance when pruning 5 out of 32 layers, outperforming the previous state of the art by \textbf{4\%}. The patch can be further refined with 5K unlabeled samples via memory-efficient offline distillation, pushing the retention to 95. 16\% within only 30 minutes on a single GPU. Code is available at \url{https: //github. com/chenxinrui-tsinghua/LinearPatch}.

NeurIPS Conference 2025 Conference Paper

EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Videos Generation

  • Xiaofeng Wang
  • Kang Zhao
  • Feng Liu
  • Jiayu Wang
  • Guosheng Zhao
  • Xiaoyi Bao
  • Zheng Zhu
  • Yingya Zhang

Video generation has emerged as a promising tool for world simulation, leveraging visual data to replicate real-world environments. Within this context, egocentric video generation, which centers on the human perspective, holds significant potential for enhancing applications in virtual reality, augmented reality, and gaming. However, the generation of egocentric videos presents substantial challenges due to the dynamic nature of first-person viewpoints, the intricate diversity of actions, and the complex variety of scenes encountered. Existing datasets are inadequate for addressing these challenges effectively. To bridge this gap, we present EgoVid-5M, the first high-quality dataset specifically curated for egocentric video generation. EgoVid-5M encompasses over 5 million egocentric video clips and is enriched with detailed action annotations, including fine-grained kinematic control and high-level textual descriptions. To ensure the integrity and usability of the dataset, we implement a sophisticated data cleansing pipeline designed to maintain frame consistency, action coherence, and motion smoothness under egocentric conditions. Furthermore, we introduce EgoDreamer, which is capable of generating egocentric videos driven simultaneously by action descriptions and kinematic control signals. The EgoVid-5M dataset, associated action annotations, and all data cleansing metadata will be released for the advancement of research in egocentric video generation.

ICML Conference 2025 Conference Paper

FlatQuant: Flatness Matters for LLM Quantization

  • Yuxuan Sun
  • Ruikang Liu
  • Haoli Bai
  • Han Bao
  • Kang Zhao
  • Yuening Li
  • Jiaxin Hu
  • Xianzhi Yu

Recently, quantization has been widely used for the compression and acceleration of large language models (LLMs). Due to the outliers in LLMs, it is crucial to flatten weights and activations to minimize quantization error with equally spaced quantization points. Prior research explores various pre-quantization transformations to suppress outliers, such as per-channel scaling and Hadamard transformation. However, we observe that these transformed weights and activations can still exhibit steep and dispersed distributions. In this paper, we propose FlatQuant (Fast and Learnable Affine Transformation), a new post-training quantization approach that enhances the flatness of weights and activations. Our approach identifies optimal affine transformations for each linear layer, calibrated in hours via a lightweight objective. To reduce runtime overhead of affine transformation, we apply Kronecker product with two lightweight matrices, and fuse all operations in FlatQuant into a single kernel. Extensive experiments demonstrate that FlatQuant establishes a new state-of-the-art benchmark for quantization. For example, it achieves less than 1% accuracy drop for W4A4 quantization on the LLaMA-3-70B model, surpassing SpinQuant by 7. 5%. Additionally, it provides up to 2. 3x prefill speedup and 1. 7x decoding speedup compared to the FP16 model. Code is available at: https: //github. com/ruikangliu/FlatQuant.

AAAI Conference 2025 Conference Paper

FreeMask: Rethinking the Importance of Attention Masks for Zero-Shot Video Editing

  • Lingling Cai
  • Kang Zhao
  • Hangjie Yuan
  • Yingya Zhang
  • Shiwei Zhang
  • Kejie Huang

Text-to-video diffusion models have made remarkable advancements. Driven by their ability to generate temporally coherent videos, research on zero-shot video editing using these fundamental models has expanded rapidly. To enhance editing quality, structural controls are frequently employed in video editing. Among these techniques, cross-attention mask control stands out for its effectiveness and efficiency. However, when cross-attention masks are naively applied to video editing, they can introduce artifacts such as blurring and flickering. Our experiments uncover a critical factor overlooked in previous video editing research: cross-attention masks are not consistently clear but vary with model structure and denoising timestep. To address this issue, we propose the metric Mask Matching Cost (MMC) that quantifies this variability and propose FreeMask, a method for selecting optimal masks tailored to specific video editing tasks. Using MMC-selected masks, we further improve the masked fusion mechanism within comprehensive attention features, e.g., temp, cross, and self-attention modules. Our approach can be seamlessly integrated into existing zero-shot video editing frameworks with better performance, requiring no control assistance or parameter fine-tuning but enabling adaptive decoupling of unedited semantic layouts with mask precision control. Extensive experiments demonstrate that FreeMask achieves superior semantic fidelity, temporal consistency, and editing quality compared to state-of-the-art methods.

ICML Conference 2024 Conference Paper

Accelerating Transformer Pre-training with 2: 4 Sparsity

  • Yuezhou Hu
  • Kang Zhao
  • Weiyu Huang
  • Jianfei Chen 0001
  • Jun Zhu 0001

Training large transformers is slow, but recent innovations on GPU architecture give us an advantage. NVIDIA Ampere GPUs can execute a fine-grained 2: 4 sparse matrix multiplication twice as fast as its dense equivalent. In the light of this property, we comprehensively investigate the feasibility of accelerating feed-forward networks (FFNs) of transformers in pre-training. First, we define a “flip rate” to monitor the stability of a 2: 4 training process. Utilizing this metric, we propose three techniques to preserve accuracy: to modify the sparse-refined straight-through estimator by applying the masked decay term on gradients, to determine a feasible decay factor in warm-up stage, and to enhance the model’s quality by a dense fine-tuning procedure near the end of pre-training. Besides, we devise two techniques to practically accelerate training: to calculate transposable 2: 4 masks by convolution, and to accelerate gated activation functions by reducing GPU L2 cache miss. Experiments show that our 2: 4 sparse training algorithm achieves similar convergence to dense training algorithms on several transformer pre-training tasks, while actual acceleration can be observed on different shapes of transformer block apparently. Our toolkit is available at https: //github. com/huyz2023/2by4-pretrain.

AAAI Conference 2024 Conference Paper

AE-NeRF: Audio Enhanced Neural Radiance Field for Few Shot Talking Head Synthesis

  • Dongze Li
  • Kang Zhao
  • Wei Wang
  • Bo Peng
  • Yingya Zhang
  • Jing Dong
  • Tieniu Tan

Audio-driven talking head synthesis is a promising topic with wide applications in digital human, film making and virtual reality. Recent NeRF-based approaches have shown superiority in quality and fidelity compared to previous studies. However, when it comes to few-shot talking head generation, a practical scenario where only few seconds of talking video is available for one identity, two limitations emerge: 1) they either have no base model, which serves as a facial prior for fast convergence, or ignore the importance of audio when building the prior; 2) most of them overlook the degree of correlation between different face regions and audio, e.g., mouth is audio related, while ear is audio independent. In this paper, we present Audio Enhanced Neural Radiance Field (AE-NeRF) to tackle the above issues, which can generate realistic portraits of a new speaker with few-shot dataset. Specifically, we introduce an Audio Aware Aggregation module into the feature fusion stage of the reference scheme, where the weight is determined by the similarity of audio between reference and target image. Then, an Audio-Aligned Face Generation strategy is proposed to model the audio related and audio independent regions respectively, with a dual-NeRF framework. Extensive experiments have shown AE-NeRF surpasses the state-of-the-art on image fidelity, audio-lip synchronization, and generalization ability, even in limited training set or training iterations.

ICML Conference 2024 Conference Paper

Jetfire: Efficient and Accurate Transformer Pretraining with INT8 Data Flow and Per-Block Quantization

  • Haocheng Xi
  • Yuxiang Chen
  • Kang Zhao
  • Kai Jun Teh
  • Jianfei Chen 0001
  • Jun Zhu 0001

Pretraining transformers are generally time-consuming. Fully quantized training (FQT) is a promising approach to speed up pretraining. However, most FQT methods adopt a quantize-compute-dequantize procedure, which often leads to suboptimal speedup and significant performance degradation when used in transformers due to the high memory access overheads and low-precision computations. In this work, we propose Jetfire, an efficient and accurate INT8 training method specific to transformers. Our method features an INT8 data flow to optimize memory access and a per-block quantization method to maintain the accuracy of pretrained transformers. Extensive experiments demonstrate that our INT8 FQT method achieves comparable accuracy to the FP16 training baseline and outperforms the existing INT8 training works for transformers. Moreover, for a standard transformer block, our method offers an end-to-end training speedup of 1. 42x and a 1. 49x memory reduction compared to the FP16 baseline.

ECAI Conference 2024 Conference Paper

SUBLLM: A Novel Efficient Architecture with Token Sequence Subsampling for LLM

  • Quandong Wang
  • Yuxuan Yuan
  • Xiaoyu Yang 0005
  • Ruike Zhang
  • Kang Zhao
  • Wei Liu 0302
  • Jian Luan 0001
  • Daniel Povey

While Large Language Models (LLMs) have achieved remarkable success in various fields, the efficiency of training and inference remains a major challenge. To address this issue, we propose SUBLLM, short for Subsampling-Upsampling-Bypass Large Language Model, an innovative architecture that extends the core decoder-only framework by incorporating subsampling, upsampling, and bypass modules. The subsampling modules are responsible for shortening the sequence, while the upsampling modules restore the sequence length, and the bypass modules enhance convergence. In comparison to LLaMA, the proposed SUBLLM exhibits significant enhancements in both training and inference speeds as well as memory usage, while maintaining competitive few-shot performance. During training, SUBLLM increases speeds by 26% and cuts memory by 10GB per GPU. In inference, it boosts speeds by up to 37% and reduces memory by 1GB per GPU. The training and inference speeds can be enhanced by 34% and 52% respectively when the context window is expanded to 8192. Our code is available at https: //github. com/XiaoMi/subllm.

TMLR Journal 2023 Journal Article

Complementary Sparsity: Accelerating Sparse CNNs with High Accuracy on General-Purpose Computing Platforms

  • Kang Zhao
  • Yijun Tan
  • Kai Han
  • Ting Hu
  • Hanting Chen
  • Tao Yuan
  • Yunhe Wang
  • Jun Yao

Model sparsity is a promising approach to reducing parameters or FLOPs of convolutional neural networks (CNNs). Compared to unstructured or coarse-grained structured sparsity, fine-grained structured sparsity, e.g., N:M sparse pattern, can achieve a better balance between accuracy and efficiency on general computing platforms like CPUs and GPUs. In particular, the 2:4 sparsity can accelerate CNN inference by 2$\times$ speed and with negligible accuracy drop. However, N:M sparsity needs to be supported by GPU within specific hardware circuits and hardly achieves significant speedups on common GPUs. To accelerate CNNs with general-purposed computing resources and simultaneously retain the model accuracy as much as possible, this paper proposes complementary sparsity (CS). CS denotes that only one weight can be retained for weights spaced at the same distance. On the one hand, CS features high mask flexibility, which is naturally favorable to high model accuracy. Moreover, we propose a CS-specific sparse training method to improve CS-based CNNs' accuracy under high parameter sparsities ($>$75\%). On the other hand, CS itself is memory-access balanced and robust to pattern hyperparameters, which can be utilized to speedup CS-based convolution computation on CPUs and common GPUs. We thus propose a CS convolution parallel computing algorithm that adapts to common GPUs without sparse tensor cores. Experimental results show that compared to other sparsity patterns, the proposed CS can achieve the optimal trade-off in terms of accuracy and latency for CPUs and common GPUs, respectively. Codes will be available at https://gitee.com/mindspore/models/tree/master/research/cv/CS.

NeurIPS Conference 2023 Conference Paper

FaceComposer: A Unified Model for Versatile Facial Content Creation

  • Jiayu Wang
  • Kang Zhao
  • Yifeng Ma
  • Shiwei Zhang
  • Yingya Zhang
  • Yujun Shen
  • Deli Zhao
  • Jingren Zhou

This work presents FaceComposer, a unified generative model that accomplishes a variety of facial content creation tasks, including text-conditioned face synthesis, text-guided face editing, face animation etc. Based on the latent diffusion framework, FaceComposer follows the paradigm of compositional generation and employs diverse face-specific conditions, e. g. , Identity Feature and Projected Normalized Coordinate Code, to release the model creativity at all possible. To support text control and animation, we clean up some existing face image datasets and collect around 500 hours of talking-face videos, forming a high-quality large-scale multi-modal face database. A temporal self-attention module is incorporated into the U-Net structure, which allows learning the denoising process on the mixture of images and videos. Extensive experiments suggest that our approach not only achieves comparable or even better performance than state-of-the-arts on each single task, but also facilitates some combined tasks with one-time forward, demonstrating its potential in serving as a foundation generative model in face domain. We further develop an interface such that users can enjoy our one-step service to create, edit, and animate their own characters. Code, dataset, model, and interface will be made publicly available.

NeurIPS Conference 2022 Conference Paper

Accelerating Sparse Convolution with Column Vector-Wise Sparsity

  • Yijun Tan
  • Kai Han
  • Kang Zhao
  • Xianzhi Yu
  • Zidong Du
  • Yunji Chen
  • Yunhe Wang
  • Jun Yao

Weight sparsity is a promising approach to reducing the model size and computation cost of convolutional neural networks (CNNs). Nevertheless, non-zero weights often distribute randomly in sparse CNN models, introducing enormous difficulty in obtaining actual speedup on common hardware (e. g. , GPU) over their dense counterparts. Existing acceleration solutions either require hardware modifications for irregular memory access support or rely on a partially structured sparsity pattern. Neither of these methods is capable of achieving fruitful speedup on convolution layers. In this work, we propose an algorithm-software co-designed sparse convolution based on a novel out-vector-wise (OVW) sparse pattern. Building on the insight that vertical vector integrity can preserve continuous memory access in IM2COL, the OVW pattern treats a $V\times1$ vector as an entirety. To reduce the error caused by sparsity, we propose an equivalent transformation process, i. e. , clustering-based channel permutation, to gather similar rows together. Experimental evaluations demonstrate that our method achieves a $1. 7\times$ and $3. 2\times$ speedup over the SOTA solution and the dense convolution of ResNet50 on NVIDIA V100 at 75\% sparsity, respectively, with only negligible accuracy loss. Moreover, compared to the SOTA solution that achieves speedups only on data with 60\% sparsity or more, our method begins to obtain speedups on data with only 10\% sparsity.

AAAI Conference 2021 Conference Paper

Distribution Adaptive INT8 Quantization for Training CNNs

  • Kang Zhao
  • Sida Huang
  • Pan Pan
  • Yinghan Li
  • Yingya Zhang
  • Zhenyu Gu
  • Yinghui Xu

Researches have demonstrated that low bit-width (e. g. , INT8) quantization can be employed to accelerate the inference process. It makes the gradient quantization very promising since the backward propagation requires approximately twice more computation than forward one. Due to the variability and uncertainty of gradient distribution, a lot of methods have been proposed to attain training stability. However, most of them ignore the channel-wise gradient distributions and the impact of gradients with different magnitudes, resulting in the degradation of final accuracy. In this paper, we propose a novel INT8 quantization training framework for convolutional neural network to address the above issues. Specifically, we adopt Gradient Vectorized Quantization to quantize the gradient, based on the observation that layer-wise gradients contain multiple distributions along the channel dimension. Then, Magnitude-aware Clipping Strategy is introduced by taking the magnitudes of gradients into consideration when minimizing the quantization error, and we present a theoretical derivation to solve the quantization parameters of different distributions. Experimental results on broad range of computer vision tasks, such as image classification, object detection and video classification, demonstrate that the proposed Distribution Adaptive INT8 Quantization training method has achieved almost lossless training accuracy for different backbones, including ResNet, MobileNetV2, InceptionV3, VGG and AlexNet, which is superior to the state-of-the-art techniques. Moreover, we further implement the INT8 kernel that can accelerate the training iteration more than 200% under the latest Turing architecture, i. e. , our method excels on both training accuracy and speed.

IS Journal 2021 Journal Article

Learning Embeddings Based on Global Structural Similarity in Heterogeneous Networks

  • Wanting Wen
  • Daniel D. Zeng
  • Jie Bai
  • Kang Zhao
  • Ziqiang Li

With different types of nodes and edges, heterogeneous networks have higher levels of structural diversity than homogeneous networks. This article proposes an unsupervised representation learning model, named gs2vec, to address structural diversity of a node being connected to other types of nodes via different types of edges in heterogeneous networks. The model measures a node's structural roles based on its numbers of neighboring nodes of different types. It also attempts to measure such structural roles beyond the immediate neighborhood of each node by incorporating structural roles of other nodes k-hop away. Experiments based on synthetic and empirical datasets show that gs2vec outperforms state-of-the-art network representation learning models in heterogeneous network analysis tasks such as node classification and node clustering.

AAAI Conference 2015 Conference Paper

On Unconstrained Quasi-Submodular Function Optimization

  • Jincheng Mei
  • Kang Zhao
  • Bao-Liang Lu

With the extensive application of submodularity, its generalizations are constantly being proposed. However, most of them are tailored for special problems. In this paper, we focus on quasi-submodularity, a universal generalization, which satisfies weaker properties than submodularity but still enjoys favorable performance in optimization. Similar to the diminishing return property of submodularity, we first define a corresponding property called the single sub-crossing, then we propose two algorithms for unconstrained quasi-submodular function minimization and maximization, respectively. The proposed algorithms return the reduced lattices in O(n) iterations, and guarantee the objective function values are strictly monotonically increased or decreased after each iteration. Moreover, any local and global optima are definitely contained in the reduced lattices. Experimental results verify the effectiveness and efficiency of the proposed algorithms on lattice reduction.

IS Journal 2015 Journal Article

System Informatics: From Methodology to Applications

  • Kang Zhao
  • Yao Xie
  • Kwok-Leung Tsui
  • Qingming Wei
  • Wenpo Huang
  • Wei Jiang
  • Yanting Li
  • Sugon Cho

This installment of Trends & Controversies provides an array of perspectives on the latest research in system informatics. Kang Zhao, Yao Xie, and Kwok-Leung Tsui introduce the work in "System Informatics: From Methodology to Applications. " On the methodology side, in "Projection-Based Process Monitoring and Empirical Divergence, " Wenpo Huang, Wei Jiang, Qingming Wei, and Yanting Li propose a framework of projection-based methods, and in "One-Class Classification Methods for Process Monitoring and Diagnosis, " Sugon Cho and Seoung Bum Kim discuss how a data analytics algorithm can be used as a control chart. On the application side, "IoT-Enabled System Informatics for Service Decision Making, " by Kaibo Liu and Jianjun Shi, reviews current trends and future opportunities for IoT, with a special focus on issues related to the big data collected by multiple sensors. "Quantifying the Risk Level of Functional Chips in DRAM Wafers, " by Young-Seon Jeong, Byunghoon Kimb, Seung Hoon Tong, In-Kap Chang, and Myong K. Jeong, not only identifies research challenges and opportunities for decision making with massive data in the process of semiconductor manufacturing but also quantifies the risk level of functional chips in DRAM wafers. Finally, "Flight Operations Monitoring through Cluster Analysis: A Case Study, " by Florent Charruaud and Lishuai Li, describes a new method called cluster-based anomaly detection to help airline safety experts monitor daily flights and detect anomalies.

AAAI Conference 2014 Conference Paper

Locality Preserving Hashing

  • Kang Zhao
  • Hongtao Lu
  • Jincheng Mei

Hashing has recently attracted considerable attention for large scale similarity search. However, learning compact codes with good performance is still a challenge. In many cases, the real-world data lies on a low-dimensional manifold embedded in high-dimensional ambient space. To capture meaningful neighbors, a compact hashing representation should be able to uncover the intrinsic geometric structure of the manifold, e. g. , the neighborhood relationships between subregions. Most existing hashing methods only consider this issue during mapping data points into certain projected dimensions. When getting the binary codes, they either directly quantize the projected values with a threshold, or use an orthogonal matrix to refine the initial projection matrix, which both consider projection and quantization separately, and will not well preserve the locality structure in the whole learning process. In this paper, we propose a novel hashing algorithm called Locality Preserving Hashing to effectively solve the above problems. Specifically, we learn a set of locality preserving projections with a joint optimization framework, which minimizes the average projection distance and quantization loss simultaneously. Experimental comparisons with other state-of-the-art methods on two large scale datasets demonstrate the effectiveness and efficiency of our method.

IS Journal 2014 Journal Article

User Recommendations in Reciprocal and Bipartite Social Networks--An Online Dating Case Study

  • Kang Zhao
  • Xi Wang
  • Mo Yu
  • Bo Gao

Many social networks in our daily life are bipartite networks built on reciprocity. How can we make recommendations to others so that the user is interested in and attractive to those other users whom we've recommended? We propose a new collaborative-filtering model to improve user recommendations in bipartite and reciprocal social networks. The model considers a user's taste in picking others and attractiveness in being picked by others. A case study of an online dating network shows that the approach offers good performance in recommending both initial and reciprocal contacts.