Author name cluster

Haotong Qin

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

29 papers

2 author rows

AAAI Conference 2026 Conference Paper

Activation Manipulation Attack: Penetrating and Harmful Jailbreak Attack Against Large Vision-Language Models

Haojie Hao
Jiakai Wang
Aishan Liu
Yuqing Ma
Haotong Qin
Yuanfang Guo
Xianglong Liu

Recently, Large Vision-Language Models (LVLMs) have been demonstrated to be vulnerable to jailbreak attacks, highlighting the urgent need for further research to comprehensively identify and mitigate these threats. Unfortunately, existing jailbreak studies primarily focus on coarse-grained input manipulation to elicit specific responses, overlooking the exploitation of internal representations, i.e., intermediate activations, which constrains their ability to penetrate alignment safeguards and generate harmful responses. To tackle this issue, we propose the Activation Manipulation (ActMan) Attack framework, which performs fine-grained activation manipulations inspired by the perception and cognition stages of human decision-making, enhancing both the penetration capability and harmfulness of attacks. To improve penetration capability, we introduce a Deceptive Visual Camouflage module inspired by the masking effect in human perception. This module uses a benign activation-guided attention redirection strategy to conceal abnormal activation patterns, thereby suppressing LVLM's defense detection during early-stage decoding. To enhance harmfulness, we design a Malicious Semantic Induction module drawing from the framing effect in human cognition, which reconstructs jailbreak instructions using malicious activation guidance to change LVLM’s risk assessment during late-stage decoding, thereby amplifying the harmfulness of model responses. Extensive experiments on six mainstream LVLMs demonstrate that our method remarkably outperforms state-of-the-art baselines, achieving an average relative ASR improvement of 12.06%.

PDF Details DOI

AAAI Conference 2026 Conference Paper

First-Order Error Matters: Accurate Compensation for Quantized Large Language Models

Xingyu Zheng
Haotong Qin
Yuye Li
Haoran Chu
Jiakai Wang
Jinyang Guo
Michele Magno
Xianglong Liu

Post-training quantization (PTQ) offers an efficient approach to compressing large language models (LLMs), significantly reducing memory access and computational costs. Existing compensation-based weight calibration methods often rely on a second-order Taylor expansion to model quantization error, under the assumption that the first-order term is negligible in well-trained full-precision models. However, we reveal that the progressive compensation process introduces accumulated first-order deviations between latent weights and their full-precision counterparts, making this assumption fundamentally flawed. To address this, we propose FOEM, a novel PTQ method that explicitly incorporates first-order gradient terms to improve quantization error compensation. FOEM approximates gradients by performing a first-order Taylor expansion around the pre-quantization weights. This yields an approximation based on the difference between latent and full-precision weights as well as the Hessian matrix. When substituted into the theoretical solution, the formulation eliminates the need to explicitly compute the Hessian, thereby avoiding the high computational cost and limited generalization of backpropagation-based gradient methods. This design introduces only minimal additional computational overhead. Extensive experiments across a wide range of models and benchmarks demonstrate that FOEM consistently outperforms the classical GPTQ method. In 3-bit weight-only quantization, FOEM reduces the perplexity of Llama3-8B by 17.3% and increases the 5-shot MMLU accuracy from 53.8% achieved by GPTAQ to 56.1%. Moreover, FOEM can be seamlessly combined with advanced techniques such as SpinQuant, delivering additional gains under the challenging W4A4KV4 setting and further narrowing the performance gap with full-precision baselines, surpassing existing state-of-the-art methods.

PDF Details DOI

AAAI Conference 2026 Conference Paper

TR-DQ: Time-Rotation Diffusion Quantization

Yihua Shao
Deyang Lin
Minxi Yan
Siyu Chen
Fanhu Zeng
Minwen Liao
Ao Ma
Ziyang Yan

Diffusion models have been widely adopted in image and video generation. However, their complex network architecture leads to high inference overhead for its generation process. Existing diffusion quantization methods primarily focus on the quantization of the model structure while ignoring the impact of time-steps variation during sampling. At the same time, most current approaches fail to account for significant activations that cannot be eliminated, resulting in substantial performance degradation after quantization. To address these issues, we propose Time-Rotation Diffusion Quantization (TR-DQ), a novel quantization method incorporating time-step and rotation-based optimization. TR-DQ first divides the sampling process based on time-steps and applies a rotation matrix to smooth activations and weights dynamically. For different time-steps, a dedicated hyperparameter is introduced for adaptive timing modeling, which enables dynamic quantization across different time steps. Additionally, we also explore the compression potential of Classifier-Free Guidance (CFG-wise) to establish a foundation for subsequent work. TR-DQ achieves state-of-the-art (SOTA) performance on image generation and video generation tasks and a 1.38-1.89× speedup and 1.97-2.58× memory reduction in inference compared to existing quantization methods.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

$\text{S}^2$Q-VDiT: Accurate Quantized Video Diffusion Transformer with Salient Data and Sparse Token Distillation

Weilun Feng
Haotong Qin
Chuanguang Yang
Xiangqi Li
Han Yang
Yuqi Li
Zhulin An
Libo Huang

Diffusion transformers have emerged as the mainstream paradigm for video generation models. However, the use of up to billions of parameters incurs significant computational costs. Quantization offers a promising solution by reducing memory usage and accelerating inference. Nonetheless, we observe that the joint modeling of spatial and temporal information in video diffusion models (V-DMs) leads to extremely long token sequences, which introduces high calibration variance and learning challenges. To address these issues, we propose **$S^2$Q-VDiT**, a post-training quantization framework for V-DMs that leverages **S**alient data and **S**parse token distillation. During the calibration phase, we identify that quantization performance is highly sensitive to the choice of calibration data. To mitigate this, we introduce *Hessian-aware Salient Data Selection*, which constructs high-quality calibration datasets by considering both diffusion and quantization characteristics unique to V-DMs. To tackle the learning challenges, we further analyze the sparse attention patterns inherent in V-DMs. Based on this observation, we propose *Attention-guided Sparse Token Distillation*, which exploits token-wise attention distributions to emphasize tokens that are more influential to the model's output. Under W4A6 quantization, $S^2$Q-VDiT achieves lossless performance while delivering $3. 9\times$ model compression and $1. 3\times$ inference acceleration. Code will be available at https: //github. com/wlfeng0509/s2q-vdit.

PDF Details

ICLR Conference 2025 Conference Paper

ARB-LLM: Alternating Refined Binarizations for Large Language Models

Zhiteng Li
Xianglong Yan
Tianao Zhang
Haotong Qin
Dong Xie
Jiang Tian
Zhongchao Shi
Linghe Kong

Large Language Models (LLMs) have greatly pushed forward advancements in natural language processing, yet their high memory and computational demands hinder practical deployment. Binarization, as an effective compression technique, can shrink model weights to just 1 bit, significantly reducing the high demands on computation and memory. However, current binarization methods struggle to narrow the distribution gap between binarized and full-precision weights, while also overlooking the column deviation in LLM weight distribution. To tackle these issues, we propose ARB-LLM, a novel 1-bit post-training quantization (PTQ) technique tailored for LLMs. To narrow the distribution shift between binarized and full-precision weights, we first design an alternating refined binarization (ARB) algorithm to progressively update the binarization parameters, which significantly reduces the quantization error. Moreover, considering the pivot role of calibration data and the column deviation in LLM weights, we further extend ARB to ARB-X and ARB-RC. In addition, we refine the weight partition strategy with column-group bitmap (CGB), which further enhance performance. Equipping ARB-X and ARB-RC with CGB, we obtain ARB-LLM$_{\text{X}}$ and ARB-LLM$ _{\text{RC}} $ respectively, which significantly outperform state-of-the-art (SOTA) binarization methods for LLMs. As a binary PTQ method, our ARB-LLM$ _{\text{RC}} $ is the first to surpass FP16 models of the same size. Code: https://github.com/ZHITENGLI/ARB-LLM.

Details

ICLR Conference 2025 Conference Paper

BinaryDM: Accurate Weight Binarization for Efficient Diffusion Models

Xingyu Zheng
Xianglong Liu 0001
Haotong Qin
Xudong Ma
Mingyuan Zhang
Haojie Hao
Jiakai Wang
Zixiang Zhao

With the advancement of diffusion models (DMs) and the substantially increased computational requirements, quantization emerges as a practical solution to obtain compact and efficient low-bit DMs. However, the highly discrete representation leads to severe accuracy degradation, hindering the quantization of diffusion models to ultra-low bit-widths. This paper proposes a novel weight binarization approach for DMs, namely BinaryDM, pushing binarized DMs to be accurate and efficient by improving the representation and optimization. From the representation perspective, we present an Evolvable-Basis Binarizer (EBB) to enable a smooth evolution of DMs from full-precision to accurately binarized. EBB enhances information representation in the initial stage through the flexible combination of multiple binary bases and applies regularization to evolve into efficient single-basis binarization. The evolution only occurs in the head and tail of the DM architecture to retain the stability of training. From the optimization perspective, a Low-rank Representation Mimicking (LRM) is applied to assist the optimization of binarized DMs. The LRM mimics the representations of full-precision DMs in low-rank space, alleviating the direction ambiguity of the optimization process caused by fine-grained alignment. Comprehensive experiments demonstrate that BinaryDM achieves significant accuracy and efficiency gains compared to SOTA quantization methods of DMs under ultra-low bit-widths. With 1-bit weight and 4-bit activation (W1A4), BinaryDM achieves as low as 7.74 FID and saves the performance from collapse (baseline FID 10.87). As the first binarization method for diffusion models, W1A4 BinaryDM achieves impressive 15.2x OPs and 29.2x model size savings, showcasing its substantial potential for edge deployment.

Details

ICML Conference 2025 Conference Paper

DA-KD: Difficulty-Aware Knowledge Distillation for Efficient Large Language Models

Changyi He
Yifu Ding
Jinyang Guo
Ruihao Gong
Haotong Qin
Xianglong Liu 0001

Although knowledge distillation (KD) is an effective approach to improve the performance of a smaller LLM (i. e. , the student model) by transferring knowledge from a large LLM (i. e. , the teacher model), it still suffers from high training cost. Existing LLM distillation methods ignore the difficulty difference among different samples, making the distillation of easy samples unnecessary. This leads to high distillation cost. In this paper, we propose difficulty-aware knowledge distillation (DA-KD) framework for efficient knowledge distillation, in which we dynamically adjust the distillation dataset based on the difficulty of samples. We further observe existing KD loss cannot perform well when most of samples are difficult in the distillation dataset because of unstable optimization and the neglect of hard samples. Therefore, we also propose a new KD loss called bidirectional discrepancy loss (BDL) for effective KD. Extensive experiments demonstrate that our DA-KD framework is effective and efficient. Without bells and whistles, DA-KD can outperform existing state-of-the-art KD methods by 2% with half training cost and even surpass the teacher model with 4. 7$\times$ compression.

Details

NeurIPS Conference 2025 Conference Paper

Exploring Semantic-constrained Adversarial Example with Instruction Uncertainty Reduction

Jin Hu
Jiakai Wang
linna Jing
Haolin Li
Liu haodong
Haotong Qin
Aishan Liu
Ke Xu

Recently, semantically constrained adversarial examples (SemanticAE), which are directly generated from natural language instructions, have become a promising avenue for future research due to their flexible attacking forms, but have not been thoroughly explored yet. To generate SemanticAEs, current methods fall short of satisfactory attacking ability as the key underlying factors of semantic uncertainty in human instructions, such as $\textit{referring diversity}$, $\textit{descriptive incompleteness}$, and $\textit{boundary ambiguity}$, have not been fully investigated. To tackle the issues, this paper develops a multi-dimensional $\textbf{ins}$truction $\textbf{u}$ncertainty $\textbf{r}$eduction ($\textbf{InSUR}$) framework to generate more satisfactory SemanticAE, $\textit{i. e. }$, transferable, adaptive, and effective. Specifically, in the dimension of the sampling method, we propose the residual-driven attacking direction stabilization to alleviate the unstable adversarial optimization caused by the diversity of language references. By coarsely predicting the language-guided sampling process, the optimization process will be stabilized by the designed ResAdv-DDIM sampler, therefore releasing the transferable and robust adversarial capability of multi-step diffusion models. In task modeling, we propose the context-encoded attacking scenario constraint to supplement the missing knowledge from incomplete human instructions. Guidance masking and renderer integration are proposed to regulate the constraints of 2D/3D SemanticAE, activating stronger scenario-adapted attacks. Moreover, in the dimension of generator evaluation, we propose the semantic-abstracted attacking evaluation enhancement by clarifying the evaluation boundary based on the label taxonomy, facilitating the development of more effective SemanticAE generators. Extensive experiments demonstrate the superiority of the transfer attack performance of InSUR. Besides, it is worth highlighting that we realize the reference-free generation of semantically constrained 3D adversarial examples by utilizing language-guided 3D generation models for the first time.

PDF Details

IROS Conference 2025 Conference Paper

FSDP: Fast and Safe Data-Driven Overtaking Trajectory Planning for Head-to-Head Autonomous Racing Competitions

Cheng Hu
Jihao Huang
Wule Mao
Yonghao Fu
Xuemin Chi
Haotong Qin
Nicolas Baumann
Zhitao Liu

Generating overtaking trajectories in autonomous racing is a challenging task, as the trajectory must satisfy the vehicle’s dynamics and ensure safety and real-time performance running on resource-constrained hardware. This work proposes the Fast and Safe Data-Driven Planner to address this challenge. Sparse Gaussian predictions are introduced to improve both the computational efficiency and accuracy of opponent predictions. Furthermore, the proposed approach employs a bi-level quadratic programming framework to generate an overtaking trajectory leveraging the opponent predictions. The first level uses polynomial fitting to generate a rough trajectory, from which reference states and control inputs are derived for the second level. The second level formulates a model predictive control optimization problem in the Frenet frame, generating a trajectory that satisfies both kinematic feasibility and safety. Experimental results on the F1TENTH platform show that our method outperforms the State-of-the-Art, achieving an 8. 93% higher overtaking success rate, allowing the maximum opponent speed, ensuring a smoother ego trajectory, and reducing 74. 04% computational time compared to the Predictive Spliner method. The code is available at: https://github.com/ZJU-DDRX/FSDP.

Details

AAAI Conference 2025 Conference Paper

MPQ-DM: Mixed Precision Quantization for Extremely Low Bit Diffusion Models

Weilun Feng
Haotong Qin
Chuanguang Yang
Zhulin An
Libo Huang
Boyu Diao
Fei Wang
Renshuai Tao

Diffusion models have received wide attention in generation tasks. However, the expensive computation cost prevents the application of diffusion models in resource-constrained scenarios. Quantization emerges as a practical solution that significantly saves storage and computation by reducing the bit-width of parameters. However, the existing quantization methods for diffusion models still cause severe degradation in performance, especially under extremely low bit-widths (2-4 bit). The primary decrease in performance comes from the significant discretization of activation values at low bit quantization. Too few activation candidates are unfriendly for outlier significant weight channel quantization, and the discretized features prevent stable learning over different time steps of the diffusion model. This paper presents MPQ-DM, a Mixed-Precision Quantization method for Diffusion Models. The proposed MPQ-DM mainly relies on two techniques: (1) To mitigate the quantization error caused by outlier severe weight channels, we propose an Outlier-Driven Mixed Quantization (OMQ) technique that uses Kurtosis to quantify outlier salient channels and apply optimized intra-layer mixed-precision bit-width allocation to recover accuracy performance within target efficiency. (2) To robustly learn representations crossing time steps, we construct a Time-Smoothed Relation Distillation (TRD) scheme between the quantized diffusion model and its full-precision counterpart, transferring discrete and continuous latent to a unified relation space to reduce the representation inconsistency. Comprehensive experiments demonstrate that MPQ-DM achieves significant accuracy gains under extremely low bit-widths compared with SOTA quantization methods. MPQ-DM achieves a 58% FID decrease under W2A4 setting compared with baseline, while all other methods even collapse.

PDF Details DOI

AAAI Conference 2025 Conference Paper

ODDN: Addressing Unpaired Data Challenges in Open-World Deepfake Detection on Online Social Networks

Renshuai Tao
Manyi Le
Chuangchuang Tan
Huan Liu
Haotong Qin
Yao Zhao

Despite significant advances in deepfake detection, handling varying image quality, especially due to different compressions on online social networks (OSNs), remains challenging. Current methods succeed by leveraging correlations between paired images, whether raw or compressed. However, in open-world scenarios, paired data is scarce, with compressed images readily available but corresponding raw versions difficult to obtain. This imbalance, where unpaired data vastly outnumbers paired data, often leads to reduced detection performance, as existing methods struggle without corresponding raw images. To overcome this issue, we propose a novel approach named the open-world deepfake detection network (ODDN), which comprises two core modules: open-world data aggregation (ODA) and compression-discard gradient correction (CGC). ODA effectively aggregates correlations between compressed and raw samples through both fine-grained and coarse-grained analyses for paired and unpaired data, respectively. CGC incorporates a compression-discard gradient correction to further enhance performance across diverse compression methods in OSN. This technique optimizes the training gradient to ensure the model remains insensitive to compression variations. Extensive experiments conducted on 17 popular deepfake datasets demonstrate the superiority of the ODDN over SOTA baselines.

PDF Details DOI

ICML Conference 2025 Conference Paper

Q-VDiT: Towards Accurate Quantization and Distillation of Video-Generation Diffusion Transformers

Weilun Feng
Chuanguang Yang
Haotong Qin
Xiangqi Li
Yu Wang
Zhulin An
Libo Huang
Boyu Diao

Diffusion transformers (DiT) have demonstrated exceptional performance in video generation. However, their large number of parameters and high computational complexity limit their deployment on edge devices. Quantization can reduce storage requirements and accelerate inference by lowering the bit-width of model parameters. Yet, existing quantization methods for image generation models do not generalize well to video generation tasks. We identify two primary challenges: the loss of information during quantization and the misalignment between optimization objectives and the unique requirements of video generation. To address these challenges, we present Q-VDiT, a quantization framework specifically designed for video DiT models. From the quantization perspective, we propose the Token aware Quantization Estimator (TQE), which compensates for quantization errors in both the token and feature dimensions. From the optimization perspective, we introduce Temporal Maintenance Distillation (TMD), which preserves the spatiotemporal correlations between frames and enables the optimization of each frame with respect to the overall video context. Our W3A6 Q-VDiT achieves a scene consistency score of 23. 40, setting a new benchmark and outperforming the current state-of-the-art quantization methods by 1. 9$\times$.

Details

ICML Conference 2025 Conference Paper

SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models

Wei Huang 0042
Haotong Qin
Yangdong Liu
Yawei Li 0001
Qinshuo Liu
Xianglong Liu 0001
Luca Benini
Michele Magno

Post-training quantization (PTQ) is an effective technique for compressing large language models (LLMs). However, while uniform-precision quantization is computationally efficient, it often compromises model performance. To address this, we propose SliM-LLM, a salience-driven mixed-precision quantization framework that allocates bit-widths at the group-wise with high accuracy. Our approach leverages the observation that important weights follow a structured distribution and introduces two key components: 1) Salience-Determined Bit Allocation adaptively assigns bit-widths to groups within each layer based on their salience; and 2) Salience-Weighted Quantizer Calibration optimizes quantizer parameters by incorporating element-level salience, retain essential information. With its structured group-wise partitioning, SliM-LLM provides a hardware-friendly solution that matches the efficiency of uniform quantization methods while significantly improving accuracy. Experiments show that SliM-LLM achieves superior performance across various LLMs at low bit-widths. For example, a 2-bit quantized LLaMA-7B model reduces memory usage by nearly 6x compared to the floating-point baseline, decreases perplexity by 48% compared to state-of-the-art gradient-free PTQ methods, and maintains GPU inference speed. Additionally, the extended version, SliM-LLM+, which incorporates gradient-based quantization, further reduces perplexity by 35. 1%. Our code is available at https: //github. com/Aaronhuang-778/SliM-LLM.

Details

NeurIPS Conference 2024 Conference Paper

2DQuant: Low-bit Post-Training Quantization for Image Super-Resolution

Kai Liu
Haotong Qin
Yong Guo
Xin Yuan
Linghe Kong
Guihai Chen
Yulun Zhang

Low-bit quantization has become widespread for compressing image super-resolution (SR) models for edge deployment, which allows advanced SR models to enjoy compact low-bit parameters and efficient integer/bitwise constructions for storage compression and inference acceleration, respectively. However, it is notorious that low-bit quantization degrades the accuracy of SR models compared to their full-precision (FP) counterparts. Despite several efforts to alleviate the degradation, the transformer-based SR model still suffers severe degradation due to its distinctive activation distribution. In this work, we present a dual-stage low-bit post-training quantization (PTQ) method for image super-resolution, namely 2DQuant, which achieves efficient and accurate SR under low-bit quantization. The proposed method first investigates the weight and activation and finds that the distribution is characterized by coexisting symmetry and asymmetry, long tails. Specifically, we propose Distribution-Oriented Bound Initialization (DOBI), using different searching strategies to search a coarse bound for quantizers. To obtain refined quantizer parameters, we further propose Distillation Quantization Calibration (DQC), which employs a distillation approach to make the quantized model learn from its FP counterpart. Through extensive experiments on different bits and scaling factors, the performance of DOBI can reach the state-of-the-art (SOTA) while after stage two, our method surpasses existing PTQ in both metrics and visual effects. 2DQuant gains an increase in PSNR as high as 4. 52dB on Set5 (x2) compared with SOTA when quantized to 2-bit and enjoys a 3. 60x compression ratio and 5. 08x speedup ratio. The code and models are available at https: //github. com/Kai-Liu001/2DQuant.

PDF Details DOI

ICML Conference 2024 Conference Paper

Accurate LoRA-Finetuning Quantization of LLMs via Information Retention

Haotong Qin
Xudong Ma
Xingyu Zheng
Xiaoyang Li
Yang Zhang 0088
Shouda Liu
Jie Luo 0004
Xianglong Liu 0001

The LoRA-finetuning quantization of LLMs has been extensively studied to obtain accurate yet compact LLMs for deployment on resource-constrained hardware. However, existing methods cause the quantized LLM to severely degrade and even fail to benefit from the finetuning of LoRA. This paper proposes a novel IR-QLoRA for pushing quantized LLMs with LoRA to be highly accurate through information retention. The proposed IR-QLoRA mainly relies on two technologies derived from the perspective of unified information: (1) statistics-based Information Calibration Quantization allows the quantized parameters of LLM to retain original information accurately; (2) finetuning-based Information Elastic Connection makes LoRA utilizes elastic representation transformation with diverse information. Comprehensive experiments show that IR-QLoRA can significantly improve accuracy across LLaMA and LLaMA2 families under 2-4 bit-widths, e. g. , 4-bit LLaMA-7B achieves 1. 4% improvement on MMLU compared with the state-of-the-art methods. The significant performance gain requires only a tiny 0. 31% additional time consumption, revealing the satisfactory efficiency of our IR-QLoRA. We highlight that IR-QLoRA enjoys excellent versatility, compatible with various frameworks (e. g. , NormalFloat and Integer quantization) and brings general accuracy gains. The code is available at https: //github. com/htqin/ir-qlora.

Details

NeurIPS Conference 2024 Conference Paper

BiDM: Pushing the Limit of Quantization for Diffusion Models

Xingyu Zheng
Xianglong Liu
Yichen Bian
Xudong Ma
Yulun Zhang
Jiakai Wang
Jinyang Guo
Haotong Qin

Diffusion models (DMs) have been significantly developed and widely used in various applications due to their excellent generative qualities. However, the expensive computation and massive parameters of DMs hinder their practical use in resource-constrained scenarios. As one of the effective compression approaches, quantization allows DMs to achieve storage saving and inference acceleration by reducing bit-width while maintaining generation performance. However, as the most extreme quantization form, 1-bit binarization causes the generation performance of DMs to face severe degradation or even collapse. This paper proposes a novel method, namely BiDM, for fully binarizing weights and activations of DMs, pushing quantization to the 1-bit limit. From a temporal perspective, we introduce the Timestep-friendly Binary Structure (TBS), which uses learnable activation binarizers and cross-timestep feature connections to address the highly timestep-correlated activation features of DMs. From a spatial perspective, we propose Space Patched Distillation (SPD) to address the difficulty of matching binary features during distillation, focusing on the spatial locality of image generation tasks and noise estimation networks. As the first work to fully binarize DMs, the W1A1 BiDM on the LDM-4 model for LSUN-Bedrooms 256$\times$256 achieves a remarkable FID of 22. 74, significantly outperforming the current state-of-the-art general binarization methods with an FID of 59. 44 and invalid generative samples, and achieves up to excellent 28. 0 times storage and 52. 7 times OPs savings.

PDF Details DOI

ICML Conference 2024 Conference Paper

BiLLM: Pushing the Limit of Post-Training Quantization for LLMs

Wei Huang 0042
Yangdong Liu
Haotong Qin
Ying Li 0122
Shiming Zhang
Xianglong Liu 0001
Michele Magno
Xiaojuan Qi 0001

Pretrained large language models (LLMs) exhibit exceptional general language processing capabilities but come with significant demands on memory and computational resources. As a powerful compression technology, binarization can extremely reduce model weights to a mere 1 bit, lowering the expensive computation and memory requirements. However, existing quantization techniques fall short of maintaining LLM performance under ultra-low bit-widths. In response to this challenge, we present BiLLM, a groundbreaking 1-bit post-training quantization scheme tailored for pretrained LLMs. Based on the weight distribution of LLMs, BiLLM first identifies and structurally selects salient weights, and minimizes the compression loss through an effective binary residual approximation strategy. Moreover, considering the bell-shaped distribution of the non-salient weights, we propose an optimal splitting search to group and binarize them accurately. BiLLM, for the first time, achieves high-accuracy inference (e. g. 8. 41 perplexity on LLaMA2-70B) with only 1. 08-bit weights across various LLM families and evaluation metrics, outperforms SOTA quantization methods of LLM by significant margins. Moreover, BiLLM enables the binarization process of a 7-billion LLM within 0. 5 hours on a single GPU, demonstrating satisfactory time efficiency. Our code is available at https: //github. com/Aaronhuang-778/BiLLM.

Details

NeurIPS Conference 2024 Conference Paper

Binarized Diffusion Model for Image Super-Resolution

Zheng Chen
Haotong Qin
Yong Guo
Xiongfei Su
Xin Yuan
Linghe Kong
Yulun Zhang

Advanced diffusion models (DMs) perform impressively in image super-resolution (SR), but the high memory and computational costs hinder their deployment. Binarization, an ultra-compression algorithm, offers the potential for effectively accelerating DMs. Nonetheless, due to the model structure and the multi-step iterative attribute of DMs, existing binarization methods result in significant performance degradation. In this paper, we introduce a novel binarized diffusion model, BI-DiffSR, for image SR. First, for the model structure, we design a UNet architecture optimized for binarization. We propose the consistent-pixel-downsample (CP-Down) and consistent-pixel-upsample (CP-Up) to maintain dimension consistent and facilitate the full-precision information transfer. Meanwhile, we design the channel-shuffle-fusion (CS-Fusion) to enhance feature fusion in skip connection. Second, for the activation difference across timestep, we design the timestep-aware redistribution (TaR) and activation function (TaA). The TaR and TaA dynamically adjust the distribution of activations based on different timesteps, improving the flexibility and representation alability of the binarized module. Comprehensive experiments demonstrate that our BI-DiffSR outperforms existing binarization methods. Code is released at: https: //github. com/zhengchen1999/BI-DiffSR.

PDF Details DOI

ICML Conference 2024 Conference Paper

Compressing Large Language Models by Joint Sparsification and Quantization

Jinyang Guo
Jianyu Wu
Zining Wang
Jiaheng Liu
Ge Yang
Yifu Ding
Ruihao Gong
Haotong Qin

In this paper, we introduce a novel model compression technique named Joint Sparsification and Quantization (JSQ), explicitly tailored for large language models (LLMs). Traditional methods employ either sparsification or quantization individually to compress LLMs, leading to performance degradation at high compression ratios. In contrast, our JSQ approach integrates sparsification and quantization cohesively. As sparsification tend to preserve outliers that is harmful to quantization, we introduce a novel sparsity metric to serves as a bridge between the sparsification and quantization. Moreover, it is proven outliers in LLMs have significant impact but harmful to compression. Current solutions are highly coupled with quantization process, which is not helpful to sparsification. To this end, we also introduce a search-based activation editor to automatically eliminate relatively useless outliers. Comprehensive experiments across various datasets and architectures affirm the efficacy of our JSQ framework. Notably, our JSQ achieves 7. 96$\times$ computation reduction without crashing for the representative model LLaMA. This accomplishment stands in stark contrast to the limitations of most state-of-the-art LLM compression methods, which typically fail under such extreme compression ratios. Our code is released at https: //github. com/uanu2002/JSQ.

Details

ICML Conference 2024 Conference Paper

Flexible Residual Binarization for Image Super-Resolution

Yulun Zhang 0001
Haotong Qin
Zixiang Zhao
Xianglong Liu 0001
Martin Danelljan
Fisher Yu 0001

Binarized image super-resolution (SR) has attracted much research attention due to its potential to drastically reduce parameters and operations. However, most binary SR works binarize network weights directly, which hinders high-frequency information extraction. Furthermore, as a pixel-wise reconstruction task, binarization often results in heavy representation content distortion. To address these issues, we propose a flexible residual binarization (FRB) method for image SR. We first propose a second-order residual binarization (SRB), to counter the information loss caused by binarization. In addition to the primary weight binarization, we also binarize the reconstruction error, which is added as a residual term in the prediction. Furthermore, to narrow the representation content gap between the binarized and full-precision networks, we propose Distillation-guided Binarization Training (DBT). We uniformly align the contents of different bit widths by constructing a normalized attention form. Finally, we generalize our method by applying our FRB to binarize convolution and Transformer-based SR networks, resulting in two binary baselines: FRBC and FRBT. We conduct extensive experiments and comparisons with recent leading binarization methods. Our proposed baselines, FRBC and FRBT, achieve superior performance both quantitatively and visually. The code and model will be released.

Details

ICML Conference 2024 Conference Paper

Image Fusion via Vision-Language Model

Zixiang Zhao
Lilun Deng
Haowen Bai
Yukun Cui
Zhipeng Zhang
Yulun Zhang 0001
Haotong Qin
Dongdong Chen

Image fusion integrates essential information from multiple images into a single composite, enhancing structures, textures, and refining imperfections. Existing methods predominantly focus on pixel-level and semantic visual features for recognition, but often overlook the deeper text-level semantic information beyond vision. Therefore, we introduce a novel fusion paradigm named image Fusion via vIsion-Language Model (FILM), for the first time, utilizing explicit textual information from source images to guide the fusion process. Specifically, FILM generates semantic prompts from images and inputs them into ChatGPT for comprehensive textual descriptions. These descriptions are fused within the textual domain and guide the visual information fusion, enhancing feature extraction and contextual understanding, directed by textual semantic information via cross-attention. FILM has shown promising results in four image fusion tasks: infrared-visible, medical, multi-exposure, and multi-focus image fusion. We also propose a vision-language dataset containing ChatGPT-generated paragraph descriptions for the eight image fusion datasets across four fusion tasks, facilitating future research in vision-language model-based image fusion. Code and dataset are available at https: //github. com/Zhaozixiang1228/IF-FILM.

Details

NeurIPS Conference 2024 Conference Paper

LLMCBench: Benchmarking Large Language Model Compression for Efficient Deployment

Ge Yang
Changyi He
Jinyang Guo
Jianyu Wu
Yifu Ding
Aishan Liu
Haotong Qin
Pengliang Ji

Although large language models (LLMs) have demonstrated their strong intelligence ability, the high demand for computation and storage hinders their practical application. To this end, many model compression techniques are proposed to increase the efficiency of LLMs. However, current researches only validate their methods on limited models, datasets, metrics, etc, and still lack a comprehensive evaluation under more general scenarios. So it is still a question of which model compression approach we should use under a specific case. To mitigate this gap, we present the Large Language Model Compression Benchmark (LLMCBench), a rigorously designed benchmark with an in-depth analysis for LLM compression algorithms. We first analyze the actual model production requirements and carefully design evaluation tracks and metrics. Then, we conduct extensive experiments and comparison using multiple mainstream LLM compression approaches. Finally, we perform an in-depth analysis based on the evaluation and provide useful insight for LLM compression design. We hope our LLMCBench can contribute insightful suggestions for LLM compression algorithm design and serve as a foundation for future research.

PDF Details DOI

ICML Conference 2023 Conference Paper

Network binarization emerges as one of the most promising compression approaches offering extraordinary computation and memory savings by minimizing the bit-width. However, recent research has shown that applying existing binarization algorithms to diverse tasks, architectures, and hardware in realistic scenarios is still not straightforward. Common challenges of binarization, such as accuracy degradation and efficiency limitation, suggest that its attributes are not fully understood. To close this gap, we present BiBench, a rigorously designed benchmark with in-depth analysis for network binarization. We first carefully scrutinize the requirements of binarization in the actual production and define evaluation tracks and metrics for a comprehensive and fair investigation. Then, we evaluate and analyze a series of milestone binarization algorithms that function at the operator level and with extensive influence. Our benchmark reveals that 1) the binarized operator has a crucial impact on the performance and deployability of binarized networks; 2) the accuracy of binarization varies significantly across different learning tasks and neural architectures; 3) binarization has demonstrated promising efficiency potential on edge devices despite the limited hardware support. The results and analysis also lead to a promising paradigm for accurate and efficient binarization. We believe that BiBench will contribute to the broader adoption of binarization and serve as a foundation for future research. The code for our BiBench is released https: //github. com/htqin/BiBench.

Details

NeurIPS Conference 2023 Conference Paper

BiMatting: Efficient Video Matting via Binarization

Haotong Qin
Lei Ke
Xudong Ma
Martin Danelljan
Yu-Wing Tai
Chi-Keung Tang
Xianglong Liu
Fisher Yu

Real-time video matting on edge devices faces significant computational resource constraints, limiting the widespread use of video matting in applications such as online conferences and short-form video production. Binarization is a powerful compression approach that greatly reduces computation and memory consumption by using 1-bit parameters and bitwise operations. However, binarization of the video matting model is not a straightforward process, and our empirical analysis has revealed two primary bottlenecks: severe representation degradation of the encoder and massive redundant computations of the decoder. To address these issues, we propose BiMatting, an accurate and efficient video matting model using binarization. Specifically, we construct shrinkable and dense topologies of the binarized encoder block to enhance the extracted representation. We sparsify the binarized units to reduce the low-information decoding computation. Through extensive experiments, we demonstrate that BiMatting outperforms other binarized video matting models, including state-of-the-art (SOTA) binarization methods, by a significant margin. Our approach even performs comparably to the full-precision counterpart in visual quality. Furthermore, BiMatting achieves remarkable savings of 12. 4$\times$ and 21. 6$\times$ in computation and storage, respectively, showcasing its potential and advantages in real-world resource-constrained scenarios. Our code and models are released at https: //github. com/htqin/BiMatting.

PDF Details

NeurIPS Conference 2023 Conference Paper

QuantSR: Accurate Low-bit Quantization for Efficient Image Super-Resolution

Haotong Qin
Yulun Zhang
Yifu Ding
Yifan Liu
Xianglong Liu
Martin Danelljan
Fisher Yu

Low-bit quantization in image super-resolution (SR) has attracted copious attention in recent research due to its ability to reduce parameters and operations significantly. However, many quantized SR models suffer from accuracy degradation compared to their full-precision counterparts, especially at ultra-low bit widths (2-4 bits), limiting their practical applications. To address this issue, we propose a novel quantized image SR network, called QuantSR, which achieves accurate and efficient SR processing under low-bit quantization. To overcome the representation homogeneity caused by quantization in the network, we introduce the Redistribution-driven Learnable Quantizer (RLQ). This is accomplished through an inference-agnostic efficient redistribution design, which adds additional information in both forward and backward passes to improve the representation ability of quantized networks. Furthermore, to achieve flexible inference and break the upper limit of accuracy, we propose the Depth-dynamic Quantized Architecture (DQA). Our DQA allows for the trade-off between efficiency and accuracy during inference through weight sharing. Our comprehensive experiments show that QuantSR outperforms existing state-of-the-art quantized SR networks in terms of accuracy while also providing more competitive computational efficiency. In addition, we demonstrate the scheme's satisfactory architecture generality by providing QuantSR-C and QuantSR-T for both convolution and Transformer versions, respectively. Our code and models are released at https: //github. com/htqin/QuantSR.

PDF Details

ICLR Conference 2022 Conference Paper

BiBERT: Accurate Fully Binarized BERT

Haotong Qin
Yifu Ding
Mingyuan Zhang
Qinghua Yan
Aishan Liu
Qingqing Dang
Ziwei Liu 0002
Xianglong Liu 0001

The large pre-trained BERT has achieved remarkable performance on Natural Language Processing (NLP) tasks but is also computation and memory expensive. As one of the powerful compression approaches, binarization extremely reduces the computation and memory consumption by utilizing 1-bit parameters and bitwise operations. Unfortunately, the full binarization of BERT (i.e., 1-bit weight, embedding, and activation) usually suffer a significant performance drop, and there is rare study addressing this problem. In this paper, with the theoretical justification and empirical analysis, we identify that the severe performance drop can be mainly attributed to the information degradation and optimization direction mismatch respectively in the forward and backward propagation, and propose BiBERT, an accurate fully binarized BERT, to eliminate the performance bottlenecks. Specifically, BiBERT introduces an efficient Bi-Attention structure for maximizing representation information statistically and a Direction-Matching Distillation (DMD) scheme to optimize the full binarized BERT accurately. Extensive experiments show that BiBERT outperforms both the straightforward baseline and existing state-of-the-art quantized BERTs with ultra-low bit activations by convincing margins on the NLP benchmark. As the first fully binarized BERT, our method yields impressive 56.3 times and 31.2 times saving on FLOPs and model size, demonstrating the vast advantages and potential of the fully binarized BERT model in real-world resource-constrained scenarios.

Details

IJCAI Conference 2022 Conference Paper

The deep neural networks, such as the Deep-FSMN, have been widely studied for keyword spotting (KWS) applications. However, computational resources for these networks are significantly constrained since they usually run on-call on edge devices. In this paper, we present BiFSMN, an accurate and extreme-efficient binary neural network for KWS. We first construct a High-frequency Enhancement Distillation scheme for the binarization-aware training, which emphasizes the high-frequency information from the full-precision network's representation that is more crucial for the optimization of the binarized network. Then, to allow the instant and adaptive accuracy-efficiency trade-offs at runtime, we also propose a Thinnable Binarization Architecture to further liberate the acceleration potential of the binarized network from the topology perspective. Moreover, we implement a Fast Bitwise Computation Kernel for BiFSMN on ARMv8 devices which fully utilizes registers and increases instruction throughput to push the limit of deployment efficiency. Extensive experiments show that BiFSMN outperforms existing binarization methods by convincing margins on various datasets and is even comparable with the full-precision counterpart (e. g. , less than 3% drop on Speech Commands V1-12). We highlight that benefiting from the thinnable architecture and the optimized 1-bit implementation, BiFSMN can achieve an impressive 22. 3x speedup and 15. 5x storage-saving on real-world edge hardware.

PDF Details DOI

ICLR Conference 2021 Conference Paper

To alleviate the resource constraint for real-time point cloud applications that run on edge devices, in this paper we present BiPointNet, the first model binarization approach for efficient deep learning on point clouds. We discover that the immense performance drop of binarized models for point clouds mainly stems from two challenges: aggregation-induced feature homogenization that leads to a degradation of information entropy, and scale distortion that hinders optimization and invalidates scale-sensitive structures. With theoretical justifications and in-depth analysis, our BiPointNet introduces Entropy-Maximizing Aggregation (EMA) to modulate the distribution before aggregation for the maximum information entropy, and Layer-wise Scale Recovery (LSR) to efficiently restore feature representation capacity. Extensive experiments show that BiPointNet outperforms existing binarization methods by convincing margins, at the level even comparable with the full precision counterpart. We highlight that our techniques are generic, guaranteeing significant improvements on various fundamental tasks and mainstream backbones. Moreover, BiPointNet gives an impressive 14.7× speedup and 18.9× storage saving on real-world resource-constrained devices.

Details

IJCAI Conference 2021 Conference Paper

Hardware-friendly Deep Learning by Network Quantization and Binarization

Haotong Qin

Quantization is emerging as an efficient approach to promote hardware-friendly deep learning and run deep neural networks on resource-limited hardware. However, it still causes a significant decrease to the network in accuracy. We summarize challenges of quantization into two categories: Quantization for Diverse Architectures and Quantization on Complex Scenes. Our studies focus mainly on applying quantization on various architectures and scenes and pushing the limit of quantization to extremely compress and accelerate networks. The comprehensive research on quantization will achieve more powerful, more efficient, and more flexible hardware-friendly deep learning, and make it better suited to more real-world applications.

PDF Details DOI

Haotong Qin

Possible papers

Activation Manipulation Attack: Penetrating and Harmful Jailbreak Attack Against Large Vision-Language Models

First-Order Error Matters: Accurate Compensation for Quantized Large Language Models

TR-DQ: Time-Rotation Diffusion Quantization

$\text{S}^2$Q-VDiT: Accurate Quantized Video Diffusion Transformer with Salient Data and Sparse Token Distillation

ARB-LLM: Alternating Refined Binarizations for Large Language Models

BinaryDM: Accurate Weight Binarization for Efficient Diffusion Models

DA-KD: Difficulty-Aware Knowledge Distillation for Efficient Large Language Models

Exploring Semantic-constrained Adversarial Example with Instruction Uncertainty Reduction

FSDP: Fast and Safe Data-Driven Overtaking Trajectory Planning for Head-to-Head Autonomous Racing Competitions

MPQ-DM: Mixed Precision Quantization for Extremely Low Bit Diffusion Models

ODDN: Addressing Unpaired Data Challenges in Open-World Deepfake Detection on Online Social Networks

Q-VDiT: Towards Accurate Quantization and Distillation of Video-Generation Diffusion Transformers

SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models

2DQuant: Low-bit Post-Training Quantization for Image Super-Resolution

Accurate LoRA-Finetuning Quantization of LLMs via Information Retention

BiDM: Pushing the Limit of Quantization for Diffusion Models

BiLLM: Pushing the Limit of Post-Training Quantization for LLMs

Binarized Diffusion Model for Image Super-Resolution

Compressing Large Language Models by Joint Sparsification and Quantization

Flexible Residual Binarization for Image Super-Resolution

Image Fusion via Vision-Language Model

LLMCBench: Benchmarking Large Language Model Compression for Efficient Deployment

BiBench: Benchmarking and Analyzing Network Binarization

BiMatting: Efficient Video Matting via Binarization

QuantSR: Accurate Low-bit Quantization for Efficient Image Super-Resolution

BiBERT: Accurate Fully Binarized BERT

BiFSMN: Binary Neural Network for Keyword Spotting

BiPointNet: Binary Neural Network for Point Clouds

Hardware-friendly Deep Learning by Network Quantization and Binarization