Author name cluster

Yifu Ding

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

11 papers

2 author rows

AAAI Conference 2026 Conference Paper

CMedBench: A Comprehensive Benchmark for Efficient Medical Large Language Models

Shengbo Gao
Jinyang Guo
Lixian Su
Yifu Ding
Shiqiao Gu
Aishan Liu
Yuqing Ma
Zhiwang Zhang

Large Language Models (LLMs) hold significant potential for enhancing healthcare applications, yet their deployment is hindered by high computational and memory demands. Model compression techniques offer solutions to reduce these demands, but their impact on medical LLMs remains underexplored. In this paper, we introduce CMedBench, the first comprehensive benchmark for evaluating compressed LLMs in medical contexts. CMedBench assesses five core dimensions: Medical Knowledge Ability, Medical Application Ability, Trustworthiness Maintenance, Compression Cross Combination, and Computational Efficiency. Through extensive empirical studies, we analyze the trade-offs between model efficiency and clinical performance across diverse models, datasets, and compression strategies. Our findings highlight critical limitations in current evaluation practices and provide a robust framework for aligning compression strategies with medical requirements. CMedBench serves as a vital resource for researchers and practitioners, guiding the development of efficient, trustworthy, and clinically effective LLMs for healthcare applications.

PDF Details DOI

ICML Conference 2025 Conference Paper

DA-KD: Difficulty-Aware Knowledge Distillation for Efficient Large Language Models

Changyi He
Yifu Ding
Jinyang Guo
Ruihao Gong
Haotong Qin
Xianglong Liu 0001

Although knowledge distillation (KD) is an effective approach to improve the performance of a smaller LLM (i. e. , the student model) by transferring knowledge from a large LLM (i. e. , the teacher model), it still suffers from high training cost. Existing LLM distillation methods ignore the difficulty difference among different samples, making the distillation of easy samples unnecessary. This leads to high distillation cost. In this paper, we propose difficulty-aware knowledge distillation (DA-KD) framework for efficient knowledge distillation, in which we dynamically adjust the distillation dataset based on the difficulty of samples. We further observe existing KD loss cannot perform well when most of samples are difficult in the distillation dataset because of unstable optimization and the neglect of hard samples. Therefore, we also propose a new KD loss called bidirectional discrepancy loss (BDL) for effective KD. Extensive experiments demonstrate that our DA-KD framework is effective and efficient. Without bells and whistles, DA-KD can outperform existing state-of-the-art KD methods by 2% with half training cost and even surpass the teacher model with 4. 7$\times$ compression.

Details

IJCAI Conference 2025 Conference Paper

Unlocking the Potential of Lightweight Quantized Models for Deepfake Detection

Renshuai Tao
Ziheng Qin
Yifu Ding
Chuangchuang Tan
Jiakai Wang
Wei Wang

Deepfake detection is increasingly crucial due to the rapid rise of AI-generated content. Existing methods achieve high performance relying on computationally intensive large models, making real-time detection on resource-constrained edge devices challenging. Given that deepfake detection is a binary classification task, there is potential for model compression and acceleration. In this paper, we propose a low-bit quantization framework for lightweight and efficient deepfake detection. The Connected Quantized Block extracts common forgery features via the quantized path and retains method-specific textures through the shortcut connections. Additionally, the Shifted Logarithmic Redistribution Quantizer mitigates information loss in near-zero domains by unfolding the unbalanced activations, enabling finer quantization granularity. Comprehensive experiments demonstrate that this new framework significantly reduces 10. 8x computational costs and 12. 4x storage requirements while maintaining high detection performance, even surpassing SOTA methods using less than 5% FLOPs, paving the way for efficient deepfake detection in resource-limited scenarios.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

VORTA: Efficient Video Diffusion via Routing Sparse Attention

Wenhao Sun
Rong-Cheng Tu
Yifu Ding
Jingyi Liao
Zhao Jin
Shunyu Liu
Dacheng Tao

Video diffusion transformers have achieved remarkable progress in high-quality video generation, but remain computationally expensive due to the quadratic complexity of attention over high-dimensional video sequences. Recent acceleration methods enhance the efficiency by exploiting the local sparsity of attention scores; yet they often struggle with accelerating the long-range computation. To address this problem, we propose VORTA, an acceleration framework with two novel components: 1) a sparse attention mechanism that efficiently captures long-range dependencies, and 2) a routing strategy that adaptively replaces full 3D attention with specialized sparse attention variants. VORTA achieves an end-to-end speedup $1. 76\times$ without loss of quality on VBench. Furthermore, it can seamlessly integrate with various other acceleration methods, such as model caching and step distillation, reaching up to speedup $14. 41\times$ with negligible performance degradation. VORTA demonstrates its efficiency and enhances the practicality of video diffusion transformers in real-world settings. Codes and weights are available at https: //github. com/wenhao728/VORTA.

PDF Details

ICML Conference 2024 Conference Paper

Compressing Large Language Models by Joint Sparsification and Quantization

Jinyang Guo
Jianyu Wu
Zining Wang
Jiaheng Liu
Ge Yang
Yifu Ding
Ruihao Gong
Haotong Qin

In this paper, we introduce a novel model compression technique named Joint Sparsification and Quantization (JSQ), explicitly tailored for large language models (LLMs). Traditional methods employ either sparsification or quantization individually to compress LLMs, leading to performance degradation at high compression ratios. In contrast, our JSQ approach integrates sparsification and quantization cohesively. As sparsification tend to preserve outliers that is harmful to quantization, we introduce a novel sparsity metric to serves as a bridge between the sparsification and quantization. Moreover, it is proven outliers in LLMs have significant impact but harmful to compression. Current solutions are highly coupled with quantization process, which is not helpful to sparsification. To this end, we also introduce a search-based activation editor to automatically eliminate relatively useless outliers. Comprehensive experiments across various datasets and architectures affirm the efficacy of our JSQ framework. Notably, our JSQ achieves 7. 96$\times$ computation reduction without crashing for the representative model LLaMA. This accomplishment stands in stark contrast to the limitations of most state-of-the-art LLM compression methods, which typically fail under such extreme compression ratios. Our code is released at https: //github. com/uanu2002/JSQ.

Details

NeurIPS Conference 2024 Conference Paper

LLMCBench: Benchmarking Large Language Model Compression for Efficient Deployment

Ge Yang
Changyi He
Jinyang Guo
Jianyu Wu
Yifu Ding
Aishan Liu
Haotong Qin
Pengliang Ji

Although large language models (LLMs) have demonstrated their strong intelligence ability, the high demand for computation and storage hinders their practical application. To this end, many model compression techniques are proposed to increase the efficiency of LLMs. However, current researches only validate their methods on limited models, datasets, metrics, etc, and still lack a comprehensive evaluation under more general scenarios. So it is still a question of which model compression approach we should use under a specific case. To mitigate this gap, we present the Large Language Model Compression Benchmark (LLMCBench), a rigorously designed benchmark with an in-depth analysis for LLM compression algorithms. We first analyze the actual model production requirements and carefully design evaluation tracks and metrics. Then, we conduct extensive experiments and comparison using multiple mainstream LLM compression approaches. Finally, we perform an in-depth analysis based on the evaluation and provide useful insight for LLM compression design. We hope our LLMCBench can contribute insightful suggestions for LLM compression algorithm design and serve as a foundation for future research.

PDF Details DOI

ICML Conference 2023 Conference Paper

Network binarization emerges as one of the most promising compression approaches offering extraordinary computation and memory savings by minimizing the bit-width. However, recent research has shown that applying existing binarization algorithms to diverse tasks, architectures, and hardware in realistic scenarios is still not straightforward. Common challenges of binarization, such as accuracy degradation and efficiency limitation, suggest that its attributes are not fully understood. To close this gap, we present BiBench, a rigorously designed benchmark with in-depth analysis for network binarization. We first carefully scrutinize the requirements of binarization in the actual production and define evaluation tracks and metrics for a comprehensive and fair investigation. Then, we evaluate and analyze a series of milestone binarization algorithms that function at the operator level and with extensive influence. Our benchmark reveals that 1) the binarized operator has a crucial impact on the performance and deployability of binarized networks; 2) the accuracy of binarization varies significantly across different learning tasks and neural architectures; 3) binarization has demonstrated promising efficiency potential on edge devices despite the limited hardware support. The results and analysis also lead to a promising paradigm for accurate and efficient binarization. We believe that BiBench will contribute to the broader adoption of binarization and serve as a foundation for future research. The code for our BiBench is released https: //github. com/htqin/BiBench.

Details

NeurIPS Conference 2023 Conference Paper

QuantSR: Accurate Low-bit Quantization for Efficient Image Super-Resolution

Haotong Qin
Yulun Zhang
Yifu Ding
Yifan Liu
Xianglong Liu
Martin Danelljan
Fisher Yu

Low-bit quantization in image super-resolution (SR) has attracted copious attention in recent research due to its ability to reduce parameters and operations significantly. However, many quantized SR models suffer from accuracy degradation compared to their full-precision counterparts, especially at ultra-low bit widths (2-4 bits), limiting their practical applications. To address this issue, we propose a novel quantized image SR network, called QuantSR, which achieves accurate and efficient SR processing under low-bit quantization. To overcome the representation homogeneity caused by quantization in the network, we introduce the Redistribution-driven Learnable Quantizer (RLQ). This is accomplished through an inference-agnostic efficient redistribution design, which adds additional information in both forward and backward passes to improve the representation ability of quantized networks. Furthermore, to achieve flexible inference and break the upper limit of accuracy, we propose the Depth-dynamic Quantized Architecture (DQA). Our DQA allows for the trade-off between efficiency and accuracy during inference through weight sharing. Our comprehensive experiments show that QuantSR outperforms existing state-of-the-art quantized SR networks in terms of accuracy while also providing more competitive computational efficiency. In addition, we demonstrate the scheme's satisfactory architecture generality by providing QuantSR-C and QuantSR-T for both convolution and Transformer versions, respectively. Our code and models are released at https: //github. com/htqin/QuantSR.

PDF Details

ICLR Conference 2022 Conference Paper

BiBERT: Accurate Fully Binarized BERT

Haotong Qin
Yifu Ding
Mingyuan Zhang
Qinghua Yan
Aishan Liu
Qingqing Dang
Ziwei Liu 0002
Xianglong Liu 0001

The large pre-trained BERT has achieved remarkable performance on Natural Language Processing (NLP) tasks but is also computation and memory expensive. As one of the powerful compression approaches, binarization extremely reduces the computation and memory consumption by utilizing 1-bit parameters and bitwise operations. Unfortunately, the full binarization of BERT (i.e., 1-bit weight, embedding, and activation) usually suffer a significant performance drop, and there is rare study addressing this problem. In this paper, with the theoretical justification and empirical analysis, we identify that the severe performance drop can be mainly attributed to the information degradation and optimization direction mismatch respectively in the forward and backward propagation, and propose BiBERT, an accurate fully binarized BERT, to eliminate the performance bottlenecks. Specifically, BiBERT introduces an efficient Bi-Attention structure for maximizing representation information statistically and a Direction-Matching Distillation (DMD) scheme to optimize the full binarized BERT accurately. Extensive experiments show that BiBERT outperforms both the straightforward baseline and existing state-of-the-art quantized BERTs with ultra-low bit activations by convincing margins on the NLP benchmark. As the first fully binarized BERT, our method yields impressive 56.3 times and 31.2 times saving on FLOPs and model size, demonstrating the vast advantages and potential of the fully binarized BERT model in real-world resource-constrained scenarios.

Details

IJCAI Conference 2022 Conference Paper

The deep neural networks, such as the Deep-FSMN, have been widely studied for keyword spotting (KWS) applications. However, computational resources for these networks are significantly constrained since they usually run on-call on edge devices. In this paper, we present BiFSMN, an accurate and extreme-efficient binary neural network for KWS. We first construct a High-frequency Enhancement Distillation scheme for the binarization-aware training, which emphasizes the high-frequency information from the full-precision network's representation that is more crucial for the optimization of the binarized network. Then, to allow the instant and adaptive accuracy-efficiency trade-offs at runtime, we also propose a Thinnable Binarization Architecture to further liberate the acceleration potential of the binarized network from the topology perspective. Moreover, we implement a Fast Bitwise Computation Kernel for BiFSMN on ARMv8 devices which fully utilizes registers and increases instruction throughput to push the limit of deployment efficiency. Extensive experiments show that BiFSMN outperforms existing binarization methods by convincing margins on various datasets and is even comparable with the full-precision counterpart (e. g. , less than 3% drop on Speech Commands V1-12). We highlight that benefiting from the thinnable architecture and the optimized 1-bit implementation, BiFSMN can achieve an impressive 22. 3x speedup and 15. 5x storage-saving on real-world edge hardware.

PDF Details DOI

ICLR Conference 2021 Conference Paper

To alleviate the resource constraint for real-time point cloud applications that run on edge devices, in this paper we present BiPointNet, the first model binarization approach for efficient deep learning on point clouds. We discover that the immense performance drop of binarized models for point clouds mainly stems from two challenges: aggregation-induced feature homogenization that leads to a degradation of information entropy, and scale distortion that hinders optimization and invalidates scale-sensitive structures. With theoretical justifications and in-depth analysis, our BiPointNet introduces Entropy-Maximizing Aggregation (EMA) to modulate the distribution before aggregation for the maximum information entropy, and Layer-wise Scale Recovery (LSR) to efficiently restore feature representation capacity. Extensive experiments show that BiPointNet outperforms existing binarization methods by convincing margins, at the level even comparable with the full precision counterpart. We highlight that our techniques are generic, guaranteeing significant improvements on various fundamental tasks and mainstream backbones. Moreover, BiPointNet gives an impressive 14.7× speedup and 18.9× storage saving on real-world resource-constrained devices.

Details

Yifu Ding

Possible papers

CMedBench: A Comprehensive Benchmark for Efficient Medical Large Language Models

DA-KD: Difficulty-Aware Knowledge Distillation for Efficient Large Language Models

Unlocking the Potential of Lightweight Quantized Models for Deepfake Detection

VORTA: Efficient Video Diffusion via Routing Sparse Attention

Compressing Large Language Models by Joint Sparsification and Quantization

LLMCBench: Benchmarking Large Language Model Compression for Efficient Deployment

BiBench: Benchmarking and Analyzing Network Binarization

QuantSR: Accurate Low-bit Quantization for Efficient Image Super-Resolution

BiBERT: Accurate Fully Binarized BERT

BiFSMN: Binary Neural Network for Keyword Spotting

BiPointNet: Binary Neural Network for Point Clouds