Arrow Research search

Author name cluster

Ao Ren

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

3 papers
1 author row

Possible papers

3

AAAI Conference 2026 Conference Paper

D2 Prune: Sparsifying Large Language Models via Dual Taylor Expansion and Attention Distribution Awareness

  • Lang Xiong
  • Ning Liu
  • Ao Ren
  • Yuheng Bai
  • Haining Fang
  • Binyan Zhang
  • Zhe Jiang
  • Yujuan Tan

Large language models (LLMs) face significant deployment challenges due to their massive computational demands. While pruning offers a promising compression solution, existing methods suffer from two critical limitations: (1) They neglect activation distribution shifts between calibration data and test data, resulting in inaccurate error estimations; (2) Overlooking the long-tail distribution characteristics of activations in the attention module. To address these limitations, this paper proposes a novel pruning method, D²Prune. First, we propose a dual Taylor expansion-based method that jointly models weight and activation perturbations for precise error estimation, leading to precise pruning mask selection and weight updating and facilitating error minimization during pruning. Second, we propose an attention-aware dynamic update strategy that preserves the long-tail attention pattern by jointly minimizing the KL divergence of attention distributions and the reconstruction error. Extensive experiments show that D²Prune consistently outperforms SOTA methods across various LLMs (e.g., OPT-125M, LLaMA2/3, Qwen3). Moreover, the dynamic attention update mechanism also generalizes well to ViT-based vision models like DeiT, achieving superior accuracy on ImageNet-1K.

AAAI Conference 2026 Conference Paper

HitKV: Activation Frequency Knows Which Tokens Are Important

  • Sanle Zhao
  • Yujuan Tan
  • Yu Jing
  • Zhuoxin Bai
  • Yue Niu
  • Jiayi Guo
  • Zongjie Wang
  • Ao Ren

The demand for long-context processing in large language models (LLMs) continues to escalate alongside rapid advancements in their capabilities. However, the intermediate attention keys and values (KV cache) employed to avoid re-computations, also grow linearly with sequence length, far exceeding the memory capacity of consumer-grade GPUs. Consequently, many studies have proposed KV cache compression methods that evict unimportant tokens based on variant attention scoring strategies. These methods typically retain the KV pairs of the top-k scoring tokens under a fixed memory budget. However, they still face several limitations. First, they disregard the activation frequency of tokens, specifically the count of times tokens achieve top-k scores in the attention distribution of following tokens. The methods based on variant attention scores may incorrectly evict some high-activation-frequency yet low final-scoring tokens. Second, the activation frequency exhibits different distribution patterns across layers and tasks. Neglecting these differences negatively impacts model performance and task adaptability. Our analysis of the actual token activation frequency and its unique characteristics across layers and task types reveals potential opportunities to address these issues. In this paper, we propose HitKV, which employs hit rates to directly characterize token activation frequencies, enabling adaptive layer-aware and task-aware KV cache eviction under the uniform memory allocation strategies. Also, HitKV can be easily integrated into layer-specific memory allocation methods. Experimental results demonstrate that HitKV maintains model performance with preserving only 3% of the KV cache, achieves high-quality generation outputs in long-text generation tasks, and delivers 4× throughput improvement over baselines.

NeurIPS Conference 2022 Conference Paper

Measuring Data Reconstruction Defenses in Collaborative Inference Systems

  • Mengda Yang
  • Ziang Li
  • Juan Wang
  • Hongxin Hu
  • Ao Ren
  • Xiaoyang Xu
  • Wenzhe Yi

The collaborative inference systems are designed to speed up the prediction processes in edge-cloud scenarios, where the local devices and the cloud system work together to run a complex deep-learning model. However, those edge-cloud collaborative inference systems are vulnerable to emerging reconstruction attacks, where malicious cloud service providers are able to recover the edge-side users’ private data. To defend against such attacks, several defense countermeasures have been recently introduced. Unfortunately, little is known about the robustness of those defense countermeasures. In this paper, we take the first step towards measuring the robustness of those state-of-the-art defenses with respect to reconstruction attacks. Specifically, we show that the latent privacy features are still retained in the obfuscated representations. Motivated by such an observation, we design a technology called Sensitive Feature Distillation (SFD) to restore sensitive information from the protected feature representations. Our experiments show that SFD can break through defense mechanisms in model partitioning scenarios, demonstrating the inadequacy of existing defense mechanisms as a privacy-preserving technique against reconstruction attacks. We hope our findings inspire further work in improving the robustness of defense mechanisms against reconstruction attacks for collaborative inference systems.