Arrow Research search

Author name cluster

Jinyu Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

8 papers
2 author rows

Possible papers

8

AAAI Conference 2026 Conference Paper

KVmix: Gradient-Based Layer Importance-Aware Mixed-Precision Quantization for KV Cache

  • Fei Li
  • Song Liu
  • Weiguo Wu
  • Shiqiang Nie
  • Jinyu Wang

The high memory demands of the Key-Value (KV) Cache during the inference of Large Language Models (LLMs) severely restrict their deployment in resource-constrained platforms. Quantization can effectively alleviate the memory pressure caused by KV Cache. However, existing methods either rely on static one-size-fits-all precision allocation or fail to dynamically prioritize critical KV in long-context tasks, forcing memory-accuracy-throughput tradeoffs. In this work, we propose a novel mixed-precision quantization method for KV Cache named KVmix. KVmix leverages gradient-based importance analysis to evaluate how individual Key and Value projection matrices affect the model loss, enabling layer-specific bit-width allocation for mix-precision quantization. It dynamically prioritizes higher precision for important layers while aggressively quantizing less influential ones, achieving a tunable balance between accuracy and efficiency. KVmix introduces a dynamic long-context optimization strategy that adaptively keeps full-precision KV pairs for recent pivotal tokens and compresses older ones, achieving high-quality sequence generation with low memory usage. Additionally, KVmix provides efficient low-bit quantization and CUDA kernels to optimize computational overhead. On LLMs such as Llama and Mistral, KVmix achieves near-lossless inference performance with extremely low quantization configuration (Key 2.19bit Value 2.38bit), while delivering a remarkable 4.9× memory compression and a 5.3× speedup in inference throughput.

ICML Conference 2025 Conference Paper

From Complex to Atomic: Enhancing Augmented Generation via Knowledge-Aware Dual Rewriting and Reasoning

  • Jinyu Wang
  • Jingjing Fu
  • Rui Wang 0028
  • Lei Song 0001
  • Jiang Bian 0002

Recent advancements in Retrieval-Augmented Generation (RAG) systems have significantly enhanced the capabilities of large language models (LLMs) by incorporating external knowledge retrieval. However, the sole reliance on retrieval is often inadequate for mining deep, domain-specific knowledge and for performing logical reasoning from specialized datasets. To tackle these challenges, we present an approach, which is designed to extract, comprehend, and utilize domain knowledge while constructing a coherent rationale. At the heart of our approach lie four pivotal components: a knowledge atomizer that extracts atomic questions from raw data, a query proposer that generates subsequent questions to facilitate the original inquiry, an atomic retriever that locates knowledge based on atomic knowledge alignments, and an atomic selector that determines which follow-up questions to pose guided by the retrieved information. Through this approach, we implement a knowledge-aware task decomposition strategy that adeptly extracts multifaceted knowledge from segmented data and iteratively builds the rationale in alignment with the initial query and the acquired knowledge. We conduct comprehensive experiments to demonstrate the efficacy of our approach across various benchmarks, particularly those requiring multihop reasoning steps. The results indicate a significant enhancement in performance, up to 12. 6% over the second-best method, underscoring the potential of the approach in complex, knowledge-intensive applications.

ICML Conference 2025 Conference Paper

Unveiling Markov heads in Pretrained Language Models for Offline Reinforcement Learning

  • Wenhao Zhao
  • Qiushui Xu
  • Linjie Xu
  • Lei Song 0001
  • Jinyu Wang
  • Chunlai Zhou
  • Jiang Bian 0002

Recently, incorporating knowledge from pretrained language models (PLMs) into decision transformers (DTs) has generated significant attention in offline reinforcement learning (RL). These PLMs perform well in RL tasks, raising an intriguing question: what kind of knowledge from PLMs has been transferred to RL to achieve such good results? This work first dives into this problem by analyzing each head quantitatively and points out Markov head, a crucial component that exists in the attention heads of PLMs. It leads to extreme attention on the last-input token and performs well only in short-term environments. Furthermore, we prove that this extreme attention cannot be changed by re-training embedding layer or fine-tuning. Inspired by our analysis, we propose a general method GPT-DTMA, which equips a pretrained DT with Mixture of Attention (MoA), to enable adaptive learning and accommodate diverse attention requirements during fine-tuning. Extensive experiments demonstrate the effectiveness of GPT-DTMA: it achieves superior performance in short-term environments compared to baselines, significantly reduces the performance gap of PLMs in long-term scenarios, and the experimental results also validate our theorems.

TMLR Journal 2024 Journal Article

Mildly Constrained Evaluation Policy for Offline Reinforcement Learning

  • Linjie Xu
  • zhengyao jiang
  • Jinyu Wang
  • Lei Song
  • Jiang Bian

Offline reinforcement learning (RL) methodologies enforce constraints on the policy to adhere closely to the behavior policy, thereby stabilizing value learning and mitigating the selection of out-of-distribution (OOD) actions during test time. Conventional approaches apply identical constraints for both value learning and test time inference. However, our findings indicate that the constraints suitable for value estimation may in fact be excessively restrictive for action selection during test time. To address this issue, we propose a Mildly Constrained Evaluation Policy (MCEP) for test time inference with a more constrained target policy for value estimation. Since the target policy has been adopted in various prior approaches, MCEP can be seamlessly integrated with them as a plug-in. We instantiate MCEP based on TD3BC (Fujimoto & Gu, 2021), AWAC (Nair et al., 2020) and DQL (Wang et al., 2023) algorithms. The empirical results on D4RL MuJoCo locomotion, high-dimensional humanoid and a set of 16 robotic manipulation tasks show that the MCEP brought significant performance improvement on classic offline RL methods and can further improve SOTA methods. The codes are open-sourced at \url{https://github.com/egg-west/MCEP}.

NeurIPS Conference 2024 Conference Paper

Protecting Your LLMs with Information Bottleneck

  • Zichuan Liu
  • Zefan Wang
  • Linjie Xu
  • Jinyu Wang
  • Lei Song
  • Tianchun Wang
  • Chunlin Chen
  • Wei Cheng

The advent of large language models (LLMs) has revolutionized the field of natural language processing, yet they might be attacked to produce harmful content. Despite efforts to ethically align LLMs, these are often fragile and can be circumvented by jailbreaking attacks through optimized or manual adversarial prompts. To address this, we introduce the Information Bottleneck Protector (IBProtector), a defense mechanism grounded in the information bottleneck principle, and we modify the objective to avoid trivial solutions. The IBProtector selectively compresses and perturbs prompts, facilitated by a lightweight and trainable extractor, preserving only essential information for the target LLMs to respond with the expected answer. Moreover, we further consider a situation where the gradient is not visible to be compatible with any LLM. Our empirical evaluations show that IBProtector outperforms current defense methods in mitigating jailbreak attempts, without overly affecting response quality or inference speed. Its effectiveness and adaptability across various attack methods and target LLMs underscore the potential of IBProtector as a novel, transferable defense that bolsters the security of LLMs without requiring modifications to the underlying models.

YNICL Journal 2021 Journal Article

Disorder- and emotional context-specific neurofunctional alterations during inhibitory control in generalized anxiety and major depressive disorder

  • Congcong Liu
  • Jing Dai
  • Yuanshu Chen
  • Ziyu Qi
  • Fei Xin
  • Qian Zhuang
  • Xinqi Zhou
  • Feng Zhou

Major Depressive Disorder (MDD) and Generalized Anxiety Disorder (GAD) are highly debilitating and often co-morbid disorders. The disorders exhibit partly overlapping dysregulations on the behavioral and neurofunctional level. The determination of disorder-specific behavioral and neurofunctional dysregulations may therefore promote neuro-mechanistic and diagnostic specificity. In order to determine disorder-specific alterations in the domain of emotion-cognition interactions the present study examined emotional context-specific inhibitory control in treatment-naïve MDD (n = 37) and GAD (n = 35) patients and healthy controls (n = 35). On the behavioral level MDD but not GAD exhibited impaired inhibitory control irrespective of emotional context. On the neural level, MDD-specific attenuated recruitment of inferior/medial parietal, posterior frontal, and mid-cingulate regions during inhibitory control were found during the negative context. GAD exhibited a stronger engagement of the left dorsolateral prefrontal cortex relative to MDD. Overall the findings from the present study suggest disorder- and emotional context-specific behavioral and neurofunctional inhibitory control dysregulations in major depression and may point to a depression-specific neuropathological and diagnostic marker.