Arrow Research search

Author name cluster

Huiqiang Jiang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

8 papers
2 author rows

Possible papers

8

NeurIPS Conference 2025 Conference Paper

Chain-of-Model Learning for Language Model

  • Xiaohua Wang
  • Kaitao Song
  • Xu Tan
  • Huiqiang Jiang
  • Chengruidong Zhang
  • Yongliang Shen
  • Cen Lu
  • Zihao Li

In this paper, we propose a novel learning paradigm, termed Chain-of-Model (CoM), which incorporates the causal relationship into the hidden states of each layer as a chain style. thereby introducing great scaling efficiency in model training and inference flexibility in deployment. We introduce the concept of Chain-of-Representation (CoR), which formulates the hidden states at each layer as a combination of multiple sub-representations (i. e. , chains). In each layer, each chain from the output representations can only view all of its preceding chains in the input representations. Consequently, the model built upon CoM framework can progressively scale up the model size by increasing the chains based on the previous models (i. e. , chains), and offer multiple sub-models at varying sizes for elastic inference by using different chain numbers. Based on this principle, we devise Chain-of-Language-Model (CoLM), which incorporates the idea of CoM into each layer of Transformer architecture. Based on CoLM, we further introduce CoLM-Air by introducing a KV sharing mechanism, that computes all keys and values within the first chain and then shares across all chains. This design demonstrates additional extensibility, such as enabling seamless LM switching, prefilling acceleration and so on. Experimental results demonstrate our CoLM family can achieve comparable performance to the standard Transformer, while simultaneously enabling greater flexiblity, such as progressive scaling to improve training efficiency and offer multiple varying model sizes for elastic inference, paving a a new way toward building language models.

NeurIPS Conference 2025 Conference Paper

GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents

  • Qianhui Wu
  • Kanzhi Cheng
  • Rui Yang
  • Chaoyun Zhang
  • Jianwei Yang
  • Huiqiang Jiang
  • Jian Mu
  • Baolin Peng

One of the principal challenges in building VLM-powered GUI agents is visual grounding—localizing the appropriate screen region for action execution based on both the visual content and the textual plans. Most existing work formulates this as a text-based coordinate generation task. However, these approaches suffer from several limitations: weak spatial-semantic alignment due to lack of explicit spatial supervision; inability to handle ambiguous supervision targets, as single-point predictions penalize valid variations; and a mismatch between the dense nature of screen coordinates and the coarse, patch-level granularity of visual features extracted by models like Vision Transformers. In this paper, we propose GUI-Actor, a VLM-based method for coordinate-free GUI grounding. At its core, GUI-Actor introduces an attention-based action head that learns to align a dedicated `` token with all relevant visual patch tokens, enabling the model to propose one or more action regions in a single forward pass. In line with this, we further design a grounding verifier to evaluate and select the most plausible action region from the candidates proposed for action execution. Extensive experiments show that GUI-Actor outperforms prior state-of-the-art methods on multiple GUI action grounding benchmarks, with improved generalization to unseen screen resolutions and layouts. Notably, GUI-Actor-7B achieves scores of 40. 7 with Qwen2-VL and 44. 6 with Qwen2. 5-VL as backbones, outperforming UI-TARS-72B (38. 1) on ScreenSpot-Pro, with significantly fewer parameters and training data. Furthermore, by incorporating the verifier, we find that fine-tuning only the newly introduced action head (~100M parameters for 7B model) while keeping the VLM backbone frozen is sufficient to achieve performance comparable to previous state-of-the-art models, highlighting that GUI-Actor can endow the underlying VLM with effective grounding capabilities without compromising its general-purpose strengths. Project page: https: //aka. ms/GUI-Actor

ICML Conference 2025 Conference Paper

MMInference: Accelerating Pre-filling for Long-Context Visual Language Models via Modality-Aware Permutation Sparse Attention

  • Yucheng Li
  • Huiqiang Jiang
  • Chengruidong Zhang
  • Qianhui Wu
  • Xufang Luo
  • Surin Ahn
  • Amir H. Abdi
  • Dongsheng Li 0002

The integration of long-context capabilities with visual understanding unlocks unprecedented potential for Vision Language Models (VLMs). However, the quadratic attention complexity during the pre-filling phase remains a significant obstacle to real-world deployment. To overcome this limitation, we introduce MMInference (Multimodality Million tokens Inference), a dynamic sparse attention method that accelerates the prefilling stage for long-context multi-modal inputs. First, our analysis reveals that the temporal and spatial locality of video input leads to a unique sparse pattern, the Grid pattern. Simultaneously, VLMs exhibit markedly different sparse distributions across different modalities. We introduce a permutation-based method to leverage the unique Grid pattern and handle modality boundary issues. By offline search the optimal sparse patterns for each head, MMInference constructs the sparse distribution dynamically based on the input. We also provide optimized GPU kernels for efficient sparse computations. Notably, MMInference integrates seamlessly into existing VLM pipelines without any model modifications or fine-tuning. Experiments on multi-modal benchmarks-including Video QA, Captioning, VisionNIAH, and Mixed-Modality NIAH-with state-of-the-art long-context VLMs (LongVila, LlavaVideo, VideoChat-Flash, Qwen2. 5-VL) show that MMInference accelerates the pre-filling stage by up to 8. 3x at 1M tokens while maintaining accuracy. Our code is available at https: //ama. ms/MMInference.

NeurIPS Conference 2025 Conference Paper

RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval

  • Di Liu
  • Meng Chen
  • Baotong Lu
  • Huiqiang Jiang
  • Zhenhua Han
  • Qianxi Zhang
  • Qi Chen
  • Chengruidong Zhang

Transformer-based Large Language Models (LLMs) have become increasingly important. However, scaling LLMs to longer contexts incurs slow inference speed and high GPU memory consumption for caching key-value (KV) vectors. This paper presents RetrievalAttention, a training-free approach to both accelerate the decoding phase and reduce GPU memory consumption by pre-building KV vector indexes for fixed contexts and maintaining them in CPU memory for efficient retrieval. Unlike conventional KV cache methods, RetrievalAttention integrate approximate nearest neighbor search (ANNS) indexes into attention computation. We observe that off-the-shelf ANNS techniques often fail due to the out-of-distribution (OOD) nature of query and key vectors in attention mechanisms. RetrievalAttention overcomes this with an attention-aware vector index. Our evaluation shows RetrievalAttention achieves near full attention accuracy while accessing only 1-3\% of the data, significantly reducing inference costs. Remarkably, RetrievalAttention enables LLMs with 8B parameters to handle 128K tokens on a single NVIDIA RTX4090 (24GB), achieving a decoding speed of 0. 107 seconds per token.

ICLR Conference 2025 Conference Paper

SCBench: A KV Cache-Centric Analysis of Long-Context Methods

  • Yucheng Li
  • Huiqiang Jiang
  • Qianhui Wu
  • Xufang Luo
  • Surin Ahn
  • Chengruidong Zhang
  • Amir H. Abdi
  • Dongsheng Li 0002

Long-context Large Language Models (LLMs) have enabled numerous downstream applications but also introduced significant challenges related to computational and memory efficiency. To address these challenges, optimizations for long-context inference have been developed, centered around the KV cache. However, existing benchmarks often evaluate in single-request, neglecting the full lifecycle of the KV cache in real-world use. This oversight is particularly critical, as KV cache reuse has become widely adopted in LLMs inference frameworks, such as vLLM and SGLang, as well as by LLM providers, including OpenAI, Microsoft, Google, and Anthropic. To address this gap, we introduce SCBENCH (SharedContextBENCH), a comprehensive benchmark for evaluating long-context methods from a KV cache centric perspective: 1) KV cache generation, 2) KV cache compression, 3) KV cache retrieval, and 4) KV cache loading. Specifically, SCBench uses test examples with shared context, ranging 12 tasks with two shared context modes, covering four categories of long-context capabilities: string retrieval, semantic retrieval, global information, and multi-task. With SCBench, we provide an extensive KV cache-centric analysis of eight categories long-context solutions, including Gated Linear RNNs (Codestal-Mamba), Mamba-Attention hybrids (Jamba-1.5-Mini), and efficient methods such as sparse attention, KV cache dropping, quantization, retrieval, loading, and prompt compression. The evaluation is conducted on six Transformer-based long-context LLMs: Llama-3.1-8B/70B, Qwen2.5-72B/32B, Llama-3-8B-262K, and GLM-4-9B. Our findings show that sub-O(n) memory methods suffer in multi-turn scenarios, while sparse encoding with O(n) memory and sub-O(n^2) pre-filling computation perform robustly. Dynamic sparsity yields more expressive KV caches than static patterns, and layer-level sparsity in hybrid architectures reduces memory usage with strong performance. Additionally, we identify attention distribution shift issues in long-generation scenarios.

ICLR Conference 2025 Conference Paper

SeCom: On Memory Construction and Retrieval for Personalized Conversational Agents

  • Zhuoshi Pan
  • Qianhui Wu
  • Huiqiang Jiang
  • Xufang Luo
  • Hao Cheng 0002
  • Dongsheng Li 0002
  • Yuqing Yang 0001
  • Chin-Yew Lin

To deliver coherent and personalized experiences in long-term conversations, existing approaches typically perform retrieval augmented response generation by constructing memory banks from conversation history at either the turn-level, session-level, or through summarization techniques. In this paper, we explore the impact of different memory granularities and present two key findings: (1) Both turn-level and session-level memory units are suboptimal, affecting not only the quality of final responses, but also the accuracy of the retrieval process. (2) The redundancy in natural language introduces noise, hindering precise retrieval. We demonstrate that *LLMLingua-2*, originally designed for prompt compression to accelerate LLM inference, can serve as an effective denoising method to enhance memory retrieval accuracy. Building on these insights, we propose **SeCom**, a method that constructs a memory bank with topical segments by introducing a conversation **Se**gmentation model, while performing memory retrieval based on **Com**pressed memory units. Experimental results show that **SeCom** outperforms turn-level, session-level, and several summarization-based methods on long-term conversation benchmarks such as *LOCOMO* and *Long-MT-Bench+*. Additionally, the proposed conversation segmentation method demonstrates superior performance on dialogue segmentation datasets such as *DialSeg711*, *TIAGE*, and *SuperDialSeg*.

TMLR Journal 2025 Journal Article

Zoomer: Adaptive Image Focus Optimization for Black-box MLLM

  • Jiaxu Qian
  • Chendong Wang
  • Yifan Yang
  • Chaoyun Zhang
  • Huiqiang Jiang
  • Xufang Luo
  • Yu Kang
  • Qingwei Lin

Multimodal large language models (MLLMs) such as GPT-4o, Gemini Pro, and Claude 3.5 have enabled unified reasoning over text and visual inputs, yet they often hallucinate in real-world scenarios—especially when small objects or fine spatial context are involved. We pinpoint two core causes of this failure: the absence of region-adaptive attention and inflexible token budgets that force uniform downsampling, leading to critical information loss. To overcome these limitations, we introduce Zoomer a visual prompting framework that delivers token-efficient, detail-preserving image representations for black-box MLLMs. Zoomer integrates (1) a prompt-aware emphasis module to highlight semantically relevant regions, (2) a spatial-preserving orchestration schema to maintain object relationships, and (3) a budget-aware strategy to optimally allocate tokens between global context and local details. Extensive experiments on nine benchmarks and three commercial MLLMs demonstrate that Zoomer boosts accuracy by up to 27% while cutting image token usage by up to 67\%. Our approach establishes a principled methodology for robust, resource-aware multimodal understanding in settings where model internals are inaccessible.

NeurIPS Conference 2024 Conference Paper

MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

  • Huiqiang Jiang
  • Yucheng Li
  • Chengruidong Zhang
  • Qianhui Wu
  • Xufang Luo
  • Surin Ahn
  • Zhenhua Han
  • Amir H. Abdi

The computational challenges of Large Language Model (LLM) inference remain a significant barrier to their widespread deployment, especially as prompt lengths continue to increase. Due to the quadratic complexity of the attention computation, it takes 30 minutes for an 8B LLM to process a prompt of 1M tokens (i. e. , the pre-filling stage) on a single A100 GPU. Existing methods for speeding up prefilling often fail to maintain acceptable accuracy or efficiency when applied to long-context LLMs. To address this gap, we introduce MInference (Milliontokens Inference), a sparse calculation method designed to accelerate pre-filling of long-sequence processing. Specifically, we identify three unique patterns in long-context attention matrices-the A-shape, Vertical-Slash, and Block-Sparse-that can be leveraged for efficient sparse computation on GPUs. We determine the optimal pattern for each attention head offline and dynamically build sparseindices based on the assigned pattern during inference. With the pattern and sparse indices, we perform efficient sparse attention calculations via our optimized GPU kernels to significantly reduce the latency in the pre-filling stage of longcontext LLMs. Our proposed technique can be directly applied to existing LLMs without any modifications to the pre-training setup or additional fine-tuning. Byevaluating on a wide range of downstream tasks, including InfiniteBench, RULER, PG-19, and Needle In A Haystack, and models including LLaMA-3-1M, GLM-4-1M, Yi-200K, Phi-3-128K, and Qwen2-128K, we demonstrate that MInference effectively reduces inference latency by up to 10x for pre-filling on an A100, while maintaining accuracy. Our code is available at https: //aka. ms/MInference.