Arrow Research search

Author name cluster

Qianhui Wu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

7 papers
2 author rows

Possible papers

7

NeurIPS Conference 2025 Conference Paper

GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents

  • Qianhui Wu
  • Kanzhi Cheng
  • Rui Yang
  • Chaoyun Zhang
  • Jianwei Yang
  • Huiqiang Jiang
  • Jian Mu
  • Baolin Peng

One of the principal challenges in building VLM-powered GUI agents is visual grounding—localizing the appropriate screen region for action execution based on both the visual content and the textual plans. Most existing work formulates this as a text-based coordinate generation task. However, these approaches suffer from several limitations: weak spatial-semantic alignment due to lack of explicit spatial supervision; inability to handle ambiguous supervision targets, as single-point predictions penalize valid variations; and a mismatch between the dense nature of screen coordinates and the coarse, patch-level granularity of visual features extracted by models like Vision Transformers. In this paper, we propose GUI-Actor, a VLM-based method for coordinate-free GUI grounding. At its core, GUI-Actor introduces an attention-based action head that learns to align a dedicated `` token with all relevant visual patch tokens, enabling the model to propose one or more action regions in a single forward pass. In line with this, we further design a grounding verifier to evaluate and select the most plausible action region from the candidates proposed for action execution. Extensive experiments show that GUI-Actor outperforms prior state-of-the-art methods on multiple GUI action grounding benchmarks, with improved generalization to unseen screen resolutions and layouts. Notably, GUI-Actor-7B achieves scores of 40. 7 with Qwen2-VL and 44. 6 with Qwen2. 5-VL as backbones, outperforming UI-TARS-72B (38. 1) on ScreenSpot-Pro, with significantly fewer parameters and training data. Furthermore, by incorporating the verifier, we find that fine-tuning only the newly introduced action head (~100M parameters for 7B model) while keeping the VLM backbone frozen is sufficient to achieve performance comparable to previous state-of-the-art models, highlighting that GUI-Actor can endow the underlying VLM with effective grounding capabilities without compromising its general-purpose strengths. Project page: https: //aka. ms/GUI-Actor

ICML Conference 2025 Conference Paper

MMInference: Accelerating Pre-filling for Long-Context Visual Language Models via Modality-Aware Permutation Sparse Attention

  • Yucheng Li
  • Huiqiang Jiang
  • Chengruidong Zhang
  • Qianhui Wu
  • Xufang Luo
  • Surin Ahn
  • Amir H. Abdi
  • Dongsheng Li 0002

The integration of long-context capabilities with visual understanding unlocks unprecedented potential for Vision Language Models (VLMs). However, the quadratic attention complexity during the pre-filling phase remains a significant obstacle to real-world deployment. To overcome this limitation, we introduce MMInference (Multimodality Million tokens Inference), a dynamic sparse attention method that accelerates the prefilling stage for long-context multi-modal inputs. First, our analysis reveals that the temporal and spatial locality of video input leads to a unique sparse pattern, the Grid pattern. Simultaneously, VLMs exhibit markedly different sparse distributions across different modalities. We introduce a permutation-based method to leverage the unique Grid pattern and handle modality boundary issues. By offline search the optimal sparse patterns for each head, MMInference constructs the sparse distribution dynamically based on the input. We also provide optimized GPU kernels for efficient sparse computations. Notably, MMInference integrates seamlessly into existing VLM pipelines without any model modifications or fine-tuning. Experiments on multi-modal benchmarks-including Video QA, Captioning, VisionNIAH, and Mixed-Modality NIAH-with state-of-the-art long-context VLMs (LongVila, LlavaVideo, VideoChat-Flash, Qwen2. 5-VL) show that MMInference accelerates the pre-filling stage by up to 8. 3x at 1M tokens while maintaining accuracy. Our code is available at https: //ama. ms/MMInference.

ICLR Conference 2025 Conference Paper

SCBench: A KV Cache-Centric Analysis of Long-Context Methods

  • Yucheng Li
  • Huiqiang Jiang
  • Qianhui Wu
  • Xufang Luo
  • Surin Ahn
  • Chengruidong Zhang
  • Amir H. Abdi
  • Dongsheng Li 0002

Long-context Large Language Models (LLMs) have enabled numerous downstream applications but also introduced significant challenges related to computational and memory efficiency. To address these challenges, optimizations for long-context inference have been developed, centered around the KV cache. However, existing benchmarks often evaluate in single-request, neglecting the full lifecycle of the KV cache in real-world use. This oversight is particularly critical, as KV cache reuse has become widely adopted in LLMs inference frameworks, such as vLLM and SGLang, as well as by LLM providers, including OpenAI, Microsoft, Google, and Anthropic. To address this gap, we introduce SCBENCH (SharedContextBENCH), a comprehensive benchmark for evaluating long-context methods from a KV cache centric perspective: 1) KV cache generation, 2) KV cache compression, 3) KV cache retrieval, and 4) KV cache loading. Specifically, SCBench uses test examples with shared context, ranging 12 tasks with two shared context modes, covering four categories of long-context capabilities: string retrieval, semantic retrieval, global information, and multi-task. With SCBench, we provide an extensive KV cache-centric analysis of eight categories long-context solutions, including Gated Linear RNNs (Codestal-Mamba), Mamba-Attention hybrids (Jamba-1.5-Mini), and efficient methods such as sparse attention, KV cache dropping, quantization, retrieval, loading, and prompt compression. The evaluation is conducted on six Transformer-based long-context LLMs: Llama-3.1-8B/70B, Qwen2.5-72B/32B, Llama-3-8B-262K, and GLM-4-9B. Our findings show that sub-O(n) memory methods suffer in multi-turn scenarios, while sparse encoding with O(n) memory and sub-O(n^2) pre-filling computation perform robustly. Dynamic sparsity yields more expressive KV caches than static patterns, and layer-level sparsity in hybrid architectures reduces memory usage with strong performance. Additionally, we identify attention distribution shift issues in long-generation scenarios.

ICLR Conference 2025 Conference Paper

SeCom: On Memory Construction and Retrieval for Personalized Conversational Agents

  • Zhuoshi Pan
  • Qianhui Wu
  • Huiqiang Jiang
  • Xufang Luo
  • Hao Cheng 0002
  • Dongsheng Li 0002
  • Yuqing Yang 0001
  • Chin-Yew Lin

To deliver coherent and personalized experiences in long-term conversations, existing approaches typically perform retrieval augmented response generation by constructing memory banks from conversation history at either the turn-level, session-level, or through summarization techniques. In this paper, we explore the impact of different memory granularities and present two key findings: (1) Both turn-level and session-level memory units are suboptimal, affecting not only the quality of final responses, but also the accuracy of the retrieval process. (2) The redundancy in natural language introduces noise, hindering precise retrieval. We demonstrate that *LLMLingua-2*, originally designed for prompt compression to accelerate LLM inference, can serve as an effective denoising method to enhance memory retrieval accuracy. Building on these insights, we propose **SeCom**, a method that constructs a memory bank with topical segments by introducing a conversation **Se**gmentation model, while performing memory retrieval based on **Com**pressed memory units. Experimental results show that **SeCom** outperforms turn-level, session-level, and several summarization-based methods on long-term conversation benchmarks such as *LOCOMO* and *Long-MT-Bench+*. Additionally, the proposed conversation segmentation method demonstrates superior performance on dialogue segmentation datasets such as *DialSeg711*, *TIAGE*, and *SuperDialSeg*.

NeurIPS Conference 2024 Conference Paper

MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

  • Huiqiang Jiang
  • Yucheng Li
  • Chengruidong Zhang
  • Qianhui Wu
  • Xufang Luo
  • Surin Ahn
  • Zhenhua Han
  • Amir H. Abdi

The computational challenges of Large Language Model (LLM) inference remain a significant barrier to their widespread deployment, especially as prompt lengths continue to increase. Due to the quadratic complexity of the attention computation, it takes 30 minutes for an 8B LLM to process a prompt of 1M tokens (i. e. , the pre-filling stage) on a single A100 GPU. Existing methods for speeding up prefilling often fail to maintain acceptable accuracy or efficiency when applied to long-context LLMs. To address this gap, we introduce MInference (Milliontokens Inference), a sparse calculation method designed to accelerate pre-filling of long-sequence processing. Specifically, we identify three unique patterns in long-context attention matrices-the A-shape, Vertical-Slash, and Block-Sparse-that can be leveraged for efficient sparse computation on GPUs. We determine the optimal pattern for each attention head offline and dynamically build sparseindices based on the assigned pattern during inference. With the pattern and sparse indices, we perform efficient sparse attention calculations via our optimized GPU kernels to significantly reduce the latency in the pre-filling stage of longcontext LLMs. Our proposed technique can be directly applied to existing LLMs without any modifications to the pre-training setup or additional fine-tuning. Byevaluating on a wide range of downstream tasks, including InfiniteBench, RULER, PG-19, and Needle In A Haystack, and models including LLaMA-3-1M, GLM-4-1M, Yi-200K, Phi-3-128K, and Qwen2-128K, we demonstrate that MInference effectively reduces inference latency by up to 10x for pre-filling on an A100, while maintaining accuracy. Our code is available at https: //aka. ms/MInference.

AAAI Conference 2020 Conference Paper

Enhanced Meta-Learning for Cross-Lingual Named Entity Recognition with Minimal Resources

  • Qianhui Wu
  • Zijia Lin
  • Guoxin Wang
  • Hui Chen
  • Börje F. Karlsson
  • Biqing Huang
  • Chin-Yew Lin

For languages with no annotated resources, transferring knowledge from rich-resource languages is an effective solution for named entity recognition (NER). While all existing methods directly transfer from source-learned model to a target language, in this paper, we propose to fine-tune the learned model with a few similar examples given a test case, which could benefit the prediction by leveraging the structural and semantic information conveyed in such similar examples. To this end, we present a meta-learning algorithm to find a good model parameter initialization that could fast adapt to the given test case and propose to construct multiple pseudo- NER tasks for meta-training by computing sentence similarities. To further improve the model’s generalization ability across different languages, we introduce a masking scheme and augment the loss function with an additional maximum term during meta-training. We conduct extensive experiments on cross-lingual named entity recognition with minimal resources over five target languages. The results show that our approach significantly outperforms existing state-of-the-art methods across the board.

IJCAI Conference 2020 Conference Paper

UniTrans: Unifying Model Transfer and Data Transfer for Cross-Lingual Named Entity Recognition with Unlabeled Data

  • Qianhui Wu
  • Zijia Lin
  • Börje F. Karlsson
  • Biqing Huang
  • Jian-Guang Lou

Prior work in cross-lingual named entity recognition (NER) with no/little labeled data falls into two primary categories: model transfer- and data transfer-based methods. In this paper, we find that both method types can complement each other, in the sense that, the former can exploit context information via language-independent features but sees no task-specific information in the target language; while the latter generally generates pseudo target-language training data via translation but its exploitation of context information is weakened by inaccurate translations. Moreover, prior work rarely leverages unlabeled data in the target language, which can be effortlessly collected and potentially contains valuable information for improved results. To handle both problems, we propose a novel approach termed UniTrans to Unify both model and data Transfer for cross-lingual NER, and furthermore, leverage the available information from unlabeled target-language data via enhanced knowledge distillation. We evaluate our proposed UniTrans over 4 target languages on benchmark datasets. Our experimental results show that it substantially outperforms the existing state-of-the-art methods.