Arrow Research search

Author name cluster

Aurick Qiao

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

4 papers
2 author rows

Possible papers

4

NeurIPS Conference 2025 Conference Paper

Efficiently Scaling LLM Reasoning Programs with Certaindex

  • Yichao Fu
  • Junda Chen
  • Siqi Zhu
  • Fu Fu
  • Zhongdongming Dai
  • Yonghao Zhuang
  • Yian Ma
  • Aurick Qiao

Test-time reasoning algorithms such as chain-of-thought, self-consistency, and MCTS enhance LLM problem-solving but can wastefully generate many tokens without improving accuracy. At the same time, we observe that these algorithms exhibit answer stabilization: their intermediate solutions often cease to change after a certain point, and further investment of compute does not change their final answer. To quantify this phenomenon, we introduce Certaindex, an algorithm-agnostic metric measuring this evolving stability, signaling when further computation is unlikely to alter the final result. Certaindex is lightweight, can accelerate reasoning program inference via early exit, and further enables dynamic token allocation, gang scheduling, and many opportunities when integrated with real-world LLM serving systems. To quantify real-world benefits, we built Certaindex as a scheduler into Dynasor, our reasoning-aware LLM serving system, and demonstrate up to 50\% compute savings and 3. 3$\times$ higher throughput in real workloads with no accuracy drop. Our code is available at https: //github. com/hao-ai-lab/Dynasor. git

NeurIPS Conference 2025 Conference Paper

SuffixDecoding: Extreme Speculative Decoding for Emerging AI Applications

  • Gabriele Oliaro
  • Zhihao Jia
  • Daniel Campos
  • Aurick Qiao

Speculative decoding is widely adopted to reduce latency in large language model (LLM) inference by leveraging smaller draft models capable of handling diverse user tasks. However, emerging AI applications, such as LLM-based agents, present unique workload characteristics: instead of diverse independent requests, agentic frameworks typically submit repetitive inference requests, such as multi-agent pipelines performing similar subtasks or self-refinement loops iteratively enhancing outputs. These workloads result in long and highly predictable sequences, which current speculative decoding methods do not effectively exploit. To address this gap, we introduce \emph{SuffixDecoding}, a novel method that utilizes efficient suffix trees to cache long token sequences from prompts and previous outputs. By adaptively speculating more tokens when acceptance likelihood is high and fewer when it is low, SuffixDecoding effectively exploits opportunities for longer speculations while conserving computation when those opportunities are limited. Evaluations on agentic benchmarks, including SWE-Bench and Text-to-SQL, demonstrate that SuffixDecoding achieves speedups of up to 3. 9$\times$, outperforming state-of-the-art methods -- 2. 2$\times$ faster than model-based approaches like EAGLE-2/3 and 1. 6$\times$ faster than model-free approaches such as Token Recycling. SuffixDecoding is open-sourced.

NeurIPS Conference 2024 Conference Paper

Efficient LLM Scheduling by Learning to Rank

  • Yichao Fu
  • Siqi Zhu
  • Runlong Su
  • Aurick Qiao
  • Ion Stoica
  • Hao Zhang

In Large Language Model (LLM) inference, the output length of an LLM request is typically regarded as not known a priori. Consequently, most LLM serving systems employ a simple First-come-first-serve (FCFS) scheduling strategy, leading to Head-Of-Line (HOL) blocking and reduced throughput and service quality. In this paper, we reexamine this assumption -- we show that, although predicting the exact generation length of each request is infeasible, it is possible to predict the relative ranks of output lengths in a batch of requests, using learning to rank. The ranking information offers valuable guidance for scheduling requests. Building on this insight, we develop a novel scheduler for LLM inference and serving that can approximate the shortest-job-first (SJF) schedule better than existing approaches. We integrate this scheduler with the state-of-the-art LLM serving system and show significant performance improvement in several important applications: 2. 8x lower latency in chatbot serving and 6. 5x higher throughput in synthetic data generation. Our code is available at https: //github. com/hao-ai-lab/vllm-ltr. git

ICML Conference 2019 Conference Paper

Fault Tolerance in Iterative-Convergent Machine Learning

  • Aurick Qiao
  • Bryon Aragam
  • Bingjing Zhang
  • Eric P. Xing

Machine learning (ML) training algorithms often possess an inherent self-correcting behavior due to their iterative- convergent nature. Recent systems exploit this property to achieve adaptability and efficiency in unreliable computing environments by relaxing the consistency of execution and allowing calculation errors to be self-corrected during training. However, the behavior of such systems are only well understood for specific types of calculation errors, such as those caused by staleness, reduced precision, or asynchronicity, and for specific algorithms, such as stochastic gradient descent. In this paper, we develop a general framework to quantify the effects of calculation errors on iterative-convergent algorithms. We then use this framework to derive a worst-case upper bound on the cost of arbitrary perturbations to model parameters during training and to design new strategies for checkpoint-based fault tolerance. Our system, SCAR, can reduce the cost of partial failures by 78%{–}95% when compared with traditional checkpoint-based fault tolerance across a variety of ML models and training algorithms, providing near-optimal performance in recovering from failures.