Arrow Research search

Author name cluster

Jiahao Qiu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

7 papers
2 author rows

Possible papers

7

TMLR Journal 2026 Journal Article

A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

  • Huan-ang Gao
  • Jiayi Geng
  • Wenyue Hua
  • Mengkang Hu
  • Xinzhe Juan
  • Hongzhang Liu
  • Shilong Liu
  • Jiahao Qiu

Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse tasks but remain fundamentally static, unable to adapt their internal parameters to novel tasks, evolving knowledge domains, or dynamic interaction contexts. As LLMs are increasingly deployed in open-ended, interactive environments, this static nature has become a critical bottleneck, necessitating agents that can adaptively reason, act, and evolve in real time. This paradigm shift ---from scaling static models to developing self-evolving agents --- has sparked growing interest in architectures and methods enabling continual learning and adaptation from data, interactions, and experiences. This survey provides the first systematic and comprehensive review of self-evolving agents, organizing the field around three foundational dimensions --- what to evolve, when to evolve, and how to evolve. We examine evolutionary mechanisms across agent components (e.g., models, memory, tools, architecture), categorize adaptation methods by stages (e.g., intra-test-time, inter-test-time), and analyze the algorithmic and architectural designs that guide evolutionary adaptation (e.g., scalar rewards, textual feedback, single-agent and multi-agent systems). Additionally, we analyze evaluation metrics and benchmarks tailored for self-evolving agents, highlight applications in domains such as coding, education, and healthcare, and identify critical challenges and research directions in safety, scalability, and co-evolutionary dynamics. By providing a structured framework for understanding and designing self-evolving agents, this survey establishes a roadmap for advancing more adaptive, capable, robust, and versatile agentic systems in both research and real-world deployments, and ultimately sheds light on the realization of Artificial Super Intelligence (ASI) where agents evolve autonomously and perform beyond human-level intelligence across a wide array of tasks.

ICLR Conference 2025 Conference Paper

Collab: Controlled Decoding using Mixture of Agents for LLM Alignment

  • Souradip Chakraborty
  • Sujay Bhatt
  • Udari Madhushani
  • Soumya Suvra Ghosal
  • Jiahao Qiu
  • Mengdi Wang 0001
  • Dinesh Manocha
  • Furong Huang

Alignment of Large Language models (LLMs) is crucial for safe and trustworthy deployment in applications. Reinforcement learning from human feedback (RLHF) has emerged as an effective technique to align LLMs to human preferences, and broader utilities, but it requires updating billions of model parameters which is computationally expensive. Controlled Decoding, by contrast, provides a mechanism for aligning a model at inference time without retraining. However, single-agent decoding approaches often struggle to adapt to diverse tasks due to the complexity and variability inherent in these tasks. To strengthen the test-time performance w.r.t the target task, we propose a mixture of agents-based decoding strategies leveraging the existing off-the-shelf aligned LLM policies. Treating each prior policy as an agent in the spirit of mixture of agent collaboration, we develop a decoding method that allows for inference-time alignment through a token-level selection strategy among multiple agents. For each token, the most suitable LLM is dynamically chosen from a pool of models based on a long-term utility metric. This policy-switching mechanism ensures optimal model selection at each step, enabling efficient collaboration and alignment among LLMs during decoding. Theoretical analysis of our proposed algorithm establishes optimal performance with respect to the target task represented via a target reward, for the given off-the-shelf models. We conduct comprehensive empirical evaluations with open-source aligned models on diverse tasks and preferences, which demonstrates the merits of this approach over single-agent decoding baselines. Notably, COLLAB surpasses the current SoTA decoding strategy, achieving an improvement of {up to 1.56x} in average reward and $71.89\%$ in GPT-4 based win-tie rate.

NeurIPS Conference 2025 Conference Paper

ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs

  • Jiaru Zou
  • Ling Yang
  • Jingwen Gu
  • Jiahao Qiu
  • Ke Shen
  • Jingrui He
  • Mengdi Wang

Process Reward Models (PRMs) have recently emerged as a powerful framework for supervising intermediate reasoning steps in large language models (LLMs). Previous PRMs are primarily trained on model final output responses and struggle to evaluate intermediate thinking trajectories robustly, especially in the emerging setting of trajectory–response outputs generated by frontier reasoning models like Deepseek-R1. In this work, we introduce ReasonFlux-PRM, a novel trajectory-aware PRM explicitly designed to evaluate the trajectory-response type of reasoning traces. ReasonFlux-PRM incorporates both step-level and trajectory-level supervision, enabling fine-grained reward assignment aligned with structured chain-of-thought data. We adapt ReasonFlux-PRM to support reward supervision under both offline and online settings, including (i) selecting high-quality model distillation data for downstream supervised fine-tuning of smaller models, (ii) providing dense process-level rewards for policy optimization during reinforcement learning, and (iii) enabling reward-guided Best-of-N test-time scaling. Empirical results on challenging downstream benchmarks such as AIME, MATH500, and GPQA-Diamond demonstrate that ReasonFlux-PRM-7B selects higher quality data than strong PRMs (e. g. , Qwen2. 5-Math-PRM-72B) and human-curated baselines. Furthermore, ReasonFlux-PRM-7B yields consistent performance improvements, achieving average gains of 12. 1\% in supervised fine-tuning, 4. 5\% in reinforcement learning, and 6. 3\% in test-time scaling. We also release an efficient ReasonFlux-PRM-1. 5B for resource-constrained applications and edge deployment. Our code and models are released at https: //github. com/Gen-Verse/ReasonFlux.

NeurIPS Conference 2024 Conference Paper

Fast Best-of-N Decoding via Speculative Rejection

  • Hanshi Sun
  • Momin Haider
  • Ruiqi Zhang
  • Huitao Yang
  • Jiahao Qiu
  • Ming Yin
  • Mengdi Wang
  • Peter Bartlett

The safe and effective deployment of Large Language Models (LLMs) involves a critical step called alignment, which ensures that the model's responses are in accordance with human preferences. Prevalent alignment techniques, such as DPO, PPO and their variants, align LLMs by changing the pre-trained model weights during a phase called post-training. While predominant, these post-training methods add substantial complexity before LLMs can be deployed. Inference-time alignment methods avoid the complex post-training step and instead bias the generation towards responses that are aligned with human preferences. The best-known inference-time alignment method, called Best-of-N, is as effective as the state-of-the-art post-training procedures. Unfortunately, Best-of-N requires vastly more resources at inference time than standard decoding strategies, which makes it computationally not viable. In this work, we introduce Speculative Rejection, a computationally-viable inference-time alignment algorithm. It generates high-scoring responses according to a given reward model, like Best-of-N does, while being between 16 to 32 times more computationally efficient.

ICML Conference 2024 Conference Paper

MaxMin-RLHF: Alignment with Diverse Human Preferences

  • Souradip Chakraborty
  • Jiahao Qiu
  • Hui Yuan 0002
  • Alec Koppel
  • Dinesh Manocha
  • Furong Huang
  • Amrit Singh Bedi
  • Mengdi Wang 0001

Reinforcement Learning from Human Feedback (RLHF) aligns language models to human preferences by employing a singular reward model derived from preference data. However, the single reward model overlooks the rich diversity of human preferences inherent in data collected from multiple users. In this work, we first derive an impossibility result of alignment with single reward RLHF, thereby highlighting its insufficiency in representing diverse human preferences. Next, we propose to learn a mixture of reward models via an expectation-maximization algorithm and solve a MaxMin alignment objective inspired by the Egalitarian principle in social choice theory to better honor diverse human preferences. We present comprehensive experimental results on small-scale (GPT-2) and large-scale language (with Tulu2-7B)) and show the efficacy of the proposed approach in the presence of diversity among human preferences. We remark that our findings in this work are not only limited to language models but also extend to reinforcement learning in general.

AAAI Conference 2024 Conference Paper

Tree Search-Based Evolutionary Bandits for Protein Sequence Optimization

  • Jiahao Qiu
  • Hui Yuan
  • Jinghong Zhang
  • Wentao Chen
  • Huazheng Wang
  • Mengdi Wang

While modern biotechnologies allow synthesizing new proteins and function measurements at scale, efficiently exploring a protein sequence space and engineering it remains a daunting task due to the vast sequence space of any given protein. Protein engineering is typically conducted through an iterative process of adding mutations to the wild-type or lead sequences, recombination of mutations, and running new rounds of screening. To enhance the efficiency of such a process, we propose a tree search-based bandit learning method, which expands a tree starting from the initial sequence with the guidance of a bandit machine learning model. Under simplified assumptions and a Gaussian Process prior, we provide theoretical analysis and a Bayesian regret bound, demonstrating that the method can efficiently discover a near-optimal design. The full algorithm is compatible with a suite of randomized tree search heuristics, machine learning models, pre-trained embeddings, and bandit techniques. We test various instances of the algorithm across benchmark protein datasets using simulated screens. Experiment results demonstrate that the algorithm is both sample-efficient, diversity-promoting, and able to find top designs using reasonably small mutation counts.

ICRA Conference 2022 Conference Paper

FusionNet: Coarse-to-Fine Extrinsic Calibration Network of LiDAR and Camera with Hierarchical Point-pixel Fusion

  • Guangming Wang 0001
  • Jiahao Qiu
  • Yanfeng Guo
  • Hesheng Wang 0001

In this paper, we propose a novel network, Fusion-Net, which can estimate the extrinsic calibration matrix between LiDAR and a monocular RGB camera with high accuracy and robustness. FusionNet is a coarse-to-fine method, providing an online and end-to-end solution that can automatically detect and correct the decalibration without any specially designed targets or environments. First, the network applies deep-learning-based technologies to extract the features of LiDAR point clouds and RGB images. Then a novel method is adopted to fuse the features got from different sensors by projecting LiDAR features onto RGB feature maps, searching for the RGB features with the projected points as centers and concatenating the extracted RGB features with LiDAR features. To increase the accuracy, we apply a coarse-to-fine method in the network, by transforming LiDAR points and estimating the extrinsic calibration matrices from the coarse scale to the fine scale. The network is trained on random artificial decalibration matrices. Compared to existing approaches, our method doesn't need to train additional iterative networks, but it can also adapt to different ranges of decalibration.