Arrow Research search

Author name cluster

Fengli Xu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

10 papers
2 author rows

Possible papers

10

AAAI Conference 2026 Conference Paper

AgentSwift: Efficient LLM Agent Design via Value-Guided Hierarchical Search

  • Yu Li
  • Lehui Li
  • Zhihao Wu
  • Qingmin Liao
  • Jianye Hao
  • Kun Shao
  • Fengli Xu

Large language model (LLM) agents have demonstrated strong capabilities across diverse domains, yet automated agent design remains a significant challenge. Current automated agent design approaches are often constrained by limited search spaces that primarily optimize workflows but fail to integrate crucial human-designed components like memory, planning, and tool use. Furthermore, these methods are hampered by high evaluation costs, as evaluating even a single new agent on a benchmark can require tens of dollars. The difficulty of this exploration is further exacerbated by inefficient search strategies that struggle to navigate the large design space effectively, making the discovery of novel agents a slow and resource-intensive process. To address these challenges, we propose AgentSwift, a novel framework for automated agent design. We formalize a hierarchical search space that jointly models agentic workflow and composable functional components. This structure moves beyond optimizing workflows alone by co-optimizing functional components, which enables the discovery of more complex and effective agent architectures. To make exploration within this expansive space feasible, we mitigate high evaluation costs by training a value model on a high-quality dataset, generated via a novel strategy combining combinatorial coverage and balanced Bayesian sampling for low-cost evaluation. Guiding the entire process is a hierarchical Monte Carlo Tree Search (MCTS) strategy, which is informed by uncertainty to efficiently navigate the search space. Evaluated across a comprehensive set of seven benchmarks spanning embodied, math, web, tool, and game domains, AgentSwift discovers agents that achieve an average performance gain of 8.34\% over both existing automated agent search methods and manually designed agents. Moreover, our framework exhibits steeper and more stable search trajectories. By enabling the efficient, automated composition of workflow with functional components, AgentSwift provides a scalable methodology to explore complex agent designs. Our framework serves as a launchpad for researchers to rapidly prototype and discover powerful agent architectures without the impediment of prohibitive evaluation costs.

AAAI Conference 2026 Conference Paper

ResMAS: Resilience Optimization in LLM-based Multi-agent Systems

  • Zhilun Zhou
  • Zihan Liu
  • Jiahe Liu
  • Qingyu Shao
  • Yihan Wang
  • Kun Shao
  • Depeng Jin
  • Fengli Xu

Large Language Model-based Multi-Agent Systems (LLM-based MAS), where multiple LLM agents collaborate to solve complex tasks, have shown impressive performance in many areas. However, MAS are typically distributed across different devices or environments, making them vulnerable to perturbations such as agent failures. While existing works have studied the adversarial attacks and corresponding defense strategies, they mainly focus on reactively detecting and mitigating attacks after they occur rather than proactively designing inherently resilient systems. In this work, we study the resilience of LLM-based MAS under perturbations and find that both the communication topology and prompt design significantly influence system resilience. Motivated by these findings, we propose ResMAS: a two-stage framework for enhancing MAS resilience. First, we train a reward model to predict the MAS’s resilience, based on which we train a topology generator to automatically design resilient topology for specific tasks through reinforcement learning. Second, we introduce a topology-aware prompt optimization method that refines each agent’s prompt based on its connections and interactions with other agents. Extensive experiments across a range of tasks show that our approach substantially improves MAS resilience under various constraints. Moreover, our framework demonstrates strong generalization ability to new tasks and models, highlighting its potential for building resilient MASs.

NeurIPS Conference 2025 Conference Paper

AgentRecBench: Benchmarking LLM Agent-based Personalized Recommender Systems

  • Yu Shang
  • Peijie Liu
  • Yuwei Yan
  • Zijing Wu
  • Leheng Sheng
  • Yuanqing Yu
  • Chumeng Jiang
  • An Zhang

The emergence of agentic recommender systems powered by Large Language Models (LLMs) represents a paradigm shift in personalized recommendations, leveraging LLMs’ advanced reasoning and role-playing capabilities to enable autonomous, adaptive decision-making. Unlike traditional recommendation approaches, agentic recommender systems can dynamically gather and interpret user-item interactions from complex environments, generating robust recommendation strategies that generalize across diverse scenarios. However, the field currently lacks standardized evaluation protocols to systematically assess these methods. To address this critical gap, we propose: (1) an interactive textual recommendation simulator incorporating rich user and item metadata and three typical evaluation scenarios (classic, evolving-interest, and cold-start recommendation tasks); (2) a unified modular framework for developing agentic recommender systems; and (3) the first comprehensive benchmark comparing over 10 classical and agentic recommendation methods. Our findings demonstrate the superiority of agentic systems and establish actionable design guidelines for their core components. The benchmark environment has been rigorously validated through an open challenge and remains publicly available with a maintained leaderboard at https: //tsinghua-fib-lab. github. io/AgentSocietyChallenge/pages/overview. html. The benchmark is available at: https: //huggingface. co/datasets/SGJQovo/AgentRecBench.

ICLR Conference 2025 Conference Paper

AgentSquare: Automatic LLM Agent Search in Modular Design Space

  • Yu Shang
  • Yu Li 0022
  • Keyu Zhao
  • Likai Ma
  • Jiahe Liu
  • Fengli Xu
  • Yong Li 0008

Recent advancements in Large Language Models (LLMs) have led to a rapid growth of agentic systems capable of handling a wide range of complex tasks. However, current research largely relies on manual, task-specific design, limiting their adaptability to novel tasks. In this paper, we introduce a new research problem: Modularized LLM Agent Search (MoLAS). We propose a modular design space that abstracts existing LLM agent designs into four fundamental modules with uniform IO interface: Planning, Reasoning, Tool Use, and Memory. Building on this design space, we present a novel LLM agent search framework called AgentSquare, which introduces two core mechanisms, i.e., module evolution and recombination, to efficiently search for optimized LLM agents. To further accelerate the process, we design a performance predictor that uses in-context surrogate models to skip unpromising agent designs. Extensive experiments across six benchmarks, covering the diverse scenarios of web, embodied, tool use and game applications, show that AgentSquare substantially outperforms hand-crafted agents, achieving an average performance gain of 17.2% against best-known human designs. Moreover, AgentSquare can generate interpretable design insights, enabling a deeper understanding of agentic architecture and its impact on task performance. We believe that the modular design space and AgentSquare search framework offer a platform for fully exploiting the potential of prior successful designs and consolidate the collective efforts of research community. Code repo is available at https://github.com/tsinghua-fib-lab/AgentSquare.

ICML Conference 2025 Conference Paper

Token Signature: Predicting Chain-of-Thought Gains with Token Decoding Feature in Large Language Models

  • Peijie Liu 0001
  • Fengli Xu
  • Yong Li 0008

Chain-of-Thought (CoT) technique has proven effective in improving the performance of large language models (LLMs) on complex reasoning tasks. However, the performance gains are inconsistent across different tasks, and the underlying mechanism remains a long-standing research question. In this work, we make a preliminary observation that the monotonicity of token probability distributions may be correlated with the gains achieved through CoT reasoning. Leveraging this insight, we propose two indicators based on the token probability distribution to assess CoT effectiveness across different tasks. By combining instance-level indicators with logistic regression model, we introduce Dynamic CoT, a method that dynamically select between CoT and direct answer. Furthermore, we extend Dynamic CoT to closed-source models by transferring decision strategies learned from open-source models. Our indicators for assessing CoT effectiveness achieve an accuracy of 89. 2%, and Dynamic CoT reduces token consumption by more than 35% while maintaining high accuracy. Overall, our work offers a novel perspective on the underlying mechanisms of CoT reasoning and provides a framework for its more efficient deployment.

TIST Journal 2024 Journal Article

Demand-driven Urban Facility Visit Prediction

  • Yunke Zhang
  • Tong Li
  • Yuan Yuan
  • Fengli Xu
  • Fan Yang
  • Funing Sun
  • Yong Li

Predicting citizens’ visiting behaviors to urban facilities is instrumental for city governors and planners to detect inequalities in urban opportunities and optimize the distribution of facilities and resources. Previous works predict facility visits simply using observed visit behavior, yet citizens’ intrinsic demands for facilities are not characterized explicitly, causing potential incorrect learned relations in the prediction results. In this article, to make up for this deficiency, we present a demand-driven urban facility visit prediction method that decomposes citizens’ visits to facilities into their unobservable demands and their capability to fulfill them. Demands are expressed as the function of regional demographic attributes by a neural network, and the fulfillment capability is determined by the urban region’s spatial accessibility to facilities. Extensive evaluations of datasets of three large cities confirm the efficiency and rationality of our model. Our method outperforms the best state-of-the-art model by 8.28% on average in facility visit prediction tasks. Further analyses demonstrate the reasonableness of recovered facility demands and their relationship with citizen demographics. For instance, senior citizens tend to have higher medical demands but lower shopping demands. Meanwhile, estimated capabilities and accessibilities provide deeper insights into the decaying accessibility with respect to spatial distance and facilities’ diverse functions in the urban environment. Our findings shed light on demand-driven urban data mining and demand-based urban facility planning.

NeurIPS Conference 2024 Conference Paper

HLM-Cite: Hybrid Language Model Workflow for Text-based Scientific Citation Prediction

  • Qianyue Hao
  • Jingyang Fan
  • Fengli Xu
  • Jian Yuan
  • Yong Li

Citation networks are critical infrastructures of modern science, serving as intricate webs of past literature and enabling researchers to navigate the knowledge production system. To mine information hiding in the link space of such networks, predicting which previous papers (candidates) will a new paper (query) cite is a critical problem that has long been studied. However, an important gap remains unaddressed: the roles of a paper's citations vary significantly, ranging from foundational knowledge basis to superficial contexts. Distinguishing these roles requires a deeper understanding of the logical relationships among papers, beyond simple edges in citation networks. The emergence of large language models (LLMs) with textual reasoning capabilities offers new possibilities for discerning these relationships, but there are two major challenges. First, in practice, a new paper may select its citations from gigantic existing papers, where the combined texts far exceed the context length of LLMs. Second, logical relationships between papers are often implicit, and directly prompting an LLM to predict citations may lead to results based primarily on surface-level textual similarities, rather than the deeper logical reasoning required. In this paper, we introduce the novel concept of core citation, which identifies the critical references that go beyond superficial mentions. Thereby, we elevate the citation prediction task from a simple binary classification to a more nuanced problem: distinguishing core citations from both superficial citations and non-citations. To address this, we propose $\textbf{HLM-Cite}$, a $\textbf{H}$ybrid $\textbf{L}$anguage $\textbf{M}$odel workflow for citation prediction, which combines embedding and generative LMs. We design a curriculum finetune procedure to adapt a pretrained text embedding model to coarsely retrieve high-likelihood core citations from vast candidate sets and then design an LLM agentic workflow to rank the retrieved papers through one-shot reasoning, revealing the implicit relationships among papers. With the two-stage pipeline, we can scale the candidate sets to 100K papers, vastly exceeding the size handled by existing methods. We evaluate HLM-Cite on a dataset across 19 scientific fields, demonstrating a 17. 6\% performance improvement comparing SOTA methods. Our code is open-source at https: //github. com/tsinghua-fib-lab/H-LM for reproducibility.

TIST Journal 2022 Journal Article

Hierarchical Multi-agent Model for Reinforced Medical Resource Allocation with Imperfect Information

  • Qianyue Hao
  • Fengli Xu
  • Lin Chen
  • Pan Hui
  • Yong Li

With the advent of the COVID-19 pandemic, the shortage in medical resources became increasingly more evident. Therefore, efficient strategies for medical resource allocation are urgently needed. However, conventional rule-based methods employed by public health experts have limited capability in dealing with the complex and dynamic pandemic-spreading situation. In addition, model-based optimization methods such as dynamic programming (DP) fail to work since we cannot obtain a precise model in real-world situations most of the time. Model-free reinforcement learning (RL) is a powerful tool for decision-making; however, three key challenges exist in solving this problem via RL: (1) complex situations and countless choices for decision-making in the real world; (2) imperfect information due to the latency of pandemic spreading; and (3) limitations on conducting experiments in the real world since we cannot set up pandemic outbreaks arbitrarily. In this article, we propose a hierarchical RL framework with several specially designed components. We design a decomposed action space with a corresponding training algorithm to deal with the countless choices, ensuring efficient and real-time strategies. We design a recurrent neural network–based framework to utilize the imperfect information obtained from the environment. We also design a multi-agent voting method, which modifies the decision-making process considering the randomness during model training and, thus, improves the performance. We build a pandemic-spreading simulator based on real-world data, serving as the experimental platform. We then conduct extensive experiments. The results show that our method outperforms all baselines, which reduces infections and deaths by 14.25% on average without the multi-agent voting method and up to 15.44% with it.

AAAI Conference 2021 Conference Paper

AttnMove: History Enhanced Trajectory Recovery via Attentional Network

  • Tong Xia
  • Yunhan Qi
  • Jie Feng
  • Fengli Xu
  • Funing Sun
  • Diansheng Guo
  • Yong Li

A considerable amount of mobility data has been accumulated due to the proliferation of location-based service. Nevertheless, compared with mobility data from transportation systems like the GPS module in taxis, this kind of data is commonly sparse in terms of individual trajectories in the sense that users do not access mobile services and contribute their data all the time. Consequently, the sparsity inevitably weakens the practical value of the data even it has a high user penetration rate. To solve this problem, we propose a novel attentional neural network-based model, named AttnMove, to densify individual trajectories by recovering unobserved locations at a fine-grained spatial-temporal resolution. To tackle the challenges posed by sparsity, we design various intraand inter- trajectory attention mechanisms to better model the mobility regularity of users and fully exploit the periodical pattern from long-term history. We evaluate our model on two real-world datasets, and extensive results demonstrate the performance gain compared with the state-of-the-art methods. This also shows that, by providing high-quality mobility data, our model can benefit a variety of mobility-oriented down-stream applications.

NeurIPS Conference 2021 Conference Paper

Automorphic Equivalence-aware Graph Neural Network

  • Fengli Xu
  • Quanming Yao
  • Pan Hui
  • Yong Li

Distinguishing the automorphic equivalence of nodes in a graph plays an essential role in many scientific domains, e. g. , computational biologist and social network analysis. However, existing graph neural networks (GNNs) fail to capture such an important property. To make GNN aware of automorphic equivalence, we first introduce a localized variant of this concept --- ego-centered automorphic equivalence (Ego-AE). Then, we design a novel variant of GNN, i. e. , GRAPE, that uses learnable AE-aware aggregators to explicitly differentiate the Ego-AE of each node's neighbors with the aids of various subgraph templates. While the design of subgraph templates can be hard, we further propose a genetic algorithm to automatically search them from graph data. Moreover, we theoretically prove that GRAPE is expressive in terms of generating distinct representations for nodes with different Ego-AE features, which fills in a fundamental gap of existing GNN variants. Finally, we empirically validate our model on eight real-world graph data, including social network, e-commerce co-purchase network, and citation network, and show that it consistently outperforms existing GNNs. The source code is public available at https: //github. com/tsinghua-fib-lab/GRAPE.