Arrow Research search

Author name cluster

Xin Cong

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

9 papers
2 author rows

Possible papers

9

NeurIPS Conference 2025 Conference Paper

Generalizing Experience for Language Agents with Hierarchical MetaFlows

  • Shengda Fan
  • Xin Cong
  • Zhong Zhang
  • Yuepeng Fu
  • Yesai Wu
  • Hao Wang
  • Xinyu Zhang
  • Enrui Hu

Recent efforts to employ large language models (LLMs) as agents have demonstrated promising results in a wide range of multi-step agent tasks. However, existing agents lack an effective experience reuse approach to leverage historical completed tasks. In this paper, we propose a novel experience reuse framework MetaFlowLLM, which constructs a hierarchical experience tree from historically completed tasks. Each node in this experience tree is presented as a MetaFlow which contains static execution workflow and subtask required by agents to complete dynamically. Then, we propose a Hierarchical MetaFlow Merging algorithm to construct the hierarchical experience tree. When accomplishing a new task, MetaFlowLLM can first retrieve the most relevant MetaFlow node from the experience tree and then execute it accordingly. To effectively generate valid MetaFlows from historical data, we further propose a reinforcement learning pipeline to train the MetaFlowGen. Extensive experimental results on AppWorld and WorkBench demonstrate that integrating with MetaFlowLLM, existing agents (e. g. , ReAct, Reflexion) can gain substantial performance improvement with reducing execution costs. Notably, MetaFlowLLM achieves an average success rate improvement of 32. 3% on AppWorld and 6. 2% on WorkBench, respectively.

ICLR Conference 2025 Conference Paper

Learning Evolving Tools for Large Language Models

  • Guoxin Chen
  • Zhong Zhang 0004
  • Xin Cong
  • Fangda Guo
  • Yesai Wu
  • Yankai Lin 0001
  • Wenzheng Feng
  • Yasheng Wang

Tool learning enables large language models (LLMs) to interact with external tools and APIs, greatly expanding the application scope of LLMs. However, due to the dynamic nature of external environments, these tools and APIs may become outdated over time, preventing LLMs from correctly invoking tools. Existing research primarily focuses on static environments and overlooks this issue, limiting the adaptability of LLMs in real-world applications. In this paper, we propose ToolEVO, a novel framework designed to enhance the adaptive and reflective capabilities of LLMs against tool variability. By leveraging Monte Carlo Tree Search, ToolEVO facilitates active exploration and interaction of LLMs within dynamic environments, allowing for autonomous self-reflection and self-updating of tool usage based on environmental feedback. Additionally, we introduce ToolQA-D, a benchmark specifically designed to evaluate the impact of tool variability. Extensive experiments demonstrate the effectiveness and stability of our approach, highlighting the importance of adaptability to tool variability for effective tool learning.

ICLR Conference 2025 Conference Paper

Proactive Agent: Shifting LLM Agents from Reactive Responses to Active Assistance

  • Yaxi Lu
  • Shenzhi Yang
  • Cheng Qian 0008
  • Guirong Chen
  • Qinyu Luo
  • Yesai Wu
  • Huadong Wang
  • Xin Cong

Agents powered by large language models have shown remarkable abilities in solving complex tasks. However, most agent systems remain reactive, limiting their effectiveness in scenarios requiring foresight and autonomous decision-making. In this paper, we tackle the challenge of developing proactive agents capable of anticipating and initiating tasks without explicit human instructions. We propose a novel data-driven approach for this problem. Firstly, we collect real-world human activities to generate proactive task predictions. These predictions are then labeled by human annotators as either accepted or rejected. The labeled data is used to train a reward model that simulates human judgment and serves as an automatic evaluator of the proactiveness of LLM agents. Building on this, we develop a comprehensive data generation pipeline to create a diverse dataset, ProactiveBench, containing 6,790 events. Finally, we demonstrate that fine-tuning models with the proposed ProactiveBench can significantly elicit the proactiveness of LLM agents. Experimental results show that our fine-tuned model achieves an F1-Score of 66.47% in proactively offering assistance, outperforming all open-source and close-source models. These results highlight the potential of our method in creating more proactive and effective agent systems, paving the way for future advancements in human-agent collaboration.

ICLR Conference 2025 Conference Paper

Rational Decision-Making Agent with Learning Internal Utility Judgment

  • Yining Ye
  • Xin Cong
  • Shizuo Tian
  • Yujia Qin
  • Chong Liu
  • Yankai Lin 0001
  • Zhiyuan Liu 0001
  • Maosong Sun 0001

With remarkable advancements, large language models (LLMs) have attracted significant efforts to develop LLM-based agents capable of executing intricate multi-step decision-making tasks. Existing approaches predominantly build upon the external performance measure to guide the decision-making process but the reliance on the external performance measure as prior is problematic in real-world scenarios, where such prior may be unavailable, flawed, or even erroneous. For genuine autonomous decision-making for LLM-based agents, it is imperative to develop rationality from their posterior experiences to judge the utility of each decision independently. In this work, we propose RaDAgent (Rational Decision-Making Agent), which fosters the development of its rationality through an iterative framework involving Experience Exploration and Utility Learning. Within this framework, Elo-based Utility Learning is devised to assign Elo scores to individual decision steps to judge their utilities via pairwise comparisons. Consequently, these Elo scores guide the decision-making process to derive optimal outcomes. Experimental results on the Game of 24, WebShop, ToolBench and RestBench datasets demonstrate RaDAgent’s superiority over baselines, achieving about 7.8% improvement on average. Besides, RaDAgent also can reduce costs (ChatGPT API calls), highlighting its effectiveness and efficiency.

ICLR Conference 2025 Conference Paper

WorkflowLLM: Enhancing Workflow Orchestration Capability of Large Language Models

  • Shengda Fan
  • Xin Cong
  • Yuepeng Fu
  • Zhong Zhang 0004
  • Shuyan Zhang
  • Yuanwei Liu
  • Yesai Wu
  • Yankai Lin 0001

Recent advancements in large language models (LLMs) have driven a revolutionary paradigm shift in process automation from Robotic Process Automation to Agentic Process Automation by automating the workflow orchestration procedure based on LLMs. However, existing LLMs (even the advanced OpenAI GPT-4o) are confined to achieving satisfactory capability in workflow orchestration. To address this limitation, we present WorkflowLLM, a data-centric framework elaborately designed to enhance the capability of LLMs in workflow orchestration. It first constructs a large-scale fine-tuning dataset WorkflowBench with 106, 763 samples, covering 1, 503 APIs from 83 applications across 28 categories. Specifically, the construction process can be divided into three phases: (1) Data Collection: we collect real-world workflow data from Apple Shortcuts and RoutineHub, transcribing them into Python-style code. We further equip them with generated hierarchical thought via GPT-4o-mini. (2) Query Expansion: we prompt GPT-4o-mini to generate more task queries to enrich the diversity and complexity of workflows. (3) Workflow Generation: we leverage an annotator model trained on collected data to generate workflows for synthesized queries. Finally, we merge the synthetic samples that pass quality confirmation with the collected samples to obtain the WorkflowBench. Based on WorkflowBench, we fine-tune Llama-3.1-8B to obtain WorkflowLlama. Our experiments show that WorkflowLlama demonstrates a strong capacity to orchestrate complex workflows, while also achieving notable generalization performance on previously unseen APIs. Additionally, WorkflowBench exhibits robust zero-shot generalization capabilities on an out-of-distribution task planning dataset, T-Eval. Our data and code are available at https://github.com/OpenBMB/WorkflowLLM.

ICLR Conference 2024 Conference Paper

AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors

  • Weize Chen
  • Yusheng Su
  • Jingwei Zuo
  • Cheng Yang 0002
  • Chenfei Yuan
  • Chi-Min Chan
  • Heyang Yu
  • Yaxi Lu

Autonomous agents empowered by Large Language Models (LLMs) have undergone significant improvements, enabling them to generalize across a broad spectrum of tasks. However, in real-world scenarios, cooperation among individuals is often required to enhance the efficiency and effectiveness of task accomplishment. Hence, inspired by human group dynamics, we propose a multi-agent framework AgentVerse that can effectively orchestrate a collaborative group of expert agents as a greater-than-the-sum-of-its-parts system. Our experiments demonstrate that AgentVerse can proficiently deploy multi-agent groups that outperform a single agent. Extensive experiments on text understanding, reasoning, coding, tool utilization, and embodied AI confirm the effectiveness of AgentVerse. Moreover, our analysis of agent interactions within AgentVerse reveals the emergence of specific collaborative behaviors, contributing to heightened group efficiency. We will release our codebase, AgentVerse, to further facilitate multi-agent research.

TMLR Journal 2024 Journal Article

Exploring Format Consistency for Instruction Tuning

  • Shihao Liang
  • Runchu Tian
  • Kunlun Zhu
  • Yujia Qin
  • Huadong Wang
  • Xin Cong
  • Zhiyuan Liu
  • Xiaojiang Liu

Instruction tuning has emerged as a promising approach to enhancing large language models in following human instructions. It is shown that increasing the diversity and number of instructions in the training data can consistently enhance generalization performance, which facilitates a recent endeavor to collect various instructions and integrate existing instruction tuning datasets into larger collections. However, different users have their unique ways of expressing instructions, and there often exist variations across different datasets in the instruction styles and formats, i.e., format inconsistency. In this work, a framework named Unified Instruction Tuning (UIT) is proposed, which calls OpenAI APIs for automatic format transfer among different instruction tuning datasets such as PromptSource, FLAN and CrossFit. With the framework, we (1) demonstrate the necessity of maintaining format consistency in instruction tuning; (2) improve the generalization performance on unseen instructions on T5-LM-xl; (3) provide a novel perplexity-based denoising method to reduce the noise of automatic format transfer to make the UIT framework more practical and a smaller offline model based on GPT-J that achieves comparable format transfer capability to OpenAI APIs to reduce costs in practice. Further analysis regarding variations of targeted formats and other effects is intended. The code and trained models will soon be available.

ICLR Conference 2024 Conference Paper

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

  • Yujia Qin
  • Shihao Liang
  • Yining Ye
  • Kunlun Zhu
  • Lan Yan
  • Yaxi Lu
  • Yankai Lin 0001
  • Xin Cong

Despite the advancements of open-source large language models (LLMs), e.g., LLaMA, they remain significantly limited in tool-use capabilities, i.e., using external tools (APIs) to fulfill human instructions. The reason is that current instruction tuning largely focuses on basic language tasks but ignores the tool-use domain. This is in contrast to the excellent tool-use capabilities of state-of-the-art (SOTA) closed-source LLMs, e.g., ChatGPT. To bridge this gap, we introduce ToolLLM, a general tool-use framework encompassing data construction, model training, and evaluation. We first present ToolBench, an instruction-tuning dataset for tool use, which is constructed automatically using ChatGPT. Specifically, the construction can be divided into three stages: (i) API collection: we collect 16,464 real-world RESTful APIs spanning 49 categories from RapidAPI Hub; (ii) instruction generation: we prompt ChatGPT to generate diverse instructions involving these APIs, covering both single-tool and multi-tool scenarios; (iii) solution path annotation: we use ChatGPT to search for a valid solution path (chain of API calls) for each instruction. To enhance the reasoning capabilities of LLMs, we develop a novel depth-first search-based decision tree algorithm. It enables LLMs to evaluate multiple reasoning traces and expand the search space. Moreover, to evaluate the tool-use capabilities of LLMs, we develop an automatic evaluator: ToolEval. Based on ToolBench, we fine-tune LLaMA to obtain an LLM ToolLLaMA, and equip it with a neural API retriever to recommend appropriate APIs for each instruction. Experiments show that ToolLLaMA demonstrates a remarkable ability to execute complex instructions and generalize to unseen APIs, and exhibits comparable performance to ChatGPT. Our ToolLLaMA also demonstrates strong zero-shot generalization ability in an out-of-distribution tool-use dataset: APIBench.

ECAI Conference 2023 Conference Paper

Uncertain Relational Hypergraph Attention Networks for Document-Level Event Factuality Identification

  • Jiawei Sheng
  • Xin Cong
  • Jiangxia Cao
  • Shu Guo
  • Chen Li 0046
  • Lihong Wang
  • Tingwen Liu
  • Hongbo Xu

Document-level event factuality identification (DocEFI) is an important task in event knowledge acquisition, which aims to detect whether an event actually occurs or not from the perspective of the document. Unlike the sentence-level task, a document can have multiple sentences with different event factualities, leading to event factuality conflicts in DocEFI. Existing studies attempt to aggregate local event factuality by exploiting document structures, but they mostly consider textual components in the document separately, degrading complicated correlations therein. To address the above issues, this paper proposes a novel approach, namely UR-HAT, to improve DocEFI with uncertain relational hypergraph attention networks. Particularly, we reframe a document graph as a hypergraph, and establish beneficial n-ary correlations among textual nodes with relational hyperedges, which helps to globally consider local factuality features to resolve event factuality conflicts. To better discern the importance of event factuality features, we further represent textual nodes with uncertain Gaussian distributions, and propose novel uncertain relational hypergraph attention networks to refine textual nodes with the document hypergraph. In addition, we select factuality-related keywords as nodes to enrich event factuality features. Experimental results demonstrate the effectiveness of our proposed method, and outperforms previous methods on two widely used benchmark datasets.