Arrow Research search

Author name cluster

Xinrun Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

33 papers
2 author rows

Possible papers

33

AAAI Conference 2026 Conference Paper

GDBA Revisited: Unleashing the Power of Guided Local Search for Distributed Constraint Optimization

  • Yanchen Deng
  • Xinrun Wang
  • Bo An

Local search is an important class of incomplete algorithms for solving Distributed Constraint Optimization Problems (DCOPs) but it often converges to poor local optima. While Generalized Distributed Breakout Algorithm (GDBA) provides a comprehensive rule set to escape premature convergence, its empirical benefits remain marginal on general-valued problems. In this work, we systematically examine GDBA and identify three factors that potentially lead to its inferior performance, i.e., over-aggressive constraint violation conditions, unbounded penalty accumulation, and uncoordinated penalty updates. To address these issues, we propose Distributed Guided Local Search (DGLS), a novel GLS framework for DCOPs that incorporates an adaptive violation condition to selectively penalize constraints with high cost, a penalty evaporation mechanism to control the magnitude of penalization, and a synchronization scheme for coordinated penalty updates. We theoretically show that the penalty values are bounded, and agents play a potential game in DGLS. Extensive empirical results on various benchmarks demonstrate the great superiority of DGLS over state-of-the-art baselines. Compared to Damped Max-sum with high damping factors, our DGLS achieves competitive performance on general-valued problems, and outperforms by significant margins on structured problems in terms of anytime results.

TMLR Journal 2026 Journal Article

LLM-Based World Models Can Make Decisions Solely, But Rigorous Evaluations are Needed

  • Chang Yang
  • Xinrun Wang
  • Junzhe Jiang
  • Qinggang Zhang
  • Xiao Huang

World model emerges as a key module in decision making, where MuZero and Dreamer achieve remarkable successes in complex tasks. Recent work leverages Large Language Models (LLMs) as general world simulators to simulate the dynamics of the world due to their generalizability. LLMs also serve as the world model for deliberative reasoning in Reasoning via Planning (RAP) and Tree of Thought (ToT). However, the world model is either evaluated as a general world simulator, or as a functional module of the agent, i.e., predicting the transitions to assist the planning. This paper argues that LLM-based world models can make decisions solely, but rigorous evaluations are needed. We first present the two key observations to showcase how LLM-based world models can make decisions solely, and then present the three key observations to demonstrate why current evaluation framework of LLM-based world models is not sufficient. Then, we present our suggested evaluation framework: policy verification, action proposal, and policy planning, where the world model is used for decision making solely, and finally we leverage the 31 diverse environments from (Wang et al., 2023; 2024) and curate the rule-based policy of each environment for diverse evaluations. The key findings include: i) GPT-4o significantly outperforms GPT-4o-mini on the three main tasks, especially for the tasks which require the domain knowledge, e.g., scientific tasks, ii) the performance of the LLM-based world models depends predominantly on their performance in key steps, while the total number of steps required for task completion is a weak indicator of task difficulty with critical bottleneck steps playing a more decisive role and iii) the combination of world models’ functionalities for decision making tends to increase performance instability in our experiments, which can partially obscure the performance gap between stronger and weaker model.

TMLR Journal 2026 Journal Article

Nondeterministic Polynomial-time Problem Challenge: An Ever-Scaling Reasoning Benchmark for LLMs

  • Chang Yang
  • Ruiyu Wang
  • Junzhe Jiang
  • Qi Jiang
  • Qinggang Zhang
  • Yanchen Deng
  • Shuxin Li
  • Shuyue Hu

Reasoning is the fundamental capability of large language models (LLMs). Due to the rapid progress of LLMs, there are two main issues of current benchmarks: i) these benchmarks can be crushed in a short time (less than 1 year), and ii) these benchmarks may be easily hacked. To handle these issues, we propose the ever-scalingness for building the benchmarks which are scaling over complexity, instance, oversight and coverage. This paper presents Nondeterministic Polynomial-time Problem Challenge (NPPC), an ever-scaling reasoning benchmark for LLMs. Specifically, the NPPC has three main modules: i) npgym, which provides a unified interface of 25 well-known NP-complete problems and can generate any number of instances with any levels of complexities, ii) npsolver, which provides a unified interface to evaluate the problem instances with both online and offline models via APIs and local deployments, respectively, and iii) npeval, which provides the comprehensive and ready-to-use tools to analyze the performances of LLMs over different problems, the number of tokens, the aha moments, the reasoning errors and the solution errors. Extensive experiments over widely-used LLMs demonstrate: i) NPPC can successfully decrease the performances of advanced LLMs to below 10%, demonstrating that NPPC is not crushed by current models, ii) DeepSeek-R1, Claude-3.7-Sonnet, and o1/o3-mini are the most powerful LLMs, where DeepSeek-R1 can outperform Claude-3.7-Sonnet and o1/o3-mini in most NP-complete problems considered, and iii) the numbers of tokens, aha moments in the advanced LLMs, e.g., Claude-3.7-Sonnet and DeepSeek-R1, are observed first to increase and then decrease when the problem instances become more and more difficult. Through continuously scaling analysis, NPPC can provide critical insights into LLMs' reasoning capabilities, exposing fundamental limitations and suggesting future directions for further improvements.

AAAI Conference 2026 Conference Paper

The Avengers: A Routing Recipe for Collective Intelligence in Language Models

  • Yiqun Zhang
  • Hao Li
  • Chenxu Wang
  • Linyao Chen
  • Qiaosheng Zhang
  • Peng Ye
  • Shi Feng
  • Xinrun Wang

Proprietary models are increasingly dominating the race for ever-larger language models. Can open-source, smaller models remain competitive across a broad range of tasks? In this paper, we present the Avengers---a lightweight framework that leverages the collective intelligence of these smaller models. The Avengers builds upon four lightweight operations: (i) embedding: encode queries using a text embedding model; (ii) clustering: group queries based on their semantic similarity; (iii) scoring: scores each model's performance within each cluster; and (iv) voting: improve outputs via repeated sampling and voting. At inference time, each query is embedded and assigned to its nearest cluster. The top-performing model(s) within that cluster are selected to generate the response with repeated sampling. Remarkably, with 10 open-source models (~7B parameters each), the Avengers surpasses GPT-4o, 4.1, and 4.5 in average performance across 15 diverse datasets spanning mathematics, coding, logical reasoning, general knowledge, and affective tasks. In particular, it surpasses GPT-4.1 on mathematics tasks by 18.21% and on code tasks by 7.46%. Furthermore, the Avengers delivers superior out-of-distribution generalization, and remains robust across various embedding models, clustering algorithms, ensemble strategies, data efficiency, and values of its sole parameter---the number of clusters.

ICLR Conference 2025 Conference Paper

AgentStudio: A Toolkit for Building General Virtual Agents

  • Longtao Zheng
  • Zhiyuan Huang
  • Zhenghai Xue
  • Xinrun Wang
  • Bo An 0001
  • Shuicheng Yan

General virtual agents need to handle multimodal observations, master complex action spaces, and self-improve in dynamic, open-domain environments. However, existing environments are often domain-specific and require complex setups, which limits agent development and evaluation in real-world settings. As a result, current evaluations lack in-depth analyses that decompose fundamental agent capabilities. We introduce AgentStudio, a trinity of environments, tools, and benchmarks to address these issues. AgentStudio provides a lightweight, interactive environment with highly generic observation and action spaces, e.g., video observations and GUI/API actions. It integrates tools for creating online benchmark tasks, annotating GUI elements, and labeling actions in videos. Based on our environment and tools, we curate an online task suite that benchmarks both GUI interactions and function calling with efficient auto-evaluation. We also reorganize existing datasets and collect new ones using our tools to establish three datasets: GroundUI, IDMBench, and CriticBench. These datasets evaluate fundamental agent abilities, including GUI grounding, learning from videos, and success detection, pointing to the desiderata for robust, general, and open-ended virtual agents.

ICML Conference 2025 Conference Paper

Cradle: Empowering Foundation Agents towards General Computer Control

  • Weihao Tan
  • Wentao Zhang 0007
  • Xinrun Xu
  • Haochong Xia
  • Ziluo Ding
  • Boyu Li 0003
  • Bohan Zhou
  • Junpeng Yue

Despite their success in specific scenarios, existing foundation agents still struggle to generalize across various virtual scenarios, mainly due to the dramatically different encapsulations of environments with manually designed observation and action spaces. To handle this issue, we propose the General Computer Control (GCC) setting to restrict foundation agents to interact with software through the most unified and standardized interface, i. e. , using screenshots as input and keyboard and mouse actions as output. We introduce Cradle, a modular and flexible LMM-powered framework, as a preliminary attempt towards GCC. Enhanced by six key modules, Information Gathering, Self-Reflection, Task Inference, Skill Curation, Action Planning, and Memory, Cradle is able to understand input screenshots and output executable code for low-level keyboard and mouse control after high-level planning and information retrieval, so that Cradle can interact with any software and complete long-horizon complex tasks without relying on any built-in APIs. Experimental results show that Cradle exhibits remarkable generalizability and impressive performance across four previously unexplored commercial video games (Red Dead Redemption 2, Cities: Skylines, Stardew Valley and Dealer’s Life 2), five software applications (Chrome, Outlook, Feishu, Meitu and CapCut), and a comprehensive benchmark, OSWorld. With a unified interface to interact with any software, Cradle greatly extends the reach of foundation agents thus paving the way for generalist agents.

ICML Conference 2024 Conference Paper

Configurable Mirror Descent: Towards a Unification of Decision Making

  • Pengdeng Li
  • Shuxin Li 0001
  • Chang Yang
  • Xinrun Wang
  • Shuyue Hu
  • Xiao Huang 0001
  • Hau Chan
  • Bo An 0001

Decision-making problems, categorized as single-agent, e. g. , Atari, cooperative multi-agent, e. g. , Hanabi, competitive multi-agent, e. g. , Hold’em poker, and mixed cooperative and competitive, e. g. , football, are ubiquitous in the real world. Although various methods have been proposed to address the specific decision-making categories, these methods typically evolve independently and cannot generalize to other categories. Therefore, a fundamental question for decision-making is: Can we develop a single algorithm to tackle ALL categories of decision-making problems? There are several main challenges to address this question: i) different decision-making categories involve different numbers of agents and different relationships between agents, ii) different categories have different solution concepts and evaluation measures, and iii) there lacks a comprehensive benchmark covering all the categories. This work presents a preliminary attempt to address the question with three main contributions. i) We propose the generalized mirror descent (GMD), a generalization of MD variants, which considers multiple historical policies and works with a broader class of Bregman divergences. ii) We propose the configurable mirror descent (CMD) where a meta-controller is introduced to dynamically adjust the hyper-parameters in GMD conditional on the evaluation measures. iii) We construct the GameBench with 15 academic-friendly games across different decision-making categories. Extensive experiments demonstrate that CMD achieves empirically competitive or better outcomes compared to baselines while providing the capability of exploring diverse dimensions of decision making.

AAAI Conference 2024 Conference Paper

EarnHFT: Efficient Hierarchical Reinforcement Learning for High Frequency Trading

  • Molei Qin
  • Shuo Sun
  • Wentao Zhang
  • Haochong Xia
  • Xinrun Wang
  • Bo An

High-frequency trading (HFT) is using computer algorithms to make trading decisions in short time scales (e.g., second-level), which is widely used in the Cryptocurrency (Crypto) market, (e.g., Bitcoin). Reinforcement learning (RL) in financial research has shown stellar performance on many quantitative trading tasks. However, most methods focus on low-frequency trading, e.g., day-level, which cannot be directly applied to HFT because of two challenges. First, RL for HFT involves dealing with extremely long trajectories (e.g., 2.4 million steps per month), which is hard to optimize and evaluate. Second, the dramatic price fluctuations and market trend changes of Crypto make existing algorithms fail to maintain satisfactory performances. To tackle these challenges, we propose an Efficient hieArchical Reinforcement learNing method for High Frequency Trading (EarnHFT), a novel three-stage hierarchical RL framework for HFT. In stage I, we compute a Q-teacher, i.e., the optimal action value based on dynamic programming, for enhancing the performance and training efficiency of second level RL agents. In stage II, we construct a pool of diverse RL agents for different market trends, distinguished by return rates, where hundreds of RL agents are trained with different preferences of return rates and only a tiny fraction of them will be selected into the pool based on their profitability. In stage III, we train a minute-level router which dynamically picks a second-level agent from the pool to achieve stable performance across different markets. Through extensive experiments in various market trends on Crypto markets in a high-fidelity simulation trading environment, we demonstrate that EarnHFT significantly outperforms 6 state-of-art baselines in 6 popular financial criteria, exceeding the runner-up by 30% in profitability.

AAMAS Conference 2024 Conference Paper

Grasper: A Generalist Pursuer for Pursuit-Evasion Problems

  • Pengdeng Li
  • Shuxin Li
  • Xinrun Wang
  • Jakub Černý
  • Youzhi Zhang
  • Stephen McAleer
  • Hau Chan
  • Bo An

Pursuit-evasion games (PEGs) model interactions between a team of pursuers and an evader in graph-based environments such as urban street networks. Recent advancements have demonstrated the effectiveness of the pre-training and fine-tuning paradigm in Policy-Space Response Oracles (PSRO) to improve scalability in solving large-scale PEGs. However, these methods primarily focus on specific PEGs with fixed initial conditions that may vary substantially in real-world scenarios, which significantly hinders the applicability of the traditional methods. To address this issue, we introduce Grasper, a GeneRAlist purSuer for Pursuit-Evasion pRoblems, capable of efficiently generating pursuer policies tailored to specific PEGs. Our contributions are threefold: First, we present a novel architecture that offers high-quality solutions for diverse PEGs, comprising critical components such as (i) a graph neural network (GNN) to encode PEGs into hidden vectors, and (ii) a hypernetwork to generate pursuer policies based on these hidden vectors. As a second contribution, we develop an efficient three-stage training method involving (i) a pre-pretraining stage for learning robust PEG representations through self-supervised graph learning techniques like graph masked auto-encoder (Graph- MAE), (ii) a pre-training stage utilizing heuristic-guided multi-task pre-training (HMP) where heuristic-derived reference policies (e. g. , through Dijkstra’s algorithm) regularize pursuer policies, and (iii) a fine-tuning stage that employs PSRO to generate pursuer policies on designated PEGs. Finally, we perform extensive experiments on synthetic and real-world maps, showcasing Grasper’s significant superiority over baselines in terms of solution quality and generalizability. We demonstrate that Grasper provides a versatile ∗Equal contribution. †Corresponding author. This work is licensed under a Creative Commons Attribution International 4. 0 License. Proc. of the 23rd International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2024), N. Alechina, V. Dignum, M. Dastani, J. S. Sichman (eds.), May 6 – 10, 2024, Auckland, New Zealand. © 2024 International Foundation for Autonomous Agents and Multiagent Systems (www. ifaamas. org). approach for solving pursuit-evasion problems across a broad range of scenarios, enabling practical deployment in real-world situations.

AAAI Conference 2024 Conference Paper

Market-GAN: Adding Control to Financial Market Data Generation with Semantic Context

  • Haochong Xia
  • Shuo Sun
  • Xinrun Wang
  • Bo An

Financial simulators play an important role in enhancing forecasting accuracy, managing risks, and fostering strategic financial decision-making. Despite the development of financial market simulation methodologies, existing frameworks often struggle with adapting to specialized simulation context. We pinpoint the challenges as i) current financial datasets do not contain context labels; ii) current techniques are not designed to generate financial data with context as control, which demands greater precision compared to other modalities; iii) the inherent difficulties in generating context-aligned, high-fidelity data given the non-stationary, noisy nature of financial data. To address these challenges, our contributions are: i) we proposed the Contextual Market Dataset with market dynamics, stock ticker, and history state as context, leveraging a market dynamics modeling method that combines linear regression and clustering to extract market dynamics; ii) we present Market-GAN, a novel architecture incorporating a Generative Adversarial Networks (GAN) for the controllable generation with context, an autoencoder for learning low-dimension features, and supervisors for knowledge transfer; iii) we introduce a two-stage training scheme to ensure that Market-GAN captures the intrinsic market distribution with multiple objectives. In the pertaining stage, with the use of the autoencoder and supervisors, we prepare the generator with a better initialization for the adversarial training stage. We propose a set of holistic evaluation metrics that consider alignment, fidelity, data usability on downstream tasks, and market facts. We evaluate Market-GAN with the Dow Jones Industrial Average data from 2000 to 2023 and showcase superior performance in comparison to 4 state-of-the-art time-series generative models.

AAMAS Conference 2024 Conference Paper

Reinforcement Nash Equilibrium Solver

  • Xinrun Wang
  • Chang Yang
  • Shuxin Li
  • Pengdeng Li
  • Xiao Huang
  • Hau Chan
  • Bo An

Nash Equilibrium (NE) is the canonical solution concept of game theory, which provides an elegant tool to understand the rationalities. Computing NE in two- or multi-player general-sum games is PPAD-Complete. Therefore, in this work, we propose REinforcement Nash Equilibrium Solver (RENES), which trains a single policy to modify the games with different sizes and applies the solvers on the modified games where the obtained solution is evaluated on the original games. Specifically, our contributions are threefold. i) We represent the games as 𝛼-rank response graphs and leverage graph neural network (GNN) to handle the games with different sizes as inputs; ii) We use tensor decomposition, e. g. , canonical polyadic (CP), to make the dimension of modifying actions fixed for games with different sizes; iii) We train the modifying strategy for games with the widely-used proximal policy optimization (PPO) and apply the solvers to solve the modified games, where the obtained solution is evaluated on original games. Extensive experiments on large-scale normal-form games show that our method can further improve the approximation of NE of different solvers, i. e. , 𝛼-rank, CE, FP and PRD, and can be generalized to unseen games.

IJCAI Conference 2024 Conference Paper

Reinforcement Nash Equilibrium Solver

  • Xinrun Wang
  • Chang Yang
  • Shuxin Li
  • Pengdeng Li
  • Xiao Huang
  • Hau Chan
  • Bo An

Nash Equilibrium (NE) is the canonical solution concept of game theory, which provides an elegant tool to understand the rationalities. Though mixed strategy NE exists in any game with finite players and actions, computing NE in two- or multi-player general-sum games is PPAD-Complete. Various alternative solutions, e. g. , Correlated Equilibrium (CE), and learning methods, e. g. , fictitious play (FP), are proposed to approximate NE. For convenience, we call these methods as ``inexact solvers'', or ``solvers'' for short. However, the alternative solutions differ from NE and the learning methods generally fail to converge to NE. Therefore, in this work, we propose REinforcement Nash Equilibrium Solver (RENES), which trains a single policy to modify the games with different sizes and applies the solvers on the modified games where the obtained solution is evaluated on the original games. Specifically, our contributions are threefold. i) We represent the games as alpha-rank response graphs and leverage graph neural network (GNN) to handle the games with different sizes as inputs; ii) We use tensor decomposition, e. g. , canonical polyadic (CP), to make the dimension of modifying actions fixed for games with different sizes; iii) We train the modifying strategy for games with the widely-used proximal policy optimization (PPO) and apply the solvers to solve the modified games, where the obtained solution is evaluated on original games. Extensive experiments on large-scale normal-form games show that our method can further improve the approximation of NE of different solvers, i. e. , alpha-rank, CE, FP and PRD, and can be generalized to unseen games.

IJCAI Conference 2024 Conference Paper

Self-adaptive PSRO: Towards an Automatic Population-based Game Solver

  • Pengdeng Li
  • Shuxin Li
  • Chang Yang
  • Xinrun Wang
  • Xiao Huang
  • Hau Chan
  • Bo An

Policy-Space Response Oracles (PSRO) as a general algorithmic framework has achieved state-of-the-art performance in learning equilibrium policies of two-player zero-sum games. However, the hand-crafted hyperparameter value selection in most of the existing works requires extensive domain knowledge, forming the main barrier to applying PSRO to different games. In this work, we make the first attempt to investigate the possibility of self-adaptively determining the optimal hyperparameter values in the PSRO framework. Our contributions are three-fold: (1) Using several hyperparameters, we propose a parametric PSRO that unifies the gradient descent ascent (GDA) and different PSRO variants. (2) We propose the self-adaptive PSRO (SPSRO) by casting the hyperparameter value selection of the parametric PSRO as a hyperparameter optimization (HPO) problem where our objective is to learn an HPO policy that can self-adaptively determine the optimal hyperparameter values during the running of the parametric PSRO. (3) To overcome the poor performance of online HPO methods, we propose a novel offline HPO approach to optimize the HPO policy based on the Transformer architecture. Experiments on various two-player zero-sum games demonstrate the superiority of SPSRO over different baselines.

ICLR Conference 2024 Conference Paper

Solving Homogeneous and Heterogeneous Cooperative Tasks with Greedy Sequential Execution

  • Shanqi Liu
  • Dong Xing
  • Pengjie Gu
  • Xinrun Wang
  • Bo An 0001
  • Yong Liu

Cooperative multi-agent reinforcement learning (MARL) is extensively used for solving complex cooperative tasks, and value decomposition methods are a prevalent approach for this domain. However, these methods have not been successful in addressing both homogeneous and heterogeneous tasks simultaneously which is a crucial aspect for the practical application of cooperative agents. On one hand, value decomposition methods demonstrate superior performance in homogeneous tasks. Nevertheless, they tend to produce agents with similar policies, which is unsuitable for heterogeneous tasks. On the other hand, solutions based on personalized observation or assigned roles are well-suited for heterogeneous tasks. However, they often lead to a trade-off situation where the agent's performance in homogeneous scenarios is negatively affected due to the aggregation of distinct policies. An alternative approach is to adopt sequential execution policies, which offer a flexible form for learning both types of tasks. However, learning sequential execution policies poses challenges in terms of credit assignment, and the limited information about subsequently executed agents can lead to sub-optimal solutions, which is known as the relative over-generalization problem. To tackle these issues, this paper proposes Greedy Sequential Execution (GSE) as a solution to learn the optimal policy that covers both scenarios. In the proposed GSE framework, we introduce an individual utility function into the framework of value decomposition to consider the complex interactions between agents. This function is capable of representing both the homogeneous and heterogeneous optimal policies. Furthermore, we utilize greedy marginal contribution calculated by the utility function as the credit value of the sequential execution policy to address the credit assignment and relative over-generalization problem. We evaluated GSE in both homogeneous and heterogeneous scenarios. The results demonstrate that GSE achieves significant improvement in performance across multiple domains, especially in scenarios involving both homogeneous and heterogeneous tasks.

ICLR Conference 2024 Conference Paper

Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control

  • Longtao Zheng
  • Rundong Wang
  • Xinrun Wang
  • Bo An 0001

Building agents with large language models (LLMs) for computer control is a burgeoning research area, where the agent receives computer states and performs actions to complete complex tasks. Previous computer agents have demonstrated the benefits of in-context learning (ICL); however, their performance is hindered by several issues. First, the limited context length of LLMs and complex computer states restrict the number of exemplars, as a single webpage can consume the entire context. Second, the exemplars in current methods, such as high-level plans and multi-choice questions, cannot represent complete trajectories, leading to suboptimal performance in long-horizon tasks. Third, existing computer agents rely on task-specific exemplars and overlook the similarity among tasks, resulting in poor generalization to novel tasks. To address these challenges, we introduce Synapse, a computer agent featuring three key components: i) state abstraction, which filters out task-irrelevant information from raw states, allowing more exemplars within the limited context, ii) trajectory-as-exemplar prompting, which prompts the LLM with complete trajectories of the abstracted states and actions to improve multi-step decision-making, and iii) exemplar memory, which stores the embeddings of exemplars and retrieves them via similarity search for generalization to novel tasks. We evaluate Synapse on MiniWoB++, a standard task suite, and Mind2Web, a real-world website benchmark. In MiniWoB++, Synapse achieves a 99.2% average success rate (a 10% relative improvement) across 64 tasks using demonstrations from only 48 tasks. Notably, Synapse is the first ICL method to solve the book-flight task in MiniWoB++. Synapse also exhibits a 56% relative improvement in average step success rate over the previous state-of-the-art prompting scheme in Mind2Web.

AAAI Conference 2024 Conference Paper

Transition-Informed Reinforcement Learning for Large-Scale Stackelberg Mean-Field Games

  • Pengdeng Li
  • Runsheng Yu
  • Xinrun Wang
  • Bo An

Many real-world scenarios including fleet management and Ad auctions can be modeled as Stackelberg mean-field games (SMFGs) where a leader aims to incentivize a large number of homogeneous self-interested followers to maximize her utility. Existing works focus on cases with a small number of heterogeneous followers, e.g., 5-10, and suffer from scalability issue when the number of followers increases. There are three major challenges in solving large-scale SMFGs: i) classical methods based on solving differential equations fail as they require exact dynamics parameters, ii) learning by interacting with environment is data-inefficient, and iii) complex interaction between the leader and followers makes the learning performance unstable. We address these challenges through transition-informed reinforcement learning. Our main contributions are threefold: i) we first propose an RL framework, the Stackelberg mean-field update, to learn the leader's policy without priors of the environment, ii) to improve the data efficiency and accelerate the learning process, we then propose the Transition-Informed Reinforcement Learning (TIRL) by leveraging the instantiated empirical Fokker-Planck equation, and iii) we develop a regularized TIRL by employing various regularizers to alleviate the sensitivity of the learning performance to the initialization of the leader's policy. Extensive experiments on fleet management and food gathering demonstrate that our approach can scale up to 100,000 followers and significantly outperform existing baselines.

ICLR Conference 2024 Conference Paper

True Knowledge Comes from Practice: Aligning Large Language Models with Embodied Environments via Reinforcement Learning

  • Weihao Tan
  • Wentao Zhang 0007
  • Shanqi Liu
  • Longtao Zheng
  • Xinrun Wang
  • Bo An 0001

Despite the impressive performance across numerous tasks, large language models (LLMs) often fail in solving simple decision-making tasks due to the misalignment of the knowledge in LLMs with environments. On the contrary, reinforcement learning (RL) agents learn policies from scratch, which makes them always align with environments but difficult to incorporate prior knowledge for efficient explorations. To narrow the gap, we propose TWOSOME, a novel general online framework that deploys LLMs as decision-making agents to efficiently interact and align with embodied environments via RL without requiring any prepared datasets or prior knowledge of the environments. Firstly, we query the joint probabilities of each valid action with LLMs to form behavior policies. Then, to enhance the stability and robustness of the policies, we propose two normalization methods and summarize four prompt design principles. Finally, we design a novel parameter-efficient training architecture where the actor and critic share one frozen LLM equipped with low-rank adapters (LoRA) updated by PPO. We conduct extensive experiments to evaluate TWOSOME. i) TWOSOME exhibits significantly better sample efficiency and performance compared to the conventional RL method, PPO, and prompt tuning method, SayCan, in both classical decision-making environment, Overcooked, and simulated household environment, VirtualHome. ii) Benefiting from LLMs' open-vocabulary feature, TWOSOME shows superior generalization ability to unseen tasks. iii) Under our framework, there is no significant loss of the LLMs' original ability during online PPO finetuning.

ICML Conference 2023 Conference Paper

Controlling Type Confounding in Ad Hoc Teamwork with Instance-wise Teammate Feedback Rectification

  • Dong Xing
  • Pengjie Gu
  • Qian Zheng
  • Xinrun Wang
  • Shanqi Liu
  • Longtao Zheng
  • Bo An 0001
  • Gang Pan 0001

Ad hoc teamwork requires an agent to cooperate with unknown teammates without prior coordination. Many works propose to abstract teammate instances into high-level representation of types and then pre-train the best response for each type. However, most of them do not consider the distribution of teammate instances within a type. This could expose the agent to the hidden risk of type confounding. In the worst case, the best response for an abstract teammate type could be the worst response for all specific instances of that type. This work addresses the issue from the lens of causal inference. We first theoretically demonstrate that this phenomenon is due to the spurious correlation brought by uncontrolled teammate distribution. Then, we propose our solution, CTCAT, which disentangles such correlation through an instance-wise teammate feedback rectification. This operation reweights the interaction of teammate instances within a shared type to reduce the influence of type confounding. The effect of CTCAT is evaluated in multiple domains, including classic ad hoc teamwork tasks and real-world scenarios. Results show that CTCAT is robust to the influence of type confounding, a practical issue that directly hazards the robustness of our trained agents but was unnoticed in previous works.

ICLR Conference 2023 Conference Paper

Enhancing Meta Learning via Multi-Objective Soft Improvement Functions

  • Runsheng Yu
  • Weiyu Chen
  • Xinrun Wang
  • James T. Kwok

Meta-learning tries to leverage information from similar learning tasks. In the commonly-used bilevel optimization formulation, the shared parameter is learned in the outer loop by minimizing the average loss over all tasks. However, the converged solution may be comprised in that it only focuses on optimizing on a small subset of tasks. To alleviate this problem, we consider meta-learning as a multi-objective optimization (MOO) problem, in which each task is an objective. However, existing MOO solvers need to access all the objectives’ gradients in each iteration, and cannot scale to the huge number of tasks in typical meta-learning settings. To alleviate this problem, we propose a scalable gradient-based solver with the use of mini-batch. We provide theoretical guarantees on the Pareto optimality or Pareto stationarity of the converged solution. Empirical studies on various machine learning settings demonstrate that the proposed method is efficient, and achieves better performance than the baselines, particularly on improving the performance of the poorly-performing tasks and thus alleviating the compromising phenomenon.

NeurIPS Conference 2023 Conference Paper

Offline RL with Discrete Proxy Representations for Generalizability in POMDPs

  • Pengjie Gu
  • Xinyu Cai
  • Dong Xing
  • Xinrun Wang
  • Mengchen Zhao
  • Bo An

Offline Reinforcement Learning (RL) has demonstrated promising results in various applications by learning policies from previously collected datasets, reducing the need for online exploration and interactions. However, real-world scenarios usually involve partial observability, which brings crucial challenges of the deployment of offline RL methods: i) the policy trained on data with full observability is not robust against the masked observations during execution, and ii) the information of which parts of observations are masked is usually unknown during training. In order to address these challenges, we present Offline RL with DiscrEte pRoxy representations (ORDER), a probabilistic framework which leverages novel state representations to improve the robustness against diverse masked observabilities. Specifically, we propose a discrete representation of the states and use a proxy representation to recover the states from masked partial observable trajectories. The training of ORDER can be compactly described as the following three steps. i) Learning the discrete state representations on data with full observations, ii) Training the decision module based on the discrete representations, and iii) Training the proxy discrete representations on the data with various partial observations, aligning with the discrete representations. We conduct extensive experiments to evaluate ORDER, showcasing its effectiveness in offline RL for diverse partially observable scenarios and highlighting the significance of discrete proxy representations in generalization performance. ORDER is a flexible framework to employ any offline RL algorithms and we hope that ORDER can pave the way for the deployment of RL policy against various partial observabilities in the real world.

ICLR Conference 2023 Conference Paper

Population-size-Aware Policy Optimization for Mean-Field Games

  • Pengdeng Li
  • Xinrun Wang
  • Shuxin Li 0001
  • Hau Chan
  • Bo An 0001

In this work, we attempt to bridge the two fields of finite-agent and infinite-agent games, by studying how the optimal policies of agents evolve with the number of agents (population size) in mean-field games, an agent-centric perspective in contrast to the existing works focusing typically on the convergence of the empirical distribution of the population. To this end, the premise is to obtain the optimal policies of a set of finite-agent games with different population sizes. However, either deriving the closed-form solution for each game is theoretically intractable, training a distinct policy for each game is computationally intensive, or directly applying the policy trained in a game to other games is sub-optimal. We address these challenges through the \textbf{P}opulation-size-\textbf{A}ware \textbf{P}olicy \textbf{O}ptimization (PAPO). Our contributions are three-fold. First, to efficiently generate efficient policies for games with different population sizes, we propose PAPO, which unifies two natural options (augmentation and hypernetwork) and achieves significantly better performance. PAPO consists of three components: i) the population-size encoding which transforms the original value of population size to an equivalent encoding to avoid training collapse, ii) a hypernetwork to generate a distinct policy for each game conditioned on the population size, and iii) the population size as an additional input to the generated policy. Next, we construct a multi-task-based training procedure to efficiently train the neural networks of PAPO by sampling data from multiple games with different population sizes. Finally, extensive experiments on multiple environments show the significant superiority of PAPO over baselines, and the analysis of the evolution of the generated policies further deepens our understanding of the two fields of finite-agent and infinite-agent games.

TMLR Journal 2023 Journal Article

PRUDEX-Compass: Towards Systematic Evaluation of Reinforcement Learning in Financial Markets

  • Shuo Sun
  • Molei Qin
  • Xinrun Wang
  • Bo An

The financial markets, which involve more than $90 trillion market capitals, attract the attention of innumerable investors around the world. Recently, reinforcement learning in financial markets (FinRL) has emerged as a promising direction to train agents for making profitable investment decisions. However, the evaluation of most FinRL methods only focuses on profit-related measures and ignores many critical axes, which are far from satisfactory for financial practitioners to deploy these methods into real-world financial markets. Therefore, we introduce PRUDEX-Compass, which has 6 axes, i.e., Profitability, Risk-control, Universality, Diversity, rEliability, and eXplainability, with a total of 17 measures for a systematic evaluation. Specifically, i) we propose AlphaMix+ as a strong FinRL baseline, which leverages mixture-of-experts (MoE) and risk-sensitive approaches to make diversified risk-aware investment decisions, ii) we evaluate 8 FinRL methods in 4 long-term real-world datasets of influential financial markets to demonstrate the usage of our PRUDEX-Compass, iii) PRUDEX-Compass together with 4 real-world datasets, standard implementation of 8 FinRL methods and a portfolio management environment is released as public resources to facilitate the design and comparison of new FinRL methods. We hope that PRUDEX-Compass can not only shed light on future FinRL research to prevent untrustworthy results from stagnating FinRL into successful industry deployment but also provide a new challenging algorithm evaluation scenario for the reinforcement learning (RL) community.

AAAI Conference 2023 Conference Paper

Solving Large-Scale Pursuit-Evasion Games Using Pre-trained Strategies

  • Shuxin Li
  • Xinrun Wang
  • Youzhi Zhang
  • Wanqi Xue
  • Jakub Černý
  • Bo An

Pursuit-evasion games on graphs model the coordination of police forces chasing a fleeing felon in real-world urban settings, using the standard framework of imperfect-information extensive-form games (EFGs). In recent years, solving EFGs has been largely dominated by the Policy-Space Response Oracle (PSRO) methods due to their modularity, scalability, and favorable convergence properties. However, even these methods quickly reach their limits when facing large combinatorial strategy spaces of the pursuit-evasion games. To improve their efficiency, we integrate the pre-training and fine-tuning paradigm into the core module of PSRO -- the repeated computation of the best response. First, we pre-train the pursuer's policy base model against many different strategies of the evader. Then we proceed with the PSRO loop and fine-tune the pre-trained policy to attain the pursuer's best responses. The empirical evaluation shows that our approach significantly outperforms the baselines in terms of speed and scalability, and can solve even games on street maps of megalopolises with tens of thousands of crossroads -- a scale beyond the effective reach of previous methods.

NeurIPS Conference 2023 Conference Paper

TradeMaster: A Holistic Quantitative Trading Platform Empowered by Reinforcement Learning

  • Shuo Sun
  • Molei Qin
  • Wentao Zhang
  • Haochong Xia
  • Chuqiao Zong
  • Jie Ying
  • Yonggang Xie
  • Lingxuan Zhao

The financial markets, which involve over \$90 trillion market capitals, attract the attention of innumerable profit-seeking investors globally. Recent explosion of reinforcement learning in financial trading (RLFT) research has shown stellar performance on many quantitative trading tasks. However, it is still challenging to deploy reinforcement learning (RL) methods into real-world financial markets due to the highly composite nature of this domain, which entails design choices and interactions between components that collect financial data, conduct feature engineering, build market environments, make investment decisions, evaluate model behaviors and offers user interfaces. Despite the availability of abundant financial data and advanced RL techniques, a remarkable gap still exists between the potential and realized utilization of RL in financial trading. In particular, orchestrating an RLFT project lifecycle poses challenges in engineering (i. e. hard to build), benchmarking (i. e. hard to compare) and usability (i. e. hard to optimize, maintain and use). To overcome these challenges, we introduce TradeMaster, a holistic open-source RLFT platform that serves as a i) software toolkit, ii) empirical benchmark, and iii) user interface. Our ultimate goal is to provide infrastructures for transparent and reproducible RLFT research and facilitate their real-world deployment with industry impact. TradeMaster will be updated continuously and welcomes contributions from both RL and finance communities.

IJCAI Conference 2021 Conference Paper

CFR-MIX: Solving Imperfect Information Extensive-Form Games with Combinatorial Action Space

  • Shuxin Li
  • Youzhi Zhang
  • Xinrun Wang
  • Wanqi Xue
  • Bo An

In many real-world scenarios, a team of agents must coordinate with each other to compete against an opponent. The challenge of solving this type of game is that the team's joint action space grows exponentially with the number of agents, which results in the inefficiency of the existing algorithms, e. g. , Counterfactual Regret Minimization (CFR). To address this problem, we propose a new framework of CFR: CFR-MIX. Firstly, we propose a new strategy representation that represents a joint action strategy using individual strategies of all agents and a consistency relationship to maintain the cooperation between agents. To compute the equilibrium with individual strategies under the CFR framework, we transform the consistency relationship between strategies to the consistency relationship between the cumulative regret values. Furthermore, we propose a novel decomposition method over cumulative regret values to guarantee the consistency relationship between the cumulative regret values. Finally, we introduce our new algorithm CFR-MIX which employs a mixing layer to estimate cumulative regret values of joint actions as a non-linear combination of cumulative regret values of individual actions. Experimental results show that CFR-MIX outperforms existing algorithms on various games significantly.

IJCAI Conference 2021 Conference Paper

Neural Regret-Matching for Distributed Constraint Optimization Problems

  • Yanchen Deng
  • Runsheng Yu
  • Xinrun Wang
  • Bo An

Distributed constraint optimization problems (DCOPs) are a powerful model for multi-agent coordination and optimization, where information and controls are distributed among multiple agents by nature. Sampling-based algorithms are important incomplete techniques for solving medium-scale DCOPs. However, they use tables to exactly store all the information (e. g. , costs, confidence bounds) to facilitate sampling, which limits their scalability. This paper tackles the limitation by incorporating deep neural networks in solving DCOPs for the first time and presents a neural-based sampling scheme built upon regret-matching. In the algorithm, each agent trains a neural network to approximate the regret related to its local problem and performs sampling according to the estimated regret. Furthermore, to ensure exploration we propose a regret rounding scheme that rounds small regret values to positive numbers. We theoretically show the regret bound of our algorithm and extensive evaluations indicate that our algorithm can scale up to large-scale DCOPs and significantly outperform the state-of-the-art methods.

NeurIPS Conference 2021 Conference Paper

RMIX: Learning Risk-Sensitive Policies for Cooperative Reinforcement Learning Agents

  • Wei Qiu
  • Xinrun Wang
  • Runsheng Yu
  • Rundong Wang
  • Xu He
  • Bo An
  • Svetlana Obraztsova
  • Zinovi Rabinovich

Current value-based multi-agent reinforcement learning methods optimize individual Q values to guide individuals' behaviours via centralized training with decentralized execution (CTDE). However, such expected, i. e. , risk-neutral, Q value is not sufficient even with CTDE due to the randomness of rewards and the uncertainty in environments, which causes the failure of these methods to train coordinating agents in complex environments. To address these issues, we propose RMIX, a novel cooperative MARL method with the Conditional Value at Risk (CVaR) measure over the learned distributions of individuals' Q values. Specifically, we first learn the return distributions of individuals to analytically calculate CVaR for decentralized execution. Then, to handle the temporal nature of the stochastic outcomes during executions, we propose a dynamic risk level predictor for risk level tuning. Finally, we optimize the CVaR policies with CVaR values used to estimate the target in TD error during centralized training and the CVaR values are used as auxiliary local rewards to update the local distribution via Quantile Regression loss. Empirically, we show that our method outperforms many state-of-the-art methods on various multi-agent risk-sensitive navigation scenarios and challenging StarCraft II cooperative tasks, demonstrating enhanced coordination and revealing improved sample efficiency.

IJCAI Conference 2021 Conference Paper

Solving Large-Scale Extensive-Form Network Security Games via Neural Fictitious Self-Play

  • Wanqi Xue
  • Youzhi Zhang
  • Shuxin Li
  • Xinrun Wang
  • Bo An
  • Chai Kiat Yeo

Securing networked infrastructures is important in the real world. The problem of deploying security resources to protect against an attacker in networked domains can be modeled as Network Security Games (NSGs). Unfortunately, existing approaches, including the deep learning-based approaches, are inefficient to solve large-scale extensive-form NSGs. In this paper, we propose a novel learning paradigm, NSG-NFSP, to solve large-scale extensive-form NSGs based on Neural Fictitious Self-Play (NFSP). Our main contributions include: i) reforming the best response (BR) policy network in NFSP to be a mapping from action-state pair to action-value, to make the calculation of BR possible in NSGs; ii) converting the average policy network of an NFSP agent into a metric-based classifier, helping the agent to assign distributions only on legal actions rather than all actions; iii) enabling NFSP with high-level actions, which can benefit training efficiency and stability in NSGs; and iv) leveraging information contained in graphs of NSGs by learning efficient graph node embeddings. Our algorithm significantly outperforms state-of-the-art algorithms in both scalability and solution quality.

ICLR Conference 2020 Conference Paper

Learning Expensive Coordination: An Event-Based Deep RL Approach

  • Zhenyu Shi
  • Runsheng Yu
  • Xinrun Wang
  • Rundong Wang
  • Youzhi Zhang 0001
  • Hanjiang Lai
  • Bo An 0001

Existing works in deep Multi-Agent Reinforcement Learning (MARL) mainly focus on coordinating cooperative agents to complete certain tasks jointly. However, in many cases of the real world, agents are self-interested such as employees in a company and clubs in a league. Therefore, the leader, i.e., the manager of the company or the league, needs to provide bonuses to followers for efficient coordination, which we call expensive coordination. The main difficulties of expensive coordination are that i) the leader has to consider the long-term effect and predict the followers' behaviors when assigning bonuses and ii) the complex interactions between followers make the training process hard to converge, especially when the leader's policy changes with time. In this work, we address this problem through an event-based deep RL approach. Our main contributions are threefold. (1) We model the leader's decision-making process as a semi-Markov Decision Process and propose a novel multi-agent event-based policy gradient to learn the leader's long-term policy. (2) We exploit the leader-follower consistency scheme to design a follower-aware module and a follower-specific attention module to predict the followers' behaviors and make accurate response to their behaviors. (3) We propose an action abstraction-based policy gradient algorithm to reduce the followers' decision space and thus accelerate the training process of followers. Experiments in resource collections, navigation, and the predator-prey game reveal that our approach outperforms the state-of-the-art methods dramatically.

ICAPS Conference 2020 Conference Paper

We Mind Your Well-Being: Preventing Depression in Uncertain Social Networks by Sequential Interventions

  • Aye Phyu Phyu Aung
  • Xinrun Wang
  • Bo An 0001
  • Xiaoli Li 0001

Mental health has become a major concern according to WHO who estimates that more than 350 million people worldwide are affected by depression. Studies have shown that interventions and social support can reduce stress and depression. However, counselling centers do not have enough resources to provide counselling and social support to all the participants in their interest. This paper helps social support organizations (e. g. , university counselling centers) sequentially select the participants for interventions. Unfortunately, previous works do not consider emotion propagation from other neighbours of the influencees and initial uncertainties of mental states and influence. Moreover, they fail to scale up to solve problems with a large number of participants due to the huge state space. Our contributions in this paper are fourfold. Firstly, we propose a new model that addresses the sequential intervention of participants while considering the propagation of emotions and formulate it as a Partially Observable Markov Decision Process (POMDP) to handle uncertainties about their mental states and the influence between them. Secondly, we apply reasoning to refine belief to improve solution quality for the lack of initial information on mental state values. Thirdly, we improve the scalability by the abstraction of states to reduce the number of states by representing the mental states with an abstracted discrete set. We further improve the scalability by multi-level partitioning to get smaller POMDPs. Finally, we conduct extensive experiments on both synthetic and real networks to show that our algorithm significantly improves scalability with comparable solution quality compared to the state-of-the-art algorithms.

IJCAI Conference 2019 Conference Paper

Who Should Pay the Cost: A Game-theoretic Model for Government Subsidized Investments to Improve National Cybersecurity

  • Xinrun Wang
  • Bo An
  • Hau Chan

Due to the recent cyber attacks, cybersecurity is becoming more critical in modern society. A single attack (e. g. , WannaCry ransomware attack) can cause as much as $4 billion in damage. However, the cybersecurity investment by companies is far from satisfactory. Therefore, governments (e. g. , in the UK) launch grants and subsidies to help companies to boost their cybersecurity to create a safer national cyber environment. The allocation problem is hard due to limited subsidies and the interdependence between self-interested companies and the presence of a strategic cyber attacker. To tackle the government's allocation problem, we introduce a Stackelberg game-theoretic model where the government first commits to an allocation and the companies/users and attacker simultaneously determine their protection and attack (pure or mixed) strategies, respectively. For the pure-strategy case, while there may not be a feasible allocation in general, we prove that computing an optimal allocation is NP-hard and propose a linear reverse convex program when the attacker can attack all users. For the mixed-strategy case, we show that there is a polynomial time algorithm to find an optimal allocation when the attacker has a single-attack capability. We then provide a heuristic algorithm, based on best-response-gradient dynamics, to find an effective allocation in the general setting. Experimentally, we show that our heuristic is effective and outperforms other baselines on synthetic and real data.

AAAI Conference 2018 Conference Paper

Catching Captain Jack: Efficient Time and Space Dependent Patrols to Combat Oil-Siphoning in International Waters

  • Xinrun Wang
  • Bo An
  • Martin Strobel
  • Fookwai Kong

Pirate syndicates capturing tankers to siphon oil, causing an estimated cost of $5 billion a year, has become a serious security issue for maritime traffic. In response to the threat, coast guards and navies deploy patrol boats to protect international oil trade. However, given the vast area of the sea and the highly time and space dependent behaviors of both players, it remains a significant challenge to find efficient ways to deploy patrol resources. In this paper, we address the research challenges and provide four key contributions. First, we construct a Stackelberg model of the oil-siphoning problem based on incident reports of actual attacks; Second, we propose a compact formulation and a constraint generation algorithm, which tackle the exponentially growth of the defender’s and attacker’s strategy spaces, respectively, to compute efficient strategies of security agencies; Third, to further improve the scalability, we propose an abstraction method, which exploits the intrinsic similarity of defender’s strategy space, to solve extremely large-scale games; Finally, we evaluate our approaches through extensive simulations and a detailed case study with real ship traffic data. The results demonstrate that our approach achieves a dramatic improvement of scalability with modest influence on the solution quality and can scale up to realistic-sized problems.

AAMAS Conference 2017 Conference Paper

Stop Nuclear Smuggling Through Efficient Container Inspection

  • Xinrun Wang
  • Qingyu Guo
  • Bo An

Since 2003, the U. S. government has spent $850 million on the Megaport Initiative which aims at stopping the nuclear smuggling in international container shipping through advanced inspection facilities including Non-Intrusive Inspection (NII) and Mobile Radiation Detection and Identification System (MRDIS). Unfortunately, it remains a significant challenge to efficiently inspect more than 11. 7 million containers imported to the U. S. due to the limited inspection resources. Moreover, existing work in container inspection neglects the sophisticated behavior of the smuggler who can surveil the inspector’s strategy and decide the optimal (sequential) smuggling plan. This paper is the first to tackle this challenging container inspection problem, where a novel Container Inspection Model (CIM) is proposed, which models the interaction between the inspector and the smuggler as a leader-follower Stackelberg game and formulates the smuggler’s sequential decision behavior as a Markov Decision Process (MDP). The special structure of the CIM results in a non-convex optimization problem, which cannot be addressed by existing approaches. We make several key contributions including: i) a linear relaxation approximation with guarantee of solution quality which reformulates the model as a bilinear optimization problem, ii) an algorithm inspired by the Multipleparametric Disaggregation Technique (MDT) to solve the reformulated bilinear optimization, and iii) a novel iterative algorithm to further improve the scalability. Extensive experimental evaluation shows that our approach can scale up to realistic-sized problems with robust enough solutions outperforming heuristic baselines significantly. CCS Concepts •Computing methodologies → Multi-agent systems;