Arrow Research search

Author name cluster

Stephen McAleer

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

14 papers
1 author row

Possible papers

14

AAMAS Conference 2025 Conference Paper

Ensemble Value Functions for Efficient Exploration in Multi-Agent Reinforcement Learning

  • Lukas Schäfer
  • Oliver Slumbers
  • Stephen McAleer
  • Yali Du
  • Stefano V. Albrecht
  • David Mguni

Multi-agent reinforcement learning (MARL) requires agents to explore within a vast joint action space to find joint actions that lead to coordination. Existing value-based MARL algorithms commonly rely on random exploration, such as 𝜖-greedy, to explore the environment which is not systematic and inefficient at identifying effective actions in multi-agent problems. Additionally, the concurrent training of the policies of multiple agents during training can render the optimisation non-stationary. This can lead to unstable value estimates, highly variant gradients, and ultimately hinder coordination between agents. To address these challenges, we propose ensemble value functions for multi-agent exploration (EMAX). EMAX is a framework to seamlessly extend value-based MARL algorithms. EMAX leverages an ensemble of value functions for each agent to guide their exploration, reduce the variance of their optimisation, and makes their policies more robust to miscoordination. EMAX achieves these benefits by (1) systematically guiding the exploration of agents with a UCB policy towards parts of the environment that require multiple agents to coordinate. (2) EMAX computes average value estimates across the ensemble as target values to reduce the variance of gradients and make optimisation more stable. (3) During evaluation, EMAX selects actions following a majority vote across the ensemble to reduce the likelihood of miscoordination. We first instantiate independent DQN with EMAX and evaluate it in 11 general-sum tasks with sparse rewards. We show that EMAX improves final evaluation returns by 185% across all tasks. We then evaluate EMAX on top of IDQN, VDN and QMIX in 21 common-reward tasks, and show that EMAX improves sample efficiency and final evaluation returns across all tasks over all three vanilla algorithms by 60%, 47%, and 538%, respectively.

AAAI Conference 2024 Conference Paper

Automated Design of Affine Maximizer Mechanisms in Dynamic Settings

  • Michael Curry
  • Vinzenz Thoma
  • Darshan Chakrabarti
  • Stephen McAleer
  • Christian Kroer
  • Tuomas Sandholm
  • Niao He
  • Sven Seuken

Dynamic mechanism design is a challenging extension to ordinary mechanism design in which the mechanism designer must make a sequence of decisions over time in the face of possibly untruthful reports of participating agents. Optimizing dynamic mechanisms for welfare is relatively well understood. However, there has been less work on optimizing for other goals (e.g., revenue), and without restrictive assumptions on valuations, it is remarkably challenging to characterize good mechanisms. Instead, we turn to automated mechanism design to find mechanisms with good performance in specific problem instances. We extend the class of affine maximizer mechanisms to MDPs where agents may untruthfully report their rewards. This extension results in a challenging bilevel optimization problem in which the upper problem involves choosing optimal mechanism parameters, and the lower problem involves solving the resulting MDP. Our approach can find truthful dynamic mechanisms that achieve strong performance on goals other than welfare, and can be applied to essentially any problem setting---without restrictions on valuations---for which RL can learn optimal policies.

AAMAS Conference 2024 Conference Paper

Grasper: A Generalist Pursuer for Pursuit-Evasion Problems

  • Pengdeng Li
  • Shuxin Li
  • Xinrun Wang
  • Jakub Černý
  • Youzhi Zhang
  • Stephen McAleer
  • Hau Chan
  • Bo An

Pursuit-evasion games (PEGs) model interactions between a team of pursuers and an evader in graph-based environments such as urban street networks. Recent advancements have demonstrated the effectiveness of the pre-training and fine-tuning paradigm in Policy-Space Response Oracles (PSRO) to improve scalability in solving large-scale PEGs. However, these methods primarily focus on specific PEGs with fixed initial conditions that may vary substantially in real-world scenarios, which significantly hinders the applicability of the traditional methods. To address this issue, we introduce Grasper, a GeneRAlist purSuer for Pursuit-Evasion pRoblems, capable of efficiently generating pursuer policies tailored to specific PEGs. Our contributions are threefold: First, we present a novel architecture that offers high-quality solutions for diverse PEGs, comprising critical components such as (i) a graph neural network (GNN) to encode PEGs into hidden vectors, and (ii) a hypernetwork to generate pursuer policies based on these hidden vectors. As a second contribution, we develop an efficient three-stage training method involving (i) a pre-pretraining stage for learning robust PEG representations through self-supervised graph learning techniques like graph masked auto-encoder (Graph- MAE), (ii) a pre-training stage utilizing heuristic-guided multi-task pre-training (HMP) where heuristic-derived reference policies (e. g. , through Dijkstra’s algorithm) regularize pursuer policies, and (iii) a fine-tuning stage that employs PSRO to generate pursuer policies on designated PEGs. Finally, we perform extensive experiments on synthetic and real-world maps, showcasing Grasper’s significant superiority over baselines in terms of solution quality and generalizability. We demonstrate that Grasper provides a versatile ∗Equal contribution. †Corresponding author. This work is licensed under a Creative Commons Attribution International 4. 0 License. Proc. of the 23rd International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2024), N. Alechina, V. Dignum, M. Dastani, J. S. Sichman (eds.), May 6 – 10, 2024, Auckland, New Zealand. © 2024 International Foundation for Autonomous Agents and Multiagent Systems (www. ifaamas. org). approach for solving pursuit-evasion problems across a broad range of scenarios, enabling practical deployment in real-world situations.

IJCAI Conference 2024 Conference Paper

Policy Space Response Oracles: A Survey

  • Ariyan Bighashdel
  • Yongzhao Wang
  • Stephen McAleer
  • Rahul Savani
  • Frans A. Oliehoek

Game theory provides a mathematical way to study the interaction between multiple decision makers. However, classical game-theoretic analysis is limited in scalability due to the large number of strategies, precluding direct application to more complex scenarios. This survey provides a comprehensive overview of a framework for large games, known as Policy Space Response Oracles (PSRO), which holds promise to improve scalability by focusing attention on sufficient subsets of strategies. We first motivate PSRO and provide historical context. We then focus on the strategy exploration problem for PSRO: the challenge of assembling effective subsets of strategies that still represent the original game well with minimum computational cost. We survey current research directions for enhancing the efficiency of PSRO, and explore the applications of PSRO across various domains. We conclude by discussing open questions and future research.

IJCAI Conference 2024 Conference Paper

Scalable Mechanism Design for Multi-Agent Path Finding

  • Paul Friedrich
  • Yulun Zhang
  • Michael Curry
  • Ludwig Dierks
  • Stephen McAleer
  • Jiaoyang Li
  • Tuomas Sandholm
  • Sven Seuken

Multi-Agent Path Finding (MAPF) involves determining paths for multiple agents to travel simultaneously and collision-free through a shared area toward given goal locations. This problem is computationally complex, especially when dealing with large numbers of agents, as is common in realistic applications like autonomous vehicle coordination. Finding an optimal solution is often computationally infeasible, making the use of approximate, suboptimal algorithms essential. Adding to the complexity, agents might act in a self-interested and strategic way, possibly misrepresenting their goals to the MAPF algorithm if it benefits them. Although the field of mechanism design offers tools to align incentives, using these tools without careful consideration can fail when only having access to approximately optimal outcomes. In this work, we introduce the problem of scalable mechanism design for MAPF and propose three strategyproof mechanisms, two of which even use approximate MAPF algorithms. We test our mechanisms on realistic MAPF domains with problem sizes ranging from dozens to hundreds of agents. We find that they improve welfare beyond a simple baseline.

NeurIPS Conference 2023 Conference Paper

Computing Optimal Equilibria and Mechanisms via Learning in Zero-Sum Extensive-Form Games

  • Brian Zhang
  • Gabriele Farina
  • Ioannis Anagnostides
  • Federico Cacciamani
  • Stephen McAleer
  • Andreas Haupt
  • Andrea Celli
  • Nicola Gatti

We introduce a new approach for computing optimal equilibria via learning in games. It applies to extensive-form settings with any number of players, including mechanism design, information design, and solution concepts such as correlated, communication, and certification equilibria. We observe that optimal equilibria are minimax equilibrium strategies of a player in an extensive-form zero-sum game. This reformulation allows to apply techniques for learning in zero-sum games, yielding the first learning dynamics that converge to optimal equilibria, not only in empirical averages, but also in iterates. We demonstrate the practical scalability and flexibility of our approach by attaining state-of-the-art performance in benchmark tabular games, and by computing an optimal mechanism for a sequential auction design problem using deep reinforcement learning.

NeurIPS Conference 2023 Conference Paper

Language Models can Solve Computer Tasks

  • Geunwoo Kim
  • Pierre Baldi
  • Stephen McAleer

Agents capable of carrying out general tasks on a computer can improve efficiency and productivity by automating repetitive tasks and assisting in complex problem-solving. Ideally, such agents should be able to solve new computer tasks presented to them through natural language commands. However, previous approaches to this problem require large amounts of expert demonstrations and task-specific reward functions, both of which are impractical for new tasks. In this work, we show that a pre-trained large language model (LLM) agent can execute computer tasks guided by natural language using a simple prompting scheme where the agent \textbf{R}ecursively \textbf{C}riticizes and \textbf{I}mproves its output (RCI). The RCI approach significantly outperforms existing LLM methods for automating computer tasks and surpasses supervised learning (SL) and reinforcement learning (RL) approaches on the MiniWoB++ benchmark. We compare multiple LLMs and find that RCI with the InstructGPT-3+RLHF LLM is state-of-the-art on MiniWoB++, using only a handful of demonstrations per task rather than tens of thousands, and without a task-specific reward function. Furthermore, we demonstrate RCI prompting's effectiveness in enhancing LLMs' reasoning abilities on a suite of natural language reasoning tasks, outperforming chain of thought (CoT) prompting with external feedback. We find that RCI combined with CoT performs better than either separately. Our code can be found here: https: //github. com/posgnu/rci-agent.

NeurIPS Conference 2023 Conference Paper

Policy Space Diversity for Non-Transitive Games

  • Jian Yao
  • Weiming Liu
  • Haobo Fu
  • Yaodong Yang
  • Stephen McAleer
  • Qiang Fu
  • Wei Yang

Policy-Space Response Oracles (PSRO) is an influential algorithm framework for approximating a Nash Equilibrium (NE) in multi-agent non-transitive games. Many previous studies have been trying to promote policy diversity in PSRO. A major weakness with existing diversity metrics is that a more diverse (according to their diversity metrics) population does not necessarily mean (as we proved in the paper) a better approximation to a NE. To alleviate this problem, we propose a new diversity metric, the improvement of which guarantees a better approximation to a NE. Meanwhile, we develop a practical and well-justified method to optimize our diversity metric using only state-action samples. By incorporating our diversity regularization into the best response solving of PSRO, we obtain a new PSRO variant, \textit{Policy Space Diversity} PSRO (PSD-PSRO). We present the convergence property of PSD-PSRO. Empirically, extensive experiments on single-state games, Leduc, and Goofspiel demonstrate that PSD-PSRO is more effective in producing significantly less exploitable policies than state-of-the-art PSRO variants.

NeurIPS Conference 2023 Conference Paper

Team-PSRO for Learning Approximate TMECor in Large Team Games via Cooperative Reinforcement Learning

  • Stephen McAleer
  • Gabriele Farina
  • Gaoyue Zhou
  • Mingzhi Wang
  • Yaodong Yang
  • Tuomas Sandholm

Recent algorithms have achieved superhuman performance at a number of two-player zero-sum games such as poker and go. However, many real-world situations are multi-player games. Zero-sum two-team games, such as bridge and football, involve two teams where each member of the team shares the same reward with every other member of that team, and each team has the negative of the reward of the other team. A popular solution concept in this setting, called TMECor, assumes that teams can jointly correlate their strategies before play, but are not able to communicate during play. This setting is harder than two-player zero-sum games because each player on a team has different information and must use their public actions to signal to other members of the team. Prior works either have game-theoretic guarantees but only work in very small games, or are able to scale to large games but do not have game-theoretic guarantees. In this paper we introduce two algorithms: Team-PSRO, an extension of PSRO from two-player games to team games, and Team-PSRO Mix-and-Match which improves upon Team PSRO by better using population policies. In Team-PSRO, in every iteration both teams learn a joint best response to the opponent's meta-strategy via reinforcement learning. As the reinforcement learning joint best response approaches the optimal best response, Team-PSRO is guaranteed to converge to a TMECor. In experiments on Kuhn poker and Liar's Dice, we show that a tabular version of Team-PSRO converges to TMECor, and a version of Team PSRO using deep cooperative reinforcement learning beats self-play reinforcement learning in the large game of Google Research Football.

NeurIPS Conference 2022 Conference Paper

Towards Human-Level Bimanual Dexterous Manipulation with Reinforcement Learning

  • Yuanpei Chen
  • Tianhao Wu
  • Shengjie Wang
  • Xidong Feng
  • Jiechuan Jiang
  • Zongqing Lu
  • Stephen McAleer
  • Hao Dong

Achieving human-level dexterity is an important open problem in robotics. However, tasks of dexterous hand manipulation even at the baby level are challenging to solve through reinforcement learning (RL). The difficulty lies in the high degrees of freedom and the required cooperation among heterogeneous agents (e. g. , joints of fingers). In this study, we propose the Bimanual Dexterous Hands Benchmark (Bi-DexHands), a simulator that involves two dexterous hands with tens of bimanual manipulation tasks and thousands of target objects. Tasks in Bi-DexHands are first designed to match human-level motor skills according to literature in cognitive science, and then are built in Issac Gym; this enables highly efficient RL trainings, reaching 30, 000+ FPS by only one single NVIDIA RTX 3090. We provide a comprehensive benchmark for popular RL algorithms under different settings; this includes multi-agent RL, offline RL, multi-task RL, and meta RL. Our results show that PPO type on-policy algorithms can learn to solve simple manipulation tasks that are equivalent up to 48-month human baby (e. g. , catching a flying object, opening a bottle), while multi-agent RL can further help to learn manipulations that require skilled bimanual cooperation (e. g. , lifting a pot, stacking blocks). Despite the success on each individual task, when it comes to mastering multiple manipulation skills, existing RL algorithms fail to work in most of the multi-task and the few-shot learning tasks, which calls for more future development from the RL community. Our project is open-sourced at https: //github. com/PKU-MARL/DexterousHands.

NeurIPS Conference 2021 Conference Paper

Neural Auto-Curricula in Two-Player Zero-Sum Games

  • Xidong Feng
  • Oliver Slumbers
  • Ziyu Wan
  • Bo Liu
  • Stephen McAleer
  • Ying Wen
  • Jun Wang
  • Yaodong Yang

When solving two-player zero-sum games, multi-agent reinforcement learning (MARL) algorithms often create populations of agents where, at each iteration, a new agent is discovered as the best response to a mixture over the opponent population. Within such a process, the update rules of "who to compete with" (i. e. , the opponent mixture) and "how to beat them" (i. e. , finding best responses) are underpinned by manually developed game theoretical principles such as fictitious play and Double Oracle. In this paper, we introduce a novel framework—Neural Auto-Curricula (NAC)—that leverages meta-gradient descent to automate the discovery of the learning update rule without explicit human design. Specifically, we parameterise the opponent selection module by neural networks and the best-response module by optimisation subroutines, and update their parameters solely via interaction with the game engine, where both players aim to minimise their exploitability. Surprisingly, even without human design, the discovered MARL algorithms achieve competitive or even better performance with the state-of-the-art population-based game solvers (e. g. , PSRO) on Games of Skill, differentiable Lotto, non-transitive Mixture Games, Iterated Matching Pennies, and Kuhn Poker. Additionally, we show that NAC is able to generalise from small games to large games, for example training on Kuhn Poker and outperforming PSRO on Leduc Poker. Our work inspires a promising future direction to discover general MARL algorithms solely from data.

PRL Workshop 2021 Workshop Paper

Obtaining Approximately Admissible Heuristic Functions through Deep Reinforcement Learning and A* Search

  • Forest Agostinelli
  • Stephen McAleer
  • Alexander Shmakov
  • Roy Fox
  • Marco Valtorta
  • Biplav Srivastava
  • Pierre Baldi

real world applications would ensure that artificial intelligence agents can solve problems in the most efficient way Deep reinforcement learning has been shown to be able to possible, or close to the most efficient way possible, which train deep neural networks to implement effective heuristic could significantly reduce the resources consumed by such functions that can be used with A* search to solve probagents. lems with large state spaces. However, these learned heuristic Obtaining an admissible heuristic function often requires functions are not guaranteed to be admissible. We introduce domain-specific knowledge. For example, pattern databases approximately admissible conversion, an algorithm that can convert any inadmissible heuristic function into a heuristic (PDBs) (Culberson and Schaeffer 1998) have been sucfunction that is admissible in the vast majority of cases with cessful at finding optimal solutions to puzzles such as the no domain-specific heuristic information. We apply approxiRubik’s cube (Korf 1997), 15-puzzle, and 24-puzzle (Korf mately admissible conversion to heuristic functions parameand Felner 2002; Felner, Korf, and Hanan 2004). Howterized by deep neural networks and show that these heuristic ever, ensuring that these PDBs produce admissible heurisfunctions can be used to find optimal solutions, or bounded tics requires knowledge about how the puzzle pieces intersuboptimal solutions, even when doing a batched version of act. There has been previous research on using deep neuA* search. We test our method on the 15-puzzle and 24ral networks to learn heuristic functions (Chen and Wei puzzle and obtain a heuristic function that is empirically ad2011; Wang et al. 2019; Ferber, Helmert, and Hoffmann missible over 99. 99% of the time and that finds optimal so2020) including the DeepCubeA algorithm (McAleer et al. lutions for 100% of all test configurations. To the best of our knowledge, this is the first demonstration that approximately 2019; Agostinelli et al. 2019) which used deep reinforceadmissible heuristics can be obtained using deep neural netment learning and weighted A* search (Pohl 1970) to solve works in a domain independent fashion. the aforementioned puzzles. However, the heuristic functions produced by DeepCubeA are not admissible. In this paper, we define an approximately admissible

NeurIPS Conference 2021 Conference Paper

XDO: A Double Oracle Algorithm for Extensive-Form Games

  • Stephen McAleer
  • JB Lanier
  • Kevin A Wang
  • Pierre Baldi
  • Roy Fox

Policy Space Response Oracles (PSRO) is a reinforcement learning (RL) algorithm for two-player zero-sum games that has been empirically shown to find approximate Nash equilibria in large games. Although PSRO is guaranteed to converge to an approximate Nash equilibrium and can handle continuous actions, it may take an exponential number of iterations as the number of information states (infostates) grows. We propose Extensive-Form Double Oracle (XDO), an extensive-form double oracle algorithm for two-player zero-sum games that is guaranteed to converge to an approximate Nash equilibrium linearly in the number of infostates. Unlike PSRO, which mixes best responses at the root of the game, XDO mixes best responses at every infostate. We also introduce Neural XDO (NXDO), where the best response is learned through deep RL. In tabular experiments on Leduc poker, we find that XDO achieves an approximate Nash equilibrium in a number of iterations an order of magnitude smaller than PSRO. Experiments on a modified Leduc poker game and Oshi-Zumo show that tabular XDO achieves a lower exploitability than CFR with the same amount of computation. We also find that NXDO outperforms PSRO and NFSP on a sequential multidimensional continuous-action game. NXDO is the first deep RL method that can find an approximate Nash equilibrium in high-dimensional continuous-action sequential games.

NeurIPS Conference 2020 Conference Paper

Pipeline PSRO: A Scalable Approach for Finding Approximate Nash Equilibria in Large Games

  • Stephen McAleer
  • JB Lanier
  • Roy Fox
  • Pierre Baldi

Finding approximate Nash equilibria in zero-sum imperfect-information games is challenging when the number of information states is large. Policy Space Response Oracles (PSRO) is a deep reinforcement learning algorithm grounded in game theory that is guaranteed to converge to an approximate Nash equilibrium. However, PSRO requires training a reinforcement learning policy at each iteration, making it too slow for large games. We show through counterexamples and experiments that DCH and Rectified PSRO, two existing approaches to scaling up PSRO, fail to converge even in small games. We introduce Pipeline PSRO (P2SRO), the first scalable PSRO-based method for finding approximate Nash equilibria in large zero-sum imperfect-information games. P2SRO is able to parallelize PSRO with convergence guarantees by maintaining a hierarchical pipeline of reinforcement learning workers, each training against the policies generated by lower levels in the hierarchy. We show that unlike existing methods, P2SRO converges to an approximate Nash equilibrium, and does so faster as the number of parallel workers increases, across a variety of imperfect information games. We also introduce an open-source environment for Barrage Stratego, a variant of Stratego with an approximate game tree complexity of 10^50. P2SRO is able to achieve state-of-the-art performance on Barrage Stratego and beats all existing bots. Experiment code is available at https: //github. com/JBLanier/pipeline-psro.