Arrow Research search

Author name cluster

Christopher Amato

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

68 papers
2 author rows

Possible papers

68

AAAI Conference 2026 Conference Paper

LLM Collaboration with Multi-Agent Reinforcement Learning

  • Shuo Liu
  • Zeyu Liang
  • Xueguang Lyu
  • Christopher Amato

A large amount of work has been done in Multi-Agent Systems (MAS) for modeling and solving problems with multiple interacting agents. However, most LLMs are pretrained independently and not specifically optimized for coordination. Existing LLM fine-tuning frameworks rely on individual rewards, which require complex reward designs for each agent to encourage collaboration. To address these challenges, we model LLM collaboration as a cooperative Multi-Agent Reinforcement Learning (MARL) problem. We develop a multi-agent, multi-turn algorithm, Multi-Agent Group Relative Policy Optimization (MAGRPO), to solve it, building on current RL approaches for LLMs as well as MARL techniques. Our experiments on LLM writing and coding collaboration demonstrate that fine-tuning MAS with MAGRPO enables agents to generate high-quality responses efficiently through effective cooperation. Our approach opens the door to using MARL methods for LLM collaboration and highlights the associated challenges.

ICML Conference 2025 Conference Paper

Adversarial Inception Backdoor Attacks against Reinforcement Learning

  • Ethan Rathbun
  • Alina Oprea
  • Christopher Amato

Recent works have demonstrated the vulnerability of Deep Reinforcement Learning (DRL) algorithms against training-time, backdoor poisoning attacks. The objectives of these attacks are twofold: induce pre-determined, adversarial behavior in the agent upon observing a fixed trigger during deployment while allowing the agent to solve its intended task during training. Prior attacks assume arbitrary control over the agent’s rewards, inducing values far outside the environment’s natural constraints. This results in brittle attacks that fail once the proper reward constraints are enforced. Thus, in this work we propose a new class of backdoor attacks against DRL which are the first to achieve state of the art performance under strict reward constraints. These “inception” attacks manipulate the agent’s training data – inserting the trigger into prior observations and replacing high return actions with those of the targeted adversarial behavior. We formally define these attacks and prove they achieve both adversarial objectives against arbitrary Markov Decision Processes (MDP). Using this framework we devise an online inception attack which achieves an 100% attack success rate on multiple environments under constrained rewards while minimally impacting the agent’s task performance.

TMLR Journal 2025 Journal Article

Leveraging Fully-Observable Solutions for Improved Partially-Observable Offline Reinforcement Learning

  • Chulabhaya Wijesundara
  • Andrea Baisero
  • Gregory David Castanon
  • Alan S Carlin
  • Robert Platt
  • Christopher Amato

Offline reinforcement learning (RL) is a popular learning framework for control problems where online interactions with the environment are expensive, risky, or otherwise impractical. Existing offline RL methods commonly assume full observability of the state, and therefore there is a lack of offline RL methods that are specialized for the more general case of partially-observable control. To address this gap, we propose Cross-Observability Conservative Q-Learning (CO-CQL), an offline RL algorithm for partially-observable control that leverages fully-observable expert policies in an asymmetric learning setting. To motivate the use of fully-observable experts for partially-observable control, we formalize Cross-Observability Optimality Ratio (COOR), a theoretical measure of cross-observability that quantifies the benefit of learning asymmetrically from a fully-observable expert, and Cross-Observability Approximation Ratio (COAR), an estimation of COOR computable from trained policies. Our empirical evaluation on a wide variety of partially-observable challenges demonstrates that CO-CQL is able to exploit the guidance of fully-observable experts to outperform other state-of-the-art offline algorithms.

AAMAS Conference 2025 Conference Paper

Leveraging Fully-Observable Solutions for Improved Partially-Observable Offline Reinforcement Learning

  • Chulabhaya Wijesundara
  • Andrea Baisero
  • Gregory Castañón
  • Alan Carlin
  • Robert Platt
  • Christopher Amato

Offline reinforcement learning (RL) is valuable in settings where online interactions with an environment are impractical. While such settings are often partially-observable, existing offline RL methods typically focus on fully-observable (FO) Markov decision processes (MDPs) rather than partially-observable MDPs (POMDPs). To help close that gap, we present an offline RL algorithm for POMDPs that leverages expert policies from simpler, fully-observable versions of environments in an asymmetric learning setting. We provide theoretical grounding for how overlap between MDPs and POMDPs can be exploited to improve learning in the partially-observable setting, and our experiments empirically demonstrate that our method significantly improves performance compared to existing state-ofthe-art MDP offline RL algorithms.

AAMAS Conference 2025 Conference Paper

On Stateful Value Factorization in Multi-Agent Reinforcement Learning

  • Enrico Marchesini
  • Andrea Baisero
  • Rupali Bhati
  • Christopher Amato

Value factorization is a popular paradigm for designing scalable multi-agent reinforcement learning algorithms. However, current factorization methods make choices without full justi�cation that may limit their performance. For example, the theory in prior work uses stateless (i. e. , history) functions, while the practical implementations use state information—making the motivating theory a mismatch for the implementation. Also, methods have built o�of previous approaches, inheriting their architectures without exploring other, potentially better ones. To address these concerns, we formally analyze the theory of using the state instead of the history in current methods—reconnecting theory and practice. We then introduce DuelMIX, a factorization algorithm that learns distinct per-agent utility estimators to improve performance and achieve full expressiveness. Experiments on StarCraft II micromanagement and Box Pushing tasks demonstrate the bene�ts of our intuitions.

AAMAS Conference 2025 Conference Paper

Safe Entropic Agents under Team Constraints

  • Ayhan Alp Aydeniz
  • Enrico Marchesini
  • Robert Loftin
  • Christopher Amato
  • Kagan Tumer

Safety is a critical concern in multiagent reinforcement learning (MARL), yet typical safety-aware methods constrain agent behaviors, limiting exploration—essential for discovering e�ective cooperation. Existing approaches mainly enforce individual constraints, overlooking potential bene�ts of joint (team) constraints. We analyze team constraints theoretically and practically, introducing entropic exploration for constrained MARL (E2C). E2C maximizes observation entropy to encourage exploration while ensuring safety at the individual and team levels. Experiments across diverse domains demonstrate that E2C matches or outperforms common baselines in task performance while reducing unsafe behaviors by up to 50%.

TIST Journal 2025 Journal Article

Verifying Online Safety Properties for Safe Deep Reinforcement Learning

  • Luca Marzari
  • Ferdinando Cicalese
  • Alessandro Farinelli
  • Christopher Amato
  • Enrico Marchesini

Ensuring safety in reinforcement learning (RL) is critical for deploying agents in real-world applications. During training, current safe RL approaches often rely on indicator cost functions that provide sparse feedback, resulting in two key limitations: (i) poor sample efficiency due to the lack of safety information in neighboring states, and (ii) dependence on cost-value functions, leading to brittle convergence and suboptimal performance. After training, safety is guaranteed via formal verification (FV) methods for deep neural networks, whose computational complexity hinders their application during training. We address the limitations of using cost functions via verification by proposing a safe RL method based on a violation value—the risk associated with policy decisions in a portion of the state space. Our approach verifies safety properties (i.e., state-action pairs) that may lead to unsafe behavior, and quantifies the size of the state space where properties are violated. This violation value is then used to penalize the agent during training to encourage safer policy behavior. Given the NP-hard nature of FV, we propose an efficient, sample-based approximation with probabilistic guarantees to compute the violation value. Extensive experiments on standard benchmarks and real-world robotic navigation tasks show that violation-augmented approaches significantly improve safety by reducing the number of unsafe states encountered while achieving superior performance compared to existing methods.

AAMAS Conference 2024 Conference Paper

Entropy Seeking Constrained Multiagent Reinforcement Learning

  • Ayhan Alp Aydeniz
  • Enrico Marchesini
  • Christopher Amato
  • Kagan Tumer

Multiagent Reinforcement Learning (MARL) has been successfully applied to domains requiring close coordination among many agents. However, real-world tasks require safety specifications that are not generally considered by MARL algorithms. In this work, we introduce an Entropy Seeking Constrained (ESC) approach aiming to learn safe cooperative policies for multiagent systems. Unlike previous methods, ESC considers safety specifications while maximizing state-visitation entropy, addressing the exploration issues of constrained-based solutions.

ICRA Conference 2024 Conference Paper

Robot Navigation in Unseen Environments using Coarse Maps

  • Chengguang Xu
  • Christopher Amato
  • Lawson L. S. Wong

Metric occupancy maps are widely used in autonomous robot navigation systems. However, when a robot is deployed in an unseen environment, building an accurate metric map is time-consuming. Can an autonomous robot directly navigate in previously unseen environments using coarse maps? In this work, we propose the Coarse Map Navigator (CMN), a navigation framework that can perform robot navigation in unseen environments using different coarse maps. To do so, CMN addresses two challenges: (1) novel and realistic visual observations; (2) error and misalignment on coarse maps. To tackle novel visual observations in unseen environments, CMN learns a deep perception model that maps the visual input from various pixel spaces to the local occupancy grid space. To tackle the error and misalignment on coarse maps, CMN extends the Bayesian filter and maintains a belief directly on coarse maps using the predicted local occupancy grids as observations. Using the latest belief, CMN extracts a global heuristic vector that guides the planner to find a local navigation action. Empirical results demonstrate that CMN achieves high navigation success rates in unseen environments, significantly outperforming baselines, and is robust to different coarse maps.

AAMAS Conference 2024 Conference Paper

Shield Decentralization for Safe Reinforcement Learning in General Partially Observable Multi-Agent Environments

  • Daniel Melcer
  • Christopher Amato
  • Stavros Tripakis

As reinforcement learning (RL) is increasingly used in safety-critical systems, it is important to restrict RL agents to only take safe actions. Shielding is a promising approach to this task; however, in multi-agent domains, shielding has previously been restricted to environments where all agents observe the same information. Most real-world tasks do not satisfy this strong assumption. We discuss the theoretical foundations of multi-agent shielding in environments with general partial observability and develop a novel shielding method which is effective in such domains. Through a series of experiments, we show that agents that use our shielding method are able to safely and successfully solve a variety of RL tasks, including tasks in which prior methods cannot be applied.

RLC Conference 2024 Conference Paper

Shield Decomposition for Safe Reinforcement Learning in General Partially Observable Multi-Agent Environments

  • Daniel Melcer
  • Christopher Amato
  • Stavros Tripakis

As Reinforcement Learning is increasingly used in safety-critical systems, it is important to restrict RL agents to only take safe actions. Shielding is a promising approach to this task; however, in multi-agent domains, shielding has previously been restricted to environments where all agents observe the same information. Most real-world tasks do not satisfy this strong assumption. We discuss the theoretical foundations of multi-agent shielding in environments with general partial observability and develop a novel shielding method which is effective in such domains. Through a series of experiments, we show that agents that use our shielding method are able to safely and successfully solve a variety of RL tasks, including tasks in which prior methods cannot be applied.

RLJ Journal 2024 Journal Article

Shield Decomposition for Safe Reinforcement Learning in General Partially Observable Multi-Agent Environments

  • Daniel Melcer
  • Christopher Amato
  • Stavros Tripakis

As Reinforcement Learning is increasingly used in safety-critical systems, it is important to restrict RL agents to only take safe actions. Shielding is a promising approach to this task; however, in multi-agent domains, shielding has previously been restricted to environments where all agents observe the same information. Most real-world tasks do not satisfy this strong assumption. We discuss the theoretical foundations of multi-agent shielding in environments with general partial observability and develop a novel shielding method which is effective in such domains. Through a series of experiments, we show that agents that use our shielding method are able to safely and successfully solve a variety of RL tasks, including tasks in which prior methods cannot be applied.

NeurIPS Conference 2024 Conference Paper

SleeperNets: Universal Backdoor Poisoning Attacks Against Reinforcement Learning Agents

  • Ethan Rathbun
  • Christopher Amato
  • Alina Oprea

Reinforcement learning (RL) is an actively growing field that is seeing increased usage in real-world, safety-critical applications -- making it paramount to ensure the robustness of RL algorithms against adversarial attacks. In this work we explore a particularly stealthy form of training-time attacks against RL -- backdoor poisoning. Here the adversary intercepts the training of an RL agent with the goal of reliably inducing a particular action when the agent observes a pre-determined trigger at inference time. We uncover theoretical limitations of prior work by proving their inability to generalize across domains and MDPs. Motivated by this, we formulate a novel poisoning attack framework which interlinks the adversary's objectives with those of finding an optimal policy -- guaranteeing attack success in the limit. Using insights from our theoretical analysis we develop "SleeperNets" as a universal backdoor attack which exploits a newly proposed threat model and leverages dynamic reward poisoning techniques. We evaluate our attack in 6 environments spanning multiple domains and demonstrate significant improvements in attack success over existing methods, while preserving benign episodic return.

ICLR Conference 2023 Conference Paper

Improving Deep Policy Gradients with Value Function Search

  • Enrico Marchesini
  • Christopher Amato

Deep Policy Gradient (PG) algorithms employ value networks to drive the learning of parameterized policies and reduce the variance of the gradient estimates. However, value function approximation gets stuck in local optima and struggles to fit the actual return, limiting the variance reduction efficacy and leading policies to sub-optimal performance. This paper focuses on improving value approximation and analyzing the effects on Deep PG primitives such as value prediction, variance reduction, and correlation of gradient estimates with the true gradient. To this end, we introduce a Value Function Search that employs a population of perturbed value networks to search for a better approximation. Our framework does not require additional environment interactions, gradient computations, or ensembles, providing a computationally inexpensive approach to enhance the supervised learning task on which value networks train. Crucially, we show that improving Deep PG primitives results in improved sample efficiency and policies with higher returns using common continuous control benchmark domains.

JAIR Journal 2023 Journal Article

On Centralized Critics in Multi-Agent Reinforcement Learning

  • Xueguang Lyu
  • Andrea Baisero
  • Yuchen Xiao
  • Brett Daley
  • Christopher Amato

Centralized Training for Decentralized Execution, where agents are trained offline in a centralized fashion and execute online in a decentralized manner, has become a popular approach in Multi-Agent Reinforcement Learning (MARL). In particular, it has become popular to develop actor-critic methods that train decentralized actors with a centralized critic where the centralized critic is allowed access global information of the entire system, including the true system state. Such centralized critics are possible given offline information and are not used for online execution. While these methods perform well in a number of domains and have become a de facto standard in MARL, using a centralized critic in this context has yet to be sufficiently analyzed theoretically or empirically. In this paper, we therefore formally analyze centralized and decentralized critic approaches, and analyze the effect of using state-based critics in partially observable environments. We derive theories contrary to the common intuition: critic centralization is not strictly beneficial, and using state values can be harmful. We further prove that, in particular, state-based critics can introduce unexpected bias and variance compared to history-based critics. Finally, we demonstrate how the theory applies in practice by comparing different forms of critics on a wide range of common multi-agent benchmarks. The experiments show practical issues such as the difficulty of representation learning with partial observability, which highlights why the theoretical problems are often overlooked in the literature.

IROS Conference 2023 Conference Paper

On-Robot Bayesian Reinforcement Learning for POMDPs

  • Hai Nguyen 0001
  • Sammie Katt
  • Yuchen Xiao
  • Christopher Amato

Robot learning is often difficult due to the expense of gathering data. The need for large amounts of data can, and should, be tackled with effective algorithms and leveraging expert information on robot dynamics. Bayesian reinforcement learning (BRL), thanks to its sample efficiency and ability to exploit prior knowledge, is uniquely positioned as such a solution method. Unfortunately, the application of BRL has been limited due to the difficulties of representing expert knowledge as well as solving the subsequent inference problem. This paper advances BRL for robotics by proposing a specialized framework for physical systems. In particular, we capture this knowledge in a factored representation, then demonstrate the posterior factorizes in a similar shape, and ultimately formalize the model in a Bayesian framework. We then introduce a sample-based online solution method, based on Monte-Carlo tree search and particle filtering, specialized to solve the resulting model. This approach can, for example, utilize typical low-level robot simulators and handle uncertainty over unknown dynamics of the environment. We empirically demonstrate its efficiency by performing on-robot learning in two human-robot interaction tasks with uncertainty about human behavior, achieving near-optimal performance after only a handful of real-world episodes. A video of learned policies is at https://youtu.be/H9xp60ngOes.

AAMAS Conference 2023 Conference Paper

Safe Deep Reinforcement Learning by Verifying Task-Level Properties

  • Enrico Marchesini
  • Luca Marzari
  • Alessandro Farinelli
  • Christopher Amato

Cost functions are commonly employed in Safe Deep Reinforcement Learning (DRL). However, the cost is typically encoded as an indicator function due to the difficulty of quantifying the risk of policy decisions in the state space. Such an encoding requires the agent to visit numerous unsafe states to learn a cost-value function to drive the learning process toward safety. Hence, increasing the number of unsafe interactions and decreasing sample efficiency. In this paper, we investigate an alternative approach that uses domain knowledge to quantify the risk in the proximity of such states by defining a violation metric. This metric is computed by verifying task-level properties, shaped as input-output conditions, and it is used as a penalty to bias the policy away from unsafe states without learning an additional value function. We investigate the benefits of using the violation metric in standard Safe DRL benchmarks and robotic mapless navigation tasks. The navigation experiments bridge the gap between Safe DRL and robotics, introducing a framework that allows rapid testing on real robots. Our experiments show that policies trained with the violation penalty achieve higher performance over Safe DRL baselines and significantly reduce the number of visited unsafe states.

ICML Conference 2023 Conference Paper

Trajectory-Aware Eligibility Traces for Off-Policy Reinforcement Learning

  • Brett Daley
  • Martha White
  • Christopher Amato
  • Marlos C. Machado

Off-policy learning from multistep returns is crucial for sample-efficient reinforcement learning, but counteracting off-policy bias without exacerbating variance is challenging. Classically, off-policy bias is corrected in a per-decision manner: past temporal-difference errors are re-weighted by the instantaneous Importance Sampling (IS) ratio after each action via eligibility traces. Many off-policy algorithms rely on this mechanism, along with differing protocols for cutting the IS ratios (traces) to combat the variance of the IS estimator. Unfortunately, once a trace has been cut, the effect cannot be easily reversed. This has led to the development of credit-assignment strategies that account for multiple past experiences at a time. These trajectory-aware methods have not been extensively analyzed, and their theoretical justification remains uncertain. In this paper, we propose a multistep operator that unifies per-decision and trajectory-aware methods. We prove convergence conditions for our operator in the tabular setting, establishing the first guarantees for several existing methods as well as many new ones. Finally, we introduce Recency-Bounded Importance Sampling (RBIS), which leverages trajectory awareness to perform robustly across $\lambda$-values in an off-policy control task.

AAAI Conference 2022 Conference Paper

A Deeper Understanding of State-Based Critics in Multi-Agent Reinforcement Learning

  • Xueguang Lyu
  • Andrea Baisero
  • Yuchen Xiao
  • Christopher Amato

Centralized Training for Decentralized Execution, where training is done in a centralized offline fashion, has become a popular solution paradigm in Multi-Agent Reinforcement Learning. Many such methods take the form of actor-critic with statebased critics, since centralized training allows access to the true system state, which can be useful during training despite not being available at execution time. State-based critics have become a common empirical choice, albeit one which has had limited theoretical justification or analysis. In this paper, we show that state-based critics can introduce bias in the policy gradient estimates, potentially undermining the asymptotic guarantees of the algorithm. We also show that, even if the state-based critics do not introduce any bias, they can still result in a larger gradient variance, contrary to the common intuition. Finally, we show the effects of the theories in practice by comparing different forms of centralized critics on a wide range of common benchmarks, and detail how various environmental properties are related to the effectiveness of different types of critics.

UAI Conference 2022 Conference Paper

Asymmetric DQN for partially observable reinforcement learning

  • Andrea Baisero
  • Brett Daley
  • Christopher Amato

Offline training in simulated partially observable environments allows reinforcement learning methods to exploit privileged state information through a mechanism known as asymmetry. Such privileged information has the potential to greatly improve the optimal convergence properties, if used appropriately. However, current research in asymmetric reinforcement learning is often heuristic in nature, with few connections to underlying theory or theoretical guarantees, and is primarily tested through empirical evaluation. In this work, we develop the theory of Asymmetric Policy Iteration, an exact model-based dynamic programming solution method, and then apply relaxations which eventually result in Asymmetric DQN, a model-free deep reinforcement learning algorithm. Our theoretical findings are complemented and validated by empirical experimentation performed in environments which exhibit significant amounts of partial observability, and require both information gathering strategies and memorization.

NeurIPS Conference 2022 Conference Paper

Asynchronous Actor-Critic for Multi-Agent Reinforcement Learning

  • Yuchen Xiao
  • Weihao Tan
  • Christopher Amato

Synchronizing decisions across multiple agents in realistic settings is problematic since it requires agents to wait for other agents to terminate and communicate about termination reliably. Ideally, agents should learn and execute asynchronously instead. Such asynchronous methods also allow temporally extended actions that can take different amounts of time based on the situation and action executed. Unfortunately, current policy gradient methods are not applicable in asynchronous settings, as they assume that agents synchronously reason about action selection at every time step. To allow asynchronous learning and decision-making, we formulate a set of asynchronous multi-agent actor-critic methods that allow agents to directly optimize asynchronous policies in three standard training paradigms: decentralized learning, centralized learning, and centralized training for decentralized execution. Empirical results (in simulation and hardware) in a variety of realistic domains demonstrate the superiority of our approaches in large multi-agent problems and validate the effectiveness of our algorithms for learning high-quality and asynchronous solutions.

AAMAS Conference 2022 Conference Paper

BADDr: Bayes-Adaptive Deep Dropout RL for POMDPs

  • Sammie Katt
  • Hai Nguyen
  • Frans A. Oliehoek
  • Christopher Amato

While reinforcement learning (RL) has made great advances in scalability, exploration and partial observability are still active research topics. In contrast, Bayesian RL (BRL) provides a principled answer to both state estimation and the exploration-exploitation trade-off, but struggles to scale. To tackle this challenge, BRL frameworks with various prior assumptions have been proposed, with varied success. This work presents a representation-agnostic formulation of BRL under partially observability, unifying the previous models under one theoretical umbrella. To demonstrate its practical significance we also propose a novel derivation, Bayes-Adaptive Deep Dropout rl (BADDr), based on dropout networks. Under this parameterization, in contrast to previous work, the belief over the state and dynamics is a more scalable inference problem. We choose actions through Monte-Carlo tree search and empirically show that our method is competitive with state-of-the-art BRL methods on small domains while being able to solve much larger ones.

NeurIPS Conference 2022 Conference Paper

Shield Decentralization for Safe Multi-Agent Reinforcement Learning

  • Daniel Melcer
  • Christopher Amato
  • Stavros Tripakis

Learning safe solutions is an important but challenging problem in multi-agent reinforcement learning (MARL). Shielded reinforcement learning is one approach for preventing agents from choosing unsafe actions. Current shielded reinforcement learning methods for MARL make strong assumptions about communication and full observability. In this work, we extend the formalization of the shielded reinforcement learning problem to a decentralized multi-agent setting. We then present an algorithm for decomposition of a centralized shield, allowing shields to be used in such decentralized, communication-free environments. Our results show that agents equipped with decentralized shields perform comparably to agents with centralized shields in several tasks, allowing shielding to be used in environments with decentralized training and execution for the first time.

AAMAS Conference 2022 Conference Paper

Unbiased Asymmetric Reinforcement Learning under Partial Observability

  • Andrea Baisero
  • Christopher Amato

In partially observable reinforcement learning, offline training gives access to latent information which is not available during online training and/or execution, such as the system state. Asymmetric actor-critic methods exploit such information by training a historybased policy via a state-based critic. However, many asymmetric methods lack theoretical foundation, and are only evaluated on limited domains. We examine the theory of asymmetric actor-critic methods which use state-based critics, and expose fundamental issues which undermine the validity of a common variant, and limit its ability to address partial observability. We propose an unbiased asymmetric actor-critic variant which is able to exploit state information while remaining theoretically sound, maintaining the validity of the policy gradient theorem, and introducing no bias and relatively low variance into the training process. An empirical evaluation performed on domains which exhibit significant partial observability confirms our analysis, demonstrating that unbiased asymmetric actor-critic converges to better policies and/or faster than symmetric and biased asymmetric baselines.

AAMAS Conference 2021 Conference Paper

Contrasting Centralized and Decentralized Critics in Multi-Agent Reinforcement Learning

  • Xueguang Lyu
  • Yuchen Xiao
  • Brett Daley
  • Christopher Amato

Centralized Training for Decentralized Execution, where agents are trained offline using centralized information but execute in a decentralized manner online, has gained popularity in the multi-agent reinforcement learning community. In particular, actor-critic methods with a centralized critic and decentralized actors are a common instance of this idea. However, the implications of using a centralized critic in this context are not fully discussed and understood even though it is the standard choice of many algorithms. We therefore formally analyze centralized and decentralized critic approaches, providing a deeper understanding of the implications of critic choice. Because our theory makes unrealistic assumptions, we also empirically compare the centralized and decentralized critic methods over a wide set of environments to validate our theories and to provide practical advice. We show that there exist misconceptions regarding centralized critics in the current literature and show that the centralized critic design is not strictly beneficial, but rather both centralized and decentralized critics have different pros and cons that should be taken into account by algorithm designers.

ICRA Conference 2021 Conference Paper

End-to-end grasping policies for human-in-the-loop robots via deep reinforcement learning *

  • Mohammadreza Sharif
  • Deniz Erdogmus
  • Christopher Amato
  • Taskin Padir

State-of-the-art human-in-the-loop robot grasping is hugely suffered by Electromyography (EMG) inference robustness issues. As a workaround, researchers have been looking into integrating EMG with other signals, often in an ad hoc manner. In this paper, we are presenting a method for end-to-end training of a policy for human-in-the-loop robot grasping on real reaching trajectories. For this purpose we use Reinforcement Learning (RL) and Imitation Learning (IL) in DEXTRON (DEXTerity enviRONment), a stochastic simulation environment with real human trajectories that are augmented and selected using a Monte Carlo (MC) simulation method. We also offer a success model which once trained on the expert policy data and the RL policy roll-out transitions, can provide transparency to how the deep policy works and when it is probably going to fail.

IJCAI Conference 2021 Conference Paper

Reconciling Rewards with Predictive State Representations

  • Andrea Baisero
  • Christopher Amato

Predictive state representations (PSRs) are models of controlled non-Markov observation sequences which exhibit the same generative process governing POMDP observations without relying on an underlying latent state. In that respect, a PSR is indistinguishable from the corresponding POMDP. However, PSRs notoriously ignore the notion of rewards, which undermines the general utility of PSR models for control, planning, or reinforcement learning. Therefore, we describe a sufficient and necessary accuracy condition which determines whether a PSR is able to accurately model POMDP rewards, we show that rewards can be approximated even when the accuracy condition is not satisfied, and we find that a non-trivial number of POMDPs taken from a well-known third-party repository do not satisfy the accuracy condition. We propose reward-predictive state representations (R-PSRs), a generalization of PSRs which accurately models both observations and rewards, and develop value iteration for R-PSRs. We show that there is a mismatch between optimal POMDP policies and the optimal PSR policies derived from approximate rewards. On the other hand, optimal R-PSR policies perfectly match optimal POMDP policies, reconfirming R-PSRs as accurate state-less generative models of observations and rewards.

AAMAS Conference 2021 Conference Paper

Safe Multi-Agent Reinforcement Learning via Shielding

  • Ingy ElSayed-Aly
  • Suda Bharadwaj
  • Christopher Amato
  • Rüdiger Ehlers
  • Ufuk Topcu
  • Lu Feng

Multi-agent reinforcement learning (MARL) has been increasingly used in a wide range of safety-critical applications, which require guaranteed safety (e. g. , no unsafe states are ever visited) during the learning process. Unfortunately, current MARL methods do not have safety guarantees. Therefore, we present two shielding approaches for safe MARL. In centralized shielding, we synthesize a single shield to monitor all agents’ joint actions and correct any unsafe action if necessary. In factored shielding, we synthesize multiple shields based on a factorization of the joint state space observed by all agents; the set of shields monitors agents concurrently and each shield is only responsible for a subset of agents at each step. Experimental results show that both approaches can guarantee the safety of agents during learning without compromising the quality of learned policies; moreover, factored shielding is more scalable in the number of agents than centralized shielding.

ICRA Conference 2020 Conference Paper

Learning Multi-Robot Decentralized Macro-Action-Based Policies via a Centralized Q-Net

  • Yuchen Xiao
  • Joshua Hoffman
  • Tian Xia
  • Christopher Amato

In many real-world multi-robot tasks, high-quality solutions often require a team of robots to perform asynchronous actions under decentralized control. Decentralized multi-agent reinforcement learning methods have difficulty learning decentralized policies because of the environment appearing to be non-stationary due to other agents also learning at the same time. In this paper, we address this challenge by proposing a macro-action-based decentralized multi-agent double deep recurrent Q-net (MacDec-MADDRQN) which trains each decentralized Q-net using a centralized Q-net for action selection. A generalized version of MacDec-MADDRQN with two separate training environments, called Parallel-MacDec-MADDRQN, is also presented to leverage either centralized or decentralized exploration. The advantages and the practical nature of our methods are demonstrated by achieving near-centralized results in simulation and having real robots accomplish a warehouse tool delivery task in an efficient way.

AAAI Conference 2020 Short Paper

Multi-Agent/Robot Deep Reinforcement Learning with Macro-Actions (Student Abstract)

  • Yuchen Xiao
  • Joshua Hoffman
  • Tian Xia
  • Christopher Amato

We consider the challenges of learning multi-agent/robot macro-action-based deep Q-nets including how to properly update each macro-action value and accurately maintain macro-action-observation trajectories. We address these challenges by first proposing two fundamental frameworks for learning macro-action-value function and joint macro-actionvalue function. Furthermore, we present two new approaches of learning decentralized macro-action-based policies, which involve a new double Q-update rule that facilitates the learning of decentralized Q-nets by using a centralized Q-net for action selection. Our approaches are evaluated both in simulation and on real robots.

IROS Conference 2020 Conference Paper

To Ask or Not to Ask: A User Annoyance Aware Preference Elicitation Framework for Social Robots

  • Balint Gucsi
  • Danesh S. Tarapore
  • William Yeoh 0001
  • Christopher Amato
  • Long Tran-Thanh

In this paper we investigate how social robots can efficiently gather user preferences without exceeding the allowed user annoyance threshold. To do so, we use a Gazebo based simulated office environment with a TIAGo Steel robot. We then formulate the user annoyance aware preference elicitation problem as a combination of tensor completion and knapsack problems. We then test our approach on the aforementioned simulated environment and demonstrate that it can accurately estimate user preferences.

AAMAS Conference 2019 Conference Paper

Bayesian Reinforcement Learning in Factored POMDPs

  • Sammie Katt
  • Frans A. Oliehoek
  • Christopher Amato

Model-based Bayesian Reinforcement Learning (BRL) provides a principled solution to dealing with the exploration-exploitation trade-off, but such methods typically assume a fully observable environments. The few Bayesian RL methods that are applicable in partially observable domains, such as the Bayes-Adaptive POMDP (BA-POMDP), scale poorly. To address this issue, we introduce the Factored BA-POMDP model (FBA-POMDP), a framework that is able to learn a compact model of the dynamics by exploiting the underlying structure of a POMDP. The FBA-POMDP framework casts the problem as a planning task, for which we adapt the Monte- Carlo Tree Search planning algorithm and develop a belief tracking method to approximate the joint posterior over the state and model variables. Our empirical results show that this method outperforms a number of BRL baselines and is able to learn efficiently when the factorization is known, as well as learn both the factorization and the model parameters simultaneously.

AAAI Conference 2019 Conference Paper

Learning to Teach in Cooperative Multiagent Reinforcement Learning

  • Shayegan Omidshafiei
  • Dong-Ki Kim
  • Miao Liu
  • Gerald Tesauro
  • Matthew Riemer
  • Christopher Amato
  • Murray Campbell
  • Jonathan P. How

Collective human knowledge has clearly benefited from the fact that innovations by individuals are taught to others through communication. Similar to human social groups, agents in distributed learning systems would likely benefit from communication to share knowledge and teach skills. The problem of teaching to improve agent learning has been investigated by prior works, but these approaches make assumptions that prevent application of teaching to general multiagent problems, or require domain expertise for problems they can apply to. This learning to teach problem has inherent complexities related to measuring long-term impacts of teaching that compound the standard multiagent coordination challenges. In contrast to existing works, this paper presents the first general framework and algorithm for intelligent agents to learn to teach in a multiagent environment. Our algorithm, Learning to Coordinate and Teach Reinforcement (LeCTR), addresses peer-to-peer teaching in cooperative multiagent reinforcement learning. Each agent in our approach learns both when and what to advise, then uses the received advice to improve local learning. Importantly, these roles are not fixed; these agents learn to assume the role of student and/or teacher at the appropriate moments, requesting and providing advice in order to improve teamwide performance and learning. Empirical comparisons against state-of-the-art teaching methods show that our teaching agents not only learn significantly faster, but also learn to coordinate in tasks where existing methods fail.

JAIR Journal 2019 Journal Article

Modeling and Planning with Macro-Actions in Decentralized POMDPs

  • Christopher Amato
  • George Konidaris
  • Leslie P. Kaelbling
  • Jonathan P. How

Decentralized partially observable Markov decision processes (Dec-POMDPs) are general models for decentralized multi-agent decision making under uncertainty. However, they typically model a problem at a low level of granularity, where each agent's actions are primitive operations lasting exactly one time step. We address the case where each agent has macro-actions: temporally extended actions that may require different amounts of time to execute. We model macro-actions as options in a Dec-POMDP, focusing on actions that depend only on information directly available to the agent during execution. Therefore, we model systems where coordination decisions only occur at the level of deciding which macro-actions to execute. The core technical difficulty in this setting is that the options chosen by each agent no longer terminate at the same time. We extend three leading Dec-POMDP algorithms for policy generation to the macro-action case, and demonstrate their effectiveness in both standard benchmarks and a multi-robot coordination problem. The results show that our new algorithms retain agent coordination while allowing high-quality solutions to be generated for significantly longer horizons and larger state-spaces than previous Dec-POMDP methods. Furthermore, in the multi-robot domain, we show that, in contrast to most existing methods that are specialized to a particular problem class, our approach can synthesize control policies that exploit opportunities for coordination while balancing uncertainty, sensor information, and information about other agents.

ICRA Conference 2019 Conference Paper

Online Planning for Target Object Search in Clutter under Partial Observability

  • Yuchen Xiao
  • Sammie Katt
  • Andreas ten Pas
  • Shengjian Chen
  • Christopher Amato

The problem of finding and grasping a target object in a cluttered, uncertain environment, target object search, is a common and important problem in robotics. One key challenge is the uncertainty of locating and recognizing each object in a cluttered environment due to noisy perception and occlusions. Furthermore, the uncertainty in localization makes manipulation difficult and uncertain. To cope with these challenges, we formulate the target object search task as a partially observable Markov decision process (POMDP), enabling the robot to reason about perceptual and manipulation uncertainty while searching. To further address the manipulation difficulty, we propose Parameterized Action Partially Observable Monte-Carlo Planning (PA-POMCP), an algorithm that evaluates manipulation actions by taking into account the effect of the robot's current belief on the success of the action execution. In addition, a novel run-time initial belief generator and a state value estimator are introduced in this paper to facilitate the PA-POMCP algorithm. Our experiments show that our methods solve the target object search task in settings where simpler methods either take more object movements or fail.

NeurIPS Conference 2019 Conference Paper

Reconciling λ-Returns with Experience Replay

  • Brett Daley
  • Christopher Amato

Modern deep reinforcement learning methods have departed from the incremental learning required for eligibility traces, rendering the implementation of the λ-return difficult in this context. In particular, off-policy methods that utilize experience replay remain problematic because their random sampling of minibatches is not conducive to the efficient calculation of λ-returns. Yet replay-based methods are often the most sample efficient, and incorporating λ-returns into them is a viable way to achieve new state-of-the-art performance. Towards this, we propose the first method to enable practical use of λ-returns in arbitrary replay-based methods without relying on other forms of decorrelation such as asynchronous gradient updates. By promoting short sequences of past transitions into a small cache within the replay memory, adjacent λ-returns can be efficiently precomputed by sharing Q-values. Computation is not wasted on experiences that are never sampled, and stored λ-returns behave as stable temporal-difference (TD) targets that replace the target network. Additionally, our method grants the unique ability to observe TD errors prior to sampling; for the first time, transitions can be prioritized by their true significance rather than by a proxy to it. Furthermore, we propose the novel use of the TD error to dynamically select λ-values that facilitate faster learning. We show that these innovations can enhance the performance of DQN when playing Atari 2600 games, even under partial observability. While our work specifically focuses on λ-returns, these ideas are applicable to any multi-step return estimator.

IJCAI Conference 2018 Conference Paper

Decision-Making Under Uncertainty in Multi-Agent and Multi-Robot Systems: Planning and Learning

  • Christopher Amato

Multi-agent planning and learning methods are becoming increasingly important in today's interconnected world. Methods for real-world domains, such as robotics, must consider uncertainty and limited communication in order to generate high-quality, robust solutions. This paper discusses our work on developing principled models to represent these problems and planning and learning methods that can scale to realistic multi-agent and multi-robot tasks.

ICRA Conference 2018 Conference Paper

Near-Optimal Adversarial Policy Switching for Decentralized Asynchronous Multi-Agent Systems

  • Trong Nghia Hoang
  • Yuchen Xiao
  • Kavinayan Sivakumar
  • Christopher Amato
  • Jonathan P. How

A key challenge in multi-robot and multi-agent systems is generating solutions that are robust to other self-interested or even adversarial parties who actively try to prevent the agents from achieving their goals. The practicality of existing works addressing this challenge is limited to only small-scale synchronous decision-making scenarios or a single agent planning its best response against a single adversary with fixed, procedurally characterized strategies. In contrast this paper considers a more realistic class of problems where a team of asynchronous agents with limited observation and communication capabilities need to compete against multiple strategic adversaries with changing strategies. This problem necessitates agents that can coordinate to detect changes in adversary strategies and plan the best response accordingly. Our approach first optimizes a set of stratagems that represent these best responses. These optimized stratagems are then integrated into a unified policy that can detect and respond when the adversaries change their strategies. The near-optimality of the proposed framework is established theoretically as well as demonstrated empirically in simulation and hardware.

IJCAI Conference 2017 Conference Paper

COG-DICE: An Algorithm for Solving Continuous-Observation Dec-POMDPs

  • Madison Clark-Turner
  • Christopher Amato

The decentralized partially observable Markov decision process (Dec-POMDP) is a powerful model for representing multi-agent problems with decentralized behavior. Unfortunately, current Dec-POMDP solution methods cannot solve problems with continuous observations, which are common in many real-world domains. To that end, we present a framework for representing and generating Dec-POMDP policies that explicitly include continuous observations. We apply our algorithm to a novel tagging problem and an extended version of a common benchmark, where it generates policies that meet or exceed the values of equivalent discretized domains without the need for finding an adequate discretization.

ICML Conference 2017 Conference Paper

Deep Decentralized Multi-task Multi-Agent Reinforcement Learning under Partial Observability

  • Shayegan Omidshafiei
  • Jason Pazis
  • Christopher Amato
  • Jonathan P. How
  • John Vian

Many real-world tasks involve multiple agents with partial observability and limited communication. Learning is challenging in these settings due to local viewpoints of agents, which perceive the world as non-stationary due to concurrently-exploring teammates. Approaches that learn specialized policies for individual tasks face problems when applied to the real world: not only do agents have to learn and store distinct policies for each task, but in practice identities of tasks are often non-observable, making these approaches inapplicable. This paper formalizes and addresses the problem of multi-task multi-agent reinforcement learning under partial observability. We introduce a decentralized single-task learning approach that is robust to concurrent interactions of teammates, and present an approach for distilling single-task policies into a unified policy that performs well across multiple related tasks, without explicit provision of task identity.

IROS Conference 2017 Conference Paper

Learning for multi-robot cooperation in partially observable stochastic environments with macro-actions

  • Miao Liu 0001
  • Kavinayan Sivakumar
  • Shayegan Omidshafiei
  • Christopher Amato
  • Jonathan P. How

This paper presents a data-driven approach for multi-robot coordination in partially-observable domains based on Decentralized Partially Observable Markov Decision Processes (Dec-POMDPs) and macro-actions (MAs). Dec-POMDPs provide a general framework for cooperative sequential decision making under uncertainty and MAs allow temporally extended and asynchronous action execution. To date, most methods assume the underlying Dec-POMDP model is known a priori or a full simulator is available during planning time. Previous methods which aim to address these issues suffer from local optimality and sensitivity to initial conditions. Additionally, few hardware demonstrations involving a large team of heterogeneous robots and with long planning horizons exist. This work addresses these gaps by proposing an iterative sampling based Expectation-Maximization algorithm (iSEM) to learn polices using only trajectory data containing observations, MAs, and rewards. Our experiments show the algorithm is able to achieve better solution quality than the state-of-the-art learning-based methods. We implement two variants of multi-robot Search and Rescue (SAR) domains (with and without obstacles) on hardware to demonstrate the learned policies can effectively control a team of distributed robots to cooperate in a partially observable stochastic environment.

ICML Conference 2017 Conference Paper

Learning in POMDPs with Monte Carlo Tree Search

  • Sammie Katt
  • Frans A. Oliehoek
  • Christopher Amato

The POMDP is a powerful framework for reasoning under outcome and information uncertainty, but constructing an accurate POMDP model is difficult. Bayes-Adaptive Partially Observable Markov Decision Processes (BA-POMDPs) extend POMDPs to allow the model to be learned during execution. BA-POMDPs are a Bayesian RL approach that, in principle, allows for an optimal trade-off between exploitation and exploration. Unfortunately, BA-POMDPs are currently impractical to solve for any non-trivial domain. In this paper, we extend the Monte-Carlo Tree Search method POMCP to BA-POMDPs and show that the resulting method, which we call BA-POMCP, is able to tackle problems that previous solution methods have been unable to solve. Additionally, we introduce several techniques that exploit the BA-POMDP structure to improve the efficiency of BA-POMCP along with proof of their convergence.

ICRA Conference 2017 Conference Paper

Scalable accelerated decentralized multi-robot policy search in continuous observation spaces

  • Shayegan Omidshafiei
  • Christopher Amato
  • Miao Liu 0001
  • Michael Everett
  • Jonathan P. How
  • John Vian

This paper presents the first ever approach for solving continuous-observation Decentralized Partially Observable Markov Decision Processes (Dec-POMDPs) and their semi-Markovian counterparts, Dec-POSMDPs. This contribution is especially important in robotics, where a vast number of sensors provide continuous observation data. A continuous-observation policy representation is introduced using Stochastic Kernel-based Finite State Automata (SK-FSAs). An SK-FSA search algorithm titled Entropy-based Policy Search using Continuous Kernel Observations (EPSCKO) is introduced and applied to the first ever continuous-observation Dec-POMDP/Dec-POSMDP domain, where it significantly outperforms state-of-the-art discrete approaches. This methodology is equally applicable to Dec-POMDPs and Dec-POSMDPs, though the empirical analysis presented focuses on Dec-POSMDPs due to their higher scalability. To improve convergence, an entropy injection policy search acceleration approach for both continuous and discrete observation cases is also developed and shown to improve convergence rates without degrading policy quality.

ICRA Conference 2017 Conference Paper

Semantic-level decentralized multi-robot decision-making using probabilistic macro-observations

  • Shayegan Omidshafiei
  • Shih-Yuan Liu
  • Michael Everett
  • Brett T. Lopez
  • Christopher Amato
  • Miao Liu 0001
  • Jonathan P. How
  • John Vian

Robust environment perception is essential for decision-making on robots operating in complex domains. Intelligent task execution requires principled treatment of uncertainty sources in a robot's observation model. This is important not only for low-level observations (e. g. , accelerom-eter data), but also for high-level observations such as semantic object labels. This paper formalizes the concept of macro-observations in Decentralized Partially Observable Semi-Markov Decision Processes (Dec-POSMDPs), allowing scalable semantic-level multi-robot decision making. A hierarchical Bayesian approach is used to model noise statistics of low-level classifier outputs, while simultaneously allowing sharing of domain noise characteristics between classes. Classification accuracy of the proposed macro-observation scheme, called Hierarchical Bayesian Noise Inference (HBNI), is shown to exceed existing methods. The macro-observation scheme is then integrated into a Dec-POSMDP planner, with hardware experiments running onboard a team of dynamic quadrotors in a challenging domain where noise-agnostic filtering fails. To the best of our knowledge, this is the first demonstration of a real-time, convolutional neural net-based classification framework running fully onboard a team of quadrotors in a multi-robot decision-making domain.

ICRA Conference 2016 Conference Paper

Graph-based Cross Entropy method for solving multi-robot decentralized POMDPs

  • Shayegan Omidshafiei
  • Ali-Akbar Agha-Mohammadi
  • Christopher Amato
  • Shih-Yuan Liu
  • Jonathan P. How
  • John Vian

This paper introduces a probabilistic algorithm for multi-robot decision-making under uncertainty, which can be posed as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP). Dec-POMDPs are inherently synchronous decision-making frameworks which require significant computational resources to be solved, making them infeasible for many real-world robotics applications. The Decentralized Partially Observable Semi-Markov Decision Process (Dec-POSMDP) was recently introduced as an extension of the Dec-POMDP that uses high-level macro-actions to allow large-scale, asynchronous decision-making. However, existing Dec-POSMDP solution methods have limited scalability or perform poorly as the problem size grows. This paper proposes a cross-entropy based Dec-POSMDP algorithm motivated by the combinatorial optimization literature. The algorithm is applied to a constrained package delivery domain, where it significantly outperforms existing Dec-POSMDP solution methods.

AAAI Conference 2016 Conference Paper

Learning for Decentralized Control of Multiagent Systems in Large, Partially-Observable Stochastic Environments

  • Miao Liu
  • Christopher Amato
  • Emily Anesta
  • John Griffith
  • Jonathan How

Decentralized partially observable Markov decision processes (Dec-POMDPs) provide a general framework for multiagent sequential decision-making under uncertainty. Although Dec-POMDPs are typically intractable to solve for real-world problems, recent research on macro-actions (i. e. , temporally-extended actions) has significantly increased the size of problems that can be solved. However, current methods assume the underlying Dec-POMDP model is known a priori or a full simulator is available during planning time. To accommodate more realistic scenarios, when such information is not available, this paper presents a policy-based reinforcement learning approach, which learns the agent policies based solely on trajectories generated by previous interaction with the environment (e. g. , demonstrations). We show that our approach is able to generate valid macro-action controllers and develop an expectationmaximization (EM) algorithm (called Policy-based EM or PoEM), which has convergence guarantees for batch learning. Our experiments show PoEM is a scalable learning method that can learn optimal policies and improve upon hand-coded “expert” solutions.

JAIR Journal 2016 Journal Article

Optimally Solving Dec-POMDPs as Continuous-State MDPs

  • Jilles Steeve Dibangoye
  • Christopher Amato
  • Olivier Buffet
  • François Charpillet

Decentralized partially observable Markov decision processes (Dec-POMDPs) provide a general model for decision-making under uncertainty in decentralized settings, but are difficult to solve optimally (NEXP-Complete). As a new way of solving these problems, we introduce the idea of transforming a Dec-POMDP into a continuous-state deterministic MDP with a piecewise-linear and convex value function. This approach makes use of the fact that planning can be accomplished in a centralized offline manner, while execution can still be decentralized. This new Dec-POMDP formulation, which we call an occupancy MDP, allows powerful POMDP and continuous-state MDP methods to be used for the first time. To provide scalability, we refine this approach by combining heuristic search and compact representations that exploit the structure present in multi-agent domains, without losing the ability to converge to an optimal solution. In particular, we introduce a feature-based heuristic search value iteration (FB-HSVI) algorithm that relies on feature-based compact representations, point-based updates and efficient action selection. A theoretical analysis demonstrates that FB-HSVI terminates in finite time with an optimal solution. We include an extensive empirical analysis using well-known benchmarks, thereby demonstrating that our approach provides significant scalability improvements compared to the state of the art.

ICRA Conference 2015 Conference Paper

Decentralized control of Partially Observable Markov Decision Processes using belief space macro-actions

  • Shayegan Omidshafiei
  • Ali-Akbar Agha-Mohammadi
  • Christopher Amato
  • Jonathan P. How

The focus of this paper is on solving multi-robot planning problems in continuous spaces with partial observability. Decentralized Partially Observable Markov Decision Processes (Dec-POMDPs) are general models for multi-robot coordination problems, but representing and solving Dec-POMDPs is often intractable for large problems. To allow for a high-level representation that is natural for multi-robot problems and scalable to large discrete and continuous problems, this paper extends the Dec-POMDP model to the Decentralized Partially Observable Semi-Markov Decision Process (Dec-POSMDP). The Dec-POSMDP formulation allows asynchronous decision-making by the robots, which is crucial in multi-robot domains. We also present an algorithm for solving this Dec-POSMDP which is much more scalable than previous methods since it can incorporate closed-loop belief space macro-actions in planning. These macro-actions are automatically constructed to produce robust solutions. The proposed method's performance is evaluated on a complex multi-robot package delivery problem under uncertainty, showing that our approach can naturally represent multi-robot problems and provide high-quality solutions for large-scale problems.

IJCAI Conference 2015 Conference Paper

Exploiting Separability in Multiagent Planning with Continuous-State MDPs (Extended Abstract)

  • Jilles Steeve Dibangoye
  • Christopher Amato
  • Olivier Buffet
  • Fran
  • ccedil; ois Charpillet

Decentralized partially observable Markov decision processes (Dec-POMDPs) provide a general model for decision-making under uncertainty in cooperative decentralized settings, but are difficult to solve optimally (NEXP-Complete). As a new way of solving these problems, we recently introduced a method for transforming a Dec-POMDP into a continuous-state deterministic MDP with a piecewise-linear and convex value function. This new Dec-POMDP formulation, which we call an occupancy MDP, allows powerful POMDP and continuous-state MDP methods to be used for the first time. However, scalability remains limited when the number of agents or problem variables becomes large. In this paper, we show that, under certain separability conditions of the optimal value function, the scalability of this approach can increase considerably. This separability is present when there is locality of interaction between agents, which can be exploited to improve performance. Unlike most previous methods, the novel continuous-state MDP algorithm retains optimality and convergence guarantees. Results show that the extension using separability can scale to a large number of agents and domain variables while maintaining optimality.

RLDM Conference 2015 Conference Abstract

Learning for Multiagent Decentralized Control in Large Partially Observable Stochastic Envi- ronments

  • Miao Liu
  • Christopher Amato
  • Emily Anesta
  • John Griffith
  • Jonathan How

This paper presents a probabilistic framework for learning decentralized control policies for co- operative multiagent systems operating in a large partially observable stochastic environment based on batch data (trajectories). In decentralized domains, because of communication limitations, the agents cannot share their entire belief states, so execution must proceed based on local information. Decentralized partially observable Markov decision processes (Dec-POMDPs) provide a general framework for modeling multi- agent sequential decision making processes in the presence of uncertainty. Although Dec-POMDPs are typically intractable to solve for real-world problems, recent research on macro-actions in Dec-POMDPs has significantly increased the size of problems that can be solved. However, existing methods are confined to tree-based policies in finite-horizon problems, and assume the underlying POMDP models are known a priori. To accommodate more realistic scenarios when the full POMDP model is unavailable and the plan- ning horizon is unbounded, this paper presents a policy-based reinforcement learning approach to learn the macro-action policies represented by Mealy machines. Based on trajectories of macro-actions, observations, and rewards generated by interacting with the environment with hand-coded policies (demonstrations) and random exploration, an expectation-maximization (EM) algorithm is proposed to learn the decentralized macro-action policies, leading to a new framework called POEM (Policy-based EM), which has conver- gence guarantee for bath learning. The performance of POEM is demonstrated on two domains, including a benchmark navigation-among-movable-obstacle problem, and a newly designed large search and rescue problem. Our empirical study shows POEM is a scalable batch learning method that can learn optimal policies and achieve policy improvement over hand-coded (suboptimal) policies for missions in partially observable stochastic environments.

ICRA Conference 2015 Conference Paper

Planning for decentralized control of multiple robots under uncertainty

  • Christopher Amato
  • George Konidaris 0001
  • Gabriel Cruz
  • Christopher A. Maynor
  • Jonathan P. How
  • Leslie Pack Kaelbling

This paper presents a probabilistic framework for synthesizing control policies for general multi-robot systems that is based on decentralized partially observable Markov decision processes (Dec-POMDPs). Dec-POMDPs are a general model of decision-making where a team of agents must cooperate to optimize a shared objective in the presence of uncertainty. Dec-POMDPs also consider communication limitations, so execution is decentralized. While Dec-POMDPs are typically intractable to solve for real-world problems, recent research on the use of macro-actions in Dec-POMDPs has significantly increased the size of problem that can be practically solved. We show that, in contrast to most existing methods that are specialized to a particular problem class, our approach can synthesize control policies that exploit any opportunities for coordination that are present in the problem, while balancing uncertainty, sensor information, and information about other agents. We use three variants of a warehouse task to show that a single planner of this type can generate cooperative behavior using task allocation, direct communication, and signaling, as appropriate. This demonstrates that our algorithmic framework can automatically optimize control and communication policies for complex multi-robot systems.

AAAI Conference 2015 Conference Paper

Scalable Planning and Learning for Multiagent POMDPs

  • Christopher Amato
  • Frans Oliehoek

Online, sample-based planning algorithms for POMDPs have shown great promise in scaling to problems with large state spaces, but they become intractable for large action and observation spaces. This is particularly problematic in multiagent POMDPs where the action and observation space grows exponentially with the number of agents. To combat this intractability, we propose a novel scalable approach based on sample-based planning and factored value functions that exploits structure present in many multiagent settings. This approach applies not only in the planning case, but also in the Bayesian reinforcement learning setting. Experimental results show that we are able to provide high quality solutions to large multiagent planning and learning problems.

IJCAI Conference 2015 Conference Paper

Stick-Breaking Policy Learning in Dec-POMDPs

  • Miao Liu
  • Christopher Amato
  • Xuejun Liao
  • Lawrence Carin
  • Jonathan P. How

Expectation maximization (EM) has recently been shown to be an efficient algorithm for learning finite-state controllers (FSCs) in large decentralized POMDPs (Dec-POMDPs). However, current methods use fixed-size FSCs and often converge to maxima that are far from the optimal value. This paper represents the local policy of each agent using variable-sized FSCs that are constructed using a stick-breaking prior, leading to a new framework called decentralized stick-breaking policy representation (Dec-SBPR). This approach learns the controller parameters with a variational Bayesian algorithm without having to assume that the Dec- POMDP model is available. The performance of Dec-SBPR is demonstrated on several benchmark problems, showing that the algorithm scales to large problems while outperforming other state-of-the-art methods.

IJCAI Conference 2013 Conference Paper

Optimally Solving Dec-POMDPs as Continuous-State MDPs

  • Jilles Steeve Dibangoye
  • Christopher Amato
  • Olivier Buffet
  • François Charpillet

Optimally solving decentralized partially observable Markov decision processes (Dec-POMDPs) is a hard combinatorial problem. Current algorithms search through the space of full histories for each agent. Because of the doubly exponential growth in the number of policies in this space as the planning horizon increases, these methods quickly become intractable. However, in real world problems, computing policies over the full history space is often unnecessary. True histories experienced by the agents often lie near a structured, low-dimensional manifold embedded into the history space. We show that by transforming a Dec-POMDP into a continuous-state MDP, we are able to find and exploit these low-dimensional representations. Using this novel transformation, we can then apply powerful techniques for solving POMDPs and continuous-state MDPs. By combining a general search algorithm and dimension reduction based on feature selection, we introduce a novel approach to optimally solve problems with significantly longer planning horizons than previous methods.

RLDM Conference 2013 Conference Abstract

Scalable Bayesian Reinforcement Learning for Multiagent POMDPs

  • Christopher Amato
  • Frans Oliehoek
  • Eric Shyu

Bayesian methods for reinforcement learning (RL) allow model uncertainty to be considered explicitly and offer a principled way of dealing with the exploration/exploitation tradeoff. However, for multiagent systems there have been few such approaches, and none of them apply to problems with state uncertainty. In this paper, we fill this gap by proposing a Bayesian RL framework for multiagent partially observable Markov decision processes that is able to take advantage of structure present in many problems. In this framework, a team of agents operates in a centralized fashion, but has uncertainty about the model of the environment. Fitting many real-world situations, we consider the case where agents learn the appropriate models while acting in an online fashion. Because it can quickly become intractable to choose the optimal action in naive versions of this online learning problem, we propose a more scalable approach based on sample-based search and factored value functions for the set of agents. Experimental results show that we are able to provide high quality solutions to large problems even with a large amount of initial model uncertainty.

UAI Conference 2012 Conference Paper

Scaling Up Decentralized MDPs Through Heuristic Search

  • Jilles Dibangoye
  • Christopher Amato
  • Arnaud Doniec

Decentralized partially observable Markov decision processes (Dec-POMDPs) are rich models for cooperative decision-making under uncertainty, but are often intractable to solve optimally (NEXP-complete). The transition and observation independent Dec-MDP is a general subclass that has been shown to have complexity in NP, but optimal algorithms for this subclass are still inefficient in practice. In this paper, we first provide an updated proof that an optimal policy does not depend on the histories of the agents, but only the local observations. We then present a new algorithm based on heuristic search that is able to expand search nodes by using constraint optimization. We show experimental results comparing our approach with the state-of-the-art Dec- MDP and Dec-POMDP solvers. These results show a reduction in computation time and an increase in scalability by multiple orders of magnitude in a number of benchmarks.

AAMAS Conference 2011 Conference Paper

Adaptive Decision Support for Structured Organizations: A Case for OrgPOMDPs

  • Pradeep Varakantham
  • Nathan Schurr
  • Alan Carlin
  • Christopher Amato

In today's world, organizations are faced with increasingly large and complex problems that require decision-making under uncertainty. Current methods for optimizing such decisions fall short of handling the problem scale and time constraints. We argue that this is due to existing methods not exploiting the inherent structure of the organizations which solve these problems. We propose a new model called the OrgPOMDP (Organizational POMDP), which is based on the partially observable Markov decision process (POMDP). This new model combines two powerful representations for modeling large scale problems: hierarchical modeling and factored representations. In this paper we make three key contributions: (a) Introduce the OrgPOMDP model; (b) Present an algorithm to solve OrgPOMDP problems efficiently; and (c) Apply OrgPOMDPs to scenarios in an existing large organization, the Air and Space Operation Center (AOC). We conduct experiments and show that our OrgPOMDP approach results in greater scalability and greatly reduced runtime. In fact, as the size of the problem increases, we soon reach a point at which the OrgPOMDP approach continues to provide solutions while traditional POMDP methods cannot. We also provide an empirical evaluation to highlight the benefits of an organization implementing an OrgPOMDP policy.

IJCAI Conference 2011 Conference Paper

Scaling Up Optimal Heuristic Search in Dec-POMDPs via Incremental Expansion

  • Matthijs T. J. Spaan
  • Frans A. Oliehoek
  • Christopher Amato

Planning under uncertainty for multiagent systems can be formalized as a decentralized partially observable Markov decision process. We advance the state of the art for optimal solution of this model, building on the Multiagent A* heuristic search method. A key insight is that we can avoid the full expansion of a search node that generates a number of children that is doubly exponential in the node's depth. Instead, we incrementally expand the children only when a next child might have the highest heuristic value. We target a subsequent bottleneck by introducing a more memory-efficient representation for our heuristic functions. Proof is given that the resulting algorithm is correct and experiments demonstrate a significant speedup over the state of the art, allowing for optimal solutions over longer horizons for many benchmark problems.

AAAI Conference 2010 Conference Paper

Finite-State Controllers Based on Mealy Machines for Centralized and Decentralized POMDPs

  • Christopher Amato
  • Blai Bonet
  • Shlomo Zilberstein

Existing controller-based approaches for centralized and decentralized POMDPs are based on automata with output known as Moore machines. In this paper, we show that several advantages can be gained by utilizing another type of automata, the Mealy machine. Mealy machines are more powerful than Moore machines, provide a richer structure that can be exploited by solution methods, and can be easily incorporated into current controller-based approaches. To demonstrate this, we adapted some existing controller-based algorithms to use Mealy machines and obtained results on a set of benchmark domains. The Mealy-based approach always outperformed the Moore-based approach and often outperformed the state-of-the-art algorithms for both centralized and decentralized POMDPs. These findings provide fresh and general insights for the improvement of existing algorithms and the development of new ones.

AAMAS Conference 2010 Conference Paper

Heuristic Search for Identical Payoff Bayesian Games

  • Frans Oliehoek
  • Matthijs Spaan
  • Jilles Dibangoye
  • Christopher Amato

Bayesian games can be used to model single-shot decisionproblems in which agents only possess incomplete information about other agents, and hence are important for multiagent coordination under uncertainty. Moreover they canbe used to represent different stages of sequential multiagent decision problems, such as POSGs and DEC-POMDPs, and appear as an operation in many methods for multiagentplanning under uncertainty. In this paper we are interestedin coordinating teams of cooperative agents. While manysuch problems can be formulated as Bayesian games withidentical payoffs, little work has been done to improve solution methods. To help address this situation, we provide abranch and bound algorithm that optimally solves identicalpayoff Bayesian games. Our results show a marked improvement over previous methods, obtaining speedups of up to 3orders of magnitude for synthetic random games, and reaching 10 orders of magnitude speedups for games in a DEC-POMDP context. This not only allows Bayesian games tobe solved more efficiently, but can also improve multiagentplanning techniques such as top-down and bottom-up algorithms for decentralized POMDPs.

AAMAS Conference 2010 Conference Paper

High-level Reinforcement Learning in Strategy Games

  • Christopher Amato
  • Guy Shani

Video games provide a rich testbed for artificial intelligence methods. In particular, creating automated opponents that perform well in strategy games is a difficult task. For instance, human players rapidly discover and exploit the weaknesses of hard coded strategies. To build better strategies, we suggest a reinforcement learning approach for learning a policy that switches between high-level strategies. These strategies are chosen based on different game situations and a fixed opponent strategy. Our learning agents are able to rapidly adapt to fixed opponents and improve deficiencies in the hard coded strategies, as the results demonstrate.

ICAPS Conference 2009 Conference Paper

Incremental Policy Generation for Finite-Horizon DEC-POMDPs

  • Christopher Amato
  • Jilles Dibangoye
  • Shlomo Zilberstein

Solving multiagent planning problems modeled as DEC-POMDPs is an important challenge. These models are often solved by using dynamic programming, but the high resource usage of current approaches results in limited scalability. To improve the efficiency of dynamic programming algorithms, we propose a new backup algorithm that is based on a reachability analysis of the state space. This method, which we call incremental policy generation, can be used to produce an optimal solution for any possible initial state or further scalability can be achieved by making use of a known start state. When incorporated into the optimal dynamic programming algorithm, our experiments show that planning horizon can be increased due to a marked reduction in resource consumption. This approach also fits nicely with approximate dynamic programming algorithms. To demonstrate this, we incorporate it into the state-of-the-art PBIP algorithm and show significant performance gains. The results suggest that the performance of other dynamic programming algorithms for DEC-POMDPs could be similarly improved by integrating the incremental policy generation approach.

JAAMAS Journal 2009 Journal Article

Optimizing fixed-size stochastic controllers for POMDPs and decentralized POMDPs

  • Christopher Amato
  • Daniel S. Bernstein
  • Shlomo Zilberstein

Abstract POMDPs and their decentralized multiagent counterparts, DEC-POMDPs, offer a rich framework for sequential decision making under uncertainty. Their high computational complexity, however, presents an important research challenge. One way to address the intractable memory requirements of current algorithms is based on representing agent policies as finite-state controllers. Using this representation, we propose a new approach that formulates the problem as a nonlinear program, which defines an optimal policy of a desired size for each agent. This new formulation allows a wide range of powerful nonlinear programming algorithms to be used to solve POMDPs and DEC-POMDPs. Although solving the NLP optimally is often intractable, the results we obtain using an off-the-shelf optimization method are competitive with state-of-the-art POMDP algorithms and outperform state-of-the-art DEC-POMDP algorithms. Our approach is easy to implement and it opens up promising research directions for solving POMDPs and DEC-POMDPs using nonlinear programming methods.

IJCAI Conference 2007 Conference Paper

  • Christopher Amato
  • Daniel S. Bernstein
  • Shlomo Zilberstein

Developing scalable algorithms for solving partially observable Markov decision processes (POMDPs) is an important challenge. One approach that effectively addresses the intractable memory requirements of POMDP algorithms is based on representing POMDP policies as finite-state controllers. In this paper, we illustrate some fundamental disadvantages of existing techniques that use controllers. We then propose a new approach that formulates the problem as a quadratically constrained linear program (QCLP), which defines an optimal controller of a desired size. This representation allows a wide range of powerful nonlinear programming algorithms to be used to solve POMDPs. Although QCLP optimization techniques guarantee only local optimality, the results we obtain using an existing optimization method show significant solution improvement over the state-of-the-art techniques. The results open up promising research directions for solving large POMDPs using nonlinear programming methods.