Arrow Research search

Author name cluster

Peter Stone

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

198 papers
1 author row

Possible papers

198

AAAI Conference 2026 Conference Paper

Out-of-Distribution Generalization with a SPARC: Racing 100 Unseen Vehicles with a Single Policy

  • Bram Grooten
  • Patrick MacAlpine
  • Kaushik Subramanian
  • Peter Stone
  • Peter R. Wurman

Generalization to unseen environments is a significant challenge in the field of robotics and control. In this work, we focus on contextual reinforcement learning, where agents act within environments with varying contexts, such as self-driving cars or quadrupedal robots that need to operate in different terrains or weather conditions than they were trained for. We tackle the critical task of generalizing to out-of-distribution (OOD) settings, without access to explicit context information at test time. Recent work has addressed this problem by training a context encoder and a history adaptation module in separate stages. While promising, this two-phase approach is cumbersome to implement and train. We simplify the methodology and introduce SPARC: single-phase adaptation for robust control. We test SPARC on varying contexts within the high-fidelity racing simulator Gran Turismo 7 and wind-perturbed MuJoCo environments, and find that it achieves reliable and robust OOD generalization.

JAAMAS Journal 2026 Journal Article

The RoboCup Soccer Server and CMUnited Clients: Implemented Infrastructure for MAS Research

  • Itsuki Noda
  • Peter Stone

Abstract The RoboCup Soccer Server and associated client code is a growing body of software infrastructure that enables a wide variety of multiagent systems research. The Soccer Server is a multiagent environment that supports 22 independent agents interacting in a complex, real-time environment. AI researchers have been using the Soccer Server to pursue research in a wide variety of areas, including real-time multiagent planning, real-time communication methods, collaborative sensing, and multiagent learning. This article describes the current Soccer Server and the champion CMUnited soccer-playing agents, both of which are publically available and used by a growing research community. It also describes the ongoing development of FUSS, a new, flexible simulation environment for multiagent research in a variety of multiagent domains.

IS Journal 2025 Journal Article

Artificial Intelligence: Looking Forward 15 Years

  • Peter Stone

Almost 10 years ago, I co-authored a report that predicted the effects of Artificial Intelligence on daily life in the year 2030. This article reflects on and evaluates our predictions from a decade ago and looks forward another decade and a half. While there are good reasons for both excitement and apprehension, it remains within our hands, as a society, to ensure that the benefits of AI outweigh the risks.

RLC Conference 2025 Conference Paper

Benchmarking Massively Parallelized Multi-Task Reinforcement Learning for Robotics Tasks

  • Viraj Joshi
  • Zifan Xu
  • Bo Liu
  • Peter Stone
  • Amy Zhang

Multi-task Reinforcement Learning (MTRL) has emerged as a critical training paradigm for applying reinforcement learning (RL) to a set of complex real-world robotic tasks, which demands a generalizable and robust policy. At the same time, \emph{massively parallelized training} has gained popularity, not only for significantly accelerating data collection through GPU-accelerated simulation but also for enabling diverse data collection across multiple tasks by simulating heterogeneous scenes in parallel. However, existing MTRL research has largely been limited to off-policy methods like SAC in the low-parallelization regime. MTRL could capitalize on the higher asymptotic performance of on-policy algorithms, whose batches require data from the current policy, and as a result, take advantage of massive parallelization offered by GPU-accelerated simulation. To bridge this gap, we introduce a massively parallelized $\textbf{M}$ulti-$\textbf{T}$ask $\textbf{Bench}$mark for robotics (MTBench), an open-sourced benchmark featuring a broad distribution of 50 manipulation tasks and 20 locomotion tasks, implemented using the GPU-accelerated simulator IsaacGym. MTBench also includes four base RL algorithms combined with seven state-of-the-art MTRL algorithms and architectures, providing a unified framework for evaluating their performance. Our extensive experiments highlight the superior speed of evaluating MTRL approaches using MTBench, while also uncovering unique challenges that arise from combining massive parallelism with MTRL. Code is available at $\href{https: //github. com/Viraj-Joshi/MTBench}{ https: //github. com/Viraj-Joshi/MTBench}$

RLJ Journal 2025 Journal Article

Benchmarking Massively Parallelized Multi-Task Reinforcement Learning for Robotics Tasks

  • Viraj Joshi
  • Zifan Xu
  • Bo Liu
  • Peter Stone
  • Amy Zhang

Multi-task Reinforcement Learning (MTRL) has emerged as a critical training paradigm for applying reinforcement learning (RL) to a set of complex real-world robotic tasks, which demands a generalizable and robust policy. At the same time, \emph{massively parallelized training} has gained popularity, not only for significantly accelerating data collection through GPU-accelerated simulation but also for enabling diverse data collection across multiple tasks by simulating heterogeneous scenes in parallel. However, existing MTRL research has largely been limited to off-policy methods like SAC in the low-parallelization regime. MTRL could capitalize on the higher asymptotic performance of on-policy algorithms, whose batches require data from the current policy, and as a result, take advantage of massive parallelization offered by GPU-accelerated simulation. To bridge this gap, we introduce a massively parallelized $\textbf{M}$ulti-$\textbf{T}$ask $\textbf{Bench}$mark for robotics (MTBench), an open-sourced benchmark featuring a broad distribution of 50 manipulation tasks and 20 locomotion tasks, implemented using the GPU-accelerated simulator IsaacGym. MTBench also includes four base RL algorithms combined with seven state-of-the-art MTRL algorithms and architectures, providing a unified framework for evaluating their performance. Our extensive experiments highlight the superior speed of evaluating MTRL approaches using MTBench, while also uncovering unique challenges that arise from combining massive parallelism with MTRL. Code is available at $\href{https://github.com/Viraj-Joshi/MTBench}{ https://github.com/Viraj-Joshi/MTBench}$

AAAI Conference 2025 Conference Paper

Deep Reinforcement Learning for Robotics: A Survey of Real-World Successes

  • Chen Tang
  • Ben Abbatematteo
  • Jiaheng Hu
  • Rohan Chandra
  • Roberto Martín-Martín
  • Peter Stone

Reinforcement learning (RL), particularly its combination with deep neural networks referred to as deep RL (DRL), has shown tremendous promise across a wide range of applications, suggesting its potential for enabling the development of sophisticated robotic behaviors. Robotics problems, however, pose fundamental difficulties for the application of RL, stemming from the complexity and cost of interacting with the physical world. These challenges notwithstanding, recent advances have enabled DRL to succeed at some real-world robotic tasks. However, state-of-the-art DRL solutions’ maturity varies significantly across robotic applications. In this talk, I will review the current progress of DRL in real-world robotic applications based on our recent survey paper (with Tang, Abbatematteo, Hu, Chandra, and Martı́n-Martı́n), with a particular focus on evaluating the real-world successes achieved with DRL in realizing several key robotic competencies, including locomotion, navigation, stationary manipulation, mobile manipulation, human-robot interaction, and multi-robot interaction. The analysis aims to identify the key factors underlying those exciting successes, reveal underexplored areas, and provide an overall characterization of the status of DRL in robotics. I will also highlight several important avenues for future work, emphasizing the need for stable and sample-efficient real-world RL paradigms, holistic approaches for discovering and integrating various competencies to tackle complex long-horizon, open-world tasks, and principled development and evaluation procedures. The talk is designed to offer insights for RL practitioners and roboticists toward harnessing RL’s power to create generally capable real-world robotic systems.

NeurIPS Conference 2025 Conference Paper

Dyn-O: Building Structured World Models with Object-Centric Representations

  • Zizhao Wang
  • Kaixin Wang
  • Li Zhao
  • Peter Stone
  • Jiang Bian

World models aim to capture the dynamics of the environment, enabling agents to predict and plan for future states. In most scenarios of interest, the dynamics are highly centered on interactions among objects within the environment. This motivates the development of world models that operate on object-centric rather than monolithic representations, with the goal of more effectively capturing environment dynamics and enhancing compositional generalization. However, the development of object-centric world models has largely been explored in environments with limited visual complexity (such as basic geometries). It remains underexplored whether such models can be effective in more challenging settings. In this paper, we fill this gap by introducing Dyn-O, an enhanced structured world model built upon object-centric representations. Compared to prior work in object-centric representations, Dyn-O improves in both learning representations and modeling dynamics. On the challenging Procgen games, we demonstrate that our method can learn object-centric world models directly from pixel observations, outperforming DreamerV3 in rollout prediction accuracy. Furthermore, by decoupling object centric features into dynamic-agnostic and dynamic-aware components, we enable finer-grained manipulation of these features and generate more diverse imagined trajectories. The code of Dyn-O can be found at: https: //github. com/wangzizhao/dyn-O.

NeurIPS Conference 2025 Conference Paper

Evaluating Generalization Capabilities of LLM-Based Agents in Mixed-Motive Scenarios Using Concordia

  • Chandler Smith
  • Marwa Abdulhai
  • Manfred Díaz
  • Marko Tesic
  • Rakshit Trivedi
  • Sasha Vezhnevets
  • Lewis Hammond
  • Jesse Clifton

Large Language Model (LLM) agents have demonstrated impressive capabilities for social interaction and are increasingly being deployed in situations where they might engage with both human and artificial agents. These interactions represent a critical frontier for LLM-based agents, yet existing evaluation methods fail to measure how well these capabilities generalize to novel social situations. In this paper, we introduce a method for evaluating the ability of LLM-based agents to cooperate in zero-shot, mixed-motive environments using Concordia, a natural language multi-agent simulation environment. Our method measures general cooperative intelligence by testing an agent's ability to identify and exploit opportunities for mutual gain across diverse partners and contexts. We present empirical results from the NeurIPS 2024 Concordia Contest, where agents were evaluated on their ability to achieve mutual gains across a suite of diverse scenarios ranging from negotiation to collective action problems. Our findings reveal significant gaps between current agent capabilities and the robust generalization required for reliable cooperation, particularly in scenarios demanding persuasion and norm enforcement.

EWRL Workshop 2025 Workshop Paper

Generalization with a SPARC: Single-Phase Adaptation for Reinforcement Learning in Contextual Environments

  • Bram Grooten
  • Patrick MacAlpine
  • Kaushik Subramanian
  • Peter R. Wurman
  • Peter Stone

Generalization to unseen environments is a significant challenge in the field of robotics and control. In this work, we focus on contextual reinforcement learning, where the agent acts within environments with varying contexts, such as self-driving cars or quadrupedal robots that need to operate in different terrains or weather conditions than they were trained for. We tackle the critical task of generalizing to out-of-distribution (OOD) contexts, without access to explicit context information at test time. Recent work has addressed this problem by training a context encoder and a history adaptation module in separate stages. While promising, this two-phase approach is cumbersome to implement and train. We simplify the methodology and introduce SPARC, a single-phase adaptation method for reinforcement learning in contextual environments. We evaluate SPARC on varying contexts within MuJoCo environments and the high-fidelity racing simulator Gran Turismo 7 and find that it achieves competitive or superior performance on OOD generalization.

RLJ Journal 2025 Journal Article

Offline Action-Free Learning of Ex-BMDPs by Comparing Diverse Datasets

  • Alexander Levine
  • Peter Stone
  • Amy Zhang

While sequential decision-making environments often involve high-dimensional observations, not all features of these observations are relevant for control. In particular, the observation space may capture factors of the environment which are not controllable by the agent, but which add complexity to the observation space. The need to ignore these ""noise"" features in order to operate in a tractably-small state space poses a challenge for efficient policy learning. Due to the abundance of video data available in many such environments, task-independent representation learning from action-free offline data offers an attractive solution. However, recent work has highlighted theoretical limitations in action-free learning under the Exogenous Block MDP (Ex-BMDP) model, where temporally-correlated noise features are present in the observations. To address these limitations, we identify a realistic setting where representation learning in Ex-BMDPs becomes tractable: when action-free video data from multiple agents with differing policies are available. Concretely, this paper introduces CRAFT (Comparison-based Representations from Action-Free Trajectories), a sample-efficient algorithm leveraging differences in controllable feature dynamics across agents to learn representations. We provide theoretical guarantees for CRAFT's performance and demonstrate its feasibility on a toy example, offering a foundation for practical methods in similar settings.

RLC Conference 2025 Conference Paper

Offline Action-Free Learning of Ex-BMDPs by Comparing Diverse Datasets

  • Alexander Levine
  • Peter Stone
  • Amy Zhang

While sequential decision-making environments often involve high-dimensional observations, not all features of these observations are relevant for control. In particular, the observation space may capture factors of the environment which are not controllable by the agent, but which add complexity to the observation space. The need to ignore these ""noise"" features in order to operate in a tractably-small state space poses a challenge for efficient policy learning. Due to the abundance of video data available in many such environments, task-independent representation learning from action-free offline data offers an attractive solution. However, recent work has highlighted theoretical limitations in action-free learning under the Exogenous Block MDP (Ex-BMDP) model, where temporally-correlated noise features are present in the observations. To address these limitations, we identify a realistic setting where representation learning in Ex-BMDPs becomes tractable: when action-free video data from multiple agents with differing policies are available. Concretely, this paper introduces CRAFT (Comparison-based Representations from Action-Free Trajectories), a sample-efficient algorithm leveraging differences in controllable feature dynamics across agents to learn representations. We provide theoretical guarantees for CRAFT's performance and demonstrate its feasibility on a toy example, offering a foundation for practical methods in similar settings.

RLC Conference 2025 Conference Paper

ProtoCRL: Prototype-based Network for Continual Reinforcement Learning

  • Michela Proietti
  • Peter R. Wurman
  • Peter Stone
  • Roberto Capobianco

The purpose of continual reinforcement learning is to train an agent on a sequence of tasks such that it learns the ones that appear later in the sequence while retaining the ability to perform the tasks that appeared earlier. Experience replay is a popular method used to make the agent remember previous tasks, but its effectiveness strongly relies on the selection of experiences to store. Kompella et al. (2023) proposed organizing the experience replay buffer into partitions, each storing transitions leading to a rare but crucial event, such that these key experiences get revisited more often during training. However, the method is sensitive to the manual selection of event states. To address this issue, we introduce ProtoCRL, a prototype-based architecture leveraging a variational Gaussian mixture model to automatically discover effective event states and build the associated partitions in the experience replay buffer. The proposed approach is tested on a sequence of MiniGrid environments, demonstrating the agent's ability to adapt and learn new skills incrementally.

RLJ Journal 2025 Journal Article

ProtoCRL: Prototype-based Network for Continual Reinforcement Learning

  • Michela Proietti
  • Peter R. Wurman
  • Peter Stone
  • Roberto Capobianco

The purpose of continual reinforcement learning is to train an agent on a sequence of tasks such that it learns the ones that appear later in the sequence while retaining the ability to perform the tasks that appeared earlier. Experience replay is a popular method used to make the agent remember previous tasks, but its effectiveness strongly relies on the selection of experiences to store. Kompella et al. (2023) proposed organizing the experience replay buffer into partitions, each storing transitions leading to a rare but crucial event, such that these key experiences get revisited more often during training. However, the method is sensitive to the manual selection of event states. To address this issue, we introduce ProtoCRL, a prototype-based architecture leveraging a variational Gaussian mixture model to automatically discover effective event states and build the associated partitions in the experience replay buffer. The proposed approach is tested on a sequence of MiniGrid environments, demonstrating the agent's ability to adapt and learn new skills incrementally.

NeurIPS Conference 2025 Conference Paper

RLZero: Direct Policy Inference from Language Without In-Domain Supervision

  • Harshit Sushil Sikchi
  • Siddhant Agarwal
  • Pranaya Jajoo
  • Samyak Parajuli
  • Caleb Chuck
  • Max Rudolph
  • Peter Stone
  • Amy Zhang

The reward hypothesis states that all goals and purposes can be understood as the maximization of a received scalar reward signal. However, in practice, defining such a reward signal is notoriously difficult, as humans are often unable to predict the optimal behavior corresponding to a reward function. Natural language offers an intuitive alternative for instructing reinforcement learning (RL) agents, yet previous language-conditioned approaches either require costly supervision or test-time training given a language instruction. In this work, we present a new approach that uses a pretrained RL agent trained using only unlabeled, offline interactions—without task-specific supervision or labeled trajectories—to get zero-shot test-time policy inference from arbitrary natural language instructions. We introduce a framework comprising three steps: imagine, project, and imitate. First, the agent imagines a sequence of observations corresponding to the provided language description using video generative models. Next, these imagined observations are projected into the target environment domain. Finally, an agent pretrained in the target environment with unsupervised RL instantly imitates the projected observation sequence through a closed-form solution. To the best of our knowledge, our method, RLZero, is the first approach to show direct language-to-behavior generation abilities on a variety of tasks and environments without any in-domain supervision. We further show that components of RLZero can be used to generate policies zero-shot from cross-embodied videos, such as those available on YouTube, even for complex embodiments like humanoids.

RLC Conference 2024 Conference Paper

A Super-human Vision-based Reinforcement Learning Agent for Autonomous Racing in Gran Turismo

  • Miguel Vasco
  • Takuma Seno
  • Kenta Kawamoto
  • Kaushik Subramanian
  • Peter R. Wurman
  • Peter Stone

Racing autonomous cars faster than the best human drivers has been a longstanding grand challenge for the fields of Artificial Intelligence and robotics. Recently, an end-to-end deep reinforcement learning agent met this challenge in a high-fidelity racing simulator, Gran Turismo. However, this agent relied on global features that require instrumentation external to the car. This paper introduces, to the best of our knowledge, the first super-human car racing agent whose sensor input is purely local to the car, namely pixels from an ego-centric camera view and quantities that can be sensed from on-board the car, such as the car's velocity. By leveraging global features only at training time, the learned agent is able to outperform the best human drivers in time trial (one car on the track at a time) races using only local input features. The resulting agent is evaluated in Gran Turismo 7 on multiple tracks and cars. Detailed ablation experiments demonstrate the agent's strong reliance on visual inputs, making it the first vision-based super-human car racing agent.

RLJ Journal 2024 Journal Article

A Super-human Vision-based Reinforcement Learning Agent for Autonomous Racing in Gran Turismo

  • Miguel Vasco
  • Takuma Seno
  • Kenta Kawamoto
  • Kaushik Subramanian
  • Peter R. Wurman
  • Peter Stone

Racing autonomous cars faster than the best human drivers has been a longstanding grand challenge for the fields of Artificial Intelligence and robotics. Recently, an end-to-end deep reinforcement learning agent met this challenge in a high-fidelity racing simulator, Gran Turismo. However, this agent relied on global features that require instrumentation external to the car. This paper introduces, to the best of our knowledge, the first super-human car racing agent whose sensor input is purely local to the car, namely pixels from an ego-centric camera view and quantities that can be sensed from on-board the car, such as the car's velocity. By leveraging global features only at training time, the learned agent is able to outperform the best human drivers in time trial (one car on the track at a time) races using only local input features. The resulting agent is evaluated in Gran Turismo 7 on multiple tracks and cars. Detailed ablation experiments demonstrate the agent's strong reliance on visual inputs, making it the first vision-based super-human car racing agent.

AAAI Conference 2024 Conference Paper

Building Minimal and Reusable Causal State Abstractions for Reinforcement Learning

  • Zizhao Wang
  • Caroline Wang
  • Xuesu Xiao
  • Yuke Zhu
  • Peter Stone

Two desiderata of reinforcement learning (RL) algorithms are the ability to learn from relatively little experience and the ability to learn policies that generalize to a range of problem specifications. In factored state spaces, one approach towards achieving both goals is to learn state abstractions, which only keep the necessary variables for learning the tasks at hand. This paper introduces Causal Bisimulation Modeling (CBM), a method that learns the causal relationships in the dynamics and reward functions for each task to derive a minimal, task-specific abstraction. CBM leverages and improves implicit modeling to train a high-fidelity causal dynamics model that can be reused for all tasks in the same environment. Empirical validation on two manipulation environments and four tasks reveals that CBM's learned implicit dynamics models identify the underlying causal relationships and state abstractions more accurately than explicit ones. Furthermore, the derived state abstractions allow a task learner to achieve near-oracle levels of sample efficiency and outperform baselines on all tasks.

JMLR Journal 2024 Journal Article

Data-Efficient Policy Evaluation Through Behavior Policy Search

  • Josiah P. Hanna
  • Yash Chandak
  • Philip S. Thomas
  • Martha White
  • Peter Stone
  • Scott Niekum

We consider the task of evaluating a policy for a Markov decision process (MDP). The standard unbiased technique for evaluating a policy is to deploy the policy and observe its performance. We show that the data collected from deploying a different policy, commonly called the behavior policy, can be used to produce unbiased estimates with lower mean squared error than this standard technique. We derive an analytic expression for a minimal variance behavior policy -- a behavior policy that minimizes the mean squared error of the resulting estimates. Because this expression depends on terms that are unknown in practice, we propose a novel policy evaluation sub-problem, behavior policy search: searching for a behavior policy that reduces mean squared error. We present two behavior policy search algorithms and empirically demonstrate their effectiveness in lowering the mean squared error of policy performance estimates. [abs] [ pdf ][ bib ] &copy JMLR 2024. ( edit, beta )

NeurIPS Conference 2024 Conference Paper

Discovering Creative Behaviors through DUPLEX: Diverse Universal Features for Policy Exploration

  • Borja G. Leon
  • Francesco Riccio
  • Kaushik Subramanian
  • Peter R. Wurman
  • Peter Stone

The ability to approach the same problem from different angles is a cornerstone of human intelligence that leads to robust solutions and effective adaptation to problem variations. In contrast, current RL methodologies tend to lead to policies that settle on a single solution to a given problem, making them brittle to problem variations. Replicating human flexibility in reinforcement learning agents is the challenge that we explore in this work. We tackle this challenge by extending state-of-the-art approaches to introduce DUPLEX, a method that explicitly defines a diversity objective with constraints and makes robust estimates of policies’ expected behavior through successor features. The trained agents can (i) learn a diverse set of near-optimal policies in complex highly-dynamic environments and (ii) exhibit competitive and diverse skills in out-of-distribution (OOD) contexts. Empirical results indicate that DUPLEX improves over previous methods and successfully learns competitive driving styles in a hyper-realistic simulator (i. e. , GranTurismo ™ 7) as well as diverse and effective policies in several multi-context robotics MuJoCo simulations with OOD gravity forces and height limits. To the best of our knowledge, our method is the first to achieve diverse solutions in complex driving simulators and OOD robotic contexts. DUPLEX agents demonstrating diverse behaviors can be found at https: //ai. sony/publications/Discovering-Creative-Behaviors-through-DUPLEX-Diverse-Universal-Features-for-Policy-Exploration/.

NeurIPS Conference 2024 Conference Paper

Disentangled Unsupervised Skill Discovery for Efficient Hierarchical Reinforcement Learning

  • Jiaheng Hu
  • Zizhao Wang
  • Peter Stone
  • Roberto Martín-Martín

A hallmark of intelligent agents is the ability to learn reusable skills purely from unsupervised interaction with the environment. However, existing unsupervised skill discovery methods often learn entangled skills where one skill variable simultaneously influences many entities in the environment, making downstream skill chaining extremely challenging. We propose Disentangled Unsupervised Skill Discovery (DUSDi), a method for learning disentangled skills that can be efficiently reused to solve downstream tasks. DUSDi decomposes skills into disentangled components, where each skill component only affects one factor of the state space. Importantly, these skill components can be concurrently composed to generate low-level actions, and efficiently chained to tackle downstream tasks through hierarchical Reinforcement Learning. DUSDi defines a novel mutual-information-based objective to enforce disentanglement between the influences of different skill components, and utilizes value factorization to optimize this objective efficiently. Evaluated in a set of challenging environments, DUSDi successfully learns disentangled skills, and significantly outperforms previous skill discovery methods when it comes to applying the learned skills to solve downstream tasks.

EWRL Workshop 2024 Workshop Paper

Image-Based Dataset Representations for Predicting Learning Performance in Offline RL

  • Enrique Mateos-Melero
  • Miguel Iglesias Alcázar
  • Raquel Fuentetaja
  • Peter Stone
  • Fernando Fernández

In this paper, we address the challenge of predicting learning performance in offline Reinforcement Learning (RL). It is a crucial task to ensure the learned policy performs reliably in the real world and to avoid unsafe or costly interactions. We introduce a new approach that utilizes Convolutional Neural Networks (CNNs) to analyze offline RL datasets, represented as images. Our model predicts the performance of policies learned from these datasets within a specific RL framework, including the selected algorithm and hyperparameters. We explore the model's transferability across different scenarios with alterations in state space size or transition functions. Furthermore, we demonstrate an application of our model in optimizing offline RL datasets. Leveraging genetic algorithms, we navigate through potential dataset subsets to identify a reduced version that enhances policy learning efficiency. This optimized dataset reduces training time while achieving comparable or superior performance to the complete dataset.

AAAI Conference 2024 Conference Paper

Learning Optimal Advantage from Preferences and Mistaking It for Reward

  • W. Bradley Knox
  • Stephane Hatgis-Kessell
  • Sigurdur Orn Adalgeirsson
  • Serena Booth
  • Anca Dragan
  • Peter Stone
  • Scott Niekum

We consider algorithms for learning reward functions from human preferences over pairs of trajectory segments, as used in reinforcement learning from human feedback (RLHF). Most recent work assumes that human preferences are generated based only upon the reward accrued within those segments, or their partial return. Recent work casts doubt on the validity of this assumption, proposing an alternative preference model based upon regret. We investigate the consequences of assuming preferences are based upon partial return when they actually arise from regret. We argue that the learned function is an approximation of the optimal advantage function, not a reward function. We find that if a specific pitfall is addressed, this incorrect assumption is not particularly harmful, resulting in a highly shaped reward function. Nonetheless, this incorrect usage of the approximation of the optimal advantage function is less desirable than the appropriate and simpler approach of greedy maximization of it. From the perspective of the regret preference model, we also provide a clearer interpretation of fine tuning contemporary large language models with RLHF. This paper overall provides insight regarding why learning under the partial return preference model tends to work so well in practice, despite it conforming poorly to how humans give preferences.

AAAI Conference 2024 Conference Paper

Minimum Coverage Sets for Training Robust Ad Hoc Teamwork Agents

  • Muhammad Rahman
  • Jiaxun Cui
  • Peter Stone

Robustly cooperating with unseen agents and human partners presents significant challenges due to the diverse cooperative conventions these partners may adopt. Existing Ad Hoc Teamwork (AHT) methods address this challenge by training an agent with a population of diverse teammate policies obtained through maximizing specific diversity metrics. However, prior heuristic-based diversity metrics do not always maximize the agent's robustness in all cooperative problems. In this work, we first propose that maximizing an AHT agent's robustness requires it to emulate policies in the minimum coverage set (MCS), the set of best-response policies to any partner policies in the environment. We then introduce the L-BRDiv algorithm that generates a set of teammate policies that, when used for AHT training, encourage agents to emulate policies from the MCS. L-BRDiv works by solving a constrained optimization problem to jointly train teammate policies for AHT training and approximating AHT agent policies that are members of the MCS. We empirically demonstrate that L-BRDiv produces more robust AHT agents than state-of-the-art methods in a broader range of two-player cooperative problems without the need for extensive hyperparameter tuning for its objectives. Our study shows that L-BRDiv outperforms the baseline methods by prioritizing discovering distinct members of the MCS instead of repeatedly finding redundant policies.

TMLR Journal 2024 Journal Article

Models of human preference for learning reward functions

  • W. Bradley Knox
  • Stephane Hatgis-Kessell
  • Serena Booth
  • Scott Niekum
  • Peter Stone
  • Alessandro G Allievi

The utility of reinforcement learning is limited by the alignment of reward functions with the interests of human stakeholders. One promising method for alignment is to learn the reward function from human-generated preferences between pairs of trajectory segments, a type of reinforcement learning from human feedback (RLHF). These human preferences are typically assumed to be informed solely by partial return, the sum of rewards along each segment. We find this assumption to be flawed and propose modeling human preferences instead as informed by each segment’s regret, a measure of a segment’s deviation from optimal decision-making. Given infinitely many preferences generated according to regret, we prove that we can identify a reward function equivalent to the reward function that generated those preferences, and we prove that the previous partial return model lacks this identifiability property in multiple contexts. We empirically show that our proposed regret preference model outperforms the partial return preference model with finite training data in otherwise the same setting. Additionally, we find that our proposed regret preference model better predicts real human preferences and also learns reward functions from these preferences that lead to policies that are better human-aligned. Overall, this work establishes that the choice of preference model is impactful, and our proposed regret preference model provides an improvement upon a core assumption of recent research. We have open sourced our experimental code, the human preferences dataset we gathered, and our training and preference elicitation interfaces for gathering such a dataset.

RLC Conference 2024 Conference Paper

Multistep Inverse Is Not All You Need

  • Alexander Levine
  • Peter Stone
  • Amy Zhang

In real-world control settings, the observation space is often unnecessarily high-dimensional and subject to time-correlated noise. However, the *controllable* dynamics of the system are often far simpler than the dynamics of the raw observations. It is therefore desirable to learn an encoder to map the observation space to a simpler space of control-relevant variables. In this work, we consider the Ex-BMDP model, first proposed by Efroni et al. (2022), which formalizes control problems where observations can be factorized into an action-dependent latent state which evolves deterministically, and action-independent time-correlated noise. Lamb et al. (2022) proposes the ""AC-State"" method for learning an encoder to extract a complete action-dependent latent state representation from the observations in such problems. AC-State is a *multistep-inverse* method, in that it uses the encoding of the the first and last state in a path to predict the *first* action in the path. However, we identify cases where AC-State will fail to learn a correct latent representation of the agent-controllable factor of the state. We therefore propose a new algorithm, ACDF, which combines multistep-inverse prediction with a latent forward model. ACDF is guaranteed to correctly infer an action-dependent latent state encoder for a large class of Ex-BMDP models. We demonstrate the effectiveness of ACDF on tabular Ex-BMDPs through numerical simulations; as well as high-dimensional environments using neural-network-based encoders. Code is available at https: //github. com/midi-lab/acdf.

RLJ Journal 2024 Journal Article

Multistep Inverse Is Not All You Need

  • Alexander Levine
  • Peter Stone
  • Amy Zhang

In real-world control settings, the observation space is often unnecessarily high-dimensional and subject to time-correlated noise. However, the *controllable* dynamics of the system are often far simpler than the dynamics of the raw observations. It is therefore desirable to learn an encoder to map the observation space to a simpler space of control-relevant variables. In this work, we consider the Ex-BMDP model, first proposed by Efroni et al. (2022), which formalizes control problems where observations can be factorized into an action-dependent latent state which evolves deterministically, and action-independent time-correlated noise. Lamb et al. (2022) proposes the ""AC-State"" method for learning an encoder to extract a complete action-dependent latent state representation from the observations in such problems. AC-State is a *multistep-inverse* method, in that it uses the encoding of the the first and last state in a path to predict the *first* action in the path. However, we identify cases where AC-State will fail to learn a correct latent representation of the agent-controllable factor of the state. We therefore propose a new algorithm, ACDF, which combines multistep-inverse prediction with a latent forward model. ACDF is guaranteed to correctly infer an action-dependent latent state encoder for a large class of Ex-BMDP models. We demonstrate the effectiveness of ACDF on tabular Ex-BMDPs through numerical simulations; as well as high-dimensional environments using neural-network-based encoders. Code is available at https://github.com/midi-lab/acdf.

NeurIPS Conference 2024 Conference Paper

N-agent Ad Hoc Teamwork

  • Caroline Wang
  • Arrasy Rahman
  • Ishan Durugkar
  • Elad Liebman
  • Peter Stone

Current approaches to learning cooperative multi-agent behaviors assume relatively restrictive settings. In standard fully cooperative multi-agent reinforcement learning, the learning algorithm controls *all* agents in the scenario, while in ad hoc teamwork, the learning algorithm usually assumes control over only a *single* agent in the scenario. However, many cooperative settings in the real world are much less restrictive. For example, in an autonomous driving scenario, a company might train its cars with the same learning algorithm, yet once on the road, these cars must cooperate with cars from another company. Towards expanding the class of scenarios that cooperative learning methods may optimally address, we introduce $N$*-agent ad hoc teamwork* (NAHT), where a set of autonomous agents must interact and cooperate with dynamically varying numbers and types of teammates. This paper formalizes the problem, and proposes the *Policy Optimization with Agent Modelling* (POAM) algorithm. POAM is a policy gradient, multi-agent reinforcement learning approach to the NAHT problem, that enables adaptation to diverse teammate behaviors by learning representations of teammate behaviors. Empirical evaluation on tasks from the multi-agent particle environment and StarCraft II shows that POAM improves cooperative task returns compared to baseline approaches, and enables out-of-distribution generalization to unseen teammates.

AAMAS Conference 2024 Conference Paper

Overview of t-DGR: A Trajectory-Based Deep Generative Replay Method for Continual Learning in Decision Making

  • William Yue
  • Bo Liu
  • Peter Stone

Deep generative replay has emerged as a promising approach for continual learning in decision-making tasks. This approach addresses the problem of catastrophic forgetting by leveraging the generation of trajectories from previously encountered tasks to augment the current dataset. However, existing deep generative replay methods for continual learning rely on autoregressive models, which suffer from compounding errors in the generated trajectories. In this extended abstract, we summarize a simple, scalable, and non-autoregressive method for continual learning in decision-making tasks using a generative model that generates task samples conditioned on the trajectory timestep. We evaluate our method on Continual World benchmarks and find that our approach achieves state-of-the-art performance on the average success rate metric among continual learning methods. Code and a preprint of a complete paper with full details are available at https: //github. com/WilliamYue37/t-DGR.

AAMAS Conference 2024 Conference Paper

Relaxed Exploration Constrained Reinforcement Learning

  • Shahaf S. Shperberg
  • Bo Liu
  • Peter Stone

This research introduces a novel setting for reinforcement learning with constraints, termed Relaxed Exploration Constrained Reinforcement Learning (RECRL). Similar to standard constrained reinforcement learning (CRL), the objective in RECRL is to discover a policy that maximizes the environmental return while adhering to a predefined set of constraints. However, in some real-world settings, it is possible to train the agent in a setting that does not require strict adherence to the constraints, as long as the agent adheres to them once deployed. To model such settings, we introduce RECRL, which explicitly incorporates an initial training phase where the constraints are relaxed, enabling the agent to explore the environment more freely. Subsequently, during deployment, the agent is obligated to fully satisfy all constraints. To address RECRL problems, we introduce a curriculum-based approach called CLiC, designed to enhance the exploration of existing CRL algorithms during the training phase and facilitate convergence towards a policy that satisfies the full set of constraints by the end of training. Empirical evaluations demonstrate that CLiC yields policies with significantly higher returns during deployment compared to training solely under the strict set of constraints. The code is available at https: //github. com/Shperb/RECRL.

AAAI Conference 2024 Conference Paper

Reward (Mis)design for Autonomous Driving (Abstract Reprint)

  • W. Bradley Knox
  • Alessandro Allievi
  • Holger Banzhaf
  • Felix Schmitt
  • Peter Stone

This article considers the problem of diagnosing certain common errors in reward design. Its insights are also applicable to the design of cost functions and performance metrics more generally. To diagnose common errors, we develop 8 simple sanity checks for identifying flaws in reward functions. We survey research that is published in top-tier venues and focuses on reinforcement learning (RL) for autonomous driving (AD). Specifically, we closely examine the reported reward function in each publication and present these reward functions in a complete and standardized format in the appendix. Wherever we have sufficient information, we apply the 8 sanity checks to each surveyed reward function, revealing near-universal flaws in reward design for AD that might also exist pervasively across reward design for other tasks. Lastly, we explore promising directions that may aid the design of reward functions for AD in subsequent research, following a process of inquiry that can be adapted to other domains.

NeurIPS Conference 2024 Conference Paper

SkiLD: Unsupervised Skill Discovery Guided by Factor Interactions

  • Zizhao Wang
  • Jiaheng Hu
  • Caleb Chuck
  • Stephen Chen
  • Roberto Martín-Martín
  • Amy Zhang
  • Scott Niekum
  • Peter Stone

Unsupervised skill discovery carries the promise that an intelligent agent can learn reusable skills through autonomous, reward-free interactions with environments. Existing unsupervised skill discovery methods learn skills by encouraging distinguishable behaviors that cover diverse states. However, in complex environments with many state factors (e. g. , household environments with many objects), learning skills that cover all possible states is impossible, and naively encouraging state diversity often leads to simple skills that are not ideal for solving downstream tasks. This work introduces Skill Discovery from Local Dependencies (SkiLD), which leverages state factorization as a natural inductive bias to guide the skill learning process. The key intuition guiding SkiLD is that skills that induce \textbf{diverse interactions} between state factors are often more valuable for solving downstream tasks. To this end, SkiLD develops a novel skill learning objective that explicitly encourages the mastering of skills that effectively induce different interactions within an environment. We evaluate SkiLD in several domains with challenging, long-horizon sparse reward tasks including a realistic simulated household robot domain, where SkiLD successfully learns skills with clear semantic meaning and shows superior performance compared to existing unsupervised reinforcement learning methods that only maximize state coverage.

AAMAS Conference 2023 Conference Paper

D-Shape: Demonstration-Shaped Reinforcement Learning via Goal-Conditioning

  • Caroline Wang
  • Garrett Warnell
  • Peter Stone

While combining imitation learning (IL) and reinforcement learning (RL) is a promising way to address poor sample efficiency in autonomous behavior acquisition, methods that do so typically assume that the requisite behavior demonstrations are provided by an expert that behaves optimally with respect to a task reward. If, however, suboptimal demonstrations are provided, a fundamental challenge appears in that the demonstration-matching objective of IL conflicts with the return-maximization objective of RL. This paper introduces D-Shape, a new method for combining IL and RL that uses ideas from reward shaping and goal-conditioned RL to resolve the above conflict. D-Shape allows learning from suboptimal demonstrations while retaining the ability to find the optimal policy with respect to the task reward. We experimentally validate D-Shape in sparse-reward gridworld domains, showing that it both improves over RL in terms of sample efficiency and converges consistently to the optimal policy in the presence of suboptimal demonstrations.

AAAI Conference 2023 Conference Paper

DM²: Decentralized Multi-Agent Reinforcement Learning via Distribution Matching

  • Caroline Wang
  • Ishan Durugkar
  • Elad Liebman
  • Peter Stone

Current approaches to multi-agent cooperation rely heavily on centralized mechanisms or explicit communication protocols to ensure convergence. This paper studies the problem of distributed multi-agent learning without resorting to centralized components or explicit communication. It examines the use of distribution matching to facilitate the coordination of independent agents. In the proposed scheme, each agent independently minimizes the distribution mismatch to the corresponding component of a target visitation distribution. The theoretical analysis shows that under certain conditions, each agent minimizing its individual distribution mismatch allows the convergence to the joint policy that generated the target distribution. Further, if the target distribution is from a joint policy that optimizes a cooperative task, the optimal policy for a combination of this task reward and the distribution matching reward is the same joint policy. This insight is used to formulate a practical algorithm (DM^2), in which each individual agent matches a target distribution derived from concurrently sampled trajectories from a joint expert policy. Experimental validation on the StarCraft domain shows that combining (1) a task reward, and (2) a distribution matching reward for expert demonstrations for the same task, allows agents to outperform a naive distributed baseline. Additional experiments probe the conditions under which expert demonstrations need to be sampled to obtain the learning benefits.

NeurIPS Conference 2023 Conference Paper

ELDEN: Exploration via Local Dependencies

  • Zizhao Wang
  • Jiaheng Hu
  • Peter Stone
  • Roberto Martín-Martín

Tasks with large state space and sparse rewards present a longstanding challenge to reinforcement learning. In these tasks, an agent needs to explore the state space efficiently until it finds a reward. To deal with this problem, the community has proposed to augment the reward function with intrinsic reward, a bonus signal that encourages the agent to visit interesting states. In this work, we propose a new way of defining interesting states for environments with factored state spaces and complex chained dependencies, where an agent's actions may change the value of one entity that, in order, may affect the value of another entity. Our insight is that, in these environments, interesting states for exploration are states where the agent is uncertain whether (as opposed to how) entities such as the agent or objects have some influence on each other. We present ELDEN, Exploration via Local DepENdencies, a novel intrinsic reward that encourages the discovery of new interactions between entities. ELDEN utilizes a novel scheme --- the partial derivative of the learned dynamics to model the local dependencies between entities accurately and computationally efficiently. The uncertainty of the predicted dependencies is then used as an intrinsic reward to encourage exploration toward new interactions. We evaluate the performance of ELDEN on four different domains with complex dependencies, ranging from 2D grid worlds to 3D robotic tasks. In all domains, ELDEN correctly identifies local dependencies and learns successful policies, significantly outperforming previous state-of-the-art exploration methods.

TMLR Journal 2023 Journal Article

Event Tables for Efficient Experience Replay

  • Varun Raj Kompella
  • Thomas Walsh
  • Samuel Barrett
  • Peter R. Wurman
  • Peter Stone

Experience replay (ER) is a crucial component of many deep reinforcement learning (RL) systems. However, uniform sampling from an ER buffer can lead to slow convergence and unstable asymptotic behaviors. This paper introduces Stratified Sampling from Event Tables (SSET), which partitions an ER buffer into Event Tables, each capturing important subsequences of optimal behavior. We prove a theoretical advantage over the traditional monolithic buffer approach and combine SSET with an existing prioritized sampling strategy to further improve learning speed and stability. Empirical results in challenging MiniGrid domains, benchmark RL environments, and a high-fidelity car racing simulator demonstrate the advantages and versatility of SSET over existing ER buffer sampling

NeurIPS Conference 2023 Conference Paper

f-Policy Gradients: A General Framework for Goal-Conditioned RL using f-Divergences

  • Siddhant Agarwal
  • Ishan Durugkar
  • Peter Stone
  • Amy Zhang

Goal-Conditioned Reinforcement Learning (RL) problems often have access to sparse rewards where the agent receives a reward signal only when it has achieved the goal, making policy optimization a difficult problem. Several works augment this sparse reward with a learned dense reward function, but this can lead to sub-optimal policies if the reward is misaligned. Moreover, recent works have demonstrated that effective shaping rewards for a particular problem can depend on the underlying learning algorithm. This paper introduces a novel way to encourage exploration called $f$-Policy Gradients, or $f$-PG. $f$-PG minimizes the f-divergence between the agent's state visitation distribution and the goal, which we show can lead to an optimal policy. We derive gradients for various f-divergences to optimize this objective. Our learning paradigm provides dense learning signals for exploration in sparse reward settings. We further introduce an entropy-regularized policy optimization objective, that we call $state$-MaxEnt RL (or $s$-MaxEnt RL) as a special case of our objective. We show that several metric-based shaping rewards like L2 can be used with $s$-MaxEnt RL, providing a common ground to study such metric-based shaping rewards with efficient exploration. We find that $f$-PG has better performance compared to standard policy gradient methods on a challenging gridworld as well as the Point Maze and FetchReach environments. More information on our website https: //agarwalsiddhant10. github. io/projects/fpg. html.

NeurIPS Conference 2023 Conference Paper

FAMO: Fast Adaptive Multitask Optimization

  • Bo Liu
  • Yihao Feng
  • Peter Stone
  • Qiang Liu

One of the grand enduring goals of AI is to create generalist agents that can learn multiple different tasks from diverse data via multitask learning (MTL). However, in practice, applying gradient descent (GD) on the average loss across all tasks may yield poor multitask performance due to severe under-optimization of certain tasks. Previous approaches that manipulate task gradients for a more balanced loss decrease require storing and computing all task gradients ($\mathcal{O}(k)$ space and time where $k$ is the number of tasks), limiting their use in large-scale scenarios. In this work, we introduce Fast Adaptive Multitask Optimization (FAMO), a dynamic weighting method that decreases task losses in a balanced way using $\mathcal{O}(1)$ space and time. We conduct an extensive set of experiments covering multi-task supervised and reinforcement learning problems. Our results indicate that FAMO achieves comparable or superior performance to state-of-the-art gradient manipulation techniques while offering significant improvements in space and computational efficiency. Code is available at \url{https: //github. com/Cranial-XIX/FAMO}.

NeurIPS Conference 2023 Conference Paper

LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

  • Bo Liu
  • Yifeng Zhu
  • Chongkai Gao
  • Yihao Feng
  • Qiang Liu
  • Yuke Zhu
  • Peter Stone

Lifelong learning offers a promising paradigm of building a generalist agent that learns and adapts over its lifespan. Unlike traditional lifelong learning problems in image and text domains, which primarily involve the transfer of declarative knowledge of entities and concepts, lifelong learning in decision-making (LLDM) also necessitates the transfer of procedural knowledge, such as actions and behaviors. To advance research in LLDM, we introduce LIBERO, a novel benchmark of lifelong learning for robot manipulation. Specifically, LIBERO highlights five key research topics in LLDM: 1) how to efficiently transfer declarative knowledge, procedural knowledge, or the mixture of both; 2) how to design effective policy architectures and 3) effective algorithms for LLDM; 4) the robustness of a lifelong learner with respect to task ordering; and 5) the effect of model pretraining for LLDM. We develop an extendible procedural generation pipeline that can in principle generate infinitely many tasks. For benchmarking purpose, we create four task suites (130 tasks in total) that we use to investigate the above-mentioned research topics. To support sample-efficient learning, we provide high-quality human-teleoperated demonstration data for all tasks. Our extensive experiments present several insightful or even unexpected discoveries: sequential finetuning outperforms existing lifelong learning methods in forward transfer, no single visual encoder architecture excels at all types of knowledge transfer, and naive supervised pretraining can hinder agents' performance in the subsequent LLDM.

AAAI Conference 2023 Conference Paper

Metric Residual Network for Sample Efficient Goal-Conditioned Reinforcement Learning

  • Bo Liu
  • Yihao Feng
  • Qiang Liu
  • Peter Stone

Goal-conditioned reinforcement learning (GCRL) has a wide range of potential real-world applications, including manipulation and navigation problems in robotics. Especially in such robotics tasks, sample efficiency is of the utmost importance for GCRL since, by default, the agent is only rewarded when it reaches its goal. While several methods have been proposed to improve the sample efficiency of GCRL, one relatively under-studied approach is the design of neural architectures to support sample efficiency. In this work, we introduce a novel neural architecture for GCRL that achieves significantly better sample efficiency than the commonly-used monolithic network architecture. The key insight is that the optimal action-value function must satisfy the triangle inequality in a specific sense. Furthermore, we introduce the metric residual network (MRN) that deliberately decomposes the action-value function into the negated summation of a metric plus a residual asymmetric component. MRN provably approximates any optimal action-value function, thus making it a fitting neural architecture for GCRL. We conduct comprehensive experiments across 12 standard benchmark environments in GCRL. The empirical results demonstrate that MRN uniformly outperforms other state-of-the-art GCRL neural architectures in terms of sample efficiency. The code is available at https://github.com/Cranial-XIX/metric-residual-network.

AAMAS Conference 2023 Conference Paper

Relaxed Exploration Constrained Reinforcement Learning

  • Shahaf S. Shperberg
  • Bo Liu
  • Peter Stone

This extended abstract introduces a novel setting of reinforcement learning with constraints, called Relaxed Exploration Constrained Reinforcement Learning (RECRL). As in standard constrained reinforcement learning (CRL), the aim is to find a policy that maximizes environmental return subject to a set of constraints. However, in RECRL there is an initial training phase in which the constraints are relaxed, thus the agent can explore the environment more freely. When training is done, the agent is deployed in the environment and is required to fully satisfy all constraints. As an initial approach to RECRL problems, we introduce a curriculum-based approach, named CLiC, that can be applied to existing CRL algorithms to improve their exploration during the training phase while allowing them to gradually converge to a policy that satisfies the full set of constraints. Empirical evaluation shows that CLiC produces policies with a higher return during deployment than policies learned when training is done using only the strict set of constraints.

AIJ Journal 2023 Journal Article

Reward (Mis)design for autonomous driving

  • W. Bradley Knox
  • Alessandro Allievi
  • Holger Banzhaf
  • Felix Schmitt
  • Peter Stone

This article considers the problem of diagnosing certain common errors in reward design. Its insights are also applicable to the design of cost functions and performance metrics more generally. To diagnose common errors, we develop 8 simple sanity checks for identifying flaws in reward functions. We survey research that is published in top-tier venues and focuses on reinforcement learning (RL) for autonomous driving (AD). Specifically, we closely examine the reported reward function in each publication and present these reward functions in a complete and standardized format in the appendix. Wherever we have sufficient information, we apply the 8 sanity checks to each surveyed reward function, revealing near-universal flaws in reward design for AD that might also exist pervasively across reward design for other tasks. Lastly, we explore promising directions that may aid the design of reward functions for AD in subsequent research, following a process of inquiry that can be adapted to other domains.

AAAI Conference 2023 Conference Paper

The Perils of Trial-and-Error Reward Design: Misdesign through Overfitting and Invalid Task Specifications

  • Serena Booth
  • W. Bradley Knox
  • Julie Shah
  • Scott Niekum
  • Peter Stone
  • Alessandro Allievi

In reinforcement learning (RL), a reward function that aligns exactly with a task's true performance metric is often necessarily sparse. For example, a true task metric might encode a reward of 1 upon success and 0 otherwise. The sparsity of these true task metrics can make them hard to learn from, so in practice they are often replaced with alternative dense reward functions. These dense reward functions are typically designed by experts through an ad hoc process of trial and error. In this process, experts manually search for a reward function that improves performance with respect to the task metric while also enabling an RL algorithm to learn faster. This process raises the question of whether the same reward function is optimal for all algorithms, i.e., whether the reward function can be overfit to a particular algorithm. In this paper, we study the consequences of this wide yet unexamined practice of trial-and-error reward design. We first conduct computational experiments that confirm that reward functions can be overfit to learning algorithms and their hyperparameters. We then conduct a controlled observation study which emulates expert practitioners' typical experiences of reward design, in which we similarly find evidence of reward function overfitting. We also find that experts' typical approach to reward design---of adopting a myopic strategy and weighing the relative goodness of each state-action pair---leads to misdesign through invalid task specifications, since RL algorithms use cumulative reward rather than rewards for individual state-action pairs as an optimization target. Code, data: github.com/serenabooth/reward-design-perils

NeurIPS Conference 2022 Conference Paper

BOME! Bilevel Optimization Made Easy: A Simple First-Order Approach

  • Bo Liu
  • Mao Ye
  • Stephen Wright
  • Peter Stone
  • Qiang Liu

Bilevel optimization (BO) is useful for solving a variety of important machine learning problems including but not limited to hyperparameter optimization, meta-learning, continual learning, and reinforcement learning. Conventional BO methods need to differentiate through the low-level optimization process with implicit differentiation, which requires expensive calculations related to the Hessian matrix. There has been a recent quest for first-order methods for BO, but the methods proposed to date tend to be complicated and impractical for large-scale deep learning applications. In this work, we propose a simple first-order BO algorithm that depends only on first-order gradient information, requires no implicit differentiation, and is practical and efficient for large-scale non-convex functions in deep learning. We provide non-asymptotic convergence analysis of the proposed method to stationary points for non-convex objectives and present empirical results that show its superior practical performance.

IS Journal 2022 Journal Article

Challenges and Opportunities of Applying Reinforcement Learning to Autonomous Racing

  • Peter R. Wurman
  • Peter Stone
  • Michael Spranger

Simulated motorsports are an exciting environment in which to explore the power and limitations of deep reinforcement learning. Racing requires precise control of a vehicle that is operating at its traction limits while competing wheel-to-wheel with other drivers. We recently demonstrated an agent that can beat the best drivers in the world at the racing game Gran Turismo. In this article, we briefly discuss some of the lessons learned and some of the remaining open research challenges.

IJCAI Conference 2022 Conference Paper

Dynamic Sparse Training for Deep Reinforcement Learning

  • Ghada Sokar
  • Elena Mocanu
  • Decebal Constantin Mocanu
  • Mykola Pechenizkiy
  • Peter Stone

Deep reinforcement learning (DRL) agents are trained through trial-and-error interactions with the environment. This leads to a long training time for dense neural networks to achieve good performance. Hence, prohibitive computation and memory resources are consumed. Recently, learning efficient DRL agents has received increasing attention. Yet, current methods focus on accelerating inference time. In this paper, we introduce for the first time a dynamic sparse training approach for deep reinforcement learning to accelerate the training process. The proposed approach trains a sparse neural network from scratch and dynamically adapts its topology to the changing data distribution during training. Experiments on continuous control tasks show that our dynamic sparse agents achieve higher performance than the equivalent dense methods, reduce the parameter count and floating-point operations (FLOPs) by 50%, and have a faster learning speed that enables reaching the performance of dense agents with 40−50% reduction in the training steps.

NeurIPS Conference 2022 Conference Paper

Value Function Decomposition for Iterative Design of Reinforcement Learning Agents

  • James MacGlashan
  • Evan Archer
  • Alisa Devlic
  • Takuma Seno
  • Craig Sherstan
  • Peter Wurman
  • Peter Stone

Designing reinforcement learning (RL) agents is typically a difficult process that requires numerous design iterations. Learning can fail for a multitude of reasons and standard RL methods provide too few tools to provide insight into the exact cause. In this paper, we show how to integrate \textit{value decomposition} into a broad class of actor-critic algorithms and use it to assist in the iterative agent-design process. Value decomposition separates a reward function into distinct components and learns value estimates for each. These value estimates provide insight into an agent's learning and decision-making process and enable new training methods to mitigate common problems. As a demonstration, we introduce SAC-D, a variant of soft actor-critic (SAC) adapted for value decomposition. SAC-D maintains similar performance to SAC, while learning a larger set of value predictions. We also introduce decomposition-based tools that exploit this information, including a new reward \textit{influence} metric, which measures each reward component's effect on agent decision-making. Using these tools, we provide several demonstrations of decomposition's use in identifying and addressing problems in the design of both environments and agents. Value decomposition is broadly applicable and easy to incorporate into existing algorithms and workflows, making it a powerful tool in an RL practitioner's toolbox.

NeurIPS Conference 2021 Conference Paper

Adversarial Intrinsic Motivation for Reinforcement Learning

  • Ishan Durugkar
  • Mauricio Tec
  • Scott Niekum
  • Peter Stone

Learning with an objective to minimize the mismatch with a reference distribution has been shown to be useful for generative modeling and imitation learning. In this paper, we investigate whether one such objective, the Wasserstein-1 distance between a policy's state visitation distribution and a target distribution, can be utilized effectively for reinforcement learning (RL) tasks. Specifically, this paper focuses on goal-conditioned reinforcement learning where the idealized (unachievable) target distribution has full measure at the goal. This paper introduces a quasimetric specific to Markov Decision Processes (MDPs) and uses this quasimetric to estimate the above Wasserstein-1 distance. It further shows that the policy that minimizes this Wasserstein-1 distance is the policy that reaches the goal in as few steps as possible. Our approach, termed Adversarial Intrinsic Motivation (AIM), estimates this Wasserstein-1 distance through its dual objective and uses it to compute a supplemental reward function. Our experiments show that this reward function changes smoothly with respect to transitions in the MDP and directs the agent's exploration to find the goal efficiently. Additionally, we combine AIM with Hindsight Experience Replay (HER) and show that the resulting algorithm accelerates learning significantly on several simulated robotics tasks when compared to other rewards that encourage exploration or accelerate learning.

JAIR Journal 2021 Journal Article

Agent-Based Markov Modeling for Improved COVID-19 Mitigation Policies

  • Roberto Capobianco
  • Varun Kompella
  • James Ault
  • Guni Sharon
  • Stacy Jong
  • Spencer Fox
  • Lauren Meyers
  • Peter R. Wurman

The year 2020 saw the covid-19 virus lead to one of the worst global pandemics in history. As a result, governments around the world have been faced with the challenge of protecting public health while keeping the economy running to the greatest extent possible. Epidemiological models provide insight into the spread of these types of diseases and predict the effects of possible intervention policies. However, to date, even the most data-driven intervention policies rely on heuristics. In this paper, we study how reinforcement learning (RL) and Bayesian inference can be used to optimize mitigation policies that minimize economic impact without overwhelming hospital capacity. Our main contributions are (1) a novel agent-based pandemic simulator which, unlike traditional models, is able to model fine-grained interactions among people at specific locations in a community; (2) an RLbased methodology for optimizing fine-grained mitigation policies within this simulator; and (3) a Hidden Markov Model for predicting infected individuals based on partial observations regarding test results, presence of symptoms, and past physical contacts. This article is part of the special track on AI and COVID-19.

NeurIPS Conference 2021 Conference Paper

Conflict-Averse Gradient Descent for Multi-task learning

  • Bo Liu
  • Xingchao Liu
  • Xiaojie Jin
  • Peter Stone
  • Qiang Liu

The goal of multi-task learning is to enable more efficient learning than single task learning by sharing model structures for a diverse set of tasks. A standard multi-task learning objective is to minimize the average loss across all tasks. While straightforward, using this objective often results in much worse final performance for each task than learning them independently. A major challenge in optimizing a multi-task model is the conflicting gradients, where gradients of different task objectives are not well aligned so that following the average gradient direction can be detrimental to specific tasks' performance. Previous work has proposed several heuristics to manipulate the task gradients for mitigating this problem. But most of them lack convergence guarantee and/or could converge to any Pareto-stationary point. In this paper, we introduce Conflict-Averse Gradient descent (CAGrad) which minimizes the average loss function, while leveraging the worst local improvement of individual tasks to regularize the algorithm trajectory. CAGrad balances the objectives automatically and still provably converges to a minimum over the average loss. It includes the regular gradient descent (GD) and the multiple gradient descent algorithm (MGDA) in the multi-objective optimization (MOO) literature as special cases. On a series of challenging multi-task supervised learning and reinforcement learning tasks, CAGrad achieves improved performance over prior state-of-the-art multi-objective gradient manipulation methods.

AAAI Conference 2021 System Paper

Demonstration of the EMPATHIC Framework for Task Learning from Implicit Human Feedback

  • Yuchen Cui
  • Qiping Zhang
  • Sahil Jain
  • Alessandro Allievi
  • Peter Stone
  • Scott Niekum
  • W. Bradley Knox

Reactions such as gestures, facial expressions, and vocalizations are an abundant, naturally occurring channel of information that humans provide during interactions. An agent could leverage an understanding of such implicit human feedback to improve its task performance at no cost to the human. This approach contrasts with common agent teaching methods based on demonstrations, critiques, or other guidance that need to be attentively and intentionally provided. In this work, we demonstrate a novel data-driven framework for learning from implicit human feedback, EMPATHIC. This two-stage method consists of (1) mapping implicit human feedback to relevant task statistics such as rewards, optimality, and advantage; and (2) using such a mapping to learn a task. We instantiate the first stage and three second-stage evaluations of the learned mapping. To do so, we collect a dataset of human facial reactions while participants observe an agent execute a sub-optimal policy for a prescribed training task. We train a deep neural network on this data and demonstrate its ability to (1) infer relative reward ranking of events in the training task from prerecorded human facial reactions; (2) improve the policy of an agent in the training task using live human facial reactions; and (3) transfer to a novel domain in which it evaluates robot manipulation trajectories. In the video, we focus on demonstrating the online learning capability of our instantiation of EMPATHIC.

AAAI Conference 2021 Conference Paper

Expected Value of Communication for Planning in Ad Hoc Teamwork

  • William Macke
  • Reuth Mirsky
  • Peter Stone

A desirable goal for autonomous agents is to be able to coordinate on the fly with previously unknown teammates. Known as “ad hoc teamwork”, enabling such a capability has been receiving increasing attention in the research community. One of the central challenges in ad hoc teamwork is quickly recognizing the current plans of other agents and planning accordingly. In this paper, we focus on the scenario in which teammates can communicate with one another, but only at a cost. Thus, they must carefully balance plan recognition based on observations vs. that based on communication. This paper proposes a new metric for evaluating how similar are two policies that a teammate may be following - the Expected Divergence Point (EDP). We then present a novel planning algorithm for ad hoc teamwork, determining which query to ask and planning accordingly. We demonstrate the effectiveness of this algorithm in a range of increasingly general communication in ad hoc teamwork problems.

AAAI Conference 2021 Conference Paper

Goal Blending for Responsive Shared Autonomy in a Navigating Vehicle

  • Yu-Sian Jiang
  • Garrett Warnell
  • Peter Stone

Human-robot shared autonomy techniques for vehicle navigation hold promise for reducing a human driver’s workload, ensuring safety, and improving navigation efficiency. However, because typical techniques achieve these improvements by effectively removing human control at critical moments, these approaches often exhibit poor responsiveness to human commands—especially in cluttered environments. In this paper, we propose a novel goal-blending shared autonomy (GBSA) system, which aims to improve responsiveness in shared autonomy systems by blending human and robot input during the selection of local navigation goals as opposed to low-level motor (servo-level) commands. We validate the proposed approach by performing a human study involving an intelligent wheelchair and compare GBSA to a representative servo-level shared control system that uses a policyblending approach. The results of both quantitative performance analysis and a subjective survey show that GBSA exhibits significantly better system responsiveness and induces higher user satisfaction than the existing approach.

NeurIPS Conference 2021 Conference Paper

Machine versus Human Attention in Deep Reinforcement Learning Tasks

  • Suna (Sihang) Guo
  • Ruohan Zhang
  • Bo Liu
  • Yifeng Zhu
  • Dana Ballard
  • Mary Hayhoe
  • Peter Stone

Deep reinforcement learning (RL) algorithms are powerful tools for solving visuomotor decision tasks. However, the trained models are often difficult to interpret, because they are represented as end-to-end deep neural networks. In this paper, we shed light on the inner workings of such trained models by analyzing the pixels that they attend to during task execution, and comparing them with the pixels attended to by humans executing the same tasks. To this end, we investigate the following two questions that, to the best of our knowledge, have not been previously studied. 1) How similar are the visual representations learned by RL agents and humans when performing the same task? and, 2) How do similarities and differences in these learned representations explain RL agents' performance on these tasks? Specifically, we compare the saliency maps of RL agents against visual attention models of human experts when learning to play Atari games. Further, we analyze how hyperparameters of the deep RL algorithm affect the learned representations and saliency maps of the trained agents. The insights provided have the potential to inform novel algorithms for closing the performance gap between human experts and RL agents.

AAMAS Conference 2021 Conference Paper

Multiagent Epidemiologic Inference through Realtime Contact Tracing

  • Guni Sharon
  • James Ault
  • Peter Stone
  • Varun Kompella
  • Roberto Capobianco

This paper addresses an epidemiologic inference problem where, given realtime observation of test results, presence of symptoms, and physical contacts, the most likely infected individuals need to be inferred. The inference problem is modeled as a hidden Markov model where infection probabilities are updated at every time step and evolve between time steps. We suggest a unique inference approach that avoids storing the given observations explicitly. Theoretical justification for the proposed model is provided under specific simplifying assumptions. To complement these theoretical results, a comprehensive experimental study is performed using a custom-built agent-based simulator that models inter-agent contacts. The reported results show the effectiveness of the proposed inference model when considering more realistic scenarios – where the simplifying assumptions do not hold. When pairing the proposed inference model with a simple testing and quarantine policy, promising trends are obtained where the epidemic progression is significantly slowed down while quarantining a bounded number of individuals.

JAAMAS Journal 2021 Journal Article

Recent advances in leveraging human guidance for sequential decision-making tasks

  • Ruohan Zhang
  • Faraz Torabi
  • Peter Stone

Abstract A longstanding goal of artificial intelligence is to create artificial agents capable of learning to perform tasks that require sequential decision making. Importantly, while it is the artificial agent that learns and acts, it is still up to humans to specify the particular task to be performed. Classical task-specification approaches typically involve humans providing stationary reward functions or explicit demonstrations of the desired tasks. However, there has recently been a great deal of research energy invested in exploring alternative ways in which humans may guide learning agents that may, e. g. , be more suitable for certain tasks or require less human effort. This survey provides a high-level overview of five recent machine learning frameworks that primarily rely on human guidance apart from pre-specified reward functions or conventional, step-by-step action demonstrations. We review the motivation, assumptions, and implementation of each framework, and we discuss possible future research directions.

AAMAS Conference 2021 Conference Paper

Scalable Multiagent Driving Policies for Reducing Traffic Congestion

  • Jiaxun Cui
  • William Macke
  • Harel Yedidsion
  • Aastha Goyal
  • Daniel Urieli
  • Peter Stone

Traffic congestion is a major challenge in modern urban settings. The industry-wide development of autonomous and automated vehicles (AVs) motivates the question of how can AVs contribute to congestion reduction. Past research has shown that in small scale mixed traffic scenarios with both AVs and human-driven vehicles, a small fraction of AVs executing a controlled multiagent driving policy can mitigate congestion. In this paper, we scale up existing approaches and develop new multiagent driving policies for AVs in scenarios with greater complexity. We start by showing that a congestion metric used by past research is manipulable in open road network scenarios where vehicles dynamically join and leave the road. We then propose using a different metric that is robust to manipulation and reflects open network traffic efficiency. Next, we propose a modular transfer reinforcement learning approach, and use it to scale up a multiagent driving policy to outperform human-like traffic and existing approaches in a simulated realistic scenario, which is an order of magnitude larger than past scenarios (hundreds instead of tens of vehicles). Additionally, our modular transfer learning approach saves up to 80% of the training time in our experiments, by focusing its data collection on key locations in the network. Finally, we show for the first time a distributed multiagent policy that improves congestion over human-driven traffic. The distributed approach is more realistic and practical, as it relies solely on existing sensing and actuation capabilities, and does not require adding new communication infrastructure.

AAAI Conference 2021 Conference Paper

Temporal-Logic-Based Reward Shaping for Continuing Reinforcement Learning Tasks

  • Yuqian Jiang
  • Suda Bharadwaj
  • Bo Wu
  • Rishi Shah
  • Ufuk Topcu
  • Peter Stone

In continuing tasks, average-reward reinforcement learning may be a more appropriate problem formulation than the more common discounted reward formulation. As usual, learning an optimal policy in this setting typically requires a large amount of training experiences. Reward shaping is a common approach for incorporating domain knowledge into reinforcement learning in order to speed up convergence to an optimal policy. However, to the best of our knowledge, the theoretical properties of reward shaping have thus far only been established in the discounted setting. This paper presents the first reward shaping framework for averagereward learning and proves that, under standard assumptions, the optimal policy under the original reward function can be recovered. In order to avoid the need for manual construction of the shaping function, we introduce a method for utilizing domain knowledge expressed as a temporal logic formula. The formula is automatically translated to a shaping function that provides additional reward throughout the learning process. We evaluate the proposed method on three continuing tasks. In all cases, shaping speeds up the average-reward learning rate without any reduction in the performance of the learned policy compared to relevant baselines.

AAMAS Conference 2021 Conference Paper

The Seeing-Eye Robot Grand Challenge: Rethinking Automated Care

  • Reuth Mirsky
  • Peter Stone

Automated care systems are becoming more tangible than ever: recent breakthroughs in robotics and machine learning can be used to address the need for automated care created by the increasing aging population. However, such systems require overcoming several technological, ethical, and social challenges. One inspirational manifestation of these challenges can be observed in the training of seeing-eye dogs for visually impaired people. A seeing-eye dog is not just trained to obey its owner, but also to “intelligently disobey”: if it is given an unsafe command from its handler, it is taught to disobey it or even insist on a different course of action. This paper proposes the challenge of building a seeing-eye robot, as a thought-provoking use-case that helps identify the challenges to be faced when creating behaviors for robot assistants in general. Through this challenge, this paper delineates the prerequisites that an automated care system will need to have in order to perform intelligent disobedience and to serve as a true agent for its handler.

IJCAI Conference 2020 Conference Paper

A Penny for Your Thoughts: The Value of Communication in Ad Hoc Teamwork

  • Reuth Mirsky
  • William Macke
  • Andy Wang
  • Harel Yedidsion
  • Peter Stone

In ad hoc teamwork, multiple agents need to collaborate without having knowledge about their teammates or their plans a priori. A common assumption in this research area is that the agents cannot communicate. However, just as two random people may speak the same language, autonomous teammates may also happen to share a communication protocol. This paper considers how such a shared protocol can be leveraged, introducing a means to reason about Communication in Ad Hoc Teamwork (CAT). The goal of this work is enabling improved ad hoc teamwork by judiciously leveraging the ability of the team to communicate. We situate our study within a novel CAT scenario, involving tasks with multiple steps, where teammates' plans are unveiled over time. In this context, the paper proposes methods to reason about the timing and value of communication and introduces an algorithm for an ad hoc agent to leverage these methods. Finally, we introduces a new multiagent domain, the tool fetching domain, and we study how varying this domain's properties affects the usefulness of communication. Empirical results show the benefits of explicit reasoning about communication content and timing in ad hoc teamwork.

NeurIPS Conference 2020 Conference Paper

An Imitation from Observation Approach to Transfer Learning with Dynamics Mismatch

  • Siddharth Desai
  • Ishan Durugkar
  • Haresh Karnan
  • Garrett Warnell
  • Josiah Hanna
  • Peter Stone

We examine the problem of transferring a policy learned in a source environment to a target environment with different dynamics, particularly in the case where it is critical to reduce the amount of interaction with the target environment during learning. This problem is particularly important in sim-to-real transfer because simulators inevitably model real-world dynamics imperfectly. In this paper, we show that one existing solution to this transfer problem-- grounded action transformation --is closely related to the problem of imitation from observation (IfO): learning behaviors that mimic the observations of behavior demonstrations. After establishing this relationship, we hypothesize that recent state-of-the-art approaches from the IfO literature can be effectively repurposed for grounded transfer learning. To validate our hypothesis we derive a new algorithm -- generative adversarial reinforced action transformation (GARAT) -- based on adversarial imitation from observation techniques. We run experiments in several domains with mismatched dynamics, and find that agents trained with GARAT achieve higher returns in the target environment compared to existing black-box transfer methods.

IJCAI Conference 2020 Conference Paper

Balancing Individual Preferences and Shared Objectives in Multiagent Reinforcement Learning

  • Ishan Durugkar
  • Elad Liebman
  • Peter Stone

In multiagent reinforcement learning scenarios, it is often the case that independent agents must jointly learn to perform a cooperative task. This paper focuses on such a scenario in which agents have individual preferences regarding how to accomplish the shared task. We consider a framework for this setting which balances individual preferences against task rewards using a linear mixing scheme. In our theoretical analysis we establish that agents can reach an equilibrium that leads to optimal shared task reward even when they consider individual preferences which aren't fully aligned with this task. We then empirically show, somewhat counter-intuitively, that there exist mixing schemes that outperform a purely task-oriented baseline. We further consider empirically how to optimize the mixing scheme.

JMLR Journal 2020 Journal Article

Curriculum Learning for Reinforcement Learning Domains: A Framework and Survey

  • Sanmit Narvekar
  • Bei Peng
  • Matteo Leonetti
  • Jivko Sinapov
  • Matthew E. Taylor
  • Peter Stone

Reinforcement learning (RL) is a popular paradigm for addressing sequential decision tasks in which the agent has only limited environmental feedback. Despite many advances over the past three decades, learning in many domains still requires a large amount of interaction with the environment, which can be prohibitively expensive in realistic scenarios. To address this problem, transfer learning has been applied to reinforcement learning such that experience gained in one task can be leveraged when starting to learn the next, harder task. More recently, several lines of research have explored how tasks, or data samples themselves, can be sequenced into a curriculum for the purpose of learning a problem that may otherwise be too difficult to learn from scratch. In this article, we present a framework for curriculum learning (CL) in reinforcement learning, and use it to survey and classify existing CL methods in terms of their assumptions, capabilities, and goals. Finally, we use our framework to find open problems and suggest directions for future RL curriculum learning research. [abs] [ pdf ][ bib ] &copy JMLR 2020. ( edit, beta )

NeurIPS Conference 2020 Conference Paper

Firefly Neural Architecture Descent: a General Approach for Growing Neural Networks

  • Lemeng Wu
  • Bo Liu
  • Peter Stone
  • Qiang Liu

We propose firefly neural architecture descent, a general framework for progressively and dynamically growing neural networks to jointly optimize the networks' parameters and architectures. Our method works in a steepest descent fashion, which iteratively finds the best network within a functional neighborhood of the original network that includes a diverse set of candidate network structures. By using Taylor approximation, the optimal network structure in the neighborhood can be found with a greedy selection procedure. We show that firefly descent can flexibly grow networks both wider and deeper, and can be applied to learn accurate but resource-efficient neural architectures that avoid catastrophic forgetting in continual learning. Empirically, firefly descent achieves promising results on both neural architecture search and continual learning. In particular, on a challenging continual image classification task, it learns networks that are smaller in size but have higher average accuracy than those learned by the state-of-the-art methods.

JAIR Journal 2020 Journal Article

Jointly Improving Parsing and Perception for Natural Language Commands through Human-Robot Dialog

  • Jesse Thomason
  • Aishwarya Padmakumar
  • Jivko Sinapov
  • Nick Walker
  • Yuqian Jiang
  • Harel Yedidsion
  • Justin Hart
  • Peter Stone

In this work, we present methods for using human-robot dialog to improve language understanding for a mobile robot agent. The agent parses natural language to underlying semantic meanings and uses robotic sensors to create multi-modal models of perceptual concepts like red and heavy. The agent can be used for showing navigation routes, delivering objects to people, and relocating objects from one location to another. We use dialog clari_cation questions both to understand commands and to generate additional parsing training data. The agent employs opportunistic active learning to select questions about how words relate to objects, improving its understanding of perceptual concepts. We evaluated this agent on Amazon Mechanical Turk. After training on data induced from conversations, the agent reduced the number of dialog questions it asked while receiving higher usability ratings. Additionally, we demonstrated the agent on a robotic platform, where it learned new perceptual concepts on the y while completing a real-world task.

JAIR Journal 2020 Journal Article

The PETLON Algorithm to Plan Efficiently for Task-Level-Optimal Navigation

  • Shih-Yun Lo
  • Shiqi Zhang
  • Peter Stone

Intelligent mobile robots have recently become able to operate autonomously in large-scale indoor environments for extended periods of time. In this process, mobile robots need the capabilities of both task and motion planning. Task planning in such environments involves sequencing the robot’s high-level goals and subgoals, and typically requires reasoning about the locations of people, rooms, and objects in the environment, and their interactions to achieve a goal. One of the prerequisites for optimal task planning that is often overlooked is having an accurate estimate of the actual distance (or time) a robot needs to navigate from one location to another. State-of-the-art motion planning algorithms, though often computationally complex, are designed exactly for this purpose of finding routes through constrained spaces. In this article, we focus on integrating task and motion planning (TMP) to achieve task-level-optimal planning for robot navigation while maintaining manageable computational efficiency. To this end, we introduce TMP algorithm PETLON (Planning Efficiently for Task-Level-Optimal Navigation), including two configurations with different trade-offs over computational expenses between task and motion planning, for everyday service tasks using a mobile robot. Experiments have been conducted both in simulation and on a mobile robot using object delivery tasks in an indoor office environment. The key observation from the results is that PETLON is more efficient than a baseline approach that pre-computes motion costs of all possible navigation actions, while still producing plans that are optimal at the task level. We provide results with two different task planning paradigms in the implementation of PETLON, and offer TMP practitioners guidelines for the selection of task planners from an engineering perspective.

IJCAI Conference 2019 Conference Paper

Ad Hoc Teamwork With Behavior Switching Agents

  • Manish Ravula
  • Shani Alkoby
  • Peter Stone

As autonomous AI agents proliferate in the real world, they will increasingly need to cooperate with each other to achieve complex goals without always being able to coordinate in advance. This kind of cooperation, in which agents have to learn to cooperate on the fly, is called ad hoc teamwork. Many previous works investigating this setting assumed that teammates behave according to one of many predefined types that is fixed throughout the task. This assumption of stationarity in behaviors, is a strong assumption which cannot be guaranteed in many real-world settings. In this work, we relax this assumption and investigate settings in which teammates can change their types during the course of the task. This adds complexity to the planning problem as now an agent needs to recognize that a change has occurred in addition to figuring out what is the new type of the teammate it is interacting with. In this paper, we present a novel Convolutional-Neural-Network-based Change point Detection (CPD) algorithm for ad hoc teamwork. When evaluating our algorithm on the modified predator prey domain, we find that it outperforms existing Bayesian CPD algorithms.

AAMAS Conference 2019 Conference Paper

Adversarial Imitation Learning from State-only Demonstrations

  • Faraz Torabi
  • Garrett Warnell
  • Peter Stone

Imitation from observation (IfO) is the problem of learning directly from state-only demonstrations without having access to the demonstrator’s actions. The lack of action information both distinguishes IfO from most of the literature in imitation learning, and also sets it apart as a method that may enable agents to learn from a large set of previously inapplicable resources such as internet videos. In this paper, we propose a new IfO approach based on generative adversarial networks called generative adversarial imitation from observation (GAIfO). We demonstrate that our approach performs comparably to classical imitation learning approaches (which have access to the demonstrator’s actions) and significantly outperforms existing imitation from observation methods in high-dimensional simulation environments.

JAAMAS Journal 2019 Journal Article

Agents teaching agents: a survey on inter-agent transfer learning

  • Felipe Leno Da Silva
  • Garrett Warnell
  • Peter Stone

Abstract While recent work in reinforcement learning (RL) has led to agents capable of solving increasingly complex tasks, the issue of high sample complexity is still a major concern. This issue has motivated the development of additional techniques that augment RL methods in an attempt to increase task learning speed. In particular, inter-agent teaching—endowing agents with the ability to respond to instructions from others—has been responsible for many of these developments. RL agents that can leverage instruction from a more competent teacher have been shown to be able to learn tasks significantly faster than agents that cannot take advantage of such instruction. That said, the inter-agent teaching paradigm presents many new challenges due to, among other factors, differences between the agents involved in the teaching interaction. As a result, many inter-agent teaching methods work only in restricted settings and have proven difficult to generalize to new domains or scenarios. In this article, we propose two frameworks that provide a comprehensive view of the challenges associated with inter-agent teaching. We highlight state-of-the-art solutions, open problems, prospective applications, and argue that new research in this area should be developed in the context of the proposed frameworks.

AAMAS Conference 2019 Conference Paper

Escape Room: A Configurable Testbed for Hierarchical Reinforcement Learning

  • Jacob Menashe
  • Peter Stone

Recent successes in Reinforcement Learning have encouraged a fastgrowing network of RL researchers and a number of breakthroughs in RL research. As the RL community and body of work grows, so does the need for widely applicable benchmarks that can fairly and effectively evaluate a variety of RL algorithms. In this paper we present the Escape Room Domain (ERD), a new flexible, scalable, and fully implemented testing domain for Hierarchical RL that bridges the “moderate complexity" gap left behind by existing alternatives. ERD is open-source and freely available through GitHub, and conforms to widely-used public testing interfaces for simple integration and testing with a variety of public RL agent implementations.

IJCAI Conference 2019 Conference Paper

Imitation Learning from Video by Leveraging Proprioception

  • Faraz Torabi
  • Garrett Warnell
  • Peter Stone

Classically, imitation learning algorithms have been developed for idealized situations, e. g. , the demonstrations are often required to be collected in the exact same environment and usually include the demonstrator's actions. Recently, however, the research community has begun to address some of these shortcomings by offering algorithmic solutions that enable imitation learning from observation (IfO), e. g. , learning to perform a task from visual demonstrations that may be in a different environment and do not include actions. Motivated by the fact that agents often also have access to their own internal states (i. e. , proprioception), we propose and study an IfO algorithm that leverages this information in the policy learning process. The proposed architecture learns policies over proprioceptive state representations and compares the resulting trajectories visually to the demonstration data. We experimentally test the proposed technique on several MuJoCo domains and show that it outperforms other imitation from observation algorithms by a large margin.

RLDM Conference 2019 Conference Abstract

Learning Curriculum Policies for Reinforcement Learning

  • Sanmit Narvekar
  • Peter Stone

Curriculum learning in reinforcement learning is a training methodology that seeks to speed up learning of a difficult target task, by first training on a series of simpler tasks and transferring the knowledge acquired to the target task. Automatically choosing a sequence of such tasks (i. e. , a curriculum) is an open problem that has been the subject of much recent work in this area. In this paper, we build upon a recent method for curriculum design, which formulates the curriculum sequencing problem as a Markov Decision Process. We extend this model to handle multiple transfer learning algorithms, and show for the first time that a curriculum policy over this MDP can be learned from experience. We explore various representations that make this possible, and evaluate our approach by learning curriculum policies for multiple agents in two different domains. The results show that our method produces curricula that can train agents to perform on a target task as fast or faster than existing methods.

AAMAS Conference 2019 Conference Paper

Learning Curriculum Policies for Reinforcement Learning

  • Sanmit Narvekar
  • Peter Stone

Curriculum learning in reinforcement learning is a training methodology that seeks to speed up learning of a difficult target task, by first training on a series of simpler tasks and transferring the knowledge acquired to the target task. Automatically choosing a sequence of such tasks (i. e. , a curriculum) is an open problem that has been the subject of much recent work in this area. In this paper, we build upon a recent method for curriculum design, which formulates the curriculum sequencing problem as a Markov Decision Process. We extend this model to handle multiple transfer learning algorithms, and show for the first time that a curriculum policy over this MDP can be learned from experience. We explore various representations that make this possible, and evaluate our approach by learning curriculum policies for multiple agents in two different domains. The results show that our method produces curricula that can train agents to perform on a target task as fast or faster than existing methods.

IJCAI Conference 2019 Conference Paper

Leveraging Human Guidance for Deep Reinforcement Learning Tasks

  • Ruohan Zhang
  • Faraz Torabi
  • Lin Guan
  • Dana H. Ballard
  • Peter Stone

Reinforcement learning agents can learn to solve sequential decision tasks by interacting with the environment. Human knowledge of how to solve these tasks can be incorporated using imitation learning, where the agent learns to imitate human demonstrated decisions. However, human guidance is not limited to the demonstrations. Other types of guidance could be more suitable for certain tasks and require less human effort. This survey provides a high-level overview of five recent learning frameworks that primarily rely on human guidance other than conventional, step-by-step action demonstrations. We review the motivation, assumption, and implementation of each framework. We then discuss possible future research directions.

AAMAS Conference 2019 Conference Paper

Marginal Cost Pricing with a Fixed Error Factor in Traffic Networks

  • Guni Sharon
  • Stephen D. Boyles
  • Shani Alkoby
  • Peter Stone

It is well known that charging marginal cost tolls (MCT) from self interested agents participating in a congestion game leads to optimal system performance, i. e. , minimal total latency. However, it is not generally possible to calculate the correct marginal costs tolls precisely, and it is not known what the impact is of charging incorrect tolls. This uncertainty could lead to reluctance to adopt such schemes in practice. This paper studies the impact of charging MCT with some fixed factor error on the system’s performance. We prove that under-estimating MCT results in a system performance that is at least as good as that obtained by not applying tolls at all. This result might encourage adoption of MCT schemes with conservative MCT estimations. Furthermore, we prove that no local extrema can exist in the function mapping the error value, r, to the system’s performance, T(r). This result implies that accurately calibrating MCT for a given network can be done by identifying an extremum inT(r) which, consequently, must be the global optimum. Experimental results from simulating several large-scale, real-life traffic networks are presented and provide further support for our theoretical findings.

IJCAI Conference 2019 Conference Paper

Recent Advances in Imitation Learning from Observation

  • Faraz Torabi
  • Garrett Warnell
  • Peter Stone

Imitation learning is the process by which one agent tries to learn how to perform a certain task using information generated by another, often more-expert agent performing that same task. Conventionally, the imitator has access to both state and action information generated by an expert performing the task (e. g. , the expert may provide a kinesthetic demonstration of object placement using a robotic arm). However, requiring the action information prevents imitation learning from a large number of existing valuable learning resources such as online videos of humans performing tasks. To overcome this issue, the specific problem of imitation from observation (IfO) has recently garnered a great deal of attention, in which the imitator only has access to the state information (e. g. , video frames) generated by the expert. In this paper, we provide a literature review of methods developed for IfO, and then point out some open research problems and potential future work.

AAMAS Conference 2019 Conference Paper

Reducing Sampling Error in Policy Gradient Learning

  • Josiah P. Hanna
  • Peter Stone

This paper studies a class of reinforcement learning algorithms known as policy gradient methods. Policy gradient methods optimize the performance of a policy by estimating the gradient of the expected return with respect to the policy parameters. One of the core challenges of applying policy gradient methods is obtaining an accurate estimate of this gradient. Most policy gradient methods rely on Monte Carlo sampling to estimate this gradient. When only a limited number of environment steps can be collected, Monte Carlo policy gradient estimates may suffer from sampling error – samples receive more or less weight than they will in expectation. In this paper, we introduce the Sampling Error Corrected policy gradient estimator that corrects the inaccurate Monte Carlo weights. Our approach treats the observed data as if it were generated by a different policy than the policy that actually generated the data. It then uses importance sampling between the two – in the process correcting the inaccurate Monte Carlo weights. Under a limiting set of assumptions we can show that this gradient estimator will have lower variance than the Monte Carlo gradient estimator. We show experimentally that our approach improves the learning speed of two policy gradient methods compared to standard Monte Carlo sampling even when the theoretical assumptions fail to hold.

AAAI Conference 2019 Conference Paper

Selecting Compliant Agents for Opt-in Micro-Tolling

  • Josiah P. Hanna
  • Guni Sharon
  • Stephen D. Boyles
  • Peter Stone

This paper examines the impact of tolls on social welfare in the context of a transportation network in which only a portion of the agents are subject to tolls. More specifically, this paper addresses the question: which subset of agents provides the most system benefit if they are compliant with an approximate marginal cost tolling scheme? Since previous work suggests this problem is NP-hard, we examine a heuristic approach. Our experimental results on three real-world traffic scenarios suggest that evaluating the marginal impact of a given agent serves as a particularly strong heuristic for selecting an agent to be compliant. Results from using this heuristic for selecting 7. 6% of the agents to be compliant achieved an increase of up to 10. 9% in social welfare over not tolling at all. The presented heuristic approach and conclusions can help practitioners target specific agents to participate in an opt-in tolling scheme.

AAMAS Conference 2019 Conference Paper

Teaching Social Behavior through Human Reinforcement for Ad hoc Teamwork - The STAR Framework: Extended Abstract

  • Shani Alkoby
  • Avilash Rath
  • Peter Stone

As AI technology continues to develop, more and more agents will become capable of long term autonomy alongside people. Thus, a recent line of research has studied the problem of teaching autonomous agents the concept of ethics and human social norms. Most existing work considers the case of an individual agent attempting to learn a predefined set of rules. In reality however, social norms are not always pre-defined and are very difficult to represent algorithmically. Moreover, the basic idea behind the social norms concept is ensuring that one’s actions do not negatively influence others’ utilities, which is inherently a multiagent concept. Thus, here we investigate a way to teach agents, as a team, how to act according to human social norms. In this research, we introduce the star framework used to teach an ad hoc team of agents to act in accordance with human social norms. Using a hybrid team (agents and people), when taking an action considered to be socially unacceptable, the agents receive negative feedback from the human teammate(s) who has(have) an awareness of the team’s norms. We view star as an important step towards teaching agents to act more consistently with respect to human morality.

RLDM Conference 2019 Conference Abstract

Utilizing Background Music in Person-Agent Interaction

  • Elad Liebman
  • Peter Stone

Numerous studies have demonstrated that mood affects emotional and cognitive processing. Previous work has established that music-induced mood can measurably alter people’s behavior in different contexts. Recent work suggests that this impact also holds in social and cooperative settings. In this study we further establish how background information (and specifically music) can affect people’s decision making in inter-social tasks, and show that this information can be effectively incorporated in an agent’s world representation in order to better predict people’s behavior. For this purpose, we devised an experiment in which people drove a simulated car through an intersection while listening to music. The intersection was not empty, as another simulated vehicle, controlled autonomously, was also crossing the intersection in a different direction. Our results corroborate that music indeed alters people’s behavior with respect to this social task. Furthermore, we show that explicitly modeling this impact is possible, and can lead to improved performance of the autonomous agent.

AAMAS Conference 2018 Conference Paper

A Stitch in Time - Autonomous Model Management via Reinforcement Learning

  • Elad Liebman
  • Eric Zavesky
  • Peter Stone

Concept drift - a change, either sudden or gradual, in the underlying properties of data - is one of the most prevalent challenges to maintaining high-performing learned models over time in autonomous systems. In the face of concept drift, one can hope that the old model is sufficiently representative of the new data despite the concept drift, one can discard the old data and retrain a new model with (often limited) new data, or one can use transfer learning methods to combine the old data with the new to create an updated model. Which of these three options is chosen affects not only near-term decisions, but also future needs to transfer or retrain. In this paper, we thus model response to concept drift as a sequential decision making problem and formally frame it as a Markov Decision Process. Our reinforcement learning approach to the problem shows promising results on one synthetic and two real-world datasets.

AAAI Conference 2018 Short Paper

Adversarial Goal Generation for Intrinsic Motivation

  • Ishan Durugkar
  • Peter Stone

Generally in Reinforcement Learning the goal, or reward signal, is given by the environment and cannot be controlled by the agent. We propose to introduce an intrinsic motivation module that will select a reward function for the agent to learn to achieve. We will use a Universal Value Function Approximator (Schaul et al. 2015), that takes as input both the state and the parameters of this reward function as the goal to predict the value function (or action-value function) to generalize across these goals. This module will be trained to generate goals such that the agent’s learning is maximized. Thus, this is also a method for automatic curriculum learning.

AIJ Journal 2018 Journal Article

Autonomous agents modelling other agents: A comprehensive survey and open problems

  • Stefano V. Albrecht
  • Peter Stone

Much research in artificial intelligence is concerned with the development of autonomous agents that can interact effectively with other agents. An important aspect of such agents is the ability to reason about the behaviours of other agents, by constructing models which make predictions about various properties of interest (such as actions, goals, beliefs) of the modelled agents. A variety of modelling approaches now exist which vary widely in their methodology and underlying assumptions, catering to the needs of the different sub-communities within which they were developed and reflecting the different practical uses for which they are intended. The purpose of the present article is to provide a comprehensive survey of the salient modelling methods which can be found in the literature. The article concludes with a discussion of open problems which may form the basis for fruitful future research.

IJCAI Conference 2018 Conference Paper

Behavioral Cloning from Observation

  • Faraz Torabi
  • Garrett Warnell
  • Peter Stone

Humans often learn how to perform tasks via imitation: they observe others perform a task, and then very quickly infer the appropriate actions to take based on their observations. While extending this paradigm to autonomous agents is a well-studied problem in general, there are two particular aspects that have largely been overlooked: (1) that the learning is done from observation only (i. e. , without explicit action information), and (2) that the learning is typically done very quickly. In this work, we propose a two-phase, autonomous imitation learning technique called behavioral cloning from observation (BCO), that aims to provide improved performance with respect to both of these aspects. First, we allow the agent to acquire experience in a self-supervised fashion. This experience is used to develop a model which is then utilized to learn a particular task by observing an expert perform that task without the knowledge of the specific actions taken. We experimentally compare BCO to imitation learning methods, including the state-of-the-art, generative adversarial imitation learning (GAIL) technique, and we show comparable task performance in several different simulation domains while exhibiting increased learning speed after expert trajectories become available.

AAAI Conference 2018 Conference Paper

Deep TAMER: Interactive Agent Shaping in High-Dimensional State Spaces

  • Garrett Warnell
  • Nicholas Waytowich
  • Vernon Lawhern
  • Peter Stone

While recent advances in deep reinforcement learning have allowed autonomous learning agents to succeed at a variety of complex tasks, existing algorithms generally require a lot of training data. One way to increase the speed at which agents are able to learn to perform tasks is by leveraging the input of human trainers. Although such input can take many forms, real-time, scalar-valued feedback is especially useful in situations where it proves difficult or impossible for humans to provide expert demonstrations. Previous approaches have shown the usefulness of human input provided in this fashion (e. g. , the TAMER framework), but they have thus far not considered high-dimensional state spaces or employed the use of deep learning. In this paper, we do both: we propose Deep TAMER, an extension of the TAMER framework that leverages the representational power of deep neural networks in order to learn complex tasks in just a short amount of time with a human trainer. We demonstrate Deep TAMER’s success by using it and just 15 minutes of human-provided feedback to train an agent that performs better than humans on the Atari game of BOWLING - a task that has proven difficult for even state-of-the-art reinforcement learning methods.

AAAI Conference 2018 Conference Paper

DyETC: Dynamic Electronic Toll Collection for Traffic Congestion Alleviation

  • Haipeng Chen
  • Bo An
  • Guni Sharon
  • Josiah Hanna
  • Peter Stone
  • Chunyan Miao
  • Yeng Soh

To alleviate traffic congestion in urban areas, electronic toll collection (ETC) systems are deployed all over the world. Despite the merits, tolls are usually pre-determined and fixed from day to day, which fail to consider traffic dynamics and thus have limited regulation effect when traffic conditions are abnormal. In this paper, we propose a novel dynamic ETC (DyETC) scheme which adjusts tolls to traffic conditions in realtime. The DyETC problem is formulated as a Markov decision process (MDP), the solution of which is very challenging due to its 1) multi-dimensional state space, 2) multidimensional, continuous and bounded action space, and 3) time-dependent state and action values. Due to the complexity of the formulated MDP, existing methods cannot be applied to our problem. Therefore, we develop a novel algorithm, PG-β, which makes three improvements to traditional policy gradient method by proposing 1) time-dependent value and policy functions, 2) Beta distribution policy function and 3) state abstraction. Experimental results show that, compared with existing ETC schemes, DyETC increases traffic volume by around 8%, and reduces travel time by around 14. 6% during rush hour. Considering the total traffic volume in a traffic network, this contributes to a substantial increase to social welfare.

AAAI Conference 2018 Conference Paper

Guiding Exploratory Behaviors for Multi-Modal Grounding of Linguistic Descriptions

  • Jesse Thomason
  • Jivko Sinapov
  • Raymond Mooney
  • Peter Stone

A major goal of grounded language learning research is to enable robots to connect language predicates to a robot’s physical interactive perception of the world. Coupling object exploratory behaviors such as grasping, lifting, and looking with multiple sensory modalities (e. g. , audio, haptics, and vision) enables a robot to ground non-visual words like “heavy” as well as visual words like “red”. A major limitation of existing approaches to multi-modal language grounding is that a robot has to exhaustively explore training objects with a variety of actions when learning a new such language predicate. This paper proposes a method for guiding a robot’s behavioral exploration policy when learning a novel predicate based on known grounded predicates and the novel predicate’s linguistic relationship to them. We demonstrate our approach on two datasets in which a robot explored large sets of objects and was tasked with learning to recognize whether novel words applied to those objects.

AAMAS Conference 2018 Conference Paper

Link-based Parameterized Micro-tolling Scheme for Optimal Traffic Management

  • Hamid Mirzaei
  • Guni Sharon
  • Stephen Boyles
  • Tony Givargis
  • Peter Stone

In the micro-tolling paradigm, different toll values are assigned to different links within a congestible traffic network. Self-interested agents then select minimal cost routes, where cost is a function of the travel time and tolls paid. A centralized system manager sets toll values with the objective of inducing a user equilibrium that maximizes the total utility over all agents. A recently proposed algorithm for computing such tolls, denoted ∆-tolling, was shown to yield up to 32% reduction in total travel time in simulated traffic scenarios compared to when there are no tolls. ∆-tolling includes two global parameters: β which is a proportionality parameter, and R which influences the rate of change of toll values across all links. This paper introduces a generalization of ∆-tolling which accounts for different β and R values on each link in the network. While this enhanced ∆-tolling algorithm requires setting significantly more parameters, we show that they can be tuned effectively via policy gradient reinforcement learning. Experimental results from several traffic scenarios indicate that Enhanced ∆-tolling reduces total travel time by up to 28% compared to the original ∆-tolling algorithm, and by up to 45% compared to not tolling.

IJCAI Conference 2018 Conference Paper

Multi-modal Predicate Identification using Dynamically Learned Robot Controllers

  • Saeid Amiri
  • Suhua Wei
  • Shiqi Zhang
  • Jivko Sinapov
  • Jesse Thomason
  • Peter Stone

Intelligent robots frequently need to explore the objects in their working environments. Modern sensors have enabled robots to learn object properties via perception of multiple modalities. However, object exploration in the real world poses a challenging trade-off between information gains and exploration action costs. Mixed observability Markov decision process (MOMDP) is a framework for planning under uncertainty, while accounting for both fully and partially observable components of the state. Robot perception frequently has to face such mixed observability. This work enables a robot equipped with an arm to dynamically construct query-oriented MOMDPs for multi-modal predicate identification (MPI) of objects. The robot's behavioral policy is learned from two datasets collected using real robots. Our approach enables a robot to explore object properties in a way that is significantly faster while improving accuracies in comparison to existing methods that rely on hand-coded exploration strategies.

AIJ Journal 2018 Journal Article

Overlapping layered learning

  • Patrick MacAlpine
  • Peter Stone

Layered learning is a hierarchical machine learning paradigm that enables learning of complex behaviors by incrementally learning a series of sub-behaviors. A key feature of layered learning is that higher layers directly depend on the learned lower layers. In its original formulation, lower layers were frozen prior to learning higher layers. This article considers a major extension to the paradigm that allows learning certain behaviors independently, and then later stitching them together by learning at the “seams” where their influences overlap. The UT Austin Villa 2014 RoboCup 3D simulation team, using such overlapping layered learning, learned a total of 19 layered behaviors for a simulated soccer-playing robot, organized both in series and in parallel. To the best of our knowledge this is more than three times the number of layered behaviors in any prior layered learning system. Furthermore, the complete learning process is repeated on four additional robot body types, showcasing its generality as a paradigm for efficient behavior learning. The resulting team won the RoboCup 2014 championship with an undefeated record, scoring 52 goals and conceding none. This article includes a detailed experimental analysis of the team's performance and the overlapping layered learning approach that led to its success.

AAMAS Conference 2018 Conference Paper

PETLON: Planning Efficiently for Task-Level-Optimal Navigation

  • Shih-Yun Lo
  • Shiqi Zhang
  • Peter Stone

Intelligent mobile robots have recently become able to operate autonomously in large-scale indoor environments for extended periods of time. Task planning in such environments involves sequencing the robot’s high-level goals and subgoals, and typically requires reasoning about the locations of people, rooms, and objects in the environment, and their interactions to achieve a goal. One of the prerequisites for optimal task planning that is often overlooked is having an accurate estimate of the actual distance (or time) a robot needs to navigate from one location to another. State-of-the-art motion planners, though often computationally complex, are designed exactly for this purpose of finding routes through constrained spaces. In this work, we focus on integrating task and motion planning (TMP) to achieve task-level optimal planning for robot navigation while maintaining manageable computational efficiency. To this end, we introduce TMP algorithm PETLON (Planning Efficiently for Task-Level-Optimal Navigation) for everyday service tasks using a mobile robot. PETLON is more efficient than planning approaches that pre-compute motion costs of all possible navigation actions, while still producing plans that are optimal at the task level.

AAAI Conference 2018 Conference Paper

Traffic Optimization for a Mixture of Self-Interested and Compliant Agents

  • Guni Sharon
  • Michael Albert
  • Tarun Rambha
  • Stephen Boyles
  • Peter Stone

This paper focuses on two commonly used path assignment policies for agents traversing a congested network: selfinterested routing, and system-optimum routing. In the selfinterested routing policy each agent selects a path that optimizes its own utility, while the system-optimum routing agents are assigned paths with the goal of maximizing system performance. This paper considers a scenario where a centralized network manager wishes to optimize utilities over all agents, i. e. , implement a system-optimum routing policy. In many real-life scenarios, however, the system manager is unable to influence the route assignment of all agents due to limited influence on route choice decisions. Motivated by such scenarios, a computationally tractable method is presented that computes the minimal amount of agents that the system manager needs to influence (compliant agents) in order to achieve system optimal performance. Moreover, this methodology can also determine whether a given set of compliant agents is sufficient to achieve system optimum and compute the optimal route assignment for the compliant agents to do so. Experimental results are presented showing that in several large-scale, realistic traffic networks optimal flow can be achieved with as low as 13% of the agent being compliant and up to 54%.

AAMAS Conference 2017 Conference Paper

Agent Behaviors for Joining and Leaving a Flock

  • Katie Genter
  • Peter Stone

Each individual bird in a flock of birds updates its behavior based on the behaviors of its neighbors. Previous work has considered how a small set of algorithmically controlled influencing agents, or robot birds, can influence the flock to behave in a particular way — such as to avoid airports or wind farms. These robot birds are assumed to be seen by the flock as ordinary birds, and hence are able to influence their neighbors. However, we are aware of no previous work that has considered the issues related to robot birds joining and leaving flocks of natural birds. Due to the influence the robot birds have on the flock as soon as members of the flock become neighbors, joining and leaving are not straightforward. In this abstract, we discuss simple approaches for robot birds to use when joining and leaving flocks of natural birds.

AAAI Conference 2017 Conference Paper

Automated Design of Robust Mechanisms

  • Michael Albert
  • Vincent Conitzer
  • Peter Stone

We introduce a new class of mechanisms, robust mechanisms, that is an intermediary between ex-post mechanisms and Bayesian mechanisms. This new class of mechanisms allows the mechanism designer to incorporate imprecise estimates of the distribution over bidder valuations in a way that provides strong guarantees that the mechanism will perform at least as well as ex-post mechanisms, while in many cases performing better. We further extend this class to mechanisms that are with high probability incentive compatible and individually rational, -robust mechanisms. Using techniques from automated mechanism design and robust optimization, we provide an algorithm polynomial in the number of bidder types to design robust and -robust mechanisms. We show experimentally that this new class of mechanisms can significantly outperform traditional mechanism design techniques when the mechanism designer has an estimate of the distribution and the bidder’s valuation is correlated with an externally verifiable signal.

AAAI Conference 2017 Conference Paper

Automatic Curriculum Graph Generation for Reinforcement Learning Agents

  • Maxwell Svetlik
  • Matteo Leonetti
  • Jivko Sinapov
  • Rishi Shah
  • Nick Walker
  • Peter Stone

In recent years, research has shown that transfer learning methods can be leveraged to construct curricula that sequence a series of simpler tasks such that performance on a final target task is improved. A major limitation of existing approaches is that such curricula are handcrafted by humans that are typically domain experts. To address this limitation, we introduce a method to generate a curriculum based on task descriptors and a novel metric of transfer potential. Our method automatically generates a curriculum as a directed acyclic graph (as opposed to a linear sequence as done in existing work). Experiments in both discrete and continuous domains show that our method produces curricula that improve the agent’s learning performance when compared to the baseline condition of learning on the target task from scratch.

RLDM Conference 2017 Conference Abstract

Autonomous Task Sequencing for Customized Curriculum Design in Reinforcement Learning

  • Sanmit Narvekar
  • Jivko Sinapov
  • Peter Stone

Transfer learning is a method where an agent reuses knowledge learned in a source task to improve learning on a target task. Recent work has shown that transfer learning can be extended to the idea of curriculum learning, where the agent incrementally accumulates knowledge over a sequence of tasks (i. e. a curriculum). In most existing work, such curricula have been constructed manually. Furthermore, they are fixed ahead of time, and do not adapt to the progress or abilities of the agent. In this paper, we formulate the design of a curriculum as a Markov Decision Process, which directly models the accumulation of knowledge as an agent interacts with tasks, and propose a method that approximates an execution of an optimal policy in this MDP to produce an agent-specific curriculum. We use our approach to automatically sequence tasks for 3 agents with varying sensing and action capabilities in an experimental domain, and show that our method produces curricula customized for each agent that improve performance relative to learning from scratch or using a different agent’s curriculum. This paper was accepted to IJCAI 2017. Upon publication, the full version will be available at: http: //www. cs. utexas. edu/users/pstone/Papers/bib2html-links/IJCAI17- Narvekar. pdf

IJCAI Conference 2017 Conference Paper

Autonomous Task Sequencing for Customized Curriculum Design in Reinforcement Learning

  • Sanmit Narvekar
  • Jivko Sinapov
  • Peter Stone

Transfer learning is a method where an agent reuses knowledge learned in a source task to improve learning on a target task. Recent work has shown that transfer learning can be extended to the idea of curriculum learning, where the agent incrementally accumulates knowledge over a sequence of tasks (i. e. a curriculum). In most existing work, such curricula have been constructed manually. Furthermore, they are fixed ahead of time, and do not adapt to the progress or abilities of the agent. In this paper, we formulate the design of a curriculum as a Markov Decision Process, which directly models the accumulation of knowledge as an agent interacts with tasks, and propose a method that approximates an execution of an optimal policy in this MDP to produce an agent-specific curriculum. We use our approach to automatically sequence tasks for 3 agents with varying sensing and action capabilities in an experimental domain, and show that our method produces curricula customized for each agent that improve performance relative to learning from scratch or using a different agent's curriculum.

AAMAS Conference 2017 Conference Paper

Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation

  • Josiah P. Hanna
  • Peter Stone
  • Scott Niekum

For an autonomous agent, executing a poor policy may be costly or even dangerous. For such agents, it is desirable to determine confidence interval lower bounds on the performance of any given policy without executing said policy. Current methods for exact high confidence off-policy evaluation that use importance sampling require a substantial amount of data to achieve a tight lower bound. Existing model-based methods only address the problem in discrete state spaces. Since exact bounds are intractable for many domains we trade off strict guarantees of safety for more data-efficient approximate bounds. In this context, we propose two bootstrapping off-policy evaluation methods which use learned MDP transition models in order to estimate lower confidence bounds on policy performance with limited data in both continuous and discrete state spaces. Since direct use of a model may introduce bias, we derive a theoretical upper bound on model bias for when the model transition function is estimated with i. i. d. trajectories. This bound broadens our understanding of the conditions under which model-based methods have high bias. Finally, we empirically evaluate our proposed methods and analyze the settings in which different bootstrapping off-policy confidence interval methods succeed and fail.

AAAI Conference 2017 Short Paper

Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation

  • Josiah Hanna
  • Peter Stone
  • Scott Niekum

In many reinforcement learning applications, it is desirable to determine confidence interval lower bounds on the performance of any given policy without executing said policy. In this context, we propose two bootstrapping off-policy evaluation methods which use learned MDP transition models in order to estimate lower confidence bounds on policy performance with limited data. We empirically evaluate the proposed methods in a standard policy evaluation tasks. 1

AAAI Conference 2017 Conference Paper

Dynamically Constructed (PO)MDPs for Adaptive Robot Planning

  • Shiqi Zhang
  • Piyush Khandelwal
  • Peter Stone

To operate in human-robot coexisting environments, intelligent robots need to simultaneously reason with commonsense knowledge and plan under uncertainty. Markov decision processes (MDPs) and partially observable MDPs (POMDPs), are good at planning under uncertainty toward maximizing long-term rewards; P-LOG, a declarative programming language under Answer Set semantics, is strong in commonsense reasoning. In this paper, we present a novel algorithm called iCORPP to dynamically reason about, and construct (PO)MDPs using P-LOG. iCORPP successfully shields exogenous domain attributes from (PO)MDPs, which limits computational complexity and enables (PO)MDPs to adapt to the value changes these attributes produce. We conduct a number of experimental trials using two example problems in simulation and demonstrate iCORPP on a real robot. Results show significant improvements compared to competitive baselines.

AAAI Conference 2017 Conference Paper

Grounded Action Transformation for Robot Learning in Simulation

  • Josiah Hanna
  • Peter Stone

Robot learning in simulation is a promising alternative to the prohibitive sample cost of learning in the physical world. Unfortunately, policies learned in simulation often perform worse than hand-coded policies when applied on the physical robot. Grounded simulation learning (GSL) promises to address this issue by altering the simulator to better match the real world. This paper proposes a new algorithm for GSL – Grounded Action Transformation – and applies it to learning of humanoid bipedal locomotion. Our approach results in a 43. 27% improvement in forward walk velocity compared to a state-of-the art hand-coded walk. We further evaluate our methodology in controlled experiments using a second, higher-fidelity simulator in place of the real world. Our results contribute to a deeper understanding of grounded simulation learning and demonstrate its effectiveness for learning robot control policies.

AAAI Conference 2017 Short Paper

Grounded Action Transformation for Robot Learning in Simulation

  • Josiah Hanna
  • Peter Stone

Robot learning in simulation is a promising alternative to the prohibitive sample cost of learning in the physical world. Unfortunately, policies learned in simulation often perform worse than hand-coded policies when applied on the physical robot. This paper proposes a new algorithm for learning in simulation – Grounded Action Transformation – and applies it to learning of humanoid bipedal locomotion. Our approach results in a 43. 27% improvement in forward walk velocity compared to a state-of-the art hand-coded walk. 1

RLDM Conference 2017 Conference Abstract

Grounded Semantic Networks for Learning Shared Communication Protocols

  • Matthew Hausknecht
  • Peter Stone

Cooperative multiagent learning poses the challenge of coordinating independent agents. A pow- erful method to achieve coordination is allowing agents to communicate. We present the Grounded Semantic Network, an approach for learning a task-dependent communication protocol grounded in the observation space and reward function of the task. We show that the grounded semantic network effectively learns a communication protocol that is useful for achieving cooperation between agents. Analyzing the messages transmitted between agents reveals that the agents’ policies are highly influenced by the communication received from teammates. Further analysis highlights the limitations of the grounded semantic network, identifying the characteristics of domains that it can and cannot solve.

RLDM Conference 2017 Conference Abstract

Hierarchical State Abstraction Synthesis for Discrete Models of Continuous Domains

  • Jacob Menashe
  • Peter Stone

Reinforcement Learning (RL) is a paradigm for enabling autonomous learning wherein rewards are used to influence an agent’s action choices in various states. As the number of states and actions available to an agent increases, so it becomes increasingly difficult for the agent to quickly learn the optimal action for any given state. One approach to mitigating the detrimental effects of large state spaces is to represent collections of states together as encompassing “abstract states”. State abstraction itself leads to a host of new challenges for an agent. One such challenge is that of automatically identifying new abstractions that balance generality and specificity; the agent must identify both the key similarities and differences between states that are relevant to the agent’s goals, while ignoring unnecessary details from the environment. We call this problem of identifying abstract states the Abstraction Synthesis Problem. In this work we propose the Recursive Cluster-based Abstraction Synthesis Technique (RCAST), a new method for abstraction synthesis. We provide the algorithmic details of RCAST and its subroutines, and compare the general properties of RCAST with those of alternative abstraction synthesis algorithms. Finally we show that RCAST enables RL agents to quickly and accurately identify helpful transactions in a variety of RL domains with minimal need for expert configuration.

AIJ Journal 2017 Journal Article

Intrinsically motivated model learning for developing curious robots

  • Todd Hester
  • Peter Stone

Reinforcement Learning (RL) agents are typically deployed to learn a specific, concrete task based on a pre-defined reward function. However, in some cases an agent may be able to gain experience in the domain prior to being given a task. In such cases, intrinsic motivation can be used to enable the agent to learn a useful model of the environment that is likely to help it learn its eventual tasks more efficiently. This paradigm fits robots particularly well, as they need to learn about their own dynamics and affordances which can be applied to many different tasks. This article presents the texplore with Variance-And-Novelty-Intrinsic-Rewards algorithm (texplore-vanir), an intrinsically motivated model-based RL algorithm. The algorithm learns models of the transition dynamics of a domain using random forests. It calculates two different intrinsic motivations from this model: one to explore where the model is uncertain, and one to acquire novel experiences that the model has not yet been trained on. This article presents experiments demonstrating that the combination of these two intrinsic rewards enables the algorithm to learn an accurate model of a domain with no external rewards and that the learned model can be used afterward to perform tasks in the domain. While learning the model, the agent explores the domain in a developing and curious way, progressively learning more complex skills. In addition, the experiments show that combining the agent's intrinsic rewards with external task rewards enables the agent to learn faster than using external rewards alone. We also present results demonstrating the applicability of this approach to learning on robots.

AIJ Journal 2017 Journal Article

Making friends on the fly: Cooperating with new teammates

  • Samuel Barrett
  • Avi Rosenfeld
  • Sarit Kraus
  • Peter Stone

Robots are being deployed in an increasing variety of environments for longer periods of time. As the number of robots grows, they will increasingly need to interact with other robots. Additionally, the number of companies and research laboratories producing these robots is increasing, leading to the situation where these robots may not share a common communication or coordination protocol. While standards for coordination and communication may be created, we expect that robots will need to additionally reason intelligently about their teammates with limited information. This problem motivates the area of ad hoc teamwork in which an agent may potentially cooperate with a variety of teammates in order to achieve a shared goal. This article focuses on a limited version of the ad hoc teamwork problem in which an agent knows the environmental dynamics and has had past experiences with other teammates, though these experiences may not be representative of the current teammates. To tackle this problem, this article introduces a new general-purpose algorithm, PLASTIC, that reuses knowledge learned from previous teammates or provided by experts to quickly adapt to new teammates. This algorithm is instantiated in two forms: 1) PLASTIC-Model – which builds models of previous teammates' behaviors and plans behaviors online using these models and 2) PLASTIC-Policy – which learns policies for cooperating with previous teammates and selects among these policies online. We evaluate PLASTIC on two benchmark tasks: the pursuit domain and robot soccer in the RoboCup 2D simulation domain. Recognizing that a key requirement of ad hoc teamwork is adaptability to previously unseen agents, the tests use more than 40 previously unknown teams on the first task and 7 previously unknown teams on the second. While PLASTIC assumes that there is some degree of similarity between the current and past teammates' behaviors, no steps are taken in the experimental setup to make sure this assumption holds. The teammates were created by a variety of independent developers and were not designed to share any similarities. Nonetheless, the results show that PLASTIC was able to identify and exploit similarities between its current and past teammates' behaviors, allowing it to quickly adapt to new teammates.

AAMAS Conference 2017 Conference Paper

Mechanism Design with Unknown Correlated Distributions: Can We Learn Optimal Mechanisms?

  • Michael Albert
  • Vincent Conitzer
  • Peter Stone

Due to Cremer and McLean (1985), it is well known that in a setting where bidders’ values are correlated, an auction designer can extract the full social surplus as revenue. However, this result strongly relies on the assumption of a common prior distribution between the mechanism designer and the bidders. A natural question to ask is, can a mechanism designer distinguish between a set of possible distributions, or failing that, use a finite number of samples from the true distribution to learn enough about the distribution to recover the Cremer and Mclean result? We show that if a bidder’s distribution is one of a countably infinite sequence of potential distributions that converges to an independent private values distribution, then there is no mechanism that can guarantee revenue more than greater than the optimal mechanism over the independent private value mechanism, even with sampling from the true distribution. We also show that any mechanism over this infinite sequence can guarantee at most a (|Θ| + 1)/(2 + ) approximation, where |Θ| is the number of bidder types, to the revenue achievable by a mechanism where the designer knows the bidder’s distribution. Finally, as a positive result, we show that for any distribution where full surplus extraction as revenue is possible, a mechanism exists that guarantees revenue arbitrarily close to full surplus for sufficiently close distributions. Intuitively, our results suggest that a high degree of correlation will be essential in the effective application of correlated mechanism design techniques to settings with uncertain distributions.

AAMAS Conference 2017 Conference Paper

Multi-Robot Human Guidance: Human Experiments and Multiple Concurrent Requests

  • Piyush Khandelwal
  • Peter Stone

In the multi-robot human guidance problem, a centralized controller makes use of multiple robots to provide navigational assistance to a human in order to reach a goal location. Previous work used Markov Decision Processes (MDPs) to construct a formalization for this problem [13], and evaluated this framework in an abstract setting only, i. e. without experiments using high-fidelity simulators or real humans. Additionally, it was unable to handle multiple concurrent requests and did not consider buildings with multiple floors. The main contribution of this paper is the introduction of an extended MDP framework for the multi-robot human guidance problem, and its application using a realistic 3D simulation environment and a real multi-robot system. The MDP formulation presented in this paper includes support for planning for multiple guidance requests concurrently as well as requests that require a human to traverse multiple floors. We evaluate this system using real humans controlling simulated avatars, and provide a video demonstration of the system implemented on real robots.

AAMAS Conference 2017 Conference Paper

Multirobot Symbolic Planning under Temporal Uncertainty

  • Shiqi Zhang
  • Yuqian Jiang
  • Guni Sharon
  • Peter Stone

Multirobot symbolic planning (MSP) aims at computing plans, each in the form of a sequence of actions, for a team of robots to achieve their individual goals while minimizing overall cost. Solving MSP problems requires modeling limited domain resources (e. g. , corridors that allow at most one robot at a time) and the possibility of action synergy (e. g. , multiple robots going through a door after a single door-opening action). However, the temporal uncertainty that propagates over actions, such as delays caused by obstacles in navigation actions, makes it challenging to plan for resource sharing and realizing synergy in a team of robots. This paper, for the first time, introduces the problem of MSP under temporal uncertainty (MSPTU). We present a novel, iterative inter-dependent planning (IIDP) algorithm, including two configurations (simple and enhanced), for solving general MSPTU problems. We then focus on multirobot navigation tasks, presenting a full instantiation of IIDP that includes a new algorithm for computing conditional plan cost under temporal uncertainty and a novel shifted-Poisson distribution for accumulating temporal uncertainty over actions. The algorithms have been implemented both in simulation and on real robots. We observed a significant reduction in overall cost compared to baselines in which robots do not communicate or model temporal uncertainty. CCS Concepts •Computing methodologies → Robotic planning; Multi-agent planning; Planning under uncertainty;

IS Journal 2017 Journal Article

Multirobot Systems

  • Tsz-Chiu Au
  • Bikramjit Banerjee
  • Prithviraj Dasgupta
  • Peter Stone

The guest editors describe the six articles appearing in this special issue on multirobot systems.

AAMAS Conference 2017 Conference Paper

Real-time Adaptive Tolling Scheme for Optimized Social Welfare in Traffic Networks

  • Guni Sharon
  • Josiah P. Hanna
  • Tarun Rambha
  • Michael W. Levin
  • Michael Albert
  • Stephen D. Boyles
  • Peter Stone

Connected and autonomous vehicle technology has advanced rapidly in recent years. These technologies create possibilities for advanced AI-based traffic management techniques. Developing such techniques is an important challenge and opportunity for the AI community as it requires synergy between experts in game theory, multiagent systems, behavioral science, and flow optimization. This paper takes a step in this direction by considering traffic flow optimization through setting and broadcasting of dynamic and adaptive tolls. Previous tolling schemes either were not adaptive in realtime, not scalable to large networks, or did not optimize traffic flow over an entire network. Moreover, previous schemes made strong assumptions on observable demands, road capacities and users homogeneity. This paper introduces ∆-tolling, a novel tolling scheme that is adaptive in real-time and able to scale to large networks. We provide theoretical evidence showing that under certain assumptions ∆-tolling is equal to Marginal-Cost Tolling, which provably leads to system-optimal, and empirical evidence showing that ∆-tolling increases social welfare (by up to 33%) in two traffic simulators with markedly different modeling assumptions. CCS Concepts •Computing methodologies → Multi-agent planning;

AAMAS Conference 2017 Conference Paper

Reasoning about Hypothetical Agent Behaviours and their Parameters

  • Stefano V. Albrecht
  • Peter Stone

Agents can achieve effective interaction with previously unknown other agents by maintaining beliefs over a set of hypothetical behaviours, or types, that these agents may have. A current limitation in this method is that it does not recognise parameters within type specifications, because types are viewed as blackbox mappings from interaction histories to probability distributions over actions. In this work, we propose a general method which allows an agent to reason about both the relative likelihood of types and the values of any bounded continuous parameters within types. The method maintains individual parameter estimates for each type and selectively updates the estimates for some types after each observation. We propose different methods for the selection of types and the estimation of parameter values. The proposed methods are evaluated in detailed experiments, showing that updating the parameter estimates of a single type after each observation can be sufficient to achieve good performance. CCS Concepts •Computing methodologies → Multi-agent systems; Intelligent agents; Planning under uncertainty; Cooperation and coordination;

AAMAS Conference 2017 Conference Paper

Three Years of the RoboCup Standard Platform League Drop-In Player Competition: Creating and Maintaining a Large Scale Ad Hoc Teamwork Robotics Competition (JAAMAS Extended Abstract)

  • Katie Genter
  • Tim Laue
  • Peter Stone

The Standard Platform League is one of the main competitions at the annual RoboCup world championships. In this competition, teams of five humanoid robots play soccer against each other. In 2013, the league began a new competition which serves as a testbed for cooperation without pre-coordination: the Drop-in Player Competition. Instead of homogeneous robot teams that are each programmed by the same people and hence implicitly pre-coordinated, this competition features ad hoc teams, i. e. teams that consist of robots originating from different RoboCup teams and as such running different software. In the article advertised by this extended abstract, we provide an overview of this competition, including its motivation, rules, and how these rules have changed across three iterations of the competition. We also present and analyze the strategies utilized by various drop-in players as well as the results of the first three competitions. The article concludes by suggesting improvements for future competitive evaluations of ad hoc teamwork. To the best of our knowledge, the three Drop-in Player Competitions described in the article are the largest annual ad hoc teamwork robotic experiment to date. Across three years, the competition saw 56 entries from 30 different organizations and consisted of 510 minutes of game time that resulted in approximately 85 robot hours.

AIJ Journal 2016 Journal Article

A synthesis of automated planning and reinforcement learning for efficient, robust decision-making

  • Matteo Leonetti
  • Luca Iocchi
  • Peter Stone

Automated planning and reinforcement learning are characterized by complementary views on decision making: the former relies on previous knowledge and computation, while the latter on interaction with the world, and experience. Planning allows robots to carry out different tasks in the same domain, without the need to acquire knowledge about each one of them, but relies strongly on the accuracy of the model. Reinforcement learning, on the other hand, does not require previous knowledge, and allows robots to robustly adapt to the environment, but often necessitates an infeasible amount of experience. We present Domain Approximation for Reinforcement LearnING (DARLING), a method that takes advantage of planning to constrain the behavior of the agent to reasonable choices, and of reinforcement learning to adapt to the environment, and increase the reliability of the decision making process. We demonstrate the effectiveness of the proposed method on a service robot, carrying out a variety of tasks in an office building. We find that when the robot makes decisions by planning alone on a given model it often fails, and when it makes decisions by reinforcement learning alone it often cannot complete its tasks in a reasonable amount of time. When employing DARLING, even when seeded with the same model that was used for planning alone, however, the robot can quickly learn a behavior to carry out all the tasks, improves over time, and adapts to the environment as it changes.

AAMAS Conference 2016 Conference Paper

Adding Influencing Agents to a Flock

  • Katie Genter
  • Peter Stone

Many different animals, including birds and fish, exhibit a collective behavior known as flocking. Flocking behavior is believed by biologists to emerge from relatively simple local control rules utilized by each individual in a flock. Specifically, each individual adjusts its behavior based on the behaviors of its closest neighbors. In our work we consider the possibility of adding a small set of influencing agents, which are under our control, to a flock. Specifically, we advance existing work on adding influencing agents into a flock and begin to consider the case in which influencing agents must join a flock in motion. Following ad hoc teamwork methodology, we assume that we are given knowledge of, but no direct control over, the rest of the flock. As such, we use the influencing agents to alter the flock’s behavior — for example by encouraging all of the individuals to face the same direction or by altering the trajectory of the flock. In this paper we define several new methods for adding influencing agents into the flock and compare them against existing methods.

AAMAS Conference 2016 Conference Paper

An MDP-Based Winning Approach to Autonomous Power Trading: Formalization and Empirical Analysis

  • Daniel Urieli
  • Peter Stone

With the efforts of moving to sustainable and reliable energy supply, electricity markets are undergoing far-reaching changes. Due to the high-cost of failure in the real-world, it is important to test new market structures in simulation. This is the focus of the Power Trading Agent Competition (Power TAC), which proposes autonomous electricity broker agents as a means for stabilizing the electricity grid. This paper focuses on the question: how should an autonomous electricity broker agent act in competitive electricity markets to maximize its profit. We formalize the electricity trading problem as a continuous, high-dimensional Markov Decision Process (MDP), which is computationally intractable to solve. Our formalization provides a guideline for approximating the MDP’s solution, and for extending existing solutions. We show that a previously champion broker can be viewed as approximating the solution using a lookahead policy. We present TacTex’15, which improves upon this previous approximation and achieves state-of-the-art performance in competitions and controlled experiments. Using thousands of experiments against 2015 finalist brokers, we analyze TacTex’15’s performance and the reasons for its success. We find that lookahead policies can be effective, but their performance can be sensitive to errors in the transition function prediction, specifically demand-prediction.

AAAI Conference 2016 Conference Paper

Autonomous Electricity Trading Using Time-of-Use Tariffs in a Competitive Market

  • Daniel Urieli
  • Peter Stone

This paper studies the impact of Time-Of-Use (TOU) tariffs in a competitive electricity market place. Specifically, it focuses on the question of how should an autonomous broker agent optimize TOU tariffs in a competitive retail market, and what is the impact of such tariffs on the economy. We formalize the problem of TOU tariff optimization and propose an algorithm for approximating its solution. We extensively experiment with our algorithm in a large-scale, detailed electricity retail markets simulation of the Power Trading Agent Competition (Power TAC) and: 1) find that our algorithm results in 15% peak-demand reduction, 2) find that its peakflattening results in greater profit and/or profit-share for the broker and allows it to win against the 1st and 2nd place brokers from the Power TAC 2014 finals, and 3) analyze several economic implications of using TOU tariffs in competitive retail markets.

AAMAS Conference 2016 Conference Paper

Autonomous Learning Agents: Layered Learning and Ad Hoc Teamwork

  • Peter Stone

In order to achieve long-term autonomy in the real world, fully autonomous agents need to be able to learn, both to improve their behaviors in a complex, dynamically changing world, and to enable interaction with previously unfamiliar agents. This talk begins by presenting layered learning, a hierarchical machine learning paradigm that enables learning of complex behaviors by incrementally learning a series of sub-behaviors. Layered learning was the key deciding factor in UT Austin Villa’s recent RoboCup 3D simulation league championship. The talk then introduces ad hoc teamwork as an emerging multiagent learning challenge. Ad hoc teamwork is based on the premise that as autonomous agents become capable of long-term autonomy, they will increasingly need to band together for cooperative activities with previously unfamiliar teammates. In such“ad hoc”team settings, team strategies cannot be developed a priori. Rather, an agent must learn to cooperate with new teammates on the fly. This talk reports on both theoretical and empirical ad hoc teamwork results, including from recent “pick up” RoboCup robot soccer competitions. CCS Concepts •Computing methodologies → Multi-agent systems; Multi-agent reinforcement learning; •Computer systems organization → Robotic control;

IJCAI Conference 2016 Conference Paper

Learning Multi-Modal Grounded Linguistic Semantics by Playing "I Spy"

  • Jesse Thomason
  • Jivko Sinapov
  • Maxwell Svetlik
  • Peter Stone
  • Raymond J. Mooney

Grounded language learning bridges words like red and square with robot perception. The vast majority of existing work in this space limits robot perception to vision. In this paper, we build perceptual models that use haptic, auditory, and proprioceptive data acquired through robot exploratory behaviors to go beyond vision. Our system learns to ground natural language words describing objects using supervision from an interactive human-robot I Spy game. In this game, the human and robot take turns describing one object among several, then trying to guess which object the other has described. All supervision labels were gathered from human participants physically present to play this game with a robot. We demonstrate that our multi-modal system for grounding natural language outperforms a traditional, vision-only grounding framework by comparing the two on the "I Spy" task. We also provide a qualitative analysis of the groundings learned in the game, visualizing what words are understood better with multi-modal sensory information as well as identifying learned word meanings that correlate with physical object properties (e. g. "small" negatively correlates with object weight).

IJCAI Conference 2016 Conference Paper

Learning to Order Objects Using Haptic and Proprioceptive Exploratory Behaviors

  • Jivko Sinapov
  • Priyanka Khante
  • Maxwell Svetlik
  • Peter Stone

This paper proposes a novel framework that enables a robot to learn ordinal object relations. While most related work focuses on classifying objects into discrete categories, such approaches cannot learn object properties (e. g. , weight, height, size, etc. ) that are context-specific and relative to other objects. To address this problem, we propose that a robot should learn to order objects based on ordinal object relations. In our experiments, the robot explored a set of 32 objects that can be ordered by three properties: height, weight, and width. Next, the robot used unsupervised learning to discover multiple ways that the objects can be ordered based on the haptic and proprioceptive perceptions detected while exploring the objects. Following, the robot's model was presented with labeled object series, allowing it to ground the three ordinal relations in terms of how similar they are to the orders discovered during the unsupervised stage. Finally, the grounded models were used to recognize whether new object series were ordered by any of the three properties as well as to correctly insert additional objects into an existing series.

IJCAI Conference 2016 Conference Paper

Robot Scavenger Hunt: A Standardized Framework for Evaluating Intelligent Mobile Robots

  • Shiqi Zhang
  • Dongcai Lu
  • Xiaoping Chen
  • Peter Stone

In recent years, many different types of intelligent mobile robots have been developed in research and industrial labs. Although there are significant differences in both hardware and software over these robots, many of them share a common set of AI capabilities, e. g. , planning, learning, vision and natural language processing. At the same time, almost all of them are equipped with traditional robotic capabilities such as mapping, localization, and navigation. However, to date it has been difficult to compare and contrast their capabilities in any controlled way. The main goal of the Robot Scavenger Hunt is to provide a standardized framework that includes a set of standardized tasks for evaluating the AI and robotic capabilities of medium-sized intelligent mobile robots. Compared to existing benchmarks, e. g. , RoboCup@Home, Robot Scavenger Hunt aims at evaluations in larger spaces (multi-floor buildings vs. rooms) over longer periods of time (hours vs. minutes) while interacting with real human residents.

AAMAS Conference 2016 Conference Paper

Source Task Creation for Curriculum Learning

  • Sanmit Narvekar
  • Jivko Sinapov
  • Matteo Leonetti
  • Peter Stone

Transfer learning in reinforcement learning has been an active area of research over the past decade. In transfer learning, training on a source task is leveraged to speed up or otherwise improve learning on a target task. This paper presents the more ambitious problem of curriculum learning in reinforcement learning, in which the goal is to design a sequence of source tasks for an agent to train on, such that final performance or learning speed is improved. We take the position that each stage of such a curriculum should be tailored to the current ability of the agent in order to promote learning new behaviors. Thus, as a first step towards creating a curriculum, the trainer must be able to create novel, agent-specific source tasks. We explore how such a space of useful tasks can be created using a parameterized model of the domain and observed trajectories on the target task. We experimentally show that these methods can be used to form components of a curriculum and that such a curriculum can be used successfully for transfer learning in 2 challenging multiagent reinforcement learning domains.

JAAMAS Journal 2016 Journal Article

Special issue on multiagent interaction without prior coordination: guest editorial

  • Stefano V. Albrecht
  • Somchaya Liemhetcharat
  • Peter Stone

Abstract This special issue of the Journal of Autonomous Agents and Multi-Agent Systems sought research articles on the emerging topic of multiagent interaction without prior coordination. Topics of interest included empirical and theoretical investigations of issues arising from assumptions of prior coordination, as well as solutions in the form of novel models and algorithms for effective multiagent interaction without prior coordination.

JAAMAS Journal 2016 Journal Article

Three years of the RoboCup standard platform league drop-in player competition

  • Katie Genter
  • Tim Laue
  • Peter Stone

Abstract The Standard Platform League is one of the main competitions at the annual RoboCup world championships. In this competition, teams of five humanoid robots play soccer against each other. In 2013, the league began a new competition which serves as a testbed for cooperation without pre-coordination: the Drop-in Player Competition. Instead of homogeneous robot teams that are each programmed by the same people and hence implicitly pre-coordinated, this competition features ad hoc teams, i. e. teams that consist of robots originating from different RoboCup teams and as such running different software. In this article, we provide an overview of this competition, including its motivation, rules, and how these rules have changed across three iterations of the competition. We then present and analyze the strategies utilized by various drop-in players as well as the results of the first three competitions before suggesting improvements for future competitive evaluations of ad hoc teamwork. To the best of our knowledge, these three competitions are the largest annual ad hoc teamwork robotic experiment to date. Across three years, the competition has seen 56 entries from 30 different organizations and consisted of 510 min of game time that resulted in approximately 85 robot hours.

IS Journal 2016 Journal Article

UT Austin Villa: Project-Driven Research in AI and Robotics

  • Katie Genter
  • Patrick MacAlpine
  • Jacob Menashe
  • Josiah Hannah
  • Elad Liebman
  • Sanmit Narvekar
  • Ruohan Zhang
  • Peter Stone

UT Austin Villa is a robot soccer team that has competed in the annual RoboCup soccer competitions since 2003. The team has won several championships and has inspired research contributions spanning many topics in robotics and artificial intelligence. This article summarizes some of these research contributions and provides a snapshot into the current development status of the team. Educational uses of the team's code bases are also presented.

AAAI Conference 2015 Conference Paper

Cooperating with Unknown Teammates in Complex Domains: A Robot Soccer Case Study of Ad Hoc Teamwork

  • Samuel Barrett
  • Peter Stone

Many scenarios require that robots work together as a team in order to effectively accomplish their tasks. However, precoordinating these teams may not always be possible given the growing number of companies and research labs creating these robots. Therefore, it is desirable for robots to be able to reason about ad hoc teamwork and adapt to new teammates on the fly. Past research on ad hoc teamwork has focused on relatively simple domains, but this paper demonstrates that agents can reason about ad hoc teamwork in complex scenarios. To handle these complex scenarios, we introduce a new algorithm, PLASTIC–Policy, that builds on an existing ad hoc teamwork approach. Specifically, PLASTIC– Policy learns policies to cooperate with past teammates and reuses these policies to quickly adapt to new teammates. This approach is tested in the 2D simulation soccer league of RoboCup using the half field offense task.

AAAI Conference 2015 Conference Paper

CORPP: Commonsense Reasoning and Probabilistic Planning, as Applied to Dialog with a Mobile Robot

  • Shiqi Zhang
  • Peter Stone

In order to be fully robust and responsive to a dynamically changing real-world environment, intelligent robots will need to engage in a variety of simultaneous reasoning modalities. In particular, in this paper we consider their needs to i) reason with commonsense knowledge, ii) model their nondeterministic action outcomes and partial observability, and iii) plan toward maximizing long-term rewards. On one hand, Answer Set Programming (ASP) is good at representing and reasoning with commonsense and default knowledge, but is ill-equipped to plan under probabilistic uncertainty. On the other hand, Partially Observable Markov Decision Processes (POMDPs) are strong at planning under uncertainty toward maximizing long-term rewards, but are not designed to incorporate commonsense knowledge and inference. This paper introduces the CORPP algorithm which combines Plog, a probabilistic extension of ASP, with POMDPs to integrate commonsense reasoning with planning under uncertainty. Our approach is fully implemented and tested on a shopping request identification problem both in simulation and on a real robot. Compared with existing approaches using P-log or POMDPs individually, we observe significant improvements in both efficiency and accuracy.

RLDM Conference 2015 Conference Abstract

Decision Mechanisms Underlying Mood-Congruent Emotional Classification

  • Elad Liebman
  • Peter Stone
  • Corey White

Numerous studies have demonstrated that an individual’s mood can affect their emotional pro- cessing. The goal of the present study was to use a sequential sampling model of simple decisions, the drift- diffusion model (DDM), to explore which components of the decision process underlie mood-congruent bias in emotional decision making. DDM assumes that decisions are made by a noisy process that accumu- lates information over time from a starting point toward one of response criteria or boundaries. This model can be fitted to response times and choice probabilities to determine whether classification bias reflects a change in the emotional evaluation of the stimuli, or rather a change in a priori bias for one response over the other. In our experiment, participants decided whether words were emotionally positive or negative while listening to music that was chosen to induce positive or negative mood. The behavioral results show that the music manipulation was effective, as participants were biased to label words positive in the positive music condition. The drift-diffusion model shows that this bias was driven by a change in the starting point of evi- dence accumulation, which indicates an a priori response bias. In contrast, there was no evidence that music affected how participants evaluated the emotional content of the stimuli, which would have been reflected by a change in the drift rates. This result has implications for future studies of emotional classification and mood, which we discuss.

AIJ Journal 2015 Journal Article

Framing reinforcement learning from human reward: Reward positivity, temporal discounting, episodicity, and performance

  • W. Bradley Knox
  • Peter Stone

Several studies have demonstrated that reward from a human trainer can be a powerful feedback signal for control-learning algorithms. However, the space of algorithms for learning from such human reward has hitherto not been explored systematically. Using model-based reinforcement learning from human reward, this article investigates the problem of learning from human reward through six experiments, focusing on the relationships between reward positivity, which is how generally positive a trainer's reward values are; temporal discounting, the extent to which future reward is discounted in value; episodicity, whether task learning occurs in discrete learning episodes instead of one continuing session; and task performance, the agent's performance on the task the trainer intends to teach. This investigation is motivated by the observation that an agent can pursue different learning objectives, leading to different resulting behaviors. We search for learning objectives that lead the agent to behave as the trainer intends. We identify and empirically support a “positive circuits” problem with low discounting (i. e. , high discount factors) for episodic, goal-based tasks that arises from an observed bias among humans towards giving positive reward, resulting in an endorsement of myopic learning for such domains. We then show that converting simple episodic tasks to be non-episodic (i. e. , continuing) reduces and in some cases resolves issues present in episodic tasks with generally positive reward and—relatedly—enables highly successful learning with non-myopic valuation in multiple user studies. The primary learning algorithm introduced in this article, which we call “vi-tamer”, is the first algorithm to successfully learn non-myopically from reward generated by a human trainer; we also empirically show that such non-myopic valuation facilitates higher-level understanding of the task. Anticipating the complexity of real-world problems, we perform further studies—one with a failure state added—that compare (1) learning when states are updated asynchronously with local bias—i. e. , states quickly reachable from the agent's current state are updated more often than other states—to (2) learning with the fully synchronous sweeps across each state in the vi-tamer algorithm. With these locally biased updates, we find that the general positivity of human reward creates problems even for continuing tasks, revealing a distinct research challenge for future work.

IJCAI Conference 2015 Conference Paper

Learning to Interpret Natural Language Commands through Human-Robot Dialog

  • Jesse Thomason
  • Shiqi Zhang
  • Raymond J Mooney
  • Peter Stone

Intelligent robots frequently need to understand requests from naive users through natural language. Previous approaches either cannot account for language variation, e. g. , keyword search, or require gathering large annotated corpora, which can be expensive and cannot adapt to new variation. We introduce a dialog agent for mobile robots that understands human instructions through semantic parsing, actively resolves ambiguities using a dialog manager, and incrementally learns from humanrobot conversations by inducing training data from user paraphrases. Our dialog agent is implemented and tested both on a web interface with hundreds of users via Mechanical Turk and on a mobile robot over several days, tasked with understanding navigation and delivery requests through natural language in an office environment. In both contexts, We observe significant improvements in user satisfaction after learning from conversations.

AAAI Conference 2015 Conference Paper

Placing Influencing Agents in a Flock

  • Katie Genter
  • Peter Stone

Flocking is a emergent behavior exhibited by many different animal species, including birds and fish. In our work we consider adding a small set of influencing agents, that are under our control, into a flock. Following ad hoc teamwork methodology, we assume that we are given knowledge of, but no direct control over, the rest of the flock. In our ongoing work highlighted in this abstract, we are specifically considering the problem of where to initially place influencing agents that we add to such a flock. We use these influencing agents to influence the flock to behave in a particular way - for example, to fly in a particular orientation or fly in a particular pattern such as to avoid an obstacle.

RLDM Conference 2015 Conference Abstract

Practical RL: Representation, Interaction, Synthesis, and Mortality (PRISM)

  • Peter Stone

When scaling up Reinforcement Learning (RL) to large continuous domains with imperfect representations and hierarchical structure, we often try applying algorithm that are proven to converge in small finite do- mains, and then just hope for the best. This talk will advocate instead designing algorithms that adhere to the constraints, and indeed take advantage of the opportunities, that might come with the problem at hand. Drawing on several different research threads within the Learning Agents Research Group at UT Austin, I will touch on four types of issues that arise from these constraints and opportunities: 1) Representation - choosing the algorithm for the problem’s representation and adapting the representation to fit the algorithm; 2) Interaction - with other agents and with human trainers; 3) Synthesis - of different algorithms for the same problem and of different concepts in the same algorithm; and 4) Mortality - dealing with the constraint that when the environment is large relative to the number of action opportunities available, one cannot explore exhaustively. Within this context, I will focus on two specific RL approaches, namely the TEXPLORE algorithm for real-time sample-efficient reinforcement learning for robots; and layered learning, a hierarchical machine learning paradigm that enables learning of complex behaviors by incrementally learning a series of sub- behaviors. TEXPLORE has been implemented and tested on a full-size fully autonomous robot car, and layered learning was the key deciding factor in our RoboCup 2014 3D simulation league championship.

AAAI Conference 2015 Conference Paper

SCRAM: Scalable Collision-avoiding Role Assignment with Minimal-Makespan for Formational Positioning

  • Patrick MacAlpine
  • Eric Price
  • Peter Stone

Teams of mobile robots often need to divide up subtasks efficiently. In spatial domains, a key criterion for doing so may depend on distances between robots and the subtasks’ locations. This paper considers a specific such criterion, namely how to assign interchangeable robots, represented as point masses, to a set of target goal locations within an open two dimensional space such that the makespan (time for all robots to reach their target locations) is minimized while also preventing collisions among robots. We present scaleable (computable in polynomial time) role assignment algorithms that we classify as being SCRAM (Scalable Collision-avoiding Role Assignment with Minimal-makespan). SCRAM role assignment algorithms use a graph theoretic approach to map agents to target goal locations such that our objectives for both minimizing the makespan and avoiding agent collisions are met. A system using SCRAM role assignment was originally designed to allow for decentralized coordination among physically realistic simulated humanoid soccer playing robots in the partially observable, non-deterministic, noisy, dynamic, and limited communication setting of the RoboCup 3D simulation league. In its current form, SCRAM role assignment generalizes well to many realistic and realworld multiagent systems, and scales to thousands of agents.

AAAI Conference 2015 Conference Paper

UT Austin Villa 2014: RoboCup 3D Simulation League Champion via Overlapping Layered Learning

  • Patrick MacAlpine
  • Mike Depinet
  • Peter Stone

Layered learning is a hierarchical machine learning paradigm that enables learning of complex behaviors by incrementally learning a series of sub-behaviors. A key feature of layered learning is that higher layers directly depend on the learned lower layers. In its original formulation, lower layers were frozen prior to learning higher layers. This paper considers an extension to the paradigm that allows learning certain behaviors independently, and then later stitching them together by learning at the “seams” where their influences overlap. The UT Austin Villa 2014 RoboCup 3D simulation team, using such overlapping layered learning, learned a total of 19 layered behaviors for a simulated soccer-playing robot, organized both in series and in parallel. To the best of our knowledge this is more than three times the number of layered behaviors in any prior layered learning system. Furthermore, the complete learning process is repeated on four different robot body types, showcasing its generality as a paradigm for efficient behavior learning. The resulting team won the RoboCup 2014 championship with an undefeated record, scoring 52 goals and conceding none. This paper includes a detailed experimental analysis of the team’s performance and the overlapping layered learning approach that led to its success.

IJCAI Conference 2015 Conference Paper

When Security Games Go Green: Designing Defender Strategies to Prevent Poaching and Illegal Fishing

  • Fei Fang
  • Peter Stone
  • Milind Tambe

Building on the successful applications of Stackelberg Security Games (SSGs) to protect infrastructure, researchers have begun focusing on applying game theory to green security domains such as protection of endangered animals and fish stocks. Previous efforts in these domains optimize defender strategies based on the standard Stackelberg assumption that the adversaries become fully aware of the defender’s strategy before taking action. Unfortunately, this assumption is inappropriate since adversaries in green security domains often lack the resources to fully track the defender strategy. This paper (i) introduces Green Security Games (GSGs), a novel game model for green security domains with a generalized Stackelberg assumption; (ii) provides algorithms to plan effective sequential defender strategies — such planning was absent in previous work; (iii) proposes a novel approach to learn adversary models that further improves defender performance; and (iv) provides detailed experimental analysis of proposed approaches.

AAAI Conference 2014 Conference Paper

TacTex’13: A Champion Adaptive Power Trading Agent

  • Daniel Urieli
  • Peter Stone

Sustainable energy systems of the future will no longer be able to rely on the current paradigm that energy supply follows demand. Many of the renewable energy resources do not produce power on demand, and therefore there is a need for new market structures that motivate sustainable behaviors by participants. The Power Trading Agent Competition (Power TAC) is a new annual competition that focuses on the design and operation of future retail power markets, specifically in smart grid environments with renewable energy production, smart metering, and autonomous agents acting on behalf of customers and retailers. It uses a rich, open-source simulation platform that is based on real-world data and stateof-the-art customer models. Its purpose is to help researchers understand the dynamics of customer and retailer decisionmaking, as well as the robustness of proposed market designs. This paper introduces TACTEX’13, the champion agent from the inaugural competition in 2013. TACTEX’13 learns and adapts to the environment in which it operates, by heavily relying on reinforcement learning and prediction methods. This paper describes the constituent components of TACTEX’13 and examines its success through analysis of competition results and subsequent controlled experiments.

RLDM Conference 2013 Conference Abstract

Communicating with Unknown Teammates

  • Samuel Barrett
  • Noa Agmon
  • Noam Hazon
  • Sarit Kraus
  • Peter Stone

Teamwork is central to many tasks, and past research has introduced a number of methods for coordinating teams of agents. However, with the growing number of sources of agents, it is likely that an agent will encounter teammates that do not share its coordination method. Therefore, it is desirable for agents to adapt to these teammates, forming an effective ad hoc team. Past ad hoc teamwork research has focused on cases where the agents do not directly communicate. This paper tackles the problem of communication in ad hoc teams, introducing a minimal version of the multiagent, multi-armed bandit problem with limited communication between the agents. The theoretical results in this paper prove that this problem setting can be solved in polynomial time when the agent knows the set of possible teammates. Furthermore, the empirical results show that an agent can cooperate with a variety of teammates not created by the authors even when its models of these teammates are imperfect.

RLDM Conference 2013 Conference Abstract

DJ-MC: A Reinforcement-Learning Framework for a Music Playlist Recommender System (Ex-

  • Elad Liebman
  • Peter Stone

In recent years, there has been growing focus on the study of automated recommender systems. Music recommendation systems serve as a prominent domain for such works, both from an academic and a commercial perspective. To our knowledge, most of these systems focus on predicting the preference of individual songs independently based on a learned model of a listener. However, a relatively well known fact in music cognition is that music is experienced in temporal context and in sequence. In this work we present a reinforcement-learning based framework for music recommendation that does not recommend songs individually but rather song sequences, or playlists, based on a learned model of preferences for both individual songs and song transitions. To reduce exploration time, we initialize a model based on user feedback. This model is subsequently updated by reinforcement. We show our algorithm outperforms a more naive approach both on synthetic data and on a real song database.

RLDM Conference 2013 Conference Abstract

Learning Objectives for Numeric Human Feedback

  • W. Bradley Knox
  • Peter Stone

Several studies have demonstrated that human-generated reward can be a powerful feedback sig- nal for control-learning algorithms. However, the algorithmic space for learning from human reward has hitherto not been explored systematically. Using model-based reinforcement learning from human reward, this article experimentally investigates the problem of learning from human reward, focusing on the rela- tionships between reward positivity, temporal discounting, whether the task is episodic or continuing, and task performance. We identify and empirically verify a “positive circuits” problem with low discounting (i. e. , high discount factors) for episodic, goal-based tasks that arises from an observed bias among hu- mans towards giving positive reward, resulting in an endorsement of myopic learning for such domains. We then show that converting simple episodic tasks to be non-episodic (i. e. , continuing) reduces and in some cases resolves issues present in episodic tasks with generally positive reward and”relatedly” enables highly successful learning with non-myopic valuation in multiple user studies. The primary learning algo- rithm introduced in this article, which we call “VI-TAMER”, is the first algorithm to successfully learn non-myopically from human-generated reward; we also empirically show that such non-myopic valuation facilitates higher-level understanding of the task.

JAAMAS Journal 2013 Journal Article

Multiagent learning in the presence of memory-bounded agents

  • Doran Chakraborty
  • Peter Stone

Abstract In recent years, great strides have been made towards creating autonomous agents that can learn via interaction with their environment. When considering just an individual agent, it is often appropriate to model the world as being stationary, meaning that the same action from the same state will always yield the same (possibly stochastic) effects. However, in the presence of other independent agents, the environment is not stationary: an action’s effects may depend on the actions of the other agents. This non-stationarity poses the primary challenge of multiagent learning and comprises the main reason that it is best considered distinctly from single agent learning. The multiagent learning problem is often studied in the stylized settings provided by repeated matrix games. The goal of this article is to introduce a novel multiagent learning algorithm for such a setting, called Convergence with Model Learning and Safety (or CMLeS ), that achieves a new set of objectives which have not been previously achieved. Specifically, CMLeS is the first multiagent learning algorithm to achieve the following three objectives: (1) converges to following a Nash equilibrium joint-policy in self-play; (2) achieves close to the best response when interacting with a set of memory-bounded agents whose memory size is upper bounded by a known value; and (3) ensures an individual return that is very close to its security value when interacting with any other set of agents. Our presentation of CMLeS is backed by a rigorous theoretical analysis, including an analysis of sample complexity wherever applicable.

AIJ Journal 2013 Journal Article

Teaching and leading an ad hoc teammate: Collaboration without pre-coordination

  • Peter Stone
  • Gal A. Kaminka
  • Sarit Kraus
  • Jeffrey S. Rosenschein
  • Noa Agmon

As autonomous agents proliferate in the real world, both in software and robotic settings, they will increasingly need to band together for cooperative activities with previously unfamiliar teammates. In such ad hoc team settings, team strategies cannot be developed a priori. Rather, an agent must be prepared to cooperate with many types of teammates: it must collaborate without pre-coordination. This article defines two aspects of collaboration in two-player teams, involving either simultaneous or sequential decision making. In both cases, the ad hoc agent is more knowledgeable of the environment, and attempts to influence the behavior of its teammate such that they will attain the optimal possible joint utility.

AAAI Conference 2013 Conference Paper

Teamwork with Limited Knowledge of Teammates

  • Samuel Barrett
  • Peter Stone
  • Sarit Kraus
  • Avi Rosenfeld

While great strides have been made in multiagent teamwork, existing approaches typically assume extensive information exists about teammates and how to coordinate actions. This paper addresses how robust teamwork can still be created even if limited or no information exists about a specific group of teammates, as in the ad hoc teamwork scenario. The main contribution of this paper is the first empirical evaluation of an agent cooperating with teammates not created by the authors, where the agent is not provided expert knowledge of its teammates. For this purpose, we develop a generalpurpose teammate modeling method and test the resulting ad hoc team agent’s ability to collaborate with more than 40 unknown teams of agents to accomplish a benchmark task. These agents were designed by people other than the authors without these designers planning for the ad hoc teamwork setting. A secondary contribution of the paper is a new transfer learning algorithm, TwoStageTransfer, that can improve results when the ad hoc team agent does have some limited observations of its current teammates.

AAMAS Conference 2012 Conference Paper

An Analysis Framework for Ad Hoc Teamwork Tasks

  • Samuel Barrett
  • Peter Stone

In multiagent team settings, the agents are often given a protocol for coordinating their actions. When such a protocol is not available, agents must engage in ad hoc teamwork to effectively cooperate with one another. A fully general ad hoc team agent needs to be capable of collaborating with a wide range of potential teammates on a varying set of joint tasks. This paper presents a framework for analyzing ad hoc team problems that sheds light on the current state of research and suggest avenues for future research. In addition, this paper shows how previous theoretical results can aid ad hoc agents in a set of testbed domains.

AAAI Conference 2012 Conference Paper

Design and Optimization of an Omnidirectional Humanoid Walk: A Winning Approach at the RoboCup 2011 3D Simulation Competition

  • Patrick MacAlpine
  • Samuel Barrett
  • Daniel Urieli
  • Victor Vu
  • Peter Stone

This paper presents the design and learning architecture for an omnidirectional walk used by a humanoid robot soccer agent acting in the RoboCup 3D simulation environment. The walk, which was originally designed for and tested on an actual Nao robot before being employed in the 2011 RoboCup 3D simulation competition, was the crucial component in the UT Austin Villa team winning the competition in 2011. To the best of our knowledge, this is the first time that robot behavior has been conceived and constructed on a real robot for the end purpose of being used in simulation. The walk is based on a double linear inverted pendulum model, and multiple sets of its parameters are optimized via a novel framework. The framework optimizes parameters for different tasks in conjunction with one another, a little-understood problem with substantial practical significance. Detailed experiments show that the UT Austin Villa agent significantly outperforms all the other agents in the competition with the optimized walk being the key to its success.

AAMAS Conference 2012 Conference Paper

Leading Ad Hoc Agents in Joint Action Settings with Multiple Teammates

  • Noa Agmon
  • Peter Stone

The growing use of autonomous agents in practice may require agents to cooperate as a team in situations where they have limited prior knowledge about one another, cannot communicate directly, or do not share the same world models. These situations raise the need to design \emph{ad hoc} team members, i. e. , agents that will be able to cooperate without coordination in order to reach an optimal team behavior. This paper considers the problem of leading $N$-agent teams by an agent toward their optimal joint utility, where the agents compute their next actions based only on their most recent observations of their teammates' actions. We show that compared to previous results in two-agent teams, in larger teams the agent might not be able to lead the team to the action with maximal joint utility, thus its optimal strategy is to lead the team to the best possible \emph{reachable} cycle of joint actions. We describe a graphical model of the problem and a polynomial time algorithm for solving it. We then consider other variations of the problem, including leading teams of agents where they base their actions on longer history of past observations, leading a team by more than one ad hoc agent, and leading a teammate while the ad hoc agent is uncertain of its behavior.

AAMAS Conference 2012 Conference Paper

Reinforcement Learning from Simultaneous Human and MDP Reward

  • W. Bradley Knox
  • Peter Stone

As computational agents are increasingly used beyond research labs, their success will depend on their ability to learn new skills and adapt to their dynamic, complex environments. If human users - without programming skills - can transfer their task knowledge to agents, learning can accelerate dramatically, reducing costly trials. The \textsc{tamer} framework guides the design of agents whose behavior can be shaped through signals of approval and disapproval, a natural form of human feedback. More recently, \textsc{tamer+rl} was introduced to enable human feedback to augment a traditional reinforcement learning (RL) agent that learns from a Markov decision process's (MDP) reward signal. We address limitations of prior work on \textsc{tamer} and \textsc{tamer+rl}, contributing in two critical directions. First, the four successful techniques for combining human reward with RL from prior \textsc{tamer+rl} work are tested on a second task, and these techniques' sensitivities to parameter changes are analyzed. Together, these examinations yield more general and prescriptive conclusions to guide others who wish to incorporate human knowledge into an RL algorithm. Second, \textsc{tamer+rl} has thus far been limited to a \emph{sequential} setting, in which training occurs before learning from MDP reward. In this paper, we introduce a novel algorithm that shares the same spirit as \textsc{tamer+rl} but learns \emph{simultaneously} from both reward sources, enabling the human feedback to come at any time during the reinforcement learning process. We call this algorithm simultaneous \textsc{tamer+rl}. To enable simultaneous learning, we introduce a new technique that appropriately determines the magnitude of the human model's influence on the RL algorithm throughout time and state-action space.

AAMAS Conference 2012 Conference Paper

Role Selection in Ad Hoc Teamwork

  • Katie Genter
  • Noa Agmon
  • Peter Stone

An ad hoc team setting is one in which teammates must work together to obtain a common goal, but without any prior agreement regarding how to work together. In this work we introduce a role-based approach for ad hoc teamwork, in which each teammate is inferred to be following a specialized role that accomplishes a specific task or exhibits a particular behavior. In such cases, the role an ad hoc agent should select depends both on its own capabilities and on the roles currently selected by other team members. We present methods for evaluating the influence of the ad hoc agent’s role selection on the team’s utility and we examine empirically how to choose the best suited method for role assignment in a complex environment. Finally, we show that an appropriate assignment method can be determined from a limited amount of data and used successfully in new tasks that the team has not encountered before.

AAMAS Conference 2012 Conference Paper

UT Austin Villa 2011: A Champion Agent in the RoboCup 3D Soccer Simulation Competition

  • Patrick MacAlpine
  • Daniel Urieli
  • Samuel Barrett
  • Shivaram Kalyanakrishnan
  • Francisco Barrera
  • Adrian Lopez-Mobilia
  • Nicolae Ştiurcă
  • Victor Vu

This paper presents the architecture and key components of a simulated humanoid robot soccer team, UT Austin Villa, which was designed to compete in the RoboCup 3D simulation competition. These key components include (1) an omnidirectional walk engine and associated walk parameter optimization framework, (2) an inverse kinematics based kicking architecture, and (3) a dynamic role assignment and positioning system. UT Austin Villa won the RoboCup 2011 3D simulation competition in convincing fashion by winning all 24 games it played. During the course of the competition the team scored 136 goals while conceding none. We analyze the effect of each component in isolation and show through extensive experiments that the complete team significantly outperforms all the other teams from the competition.

AAMAS Conference 2011 Conference Paper

A Particle Filter for Bid Estimation in Ad Auctions with Periodic Ranking Observations

  • David Pardoe
  • Peter Stone

Keyword auctions are becoming increasingly important in today's electronic marketplaces. One of their most challenging aspects is the limited amount of information revealed about other advertisers. In this paper, we present a particle filter that can be used to estimate the bids of other advertisers given a periodic ranking of their bids. This particle filter makes use of models of the bidding behavior of other advertisers, and so we also show how such models can be learned from past bidding data. In experiments in the Ad Auction scenario of the Trading Agent Competition, the combination of this particle filter and bidder modeling outperforms all other bid estimation methods tested.

AAAI Conference 2011 Conference Paper

Ad Hoc Teamwork in Variations of the Pursuit Domain

  • Samuel Barrett
  • Peter Stone

In multiagent team settings, the agents are often given a protocol for coordinating their actions. When such a protocol is not available, agents must engage in ad hoc teamwork to effectively cooperate with one another. A fully general ad hoc team agent needs to be capable of collaborating with a wide range of potential teammates on a varying set of joint tasks. This paper extends previous research in a new direction with the introduction of an efficient method for reasoning about the value of information. Then, we show how previous theoretical results can aid ad hoc agents in a set of testbed pursuit domains.

AAMAS Conference 2011 Conference Paper

Batch Reservations in Autonomous Intersection Management

  • Neda Shahidi
  • Tsz-Chiu Au
  • Peter Stone

The recent robot car competitions and demonstrations have convincingly shown that fully autonomous vehicles are feasible with current or near-future intelligent vehicle technology. Looking ahead to the time when such autonomous cars will be common, Dresner and Stone proposed a new intersection control protocol called Autonomous Intersection Management (AIM) and showed that by leveraging the capacities of autonomous vehicles we can devise a reservation-based intersection control protocol that is much more efficient than traffic signals and stop signs. Their proposed protocol, however, handles reservation requests one at a time and does not prioritize reservations according to their relative importance and vehicles' waiting times, causing potentially large inequalities in granting reservations. For example, at an intersection between a main street and an alley, vehicles from the alley can take a very long time to get reservations to enter the intersection. In this research, we introduce a prioritization scheme to prevent uneven reservation assignments in unbalanced traffic. Our experimental results show that our prioritizing scheme outperforms previous intersection control protocols in unbalanced traffic.

AAAI Conference 2011 Conference Paper

Comparing Agents’ Success against People in Security Domains

  • Raz Lin
  • Sarit Kraus
  • Noa Agmon
  • Samuel Barrett
  • Peter Stone

The interaction of people with autonomous agents has become increasingly prevalent. Some of these settings include security domains, where people can be characterized as uncooperative, hostile, manipulative, and tending to take advantage of the situation for their own needs. This makes it challenging to design proficient agents to interact with people in such environments. Evaluating the success of the agents automatically before evaluating them with people or deploying them could alleviate this challenge and result in better designed agents. In this paper we show how Peer Designed Agents (PDAs) – computer agents developed by human subjects – can be used as a method for evaluating autonomous agents in security domains. Such evaluation can reduce the effort and costs involved in evaluating autonomous agents interacting with people to validate their efficacy. Our experiments included more than 70 human subjects and 40 PDAs developed by students. The study provides empirical support that PDAs can be used to compare the proficiency of autonomous agents when matched with people in security domains.

AAMAS Conference 2011 Conference Paper

Empirical Evaluation of Ad Hoc Teamwork in the Pursuit Domain

  • Samuel Barrett
  • Peter Stone
  • Sarit Kraus

The concept of creating autonomous agents capable of exhibiting ad hoc teamwork was recently introduced as a challenge to the AI, and specifically to the multiagent systems community. An agent capable of ad hoc teamwork is one that can effectively cooperate with multiple potential teammates on a set of collaborative tasks. Previous research has investigated theoretically optimal ad hoc teamwork strategies in restrictive settings. This paper presents the first empirical study of ad hoc teamwork in a more open, complex teamwork domain. Specifically, we evaluate a range of effective algorithms for on-line behavior generation on the part of a single ad hoc team agent that must collaborate with a range of possible teammates in the pursuit domain.

AAAI Conference 2011 Conference Paper

Enforcing Liveness in Autonomous Traffic Management

  • Tsz-Chiu Au
  • Neda Shahidi
  • Peter Stone

Looking ahead to the time when autonomous cars will be common, Dresner and Stone proposed a multiagent systemsbased intersection control protocol called Autonomous Intersection Management (AIM). They showed that by leveraging the capacities of autonomous vehicles it is possible to dramatically reduce the time wasted in traffic, and therefore also fuel consumption and air pollution. The proposed protocol, however, handles reservation requests one at a time and does not prioritize reservations according to their relative priorities and waiting times, causing potentially large inequalities in granting reservations. For example, at an intersection between a main street and an alley, vehicles from the alley can take an excessively long time to get reservations to enter the intersection, causing a waste of time and fuel. The same is true in a network of intersections, in which gridlock may occur and cause traffic congestion. In this paper, we introduce the batch processing of reservations in AIM to enforce liveness properties in intersections and analyze the conditions under which no vehicle will get stuck in traffic. Our experimental results show that our prioritizing schemes outperform previous intersection control protocols in unbalanced traffic.

AAAI Conference 2011 Conference Paper

Multiagent Patrol Generalized to Complex Environmental Conditions

  • Noa Agmon
  • Daniel Urieli
  • Peter Stone

The problem of multiagent patrol has gained considerable attention during the past decade, with the immediate applicability of the problem being one of its main sources of interest. In this paper we concentrate on frequency-based patrol, in which the agents’ goal is to optimize a frequency criterion, namely, minimizing the time between visits to a set of interest points. We consider multiagent patrol in environments with complex environmental conditions that affect the cost of traveling from one point to another. For example, in marine environments, the travel time of ships depends on parameters such as wind, water currents, and waves. We demonstrate that in such environments there is a need to consider a new multiagent patrol strategy which divides the given area into parts in which more than one agent is active, for improving frequency. We show that in general graphs this problem is intractable, therefore we focus on simplified (yet realistic) cyclic graphs with possible inner edges. Although the problem remains generally intractable in such graphs, we provide a heuristic algorithm that is shown to significantly improve point-visit frequency compared to other patrol strategies. For evaluation of our work we used a custom developed ship simulator that realistically models ship movement constraints such as engine force and drag and reaction of the ship to environmental changes.

AAMAS Conference 2011 Conference Paper

On Optimizing Interdependent Skills: A Case Study in Simulated 3D Humanoid Robot Soccer

  • Daniel Urieli
  • Patrick MacAlpine
  • Shivaram Kalyanakrishnan
  • Yinon Bentor
  • Peter Stone

In several realistic domains an agent's behavior is composed of multiple interdependent skills. For example, consider a humanoid robot that must play soccer, as is the focus of this paper. In order to succeed, it is clear that the robot needs to walk quickly, turn sharply, and kick the ball far. However, these individual skills are ineffective if the robot falls down when switching from walking to turning, or if it cannot position itself behind the ball for a kick. This paper presents a learning architecture for a humanoid robot soccer agent that has been fully deployed and tested within the RoboCup 3D simulation environment. First, we demonstrate that individual skills such as walking and turning can be parameterized and optimized to match the best performance statistics reported in the literature. These results are achieved through effective use of the CMA-ES optimization algorithm. Next, we describe a framework for optimizing skills in conjunction with one another, a little-understood problem with substantial practical significance. Over several phases of learning, a total of roughly 100-150 parameters are optimized. Detailed experiments show that an agent thus optimized performs comparably with the top teams from the RoboCup 2010 competitions, while taking relatively few man-hours for development.

AAAI Conference 2011 Conference Paper

Role-Based Ad Hoc Teamwork

  • Katie Genter
  • Noa Agmon
  • Peter Stone

An ad hoc team setting is one in which teammates must work together to obtain a common goal, but without any prior agreement regarding how to work together. In this abstract we present a role-based approach for ad hoc teamwork, in which each teammate is inferred to be following a specialized role that accomplishes a specific task or exhibits a particular behavior. In such cases, the role an ad hoc agent should select depends both on its own capabilities and on the roles currently selected by the other team members. We present methods for evaluating the influence of the ad hoc agent’s role selection on the team’s utility and we examine empirically how to select the best suited method for role assignment in a complex environment. Finally, we show that an appropriate assignment method can be determined from a limited amount of data and used successfully in similar new tasks that the team has not encountered before.

AAMAS Conference 2011 Conference Paper

Ship Patrol: Multiagent Patrol under Complex Environmental Conditions

  • Noa Agmon
  • Daniel Urieli
  • Peter Stone

In the problem of multiagent patrol, a team of agents is required to repeatedly visit a target area in order to monitor possible changes in state. The growing popularity of this problem comes mainly from its immediate applicability to a wide variety of domains. In this paper we concentrate on frequency-based patrol, in which the agents' goal is to optimize a frequency criterion, namely, minimizing the time between visits to a set of interest points. In situations with varying environmental conditions, the influence of changes in the conditions on the cost of travel may be immense. For example, in marine environments, the travel time of ships depends on parameters such as wind, water currents, and waves. Such environments raise the need to consider a new multiagent patrol strategy which divides the given area into regions in which more than one agent is active, for improving frequency. We prove that in general graphs this problem is intractable, therefore we focus on simplified (yet realistic) cyclic graphs with possible inner edges. Although the problem remains generally intractable in such graphs, we provide a heuristic algorithm that is shown to significantly improve point-visit frequency compared to other patrol strategies.

AAAI Conference 2010 Conference Paper

Ad Hoc Autonomous Agent Teams: Collaboration without Pre-Coordination

  • Peter Stone
  • Gal Kaminka
  • Sarit Kraus
  • Jeffrey Rosenschein

As autonomous agents proliferate in the real world, both in software and robotic settings, they will increasingly need to band together for cooperative activities with previously unfamiliar teammates. In such ad hoc team settings, team strategies cannot be developed a priori. Rather, an agent must be prepared to cooperate with many types of teammates: it must collaborate without pre-coordination. This paper challenges the AI community to develop theory and to implement prototypes of ad hoc team agents. It defines the concept of ad hoc team agents, specifies an evaluation paradigm, and provides examples of possible theoretical and empirical approaches to challenge. The goal is to encourage progress towards this ambitious, newly realistic, and increasingly important research goal.

AAMAS Conference 2010 Conference Paper

Combining Manual Feedback with Subsequent MDP Reward Signals for Reinforcement Learning

  • W. Bradley Knox
  • Peter Stone

As learning agents move from research labs to the real world, it is increasingly important that human users, including those without programming skills, be able to teach agents desired behaviors. Recently, the TAMER framework was introduced for designing agents that can be interactively shaped by human trainers who give only positive and negative feedback signals. Past work on TAMER showed that shaping can greatly reduce the sample complexity required to learn a good policy, can enable lay users to teach agents the behaviors they desire, and can allow agents to learn within a Markov Decision Process (MDP) in the absence of a coded reward function. However, TAMER does not allow this human training to be combined with autonomous learning based on such a coded reward function. This paper leverages the fast learning exhibited within the TAMER framework to hasten a reinforcement learning (RL) algorithm's climb up the learning curve, effectively demonstrating that human reinforcement and MDP reward can be used in conjunction with one another by an autonomous agent. We tested eight plausible TAMER+RL methods for combining a previously learned human reinforcement function, $\hat{H}$, with MDP reward in a reinforcement learning algorithm. This paper identifies which of these methods are most effective and analyzes their strengths and weaknesses. Results from these TAMER+RL algorithms indicate better final performance and better cumulative performance than either a TAMER agent or an RL agent alone.

AAMAS Conference 2010 Conference Paper

MARIOnET: Motion Acquisition for Robots through Iterative Online Evaluative Training

  • Adam Setapen
  • Michael Quinlan
  • Peter Stone

As robots become more commonplace, the tools to facilitate knowledge transfer from human to robot will be vital, especially for non-technical users. While some ongoingwork considers the role of human reinforcement in intelligent algorithms, the burden of learning is often placed solelyon the computer. These approaches neglect the expressive capabilities of humans, especially regarding our abilityto quickly refine motor skills. Thus, when designing autonomous robots that interact with humans, not only is itimportant to leverage machine learning, but it is also veryuseful to have the tools in place to facilitate the transfer ofknowledge between man and machine. We introduce such atool for enabling a human to transfer motion learning capabilities to a robot. In this paper, we propose a general framework for MotionAcquisition in Robots through Iterative Online EvaluativeTraining (MARIOnET). Specifically, MARIOnET represents a direct and real-time interface between a human ina motion-capture suit and a robot, with a training processthat provides a convenient human interface and requires notechnical knowledge. In our framework, the learning happens exclusively by the human - not the robot. However, the robot provides a natural interface for interaction, andis able to store and reuse trained behaviors autonomouslyin the future. Our approach exploits the ability at whichhumans are able to learn and refine fine-motor skills. Implemented on two robots (one quadruped and one biped), our results indicate that both technical and non-technicalusers are able to harness MARIOnET to quickly improve arobot's performance of a task requiring fine-motor skills.

AAMAS Conference 2010 Conference Paper

Online Model Learning in Adversarial Markov Decision Processes

  • Doran Chakraborty
  • Peter Stone

Consider, for example, the well-known game of Roshambo, or rock-paper-scissors, in which two players select one of three actions simultaneously. One may know thatthe adversary will base its next action on some bounded sequence of the past joint actions, but may be unaware of itsexact strategy. For example, one may notice that every timeit selects $P$, the adversary selects $S$ in the next step; or perhaps whenever it selects $R$ in three of the last four steps, the adversary selects $P$ $90\%$ of the time in the next step. The challenge is that to begin with, neither the adversaryfunction that maps action histories to future actions (maybe stochastic), nor even how far back it looks back in theaction history (other than an upper bound) may be known. At a high level, this paper is concerned with automaticallybuilding such predictive models of an adversary's future actions as a function of past interactions.

AAMAS Conference 2010 Conference Paper

TacTex09: A Champion Bidding Agent for Ad Auctions

  • David Pardoe
  • Doran Chakraborty
  • Peter Stone

In the Trading Agent Competition Ad Auctions Game, agentscompete to sell products by bidding to have their ads shownin a search engine's sponsored search results. We report onthe winning agent from the first (2009) competition, TacTex. TacTex operates by estimating the full game state from limited information, using these estimates to make predictions, and then optimizing its actions (daily bids, ads, and spending limits) with respect to these predictions. We present afull description of TacTex along with analysis of its performance in both the competition and controlled experiments.

AAMAS Conference 2010 Conference Paper

To Teach or not to Teach? Decision Making Under Uncertainty in Ad Hoc Teams

  • Peter Stone
  • Sarit Kraus

In typical multiagent \emph{teamwork} settings, the teammates are either programmed together, or are otherwise provided with standard communication languages and coordination protocols. In contrast, this paper presents an \emph{ad hoc team} setting in which the teammates are not pre-coordinated, yet still must work together in order to achieve their common goal(s). We represent a specific instance of this scenario, in which a teammate has limited action capabilities and a fixed and known behavior, as a multiagent cooperative $k$-armed bandit. In addition to motivating and studying this novel ad hoc teamwork scenario, the paper contributes to the $k$-armed bandits literature by characterizing the conditions under which certain actions are potentially optimal, and by presenting a polynomial dynamic programming algorithm that solves for the optimal action when the arm payoffs come from a discrete distribution.

AAMAS Conference 2010 Conference Paper

Training a Tetris Agent via Interactive Shaping: A Demonstration of the TAMER Framework

  • W. Bradley Knox
  • Peter Stone

As computational learning agents continue to improve their ability to learn sequential decision-making tasks, a central but largely unfulfilled goal is to deploy these agents in real-world domains in which they interact with humans and make decisions that affect our lives. People will want such interactive agents to be able to perform tasks for which the agent's original developers could not prepare it. Thus it will be imperative to develop agents that can learn from natural methods of communication. The teaching technique of shaping is one such method. In this context, we define shaping as training an agent through signals of positive and negative reinforcement. In a shaping scenario, a human trainer observes an agent and reinforces its behavior through push-buttons, spoken word (''yes'' or ''no''), facial expression, or any other signal that can be converted to a scalar signal of approval or disapproval. We treat shaping as a specific mode of knowledge transfer, distinct from (and probably complementary to) other natural methods of communication, including programming by demonstration and advice-giving. The key challenge before us is to create agents that can be shaped effectively.

AAMAS Conference 2009 Conference Paper

An Empirical Analysis of Value Function-Based and Policy Search Reinforcement Learning

  • Shivaram Kalyanakrishnan
  • Peter Stone

In several agent-oriented scenarios in the real world, an autonomous agent that is situated in an unknown environment must learn through a process of trial and error to take actions that result in long-term benefit. Reinforcement Learning (or sequential decision making) is a paradigm well-suited to this requirement. Value function-based methods and policy search methods are contrasting approaches to solve reinforcement learning tasks. While both classes of methods benefit from independent theoretical analyses, these often fail to extend to the practical situations in which the methods are deployed. We conduct an empirical study to examine the strengths and weaknesses of these approaches by introducing a suite of test domains that can be varied for problem size, stochasticity, function approximation, and partial observability. Our results indicate clear patterns in the domain characteristics for which each class of methods excels. We investigate whether their strengths can be combined, and develop an approach to achieve that purpose. The effectiveness of this approach is also demonstrated on the challenging benchmark task of robot soccer Keepaway. We highlight several lines of inquiry that emanate from this study.

JAAMAS Journal 2009 Journal Article

Critical factors in the empirical performance of temporal difference and evolutionary methods for reinforcement learning

  • Shimon Whiteson
  • Matthew E. Taylor
  • Peter Stone

Abstract Temporal difference and evolutionary methods are two of the most common approaches to solving reinforcement learning problems. However, there is little consensus on their relative merits and there have been few empirical studies that directly compare their performance. This article aims to address this shortcoming by presenting results of empirical comparisons between Sarsa and NEAT, two representative methods, in mountain car and keepaway, two benchmark reinforcement learning tasks. In each task, the methods are evaluated in combination with both linear and nonlinear representations to determine their best configurations. In addition, this article tests two specific hypotheses about the critical factors contributing to these methods’ relative performance: (1) that sensor noise reduces the final performance of Sarsa more than that of NEAT, because Sarsa’s learning updates are not reliable in the absence of the Markov property and (2) that stochasticity, by introducing noise in fitness estimates, reduces the learning speed of NEAT more than that of Sarsa. Experiments in variations of mountain car and keepaway designed to isolate these factors confirm both these hypotheses.

AAMAS Conference 2009 Conference Paper

Generalized Model Learning for Reinforcement Learning in Factored Domains

  • Todd Hester
  • Peter Stone

Improving the sample efficiency of reinforcement learning algorithms to scale up to larger and more realistic domains is a current research challenge in machine learning. Model-based methods use experiential data more efficiently than modelfree approaches but often require exhaustive exploration to learn an accurate model of the domain. We present an algorithm, Reinforcement Learning with Decision Trees (rl-dt), that uses supervised learning techniques to learn the model by generalizing the relative effect of actions across states. Specifically, rl-dt uses decision trees to model the relative effects of actions in the domain. The agent explores the environment exhaustively in early episodes when its model is inaccurate. Once it believes it has developed an accurate model, it exploits its model, taking the optimal action at each step. The combination of the learning approach with the targeted exploration policy enables fast learning of the model. The sample efficiency of the algorithm is evaluated empirically in comparison to five other algorithms across three domains. rl-dt consistently accrues high cumulative rewards in comparison with the other algorithms tested.

JMLR Journal 2009 Journal Article

Transfer Learning for Reinforcement Learning Domains: A Survey

  • Matthew E. Taylor
  • Peter Stone

The reinforcement learning paradigm is a popular way to address problems that have only limited environmental feedback, rather than correctly labeled examples, as is common in other machine learning contexts. While significant progress has been made to improve learning in a single task, the idea of transfer learning has only recently been applied to reinforcement learning tasks. The core idea of transfer is that experience gained in learning to perform one task can help improve learning performance in a related, but different, task. In this article we present a framework that classifies transfer learning methods in terms of their capabilities and goals, and then use it to survey the existing literature, as well as to suggest future directions for transfer learning work. [abs] [ pdf ][ bib ] &copy JMLR 2009. ( edit, beta )

AAMAS Conference 2008 Conference Paper

Autonomous Transfer for Reinforcement Learning

  • Matthew Taylor
  • Gregory Kuhlmann
  • Peter Stone

Recent work in transfer learning has succeeded in making reinforcement learning algorithms more efficient by incorporating knowledge from previous tasks. However, such methods typically must be provided either a full model of the tasks or an explicit relation mapping one task into the other. An autonomous agent may not have access to such high-level information, but would be able to analyze its experience to find similarities between tasks. In this paper we introduce Modeling Approximate State Transitions by Exploiting Regression (MASTER), a method for automatically learning a mapping from one task to another through an agent’s experience. We empirically demonstrate that such learned relationships can significantly improve the speed of a reinforcement learning algorithm in a series of Mountain Car tasks. Additionally, we demonstrate that our method may also assist with the difficult problem of task selection for transfer.

AAMAS Conference 2008 Conference Paper

Mitigating Catastrophic Failure at Intersections of Autonomous Vehicles

  • Kurt Dresner
  • Peter Stone

Fully autonomous vehicles promise enormous gains in safety, efficiency, and economy. Before such gains can be realized, safety and reliability concerns must be addressed. We have previously introduced a system for managing such vehicles at intersections that is capable of handling more vehicles and causing fewer delays than traffic lights and stop signs [2]. While the system is safe under normal operating conditions, we have not discussed the possibility or implications of unforeseen mechanical failures. Because the system orchestrates such precarious “close calls” the tolerance for such errors is small. In this paper, we introduce safety features of the system designed to deal with these types of failures, and perform a basic failure mode analysis, demonstrating that without these features, the system is unsuitable for deployment due to a propensity for catastrophic failure modes.

AAMAS Conference 2008 Conference Paper

Replacing the Stop Sign: Unmanaged Intersection Control for Autonomous Vehicles

  • Mark Van Middlesworth
  • Kurt Dresner
  • Peter Stone

As computers replace humans as the drivers of automobiles, our current traffic management mechanisms will give way to hyper-efficient protocols designed to exploit the capabilities of fully autonomous vehicles. We have introduced such a system for coordinating large numbers of autonomous vehicles at intersections [2, 3]. Our experiments suggest that this system could alleviate many of the dangers and delays associated with intersections by allowing vehicles to “call ahead” to an agent stationed at the intersection and reserve time and space for their traversal. Unfortunately, such a system is not cost-effective at small intersections. In this paper, we propose an intersection control mechanism for autonomous vehicles designed specifically for low-traffic intersections where the previous system would not be practical. Our mechanism is based on purely peer-to-peer communication and thus requires no infrastructure at the intersection. We present experimental results demonstrating that our system, while not suited to large, busy intersections, can significantly outperform traditional stop signs at small intersections.

AAMAS Conference 2008 Conference Paper

The Utility of Temporal Abstraction in Reinforcement Learning

  • Nicholas Jong
  • Todd Hester
  • Peter Stone

The hierarchical structure of real-world problems has motivated extensive research into temporal abstractions for reinforcement learning, but precisely how these abstractions allow agents to improve their learning performance is not well understood. This paper investigates the connection between temporal abstraction and an agent’s exploration policy, which determines how the agent’s performance improves over time. Experimental results with standard methods for incorporating temporal abstractions show that these methods benefit learning only in limited contexts. The primary contribution of this paper is a clearer understanding of how hierarchical decompositions interact with reinforcement learning algorithms, with important consequences for the manual design or automatic discovery of action hierarchies.

IJCAI Conference 2007 Conference Paper

  • Peter Stone

One goal of Artificial Intelligence is to enable the creation of robust, fully autonomous agents that can coexist with us in the real world. Such agents will need to be able to learn, both in order to correct and circumvent their inevitable imperfections, and to keep up with a dynamically changing world. They will also need to be able to interact with one another, whether they share common goals, they pursue independent goals, or their goals are in direct conflict. This paper presents current research directions in machine learning, multiagent reasoning, and robotics, and advocates their unification within concrete application domains. Ideally, new theoretical results in each separate area will inform practical implementations while innovations from concrete multiagent applications will drive new theoretical pursuits, and together these synergistic research approaches will lead us towards the goal of fully autonomous agents.

IJCAI Conference 2007 Conference Paper

  • Kurt Dresner
  • Peter Stone

In modern urban settings, automobile traffic and collisions lead to endless frustration as well as significant loss of life, property, and productivity. Recent advances in artificial intelligence suggest that autonomous vehicle navigation may soon be a reality. In previous work, we have demonstrated that a reservation-based approach can efficiently and safely govern interactions of multiple autonomous vehicles at intersections. Such an approach alleviates many traditional problems associated with intersections, in terms of both safety and efficiency. However, the system relies on all vehicles being equipped with the requisite technology - a restriction that would make implementing such a system in the real world extremely difficult. In this paper, we extend this system to allow for incremental deployability. The modified system is able to accommodate traditional human-operated vehicles using existing infrastructure. Furthermore, we show that as the number of autonomous vehicles on the road increases, traffic delays decrease monotonically toward the levels exhibited in our previous work. Finally, we develop a method for switching between various human-usable configurations while the system is running, in order to facilitate an even smoother transition. The work is fully implemented and tested in our custom simulator, and we present detailed experimental results attesting to its effectiveness.

IJCAI Conference 2007 Conference Paper

  • Jonathan Wildstrom
  • Peter Stone
  • Emmett Witchel
  • Mike Dahlin

As computer systems continue to increase in complexity, the need for AI-based solutions is becoming more urgent. For example, high-end servers that can be partitioned into logical subsystems and repartitioned on the fly are now becoming available. This development raises the possibility of reconfiguring distributed systems online to optimize for dynamically changing workloads. However, it also introduces the need to decide when and how to reconfigure. This paper presents one approach to solving this online reconfiguration problem. In particular, we learn to identify, from only low-level system statistics, which of a set of possible configurations will lead to better performance under the current unknown workload. This approach requires no instrumentation of the system's middleware or operating systems. We introduce an agent that is able to learn this model and use it to switch configurations online as the workload varies. Our agent is fully implemented and tested on a publicly available multi-machine, multi-process distributed system (the online transaction processing benchmark TPC-W). We demonstrate that our adaptive configuration is able to outperform any single fixed configuration in the set over a variety of workloads, including gradual changes and abrupt workload spikes.

IJCAI Conference 2007 Conference Paper

  • Bikramjit Banerjee
  • Peter Stone

We present a reinforcement learning game player that can interact with a General Game Playing system and transfer knowledge learned in one game to expedite learning in many other games. We use the technique of value-function transfer where general features are extracted from the state space of a previous game and matched with the completely different state space of a new game. To capture the underlying similarity of vastly disparate state spaces arising from different games, we use a game-tree lookahead structure for features. We show that such feature-based value function transfer learns superior policies faster than a reinforcement learning agent that does not use knowledge transfer. Furthermore, knowledge transfer using lookahead features can capture opponent-specific value-functions, i. e. can exploit an opponent's weaknesses to learn faster than a reinforcement learner that uses lookahead with minimax (pessimistic) search against the same opponent.

IJCAI Conference 2007 Conference Paper

  • Mohan Sridharan
  • Peter Stone

A central goal of robotics and AI is to be able to deploy an agent to act autonomously in the real world over an extended period of time. It is commonly asserted that in order to do so, the agent must be able to learn to deal with unexpected environmental conditions. However an ability to learn is not sufficient. For true extended autonomy, an agent must also be able to recognize when to abandon its current model in favor of learning a new one; and how to learn in its current situation. This paper presents a fully implemented example of such autonomy in the context of color map learning on a vision-based mobile robot for the purpose of image segmentation. Past research established the ability of a robot to learn a color map in a single fixed lighting condition when manually given a 'curriculum, ' an action sequence designed to facilitate learning. This paper introduces algorithms that enable a robot to i) devise its own curriculum; and ii) recognize when the lighting conditions have changed sufficiently to warrant learning a new color map.

AAMAS Conference 2007 Conference Paper

Adapting in Agent-Based Markets: A Study from TAC SCM

  • David Pardoe
  • Peter Stone

An agent attempting to model market conditions may benefit from considering how various combinations of competitor strategies would impact these conditions. We give an illustration using a prediction task faced by our agent for the Supply Chain Management scenerio of the Trading Agent Competition(TAC SCM). We present the learning approach taken, evaluate its effectiveness, and then explore methods of improving predictions through combining multiple sources of data reflecting various combinations of competitor behaviors.

AAMAS Conference 2007 Conference Paper

Batch Reinforcement Learning in a Complex Domain

  • Shivaram Kalyanakrishnan
  • Peter Stone

Temporal difference reinforcement learning algorithms are perfectly suited to autonomous agents because they learn directly from an agent's experience based on sequential actions in the environment. However, their most common algorithmic variants are relatively inefficient in their use of experience data, which in many agent-based settings can be scarce. In particular, they make just one learning "update" for each atomic experience. Batch reinforcement learning algorithms, on the other hand, aim to achieve greater data efficiency by saving experience data and using it in aggregate to make updates to the learned policy. Their success has been demonstrated in the past on simple domains like grid worlds and low-dimensional control applications like pole balancing. In this paper, we compare and contrast batch reinforcement learning algorithms with on-line algorithms based on their empirical performance in a complex, continuous, noisy, multiagent domain, namely RoboCup soccer Keepaway. We find that the two batch methods we consider, Experience Replay and Fitted Q Iteration, both yield significant gains in sample complexity, while achieving high asymptotic performance.

AAMAS Conference 2007 Conference Paper

IFSA: Incremental Feature-Set Augmentation for Reinforcement Learning Tasks

  • Mazda Ahmadi
  • Matthew E. Taylor
  • Peter Stone

Reinforcement learning is a popular and successful framework for many agent-related problems because only limited environmental feedback is necessary for learning. While many algorithms exist to learn effective policies in such problems, learning is often used to solve real world problems, which typically have large state spaces, and therefore suffer from the "curse of dimensionality. " One effectivemethod for speeding-up reinforcement learning algorithms is to leverage expert knowledge. In this paper, we propose a method for dynamically augmenting the agent's feature set in order to speed up value-function-based reinforcement learning. The domain expert divides the feature set into a series of subsets such that a novel problem concept can be learned from each successive subset. Domain knowledge is also used to order the feature subsets in order of their importance for learning. Our algorithm uses the ordered feature subsets to learn tasks significantly faster than if the entire feature set is used from the start. Incremental Feature-Set Augmentation (IFSA) is fully implemented and tested in three different domains: Gridworld, Blackjack and RoboCup Soccer Keepaway. All experiments show that IFSA can significantly speed up learning and motivates the applicability of this novel RL method.

AAMAS Conference 2007 Conference Paper

Model-Based Function Approximation in Reinforcement Learning

  • Nicholas K. Jong
  • Peter Stone

Reinforcement learning promises a generic method for adapting agents to arbitrary tasks in arbitrary stochastic environments, but applying it to new real-world problems remains difficult, a few impressive success stories notwithstanding. Most interesting agent-environment systems have large state spaces, so performance depends crucially on efficient generalization from a small amount of experience. Current algorithms rely on model-free function approximation, which estimates the long-term values of states and actions directly from data and assumes that actions have similar values in similar states. This paper proposes model-based function approximation, which combines two forms of generalization by assuming that in addition to having similar values in similar states, actions also have similar effects. For one family of generalization schemes known as averagers, computation of an approximate value function from an approximate model is shown to be equivalent to the computation of the exact value function for a finite model derived from data. This derivation both integrates two independent sources of generalization and permits the extension of model-based techniques developed for finite problems. Preliminary experiments with a novel algorithm, AMBI (Approximate Models Based on Instances), demonstrate that this approach yields faster learning on some standard benchmark problems than many contemporary algorithms.

AIJ Journal 2007 Journal Article

Multiagent learning is not the answer. It is the question

  • Peter Stone

The article by Shoham, Powers, and Grenager called “If multi-agent learning is the answer, what is the question? ” does a great job of laying out the current state of the art and open issues at the intersection of game theory and artificial intelligence (AI). However, from the AI perspective, the term “multiagent learning” applies more broadly than can be usefully framed in game theoretic terms. In this larger context, how (and perhaps whether) multiagent learning can be usefully applied in complex domains is still a large open question.

AAMAS Conference 2007 Conference Paper

Towards Reinforcement Learning Representation Transfer

  • Matthew E. Taylor
  • Peter Stone

Transfer learning problems are typically framed as leveraging knowledge learned on a source task to improve learning on a related, but different, target task. Current transfer methods are able to successfully transfer knowledge between agents in different reinforcement learning tasks, reducing the time needed to learn the target. However, the complimentary task of representation transfer, i. e. transferring knowledge between agents with different internal representations, has not been well explored. The goal in both types of transfer problems is the same: reduce the time needed to learn the target with transfer, relative to learning the target without transfer. This work introduces one such representation transfer algorithm which is implemented in a complex multiagent domain. Experiments demonstrate that transferring the learned knowledge between different representations is both possible and beneficial.

JMLR Journal 2007 Journal Article

Transfer Learning via Inter-Task Mappings for Temporal Difference Learning

  • Matthew E. Taylor
  • Peter Stone
  • Yaxin Liu

Temporal difference (TD) learning (Sutton and Barto, 1998) has become a popular reinforcement learning technique in recent years. TD methods, relying on function approximators to generalize learning to novel situations, have had some experimental successes and have been shown to exhibit some desirable properties in theory, but the most basic algorithms have often been found slow in practice. This empirical result has motivated the development of many methods that speed up reinforcement learning by modifying a task for the learner or helping the learner better generalize to novel situations. This article focuses on generalizing across tasks, thereby speeding up learning, via a novel form of transfer using handcoded task relationships. We compare learning on a complex task with three function approximators, a cerebellar model arithmetic computer (CMAC), an artificial neural network (ANN), and a radial basis function (RBF), and empirically demonstrate that directly transferring the action-value function can lead to a dramatic speedup in learning with all three. Using transfer via inter-task mapping ( TVITM ), agents are able to learn one task and then markedly reduce the time it takes to learn a more complex task. Our algorithms are fully implemented and tested in the RoboCup soccer Keepaway domain. This article contains and extends material published in two conference papers (Taylor and Stone, 2005; Taylor et al., 2005). [abs] [ pdf ][ bib ] &copy JMLR 2007. ( edit, beta )

AAMAS Conference 2007 Conference Paper

Transfer via Inter-Task Mappings in Policy Search Reinforcement Learning

  • Matthew E. Taylor
  • Shimon Whiteson
  • Peter Stone

The ambitious goal of transfer learning is to accelerate learning on a target task after training on a different, but related, source task. While many past transfer methods have focused on transferring value-functions, this paper presents a method for transferring policies across tasks with different state and action spaces. In particular, this paper utilizes transfer via inter-task mappings for policy search methods (TVITM-PS) to construct a transfer functional that translates a population of neural network policies trained via policy search from a source task to a target task. Empirical results in robot soccer Keepaway and Server Job Scheduling show that TVITM-PS can markedly reduce learning time when full inter-task mappings are available. The results also demonstrate that TVITMPS still succeeds when given only incomplete inter-task mappings. Furthermore, we present a novel method for learning such mappings when they are not available, and give results showing they perform comparably to hand-coded mappings.

JMLR Journal 2006 Journal Article

Evolutionary Function Approximation for Reinforcement Learning

  • Shimon Whiteson
  • Peter Stone

Temporal difference methods are theoretically grounded and empirically effective methods for addressing reinforcement learning problems. In most real-world reinforcement learning tasks, TD methods require a function approximator to represent the value function. However, using function approximators requires manually making crucial representational decisions. This paper investigates evolutionary function approximation, a novel approach to automatically selecting function approximator representations that enable efficient individual learning. This method evolves individuals that are better able to learn. We present a fully implemented instantiation of evolutionary function approximation which combines NEAT, a neuroevolutionary optimization technique, with Q-learning, a popular TD method. The resulting NEAT+Q algorithm automatically discovers effective representations for neural network function approximators. This paper also presents on-line evolutionary computation, which improves the on-line performance of evolutionary computation by borrowing selection mechanisms used in TD methods to choose individual actions and using them in evolutionary computation to select policies for evaluation. We evaluate these contributions with extended empirical studies in two domains: 1) the mountain car task, a standard reinforcement learning benchmark on which neural network function approximators have previously performed poorly and 2) server job scheduling, a large probabilistic domain drawn from the field of autonomic computing. The results demonstrate that evolutionary function approximation can significantly improve the performance of TD methods and on-line evolutionary computation can significantly improve evolutionary methods. This paper also presents additional tests that offer insight into what factors can make neural network function approximation difficult in practice. [abs] [ pdf ][ bib ] &copy JMLR 2006. ( edit, beta )

EAAI Journal 2004 Journal Article

Adaptive job routing and scheduling

  • Shimon Whiteson
  • Peter Stone

Computer systems are rapidly becoming so complex that maintaining them with human support staffs will be prohibitively expensive and inefficient. In response, visionaries have begun proposing that computer systems be imbued with the ability to configure themselves, diagnose failures, and ultimately repair themselves in response to these failures. However, despite convincing arguments that such a shift would be desirable, as of yet there has been little concrete progress made towards this goal. These challenges are naturally suited to machine learning methods. Hence, this article presents a new network simulator designed to study the application of machine learning methods from a system-wide perspective. Also, learning-based methods for addressing the problems of job routing and CPU scheduling in the simulated networks are introduced. Experimental results verify that methods using machine learning outperform reasonable heuristic and hand-coded approaches on example networks designed to capture many of the complexities that exist in real systems.

AAAI Conference 2004 Conference Paper

Machine Learning for Fast Quadrupedal Locomotion

  • Nate Kohl
  • Peter Stone

For a robot, the ability to get from one place to another is one of the most basic skills. However, locomotion on legged robots is a challenging multidimensional control problem. This paper presents a machine learning approach to legged locomotion, with all training done on the physical robots. The main contributions are a specification of our fully automated learning environment and a detailed empirical comparison of four different machine learning algorithms for learning quadrupedal locomotion. The resulting learned walk is considerably faster than all previously reported hand-coded walks for the same robot platform.

AIJ Journal 1999 Journal Article

Task decomposition, dynamic role assignment, and low-bandwidth communication for real-time strategic teamwork

  • Peter Stone
  • Manuela Veloso

Multi-agent domains consisting of teams of agents that need to collaborate in an adversarial environment offer challenging research opportunities. In this article, we introduce periodic team synchronization (PTS) domains as time-critical environments in which agents act autonomously with low communication, but in which they can periodically synchronize in a full-communication setting. The two main contributions of this article are a flexible team agent structure and a method for inter-agent communication. First, the team agent structure allows agents to capture and reason about team agreements. We achieve collaboration between agents through the introduction of formations. A formation decomposes the task space defining a set of roles. Homogeneous agents can flexibly switch roles within formations, and agents can change formations dynamically, according to pre-defined triggers to be evaluated at run-time. This flexibility increases the performance of the overall team. Our teamwork structure further includes pre-planning for frequently occurring situations. Second, the communication method is designed for use during the low-communication periods in PTS domains. It overcomes the obstacles to inter-agent communication in multi-agent environments with unreliable, single-channel, high-cost, low-bandwidth communication. We fully implemented both the flexible teamwork structure and the communication method in the domain of simulated robotic soccer, and conducted controlled empirical experiments to verify their effectiveness. In addition, our simulator team made it to the semi-finals of the RoboCup-97 competition, in which 29 teams participated. It achieved a total score of 67–9 over six different games, and successfully demonstrated its flexible teamwork structure and inter-agent communication.

IJCAI Conference 1997 Conference Paper

The RoboCup Synthetic Agent Challenge

  • Hiroaki Kitano
  • Milind Tambe
  • Peter Stone
  • Manuela Veloso
  • Silvia Coradeschi
  • Eiichi Osawa
  • Hitoshi Matsubara
  • ltsuki Noda

RoboCup Challenge offers a set of challenges for intelligent agent researchers using a friendly competition in a dynamic, real-time, multiagent domain. While RoboCup in general envisions longer range challenges over the next few decades, RoboCup Challenge presents three specific challenges for the next two years: (i) learning of individual agents and teams; (ii) multi-agent team planning and plan-execution in service of teamwork; and (iii) opponent modeling. RoboCup Challenge provides a novel opportunity for machine learning, planning, and multi-agent researchers it not only supplies a concrete domain to evalute their techniques, but also challenges researchers to evolve these techniques to face key constraints fundamental to this domain: real-time, uncertainty, and teamwork.

NeurIPS Conference 1995 Conference Paper

Beating a Defender in Robotic Soccer: Memory-Based Learning of a Continuous Function

  • Peter Stone
  • Manuela Veloso

Learning how to adjust to an opponent's position is critical to the success of having intelligent agents collaborating towards the achievement of specific tasks in unfriendly environments. This pa(cid: 173) per describes our work on a Memory-based technique for to choose an action based on a continuous-valued state attribute indicating the position of an opponent. We investigate the question of how an agent performs in nondeterministic variations of the training situ(cid: 173) ations. Our experiments indicate that when the random variations fall within some bound of the initial training, the agent performs better with some initial training rather than from a tabula-rasa.