Author name cluster

Peter Stone 0001

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

95 papers

1 author row

IROS Conference 2025 Conference Paper

Dyna-LfLH: Learning Agile Navigation in Dynamic Environments from Learned Hallucination

Saad Abdul Ghani
Zizhao Wang
Peter Stone 0001
Xuesu Xiao

This paper introduces Dynamic Learning from Learned Hallucination (Dyna-LfLH), a self-supervised method for training motion planners to navigate environments with dense and dynamic obstacles. Classical planners struggle with dense, unpredictable obstacles due to limited computation, while learning-based planners face challenges in acquiring high-quality demonstrations for imitation learning or dealing with exploration inefficiencies in reinforcement learning. Building on Learning from Hallucination (LfH), which synthesizes training data from past successful navigation experiences in simpler environments, Dyna-LfLH incorporates dynamic obstacles by generating them through a learned latent distribution. This enables efficient and safe motion planner training. We evaluate Dyna-LfLH on a ground robot in both simulated and real environments, achieving up to a 25% improvement in success rate compared to baselines.

ICRA Conference 2025 Conference Paper

FLaRe: Achieving Masterful and Adaptive Robot Policies with Large-Scale Reinforcement Learning Fine-Tuning

Jiaheng Hu
Rose Hendrix
Ali Farhadi
Aniruddha Kembhavi
Roberto Martín-Martín
Peter Stone 0001
Kuo-Hao Zeng
Kiana Ehsani

In recent years, the Robotics field has initiated several efforts toward building generalist robot policies through large-scale multi-task Behavior Cloning. However, direct deployments of these policies have led to unsatisfactory performance, where the policy struggles with unseen states and tasks. How can we break through the performance plateau of these models and elevate their capabilities to new heights? In this paper, we propose FLaRe, a large-scale Reinforcement Learning fine-tuning framework that integrates robust pre-trained representations, large-scale training, and gradient stabilization techniques. Our method aligns pre-trained policies towards task completion, achieving state-of-the-art (SoTA) performance both on previously demonstrated and on entirely novel tasks and embodiments. Specifically, on a set of long-horizon mobile manipulation tasks, FLaRe achieves an average success rate of 79. 5% in unseen environments, with absolute improvements of $+23. 6 \%$ in simulation and $+30. 7 \%$ on real robots over prior SoTA methods. By utilizing only sparse rewards, our approach can enable generalizing to new capabilities beyond the pretraining data with minimal human effort. Moreover, we demonstrate rapid adaptation to new embodiments and behaviors with less than a day of fine-tuning. Videos, code, and appendix can be found on the project website at robot-flare.github.io

IROS Conference 2025 Conference Paper

GACL: Grounded Adaptive Curriculum Learning with Active Task and Performance Monitoring

Linji Wang
Zifan Xu
Peter Stone 0001
Xuesu Xiao

Curriculum learning has emerged as a promising approach for training complex robotics tasks, yet current applications predominantly rely on manually designed curricula, which demand significant engineering effort and can suffer from subjective and suboptimal human design choices. While automated curriculum learning has shown success in simple domains like grid worlds and games where task distributions can be easily specified, robotics tasks present unique challenges: they require handling complex task spaces while maintaining relevance to target domain distributions that are only partially known through limited samples. To this end, we propose Grounded Adaptive Curriculum Learning (GACL 1 ), a framework specifically designed for robotics curriculum learning with three key innovations: (1) a task representation that consistently handles complex robot task design, (2) an active performance tracking mechanism that allows adaptive curriculum generation appropriate for the robot’s current capabilities, and (3) a grounding approach that maintains target domain relevance through alternating sampling between reference and synthetic tasks. We validate GACL on wheeled navigation in constrained environments and quadruped locomotion in challenging 3D confined spaces, achieving 6. 8% and 6. 1% higher success rates, respectively, than state-of-the-art methods in each domain.

ICML Conference 2025 Conference Paper

Hyperspherical Normalization for Scalable Deep Reinforcement Learning

Hojoon Lee
Youngdo Lee
Takuma Seno
Donghu Kim
Peter Stone 0001
Jaegul Choo

Scaling up the model size and computation has brought consistent performance improvements in supervised learning. However, this lesson often fails to apply to reinforcement learning (RL) because training the model on non-stationary data easily leads to overfitting and unstable optimization. In response, we introduce SimbaV2, a novel RL architecture designed to stabilize optimization by (i) constraining the growth of weight and feature norm by hyperspherical normalization; and (ii) using a distributional value estimation with reward scaling to maintain stable gradients under varying reward magnitudes. Using the soft actor-critic as a base algorithm, SimbaV2 scales up effectively with larger models and greater compute, achieving state-of-the-art performance on 57 continuous control tasks across 4 domains.

IROS Conference 2025 Conference Paper

L3M+P: Lifelong Planning with Large Language Models

Krish Agarwal
Yuqian Jiang
Jiaheng Hu
Bo Liu 0042
Peter Stone 0001

By combining classical planning methods with large language models (LLMs), recent research such as LLM+P has enabled agents to plan for general tasks given in natural language. However, scaling these methods to general-purpose service robots remains challenging: (1) classical planning algorithms generally require a detailed and consistent specification of the environment, which is not always readily available; and (2) existing frameworks mainly focus on isolated planning tasks, whereas robots are often meant to serve in long-term continuous deployments, and therefore must maintain a dynamic memory of the environment which can be updated with multi-modal inputs and extracted as planning knowledge for future tasks. To address these two issues, this paper introduces L3M+P (Lifelong LLM+P), a framework that uses an external knowledge graph as a representation of the world state. The graph can be updated from multiple sources of information, including sensory input and natural language interactions with humans. L3M+P enforces rules for the expected format of the absolute world state graph to maintain consistency between graph updates. At planning time, given a natural language description of a task, L3M+P retrieves context from the knowledge graph and generates a problem definition for classical planners. Evaluated on household robot simulators and on a real-world service robot, L3M+P achieves significant improvement over baseline methods both on accurately registering natural language state changes and on correctly generating plans, thanks to the knowledge graph retrieval and verification.

ICLR Conference 2025 Conference Paper

Learning a Fast Mixing Exogenous Block MDP using a Single Trajectory

Alexander Levine 0001
Peter Stone 0001
Amy Zhang 0001

In order to train agents that can quickly adapt to new objectives or reward functions, efficient unsupervised representation learning in sequential decision-making environments can be important. Frameworks such as the Exogenous Block Markov Decision Process (Ex-BMDP) have been proposed to formalize this representation-learning problem (Efroni et al., 2022b). In the Ex-BMDP framework, the agent's high-dimensional observations of the environment have two latent factors: a controllable factor, which evolves deterministically within a small state space according to the agent's actions, and an exogenous factor, which represents time-correlated noise, and can be highly complex. The goal of the representation learning problem is to learn an encoder that maps from observations into the controllable latent space, as well as the dynamics of this space. Efroni et al. (2022b) has shown that this is possible with a sample complexity that depends only on the size of the controllable latent space, and not on the size of the noise factor. However, this prior work has focused on the episodic setting, where the controllable latent state resets to a specific start state after a finite horizon. By contrast, if the agent can only interact with the environment in a single continuous trajectory, prior works have not established sample-complexity bounds. We propose STEEL, the first provably sample-efficient algorithm for learning the controllable dynamics of an Ex-BMDP from a single trajectory, in the function approximation setting. STEEL has a sample complexity that depends only on the sizes of the controllable latent space and the encoder function class, and (at worst linearly) on the mixing time of the exogenous noise factor. We prove that STEEL is correct and sample-efficient, and demonstrate STEEL on two toy problems. Code is available at: https://github.com/midi-lab/steel.

ICLR Conference 2025 Conference Paper

Longhorn: State Space Models are Amortized Online Learners

Bo Liu 0042
Rui Wang
Lemeng Wu
Yihao Feng
Peter Stone 0001
Qiang Liu 0001

The most fundamental capability of modern AI methods such as Large Language Models (LLMs) is the ability to predict the next token in a long sequence of tokens, known as “sequence modeling.” Although the Transformers model is the current dominant approach to sequence modeling, its quadratic computational cost with respect to sequence length is a significant drawback. State-space models (SSMs) offer a promising alternative due to their linear decoding efficiency and high parallelizability during training. However, existing SSMs often rely on seemingly ad hoc linear recurrence designs. In this work, we explore SSM design through the lens of online learning, conceptualizing SSMs as meta-modules for specific online learning problems. This approach links SSM design to formulating precise online learning objectives, with state transition rules derived from optimizing these objectives. Based on this insight, we introduce a novel deep SSM architecture based on the implicit update for optimizing an online regression objective. Our experimental results show that our models outperform state-of-the-art SSMs, including the Mamba model, on standard sequence modeling benchmarks and language modeling tasks.

IROS Conference 2025 Conference Paper

Multi-Agent Inverse Reinforcement Learning in Real World Unstructured Pedestrian Crowds

Rohan Chandra
Haresh Karnan
Negar Mehr
Peter Stone 0001
Joydeep Biswas

Social robot navigation in crowded public spaces such as university campuses, restaurants, grocery stores, and hospitals, is an increasingly important area of research. One of the core strategies for achieving this goal is to understand humans’ intent–underlying psychological factors that govern their motion–by learning how humans assign rewards to their actions, typically via inverse reinforcement learning (IRL). Despite significant progress in IRL, learning reward functions of multiple agents simultaneously in dense unstructured pedestrian crowds has remained intractable due to the nature of the tightly coupled social interactions that occur in these scenarios e. g. passing, intersections, swerving, weaving, etc. In this paper, we present a new multi-agent maximum entropy inverse reinforcement learning algorithm for real world unstructured pedestrian crowds. Key to our approach is a simple, but effective, mathematical trick which we name the so-called "tractability-rationality trade-off" trick that achieves tractability at the cost of a slight reduction in accuracy. We compare our approach to the classical single-agent MaxEnt IRL as well as state-of-the-art trajectory prediction methods on several datasets including the ETH, UCY, SCAND, JRDB, and a new dataset, called Speedway, collected at a busy intersection on a University campus focusing on dense, complex agent interactions. Our key findings show that, on the dense Speedway dataset, our approach ranks 1 st among top 7 baselines with > 2× improvement over single-agent IRL, and is competitive with state-of-the-art large transformer-based encoder-decoder models on sparser datasets such as ETH/UCY (ranks 3 rd among top 7 baselines).

ICRA Conference 2025 Conference Paper

PRESTO: Fast Motion Planning Using Diffusion Models Based on Key-Configuration Environment Representation

Mingyo Seo
Yoonyoung Cho
Yoonchang Sung
Peter Stone 0001
Yuke Zhu
Beomjoon Kim

We introduce a learning-guided motion planning framework that generates seed trajectories using a diffusion model for trajectory optimization. Given a workspace, our method approximates the configuration space (C-space) obstacles through an environment representation consisting of a sparse set of task-related key configurations, which is then used as a conditioning input to the diffusion model. The diffusion model integrates regularization terms that encourage smooth, collision-free trajectories during training, and trajectory optimization refines the generated seed trajectories to correct any colliding segments. Our experimental results demonstrate that high-quality trajectory priors, learned through our C-space-grounded diffusion model, enable the efficient generation of collision-free trajectories in narrow-passage environments, outperforming previous learning- and planning-based baselines. Videos and additional materials can be found on the project page: https://kiwi-sherbet.github.io/PRESTO.

ICML Conference 2025 Conference Paper

Proto Successor Measure: Representing the Behavior Space of an RL Agent

Siddhant Agarwal
Harshit Sikchi
Peter Stone 0001
Amy Zhang 0001

Having explored an environment, intelligent agents should be able to transfer their knowledge to most downstream tasks within that environment without additional interactions. Referred to as "zero-shot learning", this ability remains elusive for general-purpose reinforcement learning algorithms. While recent works have attempted to produce zero-shot RL agents, they make assumptions about the nature of the tasks or the structure of the MDP. We present Proto Successor Measure: the basis set for all possible behaviors of a Reinforcement Learning Agent in a dynamical system. We prove that any possible behavior (represented using visitation distributions) can be represented using an affine combination of these policy-independent basis functions. Given a reward function at test time, we simply need to find the right set of linear weights to combine these bases corresponding to the optimal policy. We derive a practical algorithm to learn these basis functions using reward-free interaction data from the environment and show that our approach can produce the near-optimal policy at test time for any given reward function without additional environmental interactions. Project page: agarwalsiddhant10. github. io/projects/psm. html.

ICRA Conference 2025 Conference Paper

Reinforcement Learning Within the Classical Robotics Stack: A Case Study in Robot Soccer

Adam Labiosa
Zhihan Wang
Siddhant Agarwal
William Cong
Geethika Hemkumar
Abhinav Narayan Harish
Benjamin Hong
Josh Kelle

Robot decision-making in partially observable, real-time, dynamic, and multi-agent environments remains a difficult and unsolved challenge. Model-free reinforcement learning (RL) is a promising approach to learning decisionmaking in such domains, however, end-to-end RL in complex environments is often intractable. To address this challenge in the RoboCup Standard Platform League (SPL) domain, we developed a novel architecture integrating RL within a classical robotics stack, while employing a multi-fidelity sim2real approach and decomposing behavior into learned sub-behaviors with heuristic selection. Our architecture led to victory in the 2024 RoboCup SPL Challenge Shield Division. In this work, we fully describe our system's architecture and empirically analyze key design decisions that contributed to its success. Our approach demonstrates how RL-based behaviors can be integrated into complete robot behavior architectures.

ICLR Conference 2025 Conference Paper

SimBa: Simplicity Bias for Scaling Up Parameters in Deep Reinforcement Learning

Hojoon Lee
Dongyoon Hwang
Donghu Kim
Hyunseung Kim
Jun Jet Tai
Kaushik Subramanian
Peter R. Wurman
Jaegul Choo

Recent advances in CV and NLP have been largely driven by scaling up the number of network parameters, despite traditional theories suggesting that larger networks are prone to overfitting. These large networks avoid overfitting by integrating components that induce a simplicity bias, guiding models toward simple and generalizable solutions. However, in deep RL, designing and scaling up networks have been less explored. Motivated by this opportunity, we present SimBa, an architecture designed to scale up parameters in deep RL by injecting a simplicity bias. SimBa consists of three components: (i) an observation normalization layer that standardizes inputs with running statistics, (ii) a residual feedforward block to provide a linear pathway from the input to output, and (iii) a layer normalization to control feature magnitudes. By scaling up parameters with SimBa, the sample efficiency of various deep RL algorithms—including off-policy, on-policy, and unsupervised methods—is consistently improved. Moreover, solely by integrating SimBa architecture into SAC, it matches or surpasses state-of-the-art deep RL methods with high computational efficiency across DMC, MyoSuite, and HumanoidBench. These results demonstrate SimBa's broad applicability and effectiveness across diverse RL algorithms and environments.

ICRA Conference 2024 Conference Paper

Asynchronous Task Plan Refinement for Multi-Robot Task and Motion Planning

Yoonchang Sung
Rahul Shome
Peter Stone 0001

This paper explores general multi-robot task and motion planning, where multiple robots in close proximity manipulate objects while satisfying constraints and a given goal. In particular, we formulate the plan refinement problem—which, given a task plan, finds valid assignments of variables corresponding to solution trajectories—as a hybrid constraint satisfaction problem. The proposed algorithm follows several design principles that yield the following features: (1) efficient solution finding due to sequential heuristics and implicit time and roadmap representations, and (2) maximized feasible solution space obtained by introducing minimally necessary coordination-induced constraints and not relying on prevalent simplifications that exist in the literature. The evaluation results demonstrate the planning efficiency of the proposed algorithm, outperforming the synchronous approach in terms of makespan.

ICRA Conference 2024 Conference Paper

Dexterous Legged Locomotion in Confined 3D Spaces with Reinforcement Learning

Zifan Xu
Amir Hossain Raj
Xuesu Xiao
Peter Stone 0001

Recent advances of locomotion controllers utilizing deep reinforcement learning (RL) have yielded impressive results in terms of achieving rapid and robust locomotion across challenging terrain, such as rugged rocks, non-rigid ground, and slippery surfaces. However, while these controllers primarily address challenges underneath the robot, relatively little research has investigated legged mobility through confined 3D spaces, such as narrow tunnels or irregular voids, which impose all-around constraints. The cyclic gait patterns resulted from existing RL-based methods to learn parameterized locomotion skills characterized by motion parameters, such as velocity and body height, may not be adequate to navigate robots through challenging confined 3D spaces, requiring both agile 3D obstacle avoidance and robust legged locomotion. Instead, we propose to learn locomotion skills end-to-end from goal-oriented navigation in confined 3D spaces. To address the inefficiency of tracking distant navigation goals, we introduce a hierarchical locomotion controller that combines a classical planner tasked with planning waypoints to reach a faraway global goal location, and an RL-based policy trained to follow these waypoints by generating low-level motion commands. This approach allows the policy to explore its own locomotion skills within the entire solution space and facilitates smooth transitions between local goals, enabling long-term navigation towards distant goals. In simulation, our hierarchical approach succeeds at navigating through demanding confined 3D environments, outperforming both pure end-to-end learning approaches and parameterized locomotion skills. We further demonstrate the successful real-world deployment of our simulation-trained controller on a real robot.

ICRA Conference 2024 Conference Paper

Rethinking Social Robot Navigation: Leveraging the Best of Two Worlds

Amir Hossain Raj
Zichao Hu
Haresh Karnan
Rohan Chandra
Amirreza Payandeh
Luisa Mao
Peter Stone 0001
Joydeep Biswas

Empowering robots to navigate in a socially compliant manner is essential for the acceptance of robots moving in human-inhabited environments. Previously, roboticists have developed geometric navigation systems with decades of empirical validation to achieve safety and efficiency. However, the many complex factors of social compliance make geometric navigation systems hard to adapt to social situations, where no amount of tuning enables them to be both safe (people are too unpredictable) and efficient (the frozen robot problem). With recent advances in deep learning approaches, the common reaction has been to entirely discard these classical navigation systems and start from scratch, building a completely new learning-based social navigation planner. In this work, we find that this reaction is unnecessarily extreme: using a large-scale real-world social navigation dataset, SCAND, we find that geometric systems can produce trajectory plans that align with the human demonstrations in a large number of social situations. We, therefore, ask if we can rethink the social robot navigation problem by leveraging the advantages of both geometric and learning-based methods. We validate this hybrid paradigm through a proof-of-concept experiment, in which we develop a hybrid planner that switches between geometric and learning-based planning. Our experiments on both SCAND and two physical robots show that the hybrid planner can achieve better social compliance compared to using either the geometric or learning-based approach alone.

ICLR Conference 2024 Conference Paper

Sample Efficient Myopic Exploration Through Multitask Reinforcement Learning with Diverse Tasks

Ziping Xu
Zifan Xu
Runxuan Jiang
Peter Stone 0001
Ambuj Tewari

Multitask Reinforcement Learning (MTRL) approaches have gained increasing attention for its wide applications in many important Reinforcement Learning (RL) tasks. However, while recent advancements in MTRL theory have focused on the improved statistical efficiency by assuming a shared structure across tasks, exploration--a crucial aspect of RL--has been largely overlooked. This paper addresses this gap by showing that when an agent is trained on a sufficiently diverse set of tasks, a generic policy-sharing algorithm with myopic exploration design like $\epsilon$-greedy that are inefficient in general can be sample-efficient for MTRL. To the best of our knowledge, this is the first theoretical demonstration of the "exploration benefits" of MTRL. It may also shed light on the enigmatic success of the wide applications of myopic exploration in practice. To validate the role of diversity, we conduct experiments on synthetic robotic control environments, where the diverse task set aligns with the task selection by automatic curriculum learning, which is empirically shown to improve sample-efficiency.

ICRA Conference 2024 Conference Paper

Wait, That Feels Familiar: Learning to Extrapolate Human Preferences for Preference-Aligned Path Planning

Haresh Karnan
Elvin Yang
Garrett Warnell
Joydeep Biswas
Peter Stone 0001

Autonomous mobility tasks such as last-mile delivery require reasoning about operator-indicated preferences over terrains on which the robot should navigate to ensure both robot safety and mission success. However, coping with out of distribution data from novel terrains or appearance changes due to lighting variations remains a fundamental problem in visual terrain-adaptive navigation. Existing solutions either require labor-intensive manual data re-collection and labeling or use hand-coded reward functions that may not align with operator preferences. In this work, we posit that operator preferences for visually novel terrains, which the robot should adhere to, can often be extrapolated from established terrain preferences within the inertial-proprioceptive-tactile domain. Leveraging this insight, we introduce Preference extrApolation for Terrain-awarE Robot Navigation (PATERN), a novel framework for extrapolating operator terrain preferences for visual navigation. PATERN learns to map inertial-proprioceptive-tactile measurements from the robot’s observations to a representation space and performs nearest-neighbor search in this space to estimate operator preferences over novel terrains. Through physical robot experiments in outdoor environments, we assess PATERN’s capability to extrapolate preferences and generalize to novel terrains and challenging lighting conditions. Compared to baseline approaches, our findings indicate that PATERN 1 robustly generalizes to diverse terrains and varied lighting conditions, while navigating in a preference-aligned manner.

IROS Conference 2023 Conference Paper

A Novel Control Law for Multi-Joint Human-Robot Interaction Tasks While Maintaining Postural Coordination

Keya Ghonasgi
Reuth Mirsky
Adrian M. Haith
Peter Stone 0001
Ashish D. Deshpande

Exoskeleton robots are capable of safe torque-controlled interactions with a wearer while moving their limbs through predefined trajectories. However, affecting and assisting the wearer's movements while incorporating their inputs (effort and movements) effectively during an interaction re-mains an open problem due to the complex and variable nature of human motion. In this paper, we present a control algorithm that leverages task-specific movement behaviors to control robot torques during unstructured interactions by implementing a force field that imposes a desired joint angle coordination behavior. This control law, built by using principal component analysis (PCA), is implemented and tested with the Harmony exoskeleton. We show that the proposed control law is versatile enough to allow for the imposition of different coordination behaviors with varying levels of impedance stiffness. We also test the feasibility of our method for unstructured human-robot interaction. Specifically, we demonstrate that participants in a human-subject experiment are able to effectively perform reaching tasks while the exoskeleton imposes the desired joint coordination under different movement speeds and interaction modes. Survey results further suggest that the proposed control law may offer a reduction in cognitive or motor effort. This control law opens up the possibility of using the exoskeleton for training the participating in accomplishing complex multi-joint motor tasks while maintaining postural coordination.

ICRA Conference 2023 Conference Paper

Benchmarking Reinforcement Learning Techniques for Autonomous Navigation

Zifan Xu
Bo Liu 0042
Xuesu Xiao
Anirudh Nair
Peter Stone 0001

Deep reinforcement learning (RL) has brought many successes for autonomous robot navigation. However, there still exists important limitations that prevent real-world use of RL-based navigation systems. For example, most learning approaches lack safety guarantees; and learned navigation systems may not generalize well to unseen environments. Despite a variety of recent learning techniques to tackle these challenges in general, a lack of an open-source benchmark and reproducible learning methods specifically for autonomous navigation makes it difficult for roboticists to choose what learning methods to use for their mobile robots and for learning researchers to identify current shortcomings of general learning methods for autonomous navigation. In this paper, we identify four major desiderata of applying deep RL approaches for autonomous navigation: (D1) reasoning under uncertainty, (D2) safety, (D3) learning from limited trial-and-error data, and (D4) generalization to diverse and novel environments. Then, we explore four major classes of learning techniques with the purpose of achieving one or more of the four desiderata: memory-based neural network architectures (D1), safe RL (D2), model-based RL (D2, D3), and domain randomization (D4). By deploying these learning techniques in a new open-source large-scale navigation benchmark and real-world environments, we perform a comprehensive study aimed at establishing to what extent can these techniques achieve these desiderata for RL-based navigation systems.

UAI Conference 2023 Conference Paper

Composing Efficient, Robust Tests for Policy Selection

Dustin Morrill
Thomas J. Walsh 0001
Daniel Hernandez
Peter R. Wurman
Peter Stone 0001

Modern reinforcement learning systems produce many high-quality policies throughout the learning process. However, to choose which policy to actually deploy in the real world, they must be tested under an intractable number of environmental conditions. We introduce RPOSST, an algorithm to select a small set of test cases from a larger pool based on a relatively small number of sample evaluations. RPOSST treats the test case selection problem as a two-player game and optimizes a solution with provable $k$-of-$N$ robustness, bounding the error relative to a test that used all the test cases in the pool. Empirical results demonstrate that RPOSST finds a small set of test cases that identify high quality policies in a toy one-shot game, poker datasets, and a high-fidelity racing simulator.

ICRA Conference 2023 Conference Paper

Learning Perceptual Hallucination for Multi-Robot Navigation in Narrow Hallways

Jin Soo Park
Xuesu Xiao
Garrett Warnell
Harel Yedidsion
Peter Stone 0001

While current systems for autonomous robot navigation can produce safe and efficient motion plans in static environments, they usually generate suboptimal behaviors when multiple robots must navigate together in confined spaces. For example, when two robots meet each other in a narrow hallway, they may either turn around to find an alternative route or collide with each other. This paper presents a new approach to navigation that allows two robots to pass each other in a narrow hallway without colliding, stopping, or waiting. Our approach, Perceptual Hallucination for Hallway Passing (PHHP), learns to synthetically generate virtual obstacles (i. e. , perceptual hallucination) to facilitate passing in narrow hallways by multiple robots that utilize otherwise standard autonomous navigation systems. Our experiments on physical robots in a variety of hallways show improved performance compared to multiple baselines.

ICLR Conference 2023 Conference Paper

MACTA: A Multi-agent Reinforcement Learning Approach for Cache Timing Attacks and Detection

Jiaxun Cui
Xiaomeng Yang
Mulong Luo
Geunbae Lee
Peter Stone 0001
Hsien-Hsin S. Lee
Benjamin Lee
G. Edward Suh

Security vulnerabilities in computer systems raise serious concerns as computers process an unprecedented amount of private and sensitive data today. Cache timing attacks (CTA) pose an important practical threat as they can effectively breach many protection mechanisms in today’s systems. However, the current detection techniques for cache timing attacks heavily rely on heuristics and expert knowledge, which can lead to brittleness and the inability to adapt to new attacks. To mitigate the CTA threat, we propose MACTA, a multi-agent reinforcement learning (MARL) approach that leverages population-based training to train both attackers and detectors. Following best practices, we develop a realistic simulated MARL environment, MA-AUTOCAT, which enables training and evaluation of cache-timing attackers and detectors. Our empirical results suggest that MACTA is an effective solution without any manual input from security experts. MACTA detectors can generalize to a heuristic attack not exposed in training with a 97.8% detection rate and reduce the attack bandwidth of adaptive attackers by 20% on average. In the meantime, MACTA attackers are qualitatively more effective than other attacks studied, and the average evasion rate of MACTA attackers against an unseen state-of-the-art detector can reach up to 99%. Furthermore, we found that agents equipped with a Transformer encoder can learn effective policies in situations when agents with multi-layer perceptron encoders do not in this environment, suggesting the potential of Transformer structures in CTA problems.

IROS Conference 2023 Conference Paper

Symbolic State Space Optimization for Long Horizon Mobile Manipulation Planning

Xiaohan Zhang 0002
Yifeng Zhu
Yan Ding 0002
Yuqian Jiang
Yuke Zhu
Peter Stone 0001
Shiqi Zhang 0001

In existing task and motion planning (TAMP) research, it is a common assumption that experts manually specify the state space for task-level planning. A well-developed state space enables the desirable distribution of limited computational resources between task planning and motion planning. However, developing such task-level state spaces can be non-trivial in practice. In this paper, we consider a long horizon mobile manipulation domain including repeated navigation and manipulation. We propose Symbolic State Space Optimization (S3O) for computing a set of abstracted locations and their 2D geometric groundings for generating task-motion plans in such domains. Our approach has been extensively evaluated in simulation and demonstrated on a real mobile manipulator working on clearing up dining tables. Results show the superiority of the proposed method over TAMP baselines in task completion rate and execution time.

ICAPS Conference 2023 Conference Paper

Task Phasing: Automated Curriculum Learning from Demonstrations

Vaibhav Bajaj
Guni Sharon
Peter Stone 0001

Applying reinforcement learning (RL) to sparse reward domains is notoriously challenging due to insufficient guiding signals. Common RL techniques for addressing such domains include (1) learning from demonstrations and (2) curriculum learning. While these two approaches have been studied in detail, they have rarely been considered together. This paper aims to do so by introducing a principled task-phasing approach that uses demonstrations to automatically generate a curriculum sequence. Using inverse RL from (suboptimal) demonstrations we define a simple initial task. Our task phasing approach then provides a framework to gradually increase the complexity of the task all the way to the target task, while retuning the RL agent in each phasing iteration. Two approaches for phasing are considered: (1) gradually increasing the proportion of time steps an RL agent is in control, and (2) phasing out a guiding informative reward function. We present conditions that guarantee the convergence of these approaches to an optimal policy. Experimental results on 3 sparse reward domains demonstrate that our task-phasing approaches outperform state-of-the-art approaches with respect to asymptotic performance.

EUMAS Conference 2022 Conference Paper

A Survey of Ad Hoc Teamwork Research

Reuth Mirsky
Ignacio Carlucho
Arrasy Rahman
Elliot Fosong
William Macke
Mohan Sridharan
Peter Stone 0001
Stefano V. Albrecht

Abstract Ad hoc teamwork is the research problem of designing agents that can collaborate with new teammates without prior coordination. This survey makes a two-fold contribution: First, it provides a structured description of the different facets of the ad hoc teamwork problem. Second, it discusses the progress that has been made in the field so far, and identifies the immediate and long-term open problems that need to be addressed in ad hoc teamwork.

ICRA Conference 2022 Conference Paper

Adversarial Imitation Learning from Video Using a State Observer

Haresh Karnan
Faraz Torabi
Garrett Warnell
Peter Stone 0001

The imitation learning research community has recently made significant progress towards the goal of enabling artificial agents to imitate behaviors from video demonstrations alone. However, current state-of-the-art approaches developed for this problem exhibit high sample complexity due, in part, to the high-dimensional nature of video observations. Towards addressing this issue, we introduce here a new algorithm called Visual Generative Adversarial Imitation from Observation using a State Observer (VGAIfO-SO). At its core, VGAIfO-SO seeks to address sample inefficiency using a novel, self-supervised state observer, which provides estimates of lower-dimensional proprioceptive state representations from high-dimensional images. We show experimentally in several continuous control environments that VGAIfO-SO is more sample efficient than other IfO algorithms at learning from video-only demonstrations and can sometimes even achieve performance close to the Generative Adversarial Imitation from Observation (GAIfO) algorithm that has privileged access to the demonstrator's proprioceptive state information.

ICML Conference 2022 Conference Paper

Causal Dynamics Learning for Task-Independent State Abstraction

Zizhao Wang
Xuesu Xiao
Zifan Xu
Yuke Zhu
Peter Stone 0001

Learning dynamics models accurately is an important goal for Model-Based Reinforcement Learning (MBRL), but most MBRL methods learn a dense dynamics model which is vulnerable to spurious correlations and therefore generalizes poorly to unseen states. In this paper, we introduce Causal Dynamics Learning for Task-Independent State Abstraction (CDL), which first learns a theoretically proved causal dynamics model that removes unnecessary dependencies between state variables and the action, thus generalizing well to unseen states. A state abstraction can then be derived from the learned dynamics, which not only improves sample efficiency but also applies to a wider range of tasks than existing state abstraction methods. Evaluated on two simulated environments and downstream tasks, both the dynamics model and policies learned by the proposed method generalize well to unseen states and the derived state abstraction improves sample efficiency compared to learning without it.

IROS Conference 2022 Conference Paper

Quantifying Changes in Kinematic Behavior of a Human-Exoskeleton Interactive System

Keya Ghonasgi
Reuth Mirsky
Adrian M. Haith
Peter Stone 0001
Ashish D. Deshpande

While human-robot interaction studies are becoming more common, quantification of the effects of repeated interaction with an exoskeleton remains unexplored. We draw upon existing literature in human skill assessment and present extrinsic and intrinsic performance metrics that quantify how the human-exoskeleton system's behavior changes over time. Specifically, in this paper, we present a new performance metric that provides insight into the system's kinematics associated with ‘successful’ movements resulting in a richer characterization of changes in the system's behavior. A human subject study is carried out wherein participants learn to play a challenging and dynamic reaching game over multiple attempts, while donning an upper-body exoskeleton. The results demonstrate that repeated practice results in learning over time as identified through the improvement of extrinsic performance. Changes in the newly developed kinematics-based measure further illumi-nate how the participant's intrinsic behavior is altered over the training period. Thus, we are able to quantify the changes in the human-exoskeleton system's behavior observed in relation with learning.

ICRA Conference 2022 Conference Paper

Skeletal Feature Compensation for Imitation Learning with Embodiment Mismatch

Eddy Hudson
Garrett Warnell
Faraz Torabi
Peter Stone 0001

Learning from demonstrations in the wild (e. g. YouTube videos) is a tantalizing goal in imitation learning. However, for this goal to be achieved, imitation learning algorithms must deal with the fact that the demonstrators and learners may have bodies that differ from one another. This condition — “embodiment mismatch” — is ignored by many recent imitation learning algorithms. Our proposed imitation learning technique, SILEM (Skeletal feature compensation for Imitation Learning with Embodiment Mismatch), addresses a particular type of embodiment mismatch by introducing a learned affine transform to compensate for differences in the skeletal features obtained from the learner and expert. We create toy domains based on PyBullet's HalfCheetah and Ant to assess SILEM's benefits for this type of embodiment mismatch. We also provide qualitative and quantitative results on more realistic problems — teaching simulated humanoid agents, including Atlas from Boston Dynamics, to walk by observing human demonstrations.

IROS Conference 2022 Conference Paper

VI-IKD: High-Speed Accurate Off-Road Navigation using Learned Visual-Inertial Inverse Kinodynamics

Haresh Karnan
Kavan Singh Sikand
Pranav Atreya
Sadegh Rabiee
Xuesu Xiao
Garrett Warnell
Peter Stone 0001
Joydeep Biswas

One of the key challenges in high-speed off-road navigation on ground vehicles is that the kinodynamics of the vehicle-terrain interaction can differ dramatically depending on the terrain. Previous approaches to addressing this challenge have considered learning an inverse kinodynamics (IKD) model, conditioned on inertial information of the vehicle to sense the kinodynamic interactions. In this paper, we hypothesize that to enable accurate high-speed off-road navigation using a learned IKD model, in addition to inertial information from the past, one must also anticipate the kinodynamic interactions of the vehicle with the terrain in the future. To this end, we introduce Visual-Inertial Inverse Kinodynamics (VI-IKD), a novel learning based IKD model that is conditioned on visual information from a terrain patch ahead of the robot in addition to past inertial information, enabling it to anticipate kinodynamic interactions in the future. We validate the effectiveness of VI-IKD in accurate high-speed off-road navigation experimentally on a scale 1/5 UT-AlphaTruck off-road autonomous vehicle in both indoor and outdoor environments and show that compared to other state-of-the-art approaches, VI-IKD enables more accurate and robust off-road navigation on a variety of different terrains at speeds of up to 3. 5m/s.

ICRA Conference 2022 Conference Paper

Visually Grounded Task and Motion Planning for Mobile Manipulation

Xiaohan Zhang 0002
Yifeng Zhu
Yan Ding 0002
Yuke Zhu
Peter Stone 0001
Shiqi Zhang 0001

Task and motion planning (TAMP) algorithms aim to help robots achieve task-level goals, while maintaining motion-level feasibility. This paper focuses on TAMP domains that involve robot behaviors that take extended periods of time (e. g. , long-distance navigation). In this paper, we develop a visual grounding approach to help robots probabilistically evaluate action feasibility, and introduce a TAMP algorithm, called GROP, that optimizes both feasibility and efficiency. We have collected a dataset that includes 96, 000 simulated trials of a robot conducting mobile manipulation tasks, and then used the dataset to learn to ground symbolic spatial relationships for action feasibility evaluation. Compared with competitive TAMP baselines, GROP exhibited a higher task-completion rate while maintaining lower or comparable action costs. In addition to these extensive experiments in simulation, GROP is fully implemented and tested on a real robot system.

ICRA Conference 2022 Conference Paper

VOILA: Visual-Observation-Only Imitation Learning for Autonomous Navigation

Haresh Karnan
Garrett Warnell
Xuesu Xiao
Peter Stone 0001

While imitation learning for vision-based au-tonomous mobile robot navigation has recently received a great deal of attention in the research community, existing approaches typically require state-action demonstrations that were gathered using the deployment platform. However, what if one cannot easily outfit their platform to record these demonstration signals or-worse yet-the demonstrator does not have access to the platform at all? Is imitation learning for vision-based autonomous navigation even possible in such scenarios? In this work, we hypothesize that the answer is yes and that recent ideas from the Imitation from Observation (IfO) literature can be brought to bear such that a robot can learn to navigate using only ego-centric video collected by a demonstrator, even in the presence of viewpoint mismatch. To this end, we introduce a new algorithm, Visual-Observation-only Imitation Learning for Autonomous navigation (VOILA), that can successfully learn navigation policies from a single video demonstration collected from a physically different agent. We evaluate VOILA in the AirSim simulator and show that VOILA not only successfully imitates the expert, but that it also learns navigation policies that can generalize to novel environments. Further, we demonstrate the effectiveness of VOILA in a real-world setting by showing that it allows a wheeled Jackal robot to successfully imitate a human walking in an environment while recording video with a handheld mobile phone camera.

ICRA Conference 2021 Conference Paper

A Scavenger Hunt for Service Robots

Harel Yedidsion
Jennifer Suriadinata
Zifan Xu
Stefan Debruyn
Peter Stone 0001

Creating robots that can perform general-purpose service tasks in a human-populated environment has been a longstanding grand challenge for AI and Robotics research. One particularly valuable skill that is relevant to a wide variety of tasks is the ability to locate and retrieve objects upon request. This paper models this skill as a Scavenger Hunt (SH) game, which we formulate as a variation of the NP-hard stochastic traveling purchaser problem. In this problem, the goal is to find a set of objects as quickly as possible, given probability distributions of where they may be found. We investigate the performance of several solution algorithms for the SH problem, both in simulation and on a real mobile robot. We use Reinforcement Learning (RL) to train an agent to plan a minimal cost path, and show that the RL agent can outperform a range of heuristic algorithms, achieving near optimal performance. In order to stimulate research on this problem, we introduce a publicly available software stack and associated website that enable users to upload scavenger hunts which robots can download, perform, and learn from to continually improve their performance on future hunts.

ICRA Conference 2021 Conference Paper

Agile Robot Navigation through Hallucinated Learning and Sober Deployment

Xuesu Xiao
Bo Liu 0042
Peter Stone 0001

Learning from Hallucination (LfH) is a recent machine learning paradigm for autonomous navigation, which uses training data collected in completely safe environments and adds numerous imaginary obstacles to make the environment densely constrained, to learn navigation planners that produce feasible navigation even in highly constrained (more dangerous) spaces. However, LfH requires hallucinating the robot perception during deployment to match with the hallucinated training data, which creates a need for sometimes-infeasible prior knowledge and tends to generate very conservative planning. In this work, we propose a new LfH paradigm that does not require runtime hallucination—a feature we call "sober deployment"—and can therefore adapt to more realistic navigation scenarios. This novel Hallucinated Learning and Sober Deployment (HLSD) paradigm is tested in a benchmark testbed of 300 simulated navigation environments with a wide range of difficulty levels, and in the real-world. In most cases, HLSD outperforms both the original LfH method and a classical navigation planner.

ICRA Conference 2021 Conference Paper

APPLI: Adaptive Planner Parameter Learning From Interventions

Zizhao Wang
Xuesu Xiao
Bo Liu 0042
Garrett Warnell
Peter Stone 0001

While classical autonomous navigation systems can typically move robots from one point to another safely and in a collision-free manner, these systems may fail or produce suboptimal behavior in certain scenarios. The current practice in such scenarios is to manually re-tune the system’s parameters, e. g. max speed, sampling rate, inflation radius, to optimize performance. This practice requires expert knowledge and may jeopardize performance in the originally good scenarios. Meanwhile, it is relatively easy for a human to identify those failure or suboptimal cases and provide a teleoperated intervention to correct the failure or suboptimal behavior. In this work, we seek to learn from those human interventions to improve navigation performance. In particular, we propose Adaptive Planner Parameter Learning from Interventions (APPLI), in which multiple sets of navigation parameters are learned during training and applied based on a confidence measure to the underlying navigation system during deployment. In our physical experiments, the robot achieves better performance compared to the planner with static default parameters, and even dynamic parameters learned from a full human demonstration. We also show APPLI’s generalizability in another unseen physical test course, and a suite of 300 simulated navigation environments.

ICRA Conference 2021 Conference Paper

APPLR: Adaptive Planner Parameter Learning from Reinforcement

Zifan Xu
Gauraang Dhamankar
Anirudh Nair
Xuesu Xiao
Garrett Warnell
Bo Liu 0042
Zizhao Wang
Peter Stone 0001

Classical navigation systems typically operate using a fixed set of hand-picked parameters (e. g. maximum speed, sampling rate, inflation radius, etc.) and require heavy expert re-tuning in order to work in new environments. To mitigate this requirement, it has been proposed to learn parameters for different contexts in a new environment using human demonstrations collected via teleoperation. However, learning from human demonstration limits deployment to the training environment, and limits overall performance to that of a potentially-suboptimal demonstrator. In this paper, we introduce APPLR, Adaptive Planner Parameter Learning from Reinforcement, which allows existing navigation systems to adapt to new scenarios by using a parameter selection scheme discovered via reinforcement learning (RL) in a wide variety of simulation environments. We evaluate APPLR on a robot in both simulated and physical experiments, and show that it can outperform both a fixed set of hand-tuned parameters and also a dynamic parameter tuning scheme learned from human demonstration.

IROS Conference 2021 Conference Paper

Capturing Skill State in Curriculum Learning for Human Skill Acquisition

Keya Ghonasgi
Reuth Mirsky
Sanmit Narvekar
Bharath Masetty
Adrian M. Haith
Peter Stone 0001
Ashish D. Deshpande

Humans learn complex motor skills with practice and training. Though the learning process is not fully understood, several theories from motor learning, neuroscience, education, and game design suggest that curriculum-based training may be the key to efficient skill acquisition. However, designing such a curriculum and understanding its effects on learning are challenging problems. In this paper, we define the Human-skill Curriculum Markov Decision Process (H-CMDP) to systematize the design of training protocols. We also identify a vocabulary of performance features to enable the approximation for a human’s skill level across a variety of cognitive and motor tasks. A novel task domain is introduced as a testbed to evaluate the effectiveness of our approach. Human subject experiments show that (1) participants can learn to improve their performance in tasks within this domain, (2) the learning is quantifiable via our performance features, and (3) the domain is flexible enough to create distinct levels of difficulty. The long-term goal of this work is to systematize the process of curriculum-based training toward the design of protocols for robot-mediated rehabilitation.

ICML Conference 2021 Conference Paper

Coach-Player Multi-agent Reinforcement Learning for Dynamic Team Composition

Bo Liu 0042
Qiang Liu 0001
Peter Stone 0001
Animesh Garg
Yuke Zhu
Anima Anandkumar

In real-world multi-agent systems, agents with different capabilities may join or leave without altering the team’s overarching goals. Coordinating teams with such dynamic composition is challenging: the optimal team strategy varies with the composition. We propose COPA, a coach-player framework to tackle this problem. We assume the coach has a global view of the environment and coordinates the players, who only have partial views, by distributing individual strategies. Specifically, we 1) adopt the attention mechanism for both the coach and the players; 2) propose a variational objective to regularize learning; and 3) design an adaptive communication method to let the coach decide when to communicate with the players. We validate our methods on a resource collection task, a rescue game, and the StarCraft micromanagement tasks. We demonstrate zero-shot generalization to new team compositions. Our method achieves comparable or better performance than the setting where all players have a full view of the environment. Moreover, we see that the performance remains high even when the coach communicates as little as 13% of the time using the adaptive communication strategy.

IROS Conference 2021 Conference Paper

DEALIO: Data-Efficient Adversarial Learning for Imitation from Observation

Faraz Torabi
Garrett Warnell
Peter Stone 0001

In imitation learning from observation (IfO), a learning agent seeks to imitate a demonstrating agent using only observations of the demonstrated behavior without access to the control signals generated by the demonstrator. Recent methods based on adversarial imitation learning have led to state-of-the-art performance on IfO problems, but they typically suffer from high sample complexity due to a reliance on data-inefficient, model-free reinforcement learning algorithms. This issue makes them impractical to deploy in real-world settings, where gathering samples can incur high costs in terms of time, energy, and risk. In this work, we hypothesize that we can incorporate ideas from model-based reinforcement learning with adversarial methods for IfO in order to increase the data efficiency of these methods without sacrificing performance. Specifically, we consider time-varying linear Gaussian policies, and propose a method that integrates the linear-quadratic regulator with path integral policy improvement into an existing adversarial IfO framework. The result is a more data-efficient IfO algorithm with better performance, which we show empirically in four simulation domains: using far fewer interactions with the environment, the proposed method exhibits similar or better performance than the existing technique.

ICRA Conference 2021 Conference Paper

Efficient Real-Time Inference in Temporal Convolution Networks

Piyush Khandelwal
James MacGlashan
Peter R. Wurman
Peter Stone 0001

It has been recently demonstrated that Temporal Convolution Networks (TCNs) provide state-of-the-art results in many problem domains where the input data is a time-series. TCNs typically incorporate information from a long history of inputs (the receptive field) into a single output using many convolution layers. Real-time inference using a trained TCN can be challenging on devices with limited compute and memory, especially if the receptive field is large. This paper introduces the RT-TCN algorithm that reuses the output of prior convolution operations to minimize the computational requirements and persistent memory footprint of a TCN during real-time inference. We also show that when a TCN is trained using time slices of the input time-series, it can be executed in realtime continually using RT-TCN. In addition, we provide TCN architecture guidelines that ensure that real-time inference can be performed within memory and computational constraints.

IROS Conference 2021 Conference Paper

From Agile Ground to Aerial Navigation: Learning from Learned Hallucination

Zizhao Wang
Xuesu Xiao
Alexander J. Nettekoven
Kadhiravan Umasankar
Anika Singh
Sriram Bommakanti
Ufuk Topcu
Peter Stone 0001

This paper presents a self-supervised Learning from Learned Hallucination (LfLH) method to learn fast and reactive motion planners for ground and aerial robots to navigate through highly constrained environments. The recent Learning from Hallucination (LfH) paradigm for autonomous navigation executes motion plans by random exploration in completely safe obstacle-free spaces, uses hand-crafted hallucination techniques to add imaginary obstacles to the robot’s perception, and then learns motion planners to navigate in realistic, highly-constrained, dangerous spaces. However, current hand-crafted hallucination techniques need to be tailored for specific robot types (e. g. , a differential drive ground vehicle), and use approximations heavily dependent on certain assumptions (e. g. , a short planning horizon). In this work, instead of manually designing hallucination functions, LfLH learns to hallucinate obstacle configurations, where the motion plans from random exploration in open space are optimal, in a self-supervised manner. LfLH is robust to different robot types and does not make assumptions about the planning horizon. Evaluated in both simulated and physical environments with a ground and an aerial robot, LfLH outperforms or performs comparably to previous hallucination approaches, along with sampling- and optimization-based classical methods.

IROS Conference 2021 Conference Paper

Team Orienteering Coverage Planning with Uncertain Reward

Bo Liu 0042
Xuesu Xiao
Peter Stone 0001

Many municipalities and large organizations have fleets of vehicles that need to be coordinated for tasks such as garbage collection or infrastructure inspection. Motivated by this need, this paper focuses on the common subproblem in which a team of vehicles needs to plan coordinated routes to patrol an area over iterations while minimizing temporally and spatially dependent costs. In particular, at a specific location (e. g. , a vertex on a graph), we assume the cost accumulates over time and its growth rate is a random variable with a fixed but unknown mean, and the cost is reset to zero whenever any vehicle visits the vertex (representing the robot "servicing" the vertex). We formulate this problem in graph terminology and call it Team Orienteering Coverage Planning with Uncertain Reward (TOCPUR). We propose to solve TOCPUR by simultaneously estimating the accumulated cost at every vertex on the graph and solving a novel variant of the Team Orienteering Problem (TOP) iteratively, which we call the Team Orienteering Coverage Problem (TOCP). We provide the first mixed integer programming formulation for the TOCP, as a significant adaptation of the original TOP. We introduce a new benchmark consisting of hundreds of randomly generated graphs for comparing different methods. We show the proposed solution outperforms both the exact TOP solution and a greedy algorithm. In addition, we provide a demo of our method on a team of three physical robots in a real-world environment. The code is publicly available at https://github.com/Cranial-XIX/TOCPUR.git.

ICRA Conference 2021 Conference Paper

Towards Safe Motion Planning in Human Workspaces: A Robust Multi-agent Approach

Shih-Yun Lo
Benito Fernandez
Peter Stone 0001
Andrea Thomaz

It is becoming increasingly feasible for robots to share a workspace with humans. However, for them to do so safely while maintaining agile performance, they need the ability to smoothly handle the dynamics and uncertainty caused by human motions. Markov Decision Processes (MDPs) serve as a common framework to formulate robot planning problems. However, because of its single-agent formulation, such planner cannot account for human reaction when evaluating robot actions. The robot can thus suffer from unsafe motions and move in ways that are hard for nearby humans to understand. To resolve this, we instead model robot planning in human workspaces as a Stochastic Game, and contribute a robust planning algorithm, which enables the robot to account for its prediction errors in human responses to prevent collision, while not losing agility, opposed to traditional maximin optimization techniques, by applying maximin operation only at "critical states". We validate the approach under partial knowledge of pedestrian behaviors, and show that our approach encounters zero collision despite imperfect prediction, while improving path efficiency, compared to baselines.

ICRA Conference 2021 Conference Paper

Watch Where You're Going! Gaze and Head Orientation as Predictors for Social Robot Navigation

Blake Holman
Abrar Anwar
Akash Singh
Mauricio Tec
Justin W. Hart
Peter Stone 0001

Mobile robots deployed in human-populated environments must be able to safely and comfortably navigate in close proximity to people. Head orientation and gaze are both mechanisms which help people to interpret where other people intend to walk, which in turn enables them to coordinate their movement. Head orientation has previously been leveraged to develop classifiers which are able to predict the goal of a person’s walking motion. Gaze is believed to generally precede head orientation, with a person quickly moving their eyes to a target and then following it with a turn of their head. This study leverages state-of-the-art virtual reality technology to place participants into a simulated environment in which their gaze and motion can be observed. The results of this study indicate that position, velocity, head orientation, and gaze can all be used as predictive features of the goal of a person’s walking motion. The results also indicate that gaze both precedes head orientation and can be used to predict the goal of a person’s walking motion at a higher level of accuracy earlier in their walking trajectory. These findings can be leveraged in the design of social navigation systems for mobile robots.

IROS Conference 2020 Conference Paper

Deep R-Learning for Continual Area Sweeping

Rishi Shah
Yuqian Jiang
Justin W. Hart
Peter Stone 0001

Coverage path planning is a well-studied problem in robotics in which a robot must plan a path that passes through every point in a given area repeatedly, usually with a uniform frequency. To address the scenario in which some points need to be visited more frequently than others, this problem has been extended to non-uniform coverage planning. This paper considers the variant of non-uniform coverage in which the robot does not know the distribution of relevant events beforehand and must nevertheless learn to maximize the rate of detecting events of interest. This continual area sweeping problem has been previously formalized in a way that makes strong assumptions about the environment, and to date only a greedy approach has been proposed. We generalize the continual area sweeping formulation to include fewer environmental constraints, and propose a novel approach based on reinforcement learning in a Semi-Markov Decision Process. This approach is evaluated in an abstract simulation and in a high fidelity Gazebo simulation. These evaluations show significant improvement upon the existing approach in general settings, which is especially relevant in the growing area of service robotics. We also present a video demonstration on a real service robot.

ICML Conference 2020 Conference Paper

Reducing Sampling Error in Batch Temporal Difference Learning

Brahma S. Pavse
Ishan Durugkar
Josiah P. Hanna
Peter Stone 0001

Temporal difference (TD) learning is one of the main foundations of modern reinforcement learning. This paper studies the use of TD(0), a canonical TD algorithm, to estimate the value function of a given policy from a batch of data. In this batch setting, we show that TD(0) may converge to an inaccurate value function because the update following an action is weighted according to the number of times that action occurred in the batch – not the true probability of the action under the given policy. To address this limitation, we introduce \emph{policy sampling error corrected}-TD(0) (PSEC-TD(0)). PSEC-TD(0) first estimates the empirical distribution of actions in each state in the batch and then uses importance sampling to correct for the mismatch between the empirical weighting and the correct weighting for updates following each action. We refine the concept of a certainty-equivalence estimate and argue that PSEC-TD(0) is a more data efficient estimator than TD(0) for a fixed batch of data. Finally, we conduct an empirical evaluation of PSEC-TD(0) on three batch value function learning tasks, with a hyperparameter sensitivity analysis, and show that PSEC-TD(0) produces value function estimates with lower mean squared error than TD(0).

IROS Conference 2020 Conference Paper

Reinforced Grounded Action Transformation for Sim-to-Real Transfer

Haresh Karnan
Siddharth Desai
Josiah P. Hanna
Garrett Warnell
Peter Stone 0001

Robots can learn to do complex tasks in simulation, but often, learned behaviors fail to transfer well to the real world due to simulator imperfections (the "reality gap"). Some existing solutions to this sim-to-real problem, such as Grounded Action Transformation (gat), use a small amount of real-world experience to minimize the reality gap by "grounding" the simulator. While very effective in certain scenarios, gat is not robust on problems that use complex function approximation techniques to model a policy. In this paper, we introduce Reinforced Grounded Action Transformation (rgat), a new sim-to-real technique that uses Reinforcement Learning (RL) not only to update the target policy in simulation, but also to perform the grounding step itself. This novel formulation allows for end-to-end training during the grounding step, which, compared to gat, produces a better grounded simulator. Moreover, we show experimentally in several MuJoCo domains that our approach leads to successful transfer for policies modeled using neural networks.

IROS Conference 2020 Conference Paper

Stochastic Grounded Action Transformation for Robot Learning in Simulation

Siddharth Desai
Haresh Karnan
Josiah P. Hanna
Garrett Warnell
Peter Stone 0001

Robot control policies learned in simulation do not often transfer well to the real world. Many existing solutions to this sim-to-real problem, such as the Grounded Action Transformation (GAT) algorithm, seek to correct for- or ground-these differences by matching the simulator to the real world. However, the efficacy of these approaches is limited if they do not explicitly account for stochasticity in the target environment. In this work, we analyze the problems associated with grounding a deterministic simulator in a stochastic real world environment, and we present examples where GAT fails to transfer a good policy due to stochastic transitions in the target domain. In response, we introduce the Stochastic Grounded Action Transformation (SGAT) algorithm, which models this stochasticity when grounding the simulator. We find experimentally-for both simulated and physical target domains-that SGAT can find policies that are robust to stochasticity in the target domain.

ICML Conference 2019 Conference Paper

Importance Sampling Policy Evaluation with an Estimated Behavior Policy

Josiah P. Hanna
Scott Niekum
Peter Stone 0001

We consider the problem of off-policy evaluation in Markov decision processes. Off-policy evaluation is the task of evaluating the expected return of one policy with data generated by a different, behavior policy. Importance sampling is a technique for off-policy evaluation that re-weights off-policy returns to account for differences in the likelihood of the returns between the two policies. In this paper, we study importance sampling with an estimated behavior policy where the behavior policy estimate comes from the same set of data used to compute the importance sampling estimate. We find that this estimator often lowers the mean squared error of off-policy evaluation compared to importance sampling with the true behavior policy or using a behavior policy that is estimated from a separate data set. Intuitively, estimating the behavior policy in this way corrects for error due to sampling in the action-space. Our empirical results also extend to other popular variants of importance sampling and show that estimating a non-Markovian behavior policy can further lower large-sample mean squared error even when the true behavior policy is Markovian.

ICRA Conference 2019 Conference Paper

Improving Grounded Natural Language Understanding through Human-Robot Dialog

Jesse Thomason
Aishwarya Padmakumar
Jivko Sinapov
Nick Walker 0001
Yuqian Jiang
Harel Yedidsion
Justin W. Hart
Peter Stone 0001

Natural language understanding for robotics can require substantial domain- and platform-specific engineering. For example, for mobile robots to pick-and-place objects in an environment to satisfy human commands, we can specify the language humans use to issue such commands, and connect concept words like red can to physical object properties. One way to alleviate this engineering for a new domain is to enable robots in human environments to adapt dynamically-continually learning new language constructions and perceptual concepts. In this work, we present an end-to-end pipeline for translating natural language commands to discrete robot actions, and use clarification dialogs to jointly improve language parsing and concept grounding. We train and evaluate this agent in a virtual setting on Amazon Mechanical Turk, and we transfer the learned agent to a physical robot platform to demonstrate it in the real world.

ICAPS Conference 2019 Conference Paper

Open-World Reasoning for Service Robots

Yuqian Jiang
Nick Walker 0001
Justin W. Hart
Peter Stone 0001

A service robot accepting verbal commands from a human operator is likely to encounter requests that reference objects not currently represented in its knowledge base. In domestic or office settings, the construction of a complete knowledge base would be cumbersome and unlikely to succeed in most real-world deployments. The world that such a robot operates in is thus “open” in the sense that some objects that it must act on in the real world are not described in its internal representation. However, when an operator gives a command referencing an object that the robot has not yet observed (and thus not incorporated into its knowledge base), we can think of the object as being hypothetical to the robot. This paper presents a novel method for closing the robot’s world model for planning purposes by introducing hypothetical objects into the robot’s knowledge base, reasoning about these hypothetical objects, and acting on these hypotheses in the real world. We use our implementation of this method on a domestic service robot as an illustrative demonstration to explore how it works in practice.

IROS Conference 2019 Conference Paper

Task-Motion Planning with Reinforcement Learning for Adaptable Mobile Service Robots

Yuqian Jiang
Fangkai Yang
Shiqi Zhang 0001
Peter Stone 0001

Task-motion planning (TMP) addresses the problem of efficiently generating executable and low-cost task plans in a discrete space such that the (initially unknown) action costs are determined by motion plans in a corresponding continuous space. A task-motion plan for a mobile service robot that behaves in a highly dynamic domain can be sensitive to domain uncertainty and changes, leading to suboptimal behaviors or execution failures. In this paper, we propose a novel framework, TMP-RL, which is an integration of TMP and reinforcement learning (RL), to solve the problem of robust TMP in dynamic and uncertain domains. The robot first generates a low-cost, feasible task-motion plan by iteratively planning in the discrete space and updating relevant action costs evaluated by the motion planner in continuous space. During execution, the robot learns via model-free RL to further improve its task-motion plans. RL enables adaptability to the current domain, but can be costly with regards to experience; using TMP, which does not rely on experience, can jump-start the learning process before executing in the real world. TMP-RL is evaluated in a mobile service robot domain where the robot navigates in an office area, showing significantly improved adaptability to unseen domain dynamics over TMP and task planning (TP)-RL methods.

IROS Conference 2018 Conference Paper

PRISM: Pose Registration for Integrated Semantic Mapping

Justin W. Hart
Rishi Shah
Sean Kirmani
Nick Walker 0001
Kathryn Baldauf
Nathan John
Peter Stone 0001

Many robotics applications involve navigating to positions specified in terms of their semantic significance. A robot operating in a hotel may need to deliver room service to a named room. In a hospital, it may need to deliver medication to a patient's room. The Building-Wide Intelligence Project at UT Austin has been developing a fleet of autonomous mobile robots, called BWIBots, which perform tasks in the computer science department. Tasks include guiding a person, delivering a message, or bringing an object to a location such as an office, lecture hall, or classroom. The process of constructing a map that a robot can use for navigation has been simplified by modern SLAM algorithms. The attachment of semantics to map data, however, remains a tedious manual process of labeling locations in otherwise automatically generated maps. This paper introduces a system called PRISM to automate a step in this process by enabling a robot to localize door signs - a semantic markup intended to aid the human occupants of a building - and to annotate these locations in its map.

ICML Conference 2017 Conference Paper

Data-Efficient Policy Evaluation Through Behavior Policy Search

Josiah P. Hanna
Philip S. Thomas
Peter Stone 0001
Scott Niekum

We consider the task of evaluating a policy for a Markov decision process (MDP). The standard unbiased technique for evaluating a policy is to deploy the policy and observe its performance. We show that the data collected from deploying a different policy, commonly called the behavior policy, can be used to produce unbiased estimates with lower mean squared error than this standard technique. We derive an analytic expression for the optimal behavior policy — the behavior policy that minimizes the mean squared error of the resulting estimates. Because this expression depends on terms that are unknown in practice, we propose a novel policy evaluation sub-problem, behavior policy search: searching for a behavior policy that reduces mean squared error. We present a behavior policy search algorithm and empirically demonstrate its effectiveness in lowering the mean squared error of policy performance estimates.

IROS Conference 2017 Conference Paper

Leveraging commonsense reasoning and multimodal perception for robot spoken dialog systems

Dongcai Lu
Shiqi Zhang 0001
Peter Stone 0001
Xiaoping Chen

Probabilistic graphical models, such as partially observable Markov decision processes (POMDPs), have been used in stochastic spoken dialog systems to handle the inherent uncertainty in speech recognition and language understanding. Such dialog systems suffer from the fact that only a relatively small number of domain variables are allowed in the model, so as to ensure the generation of good-quality dialog policies. At the same time, the non-language perception modalities on robots, such as vision-based facial expression recognition and Lidar-based distance detection, can hardly be integrated into this process. In this paper, we use a probabilistic commonsense reasoner to “guide” our POMDP-based dialog manager, and present a principled, multimodal dialog management (MDM) framework that allows the robot's dialog belief state to be seamlessly updated by both observations of human spoken language, and exogenous events such as the change of human facial expressions. The MDM approach has been implemented and evaluated both in simulation and on a real mobile robot using guidance tasks.

EUMAS Conference 2017 Invited Paper

Multiagent Learning Paradigms

Karl Tuyls
Peter Stone 0001

Abstract “Perhaps a thing is simple if you can describe it fully in several different ways, without immediately knowing that you are describing the same thing” – Richard Feynman This articles examines multiagent learning from several paradigmatic perspectives, aiming to bring them together within one framework. We aim to provide a general definition of multiagent learning and lay out the essential characteristics of the various paradigms in a systematic manner by dissecting multiagent learning into its main components. We show how these various paradigms are related and describe similar learning processes but from varying perspectives, e. g. an individual (cognitive) learner vs. a population of (simple) learning agents.

ICML Conference 2016 Conference Paper

On the Analysis of Complex Backup Strategies in Monte Carlo Tree Search

Piyush Khandelwal
Elad Liebman
Scott Niekum
Peter Stone 0001

Over the past decade, Monte Carlo Tree Search (MCTS) and specifically Upper Confidence Bound in Trees (UCT) have proven to be quite effective in large probabilistic planning domains. In this paper, we focus on how values are backpropagated in the MCTS tree, and apply complex return strategies from the Reinforcement Learning (RL) literature to MCTS, producing 4 new MCTS variants. We demonstrate that in some probabilistic planning benchmarks from the International Planning Competition (IPC), selecting a MCTS variant with a backup strategy different from Monte Carlo averaging can lead to substantially better results. We also propose a hypothesis for why different backup strategies lead to different performance in particular environments, and manipulate a carefully structured grid-world domain to provide empirical evidence supporting our hypothesis.

IROS Conference 2015 Conference Paper

Benchmarking robot cooperation without pre-coordination in the RoboCup Standard Platform League drop-in player competition

Katie Genter
Tim Laue 0001
Peter Stone 0001

The Standard Platform League is one of the main competitions of the annual RoboCup world championships. In this competition, teams of five humanoid robots play soccer against each other. In 2014, the league added a new sub-competition which serves as a testbed for cooperation without pre-coordination: the Drop-in Player Competition. Instead of homogeneous robot teams that are each programmed by the same people and hence implicitly pre-coordinated, this competition features ad hoc teams, i. e. teams that consist of robots originating from different RoboCup teams and that are each running different software. In this paper, we provide an overview of this competition, including its motivation and rules. We then present and analyze the results of the 2014 competition, which gathered robots from 23 teams, involved at least 50 human participants, and consisted of fifteen 20-minute games for a total playing time of 300 minutes. We also suggest improvements for future iterations, many of which will be evaluated at RoboCup 2015.

ECAI Conference 2014 Conference Paper

Communicating with Unknown Teammates

Samuel Barrett
Noa Agmon
Noam Hazon
Sarit Kraus
Peter Stone 0001

Past research has investigated a number of methods for coordinating teams of agents, but with the growing number of sources of agents, it is likely that agents will encounter teammates that do not share their coordination methods. Therefore, it is desirable for agents to adapt to these teammates, forming an effective ad hoc team. Past ad hoc teamwork research has focused on cases where the agents do not directly communicate. However when teammates do communicate, it can provide a valuable channel for coordination. Therefore, this paper tackles the problem of communication in ad hoc teams, introducing a minimal version of the multiagent, multiarmed bandit problem with limited communication between the agents. The theoretical results in this paper prove that this problem setting can be solved in polynomial time when the agent knows the set of possible teammates. Furthermore, the empirical results show that an agent can cooperate with a variety of teammates following unknown behaviors even when its models of these teammates are imperfect.

ICAPS Conference 2014 Conference Paper

Planning in Action Language BC while Learning Action Costs for Mobile Robots

Piyush Khandelwal
Fangkai Yang
Matteo Leonetti
Vladimir Lifschitz
Peter Stone 0001

The action language BC provides an elegant way of formalizing dynamic domains which involve indirect effects of actions and recursively defined fluents. In complex robot task planning domains, it may be necessary for robots to plan with incomplete information, and reason about indirect or recursive action effects. In this paper, we demonstrate how BC can be used for robot task planning to solve these issues. Additionally, action costs are incorporated with planning to produce optimal plans, and we estimate these costs from experience making planning adaptive. This paper presents the first application of BC on a real robot in a realistic domain, which involves human-robot interaction for knowledge acquisition, optimal plan generation to minimize navigation time, and learning for adaptive planning.

IROS Conference 2014 Conference Paper

The RoboCup 2013 drop-in player challenges: Experiments in ad hoc teamwork

Patrick MacAlpine
Katie Genter
Samuel Barrett
Peter Stone 0001

As the prevalence of autonomous agents grows, so does the number of interactions between these agents. Therefore, it is desirable for these agents to be capable of banding together with previously unknown teammates towards a common goal: to collaborate without pre-coordination. While past research on ad hoc teamwork has focused mainly on theoretical treatments and empirical studies in relatively simple domains, the long-term vision has been to enable robots and other autonomous agents to exhibit the sort of flexibility and adaptability on complex tasks that people do, for example when they play games of “pick-up” basketball or soccer. This paper introduces a series of pick-up robot soccer experiments that were carried out in three different leagues at the international RoboCup competition in 2013. In all cases, agents from different labs were put on teams with no pre-coordination. This paper introduces the structure of these experiments, describes the strategies used by UT Austin Villa in each challenge, and analyzes the results. The paper's main contribution is the introduction of a new large-scale ad hoc teamwork testbed that can serve as a starting point for future experimental ad hoc teamwork research.

IROS Conference 2012 Conference Paper

Evasion planning for autonomous vehicles at intersections

Tsz-Chiu Au
Chien-Liang Fok
Sriram Vishwanath
Christine Julien 0001
Peter Stone 0001

Autonomous intersection management (AIM) is a new intersection control protocol that exploits the capabilities of autonomous vehicles to control traffic at intersections in a way better than traffic signals and stop signs. A key assumption of this protocol is that vehicles can always follow their trajectories. But mechanical failures can occur in real life, causing vehicles to deviate from their trajectories. A previous approach for handling mechanical failure was to prevent vehicles from entering the intersection after the failure. However, this approach cannot prevent collisions among vehicles already in the intersection or too close to stop because (1) the lack of coordination among vehicles can cause collisions during the execution of evasive actions; and (2) the intersection may not have enough room for evasive actions. In this paper, we propose a preemptive approach that pre-computes evasion plans for several common types of mechanical failures before vehicles enter an intersection. This preemptive approach is necessary because there are situations in which vehicles cannot evade without pre-allocation of space for evasion. We present a modified AIM protocol and demonstrate the effectiveness of evasion plan execution on a miniature autonomous intersection testbed.

ICRA Conference 2012 Conference Paper

On coordination in practical multi-robot patrol

Noa Agmon
Chien-Liang Fok
Yehuda Emaliah
Peter Stone 0001
Christine Julien 0001
Sriram Vishwanath

Multi-robot patrol is a fundamental application of multi-robot systems. While much theoretical work exists providing an understanding of the optimal patrol strategy for teams of coordinated homogeneous robots, little work exists on building and evaluating the performance of such systems for real. In this paper, we evaluate the performance of multirobot patrol in a practical outdoor distributed robotic system, and evaluate the effect of different coordination schemes on the performance of the robotic team. The multi-robot patrol algorithms evaluated vary in the level of robot coordination: no coordination, loose coordination, and tight coordination. In addition, we evaluate versions of these algorithms that distribute state information-either individual state, or entire team state (global-view state). Our experiments show that while tight coordination is theoretically optimal, it is not practical in practice. Instead, uncoordinated patrol performs best in terms of average waypoint visitation frequency, though loosely coordinated patrol that shares only individual state performed best in terms of worst-case frequency. Both are significantly better than a loosely coordinated algorithm based on sharing global-view state. We respond to this discrepancy between theory and practice, caused primarily by robot heterogeneity, by extending the theory to account for such heterogeneity, and find that the new theory accounts for the empirical results.

ICML Conference 2012 Conference Paper

PAC Subset Selection in Stochastic Multi-armed Bandits

Shivaram Kalyanakrishnan
Ambuj Tewari
Peter Auer
Peter Stone 0001

ICRA Conference 2012 Conference Paper

RTMBA: A Real-Time Model-Based Reinforcement Learning Architecture for robot control

Todd Hester
Michael J. Quinlan
Peter Stone 0001

Reinforcement Learning (RL) is a paradigm for learning decision-making tasks that could enable robots to learn and adapt to their situation on-line. For an RL algorithm to be practical for robotic control tasks, it must learn in very few samples, while continually taking actions in real-time. Existing model-based RL methods learn in relatively few samples, but typically take too much time between each action for practical on-line learning. In this paper, we present a novel parallel architecture for model-based RL that runs in real-time by 1) taking advantage of sample-based approximate planning methods and 2) parallelizing the acting, model learning, and planning processes in a novel way such that the acting process is sufficiently fast for typical robot control cycles. We demonstrate that algorithms using this architecture perform nearly as well as methods using the typical sequential architecture when both are given unlimited time, and greatly out-perform these methods on tasks that require real-time actions such as controlling an autonomous vehicle.

ICRA Conference 2012 Conference Paper

Setpoint scheduling for autonomous vehicle controllers

Tsz-Chiu Au
Michael J. Quinlan
Peter Stone 0001

This paper considers the problem of controlling an autonomous vehicle to arrive at a specific position on a road at a given time and velocity. This ability is particularly useful for a recently introduced autonomous intersection management protocol, called AIM, which has been shown to lead to lower delays than traffic signals and stop signs. Specifically, we introduce a setpoint scheduling algorithm for generating setpoints for the PID controllers for the brake and throttle actuators of an autonomous vehicle. The algorithm constructs a feasible setpoint schedule such that the vehicle arrives at the position at the correct time and velocity. Our experimental results show that the algorithm outperforms a heuristic-based setpoint scheduler that does not provide any guarantee about the arrival time and velocity.

IROS Conference 2012 Conference Paper

Video: RoboCup robot soccer history 1997 - 2011

Manuela Veloso
Peter Stone 0001

RoboCup is an international initiative to foster inter-disciplinary research and education in robotics, artificial intelligence, computer science, and engineering. We focus on the challenges of multi-robot systems, where robots cooperate with each other and when needed with humans to achieve goals in complex and uncertain environments, such as robot soccer, as RoboCupSoccer, robot rescue, as RoboCupRescue, and the wide spectrum of robot applications in daily life, as RoboCup@Home. We also include sponsored demonstrations that explore possible new scientific challenges, such as collaborative logistics. Furthermore, we are committed to contribute to the education of children in robotics: RoboCupJunior provides an exciting introduction to science and engineering for children. Overall, RoboCup is a large vibrant community, composed of university faculty and student researchers and engineers, school teachers, children, and parents. RoboCup serves as a substrate to a wide variety of academic entreprises, ranging from courses and class projects to undergraduate, Masters, and PhD research theses. RoboCup has an international annual event consisting of robot competitions and a symposium. RoboCup has consistently grown, from a few hundred participants in 1997 to close to 3, 000 in 2011.

IROS Conference 2011 Conference Paper

Autonomous Intersection Management: Multi-intersection optimization

Matthew J. Hausknecht
Tsz-Chiu Au
Peter Stone 0001

Advances in autonomous vehicles and intelligent transportation systems indicate a rapidly approaching future in which intelligent vehicles will automatically handle the process of driving. However, increasing the efficiency of today's transportation infrastructure will require intelligent traffic control mechanisms that work hand in hand with intelligent vehicles. To this end, Dresner and Stone proposed a new intersection control mechanism called Autonomous Intersection Management (AIM) and showed in simulation that by studying the problem from a multiagent perspective, intersection control can be made more efficient than existing control mechanisms such as traffic signals and stop signs. We extend their study beyond the case of an individual intersection and examine the unique implications and abilities afforded by using AIM-based agents to control a network of interconnected intersections. We examine different navigation policies by which autonomous vehicles can dynamically alter their planned paths, observe an instance of Braess' paradox, and explore the new possibility of dynamically reversing the flow of traffic along lanes in response to minute-by-minute traffic conditions. Studying this multiagent system in simulation, we quantify the substantial improvements in efficiency imparted by these agent-based traffic control methods.

EWRL Workshop 2011 Invited Paper

Invited Talk: PRISM - Practical RL: Representation, Interaction, Synthesis, and Mortality

Peter Stone 0001

Abstract When scaling up RL to large continuous domains with imperfect representations and hierarchical structure, we often try applying algorithms that are proven to converge in small finite domains, and then just hope for the best. This talk will advocate instead designing algorithms that adhere to the constraints, and indeed take advantage of the opportunities, that might come with the problem at hand. Drawing on several different research threads within the Learning Agents Research Group at UT Austin, I will discuss four types of issues that arise from these contraints and opportunities: 1) Representation – choosing the algorithm for the problem’s representation and adapating the representation to fit the algorithm; 2) Interaction – with other agents and with human trainers; 3) Synthesis – of different algorithms for the same problem and of different concepts in the same algorithm; and 4) Mortality – the opportunity to improve learning based on past experience and the constraint that one can’t explore exhaustively.

ICML Conference 2011 Conference Paper

Structure Learning in Ergodic Factored MDPs without Knowledge of the Transition Function's In-Degree

Doran Chakraborty
Peter Stone 0001

ICML Conference 2010 Conference Paper

Boosting for Regression Transfer

David Pardoe
Peter Stone 0001

IROS Conference 2010 Conference Paper

Bringing simulation to life: A mixed reality autonomous intersection

Michael J. Quinlan
Tsz-Chiu Au
Jesse Zhu
Nicolae Stiurca
Peter Stone 0001

Fully autonomous vehicles are technologically feasible with the current generation of hardware, as demonstrated by recent robot car competitions. Dresner and Stone proposed a new intersection control protocol called Autonomous Intersection Management (AIM) and showed that with autonomous vehicles it is possible to make intersection control much more efficient than the traditional control mechanisms such as traffic signals and stop signs. The protocol, however, has only been tested in simulation and has not been evaluated with real autonomous vehicles. To realistically test the protocol, we implemented a mixed reality platform on which an autonomous vehicle can interact with multiple virtual vehicles in a simulation at a real intersection in real time. From this platform we validated realistic parameters for our autonomous vehicle to safely traverse an intersection in AIM. We present several techniques to improve efficiency and show that the AIM protocol can still outperform traffic signals and stop signs even if the cars are not as precisely controllable as has been assumed in previous studies.

ICML Conference 2010 Conference Paper

Convergence, Targeted Optimality, and Safety in Multiagent Learning

Doran Chakraborty
Peter Stone 0001

ICML Conference 2010 Conference Paper

Efficient Selection of Multiple Bandit Arms: Theory and Practice

Shivaram Kalyanakrishnan
Peter Stone 0001

ICRA Conference 2010 Conference Paper

Generalized model learning for Reinforcement Learning on a humanoid robot

Todd Hester
Michael J. Quinlan
Peter Stone 0001

Reinforcement learning (RL) algorithms have long been promising methods for enabling an autonomous robot to improve its behavior on sequential decision-making tasks. The obvious enticement is that the robot should be able to improve its own behavior without the need for detailed step-by-step programming. However, for RL to reach its full potential, the algorithms must be sample efficient: they must learn competent behavior from very few real-world trials. From this perspective, model-based methods, which use experiential data more efficiently than model-free approaches, are appealing. But they often require exhaustive exploration to learn an accurate model of the domain. In this paper, we present an algorithm, Reinforcement Learning with Decision Trees (RL-DT), that uses decision trees to learn the model by generalizing the relative effect of actions across states. The agent explores the environment until it believes it has a reasonable policy. The combination of the learning approach with the targeted exploration policy enables fast learning of the model. We compare RL-DT against standard model-free and model-based learning methods, and demonstrate its effectiveness on an Aldebaran Nao humanoid robot scoring goals in a penalty kick scenario.

IROS Conference 2009 Conference Paper

Improving particle filter performance using SSE instructions

Peter Djeu
Michael J. Quinlan
Peter Stone 0001

Robotics researchers are often faced with real-time constraints, and for that reason algorithmic and implementation-level optimization can dramatically increase the overall performance of a robot. In this paper we illustrate how a substantial run-time gain can be achieved by taking advantage of the extended instruction sets found in modern processors, in particular the SSE1 and SSE2 instruction sets. We present an SSE version of Monte Carlo Localization that results in an impressive 9x speedup over an optimized scalar implementation. In the process, we discuss SSE implementations of atan, atan2 and exp that achieve up to a 4x speedup in these mathematical operations alone.

ICRA Conference 2008 Conference Paper

Maximum likelihood estimation of sensor and action model functions on a mobile robot

Daniel Stronger
Peter Stone 0001

In order for a mobile robot to accurately interpret its sensations and predict the effects of its actions, it must have accurate models of its sensors and actuators. These models are typically tuned manually, a brittle and laborious process. Autonomous model learning is a promising alternative to manual calibration, but previous work has assumed the presence of an accurate action or sensor model in order to train the other model. This paper presents an adaptation of the Expectation-Maximization (EM) algorithm to enable a mobile robot to learn both its action and sensor model functions, starting without an accurate version of either. The resulting algorithm is validated experimentally both on a Sony Aibo ERS-7 robot and in simulation.

ICRA Conference 2008 Conference Paper

Negative information and line observations for Monte Carlo localization

Todd Hester
Peter Stone 0001

Localization is a very important problem in robotics and is critical to many tasks performed on a mobile robot. In order to localize well in environments with few landmarks, a robot must make full use of all the information provided to it. This paper moves towards this goal by studying the effects of incorporating line observations and negative information into the localization algorithm. We extend the general Monte Carlo localization algorithm to utilize observations of lines such as carpet edges. We also make use of the information available when the robot expects to see a landmark but does not, by incorporating negative information into the algorithm. We compare our implementations of these ideas to previous similar approaches and demonstrate the effectiveness of these improvements through localization experiments performed both on a Sony AIBO ERS-7 robot and in simulation.

ICML Conference 2008 Conference Paper

Online kernel selection for Bayesian reinforcement learning

Joseph Reisinger
Peter Stone 0001
Risto Miikkulainen

ICRA Conference 2008 Conference Paper

Person recognition on a Segway Robot: A video of UT Austin Villa Robocup@Home 2007 finals demonstration

W. Bradley Knox
Juhyun Lee
Peter Stone 0001

This video shows a Segway robot from the University of Texas at Austin competing in the finals of the 2007 Robocup @Home competition, which featured home assistant robots performing various challenging tasks. This demonstration combines a few tasks which will likely be performed by a future home assistant robot. The robot learns a human's appearance, follows the human with his back turned, distinguishes the human from a similarly clothed stranger, and adapts when it notices that the human has changed his clothing. For this task, we introduce a novel two-classifier architecture, using the subject's face as a primary identifying characteristic and his shirt as a secondary characteristic.

ICRA Conference 2008 Conference Paper

Person tracking on a mobile robot with heterogeneous inter-characteristic feedback

Juhyun Lee
Peter Stone 0001

For a mobile robot that interacts with humans such as a home assistant or a tour guide robot, tracking a particular person among multiple persons is a fundamental, yet challenging task. Uniquely identifying characteristics such as a person's face, may not be visible consistently enough to be used as the sole form of identification. Rather, it may be useful to also track more frequently visible, but perhaps less uniquely identifying characteristics such as a person's clothes. After learning various characteristics of a person, the tracking system is required to autonomously update itself with additional training data, since the learned features may change over space and time due to the mobile nature of the robot. In this paper, we introduce a novel algorithm for merging multiple, heterogeneous sub-classifiers designed to track and associate different characteristics of a person being tracked. These heterogeneous classifiers give feedback to each other by identifying additional online training data for one another, thus improving the performance of each classifier and the accuracy of the overall system. Our algorithm has been fully implemented and tested on a Segway base.

ICRA Conference 2007 Conference Paper

A Comparison of Two Approaches for Vision and Self-Localization on a Mobile Robot

Daniel Stronger
Peter Stone 0001

This paper considers two approaches to the problem of vision and self-localization on a mobile robot. In the first approach, the perceptual processing is primarily bottom-up, with visual object recognition entirely preceding localization. In the second, significant top-down information is incorporated, with vision and localization being intertwined. That is, the processing of vision is highly dependent on the robot's estimate of its location. The two approaches are implemented and tested on a Sony Aibo ERS-7 robot, localizing as it walks through a color-coded test-bed domain. This paper's contributions are an exposition of two different approaches to vision and localization on a mobile robot, an empirical comparison of the two methods, and a discussion of the relative advantages of each method.

ICML Conference 2007 Conference Paper

Cross-domain transfer for reinforcement learning

Matthew E. Taylor
Peter Stone 0001

IROS Conference 2007 Conference Paper

Global action selection for illumination invariant color modeling

Mohan Sridharan
Peter Stone 0001

A major challenge in the path of widespread use of mobile robots is the ability to function autonomously, learning useful features from the environment and using them to adapt to environmental changes. We propose an algorithm for mobile robots equipped with color cameras that allows for smooth operation under illumination changes. The robot uses image statistics and the environmental structure to autonomously detect and adapt to both major and minor illumination changes. Furthermore, the robot autonomously plans an action sequence that maximizes color learning opportunities while minimizing localization errors. Our approach is fully implemented and tested on the Sony AIBO robots.

ICRA Conference 2006 Conference Paper

A Multi-robot System for Continuous Area Sweeping Tasks

Mazda Ahmadi
Peter Stone 0001

As mobile robots become increasingly autonomous over extended periods of time, opportunities arise for their use on repetitive tasks. We define and implement behaviors for a class of such tasks that we call continuous area sweeping tasks. A continuous area sweeping task is one in which a group of robots must repeatedly visit all points in a fixed area, possibly with nonuniform frequency, as specified by a task-dependent cost function. Examples of problems that need continuous area sweeping are trash removal in a large building and routine surveillance. In our previous work we have introduced a single-robot approach to this problem. In this paper, we extend that approach to multi-robot scenarios. The focus of this paper is adaptive and decentralized task assignment in continuous area sweeping problems, with the aim of ensuring stability in environments with dynamic factors, such as robot malfunctions or the addition of new robots to the team. Our proposed negotiation-based approach is fully implemented and tested both in simulation and on physical robots

ICAPS Conference 2006 Conference Paper

Predictive Planning for Supply Chain Management

David Pardoe
Peter Stone 0001

Supply chains are ubiquitous in the manufacturing of many complex products. Traditionally, supply chains have been created through the intricate interactions of human representatives of the various companies involved. However, recent advances in planning, scheduling, and autonomous agent technologies have sparked an interest, both in academia and in industry, in automating the process. The Trading Agent Competition Supply Chain Management (TAC SCM) scenario provides a unique testbed for studying and prototyping supply chain management agents by providing a competitive environment in which independently created agents can be tested against each other over the course of many simulations. This paper presents the features of TAC SCM from a planning and scheduling perspective and introduces TacTex-05, the champion agent from the 2005 competition. TacTex-05 takes a predictive approach to its many planning and scheduling decisions by estimating future resource availability and constraints. This paper focuses on these aspects of the agent and isolates their impact with controlled empirical tests.

ICRA Conference 2005 Conference Paper

Practical Vision-Based Monte Carlo Localization on a Legged Robot

Mohan Sridharan
Gregory Kuhlmann
Peter Stone 0001

Mobile robot localization, the ability of a robot to determine its global position and orientation, continues to be a major research focus in robotics. In most past cases, such localization has been studied on wheeled robots with range finding sensors such as sonar or lasers. In this paper, we consider the more challenging scenario of a legged robot localizing with a limited field-of-view camera as its primary sensory input. We begin with a baseline implementation adapted from the literature that provides a reasonable level of competence, but that exhibits some weaknesses in real-world tests. We propose a series of practical enhancements designed to improve the robot’s sensory and actuator models that enable our robots to achieve a 50% improvement in localization accuracy over the baseline implementation. We go on to demonstrate how the accuracy improvement is even more dramatic when the robot is subjected to large unmodeled movements. These enhancements are each individually straightforward, but together they provide a roadmap for avoiding potential pitfalls when implementing Monte Carlo Localization on vision-based and/or legged robots.

IROS Conference 2005 Conference Paper

Real-time vision on a mobile robot platform

Mohan Sridharan
Peter Stone 0001

Computer vision is a broad and significant ongoing research challenge, even when performed on an individual image or on streaming video from a high-quality stationary camera with abundant computational resources. When faced with streaming video from a lower-quality, rapidly moving camera and limited computational resources, the challenge increases. We present our implementation of a vision system on a mobile robot platform that uses a camera image as the primary sensory input. Having to perform all processing, including segmentation and object detection, in real-time on-board the robot, eliminates the possibility of using some state-of-the-art methods that otherwise might apply. We describe the methods that we developed to achieve a practical vision system within these constraints. Our approach is fully implemented and tested on a team of Sony AIBO robots.

ICRA Conference 2005 Conference Paper

Simultaneous Calibration of Action and Sensor Models on a Mobile Robot

Daniel Stronger
Peter Stone 0001

This paper presents a technique for the Simultaneous Calibration of Action and Sensor Models (SCASM) on a mobile robot. While previous approaches to calibration make use of an independent source of feedback, SCASM is unsupervised, in that it does not receive any well-calibrated feedback about its location. Starting with only an inaccurate action model, it learns accurate relative action and sensor models. Furthermore, SCASM is fully autonomous, in that it operates with no human supervision. SCASM is fully implemented and tested on a Sony Aibo ERS-7 robot.

ICRA Conference 2004 Conference Paper

Policy Gradient Reinforcement Learning for Fast Quadrupedal Locomotion

Nate Kohl
Peter Stone 0001

This paper presents a machine learning approach to optimizing a quadrupedal trot gait for forward speed. Given a parameterized walk designed for a specific robot, we propose using a form of policy gradient reinforcement learning to automatically search the set of possible parameters with the goal of finding the fastest possible walk. We implement and test our approach on a commercially available quadrupedal robot platform, namely the Sony Aibo robot. After about three hours of learning, all on the physical robots and with no human intervention other than to change the batteries, the robots achieved a gait faster than any previously known gait known for the Aibo, significantly outperforming a variety of existing hand-coded and learned solutions.

ICML Conference 2003 Conference Paper

Learning Predictive State Representations

Satinder Singh 0001
Michael L. Littman
Nicholas K. Jong
David Pardoe
Peter Stone 0001

ICML Conference 2002 Conference Paper

Modeling Auction Price Uncertainty Using Boosting-based Conditional Density Estimation

Robert E. Schapire
Peter Stone 0001
David A. McAllester
Michael L. Littman
János A. Csirik

ICML Conference 2001 Conference Paper

Scaling Reinforcement Learning toward RoboCup Soccer

Peter Stone 0001
Richard S. Sutton

ICML Conference 2000 Conference Paper

TPOT-RL Applied to Network Routing

Peter Stone 0001

ICAPS Conference 1994 Conference Paper

The Need for Different Domain-independent Heuristics

Peter Stone 0001
Manuela Veloso
Jim Blythe

PRODIGY’s planning algorithm uses domain-independent search heuristics. In this paper, we support our that there is no ~ search heuristic that performs more efficiently than others for all problems or in all domains. The paper presents three diIFerent domaln-independent search heuristics of increasing complexity. Werun PRODIGY with these heuristics in a series of artificial domms(introduced in (Barrett &Weld1994)) wherein fact one of the heuristics performs more eBicient]y than the others. However, we introduce an additional simple domainwhere the apparently worst heuristic outperforms the other two. The results we obtained in our empirical experiments lead to the main conclusion of this paper: p]~-ming algorithms need to use different search heuristics in difereat domains. Wecoac]ude the paper by advocating the need to learn the correspondencebetween particular domaincharacteristics and specific search heuristics for planning efficiently in complexdomains.