Arrow Research search

Author name cluster

James MacGlashan

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

15 papers
2 author rows

Possible papers

15

NeurIPS Conference 2022 Conference Paper

Value Function Decomposition for Iterative Design of Reinforcement Learning Agents

  • James MacGlashan
  • Evan Archer
  • Alisa Devlic
  • Takuma Seno
  • Craig Sherstan
  • Peter Wurman
  • Peter Stone

Designing reinforcement learning (RL) agents is typically a difficult process that requires numerous design iterations. Learning can fail for a multitude of reasons and standard RL methods provide too few tools to provide insight into the exact cause. In this paper, we show how to integrate \textit{value decomposition} into a broad class of actor-critic algorithms and use it to assist in the iterative agent-design process. Value decomposition separates a reward function into distinct components and learns value estimates for each. These value estimates provide insight into an agent's learning and decision-making process and enable new training methods to mitigate common problems. As a demonstration, we introduce SAC-D, a variant of soft actor-critic (SAC) adapted for value decomposition. SAC-D maintains similar performance to SAC, while learning a larger set of value predictions. We also introduce decomposition-based tools that exploit this information, including a new reward \textit{influence} metric, which measures each reward component's effect on agent decision-making. Using these tools, we provide several demonstrations of decomposition's use in identifying and addressing problems in the design of both environments and agents. Value decomposition is broadly applicable and easy to incorporate into existing algorithms and workflows, making it a powerful tool in an RL practitioner's toolbox.

ICRA Conference 2021 Conference Paper

Efficient Real-Time Inference in Temporal Convolution Networks

  • Piyush Khandelwal
  • James MacGlashan
  • Peter R. Wurman
  • Peter Stone 0001

It has been recently demonstrated that Temporal Convolution Networks (TCNs) provide state-of-the-art results in many problem domains where the input data is a time-series. TCNs typically incorporate information from a long history of inputs (the receptive field) into a single output using many convolution layers. Real-time inference using a trained TCN can be challenging on devices with limited compute and memory, especially if the receptive field is large. This paper introduces the RT-TCN algorithm that reuses the output of prior convolution operations to minimize the computational requirements and persistent memory footprint of a TCN during real-time inference. We also show that when a TCN is trained using time slices of the input time-series, it can be executed in realtime continually using RT-TCN. In addition, we provide TCN architecture guidelines that ensure that real-time inference can be performed within memory and computational constraints.

AAAI Conference 2020 Conference Paper

Gamma-Nets: Generalizing Value Estimation over Timescale

  • Craig Sherstan
  • Shibhansh Dohare
  • James MacGlashan
  • Johannes Günther
  • Patrick M. Pilarski

Temporal abstraction is a key requirement for agents making decisions over long time horizons—a fundamental challenge in reinforcement learning. There are many reasons why value estimates at multiple timescales might be useful; recent work has shown that value estimates at different time scales can be the basis for creating more advanced discounting functions and for driving representation learning. Further, predictions at many different timescales serve to broaden an agent's model of its environment. One predictive approach of interest within an online learning setting is general value function (GVFs), which represent models of an agent's world as a collection of predictive questions each defined by a policy, a signal to be predicted, and a prediction timescale. In this paper we present Γ-nets, a method for generalizing value function estimation over timescale, allowing a given GVF to be trained and queried for arbitrary timescales so as to greatly increase the predictive ability and scalability of a GVF-based model. The key to our approach is to use timescale as one of the value estimator's inputs. As a result, the prediction target for any timescale is available at every timestep and we are free to train on any number of timescales. We first provide two demonstrations by 1) predicting a square wave and 2) predicting sensorimotor signals on a robot arm using a linear function approximator. Next, we empirically evaluate Γ-nets in the deep reinforcement learning setting using policy evaluation on a set of Atari video games. Our results show that Γ-nets can be effective for predicting arbitrary timescales, with only a small cost in accuracy as compared to learning estimators for fixed timescales. Γ-nets provide a method for accurately and compactly making predictions at many timescales without requiring a priori knowledge of the task, making it a valuable contribution to ongoing work on model-based planning, representation learning, and lifelong learning algorithms.

RLDM Conference 2019 Conference Abstract

Gamma-nets: Generalizing Value Functions over Timescale

  • Craig Sherstan
  • James MacGlashan
  • Patrick M. Pilarski

Predictive representations of state connect an agent’s behavior (policy) to observable outcomes, providing a powerful representation for decision making. General value functions (GVFs) represent models of an agent’s world as a collection of predictive questions. A GVF is expressed by: a policy, a prediction target, and a timescale, e. g. , “If a robot drives forward how much current will its motors draw over the next 3s? ” Traditionally, predictions for a given timescale must be specified by the engineer and predictions for each timescale learned independently. Here we present γ-nets, a method for generalizing value function estimation over timescale, allowing a given GVF to be trained and queried for any fixed timescale. The key to our approach is to use timescale as one of the estimator inputs. The prediction target for any fixed timescale is then available at every timestep and we are free to train on any number of timescales. We present preliminary results on a test signal and a robot arm. This work contributes new insights into creating expressive and tractable predictive models for decision-making agents that operate in real-time, long-lived environments.

AAMAS Conference 2017 Conference Paper

Curriculum Design for Machine Learners in Sequential Decision Tasks

  • Bei Peng
  • James MacGlashan
  • Robert Loftin
  • Michael L. Littman
  • David L. Roberts
  • Matthew E. Taylor

Existing machine-learning work has shown that algorithms can benefit from curricula—learning first on simple examples before moving to more difficult examples. While most existing work on curriculum learning focuses on developing automatic methods to iteratively select training examples with increasing difficulty tailored to the current ability of the learner, relatively little attention has been paid to the ways in which humans design curricula. We argue that a better understanding of the human-designed curricula could give us insights into the development of new machinelearning algorithms and interfaces that can better accommodate machine- or human-created curricula. Our work addresses this emerging and vital area empirically, taking an important step to characterize the nature of human-designed curricula relative to the space of possible curricula and the performance benefits that may (or may not) occur.

RLDM Conference 2017 Conference Abstract

Generalized Inverse Reinforcement Learning

  • Nakul Gopalan
  • Amy Greenwald
  • Michael Littman
  • James MacGlashan

Inverse Reinforcement Learning (IRL) is used to teach behaviors to agents, by having them learn a reward function from example trajectories. The underlying assumption is usually that these trajectories represent optimal behavior. However, it is not always possible for a user to provide examples of optimal trajectories. This problem has been tackled previously by labeling trajectories with a score that indicates good and bad behaviors. In this work, we formalize the IRL problem in a generalized framework that allows for learning from failed demonstrations. In our framework, users can score entire trajectories as well as individual state-action pairs. This allows the agent to learn preferred behaviors from a relatively small number of trajectories. We expect this framework to be especially useful in robotics domains, where the user can collect fewer trajectories at the cost of labeling bad state-action pairs, which might be easier than maneuvering a robot to collect additional (entire) trajectories.

ICML Conference 2017 Conference Paper

Interactive Learning from Policy-Dependent Human Feedback

  • James MacGlashan
  • Mark K. Ho
  • Robert Tyler Loftin
  • Bei Peng 0001
  • Guan Wang
  • David L. Roberts 0001
  • Matthew E. Taylor
  • Michael L. Littman

This paper investigates the problem of interactively learning behaviors communicated by a human teacher using positive and negative feedback. Much previous work on this problem has made the assumption that people provide feedback for decisions that is dependent on the behavior they are teaching and is independent from the learner’s current policy. We present empirical results that show this assumption to be false—whether human trainers give a positive or negative feedback for a decision is influenced by the learner’s current policy. Based on this insight, we introduce Convergent Actor-Critic by Humans (COACH), an algorithm for learning from policy-dependent feedback that converges to a local optimum. Finally, we demonstrate that COACH can successfully learn multiple behaviors on a physical robot.

ICAPS Conference 2017 Conference Paper

Planning with Abstract Markov Decision Processes

  • Nakul Gopalan
  • Marie desJardins
  • Michael L. Littman
  • James MacGlashan
  • Shawn Squire
  • Stefanie Tellex
  • John Winder
  • Lawson L. S. Wong

Robots acting in human-scale environments must plan under uncertainty in large state–action spaces and face constantly changing reward functions as requirements and goals change. Planning under uncertainty in large state–action spaces requires hierarchical abstraction for efficient computation. We introduce a new hierarchical planning framework called Abstract Markov Decision Processes (AMDPs) that can plan in a fraction of the time needed for complex decision making in ordinary MDPs. AMDPs provide abstract states, actions, and transition dynamics in multiple layers above a base-level “flat” MDP. AMDPs decompose problems into a series of subtasks with both local reward and local transition functions used to create policies for subtasks. The resulting hierarchical planning method is independently optimal at each level of abstraction, and is recursively optimal when the local reward and transition functions are correct. We present empirical results showing significantly improved planning speed, while maintaining solution quality, in the Taxi domain and in a mobile-manipulation robotics problem. Furthermore, our approach allows specification of a decision-making model for a mobile-manipulation problem on a Turtlebot, spanning from low-level control actions operating on continuous variables all the way up through high-level object manipulation tasks.

ICRA Conference 2017 Conference Paper

Reducing errors in object-fetching interactions through social feedback

  • David Whitney
  • Eric Rosen
  • James MacGlashan
  • Lawson L. S. Wong
  • Stefanie Tellex

Fetching items is an important problem for a social robot. It requires a robot to interpret a person's language and gesture and use these noisy observations to infer what item to deliver. If the robot could ask questions, it would help the robot be faster and more accurate in its task. Existing approaches either do not ask questions, or rely on fixed question-asking policies. To address this problem, we propose a model that makes assumptions about cooperation between agents to perform richer signal extraction from observations. This work defines a mathematical framework for an item-fetching domain that allows a robot to increase the speed and accuracy of its ability to interpret a person's requests by reasoning about its own uncertainty as well as processing implicit information (implicatures). We formalize the item-delivery domain as a Partially Observable Markov Decision Process (POMDP), and approximately solve this POMDP in real time. Our model improves speed and accuracy of fetching tasks by asking relevant clarifying questions only when necessary. To measure our model's improvements, we conducted a real world user study with 16 participants. Our method achieved greater accuracy and a faster interaction time compared to state-of-the-art baselines. Our model is 2. 17 seconds faster (25% faster) than a state-of-the-art baseline, while being 2. 1% more accurate.

AAMAS Conference 2016 Conference Paper

A Need for Speed: Adapting Agent Action Speed to Improve Task Learning from Non-Expert Humans

  • Bei Peng
  • James MacGlashan
  • Robert Loftin
  • Michael L. Littman
  • David L. Roberts
  • Matthew E. Taylor

As robots become pervasive in human environments, it is important to enable users to effectively convey new skills without programming. Most existing work on Interactive Reinforcement Learning focuses on interpreting and incorporating non-expert human feedback to speed up learning; we aim to design a better representation of the learning agent that is able to elicit more natural and effective communication between the human trainer and the learner, while treating human feedback as discrete communication that depends probabilistically on the trainer’s target policy. This work entails a user study where participants train a virtual agent to accomplish tasks by giving reward and/or punishment in a variety of simulated environments. We present results from 60 participants to show how a learner can ground natural language commands and adapt its action execution speed to learn more efficiently from human trainers. The agent’s action execution speed can be successfully modulated to encourage more explicit feedback from a human trainer in areas of the state space where there is high uncertainty. Our results show that our novel adaptive speed agent dominates different fixed speed agents on several measures of performance. Additionally, we investigate the impact of instructions on user performance and user preference in training conditions.

NeurIPS Conference 2016 Conference Paper

Showing versus doing: Teaching by demonstration

  • Mark Ho
  • Michael Littman
  • James MacGlashan
  • Fiery Cushman
  • Joseph Austerweil

People often learn from others' demonstrations, and classic inverse reinforcement learning (IRL) algorithms have brought us closer to realizing this capacity in machines. In contrast, teaching by demonstration has been less well studied computationally. Here, we develop a novel Bayesian model for teaching by demonstration. Stark differences arise when demonstrators are intentionally teaching a task versus simply performing a task. In two experiments, we show that human participants systematically modify their teaching behavior consistent with the predictions of our model. Further, we show that even standard IRL algorithms benefit when learning from behaviors that are intentionally pedagogical. We conclude by discussing IRL algorithms that can take advantage of intentional pedagogy.

IJCAI Conference 2015 Conference Paper

Between Imitation and Intention Learning

  • James MacGlashan
  • Michael L. Littman

Research in learning from demonstration can generally be grouped into either imitation learning or intention learning. In imitation learning, the goal is to imitate the observed behavior of an expert and is typically achieved using supervised learning techniques. In intention learning, the goal is to learn the intention that motivated the expert’s behavior and to use a planning algorithm to derive behavior. Imitation learning has the advantage of learning a direct mapping from states to actions, which bears a small computational cost. Intention learning has the advantage of behaving well in novel states, but may bear a large computational cost by relying on planning algorithms in complex tasks. In this work, we introduce receding horizon inverse reinforcement learning, in which the planning horizon induces a continuum between these two learning paradigms. We present empirical results on multiple domains that demonstrate that performing IRL with a small, but non-zero, receding planning horizon greatly decreases the computational cost of planning while maintaining superior generalization performance compared to imitation learning.

ICAPS Conference 2015 Conference Paper

Goal-Based Action Priors

  • David Abel
  • D. Ellis Hershkowitz
  • Gabriel Barth-Maron
  • Stephen Brawner
  • Kevin O'Farrell
  • James MacGlashan
  • Stefanie Tellex

Robots that interact with people must flexibly respond to requests by planning in stochastic state spaces that are often too large to solve for optimal behavior. In this work, we develop a framework for goal and state dependent action priors that can be used to prune away irrelevant actions based on the robot’s current goal, thereby greatly accelerating planning in a variety of complex stochastic environments. Our framework allows these goal-based action priors to be specified by an expert or to be learned from prior experience in related problems. We evaluate our approach in the video game Minecraft, whose complexity makes it an effective robot simulator. We also evaluate our approach in a robot cooking domain that is executed on a two-handed manipulator robot. In both cases, goal-based action priors enhance baseline planners by dramatically reducing the time taken to find a near-optimal plan.

IJCAI Conference 2015 Conference Paper

Portable Option Discovery for Automated Learning Transfer in Object-Oriented Markov Decision Processes

  • Nicholay Topin
  • Nicholas Haltmeyer
  • Shawn Squire
  • John Winder
  • Marie desJardins
  • James MacGlashan

We introduce a novel framework for option discovery and learning transfer in complex domains that are represented as object-oriented Markov decision processes (OO-MDPs) [Diuk et al. , 2008]. Our framework, Portable Option Discovery (POD), extends existing option discovery methods, and enables transfer across related but different domains by providing an unsupervised method for finding a mapping between object-oriented domains with different state spaces. The framework also includes heuristic approaches for increasing the efficiency of the mapping process. We present the results of applying POD to Pickett and Barto’s [2002] Policy- Blocks and MacGlashan’s [2013] Option-Based Policy Transfer in two application domains. We show that our approach can discover options effectively, transfer options among different domains, and improve learning performance with low computational overhead.

AAAI Conference 2014 Conference Paper

A Strategy-Aware Technique for Learning Behaviors from Discrete Human Feedback

  • Robert Loftin
  • James MacGlashan
  • Bei Peng
  • Matthew Taylor
  • Michael Littman
  • Jeff Huang
  • David Roberts

This paper introduces two novel algorithms for learning behaviors from human-provided rewards. The primary novelty of these algorithms is that instead of treating the feedback as a numeric reward signal, they interpret feedback as a form of discrete communication that depends on both the behavior the trainer is trying to teach and the teaching strategy used by the trainer. For example, some human trainers use a lack of feedback to indicate whether actions are correct or incorrect, and interpreting this lack of feedback accurately can significantly improve learning speed. Results from user studies show that humans use a variety of training strategies in practice and both algorithms can learn a contextual bandit task faster than algorithms that treat the feedback as numeric. Simulated trainers are also employed to evaluate the algorithms in both contextual bandit and sequential decision-making tasks with similar results.