Arrow Research search

Author name cluster

Scott Niekum

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

71 papers
2 author rows

Possible papers

71

ICLR Conference 2025 Conference Paper

An Optimal Discriminator Weighted Imitation Perspective for Reinforcement Learning

  • Haoran Xu 0003
  • Shuozhe Li
  • Harshit Sikchi
  • Scott Niekum
  • Amy Zhang 0001

We introduce Iterative Dual Reinforcement Learning (IDRL), a new method that takes an optimal discriminator-weighted imitation view of solving RL. Our method is motivated by a simple experiment in which we find training a discriminator using the offline dataset plus an additional expert dataset and then performing discriminator-weighted behavior cloning gives strong results on various types of datasets. That optimal discriminator weight is quite similar to the learned visitation distribution ratio in Dual-RL, however, we find that current Dual-RL methods do not correctly estimate that ratio. In IDRL, we propose a correction method to iteratively approach the optimal visitation distribution ratio in the offline dataset given no addtional expert dataset. During each iteration, IDRL removes zero-weight suboptimal transitions using the learned ratio from the previous iteration and runs Dual-RL on the remaining subdataset. This can be seen as replacing the behavior visitation distribution with the optimized visitation distribution from the previous iteration, which theoretically gives a curriculum of improved visitation distribution ratios that are closer to the optimal discriminator weight. We verify the effectiveness of IDRL on various kinds of offline datasets, including D4RL datasets and more realistic corrupted demonstrations. IDRL beats strong Primal-RL and Dual-RL baselines in terms of both performance and stability, on all datasets.

RLJ Journal 2025 Journal Article

Fast Adaptation with Behavioral Foundation Models

  • Harshit Sikchi
  • Andrea Tirinzoni
  • Ahmed Touati
  • Yingchen Xu
  • Anssi Kanervisto
  • Scott Niekum
  • Amy Zhang
  • Alessandro Lazaric

Unsupervised zero-shot reinforcement learning (RL) has emerged as a powerful paradigm for pretraining behavioral foundation models (BFMs), enabling agents to solve a wide range of downstream tasks specified via reward functions in a zero-shot fashion, i.e., without additional test-time learning or planning. This is achieved by learning self-supervised task embeddings alongside corresponding near-optimal behaviors and incorporating an inference procedure to directly retrieve the latent task embedding and associated policy for any given reward function. Despite promising results, zero-shot policies are often suboptimal due to errors induced by the unsupervised training process, the embedding, and the inference procedure. In this paper, we focus on devising fast adaptation strategies to improve the zero-shot performance of BFMs in few steps of online interaction with the environment, while avoiding any performance drop during the adaptation process. Notably, we demonstrate that existing BFMs learn a set of skills containing more performant policies than those identified by their inference procedure, making them well-suited for fast adaptation. Motivated by this observation, we propose both actor-critic and actor-only fast adaptation strategies that search in the low-dimensional task-embedding space of the pre-trained BFM to rapidly improve the performance of its zero-shot policies on any downstream task. Notably, our approach mitigates the initial “unlearning” phase commonly observed when fine-tuning pre-trained RL models. We evaluate our fast adaptation strategies on top of four state-of-the-art zero-shot RL methods in multiple navigation and locomotion domains. Our results show that they achieve 10-40% improvement over their zero-shot performance in a few tens of episodes, outperforming existing baselines.

RLC Conference 2025 Conference Paper

Fast Adaptation with Behavioral Foundation Models

  • Harshit Sikchi
  • Andrea Tirinzoni
  • Ahmed Touati
  • Yingchen Xu
  • Anssi Kanervisto
  • Scott Niekum
  • Amy Zhang
  • Alessandro Lazaric

Unsupervised zero-shot reinforcement learning (RL) has emerged as a powerful paradigm for pretraining behavioral foundation models (BFMs), enabling agents to solve a wide range of downstream tasks specified via reward functions in a zero-shot fashion, i. e. , without additional test-time learning or planning. This is achieved by learning self-supervised task embeddings alongside corresponding near-optimal behaviors and incorporating an inference procedure to directly retrieve the latent task embedding and associated policy for any given reward function. Despite promising results, zero-shot policies are often suboptimal due to errors induced by the unsupervised training process, the embedding, and the inference procedure. In this paper, we focus on devising fast adaptation strategies to improve the zero-shot performance of BFMs in few steps of online interaction with the environment, while avoiding any performance drop during the adaptation process. Notably, we demonstrate that existing BFMs learn a set of skills containing more performant policies than those identified by their inference procedure, making them well-suited for fast adaptation. Motivated by this observation, we propose both actor-critic and actor-only fast adaptation strategies that search in the low-dimensional task-embedding space of the pre-trained BFM to rapidly improve the performance of its zero-shot policies on any downstream task. Notably, our approach mitigates the initial “unlearning” phase commonly observed when fine-tuning pre-trained RL models. We evaluate our fast adaptation strategies on top of four state-of-the-art zero-shot RL methods in multiple navigation and locomotion domains. Our results show that they achieve 10-40% improvement over their zero-shot performance in a few tens of episodes, outperforming existing baselines.

RLJ Journal 2025 Journal Article

High-Confidence Policy Improvement from Human Feedback

  • Hon Tik Tse
  • Philip S. Thomas
  • Scott Niekum

Reinforcement learning from human feedback (RLHF) aims to learn or fine-tune policies via human preference data when a ground-truth reward function is not known. However, conventional RLHF methods provide no performance guarantees and have an unacceptably high probability of returning poorly performing policies. We propose Policy Optimization and Safety Test for Policy Improvement (POSTPI), an algorithm that provides high-confidence policy performance guarantees without direct knowledge of the ground-truth reward function, given only a preference dataset. The user of the algorithm may select any initial policy $\pi_\text{init}$ and confidence level $1 - \delta$, and POSTPI will ensure that the probability it returns a policy with performance worse than $\pi_\text{init}$ under the unobserved ground-truth reward function is at most $\delta$. We show theory as well as empirical results in the Safety Gymnasium suite that demonstrate that POSTPI reliably provides the desired guarantee.

RLC Conference 2025 Conference Paper

High-Confidence Policy Improvement from Human Feedback

  • Hon Tik Tse
  • Philip S. Thomas
  • Scott Niekum

Reinforcement learning from human feedback (RLHF) aims to learn or fine-tune policies via human preference data when a ground-truth reward function is not known. However, conventional RLHF methods provide no performance guarantees and have an unacceptably high probability of returning poorly performing policies. We propose Policy Optimization and Safety Test for Policy Improvement (POSTPI), an algorithm that provides high-confidence policy performance guarantees without direct knowledge of the ground-truth reward function, given only a preference dataset. The user of the algorithm may select any initial policy $\pi_\text{init}$ and confidence level $1 - \delta$, and POSTPI will ensure that the probability it returns a policy with performance worse than $\pi_\text{init}$ under the unobserved ground-truth reward function is at most $\delta$. We show theory as well as empirical results in the Safety Gymnasium suite that demonstrate that POSTPI reliably provides the desired guarantee.

ICLR Conference 2025 Conference Paper

Null Counterfactual Factor Interactions for Goal-Conditioned Reinforcement Learning

  • Caleb Chuck
  • Fan Feng
  • Carl Qi
  • Chang Shi
  • Siddhant Agarwal
  • Amy Zhang 0001
  • Scott Niekum

Hindsight relabeling is a powerful tool for overcoming sparsity in goal-conditioned reinforcement learning (GCRL), especially in certain domains such as navigation and locomotion. However, hindsight relabeling can struggle in object-centric domains. For example, suppose that the goal space consists of a robotic arm pushing a particular target block to a goal location. In this case, hindsight relabeling will give high rewards to any trajectory that does not interact with the block. However, these behaviors are only useful when the object is already at the goal---an extremely rare case in practice. A dataset dominated by these kinds of trajectories can complicate learning and lead to failures. In object-centric domains, one key intuition is that meaningful trajectories are often characterized by object-object interactions such as pushing the block with the gripper. To leverage this intuition, we introduce Hindsight Relabeling using Interactions (HInt), which combines interactions with hindsight relabeling to improve the sample efficiency of downstream RL. However, interactions do not have a consensus statistical definition that is tractable for downstream GCRL. Therefore, we propose a definition of interactions based on the concept of _null counterfactual_: a cause object is interacting with a target object if, in a world where the cause object did not exist, the target object would have different transition dynamics. We leverage this definition to infer interactions in Null Counterfactual Interaction Inference (NCII), which uses a ``nulling'' operation with a learned model to simulate absences and infer interactions. We demonstrate that NCII is able to achieve significantly improved interaction inference accuracy in both simple linear dynamics domains and dynamic robotic domains in Robosuite, Robot Air Hockey, and Franka Kitchen. Furthermore, we demonstrate that HInt improves sample efficiency by up to $4\times$ in these domains as goal-conditioned tasks.

RLC Conference 2025 Conference Paper

Pareto Optimal Learning from Preferences with Hidden Context

  • Ryan Bahlous-Boldi
  • Li Ding
  • Lee Spector
  • Scott Niekum

Ensuring AI models align with human values is essential for their safety and functionality. Reinforcement learning from human feedback (RLHF) leverages human preferences to achieve this alignment. However, when preferences are sourced from diverse populations, point estimates of reward can result in suboptimal performance or be unfair to specific groups. We propose Pareto Optimal Preference Learning (POPL), which enables pluralistic alignment by framing discrepant group preferences as objectives with potential trade-offs, aiming for policies that are Pareto-optimal on the preference dataset. POPL utilizes lexicase selection, an iterative process that selects diverse and Pareto-optimal solutions. Our theoretical and empirical evaluations demonstrate that POPL surpasses baseline methods in learning sets of reward functions and policies, effectively catering to distinct groups without access to group numbers or membership labels. We verify the performance of POPL on a stateless preference learning setting, a Minigrid RL domain, Metaworld robotics benchmarks, as well as large language model (LLM) fine-tuning. We illustrate that POPL can also serve as a foundation for techniques optimizing specific notions of group fairness, ensuring safe and equitable AI model alignment.

RLJ Journal 2025 Journal Article

Pareto Optimal Learning from Preferences with Hidden Context

  • Ryan Bahlous-Boldi
  • Li Ding
  • Lee Spector
  • Scott Niekum

Ensuring AI models align with human values is essential for their safety and functionality. Reinforcement learning from human feedback (RLHF) leverages human preferences to achieve this alignment. However, when preferences are sourced from diverse populations, point estimates of reward can result in suboptimal performance or be unfair to specific groups. We propose Pareto Optimal Preference Learning (POPL), which enables pluralistic alignment by framing discrepant group preferences as objectives with potential trade-offs, aiming for policies that are Pareto-optimal on the preference dataset. POPL utilizes lexicase selection, an iterative process that selects diverse and Pareto-optimal solutions. Our theoretical and empirical evaluations demonstrate that POPL surpasses baseline methods in learning sets of reward functions and policies, effectively catering to distinct groups without access to group numbers or membership labels. We verify the performance of POPL on a stateless preference learning setting, a Minigrid RL domain, Metaworld robotics benchmarks, as well as large language model (LLM) fine-tuning. We illustrate that POPL can also serve as a foundation for techniques optimizing specific notions of group fairness, ensuring safe and equitable AI model alignment.

RLC Conference 2025 Conference Paper

Reinforcement Learning from Human Feedback with High-Confidence Safety Guarantees

  • Yaswanth Chittepu
  • Blossom Metevier
  • Will Schwarzer
  • Austin Hoag
  • Scott Niekum
  • Philip S. Thomas

Existing approaches to language model alignment often treat safety as a tradeoff against helpfulness which can lead to unacceptable actions in sensitive domains. To ensure reliable performance in such settings, we propose High-Confidence Safe Reinforcement Learning from Human Feedback (HC-RLHF), a method that provides high-confidence safety guarantees while maximizing helpfulness. Similar to previous methods, HC-RLHF explicitly decouples human preferences regarding helpfulness and harmlessness (safety) and trains separate reward and cost models, respectively. It then employs a two-step process to find safe solutions. In the first step, it optimizes the reward function while ensuring that a specific upper-confidence bound on the cost constraint is satisfied. In the second step, the trained model undergoes a safety test to verify whether its performance satisfies a separate upper-confidence bound on the cost constraint. We provide a theoretical analysis of HC-RLHF, including a proof that it will not return an unsafe solution with a probability greater than a user-specified threshold. For our empirical analysis, we apply HC-RLHF to align three different language models (Qwen2-1. 5B, Qwen2. 5-3B, and LLaMa-3. 2-3B) with human preferences. Our results demonstrate that HC-RLHF produces safe models with high probability while also improving helpfulness and harmlessness compared to previous methods.

RLJ Journal 2025 Journal Article

Reinforcement Learning from Human Feedback with High-Confidence Safety Guarantees

  • Yaswanth Chittepu
  • Blossom Metevier
  • Will Schwarzer
  • Austin Hoag
  • Scott Niekum
  • Philip S. Thomas

Existing approaches to language model alignment often treat safety as a tradeoff against helpfulness which can lead to unacceptable actions in sensitive domains. To ensure reliable performance in such settings, we propose High-Confidence Safe Reinforcement Learning from Human Feedback (HC-RLHF), a method that provides high-confidence safety guarantees while maximizing helpfulness. Similar to previous methods, HC-RLHF explicitly decouples human preferences regarding helpfulness and harmlessness (safety) and trains separate reward and cost models, respectively. It then employs a two-step process to find safe solutions. In the first step, it optimizes the reward function while ensuring that a specific upper-confidence bound on the cost constraint is satisfied. In the second step, the trained model undergoes a safety test to verify whether its performance satisfies a separate upper-confidence bound on the cost constraint. We provide a theoretical analysis of HC-RLHF, including a proof that it will not return an unsafe solution with a probability greater than a user-specified threshold. For our empirical analysis, we apply HC-RLHF to align three different language models (Qwen2-1.5B, Qwen2.5-3B, and LLaMa-3.2-3B) with human preferences. Our results demonstrate that HC-RLHF produces safe models with high probability while also improving helpfulness and harmlessness compared to previous methods.

NeurIPS Conference 2025 Conference Paper

RLZero: Direct Policy Inference from Language Without In-Domain Supervision

  • Harshit Sushil Sikchi
  • Siddhant Agarwal
  • Pranaya Jajoo
  • Samyak Parajuli
  • Caleb Chuck
  • Max Rudolph
  • Peter Stone
  • Amy Zhang

The reward hypothesis states that all goals and purposes can be understood as the maximization of a received scalar reward signal. However, in practice, defining such a reward signal is notoriously difficult, as humans are often unable to predict the optimal behavior corresponding to a reward function. Natural language offers an intuitive alternative for instructing reinforcement learning (RL) agents, yet previous language-conditioned approaches either require costly supervision or test-time training given a language instruction. In this work, we present a new approach that uses a pretrained RL agent trained using only unlabeled, offline interactions—without task-specific supervision or labeled trajectories—to get zero-shot test-time policy inference from arbitrary natural language instructions. We introduce a framework comprising three steps: imagine, project, and imitate. First, the agent imagines a sequence of observations corresponding to the provided language description using video generative models. Next, these imagined observations are projected into the target environment domain. Finally, an agent pretrained in the target environment with unsupervised RL instantly imitates the projected observation sequence through a closed-form solution. To the best of our knowledge, our method, RLZero, is the first approach to show direct language-to-behavior generation abilities on a variety of tasks and environments without any in-domain supervision. We further show that components of RLZero can be used to generate policies zero-shot from cross-embodied videos, such as those available on YouTube, even for complex embodiments like humanoids.

ICLR Conference 2024 Conference Paper

Contrastive Preference Learning: Learning from Human Feedback without Reinforcement Learning

  • Joey Hejna
  • Rafael Rafailov
  • Harshit Sikchi
  • Chelsea Finn
  • Scott Niekum
  • W. Bradley Knox
  • Dorsa Sadigh

Reinforcement Learning from Human Feedback (RLHF) has emerged as a popular paradigm for aligning models with human intent. Typically RLHF algorithms operate in two phases: first, use human preferences to learn a reward function and second, align the model by optimizing the learned reward via reinforcement learning (RL). This paradigm assumes that human preferences are distributed according to reward, but recent work suggests that they instead follow the \emph{regret} under the user's optimal policy. Thus, learning a reward function from feedback is not only based on a flawed assumption of human preference, but also leads to unwieldy optimization challenges that stem from policy gradients or bootstrapping in the RL phase. Because of these optimization challenges, contemporary RLHF methods restrict themselves to contextual bandit settings (e.g., as in large language models) or limit observation dimensionality (e.g., state-based robotics). We overcome these limitations by introducing a new family of algorithms for optimizing behavior from human feedback using the \textit{regret}-based model of human preferences. Using the principle of maximum entropy, we derive \fullname (\abv), an algorithm for learning optimal policies from preferences without learning reward functions, circumventing the need for RL. \abv is fully off-policy, uses only a simple contrastive objective, and can be applied to arbitrary MDPs. This enables \abv to elegantly scale to high-dimensional and sequential RLHF problems while being simpler than prior methods.

JMLR Journal 2024 Journal Article

Data-Efficient Policy Evaluation Through Behavior Policy Search

  • Josiah P. Hanna
  • Yash Chandak
  • Philip S. Thomas
  • Martha White
  • Peter Stone
  • Scott Niekum

We consider the task of evaluating a policy for a Markov decision process (MDP). The standard unbiased technique for evaluating a policy is to deploy the policy and observe its performance. We show that the data collected from deploying a different policy, commonly called the behavior policy, can be used to produce unbiased estimates with lower mean squared error than this standard technique. We derive an analytic expression for a minimal variance behavior policy -- a behavior policy that minimizes the mean squared error of the resulting estimates. Because this expression depends on terms that are unknown in practice, we propose a novel policy evaluation sub-problem, behavior policy search: searching for a behavior policy that reduces mean squared error. We present two behavior policy search algorithms and empirically demonstrate their effectiveness in lowering the mean squared error of policy performance estimates. [abs] [ pdf ][ bib ] &copy JMLR 2024. ( edit, beta )

ICLR Conference 2024 Conference Paper

Dual RL: Unification and New Methods for Reinforcement and Imitation Learning

  • Harshit Sikchi
  • Qinqing Zheng
  • Amy Zhang 0001
  • Scott Niekum

The goal of reinforcement learning (RL) is to find a policy that maximizes the expected cumulative return. It has been shown that this objective can be represented as an optimization problem of state-action visitation distribution under linear constraints. The dual problem of this formulation, which we refer to as *dual RL*, is unconstrained and easier to optimize. In this work, we first cast several state-of-the-art offline RL and offline imitation learning (IL) algorithms as instances of dual RL approaches with shared structures. Such unification allows us to identify the root cause of the shortcomings of prior methods. For offline IL, our analysis shows that prior methods are based on a restrictive coverage assumption that greatly limits their performance in practice. To fix this limitation, we propose a new discriminator-free method ReCOIL that learns to imitate from arbitrary off-policy data to obtain near-expert performance. For offline RL, our analysis frames a recent offline RL method XQL in the dual framework, and we further propose a new method $f$-DVL that provides alternative choices to the Gumbel regression loss that fixes the known training instability issue of XQL. The performance improvements by both of our proposed methods, ReCOIL and $f$-DVL, in IL and RL are validated on an extensive suite of simulated robot locomotion and manipulation tasks.

AAMAS Conference 2024 Conference Paper

Gaze Supervision for Mitigating Causal Confusion in Driving Agents

  • Abhijat Biswas
  • Badal Arun Pardhi
  • Caleb Chuck
  • Jarrett Holtz
  • Scott Niekum
  • Henny Admoni
  • Alessandro Allievi

Imitation Learning (IL) algorithms show promise in learning humanlevel driving behavior, but they often suffer from "causal confusion, " a phenomenon where the lack of explicit inference of the underlying causal structure can result in misattribution of the relative importance of scene elements, especially pronounced in complex scenarios like urban driving with abundant information per time step. Our key idea is that while driving, human drivers naturally exhibit an easily obtained, continuous signal that is highly correlated with causal elements of the state space: eye gaze. We collect human driver demonstrations in a CARLA-based VR driving simulator, allowing us to capture eye gaze in the same simulation environment commonly used in prior work. Further, we propose a method to use gaze-based supervision to mitigate causal confusion in driving IL agents — exploiting the relative importance of gazed-at and notgazed-at scene elements for driving decision-making. We present quantitative results demonstrating the promise of gaze-based supervision improving the driving performance of IL agents.

TMLR Journal 2024 Journal Article

Granger Causal Interaction Skill Chains

  • Caleb Chuck
  • Kevin Black
  • Aditya Arjun
  • Yuke Zhu
  • Scott Niekum

Reinforcement Learning (RL) has demonstrated promising results in learning policies for complex tasks, but it often suffers from low sample efficiency and limited transferability. Hierarchical RL (HRL) methods aim to address the difficulty of learning long-horizon tasks by decomposing policies into skills, abstracting states, and reusing skills in new tasks. However, many HRL methods require some initial task success to discover useful skills, which paradoxically may be very unlikely without access to useful skills. On the other hand, reward-free HRL methods often need to learn far too many skills to achieve proper coverage in high-dimensional domains. In contrast, we introduce the Chain of Interaction Skills (COInS) algorithm, which focuses on \textit{controllability} in factored domains to identify a small number of task-agnostic skills that allow for a high degree of control of the factored state. COInS uses learned detectors to identify interactions between state factors and then trains a chain of skills to control each of these factors successively. We evaluate COInS on a robotic pushing task with obstacles—a challenging domain where other RL and HRL methods fall short. We also demonstrate the transferability of skills learned by COInS, using variants of Breakout, a common RL benchmark, and show 2-3x improvement in both sample efficiency and final performance compared to standard RL baselines.

RLJ Journal 2024 Journal Article

Learning Action-based Representations Using Invariance

  • Max Rudolph
  • Caleb Chuck
  • Kevin Black
  • Misha Lvovsky
  • Scott Niekum
  • Amy Zhang

Robust reinforcement learning agents using high-dimensional observations must be able to identify relevant state features amidst many exogeneous distractors. A representation that captures controllability identifies these state elements by determining what affects agent control. While methods such as inverse dynamics and mutual information capture controllability for a limited number of timesteps, capturing long-horizon elements remains a challenging problem. Myopic controllability can capture the moment right before an agent crashes into a wall, but not the control-relevance of the wall while the agent is still some distance away. To address this we introduce action-bisimulation encoding, a method inspired by the bisimulation invariance pseudometric, that extends single-step controllability with a recursive invariance constraint. By doing this, action-bisimulation learns a multi-step controllability metric that smoothly discounts distant state features that are relevant for control. We demonstrate that action-bisimulation pretraining on reward-free, uniformly random data improves sample efficiency in several environments, including a photorealistic 3D simulation domain, Habitat. Additionally, we provide theoretical analysis and qualitative results demonstrating the information captured by action-bisimulation.

RLC Conference 2024 Conference Paper

Learning Action-based Representations Using Invariance

  • Max Rudolph
  • Caleb Chuck
  • Kevin Black
  • Misha Lvovsky
  • Scott Niekum
  • Amy Zhang

Robust reinforcement learning agents using high-dimensional observations must be able to identify relevant state features amidst many exogeneous distractors. A representation that captures controllability identifies these state elements by determining what affects agent control. While methods such as inverse dynamics and mutual information capture controllability for a limited number of timesteps, capturing long-horizon elements remains a challenging problem. Myopic controllability can capture the moment right before an agent crashes into a wall, but not the control-relevance of the wall while the agent is still some distance away. To address this we introduce action-bisimulation encoding, a method inspired by the bisimulation invariance pseudometric, that extends single-step controllability with a recursive invariance constraint. By doing this, action-bisimulation learns a multi-step controllability metric that smoothly discounts distant state features that are relevant for control. We demonstrate that action-bisimulation pretraining on reward-free, uniformly random data improves sample efficiency in several environments, including a photorealistic 3D simulation domain, Habitat. Additionally, we provide theoretical analysis and qualitative results demonstrating the information captured by action-bisimulation.

AAAI Conference 2024 Conference Paper

Learning Optimal Advantage from Preferences and Mistaking It for Reward

  • W. Bradley Knox
  • Stephane Hatgis-Kessell
  • Sigurdur Orn Adalgeirsson
  • Serena Booth
  • Anca Dragan
  • Peter Stone
  • Scott Niekum

We consider algorithms for learning reward functions from human preferences over pairs of trajectory segments, as used in reinforcement learning from human feedback (RLHF). Most recent work assumes that human preferences are generated based only upon the reward accrued within those segments, or their partial return. Recent work casts doubt on the validity of this assumption, proposing an alternative preference model based upon regret. We investigate the consequences of assuming preferences are based upon partial return when they actually arise from regret. We argue that the learned function is an approximation of the optimal advantage function, not a reward function. We find that if a specific pitfall is addressed, this incorrect assumption is not particularly harmful, resulting in a highly shaped reward function. Nonetheless, this incorrect usage of the approximation of the optimal advantage function is less desirable than the appropriate and simpler approach of greedy maximization of it. From the perspective of the regret preference model, we also provide a clearer interpretation of fine tuning contemporary large language models with RLHF. This paper overall provides insight regarding why learning under the partial return preference model tends to work so well in practice, despite it conforming poorly to how humans give preferences.

TMLR Journal 2024 Journal Article

Models of human preference for learning reward functions

  • W. Bradley Knox
  • Stephane Hatgis-Kessell
  • Serena Booth
  • Scott Niekum
  • Peter Stone
  • Alessandro G Allievi

The utility of reinforcement learning is limited by the alignment of reward functions with the interests of human stakeholders. One promising method for alignment is to learn the reward function from human-generated preferences between pairs of trajectory segments, a type of reinforcement learning from human feedback (RLHF). These human preferences are typically assumed to be informed solely by partial return, the sum of rewards along each segment. We find this assumption to be flawed and propose modeling human preferences instead as informed by each segment’s regret, a measure of a segment’s deviation from optimal decision-making. Given infinitely many preferences generated according to regret, we prove that we can identify a reward function equivalent to the reward function that generated those preferences, and we prove that the previous partial return model lacks this identifiability property in multiple contexts. We empirically show that our proposed regret preference model outperforms the partial return preference model with finite training data in otherwise the same setting. Additionally, we find that our proposed regret preference model better predicts real human preferences and also learns reward functions from these preferences that lead to policies that are better human-aligned. Overall, this work establishes that the choice of preference model is impactful, and our proposed regret preference model provides an improvement upon a core assumption of recent research. We have open sourced our experimental code, the human preferences dataset we gathered, and our training and preference elicitation interfaces for gathering such a dataset.

NeurIPS Conference 2024 Conference Paper

Predicting Future Actions of Reinforcement Learning Agents

  • Stephen Chung
  • Scott Niekum
  • David Krueger

As reinforcement learning agents become increasingly deployed in real-world scenarios, predicting future agent actions and events during deployment is important for facilitating better human-agent interaction and preventing catastrophic outcomes. This paper experimentally evaluates and compares the effectiveness of future action and event prediction for three types of RL agents: explicitly planning, implicitly planning, and non-planning. We employ two approaches: the inner state approach, which involves predicting based on the inner computations of the agents (e. g. , plans or neuron activations), and a simulation-based approach, which involves unrolling the agent in a learned world model. Our results show that the plans of explicitly planning agents are significantly more informative for prediction than the neuron activations of the other types. Furthermore, using internal plans proves more robust to model quality compared to simulation-based approaches when predicting actions, while the results for event prediction are more mixed. These findings highlight the benefits of leveraging inner states and simulations to predict future agent actions and events, thereby improving interaction and safety in real-world deployments.

NeurIPS Conference 2024 Conference Paper

Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms

  • Rafael Rafailov
  • Yaswanth Chittepu
  • Ryan Park
  • Harshit Sikchi
  • Joey Hejna
  • W. B. Knox
  • Chelsea Finn
  • Scott Niekum

Reinforcement Learning from Human Feedback (RLHF)has been crucial to the recent success of Large Language Models (LLMs), however it is often a complex and brittle process. In the classical RLHF framework, a reward model is first trained to represent human preferences, which is in turn used by an online reinforcement learning (RL) algorithm to optimized the LLM. A prominent issue with such methods is reward over-optimization or reward hacking, where the performance as measured by the learned proxy reward model increases, but the true model quality plateaus or even deteriorates. Direct Alignment Algorithms (DDAs), such as Direct Preference Optimization (DPO) have emerged as alternatives to the classical RLHF pipeline. However, despite not training a separate proxy reward model or using RL, they still commonly deteriorate from over-optimization. While the so-called reward hacking phenomenon is not well-defined for DAAs, we still uncover similar trends: at higher KL-budgets, DAA algorithms exhibit similar degradation patters to their classic RLHF counterparts. In particular, we find that DAA methods deteriorate not only across a wide range of KL-budgets, but also often before even a single epoch of the dataset is completed. Through extensive empirical experimentation this work formulates the reward over-optimization or hacking problem for DAAs and explores its consequences across objectives, training regimes, and model scales.

ICLR Conference 2024 Conference Paper

Score Models for Offline Goal-Conditioned Reinforcement Learning

  • Harshit Sikchi
  • Rohan Chitnis
  • Ahmed Touati
  • Alborz Geramifard
  • Amy Zhang 0001
  • Scott Niekum

Offline Goal-Conditioned Reinforcement Learning (GCRL) is tasked with learning to achieve multiple goals in an environment purely from offline datasets using sparse reward functions. Offline GCRL is pivotal for developing generalist agents capable of leveraging pre-existing datasets to learn diverse and reusable skills without hand-engineering reward functions. However, contemporary approaches to GCRL based on supervised learning and contrastive learning are often suboptimal in the offline setting. An alternative perspective on GCRL optimizes for occupancy matching, but necessitates learning a discriminator, which subsequently serves as a pseudo-reward for downstream RL. Inaccuracies in the learned discriminator can cascade, negatively influencing the resulting policy. We present a novel approach to GCRL under a new lens of mixture-distribution matching, leading to our discriminator-free method: SMORe. The key insight is combining the occupancy matching perspective of GCRL with a convex dual formulation to derive a learning objective that can better leverage suboptimal offline data. SMORe learns *scores* or unnormalized densities representing the importance of taking an action at a state for reaching a particular goal. SMORe is principled and our extensive experiments on the fully offline GCRL benchmark composed of robot manipulation and locomotion tasks, including high-dimensional observations, show that SMORe can outperform state-of-the-art baselines by a significant margin.

NeurIPS Conference 2024 Conference Paper

SkiLD: Unsupervised Skill Discovery Guided by Factor Interactions

  • Zizhao Wang
  • Jiaheng Hu
  • Caleb Chuck
  • Stephen Chen
  • Roberto Martín-Martín
  • Amy Zhang
  • Scott Niekum
  • Peter Stone

Unsupervised skill discovery carries the promise that an intelligent agent can learn reusable skills through autonomous, reward-free interactions with environments. Existing unsupervised skill discovery methods learn skills by encouraging distinguishable behaviors that cover diverse states. However, in complex environments with many state factors (e. g. , household environments with many objects), learning skills that cover all possible states is impossible, and naively encouraging state diversity often leads to simple skills that are not ideal for solving downstream tasks. This work introduces Skill Discovery from Local Dependencies (SkiLD), which leverages state factorization as a natural inductive bias to guide the skill learning process. The key intuition guiding SkiLD is that skills that induce \textbf{diverse interactions} between state factors are often more valuable for solving downstream tasks. To this end, SkiLD develops a novel skill learning objective that explicitly encourages the mastering of skills that effectively induce different interactions within an environment. We evaluate SkiLD in several domains with challenging, long-horizon sparse reward tasks including a realistic simulated household robot domain, where SkiLD successfully learns skills with clear semantic meaning and shows superior performance compared to existing unsupervised reinforcement learning methods that only maximize state coverage.

TMLR Journal 2023 Journal Article

A Ranking Game for Imitation Learning

  • Harshit Sikchi
  • Akanksha Saran
  • Wonjoon Goo
  • Scott Niekum

We propose a new framework for imitation learning---treating imitation as a two-player ranking-based game between a policy and a reward. In this game, the reward agent learns to satisfy pairwise performance rankings between behaviors, while the policy agent learns to maximize this reward. In imitation learning, near-optimal expert data can be difficult to obtain, and even in the limit of infinite data cannot imply a total ordering over trajectories as preferences can. On the other hand, learning from preferences alone is challenging as a large number of preferences are required to infer a high-dimensional reward function, though preference data is typically much easier to collect than expert demonstrations. The classical inverse reinforcement learning (IRL) formulation learns from expert demonstrations but provides no mechanism to incorporate learning from offline preferences and vice versa. We instantiate the proposed ranking-game framework with a novel ranking loss giving an algorithm that can simultaneously learn from expert demonstrations and preferences, gaining the advantages of both modalities. Our experiments show that the proposed method achieves state-of-the-art sample efficiency and can solve previously unsolvable tasks in the Learning from Observation (LfO) setting.

EWRL Workshop 2023 Workshop Paper

Dual RL: Unification and New Methods for Reinforcement and Imitation Learning

  • Harshit Sikchi
  • Qinqing Zheng
  • Amy Zhang
  • Scott Niekum

The goal of reinforcement learning (RL) is to maximize the expected cumulative return. It has been shown that this objective can be represented by an optimization problem of the state-action visitation distribution under linear constraints. The dual problem of this formulation, which we refer to as dual RL, is unconstrained and easier to optimize. We show that several state-of-the-art off-policy deep reinforcement learning (RL) algorithms, under both online and offline, RL and imitation learning (IL) settings, can be viewed as dual RL approaches in a unified framework. This unification provides a common ground to study and identify the components that contribute to the success of these methods and also reveals the common shortcomings across methods with new insights for improvement. Our analysis shows that prior off-policy imitation learning methods are based on an unrealistic coverage assumption and are minimizing a particular -divergence between the visitation distributions of the learned policy and the expert policy. We propose a new method using a simple modification to the dual RL framework that allows for performant imitation learning with arbitrary off-policy data to obtain near-expert performance, without learning a discriminator. Further, by framing a recent SOTA offline RL method XQL in the dual RL framework, we propose alternative choices to replace the Gumbel regression loss, which achieve improved performance and resolve the training instability issue of XQL.

AAAI Conference 2023 Conference Paper

The Perils of Trial-and-Error Reward Design: Misdesign through Overfitting and Invalid Task Specifications

  • Serena Booth
  • W. Bradley Knox
  • Julie Shah
  • Scott Niekum
  • Peter Stone
  • Alessandro Allievi

In reinforcement learning (RL), a reward function that aligns exactly with a task's true performance metric is often necessarily sparse. For example, a true task metric might encode a reward of 1 upon success and 0 otherwise. The sparsity of these true task metrics can make them hard to learn from, so in practice they are often replaced with alternative dense reward functions. These dense reward functions are typically designed by experts through an ad hoc process of trial and error. In this process, experts manually search for a reward function that improves performance with respect to the task metric while also enabling an RL algorithm to learn faster. This process raises the question of whether the same reward function is optimal for all algorithms, i.e., whether the reward function can be overfit to a particular algorithm. In this paper, we study the consequences of this wide yet unexamined practice of trial-and-error reward design. We first conduct computational experiments that confirm that reward functions can be overfit to learning algorithms and their hyperparameters. We then conduct a controlled observation study which emulates expert practitioners' typical experiences of reward design, in which we similarly find evidence of reward function overfitting. We also find that experts' typical approach to reward design---of adopting a myopic strategy and weighing the relative goodness of each state-action pair---leads to misdesign through invalid task specifications, since RL algorithms use cumulative reward rather than rewards for individual state-action pairs as an optimization target. Code, data: github.com/serenabooth/reward-design-perils

ICLR Conference 2022 Conference Paper

Fairness Guarantees under Demographic Shift

  • Stephen Giguere 0001
  • Blossom Metevier
  • Bruno Castro da Silva
  • Yuriy Brun
  • Philip S. Thomas
  • Scott Niekum

Recent studies have demonstrated that using machine learning for social applications can lead to injustice in the form of racist, sexist, and otherwise unfair and discriminatory outcomes. To address this challenge, recent machine learning algorithms have been designed to limit the likelihood such unfair behaviors will occur. However, these approaches typically assume the data used for training is representative of what will be encountered once the model is deployed, thus limiting their usefulness. In particular, if certain subgroups of the population become more or less probable after the model is deployed (a phenomenon we call demographic shift), the fair-ness assurances provided by prior algorithms are often invalid. We consider the impact of demographic shift and present a class of algorithms, called Shifty algorithms, that provide high-confidence behavioral guarantees that hold under demographic shift. Shifty is the first technique of its kind and demonstrates an effective strategy for designing algorithms to overcome the challenges demographic shift poses. We evaluate Shifty-ttest, an implementation of Shifty based on Student’s 𝑡-test, and, using a real-world data set of university entrance exams and subsequent student success, show that the models output by our algorithm avoid unfair bias under demo-graphic shift, unlike existing methods. Our experiments demonstrate that our algorithm’s high-confidence fairness guarantees are valid in practice and that our algorithm is an effective tool for training models that are fair when demographic shift occurs.

IROS Conference 2022 Conference Paper

Understanding Acoustic Patterns of Human Teachers Demonstrating Manipulation Tasks to Robots

  • Akanksha Saran
  • Kush Desai
  • Mai Lee Chang
  • Rudolf Lioutikov
  • Andrea Thomaz
  • Scott Niekum

Humans use audio signals in the form of spoken language or verbal reactions effectively when teaching new skills or tasks to other humans. While demonstrations allow humans to teach robots in a natural way, learning from trajectories alone does not leverage other available modalities including audio from human teachers. To effectively utilize audio cues accompanying human demonstrations, first it is important to understand what kind of information is present and conveyed by such cues. This work characterizes audio from human teachers demonstrating multi-step manipulation tasks to a situated Sawyer robot along three dimensions: (1) duration of speech used, (2) expressiveness in speech or prosody, and (3) semantic content of speech. We analyze these features for four different independent variables and find that teachers convey similar semantic content via spoken words for different conditions of (1) demonstration types, (2) audio usage instructions, (3) subtasks, and (4) errors during demonstrations. However, differentiating properties of speech in terms of duration and expressiveness are present for the four independent variables, highlighting that human audio carries rich information, potentially beneficial for technological advancement of robot learning from demonstration methods.

JMLR Journal 2021 Journal Article

A Review of Robot Learning for Manipulation: Challenges, Representations, and Algorithms

  • Oliver Kroemer
  • Scott Niekum
  • George Konidaris

A key challenge in intelligent robotics is creating robots that are capable of directly interacting with the world around them to achieve their goals. The last decade has seen substantial growth in research on the problem of robot manipulation, which aims to exploit the increasing availability of affordable robot arms and grippers to create robots capable of directly interacting with the world to achieve their goals. Learning will be central to such autonomous systems, as the real world contains too much variation for a robot to expect to have an accurate model of its environment, the objects in it, or the skills required to manipulate them, in advance. We aim to survey a representative subset of that research which uses machine learning for manipulation. We describe a formalization of the robot manipulation learning problem that synthesizes existing research into a single coherent framework and highlight the many remaining research opportunities and challenges. [abs] [ pdf ][ bib ] &copy JMLR 2021. ( edit, beta )

NeurIPS Conference 2021 Conference Paper

Adversarial Intrinsic Motivation for Reinforcement Learning

  • Ishan Durugkar
  • Mauricio Tec
  • Scott Niekum
  • Peter Stone

Learning with an objective to minimize the mismatch with a reference distribution has been shown to be useful for generative modeling and imitation learning. In this paper, we investigate whether one such objective, the Wasserstein-1 distance between a policy's state visitation distribution and a target distribution, can be utilized effectively for reinforcement learning (RL) tasks. Specifically, this paper focuses on goal-conditioned reinforcement learning where the idealized (unachievable) target distribution has full measure at the goal. This paper introduces a quasimetric specific to Markov Decision Processes (MDPs) and uses this quasimetric to estimate the above Wasserstein-1 distance. It further shows that the policy that minimizes this Wasserstein-1 distance is the policy that reaches the goal in as few steps as possible. Our approach, termed Adversarial Intrinsic Motivation (AIM), estimates this Wasserstein-1 distance through its dual objective and uses it to compute a supplemental reward function. Our experiments show that this reward function changes smoothly with respect to transitions in the MDP and directs the agent's exploration to find the goal efficiently. Additionally, we combine AIM with Hindsight Experience Replay (HER) and show that the resulting algorithm accelerates learning significantly on several simulated robotics tasks when compared to other rewards that encourage exploration or accelerate learning.

AAAI Conference 2021 System Paper

Demonstration of the EMPATHIC Framework for Task Learning from Implicit Human Feedback

  • Yuchen Cui
  • Qiping Zhang
  • Sahil Jain
  • Alessandro Allievi
  • Peter Stone
  • Scott Niekum
  • W. Bradley Knox

Reactions such as gestures, facial expressions, and vocalizations are an abundant, naturally occurring channel of information that humans provide during interactions. An agent could leverage an understanding of such implicit human feedback to improve its task performance at no cost to the human. This approach contrasts with common agent teaching methods based on demonstrations, critiques, or other guidance that need to be attentively and intentionally provided. In this work, we demonstrate a novel data-driven framework for learning from implicit human feedback, EMPATHIC. This two-stage method consists of (1) mapping implicit human feedback to relevant task statistics such as rewards, optimality, and advantage; and (2) using such a mapping to learn a task. We instantiate the first stage and three second-stage evaluations of the learned mapping. To do so, we collect a dataset of human facial reactions while participants observe an agent execute a sub-optimal policy for a prescribed training task. We train a deep neural network on this data and demonstrate its ability to (1) infer relative reward ranking of events in the training task from prerecorded human facial reactions; (2) improve the policy of an agent in the training task using live human facial reactions; and (3) transfer to a novel domain in which it evaluates robot manipulation trajectories. In the video, we focus on demonstrating the online learning capability of our instantiation of EMPATHIC.

AAMAS Conference 2021 Conference Paper

Efficiently Guiding Imitation Learning Agents with Human Gaze

  • Akanksha Saran
  • Ruohan Zhang
  • Elaine S. Short
  • Scott Niekum

Human gaze is known to be an intention-revealing signal in human demonstrations of tasks. In this work, we use gaze cues from human demonstrators to enhance the performance of agents trained via three popular imitation learning methods — behavioral cloning (BC), behavioral cloning from observation (BCO), and Trajectory-ranked Reward EXtrapolation (T-REX). Based on similarities between the attention of reinforcement learning agents and human gaze, we propose a novel approach for utilizing gaze data in a computationally efficient manner, as part of an auxiliary loss function, which guides a network to have higher activations in image regions where the human’s gaze fixated. This work is a step towards augmenting any existing convolutional imitation learning agent’s training with auxiliary gaze data. Our auxiliary coverage-based gaze loss (CGL) guides learning toward a better reward function or policy, without adding any additional learnable parameters and without requiring gaze data at test time. We find that our proposed approach improves the performance by 95% for BC, 343% for BCO, and 390% for TREX, averaged over 20 different Atari games. We also find that compared to a prior state-of-the-art imitation learning method assisted by human gaze (AGIL), our method achieves better performance, and is more efficient in terms of learning with fewer demonstrations. We further interpret trained CGL agents with a saliency map visualization method to explain their performance. At last, we show that CGL can help alleviate a well-known causal confusion problem in imitation learning.

ICRA Conference 2021 Conference Paper

ScrewNet: Category-Independent Articulation Model Estimation From Depth Images Using Screw Theory

  • Ajinkya Jain
  • Rudolf Lioutikov
  • Caleb Chuck
  • Scott Niekum

Robots in human environments will need to interact with a wide variety of articulated objects such as cabinets, drawers, and dishwashers while assisting humans in performing day-to-day tasks. Existing methods either require objects to be textured or need to know the articulation model category a priori for estimating the model parameters for an articulated object. We propose ScrewNet, a novel approach that estimates an object’s articulation model directly from depth images without requiring a priori knowledge of the articulation model category. ScrewNet uses screw theory to unify the representation of different articulation types and perform category-independent articulation model estimation. We evaluate our approach on two benchmarking datasets and three real-world objects and compare its performance with a current state-of-the-art method. Results demonstrate that ScrewNet can successfully estimate the articulation models and their parameters for novel objects across articulation model categories with better on average accuracy than the prior state-of-the-art method.

IROS Conference 2021 Conference Paper

Self-Supervised Online Reward Shaping in Sparse-Reward Environments

  • Farzan Memarian
  • Wonjoon Goo
  • Rudolf Lioutikov
  • Scott Niekum
  • Ufuk Topcu

We introduce Self-supervised Online Reward Shaping (SORS), which aims to improve the sample efficiency of any RL algorithm in sparse-reward environments by automatically densifying rewards. The proposed framework alternates between classification-based reward inference and policy update steps—the original sparse reward provides a self-supervisory signal for reward inference by ranking trajectories that the agent observes, while the policy update is performed with the newly inferred, typically dense reward function. We introduce theory that shows that, under certain conditions, this alteration of the reward function will not change the optimal policy of the original MDP, while potentially increasing learning speed significantly. Experimental results on several sparse-reward environments demonstrate that, across multiple domains, the proposed algorithm is not only significantly more sample efficient than a standard RL baseline using sparse rewards, but, at times, also achieves similar sample efficiency compared to when hand-designed dense reward functions are used.

NeurIPS Conference 2021 Conference Paper

SOPE: Spectrum of Off-Policy Estimators

  • Christina Yuan
  • Yash Chandak
  • Stephen Giguere
  • Philip S. Thomas
  • Scott Niekum

Many sequential decision making problems are high-stakes and require off-policy evaluation (OPE) of a new policy using historical data collected using some other policy. One of the most common OPE techniques that provides unbiased estimates is trajectory based importance sampling (IS). However, due to the high variance of trajectory IS estimates, importance sampling methods based on state-action visitation distributions (SIS) have recently been adopted. Unfortunately, while SIS often provides lower variance estimates for long horizons, estimating the state-action distribution ratios can be challenging and lead to biased estimates. In this paper, we present a new perspective on this bias-variance trade-off and show the existence of a spectrum of estimators whose endpoints are SIS and IS. Additionally, we also establish a spectrum for doubly-robust and weighted version of these estimators. We provide empirical evidence that estimators in this spectrum can be used to trade-off between the bias and variance of IS and SIS and can achieve lower mean-squared error than both IS and SIS.

IJCAI Conference 2021 Conference Paper

Understanding the Relationship between Interactions and Outcomes in Human-in-the-Loop Machine Learning

  • Yuchen Cui
  • Pallavi Koppol
  • Henny Admoni
  • Scott Niekum
  • Reid Simmons
  • Aaron Steinfeld
  • Tesca Fitzgerald

Human-in-the-loop Machine Learning (HIL-ML) is a widely adopted paradigm for instilling human knowledge in autonomous agents. Many design choices influence the efficiency and effectiveness of such interactive learning processes, particularly the interaction type through which the human teacher may provide feedback. While different interaction types (demonstrations, preferences, etc. ) have been proposed and evaluated in the HIL-ML literature, there has been little discussion of how these compare or how they should be selected to best address a particular learning problem. In this survey, we propose an organizing principle for HIL-ML that provides a way to analyze the effects of interaction types on human performance and training data. We also identify open problems in understanding the effects of interaction types.

NeurIPS Conference 2021 Conference Paper

Universal Off-Policy Evaluation

  • Yash Chandak
  • Scott Niekum
  • Bruno da Silva
  • Erik Learned-Miller
  • Emma Brunskill
  • Philip S. Thomas

When faced with sequential decision-making problems, it is often useful to be able to predict what would happen if decisions were made using a new policy. Those predictions must often be based on data collected under some previously used decision-making rule. Many previous methods enable such off-policy (or counterfactual) estimation of the expected value of a performance measure called the return. In this paper, we take the first steps towards a 'universal off-policy estimator' (UnO)---one that provides off-policy estimates and high-confidence bounds for any parameter of the return distribution. We use UnO for estimating and simultaneously bounding the mean, variance, quantiles/median, inter-quantile range, CVaR, and the entire cumulative distribution of returns. Finally, we also discuss UnO's applicability in various settings, including fully observable, partially observable (i. e. , with unobserved confounders), Markovian, non-Markovian, stationary, smoothly non-stationary, and discrete distribution shifts.

ICML Conference 2021 Conference Paper

Value Alignment Verification

  • Daniel S. Brown
  • Jordan Schneider
  • Anca D. Dragan
  • Scott Niekum

As humans interact with autonomous agents to perform increasingly complicated, potentially risky tasks, it is important to be able to efficiently evaluate an agent’s performance and correctness. In this paper we formalize and theoretically analyze the problem of efficient value alignment verification: how to efficiently test whether the behavior of another agent is aligned with a human’s values? The goal is to construct a kind of "driver’s test" that a human can give to any agent which will verify value alignment via a minimal number of queries. We study alignment verification problems with both idealized humans that have an explicit reward function as well as problems where they have implicit values. We analyze verification of exact value alignment for rational agents, propose and test heuristics for value alignment verification in gridworlds and a continuous autonomous driving domain, and prove that there exist sufficient conditions such that we can verify epsilon-alignment in any environment via a constant-query-complexity alignment test.

NeurIPS Conference 2020 Conference Paper

Bayesian Robust Optimization for Imitation Learning

  • Daniel Brown
  • Scott Niekum
  • Marek Petrik

One of the main challenges in imitation learning is determining what action an agent should take when outside the state distribution of the demonstrations. Inverse reinforcement learning (IRL) can enable generalization to new states by learning a parameterized reward function, but these approaches still face uncertainty over the true reward function and corresponding optimal policy. Existing safe imitation learning approaches based on IRL deal with this uncertainty using a maxmin framework that optimizes a policy under the assumption of an adversarial reward function, whereas risk-neutral IRL approaches either optimize a policy for the mean or MAP reward function. While completely ignoring risk can lead to overly aggressive and unsafe policies, optimizing in a fully adversarial sense is also problematic as it can lead to overly conservative policies that perform poorly in practice. To provide a bridge between these two extremes, we propose Bayesian Robust Optimization for Imitation Learning (BROIL). BROIL leverages Bayesian reward function inference and a user specific risk tolerance to efficiently optimize a robust policy that balances expected return and conditional value at risk. Our empirical results show that BROIL provides a natural way to interpolate between return-maximizing and risk-minimizing behaviors and outperforms existing risk-sensitive and risk-neutral inverse reinforcement learning algorithms.

IJCAI Conference 2020 Conference Paper

Human Gaze Assisted Artificial Intelligence: A Review

  • Ruohan Zhang
  • Akanksha Saran
  • Bo Liu
  • Yifeng Zhu
  • Sihang Guo
  • Scott Niekum
  • Dana Ballard
  • Mary Hayhoe

Human gaze reveals a wealth of information about internal cognitive state. Thus, gaze-related research has significantly increased in computer vision, natural language processing, decision learning, and robotics in recent years. We provide a high-level overview of the research efforts in these fields, including collecting human gaze data sets, modeling gaze behaviors, and utilizing gaze information in various applications, with the goal of enhancing communication between these research areas. We discuss future challenges and potential applications that work towards a common goal of human-centered artificial intelligence.

IROS Conference 2020 Conference Paper

Hypothesis-Driven Skill Discovery for Hierarchical Deep Reinforcement Learning

  • Caleb Chuck
  • Supawit Chockchowwat
  • Scott Niekum

Deep reinforcement learning (DRL) is capable of learning high-performing policies on a variety of complex high-dimensional tasks, ranging from video games to robotic manipulation. However, standard DRL methods often suffer from poor sample efficiency, partially because they aim to be entirely problem-agnostic. In this work, we introduce a novel approach to exploration and hierarchical skill learning that derives its sample efficiency from intuitive assumptions it makes about the behavior of objects both in the physical world and simulations which mimic physics. Specifically, we propose the Hypothesis Proposal and Evaluation (HyPE) algorithm, which discovers objects from raw pixel data, generates hypotheses about the controllability of observed changes in object state, and learns a hierarchy of skills to test these hypotheses. We demonstrate that HyPE can dramatically improve the sample efficiency of policy learning in two different domains: a simulated robotic blockpushing domain, and a popular benchmark task: Breakout. In these domains, HyPE learns high-scoring policies an order of magnitude faster than several state-of-the-art reinforcement learning methods.

IROS Conference 2020 Conference Paper

Learning Hybrid Object Kinematics for Efficient Hierarchical Planning Under Uncertainty

  • Ajinkya Jain
  • Scott Niekum

Sudden changes in the dynamics of robotic tasks, such as contact with an object or the latching of a door, are often viewed as inconvenient discontinuities that make manipulation difficult. However, when these transitions are well-understood, they can be leveraged to reduce uncertainty or aid manipulation-for example, wiggling a screw to determine if it is fully inserted or not. Current model-free reinforcement learning approaches require large amounts of data to learn to leverage such dynamics, scale poorly as problem complexity grows, and do not transfer well to significantly different problems. By contrast, hierarchical POMDP planning-based methods scale well via plan decomposition, work well on novel problems, and directly consider uncertainty, but often rely on precise hand-specified models and task decompositions. To combine the advantages of these opposing paradigms, we propose a new method, MICAH, which given unsegmented data of an object's motion under applied actions, (1) detects changepoints in the object motion model using action-conditional inference, (2) estimates the individual local motion models with their parameters, and (3) converts them into a hybrid automaton that is compatible with hierarchical POMDP planning. We show that model learning under MICAH is more accurate and robust to noise than prior approaches. Further, we combine MICAH with a hierarchical POMDP planner to demonstrate that the learned models are rich enough to be used for performing manipulation tasks under uncertainty that require the objects to be used in novel ways not encountered during training.

ICML Conference 2020 Conference Paper

Safe Imitation Learning via Fast Bayesian Reward Inference from Preferences

  • Daniel S. Brown
  • Russell Coleman
  • Ravi Srinivasan
  • Scott Niekum

Bayesian reward learning from demonstrations enables rigorous safety and uncertainty analysis when performing imitation learning. However, Bayesian reward learning methods are typically computationally intractable for complex control problems. We propose Bayesian Reward Extrapolation (Bayesian REX), a highly efficient Bayesian reward learning algorithm that scales to high-dimensional imitation learning problems by pre-training a low-dimensional feature encoding via self-supervised tasks and then leveraging preferences over demonstrations to perform fast Bayesian inference. Bayesian REX can learn to play Atari games from demonstrations, without access to the game score and can generate 100, 000 samples from the posterior over reward functions in only 5 minutes on a personal laptop. Bayesian REX also results in imitation learning performance that is competitive with or better than state-of-the-art methods that only learn point estimates of the reward function. Finally, Bayesian REX enables efficient high-confidence policy evaluation without having access to samples of the reward function. These high-confidence performance bounds can be used to rank the performance and risk of a variety of evaluation policies and provide a way to detect reward hacking behaviors.

ICML Conference 2019 Conference Paper

Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations

  • Daniel S. Brown
  • Wonjoon Goo
  • Prabhat Nagarajan
  • Scott Niekum

A critical flaw of existing inverse reinforcement learning (IRL) methods is their inability to significantly outperform the demonstrator. This is because IRL typically seeks a reward function that makes the demonstrator appear near-optimal, rather than inferring the underlying intentions of the demonstrator that may have been poorly executed in practice. In this paper, we introduce a novel reward-learning-from-observation algorithm, Trajectory-ranked Reward EXtrapolation (T-REX), that extrapolates beyond a set of (approximately) ranked demonstrations in order to infer high-quality reward functions from a set of potentially poor demonstrations. When combined with deep reinforcement learning, T-REX outperforms state-of-the-art imitation learning and IRL methods on multiple Atari and MuJoCo benchmark tasks and achieves performance that is often more than twice the performance of the best demonstration. We also demonstrate that T-REX is robust to ranking noise and can accurately extrapolate intention by simply watching a learner noisily improve at a task over time.

ICML Conference 2019 Conference Paper

Importance Sampling Policy Evaluation with an Estimated Behavior Policy

  • Josiah P. Hanna
  • Scott Niekum
  • Peter Stone 0001

We consider the problem of off-policy evaluation in Markov decision processes. Off-policy evaluation is the task of evaluating the expected return of one policy with data generated by a different, behavior policy. Importance sampling is a technique for off-policy evaluation that re-weights off-policy returns to account for differences in the likelihood of the returns between the two policies. In this paper, we study importance sampling with an estimated behavior policy where the behavior policy estimate comes from the same set of data used to compute the importance sampling estimate. We find that this estimator often lowers the mean squared error of off-policy evaluation compared to importance sampling with the true behavior policy or using a behavior policy that is estimated from a separate data set. Intuitively, estimating the behavior policy in this way corrects for error due to sampling in the action-space. Our empirical results also extend to other popular variants of importance sampling and show that estimating a non-Markovian behavior policy can further lower large-sample mean squared error even when the true behavior policy is Markovian.

RLDM Conference 2019 Conference Abstract

Learning from Suboptimal Demonstrations: Inverse Reinforcement Learning from Ranked Observations

  • Daniel S Brown
  • Wonjoon Goo
  • Scott Niekum

A critical flaw of existing imitation learning and inverse reinforcement learning methods is their inability, often by design, to significantly outperform the demonstrator. This is a consequence of the general reliance of these algorithms upon some form of mimicry, such as feature-count matching or behavioral cloning, rather than inferring the underlying intentions of the demonstrator that may have been poorly executed in practice. In this paper, we introduce a novel reward-learning-from-observation algo- rithm, Trajectory-ranked Reward EXtrapolation (T-REX), that extrapolates beyond a set of ranked subopti- mal demonstrations in order to infer a high-quality reward function. We leverage the pairwise preferences induced from the ranked demonstrations to perform reward learning without requiring an MDP solver. By learning a state-based reward function that assigns greater return to higher-ranked trajectories than lower- ranked trajectories, we transform a typically expensive, and often intractable, inverse reinforcement learning problem into one of standard binary classification. Moreover, by learning a reward function that is solely a function of state, we are able to learn from observations alone, eliminating the need for action labels. We combine our learned reward function with deep reinforcement learning and show that our approach re- sults in performance that is better than the best-performing demonstration on multiple Atari and MuJoCo benchmark tasks. In comparison, state-of-the-art imitation learning algorithms fails to exceed the average performance of the demonstrator.

AAAI Conference 2019 Conference Paper

Machine Teaching for Inverse Reinforcement Learning: Algorithms and Applications

  • Daniel S. Brown
  • Scott Niekum

Inverse reinforcement learning (IRL) infers a reward function from demonstrations, allowing for policy improvement and generalization. However, despite much recent interest in IRL, little work has been done to understand the minimum set of demonstrations needed to teach a specific sequential decisionmaking task. We formalize the problem of finding maximally informative demonstrations for IRL as a machine teaching problem where the goal is to find the minimum number of demonstrations needed to specify the reward equivalence class of the demonstrator. We extend previous work on algorithmic teaching for sequential decision-making tasks by showing a reduction to the set cover problem which enables an efficient approximation algorithm for determining the set of maximallyinformative demonstrations. We apply our proposed machine teaching algorithm to two novel applications: providing a lower bound on the number of queries needed to learn a policy using active IRL and developing a novel IRL algorithm that can learn more efficiently from informative demonstrations than a standard IRL approach.

ICRA Conference 2019 Conference Paper

One-Shot Learning of Multi-Step Tasks from Observation via Activity Localization in Auxiliary Video

  • Wonjoon Goo
  • Scott Niekum

Due to burdensome data requirements, learning from demonstration often falls short of its promise to allow users to quickly and naturally program robots. Demonstrations are inherently ambiguous and incomplete, making correct generalization to unseen situations difficult without a large number of demonstrations in varying conditions. By contrast, humans are often able to learn complex tasks from a single demonstration (typically observations without action labels) by leveraging context learned over a lifetime. Inspired by this capability, our goal is to enable robots to perform one-shot learning of multi-step tasks from observation by leveraging auxiliary video data as context. Our primary contribution is a novel system that achieves this goal by: (1) using a single user-segmented demonstration to define the primitive actions that comprise a task, (2) localizing additional examples of these actions in unsegmented auxiliary videos via a metalearning-based approach, (3) using these additional examples to learn a reward function for each action, and (4) performing reinforcement learning on top of the inferred reward functions to learn action policies that can be combined to accomplish the task. We empirically demonstrate that a robot can learn multi-step tasks more effectively when provided auxiliary video, and that performance greatly improves when localizing individual actions, compared to learning from unsegmented videos.

ICRA Conference 2019 Conference Paper

Uncertainty-Aware Data Aggregation for Deep Imitation Learning

  • Yuchen Cui
  • David Isele
  • Scott Niekum
  • Kikuo Fujimura

Estimating statistical uncertainties allows autonomous agents to communicate their confidence during task execution and is important for applications in safety-critical domains such as autonomous driving. In this work, we present the uncertainty-aware imitation learning (UAIL) algorithm for improving end-to-end control systems via data aggregation. UAIL applies Monte Carlo Dropout to estimate uncertainty in the control output of end-to-end systems, using states where it is uncertain to selectively acquire new training data. In contrast to prior data aggregation algorithms that force human experts to visit sub-optimal states at random, UAIL can anticipate its own mistakes and switch control to the expert in order to prevent visiting a series of sub-optimal states. Our experimental results from simulated driving tasks demonstrate that our proposed uncertainty estimation method can be leveraged to reliably predict infractions. Our analysis shows that UAIL outperforms existing data aggregation algorithms on a series of benchmark tasks.

RLDM Conference 2019 Conference Abstract

Using Natural Language for Reward Shaping in Reinforcement Learning

  • Prasoon Goyal
  • Scott Niekum

Recent reinforcement learning (RL) approaches have shown strong performance in complex do- mains such as Atari games, but are often highly sample inefficient. A common approach to reduce interaction time with the environment is to use reward shaping, which involves carefully designing reward functions that provide the agent intermediate rewards for progress towards the goal. However, designing appropriate shaping rewards is known to be difficult as well as time-consuming. In this work, we address this problem by using natural language instructions to perform reward shaping. We propose the LanguagE-Action Reward Network (LEARN), a framework that maps free-form natural language instructions to intermediate rewards based on actions taken by the agent. LEARN takes a trajectory and a language instruction as input, and is trained to predict whether the actions in the trajectory are related to the instruction. For instance, if the instruction is “Climb down the ladder and go to the left”, then trajectories containing actions down and left with high frequency are more related to the instruction compared to trajectories without these actions. We use Amazon Mechanical Turk to collect language data to train LEARN. Given a language instruction de- scribing the task in an MDP, we feed the trajectory executed by the agent so far and the language instruction to LEARN, and use the output as rewards to the RL agent. These intermediate language-based rewards can seamlessly be integrated into any standard RL algorithm. We experiment with Montezuma’s Revenge from the Atari Learning Environment, a popular benchmark in RL. Our experiments on a diverse set of 15 tasks demonstrate that, for the same number of interactions with the environment, language-based rewards lead to successful completion of the task 60% more often on average, compared to learning without language. Analysis of our framework shows that LEARN successfully infers associations between words and actions.

IJCAI Conference 2019 Conference Paper

Using Natural Language for Reward Shaping in Reinforcement Learning

  • Prasoon Goyal
  • Scott Niekum
  • Raymond J. Mooney

Recent reinforcement learning (RL) approaches have shown strong performance in complex domains, such as Atari games, but are highly sample inefficient. A common approach to reduce interaction time with the environment is to use reward shaping, which involves carefully designing reward functions that provide the agent intermediate rewards for progress towards the goal. Designing such rewards remains a challenge, though. In this work, we use natural language instructions to perform reward shaping. We propose a framework that maps free-form natural language instructions to intermediate rewards, that can seamlessly be integrated into any standard reinforcement learning algorithm. We experiment with Montezuma's Revenge from the Atari video games domain, a popular benchmark in RL. Our experiments on a diverse set of 15 tasks demonstrate that for the same number of interactions with the environment, using language-based rewards can successfully complete the task 60% more often, averaged across all tasks, compared to learning without language.

ICRA Conference 2018 Conference Paper

Active Reward Learning from Critiques

  • Yuchen Cui
  • Scott Niekum

Learning from demonstration algorithms, such as Inverse Reinforcement Learning, aim to provide a natural mechanism for programming robots, but can often require a prohibitive number of demonstrations to capture important subtleties of a task. Rather than requesting additional demonstrations blindly, active learning methods leverage uncertainty to query the user for action labels at states with high expected information gain. However, this approach can still require a large number of labels to adequately reduce uncertainty and may also be unintuitive, as users are not accustomed to determining optimal actions in a single out-of-context state. To address these shortcomings, we propose a novel trajectory-based active Bayesian inverse reinforcement learning algorithm that (1) queries the user for critiques of automatically generated trajectories, rather than asking for demonstrations or action labels, (2) utilizes trajectory segmentation to expedite the critique / labeling process, and (3) predicts the user's critiques to generate the most highly informative trajectory queries. We evaluated our algorithm in simulated domains, finding it to compare favorably to prior work and a randomized baseline.

AAAI Conference 2018 Conference Paper

Efficient Probabilistic Performance Bounds for Inverse Reinforcement Learning

  • Daniel Brown
  • Scott Niekum

In the field of reinforcement learning there has been recent progress towards safety and high-confidence bounds on policy performance. However, to our knowledge, no practical methods exist for determining high-confidence policy performance bounds in the inverse reinforcement learning setting— where the true reward function is unknown and only samples of expert behavior are given. We propose a sampling method based on Bayesian inverse reinforcement learning that uses demonstrations to determine practical high-confidence upper bounds on the α-worst-case difference in expected return between any evaluation policy and the optimal policy under the expert’s unknown reward function. We evaluate our proposed bound on both a standard grid navigation task and a simulated driving task and achieve tighter and more accurate bounds than a feature count-based baseline. We also give examples of how our proposed bound can be utilized to perform riskaware policy selection and risk-aware policy improvement. Because our proposed bound requires several orders of magnitude fewer demonstrations than existing high-confidence bounds, it is the first practical method that allows agents that learn from demonstration to express confidence in the quality of their learned policy.

IROS Conference 2018 Conference Paper

Human Gaze Following for Human-Robot Interaction

  • Akanksha Saran
  • Srinjoy Majumdar
  • Elaine Schaertl Short
  • Andrea Thomaz
  • Scott Niekum

Gaze provides subtle informative cues to aid fluent interactions among people. Incorporating human gaze predictions can signify how engaged a person is while interacting with a robot and allow the robot to predict a human's intentions or goals. We propose a novel approach to predict human gaze fixations relevant for human-robot interaction tasks-both referential and mutual gaze-in real time on a robot. We use a deep learning approach which tracks a human's gaze from a robot's perspective in real time. The approach builds on prior work which uses a deep network to predict the referential gaze of a person from a single 2D image. Our work uses an interpretable part of the network, a gaze heat map, and incorporates contextual task knowledge such as location of relevant objects, to predict referential gaze. We find that the gaze heat map statistics also capture differences between mutual and referential gaze conditions, which we use to predict whether a person is facing the robot's camera or not. We highlight the challenges of following a person's gaze on a robot in real time and show improved performance for referential gaze and mutual gaze prediction.

ICRA Conference 2018 Conference Paper

Incremental Task Modification via Corrective Demonstrations

  • Reymundo A. Gutierrez
  • Vivian Chu
  • Andrea Thomaz
  • Scott Niekum

In realistic environments, fully specifying a task model such that a robot can perform a task in all situations is impractical. In this work, we present Incremental Task Modification via Corrective Demonstrations (ITMCD), a novel algorithm that allows a robot to update a learned model by making use of corrective demonstrations from an end-user in its environment. We propose three different types of model updates that make structural changes to a finite state automaton (FSA) representation of the task by first converting the FSA into a state transition auto-regressive hidden Markov model (STARHMM). The STARHMM's probabilistic properties are then used to perform approximate Bayesian model selection to choose the best model update, if any. We evaluate ITMCD Model Selection in a simulated block sorting domain and the full algorithm on a real-world pouring task. The simulation results show our approach can choose new task models that sufficiently incorporate new demonstrations while remaining as simple as possible. The results from the pouring task show that ITMCD performs well when the modeled segments of the corrective demonstrations closely comply with the original task model.

AAAI Conference 2018 Conference Paper

Safe Reinforcement Learning via Shielding

  • Mohammed Alshiekh
  • Roderick Bloem
  • Rüdiger Ehlers
  • Bettina Könighofer
  • Scott Niekum
  • Ufuk Topcu

Reinforcement learning algorithms discover policies that maximize reward, but do not necessarily guarantee safety during learning or execution phases. We introduce a new approach to learn optimal policies while enforcing properties expressed in temporal logic. To this end, given the temporal logic specification that is to be obeyed by the learning system, we propose to synthesize a reactive system called a shield. The shield monitors the actions from the learner and corrects them only if the chosen action causes a violation of the specification. We discuss which requirements a shield must meet to preserve the convergence guarantees of the learner. Finally, we demonstrate the versatility of our approach on several challenging reinforcement learning scenarios.

AAMAS Conference 2017 Conference Paper

Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation

  • Josiah P. Hanna
  • Peter Stone
  • Scott Niekum

For an autonomous agent, executing a poor policy may be costly or even dangerous. For such agents, it is desirable to determine confidence interval lower bounds on the performance of any given policy without executing said policy. Current methods for exact high confidence off-policy evaluation that use importance sampling require a substantial amount of data to achieve a tight lower bound. Existing model-based methods only address the problem in discrete state spaces. Since exact bounds are intractable for many domains we trade off strict guarantees of safety for more data-efficient approximate bounds. In this context, we propose two bootstrapping off-policy evaluation methods which use learned MDP transition models in order to estimate lower confidence bounds on policy performance with limited data in both continuous and discrete state spaces. Since direct use of a model may introduce bias, we derive a theoretical upper bound on model bias for when the model transition function is estimated with i. i. d. trajectories. This bound broadens our understanding of the conditions under which model-based methods have high bias. Finally, we empirically evaluate our proposed methods and analyze the settings in which different bootstrapping off-policy confidence interval methods succeed and fail.

AAAI Conference 2017 Short Paper

Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation

  • Josiah Hanna
  • Peter Stone
  • Scott Niekum

In many reinforcement learning applications, it is desirable to determine confidence interval lower bounds on the performance of any given policy without executing said policy. In this context, we propose two bootstrapping off-policy evaluation methods which use learned MDP transition models in order to estimate lower confidence bounds on policy performance with limited data. We empirically evaluate the proposed methods in a standard policy evaluation tasks. 1

IROS Conference 2017 Conference Paper

Classification error correction: A case study in brain-computer interfacing

  • Hasan A. Poonawala
  • Mohammed Alshiekh
  • Scott Niekum
  • Ufuk Topcu

Classification techniques are useful for processing complex signals into labels with semantic value. For example, they can be used to interpret brain signals generated by humans corresponding to a finite set of commands for a physical device. The classifier, however, may interpret the signal as a command that is different from the intended one. This error in classification leads to poor performance in tasks where the class labels are used to learn some information or to control a physical device. We propose a computationally efficient algorithm to identify which class labels may be misclassified out of a sequence of class labels, when these labels are used in a given learning or control task. The algorithm is based on inference methods using Markov random fields. We apply the algorithm to goal-learning and tracking using brain-computer interfacing (BCI), in which signals from the brain are commonly processed using classification techniques. We demonstrate that the proposed algorithm reduces the time taken to identify the goal state in control experiments.

ICML Conference 2017 Conference Paper

Data-Efficient Policy Evaluation Through Behavior Policy Search

  • Josiah P. Hanna
  • Philip S. Thomas
  • Peter Stone 0001
  • Scott Niekum

We consider the task of evaluating a policy for a Markov decision process (MDP). The standard unbiased technique for evaluating a policy is to deploy the policy and observe its performance. We show that the data collected from deploying a different policy, commonly called the behavior policy, can be used to produce unbiased estimates with lower mean squared error than this standard technique. We derive an analytic expression for the optimal behavior policy — the behavior policy that minimizes the mean squared error of the resulting estimates. Because this expression depends on terms that are unknown in practice, we propose a novel policy evaluation sub-problem, behavior policy search: searching for a behavior policy that reduces mean squared error. We present a behavior policy search algorithm and empirically demonstrate its effectiveness in lowering the mean squared error of policy performance estimates.

IROS Conference 2017 Conference Paper

Viewpoint selection for visual failure detection

  • Akanksha Saran
  • Branka Lakic
  • Srinjoy Majumdar
  • Jürgen Hess 0003
  • Scott Niekum

The visual difference between outcomes in many robotics tasks is often subtle, such as the tip of a screw being near a hole versus in the hole. Furthermore, these small differences are often only observable from certain viewpoints or may even require information from multiple viewpoints to fully verify. We introduce and compare three approaches to selecting viewpoints for verifying successful execution of tasks: (1) a random forest-based method that discovers highly informative fine-grained visual features, (2) SVM models trained on features extracted from pre-trained convolutional neural networks, and (3) an active, hybrid approach that uses the above methods for two-stage multi-viewpoint classification. These approaches are experimentally validated on an IKEA furniture assembly task and a quadrotor surveillance domain.

ICML Conference 2016 Conference Paper

On the Analysis of Complex Backup Strategies in Monte Carlo Tree Search

  • Piyush Khandelwal
  • Elad Liebman
  • Scott Niekum
  • Peter Stone 0001

Over the past decade, Monte Carlo Tree Search (MCTS) and specifically Upper Confidence Bound in Trees (UCT) have proven to be quite effective in large probabilistic planning domains. In this paper, we focus on how values are backpropagated in the MCTS tree, and apply complex return strategies from the Reinforcement Learning (RL) literature to MCTS, producing 4 new MCTS variants. We demonstrate that in some probabilistic planning benchmarks from the International Planning Competition (IPC), selecting a MCTS variant with a backup strategy different from Monte Carlo averaging can lead to substantially better results. We also propose a hypothesis for why different backup strategies lead to different performance in particular environments, and manipulate a carefully structured grid-world domain to provide empirical evidence supporting our hypothesis.

ICRA Conference 2015 Conference Paper

Active articulation model estimation through interactive perception

  • Karol Hausman
  • Scott Niekum
  • Sarah Osentoski
  • Gaurav S. Sukhatme

We introduce a particle filter-based approach to representing and actively reducing uncertainty over articulated motion models. The presented method provides a probabilistic model that integrates visual observations with feedback from manipulation actions to best characterize a distribution of possible articulation models. We evaluate several action selection methods to efficiently reduce the uncertainty about the articulation model. The full system is experimentally evaluated using a PR2 mobile manipulator. Our experiments demonstrate that the proposed system allows for intelligent reasoning about sparse, noisy data in a number of common manipulation scenarios.

ICRA Conference 2015 Conference Paper

Online Bayesian changepoint detection for articulated motion models

  • Scott Niekum
  • Sarah Osentoski
  • Christopher G. Atkeson
  • Andrew G. Barto

We introduce CHAMP, an algorithm for online Bayesian changepoint detection in settings where it is difficult or undesirable to integrate over the parameters of candidate models. CHAMP is used in combination with several articulation models to detect changes in articulated motion of objects in the world, allowing a robot to infer physically-grounded task information. We focus on three settings where a changepoint model is appropriate: objects with intrinsic articulation relationships that can change over time, object-object contact that results in quasi-static articulated motion, and assembly tasks where each step changes articulation relationships. We experimentally demonstrate that this system can be used to infer various types of information from demonstration data including causal manipulation models, human-robot grasp correspondences, and skill verification tests.

NeurIPS Conference 2015 Conference Paper

Policy Evaluation Using the Ω-Return

  • Philip Thomas
  • Scott Niekum
  • Georgios Theocharous
  • George Konidaris

We propose the Ω-return as an alternative to the λ-return currently used by the TD(λ) family of algorithms. The benefit of the Ω-return is that it accounts for the correlation of different length returns. Because it is difficult to compute exactly, we suggest one way of approximating the Ω-return. We provide empirical studies that suggest that it is superior to the λ-return and γ-return for a variety of problems.

IROS Conference 2012 Conference Paper

Learning and generalization of complex tasks from unstructured demonstrations

  • Scott Niekum
  • Sarah Osentoski
  • George Konidaris 0001
  • Andrew G. Barto

We present a novel method for segmenting demonstrations, recognizing repeated skills, and generalizing complex tasks from unstructured demonstrations. This method combines many of the advantages of recent automatic segmentation methods for learning from demonstration into a single principled, integrated framework. Specifically, we use the Beta Process Autoregressive Hidden Markov Model and Dynamic Movement Primitives to learn and generalize a multi-step task on the PR2 mobile manipulator and to demonstrate the potential of our framework to learn a large library of skills over time.

NeurIPS Conference 2011 Conference Paper

Clustering via Dirichlet Process Mixture Models for Portable Skill Discovery

  • Scott Niekum
  • Andrew Barto

Skill discovery algorithms in reinforcement learning typically identify single states or regions in state space that correspond to task-specific subgoals. However, such methods do not directly address the question of how many distinct skills are appropriate for solving the tasks that the agent faces. This can be highly inefficient when many identified subgoals correspond to the same underlying skill, but are all used individually as skill goals. Furthermore, skills created in this manner are often only transferable to tasks that share identical state spaces, since corresponding subgoals across tasks are not merged into a single skill goal. We show that these problems can be overcome by clustering subgoal data defined in an agent-space and using the resulting clusters as templates for skill termination conditions. Clustering via a Dirichlet process mixture model is used to discover a minimal, sufficient collection of portable skills.

NeurIPS Conference 2011 Conference Paper

TD_gamma: Re-evaluating Complex Backups in Temporal Difference Learning

  • George Konidaris
  • Scott Niekum
  • Philip Thomas

We show that the lambda-return target used in the TD(lambda) family of algorithms is the maximum likelihood estimator for a specific model of how the variance of an n-step return estimate increases with n. We introduce the gamma-return estimator, an alternative target based on a more accurate model of variance, which defines the TD gamma family of complex-backup temporal difference learning algorithms. We derive TD gamma, the gamma-return equivalent of the original TD(lambda) algorithm, which eliminates the lambda parameter but can only perform updates at the end of an episode and requires time and space proportional to the episode length. We then derive a second algorithm, TD gamma(C), with a capacity parameter C. TD gamma(C) requires C times more time and memory than TD(lambda) and is incremental and online. We show that TD gamma outperforms TD(lambda) for any setting of lambda on 4 out of 5 benchmark domains, and that TD gamma(C) performs as well as or better than TD_gamma for intermediate settings of C.