Arrow Research search

Author name cluster

Emma Brunskill

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

96 papers
2 author rows

Possible papers

96

AAAI Conference 2026 Conference Paper

Assessing the Quality of AI-Generated Exams: A Large-Scale Field Study

  • Calvin Isley
  • Joshua Gilbert
  • Evangelos Kassos
  • Michaela Kocher
  • Allen Nie
  • Emma Brunskill
  • Ben Domingue
  • Jake Hofman

While large language models (LLMs) challenge conventional methods of teaching and learning, they present an exciting opportunity to improve efficiency and scale high-quality instruction. One promising application is the generation of customized exams, tailored to specific course content. There has been significant recent excitement on automatically generating questions using artificial intelligence, but also comparatively little work evaluating the psychometric quality of these items in real-world educational settings. Filling this gap is an important step toward understanding generative AI's role in effective test design. In this study, we introduce and evaluate an iterative refinement strategy for question generation, repeatedly producing, assessing, and improving questions through cycles of LLM-generated critique and revision. We evaluate the quality of these AI-generated questions in a large-scale field study involving 91 classes---covering computer science, mathematics, chemistry, and more---in dozens of colleges across the United States, comprising nearly 1700 students. Our analysis, based on item response theory (IRT), suggests that for students in our sample the AI-generated questions performed comparably to expert-created questions designed for standardized exams. Our results illustrate the power of AI to make high-quality assessments more readily available, benefiting both teachers and students.

AAAI Conference 2025 Conference Paper

Cost-Aware Near-Optimal Policy Learning

  • Joy He-Yueya
  • Jonathan Lee
  • Matthew Jörke
  • Emma Brunskill

It is often of interest to learn a context-sensitive decision policy, such as in contextual multi-armed bandit processes. To quantify the efficiency of a machine learning algorithm for such settings, probably approximately correct (PAC) bounds, which bound the number of samples required, or cumulative regret guarantees, are typically used. However, real-world settings often have limited resources for experimentation, and decisions/interventions may differ in the amount of resources required (e.g., money or time). Therefore, it is of interest to consider how to design an experiment strategy that reduces the experimental budget needed to learn a near-optimal contextual policy. Unlike reinforcement learning or bandit approaches that embed costs into the reward function, we focus on reducing resource use in learning a near-optimal policy without resource constraints. We introduce two resource-aware algorithms for the contextual bandit setting and prove their soundness. Simulations based on real-world datasets demonstrate that our algorithms significantly reduce the resources needed to learn a near-optimal decision policy compared to previous resource-unaware methods.

ICLR Conference 2024 Conference Paper

Adaptive Instrument Design for Indirect Experiments

  • Yash Chandak
  • Shiv Shankar
  • Vasilis Syrgkanis
  • Emma Brunskill

Indirect experiments provide a valuable framework for estimating treatment effects in situations where conducting randomized control trials (RCTs) is impractical or unethical. Unlike RCTs, indirect experiments estimate treatment effects by leveraging (conditional) instrumental variables, enabling estimation through encouragement and recommendation rather than strict treatment assignment. However, the sample efficiency of such estimators depends not only on the inherent variability in outcomes but also on the varying compliance levels of users with the instrumental variables and the choice of estimator being used, especially when dealing with numerous instrumental variables. While adaptive experiment design has a rich literature for \textit{direct} experiments, in this paper we take the initial steps towards enhancing sample efficiency for \textit{indirect} experiments by adaptively designing a data collection policy over instrumental variables. Our main contribution is a practical computational procedure that utilizes influence functions to search for an optimal data collection policy, minimizing the mean-squared error of the desired (non-linear) estimator. Through experiments conducted in various domains inspired by real-world applications, we showcase how our method can significantly improve the sample efficiency of indirect experiments.

TMLR Journal 2024 Journal Article

Estimating Optimal Policy Value in Linear Contextual Bandits Beyond Gaussianity

  • Jonathan Lee
  • Weihao Kong
  • Aldo Pacchiano
  • Vidya Muthukumar
  • Emma Brunskill

In many bandit problems, the maximal reward achievable by a policy is often unknown in advance. We consider the problem of estimating the optimal policy value in the sublinear data regime before the optimal policy is even learnable. We refer to this as $V^*$ estimation. It was previously shown that fast $V^*$ estimation is possible but only in disjoint linear bandits with Gaussian covariates. Whether this is possible for more realistic context distributions has remained an open and important question for tasks such as model selection. In this paper, we first provide lower bounds showing that this general problem is hard. However, under stronger assumptions, we give an algorithm and analysis proving that $\widetilde{\mathcal{O}}(\sqrt{d})$ sublinear estimation of $V^*$ is indeed information-theoretically possible, where $d$ is the dimension. We subsequently introduce a practical and computationally efficient algorithm that estimates a problem-specific upper bound on $V^*$, valid for general distributions and tight for Gaussian context distributions. We prove our algorithm requires only $\widetilde{\mathcal{O}}(\sqrt{d})$ samples to estimate the upper bound. We use this upper bound in conjunction with the estimator to derive novel and improved guarantees for several applications in bandit model selection and testing for treatment effects. We present promising experimental benefits on a semi-synthetic simulation using historical data on warfarin treatment dosage outcomes.

NeurIPS Conference 2024 Conference Paper

OPERA: Automatic Offline Policy Evaluation with Re-weighted Aggregates of Multiple Estimators

  • Allen Nie
  • Yash Chandak
  • Christina J. Yuan
  • Anirudhan Badrinath
  • Yannis Flet-Berliac
  • Emma Brunskill

Offline policy evaluation (OPE) allows us to evaluate and estimate a new sequential decision-making policy's performance by leveraging historical interaction data collected from other policies. Evaluating a new policy online without a confident estimate of its performance can lead to costly, unsafe, or hazardous outcomes, especially in education and healthcare. Several OPE estimators have been proposed in the last decade, many of which have hyperparameters and require training. Unfortunately, choosing the best OPE algorithm for each task and domain is still unclear. In this paper, we propose a new algorithm that adaptively blends a set of OPE estimators given a dataset without relying on an explicit selection using a statistical procedure. We prove that our estimator is consistent and satisfies several desirable properties for policy evaluation. Additionally, we demonstrate that when compared to alternative approaches, our estimator can be used to select higher-performing policies in healthcare and robotics. Our work contributes to improving ease of use for a general-purpose, estimator-agnostic, off-policy evaluation framework for offline RL.

NeurIPS Conference 2023 Conference Paper

Experiment Planning with Function Approximation

  • Aldo Pacchiano
  • Jonathan Lee
  • Emma Brunskill

We study the problem of experiment planning with function approximation in contextual bandit problems. In settings where there is a significant overhead to deploying adaptive algorithms---for example, when the execution of the data collection policies is required to be distributed, or a human in the loop is needed to implement these policies---producing in advance a set of policies for data collection is paramount. We study the setting where a large dataset of contexts but not rewards is available and may be used by the learner to design an effective data collection strategy. Although when rewards are linear this problem has been well studied, results are still missing for more complex reward models. In this work we propose two experiment planning strategies compatible with function approximation. The first is an eluder planning and sampling procedure that can recover optimality guarantees depending on the eluder dimension of the reward function class. For the second, we show that a uniform sampler achieves competitive optimality rates in the setting where the number of actions is small. We finalize our results introducing a statistical gap fleshing out the fundamental differences between planning and adaptive learning and provide results for planning with model selection.

AAAI Conference 2023 Conference Paper

Model-Based Offline Reinforcement Learning with Local Misspecification

  • Kefan Dong
  • Yannis Flet-Berliac
  • Allen Nie
  • Emma Brunskill

We present a model-based offline reinforcement learning policy performance lower bound that explicitly captures dynamics model misspecification and distribution mismatch and we propose an empirical algorithm for optimal offline policy selection. Theoretically, we prove a novel safe policy improvement theorem by establishing pessimism approximations to the value function. Our key insight is to jointly consider selecting over dynamics models and policies: as long as a dynamics model can accurately represent the dynamics of the state-action pairs visited by a given policy, it is possible to approximate the value of that particular policy. We analyze our lower bound in the LQR setting and also show competitive performance to previous lower bounds on policy selection across a set of D4RL tasks.

NeurIPS Conference 2023 Conference Paper

Proportional Response: Contextual Bandits for Simple and Cumulative Regret Minimization

  • Sanath Kumar Krishnamurthy
  • Ruohan Zhan
  • Susan Athey
  • Emma Brunskill

In many applications, e. g. in healthcare and e-commerce, the goal of a contextual bandit may be to learn an optimal treatment assignment policy at the end of the experiment. That is, to minimize simple regret. However, this objective remains understudied. We propose a new family of computationally efficient bandit algorithms for the stochastic contextual bandit setting, where a tuning parameter determines the weight placed on cumulative regret minimization (where we establish near-optimal minimax guarantees) versus simple regret minimization (where we establish state-of-the-art guarantees). Our algorithms work with any function class, are robust to model misspecification, and can be used in continuous arm settings. This flexibility comes from constructing and relying on “conformal arm sets" (CASs). CASs provide a set of arms for every context, encompassing the context-specific optimal arm with a certain probability across the context distribution. Our positive results on simple and cumulative regret guarantees are contrasted with a negative result, which shows that no algorithm can achieve instance-dependent simple regret guarantees while simultaneously achieving minimax optimal cumulative regret guarantees.

NeurIPS Conference 2023 Conference Paper

Supervised Pretraining Can Learn In-Context Reinforcement Learning

  • Jonathan Lee
  • Annie Xie
  • Aldo Pacchiano
  • Yash Chandak
  • Chelsea Finn
  • Ofir Nachum
  • Emma Brunskill

Large transformer models trained on diverse datasets have shown a remarkable ability to learn in-context, achieving high few-shot performance on tasks they were not explicitly trained to solve. In this paper, we study the in-context learning capabilities of transformers in decision-making problems, i. e. , reinforcement learning (RL) for bandits and Markov decision processes. To do so, we introduce and study the Decision-Pretrained Transformer (DPT), a supervised pretraining method where a transformer predicts an optimal action given a query state and an in-context dataset of interactions from a diverse set of tasks. While simple, this procedure produces a model with several surprising capabilities. We find that the trained transformer can solve a range of RL problems in-context, exhibiting both exploration online and conservatism offline, despite not being explicitly trained to do so. The model also generalizes beyond the pretraining distribution to new tasks and automatically adapts its decision-making strategies to unknown structure. Theoretically, we show DPT can be viewed as an efficient implementation of Bayesian posterior sampling, a provably sample-efficient RL algorithm. We further leverage this connection to provide guarantees on the regret of the in-context algorithm yielded by DPT, and prove that it can learn faster than algorithms used to generate the pretraining data. These results suggest a promising yet simple path towards instilling strong in-context decision-making abilities in transformers.

NeurIPS Conference 2023 Conference Paper

Waypoint Transformer: Reinforcement Learning via Supervised Learning with Intermediate Targets

  • Anirudhan Badrinath
  • Yannis Flet-Berliac
  • Allen Nie
  • Emma Brunskill

Despite the recent advancements in offline reinforcement learning via supervised learning (RvS) and the success of the decision transformer (DT) architecture in various domains, DTs have fallen short in several challenging benchmarks. The root cause of this underperformance lies in their inability to seamlessly connect segments of suboptimal trajectories. To overcome this limitation, we present a novel approach to enhance RvS methods by integrating intermediate targets. We introduce the Waypoint Transformer (WT), using an architecture that builds upon the DT framework and conditioned on automatically-generated waypoints. The results show a significant increase in the final return compared to existing RvS methods, with performance on par or greater than existing state-of-the-art temporal difference learning-based methods. Additionally, the performance and stability improvements are largest in the most challenging environments and data configurations, including AntMaze Large Play/Diverse and Kitchen Mixed/Partial.

AAAI Conference 2022 Conference Paper

Constraint Sampling Reinforcement Learning: Incorporating Expertise for Faster Learning

  • Tong Mu
  • Georgios Theocharous
  • David Arbour
  • Emma Brunskill

Online reinforcement learning (RL) algorithms are often difficult to deploy in complex human-facing applications as they may learn slowly and have poor early performance. To address this, we introduce a practical algorithm for incorporating human insight to speed learning. Our algorithm, Constraint Sampling Reinforcement Learning (CSRL), incorporates prior domain knowledge as constraints/restrictions on the RL policy. It takes in multiple potential policy constraints to maintain robustness to misspecification of individual constraints while leveraging helpful ones to learn quickly. Given a base RL learning algorithm (ex. UCRL, DQN, Rainbow) we propose an upper confidence with elimination scheme that leverages the relationship between the constraints, and their observed performance, to adaptively switch among them. We instantiate our algorithm with DQN-type algorithms and UCRL as base algorithms, and evaluate our algorithm in four environments, including three simulators based on real data: recommendations, educational activity sequencing, and HIV treatment sequencing. In all cases, CSRL learns a good policy faster than baselines.

NeurIPS Conference 2022 Conference Paper

Data-Efficient Pipeline for Offline Reinforcement Learning with Limited Data

  • Allen Nie
  • Yannis Flet-Berliac
  • Deon Jordan
  • William Steenbergen
  • Emma Brunskill

Offline reinforcement learning (RL) can be used to improve future performance by leveraging historical data. There exist many different algorithms for offline RL, and it is well recognized that these algorithms, and their hyperparameter settings, can lead to decision policies with substantially differing performance. This prompts the need for pipelines that allow practitioners to systematically perform algorithm-hyperparameter selection for their setting. Critically, in most real-world settings, this pipeline must only involve the use of historical data. Inspired by statistical model selection methods for supervised learning, we introduce a task- and method-agnostic pipeline for automatically training, comparing, selecting, and deploying the best policy when the provided dataset is limited in size. In particular, our work highlights the importance of performing multiple data splits to produce more reliable algorithm-hyperparameter selection. While this is a common approach in supervised learning, to our knowledge, this has not been discussed in detail in the offline RL setting. We show it can have substantial impacts when the dataset is small. Compared to alternate approaches, our proposed pipeline outputs higher-performing deployed policies from a broad range of offline policy learning algorithms and across various simulation domains in healthcare, education, and robotics. This work contributes toward the development of a general-purpose meta-algorithm for automatic algorithm-hyperparameter selection for offline RL.

NeurIPS Conference 2022 Conference Paper

Factored DRO: Factored Distributionally Robust Policies for Contextual Bandits

  • Tong Mu
  • Yash Chandak
  • Tatsunori B. Hashimoto
  • Emma Brunskill

While there has been extensive work on learning from offline data for contextual multi-armed bandit settings, existing methods typically assume there is no environment shift: that the learned policy will operate in the same environmental process as that of data collection. However, this assumption may limit the use of these methods for many practical situations where there may be distribution shifts. In this work we propose Factored Distributionally Robust Optimization (Factored-DRO), which is able to separately handle distribution shifts in the context distribution and shifts in the reward generating process. Prior work that either ignores potential shifts in the context, or considers them jointly, can lead to performance that is too conservative, especially under certain forms of reward feedback. Our Factored-DRO objective mitigates this by considering the shifts separately, and our proposed estimators are consistent and converge asymptotically. We also introduce a practical algorithm and demonstrate promising empirical results in environments based on real-world datasets, such as voting outcomes and scene classification.

NeurIPS Conference 2022 Conference Paper

Giving Feedback on Interactive Student Programs with Meta-Exploration

  • Evan Liu
  • Moritz Stephan
  • Allen Nie
  • Chris Piech
  • Emma Brunskill
  • Chelsea Finn

Developing interactive software, such as websites or games, is a particularly engaging way to learn computer science. However, teaching and giving feedback on such software is time-consuming — standard approaches require instructors to manually grade student-implemented interactive programs. As a result, online platforms that serve millions, like Code. org, are unable to provide any feedback on assignments for implementing interactive programs, which critically hinders students’ ability to learn. One approach toward automatic grading is to learn an agent that interacts with a student’s program and explores states indicative of errors via reinforcement learning. However, existing work on this approach only provides binary feedback of whether a program is correct or not, while students require finer-grained feedback on the specific errors in their programs to understand their mistakes. In this work, we show that exploring to discover errors can be cast as a meta-exploration problem. This enables us to construct a principled objective for discovering errors and an algorithm for optimizing this objective, which provides fine-grained feedback. We evaluate our approach on a set of over 700K real anonymized student programs from a Code. org interactive assignment. Our approach provides feedback with 94. 3% accuracy, improving over existing approaches by 17. 7% and coming within 1. 5% of human-level accuracy. Project web page: https: //ezliu. github. io/dreamgrader.

NeurIPS Conference 2022 Conference Paper

Off-Policy Evaluation for Action-Dependent Non-stationary Environments

  • Yash Chandak
  • Shiv Shankar
  • Nathaniel Bastian
  • Bruno da Silva
  • Emma Brunskill
  • Philip S. Thomas

Methods for sequential decision-making are often built upon a foundational assumption that the underlying decision process is stationary. This limits the application of such methods because real-world problems are often subject to changes due to external factors (\textit{passive} non-stationarity), changes induced by interactions with the system itself (\textit{active} non-stationarity), or both (\textit{hybrid} non-stationarity). In this work, we take the first steps towards the fundamental challenge of on-policy and off-policy evaluation amidst structured changes due to active, passive, or hybrid non-stationarity. Towards this goal, we make a \textit{higher-order stationarity} assumption such that non-stationarity results in changes over time, but the way changes happen is fixed. We propose, OPEN, an algorithm that uses a double application of counterfactual reasoning and a novel importance-weighted instrument-variable regression to obtain both a lower bias and a lower variance estimate of the structure in the changes of a policy's past performances. Finally, we show promising results on how OPEN can be used to predict future performances for several domains inspired by real-world applications that exhibit non-stationarity.

UAI Conference 2022 Conference Paper

Offline policy optimization with eligible actions

  • Yao Liu 0009
  • Yannis Flet-Berliac
  • Emma Brunskill

Offline policy optimization could have a large impact on many real-world decision-making problems, as online learning may be infeasible in many applications. Importance sampling and its variants are a common used type of estimator in offline policy evaluation, and such estimators typically do not require assumptions on the properties and representational capabilities of value function or decision process model function classes. In this paper, we identify an important overfitting phenomenon in optimizing the importance weighted return, in which it may be possible for the learned policy to essentially avoid making aligned decisions for part of the initial state space. We propose an algorithm to avoid this overfitting through a new per-state-neighborhood normalization constraint, and provide a theoretical justification of the proposed algorithm. We also show the limitations of previous attempts to this approach. We test our algorithm in a healthcare-inspired simulator, a logged dataset collected from real hospitals and continuous control tasks. These experiments show the proposed method yields less overfitting and better test performance compared to state-of-the-art batch reinforcement learning algorithms.

NeurIPS Conference 2022 Conference Paper

Oracle Inequalities for Model Selection in Offline Reinforcement Learning

  • Jonathan N Lee
  • George Tucker
  • Ofir Nachum
  • Bo Dai
  • Emma Brunskill

In offline reinforcement learning (RL), a learner leverages prior logged data to learn a good policy without interacting with the environment. A major challenge in applying such methods in practice is the lack of both theoretically principled and practical tools for model selection and evaluation. To address this, we study the problem of model selection in offline RL with value function approximation. The learner is given a nested sequence of model classes to minimize squared Bellman error and must select among these to achieve a balance between approximation and estimation error of the classes. We propose the first model selection algorithm for offline RL that achieves minimax rate-optimal oracle inequalities up to logarithmic factors. The algorithm, ModBE, takes as input a collection of candidate model classes and a generic base offline RL algorithm. By successively eliminating model classes using a novel one-sided generalization test, ModBE returns a policy with regret scaling with the complexity of the minimally complete model class. In addition to its theoretical guarantees, it is conceptually simple and computationally efficient, amounting to solving a series of square loss regression problems and then comparing relative square loss between classes. We conclude with several numerical simulations showing it is capable of reliably selecting a good model class.

NeurIPS Conference 2021 Conference Paper

Design of Experiments for Stochastic Contextual Linear Bandits

  • Andrea Zanette
  • Kefan Dong
  • Jonathan N Lee
  • Emma Brunskill

In the stochastic linear contextual bandit setting there exist several minimax procedures for exploration with policies that are reactive to the data being acquired. In practice, there can be a significant engineering overhead to deploy these algorithms, especially when the dataset is collected in a distributed fashion or when a human in the loop is needed to implement a different policy. Exploring with a single non-reactive policy is beneficial in such cases. Assuming some batch contexts are available, we design a single stochastic policy to collect a good dataset from which a near-optimal policy can be extracted. We present a theoretical analysis as well as numerical experiments on both synthetic and real-world datasets.

NeurIPS Conference 2021 Conference Paper

Play to Grade: Testing Coding Games as Classifying Markov Decision Process

  • Allen Nie
  • Emma Brunskill
  • Chris Piech

Contemporary coding education often presents students with the task of developing programs that have user interaction and complex dynamic systems, such as mouse based games. While pedagogically compelling, there are no contemporary autonomous methods for providing feedback. Notably, interactive programs are impossible to grade by traditional unit tests. In this paper we formalize the challenge of providing feedback to interactive programs as a task of classifying Markov Decision Processes (MDPs). Each student's program fully specifies an MDP where the agent needs to operate and decide, under reasonable generalization, if the dynamics and reward model of the input MDP should be categorized as correct or broken. We demonstrate that by designing a cooperative objective between an agent and an autoregressive model, we can use the agent to sample differential trajectories from the input MDP that allows a classifier to determine membership: Play to Grade. Our method enables an automatic feedback system for interactive code assignments. We release a dataset of 711, 274 anonymized student submissions to a single assignment with hand-coded bug labels to support future research.

NeurIPS Conference 2021 Conference Paper

Provable Benefits of Actor-Critic Methods for Offline Reinforcement Learning

  • Andrea Zanette
  • Martin J Wainwright
  • Emma Brunskill

Actor-critic methods are widely used in offline reinforcement learningpractice, but are not so well-understood theoretically. We propose a newoffline actor-critic algorithm that naturally incorporates the pessimism principle, leading to several key advantages compared to the state of the art. The algorithm can operate when the Bellman evaluation operator is closed with respect to the action value function of the actor's policies; this is a more general setting than the low-rank MDP model. Despite the added generality, the procedure is computationally tractable as it involves the solution of a sequence of second-order programs. We prove an upper bound on the suboptimality gap of the policy returned by the procedure that depends on the data coverage of any arbitrary, possibly data dependent comparator policy. The achievable guarantee is complemented with a minimax lower bound that is matching up to logarithmic factors.

NeurIPS Conference 2021 Conference Paper

Reinforcement Learning with State Observation Costs in Action-Contingent Noiselessly Observable Markov Decision Processes

  • HyunJi Alex Nam
  • Scott Fleming
  • Emma Brunskill

Many real-world problems that require making optimal sequences of decisions under uncertainty involve costs when the agent wishes to obtain information about its environment. We design and analyze algorithms for reinforcement learning (RL) in Action-Contingent Noiselessly Observable MDPs (ACNO-MDPs), a special class of POMDPs in which the agent can choose to either (1) fully observe the state at a cost and then act; or (2) act without any immediate observation information, relying on past observations to infer the underlying state. ACNO-MDPs arise frequently in important real-world application domains like healthcare, in which clinicians must balance the value of information gleaned from medical tests (e. g. , blood-based biomarkers) with the costs of gathering that information (e. g. , the costs of labor and materials required to administer such tests). We develop a PAC RL algorithm for tabular ACNO-MDPs that provides substantially tighter bounds, compared to generic POMDP-RL algorithms, on the total number of episodes exhibiting worse than near-optimal performance. For continuous-state ACNO-MDPs, we propose a novel method of incorporating observation information that, when coupled with modern RL algorithms, yields significantly faster learning compared to other POMDP-RL algorithms in several simulated environments.

NeurIPS Conference 2021 Conference Paper

Universal Off-Policy Evaluation

  • Yash Chandak
  • Scott Niekum
  • Bruno da Silva
  • Erik Learned-Miller
  • Emma Brunskill
  • Philip S. Thomas

When faced with sequential decision-making problems, it is often useful to be able to predict what would happen if decisions were made using a new policy. Those predictions must often be based on data collected under some previously used decision-making rule. Many previous methods enable such off-policy (or counterfactual) estimation of the expected value of a performance measure called the return. In this paper, we take the first steps towards a 'universal off-policy estimator' (UnO)---one that provides off-policy estimates and high-confidence bounds for any parameter of the return distribution. We use UnO for estimating and simultaneously bounding the mean, variance, quantiles/median, inter-quantile range, CVaR, and the entire cumulative distribution of returns. Finally, we also discuss UnO's applicability in various settings, including fully observable, partially observable (i. e. , with unobserved confounders), Markovian, non-Markovian, stationary, smoothly non-stationary, and discrete distribution shifts.

AAAI Conference 2020 Conference Paper

Being Optimistic to Be Conservative: Quickly Learning a CVaR Policy

  • Ramtin Keramati
  • Christoph Dann
  • Alex Tamkin
  • Emma Brunskill

While maximizing expected return is the goal in most reinforcement learning approaches, risk-sensitive objectives such as conditional value at risk (CVaR) are more suitable for many high-stakes applications. However, relatively little is known about how to explore to quickly learn policies with good CVaR. In this paper, we present the first algorithm for sample-efficient learning of CVaR-optimal policies in Markov decision processes based on the optimism in the face of uncertainty principle. This method relies on a novel optimistic version of the distributional Bellman operator that moves probability mass from the lower to the upper tail of the return distribution. We prove asymptotic convergence and optimism of this operator for the tabular policy evaluation case. We further demonstrate that our algorithm finds CVaRoptimal policies substantially faster than existing baselines in several simulated environments with discrete and continuous state spaces.

ICML Conference 2020 Conference Paper

Interpretable Off-Policy Evaluation in Reinforcement Learning by Highlighting Influential Transitions

  • Omer Gottesman
  • Joseph Futoma
  • Yao Liu 0009
  • Sonali Parbhoo
  • Leo A. Celi
  • Emma Brunskill
  • Finale Doshi

Off-policy evaluation in reinforcement learning offers the chance of using observational data to improve future outcomes in domains such as healthcare and education, but safe deployment in high stakes settings requires ways of assessing its validity. Traditional measures such as confidence intervals may be insufficient due to noise, limited data and confounding. In this paper we develop a method that could serve as a hybrid human-AI system, to enable human experts to analyze the validity of policy evaluation estimates. This is accomplished by highlighting observations in the data whose removal will have a large effect on the OPE estimate, and formulating a set of rules for choosing which ones to present to domain experts for validation. We develop methods to compute exactly the influence functions for fitted Q-evaluation with two different function classes: kernel-based and linear least squares, as well as importance sampling methods. Experiments on medical simulations and real-world intensive care unit data demonstrate that our method can be used to identify limitations in the evaluation process and make evaluation more robust.

ICML Conference 2020 Conference Paper

Learning Near Optimal Policies with Low Inherent Bellman Error

  • Andrea Zanette
  • Alessandro Lazaric
  • Mykel J. Kochenderfer
  • Emma Brunskill

We study the exploration problem with approximate linear action-value functions in episodic reinforcement learning under the notion of low inherent Bellman error, a condition normally employed to show convergence of approximate value iteration. First we relate this condition to other common frameworks and show that it is strictly more general than the low rank (or linear) MDP assumption of prior work. Second we provide an algorithm with a high probability regret bound $\widetilde O(\sum_{t=1}^H d_t \sqrt{K} + \sum_{t=1}^H \sqrt{d_t} \IBE K)$ where $H$ is the horizon, $K$ is the number of episodes, $\IBE$ is the value if the inherent Bellman error and $d_t$ is the feature dimension at timestep $t$. In addition, we show that the result is unimprovable beyond constants and logs by showing a matching lower bound. This has two important consequences: 1) it shows that exploration is possible using only \emph{batch assumptions} with an algorithm that achieves the optimal statistical rate for the setting we consider, which is more general than prior work on low-rank MDPs 2) the lack of closedness (measured by the inherent Bellman error) is only amplified by $\sqrt{d_t}$ despite working in the online setting. Finally, the algorithm reduces to the celebrated \textsc{LinUCB} when $H=1$ but with a different choice of the exploration parameter that allows handling misspecified contextual linear bandits. While computational tractability questions remain open for the MDP setting, this enriches the class of MDPs with a linear representation for the action-value function where statistically efficient reinforcement learning is possible.

NeurIPS Conference 2020 Conference Paper

Off-policy Policy Evaluation For Sequential Decisions Under Unobserved Confounding

  • Hongseok Namkoong
  • Ramtin Keramati
  • Steve Yadlowsky
  • Emma Brunskill

When observed decisions depend only on observed features, off-policy policy evaluation (OPE) methods for sequential decision problems can estimate the performance of evaluation policies before deploying them. However, this assumption is frequently violated due to unobserved confounders, unrecorded variables that impact both the decisions and their outcomes. We assess robustness of OPE methods under unobserved confounding by developing worst-case bounds on the performance of an evaluation policy. When unobserved confounders can affect every decision in an episode, we demonstrate that even small amounts of per-decision confounding can heavily bias OPE methods. Fortunately, in a number of important settings found in healthcare, policy-making, and technology, unobserved confounders may directly affect only one of the many decisions made, and influence future decisions/rewards only through the directly affected decision. Under this less pessimistic model of one-decision confounding, we propose an efficient loss-minimization-based procedure for computing worst-case bounds, and prove its statistical consistency. On simulated healthcare examples---management of sepsis and interventions for autistic children---where this is a reasonable model, we demonstrate that our method invalidates non-robust results and provides meaningful certificates of robustness, allowing reliable selection of policies under unobserved confounding.

NeurIPS Conference 2020 Conference Paper

Provably Efficient Reward-Agnostic Navigation with Linear Value Iteration

  • Andrea Zanette
  • Alessandro Lazaric
  • Mykel J. Kochenderfer
  • Emma Brunskill

There has been growing progress on theoretical analyses for provably efficient learning in MDPs with linear function approximation, but much of the existing work has made strong assumptions to enable exploration by conventional exploration frameworks. Typically these assumptions are stronger than what is needed to find good solutions in the batch setting. In this work, we show how under a more standard notion of low inherent Bellman error, typically employed in least-square value iteration-style algorithms, we can provide strong PAC guarantees on learning a near optimal value function provided that the linear space is sufficiently ``explorable''. We present a computationally tractable algorithm for the reward-free setting and show how it can be used to learn a near optimal policy for any (linear) reward function, which is revealed only once learning has completed. If this reward function is also estimated from the samples gathered during pure exploration, our results also provide same-order PAC guarantees on the performance of the resulting policy for this setting.

NeurIPS Conference 2020 Conference Paper

Provably Good Batch Off-Policy Reinforcement Learning Without Great Exploration

  • Yao Liu
  • Adith Swaminathan
  • Alekh Agarwal
  • Emma Brunskill

Batch reinforcement learning (RL) is important to apply RL algorithms to many high stakes tasks. Doing batch RL in a way that yields a reliable new policy in large domains is challenging: a new decision policy may visit states and actions outside the support of the batch data, and function approximation and optimization with limited samples can further increase the potential of learning policies with overly optimistic estimates of their future performance. Some recent approaches to address these concerns have shown promise, but can still be overly optimistic in their expected outcomes. Theoretical work that provides strong guarantees on the performance of the output policy relies on a strong concentrability assumption, which makes it unsuitable for cases where the ratio between state-action distributions of behavior policy and some candidate policies is large. This is because, in the traditional analysis, the error bound scales up with this ratio. We show that using \emph{pessimistic value estimates} in the low-data regions in Bellman optimality and evaluation back-up can yield more adaptive and stronger guarantees when the concentrability assumption does not hold. In certain settings, they can find the approximately best policy within the state-action space explored by the batch data, without requiring a priori assumptions of concentrability. We highlight the necessity of our pessimistic update and the limitations of previous algorithms and analyses by illustrative MDP examples and demonstrate an empirical comparison of our algorithm and other state-of-the-art batch RL baselines in standard benchmarks.

JMLR Journal 2020 Journal Article

Towards the Systematic Reporting of the Energy and Carbon Footprints of Machine Learning

  • Peter Henderson
  • Jieru Hu
  • Joshua Romoff
  • Emma Brunskill
  • Dan Jurafsky
  • Joelle Pineau

Accurate reporting of energy and carbon usage is essential for understanding the potential climate impacts of machine learning research. We introduce a framework that makes this easier by providing a simple interface for tracking realtime energy consumption and carbon emissions, as well as generating standardized online appendices. Utilizing this framework, we create a leaderboard for energy efficient reinforcement learning algorithms to incentivize responsible research in this area as an example for other areas of machine learning. Finally, based on case studies using our framework, we propose strategies for mitigation of carbon emissions and reduction of energy consumption. By making accounting easier, we hope to further the sustainable development of machine learning experiments and spur more research into energy efficient algorithms. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2020. ( edit, beta )

ICML Conference 2020 Conference Paper

Understanding the Curse of Horizon in Off-Policy Evaluation via Conditional Importance Sampling

  • Yao Liu 0009
  • Pierre-Luc Bacon
  • Emma Brunskill

Off-policy policy estimators that use importance sampling (IS) can suffer from high variance in long-horizon domains, and there has been particular excitement over new IS methods that leverage the structure of Markov decision processes. We analyze the variance of the most popular approaches through the viewpoint of conditional Monte Carlo. Surprisingly, we find that in finite horizon MDPs there is no strict variance reduction of per-decision importance sampling or marginalized importance sampling, comparing with vanilla importance sampling. We then provide sufficient conditions under which the per-decision or marginalized estimators will provably reduce the variance over importance sampling with finite horizons. For the asymptotic (in terms of horizon $T$) case, we develop upper and lower bounds on the variance of those estimators which yields sufficient conditions under which there exists an exponential v. s. polynomial gap between the variance of importance sampling and that of the per-decision or stationary/marginalized estimators. These results help advance our understanding of if and when new types of IS estimators will improve the accuracy of off-policy estimation.

NeurIPS Conference 2019 Conference Paper

Almost Horizon-Free Structure-Aware Best Policy Identification with a Generative Model

  • Andrea Zanette
  • Mykel Kochenderfer
  • Emma Brunskill

This paper focuses on the problem of computing an $\epsilon$-optimal policy in a discounted Markov Decision Process (MDP) provided that we can access the reward and transition function through a generative model. We propose an algorithm that is initially agnostic to the MDP but that can leverage the specific MDP structure, expressed in terms of variances of the rewards and next-state value function, and gaps in the optimal action-value function to reduce the sample complexity needed to find a good policy, precisely highlighting the contribution of each state-action pair to the final sample complexity. A key feature of our analysis is that it removes all horizon dependencies in the sample complexity of suboptimal actions except for the intrinsic scaling of the value function and a constant additive term.

RLDM Conference 2019 Conference Abstract

Being Optimistic to Be Conservative: Efficient Exploration for Bandits and Conditional Value at Risk

  • Alex Tamkin
  • Emma Brunskill

Traditionally, the multi-armed bandits literature has used an arm’s expected value as a proxy for its quality. However, expected reward is a poor objective in high-stakes domains like medicine or finance, where agents are more sensitive to worst-case outcomes. In this paper, we consider the multi-armed bandits setting with a popular risk-sensitive objective called the Conditional Value at Risk (CVaR). We devise an optimism-based algorithm for this setting and show that the growth of the CVaR-Regret is logarithmic in number of samples and grows with the reciprocal of the risk level, alpha. We also present a set of experiments demonstrating the empirical performance of our bounds. Together, these results show that one can find a risk-sensitive policy almost as quickly as a one that is risk-neutral: one need only pay a factor inversely proportional to the desired risk level.

ICML Conference 2019 Conference Paper

Combining parametric and nonparametric models for off-policy evaluation

  • Omer Gottesman
  • Yao Liu 0009
  • Scott Sussex
  • Emma Brunskill
  • Finale Doshi

We consider a model-based approach to perform batch off-policy evaluation in reinforcement learning. Our method takes a mixture-of-experts approach to combine parametric and non-parametric models of the environment such that the final value estimate has the least expected error. We do so by first estimating the local accuracy of each model and then using a planner to select which model to use at every time step as to minimize the return error estimate along entire trajectories. Across a variety of domains, our mixture-based approach outperforms the individual models alone as well as state-of-the-art importance sampling-based estimators.

RLDM Conference 2019 Conference Abstract

Directed Exploration for Reinforcement Learning with Function Approxima- tion

  • Zhaohan Guo
  • Emma Brunskill

Efficient exploration is necessary to achieve good sample efficiency for reinforcement learning in general. From small, tabular settings such as gridworlds to large, continuous and sparse reward settings such as robotic object manipulation tasks, exploration through adding an uncertainty bonus to the reward function has been shown to be effective when the uncertainty is able to accurately drive exploration towards promising states. However reward bonuses can still be inefficient since they are non-stationary, which means that we must wait for function approximators to catch up and converge again when uncertainties change. We propose the idea of directed exploration, that is learning a goal-conditioned policy where goals are simply other states, and using that to directly try to reach states with large uncertainty. The goal-conditioned policy is independent of uncertainty and is thus stationary. We show in our experiments how directed exploration is more efficient at exploration and more robust to how the uncertainty is computed than adding bonuses to rewards.

RLDM Conference 2019 Conference Abstract

Fake It Till You Make It: Learning-Compatible Performance Support

  • Jonathan Bragg
  • Emma Brunskill

A longstanding goal of artificial intelligence (AI) is to develop technologies that augment or assist humans. Current approaches to developing agents that can assist humans focus on adapting behavior of the assistant, and do not consider the potential for assistants to support human learning. We argue that in many cases it is worthwhile to provide assistance in a manner that also promotes task learning or skill maintenance. We term such assistance Learning-Compatible Performance Support (LCPS), and provide methods that greatly improve learning outcomes while still providing high levels of performance support. We demonstrate the effectiveness of our approach in multiple domains, including a complex flight control task.

UAI Conference 2019 Conference Paper

Fake It Till You Make It: Learning-Compatible Performance Support

  • Jonathan Bragg
  • Emma Brunskill

A longstanding goal of artificial intelligence is to develop technologies that augment or assist humans. Current approaches to developing agents that can assist humans focus on adapting behavior of the assistant, and do not consider the potential for assistants to support human learning. We argue that in many cases it is worthwhile to provide assistance in a manner that also promotes task learning or skill maintenance. We term such assistance Learning-Compatible Performance Support, and present the Stochastic Q Bumpers algorithm for greatly improving learning outcomes while still providing high levels of performance support. We demonstrate the effectiveness of our approach in multiple domains, including a complex flight control task.

RLDM Conference 2019 Conference Abstract

High-Probability Guarantees for Offline Contextual Bandits

  • Blossom Metevier
  • Sarah Brockman
  • Yuriy Brun
  • Emma Brunskill
  • Philip Thomas

We present an offline contextual bandit algorithm designed to satisfy a broad family of fairness constraints. Unlike previous work, our algorithm accepts multiple user-specified and problem-specific def- initions of fairness, including novel ones. Empirically, we evaluate our algorithm on applications related to an intelligent tutoring system (using data we collected via a user study) and criminal recidivism (using data released by ProPublica). In each setting our algorithm always produces fair policies that achieve rewards competitive with unsafe policies constructed by other offline and online contextual bandit algorithms.

NeurIPS Conference 2019 Conference Paper

Limiting Extrapolation in Linear Approximate Value Iteration

  • Andrea Zanette
  • Alessandro Lazaric
  • Mykel Kochenderfer
  • Emma Brunskill

We study linear approximate value iteration (LAVI) with a generative model. While linear models may accurately represent the optimal value function using a few parameters, several empirical and theoretical studies show the combination of least-squares projection with the Bellman operator may be expansive, thus leading LAVI to amplify errors over iterations and eventually diverge. We introduce an algorithm that approximates value functions by combining Q-values estimated at a set of \textit{anchor} states. Our algorithm tries to balance the generalization and compactness of linear methods with the small amplification of errors typical of interpolation methods. We prove that if the features at any state can be represented as a convex combination of features at the anchor points, then errors are propagated linearly over iterations (instead of exponentially) and our method achieves a polynomial sample complexity bound in the horizon and the number of anchor points. These findings are confirmed in preliminary simulations in a number of simple problems where a traditional least-square LAVI method diverges.

UAI Conference 2019 Conference Paper

Off-Policy Policy Gradient with Stationary Distribution Correction

  • Yao Liu 0009
  • Adith Swaminathan
  • Alekh Agarwal
  • Emma Brunskill

We study the problem of off-policy policy optimization in Markov decision processes, and develop a novel off-policy policy gradient method. Prior off-policy policy gradient approaches have generally ignored the mismatch between the distribution of states visited under the behavior policy used to collect data, and what would be the distribution of states under the learned policy. Here we build on recent progress for estimating the ratio of the state distributions under behavior and evaluation policies for policy evaluation, and present an off-policy policy gradient optimization technique that can account for this mismatch in distributions. We present an illustrative example of why this is important and a theoretical convergence guarantee for our approach. Empirically, we compare our method in simulations to several strong baselines which do not correct for this mismatch, significantly improving in the quality of the policy discovered.

RLDM Conference 2019 Conference Abstract

Off-Policy Policy Gradient with Stationary Distribution Correction

  • Yao Liu
  • Adith Swaminathan
  • Alekh Agarwal
  • Emma Brunskill

The ability to use data about prior decisions, and their outcomes, to make counterfactual infer- ences about how alternative decision policies might perform, is a cornerstone of intelligent behavior and has substantial practical importance. We focus on the problem of performing such counterfactual inferences in the context of sequential decision making in a Markov decision process, and consider how to perform off-policy policy optimization using a policy gradient method. Policy gradient methods have had great re- cent success when used in online reinforcement learning, and can be often a nice way to encode inductive bias, as well as to be able to tackle continuous action domains. Prior off-policy policy gradient approaches have generally ignored the mismatch between the distribution of states visited under the behavior policy used to collect data, and what would be the distribution of states under a new target policy. Here we buildon recent progress for estimating the ratio of the Markov chain stationary distribution of states in policy eval- uation, and present an off-policy policy gradient optimization technique that can account for this mismatch in distributions. We present an illustrative example of why this is important, and empirical simulations to suggest the benefits of this approach. We hope this is a step towards practical algorithms that can efficiently leverage prior data in order to inform better future decision policies.

NeurIPS Conference 2019 Conference Paper

Offline Contextual Bandits with High Probability Fairness Guarantees

  • Blossom Metevier
  • Stephen Giguere
  • Sarah Brockman
  • Ari Kobren
  • Yuriy Brun
  • Emma Brunskill
  • Philip Thomas

We present RobinHood, an offline contextual bandit algorithm designed to satisfy a broad family of fairness constraints. Our algorithm accepts multiple fairness definitions and allows users to construct their own unique fairness definitions for the problem at hand. We provide a theoretical analysis of RobinHood, which includes a proof that it will not return an unfair solution with probability greater than a user-specified threshold. We validate our algorithm on three applications: a tutoring system in which we conduct a user study and consider multiple unique fairness definitions; a loan approval setting (using the Statlog German credit data set) in which well-known fairness definitions are applied; and criminal recidivism (using data released by ProPublica). In each setting, our algorithm is able to produce fair policies that achieve performance competitive with other offline and online contextual bandit algorithms.

RLDM Conference 2019 Conference Abstract

PLOTS: Procedure Learning from Observations using Subtask Structure

  • Tong Mu
  • Karan Goel
  • Emma Brunskill

In many cases an intelligent agent may want to learn how to mimic a single observed demon- strated trajectory. In this work we consider how to perform such procedural learning from observation, which could help to enable agents to better use the enormous set of video data on observation sequences. Our approach exploits the properties of this setting to incrementally build an open loop action plan that can yield the desired subsequence, and can be used in both Markov and partially observable Markov domains. In addition, procedures commonly involve repeated extended temporal action subsequences. Our method optimistically explores actions to leverage potential repeated structure in the procedure. In comparing to some state-of-the-art approaches we find that our explicit procedural learning from observation method is about 100 times faster than policy-gradient based approaches that learn a stochastic policy and is faster than model based approaches as well. We also find that performing optimistic action selection yields substantial speed ups when latent dynamical structure is present.

AAMAS Conference 2019 Conference Paper

PLOTS: Procedure Learning from Observations using subTask Structure

  • Tong Mu
  • Karan Goel
  • Emma Brunskill

In many cases an intelligent agent may want to learn how to mimic a single observed demonstrated trajectory. In this work we consider how to perform such procedural learning from observation, which could help to enable agents to better use the enormous set of video data on observation sequences. Our approach exploits the properties of this setting to incrementally build an open loop action plan that can yield the desired subsequence, and can be used in both Markov and partially observable Markov domains. In addition, procedures commonly involve repeated extended temporal action subsequences. Our method optimistically explores actions to leverage potential repeated structure in the procedure. In comparing to some state-of-the-art approaches we find that our explicit procedural learning from observation method is about 100 times faster than policy-gradient based approaches that learn a stochastic policy and is faster than model based approaches as well. We also find that performing optimistic action selection yields substantial speed ups when latent dynamical structure is present.

ICML Conference 2019 Conference Paper

Policy Certificates: Towards Accountable Reinforcement Learning

  • Christoph Dann
  • Lihong Li 0001
  • Wei Wei
  • Emma Brunskill

The performance of a reinforcement learning algorithm can vary drastically during learning because of exploration. Existing algorithms provide little information about the quality of their current policy before executing it, and thus have limited use in high-stakes applications like healthcare. We address this lack of accountability by proposing that algorithms output policy certificates. These certificates bound the sub-optimality and return of the policy in the next episode, allowing humans to intervene when the certified quality is not satisfactory. We further introduce two new algorithms with certificates and present a new framework for theoretical analysis that guarantees the quality of their policies and certificates. For tabular MDPs, we show that computing certificates can even improve the sample-efficiency of optimism-based exploration. As a result, one of our algorithms is the first to achieve minimax-optimal PAC bounds up to lower-order terms, and this algorithm also matches (and in some settings slightly improves upon) existing minimax regret bounds.

ICML Conference 2019 Conference Paper

Separable value functions across time-scales

  • Joshua Romoff
  • Peter Henderson 0002
  • Ahmed Touati
  • Yann Ollivier
  • Joelle Pineau
  • Emma Brunskill

In many finite horizon episodic reinforcement learning (RL) settings, it is desirable to optimize for the undiscounted return - in settings like Atari, for instance, the goal is to collect the most points while staying alive in the long run. Yet, it may be difficult (or even intractable) mathematically to learn with this target. As such, temporal discounting is often applied to optimize over a shorter effective planning horizon. This comes at the cost of potentially biasing the optimization target away from the undiscounted goal. In settings where this bias is unacceptable - where the system must optimize for longer horizons at higher discounts - the target of the value function approximator may increase in variance leading to difficulties in learning. We present an extension of temporal difference (TD) learning, which we call TD($\Delta$), that breaks down a value function into a series of components based on the differences between value functions with smaller discount factors. The separation of a longer horizon value function into these components has useful properties in scalability and performance. We discuss these properties and show theoretic and empirical improvements over standard TD learning in certain settings.

ICML Conference 2019 Conference Paper

Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds

  • Andrea Zanette
  • Emma Brunskill

Strong worst-case performance bounds for episodic reinforcement learning exist but fortunately in practice RL algorithms perform much better than such bounds would predict. Algorithms and theory that provide strong problem-dependent bounds could help illuminate the key features of what makes a RL problem hard and reduce the barrier to using RL algorithms in practice. As a step towards this we derive an algorithm and analysis for finite horizon discrete MDPs with state-of-the-art worst-case regret bounds and substantially tighter bounds if the RL environment has special features but without apriori knowledge of the environment from the algorithm. As a result of our analysis, we also help address an open learning theory question \cite{jiang2018open} about episodic MDPs with a constant upper-bound on the sum of rewards, providing a regret bound function of the number of episodes with no dependence on the horizon.

ICML Conference 2018 Conference Paper

Decoupling Gradient-Like Learning Rules from Representations

  • Philip S. Thomas
  • Christoph Dann
  • Emma Brunskill

In machine learning, learning often corresponds to changing the parameters of a parameterized function. A learning rule is an algorithm or mathematical expression that specifies precisely how the parameters should be changed. When creating a machine learning system, we must make two decisions: what representation should be used (i. e. , what parameterized function should be used) and what learning rule should be used to search through the resulting set of representable functions. In this paper we focus on gradient-like learning rules, wherein these two decisions are coupled in a subtle (and often unintentional) way. Using most learning rules, these two decisions are coupled in a subtle (and often unintentional) way. That is, using the same learning rule with two different representations that can represent the same sets of functions can result in two different outcomes. After arguing that this coupling is undesirable, particularly when using neural networks, we present a method for partially decoupling these two decisions for a broad class of gradient-like learning rules that span unsupervised learning, reinforcement learning, and supervised learning.

IJCAI Conference 2018 Conference Paper

Importance Sampling for Fair Policy Selection

  • Shayan Doroudi
  • Philip S. Thomas
  • Emma Brunskill

We consider the problem of off-policy policy selection in reinforcement learning: using historical data generated from running one policy to compare two or more policies. We show that approaches based on importance sampling can be unfair---they can select the worse of two policies more often than not. We then give an example that shows importance sampling is systematically unfair in a practically relevant setting; namely, we show that it unreasonably favors shorter trajectory lengths. We then present sufficient conditions to theoretically guarantee fairness. Finally, we provide a practical importance sampling-based estimator to help mitigate the unfairness due to varying trajectory lengths.

ICML Conference 2018 Conference Paper

Problem Dependent Reinforcement Learning Bounds Which Can Identify Bandit Structure in MDPs

  • Andrea Zanette
  • Emma Brunskill

In order to make good decision under uncertainty an agent must learn from observations. To do so, two of the most common frameworks are Contextual Bandits and Markov Decision Processes (MDPs). In this paper, we study whether there exist algorithms for the more general framework (MDP) which automatically provide the best performance bounds for the specific problem at hand without user intervention and without modifying the algorithm. In particular, it is found that a very minor variant of a recently proposed reinforcement learning algorithm for MDPs already matches the best possible regret bound $\tilde O (\sqrt{SAT})$ in the dominant term if deployed on a tabular Contextual Bandit problem despite the agent being agnostic to such setting.

NeurIPS Conference 2018 Conference Paper

Representation Balancing MDPs for Off-policy Policy Evaluation

  • Yao Liu
  • Omer Gottesman
  • Aniruddh Raghu
  • Matthieu Komorowski
  • Aldo Faisal
  • Finale Doshi-Velez
  • Emma Brunskill

We study the problem of off-policy policy evaluation (OPPE) in RL. In contrast to prior work, we consider how to estimate both the individual policy value and average policy value accurately. We draw inspiration from recent work in causal reasoning, and propose a new finite sample generalization error bound for value estimates from MDP models. Using this upper bound as an objective, we develop a learning algorithm of an MDP model with a balanced representation, and show that our approach can yield substantially lower MSE in common synthetic benchmarks and a HIV treatment simulation domain.

EWRL Workshop 2018 Workshop Paper

Sample Efficient Learning with Feature Selection for Factored MDPs

  • Zhaohan Guo
  • Emma Brunskill

In reinforcement learning, state is often represented by feature vectors. Prior sample complexity bounds scale with the complexity of all features. However, not all features may be necessary for learning a good policy. Therefore it is of significant interest to understand if the sample complexity can scale with the complexity of necessary features instead of all features. We answer this in the affirmative for at least one important case of interest: factored Markov Decision Processes. We show that is possible to eliminate unnecessary features by using directed exploration and leveraging the negative information from failing to reach desired states. Under mild assumptions, this is sufficient to show there exists an RL algorithm whose sample complexity scales with the cardinality of the parent sets of the necessary features, rather than the parent sets of all features. This yields an exponential improvement in sample complexity bounds when the maximum cardinality of the parent sets of the necessary features is smaller than for all features.

EWRL Workshop 2018 Workshop Paper

When Simple Exploration is Sample Efficient: Identifying Sufficient Conditions for Random Exploration to Yield PAC RL Algorithms

  • Yao Liu
  • Emma Brunskill

Efficient exploration is one of the key challenges for reinforcement learning (RL) algorithms. Most traditional sample efficiency bounds require strategic exploration. Recently many deep RL algorithms with simple heuristic exploration strategies that have few formal guarantees, achieve surprising success in many domains. These results pose an important question about understanding these exploration strategies such as e-greedy, as well as understanding what characterize the difficulty of exploration in MDPs. In this work we propose problem specific sample complexity bounds of Q learning with random walk exploration that rely on several structural properties. We also link our theoretical results to some empirical benchmark domains, to illustrate if our bound gives polynomial sample complexity in these domains and how that is related with the empirical performance.

UAI Conference 2017 Conference Paper

Importance Sampling for Fair Policy Selection

  • Shayan Doroudi
  • Philip S. Thomas
  • Emma Brunskill

We consider the problem of off-policy policy selection in reinforcement learning: using historical data generated from running one policy to compare two or more policies. We show that approaches based on importance sampling can be unfair—they can select the worse of two policies more often than not. We give two examples where the unfairness of importance sampling could be practically concerning. We then present sufficient conditions to theoretically guarantee fairness and a related notion of safety. Finally, we provide a practical importance sampling-based estimator to help mitigate one of the systematic sources of unfairness resulting from using importance sampling for policy selection.

AAAI Conference 2017 Conference Paper

Importance Sampling with Unequal Support

  • Philip Thomas
  • Emma Brunskill

Importance sampling is often used in machine learning when training and testing data come from different distributions. In this paper we propose a new variant of importance sampling that can reduce the variance of importance samplingbased estimates by orders of magnitude when the supports of the training and testing distributions differ. After motivating and presenting our new importance sampling estimator, we provide a detailed theoretical analysis that characterizes both its bias and variance relative to the ordinary importance sampling estimator (in various settings, which include cases where ordinary importance sampling is biased, while our new estimator is not, and vice versa). We conclude with an example of how our new importance sampling estimator can be used to improve estimates of how well a new treatment policy for diabetes will work for an individual, using only data from when the individual used a previous treatment policy.

RLDM Conference 2017 Conference Abstract

Model Selection for Off-Policy Policy Evaluation

  • Yao Liu
  • Philip Thomas
  • Emma Brunskill

In this work we study the off-policy policy evaluation problem, which is about how to predict the value of a policy by data from other policies. This is crucial for many applications where we can not deploy new policy directly due to safety or cost. We consider the model selection problems for a better off-policy estimators, when we have models from different sources. Traditional off-policy policy evaluation method can be divided into importance sampling estimators or model based estimators, and they respectively suffer from high variance and bias. Recent work such as doubly robust and MAGIC, shows that we can get benefit from combining importance sampling method with model value. However they all assume that they have only one model. In case we have several different models, which is common in some complex domains, it may be hard to select the best one from them, and may lose the potential benefit from all models. we present a evidence example to show that select model by simply minimizing the notation of error in previous estimator (MAGIC) can fall into a wrong model, which suggest that selecting a best model for off-policy policy evaluation is non-trivial and worth of further exploration. We propose two new estimators of model bias and a cross validation way to help to choose a model, and shows the preliminary result.

NeurIPS Conference 2017 Conference Paper

Regret Minimization in MDPs with Options without Prior Knowledge

  • Ronan Fruit
  • Matteo Pirotta
  • Alessandro Lazaric
  • Emma Brunskill

The option framework integrates temporal abstraction into the reinforcement learning model through the introduction of macro-actions (i. e. , options). Recent works leveraged on the mapping of Markov decision processes (MDPs) with options to semi-MDPs (SMDPs) and introduced SMDP-versions of exploration-exploitation algorithms (e. g. , RMAX-SMDP and UCRL-SMDP) to analyze the impact of options on the learning performance. Nonetheless, the PAC-SMDP sample complexity of RMAX-SMDP can hardly be translated into equivalent PAC-MDP theoretical guarantees, while UCRL-SMDP requires prior knowledge of the parameters characterizing the distributions of the cumulative reward and duration of each option, which are hardly available in practice. In this paper, we remove this limitation by combining the SMDP view together with the inner Markov structure of options into a novel algorithm whose regret performance matches UCRL-SMDP's up to an additive regret term. We show scenarios where this term is negligible and the advantage of temporal abstraction is preserved. We also report preliminary empirical result supporting the theoretical findings.

RLDM Conference 2017 Conference Abstract

Sample Efficient Policy Search for Optimal Stopping Domains

  • Karan Goel
  • Christoph Dann
  • Rika Antonova
  • Emma Brunskill

Arising naturally in many fields, optimal stopping problems consider the question of deciding when to stop an observation-generating process. Classical examples include house-selling, the problem of deciding whether to sell a house given a bid and a history of past offers, and the secretary problem of deciding whether to hire an applicant or not, given that future applicants may be of higher quality. We examine the problem of simultaneously learning and planning in optimal stopping domains with unknown dynamics, when data is collected directly from the environment, as is common in real-world applications, rather than from a simulator. We propose Gather Full, Search and Execute, a simple and flexible model-free policy search method that leverages problem structure to improve efficiency of data reuse. Using a simple policy evaluation trick, GFSE evaluates every policy in an input policy class using all of the collected data and outputs a policy that has a near-optimal value in the policy class. To achieve this, we bound the sample complexity of GFSE to guarantee that policy value estimates are uniformly close to their true values with high probability. Our results tighten existing PAC bounds for general Partially Observable Markov Decision Processes (POMDPs) to achieve logarithmic dependence on horizon length for our setting, in contrast to the exponential horizon length dependence for learning in general POMDPs. We demonstrate the benefit of our method against prevalent model-based and model-free approaches on a simulated student tutoring domain, and a ticket purchase domain with real airline pricing data.

IJCAI Conference 2017 Conference Paper

Sample Efficient Policy Search for Optimal Stopping Domains

  • Karan Goel
  • Christoph Dann
  • Emma Brunskill

Optimal stopping problems consider the question of deciding when to stop an observation-generating process in order to maximize a return. We examine the problem of simultaneously learning and planning in such domains, when data is collected directly from the environment. We propose GFSE, a simple and flexible model-free policy search method that reuses data for sample efficiency by leveraging problem structure. We bound the sample complexity of our approach to guarantee uniform convergence of policy value estimates, tightening existing PAC bounds to achieve logarithmic dependence on horizon length for our setting. We also examine the benefit of our method against prevalent model-based and model-free approaches on 3 domains taken from diverse fields.

RLDM Conference 2017 Conference Abstract

UBEV - A More Practical Algorithm for Episodic RL with Near-Optimal PAC and Regret Guar- antees

  • Christoph Dann
  • Tor Lattimore
  • Emma Brunskill

We present UBEV, a simple and efficient reinforcement learning algorithm for fixed-horizon episodic Markov decision processes. The main contribution is a proof that UBEV enjoys a sample-complexity bound that holds for all accuracy levels simultaneously with high probability, and matches the lower bound except for logarithmic terms and one factor of the horizon. A consequence of the fact that our sample- complexity bound holds for all accuracy levels is that the new algorithm achieves a sub-linear regret of O(sqrt(SAT)), which is the first time the dependence on the size of the state space has provably appeared inside the square root. A brief empirical evaluation shows that UBEV is practically superior to existing algorithms with known sample-complexity guarantees.

NeurIPS Conference 2017 Conference Paper

Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning

  • Christoph Dann
  • Tor Lattimore
  • Emma Brunskill

Statistical performance bounds for reinforcement learning (RL) algorithms can be critical for high-stakes applications like healthcare. This paper introduces a new framework for theoretically measuring the performance of such algorithms called Uniform-PAC, which is a strengthening of the classical Probably Approximately Correct (PAC) framework. In contrast to the PAC framework, the uniform version may be used to derive high probability regret guarantees and so forms a bridge between the two setups that has been missing in the literature. We demonstrate the benefits of the new framework for finite-state episodic MDPs with a new algorithm that is Uniform-PAC and simultaneously achieves optimal regret and PAC guarantees except for a factor of the horizon.

NeurIPS Conference 2017 Conference Paper

Using Options and Covariance Testing for Long Horizon Off-Policy Policy Evaluation

  • Zhaohan Guo
  • Philip Thomas
  • Emma Brunskill

Evaluating a policy by deploying it in the real world can be risky and costly. Off-policy policy evaluation (OPE) algorithms use historical data collected from running a previous policy to evaluate a new policy, which provides a means for evaluating a policy without requiring it to ever be deployed. Importance sampling is a popular OPE method because it is robust to partial observability and works with continuous states and actions. However, the amount of historical data required by importance sampling can scale exponentially with the horizon of the problem: the number of sequential decisions that are made. We propose using policies over temporally extended actions, called options, and show that combining these policies with importance sampling can significantly improve performance for long-horizon problems. In addition, we can take advantage of special cases that arise due to options-based policies to further improve the performance of importance sampling. We further generalize these special cases to a general covariance testing rule that can be used to decide which weights to drop in an IS estimate, and derive a new IS algorithm called Incremental Importance Sampling that can provide significantly more accurate estimates for a broad class of domains.

RLDM Conference 2017 Conference Abstract

Using Options for Long-Horizon Off-Policy Evaluation

  • Zhaohan Guo
  • Philip Thomas
  • Emma Brunskill

Evaluating a policy by deploying it in the real world can be risky and costly. Off-policy evalua- tion (OPE) algorithms use historical data collected from running a previous policy to evaluate a new policy, which provides a means for evaluating a policy without requiring it to ever be deployed. Importance sam- pling is a popular OPE method because it is robust to partial observability and works with continuous states and actions. However, we show that the amount of historical data required by importance sampling can scale exponentially with the horizon of the problem: the number of sequential decisions that are made. We pro- pose using policies over temporally extended actions, called options, to address this long-horizon problem. We show theoretically and experimentally that combining importance sampling with options-based policies can significantly improve performance for long-horizon problems.

AAAI Conference 2017 Conference Paper

Where to Add Actions in Human-in-the-Loop Reinforcement Learning

  • Travis Mandel
  • Yun-En Liu
  • Emma Brunskill
  • Zoran Popovi_

In order for reinforcement learning systems to learn quickly in vast action spaces such as the space of all possible pieces of text or the space of all images, leveraging human intuition and creativity is key. However, a human-designed action space is likely to be initially imperfect and limited; furthermore, humans may improve at creating useful actions with practice or new information. Therefore, we propose a framework in which a human adds actions to a reinforcement learning system over time to boost performance. In this setting, however, it is key that we use human effort as efficiently as possible, and one significant danger is that humans waste effort adding actions at places (states) that aren’t very important. Therefore, we propose Expected Local Improvement (ELI), an automated method which selects states at which to query humans for a new action. We evaluate ELI on a variety of simulated domains adapted from the literature, including domains with over a million actions and domains where the simulated experts change over time. We find ELI demonstrates excellent empirical performance, even in settings where the synthetic “experts” are quite poor.

ICML Conference 2016 Conference Paper

Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning

  • Philip S. Thomas
  • Emma Brunskill

In this paper we present a new way of predicting the performance of a reinforcement learning policy given historical data that may have been generated by a different policy. The ability to evaluate a policy from historical data is important for applications where the deployment of a bad policy can be dangerous or costly. We show empirically that our algorithm produces estimates that often have orders of magnitude lower mean squared error than existing methods—it makes more efficient use of the available data. Our new estimator is based on two advances: an extension of the doubly robust estimator (Jiang & Li, 2015), and a new way to mix between model based and importance sampling based estimates.

IJCAI Conference 2016 Conference Paper

Efficient Bayesian Clustering for Reinforcement Learning

  • Travis Mandel
  • Yun-En Liu
  • Emma Brunskill
  • Zoran Popovic

A fundamental artificial intelligence challenge is how to design agents that intelligently trade off exploration and exploitation while quickly learning about an unknown environment. However, in order to learn quickly, we must somehow generalize experience across states. One promising approach is to use Bayesian methods to simultaneously cluster dynamics and control exploration; unfortunately, these methods tend to require computationally intensive MCMC approximation techniques which lack guarantees. We propose Thompson Clustering for Reinforcement Learning (TCRL), a family of Bayesian clustering algorithms for reinforcement learning that leverage structure in the state space to remain computationally efficient while controlling both exploration and generalization. TCRL-Theoretic achieves near-optimal Bayesian regret bounds while consistently improving over a standard Bayesian exploration approach. TCRL-Relaxed is guaranteed to converge to acting optimally, and empirically outperforms state-of-the-art Bayesian clustering algorithms across a variety of simulated domains, even in cases where no states are similar.

ICML Conference 2016 Conference Paper

Energetic Natural Gradient Descent

  • Philip S. Thomas
  • Bruno Castro da Silva
  • Christoph Dann
  • Emma Brunskill

We propose a new class of algorithms for minimizing or maximizing functions of parametric probabilistic models. These new algorithms are natural gradient algorithms that leverage more information than prior methods by using a new metric tensor in place of the commonly used Fisher information matrix. This new metric tensor is derived by computing directions of steepest ascent where the distance between distributions is measured using an approximation of energy distance (as opposed to Kullback-Leibler divergence, which produces the Fisher information matrix), and so we refer to our new ascent direction as the energetic natural gradient.

IJCAI Conference 2016 Conference Paper

Latent Contextual Bandits and their Application to Personalized Recommendations for New Users

  • Li Zhou
  • Emma Brunskill

Personalized recommendations for new users, also known as the cold-start problem, can be formulated as a contextual bandit problem. Existing contextual bandit algorithms generally rely on features alone to capture user variability. Such methods are inefficient in learning new users' interests. In this paper we propose Latent Contextual Bandits. We consider both the benefit of leveraging a set of learned latent user classes for new users, and how we can learn such latent classes from prior users. We show that our approach achieves a better regret bound than existing algorithms. We also demonstrate the benefit of our approach using a large real world dataset and a preliminary user study.

EWRL Workshop 2016 Workshop Paper

Magical Policy Search: Data Efficient Reinforcement Learning with Guarantees of Global Optimality

  • Philip Thomas
  • Emma Brunskill

We present a batch policy search algorithm that has several desirable properties: it has few parameters that require expert tuning, it can leverage approximate models of the environment, it can seamlessly handle continuous states and actions, (informally speaking) it is guaranteed to converge to a globally optimal policy even in partially observable environments, and in our simulations it outperforms a state-of-the-art baseline. The primary limitation of our algorithm is its high computational complexity—each policy improvement step involves the optimization of a known (not necessarily convex) function.

AAAI Conference 2016 Conference Paper

Offline Evaluation of Online Reinforcement Learning Algorithms

  • Travis Mandel
  • Yun-En Liu
  • Emma Brunskill
  • Zoran Popović

In many real-world reinforcement learning problems, we have access to an existing dataset and would like to use it to evaluate various learning approaches. Typically, one would prefer not to deploy a fixed policy, but rather an algorithm that learns to improve its behavior as it gains more experience. Therefore, we seek to evaluate how a proposed algorithm learns in our environment, meaning we need to evaluate how an algorithm would have gathered experience if it were run online. In this work, we develop three new evaluation approaches which guarantee that, given some history, algorithms are fed samples from the distribution that they would have encountered if they were run online. Additionally, we are the first to propose an approach that is provably unbiased given finite data, eliminating bias due to the length of the evaluation. Finally, we compare the sample-efficiency of these approaches on multiple datasets, including one from a real-world deployment of an educational game.

IJCAI Conference 2016 Conference Paper

Questimator: Generating Knowledge Assessments for Arbitrary Topics

  • Qi Guo
  • Chinmay Kulkarni
  • Aniket Kittur
  • Jeffrey P. Bigham
  • Emma Brunskill

Formative assessments allow learners to quickly identify knowledge gaps. In traditional educational settings, expert instructors can create assessments, but in informal learning environment, it is difficult for novice learners to self assess because they don't know what they don't know. This paper introduces Questimator, an automated system that generates multiple-choice assessment questions for any topic contained within Wikipedia. Given a topic, Questimator traverses the Wikipedia graph to find and rank related topics, and uses article text to form questions, answers and distractor options. In a study with 833 participants from Mechanical Turk, we found that participants' scores on Questimator-generated quizzes correlated well with their scores on existing online quizzes on topics ranging from philosophy to economics. Also Questimator generates questions with comparable discriminatory power as existing online quizzes. Our results suggest Questimator may be useful for assessing learning in topics for which there is not an existing quiz.

RLDM Conference 2015 Conference Abstract

Concurrent PAC RL

  • Zhaohan Guo
  • Emma Brunskill

In many real-world situations an agent may make decisions across many separate reinforcement learning tasks in parallel, yet there has been very little work on concurrent RL. Building on the efficient ex- ploration RL literature, we introduce two new concurrent RL algorithms and bound their sample complexity. We show that under some mild conditions, both when the agent is known to be acting in many copies of the same MDP, and when they are not the same but are taken from a finite set, we can gain order linear im- provement in the sample complexity over not sharing information. This is quite exciting as a linear speedup is the most one might hope to gain. Our preliminary simulations also confirm this result empirically.

RLDM Conference 2015 Conference Abstract

Nonstationary Evaluation for Reinforcement Learning

  • Travis Mandel
  • Yun-En Liu
  • Emma Brunskill
  • Zoran Popović

In many real-world reinforcement learning problems, we have access to an existing dataset and would like to use it to evaluate various decision making approaches. Typically one uses offline policy evaluation techniques, where the goal is to evaluate how a fixed policy would perform using the available data. However, one rarely deploys a fixed policy, but instead deploys an algorithm that learns to improve its behavior as it gains experience. Therefore, we seek to evaluate how a proposed algorithm learns in our envi- ronment, or in other words, evaluate a policy that changes over time in response to data, a problem known as nonstationary evaluation. This problem has received significant attention in the bandit and contextual bandit frameworks, however no unbiased nonstationary estimators have been proposed for the more general case of reinforcement learning. In this work, we develop two new unbiased nonstationary evaluation approaches for reinforcement learning, discuss their trade-offs, and compare their data-efficiency on a real educational game dataset.

RLDM Conference 2015 Conference Abstract

Quickly Learning to Make Good Decisions

  • Emma Brunskill

A fundamental goal of artificial intelligence is to create agents that learn to make good decisions as they interact with a stochastic environment. Some of the most exciting and valuable potential applications involve systems that interact directly with humans, such as intelligent tutoring systems or medical support software. In these cases, minimizing the amount of experience needed by an algorithm to learn to make good decisions is highly important, as each decision, good or bad, is impacting a real person. I will describe our research on tackling this challenge, including transfer learning across sequential decision making tasks, as well as its relevance to improving educational tools.

NeurIPS Conference 2015 Conference Paper

Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning

  • Christoph Dann
  • Emma Brunskill

Recently, there has been significant progress in understanding reinforcement learning in discounted infinite-horizon Markov decision processes (MDPs) by deriving tight sample complexity bounds. However, in many real-world applications, an interactive learning agent operates for a fixed or bounded period of time, for example tutoring students for exams or handling customer service requests. Such scenarios can often be better treated as episodic fixed-horizon MDPs, for which only looser bounds on the sample complexity exist. A natural notion of sample complexity in this setting is the number of episodes required to guarantee a certain performance with high probability (PAC guarantee). In this paper, we derive an upper PAC bound of order O(|S|²|A|H² log(1/δ)/ɛ²) and a lower PAC bound Ω(|S||A|H² log(1/(δ+c))/ɛ²) (ignoring log-terms) that match up to log-terms and an additional linear dependency on the number of states |S|. The lower bound is the first of its kind for this setting. Our upper bound leverages Bernstein's inequality to improve on previous bounds for episodic finite-horizon MDPs which have a time-horizon dependency of at least H³.

RLDM Conference 2015 Conference Abstract

The Online Discovery Problem and Its Application to Lifelong Reinforcement Learning

  • Emma Brunskill
  • Lihong Li

We study lifelong reinforcement learning where the agent extracts knowledge from solving a sequence of tasks to speed learning in future ones. We first formulate and study a related online discovery problem, which can be of independent interest, and propose an optimal algorithm with matching upper and lower bounds. These results are then applied to create a robust, continuous lifelong reinforcement learning algorithm with formal learning guarantees, applicable to a much wider scenarios, as verified in simulations.

AAAI Conference 2015 Conference Paper

The Queue Method: Handling Delay, Heuristics, Prior Data, and Evaluation in Bandits

  • Travis Mandel
  • Yun-En Liu
  • Emma Brunskill
  • Zoran Popović

Current algorithms for the standard multi-armed bandit problem have good empirical performance and optimal regret bounds. However, real-world problems often differ from the standard formulation in several ways. First, feedback may be delayed instead of arriving immediately. Second, the real world often contains structure which suggests heuristics, which we wish to incorporate while retaining strong theoretical guarantees. Third, we may wish to make use of an arbitrary prior dataset without negatively impacting performance. Fourth, we may wish to efficiently evaluate algorithms using a previously collected dataset. Surprisingly, these seemingly-disparate problems can be addressed using algorithms inspired by a recently-developed queueing technique. We present the Stochastic Delayed Bandits (SDB) algorithm as a solution to these four problems, which takes black-box bandit algorithms (including heuristic approaches) as input while achieving good theoretical guarantees. We present empirical results from both synthetic simulations and real-world data drawn from an educational game. Our results show that SDB outperforms state-of-the-art approaches to handling delay, heuristics, prior data, and evaluation.

ICML Conference 2014 Conference Paper

Online Stochastic Optimization under Correlated Bandit Feedback

  • Mohammad Gheshlaghi Azar
  • Alessandro Lazaric
  • Emma Brunskill

In this paper we consider the problem of online stochastic optimization of a locally smooth function under bandit feedback. We introduce the high-confidence tree (HCT) algorithm, a novel anytime \mathcal X-armed bandit algorithm, and derive regret bounds matching the performance of state-of-the-art algorithms in terms of the dependency on number of steps and the near-optimality dimension. The main advantage of HCT is that it handles the challenging case of correlated bandit feedback (reward), whereas existing methods require rewards to be conditionally independent. HCT also improves on the state-of-the-art in terms of the memory requirement, as well as requiring a weaker smoothness assumption on the mean-reward function in comparison with the existing anytime algorithms. Finally, we discuss how HCT can be applied to the problem of policy search in reinforcement learning and we report preliminary empirical results.

ICML Conference 2014 Conference Paper

PAC-inspired Option Discovery in Lifelong Reinforcement Learning

  • Emma Brunskill
  • Lihong Li 0001

A key goal of AI is to create lifelong learning agents that can leverage prior experience to improve performance on later tasks. In reinforcement-learning problems, one way to summarize prior experience for future use is through options, which are temporally extended actions (subpolicies) for how to behave. Options can then be used to potentially accelerate learning in new reinforcement learning tasks. In this work, we provide the first formal analysis of the sample complexity, a measure of learning speed, of reinforcement learning with options. This analysis helps shed light on some interesting prior empirical results on when and how options may accelerate learning. We then quantify the benefit of options in reducing sample complexity of a lifelong learning agent. Finally, the new theoretical insights inspire a novel option-discovery algorithm that aims at minimizing overall sample complexity in lifelong reinforcement learning.

UAI Conference 2013 Conference Paper

Sample Complexity of Multi-task Reinforcement Learning

  • Emma Brunskill
  • Lihong Li 0001

Transferring knowledge across a sequence of reinforcement-learning tasks is challenging, and has a number of important applications. Though there is encouraging empirical evidence that transfer can improve performance in subsequent reinforcement-learning tasks, there has been very little theoretical analysis. In this paper, we introduce a new multi-task algorithm for a sequence of reinforcement-learning tasks when each task is sampled independently from (an unknown) distribution over a finite set of Markov decision processes whose parameters are initially unknown. For this setting, we prove under certain assumptions that the per-task sample complexity of exploration is reduced significantly due to transfer compared to standard single-task algorithms. Our multi-task algorithm also has the desired characteristic that it is guaranteed not to exhibit negative transfer: in the worst case its per-task sample complexity is comparable to the corresponding single-task algorithm.

RLDM Conference 2013 Conference Abstract

Sample Complexity of Multi-task Reinforcement Learning

  • Emma Brunskill
  • Lihong Li

A key aspect of human intelligence is our ability to leverage prior experience to improve our learn- ing in future related tasks. Often these tasks themselves involve reinforcement learning, and an important goal in artificial intelligence is to create autonomous agents that perform better as they do a series of simi- lar RL tasks. Though there is encouraging empirical evidence that leveraging past knowledge can improve agent performance in subsequent reinforcement learning tasks, there has been very little theoretical analysis. Towards addressing this gap, we introduce a new algorithm for acting in a sequence of reinforcement learn- ing tasks when each task is sampled from (an unknown) distribution over a finite set of (unknown) Markov decision processes. In this setting, we prove under certain assumptions that the per-task sample complexity, the number of samples on which the agent may perform suboptimally, decreases significantly due to lever- aging prior learned knowledge compared to standard single-task algorithms. Our multi-task RL algorithm also has the desired characteristic that it is guaranteed not to exhibit negative transfer: up to log factors, its per-task sample complexity is never worse than the corresponding single-task algorithm. Poster Session 2, Saturday, October 26, 2013

NeurIPS Conference 2013 Conference Paper

Sequential Transfer in Multi-armed Bandit with Finite Set of Models

  • Mohammad Gheshlaghi Azar
  • Alessandro Lazaric
  • Emma Brunskill

Learning from prior tasks and transferring that experience to improve future performance is critical for building lifelong learning agents. Although results in supervised and reinforcement learning show that transfer may significantly improve the learning performance, most of the literature on transfer is focused on batch learning tasks. In this paper we study the problem of sequential transfer in online learning, notably in the multi-arm bandit framework, where the objective is to minimize the cumulative regret over a sequence of tasks by incrementally transferring knowledge from prior tasks. We introduce a novel bandit algorithm based on a method-of-moments approach for the estimation of the possible tasks and derive regret bounds for it.

AAMAS Conference 2012 Conference Paper

Bayes-Optimal Reinforcement Learning for Discrete Uncertainty Domains

  • Emma Brunskill

An important subclass of reinforcement learning problems are those that exhibit only discrete uncertainty: the agent’s environment is known to be sampled from a finite set of possible worlds. In contrast to generic reinforcement learning problems, it is possible to efficiently compute the Bayesoptimal policy for many discrete uncertainty RL domains. We demonstrate empirically that the Bayes-optimal policy can result in substantially and significantly improved performance relative to a state of the art probably approximately correct RL algorithm. Our second contribution is to bound the error of using slightly noisy estimates of the discrete set of possible Markov decision process parameters during learning. We suggest that this is an important and probable situation, given such models will often be constructed from finite sets of noisy, real-world data. We demonstrate good empirical performance on a simulated machine repair problem when using noisy parameter estimates.

UAI Conference 2012 Conference Paper

Incentive Decision Processes

  • Sashank J. Reddi
  • Emma Brunskill

We consider Incentive Decision Processes, where a principal seeks to reduce its costs due to another agent’s behavior, by offering incentives to the agent for alternate behavior. We focus on the case where a principal interacts with a greedy agent whose preferences are hidden and static. Though IDPs can be directly modeled as partially observable Markov decision processes (POMDP), we show that it is possible to directly reduce or approximate the IDP as a polynomiallysized MDP: when this representation is approximate, we prove the resulting policy is boundedly-optimal for the original IDP. Our empirical simulations demonstrate the performance benefit of our algorithms over simpler approaches, and also demonstrate that our approximate representation results in a significantly faster algorithm whose performance is extremely close to the optimal policy for the original IDP.

UAI Conference 2010 Conference Paper

RAPID: A Reachable Anytime Planner for Imprecisely-sensed Domains

  • Emma Brunskill
  • Stuart Russell 0001

Despite the intractability of generic optimal partially observable Markov decision process planning, there exist important problems that have highly structured models. Previous researchers have used this insight to construct more efficient algorithms for factored domains, and for domains with topological structure in the flat state dynamics model. In our work, motivated by findings from the education community relevant to automated tutoring, we consider problems that exhibit a form of topological structure in the factored dynamics model. Our Reachable Anytime Planner for Imprecisely-sensed Domains (RAPID) leverages this structure to efficiently compute a good initial envelope of reachable states under the optimal MDP policy in time linear in the number of state variables. RAPID performs partially-observable planning over the limited envelope of states, and slowly expands the state space considered as time allows. RAPID performs well on a large tutoring-inspired problem simulation with 122 state variables, corresponding to a flat state space of over 1030 states.

ICAPS Conference 2010 Conference Paper

When Policies Can Be Trusted: Analyzing a Criteria to Identify Optimal Policies in MDPs with Unknown Model Parameters

  • Emma Brunskill

Computing a good policy in stochastic uncertain environments with unknown dynamics and reward model parameters is a challenging task. In a number of domains, ranging from space robotics to epilepsy management, it may be possible to have an initial training period when suboptimal performance is permitted. For such problems it is important to be able to identify when this training period is complete, and the computed policy can be used with high confidence in its future performance. A simple principled criteria for identifying when training has completed is when the error bounds on the value estimates of the current policy are sufficiently small that the optimal policy is fixed, with high probability. We present an upper bound on the amount of training data required to identify the optimal policy as a function of the unknown separation gap between the optimal and the next-best policy values. We illustrate with several small problems that by estimating this gap in an online manner, the number of training samples to provably reach optimality can be significantly lower than predicted offline using a Probably Approximately Correct framework that requires an input epsilon parameter.

JMLR Journal 2009 Journal Article

Provably Efficient Learning with Typed Parametric Models

  • Emma Brunskill
  • Bethany R. Leffler
  • Lihong Li
  • Michael L. Littman
  • Nicholas Roy

To quickly achieve good performance, reinforcement-learning algorithms for acting in large continuous-valued domains must use a representation that is both sufficiently powerful to capture important domain characteristics, and yet simultaneously allows generalization, or sharing, among experiences. Our algorithm balances this tradeoff by using a stochastic, switching, parametric dynamics representation. We argue that this model characterizes a number of significant, real-world domains, such as robot navigati on across varying terrain. We prove that this representational assumption allows our algorithm to be probably approximately correct with a sample complexity that scales polynomially with all problem-specific quantities including the state-space dimension. We also explicitly incorporate the error introduced by approximate planning in our sample complexity bounds, in contrast to prior Probably Approximately Correct (PAC) Markov Decision Processes (MDP) approaches, which typically assume the estimated MDP can be solved exactly. Our experimental results on constructing plans for driving to work using real car trajectory data, as well as a small robot experiment on navigating varying terrain, demonstrate that our dynamics representation enables us to capture real-world dynamics in a sufficient manner to produce good performance. [abs] [ pdf ][ bib ] &copy JMLR 2009. ( edit, beta )

ICRA Conference 2009 Conference Paper

Where to go: Interpreting natural directions using global inference

  • Yuan Wei
  • Emma Brunskill
  • Thomas Kollar
  • Nicholas Roy

An important component of human-robot interaction is that people need to be able to instruct robots to move to other locations using naturally given directions. When giving directions, people often make mistakes such as labelling errors (e. g. , left vs. right) and errors of omission (skipping important decision points in a sequence). Furthermore, people often use multiple levels of granularity in specifying directions, referring to locations using single object landmarks, multiple landmarks in a given location, or identifying large regions as a single location. The challenge is to identify the correct path to a destination from a sequence of noisy, possibly erroneous directions. In our work we cast this problem as probabilistic inference: given a set of directions, an agent should automatically find the path with the geometry and physical appearance to maximize the likelihood of those directions. We use a specific variant of a Markov Random Field (MRF) to represent our model, and gather multi-granularity representation information using existing large tagged datasets. On a dataset of route directions collected in a large third floor university building, we found that our algorithm correctly inferred the true final destination in 47 out of the 55 cases successfully followed by humans volunteers. These results suggest that our algorithm is performing well relative to human users. In the future this work will be included in a broader system for autonomously constructing environmental representations that support natural human-robot interaction for direction giving.

UAI Conference 2008 Conference Paper

CORL: A Continuous-state Offset-dynamics Reinforcement Learner

  • Emma Brunskill
  • Bethany R. Leffler
  • Lihong Li 0001
  • Michael L. Littman
  • Nicholas Roy

Continuous state spaces and stochastic, switching dynamics characterize a number of rich, realworld domains, such as robot navigation across varying terrain. We describe a reinforcementlearning algorithm for learning in these domains and prove for certain environments the algorithm is probably approximately correct with a sample complexity that scales polynomially with the state-space dimension. Unfortunately, no optimal planning techniques exist in general for such problems; instead we use fitted value iteration to solve the learned MDP, and include the error due to approximate planning in our bounds. Finally, we report an experiment using a robotic car driving over varying terrain to demonstrate that these dynamics representations adequately capture real-world dynamics and that our algorithm can be used to efficiently solve such problems.

IROS Conference 2007 Conference Paper

Collision detection in legged locomotion using supervised learning

  • Finale Doshi
  • Emma Brunskill
  • Alexander C. Shkolnik
  • Thomas Kollar
  • Khashayar Rohanimanesh
  • Russ Tedrake
  • Nicholas Roy

We propose a fast approach for detecting collision- free swing-foot trajectories for legged locomotion over extreme terrains. Instead of simulating the swing trajectories and checking for collisions along them, our approach uses machine learning techniques to predict whether a swing trajectory is collision-free. Using a set of local terrain features, we apply supervised learning to train a classifier to predict collisions. Both in simulation and on a real quadruped platform, our results show that our classifiers can improve the accuracy of collision detection compared to a real-time geometric approach without significantly increasing the computation time.

AAAI Conference 2007 Short Paper

Continuous State POMDPs for Object Manipulation Tasks

  • Emma Brunskill

My research focus is on using continuous state partially observable Markov decision processes (POMDPs) to perform object manipulation tasks using a robotic arm. During object manipulation, object dynamics can be extremely complex, non-linear and challenging to specify. To avoid modeling the full complexity of possible dynamics, I instead use a model which switches between a discrete number of simple dynamics models. By learning these models and extending Porta’s continuous state POMDP framework (Porta et al. 2006) to incorporate this switching dynamics model, we hope to handle tasks that involve absolute and relative dynamics within a single framework. This dynamics model may be applicable not only to object manipulation tasks, but also to a number of other problems, such as robot navigation. By using an explicit model of uncertainty, I hope to create solutions to object manipulation tasks that more robustly handle the noisy sensory information received by physical robots.

IROS Conference 2007 Conference Paper

Topological mapping using spectral clustering and classification

  • Emma Brunskill
  • Thomas Kollar
  • Nicholas Roy

In this work we present an online method for generating topological maps from raw sensor information. We first describe an algorithm to automatically decompose a map into submap segments using a graph partitioning technique known as spectral clustering. We then describe how to train a classifier to recognize graph submaps from laser signatures using the AdaBoost machine learning algorithm. We demonstrate that the we can perform topological mapping by incrementally segmenting the world as the robot moves through its environment, and we can close the loop when the learned classifier recognizes that the robot has returned to a previously visited location.

ICRA Conference 2005 Conference Paper

SLAM using Incremental Probabilistic PCA and Dimensionality Reduction

  • Emma Brunskill
  • Nicholas Roy

The recent progress in robot mapping (or SLAM) algorithms has focused on estimating either point features (such as landmarks) or grid-based representations. Both of these representations generally scale with the size of the environment, not the complexity of the environment. Many thousand parameters may be required even when the structure of the environment can be represented using a few geometric primitives with many fewer parameters. We describe a novel SLAM model called IPSLAM. Our algorithm clusters sensor data into line segments using the Probabilistic PCA algorithm, which provides a data likelihood model that can be used within a SLAM algorithm for the simultaneous estimation of map and robot pose parameters. Unlike previous work in extracting line-based representations from point-based maps, IPSLAM builds non-point-based maps directly from the sensor data. We demonstrate our algorithm on mapping part of the MIT Stata Centre.