Arrow Research search

Author name cluster

Doina Precup

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

197 papers
2 author rows

Possible papers

197

AAAI Conference 2026 Conference Paper

Bootstrapping Personalized Insulin Therapy via Model-Based Reinforcement Learning: An In Silico Study

  • Sumana Basu
  • Flemming Kondrup
  • Adriana Romero-Soriano
  • Doina Precup

Personalized insulin therapy for individuals with Type 1 Diabetes via closed‑loop artificial pancreas systems requires rapid adaptation of dosing strategies to each patient's unique insulin response. However, learning patient‑specific policies from scratch demands extensive exploration, which is often impractical. In this work, we study a framework that integrates insulin-response-informed transfer learning with model-based reinforcement learning for insulin dosing. We first train an LSTM‑based insulin responsiveness predictor on virtual patients, using their glucose, insulin, and meal history to forecast future glucose levels. Analysis of insulin responsiveness of in-silico patients uncovers natural insulin‑response groups characterized by similar sensitivity and dynamics profiles. For a new patient, we identify a representative model from their response group and use it to generate synthetic trajectories. These trajectories are integrated into an enhanced H-step Deep Dyna-Q algorithm, enabling accelerated policy optimization through model-based planning. The dynamics model trained entirely in simulation achieves 91.31% accuracy in predicting blood glucose ranges on the Ohio Type 1 Diabetes dataset, indicating strong zero-shot generalization. Additionally, we find that bootstrapping a new patient with a physiologically-matched reference model accelerates convergence of effective dosing policies across in-silico cohorts of children, adolescents, and adults. These findings suggest that leveraging response-group-specific synthetic experience can expedite personalized insulin therapy, offering a promising pathway towards clinical validation.

NeurIPS Conference 2025 Conference Paper

Capturing Individual Human Preferences with Reward Features

  • Andre Barreto
  • Vincent Dumoulin
  • Yiran Mao
  • Mark Rowland
  • Nicolas Perez-Nieves
  • Bobak Shahriari
  • Yann Dauphin
  • Doina Precup

Reinforcement learning from human feedback usually models preferences using a reward function that does not distinguish between people. We argue that this is unlikely to be a good design choice in contexts with high potential for disagreement, like in the training of large language models. We formalise and analyse the problem of learning a reward model that can be specialised to a user. Using the principle of empirical risk minimisation, we derive a probably approximately correct (PAC) bound showing the dependency of the approximation error on the number of training examples, as usual, and also on the number of human raters who provided feedback on them. Based on our theoretical findings, we discuss how to best collect pairwise preference data and argue that adaptive reward models should be beneficial when there is considerable disagreement among users. We also propose a concrete architecture for an adaptive reward model. Our approach leverages the observation that individual preferences can be captured as a linear combination of a set of general reward features. We show how to learn such features and subsequently use them to quickly adapt the reward model to a specific individual, even if their preferences are not reflected in the training data. We present experiments with large language models illustrating our theoretical results and comparing the proposed architecture with a non-adaptive baseline. Consistent with our analysis, the benefits provided by our model increase with the number of raters and the heterogeneity of their preferences. We also show that our model compares favourably to adaptive counterparts, including those performing in-context personalisation.

TMLR Journal 2025 Journal Article

Incorporating Spatial Information into Goal-Conditioned Hierarchical Reinforcement Learning via Graph Representations

  • Shuyuan Zhang
  • Zihan Wang
  • Xiao-Wen Chang
  • Doina Precup

The integration of graphs with Goal-conditioned Hierarchical Reinforcement Learning (GCHRL) has recently gained attention, as intermediate goals (subgoals) can be effectively sampled from graphs that naturally represent the overall task structure in most RL tasks. However, existing approaches typically rely on domain-specific knowledge to construct these graphs, limiting their applicability to new tasks. Other graph-based approaches create graphs dynamically during exploration but struggle to fully utilize them, because they have problems passing the information in the graphs to newly visited states. Additionally, current GCHRL methods face challenges such as sample inefficiency and poor subgoal representation. This paper proposes a solution to these issues by developing a graph encoder-decoder to evaluate unseen states. Our proposed method, Graph-Guided sub-Goal representation Generation RL (G4RL), can be incorporated into any existing GCHRL method when operating in environments with primarily symmetric and reversible transitions to enhance performance across this class of problems. We show that the graph encoder-decoder can be effectively implemented using a network trained on the state graph generated during exploration. Empirical results indicate that leveraging high and low-level intrinsic rewards from the graph encoder-decoder significantly enhances the performance of state-of-the-art GCHRL approaches with an extra small computational cost in dense and sparse reward environments.

ICLR Conference 2025 Conference Paper

Langevin Soft Actor-Critic: Efficient Exploration through Uncertainty-Driven Critic Learning

  • Haque Ishfaq
  • Guangyuan Wang
  • Sami Nur Islam
  • Doina Precup

Existing actor-critic algorithms, which are popular for continuous control reinforcement learning (RL) tasks, suffer from poor sample efficiency due to lack of principled exploration mechanism within them. Motivated by the success of Thompson sampling for efficient exploration in RL, we propose a novel model-free RL algorithm, \emph{Langevin Soft Actor Critic} (LSAC), which prioritizes enhancing critic learning through uncertainty estimation over policy optimization. LSAC employs three key innovations: approximate Thompson sampling through distributional Langevin Monte Carlo (LMC) based $Q$ updates, parallel tempering for exploring multiple modes of the posterior of the $Q$ function, and diffusion synthesized state-action samples regularized with $Q$ action gradients. Our extensive experiments demonstrate that LSAC outperforms or matches the performance of mainstream model-free RL algorithms for continuous control tasks. Notably, LSAC marks the first successful application of an LMC based Thompson sampling in continuous control tasks with continuous action spaces.

ICLR Conference 2025 Conference Paper

MaestroMotif: Skill Design from Artificial Intelligence Feedback

  • Martin Klissarov
  • Mikael Henaff
  • Roberta Raileanu
  • Shagun Sodhani
  • Pascal Vincent
  • Amy Zhang 0001
  • Pierre-Luc Bacon
  • Doina Precup

Describing skills in natural language has the potential to provide an accessible way to inject human knowledge about decision-making into an AI system. We present MaestroMotif, a method for AI-assisted skill design, which yields high-performing and adaptable agents. MaestroMotif leverages the capabilities of Large Language Models (LLMs) to effectively create and reuse skills. It first uses an LLM's feedback to automatically design rewards corresponding to each skill, starting from their natural language description. Then, it employs an LLM's code generation abilities, together with reinforcement learning, for training the skills and combining them to implement complex behaviors specified in language. We evaluate MaestroMotif using a suite of complex tasks in the NetHack Learning Environment (NLE), demonstrating that it surpasses existing approaches in both performance and usability.

NeurIPS Conference 2025 Conference Paper

Plasticity as the Mirror of Empowerment

  • David Abel
  • Michael Bowling
  • Andre Barreto
  • Will Dabney
  • Shi Dong
  • Steven Hansen
  • Anna Harutyunyan
  • Khimya Khetarpal

Agents are minimally entities that are influenced by their past observations and act to influence future observations. This latter capacity is captured by empowerment, which has served as a vital framing concept across artificial intelligence and cognitive science. This former capacity, however, is equally foundational: In what ways, and to what extent, can an agent be influenced by what it observes? In this paper, we ground this concept in a universal agent-centric measure that we refer to as plasticity, and reveal a fundamental connection to empowerment. Following a set of desiderata on a suitable definition, we define plasticity using a new information-theoretic quantity we call the generalized directed information. We show that this new quantity strictly generalizes the directed information introduced by Massey (1990) while preserving all of its desirable properties. Under this definition, we find that plasticity is well thought of as the mirror of empowerment: The two concepts are defined using the same measure, with only the direction of influence reversed. Our main result establishes a tension between the plasticity and empowerment of an agent, suggesting that agent design needs to be mindful of both characteristics. We explore the implications of these findings, and suggest that plasticity, empowerment, and their relationship are essential to understanding agency.

ICML Conference 2025 Conference Paper

Rejecting Hallucinated State Targets during Planning

  • Mingde Zhao 0001
  • Tristan Sylvain
  • Romain Laroche
  • Doina Precup
  • Yoshua Bengio

In planning processes of computational decision-making agents, generative or predictive models are often used as "generators" to propose "targets" representing sets of expected or desirable states. Unfortunately, learned models inevitably hallucinate infeasible targets that can cause delusional behaviors and safety concerns. We first investigate the kinds of infeasible targets that generators can hallucinate. Then, we devise a strategy to identify and reject infeasible targets by learning a target feasibility evaluator. To ensure that the evaluator is robust and non-delusional, we adopted a design choice combining off-policy compatible learning rule, distributional architecture, and data augmentation based on hindsight relabeling. Attaching to a planning agent, the designed evaluator learns by observing the agent’s interactions with the environment and the targets produced by its generator, without the need to change the agent or its generator. Our controlled experiments show significant reductions in delusional behaviors and performance improvements for various kinds of existing agents.

ICLR Conference 2025 Conference Paper

Selective Unlearning via Representation Erasure Using Domain Adversarial Training

  • Nazanin Mohammadi Sepahvand
  • Eleni Triantafillou
  • Hugo Larochelle
  • Doina Precup
  • James J. Clark
  • Daniel M. Roy 0001
  • Gintare Karolina Dziugaite

When deploying machine learning models in the real world, we often face the challenge of “unlearning” specific data points or subsets after training. Inspired by Domain-Adversarial Training of Neural Networks (DANN), we propose a novel algorithm,SURE, for targeted unlearning.SURE treats the process as a domain adaptation problem, where the “forget set” (data to be removed) and a validation set from the same distribution form two distinct domains. We train a domain classifier to discriminate between representations from the forget and validation sets.Using a gradient reversal strategy similar to DANN, we perform gradient updates to the representations to “fool” the domain classifier and thus obfuscate representations belonging to the forget set. Simultaneously, gradient descent is applied to the retain set (original training data minus the forget set) to preserve its classification performance. Unlike other unlearning approaches whose training objectives are built based on model outputs, SURE directly manipulates the representations.This is key to ensure robustness against a set of more powerful attacks than currently considered in the literature, that aim to detect which examples were unlearned through access to learned embeddings. Our thorough experiments reveal that SURE has a better unlearning quality to utility trade-off compared to other standard unlearning techniques for deep neural networks.

AAMAS Conference 2025 Conference Paper

Soft Condorcet Optimization for Ranking of General Agents

  • Marc Lanctot
  • Kate Larson
  • Michael Kaisers
  • Quentin Berthet
  • Ian Gemp
  • Manfred Diaz
  • Roberto-Rafael Maura-Rivero
  • Yoram Bachrach

Driving progress of AI models and agents requires comparing their performance on standardized benchmarks; for general agents, individual performances must be aggregated across a potentially wide variety of different tasks. In this paper, we describe a novel ranking scheme inspired by social choice frameworks, called Soft Condorcet Optimization (SCO), to compute the optimal ranking of agents: the one that makes the fewest mistakes in predicting the agent comparisons in the evaluation data. This optimal ranking is the maximum likelihood estimate when evaluation data (which we view as votes) are interpreted as noisy samples from a ground truth ranking, a solution to Condorcet’s original voting system criteria. SCO ratings are maximal for Condorcet winners when they exist, which we show is not necessarily true for the classical rating system Elo. We propose three optimization algorithms to compute SCO ratings and evaluate their empirical performance. When serving as an approximation to the Kemeny-Young voting method, SCO rankings are on average 0 to 0. 043 away from the optimal ranking in normalized Kendall-tau distance across 865 preference profiles from the PrefLib open ranking archive. In a simulated noisy tournament setting, SCO achieves accurate approximations to the ground truth ranking and the best among several baselines when 59% or more of the preference data is missing. Finally, SCO ranking provides the best approximation to the optimal ranking, measured on held-out test sets, in a problem containing 52, 958 human players across 31, 049 games of the classic seven-player game of Diplomacy.

ICLR Conference 2025 Conference Paper

Training Language Models to Self-Correct via Reinforcement Learning

  • Aviral Kumar
  • Vincent Zhuang
  • Rishabh Agarwal
  • Yi Su
  • John D. Co-Reyes
  • Avi Singh
  • Kate Baumli
  • Shariq Iqbal

Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffective in modern LLMs. Current methods for training self-correction typically depend on either multiple models, a more advanced model, or additional forms of supervision. To address these shortcomings, we develop a multi-turn online reinforcement learning (RL) approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data. To build SCoRe, we first show that variants of supervised fine-tuning (SFT) on offline model-generated correction traces are insufficient for instilling self-correction behavior. In particular, we observe that training via SFT either suffers from a distribution mismatch between the training data and the model's own responses or implicitly prefers only a certain mode of correction behavior that is often not effective at test time. SCoRe addresses these challenges by training under the model's own distribution of self-generated correction traces and using appropriate regularization to steer the learning process into learning a self-correction strategy that is effective at test time as opposed to simply fitting high-reward responses for a given prompt. This regularization prescribes running a first phase of RL on a base model to generate a policy initialization that is less susceptible to collapse and then using a reward bonus to amplify self-correction during training. When applied to Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on the MATH and HumanEval benchmarks.

NeurIPS Conference 2025 Conference Paper

Uncovering a Universal Abstract Algorithm for Modular Addition in Neural Networks

  • Gavin McCracken
  • Gabriela Moisescu-Pareja
  • Vincent Létourneau
  • Doina Precup
  • Jonathan Love

We propose a testable universality hypothesis, asserting that seemingly disparate neural network solutions observed in the simple task of modular addition actually reflect a common abstract algorithm. While prior work interpreted variations in neuron-level representations as evidence for distinct algorithms, we demonstrate---through multi-level analyses spanning neurons, neuron clusters, and entire networks---that multilayer perceptrons and transformers universally implement the abstract algorithm we call the approximate Chinese Remainder Theorem. Crucially, we introduce approximate cosets and show that neurons activate exclusively on them. Furthermore, our theory works for deep neural networks (DNNs). It predicts that universally learned solutions in DNNs with trainable embeddings or more than one hidden layer require only $\mathcal{O}(\log n)$ features, a result we empirically confirm. This work thus provides the first theory‑backed interpretation of \textit{multilayer} networks solving modular addition. It advances generalizable interpretability and opens a testable universality hypothesis for group multiplication beyond modular addition.

RLC Conference 2025 Conference Paper

Understanding Behavioral Metric Learning: A Large-Scale Study on Distracting Reinforcement Learning Environments

  • Ziyan Luo
  • Tianwei Ni
  • Pierre-Luc Bacon
  • Doina Precup
  • Xujie Si

A key approach to state abstraction is approximating behavioral metrics (notably, bisimulation metrics) in the observation space and embedding these learned distances in the representation space. While promising for robustness to task-irrelevant noise, as shown in prior work, accurately estimating these metrics remains challenging, requiring various design choices that create gaps between theory and practice. Prior evaluations focus mainly on final returns, leaving the quality of learned metrics and the source of performance gains unclear. To systematically assess how metric learning works in deep reinforcement learning (RL), we evaluate five recent approaches, unified conceptually as isometric embeddings with varying design choices. We benchmark them with baselines across 20 state-based and 14 pixel-based tasks, spanning 370 task configurations with diverse noise settings. Beyond final returns, we introduce the evaluation of a denoising factor to quantify the encoder's ability to filter distractions. To further isolate the effect of metric learning, we propose and evaluate an isolated metric estimation setting, in which the encoder is influenced solely by the metric loss. Finally, we release an open-source, modular codebase to improve reproducibility and support future research on metric learning in deep RL.

RLJ Journal 2025 Journal Article

Understanding Behavioral Metric Learning: A Large-Scale Study on Distracting Reinforcement Learning Environments

  • Ziyan Luo
  • Tianwei Ni
  • Pierre-Luc Bacon
  • Doina Precup
  • Xujie Si

A key approach to state abstraction is approximating behavioral metrics (notably, bisimulation metrics) in the observation space and embedding these learned distances in the representation space. While promising for robustness to task-irrelevant noise, as shown in prior work, accurately estimating these metrics remains challenging, requiring various design choices that create gaps between theory and practice. Prior evaluations focus mainly on final returns, leaving the quality of learned metrics and the source of performance gains unclear. To systematically assess how metric learning works in deep reinforcement learning (RL), we evaluate five recent approaches, unified conceptually as isometric embeddings with varying design choices. We benchmark them with baselines across 20 state-based and 14 pixel-based tasks, spanning 370 task configurations with diverse noise settings. Beyond final returns, we introduce the evaluation of a denoising factor to quantify the encoder's ability to filter distractions. To further isolate the effect of metric learning, we propose and evaluate an isolated metric estimation setting, in which the encoder is influenced solely by the metric loss. Finally, we release an open-source, modular codebase to improve reproducibility and support future research on metric learning in deep RL.

EWRL Workshop 2024 Workshop Paper

A Look at Value-Based Decision-Time vs. Background Planning Methods Across Different Settings

  • Safa Alver
  • Doina Precup

In model-based reinforcement learning (RL), an agent can leverage a learned model to improve its way of behaving in different ways. Two of the prevalent ways to do this are through decision-time and background planning methods. In this study, we are interested in understanding how the value-based versions of these two planning methods will compare against each other across different settings. Towards this goal, we first consider the simplest instantiations of value-based decision-time and background planning methods and provide theoretical results on which one will perform better in the regular RL and transfer learning settings. Then, we consider the modern instantiations of them and provide hypotheses on which one will perform better in the same settings. Finally, we perform illustrative experiments to validate these theoretical results and hypotheses. Overall, our findings suggest that even though value-based versions of the two planning methods perform on par in their simplest instantiations, the modern instantiations of value-based decision-time planning methods can perform on par or better than the modern instantiations of value-based background planning methods in both the regular RL and transfer learning settings.

EWRL Workshop 2024 Workshop Paper

Adaptive Exploration for Data-Efficient General Value Function Evaluations

  • Arushi Jain
  • Josiah P. Hanna
  • Doina Precup

General Value Functions (GVFs) (Sutton et al. , 2011) represent predictive knowledge in reinforcement learning. Each GVF computes the expected return for a given policy, based on a unique reward. Existing methods relying on fixed behavior policies or pre-collected data often face data efficiency issues when learning multiple GVFs in parallel using off-policy methods. To address this, we introduce $GVFExplorer$ which adaptively learns a single behavior policy that efficiently collects data for evaluating multiple GVFs in parallel. Our method optimizes the behavior policy by minimizing the total variance in return across GVFs, thereby reducing the required environmental interactions We use an existing temporal-difference-style variance estimator to approximate the return variance. We prove that each behavior policy update decreases the overall mean squared error in GVF predictions. We empirically show our method's performance in tabular and nonlinear function approximation settings, including Mujoco environments, with stationary and non-stationary reward signals, optimizing data usage and reducing prediction errors across multiple GVFs.

NeurIPS Conference 2024 Conference Paper

Adaptive Exploration for Data-Efficient General Value Function Evaluations

  • Arushi Jain
  • Josiah P. Hanna
  • Doina Precup

General Value Functions (GVFs) (Sutton et al. , 2011) represent predictive knowledge in reinforcement learning. Each GVF computes the expected return for a given policy, based on a unique reward. Existing methods relying on fixed behavior policies or pre-collected data often face data efficiency issues when learning multiple GVFs in parallel using off-policy methods. To address this, we introduce GVFExplorer, which adaptively learns a single behavior policy that efficiently collects data for evaluating multiple GVFs in parallel. Our method optimizes the behavior policy by minimizing the total variance in return across GVFs, thereby reducing the required environmental interactions. We use an existing temporal-difference-style variance estimator to approximate the return variance. We prove that each behavior policy update decreases the overall mean squared error in GVF predictions. We empirically show our method's performance in tabular and nonlinear function approximation settings, including Mujoco environments, with stationary and non-stationary reward signals, optimizing data usage and reducing prediction errors across multiple GVFs.

EWRL Workshop 2024 Workshop Paper

An Attentive Approach for Building Partial Reasoning Agents from Pixels

  • Safa Alver
  • Doina Precup

We study the problem of building reasoning agents that are able to generalize in an effective manner. Towards this goal, we propose an end-to-end approach for building model-based reinforcement learning agents that dynamically focus their reasoning to the relevant aspects of the environment: after automatically identifying the distinct aspects of the environment, these agents dynamically filter out the relevant ones and then pass them to their simulator to perform partial reasoning. Unlike existing approaches, our approach works with pixel-based inputs and it allows for interpreting the focal points of the agent. Our quantitative analyses show that the proposed approach allows for effective generalization in high-dimensional domains with raw observational inputs. We also perform ablation analyses to validate our design choices. Finally, we demonstrate through qualitative analyses that our approach actually allows for building agents that focus their reasoning on the relevant aspects of the environment.

TMLR Journal 2024 Journal Article

An Attentive Approach for Building Partial Reasoning Agents from Pixels

  • Safa Alver
  • Doina Precup

We study the problem of building reasoning agents that are able to generalize in an effective manner. Towards this goal, we propose an end-to-end approach for building model-based reinforcement learning agents that dynamically focus their reasoning to the relevant aspects of the environment: after automatically identifying the distinct aspects of the environment, these agents dynamically filter out the relevant ones and then pass them to their simulator to perform partial reasoning. Unlike existing approaches, our approach works with pixel-based inputs and it allows for interpreting the focal points of the agent. Our quantitative analyses show that the proposed approach allows for effective generalization in high-dimensional domains with raw observational inputs. We also perform ablation analyses to validate of design choices. Finally, we demonstrate through qualitative analyses that our approach actually allows for building agents that focus their reasoning on the relevant aspects of the environment.

ICML Conference 2024 Conference Paper

Code as Reward: Empowering Reinforcement Learning with VLMs

  • David Venuto
  • Mohammad Sami Nur Islam
  • Martin Klissarov
  • Doina Precup
  • Sherry Yang 0001
  • Ankit Anand

Pre-trained Vision-Language Models (VLMs) are able to understand visual concepts, describe and decompose complex tasks into sub-tasks, and provide feedback on task completion. In this paper, we aim to leverage these capabilities to support the training of reinforcement learning (RL) agents. In principle, VLMs are well suited for this purpose, as they can naturally analyze image-based observations and provide feedback (reward) on learning progress. However, inference in VLMs is computationally expensive, so querying them frequently to compute rewards would significantly slowdown the training of an RL agent. To address this challenge, we propose a framework named Code as Reward (VLM-CaR). VLM-CaR produces dense reward functions from VLMs through code generation, thereby significantly reducing the computational burden of querying the VLM directly. We show that the dense rewards generated through our approach are very accurate across a diverse set of discrete and continuous environments, and can be more effective in training RL policies than the original sparse environment rewards.

ICLR Conference 2024 Conference Paper

Consciousness-Inspired Spatio-Temporal Abstractions for Better Generalization in Reinforcement Learning

  • Mingde Zhao 0001
  • Safa Alver
  • Harm van Seijen
  • Romain Laroche
  • Doina Precup
  • Yoshua Bengio

Inspired by human conscious planning, we propose Skipper, a model-based reinforcement learning framework utilizing spatio-temporal abstractions to generalize better in novel situations. It automatically decomposes the given task into smaller, more manageable subtasks, and thus enables sparse decision-making and focused computation on the relevant parts of the environment. The decomposition relies on the extraction of an abstracted proxy problem represented as a directed graph, in which vertices and edges are learned end-to-end from hindsight. Our theoretical analyses provide performance guarantees under appropriate assumptions and establish where our approach is expected to be helpful. Generalization-focused experiments validate Skipper’s significant advantage in zero-shot generalization, compared to some existing state-of-the-art hierarchical planning methods.

UAI Conference 2024 Conference Paper

Discrete Probabilistic Inference as Control in Multi-path Environments

  • Tristan Deleu
  • Padideh Nouri
  • Nikolay Malkin
  • Doina Precup
  • Yoshua Bengio

We consider the problem of sampling from a discrete and structured distribution as a sequential decision problem, where the objective is to find a stochastic policy such that objects are sampled at the end of this sequential process proportionally to some predefined reward. While we could use maximum entropy Reinforcement Learning (MaxEnt RL) to solve this problem for some distributions, it has been shown that in general, the distribution over states induced by the optimal policy may be biased in cases where there are multiple ways to generate the same object. To address this issue, Generative Flow Networks (GFlowNets) learn a stochastic policy that samples objects proportionally to their reward by approximately enforcing a conservation of flows across the whole Markov Decision Process (MDP). In this paper, we extend recent methods correcting the reward in order to guarantee that the marginal distribution induced by the optimal MaxEnt RL policy is proportional to the original reward, regardless of the structure of the underlying MDP. We also prove that some flow-matching objectives found in the GFlowNet literature are in fact equivalent to well-established MaxEnt RL algorithms with a corrected reward. Finally, we study empirically the performance of multiple MaxEnt RL and GFlowNet algorithms on multiple problems involving sampling from discrete distributions.

NeurIPS Conference 2024 Conference Paper

Efficient Reinforcement Learning by Discovering Neural Pathways

  • Samin Yeasar Arnob
  • Riyasat Ohib
  • Sergey Plis
  • Amy Zhang
  • Alessandro Sordoni
  • Doina Precup

Reinforcement learning (RL) algorithms have been very successful at tackling complex control problems, such as AlphaGo or fusion control. However, current research mainly emphasizes solution quality, often achieved by using large models trained on large amounts of data, and does not account for the financial, environmental, and societal costs associated with developing and deploying such models. Modern neural networks are often overparameterized and a significant number of parameters can be pruned without meaningful loss in performance, resulting in more efficient use of the model's capacity lottery ticket. We present a methodology for identifying sub-networks within a larger network in reinforcement learning (RL). We call such sub-networks, neural pathways. We show empirically that even very small learned sub-networks, using less than 5% of the large network's parameters, can provide very good quality solutions. We also demonstrate the training of multiple pathways within the same networks in a multitask setup, where each pathway is encouraged to tackle a separate task. We evaluate empirically our approach on several continuous control tasks, in both online and offline training

IJCAI Conference 2024 Conference Paper

Finding Increasingly Large Extremal Graphs with AlphaZero and Tabu Search

  • Abbas Mehrabian
  • Ankit Anand
  • Hyunjik Kim
  • Nicolas Sonnerat
  • Matej Balog
  • Gheorghe Comanici
  • Tudor Berariu
  • Andrew Lee

This work proposes a new learning-to-search benchmark and uses AI to discover new mathematical knowledge related to an open conjecture of Erdos (1975) in extremal graph theory. The problem is to find graphs with a given size (number of nodes) that maximize the number of edges without having 3- or 4-cycles. We formulate this as a sequential decision-making problem and compare AlphaZero, a neural network-guided tree search, with tabu search, a heuristic local search method. Using either method, by introducing a curriculum---jump-starting the search for larger graphs using good graphs found at smaller sizes---we improve the state-of-the-art lower bounds for several sizes. We also propose a flexible graph-generation environment and a permutation-invariant network architecture for learning to search in the space of graphs.

EWRL Workshop 2024 Workshop Paper

Functional Acceleration for Policy Mirror Descent

  • Veronica Chelu
  • Doina Precup

We apply functional acceleration to the Policy Mirror Descent (PMD) general family of algorithms, which cover a wide range of novel and fundamental methods in Reinforcement Learning (RL). Leveraging duality, we propose a momentum-based PMD update. By taking the functional route, our approach is independent of the policy parametrization and applicable to large-scale optimization, covering previous applications of momentum at the level of policy parameters as a special case. We theoretically analyze several properties of this approach and complement with a numerical ablation study, which serves to illustrate the policy optimization dynamics on the value polytope, relative to different algorithmic design choices in this space. We further characterize numerically several features of the problem setting relevant for functional acceleration, and lastly, we investigate the impact of approximation on their learning mechanics.

NeurIPS Conference 2024 Conference Paper

Learning Successor Features the Simple Way

  • Raymond Chua
  • Arna Ghosh
  • Christos Kaplanis
  • Blake A. Richards
  • Doina Precup

In Deep Reinforcement Learning (RL), it is a challenge to learn representations that do not exhibit catastrophic forgetting or interference in non-stationary environments. Successor Features (SFs) offer a potential solution to this challenge. However, canonical techniques for learning SFs from pixel-level observations often lead to representation collapse, wherein representations degenerate and fail to capture meaningful variations in the data. More recent methods for learning SFs can avoid representation collapse, but they often involve complex losses and multiple learning phases, reducing their efficiency. We introduce a novel, simple method for learning SFs directly from pixels. Our approach uses a combination of a Temporal-difference (TD) loss and a reward prediction loss, which together capture the basic mathematical definition of SFs. We show that our approach matches or outperforms existing SF learning techniques in both 2D (Minigrid) and 3D (Miniworld) mazes, for both single and continual learning scenarios. As well, our technique is efficient, and can reach higher levels of performance in less time than other approaches. Our work provides a new, streamlined technique for learning SFs directly from pixel observations, with no pretraining required.

ICML Conference 2024 Conference Paper

Mixtures of Experts Unlock Parameter Scaling for Deep RL

  • Johan S. Obando-Ceron
  • Ghada Sokar
  • Timon Willi
  • Clare Lyle
  • Jesse Farebrother
  • Jakob N. Foerster
  • Gintare Karolina Dziugaite
  • Doina Precup

The recent rapid progress in (self) supervised learning models is in large part predicted by empirical scaling laws: a model’s performance scales proportionally to its size. Analogous scaling laws remain elusive for reinforcement learning domains, however, where increasing the parameter count of a model often hurts its final performance. In this paper, we demonstrate that incorporating Mixture-of-Expert (MoE) modules, and in particular Soft MoEs (Puigcerver et al. , 2023), into value-based networks results in more parameter-scalable models, evidenced by substantial performance increases across a variety of training regimes and model sizes. This work thus provides strong empirical evidence towards developing scaling laws for reinforcement learning.

RLJ Journal 2024 Journal Article

More Efficient Randomized Exploration for Reinforcement Learning via Approximate Sampling

  • Haque Ishfaq
  • Yixin Tan
  • Yu Yang
  • Qingfeng Lan
  • Jianfeng Lu
  • A. Rupam Mahmood
  • Doina Precup
  • Pan Xu

Thompson sampling (TS) is one of the most popular exploration techniques in reinforcement learning (RL). However, most TS algorithms with theoretical guarantees are difficult to implement and not generalizable to Deep RL. While approximate sampling-based exploration schemes are promising, most existing algorithms are specific to linear Markov Decision Processes (MDP) with suboptimal regret bounds, or only use the most basic samplers such as Langevin Monte Carlo. In this work, we propose an algorithmic framework that incorporates different approximate sampling methods with the recently proposed Feel-Good Thompson Sampling (FGTS) approach (Zhang, 2022; Dann et al., 2021), which was previously known to be intractable. When applied to linear MDPs, our regret analysis yields the best known dependency of regret on dimensionality, surpassing existing randomized algorithms. Additionally, we provide explicit sampling complexity for each employed sampler. Empirically, we show that in tasks where deep exploration is necessary, our proposed algorithms that combine FGTS and approximate sampling perform significantly better compared to other strong baselines. On several challenging games from the Atari 57 suite, our algorithms achieve performance that is either better than or on par with other strong baselines from the deep RL literature.

RLC Conference 2024 Conference Paper

More Efficient Randomized Exploration for Reinforcement Learning via Approximate Sampling

  • Haque Ishfaq
  • Yixin Tan
  • Yu Yang
  • Qingfeng Lan
  • Jianfeng Lu
  • A. Rupam Mahmood
  • Doina Precup
  • Pan Xu

Thompson sampling (TS) is one of the most popular exploration techniques in reinforcement learning (RL). However, most TS algorithms with theoretical guarantees are difficult to implement and not generalizable to Deep RL. While approximate sampling-based exploration schemes are promising, most existing algorithms are specific to linear Markov Decision Processes (MDP) with suboptimal regret bounds, or only use the most basic samplers such as Langevin Monte Carlo. In this work, we propose an algorithmic framework that incorporates different approximate sampling methods with the recently proposed Feel-Good Thompson Sampling (FGTS) approach (Zhang, 2022; Dann et al. , 2021), which was previously known to be intractable. When applied to linear MDPs, our regret analysis yields the best known dependency of regret on dimensionality, surpassing existing randomized algorithms. Additionally, we provide explicit sampling complexity for each employed sampler. Empirically, we show that in tasks where deep exploration is necessary, our proposed algorithms that combine FGTS and approximate sampling perform significantly better compared to other strong baselines. On several challenging games from the Atari 57 suite, our algorithms achieve performance that is either better than or on par with other strong baselines from the deep RL literature.

ICML Conference 2024 Conference Paper

Nash Learning from Human Feedback

  • Rémi Munos
  • Michal Valko
  • Daniele Calandriello
  • Mohammad Gheshlaghi Azar
  • Mark Rowland 0001
  • Daniel Guo 0001
  • Yunhao Tang
  • Matthieu Geist

Reinforcement learning from human feedback (RLHF) has emerged as the main paradigm for aligning large language models (LLMs) with human preferences. Traditionally, RLHF involves the initial step of learning a reward model from pairwise human feedback, i. e. , expressed as preferences between pairs of text generations. Subsequently, the LLM’s policy is fine-tuned to maximize the reward through a reinforcement learning algorithm. In this study, we introduce an alternative pipeline for the fine-tuning of LLMs using pairwise human feedback. Our approach entails the initial learning of a pairwise preference model, which is conditioned on two inputs (instead of a single input in the case of a reward model) given a prompt, followed by the pursuit of a policy that consistently generates responses preferred over those generated by any competing policy, thus defining the Nash equilibrium of this preference model. We term this approach Nash learning from human feedback (NLHF). In the context of a tabular policy representation, we present a novel algorithmic solution, Nash-MD, founded on the principles of mirror descent. This algorithm produces a sequence of policies, with the last iteration converging to the regularized Nash equilibrium. Additionally, we explore parametric representations of policies and introduce gradient descent algorithms for deep-learning architectures. We illustrate the effectiveness of our approach by presenting experimental results on a text summarization task. We believe NLHF offers a compelling avenue for fine-tuning LLMs and enhancing the alignment of LLMs with human preferences.

NeurIPS Conference 2024 Conference Paper

Offline Multitask Representation Learning for Reinforcement Learning

  • Haque Ishfaq
  • Thanh Nguyen-Tang
  • Songtao Feng
  • Raman Arora
  • Mengdi Wang
  • Ming Yin
  • Doina Precup

We study offline multitask representation learning in reinforcement learning (RL), where a learner is provided with an offline dataset from different tasks that share a common representation and is asked to learn the shared representation. We theoretically investigate offline multitask low-rank RL, and propose a new algorithm called MORL for offline multitask representation learning. Furthermore, we examine downstream RL in reward-free, offline and online scenarios, where a new task is introduced to the agent that shares the same representation as the upstream offline tasks. Our theoretical results demonstrate the benefits of using the learned representation from the upstream offline task instead of directly learning the representation of the low-rank model.

NeurIPS Conference 2024 Conference Paper

Parseval Regularization for Continual Reinforcement Learning

  • Wesley Chung
  • Lynn Cherif
  • David Meger
  • Doina Precup

Plasticity loss, trainability loss, and primacy bias have been identified as issues arising when training deep neural networks on sequences of tasks---referring to the increased difficulty in training on new tasks. We propose to use Parseval regularization, which maintains orthogonality of weight matrices, to preserve useful optimization properties and improve training in a continual reinforcement learning setting. We show that it provides significant benefits to RL agents on a suite of gridworld, CARL and MetaWorld tasks. We conduct comprehensive ablations to identify the source of its benefits and investigate the effect of certain metrics associated to network trainability including weight matrix rank, weight norms and policy entropy.

JMLR Journal 2024 Journal Article

Policy Gradient Methods in the Presence of Symmetries and State Abstractions

  • Prakash Panangaden
  • Sahand Rezaei-Shoshtari
  • Rosie Zhao
  • David Meger
  • Doina Precup

Reinforcement learning (RL) on high-dimensional and complex problems relies on abstraction for improved efficiency and generalization. In this paper, we study abstraction in the continuous-control setting, and extend the definition of Markov decision process (MDP) homomorphisms to the setting of continuous state and action spaces. We derive a policy gradient theorem on the abstract MDP for both stochastic and deterministic policies. Our policy gradient results allow for leveraging approximate symmetries of the environment for policy optimization. Based on these theorems, we propose a family of actor-critic algorithms that are able to learn the policy and the MDP homomorphism map simultaneously, using the lax bisimulation metric. Finally, we introduce a series of environments with continuous symmetries to further demonstrate the ability of our algorithm for action abstraction in the presence of such symmetries. We demonstrate the effectiveness of our method on our environments, as well as on challenging visual control tasks from the DeepMind Control Suite. Our method's ability to utilize MDP homomorphisms for representation learning leads to improved performance, and the visualizations of the latent space clearly demonstrate the structure of the learned abstraction. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2024. ( edit, beta )

ICLR Conference 2024 Conference Paper

Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo

  • Haque Ishfaq
  • Qingfeng Lan
  • Pan Xu 0002
  • A. Rupam Mahmood
  • Doina Precup
  • Anima Anandkumar
  • Kamyar Azizzadenesheli

We present a scalable and effective exploration strategy based on Thompson sampling for reinforcement learning (RL). One of the key shortcomings of existing Thompson sampling algorithms is the need to perform a Gaussian approximation of the posterior distribution, which is not a good surrogate in most practical settings. We instead directly sample the Q function from its posterior distribution, by using Langevin Monte Carlo, an efficient type of Markov Chain Monte Carlo (MCMC) method. Our method only needs to perform noisy gradient descent updates to learn the exact posterior distribution of the Q function, which makes our approach easy to deploy in deep RL. We provide a rigorous theoretical analysis for the proposed method and demonstrate that, in the linear Markov decision process (linear MDP) setting, it has a regret bound of $\tilde{O}(d^{3/2}H^{3/2}\sqrt{T})$, where $d$ is the dimension of the feature mapping, $H$ is the planning horizon, and $T$ is the total number of steps. We apply this approach to deep RL, by using Adam optimizer to perform gradient updates. Our approach achieves better or similar results compared with state-of-the-art deep RL algorithms on several challenging exploration tasks from the Atari57 suite.

NeurIPS Conference 2024 Conference Paper

QGFN: Controllable Greediness with Action Values

  • Elaine Lau
  • Stephen Z. Lu
  • Ling Pan
  • Doina Precup
  • Emmanuel Bengio

Generative Flow Networks (GFlowNets; GFNs) are a family of energy-based generative methods for combinatorial objects, capable of generating diverse and high-utility samples. However, consistently biasing GFNs towards producing high-utility samples is non-trivial. In this work, we leverage connections between GFNs and reinforcement learning (RL) and propose to combine the GFN policy with an action-value estimate, $Q$, to create greedier sampling policies which can be controlled by a mixing parameter. We show that several variants of the proposed method, QGFN, are able to improve on the number of high-reward samples generated in a variety of tasks without sacrificing diversity.

NeurIPS Conference 2024 Conference Paper

ReactZyme: A Benchmark for Enzyme-Reaction Prediction

  • Chenqing Hua
  • Bozitao Zhong
  • Sitao Luan
  • Liang Hong
  • Guy Wolf
  • Doina Precup
  • Shuangjia Zheng

Enzymes, with their specific catalyzed reactions, are necessary for all aspects of life, enabling diverse biological processes and adaptations. Predicting enzyme functions is essential for understanding biological pathways, guiding drug development, enhancing bioproduct yields, and facilitating evolutionary studies. Addressing the inherent complexities, we introduce a new approach to annotating enzymes based on their catalyzed reactions. This method provides detailed insights into specific reactions and is adaptable to newly discovered reactions, diverging from traditional classifications by protein family or expert-derived reaction classes. We employ machine learning algorithms to analyze enzyme reaction datasets, delivering a much more refined view on the functionality of enzymes. Our evaluation leverages the largest enzyme-reaction dataset to date, derived from the SwissProt and Rhea databases with entries up to January 8, 2024. We frame the enzyme-reaction prediction as a retrieval problem, aiming to rank enzymes by their catalytic ability for specific reactions. With our model, we can recruit proteins for novel reactions and predict reactions in novel proteins, facilitating enzyme discovery and function annotation https: //github. com/WillHua127/ReactZyme.

NeurIPS Conference 2023 Conference Paper

A Definition of Continual Reinforcement Learning

  • David Abel
  • Andre Barreto
  • Benjamin Van Roy
  • Doina Precup
  • Hado P. van Hasselt
  • Satinder Singh

In a standard view of the reinforcement learning problem, an agent’s goal is to efficiently identify a policy that maximizes long-term reward. However, this perspective is based on a restricted view of learning as finding a solution, rather than treating learning as endless adaptation. In contrast, continual reinforcement learning refers to the setting in which the best agents never stop learning. Despite the importance of continual reinforcement learning, the community lacks a simple definition of the problem that highlights its commitments and makes its primary concepts precise and clear. To this end, this paper is dedicated to carefully defining the continual reinforcement learning problem. We formalize the notion of agents that “never stop learning” through a new mathematical language for analyzing and cataloging agents. Using this new language, we define a continual learning agent as one that can be understood as carrying out an implicit search process indefinitely, and continual reinforcement learning as the setting in which the best agents are all continual learning agents. We provide two motivating examples, illustrating that traditional views of multi-task reinforcement learning and continual supervised learning are special cases of our definition. Collectively, these definitions and perspectives formalize many intuitive concepts at the heart of learning, and open new research pathways surrounding continual learning agents.

EWRL Workshop 2023 Workshop Paper

Acceleration in Policy Optimization

  • Veronica Chelu
  • Tom Zahavy
  • Arthur Guez
  • Doina Precup
  • Sebastian Flennerhag

We work towards a unifying paradigm for accelerating policy optimization methods in reinforcement learning (RL) through predictive and adaptive directions of (functional) policy ascent. Leveraging the connection between policy iteration and policy gradient methods, we view policy optimization algorithms as iteratively solving a sequence of surrogate objectives, local lower bounds on the original objective. We define optimism as predictive modelling of the future behavior of a policy, and hindsight adaptation as taking immediate and anticipatory corrective actions to mitigate accumulating errors from overshooting predictions or delayed responses to change. We use this shared lens to jointly express other well-known algorithms, including model-based policy improvement based on forward search, and optimistic meta-learning algorithms. We show connections with Anderson acceleration, Nesterov's accelerated gradient, extra-gradient methods, and linear extrapolation in the update rule. We analyze properties of the formulation, design an optimistic policy gradient algorithm, adaptive via meta-gradient learning, and empirically highlight several design choices pertaining to acceleration, in an illustrative task.

NeurIPS Conference 2023 Conference Paper

For SALE: State-Action Representation Learning for Deep Reinforcement Learning

  • Scott Fujimoto
  • Wei-Di Chang
  • Edward Smith
  • Shixiang (Shane) Gu
  • Doina Precup
  • David Meger

In reinforcement learning (RL), representation learning is a proven tool for complex image-based tasks, but is often overlooked for environments with low-level states, such as physical control problems. This paper introduces SALE, a novel approach for learning embeddings that model the nuanced interaction between state and action, enabling effective representation learning from low-level states. We extensively study the design space of these embeddings and highlight important design considerations. We integrate SALE and an adaptation of checkpoints for RL into TD3 to form the TD7 algorithm, which significantly outperforms existing continuous control algorithms. On OpenAI gym benchmark tasks, TD7 has an average performance gain of 276. 7% and 50. 7% over TD3 at 300k and 5M time steps, respectively, and works in both the online and offline settings.

ICML Conference 2023 Conference Paper

Multi-Environment Pretraining Enables Transfer to Action Limited Datasets

  • David Venuto
  • Sherry Yang 0001
  • Pieter Abbeel
  • Doina Precup
  • Igor Mordatch
  • Ofir Nachum

Using massive datasets to train large-scale models has emerged as a dominant approach for broad generalization in natural language and vision applications. In reinforcement learning, however, a key challenge is that available data of sequential decision making is often not annotated with actions - for example, videos of game-play are much more available than sequences of frames paired with their logged game controls. We propose to circumvent this challenge by combining large but sparsely-annotated datasets from a target environment of interest with fully-annotated datasets from various other source environments. Our method, Action Limited PreTraining (ALPT), leverages the generalization capabilities of inverse dynamics modelling (IDM) to label missing action data in the target environment. We show that utilizing even one additional environment dataset of labelled data during IDM pretraining gives rise to substantial improvements in generating action labels for unannotated sequences. We evaluate our method on benchmark game-playing environments and show that we can significantly improve game performance and generalization capability compared to other approaches, using annotated datasets equivalent to only $12$ minutes of gameplay. Highlighting the power of IDM, we show that these benefits remain even when target and source environments share no common actions.

AAAI Conference 2023 Conference Paper

On the Challenges of Using Reinforcement Learning in Precision Drug Dosing: Delay and Prolongedness of Action Effects

  • Sumana Basu
  • Marc-André Legault
  • Adriana Romero-Soriano
  • Doina Precup

Drug dosing is an important application of AI, which can be formulated as a Reinforcement Learning (RL) problem. In this paper, we identify two major challenges of using RL for drug dosing: delayed and prolonged effects of administering medications, which break the Markov assumption of the RL framework. We focus on prolongedness and define PAE-POMDP (Prolonged Action Effect-Partially Observable Markov Decision Process), a subclass of POMDPs in which the Markov assumption does not hold specifically due to prolonged effects of actions. Motivated by the pharmacology literature, we propose a simple and effective approach to converting drug dosing PAE-POMDPs into MDPs, enabling the use of the existing RL algorithms to solve such problems. We validate the proposed approach on a toy task, and a challenging glucose control task, for which we devise a clinically-inspired reward function. Our results demonstrate that: (1) the proposed method to restore the Markov assumption leads to significant improvements over a vanilla baseline; (2) the approach is competitive with recurrent policies which may inherently capture the prolonged affect of actions; (3) it is remarkably more time and memory efficient than the recurrent baseline and hence more suitable for real-time dosing control systems; and (4) it exhibits favourable qualitative behavior in our policy analysis.

NeurIPS Conference 2023 Conference Paper

Prediction and Control in Continual Reinforcement Learning

  • Nishanth Anand
  • Doina Precup

Temporal difference (TD) learning is often used to update the estimate of the value function which is used by RL agents to extract useful policies. In this paper, we focus on value function estimation in continual reinforcement learning. We propose to decompose the value function into two components which update at different timescales: a permanent value function, which holds general knowledge that persists over time, and a transient value function, which allows quick adaptation to new situations. We establish theoretical results showing that our approach is well suited for continual learning and draw connections to the complementary learning systems (CLS) theory from neuroscience. Empirically, this approach improves performance significantly on both prediction and control problems.

EWRL Workshop 2023 Workshop Paper

Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo

  • Haque Ishfaq
  • Qingfeng Lan
  • Pan Xu
  • A. Rupam Mahmood
  • Doina Precup
  • Anima Anandkumar
  • Kamyar Azizzadenesheli

We present a scalable and effective exploration strategy based on Thompson sampling for reinforcement learning (RL). One of the key shortcomings of existing Thompson sampling algorithms is the need to perform a Gaussian approximation of the posterior distribution, which is not a good surrogate in most practical settings. We instead directly sample the Q function from its posterior distribution, by using Langevin Monte Carlo, an efficient type of Markov Chain Monte Carlo (MCMC) method. Our method only needs to perform noisy gradient descent updates to learn the exact posterior distribution of the Q function, which makes our approach easy to deploy in deep RL. We provide a rigorous theoretical analysis for the proposed method and demonstrate that, in the linear Markov decision process (linear MDP) setting, it has a regret bound of $\tilde{O}(d^{3/2}H^{5/2}\sqrt{T})$, where $d$ is the dimension of the feature mapping, $H$ is the planning horizon, and $T$ is the total number of steps. We apply this approach to deep RL, by using Adam optimizer to perform gradient updates. Our approach achieves better or similar results compared with state-of-the-art deep RL algorithms on several challenging exploration tasks from the Atari57 suite.

JMLR Journal 2023 Journal Article

Temporal Abstraction in Reinforcement Learning with the Successor Representation

  • Marlos C. Machado
  • Andre Barreto
  • Doina Precup
  • Michael Bowling

Reasoning at multiple levels of temporal abstraction is one of the key attributes of intelligence. In reinforcement learning, this is often modeled through temporally extended courses of actions called options. Options allow agents to make predictions and to operate at different levels of abstraction within an environment. Nevertheless, approaches based on the options framework often start with the assumption that a reasonable set of options is known beforehand. When this is not the case, there are no definitive answers for which options one should consider. In this paper, we argue that the successor representation, which encodes states based on the pattern of state visitation that follows them, can be seen as a natural substrate for the discovery and use of temporal abstractions. To support our claim, we take a big picture view of recent results, showing how the successor representation can be used to discover options that facilitate either temporally-extended exploration or planning. We cast these results as instantiations of a general framework for option discovery in which the agent’s representation is used to identify useful options, which are then used to further improve its representation. This results in a virtuous, never-ending, cycle in which both the representation and the options are constantly refined based on each other. Beyond option discovery itself, we also discuss how the successor representation allows us to augment a set of options into a combinatorially large counterpart without additional learning. This is achieved through the combination of previously learned options. Our empirical evaluation focuses on options discovered for temporally-extended exploration and on the use of the successor representation to combine them. Our results shed light on important design decisions involved in the definition of options and demonstrate the synergy of different methods based on the successor representation, such as eigenoptions and the option keyboard. [abs] [ pdf ][ bib ] &copy JMLR 2023. ( edit, beta )

NeurIPS Conference 2023 Conference Paper

When Do Graph Neural Networks Help with Node Classification? Investigating the Homophily Principle on Node Distinguishability

  • Sitao Luan
  • Chenqing Hua
  • Minkai Xu
  • Qincheng Lu
  • Jiaqi Zhu
  • Xiao-Wen Chang
  • Jie Fu
  • Jure Leskovec

Homophily principle, i. e. , nodes with the same labels are more likely to be connected, has been believed to be the main reason for the performance superiority of Graph Neural Networks (GNNs) over Neural Networks on node classification tasks. Recent research suggests that, even in the absence of homophily, the advantage of GNNs still exists as long as nodes from the same class share similar neighborhood patterns. However, this argument only considers intra-class Node Distinguishability (ND) but neglects inter-class ND, which provides incomplete understanding of homophily on GNNs. In this paper, we first demonstrate such deficiency with examples and argue that an ideal situation for ND is to have smaller intra-class ND than inter-class ND. To formulate this idea and study ND deeply, we propose Contextual Stochastic Block Model for Homophily (CSBM-H) and define two metrics, Probabilistic Bayes Error (PBE) and negative generalized Jeffreys divergence, to quantify ND. With the metrics, we visualize and analyze how graph filters, node degree distributions and class variances influence ND, and investigate the combined effect of intra- and inter-class ND. Besides, we discovered the mid-homophily pitfall, which occurs widely in graph datasets. Furthermore, we verified that, in real-work tasks, the superiority of GNNs is indeed closely related to both intra- and inter-class ND regardless of homophily levels. Grounded in this observation, we propose a new hypothesis-testing based performance metric beyond homophily, which is non-linear, feature-based and can provide statistical threshold value for GNNs' the superiority. Experiments indicate that it is significantly more effective than the existing homophily metrics on revealing the advantage and disadvantage of graph-aware modes on both synthetic and benchmark real-world datasets.

TMLR Journal 2022 Journal Article

Behind the Machine’s Gaze: Neural Networks with Biologically-inspired Constraints Exhibit Human-like Visual Attention

  • Leo Schwinn
  • Doina Precup
  • Bjoern Eskofier
  • Dario Zanca

By and large, existing computational models of visual attention tacitly assume perfect vision and full access to the stimulus and thereby deviate from foveated biological vision. Moreover, modeling top-down attention is generally reduced to the integration of semantic features without incorporating the signal of a high-level visual tasks that have been shown to partially guide human attention. We propose the Neural Visual Attention (NeVA) algorithm to generate visual scanpaths in a top-down manner. With our method, we explore the ability of neural networks on which we impose a biologically-inspired foveated vision constraint to generate human-like scanpaths without directly training for this objective. The loss of a neural network performing a downstream visual task (i.e., classification or reconstruction) flexibly provides top-down guidance to the scanpath. Extensive experiments show that our method outperforms state-of-the-art unsupervised human attention models in terms of similarity to human scanpaths. Additionally, the flexibility of the framework allows to quantitatively investigate the role of different tasks in the generated visual behaviors. Finally, we demonstrate the superiority of the approach in a novel experiment that investigates the utility of scanpaths in real-world applications, where imperfect viewing conditions are given.

ICLR Conference 2022 Conference Paper

Constructing a Good Behavior Basis for Transfer using Generalized Policy Updates

  • Safa Alver
  • Doina Precup

We study the problem of learning a good set of policies, so that when combined together, they can solve a wide variety of unseen reinforcement learning tasks with no or very little new data. Specifically, we consider the framework of generalized policy evaluation and improvement, in which the rewards for all tasks of interest are assumed to be expressible as a linear combination of a fixed set of features. We show theoretically that, under certain assumptions, having access to a specific set of diverse policies, which we call a set of independent policies, can allow for instantaneously achieving high-level performance on all possible downstream tasks which are typically more complex than the ones on which the agent was trained. Based on this theoretical analysis, we propose a simple algorithm that iteratively constructs this set of policies. In addition to empirically validating our theoretical results, we compare our approach with recently proposed diverse policy set construction methods and show that, while others fail, our approach is able to build a behavior basis that enables instantaneous transfer to all possible downstream tasks. We also show empirically that having access to a set of independent policies can better bootstrap the learning process on downstream tasks where the new reward function cannot be described as a linear combination of the features. Finally, we demonstrate how this policy set can be useful in a lifelong reinforcement learning setting.

NeurIPS Conference 2022 Conference Paper

Continuous MDP Homomorphisms and Homomorphic Policy Gradient

  • Sahand Rezaei-Shoshtari
  • Rosie Zhao
  • Prakash Panangaden
  • David Meger
  • Doina Precup

Abstraction has been widely studied as a way to improve the efficiency and generalization of reinforcement learning algorithms. In this paper, we study abstraction in the continuous-control setting. We extend the definition of MDP homomorphisms to encompass continuous actions in continuous state spaces. We derive a policy gradient theorem on the abstract MDP, which allows us to leverage approximate symmetries of the environment for policy optimization. Based on this theorem, we propose an actor-critic algorithm that is able to learn the policy and the MDP homomorphism map simultaneously, using the lax bisimulation metric. We demonstrate the effectiveness of our method on benchmark tasks in the DeepMind Control Suite. Our method's ability to utilize MDP homomorphisms for representation learning leads to improved performance when learning from pixel observations.

ICLR Conference 2022 Conference Paper

COptiDICE: Offline Constrained Reinforcement Learning via Stationary Distribution Correction Estimation

  • Jongmin Lee 0004
  • Cosmin Paduraru
  • Daniel J. Mankowitz
  • Nicolas Heess
  • Doina Precup
  • Kee-Eung Kim
  • Arthur Guez

We consider the offline constrained reinforcement learning (RL) problem, in which the agent aims to compute a policy that maximizes expected return while satisfying given cost constraints, learning only from a pre-collected dataset. This problem setting is appealing in many real-world scenarios, where direct interaction with the environment is costly or risky, and where the resulting policy should comply with safety constraints. However, it is challenging to compute a policy that guarantees satisfying the cost constraints in the offline RL setting, since the off-policy evaluation inherently has an estimation error. In this paper, we present an offline constrained RL algorithm that optimizes the policy in the space of the stationary distribution. Our algorithm, COptiDICE, directly estimates the stationary distribution corrections of the optimal policy with respect to returns, while constraining the cost upper bound, with the goal of yielding a cost-conservative policy for actual constraint satisfaction. Experimental results show that COptiDICE attains better policies in terms of constraint satisfaction and return-maximization, outperforming baseline algorithms.

ICML Conference 2022 Conference Paper

Improving Robustness against Real-World and Worst-Case Distribution Shifts through Decision Region Quantification

  • Leo Schwinn
  • Leon Bungert
  • An Nguyen
  • René Raab
  • Falk Pulsmeyer
  • Doina Precup
  • Björn M. Eskofier
  • Dario Zanca

The reliability of neural networks is essential for their use in safety-critical applications. Existing approaches generally aim at improving the robustness of neural networks to either real-world distribution shifts (e. g. , common corruptions and perturbations, spatial transformations, and natural adversarial examples) or worst-case distribution shifts (e. g. , optimized adversarial examples). In this work, we propose the Decision Region Quantification (DRQ) algorithm to improve the robustness of any differentiable pre-trained model against both real-world and worst-case distribution shifts in the data. DRQ analyzes the robustness of local decision regions in the vicinity of a given data point to make more reliable predictions. We theoretically motivate the DRQ algorithm by showing that it effectively smooths spurious local extrema in the decision surface. Furthermore, we propose an implementation using targeted and untargeted adversarial attacks. An extensive empirical evaluation shows that DRQ increases the robustness of adversarially and non-adversarially trained models against real-world and worst-case distribution shifts on several computer vision benchmark datasets.

JAIR Journal 2022 Journal Article

Low-Rank Representation of Reinforcement Learning Policies

  • Bogdan Mazoure
  • Thang Doan
  • Tianyu Li
  • Vladimir Makarenkov
  • Joelle Pineau
  • Doina Precup
  • Guillaume Rabusseau

We propose a general framework for policy representation for reinforcement learning tasks. This framework involves finding a low-dimensional embedding of the policy on a reproducing kernel Hilbert space (RKHS). The usage of RKHS based methods allows us to derive strong theoretical guarantees on the expected return of the reconstructed policy. Such guarantees are typically lacking in black-box models, but are very desirable in tasks requiring stability and convergence guarantees. We conduct several experiments on classic RL domains. The results confirm that the policies can be robustly represented in a low-dimensional space while the embedded policy incurs almost no decrease in returns.

IJCAI Conference 2022 Conference Paper

On the Expressivity of Markov Reward (Extended Abstract)

  • David Abel
  • Will Dabney
  • Anna Harutyunyan
  • Mark K. Ho
  • Michael L. Littman
  • Doina Precup
  • Satinder Singh

Reward is the driving force for reinforcement-learning agents. We here set out to understand the expressivity of Markov reward as a way to capture tasks that we would want an agent to perform. We frame this study around three new abstract notions of "task": (1) a set of acceptable behaviors, (2) a partial ordering over behaviors, or (3) a partial ordering over trajectories. Our main results prove that while reward can express many of these tasks, there exist instances of each task type that no Markov reward function can capture. We then provide a set of polynomial-time algorithms that construct a Markov reward function that allows an agent to perform each task type, and correctly determine when no such reward function exists.

ICLR Conference 2022 Conference Paper

Policy Gradients Incorporating the Future

  • David Venuto
  • Elaine Lau
  • Doina Precup
  • Ofir Nachum

Reasoning about the future -- understanding how decisions in the present time affect outcomes in the future -- is one of the central challenges for reinforcement learning (RL), especially in highly-stochastic or partially observable environments. While predicting the future directly is hard, in this work we introduce a method that allows an agent to ``look into the future'' without explicitly predicting it. Namely, we propose to allow an agent, during its training on past experience, to observe what \emph{actually} happened in the future at that time, while enforcing an information bottleneck to avoid the agent overly relying on this privileged information. Coupled with recent advances in variational inference and a latent-variable autoregressive model, this gives our agent the ability to utilize rich and \emph{useful} information about the future trajectory dynamics in addition to the present. Our method, Policy Gradients Incorporating the Future (PGIF), is easy to implement and versatile, being applicable to virtually any policy gradient algorithm. We apply our proposed method to a number of off-the-shelf RL algorithms and show that PGIF is able to achieve higher reward faster in a variety of online and offline RL domains, as well as sparse-reward and partially observable environments.

ICML Conference 2022 Conference Paper

Proving Theorems using Incremental Learning and Hindsight Experience Replay

  • Eser Aygün
  • Ankit Anand
  • Laurent Orseau
  • Xavier Glorot
  • Stephen Marcus McAleer
  • Vlad Firoiu
  • Lei M. Zhang
  • Doina Precup

Traditional automated theorem proving systems for first-order logic depend on speed-optimized search and many handcrafted heuristics designed to work over a wide range of domains. Machine learning approaches in the literature either depend on these traditional provers to bootstrap themselves, by leveraging these heuristics, or can struggle due to limited existing proof data. The latter issue can be explained by the lack of a smooth difficulty gradient in theorem proving datasets; large gaps in difficulty between different theorems can make training harder or even impossible. In this paper, we adapt the idea of hindsight experience replay from reinforcement learning to the automated theorem proving domain, so as to use the intermediate data generated during unsuccessful proof attempts. We build a first-order logic prover by disabling all the smart clause-scoring heuristics of the state-of-the-art E prover and replacing them with a clause-scoring neural network learned by using hindsight experience replay in an incremental learning setting. Clauses are represented as graphs and presented to transformer networks with spectral features. We show that provers trained in this way can outperform previous machine learning approaches and compete with the state of the art heuristic-based theorem prover E in its best configuration, on the popular benchmarks MPTP2078, M2k and Mizar40. The proofs generated by our algorithm are also almost always significantly shorter than E’s proofs.

NeurIPS Conference 2022 Conference Paper

Revisiting Heterophily For Graph Neural Networks

  • Sitao Luan
  • Chenqing Hua
  • Qincheng Lu
  • Jiaqi Zhu
  • Mingde Zhao
  • Shuyuan Zhang
  • Xiao-Wen Chang
  • Doina Precup

Graph Neural Networks (GNNs) extend basic Neural Networks (NNs) by using graph structures based on the relational inductive bias (homophily assumption). While GNNs have been commonly believed to outperform NNs in real-world tasks, recent work has identified a non-trivial set of datasets where their performance compared to NNs is not satisfactory. Heterophily has been considered the main cause of this empirical observation and numerous works have been put forward to address it. In this paper, we first revisit the widely used homophily metrics and point out that their consideration of only graph-label consistency is a shortcoming. Then, we study heterophily from the perspective of post-aggregation node similarity and define new homophily metrics, which are potentially advantageous compared to existing ones. Based on this investigation, we prove that some harmful cases of heterophily can be effectively addressed by local diversification operation. Then, we propose the Adaptive Channel Mixing (ACM), a framework to adaptively exploit aggregation, diversification and identity channels to extract richer localized information in each baseline GNN layer. ACM is more powerful than the commonly used uni-channel framework for node classification tasks on heterophilic graphs. When evaluated on 10 benchmark node classification tasks, ACM-augmented baselines consistently achieve significant performance gain, exceeding state-of-the-art GNNs on most tasks without incurring significant computational burden. (Code: https: //github. com/SitaoLuan/ACM-GNN)

JAIR Journal 2022 Journal Article

Towards Continual Reinforcement Learning: A Review and Perspectives

  • Khimya Khetarpal
  • Matthew Riemer
  • Irina Rish
  • Doina Precup

In this article, we aim to provide a literature review of different formulations and approaches to continual reinforcement learning (RL), also known as lifelong or non-stationary RL. We begin by discussing our perspective on why RL is a natural fit for studying continual learning. We then provide a taxonomy of different continual RL formulations by mathematically characterizing two key properties of non-stationarity, namely, the scope and driver non-stationarity. This offers a unified view of various formulations. Next, we review and present a taxonomy of continual RL approaches. We go on to discuss evaluation of continual RL agents, providing an overview of benchmarks used in the literature and important metrics for understanding agent performance. Finally, we highlight open problems and challenges in bridging the gap between the current state of continual RL and findings in neuroscience. While still in its early days, the study of continual RL has the promise to develop better incremental reinforcement learners that can function in increasingly realistic applications where non-stationarity plays a vital role. These include applications such as those in the fields of healthcare, education, logistics, and robotics.

UAI Conference 2022 Conference Paper

Towards painless policy optimization for constrained MDPs

  • Arushi Jain
  • Sharan Vaswani
  • Reza Babanezhad 0001
  • Csaba Szepesváari
  • Doina Precup

We study policy optimization in an infinite horizon, $\gamma$-discounted constrained Markov decision process (CMDP). Our objective is to return a policy that achieves large expected reward with a small constraint violation. We consider the online setting with linear function approximation and assume global access to the corresponding features. We propose a generic primal-dual framework that allows us to bound the reward sub-optimality and constraint violation for arbitrary algorithms in terms of their primal and dual regret on online linear optimization problems. We instantiate this framework to use coin-betting algorithms and propose the \textbf{Coin Betting Politex (CBP)} algorithm. Assuming that the action-value functions are $\epsilon_{\text{\tiny{b}}}$-close to the span of the $d$-dimensional state-action features and no sampling errors, we prove that $T$ iterations of CBP result in an $O\left(\frac{1}{(1 - \gamma)^3 \sqrt{T}} + \frac{\epsilon_{\text{\tiny{b}}} \sqrt{d}}{(1 - \gamma)^2} \right)$ reward sub-optimality and an $O\left(\frac{1}{(1 - \gamma)^2 \sqrt{T}} + \frac{\epsilon_{\text{\tiny{b}}} \sqrt{d}}{1 - \gamma} \right)$ constraint violation. Importantly, unlike gradient descent-ascent and other recent methods, CBP does not require extensive hyperparameter tuning. Via experiments on synthetic and Cartpole environments, we demonstrate the effectiveness and robustness of CBP.

ICML Conference 2022 Conference Paper

Why Should I Trust You, Bellman? The Bellman Error is a Poor Replacement for Value Error

  • Scott Fujimoto
  • David Meger
  • Doina Precup
  • Ofir Nachum
  • Shixiang Gu

In this work, we study the use of the Bellman equation as a surrogate objective for value prediction accuracy. While the Bellman equation is uniquely solved by the true value function over all state-action pairs, we find that the Bellman error (the difference between both sides of the equation) is a poor proxy for the accuracy of the value function. In particular, we show that (1) due to cancellations from both sides of the Bellman equation, the magnitude of the Bellman error is only weakly related to the distance to the true value function, even when considering all state-action pairs, and (2) in the finite data regime, the Bellman equation can be satisfied exactly by infinitely many suboptimal solutions. This means that the Bellman error can be minimized without improving the accuracy of the value function. We demonstrate these phenomena through a series of propositions, illustrative toy examples, and empirical analysis in standard benchmark domains.

NeurIPS Conference 2021 Conference Paper

A Consciousness-Inspired Planning Agent for Model-Based Reinforcement Learning

  • Mingde Zhao
  • Zhen Liu
  • Sitao Luan
  • Shuyuan Zhang
  • Doina Precup
  • Yoshua Bengio

We present an end-to-end, model-based deep reinforcement learning agent which dynamically attends to relevant parts of its state during planning. The agent uses a bottleneck mechanism over a set-based representation to force the number of entities to which the agent attends at each planning step to be small. In experiments, we investigate the bottleneck mechanism with several sets of customized environments featuring different challenges. We consistently observe that the design allows the planning agents to generalize their learned task-solving abilities in compatible unseen environments by attending to the relevant objects, leading to better out-of-distribution generalization performance.

ICML Conference 2021 Conference Paper

A Deep Reinforcement Learning Approach to Marginalized Importance Sampling with the Successor Representation

  • Scott Fujimoto
  • David Meger
  • Doina Precup

Marginalized importance sampling (MIS), which measures the density ratio between the state-action occupancy of a target policy and that of a sampling distribution, is a promising approach for off-policy evaluation. However, current state-of-the-art MIS methods rely on complex optimization tricks and succeed mostly on simple toy problems. We bridge the gap between MIS and deep reinforcement learning by observing that the density ratio can be computed from the successor representation of the target policy. The successor representation can be trained through deep reinforcement learning methodology and decouples the reward optimization from the dynamics of the environment, making the resulting algorithm stable and applicable to high-dimensional domains. We evaluate the empirical performance of our approach on a variety of challenging Atari and MuJoCo environments.

NeurIPS Conference 2021 Conference Paper

Flexible Option Learning

  • Martin Klissarov
  • Doina Precup

Temporal abstraction in reinforcement learning (RL), offers the promise of improving generalization and knowledge transfer in complex environments, by propagating information more efficiently over time. Although option learning was initially formulated in a way that allows updating many options simultaneously, using off-policy, intra-option learning (Sutton, Precup & Singh, 1999), many of the recent hierarchical reinforcement learning approaches only update a single option at a time: the option currently executing. We revisit and extend intra-option learning in the context of deep reinforcement learning, in order to enable updating all options consistent with current primitive action choices, without introducing any additional estimates. Our method can therefore be naturally adopted in most hierarchical RL frameworks. When we combine our approach with the option-critic algorithm for option discovery, we obtain significant improvements in performance and data-efficiency across a wide variety of domains.

NeurIPS Conference 2021 Conference Paper

Flow Network based Generative Models for Non-Iterative Diverse Candidate Generation

  • Emmanuel Bengio
  • Moksh Jain
  • Maksym Korablyov
  • Doina Precup
  • Yoshua Bengio

This paper is about the problem of learning a stochastic policy for generating an object (like a molecular graph) from a sequence of actions, such that the probability of generating an object is proportional to a given positive reward for that object. Whereas standard return maximization tends to converge to a single return-maximizing sequence, there are cases where we would like to sample a diverse set of high-return solutions. These arise, for example, in black-box function optimization when few rounds are possible, each with large batches of queries, where the batches should be diverse, e. g. , in the design of new molecules. One can also see this as a problem of approximately converting an energy function to a generative distribution. While MCMC methods can achieve that, they are expensive and generally only perform local exploration. Instead, training a generative policy amortizes the cost of search during training and yields to fast generation. Using insights from Temporal Difference learning, we propose GFlowNet, based on a view of the generative process as a flow network, making it possible to handle the tricky case where different trajectories can yield the same final state, e. g. , there are many ways to sequentially add atoms to generate some molecular graph. We cast the set of trajectories as a flow and convert the flow consistency equations into a learning objective, akin to the casting of the Bellman equations into Temporal Difference methods. We prove that any global minimum of the proposed objectives yields a policy which samples from the desired distribution, and demonstrate the improved performance and diversity of GFlowNet on a simple domain where there are many modes to the reward function, and on a molecule synthesis task.

NeurIPS Conference 2021 Conference Paper

Gradient Starvation: A Learning Proclivity in Neural Networks

  • Mohammad Pezeshki
  • Oumar Kaba
  • Yoshua Bengio
  • Aaron C. Courville
  • Doina Precup
  • Guillaume Lajoie

We identify and formalize a fundamental gradient descent phenomenon resulting in a learning proclivity in over-parameterized neural networks. Gradient Starvation arises when cross-entropy loss is minimized by capturing only a subset of features relevant for the task, despite the presence of other predictive features that fail to be discovered. This work provides a theoretical explanation for the emergence of such feature imbalance in neural networks. Using tools from Dynamical Systems theory, we identify simple properties of learning dynamics during gradient descent that lead to this imbalance, and prove that such a situation can be expected given certain statistical structure in training data. Based on our proposed formalism, we develop guarantees for a novel regularization method aimed at decoupling feature learning dynamics, improving accuracy and robustness in cases hindered by gradient starvation. We illustrate our findings with simple and real-world out-of-distribution (OOD) generalization experiments.

ICML Conference 2021 Conference Paper

Locally Persistent Exploration in Continuous Control Tasks with Sparse Rewards

  • Susan Amin
  • Maziar Gomrokchi
  • Hossein Aboutalebi
  • Harsh Satija
  • Doina Precup

A major challenge in reinforcement learning is the design of exploration strategies, especially for environments with sparse reward structures and continuous state and action spaces. Intuitively, if the reinforcement signal is very scarce, the agent should rely on some form of short-term memory in order to cover its environment efficiently. We propose a new exploration method, based on two intuitions: (1) the choice of the next exploratory action should depend not only on the (Markovian) state of the environment, but also on the agent’s trajectory so far, and (2) the agent should utilize a measure of spread in the state space to avoid getting stuck in a small region. Our method leverages concepts often used in statistical physics to provide explanations for the behavior of simplified (polymer) chains in order to generate persistent (locally self-avoiding) trajectories in state space. We discuss the theoretical properties of locally self-avoiding walks and their ability to provide a kind of short-term memory through a decaying temporal correlation within the trajectory. We provide empirical evaluations of our approach in a simulated 2D navigation task, as well as higher-dimensional MuJoCo continuous control locomotion tasks with sparse rewards.

NeurIPS Conference 2021 Conference Paper

On the Expressivity of Markov Reward

  • David Abel
  • Will Dabney
  • Anna Harutyunyan
  • Mark K. Ho
  • Michael Littman
  • Doina Precup
  • Satinder Singh

Reward is the driving force for reinforcement-learning agents. This paper is dedicated to understanding the expressivity of reward as a way to capture tasks that we would want an agent to perform. We frame this study around three new abstract notions of “task” that might be desirable: (1) a set of acceptable behaviors, (2) a partial ordering over behaviors, or (3) a partial ordering over trajectories. Our main results prove that while reward can express many of these tasks, there exist instances of each task type that no Markov reward function can capture. We then provide a set of polynomial-time algorithms that construct a Markov reward function that allows an agent to optimize tasks of each of these three types, and correctly determine when no such reward function exists. We conclude with an empirical study that corroborates and illustrates our theoretical findings.

ICML Conference 2021 Conference Paper

Preferential Temporal Difference Learning

  • Nishanth V. Anand
  • Doina Precup

Temporal-Difference (TD) learning is a general and very useful tool for estimating the value function of a given policy, which in turn is required to find good policies. Generally speaking, TD learning updates states whenever they are visited. When the agent lands in a state, its value can be used to compute the TD-error, which is then propagated to other states. However, it may be interesting, when computing updates, to take into account other information than whether a state is visited or not. For example, some states might be more important than others (such as states which are frequently seen in a successful trajectory). Or, some states might have unreliable value estimates (for example, due to partial observability or lack of data), making their values less desirable as targets. We propose an approach to re-weighting states used in TD updates, both when they are the input and when they provide the target for the update. We prove that our approach converges with linear function approximation and illustrate its desirable empirical behaviour compared to other TD-style methods.

ICML Conference 2021 Conference Paper

Randomized Exploration in Reinforcement Learning with General Value Function Approximation

  • Haque Ishfaq
  • Qiwen Cui
  • Viet Nguyen
  • Alex Ayoub
  • Zhuoran Yang
  • Zhaoran Wang 0001
  • Doina Precup
  • Lin F. Yang

We propose a model-free reinforcement learning algorithm inspired by the popular randomized least squares value iteration (RLSVI) algorithm as well as the optimism principle. Unlike existing upper-confidence-bound (UCB) based approaches, which are often computationally intractable, our algorithm drives exploration by simply perturbing the training data with judiciously chosen i. i. d. scalar noises. To attain optimistic value function estimation without resorting to a UCB-style bonus, we introduce an optimistic reward sampling procedure. When the value functions can be represented by a function class $\mathcal{F}$, our algorithm achieves a worst-case regret bound of $\tilde{O}(\mathrm{poly}(d_EH)\sqrt{T})$ where $T$ is the time elapsed, $H$ is the planning horizon and $d_E$ is the \emph{eluder dimension} of $\mathcal{F}$. In the linear setting, our algorithm reduces to LSVI-PHE, a variant of RLSVI, that enjoys an $\tilde{\mathcal{O}}(\sqrt{d^3H^3T})$ regret. We complement the theory with an empirical evaluation across known difficult exploration tasks.

KER Journal 2021 Journal Article

Safe option-critic: learning safety in the option-critic architecture

  • Arushi Jain
  • Khimya Khetarpal
  • Doina Precup

Abstract Designing hierarchical reinforcement learning algorithms that exhibit safe behaviour is not only vital for practical applications but also facilitates a better understanding of an agent’s decisions. We tackle this problem in the options framework (Sutton, Precup & Singh, 1999), a particular way to specify temporally abstract actions which allow an agent to use sub-policies with start and end conditions. We consider a behaviour as safe that avoids regions of state space with high uncertainty in the outcomes of actions. We propose an optimization objective that learns safe options by encouraging the agent to visit states with higher behavioural consistency. The proposed objective results in a trade-off between maximizing the standard expected return and minimizing the effect of model uncertainty in the return. We propose a policy gradient algorithm to optimize the constrained objective function. We examine the quantitative and qualitative behaviours of the proposed approach in a tabular grid world, continuous-state puddle world, and three games from the Arcade Learning Environment: Ms. Pacman, Amidar, and Q*Bert. Our approach achieves a reduction in the variance of return, boosts performance in environments with intrinsic variability in the reward structure, and compares favourably both with primitive actions and with risk-neutral options.

AAAI Conference 2021 Conference Paper

Self-Supervised Attention-Aware Reinforcement Learning

  • Haiping Wu
  • Khimya Khetarpal
  • Doina Precup

Visual saliency has emerged as a major visualization tool for interpreting deep reinforcement learning (RL) agents. However, much of the existing research uses it as an analyzing tool rather than an inductive bias for policy learning. In this work, we use visual attention as an inductive bias for RL agents. We propose a novel self-supervised attention learning approach which can 1. learn to select regions of interest without explicit annotations, and 2. act as a plug for existing deep RL methods to improve the learning performance. We empirically show that the self-supervised attention-aware deep RL methods outperform the baselines in the context of both the rate of convergence and performance. Furthermore, the proposed self-supervised attention is not tied with specific policies, nor restricted to a specific scene. We posit that the proposed approach is a general self-supervised attention module for multi-task learning and transfer learning, and empirically validate the generalization ability of the proposed method. Finally, we show that our method learns meaningful object keypoints highlighting improvements both qualitatively and quantitatively.

NeurIPS Conference 2021 Conference Paper

Temporally Abstract Partial Models

  • Khimya Khetarpal
  • Zafarali Ahmed
  • Gheorghe Comanici
  • Doina Precup

Humans and animals have the ability to reason and make predictions about different courses of action at many time scales. In reinforcement learning, option models (Sutton, Precup & Singh, 1999; Precup, 2000) provide the framework for this kind of temporally abstract prediction and reasoning. Natural intelligent agents are also able to focus their attention on courses of action that are relevant or feasible in a given situation, sometimes termed affordable actions. In this paper, we define a notion of affordances for options, and develop temporally abstract partial option models, that take into account the fact that an option might be affordable only in certain situations. We analyze the trade-offs between estimation and approximation error in planning and learning when using such models, and identify some interesting special cases. Additionally, we empirically demonstrate the ability to learn both affordances and partial option models online resulting in improved sample efficiency and planning time in the Taxi domain.

AAAI Conference 2021 Conference Paper

Variance Penalized On-Policy and Off-Policy Actor-Critic

  • Arushi Jain
  • Gandharv Patil
  • Ayush Jain
  • Khimya Khetarpal
  • Doina Precup

Reinforcement learning algorithms are typically geared towards optimizing the expected return of an agent. However, in many practical applications, low variance in the return is desired to ensure the reliability of an algorithm. In this paper, we propose on-policy and off-policy actor-critic algorithms that optimize a performance criterion involving both mean and variance in the return. Previous work uses the second moment of return to estimate the variance indirectly. Instead, we use a much simpler recently proposed direct variance estimator which updates the estimates incrementally using temporal difference methods. Using the variance-penalized criterion, we guarantee the convergence of our algorithm to locally optimal policies for finite state action Markov decision processes. We demonstrate the utility of our algorithm in tabular and continuous MuJoCo domains. Our approach not only performs on par with actor-critic and prior variance-penalization baselines in terms of expected return, but also generates trajectories which have lower variance in the return.

AAAI Conference 2020 Conference Paper

Algorithmic Improvements for Deep Reinforcement Learning Applied to Interactive Fiction

  • Vishal Jain
  • William Fedus
  • Hugo Larochelle
  • Doina Precup
  • Marc G. Bellemare

Text-based games are a natural challenge domain for deep reinforcement learning algorithms. Their state and action spaces are combinatorially large, their reward function is sparse, and they are partially observable: the agent is informed of the consequences of its actions through textual feedback. In this paper we emphasize this latter point and consider the design of a deep reinforcement learning agent that can play from feedback alone. Our design recognizes and takes advantage of the structural characteristics of textbased games. We first propose a contextualisation mechanism, based on accumulated reward, which simplifies the learning problem and mitigates partial observability. We then study different methods that rely on the notion that most actions are ineffectual in any given situation, following Zahavy et al. ’s idea of an admissible action. We evaluate these techniques in a series of text-based games of increasing dif- ficulty based on the TextWorld framework, as well as the iconic game ZORK. Empirically, we find that these techniques improve the performance of a baseline deep reinforcement learning agent applied to text-based games.

NeurIPS Conference 2020 Conference Paper

An Equivalence between Loss Functions and Non-Uniform Sampling in Experience Replay

  • Scott Fujimoto
  • David Meger
  • Doina Precup

Prioritized Experience Replay (PER) is a deep reinforcement learning technique in which agents learn from transitions sampled with non-uniform probability proportionate to their temporal-difference error. We show that any loss function evaluated with non-uniformly sampled data can be transformed into another uniformly sampled loss function with the same expected gradient. Surprisingly, we find in some environments PER can be replaced entirely by this new loss function without impact to empirical performance. Furthermore, this relationship suggests a new branch of improvements to PER by correcting its uniformly sampled loss function equivalent. We demonstrate the effectiveness of our proposed modifications to PER and the equivalent loss function in several MuJoCo and Atari environments.

NeurIPS Conference 2020 Conference Paper

Forethought and Hindsight in Credit Assignment

  • Veronica Chelu
  • Doina Precup
  • Hado P. van Hasselt

We address the problem of credit assignment in reinforcement learning and explore fundamental questions regarding the way in which an agent can best use additional computation to propagate new information, by planning with internal models of the world to improve its predictions. Particularly, we work to understand the gains and peculiarities of planning employed as forethought via forward models or as hindsight operating with backward models. We establish the relative merits, limitations and complementary properties of both planning mechanisms in carefully constructed scenarios. Further, we investigate the best use of models in planning, primarily focusing on the selection of states in which predictions should be (re)-evaluated. Lastly, we discuss the issue of model estimation and highlight a spectrum of methods that stretch from environment dynamics predictors to planner-aware models.

AAAI Conference 2020 Short Paper

Gifting in Multi-Agent Reinforcement Learning (Student Abstract)

  • Andrei Lupu
  • Doina Precup

This work performs a first study on multi-agent reinforcement learning with deliberate reward passing between agents. We empirically demonstrate that such mechanics can greatly improve the learning progression in a resource appropriation setting and provide a preliminary discussion of the complex effects of gifting on the learning dynamics.

ICML Conference 2020 Conference Paper

Interference and Generalization in Temporal Difference Learning

  • Emmanuel Bengio
  • Joelle Pineau
  • Doina Precup

We study the link between generalization and interference in temporal-difference (TD) learning. Interference is defined as the inner product of two different gradients, representing their alignment; this quantity emerges as being of interest from a variety of observations about neural networks, parameter sharing and the dynamics of learning. We find that TD easily leads to low-interference, under-generalizing parameters, while the effect seems reversed in supervised learning. We hypothesize that the cause can be traced back to the interplay between the dynamics of interference and bootstrapping. This is supported empirically by several observations: the negative relationship between the generalization gap and interference in TD, the negative effect of bootstrapping on interference and the local coherence of targets, and the contrast between the propagation rate of information in TD(0) versus TD($\lambda$) and regression tasks such as Monte-Carlo policy evaluation. We hope that these new findings can guide the future discovery of better bootstrapping methods.

ICML Conference 2020 Conference Paper

Invariant Causal Prediction for Block MDPs

  • Amy Zhang 0001
  • Clare Lyle
  • Shagun Sodhani
  • Angelos Filos
  • Marta Kwiatkowska
  • Joelle Pineau
  • Yarin Gal
  • Doina Precup

Generalization across environments is critical to the successful application of reinforcement learning (RL) algorithms to real-world challenges. In this work we propose a method for learning state abstractions which generalize to novel observation distributions in the multi-environment RL setting. We prove that for certain classes of environments, this approach outputs, with high probability, a state abstraction corresponding to the causal feature set with respect to the return. We give empirical evidence that analogous methods for the nonlinear setting can also attain improved generalization over single- and multi-task baselines. Lastly, we provide bounds on model generalization error in the multi-environment setting, in the process showing a connection between causal variable identification and the state abstraction framework for MDPs.

NeurIPS Conference 2020 Conference Paper

On Efficiency in Hierarchical Reinforcement Learning

  • Zheng Wen
  • Doina Precup
  • Morteza Ibrahimi
  • Andre Barreto
  • Benjamin Van Roy
  • Satinder Singh

Hierarchical Reinforcement Learning (HRL) approaches promise to provide more efficient solutions to sequential decision making problems, both in terms of statistical as well as computational efficiency. While this has been demonstrated empirically over time in a variety of tasks, theoretical results quantifying the benefits of such methods are still few and far between. In this paper, we discuss the kind of structure in a Markov decision process which gives rise to efficient HRL methods. Specifically, we formalize the intuition that HRL can exploit well repeating "subMDPs", with similar reward and transition structure. We show that, under reasonable assumptions, a model-based Thompson sampling-style HRL algorithm that exploits this structure is statistically efficient, as established through a finite-time regret bound. We also establish conditions under which planning with structure-induced options is near-optimal and computationally efficient.

AAAI Conference 2020 Conference Paper

Options of Interest: Temporal Abstraction with Interest Functions

  • Khimya Khetarpal
  • Martin Klissarov
  • Maxime Chevalier-Boisvert
  • Pierre-Luc Bacon
  • Doina Precup

Temporal abstraction refers to the ability of an agent to use behaviours of controllers which act for a limited, variable amount of time. The options framework describes such behaviours as consisting of a subset of states in which they can initiate, an internal policy and a stochastic termination condition. However, much of the subsequent work on option discovery has ignored the initiation set, because of difficulty in learning it from data. We provide a generalization of initiation sets suitable for general function approximation, by defining an interest function associated with an option. We derive a gradient-based learning algorithm for interest functions, leading to a new interest-option-critic architecture. We investigate how interest functions can be leveraged to learn interpretable and reusable temporal abstractions. We demonstrate the efficacy of the proposed approach through quantitative and qualitative results, in both discrete and continuous environments.

NeurIPS Conference 2020 Conference Paper

Reward Propagation Using Graph Convolutional Networks

  • Martin Klissarov
  • Doina Precup

Potential-based reward shaping provides an approach for designing good reward functions, with the purpose of speeding up learning. However, automatically finding potential functions for complex environments is a difficult problem (in fact, of the same difficulty as learning a value function from scratch). We propose a new framework for learning potential functions by leveraging ideas from graph representation learning. Our approach relies on Graph Convolutional Networks which we use as a key ingredient in combination with the probabilistic inference view of reinforcement learning. More precisely, we leverage Graph Convolutional Networks to perform message passing from rewarding states. The propagated messages can then be used as potential functions for reward shaping to accelerate learning. We verify empirically that our approach can achieve considerable improvements in both small and high-dimensional control problems.

IJCAI Conference 2020 Conference Paper

SVRG for Policy Evaluation with Fewer Gradient Evaluations

  • Zilun Peng
  • Ahmed Touati
  • Pascal Vincent
  • Doina Precup

Stochastic variance-reduced gradient (SVRG) is an optimization method originally designed for tackling machine learning problems with a finite sum structure. SVRG was later shown to work for policy evaluation, a problem in reinforcement learning in which one aims to estimate the value function of a given policy. SVRG makes use of gradient estimates at two scales. At the slower scale, SVRG computes a full gradient over the whole dataset, which could lead to prohibitive computation costs. In this work, we show that two variants of SVRG for policy evaluation could significantly diminish the number of gradient calculations while preserving a linear convergence speed. More importantly, our theoretical result implies that one does not need to use the entire dataset in every epoch of SVRG when it is applied to policy evaluation with linear function approximation. Our experiments demonstrate large computational savings provided by the proposed methods.

NeurIPS Conference 2020 Conference Paper

Value-driven Hindsight Modelling

  • Arthur Guez
  • Fabio Viola
  • Theophane Weber
  • Lars Buesing
  • Steven Kapturowski
  • Doina Precup
  • David Silver
  • Nicolas Heess

Value estimation is a critical component of the reinforcement learning (RL) paradigm. The question of how to effectively learn value predictors from data is one of the major problems studied by the RL community, and different approaches exploit structure in the problem domain in different ways. Model learning can make use of the rich transition structure present in sequences of observations, but this approach is usually not sensitive to the reward function. In contrast, model-free methods directly leverage the quantity of interest from the future, but receive a potentially weak scalar signal (an estimate of the return). We develop an approach for representation learning in RL that sits in between these two extremes: we propose to learn what to model in a way that can directly help value prediction. To this end, we determine which features of the future trajectory provide useful information to predict the associated return. This provides tractable prediction targets that are directly relevant for a task, and can thus accelerate learning the value function. The idea can be understood as reasoning, in hindsight, about which aspects of the future observations could help past value prediction. We show how this can help dramatically even in simple policy evaluation settings. We then test our approach at scale in challenging domains, including on 57 Atari 2600 games.

ICML Conference 2020 Conference Paper

What can I do here? A Theory of Affordances in Reinforcement Learning

  • Khimya Khetarpal
  • Zafarali Ahmed
  • Gheorghe Comanici
  • David Abel
  • Doina Precup

Reinforcement learning algorithms usually assume that all actions are always available to an agent. However, both people and animals understand the general link between the features of their environment and the actions that are feasible. Gibson (1977) coined the term "affordances" to describe the fact that certain states enable an agent to do certain actions, in the context of embodied agents. In this paper, we develop a theory of affordances for agents who learn and plan in Markov Decision Processes. Affordances play a dual role in this case. On one hand, they allow faster planning, by reducing the number of actions available in any given situation. On the other hand, they facilitate more efficient and precise learning of transition models from data, especially when such models require function approximation. We establish these properties through theoretical results as well as illustrative examples. We also propose an approach to learn affordances and use it to estimate transition models that are simpler and generalize better.

RLDM Conference 2019 Conference Abstract

A Top-down, Bottom-up Attention Model for Reinforcement Learning

  • Mehraveh Salehi
  • Eser Aygün
  • Shibl Mourad
  • Doina Precup

Reinforcement Learning (RL) agents typically have to process massive amounts of sensory data in order to execute a specific task. However, a large portion of the sensory input may not be directly related to the task at hand. Here, inspired by the human brain’s attention system, we develop a novel augmented attention mechanism for RL agents, which enables them to adaptively select the most relevant information from the input. In order to evaluate the proposed algorithms, we use an attention-demanding grid-world environment and compare our model’s performance against two other attentive agents and one naive agent. We demonstrate that our proposed augmented attention model outperforms other agents both in terms of scalability and ability to perform transfer learning.

NeurIPS Conference 2019 Conference Paper

Break the Ceiling: Stronger Multi-scale Deep Graph Convolutional Networks

  • Sitao Luan
  • Mingde Zhao
  • Xiao-Wen Chang
  • Doina Precup

Recently, neural network based approaches have achieved significant progress for solving large, complex, graph-structured problems. Nevertheless, the advantages of multi-scale information and deep architectures have not been sufficiently exploited. In this paper, we first analyze key factors constraining the expressive power of existing Graph Convolutional Networks (GCNs), including the activation function and shallow learning mechanisms. Then, we generalize spectral graph convolution and deep GCN in block Krylov subspace forms, upon which we devise two architectures, both scalable in depth however making use of multi-scale information differently. On several node classification tasks, the proposed architectures achieve state-of-the-art performance.

AAAI Conference 2019 Conference Paper

Combined Reinforcement Learning via Abstract Representations

  • Vincent Francois-Lavet
  • Yoshua Bengio
  • Doina Precup
  • Joelle Pineau

In the quest for efficient and robust reinforcement learning methods, both model-free and model-based approaches offer advantages. In this paper we propose a new way of explicitly bridging both approaches via a shared low-dimensional learned encoding of the environment, meant to capture summarizing abstractions. We show that the modularity brought by this approach leads to good generalization while being computationally efficient, with planning happening in a smaller latent state space. In addition, this approach recovers a sufficient low-dimensional representation of the environment, which opens up new strategies for interpretable AI, exploration and transfer learning.

NeurIPS Conference 2019 Conference Paper

Hindsight Credit Assignment

  • Anna Harutyunyan
  • Will Dabney
  • Thomas Mesnard
  • Mohammad Gheshlaghi Azar
  • Bilal Piot
  • Nicolas Heess
  • Hado van Hasselt
  • Gregory Wayne

We consider the problem of efficient credit assignment in reinforcement learning. In order to efficiently and meaningfully utilize new data, we propose to explicitly assign credit to past decisions based on the likelihood of them having led to the observed outcome. This approach uses new information in hindsight, rather than employing foresight. Somewhat surprisingly, we show that value functions can be rewritten through this lens, yielding a new family of algorithms. We study the properties of these algorithms, and empirically show that they successfully address important credit assignment challenges, through a set of illustrative tasks.

AAAI Conference 2019 Short Paper

Learning Options with Interest Functions

  • Khimya Khetarpal
  • Doina Precup

Learning temporal abstractions which are partial solutions to a task and could be reused for solving other tasks is an ingredient that can help agents to plan and learn efficiently. In this work, we tackle this problem in the options framework. We aim to autonomously learn options which are specialized in different state space regions by proposing a notion of interest functions, which generalizes initiation sets from the options framework for function approximation. We build on the option-critic framework to derive policy gradient theorems for interest functions, leading to a new interest-option-critic architecture.

AAAI Conference 2019 Conference Paper

Leveraging Observations in Bandits: Between Risks and Benefits

  • Andrei Lupu
  • Audrey Durand
  • Doina Precup

Imitation learning has been widely used to speed up learning in novice agents, by allowing them to leverage existing data from experts. Allowing an agent to be influenced by external observations can benefit to the learning process, but it also puts the agent at risk of following sub-optimal behaviours. In this paper, we study this problem in the context of bandits. More specifically, we consider that an agent (learner) is interacting with a bandit-style decision task, but can also observe a target policy interacting with the same environment. The learner observes only the target’s actions, not the rewards obtained. We introduce a new bandit optimism modifier that uses conditional optimism contingent on the actions of the target in order to guide the agent’s exploration. We analyze the effect of this modification on the well-known Upper Confidence Bound algorithm by proving that it preserves a regret upper-bound of order O(ln T), even in the presence of a very poor target, and we derive the dependency of the expected regret on the general target policy. We provide empirical results showing both great benefits as well as certain limitations inherent to observational learning in the multi-armed bandit setting. Experiments are conducted using targets satisfying theoretical assumptions with high probability, thus narrowing the gap between theory and application.

RLDM Conference 2019 Conference Abstract

Off-Policy Deep Reinforcement Learning without Exploration

  • Scott Fujimoto
  • David Meger
  • Doina Precup

Many practical applications of reinforcement learning constrain agents to learn from a fixed batch of data which has already been gathered, without offering further possibility for data collection. In this pa- per, we demonstrate that due to errors introduced by extrapolation, standard off-policy deep reinforcement learning algorithms, such as DQN and DDPG, are incapable of learning with data uncorrelated to the distri- bution under the current policy, making them ineffective for this fixed batch setting. We introduce a novel class of off-policy algorithms, batch-constrained reinforcement learning, which restricts the action space in order to force the agent towards behaving close to on-policy with respect to a subset of the given data. We present the first continuous control deep reinforcement learning algorithm which can learn effectively from arbitrary, fixed batch data, and empirically demonstrate the quality of its behavior in several tasks.

ICML Conference 2019 Conference Paper

Off-Policy Deep Reinforcement Learning without Exploration

  • Scott Fujimoto
  • David Meger
  • Doina Precup

Many practical applications of reinforcement learning constrain agents to learn from a fixed batch of data which has already been gathered, without offering further possibility for data collection. In this paper, we demonstrate that due to errors introduced by extrapolation, standard off-policy deep reinforcement learning algorithms, such as DQN and DDPG, are incapable of learning with data uncorrelated to the distribution under the current policy, making them ineffective for this fixed batch setting. We introduce a novel class of off-policy algorithms, batch-constrained reinforcement learning, which restricts the action space in order to force the agent towards behaving close to on-policy with respect to a subset of the given data. We present the first continuous control deep reinforcement learning algorithm which can learn effectively from arbitrary, fixed batch data, and empirically demonstrate the quality of its behavior in several tasks.

RLDM Conference 2019 Conference Abstract

Off-Policy Policy Gradient Theorem with Logarithmic Mappings

  • Riashat Islam
  • Zafarali Ahmed
  • Pierre-Luc Bacon
  • Doina Precup

Policy gradient theorem (Sutton et al. 2000) is a fundamental result in reinforcement learning. A class of continuous control tasks rely on policy gradient methods. However, most of these algorithms rely on using samples collected on-policy. While there are serveral approaches proposed for off-policy policy gradients, there exists a lack of an off-policy gradient theorem which can be adapted for deep reinforcement learning tasks. Often off-policy gradient methods are difficult to use in practice due to the need for an importance sampling correction which can have unbounded variance. In this paper, we propose a derivation for an off-policy policy gradient theorem which can completely avoid high variance importance sampling corrections. Towards this goal, we introduce the existence of a policy gradient theorem using a non-linear Bellman equation (logarithmic mappings of value function). We show that using logarithmic mappings of the policy gradient objective, we achieve a lower bound to the policy gradient, but can avoid importance sampling and derive the gradient estimate where experiences are sampled under a behaviour policy. We further develop an off-policy actor-critic algorithm, and suggest that the proposed off-policy gradient can be used for deep reinforcement learning tasks, for both discrete and continuous action spaces.

RLDM Conference 2019 Conference Abstract

Option discovery by aiming to predict

  • Veronica Chelu
  • Doina Precup

We approach the task of knowledge acquisition and option discovery of a reinforcement learning agent using predictive representations about the dynamics of the environment with respect to its behaviour. We are interested in designing agents capable of acquiring diverse competencies through the interaction with an unknown environment in an unsupervised setting, undefined by extrinsic rewards. We assume a setting in which the agent is constantly exploring the environment, making predictions and learning off- policy from a single stream of experience about the consequences of multiple possible courses of action. We hypothesize that its aim should be to make the world more predictable by empowering itself to achieve its most likely predictions, self-defined as intrinsic goals. We illustrate that this approach induces a set of predictive option models and show their usefulness as planning performance speedup over their primitive counterparts for different objectives, defined as combinations of signals that the agent might be interested in during its lifetime.

ICML Conference 2019 Conference Paper

Per-Decision Option Discounting

  • Anna Harutyunyan
  • Peter Vrancx
  • Philippe Hamel
  • Ann Nowé
  • Doina Precup

In order to solve complex problems an agent must be able to reason over a sufficiently long horizon. Temporal abstraction, commonly modeled through options, offers the ability to reason at many timescales, but the horizon length is still determined by the discount factor of the underlying Markov Decision Process. We propose a modification to the options framework that naturally scales the agent’s horizon with option length. We show that the proposed option-step discount controls a bias-variance trade-off, with larger discounts (counter-intuitively) leading to less estimation variance.

RLDM Conference 2019 Conference Abstract

Per-Decision Option Discounting

  • Anna Harutyunyan
  • Peter Vrancx
  • Philippe Hamel
  • Ann Nowe
  • Doina Precup

In order to solve complex problems an agent must be able to reason over a sufficiently long horizon. Temporal abstraction, commonly modeled through options, offers the ability to reason at many timescales, but the horizon length is still determined by the discount factor of the underlying Markov Deci- sion Process. We propose a modification to the options framework that allows the agent’s horizon to grow naturally as its actions become more complex and extended in time. We show that the proposed option- step discount controls a bias-variance trade-off, with larger discounts (counter-intuitively) leading to less estimation variance.

RLDM Conference 2019 Conference Abstract

Recurrent Temporal Difference

  • Pierre Thodoroff
  • Nishanth V Anand
  • Lucas Caccia
  • Doina Precup
  • Joelle Pineau

In sequential modelling, exponential smoothing is one of the most widely used techniques to maintain temporal consistency in estimates. In this work, we propose Recurrent Learning, a method that es- timates the value function in reinforcement learning using exponential smoothing along the trajectory. Most algorithms in Reinforcement Learning estimate value function at every time step as a point estimate without necessarily explicitly enforcing temporal coherence nor considering previous estimates. This can lead to temporally inconsistent behaviors, particularly in tabular and discrete settings. In other words, we propose to smooth the value function of a current state using the estimates of states that occur earlier in the trajectory. Intuitively, states that are temporally close to each other should have similar value. λ return [1, 2] enforces temporal coherence through the trajectory implicitly whereas we propose a method to explicitly enforce the temporal coherence. However, exponential averaging can be biased if a sharp change(non-stationarity) is encountered in the trajectory, like falling off a cliff. To alleviate this issue a common technique used is to set βt the exponential smoothing factor as a state or time dependent. The key ingredient in Recurrent Neural Networks(LSTM [3] and GRU [4]) is the gating mechanism(state dependent βt ) used to update the hidden cell. The capacity to ignore information allows the cell to focus only on important information. In this work we explore a new method that attempts to learn a state dependent smoothing factor β. To summarize, the contributions of the paper are as follows: + Propose a new way to estimate value function in reinforcement learning by exploiting the estimates along the trajectory. + Derive a learning rule for a state dependent β. + Perform a set of experiments in continuous settings to evaluate its strengths and weaknesses

RLDM Conference 2019 Conference Abstract

Safe Hierarchical Policy Optimization using Constrained Return Variance in Options

  • Arushi Jain
  • Doina Precup

The standard setting in reinforcement learning (RL) to maximize the mean return does not as- sure a reliable and repeatable behavior of an agent in safety-critical applications like autonomous driving, robotics, and so forth. Often, penalization of the objective function with the variance in return is used to limit the unexpected behavior of the agent shown in the environment. While learning the end-to-end options have been accomplished, in this work, we introduce a novel Bellman style direct approach to estimate the variance in return in hierarchical policies using the option-critic architecture (Bacon et al. (2017)). The penalization of the mean return with the variance enables learning safer trajectories, which avoids inconsis- tently behaving regions. Here, we present the derivation in the policy gradient style method with the new safe objective function which would provide the updates for the option parameters in an online fashion.

RLDM Conference 2019 Conference Abstract

Temporal Abstraction in Cooperative Multi-Agent Systems

  • Jhelum Chakravorty
  • Sumana Basu
  • Doina Precup

In this work we introduce temporal abstraction in cooperative multi-agent systems (or teams), which are essentially decentralized Markov Decision processes (Dec-MDPs) or dec. Partially Observable MDPs (Dec-POMDPs). We believe that as in the case of single-agent systems, option framework gives rise to faster convergence to the optimal value, thus facilitating transfer learning. The decentralized nature of dynamic teams leads to curse of dimensionality which impedes scalability. The partial observability requires minute analysis of the information structure involving private and public or common knowledge. The POMDP structure entails growing history of agents’ observations and actions that leads to intractability. This calls for proper design of belief to circumvent such a growing history by leveraging Bayesian update, consequently requiring judicious choice of Bayesian inference to approximate the posterior. Moreover, in the temporal abstraction, the option-policies of the agents have stochastic termination, which adds to intricacies in the hierarchical reinforcement learning problem. We study both planning and learning in the team option-critic framework. We propose Distributed Option Critic (DOC) algorithm, where we leverage the notion of common information approach and distributed policy gradient. We employ the former to formulate a centralized (coordinated) system equivalent to the original decentralized system and to define the belief for the coordinated system. The latter is exploited in DOC for policy improvements of independent agents. We assume that there is a fictitious coordinator who observes the information shared by all agents, updates a belief on the joint-states in a Bayesian manner, chooses options and whispers them to the agents. The agents in turn use their private information to choose actions pertaining to the option assigned to it. Finally, the option-value of the cooperative game is learnt using distributed option-critic architecture.

NeurIPS Conference 2019 Conference Paper

The Option Keyboard: Combining Skills in Reinforcement Learning

  • Andre Barreto
  • Diana Borsa
  • Shaobo Hou
  • Gheorghe Comanici
  • Eser Aygün
  • Philippe Hamel
  • Daniel Toyama
  • Jonathan hunt

The ability to combine known skills to create new ones may be crucial in the solution of complex reinforcement learning problems that unfold over extended periods. We argue that a robust way of combining skills is to define and manipulate them in the space of pseudo-rewards (or "cumulants"). Based on this premise, we propose a framework for combining skills using the formalism of options. We show that every deterministic option can be unambiguously represented as a cumulant defined in an extended domain. Building on this insight and on previous results on transfer learning, we show how to approximate options whose cumulants are linear combinations of the cumulants of known options. This means that, once we have learned options associated with a set of cumulants, we can instantaneously synthesise options induced by any linear combination of them, without any learning involved. We describe how this framework provides a hierarchical interface to the environment whose abstract actions correspond to combinations of basic skills. We demonstrate the practical benefits of our approach in a resource management problem and a navigation task involving a quadrupedal simulated robot.

ICRA Conference 2019 Conference Paper

Uncertainty Aware Learning from Demonstrations in Multiple Contexts using Bayesian Neural Networks

  • Sanjay Thakur
  • Herke van Hoof
  • Juan Camilo Gamboa Higuera
  • Doina Precup
  • David Meger

Diversity of environments is a key challenge that causes learned robotic controllers to fail due to the discrepancies between the training and evaluation conditions. Training from demonstrations in various conditions can mitigate - but not completely prevent - such failures. Learned controllers such as neural networks typically do not have a notion of uncertainty that allows to diagnose an offset between training and testing conditions, and potentially intervene. In this work, we propose to use Bayesian Neural Networks, which have such a notion of uncertainty. We show that uncertainty can be leveraged to consistently detect situations in high-dimensional simulated and real robotic domains in which the performance of the learned controller would be sub-par. Also, we show that such an uncertainty based solution allows making an informed decision about when to invoke a fallback strategy. One fallback strategy is to request more data. We empirically show that providing data only when requested results in increased data-efficiency.

RLDM Conference 2019 Conference Abstract

Using No-Regret To Solve Two Player Zero Sum Discounted Stochastic Games

  • Paul Pereira
  • Doina Precup

Ever since zero sum two player stochastic games were introduced by Shapley in 1953, researchers have been combining algorithms to compute the value of a zero sum game with RL algorithms in order to compute their unique value and identify optimal stationary strategies for both agents. Minimax-Q [Littman 1994] uses the fact that a zero sum game can be solved directly by solving an LP and Q-Learning where as [Vrieze O. J. 1982] uses fictitious play to update the policy of the agents. The state of the art algorithms for approximating the value of a zero sum game are variants of no-regret algorithms, for example Optimistic Multiplicative Weight which is a variant of the no-regret algorithm Multiplicative Weights. In order to begin combining these new algorithms with RL algorithms, we propose an algorithm, similar to the one introduced in [Vrieze O. J. 1982], that uses a no-regret algorithm at every state to update the policy of the agents.

RLDM Conference 2019 Conference Abstract

Value Preserving State-Action Abstractions

  • David Abel
  • Nate Umbanhowar
  • Dilip Arumugam
  • Doina Precup
  • Michael L. Littman

We here introduce combinations of state abstractions and options that preserve representation of near-optimal policies. We define φ-relative options, a general formalism for analyzing the value loss of options paired with a state abstraction, and prove that there exist classes of φ-relative options that preserve near-optimal behavior in any MDP.

ICML Conference 2018 Conference Paper

Convergent TREE BACKUP and RETRACE with Function Approximation

  • Ahmed Touati
  • Pierre-Luc Bacon
  • Doina Precup
  • Pascal Vincent

Off-policy learning is key to scaling up reinforcement learning as it allows to learn about a target policy from the experience generated by a different behavior policy. Unfortunately, it has been challenging to combine off-policy learning with function approximation and multi-step bootstrapping in a way that leads to both stable and efficient algorithms. In this work, we show that the Tree Backup and Retrace algorithms are unstable with linear function approximation, both in theory and in practice with specific examples. Based on our analysis, we then derive stable and efficient gradient-based algorithms using a quadratic convex-concave saddle-point formulation. By exploiting the problem structure proper to these algorithms, we are able to provide convergence guarantees and finite-sample bounds. The applicability of our new analysis also goes beyond Tree Backup and Retrace and allows us to provide new convergence rates for the GTD and GTD2 algorithms without having recourse to projections or Polyak averaging.

AAAI Conference 2018 Conference Paper

Deep Reinforcement Learning That Matters

  • Peter Henderson
  • Riashat Islam
  • Philip Bachman
  • Joelle Pineau
  • Doina Precup
  • David Meger

In recent years, significant progress has been made in solving challenging problems across various domains using deep reinforcement learning (RL). Reproducing existing work and accurately judging the improvements offered by novel methods is vital to sustaining this progress. Unfortunately, reproducing results for state-of-the-art deep RL methods is seldom straightforward. In particular, non-determinism in standard benchmark environments, combined with variance intrinsic to the methods, can make reported results tough to interpret. Without significance metrics and tighter standardization of experimental reporting, it is difficult to determine whether improvements over the prior state-of-the-art are meaningful. In this paper, we investigate challenges posed by reproducibility, proper experimental techniques, and reporting procedures. We illustrate the variability in reported metrics and results when comparing against common baselines and suggest guidelines to make future results in deep RL more reproducible. We aim to spur discussion about how to ensure continued progress in the field by minimizing wasted effort stemming from results that are non-reproducible and easily misinterpreted.

AAMAS Conference 2018 Conference Paper

Eligibility Traces for Options

  • Ayush Jain
  • Doina Precup

Temporally extended actions not only represent knowledge in the hierarchical setup in reinforcement learning, they also improve exploration while reducing the complexity of choosing actions. The option framework provides a concrete way to implement and reason about temporal abstraction. This work attempts to test the utility of eligibility traces with options and find good ways of doing multi-step intra-option updates. Three algorithms, based on offpolicy methods - importance sampling, tree-backup and retrace, are proposed for using eligibility traces with options.

AAAI Conference 2018 Short Paper

Imitation Upper Confidence Bound for Bandits on a Graph

  • Andrei Lupu
  • Doina Precup

We consider a graph of interconnected agents implementing a common policy and each playing a bandit problem with identical reward distributions. We restrict the information propagated in the graph such that agents can uniquely observe each other's actions. We propose an extension of the Upper Confidence Bound (UCB) algorithm to this setting and empirically demonstrate that our solution improves the performance over UCB according to multiple metrics and within various graph configurations.

AAAI Conference 2018 Conference Paper

Learning Predictive State Representations From Non-Uniform Sampling

  • Yuri Grinberg
  • Hossein Aboutalebi
  • Melanie Lyman-Abramovitch
  • Borja Balle
  • Doina Precup

Predictive state representations (PSR) have emerged as a powerful method for modelling partially observable environments. PSR learning algorithms can build models for predicting all observable variables, or predicting only some of them conditioned on others (e. g. , actions or exogenous variables). In the latter case, which we call conditional modelling, the accuracy of different estimates of the conditional probabilities for a fixed dataset can vary significantly, due to the limited sampling of certain conditions. This can have negative consequences on the PSR parameter estimation process, which are not taken into account by the current state-of-theart PSR spectral learning algorithms. In this paper, we examine closely conditional modelling within the PSR framework. We first establish a new positive but surprisingly non-trivial result: a conditional model can never be larger than the complete model. Then, we address the core shortcoming of existing PSR spectral learning methods for conditional models by incorporating an additional step in the process, which can be seen as a type of matrix denoising. We further refine this objective by adding penalty terms for violations of the system dynamics matrix structure, which improves the PSR predictive performance. Empirical evaluations on both synthetic and real datasets highlight the advantages of the proposed approach.

AAAI Conference 2018 Conference Paper

Learning Robust Options

  • Daniel Mankowitz
  • Timothy Mann
  • Pierre-Luc Bacon
  • Doina Precup
  • Shie Mannor

Robust reinforcement learning aims to produce policies that have strong guarantees even in the face of environments/transition models whose parameters have strong uncertainty. Existing work uses value-based methods and the usual primitive action setting. In this paper, we propose robust methods for learning temporally abstract actions, in the framework of options. We present a Robust Options Policy Iteration (ROPI) algorithm with convergence guarantees, which learns options that are robust to model uncertainty. We utilize ROPI to learn robust options with the Robust Options Deep Q Network (RO-DQN) that solves multiple tasks and mitigates model misspecification due to model uncertainty. We present experimental results which suggest that policy iteration with linear features may have an inherent form of robustness when using coarse feature representations. In addition, we present experimental results which demonstrate that robustness helps policy iteration implemented on top of deep neural networks to generalize over a much broader range of dynamics than non-robust policy iteration.

NeurIPS Conference 2018 Conference Paper

Learning Safe Policies with Expert Guidance

  • Jessie Huang
  • Fa Wu
  • Doina Precup
  • Yang Cai

We propose a framework for ensuring safe behavior of a reinforcement learning agent when the reward function may be difficult to specify. In order to do this, we rely on the existence of demonstrations from expert policies, and we provide a theoretical framework for the agent to optimize in the space of rewards consistent with its existing knowledge. We propose two methods to solve the resulting optimization: an exact ellipsoid-based method and a method in the spirit of the "follow-the-perturbed-leader" algorithm. Our experiments demonstrate the behavior of our algorithm in both discrete and continuous problems. The trained agent safely avoids states with potential negative effects while imitating the behavior of the expert in the other states.

AAAI Conference 2018 Conference Paper

Learning With Options That Terminate Off-Policy

  • Anna Harutyunyan
  • Peter Vrancx
  • Pierre-Luc Bacon
  • Doina Precup
  • Ann Nowé

A temporally abstract action, or an option, is specified by a policy and a termination condition: the policy guides the option behavior, and the termination condition roughly determines its length. Generally, learning with longer options (like learning with multi-step returns) is known to be more efficient. However, if the option set for the task is not ideal, and cannot express the primitive optimal policy well, shorter options offer more flexibility and can yield a better solution. Thus, the termination condition puts learning efficiency at odds with solution quality. We propose to resolve this dilemma by decoupling the behavior and target terminations, just like it is done with policies in off-policy learning. To this end, we give a new algorithm, Q(β), that learns the solution with respect to any termination condition, regardless of how the options actually terminate. We derive Q(β) by casting learning with options into a common framework with wellstudied multi-step off-policy learning. We validate our algorithm empirically, and show that it holds up to its motivating claims.

EWRL Workshop 2018 Workshop Paper

Leveraging Observational Learning for Exploration in Bandits

  • Audrey Durand
  • Andrei Lupu
  • Doina Precup

Imitation learning has been widely used to speed up learning in novice agents, by allowing them to leverage existing data from experts. In this paper, we study this problem in the context of bandits. More specifically, we consider that an agent (learner) is interacting with a bandit-style decision task, but can also observe a target policy interacting with the same environment. The learner observes only the target’s actions, not the rewards obtained. Our goal is to leverage the target data in order to guide the agent’s exploration. We propose a method that builds on the Upper Confidence Bound algorithm by using conditional optimism contingent on the actions of the target. We provide a regret upper-bound of order O(ln T) in two-actions settings and derive the dependency of the expected regret on the general target policy. We provide empirical results showing both great benefits as well as certain limitations of this type of imitation learning in the multi-armed bandit setting.

AAAI Conference 2018 Conference Paper

OptionGAN: Learning Joint Reward-Policy Options Using Generative Adversarial Inverse Reinforcement Learning

  • Peter Henderson
  • Wei-Di Chang
  • Pierre-Luc Bacon
  • David Meger
  • Joelle Pineau
  • Doina Precup

Reinforcement learning has shown promise in learning policies that can solve complex problems. However, manually specifying a good reward function can be difficult, especially for intricate tasks. Inverse reinforcement learning offers a useful paradigm to learn the underlying reward function directly from expert demonstrations. Yet in reality, the corpus of demonstrations may contain trajectories arising from a diverse set of underlying reward functions rather than a single one. Thus, in inverse reinforcement learning, it is useful to consider such a decomposition. The options framework in reinforcement learning is specifically designed to decompose policies in a similar light. We therefore extend the options framework and propose a method to simultaneously recover reward options in addition to policy options. We leverage adversarial methods to learn joint reward-policy options using only observed expert states. We show that this approach works well in both simple and complex continuous control tasks and shows significant performance increases in one-shot transfer learning.

NeurIPS Conference 2018 Conference Paper

Temporal Regularization for Markov Decision Process

  • Pierre Thodoroff
  • Audrey Durand
  • Joelle Pineau
  • Doina Precup

Several applications of Reinforcement Learning suffer from instability due to high variance. This is especially prevalent in high dimensional domains. Regularization is a commonly used technique in machine learning to reduce variance, at the cost of introducing some bias. Most existing regularization techniques focus on spatial (perceptual) regularization. Yet in reinforcement learning, due to the nature of the Bellman equation, there is an opportunity to also exploit temporal regularization based on smoothness in value estimates over trajectories. This paper explores a class of methods for temporal regularization. We formally characterize the bias induced by this technique using Markov chain concepts. We illustrate the various characteristics of temporal regularization via a sequence of simple discrete and continuous MDPs, and show that the technique provides improvement even in high-dimensional Atari games.

AAAI Conference 2018 Conference Paper

When Waiting Is Not an Option: Learning Options With a Deliberation Cost

  • Jean Harb
  • Pierre-Luc Bacon
  • Martin Klissarov
  • Doina Precup

Recent work has shown that temporally extended actions (options) can be learned fully end-to-end as opposed to being specified in advance. While the problem of how to learn options is increasingly well understood, the question of what good options should be has remained elusive. We formulate our answer to what good options should be in the bounded rationality framework (Simon, 1957) through the notion of deliberation cost. We then derive practical gradient-based learning algorithms to implement this objective. Our results in the Arcade Learning Environment (ALE) show increased performance and interpretability.

IJCAI Conference 2017 Conference Paper

Approximate Value Iteration with Temporally Extended Actions (Extended Abstract)

  • Timothy A. Mann
  • Shie Mannor
  • Doina Precup

The options framework provides a concrete way to implement and reason about temporally extended actions. Existing literature has demonstrated the value of planning with options empirically, but there is a lack of theoretical analysis formalizing when planning with options is more efficient than planning with primitive actions. We provide a general analysis of the convergence rate of a popular Approximate Value Iteration (AVI) algorithm called Fitted Value Iteration (FVI) with options. Our analysis reveals that longer duration options and a pessimistic estimate of the value function both lead to faster convergence. Furthermore, options can improve convergence even when they are suboptimal and sparsely distributed throughout the state space. Next we consider generating useful options for planning based on a subset of landmark states. This suggests a new algorithm, Landmark-based AVI (LAVI), that represents the value function only at landmark states. We analyze OFVI and LAVI using the proposed landmark-based options and compare the two algorithms. Our theoretical and experimental results demonstrate that options can play an important role in AVI by decreasing approximation error and inducing fast convergence.

RLDM Conference 2017 Conference Abstract

Asynchronous Advantage Option-Critic with Deliberation Cost

  • Jean Harb
  • Pierre-Luc Bacon
  • Doina Precup

Learning temporally extended actions is a long-standing problem that has recently been tackled by different types of frameworks (Baconet al. [2016], Vezhnevetset al. [2016], Kulkarniet al. [2016], Tessleret al. [2016]). The option-critic architecture (OC) is a framework that allows an agent to learn options in an end-to-end manner, while optimizing the expected return. In this work, we introduce an asynchronous variant of OC where it trains on multiple processes at once and accumulates the experience to train a single network, like in A3C (Mnih et al. [2016]). Option-critic has the problem that options collapse to lasting only a single time-step. This problem arises from the fact that there’s no reason for the options to be temporally extended. The agent sees termination only as giving it more choices, which cannot be worse than staying in the same option. In the worst case, it will simply choose to go back into the same option. In reality, there is a cost to terminating. Computation resources are limited, and when terminating, the agent must take time and computation to decide which option to execute. A deliberation cost would indicate that there’s a cost to terminating an option. We introduce a deliberation cost which can be simply implemented into option-critic, which allows the agent to learn temporally extended options as it now sees termination associated with a negative reward. We perform experiments on a few games of the Arcade Learning Environment (Atari 2600 games) and show the learning capacity of the asynchronous version of option-critic and the effects of different deliberation costs.

RLDM Conference 2017 Conference Abstract

Independently Controllable Features

  • Emmanuel Bengio
  • Valentin Thomas
  • Joelle Pineau
  • Doina Precup
  • Yoshua Bengio

Finding features that disentangle the different causes of variation in real data is a difficult task, that has nonetheless received considerable attention in static domains like natural images. Interactive en- vironments, in which an agent can deliberately take actions, offer an opportunity to tackle this task better, because the agent can experiment with different actions and observe their effects. We introduce the idea that in interactive environments, latent factors that control the variation in observed data can be identified by figuring out what the agent can control. We propose a naive method to find factors that explain or measure the effect of the actions of a learner, and test it in illustrative experiments.

AAAI Conference 2017 Conference Paper

The Option-Critic Architecture

  • Pierre-Luc Bacon
  • Jean Harb
  • Doina Precup

Temporal abstraction is key to scaling up learning and planning in reinforcement learning. While planning with temporally extended actions is well understood, creating such abstractions autonomously from data has remained challenging. We tackle this problem in the framework of options [Sutton, Precup & Singh, 1999; Precup, 2000]. We derive policy gradient theorems for options and propose a new option-critic architecture capable of learning both the internal policies and the termination conditions of options, in tandem with the policy over options, and without the need to provide any additional rewards or subgoals. Experimental results in both discrete and continuous environments showcase the flexibility and efficiency of the framework.

RLDM Conference 2017 Conference Abstract

Unifying Multi-Step Methods through Matrix Splitting

  • Pierre-Luc Bacon
  • Doina Precup

We show that multi-step reinforcement learning methods can be analyzed through a new perspec- tive by using the notion of matrix splitting, a notion originally developed in the literature on linear iterative solvers. This new perspective allows us to better understand how seemingly different concepts, such as temporally extended actions in the options framework and TD(λ)-style bootstrapping, relate to each other. Mapping out existing algorithms on this spectrum also allows us to identify new variants and opens the door towards designing new algorithms.

RLDM Conference 2017 Conference Abstract

Using locally self-avoiding random walks for exploration in reinforcement learning

  • Maziar Gomrokchi
  • Doina Precup

Reinforcement learning algorithms depend crucially on good exploration strategies in order to be able to quickly cover an environment and obtain good estimates of value functions and policies. However, finding good exploration strategies has remained a very difficult, open research problem in the context of sequential decision making. In this paper, we propose a new exploration method for reinforcement learning algorithms, based on two intuitions: (1) the choice of the next exploratory action to take should depend not just on the (Markovian) state of the environment, but also on the trajectory so far; (2) trajectories should aim to fill as quickly as possible the existing environment. Our method is based on the mechanism of locally self-avoiding random walks, often used in physics to describe the behavior of polymer chains. We establish theoretical results showing the advantage of locally self-avoiding walks, in comparison with simple random walks, in the context of exploration for reinforcement learning. We corroborate these results with experiments illustrating the increased efficiency of such exploration methods, compared to traditional randomization-based methods.

ICML Conference 2016 Conference Paper

Differentially Private Policy Evaluation

  • Borja Balle
  • Maziar Gomrokchi
  • Doina Precup

We present the first differentially private algorithms for reinforcement learning, which apply to the task of evaluating a fixed policy. We establish two approaches for achieving differential privacy, provide a theoretical analysis of the privacy and utility of the two algorithms, and show promising results on simple empirical examples.

AAAI Conference 2016 Conference Paper

Incremental Stochastic Factorization for Online Reinforcement Learning

  • Andre Barreto
  • Rafael Beirigo
  • Joelle Pineau
  • Doina Precup

A construct that has been receiving attention recently in reinforcement learning is stochastic factorization (SF), a particular case of non-negative factorization (NMF) in which the matrices involved are stochastic. The idea is to use SF to approximate the transition matrices of a Markov decision process (MDP). This is useful for two reasons. First, learning the factors of the SF instead of the transition matrices can reduce significantly the number of parameters to be estimated. Second, it has been shown that SF can be used to reduce the number of operations needed to compute an MDP’s value function. Recently, an algorithm called expectation-maximization SF (EMSF) has been proposed to compute a SF directly from transitions sampled from an MDP. In this paper we take a closer look at EMSF. First, by exploiting the assumptions underlying the algorithm, we show that it is possible to reduce it to simple multiplicative update rules similar to the ones that helped popularize NMF. Second, we analyze the optimization process underlying EMSF and find that it minimizes a modi- fied version of the Kullback-Leibler divergence that is particularly well-suited for learning a SF from data sampled from an arbitrary distribution. Third, we build on this improved understanding of EMSF to draw an interesting connection with NMF and probabilistic latent semantic analysis. We also exploit the simplified update rules to introduce a new version of EMSF that generalizes and significantly improves its precursor. This new algorithm provides a practical mechanism to control the trade-off between memory usage and computing time, essentially freeing the space complexity of EMSF from its dependency on the number of sample transitions. The algorithm can also compute its approximation incrementally, which makes it possible to use it concomitantly with the collection of data. This feature makes the new version of EMSF particularly suitable for online reinforcement learning. Empirical results support the utility of the proposed algorithm.

IJCAI Conference 2016 Conference Paper

Learning Multi-Step Predictive State Representations

  • Lucas Langer
  • Borja Balle
  • Doina Precup

Recent years have seen the development of efficient and provably correct spectral algorithms for learning models of partially observable environments arising in many applications. But despite the high hopes raised by this new class of algorithms, their practical impact is still below expectations. One reason for this is the difficulty in adapting spectral methods to exploit structural constraints about different target environments which can be known beforehand. A natural structure intrinsic to many dynamical systems is a multi-resolution behaviour where interesting phenomena occur at different time scales during the evolution of the system. In this paper we introduce the multi-step predictive state representation (M-PSR) and an associated learning algorithm that finds and leverages frequent patterns of observations at multiple scales in dynamical systems with discrete observations. We perform experiments on robot exploration tasks in a wide variety of environments and conclude that the use of M-PSRs improves over the classical PSR for varying amounts of data, environment sizes, and number of observations symbols.

JMLR Journal 2016 Journal Article

Practical Kernel-Based Reinforcement Learning

  • André M.S. Barreto
  • Doina Precup
  • Joelle Pineau

Kernel-based reinforcement learning (KBRL) stands out among approximate reinforcement learning algorithms for its strong theoretical guarantees. By casting the learning problem as a local kernel approximation, KBRL provides a way of computing a decision policy which converges to a unique solution and is statistically consistent. Unfortunately, the model constructed by KBRL grows with the number of sample transitions, resulting in a computational cost that precludes its application to large-scale or on-line domains. In this paper we introduce an algorithm that turns KBRL into a practical reinforcement learning tool. Kernel-based stochastic factorization (KBSF) builds on a simple idea: when a transition probability matrix is represented as the product of two stochastic matrices, one can swap the factors of the multiplication to obtain another transition matrix, potentially much smaller than the original, which retains some fundamental properties of its precursor. KBSF exploits such an insight to compress the information contained in KBRL's model into an approximator of fixed size. This makes it possible to build an approximation considering both the difficulty of the problem and the associated computational cost. KBSF's computational complexity is linear in the number of sample transitions, which is the best one can do without discarding data. Moreover, the algorithm's simple mechanics allow for a fully incremental implementation that makes the amount of memory used independent of the number of sample transitions. The result is a kernel-based reinforcement learning algorithm that can be applied to large-scale problems in both off-line and on-line regimes. We derive upper bounds for the distance between the value functions computed by KBRL and KBSF using the same data. We also prove that it is possible to control the magnitude of the variables appearing in our bounds, which means that, given enough computational resources, we can make KBSF's value function as close as desired to the value function that would be computed by KBRL using the same set of sample transitions. The potential of our algorithm is demonstrated in an extensive empirical study in which KBSF is applied to difficult tasks based on real-world data. Not only does KBSF solve problems that had never been solved before, but it also significantly outperforms other state-of-the-art reinforcement learning algorithms on the tasks studied. [abs] [ pdf ][ bib ] &copy JMLR 2016. ( edit, beta )

EWRL Workshop 2016 Workshop Paper

Using Policy Gradients to Account for Changes in Behavior Policies under Off-policy Control

  • Lucas Lehnert
  • Doina Precup

Off-policy learning refers to the problem of learning the value function of a behaviour, or policy, while selecting actions with a different policy. Gradient-based off-policy learning algorithms, such as GTD (Sutton et al. , 2009b) and TDC/GQ (Sutton et al. , 2009a), converge when selecting actions with a fixed policy even when using function approximation and incremental updates. In control problems, the behaviour policy is adapted over time. One key challenge in off-policy control is that adapting the policy results in changing the distribution of subsequent transitions the algorithm will see. We present the first off-policy gradient-based learning algorithm that accounts for how an adjustment of the policy at the current time step effects the distribution of future transition samples. We derive the algorithm in the style of policy gradients and show that our method performs favourably to existing approaches when used for off-policy control with linear function approximation.

IJCAI Conference 2015 Conference Paper

An Expectation-Maximization Algorithm to Compute a Stochastic Factorization From Data

  • Andre M. S. Barreto
  • Rafael L. Beirigo
  • Joelle Pineau
  • Doina Precup

When a transition probability matrix is represented as the product of two stochastic matrices, swapping the factors of the multiplication yields another transition matrix that retains some fundamental characteristics of the original. Since the new matrix can be much smaller than its precursor, replacing the former for the latter can lead to significant savings in terms of computational effort. This strategy, dubbed the “stochastic-factorization trick, ” can be used to compute the stationary distribution of a Markov chain, to determine the fundamental matrix of an absorbing chain, and to compute a decision policy via dynamic programming or reinforcement learning. In this paper we show that the stochastic-factorization trick can also provide benefits in terms of the number of samples needed to estimate a transition matrix. We introduce a probabilistic interpretation of a stochastic factorization and build on the resulting model to develop an algorithm to compute the factorization directly from data. If the transition matrix can be well approximated by a low-order stochastic factorization, estimating its factors instead of the original matrix reduces significantly the number of parameters to be estimated. Thus, when compared to estimating the transition matrix directly via maximum likelihood, the proposed method is able to compute approximations of roughly the same quality using less data. We illustrate the effectiveness of the proposed algorithm by using it to help a reinforcement learning agent learn how to play the game of blackjack.

JAIR Journal 2015 Journal Article

Approximate Value Iteration with Temporally Extended Actions

  • Timothy A. Mann
  • Shie Mannor
  • Doina Precup

Temporally extended actions have proven useful for reinforcement learning, but their duration also makes them valuable for efficient planning. The options framework provides a concrete way to implement and reason about temporally extended actions. Existing literature has demonstrated the value of planning with options empirically, but there is a lack of theoretical analysis formalizing when planning with options is more efficient than planning with primitive actions. We provide a general analysis of the convergence rate of a popular Approximate Value Iteration (AVI) algorithm called Fitted Value Iteration (FVI) with options. Our analysis reveals that longer duration options and a pessimistic estimate of the value function both lead to faster convergence. Furthermore, options can improve convergence even when they are suboptimal and sparsely distributed throughout the state-space. Next we consider the problem of generating useful options for planning based on a subset of landmark states. This suggests a new algorithm, Landmark-based AVI (LAVI), that represents the value function only at the landmark states. We analyze both FVI and LAVI using the proposed landmark-based options and compare the two algorithms. Our experimental results in three different domains demonstrate the key properties from the analysis. Our theoretical and experimental results demonstrate that options can play an important role in AVI by decreasing approximation error and inducing fast convergence.

NeurIPS Conference 2015 Conference Paper

Basis refinement strategies for linear value function approximation in MDPs

  • Gheorghe Comanici
  • Doina Precup
  • Prakash Panangaden

We provide a theoretical framework for analyzing basis function construction for linear value function approximation in Markov Decision Processes (MDPs). We show that important existing methods, such as Krylov bases and Bellman-error-based methods are a special case of the general framework we develop. We provide a general algorithmic framework for computing basis function refinements which “respect” the dynamics of the environment, and we derive approximation error bounds that apply for any algorithm respecting this general framework. We also show how, using ideas related to bisimulation metrics, one can translate basis refinement into a process of finding “prototypes” that are diverse enough to represent the given MDP.

RLDM Conference 2015 Conference Abstract

Conditional computation in neural networks using a decision-theoretic approach

  • Pierre-Luc Bacon
  • Emmanuel Bengio
  • Joelle Pineau
  • Doina Precup

Deep learning has become the state-of-art tool in many applications, but the evaluation and train- ing of such models is very time-consuming and expensive. Dropout has been used in order to make the computations sparse (by not involving all units), as well as to regularize the models. In typical dropout, nodes are dropped uniformly at random. Our goal is to use reinforcement learning in order to design better, more informed dropout policies, which are data-dependent. We cast the problem of learning activation- dependent dropout policies as a reinforcement learning problem. We propose a reward function motivated by information theory, which captures the idea of wanting to have parsimonious activations while main- taining prediction accuracy. We develop policy gradient algorithms for learning policies that optimize this loss function and present encouraging empirical results showing that this approach improves the speed of computation without significantly impacting the quality of the approximation. Poster M55*: Escaping Groundhog Day James MacGlashan*, Brown University; Michael Littman, Brown University; Stefanie Tellex, Brown Uni- versity The dominant approaches to reinforcement learning rely on a fixed state-action space and reward function that the agent is trying to maximize. During training, the agent is repeatedly reset to a predefined initial state or set of initial states. For example, in the classic RL Mountain Car domain, the agent starts at some point in the valley, continues until it reaches the top of the valley and then resets to somewhere else in the same valley. Learning in this regime is akin to the learning problem faced by Bill Murray in the 1993 movie Groundhog Day in which he repeatedly relives the same day, until he discovers the optimal policy and escapes to the next day. In a more realistic formulation for an RL agent, every day is a new day that may have similarities to the previous day, but the agent never encounters the same state twice. This formulation is a natural fit for robotics problems in which a robot is placed in a room in which it has never previously been, but has seen similar rooms with similar objects in the past. We formalize this problem as optimizing a learning or planning algorithm for a set of environments drawn from a distribution and present two sets of results for learning under these settings. First, we present goal-based action priors for learning how to accelerate planning in environments drawn from the distribution from a training set of environments drawn from the same distribution. Second, we present sample-optimized Rademacher complexity, which is a formal mechanism for assessing the risk in choosing a learning algorithm tuned on a training set drawn from the distribution for use on the entire distribution.

NeurIPS Conference 2015 Conference Paper

Data Generation as Sequential Decision Making

  • Philip Bachman
  • Doina Precup

We connect a broad class of generative models through their shared reliance on sequential decision making. Motivated by this view, we develop extensions to an existing model, and then explore the idea further in the context of data imputation -- perhaps the simplest setting in which to investigate the relation between unconditional and conditional generative modelling. We formulate data imputation as an MDP and develop models capable of representing effective policies for it. We construct the models using neural networks and train them using a form of guided policy search. Our models generate predictions through an iterative process of feedback and refinement. We show that this approach can learn effective policies for imputation problems of varying difficulty and across multiple datasets.

RLDM Conference 2015 Conference Abstract

Learning and Planning with Timing Information in Markov Decision Processes

  • Pierre-Luc Bacon
  • Borja Balle
  • Doina Precup

We consider the problem of learning and planning in Markov decision processes with temporally extended actions rep-resented in the options framework. We propose to use predictions about the duration of extended actions to represent the state and show that this leads to a compact predictive state representation model independent of the set of primitive actions. Then we develop a consistent and efficient spectral learning algorithm for such models. Using just the timing information to represent states allows for faster improvement in the planning performance. We illustrate our approach with experiments in both synthetic and robot navigation domains.

UAI Conference 2015 Conference Paper

Learning and Planning with Timing Information in Markov Decision Processes

  • Pierre-Luc Bacon
  • Borja Balle
  • Doina Precup

We consider the problem of learning and planning in Markov decision processes with temporally extended actions represented in the options framework. We propose to use predictions about the duration of extended actions to represent the state and show that this leads to a compact predictive state representation model independent of the set of primitive actions. Then we develop a consistent and efficient spectral learning algorithm for such models. Using just the timing information to represent states allows for faster improvement in the planning performance. We illustrate our approach with experiments in both synthetic and robot navigation domains.

EWRL Workshop 2015 Workshop Paper

Learning Policies for Data Imputation with Guided Policy Search

  • Philip Bachman
  • Doina Precup

We explore the relationship between directed generative models and reinforcement learning by developing a new approach to data imputation that combines ideas from both areas. We address data imputation by defining an MDP for which we construct policies parametrized by (reasonably) large neural networks. We then show how to train these policies using a form of (self) Guided Policy Search (Levine & Koltun, 2013a), which leads to maximizing a variational bound on the quality of the imputations made by our policies. Empirically, our policies perform well over a range of conditions.

AAAI Conference 2015 Conference Paper

Representation Discovery for MDPs Using Bisimulation Metrics

  • Sherry Ruan
  • Gheorghe Comanici
  • Prakash Panangaden
  • Doina Precup

We provide a novel, flexible, iterative refinement algorithm to automatically construct an approximate statespace representation for Markov Decision Processes (MDPs). Our approach leverages bisimulation metrics, which have been used in prior work to generate features to represent the state space of MDPs. We address a drawback of this approach, which is the expensive computation of the bisimulation metrics. We propose an algorithm to generate an iteratively improving sequence of state space partitions. Partial metric computations guide the representation search and provide much lower space and computational complexity, while maintaining strong convergence properties. We provide theoretical results guaranteeing convergence as well as experimental illustrations of the accuracy and savings (in time and memory usage) of the new algorithm, compared to traditional bisimulation metric computation.

ICML Conference 2015 Conference Paper

Variational Generative Stochastic Networks with Collaborative Shaping

  • Philip Bachman
  • Doina Precup

We develop an approach to training generative models based on unrolling a variational auto-encoder into a Markov chain, and shaping the chain’s trajectories using a technique inspired by recent work in Approximate Bayesian computation. We show that the global minimizer of the resulting objective is achieved when the generative model reproduces the target distribution. To allow finer control over the behavior of the models, we add a regularization term inspired by techniques used for regularizing certain types of policy search in reinforcement learning. We present empirical results on the MNIST and TFD datasets which show that our approach offers state-of-the-art performance, both quantitatively and from a qualitative point of view.

ICML Conference 2014 Conference Paper

A new Q(lambda) with interim forward view and Monte Carlo equivalence

  • Richard S. Sutton
  • A. Rupam Mahmood
  • Doina Precup
  • Hado van Hasselt

Q-learning, the most popular of reinforcement learning algorithms, has always included an extension to eligibility traces to enable more rapid learning and improved asymptotic performance on non-Markov problems. The lambda parameter smoothly shifts on-policy algorithms such as TD(lambda) and Sarsa(lambda) from a pure bootstrapping form (lambda=0) to a pure Monte Carlo form (lambda=1). In off-policy algorithms, including Q(lambda), GQ(lambda), and off-policy LSTD(lambda), the lambda parameter is intended to play the same role, but does not; on every exploratory action these algorithms bootstrap regardless of the value of lambda, and as a result they fail to approximate Monte Carlo learning when lambda=1. It may seem that this is inevitable for any online off-policy algorithm; if updates are made on each step on which the target policy is followed, then how could just the right updates be ‘un-made’ upon deviation from the target policy? In this paper, we introduce a new version of Q(lambda) that does exactly that, without significantly increased algorithmic complexity. En route to our new Q(lambda), we introduce a new derivation technique based on the forward-view/backward-view analysis familiar from TD(lambda) but extended to apply at every time step rather than only at the end of episodes. We apply this technique to derive first a new off-policy version of TD(lambda), called PTD(lambda), and then our new Q(lambda), called PQ(lambda).

UAI Conference 2014 Conference Paper

Bisimulation Metrics are Optimal Value Functions

  • Norman Ferns
  • Doina Precup

Bisimulation is a notion of behavioural equivalence on the states of a transition system. Its definition has been extended to Markov decision processes, where it can be used to aggregate states. A bisimulation metric is a quantitative analog of bisimulation that measures how similar states are from a the perspective of long-term behavior. Bisimulation metrics have been used to establish approximation bounds for state aggregation and other forms of value function approximation. In this paper, we prove that a bisimulation metric defined on the state space of a Markov decision process is the optimal value function of an optimal coupling of two copies of the original model. We prove the result in the general case of continuous state spaces. This result has important implications in understanding the complexity of computing such metrics, and opens up the possibility of more efficient computational methods.

NeurIPS Conference 2014 Conference Paper

Learning with Pseudo-Ensembles

  • Philip Bachman
  • Ouais Alsharif
  • Doina Precup

We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E. g. , dropout (Hinton et al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We examine the relationship of pseudo-ensembles, which involve perturbation in model-space, to standard ensemble methods and existing notions of robustness, which focus on perturbation in observation-space. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.

NeurIPS Conference 2014 Conference Paper

Optimizing Energy Production Using Policy Search and Predictive State Representations

  • Yuri Grinberg
  • Doina Precup
  • Michel Gendreau

We consider the challenging practical problem of optimizing the power production of a complex of hydroelectric power plants, which involves control over three continuous action variables, uncertainty in the amount of water inflows and a variety of constraints that need to be satisfied. We propose a policy-search-based approach coupled with predictive modelling to address this problem. This approach has some key advantages compared to other alternatives, such as dynamic programming: the policy representation and search algorithm can conveniently incorporate domain knowledge; the resulting policies are easy to interpret, and the algorithm is naturally parallelizable. Our algorithm obtains a policy which outperforms the solution found by dynamic programming both quantitatively and qualitatively.

ICML Conference 2014 Conference Paper

Sample-based approximate regularization

  • Philip Bachman
  • Amir Massoud Farahmand
  • Doina Precup

We introduce a method for regularizing linearly parameterized functions using general derivative-based penalties, which relies on sampling as well as finite-difference approximations of the relevant derivatives. We call this approach sample-based approximate regularization (SAR). We provide theoretical guarantees on the fidelity of such regularizers, compared to those they approximate, and prove that the approximations converge efficiently. We also examine the empirical performance of SAR on several datasets.

RLDM Conference 2013 Conference Abstract

Approximate Policy Iteration with Demonstration Data

  • Beomjoon Kim
  • Amir-massoud Farahmand
  • Joelle Pineau
  • Doina Precup

We propose an algorithm to solve uncertain sequential decision-making problems that utilizes two different types of data sources. The first is the data available in the conventional reinforcement learning setup: an agent interacts with the environment and receives a sequence of state transition samples alongside the corresponding reward signal. The second data source, which differentiates the setup of this work from the usual reinforcement learning framework, is in the form of expert’s demonstrations, that is, a set of states with the expert’s suggested actions. Benefitting from both sources of data, which are available in many real-world application domains, allows the agent to perform well even with few data points. The algorithm is couched in the framework of Approxi- mate Policy Iteration. Its approximate policy evaluation step is formulated as a convex optimization problem in which the expert demonstration data act as a set of linear constraints. In a real robotic navigation task, we show that the algorithm outperforms both pure approximate policy iteration and supervised learning.

ICML Conference 2013 Conference Paper

Average Reward Optimization Objective In Partially Observable Domains

  • Yuri Grinberg
  • Doina Precup

We consider the problem of average reward optimization in domains with partial observability, within the modeling framework of linear predictive state representations (PSRs). The key to average-reward computation is to have a well-defined stationary behavior of a system, so the required averages can be computed. If, additionally, the stationary behavior varies smoothly with changes in policy parameters, average-reward control through policy search also becomes a possibility. In this paper, we show that PSRs have a well-behaved stationary distribution, which is a rational function of policy parameters. Based on this result, we define a related reward process particularly suitable for average reward optimization, and analyze its properties. We show that in such a predictive state reward process, the average reward is a rational function of the policy parameters, whose complexity depends on the dimension of the underlying linear PSR. This result suggests that average reward-based policy search methods can be effective when the dimension of the system is small, even when the system representation in the POMDP framework requires many hidden states. We provide illustrative examples of this type.

NeurIPS Conference 2013 Conference Paper

Bellman Error Based Feature Generation using Random Projections on Sparse Spaces

  • Mahdi Milani Fard
  • Yuri Grinberg
  • Amir-massoud Farahmand
  • Joelle Pineau
  • Doina Precup

This paper addresses the problem of automatic generation of features for value function approximation in reinforcement learning. Bellman Error Basis Functions (BEBFs) have been shown to improve the error of policy evaluation with function approximation, with a convergence rate similar to that of value iteration. We propose a simple, fast and robust algorithm based on random projections, which generates BEBFs for sparse feature spaces. We provide a finite sample analysis of the proposed method, and prove that projections logarithmic in the dimension of the original space guarantee a contraction in the error. Empirical results demonstrate the strength of this method in domains in which choosing a good state representation is challenging.

RLDM Conference 2013 Conference Abstract

CAPI: Generalized Classification-based Approximate Policy Iteration

  • Amir-massoud Farahmand
  • Doina Precup
  • André Barreto
  • Mohammad Ghavamzadeh

Efficient methods for tackling large reinforcement learning problems usually exploit regularities, or intrinsic structures, of the problem in hand. Most current methods benefit from the regularities of either value function or policy, but not both. In this paper, we introduce a general classification-based approximate policy iteration (CAPI) framework, which can benefit from both types of regularities. This framework has two main components: a generic user- specified value function estimator and a weighted classifier that learns a policy based on the estimated value function. The result is a flexible and sample-efficient class of algorithms. We also use a particular instantiation of CAPI to design an adaptive treatment strategy for HIV-infected patients. Comparison with a state-of-the-art purely value-based reinforcement learning algorithm, Tree- based Fitted Q-Iteration, shows that benefitting from the regularity of both policy and value function can lead to better performance.

NeurIPS Conference 2013 Conference Paper

Learning from Limited Demonstrations

  • Beomjoon Kim
  • Amir-massoud Farahmand
  • Joelle Pineau
  • Doina Precup

We propose an approach to learning from demonstration (LfD) which leverages expert data, even if the expert examples are very few or inaccurate. We achieve this by integrating LfD in an approximate policy iteration algorithm. The key idea of our approach is that expert examples are used to generate linear constraints on the optimization, in a similar fashion to large-margin classification. We prove an upper bound on the true Bellman error of the approximation computed by the algorithm at each iteration. We show empirically that the algorithm outperforms both pure policy iteration, as well as DAgger (a state-of-art LfD algorithm) and supervised learning in a variety of scenarios, including when very few and/or imperfect demonstrations are available. Our experiments include simulations as well as a real robotic navigation task.

RLDM Conference 2013 Conference Abstract

Options in reinforcement learning: The state of the art

  • Doina Precup

Temporal abstraction is the ability to reason about, and plan with, courses of action taking place at multiple time scales. The options framework is a natural way of providing this capability to reinforcement learning systems. In this talk, I will review the options framework, the well-established algorithms for learning and planning with options, and several applications ranging form robotics to neuroscience. I will then discuss option discovery, the key remaining open problem in this area. I will describe some new intuitions and on-going work on efficient ways of constructing options.

EWRL Workshop 2012 Conference Paper

An Empirical Analysis of Off-policy Learning in Discrete MDPs

  • Cosmin Paduraru
  • Doina Precup
  • Joelle Pineau
  • Gheorghe Comanici

Off-policy evaluation is the problem of evaluating a decision-making policy using data collected under a different behaviour policy. While several methods are available for addressing off-policy evaluation, little work has been done on identifying the best methods. In this paper, we conduct an in-depth comparative study of several off-policy evaluation methods in non-bandit, finite-horizon MDPs, using randomly generated MDPs, as well as a Mallard population dynamics model [Anderson, 1975]. We find that un-normalized importance sampling can exhibit prohibitively large variance in problems involving look-ahead longer than a few time steps, and that dynamic programming methods perform better than Monte-Carlo style methods.

NeurIPS Conference 2012 Conference Paper

On-line Reinforcement Learning Using Incremental Kernel-Based Stochastic Factorization

  • Andre Barreto
  • Doina Precup
  • Joelle Pineau

The ability to learn a policy for a sequential decision problem with continuous state space using on-line data is a long-standing challenge. This paper presents a new reinforcement-learning algorithm, called iKBSF, which extends the benefits of kernel-based learning to the on-line scenario. As a kernel-based method, the proposed algorithm is stable and has good convergence properties. However, unlike other similar algorithms, iKBSF's space complexity is independent of the number of sample transitions, and as a result it can process an arbitrary amount of data. We present theoretical results showing that iKBSF can approximate (to any level of accuracy) the value function that would be learned by an equivalent batch non-parametric kernel-based reinforcement learning approximator. In order to show the effectiveness of the proposed algorithm in practice, we apply iKBSF to the challenging three-pole balancing task, where the ability to process a large number of transitions is crucial for achieving a high success rate.

NeurIPS Conference 2012 Conference Paper

Value Pursuit Iteration

  • Amir Farahmand
  • Doina Precup

Value Pursuit Iteration (VPI) is an approximate value iteration algorithm that finds a close to optimal policy for reinforcement learning and planning problems with large state spaces. VPI has two main features: First, it is a nonparametric algorithm that finds a good sparse approximation of the optimal value function given a dictionary of features. The algorithm is almost insensitive to the number of irrelevant features. Second, after each iteration of VPI, the algorithm adds a set of functions based on the currently learned value function to the dictionary. This increases the representation power of the dictionary in a way that is directly relevant to the goal of having a good approximation of the optimal value function. We theoretically study VPI and provide a finite-sample error upper bound for it.

EWRL Workshop 2011 Conference Paper

A Framework for Computing Bounds for the Return of a Policy

  • Cosmin Paduraru
  • Doina Precup
  • Joelle Pineau

Abstract We present a framework for computing bounds for the return of a policy in finite-horizon, continuous-state Markov Decision Processes with bounded state transitions. The state transition bounds can be based on either prior knowledge alone, or on a combination of prior knowledge and data. Our framework uses a piecewise-constant representation of the return bounds and a backwards iteration process. We instantiate this framework for a previously investigated type of prior knowledge – namely, Lipschitz continuity of the transition function. In this context, we show that the existing bounds of Fonteneau et al. (2009, 2010) can be expressed as a particular instantiation of our framework, by bounding the immediate rewards using Lipschitz continuity and choosing a particular form for the regions in the piecewise-constant representation. We also show how different instantiations of our framework can improve upon their bounds.

EWRL Workshop 2011 Conference Paper

Automatic Construction of Temporally Extended Actions for MDPs Using Bisimulation Metrics

  • Pablo Samuel Castro
  • Doina Precup

Abstract Temporally extended actions are usually effective in speeding up reinforcement learning. In this paper we present a mechanism for automatically constructing such actions, expressed as options [24], in a finite Markov Decision Process (MDP). To do this, we compute a bisimulation metric [7] between the states in a small MDP and the states in a large MDP, which we want to solve. The shape of this metric is then used to completely define a set of options for the large MDP. We demonstrate empirically that our approach is able to improve the speed of reinforcement learning, and is generally not sensitive to parameter tuning.

AAMAS Conference 2011 Conference Paper

Basis Function Discovery using Spectral Clustering and Bisimulation Metrics

  • Gheorghe Comanici
  • Doina Precup

We study the problem of automatically generating features for function approximation in reinforcement learning. We build on the work of Mahadevan and his colleagues, who pioneered the use of spectral clustering methods for basis function construction. Their methods work on top of a graph that captures state adjacency. Instead, we use bisimulation metrics in order to provide state distances for spectral clustering. The advantage of these metrics is that they incorporate reward information in a natural way, in addition to the state transition information. We provide theoretical bounds on the quality of the obtained approximation, which justify the importance of incorporating reward information. We also demonstrate empirically that the approximation quality improves when bisimulation metrics are used instead of the state adjacency graph in the basis function construction process.

AAAI Conference 2011 Conference Paper

Basis Function Discovery Using Spectral Clustering and Bisimulation Metrics

  • Gheorghe Comanici
  • Doina Precup

We study the problem of automatically generating features for function approximation in reinforcement learning. We build on the work of Mahadevan and his colleagues, who pioneered the use of spectral clustering methods for basis function construction. Their methods work on top of a graph that captures state adjacency. Instead, we use bisimulation metrics in order to provide state distances for spectral clustering. The advantage of these metrics is that they incorporate reward information in a natural way, in addition to the state transition information. We provide theoretical bounds on the quality of the obtained approximation, which justify the importance of incorporating reward information. We also demonstrate empirically that the approximation quality improves when bisimulation metrics are used instead of the state adjacency graph in the basis function construction process.

AAMAS Conference 2011 Conference Paper

Horde: A Scalable Real-time Architecture for Learning Knowledge from Unsupervised Sensorimotor Interaction

  • Richard S. Sutton
  • Joseph Modayil
  • Michael Delp
  • Thomas Degris
  • Patrick M. Pilarski
  • Adam White
  • Doina Precup

Maintaining accurate world knowledge in a complex and changing environment is a perennial problem for robots and other artificial intelligence systems. Our architecture for addressing this problem, called Horde, consists of a large number of independent reinforcement learning sub-agents, or demons. Each demon is responsible for answering a single predictive or goal-oriented question about the world, thereby contributing in a factored, modular way to the system's overall knowledge. The questions are in the form of a value function, but each demon has its own policy, reward function, termination function, and terminal-reward function unrelated to those of the base problem. Learning proceeds in parallel by all demons simultaneously so as to extract the maximal training information from whatever actions are taken by the system as a whole. Gradient-based temporal-difference learning methods are used to learn efficiently and reliably with function approximation in this off-policy setting. Horde runs in constant time and memory per time step, and is thus suitable for learning online in real-time applications such as robotics. We present results using Horde on a multi-sensored mobile robot to successfully learn goal-oriented behaviors and long-term predictions from off-policy experience. Horde is a significant incremental step towards a real-time architecture for efficient learning of general knowledge from unsupervised sensorimotor interaction.

NeurIPS Conference 2011 Conference Paper

Reinforcement Learning using Kernel-Based Stochastic Factorization

  • Andre Barreto
  • Doina Precup
  • Joelle Pineau

Kernel-based reinforcement-learning (KBRL) is a method for learning a decision policy from a set of sample transitions which stands out for its strong theoretical guarantees. However, the size of the approximator grows with the number of transitions, which makes the approach impractical for large problems. In this paper we introduce a novel algorithm to improve the scalability of KBRL. We resort to a special decomposition of a transition matrix, called stochastic factorization, to fix the size of the approximator while at the same time incorporating all the information contained in the data. The resulting algorithm, kernel-based stochastic factorization (KBSF), is much faster but still converges to a unique solution. We derive a theoretical upper bound for the distance between the value functions computed by KBRL and KBSF. The effectiveness of our method is illustrated with computational experiments on four reinforcement-learning problems, including a difficult task in which the goal is to learn a neurostimulation policy to suppress the occurrence of seizures in epileptic rat brains. We empirically demonstrate that the proposed approach is able to compress the information contained in KBRL's model. Also, on the tasks studied, KBSF outperforms two of the most prominent reinforcement-learning algorithms, namely least-squares policy iteration and fitted Q-iteration.

AAAI Conference 2010 Conference Paper

Activity and Gait Recognition with Time-Delay Embeddings

  • Jordan Frank
  • Shie Mannor
  • Doina Precup

Activity recognition based on data from mobile wearable devices is becoming an important application area for machine learning. We propose a novel approach based on a combination of feature extraction using time-delay embedding and supervised learning. The computational requirements are considerably lower than existing approaches, so the processing can be done in real time on a low-powered portable device such as a mobile phone. We evaluate the performance of our algorithm on a large, noisy data set comprising over 50 hours of data from six different subjects, including activities such as running and walking up or down stairs. We also demonstrate the ability of the system to accurately classify an individual from a set of 25 people, based only on the characteristics of their walking gait. The system requires very little parameter tuning, and can be trained with small amounts of data.

AAMAS Conference 2010 Conference Paper

Optimal Policy Switching Algorithms for Reinforcement Learning

  • Gheorghe Comanici
  • Doina Precup

We address the problem of single-agent, autonomous sequentialdecision making. We assume that some controllers or behaviorpolicies are given as prior knowledge, and the task of the agentis to learn how to switch between these policies. We formulate theproblem using the framework of reinforcement learning and options (Sutton, Precup & Singh, 1999; Precup, 2000). We derivegradient-based algorithms for learning the termination conditionsof options, with the goal of optimizing the expected long-term return. We incorporate the proposed approach into policy-gradientmethods with linear function approximation.

AAMAS Conference 2010 Conference Paper

Using bisimulation for policy transfer in MDPs

  • Pablo Castro
  • Doina Precup

Knowledge transfer is a powerful approach to solve Markov Decision Processes. In this paper, we present approaches that use bisimulation-style metrics (Ferns, Panangaden & Precup, 2004) to compute the similarity of states in a large problem to states in smaller problems, which might have already been solved. We propose algorithms that decide what actions to transfer form the small to the large problem, given this information. We also show that this approach can be used, even more successfully, when using temporally extended actions (Sutton, Precup & Singh, 1999). We present theoretical guarantees on the quality of the transferred policy, as well as promising empirical results.

AAAI Conference 2010 Conference Paper

Using Bisimulation for Policy Transfer in MDPs

  • Pablo Castro
  • Doina Precup

Knowledge transfer has been suggested as a useful approach for solving large Markov Decision Processes. The main idea is to compute a decision-making policy in one environment and use it in a different environment, provided the two are ”close enough”. In this paper, we use bisimulation-style metrics (Ferns et al. , 2004) to guide knowledge transfer. We propose algorithms that decide what actions to transfer from the policy computed on a small MDP task to a large task, given the bisimulation distance between states in the two tasks. We demonstrate the inherent ”pessimism” of bisimulation metrics and present variants of this metric aimed to overcome this pessimism, leading to improved action transfer. We also show that using this approach for transferring temporally extended actions (Sutton et al. , 1999) is more successful than using it exclusively with primitive actions. We present theoretical guarantees on the quality of the transferred policy, as well as promising empirical results.

IJCAI Conference 2009 Conference Paper

  • Pablo Samuel Castro
  • Prakash Panangaden
  • Doina Precup

We explore equivalence relations between states in Markov Decision Processes and Partially Observable Markov Decision Processes. We focus on two different equivalence notions: bisimulation [Givan et al. , 2003] and a notion of trace equivalence, under which states are considered equivalent if they generate the same conditional probability distributions over observation sequences (where the conditioning is on action sequences). We show that the relationship between these two equivalence notions changes depending on the amount and nature of the partial observability. We also present an alternate characterization of bisimulation based on trajectory equivalence.

IJCAI Conference 2009 Conference Paper

  • Robert West
  • Joelle Pineau
  • Doina Precup

Computing the semantic distance between realworld concepts is crucial for many intelligent applications. We present a novel method that leverages data from ‘Wikispeedia’, an online game played on Wikipedia; players have to reach an article from another, unrelated article, only by clicking links in the articles encountered. In order to automatically infer semantic distances between everyday concepts, our method effectively extracts the common sense displayed by humans during play, and is thus more desirable, from a cognitive point of view, than purely corpus-based methods. We show that our method significantly outperforms Latent Semantic Analysis in a psychometric evaluation of the quality of learned semantic distances.

NeurIPS Conference 2009 Conference Paper

Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation

  • Shalabh Bhatnagar
  • Doina Precup
  • David Silver
  • Richard Sutton
  • Hamid Maei
  • Csaba Szepesvári

We introduce the first temporal-difference learning algorithms that converge with smooth value function approximators, such as neural networks. Conventional temporal-difference (TD) methods, such as TD($\lambda$), Q-learning and Sarsa have been used successfully with function approximation in many applications. However, it is well known that off-policy sampling, as well as nonlinear function approximation, can cause these algorithms to become unstable (i. e. , the parameters of the approximator may diverge). Sutton et al (2009a, b) solved the problem of off-policy learning with linear TD algorithms by introducing a new objective function, related to the Bellman-error, and algorithms that perform stochastic gradient-descent on this function. In this paper, we generalize their work to nonlinear function approximation. We present a Bellman error objective function and two gradient-descent TD algorithms that optimize it. We prove the asymptotic almost-sure convergence of both algorithms for any finite Markov decision process and any smooth value function approximator, under usual stochastic approximation conditions. The computational complexity per iteration scales linearly with the number of parameters of the approximator. The algorithms are incremental and are guaranteed to converge to locally optimal solutions.

NeurIPS Conference 2008 Conference Paper

Bounding Performance Loss in Approximate MDP Homomorphisms

  • Jonathan Taylor
  • Doina Precup
  • Prakash Panagaden

We define a metric for measuring behavior similarity between states in a Markov decision process (MDP), in which action similarity is taken into account. We show that the kernel of our metric corresponds exactly to the classes of states defined by MDP homomorphisms (Ravindran & Barto, 2003). We prove that the difference in the optimal value function of different states can be upper-bounded by the value of this metric, and that the bound is tighter than that provided by bisimulation metrics (Ferns et al. 2004, 2005). Our results hold both for discrete and for continuous actions. We provide an algorithm for constructing approximate homomorphisms, by using this metric to identify states that can be grouped together, as well as actions that can be matched. Previous research on this topic is based mainly on heuristics.

IJCAI Conference 2007 Conference Paper

  • Pablo S. Castro
  • Doina Precup

A key problem in reinforcement learning is finding a good balance between the need to explore the environment and the need to gain rewards by exploiting existing knowledge. Much research has been devoted to this topic, and many of the proposed methods are aimed simply at ensuring that enough samples are gathered to estimate the value function well. In contrast, in 1959 Bellman proposed constructing a representation in which the states of the original system are paired with knowledge about the current model. Hence, knowledge about the possible Markov models of the environment is represented and maintained explicitly. Unfortunately, this approach is intractable except for bandit problems (where it gives rise to Gittins indices, an optimal exploration method). In this paper, we explore ideas for making this method computationally tractable. We maintain a model of the environment as a Markov Decision Process. We sample finite-length trajectories from the infinite tree using ideas based on sparse sampling. Finding the values of the nodes of this sparse subtree can then be expressed as an optimization problem, which we solve using Linear Programming. We illustrate this approach on a few domains and compare it with other exploration algorithms.

IJCAI Conference 2007 Conference Paper

  • Rupert Brooks
  • Tal Arbel
  • Doina Precup

Image alignment refers to finding the best transformation from a fixed reference image to a new image of a scene. This process is often guided by similarity measures between images, computed based on the image data. However, in time-critical applications state-of-the-art methods for computing similarity are too slow. Instead of using all the image data to compute similarity, one can use a subset of pixels to improve the speed, but often this comes at the cost of reduced accuracy. This makes the problem of image alignment a natural application domain for deliberation control using anytime algorithms. However, almost no research has been done in this direction. In this paper, we present anytime versions for the computation of two common image similarity measures: mean squared difference and mutual information. Off-line, we learn a performance profile specific to each measure, which is then used on-line to select the appropriate amount of pixels to process at each optimization step. When tested against existing techniques, our method achieves comparable quality and robustness with significantly less computation.

IJCAI Conference 2007 Conference Paper

  • Marc G. Bellemare
  • Doina Precup

Markov models have been a keystone in Artificial Intelligence for many decades. However, they remain unsatisfactory when the environment modelled is partially observable. There are pathological examples where no history of fixed length is sufficient for accurate prediction or decision making. On the other hand, working with a hidden state (like in Hidden Markov Models or Partially Observable Markov Decision Processes) has a high computational cost. In order to circumvent this problem, we suggest the use of a context-based model. Our approach replaces strict transition probabilities by influences on transitions. The method proposed provides a trade-off between a fully and partially observable model. We also discuss the capacity of our framework to model hierarchical knowledge and abstraction. Simple examples are given in order to show the advantages of the algorithm.

ICRA Conference 2007 Conference Paper

A formal framework for robot learning and control under model uncertainty

  • Robin Jaulmes
  • Joelle Pineau
  • Doina Precup

While the partially observable Markov decision process (POMDP) provides a formal framework for the problem of robot control under uncertainty, it typically assumes a known and stationary model of the environment. In this paper, we study the problem of finding an optimal policy for controlling a robot in a partially observable domain, where the model is not perfectly known, and may change over time. We present an algorithm called MEDUSA which incrementally learns a POMDP model using queries, while still optimizing a reward function. We demonstrate effectiveness of the approach for a simple scenario, where a robot seeking a person has minimal a priori knowledge of its own sensor model, as well as where the person is located.

UAI Conference 2006 Conference Paper

Methods for Computing State Similarity in Markov Decision Processes

  • Norman Ferns
  • Pablo Samuel Castro
  • Doina Precup
  • Prakash Panangaden

A popular approach to solving large probabilistic systems relies on aggregating states based on a measure of similarity. Many approaches in the literature are heuristic. A number of recent methods rely instead on metrics based on the notion of bisimulation, or behavioral equivalence between states (Givan et al, 2001, 2003; Ferns et al, 2004). An integral component of such metrics is the Kantorovich metric between probability distributions. However, while this metric enables many satisfying theoretical properties, it is costly to compute in practice. In this paper, we use techniques from network optimization and statistical sampling to overcome this problem. We obtain in this manner a variety of distance functions for MDP state aggregation, which differ in the tradeoff between time and space complexity, as well as the quality of the aggregation. We provide an empirical evaluation of these trade-offs.

UAI Conference 2005 Conference Paper

Metrics for Markov Decision Processes with Infinite State Spaces

  • Norman Ferns
  • Prakash Panangaden
  • Doina Precup

We present metrics for measuring state similarity in Markov decision processes (MDPs) with infinitely many states, including MDPs with continuous state spaces. Such metrics provide a stable quantitative analogue of the notion of bisimulation for MDPs, and are suitable for use in MDP approximation. We show that the optimal value function associated with a discounted infinite horizon planning task varies continuously with respect to our metric distances.

NeurIPS Conference 2005 Conference Paper

Off-policy Learning with Options and Recognizers

  • Doina Precup
  • Cosmin Paduraru
  • Anna Koop
  • Richard Sutton
  • Satinder Singh

We introduce a new algorithm for off-policy temporal-difference learning with function approximation that has lower variance and requires less knowledge of the behavior policy than prior methods. We develop the notion of a recognizer, a filter on actions that distorts the behavior policy to produce a related target policy with low-variance importance-sampling corrections. We also consider target policies that are deviations from the state distribution of the behavior policy, such as potential temporally abstract options, which further reduces variance. This paper introduces recognizers and their potential advantages, then develops a full algorithm for linear function approximation and proves that its updates are in the same direction as on-policy TD updates, which implies asymptotic convergence. Even though our algorithm is based on importance sampling, we prove that it requires absolutely no knowledge of the behavior policy for the case of state-aggregation function approximators. Off-policy learning is learning about one way of behaving while actually behaving in another way. For example, Q-learning is an off- policy learning method because it learns about the optimal policy while taking actions in a more exploratory fashion, e. g. , according to an -greedy policy. Off-policy learning is of interest because only one way of selecting actions can be used at any time, but we would like to learn about many different ways of behaving from the single resultant stream of experience. For example, the options framework for temporal abstraction involves considering a variety of different ways of selecting actions. For each such option one would like to learn a model of its possible outcomes suitable for planning and other uses. Such option models have been proposed as fundamental building blocks of grounded world knowledge (Sutton, Precup & Singh, 1999; Sutton, Rafols & Koop, 2005). Using off-policy learning, one would be able to learn predictive models for many options at the same time from a single stream of experience. Unfortunately, off-policy learning using temporal-difference methods has proven problematic when used in conjunction with function approximation. Function approximation is essential in order to handle the large state spaces that are inherent in many problem do- mains. Q-learning, for example, has been proven to converge to an optimal policy in the tabular case, but is unsound and may diverge in the case of linear function approximation (Baird, 1996). Precup, Sutton, and Dasgupta (2001) introduced and proved convergence for the first off-policy learning algorithm with linear function approximation. They addressed the problem of learning the expected value of a target policy based on experience generated using a different behavior policy. They used importance sampling techniques to reduce the off-policy case to the on-policy case, where existing convergence theorems apply (Tsitsiklis & Van Roy, 1997; Tadic, 2001). There are two important difficulties with that approach. First, the behavior policy needs to be stationary and known, because it is needed to compute the importance sampling corrections. Second, the importance sampling weights are often ill-conditioned. In the worst case, the variance could be infinite and convergence would not occur. The conditions required to prevent this were somewhat awkward and, even when they applied and asymptotic convergence was assured, the variance could still be high and convergence could be slow. In this paper we address both of these problems in the context of off-policy learning for options. We introduce the notion of a recognizer. Rather than specifying an explicit target policy (for instance, the policy of an option), about which we want to make predictions, a recognizer specifies a condition on the actions that are selected. For example, a recognizer for the temporally extended action of picking up a cup would not specify which hand is to be used, or what the motion should be at all different positions of the cup. The recognizer would recognize a whole variety of directions of motion and poses as part of picking the cup. The advantage of this strategy is not that one might prefer a multitude of different behaviors, but that the behavior may be based on a variety of different strategies, all of which are relevant, and we would like to learn from any of them. In general, a recognizer is a function that recognizes or accepts a space of different ways of behaving and thus, can learn from a wider range of data. Recognizers have two advantages over direct specification of a target policy: 1) they are a natural and easy way to specify a target policy for which importance sampling will be well conditioned, and 2) they do not require the behavior policy to be known. The latter is important because in many cases we may have little knowledge of the behavior policy, or a stationary behavior policy may not even exist. We show that for the case of state aggregation, even if the behavior policy is unknown, convergence to a good model is achieved. 1 Non-sequential example The benefits of using recognizers in off-policy learning can be most easily seen in a nonsequential context with a single continuous action. Suppose you are given a sequence of sample actions ai [0, 1], selected i. i. d. according to probability density b: [0, 1] + (the behavior density). For example, suppose the behavior density is of the oscillatory form shown as a red line in Figure 1. For each each action, ai, we observe a corresponding outcome, zi, a random variable whose distribution depends only on ai. Thus the behavior density induces an outcome density. The on-policy problem is to estimate the mean mb of the outcome density. This problem can be solved simply by averaging the sample outcomes: mb = (1/n) n=1 zi. The off-policy problem is to use this same data to learn what ^ i the mean would be if actions were selected in some way other than b, for example, if the actions were restricted to a designated range, such as between 0. 7 and 0. 9. There are two natural ways to pose this off-policy problem. The most straightforward way is to be equally interested in all actions within the designated region. One professes to be interested in actions selected according to a target density: [0, 1] +, which in the example would be 5. 0 between 0. 7 and 0. 9, and zero elsewhere, as in the dashed line in 12 Probability density functions Target policy with recognizer Target policy w/o recognizer 1. 5 Empirical variances (average of 200 sample variances) 1 without recognizer. 5 with recognizer 100 200 300 400 500

UAI Conference 2004 Conference Paper

Metrics for Finite Markov Decision Processes

  • Norman Ferns
  • Prakash Panangaden
  • Doina Precup

We present metrics for measuring the similarity of states in a finite Markov decision process (MDP). The formulation of our metrics is based on the notion of bisimulation for MDPs, with an aim towards solving discounted infinite horizon reinforcement learning tasks. Such metrics can be used to aggregate states, as well as to better structure other value function approximators (e.g., memory-based or nearest-neighbor approximators). We provide bounds that relate our metric distances to the optimal values of states in the given MDP.

NeurIPS Conference 2002 Conference Paper

A Convergent Form of Approximate Policy Iteration

  • Theodore Perkins
  • Doina Precup

We study a new, model-free form of approximate policy iteration which uses Sarsa updates with linear state-action value function approximation for policy evaluation, and a “policy improvement operator” to generate a new policy based on the learned state-action values. We prove that if the policy improvement operator produces -soft policies and is Lipschitz continuous in the action values, with a constant that is not too large, then the approximate policy iteration algorithm converges to a unique solu- tion from any initial policy. To our knowledge, this is the first conver- gence result for any form of approximate policy iteration under similar computational-resource assumptions.

NeurIPS Conference 1998 Conference Paper

Improved Switching among Temporally Abstract Actions

  • Richard Sutton
  • Satinder Singh
  • Doina Precup
  • Balaraman Ravindran

In robotics and other control applications it is commonplace to have a pre(cid: 173) existing set of controllers for solving subtasks, perhaps hand-crafted or previously learned or planned, and still face a difficult problem of how to choose and switch among the controllers to solve an overall task as well as possible. In this paper we present a framework based on Markov decision processes and semi-Markov decision processes for phrasing this problem, a basic theorem regarding the improvement in performance that can be ob(cid: 173) tained by switching flexibly between given controllers, and example appli(cid: 173) cations of the theorem. In particular, we show how an agent can plan with these high-level controllers and then use the results of such planning to find an even better plan, by modifying the existing controllers, with negligible additional cost and no re-planning. In one of our examples, the complexity of the problem is reduced from 24 billion state-action pairs to less than a million state-controller pairs. In many applications, solutions to parts of a task are known, either because they were hand(cid: 173) crafted by people or because they were previously learned or planned. For example, in robotics applications, there may exist controllers for moving joints to positions, picking up objects, controlling eye movements, or navigating along hallways. More generally, an intelli(cid: 173) gent system may have available to it several temporally extended courses of action to choose from. In such cases, a key challenge is to take full advantage of the existing temporally ex(cid: 173) tended actions, to choose or switch among them effectively, and to plan at their level rather than at the level of individual actions. Recently, several researchers have begun to address these challenges within the framework of reinforcement learning and Markov decision processes (e. g. , Singh, 1992; Kaelbling, 1993; Dayan & Hinton, 1993; Thrun and Schwartz, 1995; Sutton, 1995; Dietterich, 1998; Parr & Russell, 1998; McGovern, Sutton & Fagg, 1997). Common to much of this recent work is the modeling of a temporally extended action as a policy (controller) and a condition for terminating, which we together refer to as an option (Sutton, Precup & Singh, 1998). In this paper we consider the problem of effectively combining given options into one overall policy, generalizing prior work by Kaelbling (1993). Sections 1-3 introduce the framework; our new results are in Sections 4 and 5. Improved Switching among Temporally Abstract Actions

NeurIPS Conference 1997 Conference Paper

Learning to Schedule Straight-Line Code

  • J. Moss
  • Paul Utgoff
  • John Cavazos
  • Doina Precup
  • Darko Stefanovic
  • Carla Brodley
  • David Scheeff

Program execution speed on modem computers is sensitive, by a factor of two or more, to the order in which instructions are presented to the proces(cid: 173) sor. To realize potential execution efficiency, an optimizing compiler must employ a heuristic algorithm for instruction scheduling. Such algorithms are painstakingly hand-crafted, which is expensive and time-consuming. We show how to cast the instruction scheduling problem as a learning task, ob(cid: 173) taining the heuristic scheduling algorithm automatically. Our focus is the narrower problem of scheduling straight-line code (also called basic blocks of instructions). Our empirical results show that just a few features are ad(cid: 173) equate for quite good performance at this task for a real modem processor, and that any of several supervised learning methods perform nearly opti(cid: 173) mally with respect to the features used.