Arrow Research search

Author name cluster

Arthur Guez

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

21 papers
2 author rows

Possible papers

21

EWRL Workshop 2023 Workshop Paper

Acceleration in Policy Optimization

  • Veronica Chelu
  • Tom Zahavy
  • Arthur Guez
  • Doina Precup
  • Sebastian Flennerhag

We work towards a unifying paradigm for accelerating policy optimization methods in reinforcement learning (RL) through predictive and adaptive directions of (functional) policy ascent. Leveraging the connection between policy iteration and policy gradient methods, we view policy optimization algorithms as iteratively solving a sequence of surrogate objectives, local lower bounds on the original objective. We define optimism as predictive modelling of the future behavior of a policy, and hindsight adaptation as taking immediate and anticipatory corrective actions to mitigate accumulating errors from overshooting predictions or delayed responses to change. We use this shared lens to jointly express other well-known algorithms, including model-based policy improvement based on forward search, and optimistic meta-learning algorithms. We show connections with Anderson acceleration, Nesterov's accelerated gradient, extra-gradient methods, and linear extrapolation in the update rule. We analyze properties of the formulation, design an optimistic policy gradient algorithm, adaptive via meta-gradient learning, and empirically highlight several design choices pertaining to acceleration, in an illustrative task.

ICLR Conference 2022 Conference Paper

COptiDICE: Offline Constrained Reinforcement Learning via Stationary Distribution Correction Estimation

  • Jongmin Lee 0004
  • Cosmin Paduraru
  • Daniel J. Mankowitz
  • Nicolas Heess
  • Doina Precup
  • Kee-Eung Kim
  • Arthur Guez

We consider the offline constrained reinforcement learning (RL) problem, in which the agent aims to compute a policy that maximizes expected return while satisfying given cost constraints, learning only from a pre-collected dataset. This problem setting is appealing in many real-world scenarios, where direct interaction with the environment is costly or risky, and where the resulting policy should comply with safety constraints. However, it is challenging to compute a policy that guarantees satisfying the cost constraints in the offline RL setting, since the off-policy evaluation inherently has an estimation error. In this paper, we present an offline constrained RL algorithm that optimizes the policy in the space of the stationary distribution. Our algorithm, COptiDICE, directly estimates the stationary distribution corrections of the optimal policy with respect to returns, while constraining the cost upper bound, with the goal of yielding a cost-conservative policy for actual constraint satisfaction. Experimental results show that COptiDICE attains better policies in terms of constraint satisfaction and return-maximization, outperforming baseline algorithms.

NeurIPS Conference 2022 Conference Paper

Large-Scale Retrieval for Reinforcement Learning

  • Peter Humphreys
  • Arthur Guez
  • Olivier Tieleman
  • Laurent Sifre
  • Theophane Weber
  • Timothy Lillicrap

Effective decision making involves flexibly relating past experiences and relevant contextual information to a novel situation. In deep reinforcement learning (RL), the dominant paradigm is for an agent to amortise information that helps decision-making into its network weights via gradient descent on training losses. Here, we pursue an alternative approach in which agents can utilise large-scale context-sensitive database lookups to support their parametric computations. This allows agents to directly learn in an end-to-end manner to utilise relevant information to inform their outputs. In addition, new information can be attended to by the agent, without retraining, by simply augmenting the retrieval dataset. We study this approach for offline RL in 9x9 Go, a challenging game for which the vast combinatorial state space privileges generalisation over direct matching to past experiences. We leverage fast, approximate nearest neighbor techniques in order to retrieve relevant data from a set of tens of millions of expert demonstration states. Attending to this information provides a significant boost to prediction accuracy and game-play performance over simply using these demonstrations as training trajectories, providing a compelling demonstration of the value of large-scale retrieval in offline RL agents.

ICLR Conference 2022 Conference Paper

Policy improvement by planning with Gumbel

  • Ivo Danihelka
  • Arthur Guez
  • Julian Schrittwieser
  • David Silver 0001

AlphaZero is a powerful reinforcement learning algorithm based on approximate policy iteration and tree search. However, AlphaZero can fail to improve its policy network, if not visiting all actions at the root of a search tree. To address this issue, we propose a policy improvement algorithm based on sampling actions without replacement. Furthermore, we use the idea of policy improvement to replace the more heuristic mechanisms by which AlphaZero selects and uses actions, both at root nodes and at non-root nodes. Our new algorithms, Gumbel AlphaZero and Gumbel MuZero, respectively without and with model-learning, match the state of the art on Go, chess, and Atari, and significantly improve prior performance when planning with few simulations.

ICML Conference 2022 Conference Paper

Retrieval-Augmented Reinforcement Learning

  • Anirudh Goyal
  • Abram L. Friesen
  • Andrea Banino
  • Theophane Weber
  • Nan Rosemary Ke
  • Adrià Puigdomènech Badia
  • Arthur Guez
  • Mehdi Mirza

Most deep reinforcement learning (RL) algorithms distill experience into parametric behavior policies or value functions via gradient updates. While effective, this approach has several disadvantages: (1) it is computationally expensive, (2) it can take many updates to integrate experiences into the parametric model, (3) experiences that are not fully integrated do not appropriately influence the agent’s behavior, and (4) behavior is limited by the capacity of the model. In this paper we explore an alternative paradigm in which we train a network to map a dataset of past experiences to optimal behavior. Specifically, we augment an RL agent with a retrieval process (parameterized as a neural network) that has direct access to a dataset of experiences. This dataset can come from the agent’s past experiences, expert demonstrations, or any other relevant source. The retrieval process is trained to retrieve information from the dataset that may be useful in the current context, to help the agent achieve its goal faster and more efficiently. The proposed method facilitates learning agents that at test time can condition their behavior on the entire dataset and not only the current state, or current trajectory. We integrate our method into two different RL agents: an offline DQN agent and an online R2D2 agent. In offline multi-task problems, we show that the retrieval-augmented DQN agent avoids task interference and learns faster than the baseline DQN agent. On Atari, we show that retrieval-augmented R2D2 learns significantly faster than the baseline R2D2 agent and achieves higher scores. We run extensive ablations to measure the contributions of the components of our proposed method.

ICML Conference 2021 Conference Paper

Counterfactual Credit Assignment in Model-Free Reinforcement Learning

  • Thomas Mesnard
  • Theophane Weber
  • Fabio Viola
  • Shantanu Thakoor
  • Alaa Saade
  • Anna Harutyunyan
  • Will Dabney
  • Tom Stepleton

Credit assignment in reinforcement learning is the problem of measuring an action’s influence on future rewards. In particular, this requires separating skill from luck, i. e. disentangling the effect of an action on rewards from that of external factors and subsequent actions. To achieve this, we adapt the notion of counterfactuals from causality theory to a model-free RL setup. The key idea is to condition value functions on future events, by learning to extract relevant information from a trajectory. We formulate a family of policy gradient algorithms that use these future-conditional value functions as baselines or critics, and show that they are provably low variance. To avoid the potential bias from conditioning on future information, we constrain the hindsight information to not contain information about the agent’s actions. We demonstrate the efficacy and validity of our algorithm on a number of illustrative and challenging problems.

ICML Conference 2021 Conference Paper

Muesli: Combining Improvements in Policy Optimization

  • Matteo Hessel
  • Ivo Danihelka
  • Fabio Viola
  • Arthur Guez
  • Simon Schmitt
  • Laurent Sifre
  • Theophane Weber
  • David Silver 0001

We propose a novel policy update that combines regularized policy optimization with model learning as an auxiliary loss. The update (henceforth Muesli) matches MuZero’s state-of-the-art performance on Atari. Notably, Muesli does so without using deep search: it acts directly with a policy network and has computation speed comparable to model-free baselines. The Atari results are complemented by extensive ablations, and by additional results on continuous control and 9x9 Go.

ICLR Conference 2021 Conference Paper

On the role of planning in model-based deep reinforcement learning

  • Jessica B. Hamrick
  • Abram L. Friesen
  • Feryal M. P. Behbahani
  • Arthur Guez
  • Fabio Viola
  • Sims Witherspoon
  • Thomas W. Anthony 0001
  • Lars Buesing

Model-based planning is often thought to be necessary for deep, careful reasoning and generalization in artificial agents. While recent successes of model-based reinforcement learning (MBRL) with deep function approximation have strengthened this hypothesis, the resulting diversity of model-based methods has also made it difficult to track which components drive success and why. In this paper, we seek to disentangle the contributions of recent methods by focusing on three questions: (1) How does planning benefit MBRL agents? (2) Within planning, what choices drive performance? (3) To what extent does planning improve generalization? To answer these questions, we study the performance of MuZero (Schrittwieser et al., 2019), a state-of-the-art MBRL algorithm with strong connections and overlapping components with many other MBRL algorithms. We perform a number of interventions and ablations of MuZero across a wide range of environments, including control tasks, Atari, and 9x9 Go. Our results suggest the following: (1) Planning is most useful in the learning process, both for policy updates and for providing a more useful data distribution. (2) Using shallow trees with simple Monte-Carlo rollouts is as performant as more complex methods, except in the most difficult reasoning tasks. (3) Planning alone is insufficient to drive strong generalization. These results indicate where and how to utilize planning in reinforcement learning settings, and highlight a number of open questions for future MBRL research.

NeurIPS Conference 2020 Conference Paper

Value-driven Hindsight Modelling

  • Arthur Guez
  • Fabio Viola
  • Theophane Weber
  • Lars Buesing
  • Steven Kapturowski
  • Doina Precup
  • David Silver
  • Nicolas Heess

Value estimation is a critical component of the reinforcement learning (RL) paradigm. The question of how to effectively learn value predictors from data is one of the major problems studied by the RL community, and different approaches exploit structure in the problem domain in different ways. Model learning can make use of the rich transition structure present in sequences of observations, but this approach is usually not sensitive to the reward function. In contrast, model-free methods directly leverage the quantity of interest from the future, but receive a potentially weak scalar signal (an estimate of the return). We develop an approach for representation learning in RL that sits in between these two extremes: we propose to learn what to model in a way that can directly help value prediction. To this end, we determine which features of the future trajectory provide useful information to predict the associated return. This provides tractable prediction targets that are directly relevant for a task, and can thus accelerate learning the value function. The idea can be understood as reasoning, in hindsight, about which aspects of the future observations could help past value prediction. We show how this can help dramatically even in simple policy evaluation settings. We then test our approach at scale in challenging domains, including on 57 Atari 2600 games.

ICML Conference 2019 Conference Paper

An Investigation of Model-Free Planning

  • Arthur Guez
  • Mehdi Mirza
  • Karol Gregor
  • Rishabh Kabra
  • Sébastien Racanière
  • Theophane Weber
  • David Raposo
  • Adam Santoro

The field of reinforcement learning (RL) is facing increasingly challenging domains with combinatorial complexity. For an RL agent to address these challenges, it is essential that it can plan effectively. Prior work has typically utilized an explicit model of the environment, combined with a specific planning algorithm (such as tree search). More recently, a new family of methods have been proposed that learn how to plan, by providing the structure for planning via an inductive bias in the function approximator (such as a tree structured neural network), trained end-to-end by a model-free RL algorithm. In this paper, we go even further, and demonstrate empirically that an entirely model-free approach, without special structure beyond standard neural network components such as convolutional networks and LSTMs, can learn to exhibit many of the characteristics typically associated with a model-based planner. We measure our agent’s effectiveness at planning in terms of its ability to generalize across a combinatorial and irreversible state space, its data efficiency, and its ability to utilize additional thinking time. We find that our agent has many of the characteristics that one might expect to find in a planning algorithm. Furthermore, it exceeds the state-of-the-art in challenging combinatorial domains such as Sokoban and outperforms other model-free approaches that utilize strong inductive biases toward planning.

ICML Conference 2018 Conference Paper

Learning to Search with MCTSnets

  • Arthur Guez
  • Theophane Weber
  • Ioannis Antonoglou
  • Karen Simonyan
  • Oriol Vinyals
  • Daan Wierstra
  • Rémi Munos
  • David Silver 0001

Planning problems are among the most important and well-studied problems in artificial intelligence. They are most typically solved by tree search algorithms that simulate ahead into the future, evaluate future states, and back-up those evaluations to the root of a search tree. Among these algorithms, Monte-Carlo tree search (MCTS) is one of the most general, powerful and widely used. A typical implementation of MCTS uses cleverly designed rules, optimised to the particular characteristics of the domain. These rules control where the simulation traverses, what to evaluate in the states that are reached, and how to back-up those evaluations. In this paper we instead learn where, what and how to search. Our architecture, which we call an MCTSnet, incorporates simulation-based search inside a neural network, by expanding, evaluating and backing-up a vector embedding. The parameters of the network are trained end-to-end using gradient-based optimisation. When applied to small searches in the well-known planning problem Sokoban, the learned search algorithm significantly outperformed MCTS baselines.

NeurIPS Conference 2017 Conference Paper

Imagination-Augmented Agents for Deep Reinforcement Learning

  • Sébastien Racanière
  • Theophane Weber
  • David Reichert
  • Lars Buesing
  • Arthur Guez
  • Danilo Jimenez Rezende
  • Adrià Puigdomènech Badia
  • Oriol Vinyals

We introduce Imagination-Augmented Agents (I2As), a novel architecture for deep reinforcement learning combining model-free and model-based aspects. In contrast to most existing model-based reinforcement learning and planning methods, which prescribe how a model should be used to arrive at a policy, I2As learn to interpret predictions from a trained environment model to construct implicit plans in arbitrary ways, by using the predictions as additional context in deep policy networks. I2As show improved data efficiency, performance, and robustness to model misspecification compared to several strong baselines.

ICML Conference 2017 Conference Paper

The Predictron: End-To-End Learning and Planning

  • David Silver 0001
  • Hado van Hasselt
  • Matteo Hessel
  • Tom Schaul
  • Arthur Guez
  • Tim Harley
  • Gabriel Dulac-Arnold
  • David P. Reichert

One of the key challenges of artificial intelligence is to learn models that are effective in the context of planning. In this document we introduce the predictron architecture. The predictron consists of a fully abstract model, represented by a Markov reward process, that can be rolled forward multiple “imagined” planning steps. Each forward pass of the predictron accumulates internal rewards and values over multiple planning depths. The predictron is trained end-to-end so as to make these accumulated values accurately approximate the true value function. We applied the predictron to procedurally generated random mazes and a simulator for the game of pool. The predictron yielded significantly more accurate predictions than conventional deep neural network architectures.

NeurIPS Conference 2016 Conference Paper

Learning values across many orders of magnitude

  • Hado van Hasselt
  • Arthur Guez
  • Matteo Hessel
  • Volodymyr Mnih
  • David Silver

Most learning algorithms are not invariant to the scale of the signal that is being approximated. We propose to adaptively normalize the targets used in the learning updates. This is important in value-based reinforcement learning, where the magnitude of appropriate value approximations can change over time when we update the policy of behavior. Our main motivation is prior work on learning to play Atari games, where the rewards were clipped to a predetermined range. This clipping facilitates learning across many different games with a single learning algorithm, but a clipped reward function can result in qualitatively different behavior. Using adaptive normalization we can remove this domain-specific heuristic without diminishing overall performance.

NeurIPS Conference 2014 Conference Paper

Bayes-Adaptive Simulation-based Search with Value Function Approximation

  • Arthur Guez
  • Nicolas Heess
  • David Silver
  • Peter Dayan

Bayes-adaptive planning offers a principled solution to the exploration-exploitation trade-off under model uncertainty. It finds the optimal policy in belief space, which explicitly accounts for the expected effect on future rewards of reductions in uncertainty. However, the Bayes-adaptive solution is typically intractable in domains with large or continuous state spaces. We present a tractable method for approximating the Bayes-adaptive solution by combining simulation-based search with a novel value function approximation technique that generalises over belief space. Our method outperforms prior approaches in both discrete bandit tasks and simple continuous navigation and control tasks.

RLDM Conference 2013 Conference Abstract

A normative theory of approach-avoidance conflicts during dynamic foraging in humans

  • Arthur Guez
  • Ritwik Niyogi
  • Dominik Bach
  • Marc Guitart-Masip
  • Raymond Dolan
  • Peter Dayan

We propose a normative model of the behaviour of human subjects playing a dynamic foraging game containing a time-stochastic threat. The game is intended to capture the essence of the conflict between approach and avoidance. The realistic nature of the task makes planning challenging; we therefore rely on recent innovations in model-based methods to approximate the optimal policy. We observe that our optimal model captures many aspects of the behaviour, but there remain discrepancies between real and simulated data that will be used to elucidate the nature of the suboptimalities induced by the conflict. We hope to use elaborations of the model to capture the variance in the behaviour across groups of normal subjects and patients.

RLDM Conference 2013 Conference Abstract

Towards a practical Bayes-optimal agent

  • Arthur Guez
  • David Silver
  • Peter Dayan

Only rich and sophisticated statistical models are adequate for agents that must learn to navi- gate complex environments. However, it has not been clear how methods for planning can take advantage of models, such as those incorporating Bayesian non-parametric devices, that are sufficiently intricate as to demand approximate sampling schemes. We show that Bayes-Adaptive planning can be combined in a principled way with approximate sampling, and demonstrate the power of the resulting method in a chal- lenging task involving safe exploration which defeats myopic methods such as Thompson Sampling. This highlights the importance of propagating beliefs in realistic cases involving trade-offs between exploration and exploitation. The next challenge is to employ function approximation to represent the belief-state value to improve search efficiency further and thus enable longer search horizons.

NeurIPS Conference 2012 Conference Paper

Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

  • Arthur Guez
  • David Silver
  • Peter Dayan

Bayesian model-based reinforcement learning is a formally elegant approach to learning optimal behaviour under model uncertainty, trading off exploration and exploitation in an ideal way. Unfortunately, finding the resulting Bayes-optimal policies is notoriously taxing, since the search space becomes enormous. In this paper we introduce a tractable, sample-based method for approximate Bayes-optimal planning which exploits Monte-Carlo tree search. Our approach outperformed prior Bayesian model-based RL algorithms by a significant margin on several well-known benchmark problems -- because it avoids expensive applications of Bayes rule within the search tree by lazily sampling models from the current beliefs. We illustrate the advantages of our approach by showing it working in an infinite state space domain which is qualitatively out of reach of almost all previous work in Bayesian exploration.

ICRA Conference 2010 Conference Paper

Multi-tasking SLAM

  • Arthur Guez
  • Joelle Pineau

The problem of simultaneous localization and mapping (SLAM) is one of the most studied in the robotics literature. Most existing approaches, however, focus on scenarios where localization and mapping are the only tasks on the robot's agenda. In many real-world scenarios, a robot may be called on to perform other tasks simultaneously, in addition to localization and mapping. These can include target-following (or avoidance), search-and-rescue, point-to-point navigation, refueling, and so on. This paper proposes a framework that balances localization, mapping, and other planning objectives, thus allowing robots to solve sequential decision tasks under map and pose uncertainty. Our approach combines a SLAM algorithm with an online POMDP approach to solve diverse navigation tasks, without prior training, in an unknown environment.