Author name cluster

Veronica Chelu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

6 papers

1 author row

EWRL Workshop 2024 Workshop Paper

Functional Acceleration for Policy Mirror Descent

Veronica Chelu
Doina Precup

We apply functional acceleration to the Policy Mirror Descent (PMD) general family of algorithms, which cover a wide range of novel and fundamental methods in Reinforcement Learning (RL). Leveraging duality, we propose a momentum-based PMD update. By taking the functional route, our approach is independent of the policy parametrization and applicable to large-scale optimization, covering previous applications of momentum at the level of policy parameters as a special case. We theoretically analyze several properties of this approach and complement with a numerical ablation study, which serves to illustrate the policy optimization dynamics on the value polytope, relative to different algorithmic design choices in this space. We further characterize numerically several features of the problem setting relevant for functional acceleration, and lastly, we investigate the impact of approximation on their learning mechanics.

PDF

EWRL Workshop 2023 Workshop Paper

Acceleration in Policy Optimization

Veronica Chelu
Tom Zahavy
Arthur Guez
Doina Precup
Sebastian Flennerhag

We work towards a unifying paradigm for accelerating policy optimization methods in reinforcement learning (RL) through predictive and adaptive directions of (functional) policy ascent. Leveraging the connection between policy iteration and policy gradient methods, we view policy optimization algorithms as iteratively solving a sequence of surrogate objectives, local lower bounds on the original objective. We define optimism as predictive modelling of the future behavior of a policy, and hindsight adaptation as taking immediate and anticipatory corrective actions to mitigate accumulating errors from overshooting predictions or delayed responses to change. We use this shared lens to jointly express other well-known algorithms, including model-based policy improvement based on forward search, and optimistic meta-learning algorithms. We show connections with Anderson acceleration, Nesterov's accelerated gradient, extra-gradient methods, and linear extrapolation in the update rule. We analyze properties of the formulation, design an optimistic policy gradient algorithm, adaptive via meta-gradient learning, and empirically highlight several design choices pertaining to acceleration, in an illustrative task.

PDF

AAAI Conference 2022 Conference Paper

A Generalized Bootstrap Target for Value-Learning, Efficiently Combining Value and Feature Predictions

Anthony GX-Chen
Veronica Chelu
Blake A. Richards
Joelle Pineau

Estimating value functions is a core component of reinforcement learning algorithms. Temporal difference (TD) learning algorithms use bootstrapping, i. e. they update the value function toward a learning target using value estimates at subsequent time-steps. Alternatively, the value function can be updated toward a learning target constructed by separately predicting successor features (SF)—a policy-dependent model— and linearly combining them with instantaneous rewards. We focus on bootstrapping targets used when estimating value functions, and propose a new backup target, the η-return mixture, which implicitly combines value-predictive knowledge (used by TD methods) with (successor) feature-predictive knowledge—with a parameter η capturing how much to rely on each. We illustrate that incorporating predictive knowledge through an ηγ-discounted SF model makes more efficient use of sampled experience, compared to either extreme, i. e. bootstrapping entirely on the value function estimate, or bootstrapping on the product of separately estimated successor features and instantaneous reward models. We empirically show this approach leads to faster policy evaluation and better control performance, for tabular and nonlinear function approximations, indicating scalability and generality.

PDF Details

AAAI Conference 2022 Conference Paper

Learning Expected Emphatic Traces for Deep RL

Ray Jiang
Shangtong Zhang
Veronica Chelu
Adam White
Hado van Hasselt

Off-policy sampling and experience replay are key for improving sample efficiency and scaling model-free temporal difference learning methods. When combined with function approximation, such as neural networks, this combination is known as the deadly triad and is potentially unstable. Recently, it has been shown that stability and good performance at scale can be achieved by combining emphatic weightings and multi-step updates. This approach, however, is generally limited to sampling complete trajectories in order, to compute the required emphatic weighting. In this paper we investigate how to combine emphatic weightings with non-sequential, off-line data sampled from a replay buffer. We develop a multi-step emphatic weighting that can be combined with replay, and a time-reversed n-step TD learning algorithm to learn the required emphatic weighting. We show that these state weightings reduce variance compared with prior approaches, while providing convergence guarantees. We tested the approach at scale on Atari 2600 video games, and observed that the new X-ETD(n) agent improved over baseline agents, highlighting both the scalability and broad applicability of our approach.

PDF Details

NeurIPS Conference 2020 Conference Paper

Forethought and Hindsight in Credit Assignment

Veronica Chelu
Doina Precup
Hado P. van Hasselt

We address the problem of credit assignment in reinforcement learning and explore fundamental questions regarding the way in which an agent can best use additional computation to propagate new information, by planning with internal models of the world to improve its predictions. Particularly, we work to understand the gains and peculiarities of planning employed as forethought via forward models or as hindsight operating with backward models. We establish the relative merits, limitations and complementary properties of both planning mechanisms in carefully constructed scenarios. Further, we investigate the best use of models in planning, primarily focusing on the selection of states in which predictions should be (re)-evaluated. Lastly, we discuss the issue of model estimation and highlight a spectrum of methods that stretch from environment dynamics predictors to planner-aware models.

PDF Details

RLDM Conference 2019 Conference Abstract

Option discovery by aiming to predict

Veronica Chelu
Doina Precup

We approach the task of knowledge acquisition and option discovery of a reinforcement learning agent using predictive representations about the dynamics of the environment with respect to its behaviour. We are interested in designing agents capable of acquiring diverse competencies through the interaction with an unknown environment in an unsupervised setting, undefined by extrinsic rewards. We assume a setting in which the agent is constantly exploring the environment, making predictions and learning off- policy from a single stream of experience about the consequences of multiple possible courses of action. We hypothesize that its aim should be to make the world more predictable by empowering itself to achieve its most likely predictions, self-defined as intrinsic goals. We illustrate that this approach induces a set of predictive option models and show their usefulness as planning performance speedup over their primitive counterparts for different objectives, defined as combinations of signals that the agent might be interested in during its lifetime.

PDF Details