Author name cluster

Audrunas Gruslys

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

6 papers

2 author rows

ICML Conference 2023 Conference Paper

Quantile Credit Assignment

Thomas Mesnard
Wenqi Chen
Alaa Saade
Yunhao Tang
Mark Rowland 0001
Theophane Weber
Clare Lyle
Audrunas Gruslys

In reinforcement learning, the credit assignment problem is to distinguish luck from skill, that is, separate the inherent randomness in the environment from the controllable effects of the agent’s actions. This paper proposes two novel algorithms, Quantile Credit Assignment (QCA) and Hindsight QCA (HQCA), which incorporate distributional value estimation to perform credit assignment. QCA uses a network that predicts the quantiles of the return distribution, whereas HQCA additionally incorporates information about the future. Both QCA and HQCA have the appealing interpretation of leveraging an estimate of the quantile level of the return (interpreted as the level of "luck") in order to derive a "luck-dependent" baseline for policy gradient methods. We show theoretically that this approach gives an unbiased policy gradient estimate that can yield significant variance reductions over a standard value estimate baseline. QCA and HQCA significantly outperform prior state-of-the-art methods on a range of extremely difficult credit assignment problems.

Details

ICML Conference 2020 Conference Paper

Fast computation of Nash Equilibria in Imperfect Information Games

Rémi Munos
Julien Pérolat
Jean-Baptiste Lespiau
Mark Rowland 0001
Bart De Vylder
Marc Lanctot
Finbarr Timbers
Daniel Hennes

We introduce and analyze a class of algorithms, called Mirror Ascent against an Improved Opponent (MAIO), for computing Nash equilibria in two-player zero-sum games, both in normal form and in sequential form with imperfect information. These algorithms update the policy of each player with a mirror-ascent step to maximize the value of playing against an improved opponent. An improved opponent can be a best response, a greedy policy, a policy improved by policy gradient, or by any other reinforcement learning or search techniques. We establish a convergence result of the last iterate to the set of Nash equilibria and show that the speed of convergence depends on the amount of improvement offered by these improved policies. In addition, we show that under some condition, if we use a best response as improved policy, then an exponential convergence rate is achieved.

Details

AAAI Conference 2018 Conference Paper

Deep Q-learning From Demonstrations

Todd Hester
Matej Vecerik
Olivier Pietquin
Marc Lanctot
Tom Schaul
Bilal Piot
Dan Horgan
John Quan

Deep reinforcement learning (RL) has achieved several high proﬁle successes in difﬁcult decision-making problems. However, these algorithms typically require a huge amount of data before they reach reasonable performance. In fact, their performance during learning can be extremely poor. This may be acceptable for a simulator, but it severely limits the applicability of deep RL to many real-world tasks, where the agent must learn in the real environment. In this paper we study a setting where the agent may access data from previous control of the system. We present an algorithm, Deep Q-learning from Demonstrations (DQfD), that leverages small sets of demonstration data to massively accelerate the learning process even from relatively small amounts of demonstration data and is able to automatically assess the necessary ratio of demonstration data while learning thanks to a prioritized replay mechanism. DQfD works by combining temporal difference updates with supervised classiﬁcation of the demonstrator’s actions. We show that DQfD has better initial performance than Prioritized Dueling Double Deep Q-Networks (PDD DQN) as it starts with better scores on the ﬁrst million steps on 41 of 42 games and on average it takes PDD DQN 83 million steps to catch up to DQfD’s performance. DQfD learns to out-perform the best demonstration given in 14 of 42 games. In addition, DQfD leverages human demonstrations to achieve state-of-the-art results for 11 games. Finally, we show that DQfD performs better than three related algorithms for incorporating demonstration data into DQN.

PDF Details

AAMAS Conference 2018 Conference Paper

Value-Decomposition Networks For Cooperative Multi-Agent Learning Based On Team Reward

Peter Sunehag
Guy Lever
Audrunas Gruslys
Wojciech Marian Czarnecki
Vinicius Zambaldi
Max Jaderberg
Marc Lanctot
Nicolas Sonnerat

We study the problem of cooperative multi-agent reinforcement learning with a single joint reward signal. This class of learning problems is difficult because of the often large combined action and observation spaces. In the fully centralized and decentralized approaches, we find the problem of spurious rewards and a phenomenon we call the “lazy agent” problem, which arises due to partial observability. We address these problems by training individual agents with a novel value-decomposition network architecture, which learns to decompose the team value function into agent-wise value functions.

PDF

NeurIPS Conference 2017 Conference Paper

A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning

Marc Lanctot
Vinicius Zambaldi
Audrunas Gruslys
Angeliki Lazaridou
Karl Tuyls
Julien Perolat
David Silver
Thore Graepel

There has been a resurgence of interest in multiagent reinforcement learning (MARL), due partly to the recent success of deep neural networks. The simplest form of MARL is independent reinforcement learning (InRL), where each agent treats all of its experience as part of its (non stationary) environment. In this paper, we first observe that policies learned using InRL can overfit to the other agents' policies during training, failing to sufficiently generalize during execution. We introduce a new metric, joint-policy correlation, to quantify this effect. We describe a meta-algorithm for general MARL, based on approximate best responses to mixtures of policies generated using deep reinforcement learning, and empirical game theoretic analysis to compute meta-strategies for policy selection. The meta-algorithm generalizes previous algorithms such as InRL, iterated best response, double oracle, and fictitious play. Then, we propose a scalable implementation which reduces the memory requirement using decoupled meta-solvers. Finally, we demonstrate the generality of the resulting policies in three partially observable settings: gridworld coordination problems, emergent language games, and poker.

PDF Details

NeurIPS Conference 2016 Conference Paper

Memory-Efficient Backpropagation Through Time

Audrunas Gruslys
Remi Munos
Ivo Danihelka
Marc Lanctot
Alex Graves

We propose a novel approach to reduce memory consumption of the backpropagation through time (BPTT) algorithm when training recurrent neural networks (RNNs). Our approach uses dynamic programming to balance a trade-off between caching of intermediate results and recomputation. The algorithm is capable of tightly fitting within almost any user-set memory budget while finding an optimal execution policy minimizing the computational cost. Computational devices have limited memory capacity and maximizing a computational performance given a fixed memory budget is a practical use-case. We provide asymptotic computational upper bounds for various regimes. The algorithm is particularly effective for long sequences. For sequences of length 1000, our algorithm saves 95\% of memory usage while using only one third more time per iteration than the standard BPTT.

PDF Details