Accelerating Distributed Deep Reinforcement Learning

Andrew Tan; Vishal Satish; Michael Luo

RLDM 2019

Accelerating Distributed Deep Reinforcement Learning

Conference Abstract Accepted abstract Artificial Intelligence · Decision Making · Machine Learning · Reinforcement Learning

PDF Details

Abstract

Recent advances in the field of Reinforcement Learning (RL) have allowed agents to accom- plish complex tasks with human-level performance such as beating the world champion at GO. However, these agents require immense amounts of training data and capturing this data is both time consuming and computationally expensive. One proposed solution to speed up this process is to distribute it among many workers, which may span multiple machines. This has led to distributed RL algorithms such as IMPALA and A3C, along with distributed frameworks such as Ray. Although increasing the amount of compute can reduce learning time, it is not sustainable as this can become extremely expensive for large tasks. Thus there is a growing need for sample and timestep efficient distributed RL algorithms that can reach the same performance as earlier methods but with smaller amounts of data and fewer timesteps. Furthermore, often times compute is not used efficiently; thus there is a need for more optimized algorithms that can more efficiently use the provided hardware. In order to tackle these problems, we explore combinations and im- provements of the best parts of pre-existing distributed RL algorithms into a single algorithm that performs better than any pre-existing algorithm alone, similar to Rainbow. We start with IMPALA and propose an asynchronous Proximal Policy Optimization (PPO) loss for IMPALA that is able to learn in fewer timesteps than the original vanilla policy gradient. We also add a distributed replay buffer to improve sample efficiency and integrate an auto-encoder into the policy graph in order to reduce the input to a smaller latent space, which reduces policy computation and also helps with timestep efficiency. Finally, at the systems level we implement parallel data loading to improve GPU utilization. With all these changes, we find that our final improved IMPALA can solve Atari Pong in 1. 7 million timesteps in under 3 minutes with 128 CPU workers and 2 GPU learners.

Authors

Keywords

No keywords are indexed for this paper.

Context

Venue: Multidisciplinary Conference on Reinforcement Learning and Decision Making
Archive span: 2013-2025
Indexed papers: 1004
Paper id: 296725958736435115