Arrow Research search

Author name cluster

Philipp Moritz

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

3 papers
2 author rows

Possible papers

3

ICML Conference 2018 Conference Paper

RLlib: Abstractions for Distributed Reinforcement Learning

  • Eric Liang
  • Richard Liaw
  • Robert Nishihara
  • Philipp Moritz
  • Roy Fox
  • Ken Goldberg
  • Joseph E. Gonzalez
  • Michael I. Jordan

Reinforcement learning (RL) algorithms involve the deep nesting of highly irregular computation patterns, each of which typically exhibits opportunities for distributed computation. We argue for distributing RL components in a composable way by adapting algorithms for top-down hierarchical control, thereby encapsulating parallelism and resource requirements within short-running compute tasks. We demonstrate the benefits of this principle through RLlib: a library that provides scalable software primitives for RL. These primitives enable a broad range of algorithms to be implemented with high performance, scalability, and substantial code reuse. RLlib is available as part of the open source Ray project at http: //rllib. io/.

EWRL Workshop 2015 Workshop Paper

Generalized Advantage Estimation for Policy Gradients

  • John Schulman
  • Philipp Moritz
  • Sergey Levine
  • Pieter Abbeel

Value functions provide an elegant solution to the delayed reward problem in reinforcement learning, but it is difficult to accurately estimate and approximate them when the state space is high-dimensional. As a result, policy gradient methods that use Monte Carlo estimation are often preferred over methods that approximate the value function. We propose a method for using an approximate value function to help estimate the advantage function and obtain better policy gradient estimates, even when the value function is inaccurate. These estimators use a timescale parameter that makes an explicit tradeoff between bias and variance, and they empirically achieve faster policy improvement than Monte Carlo estimation and the actor-critic method, which can be viewed as limiting cases of these estimators. We present experimental results on a standard cart-pole benchmark task, as well as a number of highly challenging 3D locomotion tasks, where we show that our approach can learn complex gaits using neural network function approximators with over 104 parameters for both the policy and the value function.

ICML Conference 2015 Conference Paper

Trust Region Policy Optimization

  • John Schulman
  • Sergey Levine
  • Pieter Abbeel
  • Michael I. Jordan
  • Philipp Moritz

In this article, we describe a method for optimizing control policies, with guaranteed monotonic improvement. By making several approximations to the theoretically-justified scheme, we develop a practical algorithm, called Trust Region Policy Optimization (TRPO). This algorithm is effective for optimizing large nonlinear policies such as neural networks. Our experiments demonstrate its robust performance on a wide variety of tasks: learning simulated robotic swimming, hopping, and walking gaits; and playing Atari games using images of the screen as input. Despite its approximations that deviate from the theory, TRPO tends to give monotonic improvement, with little tuning of hyperparameters.