Arrow Research search
Back to ICLR

ICLR 2023

Behavior Proximal Policy Optimization

Conference Paper Accepted Paper Artificial Intelligence ยท Machine Learning

Abstract

Offline reinforcement learning (RL) is a challenging setting where existing off-policy actor-critic methods perform poorly due to overestimating of out-of-distribution state-action pairs. Thus, various additional augmentations are proposed to keep the learned policy close to the offline dataset (or the behavior policy). In this work, starting from the analysis of offline monotonic policy improvement, we reach a surprising conclusion that online on-policy algorithms are naturally able to solve offline RL. Specifically, the inherent conservatism of these on-policy algorithms is exactly what the offline RL method needs to overcome the overestimation. Based on this, we propose Behavior Proximal Policy Optimization (BPPO), which solves offline RL without any extra constraint or regularization introduced compared to PPO. Extensive experiments on the D4RL benchmark empirically show this extremely succinct method outperforms state-of-the-art offline RL algorithms. Our implementation is available at https://github.com/Dragon-Zhuang/BPPO.

Authors

Keywords

  • Offline Reinforcement Learning
  • Monotonic Policy Improvement

Context

Venue
International Conference on Learning Representations
Archive span
2013-2025
Indexed papers
10294
Paper id
507481293823870485