Policy Optimization with Stochastic Mirror Descent

Long Yang; Yu Zhang; Gang Zheng; Qian Zheng; Pengfei Li; Jianhang Huang; Gang Pan

Back to AAAI

AAAI 2022

Policy Optimization with Stochastic Mirror Descent

Conference Paper AAAI Technical Track on Machine Learning III Artificial Intelligence

PDF Details

Abstract

Improving sample efficiency has been a longstanding goal in reinforcement learning. This paper proposes VRMPO algorithm: a sample efficient policy gradient method with stochastic mirror descent. In VRMPO, a novel variance-reduced policy gradient estimator is presented to improve sample efficiency. We prove that the proposed VRMPO needs only O( −3 ) sample trajectories to achieve an -approximate first-order stationary point, which matches the best sample complexity for policy optimization. Extensive empirical results demonstrate that VRMPO outperforms the state-of-the-art policy gradient methods in various settings.

Authors

Keywords

No keywords are indexed for this paper.

Context

Venue: AAAI Conference on Artificial Intelligence
Archive span: 1980-2026
Indexed papers: 28718
Paper id: 554622051500534640