AAAI 2022
Policy Optimization with Stochastic Mirror Descent
Abstract
Improving sample efficiency has been a longstanding goal in reinforcement learning. This paper proposes VRMPO algorithm: a sample efficient policy gradient method with stochastic mirror descent. In VRMPO, a novel variance-reduced policy gradient estimator is presented to improve sample efficiency. We prove that the proposed VRMPO needs only O( −3 ) sample trajectories to achieve an -approximate first-order stationary point, which matches the best sample complexity for policy optimization. Extensive empirical results demonstrate that VRMPO outperforms the state-of-the-art policy gradient methods in various settings.
Authors
Keywords
No keywords are indexed for this paper.
Context
- Venue
- AAAI Conference on Artificial Intelligence
- Archive span
- 1980-2026
- Indexed papers
- 28718
- Paper id
- 554622051500534640