Task-Agnostic Exploration via Policy Gradient of a Non-Parametric State Entropy Estimate

Mirco Mutti; Lorenzo Pratissoli; Marcello Restelli

Back to AAAI

AAAI 2021

Task-Agnostic Exploration via Policy Gradient of a Non-Parametric State Entropy Estimate

Conference Paper AAAI Technical Track on Machine Learning III Artificial Intelligence

PDF Details

Abstract

In a reward-free environment, what is a suitable intrinsic objective for an agent to pursue so that it can learn an optimal taskagnostic exploration policy? In this paper, we argue that the entropy of the state distribution induced by finite-horizon trajectories is a sensible target. Especially, we present a novel and practical policy-search algorithm, Maximum Entropy POLicy optimization (MEPOL), to learn a policy that maximizes a non-parametric, k-nearest neighbors estimate of the state distribution entropy. In contrast to known methods, MEPOL is completely model-free as it requires neither to estimate the state distribution of any policy nor to model transition dynamics. Then, we empirically show that MEPOL allows learning a maximum-entropy exploration policy in high-dimensional, continuous-control domains, and how this policy facilitates learning meaningful reward-based tasks downstream.

Task-Agnostic Exploration via Policy Gradient of a Non-Parametric State Entropy Estimate

Abstract

Authors

Keywords

Context