PRL Workshop 2022 Workshop Paper
- Roberto Cipollone
- Giuseppe De Giacomo
- Marco Favorito
- Luca Iocchi
- Fabio Patrizi
Reinforcement Learning (RL) agents have no model available to predict outcomes of their actions. While this allowed wide applicability of RL algorithms, this lack of knowledge also demands a significant number of interactions with the environment before an optimal policy can be estimated. Indeed, most of the successes of RL achieved in recent years come from the digital world (e. g. video games, simulated environments), where a large amount of samples can be easily generated. Still, even in these cases, such large number of samples might not be available, as the simulation costs may be very high. As a result, applications of RL in real environments, such as real robots, are still very rare. Many RL tasks are goal-oriented, in which a set of environment states are denoted as target configurations. Complex tasks induce sparse goals and, as a consequence, sparse rewards. This is known to be a challenging scenario for RL, which increases the requirements on the number of samples to collect. Unfortunately, sparse goal states are very common, as they may arise in simple tasks on large state spaces (such as reaching specific locations in a complex environment), or complex behaviours even in modest environments (such as the successful completion of a desired sequence (Brafman, De Giacomo, and Patrizi 2018; Icarte et al. 2018)). From Hierarchical RL approaches, it is known that abstractions play a fundamental role in subtask decomposition and efficient exploration. The technique proposed in this work allows to exploit abstractions of Markov Decision Processes (MDPs) to allow learning algorithms to effectively explore the ground1 environment, while guaranteeing optimal convergence. The abstraction of some ground MDP M is an MDP Mφ whose states represent sets of states of M. A simple example is that of an agent moving in a map. States of M could determine the agent’s position in terms of continuous coordinates, orientation, and other configurations. States of the abstraction Mφ, instead, may be coarser descriptions, for example, through discretization or by projecting out some state variables. Such compression corresponds, ultimately, to partitioning the concrete state-space and implicitly defines a mapping from concrete to abstract states. Importantly, action spaces of M and Mφ may differ, as each model would include the actions that are best appropriate for each representation. The core intuition is that, by first learning the optimal policy ρφ of the abstract MDP, we obtain a value estimate Vφ∗ which can be exploited to guide learning on the ground model M. Technically, we adopt a variant of Reward Shaping (RS), which is generated from Vφ∗, which offers rewards that are consistent with the correspondence between states at the ground and the abstract level. In this way, when learning in the concrete model M, the agent is biased to visit first states corresponding to the abstract ones preferred by ρφ, thus trying, in a sense, to replicate the behavior of ρφ at the ground level. For such exploration bias to be effective, it is essential that the transitions of Mφ are good proxies for the dynamics of M. We characterize this relation by identifying conditions under which the optimal policy of the ground MDP with computed rewards converges to a near-optimal exploration policy. We call such model the biased MDP. An important difference with respect to previous works is that, since the proposed approach focuses on the definition of a novel RS mechanism, it is very general and may be com- Copyright © 2022, Association for the Advancement of Artificial Intelligence (www. aaai. org). All rights reserved. 1 We follow the nomenclature from (Li, Walsh, and Littman 2006). One major limitation of Reinforcement Learning (RL) algorithms, which limits applicability in many practical domains, is the large amount of samples required to learn an optimal policy. To improve learning efficiency, we consider a hierarchy of abstraction layers, where the Markov Decision Process (MDP) underlying the target domain can be abstracted at multiple levels by other MDPs. Each abstract model in the hierarchy is a coarser representation of the next one below, which captures the relevant dynamics in finer resolution. This paper proposes a novel form of Reward Shaping defined in terms of the solution obtained in the abstract levels. Theoretical guarantees about optimality and experimental validation of learning efficiency are discussed in the paper. Our technique has minimum requirements in the design of abstract models and is also tolerant to modelling errors in abstractions, thus making the proposed method of practical interest.