Learning With Options That Terminate Off-Policy

Anna Harutyunyan; Peter Vrancx; Pierre-Luc Bacon; Doina Precup; Ann Nowé

Back to AAAI

AAAI 2018

Learning With Options That Terminate Off-Policy

Conference Paper AAAI Technical Track: Machine Learning Artificial Intelligence

PDF Details

Abstract

A temporally abstract action, or an option, is speciﬁed by a policy and a termination condition: the policy guides the option behavior, and the termination condition roughly determines its length. Generally, learning with longer options (like learning with multi-step returns) is known to be more efﬁcient. However, if the option set for the task is not ideal, and cannot express the primitive optimal policy well, shorter options offer more ﬂexibility and can yield a better solution. Thus, the termination condition puts learning efﬁciency at odds with solution quality. We propose to resolve this dilemma by decoupling the behavior and target terminations, just like it is done with policies in off-policy learning. To this end, we give a new algorithm, Q(β), that learns the solution with respect to any termination condition, regardless of how the options actually terminate. We derive Q(β) by casting learning with options into a common framework with wellstudied multi-step off-policy learning. We validate our algorithm empirically, and show that it holds up to its motivating claims.

Learning With Options That Terminate Off-Policy

Abstract

Authors

Keywords

Context