Arrow Research search
Back to RLDM

RLDM 2013

Learning Objectives for Numeric Human Feedback

Conference Abstract Accepted abstract Artificial Intelligence · Decision Making · Machine Learning · Reinforcement Learning

Abstract

Several studies have demonstrated that human-generated reward can be a powerful feedback sig- nal for control-learning algorithms. However, the algorithmic space for learning from human reward has hitherto not been explored systematically. Using model-based reinforcement learning from human reward, this article experimentally investigates the problem of learning from human reward, focusing on the rela- tionships between reward positivity, temporal discounting, whether the task is episodic or continuing, and task performance. We identify and empirically verify a “positive circuits” problem with low discounting (i. e. , high discount factors) for episodic, goal-based tasks that arises from an observed bias among hu- mans towards giving positive reward, resulting in an endorsement of myopic learning for such domains. We then show that converting simple episodic tasks to be non-episodic (i. e. , continuing) reduces and in some cases resolves issues present in episodic tasks with generally positive reward and”relatedly” enables highly successful learning with non-myopic valuation in multiple user studies. The primary learning algo- rithm introduced in this article, which we call “VI-TAMER”, is the first algorithm to successfully learn non-myopically from human-generated reward; we also empirically show that such non-myopic valuation facilitates higher-level understanding of the task.

Authors

Keywords

No keywords are indexed for this paper.

Context

Venue
Multidisciplinary Conference on Reinforcement Learning and Decision Making
Archive span
2013-2025
Indexed papers
1004
Paper id
11255121716287613