RLDM 2013
Learning Objectives for Numeric Human Feedback
Abstract
Several studies have demonstrated that human-generated reward can be a powerful feedback sig- nal for control-learning algorithms. However, the algorithmic space for learning from human reward has hitherto not been explored systematically. Using model-based reinforcement learning from human reward, this article experimentally investigates the problem of learning from human reward, focusing on the rela- tionships between reward positivity, temporal discounting, whether the task is episodic or continuing, and task performance. We identify and empirically verify a “positive circuits” problem with low discounting (i. e. , high discount factors) for episodic, goal-based tasks that arises from an observed bias among hu- mans towards giving positive reward, resulting in an endorsement of myopic learning for such domains. We then show that converting simple episodic tasks to be non-episodic (i. e. , continuing) reduces and in some cases resolves issues present in episodic tasks with generally positive reward and”relatedly” enables highly successful learning with non-myopic valuation in multiple user studies. The primary learning algo- rithm introduced in this article, which we call “VI-TAMER”, is the first algorithm to successfully learn non-myopically from human-generated reward; we also empirically show that such non-myopic valuation facilitates higher-level understanding of the task.
Authors
Keywords
No keywords are indexed for this paper.
Context
- Venue
- Multidisciplinary Conference on Reinforcement Learning and Decision Making
- Archive span
- 2013-2025
- Indexed papers
- 1004
- Paper id
- 11255121716287613