RLDM Conference 2015 Conference Abstract
- Ashley Edwards
- Michael Littman
- Charles Isbell
Reward engineering is the problem of expressing a target task for an agent in the form of rewards for a Markov decision process. To be useful for learning, it is important that these encodings be robust to structural changes in the underlying domain; that is, the specification remain unchanged for any domain in some target class. We identify problems that are difficult to express robustly via the standard model of discounted rewards. In response, we examine the idea of decomposing a reward function into separate components, each with its own discount factor. We describe a method for finding robust parameters through the concept of task engineering, which additionally modifies the discount factors. We present a method for optimizing behavior in this setting and show that it could provide a more robust language than standard approaches. Poster T22*: Multi-Objective Markov Decision Processes for Decision Support Dan Lizotte*, University of Western Ontario; Eric Laber, North Carolina State University We present a new data analysis framework, Multi-Objective Markov Decision Processes for Decision Support, for developing sequential decision support systems. The framework extends the Multi- Objective Markov Decision Process with the ability to provide support that is tailored to different decision- makers with different preferences about which objectives are most important to them. We present an exten- sion of fitted-Q iteration for multiple objectives that can compute recommended actions in this context; in doing so we identify and address several conceptual and computational challenges. Finally, we demonstrate how our model could be applied to provide decision support for choosing treatments for schizophrenia using data from the Clinical Antipsychotic Trials of Intervention Effectiveness. Poster T23*: Reinforcement learning based on impulsively biased time scale and its neural substrate in OCD Yuki Sakai*, KPUM; Saori Tanaka, ATR; Yoshinari Abe, KPUM; Seiji Nishida, KPUM; Takashi Nakamae, KPUM; Kei Yamada, KPUM; Kenji Doya, OIST; Kenji Fukui, KPUM; Jin Narumoto, KPUM Obsessive-compulsive disorder (OCD) is a common neuropsychiatric disorder with a lifetime prevalence of 2-3%, which is characterized by persistent intrusive thoughts (obsessions), repetitive actions (compulsions). Howard Hughes, as depicted in the famous movie ‘Aviator, ’ suffered from severe OCD in his last years. He could not stop washing his hands and died alone in a hotel room because of his anxiety of bacterial contamination. Like his case, OCD seriously impairs patients’ daily lives Patients with OCD impulsively act on compulsive behavior to reduce obsession-related anxiety despite the profound effects on their life. Serotonergic dysfunction and hyper activity in ventral-striatal circuitry are thought to be essential in neuropathophysiology of OCD. Since cumulative evidence in human and animals suggests that serotonergic dysfunction and related alteration in ventral-striatal activity underlies impulsive behavior, which is caused by ‘prospective’ manner (underestimation of future reward) and ‘retrospective’ manner (impaired association of aversive outcomes to past actions), we hypothesized that OCD is the disorder of ‘impulsively biased time scale’. Here, we conducted the behavioral and fMRI experiments to investigate the mechanism of impulsive action selection in OCD. In fMRI experiment during prospective decision making (experiment (i)), patients with OCD had significantly greater correlated activities with impulsive short-term reward prediction in the ventral striatum, which were similar to our previous findings of healthy subjects at low serotonin levels. In experiment (ii), we conducted the monetary choice task that is difficult to solve in a prospective way and observed significantly slower associative learning when actions were followed by a delayed punishment in OCD. These results suggest that impulsive action selection characterized by both prospective and retrospective manner underlies disadvantageous compulsive behavior in OCD. Poster T24*: Direct Predictive Collaborative Control of a Prosthetic Arm Craig Sherstan, University of Alberta; Joseph Modayil, University of Alberta; Patrick Pilarski*, University of Alberta We have developed an online learning system for the collaborative control of an assistive device. Collaborative control is a complex setting requiring a human user and a learning system (automation) to co-operate towards achieving the user’s goals. There are many control domains where the number of con- trollable functions available to a user surpass what a user can attend to at a given moment. Such domains may benefit from having automation assist the user by controlling those unattended functions. How exactly this interaction between user decision making and automated decision making should occur is not clear, nor is it clear to what degree automation is beneficial or desired. We should expect such answers to vary from domain to domain and possibly from moment to moment. One domain of interest is the control of powered prosthetic arms by amputees. Upper-limb amputees are extremely limited in the number of inputs they can provide to a prosthetic device and typically control only one joint at a time with the ability to toggle between joints. Control of modern prostheses is often considered by users to be laborious and non-intuitive. To address these difficulties, we have developed a collaborative control framework called Direct Predictive Collaborative Control (DPCC), which uses a reinforcement learning technique known as general value func- tions to make temporal predictions about user behavior. These predictions are directly mapped to the control of unattended actuators to produce movement synergies. We evaluate DPCC during the human control of a powered multi-joint arm. We show that DPCC improves a user’s ability to perform coordinated movement tasks. Additionally, we demonstrate that this method can be used without the need for a specific training environment, learning only from user’s behavior. To our knowledge this is also the first demonstration of the combined use of the new True Online TD(lambda) algorithm with general value functions for online control.