RLDM Conference 2017 Conference Abstract
Connecting Instructors, Learning Scientists, and Reinforcement Learning Researchers via Col- laborative Dynamic Personalized Experimentation
- Joseph Williams
- Anna Rafferty
- Andrew Ang
- Dustin Tingley
- Walter Lasecki
- Juho Kim
The shift to digital educational resources provides new opportunities to advance psychology and education research, in tandem with improving instruction using theory and data, by using reinforcement learning to conduct dynamic experiments and turn results into real-time improvements to online resources. To realize this potential, this paper explores how randomized experiments can support mutually benefi- cial instructor-researcher collaborations. We developed the Collaborative Dynamic Experimentation (CDE) framework to address two key tensions. To enable researchers to embed experiments in online lessons while maintaining instructors’ editorial control, Collaborative experiment authoring is needed. To enable instruc- tors to use data for rapid improvement while maintaining statistically valid data for researchers, we apply the Thompson Sampling algorithm for bandits. We worked with an on-campus instructor to implement a proof-of-concept CDE system to experiment within their online calculus quizzes. The qualitative results from this deployment provided insight into how the CDE framework can facilitate alignment of research and practice. To enable this approach to be applied beyond education to any online experiment, we present a software requirements specification for implementing digital experiments, which provides an abstraction for using reinforcement learning algorithms to adapt experiments in real time. This provides data structures and APIs that enable the policy for which experimental conditions are assigned to a user to be dynamically mod- ified, in order to trade off exploration with exploitation (giving the best conditions, personalizing delivery of conditions). The conditions of an experiment correspond to an action space (which can be dynamically expanded via API, allowing algorithms for infinitely armed bandits), the dependent measures to reward functions, characteristics of users to contextual variables (bandits) or a state space (MDPs, POMDPs).