Author name cluster

Adam White

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

47 papers

2 author rows

RLJ Journal 2025 Journal Article

Deep Reinforcement Learning with Gradient Eligibility Traces

Esraa Elelimy
Brett Daley
Andrew Patterson
Marlos C. Machado
Adam White
Martha White

Achieving fast and stable off-policy learning in deep reinforcement learning (RL) is challenging. Most existing methods rely on semi-gradient temporal-difference (TD) methods for their simplicity and efficiency, but are consequently susceptible to divergence. While more principled approaches like Gradient TD (GTD) methods have strong convergence guarantees, they have rarely been used in deep RL. Recent work introduced the generalized Projected Bellman Error ($\overline{\text{PBE}}$), enabling GTD methods to work efficiently with nonlinear function approximation. However, this work is limited to one-step methods, which are slow at credit assignment and require a large number of samples. In this paper, we extend the generalized $\overline{\text{PBE}}$ objective to support multistep credit assignment based on the $\lambda$-return and derive three gradient-based methods that optimize this new objective. We provide both a forward-view formulation compatible with experience replay and a backward-view formulation compatible with streaming algorithms. Finally, we evaluate the proposed algorithms and show that they outperform both PPO and StreamQ in MuJoCo and MinAtar environments, respectively.

RLC Conference 2025 Conference Paper

Deep Reinforcement Learning with Gradient Eligibility Traces

Esraa Elelimy
Brett Daley
Andrew Patterson
Marlos C. Machado
Adam White
Martha White

Achieving fast and stable off-policy learning in deep reinforcement learning (RL) is challenging. Most existing methods rely on semi-gradient temporal-difference (TD) methods for their simplicity and efficiency, but are consequently susceptible to divergence. While more principled approaches like Gradient TD (GTD) methods have strong convergence guarantees, they have rarely been used in deep RL. Recent work introduced the generalized Projected Bellman Error ($\overline{\text{PBE}}$), enabling GTD methods to work efficiently with nonlinear function approximation. However, this work is limited to one-step methods, which are slow at credit assignment and require a large number of samples. In this paper, we extend the generalized $\overline{\text{PBE}}$ objective to support multistep credit assignment based on the $\lambda$-return and derive three gradient-based methods that optimize this new objective. We provide both a forward-view formulation compatible with experience replay and a backward-view formulation compatible with streaming algorithms. Finally, we evaluate the proposed algorithms and show that they outperform both PPO and StreamQ in MuJoCo and MinAtar environments, respectively.

RLC Conference 2025 Conference Paper

Investigating the Utility of Mirror Descent in Off-policy Actor-Critic

Samuel Neumann
Jiamin He
Adam White
Martha White

Many policy gradient methods prevent drastic changes to policies during learning. This is commonly achieved through a Kullback-Leibler (KL) divergence term. Recent work has established a theoretical connection between this heuristic and Mirror Descent (MD), offering insight into the empirical successes of existing policy gradient and actor-critic algorithms. This insight has further motivated the development of novel algorithms that better adhere to the principles of MD, alongside a growing body of theoretical research on policy mirror descent. In this study, we examine the empirical feasibility of MD-based policy updates in off-policy actor-critic. Specifically, we introduce principled MD adaptations of three widely used actor-critic algorithms and systematically evaluate their empirical effectiveness. Our findings indicate that, while MD-style policy updates do not seem to exhibit significant practical advantages over conventional approaches to off-policy actor-critic, they can somewhat mitigate sensitivity to step size selection with widely used deep-learning optimizers.

RLJ Journal 2025 Journal Article

Investigating the Utility of Mirror Descent in Off-policy Actor-Critic

Samuel Neumann
Jiamin He
Adam White
Martha White

Many policy gradient methods prevent drastic changes to policies during learning. This is commonly achieved through a Kullback-Leibler (KL) divergence term. Recent work has established a theoretical connection between this heuristic and Mirror Descent (MD), offering insight into the empirical successes of existing policy gradient and actor-critic algorithms. This insight has further motivated the development of novel algorithms that better adhere to the principles of MD, alongside a growing body of theoretical research on policy mirror descent. In this study, we examine the empirical feasibility of MD-based policy updates in off-policy actor-critic. Specifically, we introduce principled MD adaptations of three widely used actor-critic algorithms and systematically evaluate their empirical effectiveness. Our findings indicate that, while MD-style policy updates do not seem to exhibit significant practical advantages over conventional approaches to off-policy actor-critic, they can somewhat mitigate sensitivity to step size selection with widely used deep-learning optimizers.

RLJ Journal 2025 Journal Article

Modelling human exploration with light-weight meta reinforcement learning algorithms

Thomas D. Ferguson
Alona Fyshe
Adam White

Learning in non-stationary environments can be difficult. Although many algorithmic approaches have been developed, methods often struggle with different forms of non-stationarity such as gradually changing versus suddenly changing contexts. Luckily, humans can learn effectively under a variety of conditions so using human learning could be revealing. In the present work, we investigated if a stateless variant of the IDBD algorithm(Mahmood et al., 2012; Sutton, 1992), which has previously shown success in bandit-like tasks (Linke et al., 2020), can model human exploration. We compared stateless IDBD to two algorithms that are frequently used to model human exploration (a standard Q-learning algorithm and a Kalman filter algorithm). We examined the ability of these three algorithms to fit human choices and to replicate human learning within three different bandits: (1) non-stationary volatile which changed suddenly, (2) non-stationary drifting which changed gradually, and (3) stationary. In these three bandits, we found that stateless IDBD provided the best fit of the human data and was best able to replicate different aspects of human learning. We also found that when fit to the human data, differences in the hyperparameters of stateless IDBD across the three bandits may explain how humans learn effectively across contexts. Our results demonstrate that stateless IDBD can account for different types of non-stationarity and model human exploration effectively. Our findings highlight that taking inspiration from algorithms used with artificial agents may provide further insights into both human learning and inspire the development of algorithms for use in artificial agents.

RLC Conference 2025 Conference Paper

Modelling human exploration with light-weight meta reinforcement learning algorithms

Thomas D. Ferguson
Alona Fyshe
Adam White

Learning in non-stationary environments can be difficult. Although many algorithmic approaches have been developed, methods often struggle with different forms of non-stationarity such as gradually changing versus suddenly changing contexts. Luckily, humans can learn effectively under a variety of conditions so using human learning could be revealing. In the present work, we investigated if a stateless variant of the IDBD algorithm(Mahmood et al. , 2012; Sutton, 1992), which has previously shown success in bandit-like tasks (Linke et al. , 2020), can model human exploration. We compared stateless IDBD to two algorithms that are frequently used to model human exploration (a standard Q-learning algorithm and a Kalman filter algorithm). We examined the ability of these three algorithms to fit human choices and to replicate human learning within three different bandits: (1) non-stationary volatile which changed suddenly, (2) non-stationary drifting which changed gradually, and (3) stationary. In these three bandits, we found that stateless IDBD provided the best fit of the human data and was best able to replicate different aspects of human learning. We also found that when fit to the human data, differences in the hyperparameters of stateless IDBD across the three bandits may explain how humans learn effectively across contexts. Our results demonstrate that stateless IDBD can account for different types of non-stationarity and model human exploration effectively. Our findings highlight that taking inspiration from algorithms used with artificial agents may provide further insights into both human learning and inspire the development of algorithms for use in artificial agents.

YNICL Journal 2025 Journal Article

Predicting language outcome after stroke using machine learning: in search of the big data benefit

Margarita Saranti
Douglas Neville
Adam White
Pia Rotshtein
Thomas M.H. Hope
Cathy J. Price
Howard Bowman

Accurate prediction of post-stroke language outcomes using machine learning offers the potential to enhance clinical treatment and rehabilitation for aphasic patients. This study of 758 English speaking stroke patients from the PLORAS project explores the impact of sample size on the performance of logistic regression and a deep learning (ResNet-18) model in predicting language outcomes from neuroimaging and impairment-relevant tabular data. We assessed the performance of both models on two key language tasks from the Comprehensive Aphasia Test: Spoken Picture Description and Naming, using a learning curve approach. Contrary to expectations, the simpler logistic regression model performed comparably or better than the deep learning model (with overlapping confidence intervals), with both models showing an accuracy plateau around 80% for sample sizes larger than 300 patients. Principal Component Analysis revealed that the dimensionality of the neuroimaging data could be reduced to as few as 20 (or even 2) dominant components without significant loss in accuracy, suggesting that classification may be driven by simple patterns such as lesion size. The study highlights both the potential limitations of current dataset size in achieving further accuracy gains and the need for larger datasets to capture more complex patterns, as some of our results indicate that we might not have reached an absolute classification performance ceiling. Overall, these findings provide insights into the practical use of machine learning for predicting aphasia outcomes and the potential benefits of much larger datasets in enhancing model performance.

NeurIPS Conference 2024 Conference Paper

A Method for Evaluating Hyperparameter Sensitivity in Reinforcement Learning

Jacob Adkins
Michael Bowling
Adam White

The performance of modern reinforcement learning algorithms critically relieson tuning ever increasing numbers of hyperparameters. Often, small changes ina hyperparameter can lead to drastic changes in performance, and different environments require very different hyperparameter settings to achieve state-of-the-artperformance reported in the literature. We currently lack a scalable and widelyaccepted approach to characterizing these complex interactions. This work proposes a new empirical methodology for studying, comparing, and quantifying thesensitivity of an algorithm’s performance to hyperparameter tuning for a given setof environments. We then demonstrate the utility of this methodology by assessingthe hyperparameter sensitivity of several commonly used normalization variants ofPPO. The results suggest that several algorithmic performance improvements may, in fact, be a result of an increased reliance on hyperparameter tuning.

PDF Details DOI

PRL Workshop 2024 Workshop Paper

A New View on Planning in Online Reinforcement Learning

Kevin Roice
Parham Mohammad Panahi
Scott M. Jordan
Adam White
Martha White

This paper investigates a new approach to model-based reinforcement learning using background planning: mixing (approximate) dynamic programming updates and model-free updates, similar to the Dyna architecture. Background planning with learned models is often worse than model-free alternatives, such as Double DQN, even though the former uses significantly more memory and computation. The fundamental problem is that learned models can be inaccurate and often generate invalid states, especially when iterated many steps. In this paper, we avoid this limitation by constraining background planning to a set of (abstract) subgoals and learning only local, subgoal-conditioned models. This goal-space planning (GSP) approach is more computationally efficient, naturally incorporates temporal abstraction for faster longhorizon planning and avoids learning the transition dynamics entirely. We show that our GSP algorithm can propagate value from an abstract space in a manner that helps a variety of base learners learn significantly faster in different domains.

TMLR Journal 2024 Journal Article

AGaLiTe: Approximate Gated Linear Transformers for Online Reinforcement Learning

Subhojeet Pramanik
Esraa Elelimy
Marlos C. Machado
Adam White

In this paper we investigate transformer architectures designed for partially observable online reinforcement learning. The self-attention mechanism in the transformer architecture is capable of capturing long-range dependencies and it is the main reason behind its effectiveness in processing sequential data. Nevertheless, despite their success, transformers have two significant drawbacks that still limit their applicability in online reinforcement learning: (1) in order to remember all past information, the self-attention mechanism requires access to the whole history to be provided as context. (2) The inference cost in transformers is expensive. In this paper, we introduce recurrent alternatives to the transformer self-attention mechanism that offer context-independent inference cost, leverage long-range dependencies effectively, and performs well in online reinforcement learning task. We quantify the impact of the different components of our architecture in a diagnostic environment and assess performance gains in 2D and 3D pixel-based partially-observable environments (e.g. T-Maze, Mystery Path, Craftax, and Memory Maze). Compared with a state-of-the-art architecture, GTrXL, inference in our approach is at least 40% cheaper while reducing memory use more than 50%. Our approach either performs similarly or better than GTrXL, improving more than 37% upon GTrXL performance in harder tasks.

RLC Conference 2024 Conference Paper

Cross-environment Hyperparameter Tuning for Reinforcement Learning

Andrew Patterson
Samuel Neumann
Raksha Kumaraswamy
Martha White
Adam White

This paper introduces a new benchmark, the Cross-environment Hyperparameter Setting Benchmark, that allows comparison of RL algorithms across environments using only a single hyperparameter setting, encouraging algorithmic development which is insensitive to hyperparameters. We demonstrate that the benchmark is robust to statistical noise and obtains qualitatively similar results across repeated applications, even when using a small number of samples. This robustness makes the benchmark computationally cheap to apply, allowing statistically sound insights at low cost. We provide two example instantiations of the CHS, on a set of six small control environments (SC-CHS) and on the entire DM Control suite of 28 environments (DMC-CHS). Finally, to demonstrate the applicability of the CHS to modern RL algorithms on challenging environments, we provide a novel empirical study of an open question in the continuous control literature. We show, with high confidence, that there is no meaningful difference in performance between Ornstein-Uhlenbeck noise and uncorrelated Gaussian noise for exploration with the DDPG algorithm on the DMC-CHS.

RLJ Journal 2024 Journal Article

Cross-environment Hyperparameter Tuning for Reinforcement Learning

Andrew Patterson
Samuel Neumann
Raksha Kumaraswamy
Martha White
Adam White

This paper introduces a new benchmark, the Cross-environment Hyperparameter Setting Benchmark, that allows comparison of RL algorithms across environments using only a single hyperparameter setting, encouraging algorithmic development which is insensitive to hyperparameters. We demonstrate that the benchmark is robust to statistical noise and obtains qualitatively similar results across repeated applications, even when using a small number of samples. This robustness makes the benchmark computationally cheap to apply, allowing statistically sound insights at low cost. We provide two example instantiations of the CHS, on a set of six small control environments (SC-CHS) and on the entire DM Control suite of 28 environments (DMC-CHS). Finally, to demonstrate the applicability of the CHS to modern RL algorithms on challenging environments, we provide a novel empirical study of an open question in the continuous control literature. We show, with high confidence, that there is no meaningful difference in performance between Ornstein-Uhlenbeck noise and uncorrelated Gaussian noise for exploration with the DDPG algorithm on the DMC-CHS.

JMLR Journal 2024 Journal Article

Empirical Design in Reinforcement Learning

Andrew Patterson
Samuel Neumann
Martha White
Adam White

Empirical design in reinforcement learning is no small task. Running good experiments requires attention to detail and at times significant computational resources. While compute resources available per dollar have continued to grow rapidly, so have the scale of typical experiments in reinforcement learning. It is now common to benchmark agents with millions of parameters against dozens of tasks, each using the equivalent of 30 days of experience. The scale of these experiments often conflict with the need for statistical evidence, especially when comparing algorithms. Recent studies have highlighted how popular algorithms are sensitive to hyperparameter settings and implementation details, and that common empirical practice leads to weak statistical evidence (Machado et al., 2018; Henderson et al., 2018). This manuscript represents both a call to action, and a comprehensive resource for how to do good experiments in reinforcement learning. In particular, we cover: the statistical assumptions underlying common performance measures, how to properly characterize performance variation and stability, hypothesis testing, special considerations for comparing multiple agents, baseline and illustrative example construction, and how to deal with hyperparameters and experimenter bias. Throughout we highlight common mistakes found in the literature and the statistical consequences of those in example experiments. The objective of this document is to provide answers on how we can use our unprecedented compute to do good science in reinforcement learning, as well as stay alert to potential pitfalls in our empirical design. [abs] [ pdf ][ bib ] &copy JMLR 2024. ( edit, beta )

JMLR Journal 2024 Journal Article

Goal-Space Planning with Subgoal Models

Chunlok Lo
Kevin Roice
Parham Mohammad Panahi
Scott M. Jordan
Adam White
Gabor Mihucz
Farzane Aminmansour
Martha White

This paper investigates a new approach to model-based reinforcement learning using background planning: mixing (approximate) dynamic programming updates and model-free updates, similar to the Dyna architecture. Background planning with learned models is often worse than model-free alternatives, such as Double DQN, even though the former uses significantly more memory and computation. The fundamental problem is that learned models can be inaccurate and often generate invalid states, especially when iterated many steps. In this paper, we avoid this limitation by constraining background planning to a given set of (abstract) subgoals and learning only local, subgoal-conditioned models. This goal-space planning (GSP) approach is more computationally efficient, naturally incorporates temporal abstraction for faster long-horizon planning, and avoids learning the transition dynamics entirely. We show that our GSP algorithm can propagate value from an abstract space in a manner that helps a variety of base learners learn significantly faster in different domains. [abs] [ pdf ][ bib ] &copy JMLR 2024. ( edit, beta )

RLJ Journal 2024 Journal Article

Harnessing Discrete Representations for Continual Reinforcement Learning

Edan Jacob Meyer
Adam White
Marlos C. Machado

Reinforcement learning (RL) agents make decisions using nothing but observations from the environment, and consequently, rely heavily on the representations of those observations. Though some recent breakthroughs have used vector-based categorical representations of observations, often referred to as discrete representations, there is little work explicitly assessing the significance of such a choice. In this work, we provide a thorough empirical investigation of the advantages of discrete representations in the context of world-model learning, model-free RL, and ultimately continual RL problems, where we find discrete representations to have the greatest impact. We find that, when compared to traditional continuous representations, world models learned over discrete representations accurately model more of the world with less capacity, and that agents trained with discrete representations learn better policies with less data. In the context of continual RL, these benefits translate into faster adapting agents. Additionally, our analysis suggests that it is the binary and sparse nature, rather than the “discreteness” of discrete representations that leads to these improvements.

RLC Conference 2024 Conference Paper

Harnessing Discrete Representations for Continual Reinforcement Learning

Edan Jacob Meyer
Adam White
Marlos C. Machado

Reinforcement learning (RL) agents make decisions using nothing but observations from the environment, and consequently, rely heavily on the representations of those observations. Though some recent breakthroughs have used vector-based categorical representations of observations, often referred to as discrete representations, there is little work explicitly assessing the significance of such a choice. In this work, we provide a thorough empirical investigation of the advantages of discrete representations in the context of world-model learning, model-free RL, and ultimately continual RL problems, where we find discrete representations to have the greatest impact. We find that, when compared to traditional continuous representations, world models learned over discrete representations accurately model more of the world with less capacity, and that agents trained with discrete representations learn better policies with less data. In the context of continual RL, these benefits translate into faster adapting agents. Additionally, our analysis suggests that it is the binary and sparse nature, rather than the “discreteness” of discrete representations that leads to these improvements.

RLC Conference 2024 Conference Paper

Investigating the Interplay of Prioritized Replay and Generalization

Parham Mohammad Panahi
Andrew Patterson
Martha White
Adam White

Experience replay, the reuse of past data to improve sample efficiency, is ubiquitous in reinforcement learning. Though a variety of smart sampling schemes have been introduced to improve performance, uniform sampling by far remains the most common approach. One exception is Prioritized Experience Replay (PER), where sampling is done proportionally to TD errors, inspired by the success of prioritized sweeping in dynamic programming. The original work on PER showed improvements in Atari, but follow-up results were mixed. In this paper, we investigate several variations on PER, to attempt to understand where and when PER may be useful. Our findings in prediction tasks reveal that while PER can improve value propagation in tabular settings, behavior is significantly different when combined with neural networks. Certain mitigations$-$like delaying target network updates to control generalization and using estimates of expected TD errors in PER to avoid chasing stochasticity$-$can avoid large spikes in error with PER and neural networks but generally do not outperform uniform replay. In control tasks, none of the prioritized variants consistently outperform uniform replay. We present new insight into the interaction between prioritization, bootstrapping, and neural networks and propose several improvements for PER in tabular settings and noisy domains.

RLJ Journal 2024 Journal Article

Investigating the Interplay of Prioritized Replay and Generalization

Parham Mohammad Panahi
Andrew Patterson
Martha White
Adam White

Experience replay, the reuse of past data to improve sample efficiency, is ubiquitous in reinforcement learning. Though a variety of smart sampling schemes have been introduced to improve performance, uniform sampling by far remains the most common approach. One exception is Prioritized Experience Replay (PER), where sampling is done proportionally to TD errors, inspired by the success of prioritized sweeping in dynamic programming. The original work on PER showed improvements in Atari, but follow-up results were mixed. In this paper, we investigate several variations on PER, to attempt to understand where and when PER may be useful. Our findings in prediction tasks reveal that while PER can improve value propagation in tabular settings, behavior is significantly different when combined with neural networks. Certain mitigations$-$like delaying target network updates to control generalization and using estimates of expected TD errors in PER to avoid chasing stochasticity$-$can avoid large spikes in error with PER and neural networks but generally do not outperform uniform replay. In control tasks, none of the prioritized variants consistently outperform uniform replay. We present new insight into the interaction between prioritization, bootstrapping, and neural networks and propose several improvements for PER in tabular settings and noisy domains.

AIJ Journal 2024 Journal Article

Investigating the properties of neural network representations in reinforcement learning

Han Wang
Erfan Miahi
Martha White
Marlos C. Machado
Zaheer Abbas
Raksha Kumaraswamy
Vincent Liu
Adam White

In this paper we investigate the properties of representations learned by deep reinforcement learning systems. Much of the early work on representations for reinforcement learning focused on designing fixed-basis architectures to achieve properties thought to be desirable, such as orthogonality and sparsity. In contrast, the idea behind deep reinforcement learning methods is that the agent designer should not encode representational properties, but rather that the data stream should determine the properties of the representation—good representations emerge under appropriate training schemes. In this paper we bring these two perspectives together, empirically investigating the properties of representations that support transfer in reinforcement learning. We introduce and measure six representational properties over more than 25, 000 agent-task settings. We consider Deep Q-learning agents with different auxiliary losses in a pixel-based navigation environment, with source and transfer tasks corresponding to different goal locations. We develop a method to better understand why some representations work better for transfer, through a systematic approach varying task similarity and measuring and correlating representation properties with transfer performance. We demonstrate the generality of the methodology by investigating representations learned by a Rainbow agent that successfully transfers across Atari 2600 game modes.

ICML Conference 2024 Conference Paper

Position: Application-Driven Innovation in Machine Learning

David Rolnick
Alán Aspuru-Guzik
Sara Beery
Bistra Dilkina
Priya L. Donti
Marzyeh Ghassemi
Hannah Kerner
Claire Monteleoni

In this position paper, we argue that application-driven research has been systemically under-valued in the machine learning community. As applications of machine learning proliferate, innovative algorithms inspired by specific real-world challenges have become increasingly important. Such work offers the potential for significant impact not merely in domains of application but also in machine learning itself. In this paper, we describe the paradigm of application-driven research in machine learning, contrasting it with the more standard paradigm of methods-driven research. We illustrate the benefits of application-driven machine learning and how this approach can productively synergize with methods-driven work. Despite these benefits, we find that reviewing, hiring, and teaching practices in machine learning often hold back application-driven innovation. We outline how these processes may be improved.

YNICL Journal 2024 Journal Article

Predicting recovery following stroke: Deep learning, multimodal data and feature selection using explainable AI

Adam White
Margarita Saranti
Artur d’Avila Garcez
Thomas M.H. Hope
Cathy J. Price
Howard Bowman

Machine learning offers great potential for automated prediction of post-stroke symptoms and their response to rehabilitation. Major challenges for this endeavour include the very high dimensionality of neuroimaging data, the relatively small size of the datasets available for learning and interpreting the predictive features, as well as, how to effectively combine neuroimaging and tabular data (e.g. demographic information and clinical characteristics). This paper evaluates several solutions based on two strategies. The first is to use 2D images that summarise MRI scans. The second is to select key features that improve classification accuracy. Additionally, we introduce the novel approach of training a convolutional neural network (CNN) on images that combine regions-of-interests (ROIs) extracted from MRIs, with symbolic representations of tabular data. We evaluate a series of CNN architectures (both 2D and a 3D) that are trained on different representations of MRI and tabular data, to predict whether a composite measure of post-stroke spoken picture description ability is in the aphasic or non-aphasic range. MRI and tabular data were acquired from 758 English speaking stroke survivors who participated in the PLORAS study. Each participant was assigned to one of five different groups that were matched for initial severity of symptoms, recovery time, left lesion size and the months or years post-stroke that spoken description scores were collected. Training and validation were carried out on the first four groups. The fifth (lock-box/test set) group was used to test how well model accuracy generalises to new (unseen) data. The classification accuracy for a baseline logistic regression was 0.678 based on lesion size alone, rising to 0.757 and 0.813 when initial symptom severity and recovery time were successively added. The highest classification accuracy (0.854), area under the curve (0.899) and F1 score (0.901) were observed when 8 regions of interest were extracted from each MRI scan and combined with lesion size, initial severity and recovery time in a 2D Residual Neural Network (ResNet). This was also the best model when data were limited to the 286 participants with moderate or severe initial aphasia (with area under curve = 0.865), a group that would be considered more difficult to classify. Our findings demonstrate how imaging and tabular data can be combined to achieve high post-stroke classification accuracy, even when the dataset is small in machine learning terms. We conclude by proposing how the current models could be improved to achieve even higher levels of accuracy using images from hospital scanners.

NeurIPS Conference 2024 Conference Paper

Real-Time Recurrent Learning using Trace Units in Reinforcement Learning

Esraa Elelimy
Adam White
Michael Bowling
Martha White

Recurrent Neural Networks (RNNs) are used to learn representations in partially observable environments. For agents that learn online and continually interact with the environment, it is desirable to train RNNs with real-time recurrent learning (RTRL); unfortunately, RTRL is prohibitively expensive for standard RNNs. A promising direction is to use linear recurrent architectures (LRUs), where dense recurrent weights are replaced with a complex-valued diagonal, making RTRL efficient. In this work, we build on these insights to provide a lightweight but effective approach for training RNNs in online RL. We introduce Recurrent Trace Units (RTUs), a small modification on LRUs that we nonetheless find to have significant performance benefits over LRUs when trained with RTRL. We find RTUs significantly outperform GRUs and Transformers across several partially observable environments while using significantly less computation.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Reward-Respecting Subtasks for Model-Based Reinforcement Learning (Abstract Reprint)

Richard S. Sutton
Marlos C. Machado
G. Zacharias Holland
David Szepesvari
Finbarr Timbers
Brian Tanner
Adam White

To achieve the ambitious goals of artificial intelligence, reinforcement learning must include planning with a model of the world that is abstract in state and time. Deep learning has made progress with state abstraction, but temporal abstraction has rarely been used, despite extensively developed theory based on the options framework. One reason for this is that the space of possible options is immense, and the methods previously proposed for option discovery do not take into account how the option models will be used in planning. Options are typically discovered by posing subsidiary tasks, such as reaching a bottleneck state or maximizing the cumulative sum of a sensory signal other than reward. Each subtask is solved to produce an option, and then a model of the option is learned and made available to the planning process. In most previous work, the subtasks ignore the reward on the original problem, whereas we propose subtasks that use the original reward plus a bonus based on a feature of the state at the time the option terminates. We show that option models obtained from such reward-respecting subtasks are much more likely to be useful in planning than eigenoptions, shortest path options based on bottleneck states, or reward-respecting options generated by the option-critic. Reward respecting subtasks strongly constrain the space of options and thereby also provide a partial solution to the problem of option discovery. Finally, we show how values, policies, options, and models can all be learned online and off-policy using standard algorithms and general value functions.

PDF Details DOI

RLC Conference 2024 Conference Paper

The Cliff of Overcommitment with Policy Gradient Step Sizes

Scott M. Jordan
Samuel Neumann
James E. Kostas
Adam White
Philip S. Thomas

Policy gradient methods form the basis for many successful reinforcement learning algorithms, but their success depends heavily on selecting an appropriate step size and many other hyperparameters. While many adaptive step size methods exist, none are both free of hyperparameter tuning and able to converge quickly to an optimal policy. It is unclear why these methods are insufficient, so we aim to uncover what needs to be addressed to make an effective adaptive step size for policy gradient methods. Through extensive empirical investigation, the results reveal that when the step size is above optimal, the policy overcommits to sub-optimal actions leading to longer training times. These findings suggest the need for a new kind of policy optimization that can prevent or recover from entropy collapses.

RLJ Journal 2024 Journal Article

The Cliff of Overcommitment with Policy Gradient Step Sizes

Scott M. Jordan
Samuel Neumann
James E. Kostas
Adam White
Philip S. Thomas

Policy gradient methods form the basis for many successful reinforcement learning algorithms, but their success depends heavily on selecting an appropriate step size and many other hyperparameters. While many adaptive step size methods exist, none are both free of hyperparameter tuning and able to converge quickly to an optimal policy. It is unclear why these methods are insufficient, so we aim to uncover what needs to be addressed to make an effective adaptive step size for policy gradient methods. Through extensive empirical investigation, the results reveal that when the step size is above optimal, the policy overcommits to sub-optimal actions leading to longer training times. These findings suggest the need for a new kind of policy optimization that can prevent or recover from entropy collapses.

TMLR Journal 2023 Journal Article

Agent-State Construction with Auxiliary Inputs

Ruo Yu Tao
Adam White
Marlos C. Machado

In many, if not every realistic sequential decision-making task, the decision-making agent is not able to model the full complexity of the world. The environment is often much larger and more complex than the agent, a setting also known as partial observability. In such settings, the agent must leverage more than just the current sensory inputs; it must construct an agent state that summarizes previous interactions with the world. Currently, a popular approach for tackling this problem is to learn the agent-state function via a recurrent network from the agent's sensory stream as input. Many impressive reinforcement learning applications have instead relied on environment-specific functions to aid the agent's inputs for history summarization. These augmentations are done in multiple ways, from simple approaches like concatenating observations to more complex ones such as uncertainty estimates. Although ubiquitous in the field, these additional inputs, which we term auxiliary inputs, are rarely emphasized, and it is not clear what their role or impact is. In this work we explore this idea further, and relate these auxiliary inputs to prior classic approaches to state construction. We present a series of examples illustrating the different ways of using auxiliary inputs for reinforcement learning. We show that these auxiliary inputs can be used to discriminate between observations that would otherwise be aliased, leading to more expressive features that smoothly interpolate between different states. Finally, we show that this approach is complementary to state-of-the-art methods such as recurrent neural networks and truncated back-propagation through time, and acts as a heuristic that facilitates longer temporal credit assignment, leading to better performance.

AIJ Journal 2023 Journal Article

Reward-respecting subtasks for model-based reinforcement learning

Richard S. Sutton
Marlos C. Machado
G. Zacharias Holland
David Szepesvari
Finbarr Timbers
Brian Tanner
Adam White

To achieve the ambitious goals of artificial intelligence, reinforcement learning must include planning with a model of the world that is abstract in state and time. Deep learning has made progress with state abstraction, but temporal abstraction has rarely been used, despite extensively developed theory based on the options framework. One reason for this is that the space of possible options is immense, and the methods previously proposed for option discovery do not take into account how the option models will be used in planning. Options are typically discovered by posing subsidiary tasks, such as reaching a bottleneck state or maximizing the cumulative sum of a sensory signal other than reward. Each subtask is solved to produce an option, and then a model of the option is learned and made available to the planning process. In most previous work, the subtasks ignore the reward on the original problem, whereas we propose subtasks that use the original reward plus a bonus based on a feature of the state at the time the option terminates. We show that option models obtained from such reward-respecting subtasks are much more likely to be useful in planning than eigenoptions, shortest path options based on bottleneck states, or reward-respecting options generated by the option-critic. Reward respecting subtasks strongly constrain the space of options and thereby also provide a partial solution to the problem of option discovery. Finally, we show how values, policies, options, and models can all be learned online and off-policy using standard algorithms and general value functions.

JMLR Journal 2022 Journal Article

A Generalized Projected Bellman Error for Off-policy Value Estimation in Reinforcement Learning

Andrew Patterson
Adam White
Martha White

Many reinforcement learning algorithms rely on value estimation, however, the most widely used algorithms---namely temporal difference algorithms---can diverge under both off-policy sampling and nonlinear function approximation. Many algorithms have been developed for off-policy value estimation based on the linear mean squared projected Bellman error (MSPBE) and are sound under linear function approximation. Extending these methods to the nonlinear case has been largely unsuccessful. Recently, several methods have been introduced that approximate a different objective---the mean-squared Bellman error (MSBE)---which naturally facilitate nonlinear approximation. In this work, we build on these insights and introduce a new generalized MSPBE that extends the linear MSPBE to the nonlinear setting. We show how this generalized objective unifies previous work and obtain new bounds for the value error of the solutions of the generalized objective. We derive an easy-to-use, but sound, algorithm to minimize the generalized objective, and show that it is more stable across runs, is less sensitive to hyperparameters, and performs favorably across four control domains with neural network function approximation. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2022. ( edit, beta )

AAAI Conference 2022 Conference Paper

Learning Expected Emphatic Traces for Deep RL

Ray Jiang
Shangtong Zhang
Veronica Chelu
Adam White
Hado van Hasselt

Off-policy sampling and experience replay are key for improving sample efficiency and scaling model-free temporal difference learning methods. When combined with function approximation, such as neural networks, this combination is known as the deadly triad and is potentially unstable. Recently, it has been shown that stability and good performance at scale can be achieved by combining emphatic weightings and multi-step updates. This approach, however, is generally limited to sampling complete trajectories in order, to compute the required emphatic weighting. In this paper we investigate how to combine emphatic weightings with non-sequential, off-line data sampled from a replay buffer. We develop a multi-step emphatic weighting that can be combined with replay, and a time-reversed n-step TD learning algorithm to learn the required emphatic weighting. We show that these state weightings reduce variance compared with prior approaches, while providing convergence guarantees. We tested the approach at scale on Atari 2600 video games, and observed that the new X-ETD(n) agent improved over baseline agents, highlighting both the scalability and broad applicability of our approach.

NeurIPS Conference 2021 Conference Paper

Continual Auxiliary Task Learning

Matthew McLeod
Chunlok Lo
Matthew Schlegel
Andrew Jacobsen
Raksha Kumaraswamy
Martha White
Adam White

Learning auxiliary tasks, such as multiple predictions about the world, can provide many benefits to reinforcement learning systems. A variety of off-policy learning algorithms have been developed to learn such predictions, but as yet there is little work on how to adapt the behavior to gather useful data for those off-policy predictions. In this work, we investigate a reinforcement learning system designed to learn a collection of auxiliary tasks, with a behavior policy learning to take actions to improve those auxiliary predictions. We highlight the inherent non-stationarity in this continual auxiliary task learning problem, for both prediction learners and the behavior learner. We develop an algorithm based on successor features that facilitates tracking under non-stationary rewards, and prove the separation into learning successor features and rewards provides convergence rate improvements. We conduct an in-depth study into the resulting multi-prediction learning system.

JAIR Journal 2021 Journal Article

General Value Function Networks

Matthew Schlegel
Andrew Jacobsen
Zaheer Abbas
Andrew Patterson
Adam White
Martha White

State construction is important for learning in partially observable environments. A general purpose strategy for state construction is to learn the state update using a Recurrent Neural Network (RNN), which updates the internal state using the current internal state and the most recent observation. This internal state provides a summary of the observed sequence, to facilitate accurate predictions and decision-making. At the same time, specifying and training RNNs is notoriously tricky, particularly as the common strategy to approximate gradients back in time, called truncated Back-prop Through Time (BPTT), can be sensitive to the truncation window. Further, domain-expertise—which can usually help constrain the function class and so improve trainability—can be difficult to incorporate into complex recurrent units used within RNNs. In this work, we explore how to use multi-step predictions to constrain the RNN and incorporate prior knowledge. In particular, we revisit the idea of using predictions to construct state and ask: does constraining (parts of) the state to consist of predictions about the future improve RNN trainability? We formulate a novel RNN architecture, called a General Value Function Network (GVFN), where each internal state component corresponds to a prediction about the future represented as a value function. We first provide an objective for optimizing GVFNs, and derive several algorithms to optimize this objective. We then show that GVFNs are more robust to the truncation level, in many cases only requiring one-step gradient updates.

PDF Details DOI

JAIR Journal 2020 Journal Article

Adapting Behavior via Intrinsic Reward: A Survey and Empirical Study

Cam Linke
Nadia M. Ady
Martha White
Thomas Degris
Adam White

Learning about many things can provide numerous benefits to a reinforcement learning system. For example, learning many auxiliary value functions, in addition to optimizing the environmental reward, appears to improve both exploration and representation learning. The question we tackle in this paper is how to sculpt the stream of experience—how to adapt the learning system’s behavior—to optimize the learning of a collection of value functions. A simple answer is to compute an intrinsic reward based on the statistics of each auxiliary learner, and use reinforcement learning to maximize that intrinsic reward. Unfortunately, implementing this simple idea has proven difficult, and thus has been the focus of decades of study. It remains unclear which of the many possible measures of learning would work well in a parallel learning setting where environmental reward is extremely sparse or absent. In this paper, we investigate and compare different intrinsic reward mechanisms in a new bandit-like parallel-learning testbed. We discuss the interaction between reward and prediction learners and highlight the importance of introspective prediction learners: those that increase their rate of learning when progress is possible, and decrease when it is not. We provide a comprehensive empirical comparison of 14 different rewards, including well-known ideas from reinforcement learning and active learning. Our results highlight a simple but seemingly powerful principle: intrinsic rewards based on the amount of learning can generate useful behavior, if each individual learner is introspective.

PDF Details DOI

RLDM Conference 2019 Conference Abstract

A Value Function Basis for Nexting and Multi-step Prediction

Andrew Jacobsen
Vincent Liu
Adam White
Martha White

Humans and animals continuously make short-term cumulative predictions about their sensory- input stream, an ability referred to by psychologists as nexting. This ability has been recreated in a mobile robot by learning thousands of value function predictions in parallel. In practice, however, there are limita- tions on the number of things that an autonomous agent can learned. In this paper, we investigate inferring new predictions from a minimal set of learned General Value Functions. We show that linearly weighting such a collection of value function predictions enables us to make accurate multi-step predictions, and pro- vide a closed-form solution to estimate this linear weighting. Similarly, we provide a closed-form solution to estimate value functions with arbitrary discount parameters γ.

RLDM Conference 2019 Conference Abstract

Investigating Curiosity for Multi-Prediction Learning

Cameron Linke
Nadia M Ady
Martha White
Adam White

This paper investigates a computational analog of curiosity to drive behavior adaption in learning systems with multiple prediction objectives. The primary goal is to learn multiple independent predictions in parallel from data produced by some decision making policy—learning for the sake of learning. We can frame this as a reinforcement learning problem, where a decision maker’s objective is to provide training data for each of the prediction learners, with reward based on each learner’s progress. Despite the variety of potential rewards—mainly from the literature on curiosity and intrinsic motivation—there has been little systematic investigation into suitable curiosity rewards in a pure exploration setting. In this paper, we formalize this pure exploration problem as a multi-arm bandit, enabling different learning scenarios to be simulated by different types of targets for each arm and enabling careful study of the large suite of potential curiosity rewards. We test 15 different analogs of well-known curiosity reward schemes, and compare their performance across a wide array of prediction problems. This investigation elucidates issues with several curiosity rewards for this pure exploration setting, and highlights a promising direction using a simple curiosity reward based on the use of step-size adapted learners.

AAAI Conference 2019 Conference Paper

Meta-Descent for Online, Continual Prediction

Andrew Jacobsen
Matthew Schlegel
Cameron Linke
Thomas Degris
Adam White
Martha White

This paper investigates different vector step-size adaptation approaches for non-stationary online, continual prediction problems. Vanilla stochastic gradient descent can be considerably improved by scaling the update with a vector of appropriately chosen step-sizes. Many methods, including Ada- Grad, RMSProp, and AMSGrad, keep statistics about the learning process to approximate a second order update—a vector approximation of the inverse Hessian. Another family of approaches use meta-gradient descent to adapt the stepsize parameters to minimize prediction error. These metadescent strategies are promising for non-stationary problems, but have not been as extensively explored as quasi-second order methods. We first derive a general, incremental metadescent algorithm, called AdaGain, designed to be applicable to a much broader range of algorithms, including those with semi-gradient updates or even those with accelerations, such as RMSProp. We provide an empirical comparison of methods from both families. We conclude that methods from both families can perform well, but in non-stationary prediction problems the meta-descent methods exhibit advantages. Our method is particularly robust across several prediction problems, and is competitive with the state-of-the-art method on a large-scale, time-series prediction problem on real data from a mobile robot.

IJCAI Conference 2019 Conference Paper

Planning with Expectation Models

Yi Wan
Muhammad Zaheer
Adam White
Martha White
Richard S. Sutton

Distribution and sample models are two popular model choices in model-based reinforcement learning (MBRL). However, learning these models can be intractable, particularly when the state and action spaces are large. Expectation models, on the other hand, are relatively easier to learn due to their compactness and have also been widely used for deterministic environments. For stochastic environments, it is not obvious how expectation models can be used for planning as they only partially characterize a distribution. In this paper, we propose a sound way of using approximate expectation models for MBRL. In particular, we 1) show that planning with an expectation model is equivalent to planning with a distribution model if the state value function is linear in state features, 2) analyze two common parametrization choices for approximating the expectation: linear and non-linear expectation models, 3) propose a sound model-based policy evaluation algorithm and present its convergence results, and 4) empirically demonstrate the effectiveness of the proposed planning algorithm.

AAMAS Conference 2019 Conference Paper

Prediction in Intelligence: An Empirical Comparison of Off-policy Algorithms on Robots

Banafsheh Rafiee
Sina Ghiassian
Adam White
Richard S. Sutton

The ability to continually make predictions about the world may be central to intelligence. Off-policy learning and general value functions (GVFs) are well-established algorithmic techniques for learning about many signals while interacting with the world. In the past couple of years, many ambitious works have used off-policy GVF learning to improve control performance in both simulation and robotic control tasks. Many of these works use semi-gradient temporal-difference (TD) learning algorithms, like Qlearning, which are potentially divergent. In the last decade, several TD learning algorithms have been proposed that are convergent and computationally efficient, but not much is known about how they perform in practice, especially on robots. In this work, we perform an empirical comparison of modern off-policy GVF learning algorithms on three different robot platforms, providing insights into their strengths and weaknesses. We also discuss the challenges of conducting fair comparative studies of off-policy learning on robots and develop a new evaluation methodology that is successful and applicable to a relatively complicated robot domain.

NeurIPS Conference 2018 Conference Paper

Context-dependent upper-confidence bounds for directed exploration

Raksha Kumaraswamy
Matthew Schlegel
Adam White
Martha White

Directed exploration strategies for reinforcement learning are critical for learning an optimal policy in a minimal number of interactions with the environment. Many algorithms use optimism to direct exploration, either through visitation estimates or upper confidence bounds, as opposed to data-inefficient strategies like e-greedy that use random, undirected exploration. Most data-efficient exploration methods require significant computation, typically relying on a learned model to guide exploration. Least-squares methods have the potential to provide some of the data-efficiency benefits of model-based approaches—because they summarize past interactions—with the computation closer to that of model-free approaches. In this work, we provide a novel, computationally efficient, incremental exploration strategy, leveraging this property of least-squares temporal difference learning (LSTD). We derive upper confidence bounds on the action-values learned by LSTD, with context-dependent (or state-dependent) noise variance. Such context-dependent noise focuses exploration on a subset of variable states, and allows for reduced exploration in other states. We empirically demonstrate that our algorithm can converge more quickly than other incremental exploration strategies using confidence estimates on action-values.

IJCAI Conference 2018 Conference Paper

Organizing Experience: a Deeper Look at Replay Mechanisms for Sample-Based Planning in Continuous State Domains

Yangchen Pan
Muhammad Zaheer
Adam White
Andrew Patterson
Martha White

Model-based strategies for control are critical to obtain sample efficient learning. Dyna is a planning paradigm that naturally interleaves learning and planning, by simulating one-step experience to update the action-value function. This elegant planning strategy has been mostly explored in the tabular setting. The aim of this paper is to revisit sample-based planning, in stochastic and continuous domains with learned models. We first highlight the flexibility afforded by a model over Experience Replay (ER). Replay-based methods can be seen as stochastic planning methods that repeatedly sample from a buffer of recent agent-environment interactions and perform updates to improve data efficiency. We show that a model, as opposed to a replay buffer, is particularly useful for specifying which states to sample from during planning, such as predecessor states that propagate information in reverse from a state more quickly. We introduce a semi-parametric model learning approach, called Reweighted Experience Models (REMs), that makes it simple to sample next states or predecessors. We demonstrate that REM-Dyna exhibits similar advantages over replay-based methods in learning in continuous state problems, and that the performance gap grows when moving to stochastic domains, of increasing size.

AAAI Conference 2017 Conference Paper

Accelerated Gradient Temporal Difference Learning

Yangchen Pan
Adam White
Martha White

The family of temporal difference (TD) methods span a spectrum from computationally frugal linear methods like TD(λ) to data efﬁcient least squares methods. Least square methods make the best use of available data directly computing the TD solution and thus do not require tuning a typically highly sensitive learning rate parameter, but require quadratic computation and storage. Recent algorithmic developments have yielded several sub-quadratic methods that use an approximation to the least squares TD solution, but incur bias. In this paper, we propose a new family of accelerated gradient TD (ATD) methods that (1) provide similar data efﬁciency beneﬁts to least-squares methods, at a fraction of the computation and storage (2) signiﬁcantly reduce parameter sensitivity compared to linear TD methods, and (3) are asymptotically unbiased. We illustrate these claims with a proof of convergence in expectation and experiments on several benchmark domains and a large-scale industrial energy allocation domain.

AAMAS Conference 2016 Conference Paper

A Greedy Approach to Adapting the Trace Parameter for Temporal Difference Learning

Martha White
Adam White

One of the main obstacles to broad application of reinforcement learning methods is the parameter sensitivity of our core learning algorithms. In many large-scale applications, online computation and function approximation represent key strategies in scaling up reinforcement learning algorithms. In this setting, we have effective and reasonably well understood algorithms for adapting the learning-rate parameter, online during learning. Such meta-learning approaches can improve robustness of learning and enable specialization to current task, improving learning speed. For temporaldifference learning algorithms which we study here, there is yet another parameter, λ, that similarly impacts learning speed and stability in practice. Unfortunately, unlike the learning-rate parameter, λ parametrizes the objective function that temporal-difference methods optimize. Different choices of λ produce different fixed-point solutions, and thus adapting λ online and characterizing the optimization is substantially more complex than adapting the learningrate parameter. There are no meta-learning method for λ that can achieve (1) incremental updating, (2) compatibility with function approximation, and (3) maintain stability of learning under both on and off-policy sampling. In this paper we contribute a novel objective function for optimizing λ as a function of state rather than time. We derive a new incremental, linear complexity λ-adaption algorithm that does not require offline batch updating or access to a model of the world, and present a suite of experiments illustrating the practicality of our new algorithm in three different settings. Taken together, our contributions represent a concrete step towards black-box application of temporaldifference learning methods in real world problems.

EWRL Workshop 2016 Workshop Paper

Accelerated Gradient Temporal Difference Learning

Yangchen Pan
Adam White
Martha White

The family of temporal difference (TD) methods span a spectrum from computationally frugal linear methods like TD(λ) to data efficient least squares methods. Least squares methods make the best use of available data directly computing the TD solution, and thus do not require tuning a typically highly sensitive learning rate parameter, but require quadratic computation and storage. Recent algorithmic developments have yielded several sub-quadratic methods that use an approximation to the least squares TD solution, but incur bias. In this paper, we propose a new family of accelerated gradient TD (ATD) methods that (1) provide similar data efficiency benefits to least-squares methods, at a fraction of the computation and storage, (2) significantly reduce parameter sensitivity compared to linear TD methods, and (3) are asymptotically unbiased. We illustrate these claims with a proof of convergence in expectation and experiments on several benchmark domains, and a large-scale industrial energy allocation domain.

RLDM Conference 2015 Conference Abstract

Investigating the trace decay parameter in on-policy and off-policy reinforcement learning

Adam White
Martha White

This paper investigates how varying the trace decay parameter for gradient temporal difference learning affects the speed of learning and stability in off-policy reinforcement learning. Gradient temporal difference algorithms incorporate importance sampling ratios into the eligibility trace memories, and these ratios can be large and destabilize learning, particularly when the behavior policy and target policy are severely mismatched. Because the trace decay parameter influences the length of the memory, it can have a dramatic effect on stability under off-policy learning updating. There has been some prior investigation into adapting the trace decay parameter in the on-policy setting. These insights provide useful heuristics, but on their own, cannot mitigate the variance issues that can arise the in off-policy setting due to policy mismatch. In this paper, we empirically compare several heuristics for setting the trace decay parameter in an on-policy Markov chain domain and in an off-policy domain designed to produce stability for temporal difference methods. We demonstrate that previous intuitions for setting the trace decay parameter remain useful, but require a shift in focus to balance efficient learning, while guarding against off-policy instability.

RLDM Conference 2013 Conference Abstract

Nexting and State Discovery in Robot Microworlds

Joseph Modayil
Adam White
Ashique Mahmood
Darlinton Prauchner
Richard Sutton

We describe our recent work in reinforcement learning robots and its relationship to psychological ideas. We have recently shown how a robot can learn and make thousands of short-term predictions about its future stimuli, based on thousands of features, on-line and in real time. This is similar to the psychological phenomena of “nexting, ” in which animals learn to predict what sensory events will happen next, and sensory preconditioning. Our methodology is to study computational nexting in simple animal-like robots living in tightly controlled, small environments. This parallels a long tradition in artificial intelligence of studying ”microworlds–small simulated worlds, such as games and blocks worlds, that include important issues in a simplified form. Our use of robot microworlds is also analogous to the tightly controlled environments used when studying learning and brain function in the natural sciences. In ongoing and future work, we are exploring how nexting can provide a criteria for the discovery of state representations—memories or traces of past stimuli and actions that are helpful for making accurate predictions.

AAMAS Conference 2011 Conference Paper

Horde: A Scalable Real-time Architecture for Learning Knowledge from Unsupervised Sensorimotor Interaction

Richard S. Sutton
Joseph Modayil
Michael Delp
Thomas Degris
Patrick M. Pilarski
Adam White
Doina Precup

Maintaining accurate world knowledge in a complex and changing environment is a perennial problem for robots and other artificial intelligence systems. Our architecture for addressing this problem, called Horde, consists of a large number of independent reinforcement learning sub-agents, or demons. Each demon is responsible for answering a single predictive or goal-oriented question about the world, thereby contributing in a factored, modular way to the system's overall knowledge. The questions are in the form of a value function, but each demon has its own policy, reward function, termination function, and terminal-reward function unrelated to those of the base problem. Learning proceeds in parallel by all demons simultaneously so as to extract the maximal training information from whatever actions are taken by the system as a whole. Gradient-based temporal-difference learning methods are used to learn efficiently and reliably with function approximation in this off-policy setting. Horde runs in constant time and memory per time step, and is thus suitable for learning online in real-time applications such as robotics. We present results using Horde on a multi-sensored mobile robot to successfully learn goal-oriented behaviors and long-term predictions from off-policy experience. Horde is a significant incremental step towards a real-time architecture for efficient learning of general knowledge from unsupervised sensorimotor interaction.

NeurIPS Conference 2010 Conference Paper

Interval Estimation for Reinforcement-Learning Algorithms in Continuous-State Domains

Martha White
Adam White

The reinforcement learning community has explored many approaches to obtain- ing value estimates and models to guide decision making; these approaches, how- ever, do not usually provide a measure of confidence in the estimate. Accurate estimates of an agent’s confidence are useful for many applications, such as bi- asing exploration and automatically adjusting parameters to reduce dependence on parameter-tuning. Computing confidence intervals on reinforcement learning value estimates, however, is challenging because data generated by the agent- environment interaction rarely satisfies traditional assumptions. Samples of value- estimates are dependent, likely non-normally distributed and often limited, partic- ularly in early learning when confidence estimates are pivotal. In this work, we investigate how to compute robust confidences for value estimates in continuous Markov decision processes. We illustrate how to use bootstrapping to compute confidence intervals online under a changing policy (previously not possible) and prove validity under a few reasonable assumptions. We demonstrate the applica- bility of our confidence estimation algorithms with experiments on exploration, parameter estimation and tracking.

JMLR Journal 2009 Journal Article

RL-Glue: Language-Independent Software for Reinforcement-Learning Experiments

Brian Tanner
Adam White

RL-Glue is a standard, language-independent software package for reinforcement-learning experiments. The standardization provided by RL-Glue facilitates code sharing and collaboration. Code sharing reduces the need to re-engineer tasks and experimental apparatus, both common barriers to comparatively evaluating new ideas in the context of the literature. Our software features a minimalist interface and works with several languages and computing platforms. RL-Glue compatibility can be extended to any programming language that supports network socket communication. RL-Glue has been used to teach classes, to run international competitions, and is currently used by several other open-source software and hardware projects. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2009. ( edit, beta )