Author name cluster

William Dabney

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

7 papers

1 author row

RLDM Conference 2015 Conference Abstract

RLPy: A Value-Function-Based Reinforcement Learning Framework for Education and Re- search

Alborz Geramifard
Christoph Dann
Robert Klein
William Dabney
Jonathan How

RLPy (http: //acl. mit. edu/rlpy ) is an open-source reinforcement learning (RL) package with the focus on linear function approximation for value-based techniques and planning problems with discrete actions. The aim of this package is to: a) boost the RL education process, and b) enable crisp and easy to debug experimentation with existing and new methods. RLPy achieves these goals by providing a rich library of fine-grained, easily exchangeable components for learning agents (e. g. , policies or representations of value functions). Developed in Python, RLPy allows fast prototyping, yet harnesses the power of state-of- the-art numerical libraries such as scipy and parallelization to scale to large problems. Furthermore, RLPy is self-contained. The package includes code profiling, domain visualizations, and data analysis. Finally RLPy is available under the Modified BSD License that allows integration with 3rd party softwares with little legal entanglement.

PDF Details

JMLR Journal 2015 Journal Article

RLPy: A Value-Function-Based Reinforcement Learning Framework for Education and Research

Alborz Geramifard
Christoph Dann
Robert H. Klein
William Dabney
Jonathan P. How

RLPy is an object-oriented reinforcement learning software package with a focus on value-function-based methods using linear function approximation and discrete actions. The framework was designed for both educational and research purposes. It provides a rich library of fine-grained, easily exchangeable components for learning agents (e.g., policies or representations of value functions), facilitating recently increased specialization in reinforcement learning. RLPy is written in Python to allow fast prototyping, but is also suitable for large-scale experiments through its built-in support for optimized numerical libraries and parallelization. Code profiling, domain visualizations, and data analysis are integrated in a self-contained package available under the Modified BSD License at github.com/rlpy/rlpy. All of these properties allow users to compare various reinforcement learning algorithms with little effort. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2015. ( edit, beta )

PDF Details

AAAI Conference 2014 Conference Paper

Natural Temporal Difference Learning

William Dabney
Philip Thomas

In this paper we investigate the application of natural gradient descent to Bellman error based reinforcement learning algorithms. This combination is interesting because natural gradient descent is invariant to the parameterization of the value function. This invariance property means that natural gradient descent adapts its update directions to correct for poorly conditioned representations. We present and analyze quadratic and linear time natural temporal difference learning algorithms, and prove that they are covariant. We conclude with experiments which suggest that the natural algorithms can match or outperform their non-natural counterparts using linear function approximation, and drastically improve upon their non-natural counterparts when using non-linear function approximation.

PDF Details

RLDM Conference 2013 Conference Abstract

Performance Metrics for Reinforcement Learning Algorithms

William Dabney
Philip Thomas
Andrew Barto

Due to the continued growth of the field of reinforcement learning (RL), the number of RL al- gorithms has increased to the point where an individual researcher cannot experiment with all of them. To facilitate the decision of which algorithms to invest time in, we, as a field, need methods that thoroughly and accurately describe the performance of algorithms. Two approaches for evaluating RL algorithms are commonly used, neither of which fully accomplishes these goals. The first approach is to manually test each algorithm with a collection of parameter values and report the best results found. This can introduce an unintended bias in an algorithm’s favor because researchers have more insight when selecting parameter values for methods with which they are familiar. The second approach is to perform a large parameter opti- mization for each algorithm and to report the best results found. This approach does not accurately capture the difficulty of finding good parameter values. The fundamental problem with both approaches is that the robustness of the algorithm to its parameter values is ignored. In the first approach this results in biased evaluations. On the other hand, in the second approach it causes only the combination of the RL algorithm and parameter optimization to be evaluated, which allows the parameter optimization to compensate for weaknesses in the RL algorithm. By this standard, directly searching for fixed policies in the parameter op- timization will yield the best algorithm possible. We propose a performance metric for RL algorithms that tells a much larger part of the story of an algorithm’s performance and robustness to its parameter values. It allows us as RL researchers to be better informed about the performance of our algorithms, and to report results that are also more informative to our audiences. The key insight is to measure performance in terms of expected percentage of fixed, deterministic policies that the algorithm outperforms.

PDF Details

NeurIPS Conference 2013 Conference Paper

Projected Natural Actor-Critic

Philip Thomas
William Dabney
Stephen Giguere
Sridhar Mahadevan

Natural actor-critics are a popular class of policy search algorithms for finding locally optimal policies for Markov decision processes. In this paper we address a drawback of natural actor-critics that limits their real-world applicability - their lack of safety guarantees. We present a principled algorithm for performing natural gradient descent over a constrained domain. In the context of reinforcement learning, this allows for natural actor-critic algorithms that are guaranteed to remain within a known safe region of policy space. While deriving our class of constrained natural actor-critic algorithms, which we call Projected Natural Actor-Critics (PNACs), we also elucidate the relationship between natural gradient descent and mirror descent.

PDF Details

AAAI Conference 2012 Conference Paper

Adaptive Step-Size for Online Temporal Difference Learning

William Dabney
Andrew Barto

The step-size, often denoted as α, is a key parameter for most incremental learning algorithms. Its importance is especially pronounced when performing online temporal difference (TD) learning with function approximation. Several methods have been developed to adapt the step-size online. These range from straightforward back-off strategies to adaptive algorithms based on gradient descent. We derive an adaptive upper bound on the step-size parameter to guarantee that online TD learning with linear function approximation will not diverge. We then empirically evaluate algorithms using this upper bound as a heuristic for adapting the stepsize parameter online. We compare performance with related work including HL(λ) and Autostep. Our results show that this adaptive upper bound heuristic out-performs all existing methods without requiring any meta-parameters. This effectively eliminates the need to tune the learning rate of temporal difference learning with linear function approximation.

PDF Details

IJCAI Conference 2007 Conference Paper

William Dabney
Amy McGovern

We introduce an approach to autonomously creating state space abstractions for an online reinforcement learning agent using a relational representation. Our approach uses a tree-based function approximation derived from McCallum's [1995] UTree algorithm. We have extended this approach to use a relational representation where relational observations are represented by attributed graphs [McGovern et al. , 2003]. We address the challenges introduced by a relational representation by using stochastic sampling to manage the search space [Srinivasan, 1999] and temporal sampling to manage autocorrelation [Jensen and Neville, 2002]. Relational UTree incorporates Iterative Tree Induction [Utgoff et al. , 1997] to allow it to adapt to changing environments. We empirically demonstrate that Relational UTree performs better than similar relational learning methods [Finney et al. , 2002; Driessens et al. , 2001] in a blocks world domain. We also demonstrate that Relational UTree can learn to play a sub-task of the game of Go called Tsume-Go [Ramon et al. , 2001].

PDF Details