Author name cluster

Richard Sutton

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

31 papers

1 author row

AAAI Conference 2020 Conference Paper

Fixed-Horizon Temporal Difference Methods for Stable Reinforcement Learning

Kristopher De Asis
Alan Chan
Silviu Pitis
Richard Sutton
Daniel Graves

We explore fixed-horizon temporal difference (TD) methods, reinforcement learning algorithms for a new kind of value function that predicts the sum of rewards over a fixed number of future time steps. To learn the value function for horizon h, these algorithms bootstrap from the value function for horizon h−1, or some shorter horizon. Because no value function bootstraps from itself, fixed-horizon methods are immune to the stability problems that plague other off-policy TD methods using function approximation (also known as “the deadly triad”). Although fixed-horizon methods require the storage of additional value functions, this gives the agent additional predictive power, while the added complexity can be substantially reduced via parallel updates, shared weights, and n-step bootstrapping. We show how to use fixed-horizon value functions to solve reinforcement learning problems competitively with methods such as Q-learning that learn conventional value functions. We also prove convergence of fixed-horizon temporal difference methods with linear and general function approximation. Taken together, our results establish fixed-horizon TD methods as a viable new way of avoiding the stability problems of the deadly triad.

AAAI Conference 2018 Conference Paper

Multi-Step Reinforcement Learning: A Unifying Algorithm

Kristopher De Asis
J. Hernandez-Garcia
G. Holland
Richard Sutton

Unifying seemingly disparate algorithmic ideas to produce better performing algorithms has been a longstanding goal in reinforcement learning. As a primary example, TD(λ) elegantly unifies one-step TD prediction with Monte Carlo methods through the use of eligibility traces and the trace-decay parameter. Currently, there are a multitude of algorithms that can be used to perform TD control, including Sarsa, Q-learning, and Expected Sarsa. These methods are often studied in the one-step case, but they can be extended across multiple time steps to achieve better performance. Each of these algorithms is seemingly distinct, and no one dominates the others for all problems. In this paper, we study a new multi-step action-value algorithm called Q(σ) that unifies and generalizes these existing algorithms, while subsuming them as special cases. A new parameter, σ, is introduced to allow the degree of sampling performed by the algorithm at each step during its backup to be continuously varied, with Sarsa existing at one extreme (full sampling), and Expected Sarsa existing at the other (pure expectation). Q(σ) is generally applicable to both on- and off-policy learning, but in this work we focus on experiments in the on-policy case. Our results show that an intermediate value of σ, which results in a mixture of the existing algorithms, performs better than either extreme. The mixture can also be varied dynamically which can result in even greater performance.

RLDM Conference 2017 Conference Abstract

Policy Iteration for Discounted Reinforcement Learning Problems in Continuous Time and

Jaeyoung Lee
Richard Sutton

Recent advances in various fields regarding decision making, especially regarding reinforcement learning (RL), have revealed the interdisciplinary connections among their findings. For example, actor and critic in computational RL are shown to play the same roles of dorsal and ventral striatum; goal-directed and habitual learning is strongly relevant to model-based and model-free computational RL, respectively. Among the different methodologies in those fields, theoretical approach in machine learning community has established the well-defined computational RL framework in discrete domain and a dynamic programming method known as policy iteration (PI), both of which served as the fundamentals in computational RL methods. The main focus of this work is to develop such RL framework and a series of PI methods in continuous domain, with its environment modeled by an ordinary differential equation (ODE). Similar to the discrete case, the PI methods are designed to recursively find the best decision-making strategy by iterating policy evaluation (as a role of critic) and policy improvement (as a role of actor). Each proposed one is either model-free corresponding to habitual learning, or partially model-free (or partially model-based) corresponding to somewhere between goal-directed (model-based) and habitual (model-free) learning. This work also provides theoretical background and perhaps, the basic principles to RL algorithms with a real physical task which is usually modeled by ODEs. In detail, we propose on-policy PI and then four off-policy PI methods—the two off-policy methods are the ideal PI forms of advantage updating and Q-learning, and the other two are extensions of the existing off-policy PI methods; compared to PI in optimal control, ours do not require an initial stabilizing policy. The mathematical properties of admissibility, monotone improvement, and convergence are all rigorously proven; simulation examples are provided to support the theory.

RLDM Conference 2015 Conference Abstract

A Deeper Look at Planning as Learning from Replay

Harm van Seijen
Richard Sutton

In reinforcement learning, the notions of experience replay, and of planning as learning from replayed experience, have long been used to find good policies with minimal training data. Replay can be seen either as model-based reinforcement learning, where the store of past experiences serves as the model, or as a way to avoid a conventional model of the environment altogether. In this paper, we look more deeply at how replay blurs the line between model-based and model-free methods. Specifically, we show for the first time an exact equivalence between the sequence of value functions found by a model- based policy-evaluation method and by a model-free method with replay. We then use insights gained from these relationships to design a new reinforcement learning algorithm for linear function approximation. This method, which we call forgetful LSTD(λ), improves upon regular LSTD(λ) because it extends more naturally to online control, and improves upon linear Dyna because it is a multi-step method, enabling it to perform well even in non-Markov problems or, equivalently, in problems with significant function approximation.

RLDM Conference 2015 Conference Abstract

Combining Approximate Planning and Learning in a Cascade

Joseph Modayil
Kavosh Asadi
Richard Sutton

A core competence of an intelligent agent is the ability to learn an approximate model of the world and then plan with it. Planning is computationally intensive, but arguably necessary for rapidly find- ing good behavior. It is also possible to find good behavior directly from experience, using model-free reinforcement-learning methods which, because they are computationally cheaper, can use a larger repre- sentation with more informative features. Our first result is an empirical demonstration that model-free learning with a larger representation can perform better asymptotically than planning with a smaller repre- sentation. This motivates exploring agent architectures that combine planning (with a small representation) and learning (with a large representation) to get the benefits of both. In this paper we explore a combination in which planning proceeds oblivious to learning, and then learning, in parallel, adds to the approximate value function found by planning. We call this combination a cascade. We show empirically that our cas- cade obtains both benefits in the Mountain-Car and Puddle-World problems. We also prove formally that the cascade’s asymptotic performance is equal to that of model-free learning under mild conditions in a pre- diction (policy evaluation) setting. Finally, another way in which learning may be advantaged over planning is that it can use eligibility traces. We show empirically that in this case the cascade is superior even if planning and learning share the same representation.

EWRL Workshop 2015 Workshop Paper

Emphatic Temporal-Difference Learning

Ashique Rupam Mahmood
Huizhen Yu
Martha White
Richard Sutton

Emphatic algorithms are temporal-difference learning algorithms that change their effective state distribution by selectively emphasizing and de-emphasizing their updates on different time steps. Recent works by Sutton, Mahmood and White (2015), and Yu (2015) show that by varying the emphasis in a particular way, these algorithms become stable and convergent under off-policy training with linear function approximation. This paper serves as a unified summary of the available results from both works. In addition, we demonstrate the empirical benefits from the flexibility of emphatic algorithms, including state-dependent discounting, state-dependent bootstrapping, and the user-specified allocation of function approximation resources.

RLDM Conference 2015 Conference Abstract

Off-policy learning with linear function approximation based on weighted importance sam- pling

Ashique Rupam Mahmood
Richard Sutton

An important branch of reinforcement learning is off-policy learning where the agent behaves according to one policy but learns about a different policy. Many modern algorithms for model-free rein- forcement learning have incorporated off-policy learning together with parametric function approximation and sophisticated techniques such as eligibility traces. They all use a Monte Carlo technique known as importance sampling as a core component. The ordinary importance sampling estimator typically has high variance, and consequently, off-policy learning algorithms often exhibit poor performance. In Monte Carlo estimation, this problem is overcome using a variant of importance sampling called weighted importance sampling which often has much lower variance. However, weighted importance sampling has been ne- glected in off-policy learning due to the difficulty of combining it with parametric function approximation and hence not been utilized in modern reinforcement learning algorithms. In this work, we provide the key ideas on how off-policy learning algorithms for linear function approximation can be developed based on weighted importance sampling. We work with two different forms of methods for linear function approx- imation, methods of least squares resulting in an ideal but computationally expensive form and methods based on stochastic gradient descent providing a computationally congenial approximation. We empiri- cally demonstrate that the new algorithms can achieve substantial performance gain over the state-of-the-art off-policy algorithms and hence retain the benefits of weighted importance sampling.

RLDM Conference 2015 Conference Abstract

Progress Toward the Shared Control of a Prosthetic Arm

Ann Edwards
Michael Dawson
Jacqueline Hebert
Craig Sherstan
Richard Sutton
K Chan
Patrick Pilarski

State-of-the-art myoelectric prostheses typically have a greater number of functions than the pos- sible number of control signals, requiring amputees to manually switch through a fixed list of options to select the desired function. For this reason, the control of powered prosthetic arms is often considered com- plex and non-intuitive. Previous studies have demonstrated that techniques from reinforcement learning, and in particular General Value Functions (GVFs), can be applied to develop temporally extended predic- tions about signals related to prosthetic arm movement. In particular, we have shown that it is possible to learn and maintain predictions about which joint of a robotic arm a user intends to use next, and use this information to create and update an adaptive switching list. In this work, we extend previous studies by demonstrating the real-time use of adaptive switching by an amputee in a simple control task with a myo- electric arm. We also present results from a non-amputee subject controlling a myoelectric arm in a more complex task, providing evidence for the scalability of the learning system. Our results suggest that, com- pared with a fixed-list switching method, adaptive switching can significantly decrease the amount of time and the number of switches required for the control of a robotic arm and potentially reduce the cognitive burden on myoelectric arm users. Furthermore, we anticipate the future blending of human and machine decision making for the shared control of a robotic arm. Given high enough prediction certainty, a robotic arm could begin to switch autonomously between joints, reducing the time and effort required by amputees to activate complex prosthetic motions.

NeurIPS Conference 2014 Conference Paper

Universal Option Models

Hengshuai Yao
Csaba Szepesvari
Richard Sutton
Joseph Modayil
Shalabh Bhatnagar

We consider the problem of learning models of options for real-time abstract planning, in the setting where reward functions can be specified at any time and their expected returns must be efficiently computed. We introduce a new model for an option that is independent of any reward function, called the {\it universal option model (UOM)}. We prove that the UOM of an option can construct a traditional option model given a reward function, and the option-conditional return is computed directly by a single dot-product of the UOM with the reward function. We extend the UOM to linear function approximation, and we show it gives the TD solution of option returns and value functions of policies over options. We provide a stochastic approximation algorithm for incrementally learning UOMs from data and prove its consistency. We demonstrate our method in two domains. The first domain is document recommendation, where each user query defines a new reward function and a document's relevance is the expected return of a simulated random-walk through the document's references. The second domain is a real-time strategy game, where the controller must select the best game unit to accomplish dynamically-specified tasks. Our experiments show that UOMs are substantially more efficient in evaluating option returns and policies than previously known methods.

NeurIPS Conference 2014 Conference Paper

Weighted importance sampling for off-policy learning with linear function approximation

A. Rupam Mahmood
Hado van Hasselt
Richard Sutton

Importance sampling is an essential component of off-policy model-free reinforcement learning algorithms. However, its most effective variant, \emph{weighted} importance sampling, does not carry over easily to function approximation and, because of this, it is not utilized in existing off-policy learning algorithms. In this paper, we take two steps toward bridging this gap. First, we show that weighted importance sampling can be viewed as a special case of weighting the error of individual training samples, and that this weighting has theoretical and empirical benefits similar to those of weighted importance sampling. Second, we show that these benefits extend to a new weighted-importance-sampling version of off-policy LSTD(lambda). We show empirically that our new WIS-LSTD(lambda) algorithm can result in much more rapid and reliable convergence than conventional off-policy LSTD(lambda) (Yu 2010, Bertsekas & Yu 2009).

RLDM Conference 2013 Conference Abstract

Magnitude and Timing during Extinction and Reacquisition of Conditioned Nictitating Mem- brane Movements in the Rabbit (Oryctolagus cuniculus)

E James Kehoe
Elliot Ludvig
Richard Sutton

The temporal difference (TD) algorithm provides a method for continuously revising predictions about future events, usually in the face of uncertainty. Classical conditioning of the eyelid response in hu- mans and animals provides a useful platform for examining the biological implementation of TD learning processes. In particular, the current predictive value of one event for another can be observed in the acqui- sition of an eyelid closure initiated during the warning period provided by “conditioned stimulus” (CS, i. e. , soft tone) for a subsequent “unconditioned stimulus” (US, i. e. , stimulation of the trigeminal nerve near the eye). The present experiment manipulated the reliability of the predictive relationship between the CS and US. Observations of the duration and magnitude of CRs revealed that a TD model of eyelid conditioning must simultaneously accommodate the following three constraints: (1) the degree of temporal uncertainty in predictions of the US as evidenced by the duration of the CR that overlaps the period in which the US occurs; (2) the stability in the CR’s duration across a wide range of uncertainty in the reliability of CS→US pairings; (3) large changes in CR magnitude to manipulations of the reliability of CS→US pairings

RLDM Conference 2013 Conference Abstract

Nexting and State Discovery in Robot Microworlds

Joseph Modayil
Adam White
Ashique Mahmood
Darlinton Prauchner
Richard Sutton

We describe our recent work in reinforcement learning robots and its relationship to psychological ideas. We have recently shown how a robot can learn and make thousands of short-term predictions about its future stimuli, based on thousands of features, on-line and in real time. This is similar to the psychological phenomena of “nexting, ” in which animals learn to predict what sensory events will happen next, and sensory preconditioning. Our methodology is to study computational nexting in simple animal-like robots living in tightly controlled, small environments. This parallels a long tradition in artificial intelligence of studying ”microworlds–small simulated worlds, such as games and blocks worlds, that include important issues in a simplified form. Our use of robot microworlds is also analogous to the tightly controlled environments used when studying learning and brain function in the natural sciences. In ongoing and future work, we are exploring how nexting can provide a criteria for the discovery of state representations—memories or traces of past stimuli and actions that are helpful for making accurate predictions.

RLDM Conference 2013 Conference Abstract

Temporal-Difference Learning to Assist Human Decision Making during the Control of an Ar- tificial Limb

Ann Edwards
Alexandra Kearney
Michael Dawson
Rehabilitation Hospital
Richard Sutton
Patrick Pilarski

In this work we explore the use of reinforcement learning (RL) to help with human decision making, combining state-of-the-art RL algorithms with an application to prosthetics. Managing human- machine interaction is a problem of considerable scope, and the simplification of human-robot interfaces is especially important in the domains of biomedical technology and rehabilitation medicine. For example, amputees who control artificial limbs are often required to quickly switch between a number of control actions or modes of operation in order to operate their devices. We suggest that by learning to anticipate (predict) a user’s behaviour, artificial limbs could take on an active role in a human’s control decisions so as to reduce the burden on their users. Recently, we showed that RL in the form of general value functions (GVFs) could be used to accurately detect a user’s control intent prior to their explicit control choices. In the present work, we explore the use of temporal-difference learning and GVFs to predict when users will switch their control influence between the different motor functions of a robot arm. Experiments were performed using a multi-function robot arm that was controlled by muscle signals from a user’s body (similar to conventional artificial limb control). Our approach was able to acquire and maintain forecasts about a user’s switching decisions in real time. It also provides an intuitive and reward-free way for users to correct or reinforce the decisions made by the machine learning system. We expect that when a system is certain enough about its predictions, it can begin to take over switching decisions from the user to streamline control and potentially decrease the time and effort needed to complete tasks. This preliminary study therefore suggests a way to naturally integrate human and machine-based decision making systems.

NeurIPS Conference 2009 Conference Paper

Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation

Shalabh Bhatnagar
Doina Precup
David Silver
Richard Sutton
Hamid Maei
Csaba Szepesvári

We introduce the first temporal-difference learning algorithms that converge with smooth value function approximators, such as neural networks. Conventional temporal-difference (TD) methods, such as TD($\lambda$), Q-learning and Sarsa have been used successfully with function approximation in many applications. However, it is well known that off-policy sampling, as well as nonlinear function approximation, can cause these algorithms to become unstable (i. e. , the parameters of the approximator may diverge). Sutton et al (2009a, b) solved the problem of off-policy learning with linear TD algorithms by introducing a new objective function, related to the Bellman-error, and algorithms that perform stochastic gradient-descent on this function. In this paper, we generalize their work to nonlinear function approximation. We present a Bellman error objective function and two gradient-descent TD algorithms that optimize it. We prove the asymptotic almost-sure convergence of both algorithms for any finite Markov decision process and any smooth value function approximator, under usual stochastic approximation conditions. The computational complexity per iteration scales linearly with the number of parameters of the approximator. The algorithms are incremental and are guaranteed to converge to locally optimal solutions.

NeurIPS Conference 2009 Conference Paper

Multi-Step Dyna Planning for Policy Evaluation and Control

Hengshuai Yao
Shalabh Bhatnagar
Dongcui Diao
Richard Sutton
Csaba Szepesvári

We extend Dyna planning architecture for policy evaluation and control in two significant aspects. First, we introduce a multi-step Dyna planning that projects the simulated state/feature many steps into the future. Our multi-step Dyna is based on a multi-step model, which we call the {\em $\lambda$-model}. The $\lambda$-model interpolates between the one-step model and an infinite-step model, and can be learned efficiently online. Second, we use for Dyna control a dynamic multi-step model that is able to predict the results of a sequence of greedy actions and track the optimal policy in the long run. Experimental results show that Dyna using the multi-step model evaluates a policy faster than using single-step models; Dyna control algorithms using the dynamic tracking model are much faster than model-free algorithms; further, multi-step Dyna control algorithms enable the policy and value function to converge much faster to their optima than single-step Dyna algorithms.

NeurIPS Conference 2008 Conference Paper

A computational model of hippocampal function in trace conditioning

Elliot Ludvig
Richard Sutton
Eric Verbeek
E. Kehoe

We present a new reinforcement-learning model for the role of the hippocampus in classical conditioning, focusing on the differences between trace and delay conditioning. In the model, all stimuli are represented both as unindividuated wholes and as a series of temporal elements with varying delays. These two stimulus representations interact, producing different patterns of learning in trace and delay conditioning. The model proposes that hippocampal lesions eliminate long-latency temporal elements, but preserve short-latency temporal elements. For trace conditioning, with no contiguity between stimulus and reward, these long-latency temporal elements are vital to learning adaptively timed responses. For delay conditioning, in contrast, the continued presence of the stimulus supports conditioned responding, and the short-latency elements suppress responding early in the stimulus. In accord with the empirical data, simulated hippocampal damage impairs trace conditioning, but not delay conditioning, at medium-length intervals. With longer intervals, learning is impaired in both procedures, and, with shorter intervals, in neither. In addition, the model makes novel predictions about the response topography with extended stimuli or post-training lesions. These results demonstrate how temporal contiguity, as in delay conditioning, changes the timing problem faced by animals, rendering it both easier and less susceptible to disruption by hippocampal lesions.

NeurIPS Conference 2008 Conference Paper

A Convergent $O(n)$ Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation

Richard Sutton
Hamid Maei
Csaba Szepesvári

We introduce the first temporal-difference learning algorithm that is stable with linear function approximation and off-policy training, for any finite Markov decision process, target policy, and exciting behavior policy, and whose complexity scales linearly in the number of parameters. We consider an i. i. d. \ policy-evaluation setting in which the data need not come from on-policy experience. The gradient temporal-difference (GTD) algorithm estimates the expected update vector of the TD(0) algorithm and performs stochastic gradient descent on its L_2 norm. Our analysis proves that its expected update is in the direction of the gradient, assuring convergence under the usual stochastic approximation conditions to the same least-squares solution as found by the LSTD, but without its quadratic computational complexity. GTD is online and incremental, and does not involve multiplying by products of likelihood ratios as in importance-sampling methods.

IJCAI Conference 2007 Conference Paper

David Silver
Richard Sutton
Martin M
uuml; ller

We explore an application to the game of Go of a reinforcement learning approach based on a linear evaluation function and large numbers of binary features. This strategy has proved effective in game playing programs and other reinforcement learning applications. We apply this strategy to Go by creating over a million features based on templates for small fragments of the board, and then use temporal difference learning and self-play. This method identifies hundreds of low level shapes with recognisable significance to expert Go players, and provides quantitive estimates of their values. We analyse the relative contributions to performance of templates of different types and sizes. Our results show that small, translation-invariant templates are surprisingly effective. We assess the performance of our program by playing against the Average Liberty Player and a variety of computer opponents on the 9x9 Computer Go Server. Our linear evaluation function appears to outperform all other static evaluation functions that do not incorporate substantial domain knowledge.

NeurIPS Conference 2007 Conference Paper

Incremental Natural Actor-Critic Algorithms

Shalabh Bhatnagar
Mohammad Ghavamzadeh
Mark Lee
Richard Sutton

We present four new reinforcement learning algorithms based on actor-critic and natural-gradient ideas, and provide their convergence proofs. Actor-critic rein- forcement learning methods are online approximations to policy iteration in which the value-function parameters are estimated using temporal difference learning and the policy parameters are updated by stochastic gradient descent. Methods based on policy gradients in this way are of special interest because of their com- patibility with function approximation methods, which are needed to handle large or in(cid: 2)nite state spaces. The use of temporal difference learning in this way is of interest because in many applications it dramatically reduces the variance of the gradient estimates. The use of the natural gradient is of interest because it can produce better conditioned parameterizations and has been shown to further re- duce variance in some cases. Our results extend prior two-timescale convergence results for actor-critic methods by Konda and Tsitsiklis by using temporal differ- ence learning in the actor and by incorporating natural gradients, and they extend prior empirical studies of natural actor-critic methods by Peters, Vijayakumar and Schaal by providing the (cid: 2)rst convergence proofs and the (cid: 2)rst fully incremental algorithms.

NeurIPS Conference 2006 Conference Paper

iLSTD: Eligibility Traces and Convergence Analysis

Alborz Geramifard
Michael Bowling
Martin Zinkevich
Richard Sutton

We present new theoretical and empirical results with the iLSTD algorithm for policy evaluation in reinforcement learning with linear function approximation. iLSTD is an incremental method for achieving results similar to LSTD, the dataefficient, least-squares version of temporal difference learning, without incurring the full cost of the LSTD computation. LSTD is O(n2 ), where n is the number of parameters in the linear function approximator, while iLSTD is O(n). In this paper, we generalize the previous iLSTD algorithm and present three new results: (1) the first convergence proof for an iLSTD algorithm; (2) an extension to incorporate eligibility traces without changing the asymptotic computational complexity; and (3) the first empirical results with an iLSTD algorithm for a problem (mountain car) with feature vectors large enough (n = 10, 000) to show substantial computational advantages over LSTD.

NeurIPS Conference 2005 Conference Paper

Off-policy Learning with Options and Recognizers

Doina Precup
Cosmin Paduraru
Anna Koop
Richard Sutton
Satinder Singh

We introduce a new algorithm for off-policy temporal-difference learning with function approximation that has lower variance and requires less knowledge of the behavior policy than prior methods. We develop the notion of a recognizer, a filter on actions that distorts the behavior policy to produce a related target policy with low-variance importance-sampling corrections. We also consider target policies that are deviations from the state distribution of the behavior policy, such as potential temporally abstract options, which further reduces variance. This paper introduces recognizers and their potential advantages, then develops a full algorithm for linear function approximation and proves that its updates are in the same direction as on-policy TD updates, which implies asymptotic convergence. Even though our algorithm is based on importance sampling, we prove that it requires absolutely no knowledge of the behavior policy for the case of state-aggregation function approximators. Off-policy learning is learning about one way of behaving while actually behaving in another way. For example, Q-learning is an off- policy learning method because it learns about the optimal policy while taking actions in a more exploratory fashion, e. g. , according to an -greedy policy. Off-policy learning is of interest because only one way of selecting actions can be used at any time, but we would like to learn about many different ways of behaving from the single resultant stream of experience. For example, the options framework for temporal abstraction involves considering a variety of different ways of selecting actions. For each such option one would like to learn a model of its possible outcomes suitable for planning and other uses. Such option models have been proposed as fundamental building blocks of grounded world knowledge (Sutton, Precup & Singh, 1999; Sutton, Rafols & Koop, 2005). Using off-policy learning, one would be able to learn predictive models for many options at the same time from a single stream of experience. Unfortunately, off-policy learning using temporal-difference methods has proven problematic when used in conjunction with function approximation. Function approximation is essential in order to handle the large state spaces that are inherent in many problem do- mains. Q-learning, for example, has been proven to converge to an optimal policy in the tabular case, but is unsound and may diverge in the case of linear function approximation (Baird, 1996). Precup, Sutton, and Dasgupta (2001) introduced and proved convergence for the first off-policy learning algorithm with linear function approximation. They addressed the problem of learning the expected value of a target policy based on experience generated using a different behavior policy. They used importance sampling techniques to reduce the off-policy case to the on-policy case, where existing convergence theorems apply (Tsitsiklis & Van Roy, 1997; Tadic, 2001). There are two important difficulties with that approach. First, the behavior policy needs to be stationary and known, because it is needed to compute the importance sampling corrections. Second, the importance sampling weights are often ill-conditioned. In the worst case, the variance could be infinite and convergence would not occur. The conditions required to prevent this were somewhat awkward and, even when they applied and asymptotic convergence was assured, the variance could still be high and convergence could be slow. In this paper we address both of these problems in the context of off-policy learning for options. We introduce the notion of a recognizer. Rather than specifying an explicit target policy (for instance, the policy of an option), about which we want to make predictions, a recognizer specifies a condition on the actions that are selected. For example, a recognizer for the temporally extended action of picking up a cup would not specify which hand is to be used, or what the motion should be at all different positions of the cup. The recognizer would recognize a whole variety of directions of motion and poses as part of picking the cup. The advantage of this strategy is not that one might prefer a multitude of different behaviors, but that the behavior may be based on a variety of different strategies, all of which are relevant, and we would like to learn from any of them. In general, a recognizer is a function that recognizes or accepts a space of different ways of behaving and thus, can learn from a wider range of data. Recognizers have two advantages over direct specification of a target policy: 1) they are a natural and easy way to specify a target policy for which importance sampling will be well conditioned, and 2) they do not require the behavior policy to be known. The latter is important because in many cases we may have little knowledge of the behavior policy, or a stationary behavior policy may not even exist. We show that for the case of state aggregation, even if the behavior policy is unknown, convergence to a good model is achieved. 1 Non-sequential example The benefits of using recognizers in off-policy learning can be most easily seen in a nonsequential context with a single continuous action. Suppose you are given a sequence of sample actions ai [0, 1], selected i. i. d. according to probability density b: [0, 1] + (the behavior density). For example, suppose the behavior density is of the oscillatory form shown as a red line in Figure 1. For each each action, ai, we observe a corresponding outcome, zi, a random variable whose distribution depends only on ai. Thus the behavior density induces an outcome density. The on-policy problem is to estimate the mean mb of the outcome density. This problem can be solved simply by averaging the sample outcomes: mb = (1/n) n=1 zi. The off-policy problem is to use this same data to learn what ^ i the mean would be if actions were selected in some way other than b, for example, if the actions were restricted to a designated range, such as between 0. 7 and 0. 9. There are two natural ways to pose this off-policy problem. The most straightforward way is to be equally interested in all actions within the designated region. One professes to be interested in actions selected according to a target density: [0, 1] +, which in the example would be 5. 0 between 0. 7 and 0. 9, and zero elsewhere, as in the dashed line in 12 Probability density functions Target policy with recognizer Target policy w/o recognizer 1. 5 Empirical variances (average of 200 sample variances) 1 without recognizer. 5 with recognizer 100 200 300 400 500

NeurIPS Conference 2005 Conference Paper

Temporal Abstraction in Temporal-difference Networks

Eddie Rafols
Anna Koop
Richard Sutton

We present a generalization of temporal-difference networks to include temporally abstract options on the links of the question network. Temporal-difference (TD) networks have been proposed as a way of representing and learning a wide variety of predictions about the interaction between an agent and its environment. These predictions are compositional in that their targets are defined in terms of other predictions, and subjunctive in that that they are about what would happen if an action or sequence of actions were taken. In conventional TD networks, the inter-related predictions are at successive time steps and contingent on a single action; here we generalize them to accommodate extended time intervals and contingency on whole ways of behaving. Our generalization is based on the options framework for temporal abstraction. The primary contribution of this paper is to introduce a new algorithm for intra-option learning in TD networks with function approximation and eligibility traces. We present empirical examples of our algorithm's effectiveness and of the greater representational expressiveness of temporallyabstract TD networks. The primary distinguishing feature of temporal-difference (TD) networks (Sutton & Tanner, 2005) is that they permit a general compositional specification of the goals of learning. The goals of learning are thought of as predictive questions being asked by the agent in the learning problem, such as "What will I see if I step forward and look right? " or "If I open the fridge, will I see a bottle of beer? " Seeing a bottle of beer is of course a complicated perceptual act. It might be thought of as obtaining a set of predictions about what would happen if certain reaching and grasping actions were taken, about what would happen if the bottle were opened and turned upside down, and of what the bottle would look like if viewed from various angles. To predict seeing a bottle of beer is thus to make a prediction about a set of other predictions. The target for the overall prediction is a composition in the mathematical sense of the first prediction with each of the other predictions. TD networks are the first framework for representing the goals of predictive learning in a compositional, machine-accessible form. Each node of a TD network represents an individual question--something to be predicted--and has associated with it a value representing an answer to the question--a prediction of that something. The questions are represented by a set of directed links between nodes. If node 1 is linked to node 2, then node 1 rep- resents a question incorporating node 2's question; its value is a prediction about node 2's prediction. Higher-level predictions can be composed in several ways from lower ones, producing a powerful, structured representation language for the targets of learning. The compositional structure is not just in a human designer's head; it is expressed in the links and thus is accessible to the agent and its learning algorithm. The network of these links is referred to as the question network. An entirely separate set of directed links between the nodes is used to compute the values (predictions, answers) associated with each node. These links collectively are referred to as the answer network. The computation in the answer network is compositional in a conventional way--node values are computed from other node values. The essential insight of TD networks is that the notion of compositionality should apply to questions as well as to answers. A secondary distinguishing feature of TD networks is that the predictions (node values) at each moment in time can be used as a representation of the state of the world at that time. In this way they are an instance of the idea of predictive state representations (PSRs) introduced by Littman, Sutton and Singh (2002), Jaeger (2000), and Rivest and Schapire (1987). Representing a state by its predictions is a potentially powerful strategy for state abstraction (Rafols et al. , 2005). We note that the questions used in all previous work with PSRs are defined in terms of concrete actions and observations, not other predictions. They are not compositional in the sense that TD-network questions are. The questions we have discussed so far are subjunctive, meaning that they are conditional on a certain way of behaving. We predict what we would see if we were to step forward and look right, or if we were to open the fridge. The questions in conventional TD networks are subjunctive, but they are conditional only on primitive actions or open-loop sequences of primitive actions (as are conventional PSRs). It is natural to generalize this, as we have in the informal examples above, to questions that are conditional on closed-loop temporally extended ways of behaving. For example, opening the fridge is a complex, high-level action. The arm must be lifted to the door, the hand shaped for grasping the handle, etc. To ask questions like "if I were to go to the coffee room, would I see John? " would require substantial temporal abstraction in addition to state abstraction. The options framework (Sutton, Precup & Singh, 1999) is a straightforward way of talking about temporally extended ways of behaving and about predictions of their outcomes. In this paper we extend the options framework so that it can be applied to TD networks. Significant extensions of the original options framework are needed. Novel features of our option-extended TD networks are that they 1) predict components of option outcomes rather than full outcome probability distributions, 2) learn according to the first intra-option method to use eligibility traces (see Sutton & Barto, 1998), and 3) include the possibility of options whose `policies' are indifferent to which of several actions are selected. 1 The options framework In this section we present the essential elements of the options framework (Sutton, Precup & Singh, 1999) that we will need for our extension of TD networks. In this framework, an agent and an environment interact at discrete time steps t = 1, 2, 3. .. . In each state st S, the agent selects an action at A, determining the next state st+1. 1 An action is a way of behaving for one time step; the options framework lets us talk about temporally extended ways of behaving. An individual option consists of three parts. The first is the initiation set, I S, the subset of states in which the option can be started. The second component of an option is its policy, : S A [0, 1], specifying how the agent behaves when 1 Although the options framework includes rewards, we omit them here because we are concerned only with prediction, not control. following the option. Finally, a termination function, : S A [0, 1], specifies how the option ends: (s) denotes the probability of terminating when in state s. The option is thus completely and formally defined by the 3-tuple (I, , ). 2 Conventional TD networks In this section we briefly present the details of the structure and the learning algorithm comprising TD networks as introduced by Sutton and Tanner (2005). TD networks address a prediction problem in which the agent may not have direct access to the state of the environment. Instead, at each time step the agent receives an observation ot O dependent on the state. The experience stream thus consists of a sequence of alternating actions and observations, o1, a1, o2, a2, o3. The TD network consists of a set of nodes, each representing a single scalar prediction, interlinked by the question and answer networks as suggested previously. For a network 1 n of n nodes, the vector of all predictions at time step t is denoted yt = (yt, .. ., yt )T. The predictions are estimates of the expected value of some scalar quantity, typically of a bit, in which case they can be interpreted as estimates of probabilities. The predictions are updated at each time step according to a vector-valued function u with modifiable parameter W, which is often taken to be of a linear form: yt = u(yt-1, at-1, ot, Wt ) = (Wt xt ), (1) where xt m is an m-vector of features created from (yt-1, at-1, ot ), Wt is an n m matrix (whose elements are sometimes referred to as weights), and is the n-vector form of either the identity function or the S-shaped logistic function (s) = 1+1 -s. The e feature vector is an arbitrary vector-valued function of yt-1, at-1, and ot. For example, in the simplest case the feature vector is a unit basis vector with the location of the one communicating the current state. In a partially observable environment, the feature vector may be a combination of the agent's action, observations, and predictions from the previous time step. The overall update u defines the answer network. The question network consists of a set of target functions, z i: O n, and condition i y functions, ci: A n [0, 1]n. We define zt = z i (ot+1, ~t+1 ) as the target for prediction i2 i i yt. Similarly, we define ct = c (at, yt ) as the condition at time t. The learning algorithm i for each component wtj of Wt can then be written i zi c yt ij i i wt+1 = wtj + t - yt i, (2) t i wt j where is a positive step-size parameter. Note that the targets here are functions of the observation and predictions exactly one time step later, and that the conditions are functions of a single primitive action. This is what makes this algorithm suitable only for learning about one-step TD relationships. By chaining together multiple nodes, Sutton and Tanner (2005) used it to predict k steps ahead, for various particular values of k, and to predict the outcome of specific action sequences (as in PSRs, e. g. , Littman et al. , 2002; Singh et al. , 2004). Now we consider the extension to temporally abstract actions. 3 Option-extended TD networks In this section we present our intra-option learning algorithm for TD networks with options and eligibility traces. As suggested earlier, each node's outgoing link in the question The quantity ~ is almost the same as y, and we encourage the reader to think of them as identical y here. The difference is that ~ is calculated by weights that are one step out of date as compared to y, y i. e. , ~t = u(yt-1, at-1, ot, Wt-1 ) (cf. equation 1). y 2 network will now correspond to an option applying over possibly many steps. The policy of the ith node's option corresponds to the condition function ci, which we think of as a recognizer for the option. It inspects each action taken to assess whether the option is being followed: ci = 1 if the agent is acting consistently with the option policy and ci = 0 othert t wise (intermediate values are also possible). When an agent ceases to act consistently with the option policy, we say that the option has diverged. The possibility of recognizing more than one action as consistent with the option is a significant generalization of the original idea of options. If no actions are recognized as acceptable in a state, then the option cannot be followed and thus cannot be initiated. Here we take the set of states with at least one recognized action to be the initiation set of the option. The option-termination function generalizes naturally to TD networks. Each node i is i given a corresponding termination function, i: O n [0, 1], where t = i (ot+1, yt ) i is the probability of terminating at time t. 3 t = 1 indicates that the option has terminated i at time t; t = 0 indicates that it has not, and intermediate values of correspond to soft i or stochastic termination conditions. If an option terminates, then zt acts as the target, but if the option is ongoing without termination, then the node's own next value, yt+1, should ~i be the target. The termination function specifies which of the two targets (or mixture of the two targets) is used to produce a form of TD error for each node i: i ii ii i t = t zt + (1 - t )yt+1 - yt. ~ (3) Our option-extended algorithm incorporates eligibility traces (see Sutton & Barto, 1998) as short-term memory variables organized in an n m matrix E, paralleling the weight matrix. The traces are a record of the effect that each weight could have had on each node's prediction during the time the agent has been acting consistently with the node's option. The components eij of the eligibility matrix are updated by, i yt ij ij i i (4) et = ct et-1 (1 - t ) + i wt j where 0 1 is the trace-decay parameter familiar from the TD() learning algorithm. Because of the ci factor, all of a node's traces will be immediately reset to zero whenever t the agent deviates from the node's option's policy. If the agent follows the policy and the option does not terminate, then the trace decays by and increments by the gradient in the way typical of eligibility traces. If the policy is followed and the option does terminate, then the trace will be reset to zero on the immediately following time step, and a new trace will start building. Finally, our algorithm updates the weights on each time step by ij i i wt+1 = wtj + t eij. t

NeurIPS Conference 2004 Conference Paper

Temporal-Difference Networks

Richard Sutton
Brian Tanner

We introduce a generalization of temporal-difference (TD) learning to networks of interrelated predictions. Rather than relating a single pre- diction to itself at a later time, as in conventional TD methods, a TD network relates each prediction in a set of predictions to other predic- tions in the set at a later time. TD networks can represent and apply TD learning to a much wider class of predictions than has previously been possible. Using a random-walk example, we show that these networks can be used to learn to predict by a fixed interval, which is not possi- ble with conventional TD methods. Secondly, we show that if the inter- predictive relationships are made conditional on action, then the usual learning-efficiency advantage of TD methods over Monte Carlo (super- vised learning) methods becomes particularly pronounced. Thirdly, we demonstrate that TD networks can learn predictive state representations that enable exact solution of a non-Markov problem. A very broad range of inter-predictive temporal relationships can be expressed in these net- works. Overall we argue that TD networks represent a substantial ex- tension of the abilities of TD methods and bring us closer to the goal of representing world knowledge in entirely predictive, grounded terms. Temporal-difference (TD) learning is widely used in reinforcement learning methods to learn moment-to-moment predictions of total future reward (value functions). In this set- ting, TD learning is often simpler and more data-efficient than other methods. But the idea of TD learning can be used more generally than it is in reinforcement learning. TD learn- ing is a general method for learning predictions whenever multiple predictions are made of the same event over time, value functions being just one example. The most pertinent of the more general uses of TD learning have been in learning models of an environment or task domain (Dayan, 1993; Kaelbling, 1993; Sutton, 1995; Sutton, Precup & Singh, 1999). In these works, TD learning is used to predict future values of many observations or state variables of a dynamical system. The essential idea of TD learning can be described as "learning a guess from a guess". In all previous work, the two guesses involved were predictions of the same quantity at two points in time, for example, of the discounted future reward at successive time steps. In this paper we explore a few of the possibilities that open up when the second guess is allowed to be different from the first. To be more precise, we must make a distinction between the extensive definition of a predic- tion, expressing its desired relationship to measurable data, and its TD definition, express- ing its desired relationship to other predictions. In reinforcement learning, for example, state values are extensively defined as an expectation of the discounted sum of future re- wards, while they are TD defined as the solution to the Bellman equation (a relationship to the expectation of the value of successor states, plus the immediate reward). It's the same prediction, just defined or expressed in different ways. In past work with TD methods, the TD relationship was always between predictions with identical or very similar extensive semantics. In this paper we retain the TD idea of learning predictions based on others, but allow the predictions to have different extensive semantics. 1 The Learning-to-predict Problem The problem we consider in this paper is a general one of learning to predict aspects of the interaction between a decision making agent and its environment. At each of a series of discrete time steps t, the environment generates an observation ot O, and the agent takes an action at A. Whereas A is an arbitrary discrete set, we assume without loss of gener- ality that ot can be represented as a vector of bits. The action and observation events occur in sequence, o1, a1, o2, a2, o3, with each event of course dependent only on those pre- ceding it. This sequence will be called experience. We are interested in predicting not just each next observation but more general, action-conditional functions of future experience, as discussed in the next section. In this paper we use a random-walk problem with seven states, with left and right actions available in every state: 1 0 0 0 0 0 1 1 2 3 4 5 6 7 The observation upon arriving in a state consists of a special bit that is 1 only at the two ends of the walk and, in the first two of our three experiments, seven additional bits explicitly indicating the state number (only one of them is 1). This is a continuing task: reaching an end state does not end or interrupt experience. Although the sequence depends determinis- tically on action, we assume that the actions are selected randomly with equal probability so that the overall system can be viewed as a Markov chain. The TD networks introduced in this paper can represent a wide variety of predictions, far more than can be represented by a conventional TD predictor. In this paper we take just a few steps toward more general predictions. In particular, we consider variations of the problem of prediction by a fixed interval. This is one of the simplest cases that cannot otherwise be handled by TD methods. For the seven-state random walk, we will predict the special observation bit some numbers of discrete steps in advance, first unconditionally and then conditioned on action sequences.

NeurIPS Conference 2001 Conference Paper

Predictive Representations of State

Michael Littman
Richard Sutton

We show that states of a dynamical system can be usefully repre(cid: 173) sented by multi-step, action-conditional predictions of future ob(cid: 173) servations. State representations that are grounded in data in this way may be easier to learn, generalize better, and be less depen(cid: 173) dent on accurate prior models than, for example, POMDP state representations. Building on prior work by Jaeger and by Rivest and Schapire, in this paper we compare and contrast a linear spe(cid: 173) cialization of the predictive approach with the state representa(cid: 173) tions used in POMDPs and in k-order Markov models. Ours is the first specific formulation of the predictive idea that includes both stochasticity and actions (controls). We show that any system has a linear predictive state representation with number of predictions no greater than the number of states in its minimal POMDP model. In predicting or controlling a sequence of observations, the concepts of state and state estimation inevitably arise. There have been two dominant approaches. The generative-model approach, typified by research on partially observable Markov de(cid: 173) cision processes (POMDPs), hypothesizes a structure for generating observations and estimates its state and state dynamics. The history-based approach, typified by k-order Markov methods, uses simple functions of past observations as state, that is, as the immediate basis for prediction and control. (The data flow in these two ap(cid: 173) proaches are diagrammed in Figure 1. ) Of the two, the generative-model approach is more general. The model's internal state gives it temporally unlimited memory(cid: 173) the ability to remember an event that happened arbitrarily long ago--whereas a history-based approach can only remember as far back as its history extends. The bane of generative-model approaches is that they are often strongly dependent on a good model of the system's dynamics. Most uses of POMDPs, for example, assume a perfect dynamics model and attempt only to estimate state. There are algorithms for simultaneously estimating state and dynamics (e. g. , Chrisman, 1992), analogous to the Baum-Welch algorithm for the uncontrolled case (Baum et al. , 1970), but these are only effective at tuning parameters that are already approximately cor(cid: 173) rect (e. g. , Shatkay & Kaelbling, 1997). observations (and actions)

NeurIPS Conference 1999 Conference Paper

Policy Gradient Methods for Reinforcement Learning with Function Approximation

Richard Sutton
David McAllester
Satinder Singh
Yishay Mansour

Function approximation is essential to reinforcement learning, but the standard approach of approximating a value function and deter(cid: 173) mining a policy from it has so far proven theoretically intractable. In this paper we explore an alternative approach in which the policy is explicitly represented by its own function approximator, indepen(cid: 173) dent of the value function, and is updated according to the gradient of expected reward with respect to the policy parameters. Williams's REINFORCE method and actor-critic methods are examples of this approach. Our main new result is to show that the gradient can be written in a form suitable for estimation from experience aided by an approximate action-value or advantage function. Using this result, we prove for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy. Large applications of reinforcement learning (RL) require the use of generalizing func(cid: 173) tion approximators such neural networks, decision-trees, or instance-based methods. The dominant approach for the last decade has been the value-function approach, in which all function approximation effort goes into estimating a value function, with the action-selection policy represented implicitly as the "greedy" policy with respect to the estimated values (e. g. , as the policy that selects in each state the action with highest estimated value). The value-function approach has worked well in many appli(cid: 173) cations, but has several limitations. First, it is oriented toward finding deterministic policies, whereas the optimal policy is often stochastic, selecting different actions with specific probabilities (e. g. , see Singh, Jaakkola, and Jordan, 1994). Second, an arbi(cid: 173) trarily small change in the estimated value of an action can cause it to be, or not be, selected. Such discontinuous changes have been identified as a key obstacle to estab(cid: 173) lishing convergence assurances for algorithms following the value-function approach (Bertsekas and Tsitsiklis, 1996). For example, Q-Iearning, Sarsa, and dynamic pro(cid: 173) gramming methods have all been shown unable to converge to any policy for simple MDPs and simple function approximators (Gordon, 1995, 1996; Baird, 1995; Tsit(cid: 173) siklis and van Roy, 1996; Bertsekas and Tsitsiklis, 1996). This can occur even if the best approximation is found at each step before changing the policy, and whether the notion of "best" is in the mean-squared-error sense or the slightly different senses of residual-gradient, temporal-difference, and dynamic-programming methods. In this paper we explore an alternative approach to function approximation in RL.

NeurIPS Conference 1998 Conference Paper

Improved Switching among Temporally Abstract Actions

Richard Sutton
Satinder Singh
Doina Precup
Balaraman Ravindran

In robotics and other control applications it is commonplace to have a pre(cid: 173) existing set of controllers for solving subtasks, perhaps hand-crafted or previously learned or planned, and still face a difficult problem of how to choose and switch among the controllers to solve an overall task as well as possible. In this paper we present a framework based on Markov decision processes and semi-Markov decision processes for phrasing this problem, a basic theorem regarding the improvement in performance that can be ob(cid: 173) tained by switching flexibly between given controllers, and example appli(cid: 173) cations of the theorem. In particular, we show how an agent can plan with these high-level controllers and then use the results of such planning to find an even better plan, by modifying the existing controllers, with negligible additional cost and no re-planning. In one of our examples, the complexity of the problem is reduced from 24 billion state-action pairs to less than a million state-controller pairs. In many applications, solutions to parts of a task are known, either because they were hand(cid: 173) crafted by people or because they were previously learned or planned. For example, in robotics applications, there may exist controllers for moving joints to positions, picking up objects, controlling eye movements, or navigating along hallways. More generally, an intelli(cid: 173) gent system may have available to it several temporally extended courses of action to choose from. In such cases, a key challenge is to take full advantage of the existing temporally ex(cid: 173) tended actions, to choose or switch among them effectively, and to plan at their level rather than at the level of individual actions. Recently, several researchers have begun to address these challenges within the framework of reinforcement learning and Markov decision processes (e. g. , Singh, 1992; Kaelbling, 1993; Dayan & Hinton, 1993; Thrun and Schwartz, 1995; Sutton, 1995; Dietterich, 1998; Parr & Russell, 1998; McGovern, Sutton & Fagg, 1997). Common to much of this recent work is the modeling of a temporally extended action as a policy (controller) and a condition for terminating, which we together refer to as an option (Sutton, Precup & Singh, 1998). In this paper we consider the problem of effectively combining given options into one overall policy, generalizing prior work by Kaelbling (1993). Sections 1-3 introduce the framework; our new results are in Sections 4 and 5. Improved Switching among Temporally Abstract Actions

NeurIPS Conference 1998 Conference Paper

Learning Instance-Independent Value Functions to Enhance Local Search

Robert Moll
Andrew Barto
Theodore Perkins
Richard Sutton

Reinforcement learning methods can be used to improve the performance of local search algorithms for combinatorial optimization by learning an evaluation function that predicts the outcome of search. The eval(cid: 173) uation function is therefore able to guide search to low-cost solutions better than can the original cost function. We describe a reinforcement learning method for enhancing local search that combines aspects of pre(cid: 173) vious work by Zhang and Dietterich (1995) and Boyan and Moore (1997, Boyan 1998). In an off-line learning phase, a value function is learned that is useful for guiding search for multiple problem sizes and instances. We illustrate our technique by developing several such functions for the Dial-A-Ride Problem. Our learning-enhanced local search algorithm ex(cid: 173) hibits an improvement of more then 30% over a standard local search algorithm.

NeurIPS Conference 1997 Conference Paper

Multi-time Models for Temporally Abstract Planning

Doina Precup
Richard Sutton

Planning Doina Precup, Richard S. Sutton University of Massachusetts Amherst, MA 01003 {dprecuplrich}@cs. umass. edu

NeurIPS Conference 1995 Conference Paper

Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding

Richard Sutton

On large problems, reinforcement learning systems must use parame(cid: 173) terized function approximators such as neural networks in order to gen(cid: 173) eralize between similar situations and actions. In these cases there are no strong theoretical results on the accuracy of convergence, and com(cid: 173) putational results have been mixed. In particular, Boyan and Moore reported at last year's meeting a series of negative results in attempting to apply dynamic programming together with function approximation to simple control problems with continuous state spaces. In this paper, we present positive results for all the control tasks they attempted, and for one that is significantly larger. The most important differences are that we used sparse-coarse-coded function approximators (CMACs) whereas they used mostly global function approximators, and that we learned online whereas they learned offline. Boyan and Moore and others have suggested that the problems they encountered could be solved by using actual outcomes ("rollouts"), as in classical Monte Carlo methods, and as in the TD(). ) algorithm when). = 1. However, in our experiments this always resulted in substantially poorer perfor(cid: 173) mance. We conclude that reinforcement learning can work robustly in conjunction with function approximators, and that there is little justification at present for avoiding the case of general ). . 1 Reinforcement Learning and Function Approximation Reinforcement learning is a broad class of optimal control methods based on estimating value functions from experience, simulation, or search (Barto, Bradtke &; Singh, 1995; Sutton, 1988; Watkins, 1989). Many of these methods, e. g. , dynamic programming and temporal-difference learning, build their estimates in part on the basis of other Generalization in Reinforcement Learning

NeurIPS Conference 1991 Conference Paper

Iterative Construction of Sparse Polynomial Approximations

Terence Sanger
Richard Sutton
Christopher Matheus

Christopher J. Matheus

NeurIPS Conference 1990 Conference Paper

Integrated Modeling and Control Based on Reinforcement Learning and Dynamic Programming

Richard Sutton

This is a summary of results with Dyna, a class of architectures for intel(cid: 173) ligent systems based on approximating dynamic programming methods. Dyna architectures integrate trial-and-error (reinforcement) learning and execution-time planning into a single process operating alternately on the world and on a learned forward model of the world. We describe and show results for two Dyna architectures, Dyna-AHC and Dyna-Q. Using a navigation task, results are shown for a simple Dyna-AHC system which simultaneously learns by trial and error, learns a world model, and plans optimal routes using the evolving world model. We show that Dyna-Q architectures (based on Watkins's Q-Iearning) are easy to adapt for use in changing environments. 1 Introduction to Dyna Dyna architectures (Sutton, 1990) use learning algorithms to approximate the con(cid: 173) ventional optimal control technique known as dynamic programming (DP) (Bell(cid: 173) man, 1957; Bertsekas, 1987). DP itself is not a learning method, but rather a computational method for determining optimal behavior given a complete model of the task to be solved. It is very similar to state-space search, but differs in that it is more incremental and never considers actual action sequences explicitly, only single actions at a time. This makes DP more amenable to incremental planning at execution time, and also makes it more suitable for stochastic or incompletely modeled environments, as it need not consider the extremely large number of se(cid: 173) quences possible in an uncertain environment. Learned world models are likely to be stochastic and uncertain, making DP approaches particularly promising for