Arrow Research search

Author name cluster

Jonathan Baxter

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

11 papers
2 author rows

Possible papers

11

JMLR Journal 2004 Journal Article

Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning

  • Evan Greensmith
  • Peter L. Bartlett
  • Jonathan Baxter

Policy gradient methods for reinforcement learning avoid some of the undesirable properties of the value function approaches, such as policy degradation (Baxter and Bartlett, 2001). However, the variance of the performance gradient estimates obtained from the simulation is sometimes excessive. In this paper, we consider variance reduction methods that were developed for Monte Carlo estimates of integrals. We study two commonly used policy gradient techniques, the baseline and actor-critic methods, from this perspective. Both can be interpreted as additive control variate variance reduction methods. We consider the expected average reward performance measure, and we focus on the GPOMDP algorithm for estimating performance gradients in partially observable Markov decision processes controlled by stochastic reactive policies. We give bounds for the estimation error of the gradient estimates for both baseline and actor-critic algorithms, in terms of the sample size and mixing properties of the controlled system. For the baseline technique, we compute the optimal baseline, and show that the popular approach of using the average reward to define the baseline can be suboptimal. For actor-critic algorithms, we show that using the true value function as the critic can be suboptimal. We also discuss algorithms for estimating the optimal baseline and approximate value function. [abs] [ pdf ]

NeurIPS Conference 2001 Conference Paper

Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning

  • Evan Greensmith
  • Peter Bartlett
  • Jonathan Baxter

We consider the use of two additive control variate methods to reduce the variance of performance gradient estimates in reinforcement learn- ing problems. The first approach we consider is the baseline method, in which a function of the current state is added to the discounted value estimate. We relate the performance of these methods, which use sam- ple paths, to the variance of estimates based on iid data. We derive the baseline function that minimizes this variance, and we show that the vari- ance for any baseline is the sum of the optimal variance and a weighted squared distance to the optimal baseline. We show that the widely used average discounted value baseline (where the reward is replaced by the difference between the reward and its expectation) is suboptimal. The second approach we consider is the actor-critic method, which uses an approximate value function. We give bounds on the expected squared error of its estimates. We show that minimizing distance to the true value function is suboptimal in general; we provide an example for which the true value function gives an estimate with positive variance, but the op- timal value function gives an unbiased estimate with zero variance. Our bounds suggest algorithms to estimate the gradient of the performance of parameterized baseline or value functions. We present preliminary exper- iments that illustrate the performance improvements on a simple control problem. 1 Introduction, Background, and Preliminary Results In reinforcement learning problems, the aim is to select a controller that will maximize the average reward in some environment. We model the environment as a partially ob- servable Markov decision process (POMDP). Gradient ascent methods (e. g. , [7, 12, 15]) estimate the gradient of the average reward, usually using Monte Carlo techniques to cal- (cid: 3)Most of this work was performed while the authors were with the Research School of Information Sciences and Engineering at the Australian National University. culate an average over a sample path of the controlled POMDP. However such estimates tend to have a high variance, which means many steps are needed to obtain a good esti- mate. GPOMDP [4] is an algorithm for generating an estimate of the gradient in this way. Compared with other approaches, it is suitable for large systems, when the time between visits to a state is large but the mixing time of the controlled POMDP is short. However, it can suffer from the problem of producing high variance estimates. In this paper, we investi- gate techniques for variance reduction in GPOMDP. One generic approach to reducing the variance of Monte Carlo estimates of integrals is to use an additive control variate (see, for example, [6]). Suppose we wish to estimate the integral of f: X! R, and we know the integral of another function ’: X! R. Since RX f = RX (f (cid: 0) ’) +RX ’, the integral of f (cid: 0) ’ can be estimated instead. Obviously if ’ = f then the variance is zero. More generally, Var(f (cid: 0) ’) = Var(f ) (cid: 0) 2Cov(f; ’) + Var(’), so that if (cid: 30) and f are strongly correlated, the variance of the estimate is reduced. In this paper, we consider two approaches of this form. The first (Section 2) is the technique of adding a baseline. We find the optimal baseline and we show that the additional variance of a suboptimal baseline can be expressed as a weighted squared distance from the optimal baseline. Constant baselines, which do not depend on the state or observations, have been widely used [13, 15, 9, 11]. In particular, the expectation over all states of the discounted value of the state is a popular constant baseline (where, for example, the reward at each step is replaced by the difference between the reward and the expected reward). We give bounds on the estimation variance that show that, perhaps surprisingly, this may not be the best choice. The second approach (Section 3) is the use of an approximate value function. Such actor- critic methods have been investigated extensively [3, 1, 14, 10]. Generally the idea is to minimize some notion of distance between the fixed value function and the true value function. In this paper we show that this may not be the best approach: selecting the fixed value function to be equal to the true value function is not always the best choice. Even more surprisingly, we give an example for which the use of a fixed value function that is different from the true value function reduces the variance to zero, for no increase in bias. We give a bound on the expected squared error (that is, including the estimation variance) of the gradient estimate produced with a fixed value function. Our results suggest new algorithms to learn the optimum baseline, and to learn a fixed value function that minimizes the bound on the error of the estimate. In Section 5, we describe the results of preliminary experiments, which show that these algorithms give performance improvements. POMDP with Reactive, Parameterized Policy A partially observable Markov decision process (POMDP) consists of a state space, S, a control space, U, an observation space, Y, a set of transition probability matrices fP(u): u 2 Ug, each with components pij (u) for i; j 2 S; u 2 U, an observation pro- cess (cid: 23): S! PY, where PY is the space of probability distributions over Y, and a reward function r: S! R. We assume that S; U; Y are finite, although all our re- sults extend easily to infinite U and Y, and with more restrictive assumptions can be extended to infinite S. A reactive, parameterized policy for a POMDP is a set of map- pings f(cid: 22)((cid: 1); (cid: 18)): Y! PU j(cid: 18) 2 RKg. Together with the POMDP, this defines the con- trolled POMDP (S; U; Y; P; (cid: 23); r; (cid: 22)). The joint state, observation and control process, fXt; Yt; Utg, is Markov. The state process, fXtg, is also Markov, with transition prob- abilities pij ((cid: 18)) = Py2Y; u2U (cid: 23)y(i)(cid: 22)u(y; (cid: 18))pij (u), where (cid: 23)y(i) denotes the probability of observation y given the state i, and (cid: 22)u(y; (cid: 18)) denotes the probability of action u given pa- rameters (cid: 18) and observation y. The Markov chain M((cid: 18)) = (S; P((cid: 18))) then describes the behaviour of the process fXtg. Assumption 1 The controlled POMDP (S; U; Y; P; (cid: 23); r; (cid: 22)) satisfies: For all (cid: 18) 2 RK there exists a unique stationary distribution satisfying (cid: 25) 0((cid: 18)) P((cid: 18)) = (cid: 25)0((cid: 18)). There is an R < 1 such that, for all i 2 S, jr(i)j (cid: 20) R. There is a B < 1 such that, for all u 2 U, y 2 Y and (cid: 18) 2 RK the deriva- tives @(cid: 22)u(y; (cid: 18))=@(cid: 18)k (1 (cid: 20) k (cid: 20) K) exist, and the vector of these derivatives satisfies kr(cid: 22)u(y; (cid: 18))=(cid: 22)u(y; (cid: 18))k (cid: 20) B, where k (cid: 1) k denotes the Euclidean norm on RK. implies that this limit exists, and does not depend on the start state X0. The aim is to select a policy to maximize this quantity. Define the discounted value function, J(cid: 12)(i; (cid: 18)) def= We consider the average reward, (cid: 17)((cid: 18)) def= limT! 1 Eh 1 t=0 r(Xt)i. Assumption 1 X0 = i i. Throughout the rest of the paper, dependences limT! 1 EhPT (cid: 0)1 Var(A) = Eh(A (cid: 0) E [A])2i, where a2 denotes a0a, and a0 denotes the transpose of the upon (cid: 18) are assumed, and dropped in the notation. For a random vector A, we denote t=0 (cid: 12)tr(Xt)(cid: 12)(cid: 12)(cid: 12)

NeurIPS Conference 1999 Conference Paper

Boosting Algorithms as Gradient Descent

  • Llew Mason
  • Jonathan Baxter
  • Peter Bartlett
  • Marcus Frean

We provide an abstract characterization of boosting algorithms as gradient decsent on cost-functionals in an inner-product function space. We prove convergence of these functional-gradient-descent algorithms under quite weak conditions. Following previous theo(cid: 173) retical results bounding the generalization performance of convex combinations of classifiers in terms of general cost functions of the margin, we present a new algorithm (DOOM II) for performing a gradient descent optimization of such cost functions. Experiments on several data sets from the UC Irvine repository demonstrate that DOOM II generally outperforms AdaBoost, especially in high noise situations, and that the overfitting behaviour of AdaBoost is predicted by our cost functions.

NeurIPS Conference 1998 Conference Paper

Direct Optimization of Margins Improves Generalization in Combined Classifiers

  • Llew Mason
  • Peter Bartlett
  • Jonathan Baxter

Cumulative training margin dis(cid: 173) tributions for AdaBoost versus our "Direct Optimization Of Margins" (DOOM) algorithm. The dark curve is AdaBoost, the light curve is DOOM. DOOM sacrifices significant training er(cid: 173) ror for improved test error (hori(cid: 173) zontal marks on margin= 0 line)_ -1 -0. 8 -0. 6 -0. 4 -0. 2 0 0. 2 0. 4 0. 6 0. 8

NeurIPS Conference 1997 Conference Paper

The Canonical Distortion Measure in Feature Space and 1-NN Classification

  • Jonathan Baxter
  • Peter Bartlett

We prove that the Canonical Distortion Measure (CDM) [2, 3] is the optimal distance measure to use for I nearest-neighbour (l-NN) classifi(cid: 173) cation, and show that it reduces to squared Euclidean distance in feature space for function classes that can be expressed as linear combinations of a fixed set of features. PAC-like bounds are given on the sample(cid: 173) complexity required to learn the CDM. An experiment is presented in which a neural network CDM was learnt for a Japanese OCR environ(cid: 173) ment and then used to do I-NN classification.

NeurIPS Conference 1995 Conference Paper

Learning Model Bias

  • Jonathan Baxter

In this paper the problem of learning appropriate domain-specific bias is addressed. It is shown that this can be achieved by learning many related tasks from the same domain, and a theorem is given bounding the number tasks that must be learnt. A corollary of the theorem is that if the tasks are known to possess a common inter(cid: 173) nal representation or preprocessing then the number of examples required per task for good generalisation when learning n tasks si(cid: 173) multaneously scales like O(a + ~), where O(a) is a bound on the minimum number of examples requred to learn a single task, and O( a + b) is a bound on the number of examples required to learn each task independently. An experiment providing strong qualita(cid: 173) tive support for the theoretical results is reported.