Author name cluster

Geoffrey Gordon

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

30 papers

1 author row

AAAI Conference 2020 Conference Paper

Learning General Latent-Variable Graphical Models with Predictive Belief Propagation

Borui Wang
Geoffrey Gordon

Learning general latent-variable probabilistic graphical models is a key theoretical challenge in machine learning and artiﬁcial intelligence. All previous methods, including the EM algorithm and the spectral algorithms, face severe limitations that largely restrict their applicability and affect their performance. In order to overcome these limitations, in this paper we introduce a novel formulation of message-passing inference over junction trees named predictive belief propagation, and propose a new learning and inference algorithm for general latent-variable graphical models based on this formulation. Our proposed algorithm reduces the hard parameter learning problem into a sequence of supervised learning problems, and uniﬁes the learning of different kinds of latent graphical models into a single learning framework, which is local-optima-free and statistically consistent. We then give a proof of the correctness of our algorithm and show in experiments on both synthetic and real datasets that our algorithm signiﬁcantly outperforms both the EM algorithm and the spectral algorithm while also being orders of magnitude faster to compute.

NeurIPS Conference 2019 Conference Paper

Learning Neural Networks with Adaptive Regularization

Han Zhao
Yao-Hung Hubert Tsai
Russ Salakhutdinov
Geoffrey Gordon

Feed-forward neural networks can be understood as a combination of an intermediate representation and a linear hypothesis. While most previous works aim to diversify the representations, we explore the complementary direction by performing an adaptive and data-dependent regularization motivated by the empirical Bayes method. Specifically, we propose to construct a matrix-variate normal prior (on weights) whose covariance matrix has a Kronecker product structure. This structure is designed to capture the correlations in neurons through backpropagation. Under the assumption of this Kronecker factorization, the prior encourages neurons to borrow statistical strength from one another. Hence, it leads to an adaptive and data-dependent regularization when training networks on small datasets. To optimize the model, we present an efficient block coordinate descent algorithm with analytical solutions. Empirically, we demonstrate that the proposed method helps networks converge to local optima with smaller stable ranks and spectral norms. These properties suggest better generalizations and we present empirical results to support this expectation. We also verify the effectiveness of the approach on multiclass classification and multitask regression problems with various network structures. Our code is publicly available at: ~\url{https: //github. com/yaohungt/Adaptive-Regularization-Neural-Network}.

NeurIPS Conference 2019 Conference Paper

Towards modular and programmable architecture search

Renato Negrinho
Matthew Gormley
Geoffrey Gordon
Darshan Patil
Nghia Le
Daniel Ferreira

Neural architecture search methods are able to find high performance deep learning architectures with minimal effort from an expert. However, current systems focus on specific use-cases (e. g. convolutional image classifiers and recurrent language models), making them unsuitable for general use-cases that an expert might wish to write. Hyperparameter optimization systems are general-purpose but lack the constructs needed for easy application to architecture search. In this work, we propose a formal language for encoding search spaces over general computational graphs. The language constructs allow us to write modular, composable, and reusable search space encodings and to reason about search space design. We use our language to encode search spaces from the architecture search literature. The language allows us to decouple the implementations of the search space and the search algorithm, allowing us to expose search spaces to search algorithms through a consistent interface. Our experiments show the ease with which we can experiment with different combinations of search spaces and search algorithms without having to implement each combination from scratch. We release an implementation of our language with this paper.

NeurIPS Conference 2018 Conference Paper

Adversarial Multiple Source Domain Adaptation

Han Zhao
Shanghang Zhang
Guanhang Wu
José M. F. Moura
Joao Costeira
Geoffrey Gordon

While domain adaptation has been actively researched, most algorithms focus on the single-source-single-target adaptation setting. In this paper we propose new generalization bounds and algorithms under both classification and regression settings for unsupervised multiple source domain adaptation. Our theoretical analysis naturally leads to an efficient learning strategy using adversarial neural networks: we show how to interpret it as learning feature representations that are invariant to the multiple domain shifts while still being discriminative for the learning task. To this end, we propose multisource domain adversarial networks (MDAN) that approach domain adaptation by optimizing task-adaptive generalization bounds. To demonstrate the effectiveness of MDAN, we conduct extensive experiments showing superior adaptation performance on both classification and regression problems: sentiment analysis, digit classification, and vehicle counting.

AAAI Conference 2018 Conference Paper

An Efficient, Expressive and Local Minima-Free Method for Learning Controlled Dynamical Systems

Ahmed Hefny
Carlton Downey
Geoffrey Gordon

We propose a framework for modeling and estimating the state of controlled dynamical systems, where an agent can affect the system through actions and receives partial observations. Based on this framework, we propose Predictive State Representation with Random Fourier Features (RFF-PSR). A key property in RFF-PSRs is that the state estimate is represented by a conditional distribution of future observations given future actions. RFFPSRs combine this representation with moment-matching, kernel embedding and local optimization to achieve a method that enjoys several favorable qualities: It can represent controlled environments which can be affected by actions, it has an efﬁcient and theoretically justiﬁed learning algorithm, it uses a non-parametric representation that has expressive power to represent continuous nonlinear dynamics. We provide a detailed formulation, a theoretical analysis and an experimental evaluation that demonstrates the effectiveness of our method.

NeurIPS Conference 2018 Conference Paper

Dual Policy Iteration

Wen Sun
Geoffrey Gordon
Byron Boots
J. Bagnell

Recently, a novel class of Approximate Policy Iteration (API) algorithms have demonstrated impressive practical performance (e. g. , ExIt from [1], AlphaGo-Zero from [2]). This new family of algorithms maintains, and alternately optimizes, two policies: a fast, reactive policy (e. g. , a deep neural network) deployed at test time, and a slow, non-reactive policy (e. g. , Tree Search), that can plan multiple steps ahead. The reactive policy is updated under supervision from the non-reactive policy, while the non-reactive policy is improved with guidance from the reactive policy. In this work we study this Dual Policy Iteration (DPI) strategy in an alternating optimization framework and provide a convergence analysis that extends existing API theory. We also develop a special instance of this framework which reduces the update of non-reactive policies to model-based optimal control using learned local models, and provides a theoretically sound way of unifying model-free and model-based RL approaches with unknown dynamics. We demonstrate the efficacy of our approach on various continuous control Markov Decision Processes.

NeurIPS Conference 2018 Conference Paper

Learning Beam Search Policies via Imitation Learning

Renato Negrinho
Matthew Gormley
Geoffrey Gordon

Beam search is widely used for approximate decoding in structured prediction problems. Models often use a beam at test time but ignore its existence at train time, and therefore do not explicitly learn how to use the beam. We develop an unifying meta-algorithm for learning beam search policies using imitation learning. In our setting, the beam is part of the model and not just an artifact of approximate decoding. Our meta-algorithm captures existing learning algorithms and suggests new ones. It also lets us show novel no-regret guarantees for learning beam search policies.

NeurIPS Conference 2017 Conference Paper

Linear Time Computation of Moments in Sum-Product Networks

Han Zhao
Geoffrey Gordon

Bayesian online algorithms for Sum-Product Networks (SPNs) need to update their posterior distribution after seeing one single additional instance. To do so, they must compute moments of the model parameters under this distribution. The best existing method for computing such moments scales quadratically in the size of the SPN, although it scales linearly for trees. This unfortunate scaling makes Bayesian online algorithms prohibitively expensive, except for small or tree-structured SPNs. We propose an optimal linear-time algorithm that works even when the SPN is a general directed acyclic graph (DAG), which significantly broadens the applicability of Bayesian online algorithms for SPNs. There are three key ingredients in the design and analysis of our algorithm: 1). For each edge in the graph, we construct a linear time reduction from the moment computation problem to a joint inference problem in SPNs. 2). Using the property that each SPN computes a multilinear polynomial, we give an efficient procedure for polynomial evaluation by differentiation without expanding the network that may contain exponentially many monomials. 3). We propose a dynamic programming method to further reduce the computation of the moments of all the edges in the graph from quadratic to linear. We demonstrate the usefulness of our linear time algorithm by applying it to develop a linear time assume density filter (ADF) for SPNs.

NeurIPS Conference 2017 Conference Paper

Predictive State Recurrent Neural Networks

Carlton Downey
Ahmed Hefny
Byron Boots
Geoffrey Gordon
Boyue Li

We present a new model, Predictive State Recurrent Neural Networks (PSRNNs), for filtering and prediction in dynamical systems. PSRNNs draw on insights from both Recurrent Neural Networks (RNNs) and Predictive State Representations (PSRs), and inherit advantages from both types of models. Like many successful RNN architectures, PSRNNs use (potentially deeply composed) bilinear transfer functions to combine information from multiple sources. We show that such bilinear functions arise naturally from state updates in Bayes filters like PSRs, in which observations can be viewed as gating belief states. We also show that PSRNNs can be learned effectively by combining Backpropogation Through Time (BPTT) with an initialization derived from a statistically consistent learning algorithm for PSRs called two-stage regression (2SR). Finally, we show that PSRNNs can be factorized using tensor decomposition, reducing model size and suggesting interesting connections to existing multiplicative architectures such as LSTMs and GRUs. We apply PSRNNs to 4 datasets, and show that we outperform several popular alternative approaches to modeling dynamical systems in all cases.

NeurIPS Conference 2016 Conference Paper

A Unified Approach for Learning the Parameters of Sum-Product Networks

Han Zhao
Pascal Poupart
Geoffrey Gordon

We present a unified approach for learning the parameters of Sum-Product networks (SPNs). We prove that any complete and decomposable SPN is equivalent to a mixture of trees where each tree corresponds to a product of univariate distributions. Based on the mixture model perspective, we characterize the objective function when learning SPNs based on the maximum likelihood estimation (MLE) principle and show that the optimization problem can be formulated as a signomial program. We construct two parameter learning algorithms for SPNs by using sequential monomial approximations (SMA) and the concave-convex procedure (CCCP), respectively. The two proposed methods naturally admit multiplicative updates, hence effectively avoiding the projection operation. With the help of the unified framework, we also show that, in the case of SPNs, CCCP leads to the same algorithm as Expectation Maximization (EM) despite the fact that they are different in general.

NeurIPS Conference 2015 Conference Paper

Supervised Learning for Dynamical System Learning

Ahmed Hefny
Carlton Downey
Geoffrey Gordon

Recently there has been substantial interest in spectral methods for learning dynamical systems. These methods are popular since they often offer a good tradeoffbetween computational and statistical efficiency. Unfortunately, they can be difficult to use and extend in practice: e. g. , they can make it difficult to incorporateprior information such as sparsity or structure. To address this problem, we presenta new view of dynamical system learning: we show how to learn dynamical systems by solving a sequence of ordinary supervised learning problems, therebyallowing users to incorporate prior knowledge via standard techniques such asL 1 regularization. Many existing spectral methods are special cases of this newframework, using linear regression as the supervised learner. We demonstrate theeffectiveness of our framework by showing examples where nonlinear regressionor lasso let us learn better state representations than plain linear regression does; the correctness of these instances follows directly from our general analysis.

AAAI Conference 2011 Conference Paper

An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systems

Byron Boots
Geoffrey Gordon

Recently, a number of researchers have proposed spectral algorithms for learning models of dynamical systems—for example, Hidden Markov Models (HMMs), Partially Observable Markov Decision Processes (POMDPs), and Transformed Predictive State Representations (TPSRs). These algorithms are attractive since they are statistically consistent and not subject to local optima. However, they are batch methods: they need to store their entire training data set in memory at once and operate on it as a large matrix, and so they cannot scale to extremely large data sets (either many examples or many features per example). In turn, this restriction limits their ability to learn accurate models of complex systems. To overcome these limitations, we propose a new online spectral algorithm, which uses tricks such as incremental Singular Value Decomposition (SVD) and random projections to scale to much larger data sets and more complex systems than previous methods. We demonstrate the new method on an inertial measurement prediction task and a highbandwidth video mapping task and we illustrate desirable behaviors such as “closing the loop, ” where the latent state representation changes suddenly as the learner recognizes that it has returned to a previously known place.

AAMAS Conference 2010 Conference Paper

Closing the Learning-Planning Loop with Predictive State Representations

Byron Boots
Sajid Siddiqi
Geoffrey Gordon

A central problem in artificial intelligence is to plan to maximizefuture reward under uncertainty in a partially observable environment. Models of such environments include Partially ObservableMarkov Decision Processes (POMDPs) as well as their generalizations, Predictive State Representations (PSRs) and Observable Operator Models (OOMs). POMDPs model the state ofthe world as a latent variable; in contrast, PSRs and OOMs represent state by tracking occurrence probabilities of a set of futureevents (called tests or characteristic events) conditioned on pastevents (called histories or indicative events). Unfortunately, exactplanning algorithms such as value iteration are intractable formost realistic POMDPs due to the curse of history and the curse ofdimensionality. However, PSRs and OOMs hold the promiseof mitigating both of these curses: first, many successful approximate planning techniques designed to address these problems inPOMDPs can easily be adapted to PSRs and OOMs. Second, PSRs and OOMs are often more compact than their correspondingPOMDPs (i. e. , need fewer state dimensions), mitigating the curseof dimensionality. Finally, since tests and histories are observablequantities, it has been suggested that PSRs and OOMs should beeasier to learn than POMDPs; with a successful learning algorithm, we can look for a model which ignores all but the most importantcomponents of state, reducing dimensionality still further. In this paper we take an important step toward realizing the abovehopes. In particular, we propose and demonstrate a fast and statistically consistent spectral algorithm which learns the parametersof a PSR directly from sequences of action-observation pairs. Wethen close the loop from observations to actions by planning in thelearned model and recovering a policy which is near-optimal in theoriginal environment. Closing the loop is a much more stringenttest than simply checking short-term prediction accuracy, since thequality of an optimized policy depends strongly on the accuracy ofthe model: inaccurate models typically lead to useless plans.

NeurIPS Conference 2010 Conference Paper

Predictive State Temporal Difference Learning

Byron Boots
Geoffrey Gordon

We propose a new approach to value function approximation which combines linear temporal difference reinforcement learning with subspace identification. In practical applications, reinforcement learning (RL) is complicated by the fact that state is either high-dimensional or partially observable. Therefore, RL methods are designed to work with features of state rather than state itself, and the success or failure of learning is often determined by the suitability of the selected features. By comparison, subspace identification (SSID) methods are designed to select a feature set which preserves as much information as possible about state. In this paper we connect the two approaches, looking at the problem of reinforcement learning with a large set of features, each of which may only be marginally useful for value function approximation. We introduce a new algorithm for this situation, called Predictive State Temporal Difference (PSTD) learning. As in SSID for predictive state representations, PSTD finds a linear compression operator that projects a large set of features down to a small set that preserves the maximum amount of predictive information. As in RL, PSTD then uses a Bellman recursion to estimate a value function. We discuss the connection between PSTD and prior approaches in RL and SSID. We prove that PSTD is statistically consistent, perform several experiments that illustrate its properties, and demonstrate its potential on a difficult optimal stopping problem.

AAMAS Conference 2008 Conference Paper

No-Regret Learning and a Mechanism for Distributed Multi-Agent Planning

Jan-P. Calliess
Geoffrey Gordon

We develop a novel mechanism for coordinated, distributed multiagent planning. We consider problems stated as a collection of single-agent planning problems coupled by common soft constraints on resource consumption. (Resources may be real or fictitious, the latter introduced as a tool for factoring the problem). A key idea is to recast the distributed planning problem as learning in a repeated game between the original agents and a newly introduced group of adversarial agents who influence prices for the resources. The adversarial agents benefit from arbitrage: that is, their incentive is to uncover violations of the resource usage constraints and, by selfishly adjusting prices, encourage the original agents to avoid plans that cause such violations. If all agents employ no-regret learning algorithms in the course of this repeated interaction, we are able to show that our mechanism can achieve design goals such as social optimality (efficiency), budget balance, and Nash-equilibrium convergence to within an error which approaches zero as the agents gain experience. In particular, the agents’ average plans converge to a socially optimal solution for the original planning task. We present experiments in a simulated network routing domain demonstrating our method’s ability to reliably generate sound plans.

NeurIPS Conference 2007 Conference Paper

A Constraint Generation Approach to Learning Stable Linear Dynamical Systems

Byron Boots
Geoffrey Gordon
Sajid Siddiqi

Stability is a desirable characteristic for linear dynamical systems, but it is often ignored by algorithms that learn these systems from data. We propose a novel method for learning stable linear dynamical systems: we formulate an approxima- tion of the problem as a convex program, start with a solution to a relaxed version of the program, and incrementally add constraints to improve stability. Rather than continuing to generate constraints until we reach a feasible solution, we test stability at each step; because the convex program is only an approximation of the desired problem, this early stopping rule can yield a higher-quality solution. We apply our algorithm to the task of learning dynamic textures from image sequences as well as to modeling biosurveillance drug-sales data. The constraint generation approach leads to noticeable improvement in the quality of simulated sequences. We compare our method to those of Lacy and Bernstein [1, 2], with positive results in terms of accuracy, quality of simulated sequences, and efﬁciency.

NeurIPS Conference 2006 Conference Paper

Multi-Robot Negotiation: Approximating the Set of Subgame Perfect Equilibria in General-Sum Stochastic Games

Chris Murray
Geoffrey Gordon

In real-world planning problems, we must reason not only about our own goals, but about the goals of other agents with which we may interact. Often these agents' goals are neither completely aligned with our own nor directly opposed to them. Instead there are opportunities for cooperation: by joining forces, the agents can all achieve higher utility than they could separately. But, in order to cooperate, the agents must negotiate a mutually acceptable plan from among the many possible ones, and each agent must trust that the others will follow their parts of the deal. Research in multi-agent planning has often avoided the problem of making sure that all agents have an incentive to follow a proposed joint plan. On the other hand, while game theoretic algorithms handle incentives correctly, they often don't scale to large planning problems. In this paper we attempt to bridge the gap between these two lines of research: we present an efficient game-theoretic approximate planning algorithm, along with a negotiation protocol which encourages agents to compute and agree on joint plans that are fair and optimal in a sense defined below. We demonstrate our algorithm and protocol on two simple robotic planning problems. 1

NeurIPS Conference 2006 Conference Paper

No-regret Algorithms for Online Convex Programs

Geoffrey Gordon

Online convex programming has recently emerged as a powerful primitive for designing machine learning algorithms. For example, OCP can be used for learning a linear classifier, dynamically rebalancing a binary search tree, finding the shortest path in a graph with unknown edge lengths, solving a structured classification problem, or finding a good strategy in an extensive-form game. Several researchers have designed no-regret algorithms for OCP. But, compared to algorithms for special cases of OCP such as learning from expert advice, these algorithms are not very numerous or flexible. In learning from expert advice, one tool which has proved particularly valuable is the correspondence between no-regret algorithms and convex potential functions: by reasoning about these potential functions, researchers have designed algorithms with a wide variety of useful guarantees such as good performance when the target hypothesis is sparse. Until now, there has been no such recipe for the more general OCP problem, and therefore no ability to tune OCP algorithms to take advantage of properties of the problem or data. In this paper we derive a new class of no-regret learning algorithms for OCP. These Lagrangian Hedging algorithms are based on a general class of potential functions, and are a direct generalization of known learning rules like weighted majority and external-regret matching. In addition to proving regret bounds, we demonstrate our algorithms learning to play one-card poker.

NeurIPS Conference 2004 Conference Paper

Planning for Markov Decision Processes with Sparse Stochasticity

Maxim Likhachev
Sebastian Thrun
Geoffrey Gordon

Planning algorithms designed for deterministic worlds, such as A* search, usually run much faster than algorithms designed for worlds with uncertain action outcomes, such as value iteration. Real-world planning problems often exhibit uncertainty, which forces us to use the slower algorithms to solve them. Many real-world planning problems exhibit sparse uncertainty: there are long sequences of deterministic actions which accomplish tasks like moving sensor platforms into place, inter- spersed with a small number of sensing actions which have uncertain out- comes. In this paper we describe a new planning algorithm, called MCP (short for MDP Compression Planning), which combines A* search with value iteration for solving Stochastic Shortest Path problem in MDPs with sparse stochasticity. We present experiments which show that MCP can run substantially faster than competing planners in domains with sparse uncertainty; these experiments are based on a simulation of a ground robot cooperating with a helicopter to fill in a partial map and move to a goal location. In deterministic planning problems, optimal paths are acyclic: no state is visited more than once. Because of this property, algorithms like A* search can guarantee that they visit each state in the state space no more than once. By visiting the states in an appropriate order, it is possible to ensure that we know the exact value of all of a state's possible successors before we visit that state; so, the first time we visit a state we can compute its correct value. By contrast, if actions have uncertain outcomes, optimal paths may contain cycles: some states will be visited two or more times with positive probability. Because of these cycles, there is no way to order states so that we determine the values of a state's successors before we visit the state itself. Instead, the only way to compute state values is to solve a set of simultaneous equations. In problems with sparse stochasticity, only a small fraction of all states have uncertain outcomes. It is these few states that cause all of the cycles: while a deterministic state s may participate in a cycle, the only way it can do so is if one of its successors has an action with a stochastic outcome (and only if this stochastic action can lead to a predecessor of s). In such problems, we would like to build a smaller MDP which contains only states which are related to stochastic actions. We will call such an MDP a compressed MDP, and we will call its states distinguished states. We could then run fast algorithms like A* search to plan paths between distinguished states, and reserve slower algorithms like value iteration for deciding how to deal with stochastic outcomes. (a) Segbot (b) Robotic helicopter (d) Planning map (e) Execution simulation (c) 3D Map Figure 1: Robot-Helicopter Coordination There are two problems with such a strategy. First, there can be a large number of states which are related to stochastic actions, and so it may be impractical to enumerate all of them and make them all distinguished states; we would prefer instead to distinguish only states which are likely to be encountered while executing some policy which we are considering. Second, there can be a large number of ways to get from one distinguished state to another: edges in the compressed MDP correspond to sequences of actions in the original MDP. If we knew the values of all of the distinguished states exactly, then we could use A* search to generate optimal paths between them, but since we do not we cannot. In this paper, we will describe an algorithm which incrementally builds a compressed MDP using a sequence of deterministic searches. It adds states and edges to the compressed MDP only by encountering them along trajectories; so, it never adds irrelevant states or edges to the compressed MDP. Trajectories are generated by deterministic search, and so undistinguished states are treated only with fast algorithms. Bellman errors in the values for distinguished states show us where to try additional trajectories, and help us build the relevant parts of the compressed MDP as quickly as possible. 1 Robot-Helicopter Coordination Problem The motivation for our research was the problem of coordinating a ground robot and a helicopter. The ground robot needs to plan a path from its current location to a goal, but has only partial knowledge of the surrounding terrain. The helicopter can aid the ground robot by flying to and sensing places in the map. Figure 1(a) shows our ground robot, a converted Segway with a SICK laser rangefinder. Figure 1(b) shows the helicopter, also with a SICK. Figure 1(c) shows a 3D map of the environment in which the robot operates. The 3D map is post-processed to produce a discretized 2D environment (Figure 1(d)). Several places in the map are unknown, either because the robot has not visited them or because their status may have changed (e. g, a car may occupy a driveway). Such places are shown in Figure 1(d) as white squares. The elevation of each white square is proportional to the probability that there is an obstacle there; we assume independence between unknown squares. The robot must take the unknown locations into account when planning for its route. It may plan a path through these locations, but it risks having to turn back if its way is blocked. Alternately, the robot can ask the helicopter to fly to any of these places and sense them. We assign a cost to running the robot, and a somewhat higher cost to running the helicopter. The planning task is to minimize the expected overall cost of running the robot and the helicopter while getting the robot to its destination and the helicopter back to its home base. Figure 1(e) shows a snapshot of the robot and helicopter executing a policy. Designing a good policy for the robot and helicopter is a POMDP planning problem; unfortunately POMDPs are in general difficult to solve (PSPACE-complete [7]). In the POMDP representation, a state is the position of the robot, the current location of the helicopter (a point on a line segment from one of the unknown places to another unknown place or the home base), and the true status of each unknown location. The positions of the robot and the helicopter are observable, so that the only hidden variables are whether each unknown place is occupied. The number of states is (# of robot locations)(# of helicopter locations)2# of unknown places. So, the number of states is exponential in the number of unknown places and therefore quickly becomes very large. We approach the problem by planning in the belief state space, that is, the space of probability distributions over world states. This problem is a continuous-state MDP; in this belief MDP, our state consists of the ground robot's location, the helicopter's location, and a probability of occupancy for each unknown location. We will discretize the continuous probability variables by breaking the interval [0, 1] into several chunks; so, the number of belief states is exponential in the number of unknown places, and classical algorithms such as value iteration are infeasible even on small problems. If sensors are perfect, this domain is acyclic: after we sense a square we know its true state forever after. On the other hand, imperfect sensors can lead to cycles: new sensor data can contradict older sensor data and lead to increased uncertainty. With or without sensor noise, our belief state MDP differs from general MDPs because its stochastic transitions are sparse: large portions of the policy (while the robot and helicopter are traveling be- tween unknown locations) are deterministic. The algorithm we propose in this paper takes advantage of this property of the problem, as we explain in the next section. 2 The Algorithm Our algorithm can be broken into two levels. At a high level, it constructs a compressed MDP, denoted M c, which contains only the start, the goal, and some states which are out- comes of stochastic actions. At a lower level, it repeatedly runs deterministic searches to find new information to put into M c. This information includes newly-discovered stochas- tic actions and their outcomes; better deterministic paths from one place to another; and more accurate value estimates similar to Bellman backups. The deterministic searches can use an admissible heuristic h to focus their effort, so we can often avoid putting many irrelevant actions into M c. Because M c will often be much smaller than M, we can afford to run stochastic plan- ning algorithms like value iteration on it. On the other hand, the information we get by planning in M c will improve the heuristic values that we use in our deterministic searches; so, the deterministic searches will tend to visit only relevant portions of the state space. 2. 1 Constructing and Solving a Compressed MDP Each action in the compressed MDP represents several consecutive actions in M: if we see a sequence of states and actions s1, a1, s2, a2, .. ., sk, ak where a1 through ak-1 are deterministic but ak is stochastic, then we can represent it in M c with a single action a, available at s1, whose outcome distribution is P (s | sk, ak) and whose cost is k-1 c(s1, a, s ) = c(si, ai, si+1) + c(sk, ak, s ) i=1 (See Figure 2(a) for an example of such a compressed action. ) In addition, if we see a se- quence of deterministic actions ending in sgoal, say s1, a1, s2, a2, .. ., sk, ak, sk+1 = sgoal, we can define a compressed action which goes from s1 to sgoal at cost k c(s i=1 i, ai, si+1). We can label each compressed action that starts at s with (s, s, a) (where a = null if s = sgoal). Among all compressed actions starting at s and ending at (s, a) there is (at least) one with lowest expected cost; we will call such an action an optimal compression of (s, s, a). Write Astoch for the set of all pairs (s, a) such that action a when taken from state s has more than one possible outcome, and include as well (sgoal, null). Write Sstoch for the states which are possible outcomes of the actions in Astoch, and include sstart as well. If we include in our compressed MDP an optimal compression of (s, s, a) for every s Sstoch and every (s, a) Astoch, the result is what we call the full compressed MDP; an example is in Figure 2(b). If we solve the full compressed MDP, the value of each state will be the same as the value of the corresponding state in M. However, we do not need to do that much work: (a) action compression (b) full MDP compression (c) incremental MDP compression Figure 2: MDP compression Main() 01 initialize M c with sstart and sgoal and set their v-values to 0; 02 while (s M c s. t. RHS(s) - v(s) > and s belongs to the current greedy policy) 03 select spivot to be any such state s; 04 [v; vlim] = Search(spivot); 05 v(spivot) = v; 06 set the cost c(spivot, a, sgoal) of the limit action a from spivot to vlim; 07 optionally run some algorithm satisfying req. A for a bounded amount of time to improve the value function in M c; Figure 3: MCP main loop many states and actions in the full compressed MDP are irrelevant since they do not appear in the optimal policy from sstart to sgoal. So, the goal of the MCP algorithm will be to construct only the relevant part of the compressed MDP by building M c incrementally. Figure 2(c) shows the incremental construction of a compressed MDP which contains all of the stochastic states and actions along an optimal policy in M. The pseudocode for MCP is given in Figure 3. It begins by initializing M c to contain only sstart and sgoal, and it sets v(sstart) = v(sgoal) = 0. It maintains the invariant that 0 v(s) v(s) for all s. On each iteration, MCP looks at the Bellman error of each of the states in M c. The Bellman error is v(s) - RHS(s), where RHS(s) = min RHS(s, a) RHS(s, a) = Es succ(s, a)(c(s, a, s ) + v(s )) aA(s) By convention the min of an empty set is, so an s which does not have any compressed actions yet is considered to have infinite RHS. MCP selects a state with negative Bellman error, spivot, and starts a search at that state. (We note that there exist many possible ways to select spivot; for example, we can choose the state with the largest negative Bellman error, or the largest error when weighted by state visitation probabilities in the best policy in M c. ) The goal of this search is to find a new compressed action a such that its RHS-value can provide a new lower bound on v(spivot). This action can either decrease the current RHS(spivot) (if a seems to be a better action in terms of the current v-values of action outcomes) or prove that the current RHS(spivot) is valid. Since v(s ) v(s ), one way to guarantee that RHS(spivot, a) v(spivot) is to compute an optimal compression of (spivot, s, a) for all s, a, then choose the one with the smallest RHS. A more sophisticated strategy is to use an A search with appropriate safeguards to make sure we never overestimate the value of a stochastic action. MCP, however, uses a modified A search which we will describe in the next section. As the search finds new compressed actions, it adds them and their outcomes to M c. It is allowed to initialize newly-added states to any admissible values. When the search returns, MCP sets v(spivot) to the returned value. This value is at least as large as RHS(spivot). Consequently, Bellman error for spivot becomes non-negative. In addition to the compressed action and the updated value, the search algorithm returns a "limit value" vlim(spivot). These limit values allow MCP to run a standard MDP planning algorithm on M c to improve its v(s) estimates. MCP can use any planning algorithm which guarantees that, for any s, it will not lower v(s) and will not increase v(s) beyond the smaller of vlim(s) and RHS(s) (Requirement A). For example, we could insert a fake "limit action" into M c which goes directly from spivot to sgoal at cost vlim(spivot) (as we do on line 06 in Figure 3), then run value iteration for a fixed amount of time, selecting for each backup a state with negative Bellman error. After updating M c from the result of the search and any optional planning, MCP begins again by looking for another state with a negative Bellman error. It repeats this process until there are no negative Bellman errors larger than. For small enough, this property guarantees that we will be able to find a good policy (see section 2. 3). 2. 2 Searching the MDP Efficiently The top level algorithm (Figure 3) repeatedly invokes a search method for finding trajec- tories from spivot to sgoal. In order for the overall algorithm to work correctly, there are several properties that the search must satisfy. First, the estimate v that search returns for the expected cost of spivot should always be admissible. That is, 0 v v(spivot) (Property 1). Second, the estimate v should be no less than the one-step lookahead value of spivot in M c. That is, v RHS(spivot) (Property 2). This property ensures that search either increases the value of spivot or finds additional (or improved) compressed actions. The third and final property is for the vlim value, and it is only important if MCP uses its optional planning step (line 07). The property is that v vlim v(spivot) (Property 3). Here v(spivot) denotes the minimum expected cost of starting at spivot, picking a com- pressed action not in M c, and acting optimally from then on. (Note that v can be larger than v if the optimal compressed action is already part of M c. ) Property 3 uses v rather than v since the latter is not known while it is possible to compute a lower bound on the former efficiently (see below). One could adapt A* search to satisfy at least Properties 1 and 2 by assuming that we can control the outcome of stochastic actions. However, this sort of search is highly optimistic and can bias the search towards improbable trajectories. Also, it can only use heuristics which are even more optimistic than it is: that is, h must be admissible with respect to the optimistic assumption of controlled outcomes. We therefore present a version of A*, called MCP-search (Figure 4), that is more effi- cient for our purposes. MCP-search finds the correct expected value for the first stochas- tic action it encounters on any given trajectory, and is therefore far less optimistic. And, MCP-search only requires heuristic values to be admissible with respect to v values, h(s) v(s). Finally, MCP-search speeds up repetitive searches by improving heuris- tic values based on previous searches. A* maintains a priority queue, OPEN, of states which it plans to expand. The OPEN queue is sorted by f (s) = g(s)+h(s), so that A* always expands next a state which appears to be on the shortest path from start to goal. During each expansion a state s is removed from OPEN and all the g-values of s's successors are updated; if g(s ) is decreased for some state s, A* inserts s into OPEN. A* terminates as soon as the goal state is expanded. We use the variant of A* with pathmax [5] to use efficiently heuristics that do not satisfy the triangle inequality. MCP is similar to A, but the OPEN list can also contain state-action pairs {s, a} where a is a stochastic action (line 31). Plain states are represented in OPEN as {s, null}. Just ImproveHeuristic(s) 01 if s M c then h(s) = max(h(s), v(s)); 02 improve heuristic h(s) further if possible using f best and g(s) from previous iterations; procedure fvalue({s, a}) 03 if s = null return; 04 else if a = null return g(s) + h(s); 05 else return g(s) + max(h(s), E {c(s, a, s ) + h(s )}) s Succ(s, a); CheckInitialize(s) 06 if s was accessed last in some previous search iteration 07 ImproveHeuristic(s); 08 if s was not yet initialized in the current search iteration 09 g(s) =; InsertUpdateCompAction(spivot, s, a) 10 reconstruct the path from spivot to s; 11 insert compressed action (spivot, s, a) into A(spivot) (or update the cost if a cheaper path was found) 12 for each outcome u of a that was not in M c previously 13 set v(u) to h(u) or any other value less than or equal to v(u); 14 set the cost c(u, a, sgoal) of the limit action a from u to v(u); procedure Search(spivot) 15 CheckInitialize(sgoal), CheckInitialize(spivot); 16 g(spivot) = 0; 17 OPEN = {{spivot, null}}; 18 {sbest, abest} = {null, null}, f best =; 19 while(g(sgoal) > min{s, a}OPEN(fvalue({s, a})) AND f best + > min{s, a}OPEN(fvalue({s, a}))) 20 remove {s, a} with the smallest fvalue({s, a}) from OPEN breaking ties towards the pairs with a = null; 21 if a = null //expand state s 22 for each s Succ(s) 23 CheckInitialize(s ); 24 for each deterministic a A(s) 25 s = Succ(s, a ); 26 h(s ) = max(h(s ), h(s) - c(s, a, s )); 27 if g(s ) > g(s) + c(s, a, s ) 28 g(s ) = g(s) + c(s, a, s ); 29 insert/update {s, null} into OPEN with fvalue({s, null}); 30 for each stochastic a A(s) 31 insert/update {s, a } into OPEN with fvalue({s, a }); 32 else //encode stochastic action a from state s as a compressed action from spivot 33 InsertUpdateCompAction(spivot, s, a); 34 if f best > fvalue({s, a}) then {sbest, abest} = {s, a}, f best = fvalue({s, a}); 35 if (g(sgoal) min{s, a}OPEN(fvalue({s, a})) AND OPEN = ) 36 reconstruct the path from spivot to sgoal; 37 update/insert into A(spivot) a deterministic action a leading to sgoal; 38 if f best g(sgoal) then {sbest, abest} = {sgoal, null}, f best = g(sgoal); 39 return [f best; min{s, a}OPEN(fvalue({s, a}))]; Figure 4: MCP-search Algorithm like A*, MCP-search expands elements in the order of increasing f -values, but it breaks ties towards elements encoding plain states (line 20). The f -value of {s, a} is defined as g(s) + max[h(s), Es Succ(s, a)(c(s, a, s ) + h(s ))] (line 05). This f -value is a lower bound on the cost of a policy that goes from sstart to sgoal by first executing a series of deterministic actions until action a is executed from state s. This bound is as tight as possible given our heuristic values. State expansion (lines 21-31) is very similar to A. When the search removes from OPEN a state-action pair {s, a} with a = null, it adds a compressed action to M c (line 33). It also adds a compressed action if there is an optimal deterministic path to sgoal (line 37). f best tracks the minimum f -value of all the compressed actions found. As a result, f best v(spivot) and is used as a new estimate for v(spivot). The limit value vlim(spivot) is obtained by continuing the search until the minimum f -value of elements in OPEN approaches f best + for some 0 (line 19). This minimum f -value then provides a lower bound on v(spivot). To speed up repetitive searches, MCP-search improves the heuristic of every state that it encounters for the first time in the current search iteration (lines 01 and 02). Line 01 uses the fact that v(s) from M c is a lower bound on v(s). Line 02 uses the fact that f best - g(s) is a lower bound on v(s) at the end of each previous call to Search; for more details see [4]. 2. 3 Theoretical Properties of the Algorithm We now present several theorems about our algorithm. The proofs of these and other theo- rems can be found in [4]. The first theorem states the main properties of MCP-search. Theorem 1 The search function terminates and the following holds for the values it re- turns: (a) if sbest = null then v(spivot) f best E{c(spivot, abest, s ) + v(s )} (b) if sbest = null then v(spivot) = f best = (c) f best min{s, a}OPEN(fvalue({s, a})) v(spivot). If neither sgoal nor any state-action pairs were expanded, then sbest = null and (b) says that there is no policy from spivot that has a finite expected cost. Using the above theorem it is easy to show that MCP-search satisfies Properties 1, 2 and 3, considering that f best is returned as variable v and min{s, a}OPEN(fvalue({s, a})) is returned as variable vlim in the main loop of the MCP algorithm (Figure 3). Property 1 follows directly from (a) and (b) and the fact that costs are strictly positive and v-values are non-negative. Property 2 also follows trivially from (a) and (b). Property 3 follows from (c). Given these properties the next theorem states the correctness of the outer MCP algorithm (in the theorem cgreedy denotes a greedy policy that always chooses an action that looks best based on its cost and the v-values of its immediate successors). Theorem 2 Given a deterministic search algorithm which satisfies Properties 13, the MCP algorithm will terminate. Upon termination, for every state s M c c we greedy have RHS(s) - v(s) v(s). Given the above theorem one can show that for 0 < cmin (where cmin is the smallest expected action cost in our MDP) the expected cost of executing c from greedy sstart is at most cmin v(s c start). Picking cmin is not guaranteed to result in a proper min - policy, even though Theorem 2 continues to hold. 3 Experimental Study We have evaluated the MCP algorithm on the robot-helicopter coordination problem de- scribed in section 1. To obtain an admissible heuristic, we first compute a value function for every possible configuration of obstacles. Then we weight the value functions by the probabilities of their obstacle configurations, sum them, and add the cost of moving the helicopter back to its base if it is not already there. This procedure results in optimistic cost estimates because it pretends that the robot will find out the obstacle locations immediately instead of having to wait to observe them. The results of our experiments are shown in Figure 5. We have compared MCP against three algorithms: RTDP [1], LAO* [2] and value iteration on reachable states (VI). RTDP can cope with large size MDPs by focussing its planning efforts along simulated execu- tion trajectories. LAO* uses heuristics to prune away irrelevant states, then repeatedly performs dynamic programming on the states in its current partial policy. We have im- plemented LAO* so that it reduces to AO* [6] when environments are acyclic (e. g. , the robot-helicopter problem with perfect sensing). VI was only able to run on the problems with perfect sensing since the number of reachable states was too large for the others. The results support the claim that MCP can solve large problems with sparse stochas- ticity. For the problem with perfect sensing, on average MCP was able to plan 9. 5 times faster than LAO*, 7. 5 times faster than RTDP, and 8. 5 times faster than VI. On average for these problems, MCP computed values for 58633 states while M c grew to 396 states, and MCP encountered 3740 stochastic transitions (to give a sense of the degree of stochastic- ity). The main cost of MCP was in its deterministic search subroutine; this fact suggests that we might benefit from anytime search techniques such as ARA* [3]. The results for the problems with imperfect sensing show that, as the number and den- sity of uncertain outcomes increases, the advantage of MCP decreases. For these problems MCP was able to solve environments 10. 2 times faster than LAO* but only 2. 2 times faster than RTDP. On average MCP computed values for 127, 442 states, while the size of M c was 3, 713 states, and 24, 052 stochastic transitions were encountered. Figure 5: Experimental results. The top row: the robot-helicopter coordination problem with perfect sensors. The bottom row: the robot-helicopter coordination problem with sensor noise. Left column: running times (in secs) for each algorithm grouped by environments. Middle column: the number of backups for each algorithm grouped by environments. Right column: an estimate of the expected cost of an optimal policy (v(sstart)) vs. running time (in secs) for experiment (k) in the top row and experiment (e) in the bottom row. Algorithms in the bar plots (left to right): MCP, LAO*, RTDP and VI (VI is only shown for problems with perfect sensing). The characteristics of the environments are given in the second and third rows under each of the bar plot. The second row indicates how many cells the 2D plane is discretized into, and the third row indicates the number of initially unknown cells in the environment. 4 Discussion The MCP algorithm incrementally builds a compressed MDP using a sequence of deter- ministic searches. Our experimental results suggest that MCP is advantageous for problems with sparse stochasticity. In particular, MCP has allowed us to scale to larger environments than were otherwise possible for the robot-helicopter coordination problem. Acknowledgements This research was supported by DARPA's MARS program. All conclusions are our own.

IJCAI Conference 2003 Conference Paper

A Learning Algorithm for Localizing People Based on Wireless Signal Strength that Uses Labeled and Unlabeled Data

Mary Berna
Brennan Sellner
Brad Lisien
Sebastian Thrun
Geoffrey Gordon
Frank Pfenning

This paper summarizes a probabilistic approach for localizing people through the signal strengths of a wireless IEEE 802. 1 lb network. Our approach uses data labeled by ground truth position to learn a probabilistic mapping from locations to wireless signals, represented by piecewise linear Gaussians. It then uses sequences of wireless signal data (without position labels) to acquire motion models of individual people, which further improves the localization accuracy. The approach has been implemented and evaluated in an office environment.

NeurIPS Conference 2003 Conference Paper

Applying Metric-Trees to Belief-Point POMDPs

Joelle Pineau
Geoffrey Gordon
Sebastian Thrun

Recent developments in grid-based and point-based approximation algo- rithms for POMDPs have greatly improved the tractability of POMDP planning. These approaches operate on sets of belief points by individ- ually learning a value function for each point. In reality, belief points exist in a highly-structured metric simplex, but current POMDP algo- rithms do not exploit this property. This paper presents a new metric-tree algorithm which can be used in the context of POMDP planning to sort belief points spatially, and then perform fast value function updates over groups of points. We present results showing that this approach can re- duce computation in point-based POMDP algorithms for a wide range of problems.

NeurIPS Conference 2003 Conference Paper

ARA: Anytime A with Provable Bounds on Sub-Optimality

Maxim Likhachev
Geoffrey Gordon
Sebastian Thrun

In real world planning problems, time for deliberation is often limited. Anytime planners are well suited for these problems: they ﬁnd a feasi- ble solution quickly and then continually work on improving it until time runs out. In this paper we propose an anytime heuristic search, ARA*, which tunes its performance bound based on available search time. It starts by ﬁnding a suboptimal solution quickly using a loose bound, then tightens the bound progressively as time allows. Given enough time it ﬁnds a provably optimal solution. While improving its bound, ARA* reuses previous search efforts and, as a result, is signiﬁcantly more efﬁ- cient than other anytime search methods. In addition to our theoretical analysis, we demonstrate the practical utility of ARA* with experiments on a simulated robot kinematic arm and a dynamic path planning prob- lem for an outdoor rover.

NeurIPS Conference 2003 Conference Paper

Auction Mechanism Design for Multi-Robot Coordination

Curt Bererton
Geoffrey Gordon
Sebastian Thrun

The design of cooperative multi-robot systems is a highly active research area in robotics. Two lines of research in particular have generated inter- est: the solution of large, weakly coupled MDPs, and the design and im- plementation of market architectures. We propose a new algorithm which joins together these two lines of research. For a class of coupled MDPs, our algorithm automatically designs a market architecture which causes a decentralized multi-robot system to converge to a consistent policy. We can show that this policy is the same as the one which would be produced by a particular centralized planning algorithm. We demonstrate the new algorithm on three simulation examples: multi-robot towing, multi-robot path planning with a limited fuel resource, and coordinating behaviors in a game of paint ball.

NeurIPS Conference 2003 Conference Paper

Model Uncertainty in Classical Conditioning

Aaron Courville
Geoffrey Gordon
David Touretzky
Nathaniel Daw

We develop a framework based on Bayesian model averaging to explain how animals cope with uncertainty about contingencies in classical con- ditioning experiments. Traditional accounts of conditioning ﬁt parame- ters within a ﬁxed generative model of reinforcer delivery; uncertainty over the model structure is not considered. We apply the theory to ex- plain the puzzling relationship between second-order conditioning and conditioned inhibition, two similar conditioning regimes that nonethe- less result in strongly divergent behavioral outcomes. According to the theory, second-order conditioning results when limited experience leads animals to prefer a simpler world model that produces spurious corre- lations; conditioned inhibition results when a more complex model is justiﬁed by additional experience.

NeurIPS Conference 2002 Conference Paper

Exponential Family PCA for Belief Compression in POMDPs

Nicholas Roy
Geoffrey Gordon

Geoffrey Gordon Department of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 ggordon@cs. cmu. edu Standard value function approaches to ﬁnding policies for Partially Observable Markov Decision Processes (POMDPs) are intractable for large models. The in- tractability of these algorithms is due to a great extent to their generating an optimal policy over the entire belief space. However, in real POMDP problems most belief states are unlikely, and there is a structured, low-dimensional manifold of plausible beliefs embedded in the high-dimensional belief space. We introduce a new method for solving large-scale POMDPs by taking advantage of belief space sparsity. We reduce the dimensionality of the belief space by exponential family Principal Components Analysis [1], which allows us to turn the sparse, high- dimensional belief space into a compact, low-dimensional representation in terms of learned features of the belief state. We then plan directly on the low-dimensional belief features. By planning in a low-dimensional space, we can ﬁnd policies for POMDPs that are orders of magnitude larger than can be handled by conventional techniques. We demonstrate the use of this algorithm on a synthetic problem and also on a mobile robot navigation task.

NeurIPS Conference 2002 Conference Paper

Generalized² Linear² Models

Geoffrey Gordon

We introduce the Generalized2 Linear2 Model, a statistical estima(cid: 173) tor which combines features of nonlinear regression and factor anal(cid: 173) ysis. A (GL)2M approximately decomposes a rectangular matrix X into a simpler representation j(g(A)h(B)). Here A and Bare low-rank matrices, while j, g, and h are link functions. (GL)2Ms include many useful models as special cases, including principal components analysis, exponential-family peA, the infomax formu(cid: 173) lation of independent components analysis, linear regression, and generalized linear models. They also include new and interesting special cases, one of which we describe below. We also present an iterative procedure which optimizes the parameters of a (GL)2M. This procedure reduces to well-known algorithms for some of the special cases listed above; for other special cases, it is new.

NeurIPS Conference 2000 Conference Paper

Reinforcement Learning with Function Approximation Converges to a Region

Geoffrey Gordon

Many algorithms for approximate reinforcement learning are not known to converge. In fact, there are counterexamples showing that the adjustable weights in some algorithms may oscillate within a region rather than converging to a point. This paper shows that, for two popular algorithms, such oscillation is the worst that can happen: the weights cannot diverge, but instead must converge to a bounded region. The algorithms are SARSA(O) and V(O); the latter algorithm was used in the well-known TD-Gammon program.

AIIM Journal 1997 Journal Article

An evaluation of machine-learning methods for predicting pneumonia mortality

Gregory F. Cooper
Constantin F. Aliferis
Richard Ambrosino
John Aronis
Bruce G. Buchanan
Richard Caruana
Michael J. Fine
Clark Glymour

NeurIPS Conference 1995 Conference Paper

Stable Fitted Reinforcement Learning

Geoffrey Gordon

We describe the reinforcement learning problem, motivate algo(cid: 173) rithms which seek an approximation to the Q function, and present new convergence results for two such algorithms. 1 INTRODUCTION AND BACKGROUND Imagine an agent acting in some environment. At time t, the environment is in some state Xt chosen from a finite set of states. The agent perceives Xt, and is allowed to choose an action at from some finite set of actions. The environment then changes state, so that at time (t + 1) it is in a new state Xt+1 chosen from a probability distribution which depends only on Xt and at. Meanwhile, the agent experiences a real-valued cost Ct, chosen from a distribution which also depends only on Xt and at and which has finite mean and variance. Such an environment is called a Markov decision process, or MDP. The reinforce(cid: 173) ment learning problem is to control an MDP to minimize the expected discounted cost Lt, tCt for some discount factor, E [0, 1]. Define the function Q so that Q(x, a) is the cost for being in state x at time 0, choosing action a, and behaving optimally from then on. If we can discover Q, we have solved the problem: at each step, we may simply choose at to minimize Q(xt, at). For more information about MDPs, see (Watkins, 1989, Bertsekas and Tsitsiklis, 1989). We may distinguish two classes of problems, online and offline. In the offline prob(cid: 173) lem, we have a full model of the MDP: given a state and an action, we can describe the distributions of the cost and the next state. We will be concerned with the online problem, in which our knowledge of the MDP is limited to what we can dis(cid: 173) cover by interacting with it. To solve an online problem, we may approximate the transition and cost functions, then proceed as for an offline problem (the indirect approach); or we may try to learn the Q function without the intermediate step (the direct approach). Either approach may work better for any given problem: the Stable Fitted Reinforcement Learning

AIJ Journal 1993 Journal Article

Computer systems that learn

Alberto Segre
Geoffrey Gordon