NeurIPS Conference 1997 Conference Paper
This paper is concerned with the problem of Reinforcement Learn(cid: 173) ing (RL) for continuous state space and time stocha. stic control problems. We state the Harnilton-Jacobi-Bellman equation satis(cid: 173) fied by the value function and use a Finite-Difference method for designing a convergent approximation scheme. Then we propose a RL algorithm based on this scheme and prove its convergence to the optimal solution. 1 Introduction to RL in the continuous, stochastic case The objective of RL is to find -thanks to a reinforcement signal- an optimal strategy for solving a dynamical control problem. Here we sudy the continuous time, con(cid: 173) tinuous state-space stochastic case, which covers a wide variety of control problems including target, viability, optimization problems (see [FS93], [KP95])}or which a formalism is the following. The evolution of the current state x(t) E 0 (the state(cid: 173) space, with 0 open subset of IRd ), depends on the control u(t) E U (compact subset) by a stochastic differential equation, called the state dynamics: dx = f(x(t), u(t))dt + a(x(t), u(t))dw (1) where f is the local drift and a. dw (with w a brownian motion of dimension rand (j a d x r-matrix) the stochastic part (which appears for several reasons such as lake of precision, noisy influence, random fluctuations) of the diffusion process. For initial state x and control u(t), (1) leads to an infinity of possible traj~tories x(t). For some trajectory x(t) (see figure I). , let T be its exit time from 0 (with the convention that if x(t) always stays in 0, then T = 00). Then, we define the functional J of initial state x and control u(. ) as the expectation for all trajectories of the discounted cumulative reinforcement: J(x; u(. )) = Ex, u(. ) {loT '/r(x(t), u(t))dt +, ,{ R(X(T))}