RLDM Conference 2019 Conference Abstract
Recurrent Temporal Difference
- Pierre Thodoroff
- Nishanth V Anand
- Lucas Caccia
- Doina Precup
- Joelle Pineau
In sequential modelling, exponential smoothing is one of the most widely used techniques to maintain temporal consistency in estimates. In this work, we propose Recurrent Learning, a method that es- timates the value function in reinforcement learning using exponential smoothing along the trajectory. Most algorithms in Reinforcement Learning estimate value function at every time step as a point estimate without necessarily explicitly enforcing temporal coherence nor considering previous estimates. This can lead to temporally inconsistent behaviors, particularly in tabular and discrete settings. In other words, we propose to smooth the value function of a current state using the estimates of states that occur earlier in the trajectory. Intuitively, states that are temporally close to each other should have similar value. λ return [1, 2] enforces temporal coherence through the trajectory implicitly whereas we propose a method to explicitly enforce the temporal coherence. However, exponential averaging can be biased if a sharp change(non-stationarity) is encountered in the trajectory, like falling off a cliff. To alleviate this issue a common technique used is to set βt the exponential smoothing factor as a state or time dependent. The key ingredient in Recurrent Neural Networks(LSTM [3] and GRU [4]) is the gating mechanism(state dependent βt ) used to update the hidden cell. The capacity to ignore information allows the cell to focus only on important information. In this work we explore a new method that attempts to learn a state dependent smoothing factor β. To summarize, the contributions of the paper are as follows: + Propose a new way to estimate value function in reinforcement learning by exploiting the estimates along the trajectory. + Derive a learning rule for a state dependent β. + Perform a set of experiments in continuous settings to evaluate its strengths and weaknesses