Arrow Research search

Author name cluster

Pierre Thodoroff

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

2 papers
1 author row

Possible papers

2

RLDM Conference 2019 Conference Abstract

Recurrent Temporal Difference

  • Pierre Thodoroff
  • Nishanth V Anand
  • Lucas Caccia
  • Doina Precup
  • Joelle Pineau

In sequential modelling, exponential smoothing is one of the most widely used techniques to maintain temporal consistency in estimates. In this work, we propose Recurrent Learning, a method that es- timates the value function in reinforcement learning using exponential smoothing along the trajectory. Most algorithms in Reinforcement Learning estimate value function at every time step as a point estimate without necessarily explicitly enforcing temporal coherence nor considering previous estimates. This can lead to temporally inconsistent behaviors, particularly in tabular and discrete settings. In other words, we propose to smooth the value function of a current state using the estimates of states that occur earlier in the trajectory. Intuitively, states that are temporally close to each other should have similar value. λ return [1, 2] enforces temporal coherence through the trajectory implicitly whereas we propose a method to explicitly enforce the temporal coherence. However, exponential averaging can be biased if a sharp change(non-stationarity) is encountered in the trajectory, like falling off a cliff. To alleviate this issue a common technique used is to set βt the exponential smoothing factor as a state or time dependent. The key ingredient in Recurrent Neural Networks(LSTM [3] and GRU [4]) is the gating mechanism(state dependent βt ) used to update the hidden cell. The capacity to ignore information allows the cell to focus only on important information. In this work we explore a new method that attempts to learn a state dependent smoothing factor β. To summarize, the contributions of the paper are as follows: + Propose a new way to estimate value function in reinforcement learning by exploiting the estimates along the trajectory. + Derive a learning rule for a state dependent β. + Perform a set of experiments in continuous settings to evaluate its strengths and weaknesses

NeurIPS Conference 2018 Conference Paper

Temporal Regularization for Markov Decision Process

  • Pierre Thodoroff
  • Audrey Durand
  • Joelle Pineau
  • Doina Precup

Several applications of Reinforcement Learning suffer from instability due to high variance. This is especially prevalent in high dimensional domains. Regularization is a commonly used technique in machine learning to reduce variance, at the cost of introducing some bias. Most existing regularization techniques focus on spatial (perceptual) regularization. Yet in reinforcement learning, due to the nature of the Bellman equation, there is an opportunity to also exploit temporal regularization based on smoothness in value estimates over trajectories. This paper explores a class of methods for temporal regularization. We formally characterize the bias induced by this technique using Markov chain concepts. We illustrate the various characteristics of temporal regularization via a sequence of simple discrete and continuous MDPs, and show that the technique provides improvement even in high-dimensional Atari games.