Author name cluster

Odalric Maillard

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

7 papers

1 author row

AAAI Conference 2023 Conference Paper

Bilinear Exponential Family of MDPs: Frequentist Regret Bound with Tractable Exploration & Planning

Reda Ouhamma
Debabrota Basu
Odalric Maillard

We study the problem of episodic reinforcement learning in continuous state-action spaces with unknown rewards and transitions. Specifically, we consider the setting where the rewards and transitions are modeled using parametric bilinear exponential families. We propose an algorithm, that a) uses penalized maximum likelihood estimators to learn the unknown parameters, b) injects a calibrated Gaussian noise in the parameter of rewards to ensure exploration, and c) leverages linearity of the bilinear exponential family transitions with respect to an underlying RKHS to perform tractable planning. We provide a frequentist regret upper-bound for our algorithm which, in the case of tabular MDPs, is order-optimal with respect to H and K, where H is the episode length and K is the number of episodes. Our analysis improves the existing bounds for the bilinear exponential family of MDPs by square root of H and removes the handcrafted clipping deployed in existing RLSVI-type algorithms.

PDF Details DOI

EWRL Workshop 2022 Workshop Paper

Bilinear Exponential Family of MDPs: Frequentist Regret Bound with Tractable Exploration \& Planning

Reda Ouhamma
Debabrota Basu
Odalric Maillard

We study the problem of episodic reinforcement learning in continuous state-action spaces with unknown rewards and transitions. Specifically, we consider the setting where the rewards and transitions are modeled using parametric bilinear exponential families. We propose an algorithm, BEF-RLSVI, that a) uses penalized maximum likelihood estimators to learn the unknown parameters, b) injects a calibrated Gaussian noise in the parameter of rewards to ensure exploration, and c) leverages linearity of the exponential family with respect to an underlying RKHS to perform tractable planning. We further provide a frequentist regret analysis of BEF-RLSVI that yields an upper bound of Õ( √ d3H3K), where d is the dimension of the parameters, H is the episode length, and K is the number of episodes. Our analysis improves the existing bounds for the bilinear exponential family of MDPs by √ H and removes the handcrafted clipping deployed in existing RLSVI-type algorithms. Our regret bound is order-optimal with respect to H and K.

PDF Details

EWRL Workshop 2022 Workshop Paper

Risk-aware linear bandits with convex loss

Patrick Saux
Odalric Maillard

In decision-making problems such as the multi-armed bandit, an agent learns sequentially by optimizing a certain feedback. While the mean reward criterion has been extensively studied, other measures that reflect an aversion to adverse outcomes, such as mean-variance or conditional value-at-risk (CVaR), can be of interest for critical applications (healthcare, agriculture). Algorithms have been proposed for such risk-aware measures under bandit feedback without contextual information. In this work, we study contextual bandits where such risk measures can be elicited as linear functions of the contexts through the minimization of a convex loss. A typical example that fits within this framework is the expectile measure, which is obtained as the solution of an asymmetric least-square problem. Using the method of mixtures for supermartingales, we derive confidence sequences for the estimation of such risk measures. We then propose an optimistic UCB algorithm to learn optimal risk-aware actions, with regret guarantees similar to those of generalized linear bandits. This approach requires solving a convex problem at each round of the algorithm, which we can relax by allowing only approximated solution obtained by online gradient descent, at the cost of slightly higher regret. We conclude by evaluating the resulting algorithms on numerical experiments.

PDF Details

RLDM Conference 2019 Conference Abstract

Upper Confidence Reinforcement Learning Algorithms Exploiting State-Action Similarities

Mahsa Asadi
Odalric Maillard
Mohammad Sadegh Talebi

Leveraging an equivalence property on the set of states of state-action pairs in an Markov De- cision Process (MDP) has been suggested by many authors. We take the study of equivalence classes of state-action pairs of a MDP to the reinforcement learning (RL) setup, when transition distributions are no longer assumed to be known, in a never-ending discrete MDP with average reward criterion. We study pow- erful similarities between state-action pairs related to optimal transport. We first introduce a variant of the UCRL algorithm called C-UCRL, which highlights the clear benefit √ of leveraging this equivalence structure when it is known ahead of time: the regret bound scales as Õ(D KCT ), where C is the number of classes of equivalent state-action pairs and K bounds the size of the support of the transitions. A non-trivial ques- tion is whether this benefit can still be observed when the structure is unknown and must be learned while minimizing the regret. We propose a sound clustering technique that provably learns the unknown classes. It is then empirically validated that learning the structure can be beneficial in a fully-blown RL problem.

PDF Details

NeurIPS Conference 2010 Conference Paper

LSTD with Random Projections

Mohammad Ghavamzadeh
Alessandro Lazaric
Odalric Maillard
Rémi Munos

We consider the problem of reinforcement learning in high-dimensional spaces when the number of features is bigger than the number of samples. In particular, we study the least-squares temporal difference (LSTD) learning algorithm when a space of low dimension is generated with a random projection from a high-dimensional space. We provide a thorough theoretical analysis of the LSTD with random projections and derive performance bounds for the resulting algorithm. We also show how the error of LSTD with random projections is propagated through the iterations of a policy iteration algorithm and provide a performance bound for the resulting least-squares policy iteration (LSPI) algorithm.

PDF Details

NeurIPS Conference 2010 Conference Paper

Scrambled Objects for Least-Squares Regression

Odalric Maillard
Rémi Munos

We consider least-squares regression using a randomly generated subspace G P\subset F of finite dimension P, where F is a function space of infinite dimension, e. g. ~L 2([0, 1]^d). G P is defined as the span of P random features that are linear combinations of the basis functions of F weighted by random Gaussian i. i. d. ~coefficients. In particular, we consider multi-resolution random combinations at all scales of a given mother function, such as a hat function or a wavelet. In this latter case, the resulting Gaussian objects are called {\em scrambled wavelets} and we show that they enable to approximate functions in Sobolev spaces H^s([0, 1]^d). As a result, given N data, the least-squares estimate \hat g built from P scrambled wavelets has excess risk ||f^* - \hat g|| \P^2 = O(||f^ ||^2_{H^s([0, 1]^d)}(\log N)/P + P(\log N )/N) for target functions f^ \in H^s([0, 1]^d) of smoothness order s>d/2. An interesting aspect of the resulting bounds is that they do not depend on the distribution \P from which the data are generated, which is important in a statistical regression setting considered here. Randomization enables to adapt to any possible distribution. We conclude by describing an efficient numerical implementation using lazy expansions with numerical complexity \tilde O(2^d N^{3/2}\log N + N^2), where d is the dimension of the input space.

PDF Details

NeurIPS Conference 2009 Conference Paper

Compressed Least-Squares Regression

Odalric Maillard
Rémi Munos

We consider the problem of learning, from K input data, a regression function in a function space of high dimension N using projections onto a random subspace of lower dimension M. From any linear approximation algorithm using empirical risk minimization (possibly penalized), we provide bounds on the excess risk of the estimate computed in the projected subspace (compressed domain) in terms of the excess risk of the estimate built in the high-dimensional space (initial domain). We apply the analysis to the ordinary Least-Squares regression and show that by choosing M=O(\sqrt{K}), the estimation error (for the quadratic loss) of the ``Compressed Least Squares Regression is O(1/\sqrt{K}) up to logarithmic factors. We also discuss the numerical complexity of several algorithms (both in initial and compressed domains) as a function of N, K, and M.

PDF Details