Author name cluster

Claire Vernade

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

23 papers

2 author rows

EWRL Workshop 2025 Workshop Paper

Barycenter Policy Design for Multiple Policy Evaluation

Simon Weissmann
Till Freihaut
Claire Vernade
Giorgia Ramponi
Leif Döring

A growing challenge in reinforcement learning is to efficiently explore the action space to evaluate multiple target policies using importance sampling. When target policies share similarities, leveraging these resemblances in the behavior policy is crucial for sample efficiency. However, formally defining and algorithmically utilizing such similarities remains an open problem. This article introduces a behavior policy design, examining how different criteria for selecting a behavior policy influence importance sampling estimator properties. We evaluate the resulting behavior policies in downstream tasks, particularly in best policy selection problems. Additionally, we demonstrate how effectively leveraging similarities among target policies results in a more nuanced behavior policy design and enhances regret bounds for best policy selection. To facilitate rigorous analysis, the article is formulated within the stochastic bandit framework.

PDF

EWRL Workshop 2025 Workshop Paper

Efficient Risk-sensitive Planning via Entropic Risk Measures

Alexandre Marthe
Samuel Bounan
Aurélien Garivier
Claire Vernade

Risk-sensitive planning aims to identify policies maximizing some tail-focused metrics in Markov Decision Processes (MDPs), such as (Conditional) Values at Risk. In general, such optimization problems can be very costly as it is known that only Entropic Risk Measures (EntRM) can be efficiently optimized through dynamic programming. We show that EntRM can serve as an approximation of other metrics of interest and we propose a dual optimization problem that requires to compute the set of optimal policies for EntRM across all parameter values. We prove that this optimality front can be computed effectively thanks to a novel structural analysis of the smoothness properties of entropic risks. Empirical results demonstrate that our approach achieves strong performance in risk-sensitive decision-making scenarios.

PDF

EWRL Workshop 2025 Workshop Paper

Non-Stationary Lipschitz Bandits

Nicolas Nguyen
Solenne Gaucher
Claire Vernade

We study the problem of non-stationary Lipschitz bandits, where the number of actions is infinite and the reward function, satisfying a Lipschitz assumption, can change arbitrarily over time. We design an algorithm that adaptively tracks the recently introduced notion of significant shifts, defined by large deviations of the cumulative reward function. To detect such reward changes, our algorithm leverages a hierarchical discretization of the action space. Without requiring any prior knowledge of the non-stationarity, our algorithm achieves a minimax-optimal dynamic regret bound of $\mathcal{\widetilde{O}}(\widetilde{L}^{1/3}T^{2/3})$, where $\tilde{L}$ is the number of significant shifts and $T$ the horizon. This result provides the first optimal guarantee in this setting.

PDF

NeurIPS Conference 2025 Conference Paper

Non-Stationary Lipschitz Bandits

Nicolas Nguyen
Solenne Gaucher
Claire Vernade

We study the problem of non-stationary Lipschitz bandits, where the number of actions is infinite and the reward function, satisfying a Lipschitz assumption, can change arbitrarily over time. We design an algorithm that adaptively tracks the recently introduced notion of significant shifts, defined by large deviations of the cumulative reward function. To detect such reward changes, our algorithm leverages a hierarchical discretization of the action space. Without requiring any prior knowledge of the non-stationarity, our algorithm achieves a minimax-optimal dynamic regret bound of $\mathcal{\widetilde{O}}(\tilde{L}^{1/3}T^{2/3})$, where $\tilde{L}$ is the number of significant shifts and $T$ the horizon. This result provides the first optimal guarantee in this setting.

PDF Details

ICML Conference 2025 Conference Paper

Partially Observable Reinforcement Learning with Memory Traces

Onno Eberhard
Michael Muehlebach
Claire Vernade

Partially observable environments present a considerable computational challenge in reinforcement learning due to the need to consider long histories. Learning with a finite window of observations quickly becomes intractable as the window length grows. In this work, we introduce memory traces. Inspired by eligibility traces, these are compact representations of the history of observations in the form of exponential moving averages. We prove sample complexity bounds for the problem of offline on-policy evaluation that quantify the return errors achieved with memory traces for the class of Lipschitz continuous value estimates. We establish a close connection to the window approach, and demonstrate that, in certain environments, learning with memory traces is significantly more sample efficient. Finally, we underline the effectiveness of memory traces empirically in online reinforcement learning experiments for both value prediction and control.

Details

EWRL Workshop 2025 Workshop Paper

Partially Observable Reinforcement Learning with Memory Traces

Onno Eberhard
Michael Muehlebach
Claire Vernade

Partially observable environments present a considerable computational challenge in reinforcement learning due to the need to consider long histories. Learning with a finite window of observations quickly becomes intractable as the window length grows. In this work, we introduce *memory traces*. Inspired by eligibility traces, these are compact representations of the history of observations in the form of exponential moving averages. We prove sample complexity bounds for the problem of offline on-policy evaluation that quantify the return errors achieved with memory traces for the class of Lipschitz continuous value estimates. We establish a close connection to the window approach, and demonstrate that, in certain environments, learning with memory traces is significantly more sample efficient. Finally, we underline the effectiveness of memory traces empirically in online reinforcement learning experiments for both value prediction and control.

PDF

NeurIPS Conference 2025 Conference Paper

Put CASH on Bandits: A Max K-Armed Problem for Automated Machine Learning

Amir Rezaei Balef
Claire Vernade
Katharina Eggensperger

The Combined Algorithm Selection and Hyperparameter optimization (CASH) is a challenging resource allocation problem in the field of AutoML. We propose MaxUCB, a max $k$-armed bandit method to trade off exploring different model classes and conducting hyperparameter optimization. MaxUCB is specifically designed for the light-tailed and bounded reward distributions arising in this setting and, thus, provides an efficient alternative compared to classic max $k$-armed bandit methods assuming heavy-tailed reward distributions. We theoretically and empirically evaluate our method on four standard AutoML benchmarks, demonstrating superior performance over prior approaches. We make our code and data available at https: //github. com/amirbalef/CASH_with_Bandits

PDF Details

NeurIPS Conference 2025 Conference Paper

Quantization-Free Autoregressive Action Transformer

Ziyad Sheebaelhamd
Michael Tschannen
Michael Muehlebach
Claire Vernade

Current transformer-based imitation learning approaches introduce discrete action representations and train an autoregressive transformer decoder on the resulting latent code. However, the initial quantization breaks the continuous structure of the action space thereby limiting the capabilities of the generative model. We propose a quantization-free method instead that leverages Generative Infinite-Vocabulary Transformers (GIVT) as a direct, continuous policy parametrization for autoregressive transformers. This simplifies the imitation learning pipeline while achieving state-of-the-art performance on a variety of popular simulated robotics tasks. We enhance our policy roll-outs by carefully studying sampling algorithms, further improving the results.

PDF Details

RLC Conference 2024 Conference Paper

Non-stationary Bandits and Meta-Learning with a Small Set of Optimal Arms

Javad Azizi
Thang Duong
Yasin Abbasi-Yadkori
András György
Claire Vernade
Mohammad Ghavamzadeh

We study a sequential decision problem where the learner faces a sequence of $K$-armed bandit tasks. The task boundaries might be known (the bandit meta-learning setting), or unknown (the non-stationary bandit setting). For a given integer $M\le K$, the learner aims to compete with the best subset of arms of size $M$. We design an algorithm based on a reduction to bandit submodular maximization and show that for $T$ time steps comprised of $N$ tasks, in the regime of large $N$ and small number of optimal arms $M$, its regret in both settings is smaller than the simple baseline of $\tilde{O}(\sqrt{KNT})$ that can be obtained by using standard algorithms designed for non-stationary bandit problems. For the bandit meta-learning problem with fixed task length $\tau$, we show that the regret of the algorithm is bounded as $\tilde{O}(NM\sqrt{M \tau}+ (M^4 K N^2)^{1/3} \tau)$. Under some additional assumptions on the identifiability of the optimal arms in each task, we show a bandit meta-learning algorithm with an improved $\tilde{O}(N\sqrt{M \tau}+ (NMK^{1/2})^{1/2}\tau^{3/4} )$ regret, where the order of the leading term (the first term) is optimal up to logarithmic factors, and the algorithm does not need the knowledge of $M, N$, and $T$ in advance.

PDF Details

RLJ Journal 2024 Journal Article

Non-stationary Bandits and Meta-Learning with a Small Set of Optimal Arms

Javad Azizi
Thang Duong
Yasin Abbasi-Yadkori
András György
Claire Vernade
Mohammad Ghavamzadeh

PDF Details

NeurIPS Conference 2023 Conference Paper

Beyond Average Return in Markov Decision Processes

Alexandre Marthe
Aurélien Garivier
Claire Vernade

What are the functionals of the reward that can be computed and optimized exactly in Markov Decision Processes? In the finite-horizon, undiscounted setting, Dynamic Programming (DP) can only handle these operations efficiently for certain classes of statistics. We summarize the characterization of these classes for policy evaluation, and give a new answer for the planning problem. Interestingly, we prove that only generalized means can be optimized exactly, even in the more general framework of Distributional Reinforcement Learning (DistRL). DistRL permits, however, to evaluate other functionals approximately. We provide error bounds on the resulting estimators, and discuss the potential of this approach as well as its limitations. These results contribute to advancing the theory of Markov Decision Processes by examining overall characteristics of the return, and particularly risk-conscious strategies.

PDF Details

EWRL Workshop 2023 Workshop Paper

Beyond Average Reward in Markov Decision Processes

Alexandre Marthe
Aurélien Garivier
Claire Vernade

What are the functionals of the reward that can be computed and optimized exactly in Markov Decision Processes? In the finite-horizon, undiscounted setting, Dynamic Programming (DP) can only handle these operations efficiently for certain classes of statistics. We summarize the characterization of these classes for policy evaluation, and give a new answer for the planning problem. Interestingly, we prove that only generalized means can be optimized exactly. It is possible, however, to evaluate other functionals approximately using Distributional Reinforcement Learning. We prove error bounds on the resulting estimators and discuss the potential and limitations of this approach. These results contribute to advancing the theory of Markov Decision Processes by examining overall characteristics of the return, and particularly risk-conscious strategies.

PDF

EWRL Workshop 2023 Workshop Paper

Lifelong Best-Arm Identification with Misspecified Priors

Nicolas Nguyen
Claire Vernade

We address the problem of lifelong fixed-budget best-arm identification (BAI), which arises in realistic sequential A/B testing scenarios where the value of each arm is correlated across test phases. We propose a hierarchical Gaussian generative model and develop a Bayesian fixed-budget BAI algorithm. Our main contribution is to investigate the impact of prior misspecification on the missidentification probability along the learning trajectory through an upper bound on a novel risk metric. We conduct extensive empirical evaluations of our algorithm against state-of-the-art methods on various types of martingales with different dependency structures. Our results show that our approach outperforms other algorithms across a wide range of settings.

PDF

TMLR Journal 2023 Journal Article

POMRL: No-Regret Learning-to-Plan with Increasing Horizons

Khimya Khetarpal
Claire Vernade
Brendan O'Donoghue
Satinder Singh
Tom Zahavy

We study the problem of planning under model uncertainty in an online meta-reinforcement learning (RL) setting where an agent is presented with a sequence of related tasks with limited interactions per task. The agent can use its experience in each task and across tasks to estimate both the transition model and the distribution over tasks. We propose an algorithm to meta-learn the underlying relatedness across tasks, utilize it to plan in each task, and upper-bound the regret of the planning loss. Our bound suggests that the average regret over tasks decreases as the number of tasks increases and as the tasks are more similar. In the classical single-task setting, it is known that the planning horizon should depend on the estimated model's accuracy, that is, on the number of samples within task. We generalize this finding to meta-RL and study this dependence of planning horizons on the number of tasks. Based on our theoretical findings, we derive heuristics for selecting slowly increasing discount factors, and we validate its significance empirically.

PDF Details

ICLR Conference 2022 Conference Paper

EigenGame Unloaded: When playing games is better than optimizing

Ian Gemp
Brian McWilliams
Claire Vernade
Thore Graepel

We build on the recently proposed EigenGame that views eigendecomposition as a competitive game. EigenGame's updates are biased if computed using minibatches of data, which hinders convergence and more sophisticated parallelism in the stochastic setting. In this work, we propose an unbiased stochastic update that is asymptotically equivalent to EigenGame, enjoys greater parallelism allowing computation on datasets of larger sample sizes, and outperforms EigenGame in experiments. We present applications to finding the principal components of massive datasets and performing spectral clustering of graphs. We analyze and discuss our proposed update in the context of EigenGame and the shift in perspective from optimization to games.

Details

ICLR Conference 2021 Conference Paper

EigenGame: PCA as a Nash Equilibrium

Ian Gemp
Brian McWilliams
Claire Vernade
Thore Graepel

We present a novel view on principal components analysis as a competitive game in which each approximate eigenvector is controlled by a player whose goal is to maximize their own utility function. We analyze the properties of this PCA game and the behavior of its gradient based updates. The resulting algorithm---which combines elements from Oja's rule with a generalized Gram-Schmidt orthogonalization---is naturally decentralized and hence parallelizable through message passing. We demonstrate the scalability of the algorithm with experiments on large image datasets and neural network activations. We discuss how this new view of PCA as a differentiable game can lead to further algorithmic developments and insights.

Details

ICML Conference 2020 Conference Paper

Linear bandits with Stochastic Delayed Feedback

Claire Vernade
Alexandra Carpentier
Tor Lattimore
Giovanni Zappella
Beyza Ermis
Michael Brückner

Stochastic linear bandits are a natural and well-studied model for structured exploration/exploitation problems and are widely used in applications such as on-line marketing and recommendation. One of the main challenges faced by practitioners hoping to apply existing algorithms is that usually the feedback is randomly delayed and delays are only partially observable. For example, while a purchase is usually observable some time after the display, the decision of not buying is never explicitly sent to the system. In other words, the learner only observes delayed positive events. We formalize this problem as a novel stochastic delayed linear bandit and propose OTFLinUCB and OTFLinTS, two computationally efficient algorithms able to integrate new information as it becomes available and to deal with the permanently censored feedback. We prove optimal O(d\sqrt{T}) bounds on the regret of the first algorithm and study the dependency on delay-dependent parameters. Our model, assumptions and results are validated by experiments on simulated and real data.

Details

ICML Conference 2020 Conference Paper

Non-Stationary Delayed Bandits with Intermediate Observations

Claire Vernade
András György 0001
Timothy A. Mann

Online recommender systems often face long delays in receiving feedback, especially when optimizing for some long-term metrics. While mitigating the effects of delays in learning is well-understood in stationary environments, the problem becomes much more challenging when the environment changes. In fact, if the timescale of the change is comparable to the delay, it is impossible to learn about the environment, since the available observations are already obsolete. However, the arising issues can be addressed if intermediate signals are available without delay, such that given those signals, the long-term behavior of the system is stationary. To model this situation, we introduce the problem of stochastic, non-stationary, delayed bandits with intermediate observations. We develop a computationally efficient algorithm based on UCRL, and prove sublinear regret guarantees for its performance. Experimental results demonstrate that our method is able to learn in non-stationary delayed environments where existing methods fail.

Details

ICML Conference 2020 Conference Paper

Stochastic bandits with arm-dependent delays

Anne Gael Manegueu
Claire Vernade
Alexandra Carpentier
Michal Valko

Significant work has been recently dedicated to the stochastic delayed bandits because of its relevance in applications. The applicability of existing algorithms is however restricted by the fact that strong assumptions are often made on the delay distributions, such as full observability, restrictive shape constraints, or uniformity over arms. In this work, we weaken them significantly and only assume that there is a bound on the tail of the delay. In particular, we cover the important case where the delay distributions vary across arms, and the case where the delays are heavy-tailed. Addressing these difficulties, we propose a simple but efficient UCB-based algorithm called the PatientBandits. We provide both problemsdependent and problems-independent bounds on the regret as well as performance lower bounds.

Details

NeurIPS Conference 2019 Conference Paper

Weighted Linear Bandits for Non-Stationary Environments

Yoan Russac
Claire Vernade
Olivier Cappé

We consider a stochastic linear bandit model in which the available actions correspond to arbitrary context vectors whose associated rewards follow a non-stationary linear regression model. In this setting, the unknown regression parameter is allowed to vary in time. To address this problem, we propose D-LinUCB, a novel optimistic algorithm based on discounted linear regression, where exponential weights are used to smoothly forget the past. This involves studying the deviations of the sequential weighted least-squares estimator under generic assumptions. As a by-product, we obtain novel deviation results that can be used beyond non-stationary environments. We provide theoretical guarantees on the behavior of D-LinUCB in both slowly-varying and abruptly-changing environments. We obtain an upper bound on the dynamic regret that is of order d B T^{1/3}T^{2/3}, where B T is a measure of non-stationarity (d and T being, respectively, dimension and horizon). This rate is known to be optimal. We also illustrate the empirical performance of D-LinUCB and compare it with recently proposed alternatives in simulated environments.

PDF Details

IJCAI Conference 2017 Conference Paper

Bernoulli Rank-1 Bandits for Click Feedback

Sumeet Katariya
Branislav Kveton
Csaba Szepesvári
Claire Vernade
Zheng Wen

The probability that a user will click a search result depends both on its relevance and its position on the results page. The position based model explains this behavior by ascribing to every item an attraction probability, and to every position an examination probability. To be clicked, a result must be both attractive and examined. The probabilities of an item-position pair being clicked thus form the entries of a rank-1 matrix. We propose the learning problem of a Bernoulli rank-1 bandit where at each step, the learning agent chooses a pair of row and column arms, and receives the product of their Bernoulli-distributed values as a reward. This is a special case of the stochastic rank-1 bandit problem considered in recent work that proposed an elimination based algorithm Rank1Elim, and showed that Rank1Elim's regret scales linearly with the number of rows and columns on "benign" instances. These are the instances where the minimum of the average row and column rewards mu is bounded away from zero. The issue with Rank1Elim is that it fails to be competitive with straightforward bandit strategies as mu tends to 0. In this paper we propose Rank1ElimKL, which replaces the crude confidence intervals of Rank1Elim with confidence intervals based on Kullback-Leibler (KL) divergences. With the help of a novel result concerning the scaling of KL divergences we prove that with this change, our algorithm will be competitive no matter the value of mu. Experiments with synthetic data confirm that on benign instances the performance of Rank1ElimKL is significantly better than that of even Rank1Elim. Similarly, experiments with models derived from real-data confirm that the improvements are significant across the board, regardless of whether the data is benign or not.

PDF Details

UAI Conference 2017 Conference Paper

Stochastic Bandit Models for Delayed Conversions

Claire Vernade
Olivier Cappé
Vianney Perchet

Online advertising and product recommendation are important domains of applications for multi-armed bandit methods. In these fields, the reward that is immediately available is most often only a proxy for the actual outcome of interest, which we refer to as a conversion. For instance, in web advertising, clicks can be observed within a few seconds after an ad display but the corresponding sale –if any– will take hours, if not days to happen. This paper proposes and investigates a new stochastic multi-armed bandit model in the framework proposed by Chapelle (2014) –based on empirical studies in the field of web advertising– in which each action may trigger a future reward that will then happen with a stochastic delay. We assume that the probability of conversion associated with each action is unknown while the distribution of the conversion delay is known, distinguishing between the (idealized) case where the conversion events may be observed whatever their delay and the more realistic setting in which late conversions are censored. We provide performance lower bounds as well as two simple but efficient algorithms based on the UCB and KLUCB frameworks. The latter algorithm, which is preferable when conversion rates are low, is based on a Poissonization argument, of independent interest in other settings where aggregation of Bernoulli observations with different success probabilities is required.

Details

NeurIPS Conference 2016 Conference Paper

Multiple-Play Bandits in the Position-Based Model

Paul Lagrée
Claire Vernade
Olivier Cappe

Sequentially learning to place items in multi-position displays or lists is a task that can be cast into the multiple-play semi-bandit setting. However, a major concern in this context is when the system cannot decide whether the user feedback for each item is actually exploitable. Indeed, much of the content may have been simply ignored by the user. The present work proposes to exploit available information regarding the display position bias under the so-called Position-based click model (PBM). We first discuss how this model differs from the Cascade model and its variants considered in several recent works on multiple-play bandits. We then provide a novel regret lower bound for this model as well as computationally efficient algorithms that display good empirical and theoretical performance.

PDF Details