Author name cluster

Peter Auer

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

27 papers

2 author rows

NeurIPS Conference 2025 Conference Paper

Improved Best-of-Both-Worlds Regret for Bandits with Delayed Feedback

Ofir Schlisselberg
Tal Lancewicki
Peter Auer
Yishay Mansour

We study the multi-armed bandit problem with adversarially chosen delays in the Best-of-Both-Worlds (BoBW) framework, which aims to achieve near-optimal performance in both stochastic and adversarial environments. While prior work has made progress toward this goal, existing algorithms suffer from significant gaps to the known lower bounds, especially in the stochastic settings. Our main contribution is a new algorithm that, up to logarithmic factors, matches the known lower bounds in each setting individually. In the adversarial case, our algorithm achieves regret of $\widetilde{O}(\sqrt{KT} + \sqrt{D})$, which is optimal up to logarithmic terms, where $T$ is the number of rounds, $K$ is the number of arms, and $D$ is the cumulative delay. In the stochastic case, we provide a regret bound which scale as $\sum_{i: \Delta_i>0}(\log T/\Delta_i) + \frac{1}{K}\sum \Delta_i \sigma_{max}$, where $\Delta_i$ is the suboptimality gap of arm $i$ and $\sigma_{\max}$ is the maximum number of missing observations. To the best of our knowledge, this is the first BoBW algorithm to simultaneously match the lower bounds in both stochastic and adversarial regimes. Moreover, even beyond the BoBW setting, our stochastic regret bound is the first to match the known lower bound under adversarial delays, improving the second term over the best known result by a factor of $K$.

PDF Details

EWRL Workshop 2024 Workshop Paper

Understanding the Gaps in Satisficing Bandits

Chloé Rouyer
Ronald Ortner
Peter Auer

In this work, we consider a variation of the stochastic multi-armed bandit problem in which the learner is not necessarily trying to compete with the best arm, whose performance is not known ahead of time, but is satisfied with playing any arm that performs above a known satisficing threshold $S$. Michel et al. (2023) considered as respective performance measure the \textit{satisficing regret}, that scales in terms of the gaps between the expected performance of an insufficient arm and the threshold $S$, rather than in terms of its gap with the best arm. While Michel et al. propose an algorithm that achieves time-independent satisficing regret, their results suffer when arms are too close to the threshold. Is this dependency unavoidable? The first contribution of our work is to provide an alternative and more general lower bound for the $K$-armed satisficing bandit problem, which highlights how the position of the threshold compared to the arms affects the bound. Then, we introduce an algorithm robust against unbalanced gaps, which enjoys a nearly matching time-independent upper bound. We also propose an alternative definition of the satisficing regret, which might be better tailored to measure algorithm performance in these difficult instances and derive a lower bound for this regret. Finally, we include experiments to compare these different regret measures and our proposed algorithms empirically.

PDF

IJCAI Conference 2023 Conference Paper

Autonomous Exploration for Navigating in MDPs Using Blackbox RL Algorithms

Pratik Gajane
Peter Auer
Ronald Ortner

We consider the problem of navigating in a Markov decision process where extrinsic rewards are either absent or ignored. In this setting, the objective is to learn policies to reach all the states that are reachable within a given number of steps (in expectation) from a starting state. We introduce a novel meta-algorithm which can use any online reinforcement learning algorithm (with appropriate regret guarantees) as a black-box. Our algorithm demonstrates a method for transforming the output of online algorithms to a batch setting. We prove an upper bound on the sample complexity of our algorithm in terms of the regret bound of the used black-box RL algorithm. Furthermore, we provide experimental results to validate the effectiveness of our algorithm and correctness of our theoretical results.

PDF Details DOI

UAI Conference 2019 Conference Paper

Variational Regret Bounds for Reinforcement Learning

Ronald Ortner
Pratik Gajane
Peter Auer

We consider undiscounted reinforcement learning in Markov decision processes (MDPs) where \textit{both} the reward functions and the state-transition probabilities may vary (gradually or abruptly) over time. For this problem setting, we propose an algorithm and provide performance guarantees for the regret evaluated against the optimal non-stationary policy. The upper bound on the regret is given in terms of the total variation in the MDP. This is the first variational regret bound for the general reinforcement learning setting.

Details

EWRL Workshop 2018 Workshop Paper

Adaptively Tracking the Best Arm with an Unknown Number of Distribution Changes

Peter Auer
Pratik Gajane
Ronald Ortner

We consider the variant of the stochastic multi-armed bandit problem where the stochastic reward distributions may change abruptly several times. In contrast to previous work, we are able to achieve (nearly) optimal mini-max regret bounds without knowing the number of changes.

PDF Details

TCS Journal 2016 Journal Article

Guest editors' foreword

Peter Auer
Alexander Clark
Thomas Zeugmann

Details DOI

TCS Journal 2014 Journal Article

Regret bounds for restless Markov bandits

Ronald Ortner
Daniil Ryabko
Peter Auer
Rémi Munos

Details DOI

EWRL Workshop 2012 Conference Paper

Evaluation and Analysis of the Performance of the EXP3 Algorithm in Stochastic Environments

Yevgeny Seldin
Csaba Szepesvári
Peter Auer
Yasin Abbasi-Yadkori

EXP3 is a popular algorithm for adversarial multiarmed bandits, suggested and analyzed in this setting by Auer et al. [2002b]. Recently there was an increased interest in the performance of this algorithm in the stochastic setting, due to its new applications to stochastic multiarmed bandits with side information [Seldin et al. , 2011] and to multiarmed bandits in the mixed stochastic-adversarial setting [Bubeck and Slivkins, 2012]. We present an empirical evaluation and improved analysis of the performance of the EXP3 algorithm in the stochastic setting, as well as a modification of the EXP3 algorithm capable of achieving “logarithmic” regret in stochastic environments.

Details

ICML Conference 2012 Conference Paper

PAC Subset Selection in Stochastic Multi-armed Bandits

Shivaram Kalyanakrishnan
Ambuj Tewari
Peter Auer
Peter Stone 0001

Details

UAI Conference 2012 Conference Paper

PAC-Bayesian Inequalities for Martingales

Yevgeny Seldin
François Laviolette
Nicolò Cesa-Bianchi
John Shawe-Taylor
Peter Auer

We present a set of high-probability inequalities that control the concentration of weighted averages of multiple (possibly uncountably many) simultaneously evolving and interdependent martingales. Our results extend the PAC-Bayesian analysis in learning theory from the i.i.d. setting to martingales opening the way for its application in reinforcement learning and other interactive learning domains, as well as many other domains in probability theory and statistics, where martingales are encountered. We also present a comparison inequality that bounds the expectation of a convex function of a martingale difference sequence shifted to the [0, 1] interval by the expectation of the same function of independent Bernoulli variables. This inequality is applied to derive a tighter analog of Hoeffding-Azuma’s inequality.

Details

EWRL Workshop 2011 Invited Paper

Invited Talk: UCRL and Autonomous Exploration

Peter Auer

Abstract After reviewing the main ingredients of the UCRL algorithm and its analysis for online reinforcement learning — exploration vs. exploitation, optimism in the face of uncertainty, consistency with observations and upper confidence bounds, regret analysis — I show how these techniques can also be used to derive PAC-MDP bounds which match the best currently available bounds for the discounted and the undiscounted setting. As typical for reinforcement learning, the analysis for the undiscounted setting is significantly more involved. In the second part of my talk I consider a model for autonomous exploration, where an agent learns about its environment and how to navigate in it. Whereas evaluating autonomous exploration is typically difficult, in the presented setting rigorous performance bounds can be derived. For that we present an algorithm that optimistically explores, by repeatedly choosing the apparently closest unknown state — as indicated by an optimistic policy — for further exploration. Acknowledgements. This is joint work with Shiau Hong Lim. The research leading to these results has received funding from the European Community’s Seventh Framework Programme (FP7/2007-2013) under grant agreement 231495 (CompLACS).

Details

UAI Conference 2011 Conference Paper

Noisy Search with Comparative Feedback

Shiau Hong Lim
Peter Auer

Details

NeurIPS Conference 2011 Conference Paper

PAC-Bayesian Analysis of Contextual Bandits

Yevgeny Seldin
Peter Auer
John Shawe-Taylor
Ronald Ortner
François Laviolette

We derive an instantaneous (per-round) data-dependent regret bound for stochastic multiarmed bandits with side information (also known as contextual bandits). The scaling of our regret bound with the number of states (contexts) $N$ goes as $\sqrt{N I_{\rho_t}(S; A)}$, where $I_{\rho_t}(S; A)$ is the mutual information between states and actions (the side information) used by the algorithm at round $t$. If the algorithm uses all the side information, the regret bound scales as $\sqrt{N \ln K}$, where $K$ is the number of actions (arms). However, if the side information $I_{\rho_t}(S; A)$ is not fully used, the regret bound is significantly tighter. In the extreme case, when $I_{\rho_t}(S; A) = 0$, the dependence on the number of states reduces from linear to logarithmic. Our analysis allows to provide the algorithm large amount of side information, let the algorithm to decide which side information is relevant for the task, and penalize the algorithm only for the side information that it is using de facto. We also present an algorithm for multiarmed bandits with side information with computational complexity that is a linear in the number of actions.

PDF Details

JMLR Journal 2010 Journal Article

Near-optimal Regret Bounds for Reinforcement Learning

Thomas Jaksch
Ronald Ortner
Peter Auer

For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s,s' there is a policy which moves from s to s' in at most D steps (on average). We present a reinforcement learning algorithm with total regret Õ(DS√AT) after T steps for any unknown MDP with S states, A actions per state, and diameter D. A corresponding lower bound of Ω(√DSAT) on the total regret of any learning algorithm is given as well. These results are complemented by a sample complexity bound on the number of suboptimal steps taken by our algorithm. This bound can be used to achieve a (gap-dependent) regret bound that is logarithmic in T. Finally, we also consider a setting where the MDP is allowed to change a fixed number of l times. We present a modification of our algorithm that is able to deal with this setting and show a regret bound of Õ(l 1/3 T 2/3 DS√A). [abs] [ pdf ][ bib ] &copy JMLR 2010. ( edit, beta )

PDF Details

ICML Conference 2009 Conference Paper

Workshop summary: On-line learning with limited feedback

Jean-Yves Audibert
Peter Auer
Alessandro Lazaric
Rémi Munos
Daniil Ryabko
Csaba Szepesvári

Details

NeurIPS Conference 2008 Conference Paper

Near-optimal Regret Bounds for Reinforcement Learning

Peter Auer
Thomas Jaksch
Ronald Ortner

PDF Details

NeurIPS Conference 2006 Conference Paper

Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning

Peter Auer
Ronald Ortner

We present a learning algorithm for undiscounted reinforcement learning. Our interest lies in bounds for the algorithm's online performance after some finite number of steps. In the spirit of similar methods already successfully applied for the exploration-exploitation tradeoff in multi-armed bandit problems, we use upper confidence bounds to show that our UCRL algorithm achieves logarithmic online regret in the number of steps taken with respect to an optimal policy.

PDF Details

JMLR Journal 2002 Journal Article

Using Confidence Bounds for Exploitation-Exploration Trade-offs

Peter Auer

We show how a standard tool from statistics --- namely confidence bounds --- can be used to elegantly deal with situations which exhibit an exploitation-exploration trade-off. Our technique for designing and analyzing algorithms for such situations is general and can be applied when an algorithm has to make exploitation-versus-exploration decisions based on uncertain information provided by a random process. We apply our technique to two models with such an exploitation-exploration trade-off. For the adversarial bandit problem with shifting our new algorithm suffers only O (( ST ) 1/2 ) regret with high probability over T trials with S shifts. Such a regret bound was previously known only in expectation. The second model we consider is associative reinforcement learning with linear value functions. For this model our technique improves the regret from O ( T 3/4 ) to O ( T 1/2 ).

PDF Details

FOCS Conference 2000 Conference Paper

Using Upper Confidence Bounds for Online Learning

Peter Auer

We show how a standard tool from statistics, namely confidence bounds, can be used to elegantly deal with situations which exhibit an exploitation/exploration trade-off. Our technique for designing and analyzing algorithms for such situations is very general and can be applied when an algorithm has to make exploitation-versus-exploration decisions based on uncertain information provided by a random process. We consider two models with such an exploitation/exploration trade-off. For the adversarial bandit problem our new algorithm suffers only O/spl tilde/(T/sup 1/2/) regret over T trials which improves significantly over the previously best O/spl tilde/(T/sup 2/3/) regret. We also extend our results for the adversarial bandit problem to shifting bandits. The second model we consider is associative reinforcement learning with linear value functions. For this model our technique improves the regret from O/spl tilde/(T/sup 3/4/) to O/spl tilde/(T/sup 1/2/).

Details

STOC Conference 1997 Conference Paper

Approximating Hyper-Rectangles: Learning and Pseudo-Random Sets

Peter Auer
Philip M. Long
Aravind Srinivasan

Details

TCS Journal 1997 Journal Article

Learning nested differences in the presence of malicious noise

Peter Auer

Details DOI

ICML Conference 1997 Conference Paper

On Learning From Multi-Instance Examples: Empirical Evaluation of a Theoretical Approach

Peter Auer

Details

NeurIPS Conference 1995 Conference Paper

Exponentially many local minima for single neurons

Peter Auer
Mark Herbster
Manfred K. Warmuth

We show that for a single neuron with the logistic function as the transfer function the number of local minima of the error function based on the square loss can grow exponentially in the dimension.

PDF Details

FOCS Conference 1995 Conference Paper

Gambling in a Rigged Casino: The Adversarial Multi-Arm Bandit Problem

Peter Auer
Nicolò Cesa-Bianchi
Yoav Freund
Robert E. Schapire

In the multi-armed bandit problem, a gambler must decide which arm of K non-identical slot machines to play in a sequence of trials so as to maximize his reward. This classical problem has received much attention because of the simple model it provides of the trade-off between exploration (trying out each arm to find the best one) and exploitation (playing the arm believed to give the best payoff). Past solutions for the bandit problem have almost always relied on assumptions about the statistics of the slot machines. In this work, we make no statistical assumptions whatsoever about the nature of the process generating the payoffs of the slot machines. We give a solution to the bandit problem in which an adversary, rather than a well-behaved stochastic process, has complete control over the payoffs. In a sequence of T plays, we prove that the expected per-round payoff of our algorithm approaches that of the best arm at the rate O(T/sup -1/3/), and we give an improved rate of convergence when the best arm has fairly low payoff. We also consider a setting in which the player has a team of "experts" advising him on which arm to play; here, we give a strategy that will guarantee expected payoff close to that of the best expert. Finally, we apply our result to the problem of learning to play an unknown repeated matrix game against an all-powerful adversary.

Details

ICML Conference 1995 Conference Paper

Theory and Applications of Agnostic PAC-Learning with Small Decision Trees

Peter Auer
Robert C. Holte
Wolfgang Maass 0001

Details

FOCS Conference 1995 Conference Paper

Tracking the Best Disjunction

Peter Auer
Manfred K. Warmuth

N. Littlestone developed a simple deterministic on-line learning algorithm for learning k-literal disjunctions. This algorithm (called Winnow) keeps one weight for each of the n variables and does multiplicative updates to its weights. We develop a randomized version of Winnow and prove bounds for an adaptation of the algorithm for the case when the disjunction may change over time. In this case a possible target disjunction schedule T is a sequence of disjunctions (one per trial) and the shift size is the total number of literals that are added/removed from the disjunctions as one progresses through the sequence. We develop an algorithm that predicts nearly as well as the best disjunction schedule for an arbitrary sequence of examples. This algorithm that allows us to track the predictions of the best disjunction is hardly more complex than the original version. However the amortized analysis needed for obtaining worst-case mistake bounds requires new techniques. In some cases our lower bounds show that the upper bounds of our algorithm have the right constant in front of the leading term in the mistake bound and almost the right constant in front of the second leading term. By combining the tracking capability with existing applications of Winnow we are able to enhance these applications to the shifting case as well.

Details

STOC Conference 1994 Conference Paper

Simulating access to hidden information while learning

Peter Auer
Philip M. Long

Details