Arrow Research search

Author name cluster

Peter Auer

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

27 papers
2 author rows

Possible papers

27

NeurIPS Conference 2025 Conference Paper

Improved Best-of-Both-Worlds Regret for Bandits with Delayed Feedback

  • Ofir Schlisselberg
  • Tal Lancewicki
  • Peter Auer
  • Yishay Mansour

We study the multi-armed bandit problem with adversarially chosen delays in the Best-of-Both-Worlds (BoBW) framework, which aims to achieve near-optimal performance in both stochastic and adversarial environments. While prior work has made progress toward this goal, existing algorithms suffer from significant gaps to the known lower bounds, especially in the stochastic settings. Our main contribution is a new algorithm that, up to logarithmic factors, matches the known lower bounds in each setting individually. In the adversarial case, our algorithm achieves regret of $\widetilde{O}(\sqrt{KT} + \sqrt{D})$, which is optimal up to logarithmic terms, where $T$ is the number of rounds, $K$ is the number of arms, and $D$ is the cumulative delay. In the stochastic case, we provide a regret bound which scale as $\sum_{i: \Delta_i>0}(\log T/\Delta_i) + \frac{1}{K}\sum \Delta_i \sigma_{max}$, where $\Delta_i$ is the suboptimality gap of arm $i$ and $\sigma_{\max}$ is the maximum number of missing observations. To the best of our knowledge, this is the first BoBW algorithm to simultaneously match the lower bounds in both stochastic and adversarial regimes. Moreover, even beyond the BoBW setting, our stochastic regret bound is the first to match the known lower bound under adversarial delays, improving the second term over the best known result by a factor of $K$.

EWRL Workshop 2024 Workshop Paper

Understanding the Gaps in Satisficing Bandits

  • Chloé Rouyer
  • Ronald Ortner
  • Peter Auer

In this work, we consider a variation of the stochastic multi-armed bandit problem in which the learner is not necessarily trying to compete with the best arm, whose performance is not known ahead of time, but is satisfied with playing any arm that performs above a known satisficing threshold $S$. Michel et al. (2023) considered as respective performance measure the \textit{satisficing regret}, that scales in terms of the gaps between the expected performance of an insufficient arm and the threshold $S$, rather than in terms of its gap with the best arm. While Michel et al. propose an algorithm that achieves time-independent satisficing regret, their results suffer when arms are too close to the threshold. Is this dependency unavoidable? The first contribution of our work is to provide an alternative and more general lower bound for the $K$-armed satisficing bandit problem, which highlights how the position of the threshold compared to the arms affects the bound. Then, we introduce an algorithm robust against unbalanced gaps, which enjoys a nearly matching time-independent upper bound. We also propose an alternative definition of the satisficing regret, which might be better tailored to measure algorithm performance in these difficult instances and derive a lower bound for this regret. Finally, we include experiments to compare these different regret measures and our proposed algorithms empirically.

IJCAI Conference 2023 Conference Paper

Autonomous Exploration for Navigating in MDPs Using Blackbox RL Algorithms

  • Pratik Gajane
  • Peter Auer
  • Ronald Ortner

We consider the problem of navigating in a Markov decision process where extrinsic rewards are either absent or ignored. In this setting, the objective is to learn policies to reach all the states that are reachable within a given number of steps (in expectation) from a starting state. We introduce a novel meta-algorithm which can use any online reinforcement learning algorithm (with appropriate regret guarantees) as a black-box. Our algorithm demonstrates a method for transforming the output of online algorithms to a batch setting. We prove an upper bound on the sample complexity of our algorithm in terms of the regret bound of the used black-box RL algorithm. Furthermore, we provide experimental results to validate the effectiveness of our algorithm and correctness of our theoretical results.

UAI Conference 2019 Conference Paper

Variational Regret Bounds for Reinforcement Learning

  • Ronald Ortner
  • Pratik Gajane
  • Peter Auer

We consider undiscounted reinforcement learning in Markov decision processes (MDPs) where \textit{both} the reward functions and the state-transition probabilities may vary (gradually or abruptly) over time. For this problem setting, we propose an algorithm and provide performance guarantees for the regret evaluated against the optimal non-stationary policy. The upper bound on the regret is given in terms of the total variation in the MDP. This is the first variational regret bound for the general reinforcement learning setting.

EWRL Workshop 2012 Conference Paper

Evaluation and Analysis of the Performance of the EXP3 Algorithm in Stochastic Environments

  • Yevgeny Seldin
  • Csaba Szepesvári
  • Peter Auer
  • Yasin Abbasi-Yadkori

EXP3 is a popular algorithm for adversarial multiarmed bandits, suggested and analyzed in this setting by Auer et al. [2002b]. Recently there was an increased interest in the performance of this algorithm in the stochastic setting, due to its new applications to stochastic multiarmed bandits with side information [Seldin et al. , 2011] and to multiarmed bandits in the mixed stochastic-adversarial setting [Bubeck and Slivkins, 2012]. We present an empirical evaluation and improved analysis of the performance of the EXP3 algorithm in the stochastic setting, as well as a modification of the EXP3 algorithm capable of achieving “logarithmic” regret in stochastic environments.

UAI Conference 2012 Conference Paper

PAC-Bayesian Inequalities for Martingales

  • Yevgeny Seldin
  • François Laviolette
  • Nicolò Cesa-Bianchi
  • John Shawe-Taylor
  • Peter Auer

We present a set of high-probability inequalities that control the concentration of weighted averages of multiple (possibly uncountably many) simultaneously evolving and interdependent martingales. Our results extend the PAC-Bayesian analysis in learning theory from the i.i.d. setting to martingales opening the way for its application in reinforcement learning and other interactive learning domains, as well as many other domains in probability theory and statistics, where martingales are encountered. We also present a comparison inequality that bounds the expectation of a convex function of a martingale difference sequence shifted to the [0, 1] interval by the expectation of the same function of independent Bernoulli variables. This inequality is applied to derive a tighter analog of Hoeffding-Azuma’s inequality.

EWRL Workshop 2011 Invited Paper

Invited Talk: UCRL and Autonomous Exploration

  • Peter Auer

Abstract After reviewing the main ingredients of the UCRL algorithm and its analysis for online reinforcement learning — exploration vs. exploitation, optimism in the face of uncertainty, consistency with observations and upper confidence bounds, regret analysis — I show how these techniques can also be used to derive PAC-MDP bounds which match the best currently available bounds for the discounted and the undiscounted setting. As typical for reinforcement learning, the analysis for the undiscounted setting is significantly more involved. In the second part of my talk I consider a model for autonomous exploration, where an agent learns about its environment and how to navigate in it. Whereas evaluating autonomous exploration is typically difficult, in the presented setting rigorous performance bounds can be derived. For that we present an algorithm that optimistically explores, by repeatedly choosing the apparently closest unknown state — as indicated by an optimistic policy — for further exploration. Acknowledgements. This is joint work with Shiau Hong Lim. The research leading to these results has received funding from the European Community’s Seventh Framework Programme (FP7/2007-2013) under grant agreement 231495 (CompLACS).

NeurIPS Conference 2011 Conference Paper

PAC-Bayesian Analysis of Contextual Bandits

  • Yevgeny Seldin
  • Peter Auer
  • John Shawe-Taylor
  • Ronald Ortner
  • François Laviolette

We derive an instantaneous (per-round) data-dependent regret bound for stochastic multiarmed bandits with side information (also known as contextual bandits). The scaling of our regret bound with the number of states (contexts) $N$ goes as $\sqrt{N I_{\rho_t}(S; A)}$, where $I_{\rho_t}(S; A)$ is the mutual information between states and actions (the side information) used by the algorithm at round $t$. If the algorithm uses all the side information, the regret bound scales as $\sqrt{N \ln K}$, where $K$ is the number of actions (arms). However, if the side information $I_{\rho_t}(S; A)$ is not fully used, the regret bound is significantly tighter. In the extreme case, when $I_{\rho_t}(S; A) = 0$, the dependence on the number of states reduces from linear to logarithmic. Our analysis allows to provide the algorithm large amount of side information, let the algorithm to decide which side information is relevant for the task, and penalize the algorithm only for the side information that it is using de facto. We also present an algorithm for multiarmed bandits with side information with computational complexity that is a linear in the number of actions.

JMLR Journal 2010 Journal Article

Near-optimal Regret Bounds for Reinforcement Learning

  • Thomas Jaksch
  • Ronald Ortner
  • Peter Auer

For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s,s' there is a policy which moves from s to s' in at most D steps (on average). We present a reinforcement learning algorithm with total regret Õ(DS√AT) after T steps for any unknown MDP with S states, A actions per state, and diameter D. A corresponding lower bound of Ω(√DSAT) on the total regret of any learning algorithm is given as well. These results are complemented by a sample complexity bound on the number of suboptimal steps taken by our algorithm. This bound can be used to achieve a (gap-dependent) regret bound that is logarithmic in T. Finally, we also consider a setting where the MDP is allowed to change a fixed number of l times. We present a modification of our algorithm that is able to deal with this setting and show a regret bound of Õ(l 1/3 T 2/3 DS√A). [abs] [ pdf ][ bib ] &copy JMLR 2010. ( edit, beta )

NeurIPS Conference 2008 Conference Paper

Near-optimal Regret Bounds for Reinforcement Learning

  • Peter Auer
  • Thomas Jaksch
  • Ronald Ortner

For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s1, s2 there is a policy which moves from s1 to s2 in at most D steps (on average). We present a reinforcement learning algorithm with total regret O(DSAT) after T steps for any unknown MDP with S states, A actions per state, and diameter D. This bound holds with high probability. We also present a corresponding lower bound of Omega(DSAT) on the total regret of any learning algorithm. Both bounds demonstrate the utility of the diameter as structural parameter of the MDP.

NeurIPS Conference 2006 Conference Paper

Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning

  • Peter Auer
  • Ronald Ortner

We present a learning algorithm for undiscounted reinforcement learning. Our interest lies in bounds for the algorithm's online performance after some finite number of steps. In the spirit of similar methods already successfully applied for the exploration-exploitation tradeoff in multi-armed bandit problems, we use upper confidence bounds to show that our UCRL algorithm achieves logarithmic online regret in the number of steps taken with respect to an optimal policy.

JMLR Journal 2002 Journal Article

Using Confidence Bounds for Exploitation-Exploration Trade-offs

  • Peter Auer

We show how a standard tool from statistics --- namely confidence bounds --- can be used to elegantly deal with situations which exhibit an exploitation-exploration trade-off. Our technique for designing and analyzing algorithms for such situations is general and can be applied when an algorithm has to make exploitation-versus-exploration decisions based on uncertain information provided by a random process. We apply our technique to two models with such an exploitation-exploration trade-off. For the adversarial bandit problem with shifting our new algorithm suffers only O (( ST ) 1/2 ) regret with high probability over T trials with S shifts. Such a regret bound was previously known only in expectation. The second model we consider is associative reinforcement learning with linear value functions. For this model our technique improves the regret from O ( T 3/4 ) to O ( T 1/2 ).

FOCS Conference 2000 Conference Paper

Using Upper Confidence Bounds for Online Learning

  • Peter Auer

We show how a standard tool from statistics, namely confidence bounds, can be used to elegantly deal with situations which exhibit an exploitation/exploration trade-off. Our technique for designing and analyzing algorithms for such situations is very general and can be applied when an algorithm has to make exploitation-versus-exploration decisions based on uncertain information provided by a random process. We consider two models with such an exploitation/exploration trade-off. For the adversarial bandit problem our new algorithm suffers only O/spl tilde/(T/sup 1/2/) regret over T trials which improves significantly over the previously best O/spl tilde/(T/sup 2/3/) regret. We also extend our results for the adversarial bandit problem to shifting bandits. The second model we consider is associative reinforcement learning with linear value functions. For this model our technique improves the regret from O/spl tilde/(T/sup 3/4/) to O/spl tilde/(T/sup 1/2/).

NeurIPS Conference 1995 Conference Paper

Exponentially many local minima for single neurons

  • Peter Auer
  • Mark Herbster
  • Manfred K. Warmuth

We show that for a single neuron with the logistic function as the transfer function the number of local minima of the error function based on the square loss can grow exponentially in the dimension.

FOCS Conference 1995 Conference Paper

Gambling in a Rigged Casino: The Adversarial Multi-Arm Bandit Problem

  • Peter Auer
  • Nicolò Cesa-Bianchi
  • Yoav Freund
  • Robert E. Schapire

In the multi-armed bandit problem, a gambler must decide which arm of K non-identical slot machines to play in a sequence of trials so as to maximize his reward. This classical problem has received much attention because of the simple model it provides of the trade-off between exploration (trying out each arm to find the best one) and exploitation (playing the arm believed to give the best payoff). Past solutions for the bandit problem have almost always relied on assumptions about the statistics of the slot machines. In this work, we make no statistical assumptions whatsoever about the nature of the process generating the payoffs of the slot machines. We give a solution to the bandit problem in which an adversary, rather than a well-behaved stochastic process, has complete control over the payoffs. In a sequence of T plays, we prove that the expected per-round payoff of our algorithm approaches that of the best arm at the rate O(T/sup -1/3/), and we give an improved rate of convergence when the best arm has fairly low payoff. We also consider a setting in which the player has a team of "experts" advising him on which arm to play; here, we give a strategy that will guarantee expected payoff close to that of the best expert. Finally, we apply our result to the problem of learning to play an unknown repeated matrix game against an all-powerful adversary.

FOCS Conference 1995 Conference Paper

Tracking the Best Disjunction

  • Peter Auer
  • Manfred K. Warmuth

N. Littlestone developed a simple deterministic on-line learning algorithm for learning k-literal disjunctions. This algorithm (called Winnow) keeps one weight for each of the n variables and does multiplicative updates to its weights. We develop a randomized version of Winnow and prove bounds for an adaptation of the algorithm for the case when the disjunction may change over time. In this case a possible target disjunction schedule T is a sequence of disjunctions (one per trial) and the shift size is the total number of literals that are added/removed from the disjunctions as one progresses through the sequence. We develop an algorithm that predicts nearly as well as the best disjunction schedule for an arbitrary sequence of examples. This algorithm that allows us to track the predictions of the best disjunction is hardly more complex than the original version. However the amortized analysis needed for obtaining worst-case mistake bounds requires new techniques. In some cases our lower bounds show that the upper bounds of our algorithm have the right constant in front of the leading term in the mistake bound and almost the right constant in front of the second leading term. By combining the tracking capability with existing applications of Winnow we are able to enhance these applications to the shifting case as well.