Arrow Research search

Author name cluster

Yishay Mansour

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

173 papers
2 author rows

Possible papers

173

AAAI Conference 2025 Conference Paper

Batch Ensemble for Variance Dependent Regret in Stochastic Bandits

  • Asaf Cassel
  • Orin Levy
  • Yishay Mansour

Efficiently trading off exploration and exploitation is one of the key challenges in online Reinforcement Learning (RL). Most works achieve this by carefully estimating the model uncertainty and following the so-called optimistic model. Inspired by practical ensemble methods, in this work we propose a simple and novel batch ensemble scheme that provably achieves near-optimal regret for stochastic Multi-Armed Bandits (MAB). Crucially, our algorithm has just a single parameter, namely the number of batches, and its value does not depend on distributional properties such as the scale and variance of the losses. We complement our theoretical results by demonstrating the effectiveness of our algorithm on synthetic benchmarks.

EWRL Workshop 2025 Workshop Paper

Convergence and Sample Complexity of First-Order Methods for Agnostic Reinforcement Learning

  • Uri Sherman
  • Tomer Koren
  • Yishay Mansour

We study reinforcement learning (RL) in the agnostic policy learning setting, where the goal is to find a policy whose performance is competitive with the best policy in a given class of interest $\Pi$---crucially, without assuming that $\Pi$ contains the optimal policy. We propose a general policy learning framework that reduces this problem to first-order optimization in a non-Euclidean space, leading to new algorithms as well as shedding light on the convergence properties of existing ones. Specifically, under the assumption that $\Pi$ is convex and satisfies a variational gradient dominance (VGD) condition---an assumption known to be strictly weaker than more standard completeness and coverability conditions---we obtain sample complexity upper bounds for three policy learning algorithms: \emph{(i)} Steepest Descent Policy Optimization, derived from a constrained steepest descent method for non-convex optimization; \emph{(ii)} the classical Conservative Policy Iteration algorithm \citep{kakade2002approximately} reinterpreted through the lens of the Frank-Wolfe method, which leads to improved convergence results; and \emph{(iii)} an on-policy instantiation of the well-studied Policy Mirror Descent algorithm. Finally, we empirically evaluate the VGD condition across several standard environments, demonstrating the practical relevance of our key assumption.

ICML Conference 2025 Conference Paper

Convergence of Policy Mirror Descent Beyond Compatible Function Approximation

  • Uri Sherman
  • Tomer Koren
  • Yishay Mansour

Modern policy optimization methods roughly follow the policy mirror descent (PMD) algorithmic template, for which there are by now numerous theoretical convergence results. However, most of these either target tabular environments, or can be applied effectively only when the class of policies being optimized over satisfies strong closure conditions, which is typically not the case when working with parametric policy classes in large-scale environments. In this work, we develop a theoretical framework for PMD for general policy classes where we replace the closure conditions with a generally weaker variational gradient dominance assumption, and obtain upper bounds on the rate of convergence to the best-in-class policy. Our main result leverages a novel notion of smoothness with respect to a local norm induced by the occupancy measure of the current policy, and casts PMD as a particular instance of smooth non-convex optimization in non-Euclidean space.

AAAI Conference 2025 Conference Paper

Delay as Payoff in MAB

  • Ofir Schlisselberg
  • Ido Cohen
  • Tal Lancewicki
  • Yishay Mansour

In this paper, we investigate a variant of the classical stochastic Multi-armed Bandit (MAB) problem, where the payoff received by an agent (either cost or reward) is both delayed, and directly corresponds to the magnitude of the delay. This setting models faithfully many real world scenarios such as the time it takes for a data packet to traverse a network given a choice of route (where delay serves as the agent’s cost); or a user's time spent on a web page given a choice of content (where delay serves as the agent’s reward). Our main contributions are tight upper and lower bounds for both the cost and reward settings. For the case that delays serve as costs, which we are the first to consider, we prove optimal regret that scales as ∑i:Δi > 0(log T)/Δi + d*, where T is the maximal number of steps, Δi are the sub-optimality gaps and d* is the minimal expected delay amongst arms. For the case that delays serves as rewards, we show optimal regret of ∑i:Δi > 0(log T)/Δi + d̄, where d̄ is the second maximal expected delay. These improve over the regret in the general delay-dependent payoff setting, which scales as ∑i:Δi > 0(log T)/Δi+ D, where D is the maximum possible delay. Our regret bounds highlight the difference between the cost and reward scenarios, showing that the improvement in the cost scenario is more significant than for the reward. Finally, we accompany our theoretical results with an empirical evaluation.

ICML Conference 2025 Conference Paper

Dueling Convex Optimization with General Preferences

  • Aadirupa Saha
  • Tomer Koren
  • Yishay Mansour

We address the problem of convex optimization with dueling feedback, where the goal is to minimize a convex function given a weaker form of dueling feedback. Each query consists of two points and the dueling feedback returns a (noisy) single-bit binary comparison of the function values of the two queried points. The translation of the function values to the single comparison bit is through a transfer function. This problem has been addressed previously for some restricted classes of transfer functions, but here we consider a very general transfer function class which includes all functions that admit a series expansion about the origin. Our main contribution is an efficient algorithm with convergence rate of $O(\epsilon^{-4p})$ for smooth convex functions, and an optimal rate of $\widetilde O(\epsilon^{-2p})$ when the objective is both smooth and strongly convex, where $p$ is the minimal degree (with a non-zero coefficient) in the transfer’s series expansion about the origin.

NeurIPS Conference 2025 Conference Paper

Improved Best-of-Both-Worlds Regret for Bandits with Delayed Feedback

  • Ofir Schlisselberg
  • Tal Lancewicki
  • Peter Auer
  • Yishay Mansour

We study the multi-armed bandit problem with adversarially chosen delays in the Best-of-Both-Worlds (BoBW) framework, which aims to achieve near-optimal performance in both stochastic and adversarial environments. While prior work has made progress toward this goal, existing algorithms suffer from significant gaps to the known lower bounds, especially in the stochastic settings. Our main contribution is a new algorithm that, up to logarithmic factors, matches the known lower bounds in each setting individually. In the adversarial case, our algorithm achieves regret of $\widetilde{O}(\sqrt{KT} + \sqrt{D})$, which is optimal up to logarithmic terms, where $T$ is the number of rounds, $K$ is the number of arms, and $D$ is the cumulative delay. In the stochastic case, we provide a regret bound which scale as $\sum_{i: \Delta_i>0}(\log T/\Delta_i) + \frac{1}{K}\sum \Delta_i \sigma_{max}$, where $\Delta_i$ is the suboptimality gap of arm $i$ and $\sigma_{\max}$ is the maximum number of missing observations. To the best of our knowledge, this is the first BoBW algorithm to simultaneously match the lower bounds in both stochastic and adversarial regimes. Moreover, even beyond the BoBW setting, our stochastic regret bound is the first to match the known lower bound under adversarial delays, improving the second term over the best known result by a factor of $K$.

NeurIPS Conference 2025 Conference Paper

Individual Regret in Cooperative Stochastic Multi-Armed Bandits

  • Idan Barnea
  • Tal Lancewicki
  • Yishay Mansour

We study the regret in stochastic Multi-Armed Bandits (MAB) with multiple agents that communicate over an arbitrary connected communication graph. We analyzed a variant of Cooperative Successive Elimination algorithm, $\texttt{Coop-SE}$, and show an individual regret bound of ${O}(\mathcal{R} / m + A^2 + A \sqrt{\log T})$ and a nearly matching lower bound. Here $A$ is the number of actions, $T$ the time horizon, $m$ the number of agents, and $\mathcal{R} = \sum_{\Delta_i > 0}\log(T)/\Delta_i$ is the optimal single agent regret, where $\Delta_i$ is the sub-optimality gap of action $i$. Our work is the first to show an individual regret bound in cooperative stochastic MAB that is independent of the graph's diameter. When considering communication networks there are additional considerations beyond regret, such as message size and number of communication rounds. First, we show that our regret bound holds even if we restrict the messages to be of logarithmic size. Second, for logarithmic number of communication rounds, we obtain a regret bound of ${O}(\mathcal{R} / m+A \log T)$.

ICML Conference 2025 Conference Paper

Near-optimal Regret Using Policy Optimization in Online MDPs with Aggregate Bandit Feedback

  • Tal Lancewicki
  • Yishay Mansour

We study online finite-horizon Markov Decision Processes with adversarially changing loss and aggregate bandit feedback (a. k. a full-bandit). Under this type of feedback, the agent observes only the total loss incurred over the entire trajectory, rather than the individual losses at each intermediate step within the trajectory. We introduce the first Policy Optimization algorithms for this setting. In the known-dynamics case, we achieve the first optimal regret bound of $\tilde \Theta(H^2\sqrt{SAK})$, where $K$ is the number of episodes, $H$ is the episode horizon, $S$ is the number of states, and $A$ is the number of actions. In the unknown dynamics case we establish regret bound of $\tilde O(H^3 S \sqrt{AK})$, significantly improving the best known result by a factor of $H^2 S^5 A^2$.

STOC Conference 2025 Conference Paper

On Differentially Private Linear Algebra

  • Haim Kaplan
  • Yishay Mansour
  • Shay Moran
  • Uri Stemmer
  • Nitzan Tur

We introduce efficient differentially private (DP) algorithms for several linear algebraic tasks, including solving linear equalities over arbitrary fields, linear inequalities over the reals, and computing affine spans and convex hulls. As an application, we obtain efficient DP algorithms for learning halfspaces and affine subspaces. Our algorithms addressing equalities are strongly polynomial, whereas those addressing inequalities are weakly polynomial. Furthermore, this distinction is inevitable: no DP algorithm for linear programming can be strongly polynomial-time efficient.

NeurIPS Conference 2025 Conference Paper

Principled Model Routing for Unknown Mixtures of Source Domains

  • Christoph Dann
  • Yishay Mansour
  • Teodor Vanislavov Marinov
  • Mehryar Mohri

The rapid proliferation of domain-specialized machine learning models presents a challenge: while individual models excel in specific domains, their performance varies significantly across diverse applications. This makes selecting the optimal model when faced with an unknown mixture of tasks, especially with limited or no data to estimate the mixture, a difficult problem. We address this challenge by formulating it as a multiple-source domain adaptation (MSA) problem. We introduce a novel, scalable algorithm that effectively routes each input to the best-suited model from a pool of available models. Our approach provides a strong performance guarantee: remarkably, for any mixture domain, the accuracy achieved by the best source model is maintained. This guarantee is established through a theoretical bound on the regret for new domains, expressed as a convex combination of the best regrets in the source domains, plus a concentration term that diminishes as the amount of source data increases. While our primary contributions are theoretical and algorithmic, we also present empirical results demonstrating the effectiveness of our approach.

NeurIPS Conference 2025 Conference Paper

Probably Approximately Precision and Recall Learning

  • Lee Cohen
  • Yishay Mansour
  • Shay Moran
  • Han Shao

Precision and Recall are fundamental metrics in machine learning tasks where both accurate predictions and comprehensive coverage are essential, such as in multi-label learning, language generation, medical studies, and recommender systems. A key challenge in these settings is the prevalence of one-sided feedback, where only positive examples are observed during training—e. g. , in multi-label tasks like tagging people in Facebook photos, we may observe only a few tagged individuals, without knowing who else appears in the image. To address learning under such partial feedback, we introduce a Probably Approximately Correct (PAC) framework in which hypotheses are set functions that map each input to a set of labels, extending beyond single-label predictions and generalizing classical binary, multi-class, and multi-label models. Our results reveal sharp statistical and algorithmic separations from standard settings: classical methods such as Empirical Risk Minimization provably fail, even for simple hypothesis classes. We develop new algorithms that learn from positive data alone, achieving optimal sample complexity in the realizable case, and establishing multiplicative—rather than additive—approximation guarantees in the agnostic case, where achieving additive regret is impossible.

NeurIPS Conference 2025 Conference Paper

Regret Bounds for Adversarial Contextual Bandits with General Function Approximation and Delayed Feedback

  • Orin Levy
  • Liad Erez
  • Alon Peled-Cohen
  • Yishay Mansour

We present regret minimization algorithms for the contextual multi-armed bandit (CMAB) problem over $K$ actions in the presence of delayed feedback, a scenario where loss observations arrive with delays chosen by an adversary. As a preliminary result, assuming direct access to a finite policy class $\Pi$ we establish an optimal expected regret bound of $ O (\sqrt{KT \log |\Pi|} + \sqrt{D \log |\Pi|)} $ where $D$ is the sum of delays. For our main contribution, we study the general function approximation setting over a (possibly infinite) contextual loss function class $ \mathcal{F} $ with access to an online least-square regression oracle $\mathcal{O}$ over $\mathcal{F}$. In this setting, we achieve an expected regret bound of $O(\sqrt{KTR_T(\mathcal{O})} + \sqrt{ d_{\max} D \beta})$ assuming FIFO order, where $d_{\max}$ is the maximal delay, $R_T(\mathcal{O})$ is an upper bound on the oracle's regret and $\beta$ is a stability parameter associated with the oracle. We complement this general result by presenting a novel stability analysis of a Hedge-based version of Vovk's aggregating forecaster as an oracle implementation for least-square regression over a finite function class $\mathcal{F}$ and show that its stability parameter $\beta$ is bounded by $\log |\mathcal{F}|$, resulting in an expected regret bound of $O(\sqrt{KT \log |\mathcal{F}|} + \sqrt{d_{\max} D \log |\mathcal{F}|})$ which is a $\sqrt{d_{\max}}$ factor away from the lower bound of $\Omega(\sqrt{KT \log |\mathcal{F}|} + \sqrt{D \log |\mathcal{F}|})$ that we also present.

EWRL Workshop 2024 Workshop Paper

Efficient Rate Optimal Regret for Adversarial Contextual MDPs Using Online Function Approximation

  • Orin Levy
  • Alon Cohen
  • Asaf Cassel
  • Yishay Mansour

We present the OMG-CMDP! algorithm for regret minimization in adversarial Contextual MDPs. The algorithm operates under the minimal assumptions of realizable function class and access to online least squares and log loss regression oracles. Our algorithm is efficient (assuming efficient online regression oracles), simple and robust to approximation errors. It enjoys an $\widetilde{O}(H^2 \sqrt{ TH|S||A| ( \mathcal{R}_{TH}(\mathcal{O}) + H log(1/\delta)} )$ regret guarantee, with $T$ being the number of episodes, $S$ the state space, $A$ the action space, $H$ the horizon. In addition, $\mathcal{R}_{TH}( \mathcal{O} )$ is the sum of the square and log-loss regression oracles' regret, used to approximate the context-dependent rewards and dynamics, respectively. To the best of our knowledge, our algorithm is the first efficient rate optimal regret minimization algorithm for adversarial CMDPs that operates under the minimal standard assumption of online function approximation.

ICML Conference 2024 Conference Paper

Eluder-based Regret for Stochastic Contextual MDPs

  • Orin Levy
  • Asaf B. Cassel
  • Alon Cohen
  • Yishay Mansour

We present the E-UC$^3$RL algorithm for regret minimization in Stochastic Contextual Markov Decision Processes (CMDPs). The algorithm operates under the minimal assumptions of realizable function class and access to offline least squares and log loss regression oracles. Our algorithm is efficient (assuming efficient offline regression oracles) and enjoys a regret guarantee of $ \widetilde{O}(H^3 \sqrt{T |S| |A|d_{\mathrm{E}}(\mathcal{P}) \log (|\mathcal{F}| |\mathcal{P}|/ \delta) )}) $, with $T$ being the number of episodes, $S$ the state space, $A$ the action space, $H$ the horizon, $\mathcal{P}$ and $\mathcal{F}$ are finite function classes used to approximate the context-dependent dynamics and rewards, respectively, and $d_{\mathrm{E}}(\mathcal{P})$ is the Eluder dimension of $\mathcal{P}$ w. r. t the Hellinger distance. To the best of our knowledge, our algorithm is the first efficient and rate-optimal regret minimization algorithm for CMDPs that operates under the general offline function approximation setting. In addition, we extend the Eluder dimension to general bounded metrics which may be of independent interest.

NeurIPS Conference 2024 Conference Paper

Fast Rates for Bandit PAC Multiclass Classification

  • Liad Erez
  • Alon Cohen
  • Tomer Koren
  • Yishay Mansour
  • Shay Moran

We study multiclass PAC learning with bandit feedback, where inputs are classified into one of $K$ possible labels and feedback is limited to whether or not the predicted labels are correct. Our main contribution is in designing a novel learning algorithm for the agnostic $(\varepsilon, \delta)$-PAC version of the problem, with sample complexity of $O\big( (\operatorname{poly}(K) + 1 / \varepsilon^2) \log (|\mathcal{H}| / \delta) \big)$ for any finite hypothesis class $\mathcal{H}$. In terms of the leading dependence on $\varepsilon$, this improves upon existing bounds for the problem, that are of the form $O(K/\varepsilon^2)$. We also provide an extension of this result to general classes and establish similar sample complexity bounds in which $\log |\mathcal{H}|$ is replaced by the Natarajan dimension. This matches the optimal rate in the full-information version of the problem and resolves an open question studied by Daniely, Sabato, Ben-David, and Shalev-Shwartz (2011) who demonstrated that the multiplicative price of bandit feedback in realizable PAC learning is $\Theta(K)$. We complement this by revealing a stark contrast with the agnostic case, where the price of bandit feedback is only $O(1)$ as $\varepsilon \to 0$. Our algorithm utilizes a stochastic optimization technique to minimize a log-barrier potential based on Frank-Wolfe updates for computing a low-variance exploration distribution over the hypotheses, and is made computationally efficient provided access to an ERM oracle over $\mathcal{H}$.

NeurIPS Conference 2024 Conference Paper

How to Boost Any Loss Function

  • Richard Nock
  • Yishay Mansour

Boosting is a highly successful ML-born optimization setting in which one is required to computationally efficiently learn arbitrarily good models based on the access to a weak learner oracle, providing classifiers performing at least slightly differently from random guessing. A key difference with gradient-based optimization is that boosting's original model does not requires access to first order information about a loss, yet the decades long history of boosting has quickly evolved it into a first order optimization setting -- sometimes even wrongfully *defining* it as such. Owing to recent progress extending gradient-based optimization to use only a loss' zeroth ($0^{th}$) order information to learn, this begs the question: what loss functions be efficiently optimized with boosting and what is the information really needed for boosting to meet the *original* boosting blueprint's requirements? We provide a constructive formal answer essentially showing that *any* loss function can be optimized with boosting and thus boosting can achieve a feat not yet known to be possible in the classical $0^{th}$ order setting, since loss functions are not required to be be convex, nor differentiable or Lipschitz -- and in fact not required to be continuous either. Some tools we use are rooted in quantum calculus, the mathematical field -- not to be confounded with quantum computation -- that studies calculus without passing to the limit, and thus without using first order information.

EWRL Workshop 2024 Workshop Paper

Individual Regret in Cooperative Stochastic Multi-Armed Bandits over Communication Graph

  • Idan Barnea
  • Tal Lancewicki
  • Yishay Mansour

We study the regret in stochastic Multi-Armed Bandits (MAB) with multiple agents that communicate over an arbitrary connected communication graph. We show a near-optimal individual regret bound of $\tilde{O}(\sqrt{AT/m}+A)$, where $A$ is the number of actions, $T$ the time horizon, and $m$ the number of agents. In particular, assuming a sufficient number of agents, we achieve a regret bound of $\tilde{O}(A)$, which is independent of the sub-optimality gaps and depends only logarithmically on the time horizon. To the best of our knowledge, our study is the first to show an individual regret bound in cooperative stochastic MAB that is independent of the graph's diameter and applicable to non-fully-connected communication graphs.

NeurIPS Conference 2024 Conference Paper

Learning-Augmented Algorithms with Explicit Predictors

  • Marek Eliáš
  • Haim Kaplan
  • Yishay Mansour
  • Shay Moran

Recent advances in algorithmic design show how to utilize predictions obtained by machine learning models from past and present data. These approaches have demonstrated an enhancement in performance when the predictions are accurate, while also ensuring robustness by providing worst-case guarantees when predictions fail. In this paper we focus on online problems; prior research in this context was focused on a paradigm where the algorithms are oblivious of the predictors' design, treating them as a black box. In contrast, in this work, we unpack the predictor and integrate the learning problem it gives rise for within the algorithmic challenge. In particular we allow the predictor to learn as it receives larger parts of the input, with the ultimate goal of designing online learning algorithms specifically tailored for the algorithmic task at hand. Adopting this perspective, we focus on a number of fundamental problems, including caching and scheduling, which have been well-studied in the black-box setting. For each of the problems, we introduce new algorithms that take advantage of explicit and carefully designed learning rules. These pairings of online algorithms with corresponding learning rules yields improvements in the overall performance in comparison with previous work.

AAAI Conference 2024 Conference Paper

Principal-Agent Reward Shaping in MDPs

  • Omer Ben-Porat
  • Yishay Mansour
  • Michal Moshkovitz
  • Boaz Taitler

Principal-agent problems arise when one party acts on behalf of another, leading to conflicts of interest. The economic literature has extensively studied principal-agent problems, and recent work has extended this to more complex scenarios such as Markov Decision Processes (MDPs). In this paper, we further explore this line of research by investigating how reward shaping under budget constraints can improve the principal's utility. We study a two-player Stackelberg game where the principal and the agent have different reward functions, and the agent chooses an MDP policy for both players. The principal offers an additional reward to the agent, and the agent picks their policy selfishly to maximize their reward, which is the sum of the original and the offered reward. Our results establish the NP-hardness of the problem and offer polynomial approximation algorithms for two classes of instances: Stochastic trees and deterministic decision processes with a finite horizon.

ICML Conference 2024 Conference Paper

Rate-Optimal Policy Optimization for Linear Markov Decision Processes

  • Uri Sherman
  • Alon Cohen
  • Tomer Koren
  • Yishay Mansour

We study regret minimization in online episodic linear Markov Decision Processes, and propose a policy optimization algorithm that is computationally efficient, and obtains rate optimal $\widetilde O (\sqrt K)$ regret where $K$ denotes the number of episodes. Our work is the first to establish the optimal rate (in terms of $K$) of convergence in the stochastic setting with bandit feedback using a policy optimization based approach, and the first to establish the optimal rate in the adversarial setup with full information feedback, for which no algorithm with an optimal rate guarantee was previously known.

EWRL Workshop 2024 Workshop Paper

Rate-Optimal Policy Optimization for Linear Markov Decision Processes

  • Uri Sherman
  • Alon Cohen
  • Tomer Koren
  • Yishay Mansour

We study regret minimization in online episodic linear Markov Decision Processes, and propose a policy optimization algorithm that is computationally efficient, and obtains rate optimal $\widetilde O (\sqrt K)$ regret where $K$ denotes the number of episodes. Our work is the first to establish the optimal rate (in terms of~$K$) of convergence in the stochastic setting with bandit feedback using a policy optimization based approach, and the first to establish the optimal rate in the adversarial setup with full information feedback, for which no algorithm with an optimal rate guarantee was previously known.

EWRL Workshop 2024 Workshop Paper

Regret Guarantees for Adversarial Contextual Bandits with Delayed Feedback

  • Liad Erez
  • Orin Levy
  • Yishay Mansour

In this paper we present regret minimization algorithms for the contextual multi-armed bandit (CMAB) problem in the presence of delayed feedback, a scenario where reward observations arrive with delays chosen by an adversary. We study two fundamental frameworks in terms of the function classes used to derive regret bounds for CMAB. Firstly, for a finite policy class $ \Pi $, we establish an optimal regret bound of $ O \left( \sqrt{KT \log |\Pi|} + \sqrt{D \log |\Pi|} \right) $, where $ K $ is the number of arms, $ T $ is the number of rounds, and $ D $ is the sum of delays. Secondly, assuming a finite contextual reward function class $ \mathcal{F} $ and access to an online least-square regression oracle $\mathcal{O}$ over $\mathcal{F}$, we achieve a regret bound of $\widetilde{O}(\sqrt{KT\cdot (\mathcal{R}_T(\mathcal{O})+\log (\delta^{-1}))} + \eta D + d_m)$ that holds with probability at least $1-\delta$, where $d_m$ is the maximal delay, $\mathcal{R}_T(\mathcal{O})$ is an upper bound on the oracle's regret and $\eta$ is a stability parameter associated with the oracle.

NeurIPS Conference 2023 Conference Paper

Black-Box Differential Privacy for Interactive ML

  • Haim Kaplan
  • Yishay Mansour
  • Shay Moran
  • Kobbi Nissim
  • Uri Stemmer

In this work we revisit an interactive variant of joint differential privacy, recently introduced by Naor et al. [2023], and generalize it towards handling online processes in which existing privacy definitions seem too restrictive. We study basic properties of this definition and demonstrate that it satisfies (suitable variants) of group privacy, composition, and post processing. In order to demonstrate the advantages of this privacy definition compared to traditional forms of differential privacy, we consider the basic setting of online classification. We show that any (possibly non-private) learning rule can be effectively transformed to a private learning rule with only a polynomial overhead in the mistake bound. This demonstrates a stark difference with traditional forms of differential privacy, such as the one studied by Golowich and Livni [2021], where only a double exponential overhead in the mistake bound is known (via an information theoretic upper bound).

ICML Conference 2023 Conference Paper

Concurrent Shuffle Differential Privacy Under Continual Observation

  • Jay Tenenbaum
  • Haim Kaplan
  • Yishay Mansour
  • Uri Stemmer

We introduce the concurrent shuffle model of differential privacy. In this model we have multiple concurrent shufflers permuting messages from different, possibly overlapping, batches of users. Similarly to the standard (single) shuffler model, the privacy requirement is that the concatenation of all shuffled messages should be differentially private. We study the private continual summation problem (a. k. a. the counter problem) and show that the concurrent shuffle model allows for significantly improved error compared to a standard (single) shuffler model. Specifically, we give a summation algorithm with error $\tilde{O}(n^{1/(2k+1)})$ with $k$ concurrent shufflers on a sequence of length $n$. Furthermore, we prove that this bound is tight for any $k$, even if the algorithm can choose the sizes of the batches adaptively. For $k=\log n$ shufflers, the resulting error is polylogarithmic, much better than $\tilde{\Theta}(n^{1/3})$ which we show is the smallest possible with a single shuffler. We use our online summation algorithm to get algorithms with improved regret bounds for the contextual linear bandit problem. In particular we get optimal $\tilde{O}(\sqrt{n})$ regret with $k= \tilde{\Omega}(\log n)$ concurrent shufflers.

ICML Conference 2023 Conference Paper

Efficient Rate Optimal Regret for Adversarial Contextual MDPs Using Online Function Approximation

  • Orin Levy
  • Alon Cohen
  • Asaf B. Cassel
  • Yishay Mansour

We present the OMG-CMDP! algorithm for regret minimization in adversarial Contextual MDPs. The algorithm operates under the minimal assumptions of realizable function class and access to online least squares and log loss regression oracles. Our algorithm is efficient (assuming efficient online regression oracles), simple and robust to approximation errors. It enjoys an $\widetilde{O}(H^{2. 5} \sqrt{ T|S||A| ( \mathcal{R}_{TH}(\mathcal{O}) + H \log(\delta^{-1}) )})$ regret guarantee, with $T$ being the number of episodes, $S$ the state space, $A$ the action space, $H$ the horizon and $\mathcal{R}_{TH}(\mathcal{O}) = \mathcal{R}_{TH}(\mathcal{O}_{sq}^\mathcal{F}) + \mathcal{R}_{TH}(\mathcal{O}_{log}^\mathcal{P})$ is the sum of the square and log-loss regression oracles’ regret, used to approximate the context-dependent rewards and dynamics, respectively. To the best of our knowledge, our algorithm is the first efficient rate optimal regret minimization algorithm for adversarial CMDPs that operates under the minimal standard assumption of online function approximation.

NeurIPS Conference 2023 Conference Paper

Eliciting User Preferences for Personalized Multi-Objective Decision Making through Comparative Feedback

  • Han Shao
  • Lee Cohen
  • Avrim Blum
  • Yishay Mansour
  • Aadirupa Saha
  • Matthew Walter

In this work, we propose a multi-objective decision making framework that accommodates different user preferences over objectives, where preferences are learned via policy comparisons. Our model consists of a known Markov decision process with a vector-valued reward function, with each user having an unknown preference vector that expresses the relative importance of each objective. The goal is to efficiently compute a near-optimal policy for a given user. We consider two user feedback models. We first address the case where a user is provided with two policies and returns their preferred policy as feedback. We then move to a different user feedback model, where a user is instead provided with two small weighted sets of representative trajectories and selects the preferred one. In both cases, we suggest an algorithm that finds a nearly optimal policy for the user using a number of comparison queries that scales quasilinearly in the number of objectives.

NeurIPS Conference 2023 Conference Paper

Finding Safe Zones of Markov Decision Processes Policies

  • Lee Cohen
  • Yishay Mansour
  • Michal Moshkovitz

Given a policy of a Markov Decision Process, we define a SafeZone as a subset of states, such that most of the policy's trajectories are confined to this subset. The quality of a SafeZone is parameterized by the number of states and the escape probability, i. e. , the probability that a random trajectory will leave the subset. SafeZones are especially interesting when they have a small number of states and low escape probability. We study the complexity of finding optimal SafeZones, and show that in general, the problem is computationally hard. For this reason, we concentrate on finding approximate SafeZones. Our main result is a bi-criteria approximation learning algorithm with a factor of almost $2$ approximation for both the escape probability and \newprob size, using a polynomial size sample complexity.

ICML Conference 2023 Conference Paper

Improved Regret for Efficient Online Reinforcement Learning with Linear Function Approximation

  • Uri Sherman
  • Tomer Koren
  • Yishay Mansour

We study reinforcement learning with linear function approximation and adversarially changing cost functions, a setup that has mostly been considered under simplifying assumptions such as full information feedback or exploratory conditions. We present a computationally efficient policy optimization algorithm for the challenging general setting of unknown dynamics and bandit feedback, featuring a combination of mirror-descent and least squares policy evaluation in an auxiliary MDP used to compute exploration bonuses. Our algorithm obtains an $\widetilde O(K^{6/7})$ regret bound, improving significantly over previous state-of-the-art of $\widetilde O (K^{14/15})$ in this setting. In addition, we present a version of the same algorithm under the assumption a simulator of the environment is available to the learner (but otherwise no exploratory assumptions are made), and prove it obtains state-of-the-art regret of $\widetilde O (K^{2/3})$.

AAAI Conference 2023 Conference Paper

Learning Revenue Maximization Using Posted Prices for Stochastic Strategic Patient Buyers

  • Eitan-Hai Mashiah
  • Idan Attias
  • Yishay Mansour

We consider a seller faced with buyers which have the ability to delay their decision, which we call patience. Each buyer's type is composed of value and patience, and it is sampled i.i.d. from a distribution. The seller, using posted prices, would like to maximize her revenue from selling to the buyer. In this paper, we formalize this setting and characterize the resulting Stackelberg equilibrium, where the seller first commits to her strategy, and then the buyers best respond. Following this, we show how to compute both the optimal pure and mixed strategies. We then consider a learning setting, where the seller does not have access to the distribution over buyer's types. Our main results are the following. We derive a sample complexity bound for the learning of an approximate optimal pure strategy, by computing the fat-shattering dimension of this setting. Moreover, we provide a general sample complexity bound for the approximate optimal mixed strategy. We also consider an online setting and derive a vanishing regret bound with respect to both the optimal pure strategy and the optimal mixed strategy.

NeurIPS Conference 2023 Conference Paper

Multiclass Boosting: Simple and Intuitive Weak Learning Criteria

  • Nataly Brukhim
  • Amit Daniely
  • Yishay Mansour
  • Shay Moran

We study a generalization of boosting to the multiclass setting. We introduce a weak learning condition for multiclass classification that captures the original notion of weak learnability as being “slightly better than random guessing”. We give a simple and efficient boosting algorithm, that does not require realizability assumptions and its sample and oracle complexity bounds are independent of the number of classes. In addition, we utilize our new boosting technique in several theoretical applications within the context of List PAC Learning. First, we establish an equivalence to weak PAC learning. Furthermore, we present a new result on boosting for list learners, as well as provide a novel proof for the characterization of multiclass PAC learning and List PAC learning. Notably, our technique gives rise to simplified algorithms and analysis compared to previous works.

AAAI Conference 2023 Conference Paper

Optimism in Face of a Context:Regret Guarantees for Stochastic Contextual MDP

  • Orin Levy
  • Yishay Mansour

We present regret minimization algorithms for stochastic contextual MDPs under minimum reachability assumption, using an access to an offline least square regression oracle. We analyze three different settings: where the dynamics is known, where the dynamics is unknown but independent of the context and the most challenging setting where the dynamics is unknown and context-dependent. For the latter, our algorithm obtains regret bound (up to poly-logarithmic factors) of order (H+1/pₘᵢₙ)H|S|³ᐟ²(|A|Tlog(max{|?|,|?|} /?))¹ᐟ² with probability 1−?, where? and? are finite and realizable function classes used to approximate the dynamics and rewards respectively, pₘᵢₙ is the minimum reachability parameter, S is the set of states, A the set of actions, H the horizon, and T the number of episodes. To our knowledge, our approach is the first optimistic approach applied to contextual MDPs with general function approximation (i.e., without additional knowledge regarding the function class, such as it being linear and etc.). We present a lower bound of?((TH|S||A|ln|?| /ln|A| )¹ᐟ² ), on the expected regret which holds even in the case of known dynamics. Lastly, we discuss an extension of our results to CMDPs without minimum reachability, that obtains order of T³ᐟ⁴ regret.

ICML Conference 2023 Conference Paper

Random Classification Noise does not defeat All Convex Potential Boosters Irrespective of Model Choice

  • Yishay Mansour
  • Richard Nock
  • Robert C. Williamson

A landmark negative result of Long and Servedio has had a considerable impact on research and development in boosting algorithms, around the now famous tagline that "noise defeats all convex boosters". In this paper, we appeal to the half-century+ founding theory of losses for class probability estimation, an extension of Long and Servedio’s results and a new general convex booster to demonstrate that the source of their negative result is in fact the model class, linear separators. Losses or algorithms are neither to blame. This leads us to a discussion on an otherwise praised aspect of ML, parameterisation.

ICML Conference 2023 Conference Paper

Regret Minimization and Convergence to Equilibria in General-sum Markov Games

  • Liad Erez
  • Tal Lancewicki
  • Uri Sherman
  • Tomer Koren
  • Yishay Mansour

An abundance of recent impossibility results establish that regret minimization in Markov games with adversarial opponents is both statistically and computationally intractable. Nevertheless, none of these results preclude the possibility of regret minimization under the assumption that all parties adopt the same learning procedure. In this work, we present the first (to our knowledge) algorithm for learning in general-sum Markov games that provides sublinear regret guarantees when executed by all agents. The bounds we obtain are for $\textit{swap regret}$, and thus, along the way, imply convergence to a $\textit{correlated}$ equilibrium. Our algorithm is decentralized, computationally efficient, and does not require any communication between agents. Our key observation is that online learning via policy optimization in Markov games essentially reduces to a form of $\textit{weighted}$ regret minimization, with $\textit{unknown}$ weights determined by the path length of the agents’ policy sequence. Consequently, controlling the path length leads to weighted regret objectives for which sufficiently adaptive algorithms provide sublinear regret guarantees.

ICML Conference 2023 Conference Paper

Reinforcement Learning Can Be More Efficient with Multiple Rewards

  • Christoph Dann
  • Yishay Mansour
  • Mehryar Mohri

Reward design is one of the most critical and challenging aspects when formulating a task as a reinforcement learning (RL) problem. In practice, it often takes several attempts of reward specification and learning with it in order to find one that leads to sample-efficient learning of the desired behavior. Instead, in this work, we study whether directly incorporating multiple alternate reward formulations of the same task in a single agent can lead to faster learning. We analyze multi-reward extensions of action-elimination algorithms and prove more favorable instance-dependent regret bounds compared to their single-reward counterparts, both in multi-armed bandits and in tabular Markov decision processes. Our bounds scale for each state-action pair with the inverse of the largest gap among all reward functions. This suggests that learning with multiple rewards can indeed be more sample-efficient, as long as the rewards agree on an optimal policy. We further prove that when rewards do not agree, multi-reward action elimination in multi-armed bandits still learns a policy that is good across all reward functions.

NeurIPS Conference 2022 Conference Paper

A Characterization of Semi-Supervised Adversarially Robust PAC Learnability

  • Idan Attias
  • Steve Hanneke
  • Yishay Mansour

We study the problem of learning an adversarially robust predictor to test time attacks in the semi-supervised PAC model. We address the question of how many labeled and unlabeled examples are required to ensure learning. We show that having enough unlabeled data (the size of a labeled sample that a fully-supervised method would require), the labeled sample complexity can be arbitrarily smaller compared to previous works, and is sharply characterized by a different complexity measure. We prove nearly matching upper and lower bounds on this sample complexity. This shows that there is a significant benefit in semi-supervised robust learning even in the worst-case distribution-free model, and establishes a gap between supervised and semi-supervised label complexities which is known not to hold in standard non-robust PAC learning.

NeurIPS Conference 2022 Conference Paper

Benign Underfitting of Stochastic Gradient Descent

  • Tomer Koren
  • Roi Livni
  • Yishay Mansour
  • Uri Sherman

We study to what extent may stochastic gradient descent (SGD) be understood as a ``conventional'' learning rule that achieves generalization performance by obtaining a good fit to training data. We consider the fundamental stochastic convex optimization framework, where (one pass, $\textit{without}$-replacement) SGD is classically known to minimize the population risk at rate $O(1/\sqrt n)$, and prove that, surprisingly, there exist problem instances where the SGD solution exhibits both empirical risk and generalization gap of $\Omega(1)$. Consequently, it turns out that SGD is not algorithmically stable in $\textit{any}$ sense, and its generalization ability cannot be explained by uniform convergence or any other currently known generalization bound technique for that matter (other than that of its classical analysis). We then continue to analyze the closely related $\textit{with}$-replacement SGD, for which we show that an analogous phenomenon does not occur and prove that its population risk does in fact converge at the optimal rate. Finally, we interpret our main results in the context of without-replacement SGD for finite-sum convex optimization problems, and derive upper and lower bounds for the multi-epoch regime that significantly improve upon previously known results.

EWRL Workshop 2022 Workshop Paper

Cooperative Online Learning in Stochastic and Adversarial MDPs

  • Tal Lancewicki
  • Aviv Rosenberg
  • Yishay Mansour

We study cooperative online learning in stochastic and adversarial Markov decision process (MDP). That is, in each episode, m agents interact with an MDP simultaneously and share information in order to minimize their individual regret. We consider environments with two types of randomness: fresh – where each agent’s trajectory is sampled i. i. d, and non-fresh – where the realization is shared by all agents (but each agent’s trajectory is also affected by its own actions). More precisely, with non-fresh randomness the realization of every cost and transition is fixed at the start of each episode, and agents that take the same action in the same state at the same time observe the same cost and next state. We thoroughly analyze all relevant settings, highlight the challenges and differences between the models, and prove nearly-matching regret lower and upper bounds. To our knowledge, we are the first to consider cooperative reinforcement learning (RL) with either non-fresh randomness or in adversarial MDPs.

ICML Conference 2022 Conference Paper

Cooperative Online Learning in Stochastic and Adversarial MDPs

  • Tal Lancewicki
  • Aviv Rosenberg 0002
  • Yishay Mansour

We study cooperative online learning in stochastic and adversarial Markov decision process (MDP). That is, in each episode, $m$ agents interact with an MDP simultaneously and share information in order to minimize their individual regret. We consider environments with two types of randomness: fresh – where each agent’s trajectory is sampled i. i. d, and non-fresh – where the realization is shared by all agents (but each agent’s trajectory is also affected by its own actions). More precisely, with non-fresh randomness the realization of every cost and transition is fixed at the start of each episode, and agents that take the same action in the same state at the same time observe the same cost and next state. We thoroughly analyze all relevant settings, highlight the challenges and differences between the models, and prove nearly-matching regret lower and upper bounds. To our knowledge, we are the first to consider cooperative reinforcement learning (RL) with either non-fresh randomness or in adversarial MDPs.

NeurIPS Conference 2022 Conference Paper

Fair Wrapping for Black-box Predictions

  • Alexander Soen
  • Ibrahim M. Alabdulmohsin
  • Sanmi Koyejo
  • Yishay Mansour
  • Nyalleng Moorosi
  • Richard Nock
  • Ke Sun
  • Lexing Xie

We introduce a new family of techniques to post-process (``wrap") a black-box classifier in order to reduce its bias. Our technique builds on the recent analysis of improper loss functions whose optimization can correct any twist in prediction, unfairness being treated as a twist. In the post-processing, we learn a wrapper function which we define as an $\alpha$-tree, which modifies the prediction. We provide two generic boosting algorithms to learn $\alpha$-trees. We show that our modification has appealing properties in terms of composition of $\alpha$-trees, generalization, interpretability, and KL divergence between modified and original predictions. We exemplify the use of our technique in three fairness notions: conditional value-at-risk, equality of opportunity, and statistical parity; and provide experiments on several readily available datasets.

ICML Conference 2022 Conference Paper

FriendlyCore: Practical Differentially Private Aggregation

  • Eliad Tsfadia
  • Edith Cohen
  • Haim Kaplan
  • Yishay Mansour
  • Uri Stemmer

Differentially private algorithms for common metric aggregation tasks, such as clustering or averaging, often have limited practicality due to their complexity or to the large number of data points that is required for accurate results. We propose a simple and practical tool $\mathsf{FriendlyCore}$ that takes a set of points ${\cal D}$ from an unrestricted (pseudo) metric space as input. When ${\cal D}$ has effective diameter $r$, $\mathsf{FriendlyCore}$ returns a “stable” subset ${\cal C} \subseteq {\cal D}$ that includes all points, except possibly few outliers, and is guaranteed to have diameter $r$. $\mathsf{FriendlyCore}$ can be used to preprocess the input before privately aggregating it, potentially simplifying the aggregation or boosting its accuracy. Surprisingly, $\mathsf{FriendlyCore}$ is light-weight with no dependence on the dimension. We empirically demonstrate its advantages in boosting the accuracy of mean estimation and clustering tasks such as $k$-means and $k$-GMM, outperforming tailored methods.

ICML Conference 2022 Conference Paper

Guarantees for Epsilon-Greedy Reinforcement Learning with Function Approximation

  • Christoph Dann
  • Yishay Mansour
  • Mehryar Mohri
  • Ayush Sekhari
  • Karthik Sridharan

Myopic exploration policies such as epsilon-greedy, softmax, or Gaussian noise fail to explore efficiently in some reinforcement learning tasks and yet, they perform well in many others. In fact, in practice, they are often selected as the top choices, due to their simplicity. But, for what tasks do such policies succeed? Can we give theoretical guarantees for their favorable performance? These crucial questions have been scarcely investigated, despite the prominent practical importance of these policies. This paper presents a theoretical analysis of such policies and provides the first regret and sample-complexity bounds for reinforcement learning with myopic exploration. Our results apply to value-function-based algorithms in episodic MDPs with bounded Bellman Eluder dimension. We propose a new complexity measure called myopic exploration gap, denoted by alpha, that captures a structural property of the MDP, the exploration policy and the given value function class. We show that the sample-complexity of myopic exploration scales quadratically with the inverse of this quantity, 1 / alpha^2. We further demonstrate through concrete examples that myopic exploration gap is indeed favorable in several tasks where myopic exploration succeeds, due to the corresponding dynamics and reward structure.

JMLR Journal 2022 Journal Article

Improved Generalization Bounds for Adversarially Robust Learning

  • Idan Attias
  • Aryeh Kontorovich
  • Yishay Mansour

We consider a model of robust learning in an adversarial environment. The learner gets uncorrupted training data with access to possible corruptions that may be affected by the adversary during testing. The learner's goal is to build a robust classifier, which will be tested on future adversarial examples. The adversary is limited to $k$ possible corruptions for each input. We model the learner-adversary interaction as a zero-sum game. This model is closely related to the adversarial examples model of Schmidt et al. (2018); Madry et al. (2017). Our main results consist of generalization bounds for the binary and multiclass classification, as well as the real-valued case (regression). For the binary classification setting, we both tighten the generalization bound of Feige et al. (2015), and are also able to handle infinite hypothesis classes. The sample complexity is improved from $O(\frac{1}{\epsilon^4}\log(\frac{|H|}{\delta}))$ to $O\big(\frac{1}{\epsilon^2}(kVC(H)\log^{\frac{3}{2}+\alpha}(kVC(H))+\log(\frac{1}{\delta})\big)$ for any $\alpha > 0$. Additionally, we extend the algorithm and generalization bound from the binary to the multiclass and real-valued cases. Along the way, we obtain results on fat-shattering dimension and Rademacher complexity of $k$-fold maxima over function classes; these may be of independent interest. For binary classification, the algorithm of Feige et al. (2015) uses a regret minimization algorithm and an ERM oracle as a black box; we adapt it for the multiclass and regression settings. The algorithm provides us with near-optimal policies for the players on a given training sample. [abs] [ pdf ][ bib ] &copy JMLR 2022. ( edit, beta )

AAAI Conference 2022 Conference Paper

Learning Adversarial Markov Decision Processes with Delayed Feedback

  • Tal Lancewicki
  • Aviv Rosenberg
  • Yishay Mansour

Reinforcement learning typically assumes that agents observe feedback for their actions immediately, but in many realworld applications (like recommendation systems) feedback is observed in delay. This paper studies online learning in episodic Markov decision processes (MDPs) with unknown transitions, adversarially changing costs and unrestricted delayed feedback. That is, the costs and trajectory of episode k are revealed to the learner only in the end of episode k + dk, where the delays dk are neither identical nor bounded, and are chosen by an oblivious adversary. We present novel algorithms based on policy optimization that achieve near-optimal high-probability regret of √ K + D under full-information feedback, where K is the number of episodes and D = P k dk is the total delay. Under bandit feedback, we prove similar √ K + D regret assuming the costs are stochastic, and (K + D)2/3 regret in the general case. We are the first to consider regret minimization in the important setting of MDPs with delayed feedback.

EWRL Workshop 2022 Workshop Paper

Learning Efficiently Function Approximation for Contextual MDP

  • Orin Levy
  • Yishay Mansour

We study learning contextual MDPs using a function approximation for both the rewards and the dynamics. We consider both the case that the dynamics dependent or independent of the context. For both models we derive polynomial sample and time complexity (assuming an efficient ERM oracle). Our methodology gives a general reduction from learning contextual MDP to supervised learning.

AAAI Conference 2022 Conference Paper

Modeling Attrition in Recommender Systems with Departing Bandits

  • Omer Ben-Porat
  • Lee Cohen
  • Liu Leqi
  • Zachary C. Lipton
  • Yishay Mansour

Traditionally, when recommender systems are formalized as multi-armed bandits, the policy of the recommender system influences the rewards accrued, but not the length of interaction. However, in real-world systems, dissatisfied users may depart (and never come back). In this work, we propose a novel multiarmed bandit setup that captures such policy-dependent horizons. Our setup consists of a finite set of user types, and multiple arms with Bernoulli payoffs. Each (user type, arm) tuple corresponds to an (unknown) reward probability. Each user’s type is initially unknown and can only be inferred through their response to recommendations. Moreover, if a user is dissatisfied with their recommendation, they might depart the system. We first address the case where all users share the same type, demonstrating that a recent UCB-based algorithm is optimal. We then move forward to the more challenging case, where users are divided among two types. While naive approaches cannot handle this setting, we provide an efficient learning algorithm that achieves Õ( √ T) regret, where T is the number of users.

NeurIPS Conference 2022 Conference Paper

Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback

  • Tiancheng Jin
  • Tal Lancewicki
  • Haipeng Luo
  • Yishay Mansour
  • Aviv Rosenberg

The standard assumption in reinforcement learning (RL) is that agents observe feedback for their actions immediately. However, in practice feedback is often observed in delay. This paper studies online learning in episodic Markov decision process (MDP) with unknown transitions, adversarially changing costs, and unrestricted delayed bandit feedback. More precisely, the feedback for the agent in episode $k$ is revealed only in the end of episode $k + d^k$, where the delay $d^k$ can be changing over episodes and chosen by an oblivious adversary. We present the first algorithms that achieve near-optimal $\sqrt{K + D}$ regret, where $K$ is the number of episodes and $D = \sum_{k=1}^K d^k$ is the total delay, significantly improving upon the best known regret bound of $(K + D)^{2/3}$.

EWRL Workshop 2022 Workshop Paper

Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback

  • Tiancheng Jin
  • Tal Lancewicki
  • Haipeng Luo
  • Yishay Mansour
  • Aviv Rosenberg

The standard assumption in reinforcement learning (RL) is that agents observe feedback for their actions immediately. However, in practice feedback is often observed in delay. This paper studies online learning in episodic Markov decision process (MDP) with unknown transitions, adversarially changing costs, and unrestricted delayed bandit feedback. More precisely, the feedback for the agent in episode k is revealed only in the end of episode k+dk, where the delay dk can be changing over episodes and chosen by an oblivious adversary. We present the first algorithms that achieve near-optimal √ K + D regret, where K is the number of episodes and D = PK k=1 dk is the total delay, significantly improving upon the best known regret bound of (K + D)2/3.

JMLR Journal 2022 Journal Article

Nonstochastic Bandits with Composite Anonymous Feedback

  • Nicolò Cesa-Bianchi
  • Tommaso Cesari
  • Roberto Colomboni
  • Claudio Gentile
  • Yishay Mansour

We investigate a nonstochastic bandit setting in which the loss of an action is not immediately charged to the player, but rather spread over the subsequent rounds in an adversarial way. The instantaneous loss observed by the player at the end of each round is then a sum of many loss components of previously played actions. This setting encompasses as a special case the easier task of bandits with delayed feedback, a well-studied framework where the player observes the delayed losses individually. Our first contribution is a general reduction transforming a standard bandit algorithm into one that can operate in the harder setting: We bound the regret of the transformed algorithm in terms of the stability and regret of the original algorithm. Then, we show that the transformation of a suitably tuned FTRL with Tsallis entropy has a regret of order $\sqrt{(d+1)KT}$, where $d$ is the maximum delay, $K$ is the number of arms, and $T$ is the time horizon. Finally, we show that our results cannot be improved in general by exhibiting a matching (up to a log factor) lower bound on the regret of any algorithm operating in this setting. [abs] [ pdf ][ bib ] &copy JMLR 2022. ( edit, beta )

EWRL Workshop 2022 Workshop Paper

Optimism in Face of a Context: Regret Guarantees for Stochastic Contextual MDP

  • Orin Levy
  • Yishay Mansour

We present regret minimization algorithms for stochastic contextual MDPs under minimum reachability assumption, using an access to an offline least square regression oracle. We analyze three different settings: where the dynamics is known, where the dynamics is unknown but independent of the context and the most challenging setting where the dynamics is unknown and context-dependent. For the latter, our algorithm obtains Õ max{H, 1 pmin }H|S|3/2 q |A|T log max{|G|, |P|} δ regret bound, with probability 1 − δ, where P and G are finite and realizable function classes used to approximate the dynamics and rewards respectively, pmin is the minimum reachability parameter, S is the set of states, A the set of actions, H the horizon, and T the number of episodes. To our knowledge, our approach is the first optimistic approach applied to contextual MDPs with general function approximation (i. e. , without additional knowledge regarding the function class, such as it being linear and etc.). In addition, we present a lower bound of Ω( p TH|S||A| ln(|G|)/ ln(|A|)), on the expected regret which holds even in the case of known dynamics.

ICML Conference 2021 Conference Paper

Adversarial Dueling Bandits

  • Aadirupa Saha
  • Tomer Koren
  • Yishay Mansour

We introduce the problem of regret minimization in Adversarial Dueling Bandits. As in classic Dueling Bandits, the learner has to repeatedly choose a pair of items and observe only a relative binary ‘win-loss’ feedback for this pair, but here this feedback is generated from an arbitrary preference matrix, possibly chosen adversarially. Our main result is an algorithm whose $T$-round regret compared to the \emph{Borda-winner} from a set of $K$ items is $\tilde{O}(K^{1/3}T^{2/3})$, as well as a matching $\Omega(K^{1/3}T^{2/3})$ lower bound. We also prove a similar high probability regret bound. We further consider a simpler \emph{fixed-gap} adversarial setup, which bridges between two extreme preference feedback models for dueling bandits: stationary preferences and an arbitrary sequence of preferences. For the fixed-gap adversarial setup we give an $\smash{ \tilde{O}((K/\Delta^2)\log{T}) }$ regret algorithm, where $\Delta$ is the gap in Borda scores between the best item and all other items, and show a lower bound of $\Omega(K/\Delta^2)$ indicating that our dependence on the main problem parameters $K$ and $\Delta$ is tight (up to logarithmic factors). Finally, we corroborate the theoretical results with empirical evaluations.

NeurIPS Conference 2021 Conference Paper

Agnostic Reinforcement Learning with Low-Rank MDPs and Rich Observations

  • Ayush Sekhari
  • Christoph Dann
  • Mehryar Mohri
  • Yishay Mansour
  • Karthik Sridharan

There have been many recent advances on provably efficient Reinforcement Learning (RL) in problems with rich observation spaces. However, all these works share a strong realizability assumption about the optimal value function of the true MDP. Such realizability assumptions are often too strong to hold in practice. In this work, we consider the more realistic setting of agnostic RL with rich observation spaces and a fixed class of policies $\Pi$ that may not contain any near-optimal policy. We provide an algorithm for this setting whose error is bounded in terms of the rank $d$ of the underlying MDP. Specifically, our algorithm enjoys a sample complexity bound of $\widetilde{O}\left((H^{4d} K^{3d} \log |\Pi|)/\epsilon^2\right)$ where $H$ is the length of episodes, $K$ is the number of actions and $\epsilon>0$ is the desired sub-optimality. We also provide a nearly matching lower bound for this agnostic setting that shows that the exponential dependence on rank is unavoidable, without further assumptions.

NeurIPS Conference 2021 Conference Paper

Differentially Private Multi-Armed Bandits in the Shuffle Model

  • Jay Tenenbaum
  • Haim Kaplan
  • Yishay Mansour
  • Uri Stemmer

We give an $(\varepsilon, \delta)$-differentially private algorithm for the Multi-Armed Bandit (MAB) problem in the shuffle model with a distribution-dependent regret of $O\left(\left(\sum_{a: \Delta_a>0}\frac{\log T}{\Delta_a}\right)+\frac{k\sqrt{\log\frac{1}{\delta}}\log T}{\varepsilon}\right)$, and a distribution-independent regret of $O\left(\sqrt{kT\log T}+\frac{k\sqrt{\log\frac{1}{\delta}}\log T}{\varepsilon}\right)$, where $T$ is the number of rounds, $\Delta_a$ is the suboptimality gap of the action $a$, and $k$ is the total number of actions. Our upper bound almost matches the regret of the best known algorithms for the centralized model, and significantly outperforms the best known algorithm in the local model.

ICML Conference 2021 Conference Paper

Differentially-Private Clustering of Easy Instances

  • Edith Cohen
  • Haim Kaplan
  • Yishay Mansour
  • Uri Stemmer
  • Eliad Tsfadia

Clustering is a fundamental problem in data analysis. In differentially private clustering, the goal is to identify k cluster centers without disclosing information on individual data points. Despite significant research progress, the problem had so far resisted practical solutions. In this work we aim at providing simple implementable differentrially private clustering algorithms when the the data is "easy, " e. g. , when there exists a significant separation between the clusters. For the easy instances we consider, we have a simple implementation based on utilizing non-private clustering algorithms, and combining them privately. We are able to get improved sample complexity bounds in some cases of Gaussian mixtures and k-means. We complement our theoretical algorithms with experiments of simulated data.

NeurIPS Conference 2021 Conference Paper

Dueling Bandits with Team Comparisons

  • Lee Cohen
  • Ulrike Schmidt-Kraepelin
  • Yishay Mansour

We introduce the dueling teams problem, a new online-learning setting in which the learner observes noisy comparisons of disjoint pairs of $k$-sized teams from a universe of $n$ players. The goal of the learner is to minimize the number of duels required to identify, with high probability, a Condorcet winning team, i. e. , a team which wins against any other disjoint team (with probability at least $1/2$). Noisy comparisons are linked to a total order on the teams. We formalize our model by building upon the dueling bandits setting (Yue et al. 2012) and provide several algorithms, both for stochastic and deterministic settings. For the stochastic setting, we provide a reduction to the classical dueling bandits setting, yielding an algorithm that identifies a Condorcet winning team within $\mathcal{O}((n + k \log (k)) \frac{\max(\log\log n, \log k)}{\Delta^2})$ duels, where $\Delta$ is a gap parameter. For deterministic feedback, we additionally present a gap-independent algorithm that identifies a Condorcet winning team within $\mathcal{O}(nk\log(k)+k^5)$ duels.

ICML Conference 2021 Conference Paper

Dueling Convex Optimization

  • Aadirupa Saha
  • Tomer Koren
  • Yishay Mansour

We address the problem of convex optimization with preference (dueling) feedback. Like the traditional optimization objective, the goal is to find the optimal point with the least possible query complexity, however, without the luxury of even a zeroth order feedback. Instead, the learner can only observe a single noisy bit which is win-loss feedback for a pair of queried points based on their function values. % The problem is certainly of great practical relevance as in many real-world scenarios, such as recommender systems or learning from customer preferences, where the system feedback is often restricted to just one binary-bit preference information. % We consider the problem of online convex optimization (OCO) solely by actively querying $\{0, 1\}$ noisy-comparison feedback of decision point pairs, with the objective of finding a near-optimal point (function minimizer) with the least possible number of queries. %a very general class of monotonic, non-decreasing transfer functions, and analyze the problem for any $d$-dimensional smooth convex function. % For the non-stationary OCO setup, where the underlying convex function may change over time, we prove an impossibility result towards achieving the above objective. We next focus only on the stationary OCO problem, and our main contribution lies in designing a normalized gradient descent based algorithm towards finding a $\epsilon$-best optimal point. Towards this, our algorithm is shown to yield a convergence rate of $\tilde O(\nicefrac{d\beta}{\epsilon \nu^2})$ ($\nu$ being the noise parameter) when the underlying function is $\beta$-smooth. Further we show an improved convergence rate of just $\tilde O(\nicefrac{d\beta}{\alpha \nu^2} \log \frac{1}{\epsilon})$ when the function is additionally also $\alpha$-strongly convex.

NeurIPS Conference 2021 Conference Paper

Minimax Regret for Stochastic Shortest Path

  • Alon Cohen
  • Yonathan Efroni
  • Yishay Mansour
  • Aviv Rosenberg

We study the Stochastic Shortest Path (SSP) problem in which an agent has to reach a goal state in minimum total expected cost. In the learning formulation of the problem, the agent has no prior knowledge about the costs and dynamics of the model. She repeatedly interacts with the model for $K$ episodes, and has to minimize her regret. In this work we show that the minimax regret for this setting is $\widetilde O(\sqrt{ (B_\star^2 + B_\star) |S| |A| K})$ where $B_\star$ is a bound on the expected cost of the optimal policy from any state, $S$ is the state space, and $A$ is the action space. This matches the $\Omega (\sqrt{ B_\star^2 |S| |A| K})$ lower bound of Rosenberg et al. [2020] for $B_\star \ge 1$, and improves their regret bound by a factor of $\sqrt{|S|}$. For $B_\star < 1$ we prove a matching lower bound of $\Omega (\sqrt{ B_\star |S| |A| K})$. Our algorithm is based on a novel reduction from SSP to finite-horizon MDPs. To that end, we provide an algorithm for the finite-horizon setting whose leading term in the regret depends polynomially on the expected cost of the optimal policy and only logarithmically on the horizon.

NeurIPS Conference 2021 Conference Paper

Optimal Rates for Random Order Online Optimization

  • Uri Sherman
  • Tomer Koren
  • Yishay Mansour

We study online convex optimization in the random order model, recently proposed by Garber et al. (2020), where the loss functions may be chosen by an adversary, but are then presented to the online algorithm in a uniformly random order. Focusing on the scenario where the cumulative loss function is (strongly) convex, yet individual loss functions are smooth but might be non-convex, we give algorithms that achieve the optimal bounds and significantly outperform the results of Garber et al. (2020), completely removing the dimension dependence and improve their scaling with respect to the strong convexity parameter. Our analysis relies on novel connections between algorithmic stability and generalization for sampling without-replacement analogous to those studied in the with-replacement i. i. d. setting, as well as on a refined average stability analysis of stochastic gradient descent.

NeurIPS Conference 2021 Conference Paper

Oracle-Efficient Regret Minimization in Factored MDPs with Unknown Structure

  • Aviv Rosenberg
  • Yishay Mansour

We study regret minimization in non-episodic factored Markov decision processes (FMDPs), where all existing algorithms make the strong assumption that the factored structure of the FMDP is known to the learner in advance. In this paper, we provide the first algorithm that learns the structure of the FMDP while minimizing the regret. Our algorithm is based on the optimism in face of uncertainty principle, combined with a simple statistical method for structure learning, and can be implemented efficiently given oracle-access to an FMDP planner. Moreover, we give a variant of our algorithm that remains efficient even when the oracle is limited to non-factored actions, which is the case with almost all existing approximate planners. Finally, we leverage our techniques to prove a novel lower bound for the known structure case, closing the gap to the regret bound of Chen et al. [2021].

NeurIPS Conference 2021 Conference Paper

ROI Maximization in Stochastic Online Decision-Making

  • Nicolò Cesa-Bianchi
  • Tom Cesari
  • Yishay Mansour
  • Vianney Perchet

We introduce a novel theoretical framework for Return On Investment (ROI) maximization in repeated decision-making. Our setting is motivated by the use case of companies that regularly receive proposals for technological innovations and want to quickly decide whether they are worth implementing. We design an algorithm for learning ROI-maximizing decision-making policies over a sequence of innovation proposals. Our algorithm provably converges to an optimal policy in class $\Pi$ at a rate of order $\min\big\{1/(N\Delta^2), N^{-1/3}\}$, where $N$ is the number of innovations and $\Delta$ is the suboptimality gap in $\Pi$. A significant hurdle of our formulation, which sets it aside from other online learning problems such as bandits, is that running a policy does not provide an unbiased estimate of its performance.

ICML Conference 2021 Conference Paper

Stochastic Multi-Armed Bandits with Unrestricted Delay Distributions

  • Tal Lancewicki
  • Shahar Segal
  • Tomer Koren
  • Yishay Mansour

We study the stochastic Multi-Armed Bandit (MAB) problem with random delays in the feedback received by the algorithm. We consider two settings: the {\it reward dependent} delay setting, where realized delays may depend on the stochastic rewards, and the {\it reward-independent} delay setting. Our main contribution is algorithms that achieve near-optimal regret in each of the settings, with an additional additive dependence on the quantiles of the delay distribution. Our results do not make any assumptions on the delay distributions: in particular, we do not assume they come from any parametric family of distributions and allow for unbounded support and expectation; we further allow for the case of infinite delays where the algorithm might occasionally not observe any feedback.

IJCAI Conference 2021 Conference Paper

Stochastic Shortest Path with Adversarially Changing Costs

  • Aviv Rosenberg
  • Yishay Mansour

Stochastic shortest path (SSP) is a well-known problem in planning and control, in which an agent has to reach a goal state in minimum total expected cost. In this paper we present the adversarial SSP model that also accounts for adversarial changes in the costs over time, while the underlying transition function remains unchanged. Formally, an agent interacts with an SSP environment for K episodes, the cost function changes arbitrarily between episodes, and the transitions are unknown to the agent. We develop the first algorithms for adversarial SSPs and prove high probability regret bounds of square-root K assuming all costs are strictly positive, and sub-linear regret in the general case. We are the first to consider this natural setting of adversarial SSP and obtain sub-linear regret for it.

NeurIPS Conference 2020 Conference Paper

Adversarially Robust Streaming Algorithms via Differential Privacy

  • Avinatan Hasidim
  • Haim Kaplan
  • Yishay Mansour
  • Yossi Matias
  • Uri Stemmer

A streaming algorithm is said to be adversarially robust if its accuracy guarantees are maintained even when the data stream is chosen maliciously, by an adaptive adversary. We establish a connection between adversarial robustness of streaming algorithms and the notion of differential privacy. This connection allows us to design new adversarially robust streaming algorithms that outperform the current state-of-the-art constructions for many interesting regimes of parameters.

AAAI Conference 2020 Conference Paper

Apprenticeship Learning via Frank-Wolfe

  • Tom Zahavy
  • Alon Cohen
  • Haim Kaplan
  • Yishay Mansour

We consider the applications of the Frank-Wolfe (FW) algorithm for Apprenticeship Learning (AL). In this setting, we are given a Markov Decision Process (MDP) without an explicit reward function. Instead, we observe an expert that acts according to some policy, and the goal is to find a policy whose feature expectations are closest to those of the expert policy. We formulate this problem as finding the projection of the feature expectations of the expert on the feature expectations polytope – the convex hull of the feature expectations of all the deterministic policies in the MDP. We show that this formulation is equivalent to the AL objective and that solving this problem using the FW algorithm is equivalent wellknown Projection method of Abbeel and Ng (2004). This insight allows us to analyze AL with tools from convex optimization literature and derive tighter convergence bounds on AL. Specifically, we show that a variation of the FW method that is based on taking “away steps” achieves a linear rate of convergence when applied to AL and that a stochastic version of the FW algorithm can be used to avoid precise estimation of feature expectations. We also experimentally show that this version outperforms the FW baseline. To the best of our knowledge, this is the first work that shows linear convergence rates for AL.

AAAI Conference 2020 Conference Paper

Designing Committees for Mitigating Biases

  • Michal Feldman
  • Yishay Mansour
  • Noam Nisan
  • Sigal Oren
  • Moshe Tennenholtz

It is widely observed that individuals prefer to interact with others who are more similar to them (this phenomenon is termed homophily). This similarity manifests itself in various ways such as beliefs, values and education. Thus, it should not come as a surprise that when people make hiring choices, for example, their similarity to the candidate plays a role in their choice. In this paper, we suggest that putting the decision in the hands of a committee instead of a single person can reduce this bias. We study a novel model of voting in which a committee of experts is constructed to reduce the biases of its members. We first present voting rules that optimally reduce the biases of a given committee. Our main results include the design of committees, for several settings, that are able to reach a nearly optimal (unbiased) choice. We also provide a thorough analysis of the trade-offs between the committee size and the obtained error. Our model is inherently different from the well-studied models of voting that focus on aggregation of preferences or on aggregation of information due to the introduction of similarity biases.

ICML Conference 2020 Conference Paper

Near-optimal Regret Bounds for Stochastic Shortest Path

  • Aviv Rosenberg 0002
  • Alon Cohen
  • Yishay Mansour
  • Haim Kaplan

Stochastic shortest path (SSP) is a well-known problem in planning and control, in which an agent has to reach a goal state in minimum total expected cost. In the learning formulation of the problem, the agent is unaware of the environment dynamics (i. e. , the transition function) and has to repeatedly play for a given number of episodes, while learning the problem’s optimal solution. Unlike other well-studied models in reinforcement learning (RL), the length of an episode is not predetermined (or bounded) and is influenced by the agent’s actions. Recently, \cite{tarbouriech2019noregret} studied this problem in the context of regret minimization, and provided an algorithm whose regret bound is inversely proportional to the square root of the minimum instantaneous cost. In this work we remove this dependence on the minimum cost—we give an algorithm that guarantees a regret bound of $\widetilde{O}(B^{3/2} S \sqrt{A K})$, where $B$ is an upper bound on the expected cost of the optimal policy, $S$ is the number of states, $A$ is the number of actions and $K$ is the total number of episodes. We additionally show that any learning algorithm must have at least $\Omega(B \sqrt{S A K})$ regret in the worst case.

IJCAI Conference 2020 Conference Paper

Online Revenue Maximization for Server Pricing

  • Shant Boodaghians
  • Federico Fusco
  • Stefano Leonardi
  • Yishay Mansour
  • Ruta Mehta

Efficient and truthful mechanisms to price time on remote servers/machines have been the subject of much work in recent years due to the importance of the cloud market. This paper considers online revenue maximization for a unit capacity server, when jobs are non preemptive, in the Bayesian setting: at each time step, one job arrives, with parameters drawn from an underlying distribution. We design an efficiently computable truthful posted price mechanism, which maximizes revenue in expectation and in retrospect, up to additive error. The prices are posted prior to learning the agent's type, and the computed pricing scheme is deterministic. We also show the pricing mechanism is robust to learning the job distribution from samples, where polynomially many samples suffice to obtain near optimal prices.

NeurIPS Conference 2020 Conference Paper

Prediction with Corrupted Expert Advice

  • Idan Amir
  • Idan Attias
  • Tomer Koren
  • Yishay Mansour
  • Roi Livni

We revisit the fundamental problem of prediction with expert advice, in a setting where the environment is benign and generates losses stochastically, but the feedback observed by the learner is subject to a moderate adversarial corruption. We prove that a variant of the classical Multiplicative Weights algorithm with decreasing step sizes achieves constant regret in this setting and performs optimally in a wide range of environments, regardless of the magnitude of the injected corruption. Our results reveal a surprising disparity between the often comparable Follow the Regularized Leader (FTRL) and Online Mirror Descent (OMD) frameworks: we show that for experts in the corrupted stochastic regime, the regret performance of OMD is in fact strictly inferior to that of FTRL.

NeurIPS Conference 2020 Conference Paper

Private Learning of Halfspaces: Simplifying the Construction and Reducing the Sample Complexity

  • Haim Kaplan
  • Yishay Mansour
  • Uri Stemmer
  • Eliad Tsfadia

We present a differentially private learner for halfspaces over a finite grid $G$ in $\R^d$ with sample complexity $\approx d^{2. 5}\cdot 2^{\log^*|G|}$, which improves the state-of-the-art result of [Beimel et al. , COLT 2019] by a $d^2$ factor. The building block for our learner is a new differentially private algorithm for approximately solving the linear feasibility problem: Given a feasible collection of $m$ linear constraints of the form $Ax\geq b$, the task is to {\em privately} identify a solution $x$ that satisfies {\em most} of the constraints. Our algorithm is iterative, where each iteration determines the next coordinate of the constructed solution $x$.

NeurIPS Conference 2020 Conference Paper

Reinforcement Learning with Feedback Graphs

  • Christoph Dann
  • Yishay Mansour
  • Mehryar Mohri
  • Ayush Sekhari
  • Karthik Sridharan

We study RL in the tabular MDP setting where the agent receives additional observations per step in the form of transitions samples. Such additional observations can be provided in many tasks by auxiliary sensors or by leveraging prior knowledge about the environment (e. g. , when certain actions yield similar outcome). We formalize this setting using a feedback graph over state-action pairs and show that model-based algorithms can incorporate additional observations for more sample-efficient learning. We give a regret bound that predominantly depends on the size of the maximum acyclic subgraph of the feedback graph, in contrast with a polynomial dependency on the number of states and actions in the absence of side observations. Finally, we highlight fundamental challenges for leveraging a small dominating set of the feedback graph, as compared to the well-studied bandit setting, and propose a new algorithm that can use such a dominating set to learn a near-optimal policy faster.

NeurIPS Conference 2020 Conference Paper

Sample Complexity of Uniform Convergence for Multicalibration

  • Eliran Shabat
  • Lee Cohen
  • Yishay Mansour

There is a growing interest in societal concerns in machine learning systems, especially in fairness. Multicalibration gives a comprehensive methodology to address group fairness. In this work, we address the multicalibration error and decouple it from the prediction error. The importance of decoupling the fairness metric (multicalibration) and the accuracy (prediction error) is due to the inherent trade-off between the two, and the societal decision regarding the ``right tradeoff'' (as imposed many times by regulators). Our work gives sample complexity bounds for uniform convergence guarantees of multicalibration error, which implies that regardless of the accuracy, we can guarantee that the empirical and (true) multicalibration errors are close. We emphasize that our results: (1) are more general than previous bounds, as they apply to both agnostic and realizable settings, and do not rely on a specific type of algorithm (such as differentially private), (2) improve over previous multicalibration sample complexity bounds and (3) implies uniform convergence guarantees for the classical calibration error.

UAI Conference 2020 Conference Paper

Unknown mixing times in apprenticeship and reinforcement learning

  • Tom Zahavy
  • Alon Cohen
  • Haim Kaplan
  • Yishay Mansour

We derive and analyze learning algorithms for apprenticeship learning, policy evaluation and policy gradient for average reward criteria. Existing algorithms explicitly require an upper bound on the mixing time. In contrast, we build on ideas from Markov chain theory and derive sampling algorithms that do not require such an upper bound. For these algorithms, we provide theoretical bounds on their sample-complexity and running time.

ICML Conference 2019 Conference Paper

Adversarial Online Learning with noise

  • Alon Resler
  • Yishay Mansour

We present and study models of adversarial online learning where the feedback observed by the learner is noisy, and the feedback is either full information feedback or bandit feedback. Specifically, we consider binary losses xored with the noise, which is a Bernoulli random variable. We consider both a constant noise rate and a variable noise rate. Our main results are tight regret bounds for learning with noise in the adversarial online learning model.

JMLR Journal 2019 Journal Article

Delay and Cooperation in Nonstochastic Bandits

  • Nicolò Cesa-Bianchi
  • Claudio Gentile
  • Yishay Mansour

We study networks of communicating learning agents that cooperate to solve a common nonstochastic bandit problem. Agents use an underlying communication network to get messages about actions selected by other agents, and drop messages that took more than $d$ hops to arrive, where $d$ is a delay parameter. We introduce Exp3-Coop, a cooperative version of the Exp3 algorithm and prove that with $K$ actions and $N$ agents the average per-agent regret after $T$ rounds is at most of order $\sqrt{\bigl(d+1 + \tfrac{K}{N}\alpha_{\le d}\bigr)(T\ln K)}$, where $\alpha_{\le d}$ is the independence number of the $d$-th power of the communication graph $G$. We then show that for any connected graph, for $d=\sqrt{K}$ the regret bound is $K^{1/4}\sqrt{T}$, strictly better than the minimax regret $\sqrt{KT}$ for noncooperating agents. More informed choices of $d$ lead to bounds which are arbitrarily close to the full information minimax regret $\sqrt{T\ln K}$ when $G$ is dense. When $G$ has sparse components, we show that a variant of Exp3-Coop, allowing agents to choose their parameters according to their centrality in $G$, strictly improves the regret. Finally, as a by-product of our analysis, we provide the first characterization of the minimax regret for bandit learning with delay. [abs] [ pdf ][ bib ] &copy JMLR 2019. ( edit, beta )

ICML Conference 2019 Conference Paper

Differentially Private Learning of Geometric Concepts

  • Haim Kaplan
  • Yishay Mansour
  • Yossi Matias
  • Uri Stemmer

We present differentially private efficient algorithms for learning union of polygons in the plane (which are not necessarily convex). Our algorithms achieve $(\alpha, \beta)$-PAC learning and $(\epsilon, \delta)$-differential privacy using a sample of size $\tilde{O}\left(\frac{1}{\alpha\epsilon}k\log d\right)$, where the domain is $[d]\times[d]$ and $k$ is the number of edges in the union of polygons.

NeurIPS Conference 2019 Conference Paper

Graph-based Discriminators: Sample Complexity and Expressiveness

  • Roi Livni
  • Yishay Mansour

A basic question in learning theory is to identify if two distributions are identical when we have access only to examples sampled from the distributions. This basic task is considered, for example, in the context of Generative Adversarial Networks (GANs), where a discriminator is trained to distinguish between a real-life distribution and a synthetic distribution. Classically, we use a hypothesis class $H$ and claim that the two distributions are distinct if for some $h\in H$ the expected value on the two distributions is (significantly) different. Our starting point is the following fundamental problem: "is having the hypothesis dependent on more than a single random example beneficial". To address this challenge we define $k$-ary based discriminators, which have a family of Boolean $k$-ary functions $\G$. Each function $g\in \G$ naturally defines a hyper-graph, indicating whether a given hyper-edge exists. A function $g\in \G$ distinguishes between two distributions, if the expected value of $g$, on a $k$-tuple of i. i. d examples, on the two distributions is (significantly) different. We study the expressiveness of families of $k$-ary functions, compared to the classical hypothesis class $H$, which is $k=1$. We show a separation in expressiveness of $k+1$-ary versus $k$-ary functions. This demonstrate the great benefit of having $k\geq 2$ as distinguishers. For $k\geq 2$ we introduce a notion similar to the VC-dimension, and show that it controls the sample complexity. We proceed and provide upper and lower bounds as a function of our extended notion of VC-dimension.

NeurIPS Conference 2019 Conference Paper

Individual Regret in Cooperative Nonstochastic Multi-Armed Bandits

  • Yogev Bar-On
  • Yishay Mansour

We study agents communicating over an underlying network by exchanging messages, in order to optimize their individual regret in a common nonstochastic multi-armed bandit problem. We derive regret minimization algorithms that guarantee for each agent $v$ an individual expected regret of $\widetilde{O}\left(\sqrt{\left(1+\frac{K}{\left|\mathcal{N}\left(v\right)\right|}\right)T}\right)$, where $T$ is the number of time steps, $K$ is the number of actions and $\mathcal{N}\left(v\right)$ is the set of neighbors of agent $v$ in the communication graph. We present algorithms both for the case that the communication graph is known to all the agents, and for the case that the graph is unknown. When the graph is unknown, each agent knows only the set of its neighbors and an upper bound on the total number of agents. The individual regret between the models differs only by a logarithmic factor. Our work resolves an open problem from [Cesa-Bianchi et al. , 2019b].

ICML Conference 2019 Conference Paper

Learning Linear-Quadratic Regulators Efficiently with only √T Regret

  • Alon Cohen
  • Tomer Koren
  • Yishay Mansour

We present the first computationally-efficient algorithm with $\widetilde{O}(\sqrt{T})$ regret for learning in Linear Quadratic Control systems with unknown dynamics. By that, we resolve an open question of Abbasi-Yadkori and Szepesvari (2011) and Dean, Mania, Matni, Recht, and Tu (2018).

NeurIPS Conference 2019 Conference Paper

Learning to Screen

  • Alon Cohen
  • Avinatan Hassidim
  • Haim Kaplan
  • Yishay Mansour
  • Shay Moran

Imagine a large firm with multiple departments that plans a large recruitment. Candidates arrive one-by-one, and for each candidate the firm decides, based on her data (CV, skills, experience, etc), whether to summon her for an interview. The firm wants to recruit the best candidates while minimizing the number of interviews. We model such scenarios as an assignment problem between items (candidates) and categories (departments): the items arrive one-by-one in an online manner, and upon processing each item the algorithm decides, based on its value and the categories it can be matched with, whether to retain or discard it (this decision is irrevocable). The goal is to retain as few items as possible while guaranteeing that the set of retained items contains an optimal matching. We consider two variants of this problem: (i) in the first variant it is assumed that the $n$ items are drawn independently from an unknown distribution $D$. (ii) In the second variant it is assumed that before the process starts, the algorithm has an access to a training set of $n$ items drawn independently from the same unknown distribution (e. g. \ data of candidates from previous recruitment seasons). We give tight bounds on the minimum possible number of retained items in each of these variants. These results demonstrate that one can retain exponentially less items in the second variant (with the training set). Our algorithms and analysis utilize ideas and techniques from statistical learning theory and from discrete algorithms.

ICML Conference 2019 Conference Paper

Online Convex Optimization in Adversarial Markov Decision Processes

  • Aviv Rosenberg 0002
  • Yishay Mansour

We consider online learning in episodic loop-free Markov decision processes (MDPs), where the loss function can change arbitrarily between episodes, and the transition function is not known to the learner. We show $\tilde{O}(L|X|\sqrt{|A|T})$ regret bound, where $T$ is the number of episodes, $X$ is the state space, $A$ is the action space, and $L$ is the length of each episode. Our online algorithm is implemented using entropic regularization methodology, which allows to extend the original adversarial MDP model to handle convex performance criteria (different ways to aggregate the losses of a single episode), as well as improve previous regret bounds.

NeurIPS Conference 2019 Conference Paper

Online Stochastic Shortest Path with Bandit Feedback and Unknown Transition Function

  • Aviv Rosenberg
  • Yishay Mansour

We consider online learning in episodic loop-free Markov decision processes (MDPs), where the loss function can change arbitrarily between episodes. The transition function is fixed but unknown to the learner, and the learner only observes bandit feedback (not the entire loss function). For this problem we develop no-regret algorithms that perform asymptotically as well as the best stationary policy in hindsight. Assuming that all states are reachable with probability $\beta > 0$ under any policy, we give a regret bound of $\tilde{O} ( L|X|\sqrt{|A|T} / \beta )$, where $T$ is the number of episodes, $X$ is the state space, $A$ is the action space, and $L$ is the length of each episode. When this assumption is removed we give a regret bound of $\tilde{O} ( L^{3/2} |X| |A|^{1/4} T^{3/4})$, that holds for an arbitrary transition function. To our knowledge these are the first algorithms that in our setting handle both bandit feedback and an unknown transition function.

ICML Conference 2018 Conference Paper

Online Linear Quadratic Control

  • Alon Cohen
  • Avinatan Hassidim
  • Tomer Koren
  • Nevena Lazic
  • Yishay Mansour
  • Kunal Talwar

We study the problem of controlling linear time-invariant systems with known noisy dynamics and adversarially chosen quadratic losses. We present the first efficient online learning algorithms in this setting that guarantee $O(\sqrt{T})$ regret under mild assumptions, where $T$ is the time horizon. Our algorithms rely on a novel SDP relaxation for the steady-state distribution of the system. Crucially, and in contrast to previously proposed relaxations, the feasible solutions of our SDP all correspond to “strongly stable” policies that mix exponentially fast to a steady state.

IJCAI Conference 2018 Conference Paper

Planning and Learning with Stochastic Action Sets

  • Craig Boutilier
  • Alon Cohen
  • Avinatan Hassidim
  • Yishay Mansour
  • Ofer Meshi
  • Martin Mladenov
  • Dale Schuurmans

In many practical uses of reinforcement learning (RL) the set of actions available at a given state is a random variable, with realizations governed by an exogenous stochastic process. Somewhat surprisingly, the foundations for such sequential decision processes have been unaddressed. In this work, we formalize and investigate MDPs with stochastic action sets (SAS-MDPs) to provide these foundations. We show that optimal policies and value functions in this model have a structure that admits a compact representation. From an RL perspective, we show that Q-learning with sampled action sets is sound. In model-based settings, we consider two important special cases: when individual actions are available with independent probabilities, and a sampling-based model for unknown distributions. We develop polynomial-time value and policy iteration methods for both cases, and provide a polynomial-time linear programming solution for the first case.

EWRL Workshop 2018 Workshop Paper

Planning in Hierarchical Reinforcement Learning: Guarantees for Using Local Policies

  • Tom Zahavy
  • Avinatan Hasidim
  • Haim Kaplan
  • Yishay Mansour

We provide theoretical guarantees for reward decomposition in deterministic MDPs with collectible rewards. Reward decomposition is a special case of hierarchical reinforcement learning, in which we view the reward as a sum, and we assemble a policy from policies for its components. Our approach builds on formulating this problem as a maximum traveling salesman problem with discounted reward. In particular, we focus on approximate solutions that are local, i.e., solutions that only observe information about the current state. Local policies are easy to implement and do not require substantial computational resources since they do not perform learning nor planning. Local deterministic policies, like Nearest Neighbor (NN), are being used in practice for hierarchical reinforcement learning, in particular, for 3D navigation. We propose three stochastic policies and prove that they guarantee better performance than any deterministic policy in the worst case. We then show experimentally that these policies outperform NN in deterministic MDPs with optimal options, and also in stochastic MDPs and during learning.

AAAI Conference 2017 Conference Paper

Label Efficient Learning by Exploiting Multi-Class Output Codes

  • Maria Balcan
  • Travis Dick
  • Yishay Mansour

We present a new perspective on the popular multi-class algorithmic techniques of one-vs-all and error correcting output codes. Rather than studying the behavior of these techniques for supervised learning, we establish a connection between the success of these methods and the existence of labelefficient learning procedures. We show that in both the realizable and agnostic cases, if output codes are successful at learning from labeled data, they implicitly assume structure on how the classes are related. By making that structure explicit, we design learning algorithms to recover the classes with low label complexity. We provide results for the commonly studied cases of one-vs-all learning and when the codewords of the classes are well separated. We additionally consider the more challenging case where the codewords are not well separated, but satisfy a boundary features condition that captures the natural intuition that every bit of the codewords should be significant.

NeurIPS Conference 2017 Conference Paper

Multi-Armed Bandits with Metric Movement Costs

  • Tomer Koren
  • Roi Livni
  • Yishay Mansour

We consider the non-stochastic Multi-Armed Bandit problem in a setting where there is a fixed and known metric on the action space that determines a cost for switching between any pair of actions. The loss of the online learner has two components: the first is the usual loss of the selected actions, and the second is an additional loss due to switching between actions. Our main contribution gives a tight characterization of the expected minimax regret in this setting, in terms of a complexity measure $\mathcal{C}$ of the underlying metric which depends on its covering numbers. In finite metric spaces with $k$ actions, we give an efficient algorithm that achieves regret of the form $\widetilde(\max\set{\mathcal{C}^{1/3}T^{2/3}, \sqrt{kT}})$, and show that this is the best possible. Our regret bound generalizes previous known regret bounds for some special cases: (i) the unit-switching cost regret $\widetilde{\Theta}(\max\set{k^{1/3}T^{2/3}, \sqrt{kT}})$ where $\mathcal{C}=\Theta(k)$, and (ii) the interval metric with regret $\widetilde{\Theta}(\max\set{T^{2/3}, \sqrt{kT}})$ where $\mathcal{C}=\Theta(1)$. For infinite metrics spaces with Lipschitz loss functions, we derive a tight regret bound of $\widetilde{\Theta}(T^{\frac{d+1}{d+2}})$ where $d \ge 1$ is the Minkowski dimension of the space, which is known to be tight even when there are no switching costs.

NeurIPS Conference 2017 Conference Paper

Submultiplicative Glivenko-Cantelli and Uniform Convergence of Revenues

  • Noga Alon
  • Moshe Babaioff
  • Yannai A. Gonczarowski
  • Yishay Mansour
  • Shay Moran
  • Amir Yehudayoff

In this work we derive a variant of the classic Glivenko-Cantelli Theorem, which asserts uniform convergence of the empirical Cumulative Distribution Function (CDF) to the CDF of the underlying distribution. Our variant allows for tighter convergence bounds for extreme values of the CDF. We apply our bound in the context of revenue learning, which is a well-studied problem in economics and algorithmic game theory. We derive sample-complexity bounds on the uniform convergence rate of the empirical revenues to the true revenues, assuming a bound on the k'th moment of the valuations, for any (possibly fractional) k > 1. For uniform convergence in the limit, we give a complete characterization and a zero-one law: if the first moment of the valuations is finite, then uniform convergence almost surely occurs; conversely, if the first moment is infinite, then uniform convergence almost never occurs.

EWRL Workshop 2016 Workshop Paper

Automatic Representation for Life-Time Value Recommender Systems

  • Assaf Hallak
  • Elad Yom-Tov
  • Yishay Mansour

Recommender systems are embedded in almost every commercial site, proposing users items which are likely to draw their interest. While most systems maximize the immediate gain, a better notion of success would be the lifetime value (LTV) of the user-system interaction. The LTV approach instead considers the future implications of the item recommendations, and seeks to maximize over the cumulative gain over time. Reinforcement Learning (RL) framework is the standard formulation for optimizing the cumulative successes over time, but RL is rarely used in practice due to its complicated representation, optimization and validation techniques. In this paper we propose a new architecture for combining RL with recommendation systems which obviates the need for hand-tuned features, thus automating the state-space representation construction process. We analyze the practical difficulties in this formulation and test our solutions on real-world recommendation data.

NeurIPS Conference 2016 Conference Paper

Online Pricing with Strategic and Patient Buyers

  • Michal Feldman
  • Tomer Koren
  • Roi Livni
  • Yishay Mansour
  • Aviv Zohar

We consider a seller with an unlimited supply of a single good, who is faced with a stream of $T$ buyers. Each buyer has a window of time in which she would like to purchase, and would buy at the lowest price in that window, provided that this price is lower than her private value (and otherwise, would not buy at all). In this setting, we give an algorithm that attains $O(T^{2/3})$ regret over any sequence of $T$ buyers with respect to the best fixed price in hindsight, and prove that no algorithm can perform better in the worst case.

ICML Conference 2015 Conference Paper

Classification with Low Rank and Missing Data

  • Elad Hazan
  • Roi Livni
  • Yishay Mansour

We consider classification and regression tasks where we have missing data and assume that the (clean) data resides in a low rank subspace. Finding a hidden subspace is known to be computationally hard. Nevertheless, using a non-proper formulation we give an efficient agnostic algorithm that classifies as good as the best linear classifier coupled with the best low-dimensional subspace in which the data resides. A direct implication is that our algorithm can linearly (and non-linearly through kernels) classify provably as well as the best classifier that has access to the full data.

AAAI Conference 2015 Conference Paper

Learning Valuation Distributions from Partial Observation

  • Avrim Blum
  • Yishay Mansour
  • Jame Morgenstern

Auction theory traditionally assumes that bidders’ valuation distributions are known to the auctioneer, such as in the celebrated, revenue-optimal Myerson auction (Myerson 1981). However, this theory does not describe how the auctioneer comes to possess this information. Recently work (Cole and Roughgarden 2014) showed that an approximation based on a finite sample of independent draws from each bidder’s distribution is sufficient to produce a near-optimal auction. In this work, we consider the problem of learning bidders’ valuation distributions from much weaker forms of observations. Specifically, we consider a setting where there is a repeated, sealed-bid auction with n bidders, but all we observe for each round is who won, but not how much they bid or paid. We can also participate (i. e. , submit a bid) ourselves, and observe when we win. From this information, our goal is to (approximately) recover the inherently recoverable part of the underlying bid distributions. We also consider extensions where different subsets of bidders participate in each round, and where bidders’ valuations have a common-value component added to their independent private values.

ICML Conference 2014 Conference Paper

Thompson Sampling for Complex Online Problems

  • Aditya Gopalan
  • Shie Mannor
  • Yishay Mansour

We consider stochastic multi-armed bandit problems with complex actions over a set of basic arms, where the decision maker plays a complex action rather than a basic arm in each round. The reward of the complex action is some function of the basic arms’ rewards, and the feedback observed may not necessarily be the reward per-arm. For instance, when the complex actions are subsets of the arms, we may only observe the maximum reward over the chosen subset. Thus, feedback across complex actions may be coupled due to the nature of the reward function. We prove a frequentist regret bound for Thompson sampling in a very general setting involving parameter, action and observation spaces and a likelihood function over them. The bound holds for discretely-supported priors over the parameter space and without additional structural properties such as closed-form posteriors, conjugate prior structure or independence across arms. The regret bound scales logarithmically with time but, more importantly, with an improved constant that non-trivially captures the coupling across complex actions due to the structure of the rewards. As applications, we derive improved regret bounds for classes of complex bandit problems involving selecting subsets of arms, including the first nontrivial regret bounds for nonlinear MAX reward feedback from subsets. Using particle filters for computing posterior distributions which lack an explicit closed-form, we present numerical results for the performance of Thompson sampling for subset-selection and job scheduling problems.

AAMAS Conference 2013 Conference Paper

An Empirical Study of Trading Agent Robustness

  • Shai Hertz
  • Mariano Schain
  • Yishay Mansour

We study the empirical behavior of trading agents participating in the Ad-Auction game of the Trading Agent Competition (TAC-AA). Aiming to understand the applicability of optimal trading strategies in synthesized environments to real-life settings, we investigate the robustness of the agents to deviations from the game’s specified environment. Our results indicate that most agents, especially the top-scoring ones, are surprisingly robust. In addition, using the game logs, we derive for each agent a strategic fingerprint and show that it almost uniquely identifies it. Finally, we show that although the Machine Learning modeling in TAC-AA is inherently inaccurate, further improvement in modeling accuracy is likely to have only a limited contribution to the overall performance of TAC-AA agents.

RLDM Conference 2013 Conference Abstract

Complex Bandit Problems and Thompson Sampling

  • Aditya Gopalan
  • Shie Mannor
  • Yishay Mansour

We study stochastic multi-armed bandit settings with complex actions derived from the basic bandit arms, e. g. , subsets or partitions of basic arms. The decision maker is faced with selecting at each round a complex action instead of a basic arm. We allow the reward of the complex action to be some func- tion of the basic arms’ rewards, and so the feedback observed may not necessarily be the reward per-arm. For instance, when the complex actions are subsets of bandit arms, we may only observe the maximum reward over the chosen subset. Feedback from playing (complex) actions can thus be indicative of rewards from other actions, and leveraging this coupled feedback becomes important to the decision maker in or- der to learn efficiently. We propose applying Thompson Sampling – a Bayesian-inspired algorithm for the standard multi-armed bandit – for minimizing regret in complex bandit problems. We derive the first gener- al, frequentist regret bound for Thompson sampling in complex bandit settings, that holds without specific structural assumptions on the prior used by the algorithm. The regret bound exhibits the standard logarith- mic scaling with time but with a non-trivial multiplicative constant that encodes the coupled information structure of the complex bandit. As applications, we show improved regret bounds (compared to treating the complex actions as independent) for a class of complex, subset-selection bandit problems. Using particle filters for computing posterior distributions that often lack an explicit closed-form, we apply Thompson- sampling algorithms for subset selection and job-scheduling problems and present numerical results.

ICML Conference 2013 Conference Paper

Exploiting Ontology Structures and Unlabeled Data for Learning

  • Maria-Florina Balcan
  • Avrim Blum
  • Yishay Mansour

We present and analyze a theoretical model designed to understand and explain the effectiveness of ontologies for learning multiple related tasks from primarily unlabeled data. We present both information-theoretic results as well as efficient algorithms. We show in this model that an ontology, which specifies the relationships between multiple outputs, in some cases is sufficient to completely learn a classification using a large unlabeled data source.

NeurIPS Conference 2013 Conference Paper

From Bandits to Experts: A Tale of Domination and Independence

  • Noga Alon
  • Nicolò Cesa-Bianchi
  • Claudio Gentile
  • Yishay Mansour

We consider the partial observability model for multi-armed bandits, introduced by Mannor and Shamir (2011). Our main result is a characterization of regret in the directed observability model in terms of the dominating and independence numbers of the observability graph. We also show that in the undirected case, the learner can achieve optimal regret without even accessing the observability graph before selecting an action. Both results are shown using variants of the Exp3 algorithm operating on the observability graph in a time-efficient manner.

NeurIPS Conference 2012 Conference Paper

Learning Multiple Tasks using Shared Hypotheses

  • Koby Crammer
  • Yishay Mansour

In this work we consider a setting where we have a very large number of related tasks with few examples from each individual task. Rather than either learning each task individually (and having a large generalization error) or learning all the tasks together using a single hypothesis (and suffering a potentially large inherent error), we consider learning a small pool of {\em shared hypotheses}. Each task is then mapped to a single hypothesis in the pool (hard association). We derive VC dimension generalization bounds for our model, based on the number of tasks, shared hypothesis and the VC dimension of the hypotheses class. We conducted experiments with both synthetic problems and sentiment of reviews, which strongly support our approach.

FOCS Conference 2011 Conference Paper

Welfare and Profit Maximization with Production Costs

  • Avrim Blum
  • Anupam Gupta 0001
  • Yishay Mansour
  • Ankit Sharma 0001

Combinatorial Auctions are a central problem in Algorithmic Mechanism Design: pricing and allocating goods to buyers with complex preferences in order to maximize some desired objective (e. g. , social welfare, revenue, or profit). The problem has been well-studied in the case of limited supply (one copy of each item), and in the case of digital goods (the seller can produce additional copies at no cost). Yet in the case of resources -- oil, labor, computing cycles, etc. -- neither of these abstractions is just right: additional supplies of these resources can be found, but at increasing difficulty (marginal cost) as resources are depleted. In this work, we initiate the study of the algorithmic mechanism design problem of combinatorial pricing under increasing marginal cost. The goal is to sell these goods to buyers with unknown and arbitrary combinatorial valuation functions to maximize either the social welfare, or the seller's profit, specifically we focus on the setting of posted item prices with buyers arriving online. We give algorithms that achieve constant factor approximations for a class of natural cost functions - linear, low-degree polynomial, logarithmic - and that give logarithmic approximations for more general increasing marginal cost functions (along with a necessary additive loss). We show that these bounds are essentially best possible for these settings.

NeurIPS Conference 2010 Conference Paper

Learning Bounds for Importance Weighting

  • Corinna Cortes
  • Yishay Mansour
  • Mehryar Mohri

This paper presents an analysis of importance weighting for learning from finite samples and gives a series of theoretical and algorithmic results. We point out simple cases where importance weighting can fail, which suggests the need for an analysis of the properties of this technique. We then give both upper and lower bounds for generalization with bounded importance weights and, more significantly, give learning guarantees for the more common case of unbounded importance weights under the weak assumption that the second moment is bounded, a condition related to the Renyi divergence of the training and test distributions. These results are based on a series of novel and general bounds we derive for unbounded loss functions, which are of independent interest. We use these bounds to guide the definition of an alternative reweighting algorithm and report the results of experiments demonstrating its benefits. Finally, we analyze the properties of normalized importance weights which are also commonly used.

UAI Conference 2009 Conference Paper

Multiple Source Adaptation and the Rényi Divergence

  • Yishay Mansour
  • Mehryar Mohri
  • Afshin Rostamizadeh

This paper presents a novel theoretical study of the general problem of multiple source adaptation using the notion of Rényi divergence. Our results build on our previous work [12], but significantly broaden the scope of that work in several directions. We extend previous multiple source loss guarantees based on distribution weighted combinations to arbitrary target distributions P, not necessarily mixtures of the source distributions, analyze both known and unknown target distribution cases, and prove a lower bound. We further extend our bounds to deal with the case where the learner receives an approximate distribution for each source instead of the exact one, and show that similar loss guarantees can be achieved depending on the divergence between the approximate and true distributions. We also analyze the case where the labeling functions of the source domains are somewhat different. Finally, we report the results of experiments with both an artificial data set and a sentiment analysis task, showing the performance benefits of the distribution weighted combinations and the quality of our bounds based on the Rényi divergence.

STOC Conference 2009 Conference Paper

On the convergence of regret minimization dynamics in concave games

  • Eyal Even-Dar
  • Yishay Mansour
  • Uri Nadav

We study a general sub-class of concave games which we call socially concave games. We show that if each player follows any no-external regret minimization procedure then the dynamics will converge in the sense that both the average action vector will converge to a Nash equilibrium and that the utility of each player will converge to her utility in that Nash equilibrium. We show that many natural games are indeed socially concave games. Specifically, we show that linear Cournot competition and linear resource allocation games are socially-concave games, and therefore our convergence result applies to them. In addition, we show that a simple best response dynamics might diverge for linear resource allocation games, and is known to diverge for linear Cournot competition. For the TCP congestion games we show that "near" the equilibrium the games are socially-concave, and using our general methodology we show the convergence of a specific regret minimization dynamics.

NeurIPS Conference 2008 Conference Paper

Domain Adaptation with Multiple Sources

  • Yishay Mansour
  • Mehryar Mohri
  • Afshin Rostamizadeh

This paper presents a theoretical analysis of the problem of adaptation with multiple sources. For each source domain, the distribution over the input points as well as a hypothesis with error at most \epsilon are given. The problem consists of combining these hypotheses to derive a hypothesis with small error with respect to the target domain. We present several theoretical results relating to this problem. In particular, we prove that standard convex combinations of the source hypotheses may in fact perform very poorly and that, instead, combinations weighted by the source distributions benefit from favorable theoretical guarantees. Our main result shows that, remarkably, for any fixed target function, there exists a distribution weighted combining rule that has a loss of at most \epsilon with respect to any target mixture of the source distributions. We further generalize the setting from a single target function to multiple consistent target functions and show the existence of a combining rule with error at most 3\epsilon. Finally, we report empirical results for a multiple source adaptation problem with a real-world dataset.

IJCAI Conference 2007 Conference Paper

  • Eyal Even-Dar
  • Sham. M Kakade
  • Yishay Mansour

We consider the fundamental problem of monitoring (i. e. tracking) the belief state in a dynamic system, when the model is only approximately correct and when the initial belief state might be unknown. In this general setting where the model is (perhaps only slightly) mis-specified, monitoring (and consequently planning) may be impossible as errors might accumulate over time. We provide a new characterization, the \emph{value of observation}, which allows us to bound the error accumulation. The value of observation is a parameter that governs how much information the observation provides. For instance, in Partially Observable MDPs when it is 1 the POMDP is an MDP while for an unobservable Markov Decision Process the parameter is 0. Thus, the new parameter characterizes a spectrum from MDPs to unobservable MDPs depending on the amount of information conveyed in the observations.

JMLR Journal 2007 Journal Article

From External to Internal Regret

  • Avrim Blum
  • Yishay Mansour

External regret compares the performance of an online algorithm, selecting among N actions, to the performance of the best of those actions in hindsight. Internal regret compares the loss of an online algorithm to the loss of a modified online algorithm, which consistently replaces one action by another. In this paper we give a simple generic reduction that, given an algorithm for the external regret problem, converts it to an efficient online algorithm for the internal regret problem. We provide methods that work both in the full information model, in which the loss of every action is observed at each time step, and the partial information (bandit) model, where at each time step only the loss of the selected action is observed. The importance of internal regret in game theory is due to the fact that in a general game, if each player has sublinear internal regret, then the empirical frequencies converge to a correlated equilibrium. For external regret we also derive a quantitative regret bound for a very general setting of regret, which includes an arbitrary set of modification rules (that possibly modify the online algorithm) and an arbitrary set of time selection functions (each giving different weight to each time step). The regret for a given time selection and modification rule is the difference between the cost of the online algorithm and the cost of the modified online algorithm, where the costs are weighted by the time selection function. This can be viewed as a generalization of the previously-studied sleeping experts setting. [abs] [ pdf ][ bib ] &copy JMLR 2007. ( edit, beta )

STOC Conference 2007 Conference Paper

The communication complexity of uncoupled nash equilibrium procedures

  • Sergiu Hart
  • Yishay Mansour

We study the question of how long it takes players to reach a Nashequilibrium in uncoupled setups, where each player initially knowsonly his own payoff function. We derive lower bounds on the communication complexity of reaching a Nash equilibrium, i.e., on thenumber of bits that need to be transmitted, and thus also on the requirednumber of steps. Specifically, we show lower bounds that are exponential inthe number of players in each one of the following cases: (1) reaching apure Nash equilibrium; (2) reaching a pure Nash equilibrium in a Bayesiansetting; and (3) reaching a mixed Nash equilibrium. We then show that, incontrast, the communication complexity of reaching a correlated equilibriumis polynomial in the number of players.

JMLR Journal 2006 Journal Article

Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems

  • Eyal Even-Dar
  • Shie Mannor
  • Yishay Mansour

We incorporate statistical confidence intervals in both the multi-armed bandit and the reinforcement learning problems. In the bandit problem we show that given n arms, it suffices to pull the arms a total of O (( n /ε 2 )log(1/δ)) times to find an ε-optimal arm with probability of at least 1-δ. This bound matches the lower bound of Mannor and Tsitsiklis (2004) up to constants. We also devise action elimination procedures in reinforcement learning algorithms. We describe a framework that is based on learning the confidence interval around the value function or the Q-function and eliminating actions that are not optimal (with high probability). We provide a model-based and a model-free variants of the elimination method. We further derive stopping conditions guaranteeing that the learned policy is approximately optimal with high probability. Simulations demonstrate a considerable speedup and added robustness over ε-greedy Q-learning. [abs] [ pdf ][ bib ] &copy JMLR 2006. ( edit, beta )

FOCS Conference 2005 Conference Paper

Agnostically Learning Halfspaces

  • Adam Tauman Kalai
  • Adam R. Klivans
  • Yishay Mansour
  • Rocco A. Servedio

We give the first algorithm that (under distributional assumptions) efficiently learns halfspaces in the notoriously difficult agnostic framework of Kearns, Schapire, & Sellie, where a learner is given access to labeled examples drawn from a distribution, without restriction on the labels (e. g. adversarial noise). The algorithm constructs a hypothesis whose error rate on future examples is within an additive /spl epsi/ of the optimal halfspace, in time poly(n) for any constant /spl epsi/ > 0, under the uniform distribution over {-1, 1}/sup n/ or the unit sphere in /spl Ropf//sup n/, as well as under any log-concave distribution over /spl Ropf/ /sup n/. It also agnostically learns Boolean disjunctions in time 2/sup O~(/spl radic/n)/ with respect to any distribution. The new algorithm, essentially L/sub 1/ polynomial regression, is a noise-tolerant arbitrary distribution generalization of the "low degree" Fourier algorithm of Linial, Mansour, & Nisan. We also give a new algorithm for PAC learning halfspaces under the uniform distribution on the unit sphere with the current best bounds on tolerable rate of "malicious noise".

JMLR Journal 2005 Journal Article

Concentration Bounds for Unigram Language Models

  • Evgeny Drukh
  • Yishay Mansour

We show several high-probability concentration bounds for learning unigram language models. One interesting quantity is the probability of all words appearing exactly k times in a sample of size m. A standard estimator for this quantity is the Good-Turing estimator. The existing analysis on its error shows a high-probability bound of approximately O(k / m 1/2 ). We improve its dependency on k to O(k 1/4 / m 1/2 + k / m). We also analyze the empirical frequencies estimator, showing that with high probability its error is bounded by approximately O( 1 / k + k 1/2 / m). We derive a combined estimator, which has an error of approximately O(m -2/5 ), for any k. A standard measure for the quality of a learning algorithm is its expected per-word log-loss. The leave-one-out method can be used for estimating the log-loss of the unigram model. We show that its error has a high-probability bound of approximately O(1 / m 1/2 ), for any underlying distribution. We also bound the log-loss a priori, as a function of various parameters of the distribution. [abs] [ pdf ][ bib ] &copy JMLR 2005. ( edit, beta )

FOCS Conference 2005 Conference Paper

Mechanism Design via Machine Learning

  • Maria-Florina Balcan
  • Avrim Blum
  • Jason D. Hartline
  • Yishay Mansour

We use techniques from sample-complexity in machine learning to reduce problems of incentive-compatible mechanism design to standard algorithmic questions, for a wide variety of revenue-maximizing pricing problems. Our reductions imply that for these problems, given an optimal (or /spl beta/-approximation) algorithm for the standard algorithmic problem, we can convert it into a (1 + /spl epsi/)-approximation (or /spl beta/(1 +/spl epsi/)-approximation) for the incentive-compatible mechanism design problem, so long as the number of bidders is sufficiently large as a function of an appropriate measure of complexity of the comparison class of solutions. We apply these results to the problem of auctioning a digital good, the attribute auction problem, and to the problem of item-pricing in unlimited-supply combinatorial auctions. From a learning perspective, these settings present several challenges: in particular the loss function is discontinuous and asymmetric, and the range of bidders' valuations may be large.

NeurIPS Conference 2004 Conference Paper

Experts in a Markov Decision Process

  • Eyal Even-Dar
  • Sham Kakade
  • Yishay Mansour

We consider an MDP setting in which the reward function is allowed to change during each time step of play (possibly in an adversarial manner), yet the dynamics remain fixed. Similar to the experts setting, we address the question of how well can an agent do when compared to the reward achieved under the best stationary policy over time. We provide efficient algorithms, which have regret bounds with no dependence on the size of state space. Instead, these bounds depend only on a certain horizon time of the process and logarithmically on the number of actions. We also show that in the case that the dynamics change over time, the problem becomes computationally hard. 1 Introduction There is an inherent tension between the objectives in an expert setting and those in a re- inforcement learning setting. In the experts problem, during every round a learner chooses one of n decision making experts and incurs the loss of the chosen expert. The setting is typically an adversarial one, where Nature provides the examples to a learner. The stan- dard objective here is a myopic, backwards looking one -- in retrospect, we desire that our performance is not much worse than had we chosen any single expert on the sequence of examples provided by Nature. In contrast, a reinforcement learning setting typically makes the much stronger assumption of a fixed environment, typically a Markov decision pro- cess (MDP), and the forward looking objective is to maximize some measure of the future reward with respect to this fixed environment. The motivation of this work is to understand how to efficiently incorporate the benefits of existing experts algorithms into a more adversarial reinforcement learning setting, where certain aspects of the environment could change over time. A naive way to implement an experts algorithm is to simply associate an expert with each fixed policy. The running time of such algorithms is polynomial in the number of experts and the regret (the difference from the optimal reward) is logarithmic in the number of experts. For our setting the num- ber of policies is huge, namely #actions#states, which renders the naive experts approach computationally infeasible. Furthermore, straightforward applications of standard regret algorithms produce regret bounds which are logarithmic in the number of policies, so they have linear dependence This work was supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST-2002-506778, by a grant from the Israel Science Foundation and an IBM faculty award. This publication only reflects the authors' views. on the number of states. We might hope for a more effective regret bound which has no dependence on the size of state space (which is typically large). The setting we consider is one in which the dynamics of the environment are known to the learner, but the reward function can change over time. We assume that after each time step the learner has complete knowledge of the previous reward functions (over the entire environment), but does not know the future reward functions. As a motivating example one can consider taking a long road-trip over some period of time T. The dynamics, namely the roads, are fixed, but the road conditions may change frequently. By listening to the radio, one can get (effectively) instant updates of the road and traffic conditions. Here, the task is to minimize the cost during the period of time T. Note that at each time step we select one road segment, suffer a certain delay, and need to plan ahead with respect to our current position. This example is similar to an adversarial shortest path problem considered in Kalai and Vempala [2003]. In fact Kalai and Vempala [2003], address the computational difficulty of handling a large number of experts under certain linear assumptions on the reward func- tions. However, their algorithm is not directly applicable to our setting, due to the fact that in our setting, decisions must be made with respect to the current state of the agent (and the reward could be changing frequently), while in their setting the decisions are only made with respect to a single state. McMahan et al. [2003] also considered a similar setting -- they also assume that the reward function is chosen by an adversary and that the dynamics are fixed. However, they assume that the cost functions come from a finite set (but are not observable) and the goal is to find a min-max solution for the related stochastic game. In this work, we provide efficient ways to incorporate existing best experts algorithms into the MDP setting. Furthermore, our loss bounds (compared to the best constant policy) have no dependence on the number of states and depend only on on a certain horizon time of the environment and log(#actions). There are two sensible extensions of our setting. The first is where we allow Nature to change the dynamics of the environment over time. Here, we show that it becomes NP-Hard to develop a low regret algorithm even for oblivious adversary. The second extension is to consider one in which the agent only observes the rewards for the states it actually visits (a generalization of the multi-arm bandits problem). We leave this interesting direction for future work.

JMLR Journal 2003 Journal Article

Learning Rates for Q-learning

  • Eyal Even Dar
  • Yishay Mansour

In this paper we derive convergence rates for Q-learning. We show an interesting relationship between the convergence rate and the learning rate used in Q-learning. For a polynomial learning rate, one which is 1/ t &omega; at time t where &omega;&isin;(1/2,1), we show that the convergence rate is polynomial in 1/(1-&gamma;), where &gamma; is the discount factor. In contrast we show that for a linear learning rate, one which is 1/ t at time t, the convergence rate has an exponential dependence on 1/(1-&gamma;). In addition we show a simple example that proves this exponential behavior is inherent for linear learning rates. [abs] [ pdf ][ ps.gz ][ ps ]

NeurIPS Conference 2001 Conference Paper

Convergence of Optimistic and Incremental Q-Learning

  • Eyal Even-Dar
  • Yishay Mansour

Vie sho, v the convergence of tV/O deterministic variants of Q(cid: 173) learning. The first is the widely used optimistic Q-learning, which initializes the Q-values to large initial values and then follows a greedy policy with respect to the Q-values. We show that setting the initial value sufficiently large guarantees the converges to an E(cid: 173) optimal policy. The second is a new and novel algorithm incremen(cid: 173) tal Q-learning, which gradually promotes the values of actions that are not taken. We show that incremental Q-learning converges, in the limit, to the optimal policy. Our incremental Q-learning algo(cid: 173) rithm can be viewed as derandomization of the E-greedy Q-learning.

NeurIPS Conference 1999 Conference Paper

Approximate Planning in Large POMDPs via Reusable Trajectories

  • Michael Kearns
  • Yishay Mansour
  • Andrew Ng

We consider the problem of reliably choosing a near-best strategy from a restricted class of strategies TI in a partially observable Markov deci(cid: 173) sion process (POMDP). We assume we are given the ability to simulate the POMDP, and study what might be called the sample complexity - that is, the amount of data one must generate in the POMDP in order to choose a good strategy. We prove upper bounds on the sample com(cid: 173) plexity showing that, even for infinitely large and arbitrarily complex POMDPs, the amount of data needed can be finite, and depends only linearly on the complexity of the restricted strategy class TI, and expo(cid: 173) nentially on the horizon time. This latter dependence can be eased in a variety of ways, including the application of gradient and local search algorithms. Our measure of complexity generalizes the classical super(cid: 173) vised learning notion of VC dimension to the settings of reinforcement learning and planning.

NeurIPS Conference 1999 Conference Paper

Boosting with Multi-Way Branching in Decision Trees

  • Yishay Mansour
  • David McAllester

It is known that decision tree learning can be viewed as a form of boosting. However, existing boosting theorems for decision tree learning allow only binary-branching trees and the generalization to multi-branching trees is not immediate. Practical decision tree al(cid: 173) gorithms, such as CART and C4. 5, implement a trade-off between the number of branches and the improvement in tree quality as measured by an index function. Here we give a boosting justifica(cid: 173) tion for a particular quantitative trade-off curve. Our main theorem states, in essence, that if we require an improvement proportional to the log of the number of branches then top-down greedy con(cid: 173) struction of decision trees remains an effective boosting algorithm.

NeurIPS Conference 1999 Conference Paper

Policy Gradient Methods for Reinforcement Learning with Function Approximation

  • Richard Sutton
  • David McAllester
  • Satinder Singh
  • Yishay Mansour

Function approximation is essential to reinforcement learning, but the standard approach of approximating a value function and deter(cid: 173) mining a policy from it has so far proven theoretically intractable. In this paper we explore an alternative approach in which the policy is explicitly represented by its own function approximator, indepen(cid: 173) dent of the value function, and is updated according to the gradient of expected reward with respect to the policy parameters. Williams's REINFORCE method and actor-critic methods are examples of this approach. Our main new result is to show that the gradient can be written in a form suitable for estimation from experience aided by an approximate action-value or advantage function. Using this result, we prove for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy. Large applications of reinforcement learning (RL) require the use of generalizing func(cid: 173) tion approximators such neural networks, decision-trees, or instance-based methods. The dominant approach for the last decade has been the value-function approach, in which all function approximation effort goes into estimating a value function, with the action-selection policy represented implicitly as the "greedy" policy with respect to the estimated values (e. g. , as the policy that selects in each state the action with highest estimated value). The value-function approach has worked well in many appli(cid: 173) cations, but has several limitations. First, it is oriented toward finding deterministic policies, whereas the optimal policy is often stochastic, selecting different actions with specific probabilities (e. g. , see Singh, Jaakkola, and Jordan, 1994). Second, an arbi(cid: 173) trarily small change in the estimated value of an action can cause it to be, or not be, selected. Such discontinuous changes have been identified as a key obstacle to estab(cid: 173) lishing convergence assurances for algorithms following the value-function approach (Bertsekas and Tsitsiklis, 1996). For example, Q-Iearning, Sarsa, and dynamic pro(cid: 173) gramming methods have all been shown unable to converge to any policy for simple MDPs and simple function approximators (Gordon, 1995, 1996; Baird, 1995; Tsit(cid: 173) siklis and van Roy, 1996; Bertsekas and Tsitsiklis, 1996). This can occur even if the best approximation is found at each step before changing the policy, and whether the notion of "best" is in the mean-squared-error sense or the slightly different senses of residual-gradient, temporal-difference, and dynamic-programming methods. In this paper we explore an alternative approach to function approximation in RL.

FOCS Conference 1998 Conference Paper

Jitter Control in QoS Networks

  • Yishay Mansour
  • Boaz Patt-Shamir

We study jitter control in networks guaranteeing quality of service (QoS). Jitter measures variability of delivery times in packet streams. We propose on-line algorithms that control jitter and compare their performance to the best possible (by an off-line algorithm) for any given arrival sequence. For delay jitter, where the goal is to minimize the difference between delay times of different packets, we give an on-line algorithm using buffer size of 2B which guarantees the same delay-jitter as an off-line algorithm using buffer space B. We show that 2B space is the minimum space required by any on-line algorithm to provide delay-jitter related to the best possible delay-jitter using B buffer space. We also show that the guarantees made by our online algorithm hold even for distributed implementations, where the total buffer space is distributed along the path of the connection, provided that the input stream satisfies a certain simple property. For rate jitter, where the goal is to minimize the difference between inter-arrival times, we develop an on-line algorithm using a buffer of size 2B+h for any h/spl ges/1, and compare its jitter to the jitter of an optimal off-line algorithm using buffer size B. Our algorithm guarantees that the difference is bounded by a term proportional to B/h. We also prove that 2B space is necessary for on-line algorithms with non trivial guarantees for rate-jitter control.

FOCS Conference 1995 Conference Paper

Competitive Access Time via Dynamic Storage Rearrangement (Preliminary Version)

  • Amos Fiat
  • Yishay Mansour
  • Adi Rosén
  • Orli Waarts

We model the problem of storing items in some warehouse (modeled as an undirected graph) where a server has to visit items over time, with the goal of minimizing the total distance traversed by the server. Special cases of this problem include the management of a real industrial stacker crane warehouse, automatic robot run warehouses, disk track optimization to minimize access time, managing two dimensional memory (bubble memory and mass storage systems), doubly linked list management, and the process migration problem. The static version of this problem assumes some known probability distribution on the access patterns. We initiate the study of the dynamic version of the problem, where the robot may rearrange the warehouse to deal efficiently with future events. We require no statistical assumptions on the access pattern, and give competitive algorithms that rearrange the warehouse over time to deal efficiently with the true access patterns. We give non-trivial upper bounds for the general problem, along with some interesting lower bounds. In addition, we model realistic data access patterns on disk storage by considering two practically significant scenarios: access to some database via dynamically changing alternative indices and access patterns derived from root to leaf traversals of some (unknown) tree structure. In both cases we give greatly improved competitive ratios.

FOCS Conference 1995 Conference Paper

Efficient Algorithms for Learning to Play Repeated Games Against Computationally Bounded Adversaries

  • Yoav Freund
  • Michael J. Kearns
  • Yishay Mansour
  • Dana Ron
  • Ronitt Rubinfeld
  • Robert E. Schapire

We examine the problem of learning to play various games optimally against resource-bounded adversaries, with an explicit emphasis on the computational efficiency of the learning algorithm. We are especially interested in providing efficient algorithms for games other than penny-matching (in which payoff is received for matching the adversary's action in the current round), and for adversaries other than the classically studied finite automata. In particular, we examine games and adversaries for which the learning algorithm's past actions may strongly affect the adversary's future willingness to "cooperate" (that is, permit high payoff), and therefore require carefully planned actions on the part of the learning algorithm. For example, in the game we call contract, both sides play O or 1 on each round, but our side receives payoff only if we play 1 in synchrony with the adversary; unlike penny-matching, playing O in synchrony with the adversary pays nothing. The name of the game is derived from the example of signing a contract, which becomes valid only if both parties sign (play 1).

NeurIPS Conference 1995 Conference Paper

Implementation Issues in the Fourier Transform Algorithm

  • Yishay Mansour
  • Sigal Sahar

The Fourier transform of boolean functions has come to play an important role in proving many important learnability results. We aim to demonstrate that the Fourier transform techniques are also a useful and practical algorithm in addition to being a powerful theoretical tool. We describe the more prominent changes we have introduced to the algorithm, ones that were crucial and without which the performance of the algorithm would severely deterio(cid: 173) rate. One of the benefits we present is the confidence level for each prediction which measures the likelihood the prediction is correct.

FOCS Conference 1995 Conference Paper

Simple Learning Algorithms for Decision Trees and Multivariate Polynomials

  • Nader H. Bshouty
  • Yishay Mansour

In this paper we develop a new approach for learning decision trees and multivariate polynomials via interpolation of multivariate polynomials. This new approach yields simple learning algorithms for multivariate polynomials and decision trees over finite fields under any constant bounded product distribution. The output hypothesis is a (single) multivariate polynomial that is an /spl epsiv/-approximation of the target under any constant bounded product distribution. The new approach demonstrates the learnability of many classes under any constant bounded product distribution and using membership queries, such as j-disjoint DNF and multivariate polynomial with bounded degree over any field. The technique shows how to interpolate multivariate polynomials with bounded term size from membership queries only. This in particular gives a learning algorithm for O(log n)-depth decision tree from membership queries only and a new learning algorithm of any multivariate polynomial over sufficiently large fields from membership queries only. We show that our results for learning from membership queries only are the best possible.

FOCS Conference 1989 Conference Paper

Constant Depth Circuits, Fourier Transform, and Learnability

  • Nati Linial
  • Yishay Mansour
  • Noam Nisan

Boolean functions in AC/sup O/ are studied using the harmonic analysis of the cube. The main result is that an AC/sup O/ Boolean function has almost all of its power spectrum on the low-order coefficients. This result implies the following properties of functions in AC/sup O/: functions in AC/sup O/ have low average sensitivity; they can be approximated well be a real polynomial of low degree; they cannot be pseudorandom function generators and their correlation with any polylog-wide independent probability distribution is small. An O(n/sup polylog(/ /sup sup)/ /sup (n)/)-time algorithm for learning functions in AC/sup O/ is obtained. The algorithm observed the behavior of an AC/sup O/ function on O(n/sup polylog/ /sup (n)/) randomly chosen inputs and derives a good approximation for the Fourier transform of the function. This allows it to predict with high probability the value of the function on other randomly chosen inputs. >

FOCS Conference 1989 Conference Paper

Polynomial End-To-End Communication (Extended Abstract)

  • Baruch Awerbuch
  • Yishay Mansour
  • Nir Shavit

A dynamic communication network is one in which links may repeatedly fail and recover. In such a network, although it is impossible to establish a path of unfailed links, reliable communication is possible if there is no cut of permanently failed links between a sender and receiver. The authors consider for such a network the basic task of end-to-end communication, that is, delivery in finite time of data items generated online at the sender, to the receiver, in order and without duplication or omission. The best known previous solutions to this problem had exponential complexity. Moreover, it has been conjectured that a polynomial solution is impossible. The authors disprove this conjecture, presenting the first polynomial end-to-end protocol. The protocol uses methods adopted from shared-memory algorithms and introduces novel techniques for fast load balancing in communication networks. >

FOCS Conference 1989 Conference Paper

The Complexity of Approximating the Square Root (Extended Summary)

  • Yishay Mansour
  • Baruch Schieber
  • Prasoon Tiwari

The authors prove upper and lower bounds for approximately computing the square root using a given set of operations. The bounds are extended to hold for approximating the kth root, for any fixed k. Several tools from approximation theory are used to prove the lower bound. These include Markoff inequality, Chebyshev polynomials, and a theorem that relates the degree of a rational function to its deviation from the approximated function over a given interval. The lower bound can be generalized to other algebraic functions. The upper bound can be generalized to obtain an O(1)-step straight-line program for evaluating any rational function with integer coefficients at a given integer point. >

FOCS Conference 1988 Conference Paper

Results on learnability and the Vapnik-Chervonenkis dimension (Extended Abstract)

  • Nati Linial
  • Yishay Mansour
  • Ronald L. Rivest

The problem of learning a concept from examples in a distribution-free model is considered. The notion of dynamic sampling, wherein the number of examples examined can increase with the complexity of the target concept, is introduced. This method is used to establish the learnability of various concept classes with an infinite Vapnik-Chervonenkis (VC) dimension. An important variation on the problem of learning from examples, called approximating from examples, is also discussed. The problem of computing the VC dimension of a finite concept set defined on a finite domain is considered. >

FOCS Conference 1987 Conference Paper

Interactive Proof Systems: Provers that never Fail and Random Selection (Extended Abstract)

  • Oded Goldreich 0001
  • Yishay Mansour
  • Michael Sipser

An interactive proof system with Perfect Completeness (resp. Perfect Soundness) for a language L is an interactive proof (for L) in which for every x ∈ L (resp. x ∉ L) the verifier always accepts (resp. always rejects). Zachos and Fuerer showed that any language having a bounded interactive proof has one with perfect completeness. We extend their result and show that any language having a (possibly unbounded) interactive proof system has one with perfect completeness. On the other hand, only languages in NP have interactive proofs with perfect soundness. We present two proofs of the main result. One proof extends Lautemann's proof that BPP is in the polynomial-time hierarchy. The other proof, uses a new protocol for proving approximately lower bounds and "random selection". The problem of random selection consists of a verifier selecting at random, with uniform probability distribution, an element from an arbitrary set held by the prover. Previous protocols known for approximate lower bound do not solve the random selection problem. Interestingly, random selection can be implemented by an unbounded Arthur-Merlin game but can not be implemented by a two-iteration game.