Arrow Research search

Author name cluster

Robert E. Schapire

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

53 papers
2 author rows

Possible papers

53

ICML Conference 2024 Conference Paper

Provable Interactive Learning with Hindsight Instruction Feedback

  • Dipendra Misra
  • Aldo Pacchiano
  • Robert E. Schapire

We study interactive learning in a setting where the agent has to generate a response (e. g. , an action or trajectory) given a context and an instruction. In contrast, to typical approaches that train the system using reward or expert supervision on response, we study learning with hindsight labeling where a teacher provides an instruction that is most suitable for the agent’s generated response. This hindsight labeling of instruction is often easier to provide than providing expert supervision of the optimal response which may require expert knowledge or can be impractical to elicit. We initiate the theoretical analysis of interactive learning with hindsight labeling. We first provide a lower bound showing that in general, the regret of any algorithm must scale with the size of the agent’s response space. Next, we study a specialized setting where the underlying instruction-response distribution can be decomposed as a low-rank matrix. We introduce an algorithm called LORIL for this setting and show that it is a no-regret algorithm with the regret scaling with $\sqrt{T}$ and depends on the intrinsic rank but does not depend on the agent’s response space. We provide experiments showing the performance of LORIL in practice for 2 domains.

NeurIPS Conference 2023 Conference Paper

A Unified Model and Dimension for Interactive Estimation

  • Nataly Brukhim
  • Miro Dudik
  • Aldo Pacchiano
  • Robert E. Schapire

We study an abstract framework for interactive learning called interactive estimation in which the goal is to estimate a target from its ``similarity'' to points queried by the learner. We introduce a combinatorial measure called Dissimilarity dimension which largely captures learnability in our model. We present a simple, general, and broadly-applicable algorithm, for which we obtain both regret and PAC generalization bounds that are polynomial in the new dimension. We show that our framework subsumes and thereby unifies two classic learning models: statistical-query learning and structured bandits. We also delineate how the Dissimilarity dimension is related to well-known parameters for both frameworks, in some cases yielding significantly improved analyses.

NeurIPS Conference 2022 Conference Paper

Provably sample-efficient RL with side information about latent dynamics

  • Yao Liu
  • Dipendra Misra
  • Miro Dudik
  • Robert E. Schapire

We study reinforcement learning (RL) in settings where observations are high-dimensional, but where an RL agent has access to abstract knowledge about the structure of the state space, as is the case, for example, when a robot is tasked to go to a specific room in a building using observations from its own camera, while having access to the floor plan. We formalize this setting as transfer reinforcement learning from an "abstract simulator, " which we assume is deterministic (such as a simple model of moving around the floor plan), but which is only required to capture the target domain's latent-state dynamics approximately up to unknown (bounded) perturbations (to account for environment stochasticity). Crucially, we assume no prior knowledge about the structure of observations in the target domain except that they can be used to identify the latent states (but the decoding map is unknown). Under these assumptions, we present an algorithm, called TASID, that learns a robust policy in the target domain, with sample complexity that is polynomial in the horizon, and independent of the number of states, which is not possible without access to some prior knowledge. In synthetic experiments, we verify various properties of our algorithm and show that it empirically outperforms transfer RL algorithms that require access to "full simulators" (i. e. , those that also simulate observations).

NeurIPS Conference 2021 Conference Paper

Bayesian decision-making under misspecified priors with applications to meta-learning

  • Max Simchowitz
  • Christopher Tosh
  • Akshay Krishnamurthy
  • Daniel J. Hsu
  • Thodoris Lykouris
  • Miro Dudik
  • Robert E. Schapire

Thompson sampling and other Bayesian sequential decision-making algorithms are among the most popular approaches to tackle explore/exploit trade-offs in (contextual) bandits. The choice of prior in these algorithms offers flexibility to encode domain knowledge but can also lead to poor performance when misspecified. In this paper, we demonstrate that performance degrades gracefully with misspecification. We prove that the expected reward accrued by Thompson sampling (TS) with a misspecified prior differs by at most $\tilde{O}(H^2 \epsilon)$ from TS with a well-specified prior, where $\epsilon$ is the total-variation distance between priors and $H$ is the learning horizon. Our bound does not require the prior to have any parametric form. For priors with bounded support, our bound is independent of the cardinality or structure of the action space, and we show that it is tight up to universal constants in the worst case. Building on our sensitivity analysis, we establish generic PAC guarantees for algorithms in the recently studied Bayesian meta-learning setting and derive corollaries for various families of priors. Our results generalize along two axes: (1) they apply to a broader family of Bayesian decision-making algorithms, including a Monte-Carlo implementation of the knowledge gradient algorithm (KG), and (2) they apply to Bayesian POMDPs, the most general Bayesian decision-making setting, encompassing contextual bandits as a special case. Through numerical simulations, we illustrate how prior misspecification and the deployment of one-step look-ahead (as in KG) can impact the convergence of meta-learning in multi-armed and contextual bandits with structured and correlated priors.

ICML Conference 2021 Conference Paper

Interactive Learning from Activity Description

  • Khanh Nguyen
  • Dipendra Misra
  • Robert E. Schapire
  • Miroslav Dudík
  • Patrick Shafto

We present a novel interactive learning protocol that enables training request-fulfilling agents by verbally describing their activities. Unlike imitation learning (IL), our protocol allows the teaching agent to provide feedback in a language that is most appropriate for them. Compared with reward in reinforcement learning (RL), the description feedback is richer and allows for improved sample complexity. We develop a probabilistic framework and an algorithm that practically implements our protocol. Empirical results in two challenging request-fulfilling problems demonstrate the strengths of our approach: compared with RL baselines, it is more sample-efficient; compared with IL baselines, it achieves competitive success rates without requiring the teaching agent to be able to demonstrate the desired behavior using the learning agent’s actions. Apart from empirical evaluation, we also provide theoretical guarantees for our algorithm under certain assumptions about the teacher and the environment.

NeurIPS Conference 2021 Conference Paper

Multiclass Boosting and the Cost of Weak Learning

  • Nataly Brukhim
  • Elad Hazan
  • Shay Moran
  • Indraneel Mukherjee
  • Robert E. Schapire

Boosting is an algorithmic approach which is based on the idea of combining weak and moderately inaccurate hypotheses to a strong and accurate one. In this work we study multiclass boosting with a possibly large number of classes or categories. Multiclass boosting can be formulated in various ways. Here, we focus on an especially natural formulation in which the weak hypotheses are assumed to belong to an ''easy-to-learn'' base class, and the weak learner is an agnostic PAC learner for that class with respect to the standard classification loss. This is in contrast with other, more complicated losses as have often been considered in the past. The goal of the overall boosting algorithm is then to learn a combination of weak hypotheses by repeatedly calling the weak learner. We study the resources required for boosting, especially how theydepend on the number of classes $k$, for both the booster and weak learner. We find that the boosting algorithm itself only requires $O(\log k)$samples, as we show by analyzing a variant of AdaBoost for oursetting. In stark contrast, assuming typical limits on the number of weak-learner calls, we prove that the number of samples required by a weak learner is at least polynomial in $k$, exponentially more than thenumber of samples needed by the booster. Alternatively, we prove that the weak learner's accuracy parametermust be smaller than an inverse polynomial in $k$, showing that the returned weakhypotheses must be nearly the best in their class when $k$ is large. We also prove a trade-off between number of oracle calls and theresources required of the weak learner, meaning that the fewer calls to theweak learner the more that is demanded on each call.

FOCS Conference 2019 Conference Paper

Adversarial Bandits with Knapsacks

  • Nicole Immorlica
  • Karthik Abinav Sankararaman
  • Robert E. Schapire
  • Aleksandrs Slivkins

We consider Bandits with Knapsacks (henceforth, BwK), a general model for multi-armed bandits under supply/budget constraints. In particular, a bandit algorithm needs to solve a well-known knapsack problem: find an optimal packing of items into a limited-size knapsack. The BwK problem is a common generalization of numerous motivating examples, which range from dynamic pricing to repeated auctions to dynamic ad allocation to network routing and scheduling. While the prior work on BwK focused on the stochastic version, we pioneer the other extreme in which the outcomes can be chosen adversarially. This is a considerably harder problem, compared to both the stochastic version and the "classic" adversarial bandits, in that regret minimization is no longer feasible. Instead, the objective is to minimize the competitive ratio: the ratio of the benchmark reward to algorithm's reward. We design an algorithm with competitive ratio O(log T) relative to the best fixed distribution over actions, where T is the time horizon; we also prove a matching lower bound. The key conceptual contribution is a new perspective on the stochastic version of the problem. We suggest a new algorithm for the stochastic version, which builds on the framework of regret minimization in repeated games and admits a substantially simpler analysis compared to prior work. We then analyze this algorithm for the adversarial version, and use it as a subroutine to solve the latter. Our algorithm is the first "black-box reduction" from bandits to BwK: it takes an arbitrary bandit algorithm and uses it as a subroutine. We use this reduction to derive several extensions.

ICML Conference 2018 Conference Paper

Learning Deep ResNet Blocks Sequentially using Boosting Theory

  • Furong Huang
  • Jordan T. Ash
  • John Langford 0001
  • Robert E. Schapire

We prove a multi-channel telescoping sum boosting theory for the ResNet architectures which simultaneously creates a new technique for boosting over features (in contrast with labels) and provides a new algorithm for ResNet-style architectures. Our proposed training algorithm, BoostResNet, is particularly suitable in non-differentiable architectures. Our method only requires the relatively inexpensive sequential training of $T$ “shallow ResNets”. We prove that the training error decays exponentially with the depth $T$ if the weak module classifiers that we train perform slightly better than some weak baseline. In other words, we propose a weak learning condition and prove a boosting theory for ResNet under the weak learning condition. A generalization error bound based on margin theory is proved and suggests that ResNet could be resistant to overfitting using a network with $l_1$ norm bounded weights.

ICML Conference 2018 Conference Paper

Practical Contextual Bandits with Regression Oracles

  • Dylan J. Foster
  • Alekh Agarwal
  • Miroslav Dudík
  • Haipeng Luo
  • Robert E. Schapire

A major challenge in contextual bandits is to design general-purpose algorithms that are both practically useful and theoretically well-founded. We present a new technique that has the empirical and computational advantages of realizability-based approaches combined with the flexibility of agnostic methods. Our algorithms leverage the availability of a regression oracle for the value-function class, a more realistic and reasonable oracle than the classification oracles over policies typically assumed by agnostic methods. Our approach generalizes both UCB and LinUCB to far more expressive possible model classes and achieves low regret under certain distributional assumptions. In an extensive empirical evaluation, we find that our approach typically matches or outperforms both realizability-based and agnostic baselines.

ICML Conference 2017 Conference Paper

Contextual Decision Processes with low Bellman rank are PAC-Learnable

  • Nan Jiang 0008
  • Akshay Krishnamurthy
  • Alekh Agarwal
  • John Langford 0001
  • Robert E. Schapire

This paper studies systematic exploration for reinforcement learning (RL) with rich observations and function approximation. We introduce contextual decision processes (CDPs), that unify most prior RL settings. Our first contribution is a complexity measure, the Bellman rank, that we show enables tractable learning of near-optimal behavior in CDPs and is naturally small for many well-studied RL models. Our second contribution is a new RL algorithm that does systematic exploration to learn near-optimal behavior in CDPs with low Bellman rank. The algorithm requires a number of samples that is polynomial in all relevant parameters but independent of the number of unique contexts. Our approach uses Bellman error minimization with optimistic exploration and provides new insights into efficient exploration for RL with function approximation.

FOCS Conference 2017 Conference Paper

Oracle-Efficient Online Learning and Auction Design

  • Miroslav Dudík
  • Nika Haghtalab
  • Haipeng Luo
  • Robert E. Schapire
  • Vasilis Syrgkanis
  • Jennifer Wortman Vaughan

We consider the design of computationally efficient online learning algorithms in an adversarial setting in which the learner has access to an offline optimization oracle. We present an algorithm called Generalized Followthe- Perturbed-Leader and provide conditions under which it is oracle-efficient while achieving vanishing regret. Our results make significant progress on an open problem raised by Hazan and Koren [1], who showed that oracle-efficient algorithms do not exist in full generality and asked whether one can identify conditions under which oracle-efficient online learning may be possible. Our auction-design framework considers an auctioneer learning an optimal auction for a sequence of adversarially selected valuations with the goal of achieving revenue that is almost as good as the optimal auction in hindsight, among a class of auctions. We give oracle-efficient learning results for: (1) VCG auctions with bidder-specific reserves in singleparameter settings, (2) envy-free item-pricing auctions in multiitem settings, and (3) the level auctions of Morgenstern and Roughgarden [2] for single-item settings. The last result leads to an approximation of the overall optimal Myerson auction when bidders' valuations are drawn according to a fast-mixing Markov process, extending prior work that only gave such guarantees for the i. i. d. setting. We also derive various extensions, including: (1) oracleefficient algorithms for the contextual learning setting in which the learner has access to side information (such as bidder demographics), (2) learning with approximate oracles such as those based on Maximal-in-Range algorithms, and (3) no-regret bidding algorithms in simultaneous auctions, which resolve an open problem of Daskalakis and Syrgkanis [3].

ICML Conference 2016 Conference Paper

Efficient Algorithms for Adversarial Contextual Learning

  • Vasilis Syrgkanis
  • Akshay Krishnamurthy
  • Robert E. Schapire

We provide the first oracle efficient sublinear regret algorithms for adversarial versions of the contextual bandit problem. In this problem, the learner repeatedly makes an action on the basis of a context and receives reward for the chosen action, with the goal of achieving reward competitive with a large class of policies. We analyze two settings: i) in the transductive setting the learner knows the set of contexts a priori, ii) in the small separator setting, there exists a small set of contexts such that any two policies behave differently on one of the contexts in the set. Our algorithms fall into the Follow-The-Perturbed-Leader family (Kalai and Vempala, 2005) and achieve regret O(T^3/4\sqrtK\log(N)) in the transductive setting and O(T^2/3 d^3/4 K\sqrt\log(N)) in the separator setting, where T is the number of rounds, K is the number of actions, N is the number of baseline policies, and d is the size of the separator. We actually solve the more general adversarial contextual semi-bandit linear optimization problem, whilst in the full information setting we address the even more general contextual combinatorial optimization. We provide several extensions and implications of our algorithms, such as switching regret and efficient learning with predictable sequences.

ICML Conference 2014 Conference Paper

Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits

  • Alekh Agarwal
  • Daniel J. Hsu
  • Satyen Kale
  • John Langford 0001
  • Lihong Li 0001
  • Robert E. Schapire

We present a new algorithm for the contextual bandit learning problem, where the learner repeatedly takes one of K \emphactions in response to the observed \emphcontext, and observes the \emphreward only for that action. Our method assumes access to an oracle for solving fully supervised cost-sensitive classification problems and achieves the statistically optimal regret guarantee with only \otil(\sqrtKT) oracle calls across all T rounds. By doing so, we obtain the most practical contextual bandit learning algorithm amongst approaches that work for general policy classes. We conduct a proof-of-concept experiment which demonstrates the excellent computational and statistical performance of (an online variant of) our algorithm relative to several strong baselines.

ICML Conference 2014 Conference Paper

Towards Minimax Online Learning with Unknown Time Horizon

  • Haipeng Luo
  • Robert E. Schapire

We consider online learning when the time horizon is unknown. We apply a minimax analysis, beginning with the fixed horizon case, and then moving on to two unknown-horizon settings, one that assumes the horizon is chosen randomly according to some distribution, and the other which allows the adversary full control over the horizon. For the random horizon setting with restricted losses, we derive a fully optimal minimax algorithm. And for the adversarial horizon setting, we prove a nontrivial lower bound which shows that the adversary obtains strictly more power than when the horizon is fixed and known. Based on the minimax solution of the random horizon setting, we then propose a new adaptive algorithm which “pretends” that the horizon is drawn from a distribution from a special family, but no matter how the actual horizon is chosen, the worst-case regret is of the optimal rate. Furthermore, our algorithm can be combined and applied in many ways, for instance, to online convex optimization, follow the perturbed leader, exponential weights algorithm and first order bounds. Experiments show that our algorithm outperforms many other existing algorithms in an online linear optimization setting.

JMLR Journal 2013 Journal Article

A Theory of Multiclass Boosting

  • Indraneel Mukherjee
  • Robert E. Schapire

Boosting combines weak classifiers to form highly accurate predictors. Although the case of binary classification is well understood, in the multiclass setting, the “correct” requirements on the weak classifier, or the notion of the most efficient boosting algorithms are missing. In this paper, we create a broad and general framework, within which we make precise and identify the optimal requirements on the weak-classifier, as well as design the most effective, in a certain sense, boosting algorithms that assume such requirements. [abs] [ pdf ][ bib ] &copy JMLR 2013. ( edit, beta )

JMLR Journal 2013 Journal Article

The Rate of Convergence of AdaBoost

  • Indraneel Mukherjee
  • Cynthia Rudin
  • Robert E. Schapire

The AdaBoost algorithm was designed to combine many “weak” hypotheses that perform slightly better than random guessing into a “strong” hypothesis that has very low error. We study the rate at which AdaBoost iteratively converges to the minimum of the “exponential loss”. Unlike previous work, our proofs do not require a weak-learning assumption, nor do they require that minimizers of the exponential loss are finite. Our first result shows that the exponential loss of AdaBoost's computed parameter vector will be at most $\varepsilon$ more than that of any parameter vector of $\ell_1$-norm bounded by $B$ in a number of rounds that is at most a polynomial in $B$ and $1/\varepsilon$. We also provide lower bounds showing that a polynomial dependence is necessary. Our second result is that within $C/\varepsilon$ iterations, AdaBoost achieves a value of the exponential loss that is at most $\varepsilon$ more than the best possible value, where $C$ depends on the data set. We show that this dependence of the rate on $\varepsilon$ is optimal up to constant factors, that is, at least $\Omega(1/\varepsilon)$ rounds are necessary to achieve within $\varepsilon$ of the optimal exponential loss. [abs] [ pdf ][ bib ] &copy JMLR 2013. ( edit, beta )

UAI Conference 2010 Conference Paper

Combining Spatial and Telemetric Features for Learning Animal Movement Models

  • Berk Kapicioglu
  • Robert E. Schapire
  • Martin Wikelski
  • Tamara Broderick

We introduce a new graphical model for tracking radio-tagged animals and learning their movement patterns. The model provides a principled way to combine radio telemetry data with an arbitrary set of userdefined, spatial features. We describe an efficient stochastic gradient algorithm for fitting model parameters to data and demonstrate its effectiveness via asymptotic analysis and synthetic experiments. We also apply our model to real datasets, and show that it outperforms the most popular radio telemetry software package used in ecology. We conclude that integration of different data sources under a single statistical framework, coupled with appropriate parameter and state estimation procedures, produces both accurate location estimates and an interpretable statistical model of animal movement.

TCS Journal 2010 Journal Article

Learning with continuous experts using drifting games

  • Indraneel Mukherjee
  • Robert E. Schapire

We consider the problem of learning to predict as well as the best in a group of experts making continuous predictions. We assume the learning algorithm has prior knowledge of the maximum number of mistakes of the best expert. We propose a new master strategy that achieves the best known performance for on-line learning with continuous experts in the mistake bounded model. Our ideas are based on drifting games, a generalization of boosting and on-line learning algorithms. We prove new lower bounds based on the drifting games framework which, though not as tight as previous bounds, have simpler proofs and do not require an enormous number of experts. We also extend previous lower bounds to show that our upper bounds are exactly tight for sufficiently many experts. A surprising consequence of our work is that continuous experts are only as powerful as experts making binary or no prediction in each round.

JMLR Journal 2009 Journal Article

Margin-based Ranking and an Equivalence between AdaBoost and RankBoost

  • Cynthia Rudin
  • Robert E. Schapire

We study boosting algorithms for learning to rank. We give a general margin-based bound for ranking based on covering numbers for the hypothesis space. Our bound suggests that algorithms that maximize the ranking margin will generalize well. We then describe a new algorithm, smooth margin ranking, that precisely converges to a maximum ranking-margin solution. The algorithm is a modification of RankBoost, analogous to "approximate coordinate ascent boosting." Finally, we prove that AdaBoost and RankBoost are equally good for the problems of bipartite ranking and classification in terms of their asymptotic behavior on the training set. Under natural conditions, AdaBoost achieves an area under the ROC curve that is equally as good as RankBoost's; furthermore, RankBoost, when given a specific intercept, achieves a misclassification error that is as good as AdaBoost's. This may help to explain the empirical observations made by Cortes and Mohri, and Caruana and Niculescu-Mizil, about the excellent performance of AdaBoost as a bipartite ranking algorithm, as measured by the area under the ROC curve. [abs] [ pdf ][ bib ] &copy JMLR 2009. ( edit, beta )

ICML Conference 2008 Conference Paper

Apprenticeship learning using linear programming

  • Umar Syed
  • Michael H. Bowling
  • Robert E. Schapire

In apprenticeship learning, the goal is to learn a policy in a Markov decision process that is at least as good as a policy demonstrated by an expert. The difficulty arises in that the MDP's true reward function is assumed to be unknown. We show how to frame apprenticeship learning as a linear programming problem, and show that using an off-the-shelf LP solver to solve this problem results in a substantial improvement in running time over existing methods---up to two orders of magnitude faster in our experiments. Additionally, our approach produces stationary policies, while all existing methods for apprenticeship learning output policies that are "mixed", i.e. randomized combinations of stationary policies. The technique used is general enough to convert any mixed policy to a stationary policy.

UAI Conference 2007 Conference Paper

Imitation Learning with a Value-Based Prior

  • Umar Syed
  • Robert E. Schapire

The goal of imitation learning is for an apprentice to learn how to behave in a stochastic environment by observing a mentor demonstrating the correct behavior. Accurate prior knowledge about the correct behavior can reduce the need for demonstrations from the mentor. We present a novel approach to encoding prior knowledge about the correct behavior, where we assume that this prior knowledge takes the form of a Markov Decision Process (MDP) that is used by the apprentice as a rough and imperfect model of the mentor's behavior. Specifically, taking a Bayesian approach, we treat the value of a policy in this modeling MDP as the log prior probability of the policy. In other words, we assume a priori that the mentor's behavior is likely to be a high value policy in the modeling MDP, though quite possibly different from the optimal policy. We describe an efficient algorithm that, given a modeling MDP and a set of demonstrations by a mentor, provably converges to a stationary point of the log posterior of the mentor's policy, where the posterior is computed with respect to the "value based" prior. We also present empirical evidence that this prior does in fact speed learning of the mentor's policy, and is an improvement in our experiments over similar previous methods.

JMLR Journal 2007 Journal Article

Maximum Entropy Density Estimation with Generalized Regularization and an Application to Species Distribution Modeling

  • Miroslav Dudík
  • Steven J. Phillips
  • Robert E. Schapire

We present a unified and complete account of maximum entropy density estimation subject to constraints represented by convex potential functions or, alternatively, by convex regularization. We provide fully general performance guarantees and an algorithm with a complete convergence proof. As special cases, we easily derive performance guarantees for many known regularization types, including l 1, l 2, l 2 2, and l 2 + l 2 2 style regularization. We propose an algorithm solving a large and general subclass of generalized maximum entropy problems, including all discussed in the paper, and prove its convergence. Our approach generalizes and unifies techniques based on information geometry and Bregman divergences as well as those based more directly on compactness. Our work is motivated by a novel application of maximum entropy to species distribution modeling, an important problem in conservation biology and ecology. In a set of experiments on real-world data, we demonstrate the utility of maximum entropy in this setting. We explore effects of different feature types, sample sizes, and regularization levels on the performance of maxent, and discuss interpretability of the resulting models. [abs] [ pdf ][ bib ] &copy JMLR 2007. ( edit, beta )

ICML Conference 2006 Conference Paper

Algorithms for portfolio management based on the Newton method

  • Amit Agarwal 0008
  • Elad Hazan
  • Satyen Kale
  • Robert E. Schapire

We experimentally study on-line investment algorithms first proposed by Agarwal and Hazan and extended by Hazan et al. which achieve almost the same wealth as the best constant-rebalanced portfolio determined in hindsight. These algorithms are the first to combine optimal logarithmic regret bounds with efficient deterministic computability. They are based on the Newton method for offline optimization which, unlike previous approaches, exploits second order information. After analyzing the algorithm using the potential function introduced by Agarwal and Hazan, we present extensive experiments on actual financial data. These experiments confirm the theoretical advantage of our algorithms, which yield higher returns and run considerably faster than previous algorithms with optimal regret. Additionally, we perform financial analysis using mean-variance calculations and the Sharpe ratio.

ICML Conference 2006 Conference Paper

How boosting the margin can also boost classifier complexity

  • Lev Reyzin
  • Robert E. Schapire

Boosting methods are known not to usually overfit training data even as the size of the generated classifiers becomes large. Schapire et al. attempted to explain this phenomenon in terms of the margins the classifier achieves on training examples. Later, however, Breiman cast serious doubt on this explanation by introducing a boosting algorithm, arc-gv, that can generate a higher margins distribution than AdaBoost and yet performs worse. In this paper, we take a close look at Breiman's compelling but puzzling results. Although we can reproduce his main finding, we find that the poorer performance of arc-gv can be explained by the increased complexity of the base classifiers it uses, an explanation supported by our experiments and entirely consistent with the margins theory. Thus, we find maximizing the margins is desirable, but not necessarily at the expense of other factors, especially base-classifier complexity.

JMLR Journal 2004 Journal Article

The Dynamics of AdaBoost: Cyclic Behavior and Convergence of Margins

  • Cynthia Rudin
  • Ingrid Daubechies
  • Robert E. Schapire

In order to study the convergence properties of the AdaBoost algorithm, we reduce AdaBoost to a nonlinear iterated map and study the evolution of its weight vectors. This dynamical systems approach allows us to understand AdaBoost's convergence properties completely in certain cases; for these cases we find stable cycles, allowing us to explicitly solve for AdaBoost's output. Using this unusual technique, we are able to show that AdaBoost does not always converge to a maximum margin combined classifier, answering an open question. In addition, we show that "non-optimal" AdaBoost (where the weak learning algorithm does not necessarily choose the best weak classifier at each iteration) may fail to converge to a maximum margin classifier, even if "optimal" AdaBoost produces a maximum margin. Also, we show that if AdaBoost cycles, it cycles among "support vectors", i.e., examples that achieve the same smallest margin. [abs] [ pdf ]

JMLR Journal 2003 Journal Article

An Efficient Boosting Algorithm for Combining Preferences

  • Yoav Freund
  • Raj Iyer
  • Robert E. Schapire
  • Yoram Singer

We study the problem of learning to accurately rank a set of objects by combining a given collection of ranking or preference functions. This problem of combining preferences arises in several applications, such as that of combining the results of different search engines, or the "collaborative-filtering" problem of ranking movies for a user based on the movie rankings provided by other users. In this work, we begin by presenting a formal framework for this general problem. We then describe and analyze an efficient algorithm called RankBoost for combining preferences based on the boosting approach to machine learning. We give theoretical results describing the algorithm's behavior both on the training data, and on new test data not seen during training. We also describe an efficient implementation of the algorithm for a particular restricted but common case. We next discuss two experiments we carried out to assess the performance of RankBoost. In the first experiment, we used the algorithm to combine different web search strategies, each of which is a query expansion for a given domain. The second experiment is a collaborative-filtering task for making movie recommendations. [abs] [ pdf ][ ps.gz ][ ps ]

UAI Conference 2002 Conference Paper

Advances in Boosting

  • Robert E. Schapire

Boosting is a general method of generating many simple classification rules and combining them into a single, highly accurate rule. In this talk, I will review the AdaBoost boosting algorithm and some of its underlying theory, and then look at how this theory has helped us to face some of the challenges of applying AdaBoost in two domains: In the first of these, we used boosting for predicting and modeling the uncertainty of prices in complicated, interacting auctions. The second application was to the classification of caller utterances in a telephone spoken-dialogue system where we faced two challenges: the need to incorporate prior knowledge to compensate for initially insufficient data; and a later need to filter the large stream of unlabeled examples being collected to select the ones whose labels are likely to be most informative.

JMLR Journal 2000 Journal Article

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

  • Erin L. Allwein
  • Robert E. Schapire
  • Yoram Singer

We present a unifying framework for studying the solution of multiclass categorization problems by reducing them to multiple binary problems that are then solved using a margin-based binary learning algorithm. The proposed framework unifies some of the most popular approaches in which each class is compared against all others, or in which all pairs of classes are compared to each other, or in which output codes with error-correcting properties are used. We propose a general method for combining the classifiers generated on the binary problems, and we prove a general empirical multiclass loss bound given the empirical loss of the individual binary learning algorithms. The scheme and the corresponding bounds apply to many popular classification learning algorithms including support-vector machines, AdaBoost, regression, logistic regression and decision-tree algorithms. We also give a multiclass generalization error analysis for general output codes with AdaBoost as the binary learner. Experimental results with SVM and AdaBoost show that our scheme provides a viable alternative to the most commonly used multiclass algorithms.

FOCS Conference 1995 Conference Paper

Efficient Algorithms for Learning to Play Repeated Games Against Computationally Bounded Adversaries

  • Yoav Freund
  • Michael J. Kearns
  • Yishay Mansour
  • Dana Ron
  • Ronitt Rubinfeld
  • Robert E. Schapire

We examine the problem of learning to play various games optimally against resource-bounded adversaries, with an explicit emphasis on the computational efficiency of the learning algorithm. We are especially interested in providing efficient algorithms for games other than penny-matching (in which payoff is received for matching the adversary's action in the current round), and for adversaries other than the classically studied finite automata. In particular, we examine games and adversaries for which the learning algorithm's past actions may strongly affect the adversary's future willingness to "cooperate" (that is, permit high payoff), and therefore require carefully planned actions on the part of the learning algorithm. For example, in the game we call contract, both sides play O or 1 on each round, but our side receives payoff only if we play 1 in synchrony with the adversary; unlike penny-matching, playing O in synchrony with the adversary pays nothing. The name of the game is derived from the example of signing a contract, which becomes valid only if both parties sign (play 1).

FOCS Conference 1995 Conference Paper

Gambling in a Rigged Casino: The Adversarial Multi-Arm Bandit Problem

  • Peter Auer
  • Nicolò Cesa-Bianchi
  • Yoav Freund
  • Robert E. Schapire

In the multi-armed bandit problem, a gambler must decide which arm of K non-identical slot machines to play in a sequence of trials so as to maximize his reward. This classical problem has received much attention because of the simple model it provides of the trade-off between exploration (trying out each arm to find the best one) and exploitation (playing the arm believed to give the best payoff). Past solutions for the bandit problem have almost always relied on assumptions about the statistics of the slot machines. In this work, we make no statistical assumptions whatsoever about the nature of the process generating the payoffs of the slot machines. We give a solution to the bandit problem in which an adversary, rather than a well-behaved stochastic process, has complete control over the payoffs. In a sequence of T plays, we prove that the expected per-round payoff of our algorithm approaches that of the best arm at the rate O(T/sup -1/3/), and we give an improved rate of convergence when the best arm has fairly low payoff. We also consider a setting in which the player has a team of "experts" advising him on which arm to play; here, we give a strategy that will guarantee expected payoff close to that of the best expert. Finally, we apply our result to the problem of learning to play an unknown repeated matrix game against an all-powerful adversary.

FOCS Conference 1990 Conference Paper

Efficient Distribution-free Learning of Probabilistic Concepts (Extended Abstract)

  • Michael J. Kearns
  • Robert E. Schapire

A model of machine learning in which the concept to be learned may exhibit uncertain or probabilistic behavior is investigated. Such probabilistic concepts (or p-concepts) may arise in situations such as weather prediction, where the measured variables and their accuracy are insufficient to determine the outcome with certainty. It is required that learning algorithms be both efficient and general in the sense that they perform well for a wide class of p-concepts and for any distribution over the domain. Many efficient algorithms for learning natural classes of p-concepts are given, and an underlying theory of learning p-concepts is developed in detail. >

FOCS Conference 1990 Conference Paper

Exact Identification of Circuits Using Fixed Points of Amplification Functions (Extended Abstract)

  • Sally A. Goldman
  • Michael J. Kearns
  • Robert E. Schapire

A technique for exactly identifying certain classes of read-once Boolean formulas is introduced. The method is based on sampling the input-output behavior of the target formula on a probability distribution which is determined by the fixed point of the formula's amplification function (defined as the probability that a 1 is output by the formula when each input bit is 1 independently with probability p). By performing various statistical tests on easily sampled variants of the fixed-point distribution, it is possible to infer efficiently all structural information about any logarithmic-depth target family (with high probability). The results are used to prove the existence of short universal identification sequences for large classes of formulas. Extensions of the algorithms to handle high rates of noise and to learn formulas of unbounded depth in L. G. Valiant's (1984) model with respect to specific distributions are described. >

FOCS Conference 1989 Conference Paper

Learning Binary Relations and Total Orders (Extended Abstract)

  • Sally A. Goldman
  • Ronald L. Rivest
  • Robert E. Schapire

The problem of designing polynomial prediction algorithms for learning binary relations is studied for an online model in which the instances are drawn by the learner, by a helpful teacher, by an adversary, or according to a probability distribution on the instance space. The relation is represented as an n*m binary matrix, and results are presented when the matrix is restricted to have at most k distinct row types, and when it is constrained by requiring that the predicate form a total order. >

FOCS Conference 1989 Conference Paper

The Strength of Weak Learnability (Extended Abstract)

  • Robert E. Schapire

The problem of improving the accuracy of a hypothesis output by a learning algorithm in the distribution-free learning model is considered. A concept class is learnable (or strongly learnable) if, given access to a source of examples from the unknown concept, the learner with high probability is able to output a hypothesis that is correct on all but an arbitrarily small fraction of the instances. The concept class is weakly learnable if the learner can produce a hypothesis that forms only slightly better than random guessing. It is shown that these two notions of learnability are equivalent. An explicit method is described for directly converting a weak learning algorithm into one that achieves arbitrarily high accuracy. This construction may have practical applications as a tool for efficiently converting a mediocre learning algorithm into one that performs extremely well. In addition, the construction has some interesting theoretical consequences. >

FOCS Conference 1987 Conference Paper

Diversity-Based Inference of Finite Automata (Extended Abstract)

  • Ronald L. Rivest
  • Robert E. Schapire

We present a new procedure for inferring the structure of a finitestate automaton (FSA) from its input/output behavior, using access to the automaton to perform experiments. Our procedure uses a new representation for FSA's, based on the notion of equivalence between testa. We call the number of such equivalence classes the diversity of the automaton; the diversity may be as small as the logarithm of the number of states of the automaton. The size of our representation of the FSA, and the running time of our procedure (in some case provably, in others conjecturally) is polynomial in the diversity and ln(1/ε), where ε is a given upper bound on the probability that our procedure returns an incorrect result. (Since our procedure uses randomization to perform experiments, there is a certain controllable chance that it will return an erroneous result.) We also present some evidence for the practical efficiency of our approach. For example, our procedure is able to infer the structure of an automaton based on Rubik's Cube (which has approximately 1019 states) in about 2 minutes on a DEC Micro Vax. This automaton is many orders of magnitude larger than possible with previous techniques, which would require time proportional at least to the number of global states. (Note that in this example, only a small fraction (10-14) of the global states were even visited.)