Author name cluster

Ofer Dekel

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

31 papers

2 author rows

NeurIPS Conference 2018 Conference Paper

Learning SMaLL Predictors

Vikas Garg
Ofer Dekel
Lin Xiao

We introduce a new framework for learning in severely resource-constrained settings. Our technique delicately amalgamates the representational richness of multiple linear predictors with the sparsity of Boolean relaxations, and thereby yields classifiers that are compact, interpretable, and accurate. We provide a rigorous formalism of the learning problem, and establish fast convergence of the ensuing algorithm via relaxation to a minimax saddle point objective. We supplement the theoretical foundations of our work with an extensive empirical evaluation.

UAI Conference 2018 Conference Paper

Sparse Multi-Prototype Classification

Vikas K. Garg 0001
Lin Xiao
Ofer Dekel

We introduce a new class of sparse multiprototype classifiers, designed to combine the computational advantages of sparse predictors with the non-linear power of prototype-based classification techniques. This combination makes sparse multiprototype models especially well-suited for resource constrained computational platforms, such as the IoT devices. We cast our supervised learning problem as a convexconcave saddle point problem and design a provably-fast algorithm to solve it. We complement our theoretical analysis with an empirical study that demonstrates the merits of our methodology.

ICML Conference 2017 Conference Paper

Adaptive Neural Networks for Efficient Inference

Tolga Bolukbasi
Joseph Wang 0001
Ofer Dekel
Venkatesh Saligrama

We present an approach to adaptively utilize deep neural networks in order to reduce the evaluation time on new examples without loss of accuracy. Rather than attempting to redesign or approximate existing networks, we propose two schemes that adaptively utilize networks. We first pose an adaptive network evaluation scheme, where we learn a system to adaptively choose the components of a deep network to be evaluated for each example. By allowing examples correctly classified using early layers of the system to exit, we avoid the computational time associated with full evaluation of the network. We extend this to learn a network selection system that adaptively selects the network to be evaluated for each example. We show that computational time can be dramatically reduced by exploiting the fact that many examples can be correctly classified using relatively efficient networks and that complex, computationally costly networks are only necessary for a small fraction of examples. We pose a global objective for learning an adaptive early exit or network selection policy and solve it by reducing the policy learning problem to a layer-by-layer weighted binary classification problem. Empirically, these approaches yield dramatic reductions in computational cost, with up to a 2. 8x speedup on state-of-the-art networks from the ImageNet image recognition challenge with minimal ($<1\%$) loss of top5 accuracy.

NeurIPS Conference 2017 Conference Paper

Online Learning with a Hint

Ofer Dekel
Arthur Flajolet
Nika Haghtalab
Patrick Jaillet

We study a variant of online linear optimization where the player receives a hint about the loss function at the beginning of each round. The hint is given in the form of a vector that is weakly correlated with the loss vector on that round. We show that the player can benefit from such a hint if the set of feasible actions is sufficiently round. Specifically, if the set is strongly convex, the hint can be used to guarantee a regret of O(log(T)), and if the set is q-uniformly convex for q\in(2, 3), the hint can be used to guarantee a regret of o(sqrt{T}). In contrast, we establish Omega(sqrt{T}) lower bounds on regret when the set of feasible actions is a polyhedron.

NeurIPS Conference 2015 Conference Paper

Bandit Smooth Convex Optimization: Improving the Bias-Variance Tradeoff

Ofer Dekel
Ronen Eldan
Tomer Koren

Bandit convex optimization is one of the fundamental problems in the field of online learning. The best algorithm for the general bandit convex optimization problem guarantees a regret of $\widetilde{O}(T^{5/6})$, while the best known lower bound is $\Omega(T^{1/2})$. Many attemptshave been made to bridge the huge gap between these bounds. A particularly interesting special case of this problem assumes that the loss functions are smooth. In this case, the best known algorithm guarantees a regret of $\widetilde{O}(T^{2/3})$. We present an efficient algorithm for the banditsmooth convex optimization problem that guarantees a regret of $\widetilde{O}(T^{5/8})$. Our result rules out an $\Omega(T^{2/3})$ lower bound and takes a significant step towards the resolution of this open problem.

STOC Conference 2014 Conference Paper

Bandits with switching costs: T 2/3 regret

Ofer Dekel
Jian Ding
Tomer Koren
Yuval Peres

NeurIPS Conference 2014 Conference Paper

The Blinded Bandit: Learning with Adaptive Feedback

Ofer Dekel
Elad Hazan
Tomer Koren

We study an online learning setting where the player is temporarily deprived of feedback each time it switches to a different action. Such model of \emph{adaptive feedback} naturally occurs in scenarios where the environment reacts to the player's actions and requires some time to recover and stabilize after the algorithm switches actions. This motivates a variant of the multi-armed bandit problem, which we call the \emph{blinded multi-armed bandit}, in which no feedback is given to the algorithm whenever it switches arms. We develop efficient online learning algorithms for this problem and prove that they guarantee the same asymptotic regret as the optimal algorithms for the standard multi-armed bandit problem. This result stands in stark contrast to another recent result, which states that adding a switching cost to the standard multi-armed bandit makes it substantially harder to learn, and provides a direct comparison of how feedback and loss contribute to the difficulty of an online learning problem. We also extend our results to the general prediction framework of bandit linear optimization, again attaining near-optimal regret bounds.

ICML Conference 2013 Conference Paper

Better Rates for Any Adversarial Deterministic MDP

Ofer Dekel
Elad Hazan

We consider regret minimization in adversarial deterministic Markov Decision Processes (ADMDPs) with bandit feedback. We devise a new algorithm that pushes the state-of-the-art forward in two ways: First, it attains a regret of O(T^2/3) with respect to the best fixed policy in hindsight, whereas the previous best regret bound was O(T^3/4). Second, the algorithm and its analysis are compatible with any feasible ADMDP graph topology, while all previous approaches required additional restrictions on the graph topology.

NeurIPS Conference 2013 Conference Paper

Online Learning with Switching Costs and Other Adaptive Adversaries

Nicolò Cesa-Bianchi
Ofer Dekel
Ohad Shamir

We study the power of different types of adaptive (nonoblivious) adversaries in the setting of prediction with expert advice, under both full-information and bandit feedback. We measure the player's performance using a new notion of regret, also known as policy regret, which better captures the adversary's adaptiveness to the player's behavior. In a setting where losses are allowed to drift, we characterize ---in a nearly complete manner--- the power of adaptive adversaries with bounded memories and switching costs. In particular, we show that with switching costs, the attainable rate with bandit feedback is $T^{2/3}$. Interestingly, this rate is significantly worse than the $\sqrt{T}$ rate attainable with switching costs in the full-information case. Via a novel reduction from experts to bandits, we also show that a bounded memory adversary can force $T^{2/3}$ regret even in the full information case, proving that switching costs are easier to control than bounded memory adversaries. Our lower bounds rely on a new stochastic adversary strategy that generates loss processes with strong dependencies.

UAI Conference 2012 Conference Paper

Deterministic MDPs with Adversarial Rewards and Bandit Feedback

Raman Arora
Ofer Dekel
Ambuj Tewari

We consider a Markov decision process with deterministic state transition dynamics, adversarially generated rewards that change arbitrarily from round to round, and a bandit feedback model in which the decision maker only observes the rewards it receives. In this setting, we present a novel and efficient online decision making algorithm named MarcoPolo. Under mild assumptions on the structure of the transition dynamics, we prove that MarcoPolo enjoys a regret of O(T3/4 √ log T) against the best deterministic policy in hindsight. Specifically, our analysis does not rely on the stringent unichain assumption, which dominates much of the previous work on this topic.

ICML Conference 2012 Conference Paper

Online Bandit Learning against an Adaptive Adversary: from Regret to Policy Regret

Ofer Dekel
Ambuj Tewari
Raman Arora

JMLR Journal 2012 Journal Article

Optimal Distributed Online Prediction Using Mini-Batches

Ofer Dekel
Ran Gilad-Bachrach
Ohad Shamir
Lin Xiao

Online prediction methods are typically presented as serial algorithms running on a single processor. However, in the age of web-scale prediction problems, it is increasingly common to encounter situations where a single processor cannot keep up with the high rate at which inputs arrive. In this work, we present the distributed mini-batch algorithm, a method of converting many serial gradient-based online prediction algorithms into distributed algorithms. We prove a regret bound for this method that is asymptotically optimal for smooth convex loss functions and stochastic inputs. Moreover, our analysis explicitly takes into account communication latencies between nodes in the distributed environment. We show how our method can be used to solve the closely-related distributed stochastic optimization problem, achieving an asymptotically linear speed-up over multiple processors. Finally, we demonstrate the merits of our approach on a web-scale online prediction problem. [abs] [ pdf ][ bib ] &copy JMLR 2012. ( edit, beta )

JMLR Journal 2012 Journal Article

Selective Sampling and Active Learning from Single and Multiple Teachers

Ofer Dekel
Claudio Gentile
Karthik Sridharan

We present a new online learning algorithm in the selective sampling framework, where labels must be actively queried before they are revealed. We prove bounds on the regret of our algorithm and on the number of labels it queries when faced with an adaptive adversarial strategy of generating the instances. Our bounds both generalize and strictly improve over previous bounds in similar settings. Additionally, our selective sampling algorithm can be converted into an efficient statistical active learning algorithm. We extend our algorithm and analysis to the multiple-teacher setting, where the algorithm can choose which subset of teachers to query for each label. Finally, we demonstrate the effectiveness of our techniques on a real-world Internet search problem. [abs] [ pdf ][ bib ] &copy JMLR 2012. ( edit, beta )

ICML Conference 2011 Conference Paper

Bundle Selling by Online Estimation of Valuation Functions

Daniel Vainsencher
Ofer Dekel
Shie Mannor

ICML Conference 2011 Conference Paper

Optimal Distributed Online Prediction

Ofer Dekel
Ran Bachrach
Ohad Shamir
Lin Xiao

NeurIPS Conference 2009 Conference Paper

Distribution-Calibrated Hierarchical Classification

Ofer Dekel

While many advances have already been made on the topic of hierarchical classi- ﬁcation learning, we take a step back and examine how a hierarchical classiﬁca- tion problem should be formally deﬁned. We pay particular attention to the fact that many arbitrary decisions go into the design of the the label taxonomy that is provided with the training data, and that this taxonomy is often unbalanced. We correct this problem by using the data distribution to calibrate the hierarchical classiﬁcation loss function. This distribution-based correction must be done with care, to avoid introducing unmanagable statstical dependencies into the learning problem. This leads us off the beaten path of binomial-type estimation and into the uncharted waters of geometric-type estimation. We present a new calibrated deﬁnition of statistical risk for hierarchical classiﬁcation, an unbiased geometric estimator for this risk, and a new algorithmic reduction from hierarchical classiﬁ- cation to cost-sensitive classiﬁcation.

ICML Conference 2009 Conference Paper

Good learners for evil teachers

Ofer Dekel
Ohad Shamir

NeurIPS Conference 2008 Conference Paper

From Online to Batch Learning with Cutoff-Averaging

Ofer Dekel

We present cutoff averaging", a technique for converting any conservative online learning algorithm into a batch learning algorithm. Most online-to-batch conversion techniques work well with certain types of online learning algorithms and not with others, whereas cutoff averaging explicitly tries to adapt to the characteristics of the online algorithm being converted. An attractive property of our technique is that it preserves the efficiency of the original online algorithm, making it approporiate for large-scale learning problems. We provide a statistical analysis of our technique and back our theoretical claims with experimental results. "

SODA Conference 2008 Conference Paper

Incentive compatible regression learning

Ofer Dekel
Felix A. Fischer
Ariel D. Procaccia

ICML Conference 2008 Conference Paper

Learning to classify with missing and corrupted features

Ofer Dekel
Ohad Shamir

After a classifier is trained using a machine learning algorithm and put to use in a real world system, it often faces noise which did not appear in the training data. Particularly, some subset of features may be missing or may become corrupted. We present two novel machine learning techniques that are robust to this type of classification-time noise. First, we solve an approximation to the learning problem using linear programming. We analyze the tightness of our approximation and prove statistical risk bounds for this approach. Second, we define the online-learning variant of our problem, address this variant using a modified Perceptron, and obtain a statistical learning algorithm using an online-to-batch technique. We conclude with a set of experiments that demonstrate the effectiveness of our algorithms.

JMLR Journal 2007 Journal Article

Online Learning of Multiple Tasks with a Shared Loss

Ofer Dekel
Philip M. Long
Yoram Singer

We study the problem of learning multiple tasks in parallel within the online learning framework. On each online round, the algorithm receives an instance for each of the parallel tasks and responds by predicting the label of each instance. We consider the case where the predictions made on each round all contribute toward a common goal. The relationship between the various tasks is defined by a global loss function, which evaluates the overall quality of the multiple predictions made on each round. Specifically, each individual prediction is associated with its own loss value, and then these multiple loss values are combined into a single number using the global loss function. We focus on the case where the global loss function belongs to the family of absolute norms, and present several online learning algorithms for the induced problem. We prove worst-case relative loss bounds for all of our algorithms, and demonstrate the effectiveness of our approach on a large-scale multiclass-multilabel text categorization problem. [abs] [ pdf ][ bib ] &copy JMLR 2007. ( edit, beta )

JMLR Journal 2006 Journal Article

Online Passive-Aggressive Algorithms

Koby Crammer
Ofer Dekel
Joseph Keshet
Shai Shalev-Shwartz
Yoram Singer

We present a family of margin based online learning algorithms for various prediction tasks. In particular we derive and analyze algorithms for binary and multiclass categorization, regression, uniclass prediction and sequence prediction. The update steps of our different algorithms are all based on analytical solutions to simple constrained optimization problems. This unified view allows us to prove worst-case loss bounds for the different algorithms and for the various decision problems based on a single lemma. Our bounds on the cumulative loss of the algorithms are relative to the smallest loss that can be attained by any fixed hypothesis, and as such are applicable to both realizable and unrealizable settings. We demonstrate some of the merits of the proposed algorithms in a series of experiments with synthetic and real data sets. [abs] [ pdf ][ bib ] &copy JMLR 2006. ( edit, beta )

NeurIPS Conference 2006 Conference Paper

Support Vector Machines on a Budget

Ofer Dekel
Yoram Singer

The standard Support Vector Machine formulation does not provide its user with the ability to explicitly control the number of support vectors used to deﬁne the generated classiﬁer. We present a modiﬁed version of SVM that allows the user to set a budget parameter B and focuses on minimizing the loss attained by the B worst-classiﬁed examples while ignoring the remaining examples. This idea can be used to derive sparse versions of both L1-SVM and L2-SVM. Technically, we obtain these new SVM variants by replacing the 1-norm in the standard SVM for- mulation with various interpolation-norms. We also adapt the SMO optimization algorithm to our setting and report on some preliminary experimental results.

NeurIPS Conference 2005 Conference Paper

Data-Driven Online to Batch Conversions

Ofer Dekel
Yoram Singer

Online learning algorithms are typically fast, memory efficient, and simple to implement. However, many common learning problems fit more naturally in the batch learning setting. The power of online learning algorithms can be exploited in batch settings by using online-to-batch conversions techniques which build a new batch algorithm from an existing online algorithm. We first give a unified overview of three existing online-to-batch conversion techniques which do not use training data in the conversion process. We then build upon these data-independent conversions to derive and analyze data-driven conversions. Our conversions find hypotheses with a small risk by explicitly minimizing datadependent generalization bounds. We experimentally demonstrate the usefulness of our approach and in particular show that the data-driven conversions consistently outperform the data-independent conversions.

JMLR Journal 2005 Journal Article

Smooth ε-Insensitive Regression by Loss Symmetrization

Ofer Dekel
Shai Shalev-Shwartz
Yoram Singer

We describe new loss functions for regression problems along with an accompanying algorithmic framework which utilizes these functions. These loss functions are derived by symmetrization of margin-based losses commonly used in boosting algorithms, namely, the logistic loss and the exponential loss. The resulting symmetric logistic loss can be viewed as a smooth approximation to the ε-insensitive hinge loss used in support vector regression. We describe and analyze two parametric families of batch learning algorithms for minimizing these symmetric losses. The first family employs an iterative log-additive update which can be viewed as a regression counterpart to recent boosting algorithms. The second family utilizes an iterative additive update step. We also describe and analyze online gradient descent (GD) and exponentiated gradient (EG) algorithms for the symmetric logistic loss. A byproduct of our work is a new simple form of regularization for boosting-based classification and regression algorithms. Our regression framework also has implications on classification algorithms, namely, a new additive update boosting algorithm for classification. We demonstrate the merits of our algorithms in a series of experiments. [abs] [ pdf ][ bib ] &copy JMLR 2005. ( edit, beta )

NeurIPS Conference 2005 Conference Paper

The Forgetron: A Kernel-Based Perceptron on a Fixed Budget

Ofer Dekel
Shai Shalev-Shwartz
Yoram Singer

The Perceptron algorithm, despite its simplicity, often performs well on online classification tasks. The Perceptron becomes especially effective when it is used in conjunction with kernels. However, a common difficulty encountered when implementing kernel-based online algorithms is the amount of memory required to store the online hypothesis, which may grow unboundedly. In this paper we present and analyze the Forgetron algorithm for kernel-based online learning on a fixed memory budget. To our knowledge, this is the first online learning algorithm which, on one hand, maintains a strict limit on the number of examples it stores while, on the other hand, entertains a relative mistake bound. In addition to the formal results, we also present experiments with real datasets which underscore the merits of our approach.

ICML Conference 2004 Conference Paper

Large margin hierarchical classification

Ofer Dekel
Joseph Keshet
Yoram Singer

NeurIPS Conference 2004 Conference Paper

The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

Ofer Dekel
Shai Shalev-Shwartz
Yoram Singer

Prediction suffix trees (PST) provide a popular and effective tool for tasks such as compression, classification, and language modeling. In this pa- per we take a decision theoretic view of PSTs for the task of sequence prediction. Generalizing the notion of margin to PSTs, we present an on- line PST learning algorithm and derive a loss bound for it. The depth of the PST generated by this algorithm scales linearly with the length of the input. We then describe a self-bounded enhancement of our learning al- gorithm which automatically grows a bounded-depth PST. We also prove an analogous mistake-bound for the self-bounded algorithm. The result is an efficient algorithm that neither relies on a-priori assumptions on the shape or maximal depth of the target PST nor does it require any param- eters. To our knowledge, this is the first provably-correct PST learning algorithm which generates a bounded-depth PST while being competi- tive with any fixed PST determined in hindsight.

NeurIPS Conference 2003 Conference Paper

Log-Linear Models for Label Ranking

Ofer Dekel
Yoram Singer
Christopher Manning

Label ranking is the task of inferring a total order over a predeﬁned set of labels for each given instance. We present a general framework for batch learning of label ranking functions from supervised data. We assume that each instance in the training data is associated with a list of preferences over the label-set, however we do not assume that this list is either com- plete or consistent. This enables us to accommodate a variety of ranking problems. In contrast to the general form of the supervision, our goal is to learn a ranking function that induces a total order over the entire set of labels. Special cases of our setting are multilabel categorization and hierarchical classiﬁcation. We present a general boosting-based learning algorithm for the label ranking problem and prove a lower bound on the progress of each boosting iteration. The applicability of our approach is demonstrated with a set of experiments on a large-scale text corpus.

NeurIPS Conference 2003 Conference Paper

Online Passive-Aggressive Algorithms

Shai Shalev-Shwartz
Koby Crammer
Ofer Dekel
Yoram Singer

We present a uniﬁed view for online classiﬁcation, regression, and uni- class problems. This view leads to a single algorithmic framework for the three problems. We prove worst case loss bounds for various algorithms for both the realizable case and the non-realizable case. A conversion of our main online algorithm to the setting of batch learning is also dis- cussed. The end result is new algorithms and accompanying loss bounds for the hinge-loss.

NeurIPS Conference 2002 Conference Paper

Multiclass Learning by Probabilistic Embeddings

Ofer Dekel
Yoram Singer

We describe a new algorithmic framework for learning multiclass catego- rization problems. In this framework a multiclass predictor is composed of a pair of embeddings that map both instances and labels into a common space. In this space each instance is assigned the label it is nearest to. We outline and analyze an algorithm, termed Bunching, for learning the pair of embeddings from labeled data. A key construction in the analysis of the algorithm is the notion of probabilistic output codes, a generaliza- tion of error correcting output codes (ECOC). Furthermore, the method of multiclass categorization using ECOC is shown to be an instance of Bunching. We demonstrate the advantage of Bunching over ECOC by comparing their performance on numerous categorization problems.