Arrow Research search

Author name cluster

Manfred K. Warmuth

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

67 papers
2 author rows

Possible papers

67

NeurIPS Conference 2024 Conference Paper

Hyperbolic Embeddings of Supervised Models

  • Richard Nock
  • Ehsan Amid
  • Frank Nielsen
  • Alexander Soen
  • Manfred K. Warmuth

Models of hyperbolic geometry have been successfully used in ML for two main tasks: embedding models in unsupervised learning ( e. g. hierarchies) and embedding data. To our knowledge, there are no approaches that provide embeddings for supervised models; even when hyperbolic geometry provides convenient properties for expressing popular hypothesis classes, such as decision trees (and ensembles). In this paper, we propose a full-fledged solution to the problem in three independent contributions. The first linking the theory of losses for class probability estimation to hyperbolic embeddings in Poincar\'e disk model. The second resolving an issue for a clean, unambiguous embedding of (ensembles of) decision trees in this model. The third showing how to smoothly tweak the Poincar\'e hyperbolic distance to improve its encoding and visualization properties near the border of the disk, a crucial region for our application, while keeping hyperbolicity. This last step has substantial independent interest as it is grounded in a generalization of Leibniz-Newton's fundamental Theorem of calculus.

AAAI Conference 2024 Conference Paper

Optimal Transport with Tempered Exponential Measures

  • Ehsan Amid
  • Frank Nielsen
  • Richard Nock
  • Manfred K. Warmuth

In the field of optimal transport, two prominent subfields face each other: (i) unregularized optimal transport, ``a-la-Kantorovich'', which leads to extremely sparse plans but with algorithms that scale poorly, and (ii) entropic-regularized optimal transport, ``a-la-Sinkhorn-Cuturi'', which gets near-linear approximation algorithms but leads to maximally un-sparse plans. In this paper, we show that an extension of the latter to tempered exponential measures, a generalization of exponential families with indirect measure normalization, gets to a very convenient middle ground, with both very fast approximation algorithms and sparsity, which is under control up to sparsity patterns. In addition, our formulation fits naturally in the unbalanced optimal transport problem setting.

JMLR Journal 2022 Journal Article

Unbiased estimators for random design regression

  • Michał Dereziński
  • Manfred K. Warmuth
  • Daniel Hsu

In linear regression we wish to estimate the optimum linear least squares predictor for a distribution over $d$-dimensional input points and real-valued responses, based on a small sample. Under standard random design analysis, where the sample is drawn i.i.d. from the input distribution, the least squares solution for that sample can be viewed as the natural estimator of the optimum. Unfortunately, this estimator almost always incurs an undesirable bias coming from the randomness of the input points, which is a significant bottleneck in model averaging. In this paper we show that it is possible to draw a non-i.i.d. sample of input points such that, regardless of the response model, the least squares solution is an unbiased estimator of the optimum. Moreover, this sample can be produced efficiently by augmenting a previously drawn i.i.d. sample with an additional set of $d$ points, drawn jointly according to a certain determinantal point process constructed from the input distribution rescaled by the squared volume spanned by the points. Motivated by this, we develop a theoretical framework for studying volume-rescaled sampling, and in the process prove a number of new matrix expectation identities. We use them to show that for any input distribution and $\epsilon>0$ there is a random design consisting of $O(d\log d+ d/\epsilon)$ points from which an unbiased estimator can be constructed whose expected square loss over the entire distribution is bounded by $1+\epsilon$ times the loss of the optimum. We provide efficient algorithms for constructing such unbiased estimators in a number of practical settings. In one such setting, we let the input distribution be uniform over a large dataset of $n\gg d$ points. Here, we obtain the first unbiased least squares estimator that can be constructed in time nearly-linear in the data size, resulting in strong guarantees for model averaging. We achieve these computational gains by introducing a new algorithmic technique, called distortion-free intermediate sampling, which is the first method to enable sampling from determinantal point processes in time polynomial in the sample size. [abs] [ pdf ][ bib ] &copy JMLR 2022. ( edit, beta )

AAAI Conference 2020 Conference Paper

An Implicit Form of Krasulina’s k-PCA Update without the Orthonormality Constraint

  • Ehsan Amid
  • Manfred K. Warmuth

We shed new insights on the two commonly used updates for the online k-PCA problem, namely, Krasulina’s and Oja’s updates. We show that Krasulina’s update corresponds to a projected gradient descent step on the Stiefel manifold of orthonormal k-frames, while Oja’s update amounts to a gradient descent step using the unprojected gradient. Following these observations, we derive a more implicit form of Krasulina’s k-PCA update, i. e. a version that uses the information of the future gradient as much as possible. Most interestingly, our implicit Krasulina update avoids the costly QR-decomposition step by bypassing the orthonormality constraint. A related update, called the Sanger’s rule, can be seen as an explicit approximation of our implicit update. We show that the new update in fact corresponds to an online EM step applied to a probabilistic k-PCA model. The probabilistic view of the update allows us to combine multiple models in a distributed setting. We show experimentally that the implicit Krasulina update yields superior convergence while being significantly faster. We also give strong evidence that the new update can benefit from parallelism and is more stable w. r. t. tuning of the learning rate.

UAI Conference 2020 Conference Paper

Divergence-Based Motivation for Online EM and Combining Hidden Variable Models

  • Ehsan Amid
  • Manfred K. Warmuth

Expectation-Maximization (EM) is a prominent approach for parameter estimation of hidden (aka latent) variable models. Given the full batch of data, EM forms an upper-bound of the negative log-likelihood of the model at each iteration and updates to the minimizer of this upper-bound. We first provide a “model level” interpretation of the EM upper-bound as a sum of relative entropy divergences to a set of singleton models induced by the batch of observations. Our alternative motivation unifies the “observation level” and the “model level” view of the EM. As a result, we formulate an online version of the EM algorithm by adding an analogous inertia term which is a relative entropy divergence to the old model. Our motivation is more widely applicable than the previous approaches and leads to simple online updates for mixture of exponential distributions, hidden Markov models, and the first known online update for Kalman filters. Additionally, the finite sample form of the inertia term lets us derive online updates when there is no closed-form solution. Finally, we extend the analysis to the distributed setting where we motivate a systematic way of combining multiple hidden variable models. Experimentally, we validate the results on synthetic as well as real-world datasets.

ICML Conference 2019 Conference Paper

Adaptive Scale-Invariant Online Algorithms for Learning Linear Models

  • Michal Kempka
  • Wojciech Kotlowski
  • Manfred K. Warmuth

We consider online learning with linear models, where the algorithm predicts on sequentially revealed instances (feature vectors), and is compared against the best linear function (comparator) in hindsight. Popular algorithms in this framework, such as Online Gradient Descent (OGD), have parameters (learning rates), which ideally should be tuned based on the scales of the features and the optimal comparator, but these quantities only become available at the end of the learning process. In this paper, we resolve the tuning problem by proposing online algorithms making predictions which are invariant under arbitrary rescaling of the features. The algorithms have no parameters to tune, do not require any prior knowledge on the scale of the instances or the comparator, and achieve regret bounds matching (up to a logarithmic factor) that of OGD with optimally tuned separate learning rates per dimension, while retaining comparable runtime performance.

I&C Journal 2019 Journal Article

Mistake bounds on the noise-free multi-armed bandit game

  • Atsuyoshi Nakamura
  • David P. Helmbold
  • Manfred K. Warmuth

We study the { 0, 1 } -loss version of adaptive adversarial multi-armed bandit problems with α ( ≥ 1 ) lossless arms. For the problem, we show a tight bound K − α − Θ ( 1 / T ) on the minimax expected number of mistakes (1-losses), where K is the number of arms and T is the number of rounds.

NeurIPS Conference 2019 Conference Paper

Robust Bi-Tempered Logistic Loss Based on Bregman Divergences

  • Ehsan Amid
  • Manfred K. Warmuth
  • Rohan Anil
  • Tomer Koren

We introduce a temperature into the exponential function and replace the softmax output layer of the neural networks by a high-temperature generalization. Similarly, the logarithm in the loss we use for training is replaced by a low-temperature logarithm. By tuning the two temperatures, we create loss functions that are non-convex already in the single layer case. When replacing the last layer of the neural networks by our bi-temperature generalization of the logistic loss, the training becomes more robust to noise. We visualize the effect of tuning the two temperatures in a simple setting and show the efficacy of our method on large datasets. Our methodology is based on Bregman divergences and is superior to a related two-temperature method that uses the Tsallis divergence.

NeurIPS Conference 2018 Conference Paper

Leveraged volume sampling for linear regression

  • Michal Derezinski
  • Manfred K. Warmuth
  • Daniel Hsu

Suppose an n x d design matrix in a linear regression problem is given, but the response for each point is hidden unless explicitly requested. The goal is to sample only a small number k << n of the responses, and then produce a weight vector whose sum of squares loss over all points is at most 1+epsilon times the minimum. When k is very small (e. g. , k=d), jointly sampling diverse subsets of points is crucial. One such method called "volume sampling" has a unique and desirable property that the weight vector it produces is an unbiased estimate of the optimum. It is therefore natural to ask if this method offers the optimal unbiased estimate in terms of the number of responses k needed to achieve a 1+epsilon loss approximation. Surprisingly we show that volume sampling can have poor behavior when we require a very accurate approximation -- indeed worse than some i. i. d. sampling techniques whose estimates are biased, such as leverage score sampling. We then develop a new rescaled variant of volume sampling that produces an unbiased estimate which avoids this bad behavior and has at least as good a tail bound as leverage score sampling: sample size k=O(d log d + d/epsilon) suffices to guarantee total loss at most 1+epsilon times the minimum with high probability. Thus, we improve on the best previously known sample size for an unbiased estimator, k=O(d^2/epsilon). Our rescaling procedure leads to a new efficient algorithm for volume sampling which is based on a "determinantal rejection sampling" technique with potentially broader applications to determinantal point processes. Other contributions include introducing the combinatorics needed for rescaled volume sampling and developing tail bounds for sums of dependent random matrices which arise in the process.

JMLR Journal 2018 Journal Article

Reverse Iterative Volume Sampling for Linear Regression

  • Michał Dereziński
  • Manfred K. Warmuth

We study the following basic machine learning task: Given a fixed set of input points in $\mathbb{R}^d$ for a linear regression problem, we wish to predict a hidden response value for each of the points. We can only afford to attain the responses for a small subset of the points that are then used to construct linear predictions for all points in the dataset. The performance of the predictions is evaluated by the total square loss on all responses (the attained as well as the remaining hidden ones). We show that a good approximate solution to this least squares problem can be obtained from just dimension $d$ many responses by using a joint sampling technique called volume sampling. Moreover, the least squares solution obtained for the volume sampled subproblem is an unbiased estimator of optimal solution based on all $n$ responses. This unbiasedness is a desirable property that is not shared by other common subset selection techniques. Motivated by these basic properties, we develop a theoretical framework for studying volume sampling, resulting in a number of new matrix expectation equalities and statistical guarantees which are of importance not only to least squares regression but also to numerical linear algebra in general. Our methods also lead to a regularized variant of volume sampling, and we propose the first efficient algorithm for volume sampling which makes this technique a practical tool in the machine learning toolbox. Finally, we provide experimental evidence which confirms our theoretical findings. [abs] [ pdf ][ bib ] &copy JMLR 2018. ( edit, beta )

NeurIPS Conference 2017 Conference Paper

Online Dynamic Programming

  • Holakou Rahmanian
  • Manfred K. Warmuth

We consider the problem of repeatedly solving a variant of the same dynamic programming problem in successive trials. An instance of the type of problems we consider is to find a good binary search tree in a changing environment. At the beginning of each trial, the learner probabilistically chooses a tree with the n keys at the internal nodes and the n + 1 gaps between keys at the leaves. The learner is then told the frequencies of the keys and gaps and is charged by the average search cost for the chosen tree. The problem is online because the frequencies can change between trials. The goal is to develop algorithms with the property that their total average search cost (loss) in all trials is close to the total loss of the best tree chosen in hindsight for all trials. The challenge, of course, is that the algorithm has to deal with exponential number of trees. We develop a general methodology for tackling such problems for a wide class of dynamic programming algorithms. Our framework allows us to extend online learning algorithms like Hedge and Component Hedge to a significantly wider class of combinatorial objects than was possible before.

NeurIPS Conference 2017 Conference Paper

Unbiased estimates for linear regression via volume sampling

  • Michal Derezinski
  • Manfred K. Warmuth

Given a full rank matrix X with more columns than rows consider the task of estimating the pseudo inverse $X^+$ based on the pseudo inverse of a sampled subset of columns (of size at least the number of rows). We show that this is possible if the subset of columns is chosen proportional to the squared volume spanned by the rows of the chosen submatrix (ie, volume sampling). The resulting estimator is unbiased and surprisingly the covariance of the estimator also has a closed form: It equals a specific factor times $X^+X^{+\top}$. Pseudo inverse plays an important part in solving the linear least squares problem, where we try to predict a label for each column of $X$. We assume labels are expensive and we are only given the labels for the small subset of columns we sample from $X$. Using our methods we show that the weight vector of the solution for the sub problem is an unbiased estimator of the optimal solution for the whole problem based on all column labels. We believe that these new formulas establish a fundamental connection between linear least squares and volume sampling. We use our methods to obtain an algorithm for volume sampling that is faster than state-of-the-art and for obtaining bounds for the total loss of the estimated least-squares solution on all labeled columns.

JMLR Journal 2016 Journal Article

Online PCA with Optimal Regret

  • Jiazhong Nie
  • Wojciech Kotlowski
  • Manfred K. Warmuth

We investigate the online version of Principle Component Analysis (PCA), where in each trial $t$ the learning algorithm chooses a $k$-dimensional subspace, and upon receiving the next instance vector $\x_t$, suffers the compression loss, which is the squared Euclidean distance between this instance and its projection into the chosen subspace. When viewed in the right parameterization, this compression loss is linear, i.e. it can be rewritten as $\text{tr}(\mathbf{W}_t\x_t\x_t^\top)$, where $\mathbf{W}_t$ is the parameter of the algorithm and the outer product $\x_t\x_t^\top$ (with $\|\x_t\|\le 1$) is the instance matrix. In this paper generalize PCA to arbitrary positive definite instance matrices $\mathbf{X}_t$ with the linear loss $\text{tr}(\mathbf{W}_t\X_t)$. We evaluate online algorithms in terms of their worst-case regret, which is a bound on the additional total loss of the online algorithm on all instances matrices over the compression loss of the best $k$-dimensional subspace (chosen in hindsight). We focus on two popular online algorithms for generalized PCA: the Gradient Descent (GD) and Matrix Exponentiated Gradient (MEG) algorithms. We show that if the regret is expressed as a function of the number of trials, then both algorithms are optimal to within a constant factor on worst-case sequences of positive definite instances matrices with trace norm at most one (which subsumes the original PCA problem with outer products). This is surprising because MEG is believed be suboptimal in this case. We also show that when considering regret bounds as a function of a loss budget, then MEG remains optimal and strictly outperforms GD when the instance matrices are trace norm bounded. Next, we consider online PCA when the adversary is allowed to present the algorithm with positive semidefinite instance matrices whose largest eigenvalue is bounded (rather than their trace which is the sum of their eigenvalues). Again we can show that MEG is optimal and strictly better than GD in this setting. [abs] [ pdf ][ bib ] &copy JMLR 2016. ( edit, beta )

TCS Journal 2014 Journal Article

Combining initial segments of lists

  • Manfred K. Warmuth
  • Wouter M. Koolen
  • David P. Helmbold

We propose a new way to build a combined list from K base lists, each containing N items. A combined list consists of top segments of various sizes from each base list so that the total size of all top segments equals N. A sequence of item requests is processed and the goal is to minimize the total number of misses. That is, we seek to build a combined list that contains all the frequently requested items. We first consider the special case of disjoint base lists. There, we design an efficient algorithm that computes the best combined list for a given sequence of requests. In addition, we develop a randomized online algorithm whose expected number of misses is close to that of the best combined list chosen in hindsight. We prove lower bounds that show that the expected number of misses of our randomized algorithm is close to the optimum. In the presence of duplicate items, we show that computing the best combined list is NP-hard. We show that our algorithms still apply to a linearized notion of loss in this case. We expect that this new way of aggregating lists will find many ranking applications.

TCS Journal 2014 Journal Article

Kernelization of matrix updates, when and how?

  • Manfred K. Warmuth
  • Wojciech Kotłowski
  • Shuisheng Zhou

We define what it means for a learning algorithm to be kernelizable in the case when the instances are vectors, asymmetric matrices and symmetric matrices, respectively. We can characterize kernelizability in terms of an invariance of the algorithm to certain orthogonal transformations. If we assume that the algorithm's action relies on a linear prediction, then we can show that in each case, the linear parameter vector must be a certain linear combination of the instances. We give a number of examples of how to apply our methods. In particular we show how to kernelize multiplicative updates for symmetric instance matrices.

NeurIPS Conference 2014 Conference Paper

The limits of squared Euclidean distance regularization

  • Michal Derezinski
  • Manfred K. Warmuth

Some of the simplest loss functions considered in Machine Learning are the square loss, the logistic loss and the hinge loss. The most common family of algorithms, including Gradient Descent (GD) with and without Weight Decay, always predict with a linear combination of the past instances. We give a random construction for sets of examples where the target linear weight vector is trivial to learn but any algorithm from the above family is drastically sub-optimal. Our lower bound on the latter algorithms holds even if the algorithms are enhanced with an arbitrary kernel function. This type of result was known for the square loss. However, we develop new techniques that let us prove such hardness results for any loss function satisfying some minimal requirements on the loss function (including the three listed above). We also show that algorithms that regularize with the squared Euclidean distance are easily confused by random features. Finally, we conclude by discussing related open problems regarding feed forward neural networks. We conjecture that our hardness results hold for any training algorithm that is based on the squared Euclidean distance regularization (i. e. Back-propagation with the Weight Decay heuristic).

NeurIPS Conference 2012 Conference Paper

Putting Bayes to sleep

  • Dmitry Adamskiy
  • Manfred K. Warmuth
  • Wouter Koolen

We consider sequential prediction algorithms that are given the predictions from a set of models as inputs. If the nature of the data is changing over time in that different models predict well on different segments of the data, then adaptivity is typically achieved by mixing into the weights in each round a bit of the initial prior (kind of like a weak restart). However, what if the favored models in each segment are from a small subset, i. e. the data is likely to be predicted well by models that predicted well before? Curiously, fitting such ''sparse composite models'' is achieved by mixing in a bit of all the past posteriors. This self-referential updating method is rather peculiar, but it is efficient and gives superior performance on many natural data sets. Also it is important because it introduces a long-term memory: any model that has done well in the past can be recovered quickly. While Bayesian interpretations can be found for mixing in a bit of the initial prior, no Bayesian interpretation is known for mixing in past posteriors. We build atop the ''specialist'' framework from the online learning literature to give the Mixing Past Posteriors update a proper Bayesian foundation. We apply our method to a well-studied multitask learning problem and obtain a new intriguing efficient update that achieves a significantly better bound.

NeurIPS Conference 2011 Conference Paper

Learning Eigenvectors for Free

  • Wouter Koolen
  • Wojciech Kotlowski
  • Manfred K. Warmuth

We extend the classical problem of predicting a sequence of outcomes from a finite alphabet to the matrix domain. In this extension, the alphabet of $n$ outcomes is replaced by the set of all dyads, i. e. outer products $\u\u^\top$ where $\u$ is a vector in $\R^n$ of unit length. Whereas in the classical case the goal is to learn (i. e. sequentially predict as well as) the best multinomial distribution, in the matrix case we desire to learn the density matrix that best explains the observed sequence of dyads. We show how popular online algorithms for learning a multinomial distribution can be extended to learn density matrices. Intuitively, learning the $n^2$ parameters of a density matrix is much harder than learning the $n$ parameters of a multinomial distribution. Completely surprisingly, we prove that the worst-case regrets of certain classical algorithms and their matrix generalizations are identical. The reason is that the worst-case sequence of dyads share a common eigensystem, i. e. the worst case regret is achieved in the classical case. So these matrix algorithms learn the eigenvectors without any regret.

NeurIPS Conference 2010 Conference Paper

Repeated Games against Budgeted Adversaries

  • Jacob Abernethy
  • Manfred K. Warmuth

We study repeated zero-sum games against an adversary on a budget. Given that an adversary has some constraint on the sequence of actions that he plays, we consider what ought to be the player's best mixed strategy with knowledge of this budget. We show that, for a general class of normal-form games, the minimax strategy is indeed efficiently computable and relies on a random playout" technique. We give three diverse applications of this algorithmic template: a cost-sensitive "Hedge" setting, a particular problem in Metrical Task Systems, and the design of combinatorial prediction markets. "

JMLR Journal 2009 Journal Article

Learning Permutations with Exponential Weights

  • David P. Helmbold
  • Manfred K. Warmuth

We give an algorithm for the on-line learning of permutations. The algorithm maintains its uncertainty about the target permutation as a doubly stochastic weight matrix, and makes predictions using an efficient method for decomposing the weight matrix into a convex combination of permutations. The weight matrix is updated by multiplying the current matrix entries by exponential factors, and an iterative procedure is needed to restore double stochasticity. Even though the result of this procedure does not have a closed form, a new analysis approach allows us to prove an optimal (up to small constant factors) bound on the regret of our algorithm. This regret bound is significantly better than that of either Kalai and Vempala's more efficient Follow the Perturbed Leader algorithm or the computationally expensive method of explicitly representing each permutation as an expert. [abs] [ pdf ][ bib ] &copy JMLR 2009. ( edit, beta )

JMLR Journal 2008 Journal Article

Randomized Online PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension

  • Manfred K. Warmuth
  • Dima Kuzmin

We design an online algorithm for Principal Component Analysis. In each trial the current instance is centered and projected into a probabilistically chosen low dimensional subspace. The regret of our online algorithm, that is, the total expected quadratic compression loss of the online algorithm minus the total quadratic compression loss of the batch algorithm, is bounded by a term whose dependence on the dimension of the instances is only logarithmic. We first develop our methodology in the expert setting of online learning by giving an algorithm for learning as well as the best subset of experts of a certain size. This algorithm is then lifted to the matrix setting where the subsets of experts correspond to subspaces. The algorithm represents the uncertainty over the best subspace as a density matrix whose eigenvalues are bounded. The running time is O ( n 2 ) per trial, where n is the dimension of the instances. [abs] [ pdf ][ bib ] &copy JMLR 2008. ( edit, beta )

NeurIPS Conference 2007 Conference Paper

Boosting Algorithms for Maximizing the Soft Margin

  • Gunnar Rätsch
  • Manfred K. Warmuth
  • Karen Glocer

Gunnar R¨atsch Friedrich Miescher Laboratory Max Planck Society T¨ubingen, Germany We present a novel boosting algorithm, called SoftBoost, designed for sets of bi- nary labeled examples that are not necessarily separable by convex combinations of base hypotheses. Our algorithm achieves robustness by capping the distribu- tions on the examples. Our update of the distribution is motivated by minimizing a relative entropy subject to the capping constraints and constraints on the edges of the obtained base hypotheses. The capping constraints imply a soft margin in the dual optimization problem. Our algorithm produces a convex combination of hypotheses whose soft margin is within δ of its maximum. We employ relative en- tropy projection methods to prove an O( ln N δ2 ) iteration bound for our algorithm, where N is number of examples. We compare our algorithm with other approaches including LPBoost, Brown- Boost, and SmoothBoost. We show that there exist cases where the number of iter- ations required by LPBoost grows linearly in N instead of the logarithmic growth for SoftBoost. In simulation studies we show that our algorithm converges about as fast as LPBoost, faster than BrownBoost, and much faster than SmoothBoost. In a benchmark comparison we illustrate the competitiveness of our approach.

ICML Conference 2007 Conference Paper

Online kernel PCA with entropic matrix updates

  • Dima Kuzmin
  • Manfred K. Warmuth

A number of updates for density matrices have been developed recently that are motivated by relative entropy minimization problems. The updates involve a softmin calculation based on matrix logs and matrix exponentials. We show that these updates can be kernelized. This is important because the bounds provable for these algorithms are logarithmic in the feature dimension (provided that the 2-norm of feature vectors is bounded by a constant). The main problem we focus on is the kernelization of an online PCA algorithm which belongs to this family of updates.

JMLR Journal 2007 Journal Article

Unlabeled Compression Schemes for Maximum Classes

  • Dima Kuzmin
  • Manfred K. Warmuth

Maximum concept classes of VC dimension d over n domain points have size n C ≤ d, and this is an upper bound on the size of any concept class of VC dimension d over n points. We give a compression scheme for any maximum class that represents each concept by a subset of up to d unlabeled domain points and has the property that for any sample of a concept in the class, the representative of exactly one of the concepts consistent with the sample is a subset of the domain of the sample. This allows us to compress any sample of a concept in the class to a subset of up to d unlabeled sample points such that this subset represents a concept consistent with the entire original sample. Unlike the previously known compression scheme for maximum classes (Floyd and Warmuth, 1995) which compresses to labeled subsets of the sample of size equal d, our new scheme is tight in the sense that the number of possible unlabeled compression sets of size at most d equals the number of concepts in the class. [abs] [ pdf ][ bib ] &copy JMLR 2007. ( edit, beta )

NeurIPS Conference 2006 Conference Paper

Randomized PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension

  • Manfred K. Warmuth
  • Dima Kuzmin

We design an on-line algorithm for Principal Component Analysis. In each trial the current instance is projected onto a probabilistically chosen low dimensional subspace. The total expected quadratic approximation error equals the total quadratic approximation error of the best subspace chosen in hindsight plus some additional term that grows linearly in dimension of the subspace but logarithmically in the dimension of the instances.

ICML Conference 2006 Conference Paper

Totally corrective boosting algorithms that maximize the margin

  • Manfred K. Warmuth
  • Jun Liao
  • Gunnar Rätsch

We consider boosting algorithms that maintain a distribution over a set of examples. At each iteration a weak hypothesis is received and the distribution is updated. We motivate these updates as minimizing the relative entropy subject to linear constraints. For example AdaBoost constrains the edge of the last hypothesis w.r.t. the updated distribution to be at most γ = 0. In some sense, AdaBoost is "corrective" w.r.t. the last hypothesis. A cleaner boosting method is to be "totally corrective": the edges of all past hypotheses are constrained to be at most γ, where γ is suitably adapted.Using new techniques, we prove the same iteration bounds for the totally corrective algorithms as for their corrective versions. Moreover with adaptive γ, the algorithms provably maximizes the margin. Experimentally, the totally corrective versions return smaller convex combinations of weak hypotheses than the corrective ones and are competitive with LPBoost, a totally corrective boosting algorithm with no regularization, for which there is no iteration bound known.

NeurIPS Conference 2005 Conference Paper

A Bayes Rule for Density Matrices

  • Manfred K. Warmuth

The classical Bayes rule computes the posterior model probability from the prior probability and the data likelihood. We generalize this rule to the case when the prior is a density matrix (symmetric positive definite and trace one) and the data likelihood a covariance matrix. The classical Bayes rule is retained as the special case when the matrices are diagonal. In the classical setting, the calculation of the probability of the data is an expected likelihood, where the expectation is over the prior distribution. In the generalized setting, this is replaced by an expected variance calculation where the variance is computed along the eigenvectors of the prior density matrix and the expectation is over the eigenvalues of the density matrix (which form a probability vector). The variances along any direction is determined by the covariance matrix. Curiously enough this expected variance calculation is a quantum measurement where the co-variance matrix specifies the instrument and the prior density matrix the mixture state of the particle. We motivate both the classical and the generalized Bayes rule with a minimum relative entropy principle, where the Kullbach-Leibler version gives the classical Bayes rule and Umegaki's quantum relative entropy the new Bayes rule for density matrices.

JMLR Journal 2005 Journal Article

Efficient Margin Maximizing with Boosting

  • Gunnar Rätsch
  • Manfred K. Warmuth

AdaBoost produces a linear combination of base hypotheses and predicts with the sign of this linear combination. The linear combination may be viewed as a hyperplane in feature space where the base hypotheses form the features. It has been observed that the generalization error of the algorithm continues to improve even after all examples are on the correct side of the current hyperplane. The improvement is attributed to the experimental observation that the distances (margins) of the examples to the separating hyperplane are increasing even after all examples are on the correct side. We introduce a new version of AdaBoost, called AdaBoost* ν, that explicitly maximizes the minimum margin of the examples up to a given precision. The algorithm incorporates a current estimate of the achievable margin into its calculation of the linear coefficients of the base hypotheses. The bound on the number of iterations needed by the new algorithms is the same as the number needed by a known version of AdaBoost that must have an explicit estimate of the achievable margin as a parameter. We also illustrate experimentally that our algorithm requires considerably fewer iterations than other algorithms that aim to maximize the margin. [abs] [ pdf ][ bib ] &copy JMLR 2005. ( edit, beta )

JMLR Journal 2005 Journal Article

Matrix Exponentiated Gradient Updates for On-line Learning and Bregman Projection

  • Koji Tsuda
  • Gunnar Rätsch
  • Manfred K. Warmuth

We address the problem of learning a symmetric positive definite matrix. The central issue is to design parameter updates that preserve positive definiteness. Our updates are motivated with the von Neumann divergence. Rather than treating the most general case, we focus on two key applications that exemplify our methods: on-line learning with a simple square loss, and finding a symmetric positive definite matrix subject to linear constraints. The updates generalize the exponentiated gradient (EG) update and AdaBoost, respectively: the parameter is now a symmetric positive definite matrix of trace one instead of a probability vector (which in this context is a diagonal positive definite matrix with trace one). The generalized updates use matrix logarithms and exponentials to preserve positive definiteness. Most importantly, we show how the derivation and the analyses of the original EG update and AdaBoost generalize to the non-diagonal case. We apply the resulting matrix exponentiated gradient (MEG) update and DefiniteBoost to the problem of learning a kernel matrix from distance measurements. [abs] [ pdf ][ bib ] &copy JMLR 2005. ( edit, beta )

NeurIPS Conference 2004 Conference Paper

Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection

  • Koji Tsuda
  • Gunnar Rätsch
  • Manfred K. Warmuth

We address the problem of learning a symmetric positive definite matrix. The central issue is to design parameter updates that preserve positive definiteness. Our updates are motivated with the von Neumann diver- gence. Rather than treating the most general case, we focus on two key applications that exemplify our methods: On-line learning with a simple square loss and finding a symmetric positive definite matrix subject to symmetric linear constraints. The updates generalize the Exponentiated Gradient (EG) update and AdaBoost, respectively: the parameter is now a symmetric positive definite matrix of trace one instead of a probability vector (which in this context is a diagonal positive definite matrix with trace one). The generalized updates use matrix logarithms and exponen- tials to preserve positive definiteness. Most importantly, we show how the analysis of each algorithm generalizes to the non-diagonal case. We apply both new algorithms, called the Matrix Exponentiated Gradient (MEG) update and DefiniteBoost, to learn a kernel matrix from distance measurements. 1 Introduction Most learning algorithms have been developed to learn a vector of parameters from data. However, an increasing number of papers are now dealing with more structured parame- ters. More specifically, when learning a similarity or a distance function among objects, the parameters are defined as a symmetric positive definite matrix that serves as a kernel (e. g. [14, 11, 13]). Learning is typically formulated as a parameter updating procedure to optimize a loss function. The gradient descent update [6] is one of the most commonly used algorithms, but it is not appropriate when the parameters form a positive definite matrix, because the updated parameter is not necessarily positive definite. Xing et al. [14] solved this problem by always correcting the updated matrix to be positive. However no bound has been proven for this update-and-correction approach. In this paper, we introduce the Matrix Exponentiated Gradient update which works as follows: First, the matrix logarithm of the current parameter matrix is computed. Then a step is taken in the direction of the steepest descent. Finally, the parameter matrix is updated to the exponential of the modified log-matrix. Our update preserves symmetry and positive definiteness because the matrix exponential maps any symmetric matrix to a positive definite matrix. Bregman divergences play a central role in the motivation and the analysis of on-line learn- ing algorithms [5]. A learning problem is essentially defined by a loss function, and a di- vergence that measures the discrepancy between parameters. More precisely, the updates are motivated by minimizing the sum of the loss function and the Bregman divergence, where the loss function is multiplied by a positive learning rate. Different divergences lead to radically different updates [6]. For example, the gradient descent is derived from the squared Euclidean distance, and the exponentiated gradient from the Kullback-Leibler di- vergence. We use the von Neumann divergence (also called quantum relative entropy) for measuring the discrepancy between two positive definite matrices [8]. We derive a new Matrix Exponentiated Gradient update from this divergence (which is a Bregman diver- gence for positive definite matrices). Finally we prove relative loss bounds using the von Neumann divergence as a measure of progress. Also the following related key problem has received a lot of attention recently [14, 11, 13]: Find a symmetric positive definite matrix that satisfies a number of symmetric linear inequality constraints. The new DefiniteBoost algorithm greedily chooses the most violated constraint and performs an approximated Bregman projection. In the diagonal case, we recover AdaBoost [9]. We also show how the convergence proof of AdaBoost generalizes to the non-diagonal case. 2 von Neumann Divergence or Quantum Relative Entropy If F is a real convex differentiable function on the parameter domain (symmetric d d positive definite matrices) and f (W): = F(W), then the Bregman divergence between two parameters W and W is defined as F(W, W) = F(W) - F(W) - tr[(W - W)f(W)]. When choosing F(W) = tr(W log W -W), then f(W) = log W and the corresponding Bregman divergence becomes the von Neumann divergence [8]: F(W, W) = tr(W log W - W log W - W + W). (1) In this paper, we are primarily interested in the normalized case (when tr(W) = 1). In this case, the positive symmetric definite matrices are related to density matrices commonly used in Statistical Physics and the divergence simplifies to F(W, W) = tr(W log W - W log W). If W = i ivivi is our notation for the eigenvalue decomposition, then we can rewrite the normalized divergence as ~ ~ F(W, W) = i ln ~ i + i ln j(~ vi vj )2. i i, j So this divergence quantifies the difference in the eigenvalues as well as the eigenvectors. 3 On-line Learning In this section, we present a natural extension of the Exponentiated Gradient (EG) up- date [6] to an update for symmetric positive definite matrices. At the t-th trial, the algorithm receives a symmetric instance matrix Xt Rdd. It then produces a prediction ^ yt = tr(WtXt) based on the algorithm's current symmetric positive definite parameter matrix Wt. Finally it incurs for instance1 a quadratic loss (^ yt - yt)2, 1For the sake of simplicity, we use the simple quadratic loss: L W t (W) = (tr(Xt ) - yt)2. For the general update, the gradient Lt(Wt) is exponentiated in the update (4) and this gradient must be symmetric. Following [5], more general loss functions (based on Bregman divergences) are amenable to our techniques. and updates its parameter matrix Wt. In the update we aim to solve the following problem: Wt+1 = argmin W F(W, Wt) + (tr(WXt) - yt)2, (2) where the convex function F defines the Bregman divergence. Setting the derivative with respect to W to zero, we have f (Wt+1) - f(Wt) + [(tr(Wt+1Xt) - yt)2] = 0. (3) The update rule is derived by solving (3) with respect to Wt+1, but it is not solvable in closed form. A common way to avoid this problem is to approximate tr(Wt+1Xt) by tr(WtXt) [5]. Then, we have the following update: Wt+1 = f -1(f (Wt) - 2(^yt - yt)Xt). In our case, F(W) = tr(W log W - W) and thus f(W) = log W and f-1(W) = exp W. We also augment (2) with the constraint tr(W) = 1, leading to the following Matrix Exponential Gradient (MEG) Update: 1 Wt+1 = exp(log W Z t - 2(^yt - yt)Xt), (4) t where the normalization factor Zt is tr[exp(log Wt - 2(^yt - yt)Xt)]. Note that in the above update, the exponent log Wt - 2(^yt - yt)Xt is an arbitrary symmetric matrix and the matrix exponential converts this matrix back into a symmetric positive definite matrix. A numerically stable version of the MEG update is given in Section 3. 2. 3. 1 Relative Loss Bounds We now begin with the definitions needed for the relative loss bounds. Let S = (X1, y1), .. ., (XT, yT ) denote a sequence of examples, where the instance matrices Xt Rdd are symmetric and the labels yt R. For any symmetric positive semi-definite ma- trix U with tr(U) = 1, define its total loss as LU(S) = T (tr(UX t=1 t ) - yt)2. The total loss of the on-line algorithm is LMEG(S) = T (tr(W t=1 tXt) - yt)2. We prove a bound on the relative loss LMEG(S) -LU(S) that holds for any U. The proof generalizes a sim- ilar bound for the Exponentiated Gradient update (Lemmas 5. 8 and 5. 9 of [6]). The relative loss bound is derived in two steps: Lemma 3. 1 bounds the relative loss for an individual trial and Lemma 3. 2 for a whole sequence (Proofs are given in the full paper). Lemma 3. 1 Let Wt be any symmetric positive definite matrix. Let Xt be any symmetric matrix whose smallest and largest eigenvalues satisfy max - min r. Assume Wt+1 is produced from Wt by the MEG update and let U be any symmetric positive semi-definite matrix. Then for any constants a and b such that 0 a(yt - tr(WtXt))2 - b(yt - tr(UXt))2 (U, Wt) - (U, Wt+1) (5) In the proof, we use the Golden-Thompson inequality [3], i. e. , tr[exp(A + B)] tr[exp(A) exp(B)] for symmetric matrices A and B. We also needed to prove the fol- lowing generalization of Jensen's inequality to matrices: exp(1A + 2(I - A)) exp(1)A + exp(2)(I - A) for finite 1, 2 R and any symmetric matrix A with 0 c 1 1 LMEG(S) 1 + LU(S) + + r2(U, W 2 2 c 1). (6) Proof For the maximum tightness of (5), a should be chosen as a = = 2b/(2 + r2b). Let b = c/r2, and thus a = 2c/(r2(2 + c)). Then (5) is rewritten as 2c (y 2 + c t - tr(WtXt))2 - c(yt - tr(UXt))2 r2((U, Wt) - (U, Wt+1)) Adding the bounds for t = 1, , T, we get 2c L 2 + c MEG(S) - cLU(S) r2((U, W1) - (U, Wt+1)) r2(U, W1), which is equivalent to (6). Assuming LU(S) max and (U, W1) dmax, the bound (6) is tightest when c = r 2dmax/ max. Then we have LMEG(S) - LU(S) r2 maxdmax + r2(U, W 2 1). 3. 2 Numerically stable MEG update The MEG update is numerically unstable when the eigenvalues of Wt are around zero. However we can "unwrap" Wt+1 as follows: 1 t Wt+1 = exp(c (^ y ~ tI + log W1 s Z - 2 - ys)Xs), (7) t s=1 where the constant ~ Zt normalizes the trace of Wt+1 to one. As long as the eigen values of W1 are not too small then the computation of log Wt is stable. Note that the update is inde- pendent of the choice of ct R. We incrementally maintain an eigenvalue decomposition of the matrix in the exponent (O(n3) per iteration): t VttVTt = ctI + log W1 - 2 (^ys - ys)Xs), s=1 where the constant ct is chosen so that the maximum eigenvalue of the above is zero. Now Wt+1 = Vt exp(t)VTt /tr(exp(t)). 4 Bregman Projection and DefiniteBoost In this section, we address the following Bregman projection problem2 W = argmin W F (W, W1), tr(W) = 1, tr(WCj ) 0, for j = 1, .. ., n, (8) where the symmetric positive definite matrix W1 of trace one is the initial parameter ma- trix, and C1, .. ., Cn are arbitrary symmetric matrices. Prior knowledge about W is en- coded in the constraints, and the matrix closest to W1 is chosen among the matrices satis- fying all constraints. Tsuda and Noble [13] employed this approach for learning a kernel matrix among graph nodes, and this method can be potentially applied to learn a kernel matrix in other settings (e. g. [14, 11]). The problem (8) is a projection of W1 to the intersection of convex regions defined by the constraints. It is well known that the Bregman projection into the intersection of convex regions can be solved by sequential projections to each region [1]. In the original papers only asymptotic convergence was shown. More recently a connection [4, 7] was made to the AdaBoost algorithm which has an improved convergence analysis [2, 9]. We generalize the latter algorithm and its analysis to symmetric positive definite matrices and call the new algorithm DefiniteBoost. As in the original setting, only approximate projections (Figure 1) are required to show fast convergence. 2Note that if is large then the on-line update (2) becomes a Bregman projection subject to a single equality constraint tr(WXt) = yt. Approximate Figure 1: In (exact) Bregman projections, the intersection Projection of convex sets (i. e. , two lines here) is found by iterating pro- jections to each set. We project only approximately, so the projected point does not satisfy the current constraint. Nev- ertheless, global convergence to the optimal solution is guar- anteed via our proofs. Exact Projection Before presenting the algorithm, let us derive the dual problem of (8) by means of Lagrange multipliers, n = argmin log trexp(logW1- jCj), j 0. (9) j=1 See [13] for a detailed derivation of the dual problem. When (8) is feasible, the opti- mal solution is described as W = 1 exp(log W C ) = Z( 1 j ), where Z( ) - nj=1 j tr[exp(log W1 - n C j=1 j j )]. 4. 1 Exact Bregman Projections First, let us present the exact Bregman projection algorithm to solve (8). We start from the initial parameter W1. At the t-th step, the most unsatisfied constraint is chosen, jt = argmaxj=1, ,n tr(WtCj). Let us use Ct as the short notation for Cj. Then, the t following Bregman projection with respect to the chosen constraint is solved. Wt+1 = argmin (W, W W t), tr(W) = 1, tr(WCt) 0. (10) By means of a Lagrange multiplier, the dual problem is described as t = argmin tr[exp(log Wt - Ct)], 0. (11) Using the solution of the dual problem, Wt is updated as 1 Wt+1 = exp(log W Z t - tCt) (12) t(t) where the normalization factor is Zt(t) = tr[exp(log Wt -tCt)]. Note that we can use the same numerically stable update as in the previous section. 4. 2 Approximate Bregman Projections The solution of (11) cannot be obtained in closed form. However, one can use the following approximate solution: 1 1 + r t/max t t = log, (13) max t - min t 1 + rt/min t when the eigenvalues of Ct lie in the interval [min t, max t ] and rt = tr(WtCt). Since the most unsatisfied constraint is chosen, rt 0 and thus t 0. Although the projection is done only approximately, 3 the convergence of the dual objective (9) can be shown using the following upper bound. 3The approximate Bregman projection (with t as in (13) can also be motivated as an online algorithm based on an entropic loss and learning rate one (following Section 3 and [4]). Theorem 4. 1 The dual objective (9) is bounded as n T tr explogW1- jCj (rt) (14) j=1 t=1 max min t -t rt max max t -min t rt t -min t where (rt) = 1 - 1. max - t min t The dual objective is monotonically decreasing, because (rt) 1. Also, since rt corre- sponds to the maximum value among all constraint violations {rj}nj=1, we have (rt) = 1 only if rt = 0. Thus the dual objective continues to decrease until all constraints are satisfied. 4. 3 Relation to Boosting When all matrices are diagonal, the DefiniteBoost degenerates to AdaBoost [9]: Let {xi, yi}di=1 be the training samples, where xi Rm and yi {-1, 1}. Let h1(x), .. ., hn(x) [-1, 1] be the weak hypotheses. For the j-th hypothesis hj(x), let us define Cj = diag(y1hj(x1), .. ., ydhj(xd)). Since |yhj(x)| 1, max/min t = 1 for any t. Setting W1 = I/d, the dual objective (14) is rewritten as d n 1 exp d -yi jhj(xi), i=1 j=1 which is equivalent to the exponential loss function used in AdaBoost. Since Cj and W1 are diagonal, the matrix Wt stays diagonal after the update. If wti = [Wt]ii, the updating formula (12) becomes the AdaBoost update: wt+1, i = wti exp(-tyiht(xi))/Zt(t). The approximate solution of t (13) is described as t = 1 log 1+rt, where r 2 1-r t is the weighted t training error of the t-th hypothesis, i. e. rt = d w i=1 tiyiht(xi). 5 Experiments on Learning Kernels In this section, our technique is applied to learning a kernel matrix from a set of distance measurements. This application is not on-line per se, but it shows nevertheless that the theoretical bounds can be reasonably tight on natural data. When K is a d d kernel matrix among d objects, then the Kij characterizes the similarity between objects i and j. In the feature space, Kij corresponds to the inner product between object i and j, and thus the Euclidean distance can be computed from the entries of the kernel matrix [10]. In some cases, the kernel matrix is not given explicitly, but only a set of distance measurements is available. The data are represented either as (i) quantitative distance values (e. g. , the distance between i and j is 0. 75), or (ii) qualitative evaluations (e. g. , the distance between i and j is small) [14, 13]. Our task is to obtain a positive definite kernel matrix which fits well to the given distance data. On-line kernel learning In the first experiment, we consider the on-line learning scenario in which only one distance example is shown to the learner at each time step. The distance example at time t is described as {at, bt, yt}, which indicates that the squared Euclidean distance between objects at and bt is yt. Let us define a time-developing sequence of kernel matrices as {Wt}Tt=1, and the corresponding points in the feature space as {xti}di=1 (i. e. [Wt]ab = xtaxtb). Then, the total loss incurred by this sequence is T T 2 2 xta = (tr(W t - xtbt - yt tXt) - yt)2, t=1 t=1 1. 8 0. 45 1. 6 0. 4 1. 4 0. 35 1. 2 0. 3 1 0. 25 0. 8 Total Loss 0. 2 0. 6 Classification Error 0. 4 0. 15 0. 2 0. 1 0 0. 05 0 0. 5 1 1. 5 2 2. 5 3 0 0. 5 1 1. 5 2 2. 5 3 5 Iterations 5 Iterations x 10 x 10 Figure 2: Numerical results of on-line learning. (Left) total loss against the number of iterations. The dashed line shows the loss bound. (Right) classification error of the nearest neighbor classifier using the learned kernel. The dashed line shows the error by the target kernel. where Xt is a symmetric matrix whose (at, at) and (bt, bt) elements are 0. 5, (at, bt) and (bt, at) elements are -0. 5, and all the other elements are zero. We consider a controlled experiment in which the distance examples are created from a known target kernel matrix. We used a 52 52 kernel matrix among gyrB proteins of bacteria (d = 52). This data contains three bacteria species (see [12] for details). Each distance example is created by randomly choosing one element of the target kernel. The initial parameter was set as W1 = I/d. When the comparison matrix U is set to the target matrix, LU (S) = 0 and max = 0, because all the distance examples are derived from the target matrix. Therefore we choose learning rate = 2, which minimizes the relative loss bound of Lemma 3. 2. The total loss of the kernel matrix sequence obtained by the matrix exponential update is shown in Figure 2 (left). In the plot, we have also shown the relative loss bound. The bound seems to give a reasonably tight performance guarantee--it is about twice the actual total loss. To evaluate the learned kernel matrix, the prediction accuracy of bacteria species by the nearest neighbor classifier is calculated (Figure 2, right), where the 52 proteins are randomly divided into 50% training and 50% testing data. The value shown in the plot is the test error averaged over 10 different divisions. It took a large number of iterations ( 2 105) for the error rate to converge to the level of the target kernel. In practice one can often increase the learning rate for faster convergence, but here we chose the small rate suggested by our analysis to check the tightness of the bound. Kernel learning by Bregman projection Next, let us consider a batch learning sce- nario where we have a set of qualitative distance evaluations (i. e. inequality constraints). Given n pairs of similar objects {aj, bj}nj=1, the inequality constraints are constructed as xaj - xbj, j = 1, .. ., n, where is a predetermined constant. If Xj is de- fined as in the previous section and Cj = Xj - I, the inequalities are then rewritten as tr(WCj) 0, j = 1, .. ., n. The largest and smallest eigenvalues of any Cj are 1 - and -, respectively. As in the previous section, distance examples are generated from the target kernel matrix between gyrB proteins. Setting = 0. 2/d, we collected all object pairs whose distance in the feature space is less than to yield 980 inequalities (n = 980). Figure 3 (left) shows the convergence of the dual objective function as proven in Theo- rem 4. 1. The convergence was much faster than the previous experiment, because, in the batch setting, one can choose the most unsatisfied constraint, and optimize the step size as well. Figure 3 (right) shows the classification error of the nearest neighbor classifier. As opposed to the previous experiment, the error rate is higher than that of the target kernel matrix, because substantial amount of information is lost by the conversion to inequality constraints. 55 0. 8 50 0. 7 45 0. 6 40 0. 5 35 0. 4 Dual Obj 30 0. 3 Classification Error 25 0. 2 20 0. 1 15 0 0 50 100 150 200 250 300 0 50 100 150 200 250 300 Iterations Iterations Figure 3: Numerical results of Bregman projection. (Left) convergence of the dual objective function. (Right) classification error of the nearest neighbor classifier using the learned kernel. 6 Conclusion We motivated and analyzed a new update for symmetric positive matrices using the von Neumann divergence. We showed that the standard bounds for on-line learning and Boost- ing generalize to the case when the parameters are a symmetric positive definite matrix (of trace one) instead of a probability vector. As in quantum physics, the eigenvalues act as probabilities. Acknowledgment We would like to thank B. Sch olkopf, M. Kawanabe, J. Liao and W. S. Noble for fruitful discussions. M. W. was supported by NSF grant CCR 9821087 and UC Discovery grant LSIT02-10110. K. T. and G. R. gratefully acknowledge partial support from the PASCAL Network of Excellence (EU #506778). Part of this work was done while all three authors were visiting the National ICT Australia in Canberra.

NeurIPS Conference 2003 Conference Paper

Boosting versus Covering

  • Kohei Hatano
  • Manfred K. Warmuth

We investigate improvements of AdaBoost that can exploit the fact that the weak hypotheses are one-sided, i. e. either all its positive (or negative) predictions are correct. In particular, for any set of m labeled examples consistent with a disjunction of k literals (which are one-sided in this case), AdaBoost constructs a consistent hypothesis by using O(k2 log m) iterations. On the other hand, a greedy set covering algorithm finds a consistent hypothesis of size O(k log m). Our primary question is whether there is a simple boosting algorithm that performs as well as the greedy set covering. We first show that InfoBoost, a modification of AdaBoost pro- posed by Aslam for a different purpose, does perform as well as the greedy set covering algorithm. We then show that AdaBoost requires Ω(k2 log m) iterations for learning k-literal disjunctions. We achieve this with an adversary construction and as well as in simple experiments based on artificial data. Further we give a vari- ant called SemiBoost that can handle the degenerate case when the given examples all have the same label. We conclude by showing that SemiBoost can be used to produce small conjunctions as well.

JMLR Journal 2003 Journal Article

Path Kernels and Multiplicative Updates

  • Eiji Takimoto
  • Manfred K. Warmuth

Kernels are typically applied to linear algorithms whose weight vector is a linear combination of the feature vectors of the examples. On-line versions of these algorithms are sometimes called "additive updates" because they add a multiple of the last feature vector to the current weight vector. In this paper we have found a way to use special convolution kernels to efficiently implement "multiplicative" updates. The kernels are defined by a directed graph. Each edge contributes an input. The inputs along a path form a product feature and all such products build the feature vector associated with the inputs. We also have a set of probabilities on the edges so that the outflow from each vertex is one. We then discuss multiplicative updates on these graphs where the prediction is essentially a kernel computation and the update contributes a factor to each edge. After adding the factors to the edges, the total outflow out of each vertex is not one any more. However some clever algorithms re-normalize the weights on the paths so that the total outflow out of each vertex is one again. Finally, we show that if the digraph is built from a regular expressions, then this can be used for speeding up the kernel and re-normalization computations. We reformulate a large number of multiplicative update algorithms using path kernels and characterize the applicability of our method. The examples include efficient algorithms for learning disjunctions and a recent algorithm that predicts as well as the best pruning of a series parallel digraphs. [abs] [ pdf ][ ps.gz ][ ps ]

NeurIPS Conference 2002 Conference Paper

Adaptive Caching by Refetching

  • Robert Gramacy
  • Manfred K. Warmuth
  • Scott Brandt
  • Ismail Ari

We are constructing caching policies that have 13-20% lower miss rates than the best of twelve baseline policies over a large variety of request streams. This represents an improvement of 49–63% over Least Recently Used, the most commonly implemented policy. We achieve this not by designing a specific new policy but by using on-line Machine Learning algorithms to dynamically shift between the standard policies based on their observed miss rates. A thorough experimental evaluation of our techniques is given, as well as a discussion of what makes caching an interesting on-line learning problem.

TCS Journal 2002 Journal Article

Predicting nearly as well as the best pruning of a planar decision graph

  • Eiji Takimoto
  • Manfred K. Warmuth

We design efficient on-line algorithms that predict nearly as well as the best pruning of a planar decision graph. We assume that the graph has no cycles. As in the previous work on decision trees, we implicitly maintain one weight for each of the prunings (exponentially many). The method works for a large class of algorithms that update its weights multiplicatively. It can also be used to design algorithms that predict nearly as well as the best convex combination of prunings.

JMLR Journal 2002 Journal Article

Tracking a Small Set of Experts by Mixing Past Posteriors

  • Olivier Bousquet
  • Manfred K. Warmuth

In this paper, we examine on-line learning problems in which the target concept is allowed to change over time. In each trial a master algorithm receives predictions from a large set of n experts. Its goal is to predict almost as well as the best sequence of such experts chosen off-line by partitioning the training sequence into k +1 sections and then choosing the best expert for each section. We build on methods developed by Herbster and Warmuth and consider an open problem posed by Freund where the experts in the best partition are from a small pool of size m. Since k >> m, the best expert shifts back and forth between the experts of the small pool. We propose algorithms that solve this open problem by mixing the past posteriors maintained by the master algorithm. We relate the number of bits needed for encoding the best partition to the loss bounds of the algorithms. Instead of paying log n for choosing the best expert in each section we first pay log ( n choose m ) bits in the bounds for identifying the pool of m experts and then log m bits per new section. In the bounds we also pay twice for encoding the boundaries of the sections.

NeurIPS Conference 2001 Conference Paper

Active Learning in the Drug Discovery Process

  • Manfred K. Warmuth
  • Gunnar Rätsch
  • Michael Mathieson
  • Jun Liao
  • Christian Lemmen

We investigate the following data mining problem from Computational Chemistry: From a large data set of compounds, find those that bind to a target molecule in as few iterations of biological testing as possible. In each iteration a comparatively small batch of compounds is screened for binding to the target. We apply active learning techniques for selecting the successive batches. One selection strategy picks unlabeled examples closest to the maximum margin hyperplane. Another produces many weight vectors by running perceptrons over multiple permutations of the data. Each weight vector prediction and we pick the unlabeled examples for which votes with its the prediction is most evenly split between. For a third selec- tion strategy note that each unlabeled example bisects the version space of consistent weight vectors. We estimate the volume on both sides of the split by bouncing a billiard through the version space and select un- labeled examples that cause the most even split of the version space. We demonstrate that on two data sets provided by DuPont Pharmaceu- ticals that all three selection strategies perform comparably well and are much better than selecting random batches for testing. and

NeurIPS Conference 2001 Conference Paper

On the Convergence of Leveraging

  • Gunnar Rätsch
  • Sebastian Mika
  • Manfred K. Warmuth

We give an unified convergence analysis of ensemble learning meth- ods including e. g. AdaBoost, Logistic Regression and the Least-Square- Boost algorithm for regression. These methods have in common that they iteratively call a base learning algorithm which returns hypotheses that are then linearly combined. We show that these methods are related to the Gauss-Southwell method known from numerical optimization and state non-asymptotical convergence results for all these methods. Our analysis includes ` 1 -norm regularized cost functions leading to a clean and general way to regularize ensemble learning. 1 Introduction We show convergence rates of ensemble learning methods such as AdaBoost [10], Logistic Regression (LR) [11, 5] and the Least-Square (LS) regression algorithm called LS-Boost [12]. These algorithms have in common that they iteratively call a base learning algorithm L (also called weak learner) on a weighted training sample. The base learner is expected to return in each iteration t a hypothesis ^ h t from some hypothesis set of weak hypotheses H that has small weighted training error. This is the weighted number of false predictions in classification and weighted estimation error in regression. These hypotheses are then linearly combined to form the final hypothesis f ^ (x) =

JMLR Journal 2001 Journal Article

Tracking the Best Linear Predictor

  • Mark Herbster
  • Manfred K. Warmuth

In most on-line learning research the total on-line loss of the algorithm is compared to the total loss of the best off-line predictor u from a comparison class of predictors. We call such bounds static bounds. The interesting feature of these bounds is that they hold for an arbitrary sequence of examples. Recently some work has been done where the predictor u t at each trial t is allowed to change with time, and the total on-line loss of the algorithm is compared to the sum of the losses of u t at each trial plus the total "cost" for shifting to successive predictors. This is to model situations in which the examples change over time, and different predictors from the comparison class are best for different segments of the sequence of examples. We call such bounds shifting bounds. They hold for arbitrary sequences of examples and arbitrary sequences of predictors. Naturally shifting bounds are much harder to prove. The only known bounds are for the case when the comparison class consists of a sequences of experts or boolean disjunctions. In this paper we develop the methodology for lifting known static bounds to the shifting case. In particular we obtain bounds when the comparison class consists of linear neurons (linear combinations of experts). Our essential technique is to project the hypothesis of the static algorithm at the end of each trial into a suitably chosen convex region. This keeps the hypothesis of the algorithm well-behaved and the static bounds can be converted to shifting bounds.

NeurIPS Conference 1998 Conference Paper

Batch and On-Line Parameter Estimation of Gaussian Mixtures Based on the Joint Entropy

  • Yoram Singer
  • Manfred K. Warmuth

We describe a new iterative method for parameter estimation of Gaus(cid: 173) sian mixtures. The new method is based on a framework developed by Kivinen and Warmuth for supervised on-line learning. In contrast to gra(cid: 173) dient descent and EM, which estimate the mixture's covariance matrices, the proposed method estimates the inverses of the covariance matrices. Furthennore, the new parameter estimation procedure can be applied in both on-line and batch settings. We show experimentally that it is typi(cid: 173) cally faster than EM, and usually requires about half as many iterations as EM.

NeurIPS Conference 1998 Conference Paper

Linear Hinge Loss and Average Margin

  • Claudio Gentile
  • Manfred K. Warmuth

We describe a unifying method for proving relative loss bounds for on(cid: 173) line linear threshold classification algorithms, such as the Perceptron and the Winnow algorithms. For classification problems the discrete loss is used, i. e. , the total number of prediction mistakes. We introduce a con(cid: 173) tinuous loss function, called the "linear hinge loss", that can be employed to derive the updates of the algorithms. We first prove bounds w. r. t. the linear hinge loss and then convert them to the discrete loss. We intro(cid: 173) duce a notion of "average margin" of a set of examples. We show how relative loss bounds based on the linear hinge loss can be converted to relative loss bounds i. t. o. the discrete loss using the average margin.

I&C Journal 1997 Journal Article

Exponentiated Gradient versus Gradient Descent for Linear Predictors

  • Jyrki Kivinen
  • Manfred K. Warmuth

We consider two algorithms for on-line prediction based on a linear model. The algorithms are the well-known gradient descent (GD) algorithm and a new algorithm, which we call EG±. They both maintain a weight vector using simple updates. For the GD algorithm, the update is based on subtracting the gradient of the squared error made on a prediction. The EG±algorithm uses the components of the gradient in the exponents of factors that are used in updating the weight vector multiplicatively. We present worst-case loss bounds for EG±and compare them to previously known bounds for the GD algorithm. The bounds suggest that the losses of the algorithms are in general incomparable, but EG±has a much smaller loss if only few components of the input are relevant for the predictions. We have performed experiments which show that our worst-case upper bounds are quite tight already on simple artificial data.

NeurIPS Conference 1997 Conference Paper

Relative Loss Bounds for Multidimensional Regression Problems

  • Jyrki Kivinen
  • Manfred K. Warmuth

We study on-line generalized linear regression with multidimensional outputs, i. e. , neural networks with multiple output nodes but no hidden nodes. We allow at the final layer transfer functions such as the soft(cid: 173) max function that need to consider the linear activations to all the output neurons. We use distance functions of a certain kind in two completely independent roles in deriving and analyzing on-line learning algorithms for such tasks. We use one distance function to define a matching loss function for the (possibly multidimensional) transfer function, which al(cid: 173) lows us to generalize earlier results from one-dimensional to multidimen(cid: 173) sional outputs. We use another distance function as a tool for measuring progress made by the on-line updates. This shows how previously stud(cid: 173) ied algorithms such as gradient descent and exponentiated gradient fit into a common framework. We evaluate the performance of the algo(cid: 173) rithms using relative loss bounds that compare the loss of the on-line algoritm to the best off-line predictor from the relevant model class, thus completely eliminating probabilistic assumptions about the data.

NeurIPS Conference 1996 Conference Paper

Training Algorithms for Hidden Markov Models using Entropy Based Distance Functions

  • Yoram Singer
  • Manfred K. Warmuth

We present new algorithms for parameter estimation of HMMs. By adapting a framework used for supervised learning, we construct iterative algorithms that maximize the likelihood of the observations while also attempting to stay "close" to the current estimated parameters. We use a bound on the relative entropy between the two HMMs as a distance mea(cid: 173) sure between them. The result is new iterative training algorithms which are similar to the EM (Baum-Welch) algorithm for training HMMs. The proposed algorithms are composed of a step similar to the expectation step of Baum-Welch and a new update of the parameters which replaces the maximization (re-estimation) step. The algorithm takes only negligi(cid: 173) bly more time per iteration and an approximated version uses the same expectation step as Baum-Welch. We evaluate experimentally the new algorithms on synthetic and natural speech pronunciation data. For sparse models, i. e. models with relatively small number of non-zero parameters, the proposed algorithms require significantly fewer iterations. 1 Preliminaries We use the numbers from 0 to N to name the states of an HMM. State 0 is a special initial state and state N is a special final state. Any state sequence, denoted by s, starts with the initial state but never returns to it and ends in the final state. Observations symbols are also numbers in {I, .. ., M} and observation sequences are denoted by x. A discrete output hidden Markov model (HMM) is parameterized by two matrices A and B. The first matrix is of dimension [N, N] and ai, j (0: 5: i: 5: N - 1, 1: 5: j: 5: N) denotes the probability of moving from state i to state j. The second matrix is of dimension [N + 1, M] and bi, k is the probability of outputting symbol k at state i. The set of parameters of an HMM is denoted by 0 = (A, B). (The initial state distribution vector is represented by the first row of A. ) An HMM is a probabilistic generator of sequences. It starts in the initial state O. It then iteratively does the following until the final state is reached. If i is the current state then a next state j is chosen according to the transition probabilities out of the current state (row i of matrix A). After arriving at state j a symbol is output according to the output probabilities of that state (row j of matrix B). Let P(x, slO) denote the probability (likelihood) that an HMM 0 generates the observation sequence x on the path s starting at state 0 and ending at state N: P(x, sllsl = Ixl + 1, So = 0, slSI = N, 0) ~ I1~~ll as. _t, s. bs. ,x •. For the sake of brevity we omit the conditions on s and x. Throughout the paper we assume that the HMMs are absorbing, that is from every state there is a path to the final state with a 642 Y. Singer and M. K. Warmuth non-zero probability. Similar parameter estimation algorithms can be derived for ergodic HMMs. Absorbing HMMs induce a probability over all state-observation sequences, i. e. Ex, s P(x, s18) = 1. The likelihood of an observation sequence x is obtained by summing over all possible hidden paths (state sequences), P(xI8) = Es P(x, sI8). To obtain the likelihood for a set X of observations we simply mUltiply the likelihood values for the individual sequences. We seek an HMM 8 that maximizes the likelihood for a given set of observations X, or equivalently, maximizes the log-likelihood, LL(XI8) = r: h EXEX In P(xI8). To simplify our notation we denote the generic parameter in 8 by Oi, where i ranges from 1 to the total number of parameters in A and B (There might be less if some are clamped to zero). We denote the total number of parameters of 8 by I and leave the (fixed) correspondence between the Oi and the entries of A and B unspecified. The indices are naturally partitioned into classes corresponding to the rows of the matrices. We denote by [i] the class of parameters to which Oi belongs and by O[i) the vector of all OJ S. t. j E [i]. If j E [i] then both Oi and OJ are parameters from the same row of one of the two matrices. Whenever it is clear from the context, we will use [i] to denote both a class of parameters and the row number (i. e. state) associated with the class. We now can rewrite P(x, s18) as nf=l O~'(X, S), where ni(x, s) is the number of times parameter i is used along the path s with observation sequence x. (Note that this value does not depend on the actual parameters 8. ) We next compute partial derivatives ofthe likelihood and the log-likelihood using this notation.

NeurIPS Conference 1995 Conference Paper

Exponentially many local minima for single neurons

  • Peter Auer
  • Mark Herbster
  • Manfred K. Warmuth

We show that for a single neuron with the logistic function as the transfer function the number of local minima of the error function based on the square loss can grow exponentially in the dimension.

FOCS Conference 1995 Conference Paper

Tracking the Best Disjunction

  • Peter Auer
  • Manfred K. Warmuth

N. Littlestone developed a simple deterministic on-line learning algorithm for learning k-literal disjunctions. This algorithm (called Winnow) keeps one weight for each of the n variables and does multiplicative updates to its weights. We develop a randomized version of Winnow and prove bounds for an adaptation of the algorithm for the case when the disjunction may change over time. In this case a possible target disjunction schedule T is a sequence of disjunctions (one per trial) and the shift size is the total number of literals that are added/removed from the disjunctions as one progresses through the sequence. We develop an algorithm that predicts nearly as well as the best disjunction schedule for an arbitrary sequence of examples. This algorithm that allows us to track the predictions of the best disjunction is hardly more complex than the original version. However the amortized analysis needed for obtaining worst-case mistake bounds requires new techniques. In some cases our lower bounds show that the upper bounds of our algorithm have the right constant in front of the leading term in the mistake bound and almost the right constant in front of the second leading term. By combining the tracking capability with existing applications of Winnow we are able to enhance these applications to the shifting case as well.

NeurIPS Conference 1995 Conference Paper

Worst-case Loss Bounds for Single Neurons

  • David Helmbold
  • Jyrki Kivinen
  • Manfred K. Warmuth

We analyze and compare the well-known Gradient Descent algo(cid: 173) rithm and a new algorithm, called the Exponentiated Gradient algorithm, for training a single neuron with an arbitrary transfer function. Both algorithms are easily generalized to larger neural networks, and the generalization of Gradient Descent is the stan(cid: 173) dard back-propagation algorithm. In this paper we prove worst(cid: 173) case loss bounds for both algorithms in the single neuron case. Since local minima make it difficult to prove worst-case bounds for gradient-based algorithms, we must use a loss function that prevents the formation of spurious local minima. We define such a matching loss function for any strictly increasing differentiable transfer function and prove worst-case loss bound for any such transfer function and its corresponding matching loss. For exam(cid: 173) ple, the matching loss for the identity function is the square loss and the matching loss for the logistic sigmoid is the entropic loss. The different structure of the bounds for the two algorithms indi(cid: 173) cates that the new algorithm out-performs Gradient Descent when the inputs contain a large number of irrelevant components. 310 D. P. HELMBOLD, J. KIVINEN, M. K. WARMUTH

I&C Journal 1991 Journal Article

Equivalence of models for polynomial learnability

  • David Haussler
  • Michael Kearns
  • Nick Littlestone
  • Manfred K. Warmuth

In this paper we consider several variants of Valiant's learnability model that have appeared in the literature. We give conditions under which these models are equivalent in terms of the polynomially learnable concept classes they define. These equivalences allow comparisons of most of the existing theorems in Valiant-style learnability and show that several simplifying assumptions on polynomial learning algorithms can be made without loss of generality. We also give a useful reduction of learning problems to the problem of finding consistent hypotheses, and give comparisons and equivalences between Valiant's model and the prediction learning models of Haussler, Littlestone, and Warmuth (in “29th Annual IEEE Symposium on Foundations of Computer Science, ” 1988).

I&C Journal 1989 Journal Article

Parallel approximation algorithms for bin packing

  • Richard J. Anderson
  • Ernst W. Mayr
  • Manfred K. Warmuth

We study the parallel complexity of polynomial heuristics for the bin packing problem. We show that some well-known (and simple) methods like first-fit-decreasing areP-complete, and it is hence very unlikely that they can be efficiently parallelized. On the other hand, we exhibit an optimalNCalgorithm that achieves the same performance bound as does FFD. Finally, we discuss parallelization of polynomial approximation algorithms for bin packing based on discretization.

FOCS Conference 1989 Conference Paper

The Weighted Majority Algorithm

  • Nick Littlestone
  • Manfred K. Warmuth

The construction of prediction algorithms in a situation in which a learner faces a sequence of trials, with a prediction to be made in each, and the goal of the learner is to make few mistakes is studied. It is assumed that the learner has reason to believe that one of some pool of known algorithms will perform well but does not know which one. A simple and effective method, based on weighted voting, is introduced for constructing a compound algorithm in such a circumstance. It is called the weighted majority algorithm and is shown to be robust with respect to errors in the data. Various versions of the weighted majority algorithm are discussed, and error bounds for them that are closely related to the error bounds of the best algorithms of the pool are proved. >

FOCS Conference 1988 Conference Paper

Predicting {0, 1}-Functions on Randomly Drawn Points (Extended Abstract)

  • David Haussler
  • Nick Littlestone
  • Manfred K. Warmuth

The authors consider the problem of predicting (0, 1)-valued functions on R/sup n/ and smaller domains, based on their values on randomly drawn points. Their model is related to L. G. Valiant's learnability model (1984), but does not require the hypotheses used for prediction to be represented in any specified form. The authors first disregard computational complexity and show how to construct prediction strategies that are optimal to within a constant factor for any reasonable class F of target functions. These prediction strategies use the 1-inclusion graph structure from N. Alon et al. 's work on geometric range queries (1987) to minimize the probability of incorrect prediction. They then turn to computationally efficient algorithms. For indicator functions of axis-parallel rectangles and halfspaces in R/sup n/, they demonstrate how their techniques can be applied to construct computational efficient prediction strategies that are optimal to within a constant factor. They compare the general performance of prediction strategies derived by their method to those derived from existing methods in Valiant's learnability theory. >

TCS Journal 1986 Journal Article

Manipulating derivation forests by scheduling techniques

  • Jakob Gonczarowski
  • Manfred K. Warmuth

This paper continues the investigation into k-parallel rewritting begun in (Gonczarowski/Shamir, 1985) and (Gonczarowski/Warmuth, 1985). This rewritting mechanism is a generalization of context-free rewriting; instead of applying a single production (or, alternatively, arbitrarily many productions) at each derivation step, exactly k productions are applied. In the works mentioned above, polynomial dynamic programming algorithms were presented which require constant k and propagating grammars. We solve the various membership problems left open in (Gonczarowski/Warmuth, 1985), for arbitrary grammars, using Scheduling Theory. We develop various kinds of pumping, obtaining bounds on the sizes of k-derivations and k-derivation forests. A polynomial dynamic programming membership algorithm is presented for arbitrary (i. e. , possibly nonpropagating) grammars, for fixed k. If k is a variable of the problem, then membership is in NP and, hence, by (Gonczarowski/Warmuth, 1985), NP-complete. For unary alphabets, the latter problem is polynomial. Similarly, membership is polynomial in the size of k if only k is variable.

TCS Journal 1985 Journal Article

Applications of scheduling theory to formal language theory

  • Jakob Gonczarowski
  • Manfred K. Warmuth

Context-free grammars are extended to the case where it is required that at each derivation step, a fixed number k of nonterminals must be rewritten in parallel. This way of rewriting constitutes a ‘missing link’ between context-free rewriting, where only one nonterminal is rewritten in each step (k=1), and E0L rewriting, where always all nonterminals are rewritten. We approach the study of these families by investigating their computational complexity. In both the E0L and CF case, as well as for the case k=2, simple dynamic programming membership algorithms exist (see [13, 22, 29]). We solve the general problem using results from Scheduling Theory. Rewriting k symbols at each derivation step corresponds to scheduling the corresponding derivation forest on k processors. Using Scheduling Theory techniques, we present dynamic programming membership algorithms that run in polynomial time, for constant k. On the other hand, it is shown that membership is NP-hard if k is a variable of the problem, even when the grammar is fixed. An analogous NP-hardness result is shown for the case where the k symbols to be rewritten are required to be adjacent.