Author name cluster

Koby Crammer

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

45 papers

2 author rows

ICLR Conference 2022 Conference Paper

Weighted Training for Cross-Task Learning

Shuxiao Chen
Koby Crammer
Hangfeng He 0001
Dan Roth 0001
Weijie J. Su

In this paper, we introduce Target-Aware Weighted Training (TAWT), a weighted training algorithm for cross-task learning based on minimizing a representation-based task distance between the source and target tasks. We show that TAWT is easy to implement, is computationally efficient, requires little hyperparameter tuning, and enjoys non-asymptotic learning-theoretic guarantees. The effectiveness of TAWT is corroborated through extensive experiments with BERT on four sequence tagging tasks in natural language processing (NLP), including part-of-speech (PoS) tagging, chunking, predicate detection, and named entity recognition (NER). As a byproduct, the proposed representation-based task distance allows one to reason in a theoretically principled way about several critical aspects of cross-task learning, such as the choice of the source data and the impact of fine-tuning.

NeurIPS Conference 2018 Conference Paper

Efficient Loss-Based Decoding on Graphs for Extreme Classification

Itay Evron
Edward Moroshko
Koby Crammer

In extreme classification problems, learning algorithms are required to map instances to labels from an extremely large label set. We build on a recent extreme classification framework with logarithmic time and space (LTLS), and on a general approach for error correcting output coding (ECOC) with loss-based decoding, and introduce a flexible and efficient approach accompanied by theoretical bounds. Our framework employs output codes induced by graphs, for which we show how to perform efficient loss-based decoding to potentially improve accuracy. In addition, our framework offers a tradeoff between accuracy, model size and prediction time. We show how to find the sweet spot of this tradeoff using only the training data. Our experimental study demonstrates the validity of our assumptions and claims, and shows that our method is competitive with state-of-the-art algorithms.

NeurIPS Conference 2017 Conference Paper

Rotting Bandits

Nir Levine
Koby Crammer
Shie Mannor

The Multi-Armed Bandits (MAB) framework highlights the trade-off between acquiring new knowledge (Exploration) and leveraging available knowledge (Exploitation). In the classical MAB problem, a decision maker must choose an arm at each time step, upon which she receives a reward. The decision maker's objective is to maximize her cumulative expected reward over the time horizon. The MAB problem has been studied extensively, specifically under the assumption of the arms' rewards distributions being stationary, or quasi-stationary, over time. We consider a variant of the MAB framework, which we termed Rotting Bandits, where each arm's expected reward decays as a function of the number of times it has been pulled. We are motivated by many real-world scenarios such as online advertising, content recommendation, crowdsourcing, and more. We present algorithms, accompanied by simulations, and derive theoretical guarantees.

NeurIPS Conference 2015 Conference Paper

Linear Multi-Resource Allocation with Semi-Bandit Feedback

Tor Lattimore
Koby Crammer
Csaba Szepesvari

We study an idealised sequential resource allocation problem. In each time step the learner chooses an allocation of several resource types between a number of tasks. Assigning more resources to a task increases the probability that it is completed. The problem is challenging because the alignment of the tasks to the resource types is unknown and the feedback is noisy. Our main contribution is the new setting and an algorithm with nearly-optimal regret analysis. Along the way we draw connections to the problem of minimising regret for stochastic linear bandits with heteroscedastic noise. We also present some new results for stochastic linear bandits on the hypercube that significantly out-performs existing work, especially in the sparse case.

AAAI Conference 2015 Conference Paper

Outlier-Robust Convex Segmentation

Itamar Katz
Koby Crammer

We derive a convex optimization problem for the task of segmenting sequential data, which explicitly treats presence of outliers. We describe two algorithms for solving this problem, one exact and one a top-down novel approach, and we derive a consistency results for the case of two segments and no outliers. Robustness to outliers is evaluated on two real-world tasks related to speech segmentation. Our algorithms outperform baseline segmentation algorithms.

JMLR Journal 2015 Journal Article

Second-Order Non-Stationary Online Learning for Regression

Edward Moroshko
Nina Vaits
Koby Crammer

The goal of a learner in standard online learning, is to have the cumulative loss not much larger compared with the best- performing function from some fixed class. Numerous algorithms were shown to have this gap arbitrarily close to zero, compared with the best function that is chosen off-line. Nevertheless, many real-world applications, such as adaptive filtering, are non-stationary in nature, and the best prediction function may drift over time. We introduce two novel algorithms for online regression, designed to work well in non-stationary environment. Our first algorithm performs adaptive resets to forget the history, while the second is last-step min-max optimal in context of a drift. We analyze both algorithms in the worst-case regret framework and show that they maintain an average loss close to that of the best slowly changing sequence of linear functions, as long as the cumulative drift is sublinear. In addition, in the stationary case, when no drift occurs, our algorithms suffer logarithmic regret, as for previous algorithms. Our bounds improve over existing ones, and simulations demonstrate the usefulness of these algorithms compared with other state-of-the-art approaches. [abs] [ pdf ][ bib ] &copy JMLR 2015. ( edit, beta )

ICML Conference 2014 Conference Paper

Concept Drift Detection Through Resampling

Maayan Harel
Shie Mannor
Ran El-Yaniv
Koby Crammer

Detecting changes in data-streams is an important part of enhancing learning quality in dynamic environments. We devise a procedure for detecting concept drifts in data-streams that relies on analyzing the empirical loss of learning algorithms. Our method is based on obtaining statistics from the loss distribution by reusing the data multiple times via resampling. We present theoretical guarantees for the proposed procedure based on the stability of the underlying learning algorithms. Experimental results show that the detection method has high recall and precision, and performs well in the presence of noise.

NeurIPS Conference 2014 Conference Paper

Learning Multiple Tasks in Parallel with a Shared Annotator

Haim Cohen
Koby Crammer

We introduce a new multi-task framework, in which $K$ online learners are sharing a single annotator with limited bandwidth. On each round, each of the $K$ learners receives an input, and makes a prediction about the label of that input. Then, a shared (stochastic) mechanism decides which of the $K$ inputs will be annotated. The learner that receives the feedback (label) may update its prediction rule, and we proceed to the next round. We develop an online algorithm for multi-task binary classification that learns in this setting, and bound its performance in the worst-case setting. Additionally, we show that our algorithm can be used to solve two bandits problems: contextual bandits, and dueling bandits with context, both allowed to decouple exploration and exploitation. Empirical study with OCR data, vowel prediction (VJ project) and document classification, shows that our algorithm outperforms other algorithms, one of which uses uniform allocation, and essentially makes more (accuracy) for the same labour of the annotator.

UAI Conference 2014 Conference Paper

Optimal Resource Allocation with Semi-Bandit Feedback

Tor Lattimore
Koby Crammer
Csaba Szepesvári

We study a sequential resource allocation problem involving a fixed number of recurring jobs. At each time-step the manager should distribute available resources among the jobs in order to maximise the expected number of completed jobs. Allocating more resources to a given job increases the probability that it completes, but with a cut-off. Specifically, we assume a linear model where the probability increases linearly until it equals one, after which allocating additional resources is wasteful. We assume the difficulty of each job is unknown and present the first algorithm for this problem and prove upper and lower bounds on its regret. Despite its apparent simplicity, the problem has a rich structure: we show that an appropriate optimistic algorithm can improve its learning speed dramatically beyond the results one normally expects for similar problems as the problem becomes resource-laden.

ICML Conference 2014 Conference Paper

Prediction with Limited Advice and Multiarmed Bandits with Paid Observations

Yevgeny Seldin
Peter L. Bartlett
Koby Crammer
Yasin Abbasi-Yadkori

We study two problems of online learning under restricted information access. In the first problem, \emphprediction with limited advice, we consider a game of prediction with expert advice, where on each round of the game we query the advice of a subset of M out of N experts. We present an algorithm that achieves O(\sqrt(N/M)T\ln N) regret on T rounds of this game. The second problem, the \emphmultiarmed bandit with paid observations, is a variant of the adversarial N-armed bandit game, where on round t of the game we can observe the reward of any number of arms, but each observation has a cost c. We present an algorithm that achieves O((cN\ln N)^1/3 T^2/3 + \sqrtT \ln N) regret on T rounds of this game in the worst case. Furthermore, we present a number of refinements that treat arm- and time-dependent observation costs and achieve lower regret under benign conditions. We present lower bounds that show that, apart from the logarithmic factors, the worst-case regret bounds cannot be improved.

TCS Journal 2014 Journal Article

Weighted last-step min–max algorithm with improved sub-logarithmic regret

Edward Moroshko
Koby Crammer

IJCAI Conference 2013 Conference Paper

Hartigan's K-Means Versus Lloyd's K-Means — Is It Time for a Change?

Noam Slonim
Ehud Aharoni
Koby Crammer

Hartigan’s method for k-means clustering holds several potential advantages compared to the classical and prevalent optimization heuristic known as Lloyd’s algorithm. E. g. , it was recently shown that the set of local minima of Hartigan’s algorithm is a subset of those of Lloyd’s method. We develop a closed-form expression that allows to establish Hartigan’s method for k-means clustering with any Bregman divergence, and further strengthen the case of preferring Hartigan’s algorithm over Lloyd’s algorithm. Specifically, we characterize a range of problems with various noise levels of the inputs, for which any random partition represents a local minimum for Lloyd’s algorithm, while Hartigan’s algorithm easily converges to the correct solution. Extensive experiments on synthetic and real-world data further support our theoretical analysis.

PDF Details DOI

IJCAI Conference 2013 Conference Paper

Multi Class Learning with Individual Sparsity

Ben Zion Vatashsky
Koby Crammer

Multi class problems are everywhere. Given an input the goal is to predict one of a few possible classes. Most previous work reduced learning to minimizing the empirical loss over some training set and an additional regularization term, prompting simple models or some other prior knowledge. Many learning regularizations promote sparsity, that is, small models or small number of features, as performed in group LASSO. Yet, such models do not always represent the classes well. In some problems, for each class, there is a small set of features that represents it well, yet the union of these sets is not small. We propose to use other regularizations that promote this type of sparsity, analyze the generalization property of such formulations, and show empirically that indeed, these regularizations not only perform well, but also promote such sparsity structure.

PDF Details DOI

ICML Conference 2012 Conference Paper

Adaptive Regularization for Similarity Measures

Koby Crammer
Gal Chechik

JMLR Journal 2012 Journal Article

Breaking the Curse of Kernelization: Budgeted Stochastic Gradient Descent for Large-Scale SVM Training

Zhuang Wang
Koby Crammer
Slobodan Vucetic

Online algorithms that process one example at a time are advantageous when dealing with very large data or with data streams. Stochastic Gradient Descent (SGD) is such an algorithm and it is an attractive choice for online Support Vector Machine (SVM) training due to its simplicity and effectiveness. When equipped with kernel functions, similarly to other SVM learning algorithms, SGD is susceptible to the curse of kernelization that causes unbounded linear growth in model size and update time with data size. This may render SGD inapplicable to large data sets. We address this issue by presenting a class of Budgeted SGD (BSGD) algorithms for large-scale kernel SVM training which have constant space and constant time complexity per update. Specifically, BSGD keeps the number of support vectors bounded during training through several budget maintenance strategies. We treat the budget maintenance as a source of the gradient error, and show that the gap between the BSGD and the optimal SVM solutions depends on the model degradation due to budget maintenance. To minimize the gap, we study greedy budget maintenance methods based on removal, projection, and merging of support vectors. We propose budgeted versions of several popular online SVM algorithms that belong to the SGD family. We further derive BSGD algorithms for multi-class SVM training. Comprehensive empirical results show that BSGD achieves higher accuracy than the state-of-the-art budgeted online algorithms and comparable to non-budget algorithms, while achieving impressive computational efficiency both in time and space during training and prediction. [abs] [ pdf ][ bib ] &copy JMLR 2012. ( edit, beta )

JMLR Journal 2012 Journal Article

Confidence-Weighted Linear Classification for Text Categorization

Koby Crammer
Mark Dredze
Fernando Pereira

Confidence-weighted online learning is a generalization of margin-based learning of linear classifiers in which the margin constraint is replaced by a probabilistic constraint based on a distribution over classifier weights that is updated online as examples are observed. The distribution captures a notion of confidence on classifier weights, and in some cases it can also be interpreted as replacing a single learning rate by adaptive per-weight rates. Confidence-weighted learning was motivated by the statistical properties of natural-language classification tasks, where most of the informative features are relatively rare. We investigate several versions of confidence-weighted learning that use a Gaussian distribution over weight vectors, updated at each observed example to achieve high probability of correct classification for the example. Empirical evaluation on a range of text-categorization tasks show that our algorithms improve over other state-of-the-art online and batch methods, learn faster in the online setting, and lead to better classifier combination for a type of distributed training commonly used in cloud computing. [abs] [ pdf ][ bib ] &copy JMLR 2012. ( edit, beta )

NeurIPS Conference 2012 Conference Paper

Learning Multiple Tasks using Shared Hypotheses

Koby Crammer
Yishay Mansour

In this work we consider a setting where we have a very large number of related tasks with few examples from each individual task. Rather than either learning each task individually (and having a large generalization error) or learning all the tasks together using a single hypothesis (and suffering a potentially large inherent error), we consider learning a small pool of {\em shared hypotheses}. Each task is then mapped to a single hypothesis in the pool (hard association). We derive VC dimension generalization bounds for our model, based on the number of tasks, shared hypothesis and the VC dimension of the hypotheses class. We conducted experiments with both synthetic problems and sentiment of reviews, which strongly support our approach.

NeurIPS Conference 2012 Conference Paper

Volume Regularization for Binary Classification

Koby Crammer
Tal Wagner

We introduce a large-volume box classification for binary prediction, which maintains a subset of weight vectors, and specifically axis-aligned boxes. Our learning algorithm seeks for a box of large volume that contains ``simple'' weight vectors which most of are accurate on the training set. Two versions of the learning process are cast as convex optimization problems, and it is shown how to solve them efficiently. The formulation yields a natural PAC-Bayesian performance bound and it is shown to minimize a quantity directly aligned with it. The algorithm outperforms SVM and the recently proposed AROW algorithm on a majority of $30$ NLP datasets and binarized USPS optical character recognition datasets.

IJCAI Conference 2011 Conference Paper

Active Online Classification via Information Maximization

Noam Slonim
Elad Yom-Tov
Koby Crammer

We propose an online classification approach for co-occurrence data which is based on a simple information theoretic principle. We further show how to properly estimate the uncertainty associated with each prediction of our scheme and demonstrate how to exploit these uncertainty estimates. First, in order to abstain highly uncertain predictions. And second, within an active learning framework, in order to preserve classification accuracy while substantially reducing training set size. Our method is highly efficient in terms of run-time and memory footprint requirements. Experimental results in the domain of text classification demonstrate that the classification accuracy of our method is superior or comparable to other state-of-the-art online classification algorithms.

PDF Details DOI

ICML Conference 2011 Conference Paper

Multiclass Classification with Bandit Feedback using Adaptive Regularization

Koby Crammer
Claudio Gentile

NeurIPS Conference 2010 Conference Paper

Learning via Gaussian Herding

Koby Crammer
Daniel Lee

We introduce a new family of online learning algorithms based upon constraining the velocity flow over a distribution of weight vectors. In particular, we show how to effectively herd a Gaussian weight vector distribution by trading off velocity constraints with a loss function. By uniformly bounding this loss function, we demonstrate how to solve the resulting optimization analytically. We compare the resulting algorithms on a variety of real world datasets, and demonstrate how these algorithms achieve state-of-the-art robust performance, especially with high label noise in the training data.

ICML Conference 2010 Conference Paper

Multi-Class Pegasos on a Budget

Zhuang Wang
Koby Crammer
Slobodan Vucetic

NeurIPS Conference 2010 Conference Paper

New Adaptive Algorithms for Online Classification

Francesco Orabona
Koby Crammer

We propose a general framework to online learning for classification problems with time-varying potential functions in the adversarial setting. This framework allows to design and prove relative mistake bounds for any generic loss function. The mistake bounds can be specialized for the hinge loss, allowing to recover and improve the bounds of known online classification algorithms. By optimizing the general bound we derive a new online classification algorithm, called NAROW, that hybridly uses adaptive- and fixed- second order information. We analyze the properties of the algorithm and illustrate its performance using synthetic dataset.

NeurIPS Conference 2009 Conference Paper

Adaptive Regularization of Weight Vectors

Koby Crammer
Alex Kulesza
Mark Dredze

We present AROW, a new online learning algorithm that combines several properties of successful: large margin training, confidence weighting, and the capacity to handle non-separable data. AROW performs adaptive regularization of the prediction function upon seeing each new instance, allowing it to perform especially well in the presence of label noise. We derive a mistake bound, similar in form to the second order perceptron bound, which does not assume separability. We also relate our algorithm to recent confidence-weighted online learning techniques and empirically show that AROW achieves state-of-the-art performance and notable robustness in the case of non-separable data.

ICML Conference 2008 Conference Paper

A rate-distortion one-class model and its applications to clustering

Koby Crammer
Partha Pratim Talukdar
Fernando C. N. Pereira

ICML Conference 2008 Conference Paper

Confidence-weighted linear classification

Mark Dredze
Koby Crammer
Fernando C. N. Pereira

We introduce confidence-weighted linear classifiers, which add parameter confidence information to linear classifiers. Online learners in this setting update both classifier parameters and the estimate of their confidence. The particular online algorithms we study here maintain a Gaussian distribution over parameter vectors and update the mean and covariance of the distribution with each instance. Empirical evaluation on a range of NLP tasks show that our algorithm improves over other state of the art online and batch methods, learns faster in the online setting, and lends itself to better classifier combination after parallel training.

NeurIPS Conference 2008 Conference Paper

Exact Convex Confidence-Weighted Learning

Koby Crammer
Mark Dredze
Fernando Pereira

Confidence-weighted (CW) learning [6], an online learning method for linear classifiers, maintains a Gaussian distributions over weight vectors, with a covariance matrix that represents uncertainty about weights and correlations. Confidence constraints ensure that a weight vector drawn from the hypothesis distribution correctly classifies examples with a specified probability. Within this framework, we derive a new convex form of the constraint and analyze it in the mistake bound model. Empirical evaluation with both synthetic and text data shows our version of CW learning achieves lower cumulative and out-of-sample errors than commonly used first-order and second-order online methods.

JMLR Journal 2008 Journal Article

Learning from Multiple Sources

Koby Crammer
Michael Kearns
Jennifer Wortman

We consider the problem of learning accurate models from multiple sources of "nearby" data. Given distinct samples from multiple data sources and estimates of the dissimilarities between these sources, we provide a general theory of which samples should be used to learn models for each source. This theory is applicable in a broad decision-theoretic learning framework, and yields general results for classification and regression. A key component of our approach is the development of approximate triangle inequalities for expected loss, which may be of independent interest. We discuss the related problem of learning parameters of a distribution from multiple data sources. Finally, we illustrate our theory through a series of synthetic simulations. [abs] [ pdf ][ bib ] &copy JMLR 2008. ( edit, beta )

NeurIPS Conference 2007 Conference Paper

Learning Bounds for Domain Adaptation

John Blitzer
Koby Crammer
Alex Kulesza
Fernando Pereira
Jennifer Wortman

Empirical risk minimization offers well-known learning guarantees when training and test data come from the same domain. In the real world, though, we often wish to adapt a classifier from a source domain with a large amount of training data to different target domain with very little training data. In this work we give uniform convergence bounds for algorithms that minimize a convex combination of source and target empirical risk. The bounds explicitly model the inherent trade-off between training on a large but inaccurate source data set and a small but accurate target training set. Our theory also gives results when we have multiple source domains, each of which may have a different number of instances, and we exhibit cases in which minimizing a non-uniform combination of source risks can achieve much lower target error than standard empirical risk minimization.

NeurIPS Conference 2006 Conference Paper

Analysis of Representations for Domain Adaptation

Shai Ben-David
John Blitzer
Koby Crammer
Fernando Pereira

Discriminative learning methods for classification perform well when training and test data are drawn from the same distribution. In many situations, though, we have labeled training data for a source domain, and we wish to learn a classifier which performs well on a target domain with a different distribution. Under what conditions can we adapt a classifier trained on the source domain for use in the target domain? Intuitively, a good feature representation is a crucial factor in the success of domain adaptation. We formalize this intuition theoretically with a generalization bound for domain adaption. Our theory illustrates the tradeoffs inherent in designing a representation for domain adaptation and gives a new justification for a recently proposed model. It also points toward a promising new model for domain adaptation: one which explicitly minimizes the difference between the source and target domains, while at the same time maximizing the margin of the training set.

UAI Conference 2006 Conference Paper

Discriminative Learning via Semidefinite Probabilistic Models

Koby Crammer
Amir Globerson

Discriminative linear models are a popular tool in machine learning. These can be generally divided into two types: The first is linear classifiers, such as support vector machines, which are well studied and provide state-of-the-art results. One shortcoming of these models is that their output (known as the 'margin') is not calibrated, and cannot be translated naturally into a distribution over the labels. Thus, it is difficult to incorporate such models as components of larger systems, unlike probabilistic based approaches. The second type of approach constructs class conditional distributions using a nonlinearity (e.g. log-linear models), but is occasionally worse in terms of classification error. We propose a supervised learning method which combines the best of both approaches. Specifically, our method provides a distribution over the labels, which is a linear function of the model parameters. As a consequence, differences between probabilities are linear functions, a property which most probabilistic models (e.g. log-linear) do not have. Our model assumes that classes correspond to linear subspaces (rather than to half spaces). Using a relaxed projection operator, we construct a measure which evaluates the degree to which a given vector 'belongs' to a subspace, resulting in a distribution over labels. Interestingly, this view is closely related to similar concepts in quantum detection theory. The resulting models can be trained either to maximize the margin or to optimize average likelihood measures. The corresponding optimization problems are semidefinite programs which can be solved efficiently. We illustrate the performance of our algorithm on real world datasets, and show that it outperforms 2nd order kernel methods.

NeurIPS Conference 2006 Conference Paper

Learning from Multiple Sources

Koby Crammer
Michael Kearns
Jennifer Wortman

We consider the problem of learning accurate models from multiple sources of "nearby" data. Given distinct samples from multiple data sources and estimates of the dissimilarities between these sources, we provide a general theory of which samples should be used to learn models for each source. This theory is applicable in a broad decision-theoretic learning framework, and yields results for classification and regression generally, and for density estimation within the exponential family. A key component of our approach is the development of approximate triangle inequalities for expected loss, which may be of independent interest.

JMLR Journal 2006 Journal Article

Online Passive-Aggressive Algorithms

Koby Crammer
Ofer Dekel
Joseph Keshet
Shai Shalev-Shwartz
Yoram Singer

We present a family of margin based online learning algorithms for various prediction tasks. In particular we derive and analyze algorithms for binary and multiclass categorization, regression, uniclass prediction and sequence prediction. The update steps of our different algorithms are all based on analytical solutions to simple constrained optimization problems. This unified view allows us to prove worst-case loss bounds for the different algorithms and for the various decision problems based on a single lemma. Our bounds on the cumulative loss of the algorithms are relative to the smallest loss that can be attained by any fixed hypothesis, and as such are applicable to both realizable and unrealizable settings. We demonstrate some of the merits of the proposed algorithms in a series of experiments with synthetic and real data sets. [abs] [ pdf ][ bib ] &copy JMLR 2006. ( edit, beta )

NeurIPS Conference 2005 Conference Paper

Learning from Data of Variable Quality

Koby Crammer
Michael Kearns
Jennifer Wortman

We initiate the study of learning from multiple sources of limited data, each of which may be corrupted at a different rate. We develop a com- plete theory of which data sources should be used for two fundamental problems: estimating the bias of a coin, and learning a classiﬁer in the presence of label noise. In both cases, efﬁcient algorithms are provided for computing the optimal subset of data.

ICML Conference 2004 Conference Paper

A needle in a haystack: local one-class optimization

Koby Crammer
Gal Chechik

NeurIPS Conference 2004 Conference Paper

A Temporal Kernel-Based Model for Tracking Hand Movements from Neural Activities

Lavi Shpigelman
Koby Crammer
Rony Paz
Eilon Vaadia
Yoram Singer

We devise and experiment with a dynamical kernel-based system for tracking hand movements from neural activity. The state of the system corresponds to the hand location, velocity, and acceleration, while the system's input are the instantaneous spike rates. The system's state dy- namics is defined as a combination of a linear mapping from the previous estimated state and a kernel-based mapping tailored for modeling neural activities. In contrast to generative models, the activity-to-state mapping is learned using discriminative methods by minimizing a noise-robust loss function. We use this approach to predict hand trajectories on the basis of neural activity in motor cortex of behaving monkeys and find that the proposed approach is more accurate than both a static approach based on support vector regression and the Kalman filter. 1 Introduction The paper focuses on the problem of tracking hand movements, which constitute smooth spatial trajectories, from spike trains of a neural population. We do so by devising a dynam- ical system which employs a tailored kernel for spike trains along with a linear mapping corresponding to the states' dynamics. Consider a situation where a subject performs free hand movements during a task that requires accurate space and time precision. In the lab, it may be a constrained reaching task while in real life it may be an every day task such as eating. We wish to track the hand position given only spike trains from a recorded neural population. The rationale of such an undertaking is two fold. First, this task can be viewed as a stem towards the development of a Brain Machine Interface (BMI) which gradually and rapidly become a possible future solution for the motor disabled patients. Recent studies of BMIs [13, 3, 10] (being on-line and feedback enabled) show that a relatively small number of cortical units can be used to move a cursor or a robot effectively, even without genera- tion of hand movements and that training of the subjects improves the overall success of the BMIs. Second, an open loop (off-line) movement decoding (see e. g. [7, 1, 15, 11, 8]), while inappropriate for BMIs, is computationally less expensive, easier to implement and allows repeated analysis thus providing a handle to understandings of neural computations in the brain. Early studies [6] showed that the direction of arm movement is reflected by the population vector of preferred directions weighted by current firing ra tes, suggesting that intended movement is encoded in the firing rate which, in turn, is modulated by the angle between a unit's preferred direction (PD) and the intended direction. This linear regression approach is still prevalent and is applied, with some variation of the learning methods, in closed and open loop settings. There is relatively little work on the development of dedicated nonlinear methods. Both movement and neural activity are dynamic and can therefore be naturally modeled by dynamical systems. Filtering methods often employ generative probabilistic models such as the well known Kalman filter [16] or more neurally specialized models [1] in which a cortical unit's spike count is generated by a probability function of its underlying firing rate which is tuned to movement parameters. The movement, being a smooth trajectory, is modeled as a linear transition with (typically additive Gaussian) noise. These methods have the advantage of being aware of the smooth nature of movement and provide models of what neurons are tuned to. However, the requirement of describing a neural population's firing probability as a function of movement state is hard to satisfy without making costly assumptions. The most prominent is the assumption of statistical independence of cells given the movement. Kernel based methods have been shown to achieve state of the art results in many applica- tion domains. Discriminative kernel methods, such as Support Vector Regression (SVR) forgo the task of modeling neuronal tuning functions. Furthermore, the construction of kernel induced feature spaces, lends itself to efficient implementation of distance measures over spike trains that are better suited to comparing two neural population trajectories than the Euclidean distance in the original space of spike counts per bins [11, 5]. However, SVR is a "static" method that does not take into account the smooth dynamics of the pre- dicted movement trajectory which imposes a statistical dependency between consecutive examples. This paper introduces a kernel based regression method that incorporates linear dynamics of the predicted trajectories. In Sec. 2 we formally describe the problem setting. We intro- duce the movement tracking model and the associated learning framework in Sec. 3. The resulting learning problem yields a new kernel for linear dynamical systems. We provide an efficient calculation of this kernel and describe our dual space optimization method for solving the learning problem. The experimental method is presented in Sec. 4. Results, underscoring the merits of our algorithm are provided in Sec. 5 and conclusions are given in Sec. 6. 2 Problem Setting Our training set contains m trials. Each trial (typically indexed by i or j) consists of a pair ti of movement and neural recordings, designated by Yi, Oi. Yi = yi end t is a time t=1 series of movement state values and yi t Rd is the movement state vector at time t in trial i. We are interested in reconstructing position, however, for better modeling, yit may be a vector of position, velocity and acceleration (as is the case in Sec. 4). This trajectory is observed during model learning and is the inference target. Oi = {ot}tiend t=1 is a time series of neural spike counts and oi t Rq is a vector of spike counts from q cortical units at time t. We wish to learn a function zi = f Oi t 1: t that is a good estimate (in a sense formalized in the sequel) of the movement yit. Thus, f is a causal filtering method. We confine ourselves to a causal setting since we plan to apply the proposed method in a closed loop scenario where real-time output is required. The partition into separate trajecto- ries is a natural one in a setting where a session is divided into many trials, each consisting of one attempt at accomplishing the basic task (such as reaching movements to displayed targets). In tasks that involve no hitting of objects, hand movements are typically smooth. End point movement in small time steps is loosely approximated as having constant ac- celeration. On the other hand, neural spike counts (which are typically measured in bins of 50 - 100ms) vary greatly from one time step to the next. In summary, our goal is to devise a dynamic mapping from sequences of neural activities ending at a given time to the instantaneous hand movement characterization (location, velocity, and acceleration). 3 Movement Tracking Algorithm Our regression method is defined as follows: given a series O Rqtend of observations and, possibly, an initial state y0, the predicted trajectory Z Rdtend is, zt = Azt-1 + W (ot), tend t > 0, (1) where z0 = y0, A Rdd is a matrix describing linear movement dynamics and W Rdq is a weight matrix. (ot) is a feature vector of the observed spike trains at time t and is later replaced by a kernel operator (in the dual formulation to follow). Thus, the state transition is a linear transformation of the previous state with the addition of a non-linear effect of the observation. Note that unfolding the recursion in Eq. (1) yields zt = Aty0 + t At-kW (o k=1 k ). Assuming that A describes stable dynamics (the real parts of the eigenvalues of A are les than 1), then the current prediction depends, in an exponentially decaying manner, on the previous observations. We further assume that A is fixed and wish to learn W (we describe our choice of A in Sec. 4). In addition, ot may also encompass a series of previous spike counts in a window ending at time t (as is the case in Sec. 4). Also, note that this model (in its non-kernelized version) has an algebraic form which is similar to the Kalman filter (to which we compare our results later). Primal Learning Problem: The optimization problem presented here is identical to the standard SVR learning problem (see, for example [12]) with the exception that zit is defined as in Eq. (1) while in standard SVR, zt = W (ot) (i. e. without the linear dynamics). Given a training set of fully observed trials Yi, Oi m we define the learning problem i=1 to be ti 1 m end d min W 2 + c zi - yi. t t (2) W 2 s s i=1 t=1 s=1 Where W 2 = (W)2 (is the Forbenius norm). The second term is a sum of training a, b ab errors (in all trials, times and movement dimensions). | | is the insensitive loss and is defined as |v| = max {0, |v| - }. The first term is a regularization term that promotes small weights and c is a fixed constant providing a tradeoff between the regularization term and the training error. Note that to compensate for different units and scales of the movement dimensions one could either define a different s and cs for each dimension of the movement or, conversely, scale the sth movement dimension. The tracking method, combined with the optimization specified here, defines the complete algorithm. We name this method the Discriminative Dynamic Tracker or DDT in short. A Dual Solution: The derivation of the dual of the learning problem defined in Eq. (2) is rather mundane (e. g. [12]) and is thus omitted. Briefly, we replace the -loss with pairs of slack variables. We then write a Lagrangian of the primal problem and replace zit with its (less-standard) definition. We then differentiate the Lagrangian with respect to the slack variables and W and obtain a dual optimization problem. We present the dual dual problem in a top-down manner, starting with the general form and finishing with a kernel definition. The form of the dual is max - 1 ( - )T G ( - ) + ( - )T y - ( + )T 2, s. t. , [0, c]. (3) Note that the above expression conforms to the dual form of SVR. Let equal the size of the movement space (d), multiplied by the total number of time steps in all the training trajecto- ries. , R are vectors of Lagrange multipliers, y R is a column concatenation of T T T all the training set movement trajectories y11 ym tm, = [, .. ., ]T R end and G R is a Gram matrix (vT denotes transposition). One obvious difference be- tween our setting and the standard SVR lies within the size of the vectors and Gram matrix. In addition, a major difference is the definition of G. We define G here in a hierarchical manner. Let i, j {1, .. ., m} be trajectory (trial) indexes. G is built from blocks indexed by Gij, which are in turn made from basic blocks, indexed by Kij tq as follows G11 G1m Kij11 Kij1tj. . G =. .. .. .. ., Gij =. .. .. ., .. Gm1 Gmm Kij Kij ti 1 end ti tj end end where block Gij refers to a pair of trials (i and j). Finally Each basic block, Kij tq refers to a pair of time steps t and q in trajectories i and j respectively. ti, tj are the time lengths end end of trials i and j. Basic blocks are defined as t q Kij = At-r kij Aq-s T, tq rs (4) r=1 s=1 where kij = k oi, oj rs r s is a (freely chosen) basic kernel between the two neural observa- tions oir and ojs at times r and s in trials i and j respectively. For an explanation of kernel operators we refer the reader to [14] and mention that the kernel operator can be viewed as computing oi oj r s where is a fixed mapping to some inner product space. The choice of kernel (being the choice of feature space) reflects a modeling decision that specifies how similarities between neural patterns are measured. The resulting dual form of the tracker is zt = k k Gtk where Gt is the Gram matrix row of the new example. It is therefore clear from Eq. (4) that the linear dynamic characteristics of DDT results in a Gram matrix whose entries depend on previous observations. This dependency is ex- ponentially decaying as the time difference between events in the trajectories grow. Note that solution of the dual optimization problem in Eq. (3) can be calculated by any stan- dard quadratic programming optimization tool. Also, note that direct calculation of G is inefficient. We describe an efficient method in the sequel. Efficient Calculation of the Gram Matrix Simple, straight-forward calculation of the Gram matrix is time consuming. To illustrate this, suppose each trial is of length ti = n, end then calculation of each basic block would take (n2) summation steps. We now describe a procedure based on dynamic-programming method for calculating the Gram matrix in a constant number of operations for each basic block. Omitting the indexing over trials to ease notation, we are interested in calculating the basic block Ktq. First, define Btq = t k k=1 kq At-k. the basic block Ktq can be recursively calculated in three different ways: Ktq = Kt(q-1)AT + Btq (5) Ktq = AK(t-1)q + (Bqt)T (6) Ktq = AK(t-1)(q-1)AT + (Bqt)T + Btq - ktq. (7) Thus, by adding Eq. (5) to Eq. (6) and subtracting Eq. (7) we get Ktq = AK(t-1)q + Kt(q-1)AT - AK(t-1)(q-1)AT + ktqI. Btq (and the entailed summation) is eliminated in exchange for a 2D dynamic program with initial conditions: K11 = k11I, K1q = K1(q-1)AT + k1qI, Kt1 = AK(t-1)1 + kt1I. Table 1: Mean R2, MAE & MSE (across datasets, folds, hands and directions) for each algorithm. R2 MAE MSE Algorithm pos. vel. accl. pos. vel. accl. pos. vel. accl. Kalman filter 0. 64 0. 58 0. 30 0. 40 0. 15 0. 37 0. 78 0. 27 1. 16 DDT-linear 0. 59 0. 49 0. 17 0. 63 0. 41 0. 58 0. 97 0. 50 1. 23 SVR-Spikernel 0. 61 0. 64 0. 37 0. 44 0. 14 0. 34 0. 76 0. 20 0. 98 DDT-Spikernal 0. 73 0. 67 0. 40 0. 37 0. 14 0. 34 0. 50 0. 16 0. 91 1 0. 8 Scores 2 0. 6 0. 4 left hand, X dir. left hand, Y dir. 0. 2 DDT-Spikernel, R right hand, X dir. right hand, Y dir. 00 0. 2 0. 4 0. 6 0. 8 1 0 0. 2 0. 4 0. 6 0. 8 1 0 0. 2 0. 4 0. 6 0. 8 1 Kalman filter, R2 Scores DDT-linear, R2 Scores SVR-Spikernel, R2 Scores Figure 1: Correlation coefficients (R2, of predicted and observed hand positions) comparisons of the DDT-Spikernel versus the Kalman filter (left), DDT-linear (center) and SVR-Spikernel (right). Each data point is the R2 values obtained by the DDT-Spikernel and by another method in one fold of one of the datasets for one of the two axes of movement (circle / square) and one of the hands (filled/non-filled). Results above the diagonals are cases were the DDT-Spikernel outperformes. Suggested Optimization Method. One possible way to solve the optimization problem (essentially, a modification of the method described in [4] for classification) is to sequen- tially solve a reduced problem with respect to a single constraint at a time. Define: i = - - min -. j j Gij - yi j j Gij - yi i, [0, c] j i j Then i is the amount of -insensitive error that can be corrected for example i by keeping () () all constant and changing. Optimality is reached by iteratively choosing the j=i i example with the largest i and changing its () within the [0, c] limits to minimize the i error for this example. 4 Experimental Setting The data used in this work was recorded from the primary motor cortex of a Rhesus (Macaca Mulatta) monkey (~4. 5 kg). The monkey sat in a dark chamber, and up to 8 electrodes were introduced into MI area of each hemisphere. The electrode signals were amplified, filtered and sorted. The data used in this report was recorded on 8 different days and includes hand positions, sampled at 500Hz, spike times of single units (isolated by sig- nal fit to a series of windows) and of multi units (detection by threshold crossing) sampled at 1ms precision. The monkey used two planar-movement manipulanda to control 2 cur- sors on the screen to perform a center-out reaching task. Each trial began when the monkey centered both cursors on a central circle. Either cursor could turn green, indicating the hand to be used in the trial. Then, one of eight targets appeared ('go signal'), the center circle disappeared and the monkey had to move and reach the target to receive liquid reward. The number of multi-unit channels ranged from 5 to 15, the number of single units was 20-27 and the average total was 34 units per dataset. The average spike rate per channel was 8. 2 spikes/sec. More information on the recordings can be found in [9]. DDT (Spikernel) DDT (Spikernel) DDT (Spikernel) 88. 1% 75% 78. 7% 100% Kalman Filter SVR (Spikernel) 87. 5% SVR (Spikernel) 91. 88% 100% 63. 75% 99. 4% 80. 0% 98. 7% 86. 3% SVR (Spikernel) 78. 12% 96. 3% Kalman Filter 95. 6% Kalman Filter 62. 5% 86. 8% 84. 4% DDT (Linear) DDT (Linear) DDT (Linear) Figure 2: Comparison of R2-performance between algorithms. Each algorithm is represented by a vertex. The weight of an edge between two algorithms is the fraction of tests in which the algorithm on top achieves higher R2 score than the other. A bold edge indicates a fraction higher than 95%. Graphs from left to right are for position, velocity, and acceleration respectively. The results that we present here refer to prediction of instantaneous hand movements during the period from 'Go Signal' to 'Target Reach' times of both hands in successful trials. Note that some of the trials required movement of the left hand while keeping the right hand steady and vise versa. Therefore, although we considered only movement periods of the trials, we had to predict both movement and non-movement for each hand. The cumulative time length of all the datasets was about 67 minutes. Since the correlation between the movements of the two hands tend to zero - we predicted movement for each hand separately, choosing the movement space to be [x, y, vx, vy, ax, ay]T for each of the hands (preliminary results using only [x, y, vx, vy]T were less accurate). We preprocessed the spike trains into spike counts in a running windows of 100ms (choice of window size is based on previous experience [11]). Hand position, velocity and acceler- ation were calculated using the 500Hz recordings. Both spike counts and hand movement were then sampled at steps of 100ms (preliminary results with step size 50ms were negli- gibly different for all algorithms). A labeled example yi, oi t t for time t in trial i consisted of the previous 10 bins of population spike counts and the state, as a 6D vector for the left or right hand. Two such consecutive examples would than have 9 time bins of spike count overlap. For example, the number of cortical units q in the first dataset was 43 (27 single and 16 multiple) and the total length of all the trials that were used in that dataset is 529 seconds. Hence in that session there are 5290 consecutive examples where each is a 4310 matrix of spike counts along with two 6D vectors of end point movement. In order to run our algorithm we had to choose base kernels, their parameters, A and c (and, to be introduced below). We used the Spikernel [11], a kernel designed to be used with spike rate patterns, and the simple dot product (i. e. linear regression). Kernel parmeters and c were chosen (and subsequently held fixed) by 5 fold cross validation over half of the first dataset only. We compared DDT with the Spikernel and with the linear kernel to standard SVR using the Spikernel and the Kalman filter. We also obtained tracking results using both DDT and SVR with the standard exponential kernel. These results were slightly less accurate on average than with the Spikernel and are therefore omitted here. The Kalman filter was learned assuming the standard state space model (yt = Ayt-1 +, ot = Hyt +, where, are white Gaussian noise with appropriate correlation matrices) such as in [16]. y belonged to the same 6D state space as described earlier. To ease the comparison - the same matrix A that was learned for the Kalman filter was used in our algorithm (though we show that it is not optimal for DDT), multiplied by a scaling parameter. This parameter was selected to produce best position results on the training set. The selected value is 0. 8. The figures that we show in Sec. 5 are of test results in 5 fold cross validation on the rest of the data. Each of the 8 remaining datasets was divided into 5 folds. 4/5 were used for X Y R2 MAE MSE # Support 14K position Position 12K Actual DDT-Spikernel SVR-Spikernel 10K Velocity velocity 8K 6K Acceleration acceleration Figure 3: Effect of on R2, MAE, MSE and Figure 4: Sample of tracking with the DDT- number of support vectors. Spikernel and the SVR-Spikernel. training (with the parameters obtained previously and the remaining 1/5 as test set). This process was repeated 5 times for each hand. Altogether we had 8sets 5folds 2hands = 80 folds. 5 Results We begin by showing average results across all datasets, folds, hands and X/Y directions for the four algorithms that are compared. Table. 1 shows mean Correlation Coefficients (R2, between recorded and predicted movement values), Mean insensitive Absolute Errors (MAE) and Mean Square Errors (MSE). R2 is a standard performance measure, MAE is the error minimized by DDT (subject to the regularization term) and MSE is minimized by the Kalman filter. Under all the above measures the DDT-Spikernel outperforms the rest with the SVR-Spikernel and the Kalman Filter alternating in second place. To understand whether the performance differences are significant we look at the distribu- tion of position (X and Y) R2 values at each of the separate tests (160 altogether). Figure 1 shows scatter plots of R2 results for position predictions. Each plot compares the DDT- Spikernel (on the Y axis) with one of the other three algorithms (on the X axes). It is clear that in spite large differences in accuracy across datasets, the algorithm pairs achieve similar success with the DDT-Spikernel achieving a better R2 score in almost all cases. To summarize the significance of R2 differences we computed the number of tests in which one algorithm achieved a higher R2 value than another algorithm (for all pairs, in each of the position, velocity and acceleration categories). The results of this tournament between the algorithms are presented in Figure 2 as winning percentages. The graphs produce a ranking of the algorithms and the percentages are the significances of the ranking between pairs. The DDT-Spikernel is significantly better then the rest in tracking position. The matrix A in use is not optimal for our algorithm. The choice of scales its effect. When = 0 we get the standard SVR algorithm (without state dynamics). To illustrate the effect of we present in Figure 3 the mean (over 5 folds, X/Y direction and hand) R2 results on the first dataset as a function of. It is clear that the value chosen to minimize position error is not optimal for minimizing velocity and acceleration errors. Another important effect of is the number of the support patterns in the learned model, which drops considerably (by about one third) when the effect of the dynamics is increased. This means that more training points fall strictly within the -tube in training, suggesting that the kernel which tacitly results from the dynamical model is better suited for the problem. Lastly, we show a sample of test tracking results for the DDT-Spikernel and SVR-Spikernel in Figure 4. Note that the acceleration values are not smooth and are, therefore, least aided by the dynamics of the model. However, adding acceleration to the model improves the prediction of position. 6 Conclusion We described and reported experiments with a dynamical system that combines a linear state mapping with a nonlinear observation-to-state mapping. The estimation of the sys- tem's parameters is transformed to a dual representation and yields a novel kernel for tem- poral modelling. When a linear kernel is used, the DDT system has a similar form to the Kalman filter as t. However, the system's parameters are set so as to minimize the regularized -insensitive 1 loss between state trajectories. DDT also bares similarity to SVR, which employs the same loss yet without the state dynamics. Our experiments indi- cate that by combining a kernel-induced feature space, linear state dynamics, and using a robust loss we are able to leverage the trajectory prediction accuracy and outperform com- mon approaches. Our next step toward an accurate brain-machine interface for predicting hand movements is the development of a learning procedure for the state dynamic mapping A and further developments of neurally motivated and compact representations. Acknowledgments This study was partly supported by a center of excellence grant (8006/00) administered by the ISF, BMBF-DIP, by the U. S. Israel BSF and by the IST Programme of the Eu- ropean Community, under the PASCAL Network of Excellence, IST-2002-506778. L. S. is supported by a Horowitz fellowship.

JMLR Journal 2003 Journal Article

A Family of Additive Online Algorithms for Category Ranking

Koby Crammer
Yoram Singer

We describe a new family of topic-ranking algorithms for multi-labeled documents. The motivation for the algorithms stem from recent advances in online learning algorithms. The algorithms are simple to implement and are also time and memory efficient. We provide a unified analysis of the family of algorithms in the mistake bound model. We then discuss experiments with the proposed family of topic-ranking algorithms on the Reuters-21578 corpus and the new corpus released by Reuters in 2000. On both corpora, the algorithms we present achieve state-of-the-art results and outperforms topic-ranking adaptations of Rocchio's algorithm and of the Perceptron algorithm.

NeurIPS Conference 2003 Conference Paper

Online Classification on a Budget

Koby Crammer
Jaz Kandola
Yoram Singer

Online algorithms for classiﬁcation often require vast amounts of mem- ory and computation time when employed in conjunction with kernel functions. In this paper we describe and analyze a simple approach for an on-the-ﬂy reduction of the number of past examples used for prediction. Experiments performed with real datasets show that using the proposed algorithmic approach with a single epoch is competitive with the sup- port vector machine (SVM) although the latter, being a batch algorithm, accesses each training example multiple times. 1 Introduction and Motivation Kernel-based methods are widely being used for data modeling and prediction because of their conceptual simplicity and outstanding performance on many real-world tasks. The support vector machine (SVM) is a well known algorithm for ﬁnding kernel-based linear classiﬁers with maximal margin [7]. The kernel trick can be used to provide an effective method to deal with very high dimensional feature spaces as well as to model complex in- put phenomena via embedding into inner product spaces. However, despite generalization error being upper bounded by a function of the margin of a linear classiﬁer, it is notoriously difﬁcult to implement such classiﬁers efﬁciently. Empirically this often translates into very long training times. A number of alternative algorithms exist for ﬁnding a maximal margin hyperplane many of which have been inspired by Rosenblatt’s Perceptron algorithm [6] which is an on-line learning algorithm for linear classiﬁers. The work on SVMs has in- spired a number of modiﬁcations and enhancements to the original Perceptron algorithm. These incorporate the notion of margin to the learning and prediction processes whilst ex- hibiting good empirical performance in practice. Examples of such algorithms include the Relaxed Online Maximum Margin Algorithm (ROMMA) [4], the Approximate Maximal Margin Classiﬁcation Algorithm (ALMA) [2], and the Margin Infused Relaxed Algorithm (MIRA) [1] which can be used in conjunction with kernel functions. A notable limitation of kernel based methods is their computational complexity since the amount of computer memory that they require to store the so called support patterns grows linearly with the number prediction errors. A number of attempts have been made to speed up the training and testing of SVM’s by enforcing a sparsity condition. In this paper we devise an online algorithm that is not only sparse but also generalizes well. To achieve this goal our algorithm employs an insertion and deletion process. Informally, it can be thought of as revising the weight vector after each example on which a prediction mistake has been made. Once such an event occurs the algorithm adds the new erroneous example (the insertion phase), and then immediately searches for past examples that appear to be redundant given the recent addition (the deletion phase). As we describe later, making this adjustment to the algorithm allows us to modify the standard online proof techniques so as to provide a bound on the total number of examples the algorithm keeps. This paper is organized as follows. In Sec. 2 we formalize the problem setting and provide a brief outline of our method for obtaining a sparse set of support patterns in an online setting. In Sec. 3 we present both theoretical and algorithmic details of our approach and provide a bound on the number of support patterns that constitute the cache. Sec. 4 provides experimental details, evaluated on three real world datasets, to illustrate the performance and merits of our sparse online algorithm. We end the paper with conclusions and ideas for future work. 2 Problem Setting and Algorithms This work focuses on online additive algorithms for classiﬁcation tasks. In such problems we are typically given a stream of instance-label pairs (x1; y1); :: :; (xt; yt); :: :. we assume that each instance is a vector xt 2 Rn and each label belongs to a ﬁnite set Y. In this and the next section we assume that Y = f(cid: 0)1; +1g but relax this assumption in Sec. 4 where we describe experiments with datasets consisting of more than two labels. When dealing with the task of predicting new labels, thresholded linear classiﬁers of the form h(x) = sign(w (cid: 1) x) are commonly employed. The vector w is typically represented as a weighted linear combination of the examples, namely w = Pt (cid: 11)tytxt where (cid: 11)t (cid: 21) 0. The instances for which (cid: 11)t > 0 are referred to as support patterns. Under this assumption, the output of the classiﬁer solely depends on inner-products of the form x (cid: 1) xt the use of kernel functions can easily be employed simply by replacing the standard scalar product with a function K((cid: 1); (cid: 1)) which satisﬁes Mercer conditions [7]. The resulting classiﬁcation rule takes the form h(x) = sign(w (cid: 1) x) = sign(Pt (cid: 11)tytK(x; xt)). The majority of additive online algorithms for classiﬁcation, for example the well known Perceptron [6], share a common algorithmic structure. These online algorithms typically work in rounds. On the tth round, an online algorithm receives an instance xt, computes the inner-products st = Pi 0. The various online algorithms differ in the way the values of the parameters (cid: 12)t; (cid: 11)t and ct are set. A notable example of an online algorithm is the Perceptron algorithm [6] for which we set (cid: 12)t = 0; (cid: 11)t = 1 and ct = 1. More recent algorithms such as the Relaxed Online Maximum Margin Algorithm (ROMMA) [4] the Approximate Maximal Margin Classiﬁcation Algorithm (ALMA) [2] and the Margin Infused Relaxed Algorithm (MIRA) [1] can also be described in this framework although the constants (cid: 12)t; (cid: 11)t and ct are not as simple as the ones employed by the Perceptron algorithm. An important computational consideration needs to be made when employing kernel func- tions for machine learning tasks. This is because the amount of memory required to store the so called support patterns grows linearly with the number prediction errors. In Input: Tolerance (cid: 12). Initialize: Set 8t (cid: 11)t = 0; w0 = 0; C0 =; . Loop: For t = 1; 2; :: :; T (cid: 15) Get a new instance xt 2 Rn. (cid: 15) Predict ^yt = sign (yt(xt (cid: 1) wt(cid: 0)1)). (cid: 15) Get a new label yt. (cid: 15) if yt(xt (cid: 1) wt(cid: 0)1) (cid: 20) (cid: 12) update: Insert Ct Ct(cid: 0)1 [ ftg. 2. Set (cid: 11)t = 1. 3. Compute wt wt(cid: 0)1 + yt(cid: 11)txt. 4. DistillCache(Ct; wt; ((cid: 11)1; :: :; (cid: 11)t)). Output: H(x) = sign(wT (cid: 1) x). Figure 1: The aggressive Perceptron algorithm with a variable-size cache. this paper we shift the focus to the problem of devising online algorithms which are budget-conscious as they attempt to keep the number of support patterns small. The approach is attractive for at least two reasons. Firstly, both the training time and clas- siﬁcation time can be reduced signiﬁcantly if we store only a fraction of the potential support patterns. Secondly, a classier with a small number of support patterns is intu- itively ”simpler”, and hence are likely to exhibit good generalization properties rather than complex classiﬁers with large numbers of support patterns. (See for instance [7] for formal results connecting the number of support patterns to the generalization error. ) Input: C; w; ((cid: 11)1; :: :; (cid: 11)t). Loop: (cid: 15) Choose i 2 C such that (cid: 12) (cid: 20) yi(w (cid: 0) (cid: 11)iyixi). Figure 2: DistillCache (cid: 11)i = 0. 2. w w (cid: 0) (cid: 11)iyixi. 3. C C=fig (cid: 15) if no such i exists then return. (cid: 15) Remove the example i: In Sec. 3 we present a formal analysis and the algorithmic details of our approach. Let us now provide a general overview of how to restrict the number of support patterns in an online setting. Denote by Ct the indices of patterns which consti- tute the classiﬁcation vector wt. That is, i 2 Ct if and only if (cid: 11)i > 0 on round t when xt is received. The online classi- ﬁcation algorithms discussed above keep enlarging Ct – once an example is added to Ct it will never be deleted. However, as the online algorithm receives more ex- amples, the performance of the classiﬁer improves, and some of the past examples may have become redundant and hence can be removed. Put another way, old examples may have been inserted into the cache sim- ply due the lack of support patterns in early rounds. As more examples are observed, the old examples maybe replaced with new examples whose location is closer to the decision boundary induced by the online classiﬁer. We thus add a new stage to the online algorithm in which we discard a few old examples from the cache Ct. We suggest a modiﬁcation of the online algorithm structure as follows. Whenever yt (cid: 0)Pi<t (cid: 11)iyiK(x; xi)(cid: 1) (cid: 20) (cid: 12)t, then after adding xt to w and inserting the tth into Ct, we scan the cache Ct for seemingly redundant examples by examining the margin conditions of old examples in Ct. If such an example is found, we discard it from the both the classiﬁer and the cache by updating wt wt (cid: 0) (cid: 11)iyixi and setting Ct Ct=fig. The pseudocode for this “budget-conscious” version of the aggressive Perceptron algorithm [3] is given in Fig. 1. We say that the algo- Return: C; w; ((cid: 11)1; :: :; (cid: 11)t). rithm employs a variable-size cache since we do no limit explicitly the number of support patterns though we do attempt to discard as many patterns as possible from the cache. A similar modiﬁcation, to that described for aggressive Perceptron, can be made to all of the online classiﬁcation algorithms outlined above. In particular, we use a modiﬁcation of the MIRA [1] algorithm in our experiments.

NeurIPS Conference 2003 Conference Paper

Online Passive-Aggressive Algorithms

Shai Shalev-Shwartz
Koby Crammer
Ofer Dekel
Yoram Singer

We present a uniﬁed view for online classiﬁcation, regression, and uni- class problems. This view leads to a single algorithmic framework for the three problems. We prove worst case loss bounds for various algorithms for both the realizable case and the non-realizable case. A conversion of our main online algorithm to the setting of batch learning is also dis- cussed. The end result is new algorithms and accompanying loss bounds for the hinge-loss.

JMLR Journal 2003 Journal Article

Ultraconservative Online Algorithms for Multiclass Problems

Koby Crammer
Yoram Singer

In this paper we study a paradigm to generalize online classification algorithms for binary classification problems to multiclass problems. The particular hypotheses we investigate maintain one prototype vector per class. Given an input instance, a multiclass hypothesis computes a similarity-score between each prototype and the input instance and sets the predicted label to be the index of the prototype achieving the highest similarity. To design and analyze the learning algorithms in this paper we introduce the notion of ultraconservativeness. Ultraconservative algorithms are algorithms that update only the prototypes attaining similarity-scores which are higher than the score of the correct label's prototype. We start by describing a family of additive ultraconservative algorithms where each algorithm in the family updates its prototypes by finding a feasible solution for a set of linear constraints that depend on the instantaneous similarity-scores. We then discuss a specific online algorithm that seeks a set of prototypes which have a small norm. The resulting algorithm, which we term MIRA (for Margin Infused Relaxed Algorithm) is ultraconservative as well. We derive mistake bounds for all the algorithms and provide further analysis of MIRA using a generalized notion of the margin for multiclass problems. We discuss the form the algorithms take in the binary case and show that all the algorithms from the first family reduce to the Perceptron algorithm while MIRA provides a new Perceptron-like algorithm with a margin-dependent learning rate. We then return to multiclass problems and describe an analogous multiplicative family of algorithms with corresponding mistake bounds. We end the formal part by deriving and analyzing a multiclass version of Li and Long's ROMMA algorithm. We conclude with a discussion of experimental results that demonstrate the merits of our algorithms.

NeurIPS Conference 2002 Conference Paper

Kernel Design Using Boosting

Koby Crammer
Joseph Keshet
Yoram Singer

The focus of the paper is the problem of learning kernel operators from empirical data. We cast the kernel design problem as the construction of an accurate kernel from simple (and less accurate) base kernels. We use the boosting paradigm to perform the kernel construction process. To do so, we modify the booster so as to accommodate kernel operators. We also devise an efﬁcient weak-learner for simple kernels that is based on generalized eigen vector decomposition. We demonstrate the effective- ness of our approach on synthetic data and on the USPS dataset. On the USPS dataset, the performance of the Perceptron algorithm with learned kernels is systematically better than a ﬁxed RBF kernel. 1 Introduction and problem Setting The last decade brought voluminous amount of work on the design, analysis and experi- mentation of kernel machines. Algorithm based on kernels can be used for various ma- chine learning tasks such as classiﬁcation, regression, ranking, and principle component analysis. The most prominent learning algorithm that employs kernels is the Support Vec- tor Machines (SVM) [1, 2] designed for classiﬁcation and regression. A key component in a kernel machine is a kernel operator which computes for any pair of instances their inner-product in some abstract vector space. Intuitively and informally, a kernel operator is a means for measuring similarity between instances. Almost all of the work that em- ployed kernel operators concentrated on various machine learning problems that involved a predeﬁned kernel. A typical approach when using kernels is to choose a kernel before learning starts. Examples to popular predeﬁned kernels are the Radial Basis Functions and the polynomial kernels (see for instance [1]). Despite the simplicity required in modifying a learning algorithm to a “kernelized” version, the success of such algorithms is not well understood yet. More recently, special efforts have been devoted to crafting kernels for speciﬁc tasks such as text categorization [3] and protein classiﬁcation problems [4]. Our work attempts to give a computational alternative to predeﬁned kernels by learning kernel operators from data. We start with a few deﬁnitions. Let X be an instance space. . An explicit way to describe K A kernel is an inner-product operator K: X (cid: 2) X! is via a mapping (cid: 30): X! H from X to an inner-products space H such that K(x; x0) = (cid: 30)(x)(cid: 1)(cid: 30)(x0). Given a kernel operator and a ﬁnite set of instances S = fxi; yigm i=1, the kernel matrix (a. k. a the Gram matrix) is the matrix of all possible inner-products of pairs from S, Ki; j = K(xi; xj). We therefore refer to the general form of K as the kernel operator and to the application of the kernel operator to a set of pairs of instances as the kernel matrix. The speciﬁc setting of kernel design we consider assumes that we have access to a base kernel learner and we are given a target kernel K? manifested as a kernel ma- trix on a set of examples. Upon calling the base kernel learner it returns a kernel op- erator denote Kj. The goal thereafter is to ﬁnd a weighted combination of kernels ^K(x; x0) = Pj (cid: 11)jKj(x; x0) that is similar, in a sense that will be deﬁned shortly, to the target kernel, ^K (cid: 24) K? . Cristianini et al. [5] in their pioneering work on kernel target alignment employed as the notion of similarity the inner-product between the kernel ma- trices F =Pm i; j=1 K(xi; xj)K 0(xi; xj). Given this deﬁnition, they deﬁned the kernel-similarity, or alignment, to be the above inner-product normalized by the norm of each kernel, ^A(S; ^K; K? ) = (cid: 16) F(cid: 17) =q F F; where S is, as above, a ﬁnite sample of m instances. Put another way, the kernel alignment Cris- tianini et al. employed is the cosine of the angle between the kernel matrices where each matrix is “ﬂattened” into a vector of dimension m2. Therefore, this deﬁnition implies that the alignment is bounded above by 1 and can attain this value iff the two kernel matrices are identical. Given a (column) vector of m labels y where yi 2 f(cid: 0)1; +1g is the label of the instance xi, Cristianini et al. used the outer-product of y as the the target kernel, K? = yyT. Therefore, an optimal alignment is achieved if ^K(xi; xj) = yiyj. Clearly, if such a kernel is used for classifying instances from X, then the kernel itself sufﬁces to construct an excellent classiﬁer f: X! f(cid: 0)1; +1g by setting, f (x) = sign(yiK(xi; x)) where (xi; yi) is any instance-label pair. Cristianini et al. then devised a procedure that works with both labelled and unlabelled examples to ﬁnd a Gram matrix which attains a good alignment with K? on the labelled part of the matrix. While this approach can clearly construct powerful kernels, a few problems arise from the notion of kernel alignment they employed. For instance, a kernel operator such that the sign(K(xi; xj)) is equal to yiyj but its magnitude, jK(xi; xj)j, is not necessarily 1, might achieve a poor alignment score while it can constitute a classiﬁer whose empirical loss is zero. Furthermore, the task of ﬁnding a good kernel when it is not always possible to ﬁnd a kernel whose sign on each pair of instances is equal to the products of the labels (termed the soft-margin case in [5, 6]) becomes rather tricky. We thus propose a different approach which attempts to overcome some of the difﬁculties above. Like Cristianini et al. we assume that we are given a set of labelled instances S = f(xi; yi) j xi 2 X; yi 2 f(cid: 0)1; +1g; i = 1; :: :; mg: We are also given a set of unlabelled examples ~S = f~xig ~m i=1. If such a set is not provided we can simply use the labelled in- stances (without the labels themselves) as the set ~S. The set ~S is used for constructing the primitive kernels that are combined to constitute the learned kernel ^K. The labelled set is used to form the target kernel matrix and its instances are used for evaluating the learned kernel ^K. This approach, known as transductive learning, was suggested in [5, 6] for kernel alignment tasks when the distribution of the instances in the test data is different from that of the training data. This setting becomes in particular handy in datasets where the test data was collected in a different scheme than the training data. We next discuss the notion of kernel goodness employed in this paper. This notion builds on the objective function that several variants of boosting algorithms maintain [7, 8]. We therefore ﬁrst discuss in brief the form of boosting algorithms for kernels. 2 Using Boosting to Combine Kernels Numerous interpretations of AdaBoost and its variants cast the boosting process as a pro- cedure that attempts to minimize, or make small, a continuous bound on the classiﬁcation error (see for instance [9, 7] and the references therein). A recent work by Collins et al. [8] uniﬁes the boosting process for two popular loss functions, the exponential-loss (denoted henceforth as ExpLoss) and logarithmic-loss (denoted as LogLoss) that bound the empir- Input: Labelled and unlabelled sets of examples: S = f(xi; yi)gm Initialize: K 0 (all zeros matrix) For t = 1; 2; :: :; T: i=1

NeurIPS Conference 2002 Conference Paper

Margin Analysis of the LVQ Algorithm

Koby Crammer
Ran Gilad-Bachrach
Amir Navot
Naftali Tishby

Prototypes based algorithms are commonly used to reduce the computa- tional complexity of Nearest-Neighbour (NN) classiﬁers. In this paper we discuss theoretical and algorithmical aspects of such algorithms. On the theory side, we present margin based generalization bounds that sug- gest that these kinds of classiﬁers can be more accurate then the 1-NN rule. Furthermore, we derived a training algorithm that selects a good set of prototypes using large margin principles. We also show that the 20 years old Learning Vector Quantization (LVQ) algorithm emerges natu- rally from our framework.

JMLR Journal 2001 Journal Article

On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines (Kernel Machines Section)

Koby Crammer
Yoram Singer

In this paper we describe the algorithmic implementation of multiclass kernel-based vector machines. Our starting point is a generalized notion of the margin to multiclass problems. Using this notion we cast multiclass categorization problems as a constrained optimization problem with a quadratic objective function. Unlike most of previous approaches which typically decompose a multiclass problem into multiple independent binary classification tasks, our notion of margin yields a direct method for training multiclass predictors. By using the dual of the optimization problem we are able to incorporate kernels with a compact set of constraints and decompose the dual problem into multiple optimization problems of reduced size. We describe an efficient fixed-point algorithm for solving the reduced optimization problems and prove its convergence. We then discuss technical details that yield significant running time improvements for large datasets. Finally, we describe various experiments with our approach comparing it to previously studied kernel-based methods. Our experiments indicate that for multiclass problems we attain state-of-the-art accuracy.

NeurIPS Conference 2001 Conference Paper

Pranking with Ranking

Koby Crammer
Yoram Singer

We discuss the problem of ranking instances. In our framework each instance is associated with a rank or a rating, which is an integer from 1 to k. Our goal is to find a rank-prediction rule that assigns each instance a rank which is as close as possible to the instance's true rank. We describe a simple and efficient online al(cid: 173) gorithm, analyze its performance in the mistake bound model, and prove its correctness. We describe two sets of experiments, with synthetic data and with the EachMovie dataset for collaborative filtering. In the experiments we performed, our algorithm outper(cid: 173) forms online algorithms for regression and classification applied to ranking.

NeurIPS Conference 2000 Conference Paper

Improved Output Coding for Classification Using Continuous Relaxation

Koby Crammer
Yoram Singer

Output coding is a general method for solving multiclass problems by reducing them to multiple binary classification problems. Previous re(cid: 173) search on output coding has employed, almost solely, predefined discrete codes. We describe an algorithm that improves the performance of output codes by relaxing them to continuous codes. The relaxation procedure is cast as an optimization problem and is reminiscent of the quadratic program for support vector machines. We describe experiments with the proposed algorithm, comparing it to standard discrete output codes. The experimental results indicate that continuous relaxations of output codes often improve the generalization performance, especially for short codes.