Arrow Research search

Author name cluster

Ralf Herbrich

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

26 papers
2 author rows

Possible papers

26

ICML Conference 2024 Conference Paper

Energy-Efficient Gaussian Processes Using Low-Precision Arithmetic

  • Nicolas Alder
  • Ralf Herbrich

The widespread use of artificial intelligence requires finding energy-efficient paradigms for the field. We propose to reduce the energy consumption of Gaussian process regression using low-precision floating-point representations. We explore how low-precision representations impact the results of Gaussian process regression and how data set properties, implementation approach, model performance, and energy consumption interact. Our findings show that a well-conditioned kernel matrix allows reducing the energy consumption by up to 89. 01% for 98. 08% of arithmetic operations with little to no impact on model performance. Our findings are relevant whenever one needs to invert a symmetric full-rank matrix.

ICML Conference 2024 Conference Paper

Hieros: Hierarchical Imagination on Structured State Space Sequence World Models

  • Paul Mattes
  • Rainer Schlosser
  • Ralf Herbrich

One of the biggest challenges to modern deep reinforcement learning (DRL) algorithms is sample efficiency. Many approaches learn a world model in order to train an agent entirely in imagination, eliminating the need for direct environment interaction during training. However, these methods often suffer from either a lack of imagination accuracy, exploration capabilities, or runtime efficiency. We propose HIEROS, a hierarchical policy that learns time abstracted world representations and imagines trajectories at multiple time scales in latent space. HIEROS uses an S5 layer-based world model, which predicts next world states in parallel during training and iteratively during environment interaction. Due to the special properties of S5 layers, our method can train in parallel and predict next world states iteratively during imagination. This allows for more efficient training than RNN-based world models and more efficient imagination than Transformer-based world models. We show that our approach outperforms the state of the art in terms of mean and median normalized human score on the Atari 100k benchmark, and that our proposed world model is able to predict complex dynamics very accurately. We also show that HIEROS displays superior exploration capabilities compared to existing approaches.

NeurIPS Conference 2022 Conference Paper

On the detrimental effect of invariances in the likelihood for variational inference

  • Richard Kurle
  • Ralf Herbrich
  • Tim Januschowski
  • Yuyang (Bernie) Wang
  • Jan Gasthaus

Variational Bayesian posterior inference often requires simplifying approximations such as mean-field parametrisation to ensure tractability. However, prior work has associated the variational mean-field approximation for Bayesian neural networks with underfitting in the case of small datasets or large model sizes. In this work, we show that invariances in the likelihood function of over-parametrised models contribute to this phenomenon because these invariances complicate the structure of the posterior by introducing discrete and/or continuous modes which cannot be well approximated by Gaussian mean-field distributions. In particular, we show that the mean-field approximation has an additional gap in the evidence lower bound compared to a purpose-built posterior that takes into account the known invariances. Importantly, this invariance gap is not constant; it vanishes as the approximation reverts to the prior. We proceed by first considering translation invariances in a linear model with a single data point in detail. We show that, while the true posterior can be constructed from a mean-field parametrisation, this is achieved only if the objective function takes into account the invariance gap. Then, we transfer our analysis of the linear model to neural networks. Our analysis provides a framework for future work to explore solutions to the invariance problem.

AAAI Conference 2010 Conference Paper

Collaborative Expert Portfolio Management

  • David Stern
  • Horst Samulowitz
  • Ralf Herbrich
  • Thore Graepel
  • Luca Pulina
  • Armando Tacchella

We consider the task of assigning experts from a portfolio of specialists in order to solve a set of tasks. We apply a Bayesian model which combines collaborative filtering with a feature-based description of tasks and experts to yield a general framework for managing a portfolio of experts. The model learns an embedding of tasks and problems into a latent space in which affinity is measured by the inner product. The model can be trained incrementally and can track non-stationary data, tracking potentially changing expert and task characteristics. The approach allows us to use a principled decision theoretic framework for expert selection, allowing the user to choose a utility function that best suits their objectives. The model component for taking into account the performance feedback data is pluggable, allowing flexibility. We apply the model to manage a portfolio of algorithms to solve hard combinatorial problems. This is a well studied area and we demonstrate a large improvement on the state of the art in one domain (constraint solving) and in a second domain (combinatorial auctions) created a portfolio that performed significantly better than any single algorithm.

NeurIPS Conference 2007 Conference Paper

TrueSkill Through Time: Revisiting the History of Chess

  • Pierre Dangauthier
  • Ralf Herbrich
  • Tom Minka
  • Thore Graepel

We extend the Bayesian skill rating system TrueSkill to infer entire time series of skills of players by smoothing through time instead of (cid: 12)ltering. The skill of each participating player, say, every year is represented by a latent skill variable which is a(cid: 11)ected by the relevant game outcomes that year, and coupled with the skill variables of the previous and subsequent year. Inference in the resulting factor graph is carried out by approximate message passing (EP) along the time series of skills. As before the system tracks the uncertainty about player skills, explicitly models draws, can deal with any number of competing entities and can infer individual skills from team results. We extend the system to estimate player-speci(cid: 12)c draw mar- gins. Based on these models we present an analysis of the skill curves of important players in the history of chess over the past 150 years. Results include plots of players’ lifetime skill development as well as the ability to compare the skills of di(cid: 11)erent players across time. Our results indicate that a) the overall playing strength has increased over the past 150 years, and b) that modelling a player’s ability to force a draw provides signi(cid: 12)cantly better predictive power.

JMLR Journal 2005 Journal Article

Generalization Bounds for the Area Under the ROC Curve

  • Shivani Agarwal
  • Thore Graepel
  • Ralf Herbrich
  • Sariel Har-Peled
  • Dan Roth

We study generalization properties of the area under the ROC curve (AUC), a quantity that has been advocated as an evaluation criterion for the bipartite ranking problem. The AUC is a different term than the error rate used for evaluation in classification problems; consequently, existing generalization bounds for the classification error rate cannot be used to draw conclusions about the AUC. In this paper, we define the expected accuracy of a ranking function (analogous to the expected error rate of a classification function), and derive distribution-free probabilistic bounds on the deviation of the empirical AUC of a ranking function (observed on a finite data sequence) from its expected accuracy. We derive both a large deviation bound, which serves to bound the expected accuracy of a ranking function in terms of its empirical AUC on a test sequence, and a uniform convergence bound, which serves to bound the expected accuracy of a learned ranking function in terms of its empirical AUC on a training sequence. Our uniform convergence bound is expressed in terms of a new set of combinatorial parameters that we term the bipartite rank-shatter coefficients; these play the same role in our result as do the standard VC-dimension related shatter coefficients (also known as the growth function) in uniform convergence results for the classification error rate. A comparison of our result with a recent uniform convergence result derived by Freund et al. (2003) for a quantity closely related to the AUC shows that the bound provided by our result can be considerably tighter. [abs] [ pdf ][ bib ] &copy JMLR 2005. ( edit, beta )

JMLR Journal 2005 Journal Article

Kernel Methods for Measuring Independence

  • Arthur Gretton
  • Ralf Herbrich
  • Alexander Smola
  • Olivier Bousquet
  • Bernhard Schölkopf

We introduce two new functionals, the constrained covariance and the kernel mutual information, to measure the degree of independence of random variables. These quantities are both based on the covariance between functions of the random variables in reproducing kernel Hilbert spaces (RKHSs). We prove that when the RKHSs are universal, both functionals are zero if and only if the random variables are pairwise independent. We also show that the kernel mutual information is an upper bound near independence on the Parzen window estimate of the mutual information. Analogous results apply for two correlation-based dependence functionals introduced earlier: we show the kernel canonical correlation and the kernel generalised variance to be independence measures for universal kernels, and prove the latter to be an upper bound on the mutual information near independence. The performance of the kernel dependence functionals in measuring independence is verified in the context of independent component analysis. [abs] [ pdf ][ bib ] &copy JMLR 2005. ( edit, beta )

NeurIPS Conference 2004 Conference Paper

A Large Deviation Bound for the Area Under the ROC Curve

  • Shivani Agarwal
  • Thore Graepel
  • Ralf Herbrich
  • Dan Roth

The area under the ROC curve (AUC) has been advocated as an evalu- ation criterion for the bipartite ranking problem. We study large devi- ation properties of the AUC; in particular, we derive a distribution-free large deviation bound for the AUC which serves to bound the expected accuracy of a ranking function in terms of its empirical AUC on an inde- pendent test sequence. A comparison of our result with a corresponding large deviation result for the classification error rate suggests that the test sample size required to obtain an -accurate estimate of the expected ac- curacy of a ranking function with δ-confidence is larger than that required to obtain an -accurate estimate of the expected error rate of a classifi- cation function with the same confidence. A simple application of the union bound allows the large deviation bound to be extended to learned ranking functions chosen from finite function classes.

NeurIPS Conference 2003 Conference Paper

Invariant Pattern Recognition by Semi-Definite Programming Machines

  • Thore Graepel
  • Ralf Herbrich

invariances with respect to given pattern Knowledge about local transformations can greatly improve the accuracy of classification. Previous approaches are either based on regularisation or on the gen- eration of virtual (transformed) examples. We develop a new frame- work for learning linear classifiers under known transformations based on semidefinite programming. We present a new learning algorithm— the Semidefinite Programming Machine (SDPM)—which is able to find a maximum margin hyperplane when the training examples are polynomial trajectories instead of single points. The solution is found to be sparse in dual variables and allows to identify those points on the trajectory with minimal real-valued output as virtual support vec- tors. Extensions to segments of trajectories, to more than one trans- formation parameter, and to learning with kernels are discussed. In experiments we use a Taylor expansion to locally approximate rota- tional invariance in pixel images from USPS and find improvements over known methods.

NeurIPS Conference 2003 Conference Paper

Semi-Definite Programming by Perceptron Learning

  • Thore Graepel
  • Ralf Herbrich
  • Andriy Kharechko
  • John Shawe-Taylor

We present a modified version of the perceptron learning algorithm (PLA) which solves semidefinite programs (SDPs) in polynomial time. The algorithm is based on the following three observations: (i) Semidefinite programs are linear programs with infinitely many (linear) constraints; (ii) every linear program can be solved by a sequence of constraint satisfaction problems with linear constraints; (iii) in general, the perceptron learning algorithm solves a constraint satisfaction problem with linear constraints in finitely many updates. Combining the PLA with a probabilistic rescaling algorithm (which, on average, increases the size of the feasable region) results in a prob- abilistic algorithm for solving SDPs that runs in polynomial time. We present preliminary results which demonstrate that the algo- rithm works, but is not competitive with state-of-the-art interior point methods.

JMLR Journal 2002 Journal Article

Algorithmic Luckiness

  • Ralf Herbrich
  • Robert C. Williamson

Classical statistical learning theory studies the generalisation performance of machine learning algorithms rather indirectly. One of the main detours is that algorithms are studied in terms of the hypothesis class that they draw their hypotheses from. In this paper, motivated by the luckiness framework of Shawe-Taylor et al. (1998), we study learning algorithms more directly and in a way that allows us to exploit the serendipity of the training sample. The main difference to previous approaches lies in the complexity measure; rather than covering all hypotheses in a given hypothesis space it is only necessary to cover the functions which could have been learned using the fixed learning algorithm. We show how the resulting framework relates to the VC, luckiness and compression frameworks. Finally, we present an application of this framework to the maximum margin algorithm for linear classifiers which results in a bound that exploits the margin, the sparsity of the resultant weight vector, and the degree of clustering of the training data in feature space.

NeurIPS Conference 2002 Conference Paper

Fast Sparse Gaussian Process Methods: The Informative Vector Machine

  • Ralf Herbrich
  • Neil Lawrence
  • Matthias Seeger

We present a framework for sparse Gaussian process (GP) methods which uses forward selection with criteria based on information- theoretic principles, previously suggested for active learning. Our goal is not only to learn d{sparse predictors (which can be evalu- ated in O(d) rather than O(n), d (cid: 28) n, n the number of training points), but also to perform training under strong restrictions on time and memory requirements. The scaling of our method is at most O(n (cid: 1) d2), and in large real-world classi(cid: 12)cation experiments we show that it can match prediction performance of the popular support vector machine (SVM), yet can be signi(cid: 12)cantly faster in training. In contrast to the SVM, our approximation produces esti- mates of predictive probabilities (‘error bars’), allows for Bayesian model selection and is less complex in implementation.

NeurIPS Conference 2001 Conference Paper

Algorithmic Luckiness

  • Ralf Herbrich
  • Robert Williamson

In contrast to standard statistical learning theory which studies uniform bounds on the expected error we present a framework that exploits the specific learning algorithm used. Motivated by the luckiness framework [8] we are also able to exploit the serendipity of the training sample. The main difference to previous approaches lies in the complexity measure; rather than covering all hypothe(cid: 173) ses in a given hypothesis space it is only necessary to cover the functions which could have been learned using the fixed learning algorithm. We show how the resulting framework relates to the VC, luckiness and compression frameworks. Finally, we present an application of this framework to the maximum margin algorithm for linear classifiers which results in a bound that exploits both the margin and the distribution of the data in feature space.

JMLR Journal 2001 Journal Article

Bayes Point Machines (Kernel Machines Section)

  • Ralf Herbrich
  • Thore Graepel
  • Colin Campbell

Kernel-classifiers comprise a powerful class of non-linear decision functions for binary classification. The support vector machine is an example of a learning algorithm for kernel classifiers that singles out the consistent classifier with the largest margin, i.e. minimal real-valued output on the training sample, within the set of consistent hypotheses, the so-called version space. We suggest the Bayes point machine as a well-founded improvement which approximates the Bayes-optimal decision by the centre of mass of version space. We present two algorithms to stochastically approximate the centre of mass of version space: a billiard sampling algorithm and a sampling algorithm based on the well known perceptron algorithm. It is shown how both algorithms can be extended to allow for soft-boundaries in order to admit training errors. Experimentally, we find that - for the zero training error case - Bayes point machines consistently outperform support vector machines on both surrogate data and real-world benchmark data sets. In the soft-boundary/soft-margin case, the improvement over support vector machines is shown to be reduced. Finally, we demonstrate that the real-valued output of single Bayes points on novel test points is a valid confidence measure and leads to a steady decrease in generalisation error when used as a rejection criterion.

NeurIPS Conference 2000 Conference Paper

A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work

  • Ralf Herbrich
  • Thore Graepel

We present a bound on the generalisation error of linear classifiers in terms of a refined margin quantity on the training set. The result is obtained in a PAC- Bayesian framework and is based on geometrical arguments in the space of linear classifiers. The new bound constitutes an exponential improvement of the so far tightest margin bound by Shawe-Taylor et al. [8] and scales logarithmically in the inverse margin. Even in the case of less training examples than input dimensions sufficiently large margins lead to non-trivial bound values and - plexity term. Furthermore, the classical margin is too coarse a measure for the essential quantity that controls the generalisation error: the volume ratio between the whole hypothesis space and the subset of consistent hypotheses. The practical relevance of the result lies in the fact that the well-known support vector machine is optimal w. r. t. the new bound only if the feature vectors are all of the same length. As a consequence we recommend to use SVMs on normalised feature vectors only - a recommendation that is well supported by our numerical experiments on two benchmark data sets. for maximum margins - to a vanishing com(cid: 173)

NeurIPS Conference 2000 Conference Paper

From Margin to Sparsity

  • Thore Graepel
  • Ralf Herbrich
  • Robert Williamson

We present an improvement of Novikoff's perceptron convergence theorem. Reinterpreting this mistake bound as a margin dependent sparsity guarantee allows us to give a PAC-style generalisation er(cid: 173) ror bound for the classifier learned by the perceptron learning algo(cid: 173) rithm. The bound value crucially depends on the margin a support vector machine would achieve on the same data set using the same kernel. Ironically, the bound yields better guarantees than are cur(cid: 173) rently available for the support vector solution itself.

NeurIPS Conference 2000 Conference Paper

Large Scale Bayes Point Machines

  • Ralf Herbrich
  • Thore Graepel

also known as the Bayes point - The concept of averaging over classifiers is fundamental to the Bayesian analysis of learning. Based on this viewpoint, it has re(cid: 173) cently been demonstrated for linear classifiers that the centre of mass of version space (the set of all classifiers consistent with the training set) - exhibits excel(cid: 173) lent generalisation abilities. However, the billiard algorithm as pre(cid: 173) sented in [4] is restricted to small sample size because it requires o (m 2 ) of memory and 0 (N. m2 ) computational steps where m is the number of training patterns and N is the number of random draws from the posterior distribution. In this paper we present a method based on the simple perceptron learning algorithm which allows to overcome this algorithmic drawback. The method is al(cid: 173) gorithmically simple and is easily extended to the multi-class case. We present experimental results on the MNIST data set of hand(cid: 173) written digits which show that Bayes point machines (BPMs) are competitive with the current world champion, the support vector machine. In addition, the computational complexity of BPMs can be tuned by varying the number of samples from the posterior. Finally, rejecting test points on the basis of their (approximative) posterior probability leads to a rapid decrease in generalisation er(cid: 173) ror, e. g. 0. 1% generalisation error for a given rejection rate of 10%.

NeurIPS Conference 2000 Conference Paper

The Kernel Gibbs Sampler

  • Thore Graepel
  • Ralf Herbrich

We present an algorithm that samples the hypothesis space of ker(cid: 173) nel classifiers. Given a uniform prior over normalised weight vectors and a likelihood based on a model of label noise leads to a piece(cid: 173) wise constant posterior that can be sampled by the kernel Gibbs sampler (KGS). The KGS is a Markov Chain Monte Carlo method that chooses a random direction in parameter space and samples from the resulting piecewise constant density along the line chosen. The KGS can be used as an analytical tool for the exploration of Bayesian transduction, Bayes point machines, active learning, and evidence-based model selection on small data sets that are contam(cid: 173) inated with label noise. For a simple toy example we demonstrate experimentally how a Bayes point machine based on the KGS out(cid: 173) performs an SVM that is incapable of taking into account label noise.

NeurIPS Conference 1999 Conference Paper

Bayesian Transduction

  • Thore Graepel
  • Ralf Herbrich
  • Klaus Obermayer

Transduction is an inference principle that takes a training sam(cid: 173) ple and aims at estimating the values of a function at given points contained in the so-called working sample as opposed to the whole of input space for induction. Transduction provides a confidence measure on single predictions rather than classifiers - a feature particularly important for risk-sensitive applications. The possibly infinite number of functions is reduced to a finite number of equiv(cid: 173) alence classes on the working sample. A rigorous Bayesian analysis reveals that for standard classification loss we cannot benefit from considering more than one test point at a time. The probability of the label of a given test point is determined as the posterior measure of the corresponding subset of hypothesis space. We con(cid: 173) sider the PAC setting of binary classification by linear discriminant functions (perceptrons) in kernel space such that the probability of labels is determined by the volume ratio in version space. We suggest to sample this region by an ergodic billiard. Experimen(cid: 173) tal results on real world data indicate that Bayesian Transduction compares favourably to the well-known Support Vector Machine, in particular if the posterior probability of labellings is used as a confidence measure to exclude test points of low confidence.

NeurIPS Conference 1998 Conference Paper

Classification on Pairwise Proximity Data

  • Thore Graepel
  • Ralf Herbrich
  • Peter Bollmann-Sdorra
  • Klaus Obermayer

We investigate the problem of learning a classification task on data represented in terms of their pairwise proximities. This representa(cid: 173) tion does not refer to an explicit feature representation of the data items and is thus more general than the standard approach of us(cid: 173) ing Euclidean feature vectors, from which pairwise proximities can always be calculated. Our first approach is based on a combined linear embedding and classification procedure resulting in an ex(cid: 173) tension of the Optimal Hyperplane algorithm to pseudo-Euclidean data. As an alternative we present another approach based on a linear threshold model in the proximity values themselves, which is optimized using Structural Risk Minimization. We show that prior knowledge about the problem can be incorporated by the choice of distance measures and examine different metrics W. r. t. their gener(cid: 173) alization. Finally, the algorithms are successfully applied to protein structure data and to data from the cat's cerebral cortex. They show better performance than K-nearest-neighbor classification.

IJCAI Conference 1997 Conference Paper

Unbiased Assessment of Learning Algorithms

  • Tobias Scheffer
  • Ralf Herbrich

In order to rank the performance of machine learning algorithms, many researchers conduct experiments on benchmark data sets. Since most learning algorithms have domain-specific parameters, it is a popular custom to adapt these parameters to obtain a minimal error rate on the test set. The same rate is then used to rank the algorithm, which causes an optimistic bias. We quantify this bias, showing, in particular, that an algorithm with more parameters will probably be ranked higher than an equally good algorithm with fewer parameters. We demonstrate this result, showing the number of parameters and trials required in order to pretend to outperform C4. 5 or FOIL, respectively, for various benchmark problems. We then describe out how unbiased ranking experiments should be conducted.