Arrow Research search

Author name cluster

Gunnar Rätsch

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

41 papers
2 author rows

Possible papers

41

ICLR Conference 2025 Conference Paper

Preference Elicitation for Offline Reinforcement Learning

  • Alizée Pace
  • Bernhard Schölkopf
  • Gunnar Rätsch
  • Giorgia Ramponi

Applying reinforcement learning (RL) to real-world problems is often made challenging by the inability to interact with the environment and the difficulty of designing reward functions. Offline RL addresses the first challenge by considering access to an offline dataset of environment interactions labeled by the reward function. In contrast, Preference-based RL does not assume access to the reward function and learns it from preferences, but typically requires an online interaction with the environment. We bridge the gap between these frameworks by exploring efficient methods for acquiring preference feedback in a fully offline setup. We propose Sim-OPRL, an offline preference-based reinforcement learning algorithm, which leverages a learned environment model to elicit preference feedback on simulated rollouts. Drawing on insights from both the offline RL and the preference-based RL literature, our algorithm employs a pessimistic approach for out-of-distribution data, and an optimistic approach for acquiring informative preferences about the optimal policy. We provide theoretical guarantees regarding the sample complexity of our approach, dependent on how well the offline data covers the optimal policy. Finally, we demonstrate the empirical performance of Sim-OPRL in various environments.

ICLR Conference 2024 Conference Paper

Delphic Offline Reinforcement Learning under Nonidentifiable Hidden Confounding

  • Alizée Pace
  • Hugo Yèche
  • Bernhard Schölkopf
  • Gunnar Rätsch
  • Guy Tennenholtz

A prominent challenge of offline reinforcement learning (RL) is the issue of hidden confounding: unobserved variables may influence both the actions taken by the agent and the observed outcomes. Hidden confounding can compromise the validity of any causal conclusion drawn from data and presents a major obstacle to effective offline RL. In the present paper, we tackle the problem of hidden confounding in the nonidentifiable setting. We propose a definition of uncertainty due to hidden confounding bias, termed delphic uncertainty, which uses variation over world models compatible with the observations, and differentiate it from the well-known epistemic and aleatoric uncertainties. We derive a practical method for estimating the three types of uncertainties, and construct a pessimistic offline RL algorithm to account for them. Our method does not assume identifiability of the unobserved confounders, and attempts to reduce the amount of confounding bias. We demonstrate through extensive experiments and ablations the efficacy of our approach on a sepsis management benchmark, as well as on electronic health records. Our results suggest that nonidentifiable hidden confounding bias can be mitigated to improve offline RL solutions in practice.

ICML Conference 2024 Conference Paper

Improving Neural Additive Models with Bayesian Principles

  • Kouroche Bouchiat
  • Alexander Immer
  • Hugo Yèche
  • Gunnar Rätsch
  • Vincent Fortuin

Neural additive models (NAMs) enhance the transparency of deep neural networks by handling input features in separate additive sub-networks. However, they lack inherent mechanisms that provide calibrated uncertainties and enable selection of relevant features and interactions. Approaching NAMs from a Bayesian perspective, we augment them in three primary ways, namely by a) providing credible intervals for the individual additive sub-networks; b) estimating the marginal likelihood to perform an implicit selection of features via an empirical Bayes procedure; and c) facilitating the ranking of feature pairs as candidates for second-order interaction in fine-tuned models. In particular, we develop Laplace-approximated NAMs (LA-NAMs), which show improved empirical performance on tabular datasets and challenging real-world medical tasks.

ICLR Conference 2024 Conference Paper

Towards Training Without Depth Limits: Batch Normalization Without Gradient Explosion

  • Alexandru Meterez
  • Amir Joudaki
  • Francesco Orabona
  • Alexander Immer
  • Gunnar Rätsch
  • Hadi Daneshmand

Normalization layers are one of the key building blocks for deep neural networks. Several theoretical studies have shown that batch normalization improves the signal propagation, by avoiding the representations from becoming collinear across the layers. However, results on mean-field theory of batch normalization also conclude that this benefit comes at the expense of exploding gradients in depth. Motivated by these two aspects of batch normalization, in this study we pose the following question: *Can a batch-normalized network keep the optimal signal propagation properties, but avoid exploding gradients?* We answer this question in the affirmative by giving a particular construction of an *MLP with linear activations* and batch-normalization that provably has *bounded gradients* at any depth. Based on Weingarten calculus, we develop a rigorous and non-asymptotic theory for this constructed MLP that gives a precise characterization of forward signal propagation, while proving that gradients remain bounded for linearly independent input samples, which holds in most practical settings. Inspired by our theory, we also design an activation shaping scheme that empirically achieves the same properties for non-linear activations.

ICML Conference 2023 Conference Paper

Stochastic Marginal Likelihood Gradients using Neural Tangent Kernels

  • Alexander Immer
  • Tycho F. A. van der Ouderaa
  • Mark van der Wilk
  • Gunnar Rätsch
  • Bernhard Schölkopf

Selecting hyperparameters in deep learning greatly impacts its effectiveness but requires manual effort and expertise. Recent works show that Bayesian model selection with Laplace approximations can allow to optimize such hyperparameters just like standard neural network parameters using gradients and on the training data. However, estimating a single hyperparameter gradient requires a pass through the entire dataset, limiting the scalability of such algorithms. In this work, we overcome this issue by introducing lower bounds to the linearized Laplace approximation of the marginal likelihood. In contrast to previous estimators, these bounds are amenable to stochastic-gradient-based optimization and allow to trade off estimation accuracy against computational complexity. We derive them using the function-space form of the linearized Laplace, which can be estimated using the neural tangent kernel. Experimentally, we show that the estimators can significantly accelerate gradient-based hyperparameter optimization.

ICML Conference 2023 Conference Paper

Temporal Label Smoothing for Early Event Prediction

  • Hugo Yèche
  • Alizée Pace
  • Gunnar Rätsch
  • Rita Kuznetsova

Models that can predict the occurrence of events ahead of time with low false-alarm rates are critical to the acceptance of decision support systems in the medical community. This challenging task is typically treated as a simple binary classification, ignoring temporal dependencies between samples, whereas we propose to exploit this structure. We first introduce a common theoretical framework unifying dynamic survival analysis and early event prediction. Following an analysis of objectives from both fields, we propose Temporal Label Smoothing (TLS), a simpler, yet best-performing method that preserves prediction monotonicity over time. By focusing the objective on areas with a stronger predictive signal, TLS improves performance over all baselines on two large-scale benchmark tasks. Gains are particularly notable along clinically relevant measures, such as event recall at low false-alarm rates. TLS reduces the number of missed events by up to a factor of two over previously used approaches in early event prediction.

ICLR Conference 2022 Conference Paper

Bayesian Neural Network Priors Revisited

  • Vincent Fortuin
  • Adrià Garriga-Alonso
  • Sebastian W. Ober
  • Florian Wenzel
  • Gunnar Rätsch
  • Richard E. Turner
  • Mark van der Wilk
  • Laurence Aitchison

Isotropic Gaussian priors are the de facto standard for modern Bayesian neural network inference. However, it is unclear whether these priors accurately reflect our true beliefs about the weight distributions or give optimal performance. To find better priors, we study summary statistics of neural network weights in networks trained using stochastic gradient descent (SGD). We find that convolutional neural network (CNN) and ResNet weights display strong spatial correlations, while fully connected networks (FCNNs) display heavy-tailed weight distributions. We show that building these observations into priors can lead to improved performance on a variety of image classification datasets. Surprisingly, these priors mitigate the cold posterior effect in FCNNs, but slightly increase the cold posterior effect in ResNets.

NeurIPS Conference 2022 Conference Paper

Invariance Learning in Deep Neural Networks with Differentiable Laplace Approximations

  • Alexander Immer
  • Tycho van der Ouderaa
  • Gunnar Rätsch
  • Vincent Fortuin
  • Mark van der Wilk

Data augmentation is commonly applied to improve performance of deep learning by enforcing the knowledge that certain transformations on the input preserve the output. Currently, the data augmentation parameters are chosen by human effort and costly cross-validation, which makes it cumbersome to apply to new datasets. We develop a convenient gradient-based method for selecting the data augmentation without validation data during training of a deep neural network. Our approach relies on phrasing data augmentation as an invariance in the prior distribution on the functions of a neural network, which allows us to learn it using Bayesian model selection. This has been shown to work in Gaussian processes, but not yet for deep neural networks. We propose a differentiable Kronecker-factored Laplace approximation to the marginal likelihood as our objective, which can be optimised without human supervision or validation data. We show that our method can successfully recover invariances present in the data, and that this improves generalisation and data efficiency on image datasets.

IJCAI Conference 2021 Conference Paper

Boosting Variational Inference With Locally Adaptive Step-Sizes

  • Gideon Dresdner
  • Saurav Shekhar
  • Fabian Pedregosa
  • Francesco Locatello
  • Gunnar Rätsch

Variational Inference makes a trade-off between the capacity of the variational family and the tractability of finding an approximate posterior distribution. Instead, Boosting Variational Inference allows practitioners to obtain increasingly good posterior approximations by spending more compute. The main obstacle to widespread adoption of Boosting Variational Inference is the amount of resources necessary to improve over a strong Variational Inference baseline. In our work, we trace this limitation back to the global curvature of the KL-divergence. We characterize how the global curvature impacts time and memory consumption, address the problem with the notion of local curvature, and provide a novel approximate backtracking algorithm for estimating local curvature. We give new theoretical convergence rates for our algorithms and provide experimental validation on synthetic and real-world datasets.

NeurIPS Conference 2021 Conference Paper

HiRID-ICU-Benchmark --- A Comprehensive Machine Learning Benchmark on High-resolution ICU Data

  • Hugo Yèche
  • Rita Kuznetsova
  • Marc Zimmermann
  • Matthias Hüser
  • Xinrui Lyu
  • Martin Faltys
  • Gunnar Rätsch

The recent success of machine learning methods applied to time series collected from Intensive Care Units (ICU) exposes the lack of standardized machine learning benchmarks for developing and comparing such methods. While raw datasets, such as MIMIC-IV or eICU, can be freely accessed on Physionet, the choice of tasks and pre-processing is often chosen ad-hoc for each publication, limiting comparability across publications. In this work, we aim to improve this situation by providing a benchmark covering a large spectrum of ICU-related tasks. Using the HiRID dataset, we define multiple clinically relevant tasks developed in collaboration with clinicians. In addition, we provide a reproducible end-to-end pipeline to construct both data and labels. Finally, we provide an in-depth analysis of current state-of-the-art sequence modeling methods, highlighting some limitations of deep learning approaches for this type of data. With this benchmark, we hope to give the research community the possibility of a fair comparison of their work.

ICML Conference 2021 Conference Paper

Neighborhood Contrastive Learning Applied to Online Patient Monitoring

  • Hugo Yèche
  • Gideon Dresdner
  • Francesco Locatello
  • Matthias Hüser
  • Gunnar Rätsch

Intensive care units (ICU) are increasingly looking towards machine learning for methods to provide online monitoring of critically ill patients. In machine learning, online monitoring is often formulated as a supervised learning problem. Recently, contrastive learning approaches have demonstrated promising improvements over competitive supervised benchmarks. These methods rely on well-understood data augmentation techniques developed for image data which do not apply to online monitoring. In this work, we overcome this limitation by supplementing time-series data augmentation techniques with a novel contrastive learning objective which we call neighborhood contrastive learning (NCL). Our objective explicitly groups together contiguous time segments from each patient while maintaining state-specific information. Our experiments demonstrate a marked improvement over existing work applying contrastive methods to medical time-series.

ICML Conference 2021 Conference Paper

Scalable Marginal Likelihood Estimation for Model Selection in Deep Learning

  • Alexander Immer
  • Matthias Bauer 0001
  • Vincent Fortuin
  • Gunnar Rätsch
  • Mohammad Emtiyaz Khan

Marginal-likelihood based model-selection, even though promising, is rarely used in deep learning due to estimation difficulties. Instead, most approaches rely on validation data, which may not be readily available. In this work, we present a scalable marginal-likelihood estimation method to select both hyperparameters and network architectures, based on the training data alone. Some hyperparameters can be estimated online during training, simplifying the procedure. Our marginal-likelihood estimate is based on Laplace’s method and Gauss-Newton approximations to the Hessian, and it outperforms cross-validation and manual tuning on standard regression and image classification datasets, especially in terms of calibration and out-of-distribution detection. Our work shows that marginal likelihoods can improve generalization and be useful when validation data is unavailable (e. g. , in nonstationary settings).

AAAI Conference 2020 Conference Paper

A Commentary on the Unsupervised Learning of Disentangled Representations

  • Francesco Locatello
  • Stefan Bauer
  • Mario Lucic
  • Gunnar Rätsch
  • Sylvain Gelly
  • Bernhard Schölkopf
  • Olivier Bachem

The goal of the unsupervised learning of disentangled representations is to separate the independent explanatory factors of variation in the data without access to supervision. In this paper, we summarize the results of (Locatello et al. 2019b) and focus on their implications for practitioners. We discuss the theoretical result showing that the unsupervised learning of disentangled representations is fundamentally impossible without inductive biases and the practical challenges it entails. Finally, we comment on our experimental findings, highlighting the limitations of state-of-the-art approaches and directions for future research.

ICLR Conference 2020 Conference Paper

Disentangling Factors of Variations Using Few Labels

  • Francesco Locatello
  • Michael Tschannen
  • Stefan Bauer
  • Gunnar Rätsch
  • Bernhard Schölkopf
  • Olivier Bachem

Learning disentangled representations is considered a cornerstone problem in representation learning. Recently, Locatello et al. (2019) demonstrated that unsupervised disentanglement learning without inductive biases is theoretically impossible and that existing inductive biases and unsupervised methods do not allow to consistently learn disentangled representations. However, in many practical settings, one might have access to a limited amount of supervision, for example through manual labeling of (some) factors of variation in a few training examples. In this paper, we investigate the impact of such supervision on state-of-the-art disentanglement methods and perform a large scale study, training over 52000 models under well-defined and reproducible experimental conditions. We observe that a small number of labeled examples (0.01--0.5% of the data set), with potentially imprecise and incomplete labels, is sufficient to perform model selection on state-of-the-art unsupervised models. Further, we investigate the benefit of incorporating supervision into the training process. Overall, we empirically validate that with little and imprecise supervision it is possible to reliably learn disentangled representations.

ICML Conference 2020 Conference Paper

Weakly-Supervised Disentanglement Without Compromises

  • Francesco Locatello
  • Ben Poole
  • Gunnar Rätsch
  • Bernhard Schölkopf
  • Olivier Bachem
  • Michael Tschannen

Intelligent agents should be able to learn useful representations by observing changes in their environment. We model such observations as pairs of non-i. i. d. images sharing at least one of the underlying factors of variation. First, we theoretically show that only knowing how many factors have changed, but not which ones, is sufficient to learn disentangled representations. Second, we provide practical algorithms that learn disentangled representations from pairs of images without requiring annotation of groups, individual factors, or the number of factors that have changed. Third, we perform a large-scale empirical study and show that such pairs of observations are sufficient to reliably learn disentangled representations on several benchmark data sets. Finally, we evaluate our learned representations and find that they are simultaneously useful on a diverse suite of tasks, including generalization under covariate shifts, fairness, and abstract reasoning. Overall, our results demonstrate that weak supervision enables learning of useful disentangled representations in realistic scenarios.

ICML Conference 2019 Conference Paper

Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations

  • Francesco Locatello
  • Stefan Bauer
  • Mario Lucic
  • Gunnar Rätsch
  • Sylvain Gelly
  • Bernhard Schölkopf
  • Olivier Bachem

The key idea behind the unsupervised learning of disentangled representations is that real-world data is generated by a few explanatory factors of variation which can be recovered by unsupervised learning algorithms. In this paper, we provide a sober look at recent progress in the field and challenge some common assumptions. We first theoretically show that the unsupervised learning of disentangled representations is fundamentally impossible without inductive biases on both the models and the data. Then, we train more than $12000$ models covering most prominent methods and evaluation metrics in a reproducible large-scale experimental study on seven different data sets. We observe that while the different methods successfully enforce properties “encouraged” by the corresponding losses, well-disentangled models seemingly cannot be identified without supervision. Furthermore, increased disentanglement does not seem to lead to a decreased sample complexity of learning for downstream tasks. Our results suggest that future work on disentanglement learning should be explicit about the role of inductive biases and (implicit) supervision, investigate concrete benefits of enforcing disentanglement of the learned representations, and consider a reproducible experimental setup covering several data sets.

ICML Conference 2018 Conference Paper

On Matching Pursuit and Coordinate Descent

  • Francesco Locatello
  • Anant Raj
  • Sai Praneeth Karimireddy
  • Gunnar Rätsch
  • Bernhard Schölkopf
  • Sebastian U. Stich
  • Martin Jaggi

Two popular examples of first-order optimization methods over linear spaces are coordinate descent and matching pursuit algorithms, with their randomized variants. While the former targets the optimization by moving along coordinates, the latter considers a generalized notion of directions. Exploiting the connection between the two algorithms, we present a unified analysis of both, providing affine invariant sublinear $O(1/t)$ rates on smooth objectives and linear convergence on strongly convex objectives. As a byproduct of our affine invariant analysis of matching pursuit, our rates for steepest coordinate descent are the tightest known. Furthermore, we show the first accelerated convergence rate $O(1/t^2)$ for matching pursuit and steepest coordinate descent on convex objectives.

NeurIPS Conference 2011 Conference Paper

Hierarchical Multitask Structured Output Learning for Large-scale Sequence Segmentation

  • Nico Goernitz
  • Christian Widmer
  • Georg Zeller
  • Andre Kahles
  • Gunnar Rätsch
  • Sören Sonnenburg

We present a novel regularization-based Multitask Learning (MTL) formulation for Structured Output (SO) prediction for the case of hierarchical task relations. Structured output learning often results in difficult inference problems and requires large amounts of training data to obtain accurate models. We propose to use MTL to exploit information available for related structured output learning tasks by means of hierarchical regularization. Due to the combination of example sets, the cost of training models for structured output prediction can easily become infeasible for real world applications. We thus propose an efficient algorithm based on bundle methods to solve the optimization problems resulting from MTL structured output learning. We demonstrate the performance of our approach on gene finding problems from the application domain of computational biology. We show that 1) our proposed solver achieves much faster convergence than previous methods and 2) that the Hierarchical SO-MTL approach clearly outperforms considered non-MTL methods.

JMLR Journal 2010 Journal Article

The SHOGUN Machine Learning Toolbox

  • Sören Sonnenburg
  • Gunnar Rätsch
  • Sebastian Henschel
  • Christian Widmer
  • Jonas Behr
  • Alexander Zien
  • Fabio de Bona
  • Alexander Binder

We have developed a machine learning toolbox, called SHOGUN, which is designed for unified large-scale learning for a broad range of feature types and learning settings. It offers a considerable number of machine learning models such as support vector machines, hidden Markov models, multiple kernel learning, linear discriminant analysis, and more. Most of the specific algorithms are able to deal with several different data classes. We have used this toolbox in several applications from computational biology, some of them coming with no less than 50 million training examples and others with 7 billion test examples. With more than a thousand installations worldwide, SHOGUN is already widely adopted in the machine learning community and beyond. SHOGUN is implemented in C++ and interfaces to MATLAB TM, R, Octave, Python, and has a stand-alone command line interface. The source code is freely available under the GNU General Public License, Version 3 at http://www.shogun-toolbox.org. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2010. ( edit, beta )

NeurIPS Conference 2008 Conference Paper

An Empirical Analysis of Domain Adaptation Algorithms for Genomic Sequence Analysis

  • Gabriele Schweikert
  • Gunnar Rätsch
  • Christian Widmer
  • Bernhard Schölkopf

We study the problem of domain transfer for a supervised classification task in mRNA splicing. We consider a number of recent domain transfer methods from machine learning, including some that are novel, and evaluate them on genomic sequence data from model organisms of varying evolutionary distance. We find that in cases where the organisms are not closely related, the use of domain adaptation methods can help improve classification performance.

NeurIPS Conference 2007 Conference Paper

Boosting Algorithms for Maximizing the Soft Margin

  • Gunnar Rätsch
  • Manfred K. Warmuth
  • Karen Glocer

Gunnar R¨atsch Friedrich Miescher Laboratory Max Planck Society T¨ubingen, Germany We present a novel boosting algorithm, called SoftBoost, designed for sets of bi- nary labeled examples that are not necessarily separable by convex combinations of base hypotheses. Our algorithm achieves robustness by capping the distribu- tions on the examples. Our update of the distribution is motivated by minimizing a relative entropy subject to the capping constraints and constraints on the edges of the obtained base hypotheses. The capping constraints imply a soft margin in the dual optimization problem. Our algorithm produces a convex combination of hypotheses whose soft margin is within δ of its maximum. We employ relative en- tropy projection methods to prove an O( ln N δ2 ) iteration bound for our algorithm, where N is number of examples. We compare our algorithm with other approaches including LPBoost, Brown- Boost, and SmoothBoost. We show that there exist cases where the number of iter- ations required by LPBoost grows linearly in N instead of the logarithmic growth for SoftBoost. In simulation studies we show that our algorithm converges about as fast as LPBoost, faster than BrownBoost, and much faster than SmoothBoost. In a benchmark comparison we illustrate the competitiveness of our approach.

JMLR Journal 2007 Journal Article

The Need for Open Source Software in Machine Learning

  • Sören Sonnenburg
  • Mikio L. Braun
  • Cheng Soon Ong
  • Samy Bengio
  • Leon Bottou
  • Geoffrey Holmes
  • Yann LeCun
  • Klaus-Robert Müller

Open source tools have recently reached a level of maturity which makes them suitable for building large-scale real-world systems. At the same time, the field of machine learning has developed a large body of powerful learning algorithms for diverse applications. However, the true potential of these methods is not used, since existing implementations are not openly shared, resulting in software with low usability, and weak interoperability. We argue that this situation can be significantly improved by increasing incentives for researchers to publish their software under an open source model. Additionally, we outline the problems authors are faced with when trying to publish algorithmic implementations of machine learning methods. We believe that a resource of peer reviewed software accompanied by short articles would be highly valuable to both the machine learning and the general scientific community. [abs] [ pdf ][ bib ] &copy JMLR 2007. ( edit, beta )

NeurIPS Conference 2006 Conference Paper

Large Scale Hidden Semi-Markov SVMs

  • Gunnar Rätsch
  • Sören Sonnenburg

We describe Hidden Semi-Markov Support Vector Machines (SHM SVMs), an extension of HM SVMs to semi-Markov chains. This allows us to predict segmentations of sequences based on segment-based features measuring properties such as the length of the segment. We propose a novel technique to partition the problem into sub-problems. The independently obtained partial solutions can then be recombined in an efficient way, which allows us to solve label sequence learning problems with several thousands of labeled sequences. We have tested our algorithm for predicting gene structures, an important problem in computational biology. Results on a well-known model organism illustrate the great potential of SHM SVMs in computational biology.

JMLR Journal 2006 Journal Article

Large Scale Multiple Kernel Learning

  • Sören Sonnenburg
  • Gunnar Rätsch
  • Christin Schäfer
  • Bernhard Schölkopf

While classical kernel-based learning algorithms are based on a single kernel, in practice it is often desirable to use multiple kernels. Lanckriet et al. (2004) considered conic combinations of kernel matrices for classification, leading to a convex quadratically constrained quadratic program. We show that it can be rewritten as a semi-infinite linear program that can be efficiently solved by recycling the standard SVM implementations. Moreover, we generalize the formulation and our method to a larger class of problems, including regression and one-class classification. Experimental results show that the proposed algorithm works for hundred thousands of examples or hundreds of kernels to be combined, and helps for automatic model selection, improving the interpretability of the learning result. In a second part we discuss general speed up mechanism for SVMs, especially when used with sparse feature maps as appear for string kernels, allowing us to train a string kernel SVM on a 10 million real-world splice data set from computational biology. We integrated multiple kernel learning in our machine learning toolbox SHOGUN for which the source code is publicly available at http://www.fml.tuebingen.mpg.de/raetsch/projects/shogun. [abs] [ pdf ][ bib ] &copy JMLR 2006. ( edit, beta )

ICML Conference 2006 Conference Paper

Totally corrective boosting algorithms that maximize the margin

  • Manfred K. Warmuth
  • Jun Liao
  • Gunnar Rätsch

We consider boosting algorithms that maintain a distribution over a set of examples. At each iteration a weak hypothesis is received and the distribution is updated. We motivate these updates as minimizing the relative entropy subject to linear constraints. For example AdaBoost constrains the edge of the last hypothesis w.r.t. the updated distribution to be at most γ = 0. In some sense, AdaBoost is "corrective" w.r.t. the last hypothesis. A cleaner boosting method is to be "totally corrective": the edges of all past hypotheses are constrained to be at most γ, where γ is suitably adapted.Using new techniques, we prove the same iteration bounds for the totally corrective algorithms as for their corrective versions. Moreover with adaptive γ, the algorithms provably maximizes the margin. Experimentally, the totally corrective versions return smaller convex combinations of weak hypotheses than the corrective ones and are competitive with LPBoost, a totally corrective boosting algorithm with no regularization, for which there is no iteration bound known.

NeurIPS Conference 2005 Conference Paper

A General and Efficient Multiple Kernel Learning Algorithm

  • Sören Sonnenburg
  • Gunnar Rätsch
  • Christin Schäfer

While classical kernel-based learning algorithms are based on a single kernel, in practice it is often desirable to use multiple kernels. Lankriet et al. (2004) considered conic combinations of kernel matrices for classi- fication, leading to a convex quadratically constraint quadratic program. We show that it can be rewritten as a semi-infinite linear program that can be efficiently solved by recycling the standard SVM implementa- tions. Moreover, we generalize the formulation and our method to a larger class of problems, including regression and one-class classifica- tion. Experimental results show that the proposed algorithm helps for automatic model selection, improving the interpretability of the learn- ing result and works for hundred thousands of examples or hundreds of kernels to be combined. f (x) = sign N Xi=1 αiyik(xi, x) + b! ,

JMLR Journal 2005 Journal Article

Efficient Margin Maximizing with Boosting

  • Gunnar Rätsch
  • Manfred K. Warmuth

AdaBoost produces a linear combination of base hypotheses and predicts with the sign of this linear combination. The linear combination may be viewed as a hyperplane in feature space where the base hypotheses form the features. It has been observed that the generalization error of the algorithm continues to improve even after all examples are on the correct side of the current hyperplane. The improvement is attributed to the experimental observation that the distances (margins) of the examples to the separating hyperplane are increasing even after all examples are on the correct side. We introduce a new version of AdaBoost, called AdaBoost* ν, that explicitly maximizes the minimum margin of the examples up to a given precision. The algorithm incorporates a current estimate of the achievable margin into its calculation of the linear coefficients of the base hypotheses. The bound on the number of iterations needed by the new algorithms is the same as the number needed by a known version of AdaBoost that must have an explicit estimate of the achievable margin as a parameter. We also illustrate experimentally that our algorithm requires considerably fewer iterations than other algorithms that aim to maximize the margin. [abs] [ pdf ][ bib ] &copy JMLR 2005. ( edit, beta )

JMLR Journal 2005 Journal Article

Matrix Exponentiated Gradient Updates for On-line Learning and Bregman Projection

  • Koji Tsuda
  • Gunnar Rätsch
  • Manfred K. Warmuth

We address the problem of learning a symmetric positive definite matrix. The central issue is to design parameter updates that preserve positive definiteness. Our updates are motivated with the von Neumann divergence. Rather than treating the most general case, we focus on two key applications that exemplify our methods: on-line learning with a simple square loss, and finding a symmetric positive definite matrix subject to linear constraints. The updates generalize the exponentiated gradient (EG) update and AdaBoost, respectively: the parameter is now a symmetric positive definite matrix of trace one instead of a probability vector (which in this context is a diagonal positive definite matrix with trace one). The generalized updates use matrix logarithms and exponentials to preserve positive definiteness. Most importantly, we show how the derivation and the analyses of the original EG update and AdaBoost generalize to the non-diagonal case. We apply the resulting matrix exponentiated gradient (MEG) update and DefiniteBoost to the problem of learning a kernel matrix from distance measurements. [abs] [ pdf ][ bib ] &copy JMLR 2005. ( edit, beta )

NeurIPS Conference 2004 Conference Paper

Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection

  • Koji Tsuda
  • Gunnar Rätsch
  • Manfred K. Warmuth

We address the problem of learning a symmetric positive definite matrix. The central issue is to design parameter updates that preserve positive definiteness. Our updates are motivated with the von Neumann diver- gence. Rather than treating the most general case, we focus on two key applications that exemplify our methods: On-line learning with a simple square loss and finding a symmetric positive definite matrix subject to symmetric linear constraints. The updates generalize the Exponentiated Gradient (EG) update and AdaBoost, respectively: the parameter is now a symmetric positive definite matrix of trace one instead of a probability vector (which in this context is a diagonal positive definite matrix with trace one). The generalized updates use matrix logarithms and exponen- tials to preserve positive definiteness. Most importantly, we show how the analysis of each algorithm generalizes to the non-diagonal case. We apply both new algorithms, called the Matrix Exponentiated Gradient (MEG) update and DefiniteBoost, to learn a kernel matrix from distance measurements. 1 Introduction Most learning algorithms have been developed to learn a vector of parameters from data. However, an increasing number of papers are now dealing with more structured parame- ters. More specifically, when learning a similarity or a distance function among objects, the parameters are defined as a symmetric positive definite matrix that serves as a kernel (e. g. [14, 11, 13]). Learning is typically formulated as a parameter updating procedure to optimize a loss function. The gradient descent update [6] is one of the most commonly used algorithms, but it is not appropriate when the parameters form a positive definite matrix, because the updated parameter is not necessarily positive definite. Xing et al. [14] solved this problem by always correcting the updated matrix to be positive. However no bound has been proven for this update-and-correction approach. In this paper, we introduce the Matrix Exponentiated Gradient update which works as follows: First, the matrix logarithm of the current parameter matrix is computed. Then a step is taken in the direction of the steepest descent. Finally, the parameter matrix is updated to the exponential of the modified log-matrix. Our update preserves symmetry and positive definiteness because the matrix exponential maps any symmetric matrix to a positive definite matrix. Bregman divergences play a central role in the motivation and the analysis of on-line learn- ing algorithms [5]. A learning problem is essentially defined by a loss function, and a di- vergence that measures the discrepancy between parameters. More precisely, the updates are motivated by minimizing the sum of the loss function and the Bregman divergence, where the loss function is multiplied by a positive learning rate. Different divergences lead to radically different updates [6]. For example, the gradient descent is derived from the squared Euclidean distance, and the exponentiated gradient from the Kullback-Leibler di- vergence. We use the von Neumann divergence (also called quantum relative entropy) for measuring the discrepancy between two positive definite matrices [8]. We derive a new Matrix Exponentiated Gradient update from this divergence (which is a Bregman diver- gence for positive definite matrices). Finally we prove relative loss bounds using the von Neumann divergence as a measure of progress. Also the following related key problem has received a lot of attention recently [14, 11, 13]: Find a symmetric positive definite matrix that satisfies a number of symmetric linear inequality constraints. The new DefiniteBoost algorithm greedily chooses the most violated constraint and performs an approximated Bregman projection. In the diagonal case, we recover AdaBoost [9]. We also show how the convergence proof of AdaBoost generalizes to the non-diagonal case. 2 von Neumann Divergence or Quantum Relative Entropy If F is a real convex differentiable function on the parameter domain (symmetric d d positive definite matrices) and f (W): = F(W), then the Bregman divergence between two parameters W and W is defined as F(W, W) = F(W) - F(W) - tr[(W - W)f(W)]. When choosing F(W) = tr(W log W -W), then f(W) = log W and the corresponding Bregman divergence becomes the von Neumann divergence [8]: F(W, W) = tr(W log W - W log W - W + W). (1) In this paper, we are primarily interested in the normalized case (when tr(W) = 1). In this case, the positive symmetric definite matrices are related to density matrices commonly used in Statistical Physics and the divergence simplifies to F(W, W) = tr(W log W - W log W). If W = i ivivi is our notation for the eigenvalue decomposition, then we can rewrite the normalized divergence as ~ ~ F(W, W) = i ln ~ i + i ln j(~ vi vj )2. i i, j So this divergence quantifies the difference in the eigenvalues as well as the eigenvectors. 3 On-line Learning In this section, we present a natural extension of the Exponentiated Gradient (EG) up- date [6] to an update for symmetric positive definite matrices. At the t-th trial, the algorithm receives a symmetric instance matrix Xt Rdd. It then produces a prediction ^ yt = tr(WtXt) based on the algorithm's current symmetric positive definite parameter matrix Wt. Finally it incurs for instance1 a quadratic loss (^ yt - yt)2, 1For the sake of simplicity, we use the simple quadratic loss: L W t (W) = (tr(Xt ) - yt)2. For the general update, the gradient Lt(Wt) is exponentiated in the update (4) and this gradient must be symmetric. Following [5], more general loss functions (based on Bregman divergences) are amenable to our techniques. and updates its parameter matrix Wt. In the update we aim to solve the following problem: Wt+1 = argmin W F(W, Wt) + (tr(WXt) - yt)2, (2) where the convex function F defines the Bregman divergence. Setting the derivative with respect to W to zero, we have f (Wt+1) - f(Wt) + [(tr(Wt+1Xt) - yt)2] = 0. (3) The update rule is derived by solving (3) with respect to Wt+1, but it is not solvable in closed form. A common way to avoid this problem is to approximate tr(Wt+1Xt) by tr(WtXt) [5]. Then, we have the following update: Wt+1 = f -1(f (Wt) - 2(^yt - yt)Xt). In our case, F(W) = tr(W log W - W) and thus f(W) = log W and f-1(W) = exp W. We also augment (2) with the constraint tr(W) = 1, leading to the following Matrix Exponential Gradient (MEG) Update: 1 Wt+1 = exp(log W Z t - 2(^yt - yt)Xt), (4) t where the normalization factor Zt is tr[exp(log Wt - 2(^yt - yt)Xt)]. Note that in the above update, the exponent log Wt - 2(^yt - yt)Xt is an arbitrary symmetric matrix and the matrix exponential converts this matrix back into a symmetric positive definite matrix. A numerically stable version of the MEG update is given in Section 3. 2. 3. 1 Relative Loss Bounds We now begin with the definitions needed for the relative loss bounds. Let S = (X1, y1), .. ., (XT, yT ) denote a sequence of examples, where the instance matrices Xt Rdd are symmetric and the labels yt R. For any symmetric positive semi-definite ma- trix U with tr(U) = 1, define its total loss as LU(S) = T (tr(UX t=1 t ) - yt)2. The total loss of the on-line algorithm is LMEG(S) = T (tr(W t=1 tXt) - yt)2. We prove a bound on the relative loss LMEG(S) -LU(S) that holds for any U. The proof generalizes a sim- ilar bound for the Exponentiated Gradient update (Lemmas 5. 8 and 5. 9 of [6]). The relative loss bound is derived in two steps: Lemma 3. 1 bounds the relative loss for an individual trial and Lemma 3. 2 for a whole sequence (Proofs are given in the full paper). Lemma 3. 1 Let Wt be any symmetric positive definite matrix. Let Xt be any symmetric matrix whose smallest and largest eigenvalues satisfy max - min r. Assume Wt+1 is produced from Wt by the MEG update and let U be any symmetric positive semi-definite matrix. Then for any constants a and b such that 0 a(yt - tr(WtXt))2 - b(yt - tr(UXt))2 (U, Wt) - (U, Wt+1) (5) In the proof, we use the Golden-Thompson inequality [3], i. e. , tr[exp(A + B)] tr[exp(A) exp(B)] for symmetric matrices A and B. We also needed to prove the fol- lowing generalization of Jensen's inequality to matrices: exp(1A + 2(I - A)) exp(1)A + exp(2)(I - A) for finite 1, 2 R and any symmetric matrix A with 0 c 1 1 LMEG(S) 1 + LU(S) + + r2(U, W 2 2 c 1). (6) Proof For the maximum tightness of (5), a should be chosen as a = = 2b/(2 + r2b). Let b = c/r2, and thus a = 2c/(r2(2 + c)). Then (5) is rewritten as 2c (y 2 + c t - tr(WtXt))2 - c(yt - tr(UXt))2 r2((U, Wt) - (U, Wt+1)) Adding the bounds for t = 1, , T, we get 2c L 2 + c MEG(S) - cLU(S) r2((U, W1) - (U, Wt+1)) r2(U, W1), which is equivalent to (6). Assuming LU(S) max and (U, W1) dmax, the bound (6) is tightest when c = r 2dmax/ max. Then we have LMEG(S) - LU(S) r2 maxdmax + r2(U, W 2 1). 3. 2 Numerically stable MEG update The MEG update is numerically unstable when the eigenvalues of Wt are around zero. However we can "unwrap" Wt+1 as follows: 1 t Wt+1 = exp(c (^ y ~ tI + log W1 s Z - 2 - ys)Xs), (7) t s=1 where the constant ~ Zt normalizes the trace of Wt+1 to one. As long as the eigen values of W1 are not too small then the computation of log Wt is stable. Note that the update is inde- pendent of the choice of ct R. We incrementally maintain an eigenvalue decomposition of the matrix in the exponent (O(n3) per iteration): t VttVTt = ctI + log W1 - 2 (^ys - ys)Xs), s=1 where the constant ct is chosen so that the maximum eigenvalue of the above is zero. Now Wt+1 = Vt exp(t)VTt /tr(exp(t)). 4 Bregman Projection and DefiniteBoost In this section, we address the following Bregman projection problem2 W = argmin W F (W, W1), tr(W) = 1, tr(WCj ) 0, for j = 1, .. ., n, (8) where the symmetric positive definite matrix W1 of trace one is the initial parameter ma- trix, and C1, .. ., Cn are arbitrary symmetric matrices. Prior knowledge about W is en- coded in the constraints, and the matrix closest to W1 is chosen among the matrices satis- fying all constraints. Tsuda and Noble [13] employed this approach for learning a kernel matrix among graph nodes, and this method can be potentially applied to learn a kernel matrix in other settings (e. g. [14, 11]). The problem (8) is a projection of W1 to the intersection of convex regions defined by the constraints. It is well known that the Bregman projection into the intersection of convex regions can be solved by sequential projections to each region [1]. In the original papers only asymptotic convergence was shown. More recently a connection [4, 7] was made to the AdaBoost algorithm which has an improved convergence analysis [2, 9]. We generalize the latter algorithm and its analysis to symmetric positive definite matrices and call the new algorithm DefiniteBoost. As in the original setting, only approximate projections (Figure 1) are required to show fast convergence. 2Note that if is large then the on-line update (2) becomes a Bregman projection subject to a single equality constraint tr(WXt) = yt. Approximate Figure 1: In (exact) Bregman projections, the intersection Projection of convex sets (i. e. , two lines here) is found by iterating pro- jections to each set. We project only approximately, so the projected point does not satisfy the current constraint. Nev- ertheless, global convergence to the optimal solution is guar- anteed via our proofs. Exact Projection Before presenting the algorithm, let us derive the dual problem of (8) by means of Lagrange multipliers, n = argmin log trexp(logW1- jCj), j 0. (9) j=1 See [13] for a detailed derivation of the dual problem. When (8) is feasible, the opti- mal solution is described as W = 1 exp(log W C ) = Z( 1 j ), where Z( ) - nj=1 j tr[exp(log W1 - n C j=1 j j )]. 4. 1 Exact Bregman Projections First, let us present the exact Bregman projection algorithm to solve (8). We start from the initial parameter W1. At the t-th step, the most unsatisfied constraint is chosen, jt = argmaxj=1, ,n tr(WtCj). Let us use Ct as the short notation for Cj. Then, the t following Bregman projection with respect to the chosen constraint is solved. Wt+1 = argmin (W, W W t), tr(W) = 1, tr(WCt) 0. (10) By means of a Lagrange multiplier, the dual problem is described as t = argmin tr[exp(log Wt - Ct)], 0. (11) Using the solution of the dual problem, Wt is updated as 1 Wt+1 = exp(log W Z t - tCt) (12) t(t) where the normalization factor is Zt(t) = tr[exp(log Wt -tCt)]. Note that we can use the same numerically stable update as in the previous section. 4. 2 Approximate Bregman Projections The solution of (11) cannot be obtained in closed form. However, one can use the following approximate solution: 1 1 + r t/max t t = log, (13) max t - min t 1 + rt/min t when the eigenvalues of Ct lie in the interval [min t, max t ] and rt = tr(WtCt). Since the most unsatisfied constraint is chosen, rt 0 and thus t 0. Although the projection is done only approximately, 3 the convergence of the dual objective (9) can be shown using the following upper bound. 3The approximate Bregman projection (with t as in (13) can also be motivated as an online algorithm based on an entropic loss and learning rate one (following Section 3 and [4]). Theorem 4. 1 The dual objective (9) is bounded as n T tr explogW1- jCj (rt) (14) j=1 t=1 max min t -t rt max max t -min t rt t -min t where (rt) = 1 - 1. max - t min t The dual objective is monotonically decreasing, because (rt) 1. Also, since rt corre- sponds to the maximum value among all constraint violations {rj}nj=1, we have (rt) = 1 only if rt = 0. Thus the dual objective continues to decrease until all constraints are satisfied. 4. 3 Relation to Boosting When all matrices are diagonal, the DefiniteBoost degenerates to AdaBoost [9]: Let {xi, yi}di=1 be the training samples, where xi Rm and yi {-1, 1}. Let h1(x), .. ., hn(x) [-1, 1] be the weak hypotheses. For the j-th hypothesis hj(x), let us define Cj = diag(y1hj(x1), .. ., ydhj(xd)). Since |yhj(x)| 1, max/min t = 1 for any t. Setting W1 = I/d, the dual objective (14) is rewritten as d n 1 exp d -yi jhj(xi), i=1 j=1 which is equivalent to the exponential loss function used in AdaBoost. Since Cj and W1 are diagonal, the matrix Wt stays diagonal after the update. If wti = [Wt]ii, the updating formula (12) becomes the AdaBoost update: wt+1, i = wti exp(-tyiht(xi))/Zt(t). The approximate solution of t (13) is described as t = 1 log 1+rt, where r 2 1-r t is the weighted t training error of the t-th hypothesis, i. e. rt = d w i=1 tiyiht(xi). 5 Experiments on Learning Kernels In this section, our technique is applied to learning a kernel matrix from a set of distance measurements. This application is not on-line per se, but it shows nevertheless that the theoretical bounds can be reasonably tight on natural data. When K is a d d kernel matrix among d objects, then the Kij characterizes the similarity between objects i and j. In the feature space, Kij corresponds to the inner product between object i and j, and thus the Euclidean distance can be computed from the entries of the kernel matrix [10]. In some cases, the kernel matrix is not given explicitly, but only a set of distance measurements is available. The data are represented either as (i) quantitative distance values (e. g. , the distance between i and j is 0. 75), or (ii) qualitative evaluations (e. g. , the distance between i and j is small) [14, 13]. Our task is to obtain a positive definite kernel matrix which fits well to the given distance data. On-line kernel learning In the first experiment, we consider the on-line learning scenario in which only one distance example is shown to the learner at each time step. The distance example at time t is described as {at, bt, yt}, which indicates that the squared Euclidean distance between objects at and bt is yt. Let us define a time-developing sequence of kernel matrices as {Wt}Tt=1, and the corresponding points in the feature space as {xti}di=1 (i. e. [Wt]ab = xtaxtb). Then, the total loss incurred by this sequence is T T 2 2 xta = (tr(W t - xtbt - yt tXt) - yt)2, t=1 t=1 1. 8 0. 45 1. 6 0. 4 1. 4 0. 35 1. 2 0. 3 1 0. 25 0. 8 Total Loss 0. 2 0. 6 Classification Error 0. 4 0. 15 0. 2 0. 1 0 0. 05 0 0. 5 1 1. 5 2 2. 5 3 0 0. 5 1 1. 5 2 2. 5 3 5 Iterations 5 Iterations x 10 x 10 Figure 2: Numerical results of on-line learning. (Left) total loss against the number of iterations. The dashed line shows the loss bound. (Right) classification error of the nearest neighbor classifier using the learned kernel. The dashed line shows the error by the target kernel. where Xt is a symmetric matrix whose (at, at) and (bt, bt) elements are 0. 5, (at, bt) and (bt, at) elements are -0. 5, and all the other elements are zero. We consider a controlled experiment in which the distance examples are created from a known target kernel matrix. We used a 52 52 kernel matrix among gyrB proteins of bacteria (d = 52). This data contains three bacteria species (see [12] for details). Each distance example is created by randomly choosing one element of the target kernel. The initial parameter was set as W1 = I/d. When the comparison matrix U is set to the target matrix, LU (S) = 0 and max = 0, because all the distance examples are derived from the target matrix. Therefore we choose learning rate = 2, which minimizes the relative loss bound of Lemma 3. 2. The total loss of the kernel matrix sequence obtained by the matrix exponential update is shown in Figure 2 (left). In the plot, we have also shown the relative loss bound. The bound seems to give a reasonably tight performance guarantee--it is about twice the actual total loss. To evaluate the learned kernel matrix, the prediction accuracy of bacteria species by the nearest neighbor classifier is calculated (Figure 2, right), where the 52 proteins are randomly divided into 50% training and 50% testing data. The value shown in the plot is the test error averaged over 10 different divisions. It took a large number of iterations ( 2 105) for the error rate to converge to the level of the target kernel. In practice one can often increase the learning rate for faster convergence, but here we chose the small rate suggested by our analysis to check the tightness of the bound. Kernel learning by Bregman projection Next, let us consider a batch learning sce- nario where we have a set of qualitative distance evaluations (i. e. inequality constraints). Given n pairs of similar objects {aj, bj}nj=1, the inequality constraints are constructed as xaj - xbj, j = 1, .. ., n, where is a predetermined constant. If Xj is de- fined as in the previous section and Cj = Xj - I, the inequalities are then rewritten as tr(WCj) 0, j = 1, .. ., n. The largest and smallest eigenvalues of any Cj are 1 - and -, respectively. As in the previous section, distance examples are generated from the target kernel matrix between gyrB proteins. Setting = 0. 2/d, we collected all object pairs whose distance in the feature space is less than to yield 980 inequalities (n = 980). Figure 3 (left) shows the convergence of the dual objective function as proven in Theo- rem 4. 1. The convergence was much faster than the previous experiment, because, in the batch setting, one can choose the most unsatisfied constraint, and optimize the step size as well. Figure 3 (right) shows the classification error of the nearest neighbor classifier. As opposed to the previous experiment, the error rate is higher than that of the target kernel matrix, because substantial amount of information is lost by the conversion to inequality constraints. 55 0. 8 50 0. 7 45 0. 6 40 0. 5 35 0. 4 Dual Obj 30 0. 3 Classification Error 25 0. 2 20 0. 1 15 0 0 50 100 150 200 250 300 0 50 100 150 200 250 300 Iterations Iterations Figure 3: Numerical results of Bregman projection. (Left) convergence of the dual objective function. (Right) classification error of the nearest neighbor classifier using the learned kernel. 6 Conclusion We motivated and analyzed a new update for symmetric positive matrices using the von Neumann divergence. We showed that the standard bounds for on-line learning and Boost- ing generalize to the case when the parameters are a symmetric positive definite matrix (of trace one) instead of a probability vector. As in quantum physics, the eigenvalues act as probabilities. Acknowledgment We would like to thank B. Sch olkopf, M. Kawanabe, J. Liao and W. S. Noble for fruitful discussions. M. W. was supported by NSF grant CCR 9821087 and UC Discovery grant LSIT02-10110. K. T. and G. R. gratefully acknowledge partial support from the PASCAL Network of Excellence (EU #506778). Part of this work was done while all three authors were visiting the National ICT Australia in Canberra.

NeurIPS Conference 2003 Conference Paper

Image Reconstruction by Linear Programming

  • Koji Tsuda
  • Gunnar Rätsch

A common way of image denoising is to project a noisy image to the sub- space of admissible images made for instance by PCA. However, a major drawback of this method is that all pixels are updated by the projection, even when only a few pixels are corrupted by noise or occlusion. We pro- pose a new method to identify the noisy pixels by (cid: 1) 1-norm penalization and update the identified pixels only. The identification and updating of noisy pixels are formulated as one linear program which can be solved efficiently. Especially, one can apply the ν-trick to directly specify the fraction of pixels to be reconstructed. Moreover, we extend the linear program to be able to exploit prior knowledge that occlusions often ap- pear in contiguous blocks (e. g. sunglasses on faces). The basic idea is to penalize boundary points and interior points of the occluded area dif- ferently. We are able to show the ν-property also for this extended LP leading a method which is easy to use. Experimental results impressively demonstrate the power of our approach.

NeurIPS Conference 2002 Conference Paper

Adapting Codes and Embeddings for Polychotomies

  • Gunnar Rätsch
  • Sebastian Mika
  • Alex Smola

In this paper we consider formulations of multi-class problems based on a generalized notion of a margin and using output coding. This includes, but is not restricted to, standard multi-class SVM formulations. Differ- ently from many previous approaches we learn the code as well as the embedding function. We illustrate how this can lead to a formulation that allows for solving a wider range of problems with for instance many classes or even “missing classes”. To keep our optimization problems tractable we propose an algorithm capable of solving them using two- class classifiers, similar in spirit to Boosting.

NeurIPS Conference 2001 Conference Paper

A New Discriminative Kernel From Probabilistic Models

  • Koji Tsuda
  • Motoaki Kawanabe
  • Gunnar Rätsch
  • Sören Sonnenburg
  • Klaus-Robert Müller

Recently, Jaakkola and Haussler proposed a method for construct(cid: 173) ing kernel functions from probabilistic models. Their so called "Fisher kernel" has been combined with discriminative classifiers such as SVM and applied successfully in e. g. DNA and protein analysis. Whereas the Fisher kernel (FK) is calculated from the marginal log-likelihood, we propose the TOP kernel derived from Tangent vectors Of Posterior log-odds. Furthermore we develop a theoretical framework on feature extractors from probabilistic models and use it for analyzing FK and TOP. In experiments our new discriminative TOP kernel compares favorably to the Fisher kernel.

NeurIPS Conference 2001 Conference Paper

Active Learning in the Drug Discovery Process

  • Manfred K. Warmuth
  • Gunnar Rätsch
  • Michael Mathieson
  • Jun Liao
  • Christian Lemmen

We investigate the following data mining problem from Computational Chemistry: From a large data set of compounds, find those that bind to a target molecule in as few iterations of biological testing as possible. In each iteration a comparatively small batch of compounds is screened for binding to the target. We apply active learning techniques for selecting the successive batches. One selection strategy picks unlabeled examples closest to the maximum margin hyperplane. Another produces many weight vectors by running perceptrons over multiple permutations of the data. Each weight vector prediction and we pick the unlabeled examples for which votes with its the prediction is most evenly split between. For a third selec- tion strategy note that each unlabeled example bisects the version space of consistent weight vectors. We estimate the volume on both sides of the split by bouncing a billiard through the version space and select un- labeled examples that cause the most even split of the version space. We demonstrate that on two data sets provided by DuPont Pharmaceu- ticals that all three selection strategies perform comparably well and are much better than selecting random batches for testing. and

NeurIPS Conference 2001 Conference Paper

On the Convergence of Leveraging

  • Gunnar Rätsch
  • Sebastian Mika
  • Manfred K. Warmuth

We give an unified convergence analysis of ensemble learning meth- ods including e. g. AdaBoost, Logistic Regression and the Least-Square- Boost algorithm for regression. These methods have in common that they iteratively call a base learning algorithm which returns hypotheses that are then linearly combined. We show that these methods are related to the Gauss-Southwell method known from numerical optimization and state non-asymptotical convergence results for all these methods. Our analysis includes ` 1 -norm regularized cost functions leading to a clean and general way to regularize ensemble learning. 1 Introduction We show convergence rates of ensemble learning methods such as AdaBoost [10], Logistic Regression (LR) [11, 5] and the Least-Square (LS) regression algorithm called LS-Boost [12]. These algorithms have in common that they iteratively call a base learning algorithm L (also called weak learner) on a weighted training sample. The base learner is expected to return in each iteration t a hypothesis ^ h t from some hypothesis set of weak hypotheses H that has small weighted training error. This is the weighted number of false predictions in classification and weighted estimation error in regression. These hypotheses are then linearly combined to form the final hypothesis f ^ (x) =

NeurIPS Conference 2000 Conference Paper

A Mathematical Programming Approach to the Kernel Fisher Algorithm

  • Sebastian Mika
  • Gunnar Rätsch
  • Klaus-Robert Müller

We investigate a new kernel-based classifier: the Kernel Fisher Discrim(cid: 173) inant (KFD). A mathematical programming formulation based on the ob(cid: 173) servation that KFD maximizes the average margin permits an interesting modification of the original KFD algorithm yielding the sparse KFD. We find that both, KFD and the proposed sparse KFD, can be understood in an unifying probabilistic context. Furthermore, we show connections to Support Vector Machines and Relevance Vector Machines. From this understanding, we are able to outline an interesting kernel-regression technique based upon the KFD algorithm. Simulations support the use(cid: 173) fulness of our approach.

NeurIPS Conference 1999 Conference Paper

Invariant Feature Extraction and Classification in Kernel Spaces

  • Sebastian Mika
  • Gunnar Rätsch
  • Jason Weston
  • Bernhard Schölkopf
  • Alex Smola
  • Klaus-Robert Müller

In hyperspectral imagery one pixel typically consists of a mixture of the reflectance spectra of several materials, where the mixture coefficients correspond to the abundances of the constituting ma(cid: 173) terials. We assume linear combinations of reflectance spectra with some additive normal sensor noise and derive a probabilistic MAP framework for analyzing hyperspectral data. As the material re(cid: 173) flectance characteristics are not know a priori, we face the problem of unsupervised linear unmixing. The incorporation of different prior information (e. g. positivity and normalization of the abun(cid: 173) dances) naturally leads to a family of interesting algorithms, for example in the noise-free case yielding an algorithm that can be understood as constrained independent component analysis (ICA). Simulations underline the usefulness of our theory.

NeurIPS Conference 1999 Conference Paper

v-Arc: Ensemble Learning in the Presence of Outliers

  • Gunnar Rätsch
  • Bernhard Schölkopf
  • Alex Smola
  • Klaus-Robert Müller
  • Takashi Onoda
  • Sebastian Mika

AdaBoost and other ensemble methods have successfully been ap(cid: 173) plied to a number of classification tasks, seemingly defying prob(cid: 173) lems of overfitting. AdaBoost performs gradient descent in an error function with respect to the margin, asymptotically concentrating on the patterns which are hardest to learn. For very noisy prob(cid: 173) lems, however, this can be disadvantageous. Indeed, theoretical analysis has shown that the margin distribution, as opposed to just the minimal margin, plays a crucial role in understanding this phe(cid: 173) nomenon. Loosely speaking, some outliers should be tolerated if this has the benefit of substantially increasing the margin on the remaining points. We propose a new boosting algorithm which al(cid: 173) lows for the possibility of a pre-specified fraction of points to lie in the margin area Or even on the wrong side of the decision boundary.

NeurIPS Conference 1998 Conference Paper

Kernel PCA and De-Noising in Feature Spaces

  • Sebastian Mika
  • Bernhard Schölkopf
  • Alex Smola
  • Klaus-Robert Müller
  • Matthias Scholz
  • Gunnar Rätsch

Kernel PCA as a nonlinear feature extractor has proven powerful as a preprocessing step for classification algorithms. But it can also be con(cid: 173) sidered as a natural generalization of linear principal component anal(cid: 173) ysis. This gives rise to the question how to use nonlinear features for data compression, reconstruction, and de-noising, applications common in linear PCA. This is a nontrivial task, as the results provided by ker(cid: 173) nel PCA live in some high dimensional feature space and need not have pre-images in input space. This work presents ideas for finding approxi(cid: 173) mate pre-images, focusing on Gaussian kernels, and shows experimental results using these pre-images in data reconstruction and de-noising on toy examples as well as on real world data. 1 peA and Feature Spaces Principal Component Analysis (PC A) (e. g. [3]) is an orthogonal basis transformation. The new basis is found by diagonalizing the centered covariance matrix of a data set {Xk E RNlk = 1, .. ., f}, defined by C = ((Xi - nates in the Eigenvector basis are called principal components. The size of an Eigenvalue >. corresponding to an Eigenvector v of C equals the amount of variance in the direction of v. Furthermore, the directions of the first n Eigenvectors corresponding to the biggest n Eigenvalues cover as much variance as possible by n orthogonal directions. In many ap(cid: 173) plications they contain the most interesting information: for instance, in data compression, where we project onto the directions with biggest variance to retain as much information as possible, or in de-noising, where we deliberately drop directions with small variance. (Xk))T). The coordi(cid: 173)

NeurIPS Conference 1998 Conference Paper

Regularizing AdaBoost

  • Gunnar Rätsch
  • Takashi Onoda
  • Klaus Müller

Boosting methods maximize a hard classification margin and are known as powerful techniques that do not exhibit overfitting for low noise cases. Also for noisy data boosting will try to enforce a hard margin and thereby give too much weight to outliers, which then leads to the dilemma of non-smooth fits and overfitting. Therefore we propose three algorithms to allow for soft margin classification by introducing regularization with slack variables into the boosting concept: (1) AdaBoostreg and regularized versions of (2) linear and (3) quadratic programming AdaBoost. Experiments show the usefulness of the proposed algorithms in comparison to another soft margin classifier: the support vector machine. 1