Author name cluster

Robert Williamson

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

24 papers

1 author row

TMLR Journal 2025 Journal Article

Aggregating Algorithm and Axiomatic Loss Aggregation

Armando J Cabrera Pacheco
Rabanus Derr
Robert Williamson

Supervised learning has gone beyond the empirical risk minimization framework. Central to most of these developments is the introduction of more general aggregation functions for losses incurred by the learner. In this paper, we turn towards online learning under expert advice. Via easily justified assumptions we characterize a set of reasonable loss aggregation functions as quasi-sums. Based upon this insight, we suggest how to tailor Vovk's Aggregating Algorithm to these more general aggregation functions. The "change of variables" we propose, let us highlight that "weighting profiles" determine the contribution of each expert to the next prediction according to their loss and the multiplicative structure of the weight updates in the Aggregating Algorithm translates into the additive structure of the loss aggregation in the regret bound. In addition, we suggest that the mixability of the loss function, which is functionally necessary for the Aggregating Algorithm, is intrinsically relative to the log loss, because the standard aggregation of losses in online learning is the sum. Finally, we conceptually and empirically argue that our generalized loss aggregation functions express the attitude of the learner towards losses.

TMLR Journal 2023 Journal Article

Tailoring to the Tails: Risk Measures for Fine-Grained Tail Sensitivity

Christian Fröhlich
Robert Williamson

Expected risk minimization (ERM) is at the core of many machine learning systems. This means that the risk inherent in a loss distribution is summarized using a single number - its average. In this paper, we propose a general approach to construct risk measures which exhibit a desired tail sensitivity and may replace the expectation operator in ERM. Our method relies on the specification of a reference distribution with a desired tail behaviour, which is in a one-to-one correspondence to a coherent upper probability. Any risk measure, which is compatible with this upper probability, displays a tail sensitivity which is finely tuned to the reference distribution. As a concrete example, we focus on divergence risk measures based on f-divergence ambiguity sets, which are a widespread tool used to foster distributional robustness of machine learning systems. For instance, we show how ambiguity sets based on the Kullback-Leibler divergence are intricately tied to the class of subexponential random variables. We elaborate the connection of divergence risk measures and rearrangement invariant Banach norms.

TMLR Journal 2023 Journal Article

The Geometry of Mixability

Armando J Cabrera Pacheco
Robert Williamson

Mixable loss functions are of fundamental importance in the context of prediction with expert advice in the online setting since they characterize fast learning rates. By re-interpreting properness from the point of view of differential geometry, we provide a simple geometric characterization of mixability for the binary and multi-class cases: a proper loss function $\ell$ is $\eta$-mixable if and only if the superprediction set $\textrm{spr}(\eta \ell)$ of the scaled loss function $\eta \ell$ slides freely inside the superprediction set $\textrm{spr}(\ell_{\log})$ of the log loss $\ell_{\log}$, under fairly general assumptions on the differentiability of $\ell$. Our approach provides a way to treat some concepts concerning loss functions (like properness) in a ''coordinate-free'' manner and reconciles previous results obtained for mixable loss functions for the binary and the multi-class cases.

NeurIPS Conference 2019 Conference Paper

A Primal-Dual link between GANs and Autoencoders

Hisham Husain
Richard Nock
Robert Williamson

Since the introduction of Generative Adversarial Networks (GANs) and Variational Autoencoders (VAE), the literature on generative modelling has witnessed an overwhelming resurgence. The impressive, yet elusive empirical performance of GANs has lead to the rise of many GAN-VAE hybrids, with the hopes of GAN level performance and additional benefits of VAE, such as an encoder for feature reduction, which is not offered by GANs. Recently, the Wasserstein Autoencoder (WAE) was proposed, achieving performance similar to that of GANs, yet it is still unclear whether the two are fundamentally different or can be further improved into a unified model. In this work, we study the $f$-GAN and WAE models and make two main discoveries. First, we find that the $f$-GAN and WAE objectives partake in a primal-dual relationship and are equivalent under some assumptions, which then allows us to explicate the success of WAE. Second, the equivalence result allows us to, for the first time, prove generalization bounds for Autoencoder models, which is a pertinent problem when it comes to theoretical analyses of generative models. Furthermore, we show that the WAE objective is related to other statistical quantities such as the $f$-divergence and in particular, upper bounded by the Wasserstein distance, which then allows us to tap into existing efficient (regularized) optimal transport solvers. Our findings thus present the first primal-dual relationship between GANs and Autoencoder models, comment on generalization abilities and make a step towards unifying these models.

NeurIPS Conference 2018 Conference Paper

Constant Regret, Generalized Mixability, and Mirror Descent

Zakaria Mhammedi
Robert Williamson

We consider the setting of prediction with expert advice; a learner makes predictions by aggregating those of a group of experts. Under this setting, and for the right choice of loss function and ``mixing'' algorithm, it is possible for the learner to achieve a constant regret regardless of the number of prediction rounds. For example, a constant regret can be achieved for \emph{mixable} losses using the \emph{aggregating algorithm}. The \emph{Generalized Aggregating Algorithm} (GAA) is a name for a family of algorithms parameterized by convex functions on simplices (entropies), which reduce to the aggregating algorithm when using the \emph{Shannon entropy} $\operatorname{S}$. For a given entropy $\Phi$, losses for which a constant regret is possible using the \textsc{GAA} are called $\Phi$-mixable. Which losses are $\Phi$-mixable was previously left as an open question. We fully characterize $\Phi$-mixability and answer other open questions posed by \cite{Reid2015}. We show that the Shannon entropy $\operatorname{S}$ is fundamental in nature when it comes to mixability; any $\Phi$-mixable loss is necessarily $\operatorname{S}$-mixable, and the lowest worst-case regret of the \textsc{GAA} is achieved using the Shannon entropy. Finally, by leveraging the connection between the \emph{mirror descent algorithm} and the update step of the GAA, we suggest a new \emph{adaptive} generalized aggregating algorithm and analyze its performance in terms of the regret bound.

NeurIPS Conference 2017 Conference Paper

f-GANs in an Information Geometric Nutshell

Richard Nock
Zac Cranko
Aditya Menon
Lizhen Qu
Robert Williamson

Nowozin \textit{et al} showed last year how to extend the GAN \textit{principle} to all $f$-divergences. The approach is elegant but falls short of a full description of the supervised game, and says little about the key player, the generator: for example, what does the generator actually converge to if solving the GAN game means convergence in some space of parameters? How does that provide hints on the generator's design and compare to the flourishing but almost exclusively experimental literature on the subject? In this paper, we unveil a broad class of distributions for which such convergence happens --- namely, deformed exponential families, a wide superset of exponential families ---. We show that current deep architectures are able to factorize a very large number of such densities using an especially compact design, hence displaying the power of deep architectures and their concinnity in the $f$-GAN game. This result holds given a sufficient condition on \textit{activation functions} --- which turns out to be satisfied by popular choices. The key to our results is a variational generalization of an old theorem that relates the KL divergence between regular exponential families and divergences between their natural parameters. We complete this picture with additional results and experimental insights on how these results may be used to ground further improvements of GAN architectures, via (i) a principled design of the activation functions in the generator and (ii) an explicit integration of proper composite losses' link function in the discriminator.

NeurIPS Conference 2015 Conference Paper

Learning with Symmetric Label Noise: The Importance of Being Unhinged

Brendan van Rooyen
Aditya Menon
Robert Williamson

Convex potential minimisation is the de facto approach to binary classification. However, Long and Servedio [2008] proved that under symmetric label noise (SLN), minimisation of any convex potential over a linear function class can result in classification performance equivalent to random guessing. This ostensibly shows that convex losses are not SLN-robust. In this paper, we propose a convex, classification-calibrated loss and prove that it is SLN-robust. The loss avoids the Long and Servedio [2008] result by virtue of being negatively unbounded. The loss is a modification of the hinge loss, where one does not clamp at zero; hence, we call it the unhinged loss. We show that the optimal unhinged solution is equivalent to that of a strongly regularised SVM, and is the limiting solution for any convex potential; this implies that strong l2 regularisation makes most standard learners SLN-robust. Experiments confirm the unhinged loss’ SLN-robustness.

NeurIPS Conference 2014 Conference Paper

From Stochastic Mixability to Fast Rates

Nishant Mehta
Robert Williamson

Empirical risk minimization (ERM) is a fundamental learning rule for statistical learning problems where the data is generated according to some unknown distribution $\mathsf{P}$ and returns a hypothesis $f$ chosen from a fixed class $\mathcal{F}$ with small loss $\ell$. In the parametric setting, depending upon $(\ell, \mathcal{F}, \mathsf{P})$ ERM can have slow $(1/\sqrt{n})$ or fast $(1/n)$ rates of convergence of the excess risk as a function of the sample size $n$. There exist several results that give sufficient conditions for fast rates in terms of joint properties of $\ell$, $\mathcal{F}$, and $\mathsf{P}$, such as the margin condition and the Bernstein condition. In the non-statistical prediction with expert advice setting, there is an analogous slow and fast rate phenomenon, and it is entirely characterized in terms of the mixability of the loss $\ell$ (there being no role there for $\mathcal{F}$ or $\mathsf{P}$). The notion of stochastic mixability builds a bridge between these two models of learning, reducing to classical mixability in a special case. The present paper presents a direct proof of fast rates for ERM in terms of stochastic mixability of $(\ell, \mathcal{F}, \mathsf{P})$, and in so doing provides new insight into the fast-rates phenomenon. The proof exploits an old result of Kemperman on the solution to the general moment problem. We also show a partial converse that suggests a characterization of fast rates for ERM in terms of stochastic mixability is possible.

NeurIPS Conference 2012 Conference Paper

Mixability in Statistical Learning

Tim Erven
Peter Grünwald
Mark Reid
Robert Williamson

Statistical learning and sequential prediction are two different but related formalisms to study the quality of predictions. Mapping out their relations and transferring ideas is an active area of investigation. We provide another piece of the puzzle by showing that an important concept in sequential prediction, the mixability of a loss, has a natural counterpart in the statistical setting, which we call stochastic mixability. Just as ordinary mixability characterizes fast rates for the worst-case regret in sequential prediction, stochastic mixability characterizes fast rates in statistical learning. We show that, in the special case of log-loss, stochastic mixability reduces to a well-known (but usually unnamed) martingale condition, which is used in existing convergence theorems for minimum description length and Bayesian inference. In the case of 0/1-loss, it reduces to the margin condition of Mammen and Tsybakov, and in the case that the model under consideration contains all possible predictors, it is equivalent to ordinary mixability.

NeurIPS Conference 2011 Conference Paper

Composite Multiclass Losses

Elodie Vernet
Mark Reid
Robert Williamson

We consider loss functions for multiclass prediction problems. We show when a multiclass loss can be expressed as a proper composite loss'', which is the composition of a proper loss and a link function. We extend existing results for binary losses to multiclass losses. We determine the stationarity condition, Bregman representation, order-sensitivity, existence and uniqueness of the composite representation for multiclass losses. We also show that the integral representation for binary proper losses can not be extended to multiclass losses. We subsume existing results on classification calibration'' by relating it to properness. We draw conclusions concerning the design of multiclass losses.

JMLR Journal 2007 Journal Article

The Need for Open Source Software in Machine Learning

Sören Sonnenburg
Mikio L. Braun
Cheng Soon Ong
Samy Bengio
Leon Bottou
Geoffrey Holmes
Yann LeCun
Klaus-Robert Müller

Open source tools have recently reached a level of maturity which makes them suitable for building large-scale real-world systems. At the same time, the field of machine learning has developed a large body of powerful learning algorithms for diverse applications. However, the true potential of these methods is not used, since existing implementations are not openly shared, resulting in software with low usability, and weak interoperability. We argue that this situation can be significantly improved by increasing incentives for researchers to publish their software under an open source model. Additionally, we outline the problems authors are faced with when trying to publish algorithmic implementations of machine learning methods. We believe that a resource of peer reviewed software accompanied by short articles would be highly valuable to both the machine learning and the general scientific community. [abs] [ pdf ][ bib ] &copy JMLR 2007. ( edit, beta )

NeurIPS Conference 2002 Conference Paper

Hyperkernels

Cheng Ong
Robert Williamson
Alex Smola

We consider the problem of choosing a kernel suitable for estimation using a Gaussian Process estimator or a Support Vector Machine. A novel solution is presented which involves deﬁning a Reproducing Ker- nel Hilbert Space on the space of kernels itself. By utilizing an analog of the classical representer theorem, the problem of choosing a kernel from a parameterized family of kernels (e. g. of varying width) is reduced to a statistical estimation problem akin to the problem of minimizing a regularized risk functional. Various classical settings for model or kernel selection are special cases of our framework.

NeurIPS Conference 2001 Conference Paper

Algorithmic Luckiness

Ralf Herbrich
Robert Williamson

In contrast to standard statistical learning theory which studies uniform bounds on the expected error we present a framework that exploits the specific learning algorithm used. Motivated by the luckiness framework [8] we are also able to exploit the serendipity of the training sample. The main difference to previous approaches lies in the complexity measure; rather than covering all hypothe(cid: 173) ses in a given hypothesis space it is only necessary to cover the functions which could have been learned using the fixed learning algorithm. We show how the resulting framework relates to the VC, luckiness and compression frameworks. Finally, we present an application of this framework to the maximum margin algorithm for linear classifiers which results in a bound that exploits both the margin and the distribution of the data in feature space.

NeurIPS Conference 2001 Conference Paper

Kernel Machines and Boolean Functions

Adam Kowalczyk
Alex Smola
Robert Williamson

We give results about the learnability and required complexity of logical formulae to solve classiﬁcation problems. These results are obtained by linking propositional logic with kernel machines. In particular we show that decision trees and disjunctive normal forms (DNF) can be repre- sented by the help of a special kernel, linking regularized risk to separa- tion margin. Subsequently we derive a number of lower bounds on the required complexity of logic formulae using properties of algorithms for generation of linear estimators, such as perceptron and maximal percep- tron learning.

NeurIPS Conference 2001 Conference Paper

Online Learning with Kernels

Jyrki Kivinen
Alex Smola
Robert Williamson

We consider online learning in a Reproducing Kernel Hilbert Space. Our method is computationally efﬁcient and leads to simple algorithms. In particular we derive update equations for classiﬁcation, regression, and novelty detection. The inclusion of the -trick allows us to give a robust parameterization. Moreover, unlike in batch learning where the -trick only applies to the -insensitive loss function we are able to derive gen- eral trimmed-mean types of estimators such as for Huber’s robust loss.

NeurIPS Conference 2000 Conference Paper

From Margin to Sparsity

Thore Graepel
Ralf Herbrich
Robert Williamson

We present an improvement of Novikoff's perceptron convergence theorem. Reinterpreting this mistake bound as a margin dependent sparsity guarantee allows us to give a PAC-style generalisation er(cid: 173) ror bound for the classifier learned by the perceptron learning algo(cid: 173) rithm. The bound value crucially depends on the margin a support vector machine would achieve on the same data set using the same kernel. Ironically, the bound yields better guarantees than are cur(cid: 173) rently available for the support vector solution itself.

NeurIPS Conference 2000 Conference Paper

Regularization with Dot-Product Kernels

Alex Smola
Zoltán Óvári
Robert Williamson

In this paper we give necessary and sufficient conditions under which kernels of dot product type k(x, y) = k(x. y) satisfy Mer(cid: 173) cer's condition and thus may be used in Support Vector Ma(cid: 173) chines (SVM), Regularization Networks (RN) or Gaussian Pro(cid: 173) cesses (GP). In particular, we show that if the kernel is analytic (i. e. can be expanded in a Taylor series), all expansion coefficients have to be nonnegative. We give an explicit functional form for the feature map by calculating its eigenfunctions and eigenvalues.

NeurIPS Conference 1999 Conference Paper

Support Vector Method for Novelty Detection

Bernhard Schölkopf
Robert Williamson
Alex Smola
John Shawe-Taylor
John Platt

Suppose you are given some dataset drawn from an underlying probabil(cid: 173) ity distribution P and you want to estimate a "simple" subset S of input space such that the probability that a test point drawn from P lies outside of S equals some a priori specified l/ between 0 and 1. We propose a method to approach this problem by trying to estimate a function f which is positive on S and negative on the complement. The functional form of f is given by a kernel expansion in terms of a poten(cid: 173) tially small subset of the training data; it is regularized by controlling the length of the weight vector in an associated feature space. We provide a theoretical analysis of the statistical performance of our algorithm. The algorithm is a natural extension of the support vector algorithm to the case of unlabelled data.

NeurIPS Conference 1999 Conference Paper

The Entropy Regularization Information Criterion

Alex Smola
John Shawe-Taylor
Bernhard Schölkopf
Robert Williamson

Effective methods of capacity control via uniform convergence bounds for function expansions have been largely limited to Support Vector ma(cid: 173) chines, where good bounds are obtainable by the entropy number ap(cid: 173) proach. We extend these methods to systems with expansions in terms of arbitrary (parametrized) basis functions and a wide range of regulariza(cid: 173) tion methods covering the whole range of general linear additive models. This is achieved by a data dependent analysis of the eigenvalues of the corresponding design matrix.

NeurIPS Conference 1998 Conference Paper

Shrinking the Tube: A New Support Vector Regression Algorithm

Bernhard Schölkopf
Peter Bartlett
Alex Smola
Robert Williamson

A new algorithm for Support Vector regression is described. For a priori chosen 1/, it automatically adjusts a flexible tube of minimal radius to the data such that at most a fraction 1/ of the data points lie outside. More(cid: 173) over, it is shown how to use parametric tube shapes with non-constant radius. The algorithm is analysed theoretically and experimentally.

NeurIPS Conference 1995 Conference Paper

Examples of learning curves from a modified VC-formalism

Adam Kowalczyk
Jacek Szymanski
Peter Bartlett
Robert Williamson

We examine the issue of evaluation of model specific parameters in a modified VC-formalism. Two examples are analyzed: the 2-dimensional homogeneous perceptron and the I-dimensional higher order neuron. Both models are solved theoretically, and their learning curves are com(cid: 173) pared against true learning curves. It is shown that the formalism has the potential to generate a variety of learning curves, including ones displaying ''phase transitions. "

NeurIPS Conference 1992 Conference Paper

Rational Parametrizations of Neural Networks

Uwe Helmke
Robert Williamson

A connection is drawn between rational functions, the realization theory of dynamical systems, and feedforward neural networks. This allows us to parametrize single hidden layer scalar neural networks with (almost) arbitrary analytic activation functions in terms of strictly proper rational functions. Hence, we can solve the uniqueness of parametrization problem for such networks.

NeurIPS Conference 1991 Conference Paper

Splines, Rational Functions and Neural Networks

Robert Williamson
Peter Bartlett

Connections between spline approximation, approximation with rational functions, and feedforward neural networks are studied. The potential improvement in the degree of approximation in going from single to two hidden layer networks is examined. Some results of Birman and Solomjak regarding the degree of approximation achievable when knot positions are chosen on the basis of the probability distribution of examples rather than the function values are extended.

NeurIPS Conference 1990 Conference Paper

e-Entropy and the Complexity of Feedforward Neural Networks

Robert Williamson

We develop a. new feedforward neuralnet. work represent. ation of Lipschitz functions from [0, p]n into [0, 1] ba'3ed on the level sets of the function. We show that ~~ + ~€r + ( 1 + h) (: ~) n is an upper bound on the number of nodes needed to represent f to within uniform error Cr, where L is the Lipschitz constant. \Ve also show that the number of bits needed to represent the weights in the network in order to achieve this approximation is given by o (~2; ~r (: ~) n). \Ve compare this bound with the [-entropy of the functional class under consideration.