Author name cluster

Robert C. Williamson

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

22 papers

2 author rows

JMLR Journal 2025 Journal Article

Geometry and Stability of Supervised Learning Problems

Facundo Mémoli
Brantley Vose
Robert C. Williamson

We introduce a notion of distance between supervised learning problems, which we call the Risk distance. This distance, inspired by optimal transport, facilitates stability results; one can quantify how seriously issues like sampling bias, noise, limited data, and approximations might change a given problem by bounding how much these modifications can move the problem under the Risk distance. With the distance established, we explore the geometry of the resulting space of supervised learning problems, providing explicit geodesics and proving that the set of classification problems is dense in a larger class of problems. We also provide two variants of the Risk distance: one that incorporates specified weights on a problem's predictors, and one that is more sensitive to the contours of a problem's risk landscape. [abs] [ pdf ][ bib ] &copy JMLR 2025. ( edit, beta )

JMLR Journal 2024 Journal Article

Information Processing Equalities and the Information–Risk Bridge

Robert C. Williamson
Zac Cranko

We introduce two new classes of measures of information for statistical experiments which generalise and subsume φ-divergences, integral probability metrics, N-distances (MMD), and (f,Γ) divergences between two or more distributions. This enables us to derive a simple geometrical relationship between measures of information and the Bayes risk of a statistical decision problem, thus extending the variational φ-divergence representation to multiple distributions in an entirely symmetric manner. The new families of divergence are closed under the action of Markov operators which yields an information processing equality which is a refinement and generalisation of the classical information processing inequality. This equality gives insight into the significance of the choice of the hypothesis class in classical risk minimization. [abs] [ pdf ][ bib ] &copy JMLR 2024. ( edit, beta )

JMLR Journal 2024 Journal Article

Risk Measures and Upper Probabilities: Coherence and Stratification

Christian Fröhlich
Robert C. Williamson

Machine learning typically presupposes classical probability theory which implies that aggregation is built upon expectation. There are now multiple reasons to motivate looking at richer alternatives to classical probability theory as a mathematical foundation for machine learning. We systematically examine a powerful and rich class of alternative aggregation functionals, known variously as spectral risk measures, Choquet integrals or Lorentz norms. We present a range of characterization results, and demonstrate what makes this spectral family so special. In doing so we arrive at a natural stratification of all coherent risk measures in terms of the upper probabilities that they induce by exploiting results from the theory of rearrangement invariant Banach spaces. We empirically demonstrate how this new approach to uncertainty helps tackling practical machine learning problems. [abs] [ pdf ][ bib ] &copy JMLR 2024. ( edit, beta )

ICML Conference 2023 Conference Paper

Random Classification Noise does not defeat All Convex Potential Boosters Irrespective of Model Choice

Yishay Mansour
Richard Nock
Robert C. Williamson

A landmark negative result of Long and Servedio has had a considerable impact on research and development in boosting algorithms, around the now famous tagline that "noise defeats all convex boosters". In this paper, we appeal to the half-century+ founding theory of losses for class probability estimation, an extension of Long and Servedio’s results and a new general convex booster to demonstrate that the source of their negative result is in fact the model class, linear separators. Losses or algorithms are neither to blame. This leads us to a discussion on an otherwise praised aspect of ML, parameterisation.

JMLR Journal 2023 Journal Article

The Geometry and Calculus of Losses

Robert C. Williamson
Zac Cranko

Statistical decision problems lie at the heart of statistical machine learning. The simplest problems are multiclass classification and class probability estimation. Central to their definition is the choice of loss function, which is the means by which the quality of a solution is evaluated. In this paper we systematically develop the theory of loss functions for such problems from a novel perspective whose basic ingredients are convex sets with a particular structure. The loss function is defined as the subgradient of the support function of the convex set. It is consequently automatically proper (calibrated for probability estimation). This perspective provides three novel opportunities. It enables the development of a fundamental relationship between losses and (anti)-norms that appears to have not been noticed before. Second, it enables the development of a calculus of losses induced by the calculus of convex sets which allows the interpolation between different losses, and thus is a potential useful design tool for tailoring losses to particular problems. In doing this we build upon, and considerably extend, existing results on M-sums of convex sets. Third, the perspective leads to a natural theory of “polar” loss functions, which are derived from the polar dual of the convex set defining the loss, and which form a natural universal substitution function for Vovk’s aggregating algorithm. [abs] [ pdf ][ bib ] &copy JMLR 2023. ( edit, beta )

NeurIPS Conference 2020 Conference Paper

PAC-Bayesian Bound for the Conditional Value at Risk

Zakaria Mhammedi
Benjamin Guedj
Robert C. Williamson

Conditional Value at Risk ($\textsc{CVaR}$) is a ``coherent risk measure'' which generalizes expectation (reduced to a boundary parameter setting). Widely used in mathematical finance, it is garnering increasing interest in machine learning as an alternate approach to regularization, and as a means for ensuring fairness. This paper presents a generalization bound for learning algorithms that minimize the $\textsc{CVaR}$ of the empirical loss. The bound is of PAC-Bayesian type and is guaranteed to be small when the empirical $\textsc{CVaR}$ is small. We achieve this by reducing the problem of estimating $\textsc{CVaR}$ to that of merely estimating an expectation. This then enables us, as a by-product, to obtain concentration inequalities for $\textsc{CVaR}$ even when the random variable in question is unbounded.

ICML Conference 2019 Conference Paper

Fairness risk measures

Robert C. Williamson
Aditya Krishna Menon

Ensuring that classifiers are non-discriminatory or fair with respect to a sensitive feature (e. g. , race or gender) is a topical problem. Progress in this task requires fixing a definition of fairness, and there have been several proposals in this regard over the past few years. Several of these, however, assume either binary sensitive features (thus precluding categorical or real-valued sensitive groups), or result in non-convex objectives (thus adversely affecting the optimisation landscape). In this paper, we propose a new definition of fairness that generalises some existing proposals, while allowing for generic sensitive features and resulting in a convex objective. The key idea is to enforce that the expected losses (or risks) across each subgroup induced by the sensitive feature are commensurate. We show how this relates to the rich literature on risk measures from mathematical finance. As a special case, this leads to a new convex fairness-aware objective based on minimising the conditional value at risk (CVaR).

ICML Conference 2019 Conference Paper

Lossless or Quantized Boosting with Integer Arithmetic

Richard Nock
Robert C. Williamson

In supervised learning, efficiency often starts with the choice of a good loss: support vector machines popularised Hinge loss, Adaboost popularised the exponential loss, etc. Recent trends in machine learning have highlighted the necessity for training routines to meet tight requirements on communication, bandwidth, energy, operations, encoding, among others. Fitting the often decades-old state of the art training routines into these new constraints does not go without pain and uncertainty or reduction in the original guarantees. Our paper starts with the design of a new strictly proper canonical, twice differentiable loss called the Q-loss. Importantly, its mirror update over (arbitrary) rational inputs uses only integer arithmetics – more precisely, the sole use of $+, -, /, \times, |. |$. We build a learning algorithm which is able, under mild assumptions, to achieve a lossless boosting-compliant training. We give conditions for a quantization of its main memory footprint, weights, to be done while keeping the whole algorithm boosting-compliant. Experiments display that the algorithm can achieve a fast convergence during the early boosting rounds compared to AdaBoost, even with a weight storage that can be 30+ times smaller. Lastly, we show that the Bayes risk of the Q-loss can be used as node splitting criterion for decision trees and guarantees optimal boosting convergence.

JMLR Journal 2018 Journal Article

A Theory of Learning with Corrupted Labels

Brendan van Rooyen
Robert C. Williamson

It is usual in machine learning theory to assume that the training and testing sets comprise of draws from the same distribution. This is rarely, if ever, true and one must admit the presence of corruption. There are many different types of corruption that can arise and as of yet there is no general means to compare the relative ease of learning in these settings. Such results are necessary if we are to make informed economic decisions regarding the acquisition of data. Here we begin to develop an abstract framework for tackling these problems. We present a generic method for learning from a fixed, known, reconstructible corruption, along with an analyses of its statistical properties. We demonstrate the utility of our framework via concrete novel results in solving supervised learning problems wherein the labels are corrupted, such as learning with noisy labels, semi-supervised learning and learning with partial labels. [abs] [ pdf ][ bib ] &copy JMLR 2018. ( edit, beta )

JMLR Journal 2016 Journal Article

Bipartite Ranking: a Risk-Theoretic Perspective

Aditya Krishna Menon
Robert C. Williamson

We present a systematic study of the bipartite ranking problem, with the aim of explicating its connections to the class- probability estimation problem. Our study focuses on the properties of the statistical risk for bipartite ranking with general losses, which is closely related to a generalised notion of the area under the ROC curve: we establish alternate representations of this risk, relate the Bayes-optimal risk to a class of probability divergences, and characterise the set of Bayes-optimal scorers for the risk. We further study properties of a generalised class of bipartite risks, based on the $p$-norm push of Rudin (2009). Our analysis is based on the rich framework of proper losses, which are the central tool in the study of class-probability estimation. We show how this analytic tool makes transparent the generalisations of several existing results, such as the equivalence of the minimisers for four seemingly disparate risks from bipartite ranking and class- probability estimation. A novel practical implication of our analysis is the design of new families of losses for scenarios where accuracy at the head of ranked list is paramount, with comparable empirical performance to the $p$-norm push. [abs] [ pdf ][ bib ] &copy JMLR 2016. ( edit, beta )

JMLR Journal 2016 Journal Article

Composite Multiclass Losses

Robert C. Williamson
Elodie Vernet
Mark D. Reid

We consider loss functions for multiclass prediction problems. We show when a multiclass loss can be expressed as a proper composite loss, which is the composition of a proper loss and a link function. We extend existing results for binary losses to multiclass losses. We subsume results on âclassification calibrationâ by relating it to properness. We determine the stationarity condition, Bregman representation, order- sensitivity, and quasi-convexity of multiclass proper losses. We then characterise the existence and uniqueness of the composite representation for multiclass losses. We show how the composite representation is related to other core properties of a loss: mixability, admissibility and (strong) convexity of multiclass losses which we characterise in terms of the Hessian of the Bayes risk. We show that the simple integral representation for binary proper losses can not be extended to multiclass losses but offer concrete guidance regarding how to design different loss functions. The conclusion drawn from these results is that the proper composite representation is a natural and convenient tool for the design of multiclass loss functions. [abs] [ pdf ][ bib ] &copy JMLR 2016. ( edit, beta )

JMLR Journal 2015 Journal Article

Fast Rates in Statistical and Online Learning

Tim van Erven
Peter D. Grünwald
Nishant A. Mehta
Mark D. Reid
Robert C. Williamson

The speed with which a learning algorithm converges as it is presented with more data is a central problem in machine learning --- a fast rate of convergence means less data is needed for the same level of performance. The pursuit of fast rates in online and statistical learning has led to the discovery of many conditions in learning theory under which fast learning is possible. We show that most of these conditions are special cases of a single, unifying condition, that comes in two forms: the central condition for `proper' learning algorithms that always output a hypothesis in the given model, and stochastic mixability for online algorithms that may make predictions outside of the model. We show that under surprisingly weak assumptions both conditions are, in a certain sense, equivalent. The central condition has a re-interpretation in terms of convexity of a set of pseudoprobabilities, linking it to density estimation under misspecification. For bounded losses, we show how the central condition enables a direct proof of fast rates and we prove its equivalence to the Bernstein condition, itself a generalization of the Tsybakov margin condition, both of which have played a central role in obtaining fast rates in statistical learning. Yet, while the Bernstein condition is two-sided, the central condition is one-sided, making it more suitable to deal with unbounded losses. In its stochastic mixability form, our condition generalizes both a stochastic exp-concavity condition identified by Juditsky, Rigollet and Tsybakov and Vovk's notion of mixability. Our unifying conditions thus provide a substantial step towards a characterization of fast rates in statistical learning, similar to how classical mixability characterizes constant regret in the sequential prediction with expert advice setting. [abs] [ pdf ][ bib ] &copy JMLR 2015. ( edit, beta )

JMLR Journal 2012 Journal Article

Mixability is Bayes Risk Curvature Relative to Log Loss

Tim van Erven
Mark D. Reid
Robert C. Williamson

Mixability of a loss characterizes fast rates in the online learning setting of prediction with expert advice. The determination of the mixability constant for binary losses is straightforward but opaque. In the binary case we make this transparent and simpler by characterising mixability in terms of the second derivative of the Bayes risk of proper losses. We then extend this result to multiclass proper losses where there are few existing results. We show that mixability is governed by the maximum eigenvalue of the Hessian of the Bayes risk, relative to the Hessian of the Bayes risk for log loss. We conclude by comparing our result to other work that bounds prediction performance in terms of the geometry of the Bayes risk. Although all calculations are for proper losses, we also show how to carry the results across to improper losses. [abs] [ pdf ][ bib ] &copy JMLR 2012. ( edit, beta )

ICML Conference 2012 Conference Paper

The Convexity and Design of Composite Multiclass Losses

Mark D. Reid
Robert C. Williamson
Peng Sun 0011

JMLR Journal 2011 Journal Article

Information, Divergence and Risk for Binary Experiments

Mark D. Reid
Robert C. Williamson

We unify f -divergences, Bregman divergences, surrogate regret bounds, proper scoring rules, cost curves, ROC-curves and statistical information. We do this by systematically studying integral and variational representations of these objects and in so doing identify their representation primitives which all are related to cost-sensitive binary classification. As well as developing relationships between generative and discriminative views of learning, the new machinery leads to tight and more general surrogate regret bounds and generalised Pinsker inequalities relating f -divergences to variational divergence. The new viewpoint also illuminates existing algorithms: it provides a new derivation of Support Vector Machines in terms of divergences and relates maximum mean discrepancy to Fisher linear discriminants. [abs] [ pdf ][ bib ] &copy JMLR 2011. ( edit, beta )

JMLR Journal 2010 Journal Article

Composite Binary Losses

Mark D. Reid
Robert C. Williamson

We study losses for binary classification and class probability estimation and extend the understanding of them from margin losses to general composite losses which are the composition of a proper loss with a link function. We characterise when margin losses can be proper composite losses, explicitly show how to determine a symmetric loss in full from half of one of its partial losses, introduce an intrinsic parametrisation of composite binary losses and give a complete characterisation of the relationship between proper losses and "classification calibrated" losses. We also consider the question of the "best" surrogate binary loss. We introduce a precise notion of "best" and show there exist situations where two convex surrogate losses are incommensurable. We provide a complete explicit characterisation of the convexity of composite binary losses in terms of the link function and the weight function associated with the proper loss which make up the composite loss. This characterisation suggests new ways of "surrogate tuning" as well as providing an explicit characterisation of when Bregman divergences on the unit interval are convex in their second argument. Finally, in an appendix we present some new algorithm-independent results on the relationship between properness, convexity and robustness to misclassification noise for binary losses and show that all convex proper losses are non-robust to misclassification noise. [abs] [ pdf ][ bib ] &copy JMLR 2010. ( edit, beta )

ICML Conference 2009 Conference Paper

Surrogate regret bounds for proper losses

Mark D. Reid
Robert C. Williamson

We present tight surrogate regret bounds for the class of proper ( i.e. , Fisher consistent) losses. The bounds generalise the margin-based bounds due to Bartlett et al. (2006). The proof uses Taylor's theorem and leads to new representations for loss and regret and a simple proof of the integral representation of proper losses. We also present a different formulation of a duality result of Bregman divergences which leads to a simple demonstration of the convexity of composite losses using canonical link functions.

JMLR Journal 2005 Journal Article

Learning the Kernel with Hyperkernels

Cheng Soon Ong
Alexander J. Smola
Robert C. Williamson

This paper addresses the problem of choosing a kernel suitable for estimation with a support vector machine, hence further automating machine learning. This goal is achieved by defining a reproducing kernel Hilbert space on the space of kernels itself. Such a formulation leads to a statistical estimation problem similar to the problem of minimizing a regularized risk functional. We state the equivalent representer theorem for the choice of kernels and present a semidefinite programming formulation of the resulting optimization problem. Several recipes for constructing hyperkernels are provided, as well as the details of common machine learning problems. Experimental results for classification, regression and novelty detection on UCI data show the feasibility of our approach. [abs] [ pdf ][ bib ] &copy JMLR 2005. ( edit, beta )

JMLR Journal 2002 Journal Article

Algorithmic Luckiness

Ralf Herbrich
Robert C. Williamson

Classical statistical learning theory studies the generalisation performance of machine learning algorithms rather indirectly. One of the main detours is that algorithms are studied in terms of the hypothesis class that they draw their hypotheses from. In this paper, motivated by the luckiness framework of Shawe-Taylor et al. (1998), we study learning algorithms more directly and in a way that allows us to exploit the serendipity of the training sample. The main difference to previous approaches lies in the complexity measure; rather than covering all hypotheses in a given hypothesis space it is only necessary to cover the functions which could have been learned using the fixed learning algorithm. We show how the resulting framework relates to the VC, luckiness and compression frameworks. Finally, we present an application of this framework to the maximum margin algorithm for linear classifiers which results in a bound that exploits the margin, the sparsity of the resultant weight vector, and the degree of clustering of the training data in feature space.

JMLR Journal 2001 Journal Article

Introduction to the Special Issue on Kernel Methods (Kernel Machines Section)

Nello Cristianini
John Shawe-Taylor
Robert C. Williamson

JMLR Journal 2001 Journal Article

Prior Knowledge and Preferential Structures in Gradient Descent Learning Algorithms

Robert E. Mahony
Robert C. Williamson

A family of gradient descent algorithms for learning linear functions in an online setting is considered. The family includes the classical LMS algorithm as well as new variants such as the Exponentiated Gradient (EG) algorithm due to Kivinen and Warmuth. The algorithms are based on prior distributions defined on the weight space. Techniques from differential geometry are used to develop the algorithms as gradient descent iterations with respect to the natural gradient in the Riemannian structure induced by the prior distribution. The proposed framework subsumes the notion of "link-functions".

JMLR Journal 2001 Journal Article

Regularized Principal Manifolds (Kernel Machines Section)

Alexander J. Smola
Sebastian Mika
Bernhard Schölkopf
Robert C. Williamson

Many settings of unsupervised learning can be viewed as quantization problems - the minimization of the expected quantization error subject to some restrictions. This allows the use of tools such as regularization from the theory of (supervised) risk minimization for unsupervised learning. This setting turns out to be closely related to principal curves, the generative topographic map, and robust coding. We explore this connection in two ways: (1) we propose an algorithm for finding principal manifolds that can be regularized in a variety of ways; and (2) we derive uniform convergence bounds and hence bounds on the learning rates of the algorithm. In particular, we give bounds on the covering numbers which allows us to obtain nearly optimal learning rates for certain types of regularization operators. Experimental results demonstrate the feasibility of the approach.