Author name cluster

Christopher Williams

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

30 papers

1 author row

TMLR Journal 2024 Journal Article

Approximations to the Fisher Information Metric of Deep Generative Models for Out-Of-Distribution Detection

Sam Dauncey
Christopher C. Holmes
Christopher Williams
Fabian Falck

Likelihood-based deep generative models such as score-based diffusion models and variational autoencoders are state-of-the-art machine learning models approximating high-dimensional distributions of data such as images, text, or audio. One of many downstream tasks they can be naturally applied to is out-of-distribution (OOD) detection. However, seminal work by Nalisnick et al. which we reproduce showed that deep generative models consistently infer higher log-likelihoods for OOD data than data they were trained on, marking an open problem. In this work, we analyse using the gradient of a data point with respect to the parameters of the deep generative model for OOD detection, based on the simple intuition that OOD data should have larger gradient norms than training data. We formalise measuring the size of the gradient as approximating the Fisher information metric. We show that the Fisher information matrix (FIM) has large absolute diagonal values, motivating the use of chi-square distributed, layer-wise gradient norms as features. We combine these features to make a simple, model-agnostic and hyperparameter-free method for OOD detection which estimates the joint density of the layer-wise gradient norms for a given data point. We find that these layer-wise gradient norms are weakly correlated, rendering their combined usage informative, and prove that the layer-wise gradient norms satisfy the principle of (data representation) invariance. Our empirical results indicate that this method outperforms the Typicality test for most deep generative models and image dataset pairings.

NeurIPS Conference 2024 Conference Paper

Score-Optimal Diffusion Schedules

Christopher Williams
Andrew Campbell
Arnaud Doucet
Saifuddin Syed

Denoising diffusion models (DDMs) offer a flexible framework for sampling from high dimensional data distributions. DDMs generate a path of probability distributions interpolating between a reference Gaussian distribution and a data distribution by incrementally injecting noise into the data. To numerically simulate the sampling process, a discretisation schedule from the reference back towards clean data must be chosen. An appropriate discretisation schedule is crucial to obtain high quality samples. However, beyond hand crafted heuristics, a general method for choosing this schedule remains elusive. This paper presents a novel algorithm for adaptively selecting an optimal discretisation schedule with respect to a cost that we derive. Our cost measures the work done by the simulation procedure to transport samples from one point in the diffusion path to the next. Our method does not require hyperparameter tuning and adapts to the dynamics and geometry of the diffusion path. Our algorithm only involves the evaluation of the estimated Stein score, making it scalable to existing pre-trained models at inference time and online during training. We find that our learned schedule recovers performant schedules previously only discovered through manual search and obtains competitive FID scores on image datasets.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

A Unified Framework for U-Net Design and Analysis

Christopher Williams
Fabian Falck
George Deligiannidis
Chris C Holmes
Arnaud Doucet
Saifuddin Syed

U-Nets are a go-to neural architecture across numerous tasks for continuous signals on a square such as images and Partial Differential Equations (PDE), however their design and architecture is understudied. In this paper, we provide a framework for designing and analysing general U-Net architectures. We present theoretical results which characterise the role of the encoder and decoder in a U-Net, their high-resolution scaling limits and their conjugacy to ResNets via preconditioning. We propose Multi-ResNets, U-Nets with a simplified, wavelet-based encoder without learnable parameters. Further, we show how to design novel U-Net architectures which encode function constraints, natural bases, or the geometry of the data. In diffusion models, our framework enables us to identify that high-frequency information is dominated by noise exponentially faster, and show how U-Nets with average pooling exploit this. In our experiments, we demonstrate how Multi-ResNets achieve competitive and often superior performance compared to classical U-Nets in image segmentation, PDE surrogate modelling, and generative modelling with diffusion models. Our U-Net framework paves the way to study the theoretical properties of U-Nets and design natural, scalable neural architectures for a multitude of problems beyond the square.

NeurIPS Conference 2022 Conference Paper

A Multi-Resolution Framework for U-Nets with Applications to Hierarchical VAEs

Fabian Falck
Christopher Williams
Dominic Danks
George Deligiannidis
Christopher Yau
Chris C Holmes
Arnaud Doucet
Matthew Willetts

U-Net architectures are ubiquitous in state-of-the-art deep learning, however their regularisation properties and relationship to wavelets are understudied. In this paper, we formulate a multi-resolution framework which identifies U-Nets as finite-dimensional truncations of models on an infinite-dimensional function space. We provide theoretical results which prove that average pooling corresponds to projection within the space of square-integrable functions and show that U-Nets with average pooling implicitly learn a Haar wavelet basis representation of the data. We then leverage our framework to identify state-of-the-art hierarchical VAEs (HVAEs), which have a U-Net architecture, as a type of two-step forward Euler discretisation of multi-resolution diffusion processes which flow from a point mass, introducing sampling instabilities. We also demonstrate that HVAEs learn a representation of time which allows for improved parameter efficiency through weight-sharing. We use this observation to achieve state-of-the-art HVAE performance with half the number of parameters of existing models, exploiting the properties of our continuous-time formulation.

NeurIPS Conference 2012 Conference Paper

A Generative Model for Parts-based Object Segmentation

S. Eslami
Christopher Williams

The Shape Boltzmann Machine (SBM) has recently been introduced as a state-of-the-art model of foreground/background object shape. We extend the SBM to account for the foreground object's parts. Our model, the Multinomial SBM (MSBM), can capture both local and global statistics of part shapes accurately. We combine the MSBM with an appearance model to form a fully generative model of images of objects. Parts-based image segmentations are obtained simply by performing probabilistic inference in the model. We apply the model to two challenging datasets which exhibit significant shape and appearance variability, and find that it obtains results that are comparable to the state-of-the-art.

NeurIPS Conference 2008 Conference Paper

Multi-task Gaussian Process Learning of Robot Inverse Dynamics

Christopher Williams
Stefan Klanke
Sethu Vijayakumar
Kian Chai

The inverse dynamics problem for a robotic manipulator is to compute the torques needed at the joints to drive it along a given trajectory; it is beneficial to be able to learn this function for adaptive control. A given robot manipulator will often need to be controlled while holding different loads in its end effector, giving rise to a multi-task learning problem. We show how the structure of the inverse dynamics problem gives rise to a multi-task Gaussian process prior over functions, where the inter-task similarity depends on the underlying dynamic parameters. Experiments demonstrate that this multi-task formulation generally improves performance over either learning only on single tasks or pooling the data over all tasks.

NeurIPS Conference 2007 Conference Paper

Multi-task Gaussian Process Prediction

Edwin Bonilla
Kian Chai
Christopher Williams

In this paper we investigate multi-task learning in the context of Gaussian Pro- cesses (GP). We propose a model that learns a shared covariance function on input-dependent features and a “free-form” covariance matrix over tasks. This al- lows for good ﬂexibility when modelling inter-task dependencies while avoiding the need for large amounts of data for training. We show that under the assump- tion of noise-free observations and a block design, predictions for a given task only depend on its target values and therefore a cancellation of inter-task trans- fer occurs. We evaluate the beneﬁts of our model on two practical applications: a compiler performance prediction problem and an exam score prediction task. Additionally, we make use of GP approximations and properties of our model in order to provide scalability to large data sets.

NeurIPS Conference 2005 Conference Paper

Factorial Switching Kalman Filters for Condition Monitoring in Neonatal Intensive Care

Christopher Williams
John Quinn
Neil McIntosh

The observed physiological dynamics of an infant receiving intensive care are affected by many possible factors, including interventions to the baby, the operation of the monitoring equipment and the state of health. The Factorial Switching Kalman Filter can be used to infer the presence of such factors from a sequence of observations, and to estimate the true values where these observations have been corrupted. We apply this model to clinical time series data and show it to be effective in identifying a number of artifactual and physiological patterns.

NeurIPS Conference 2004 Conference Paper

Harmonising Chorales by Probabilistic Inference

Moray Allan
Christopher Williams

We describe how we used a data set of chorale harmonisations composed by Johann Sebastian Bach to train Hidden Markov Models. Using a prob- abilistic framework allows us to create a harmonisation system which learns from examples, and which can compose new harmonisations. We make a quantitative comparison of our system's harmonisation perfor- mance against simpler models, and provide example harmonisations.

NeurIPS Conference 2004 Conference Paper

Using the Equivalent Kernel to Understand Gaussian Process Regression

Peter Sollich
Christopher Williams

The equivalent kernel [1] is a way of understanding how Gaussian pro- cess regression works for large sample sizes based on a continuum limit. In this paper we show (1) how to approximate the equivalent kernel of the widely-used squared exponential (or Gaussian) kernel and related ker- nels, and (2) how analysis using the equivalent kernel helps to understand the learning curves for Gaussian processes. Consider the supervised regression problem for a dataset D with entries (xi, yi) for i = 1, .. ., n. Under Gaussian Process (GP) assumptions the predictive mean at a test point x is given by f (x) = k (x)(K + 2I)-1y, (1) where K denotes the n n matrix of covariances between the training points with entries k(xi, xj), k(x) is the vector of covariances k(xi, x), 2 is the noise variance on the observations and y is a n 1 vector holding the training targets. See e. g. [2] for further details. We can define a vector of functions h(x) = (K + 2I)-1k(x). Thus we have f (x) = h (x)y, making it clear that the mean prediction at a point x is a linear combination of the target values y. Gaussian process regression is thus a linear smoother, see [3, section 2. 8] for further details. For a fixed test point x, h(x) gives the vector of weights applied to targets y. Silverman [1] called h (x) the weight function. Understanding the form of the weight function is made complicated by the matrix inversion of K + 2I and the fact that K depends on the specific locations of the n datapoints. Idealizing the situation one can consider the observations to be "smeared out" in x-space at some constant density of observations. In this case analytic tools can be brought to bear on the problem, as shown below. By analogy to kernel smoothing Silverman [1] called the idealized weight function the equivalent kernel (EK). The structure of the remainder of the paper is as follows: In section 1 we describe how to derive the equivalent kernel in Fourier space. Section 2 derives approximations for the EK for the squared exponential and other kernels. In section 3 we show how use the EK approach to estimate learning curves for GP regression, and compare GP regression to kernel regression using the EK. 1 Gaussian Process Regression and the Equivalent Kernel It is well known (see e. g. [4]) that the posterior mean for GP regression can be obtained as the function which minimizes the functional 1 1 n J[f ] = f 2 (y 2 H + 22 i - f (xi))2, (2) n i=1 where f H is the RKHS norm corresponding to kernel k. (However, note that the GP framework gives much more than just this mean prediction, for example the predictive variance and the marginal likelihood p(y) of the data under the model. ) Let (x) = E[y|x] be the target function for our regression problem and write E[(y - f (x))2] = E[(y - (x))2] + ((x) - f(x))2. Using the fact that the first term on the RHS is independent of f motivates considering a smoothed version of equation 2, 1 J[f ] = ((x) - f (x))2dx + f 2 22 2 H, where has dimensions of the number of observations per unit of x-space (length/area/volume etc. as appropriate). If we consider kernels that are stationary, k(x, x ) = k(x - x ), the natural basis in which to analyse equation 1 is the Fourier basis of complex sinusoids so that f (x) is represented as ~ f (s)e2isxds and similarly for (x). Thus we obtain 1 | ~ f (s)|2 J[f ] = | ~ f (s) - ~ (s)|2 + ds, 2 2 S(s) as f 2 = | ~ f ( H s)|2/S(s)ds where S(s) is the power spectrum of the kernel k, S(s) = k(x)e-2isxdx. J[f ] can be minimized using calculus of variations to ob- tain ~ f (s) = S(s)(s)/(2/ + S(s)) which is recognized as the convolution f (x) = h(x - x)(x)dx. Here the Fourier transform of the equivalent kernel h(x) is ~ S(s) 1 h(s) = =. (3) S(s) + 2/ 1 + 2/(S(s)) The term 2/ in the first expression for ~ h(s) corresponds to the power spectrum of a white noise process, whose delta-function covariance function becomes a constant in the Fourier domain. This analysis is known as Wiener filtering; see, e. g. [5, 14-1]. Notice that as, h(x) tends to the delta function. If the input density is non-uniform the analysis above should be interpreted as computing the equivalent kernel for np(x) =. This approximation will be valid if the scale of variation of p(x) is larger than the width of the equivalent kernel. 2 The EK for the Squared Exponential and Related Kernels For certain kernels/covariance functions the EK h(x) can be computed exactly by Fourier inversion. Examples include the Ornstein-Uhlenbeck process in D = 1 with covariance k(x) = e-|x| (see [5, p. 326]), splines in D = 1 corresponding to the regularizer P f 2 = (f (m))2dx [1, 6], and the regularizer P f 2 = ( 2f )2dx in two dimen- sions, where the EK is given in terms of the Kelvin function kei [7]. We now consider the commonly used squared exponential (SE) kernel k(r) = exp(-r2/2 2), where r2 = ||x-x ||2. (This is sometimes called the Gaussian or radial ba- sis function kernel. ) Its Fourier transform is given by S(s) = (2 2)D/2 exp(-22 2|s|2), where D denotes the dimensionality of x (and s) space. From equation 3 we obtain ~ 1 hSE(s) =, 1 + b exp(22 2|s|2) where b = 2/(2 2)D/2. We are unaware of an exact result in this case, but the following initial approximation is simple but effective. For large, b will be small. Thus for small s = |s| we have that ~ hSE 1, but for large s it is approximately 0. The change takes place around the point sc where b exp(22 2s2c) = 1, i. e. s2c = log(1/b)/22 2. As exp(22 2s2) grows quickly with s, the transition of ~ hSE between 1 and 0 can be expected to be rapid, and thus be well-approximated by a step function. Proposition 1 The approximate form of the equivalent kernel for the squared-exponential kernel in D-dimensions is given by s D/2 h c SE(r) = J r D/2(2scr). Proof: hSE(s) is a function of s = |s| only, and for D > 1 the Fourier integral can be simplified by changing to spherical polar coordinates and integrating out the angular variables to give s +1 hSE(r) = 2r J(2rs)~hSE(s) ds (4) 0 r sc s +1 s D/2 2r J c (2rs) ds = JD/2(2scr). 0 r r where = D/2 - 1, J(z) is a Bessel function of the first kind and we have used the identity z+1J(z) = (d/dz)[z+1J+1(z)]. Note that in D = 1 by computing the Fourier transform of the boxcar function we obtain hSE(x) = 2scsinc(2scx) where sinc(z) = sin(z)/z. This is consistent with Proposition 1 and J1/2(z) = (2/z)1/2 sin(z). The asymptotic form of the EK in D = 2 is shown in Figure 2(left) below. Notice that sc scales as (log())1/2 so that the width of the EK (which is proportional to 1/sc) will decay very slowly as increases. In contrast for a spline of order m (with power spectrum |s|-2m) the width of the EK scales as -1/2m [1]. If instead of RD we consider the input set to be the unit circle, a stationary kernel can be periodized by the construction kp(x, x ) = k(x - x + 2n). This kernel will nZ be represented as a Fourier series (rather than with a Fourier transform) because of the periodicity. In this case the step function in Fourier space approximation would give rise to a Dirichlet kernel as the EK (see [8, section 4. 4. 3] for further details on the Dirichlet kernel). We now show that the result of Proposition 1 is asymptotically exact for, and calcu- late the leading corrections for finite. The scaling of the width of the EK as 1/sc suggests writing hSE(r) = (2sc)Dg(2scr). Then from equation 4 and using the definition of sc z 2s +1 J g(z) = cs (zs/sc) ds sc(2sc)D 0 z 1 + exp[22 2(s2 - s2c)] u +1 J = z (zu) du (5) 0 2z 1 + exp[22 2s2c(u2 - 1)] where we have rescaled s = scu in the second step. The value of sc, and hence, now enters only in the exponential via a = 22 2s2c. For a, the exponential tends to zero for u 1. The factor 1/[1 + exp(. .. )] is therefore a step function (1 - u) in the limit and Proposition 1 becomes exact, with g(z) lima g(z) = (2z)-D/2JD/2(z). To calculate corrections to this, one uses that for large but finite a the difference (u) = {1 + exp[a(u2 - 1)]}-1 - (1 - u) is non-negligible only in a range of order 1/a around u = 1. The other factors in the integrand of equation 5 can thus be Taylor-expanded around that point to give I dk u +1 g(z) = g k (z) + z J, I (u)(u - 1)k du k! duk 2z (zu) k = k=0 u=1 0 The problem is thus reduced to calculating the integrals Ik. Setting u = 1 + v/a one has 0 1 vk ak+1Ik = - 1 vk dv + dv -a 1 + exp(v2/a + 2v) 0 1 + exp(v2/a + 2v) a (-1)k+1vk vk = dv + dv 0 1 + exp(-v2/a + 2v) 0 1 + exp(v2/a + 2v) In the first integral, extending the upper limit to gives an error that is exponentially small in a. Expanding the remaining 1/a-dependence of the integrand one then gets, to leading order in 1/a, I0 = c0/a2, I1 = c1/a2 while all Ik with k 2 are smaller by at least 1/a2. The numerical constants are -c0 = c1 = 2/24. This gives, using that (d/dz)[z+1J(z)] = zJ(z) + z+1J-1(z) = (2 + 1)zJ(z) - z+1J+1(z): Proposition 2 The equivalent kernel for the squared-exponential kernel is given for large by hSE(r) = (2sc)Dg(2scr) with 1 z 1 g(z) = J (c ) D D/2(z) + 0 + c1(D - 1))JD/2-1(z) - c1zJD/2(z) +O( (2z) 2 a2 a4 For e. g. D = 1 this becomes g(z) = -1{sin(z)/z - 2/(24a2)[cos(z) + z sin(z)]}. Here and in general, by comparing the second part of the 1/a2 correction with the leading order term, one estimates that the correction is of relative size z2/a2. It will therefore provide a useful improvement as long as z = 2scr 1To understand this scaling of 2grid consider the case where grid > which means that the effective variance at each of the grid points per unit x-space is larger, but as there are correspondingly more points this effect cancels out. This can be understood by imagining the situation where there are grid/ independent Gaussian observations with variance 2grid at a single x-point; this would be equivalent to one Gaussian observation with variance 2. In effect the observations per unit x-space have been smoothed out uniformly. 0. 16 0. 35 0. 35 Numerical Numerical 0. 3 0. 3 Proposition 1 Proposition 1 0. 25 Proposition 2 0. 25 Proposition 2 0. 14 0. 2 0. 2 0. 15 0. 15 0. 12 0. 1 0. 1 0. 05 0. 05 0. 1 0 0 -0. 05 -0. 05 0. 08 -0. 1 -0. 1 0 5 10 15 0 5 10 15 0. 06 0. 04 0. 02 0 -0. 02 Numerical Proposition 1 -0. 04 Sample -0. 5 -0. 4 -0. 3 -0. 2 -0. 1 0 0. 1 0. 2 0. 3 0. 4 0. 5 Figure 1: Main figure: plot of the weight function corresponding to = 100 training points/unit length, plus the numerically computed equivalent kernel at x = 0. 0 and the sinc approximation from Proposition 1. Insets: numerically evaluated g(z) together with sinc and Proposition 2 approximations for = 100 (left) and = 104 (right). Figure 1 shows plots of the weight function for = 100, the EK computed on the grid as described above and the analytical sinc approximation. These are computed for parameter values of 2 = 0. 004 and 2 = 0. 1, with grid/ = 5/3. To reduce edge effects, the interval [-3/2, 3/2] was used for computations, although only the centre of this is shown in the figure. There is quite good agreement between the numerical computation and the analytical approximation, although the sidelobes decay more rapidly for the numerically computed EK. This is not surprising because the absence of a truly hard cutoff in Fourier space means one should expect less "ringing" than the analytical approximation predicts. The figure also shows good agreement between the weight function (based on the finite sample) and the numerically computed EK. The insets show the approximation of Proposi- tion 2 to g(z) for = 100 (a = 5. 67, left) and = 104 (a = 9. 67, right). As expected, the addition of the 1/a2-correction gives better agreement with the numerical result for z a where the 1/a expansion breaks down. 2. 2 Other kernels Our analysis is not in fact restricted to the SE kernel. Consider an isotropic kernel, for which the power spectrum S(s) depends on s = |s| only. Then we can again define from equation 3 an effective cutoff sc on the range of s in the EK via 2/ = S(sc), so that ~ h(s) = [1 + S(sc)/S(s)]-1. The EK will then have the limiting form given in Proposi- tion 1 if ~ h(s) approaches a step function (sc - s), i. e. if it becomes infinitely "steep" around the point s = sc for sc. A quantitative criterion for this is that the slope |~ h (sc)| should become much larger than 1/sc, the inverse of the range of the step func- tion. Since ~ h (s) = S (s)S(sc)S-2(s)[1 + S(sc)/S(s)]-2, this is equivalent to requiring that -scS (sc)/4S(sc) -d log S(sc)/d log sc must diverge for sc. The result of Proposition 1 therefore applies to any kernel whose power spectrum S(s) decays more rapidly than any positive power of 1/s. A trivial example of a kernel obeying this condition would be a superposition of finitely many SE kernels with different lengthscales 2; the asymptotic behaviour of sc is then governed by the smallest. A less obvious case is the "rational quadratic" k(r) = [1 + (r/l)2]-(D+1)/2 which has an exponentially decaying power spectrum S(s) exp(-2 s). (This relationship is often used in the reverse direction, to obtain the power spectrum of the Ornstein-Uhlenbeck (OU) kernel exp(-r/ ). ) Proposition 1 then applies, with the width of the EK now scaling as 1/sc 1/ log(). The previous example is a special case of kernels which can be written as superpositions of SE kernels with a distribution p( ) of lengthscales, k(r) = exp(-r2/2 2)p( ) d. This is in fact the most general representation for an isotropic kernel which defines a valid covariance function in any dimension D, see [9, 2. 10]. Such a kernel has power spectrum S(s) = (2)D/2 D exp(-22 2s2)p( ) d (6) 0 and one easily verifies that the rational quadratic kernel, which has S(s) exp(-2 0s), is obtained for p( ) -D-2 exp(- 20/2 2). More generally, because the exponential factor in equation 6 acts like a cutoff for > 1/s, one estimates S(s) 1/s Dp( ) d 0 for large s. This will decay more strongly than any power of 1/s for s if p( ) itself decreases more strongly than any power of for 0. Any such choice of p( ) will therefore yield a kernel to which Proposition 1 applies. 3 Understanding GP Learning Using the Equivalent Kernel We now turn to using EK analysis to get a handle on average case learning curves for Gaus- sian processes. Here the setup is that a function is drawn from a Gaussian process, and we obtain noisy observations of per unit x-space at random x locations. We are concerned with the mean squared error (MSE) between the GP prediction f and. Averaging over the noise process, the x-locations of the training data and the prior over we obtain the average MSE as a function of. See e. g. [10] and [11] for an overview of earlier work on GP learning curves. To understand the asymptotic behaviour of for large, we now approximate the true GP predictions with the EK predictions from noisy data, given by fEK(x) = h(x - x )y(x )dx in the continuum limit of "smoothed out" input locations. We assume as before that y = target + noise, i. e. y(x) = (x) + (x) where E[(x)(x )] = (2/)(x - x ). Here 2 denotes the true noise variance, as opposed to the noise variance assumed in the EK; the scaling of 2 with is explained in footnote 1. For a fixed target, the MSE is = ( dx)-1 [(x) - fEK(x)]2dx. Averaging over the noise process and target function gives in Fourier space 2 (2/)S = S (s)/S2(s) + 2 /2 (s)[1 - ~ h(s)]2 + (2/)~h2(s) ds = ds [1 + 2/(S(s))]2 (7) where S(s) is the power spectrum of the prior over target functions. In the case S(s) = S(s) and 2 = 2 where the kernel is exactly matched to the structure of the target, equation 7 gives the Bayes error B and simplifies to B = (2/) [1 + 2/(S(s))]-1ds (see also [5, eq. 14-16]). Interestingly, this is just the analogue (for a continuous power spectrum of the kernel rather than a discrete set of eigenvalues) of the lower bound of [10] 0. 5 0. 5 =2 0. 03 0. 025 0. 02 0. 015 0. 01 0. 1 =4 0. 005 0 -0. 005 1 0. 05 0. 5 1 0. 5 0 0 -0. 5 25 50 100 -0. 5 250 500 -1 -1 Figure 2: Left: plot of the asymptotic form of the EK (sc/r)J1(2scr) for D = 2 and = 1225. Right: log-log plot of against log() for the OU and Matern-class processes ( = 2, 4 respectively). The dashed lines have gradients of -1/2 and -3/2 which are the predicted rates. on the MSE of standard GP prediction from finite datasets. In experiments this bound provides a good approximation to the actual average MSE for large dataset size n [11]. This supports our approach of using the EK to understand the learning behaviour of GP regression. Treating the denominator in the expression for B again as a hard cutoff at s = sc, which is justified for large, one obtains for an SE target and learner 2sc/ (log())D/2/. To get analogous predictions for the mismatched case, one can write equation 7 as 2 [1 + 2/(S(s))] - 2/(S(s)) S = d (s) s + ds. [1 + 2/(S(s))]2 [S(s)/2 + 1]2 The first integral is smaller than (2/2) B and can be neglected as long as B. In the second integral we can again make the cutoff approximation--though now with s having to be above sc to get the scaling sD-1S s (s) ds. For target functions with a c power-law decay S(s) s- of the power spectrum at large s this predicts sD- c (log())(D-)/2. So we generically get slow logarithmic learning, consistent with the observations in [12]. For D = 1 and an OU target ( = 2) we obtain (log())-1/2, and for the Matern-class covariance function k(r) = (1 + r/ ) exp(-r/ ) (which has power spectrum (3/ 2 + 42s2)-2, so = 4) we get (log())-3/2. These predictions were tested experimentally using a GP learner with SE covariance function ( = 0. 1 and assumed noise level 2 = 0. 1) against targets from the OU and Matern-class priors (with = 0. 05) and with noise level 2 = 0. 01, averaging over 100 replications for each value of. To demonstrate the predicted power-law dependence of on log(), in Figure 2(right) we make a log-log plot of against log(). The dashed lines show the gradients of -1/2 and -3/2 and we observe good agreement between experimental and theoretical results for large. 3. 1 Using the Equivalent Kernel in Kernel Regression Above we have used the EK to understand how standard GP regression works. One could alternatively envisage using the EK to perform kernel regression, on given finite data sets, producing a prediction -1 h(x i - xi)yi at x. Intuitively this seems appealing as a cheap alternative to full GP regression, particularly for kernels such as the SE where the EK can be calculated analytically, at least to a good approximation. We now analyze briefly how such an EK predictor would perform compared to standard GP prediction. Letting denote averaging over noise, training input points and the test point and setting f(x) = h(x, x)(x)dx, the average MSE of the EK predictor is pred = [(x) - (1/) h(x, x i i)yi]2 = [(x) - f(x)]2 + 2 h2(x, x )dx + 1 h2(x, x )2(x )dx - 1 f 2 (x) 2 (2/)S 2 ds = (s)/S2(s) + 2 /2 ds + [1 + 2/(S(s))]2 [1 + 2/(S(s))]2 Here we have set 2 = ( dx)-1 2(x) dx = S(s) ds for the spatial average of the squared target amplitude. Taking the matched case, (S(s) = S(s) and 2 = 2) as an example, the first term (which is the one we get for the prediction from "smoothed out" training inputs, see eq. 7) is of order 2sD c /, while the second one is 2 sD c /. Thus both terms scale in the same way, but the ratio of the second term to the first is the signal- to-noise ratio 2 /2, which in practice is often large. The EK predictor will then perform significantly worse than standard GP prediction, by a roughly constant factor, and we have confirmed this prediction numerically. This result is somewhat surprising given the good agreement between the weight function h(x) and the EK that we saw in figure 1, leading to the conclusion that the detailed structure of the weight function is important for optimal prediction from finite data sets. In summary, we have derived accurate approximations for the equivalent kernel (EK) of GP regression with the widely used squared exponential kernel, and have shown that the same analysis in fact extends to a whole class of kernels. We have also demonstrated that EKs provide a simple means of understanding the learning behaviour of GP regression, even in cases where the learner's covariance function is not well matched to the structure of the target function. In future work, it will be interesting to explore in more detail the use of the EK in kernel smoothing. This is suboptimal compared to standard GP regression as we saw. However, it does remain feasible even for very large datasets, and may then be competitive with sparse methods for approximating GP regression. From the theoretical point of view, the average error of the EK predictor which we calculated may also provide the basis for useful upper bounds on GP learning curves. Acknowledgments: This work was supported in part by the IST Programme of the Eu- ropean Community, under the PASCAL Network of Excellence, IST-2002-506778. This publication only reflects the authors' views.

NeurIPS Conference 2003 Conference Paper

Extreme Components Analysis

Max Welling
Christopher Williams
Felix Agakov

Principal components analysis (PCA) is one of the most widely used techniques in machine learning and data mining. Minor components analysis (MCA) is less well known, but can also play an important role in the presence of constraints on the data distribution. In this paper we present a probabilistic model for “extreme components analysis” (XCA) which at the maximum likelihood solution extracts an optimal combina- tion of principal and minor components. For a given number of compo- nents, the log-likelihood of the XCA model is guaranteed to be larger or equal than that of the probabilistic models for PCA and MCA. We de- scribe an efﬁcient algorithm to solve for the globally optimal solution. For log-convex spectra we prove that the solution consists of principal components only, while for log-concave spectra the solution consists of minor components. In general, the solution admits a combination of both. In experiments we explore the properties of XCA on some synthetic and real-world datasets.

NeurIPS Conference 2002 Conference Paper

Learning About Multiple Objects in Images: Factorial Learning without Factorial Search

Christopher Williams
Michalis Titsias

We consider data which are images containing views of multiple objects. Our task is to learn about each of the objects present in the images. This task can be approached as a factorial learning problem, where each image must be explained by instantiating a model for each of the objects present with the correct instantiation parameters. A major problem with learning a factorial model is that as the number of objects increases, there is a combinatorial explosion of the number of conﬁgurations that need to be considered. We develop a method to extract object models sequentially from the data by making use of a robust statistical method, thus avoid- ing the combinatorial explosion, and present results showing successful extraction of objects from real images.

NeurIPS Conference 2002 Conference Paper

The Stability of Kernel Principal Components Analysis and its Relation to the Process Eigenspectrum

Christopher Williams
John Shawe-Taylor

In this paper we analyze the relationships between the eigenvalues of the m x m Gram matrix K for a kernel k(·, .) corresponding to a sample Xl, .. ., Xm drawn from a density p(x) and the eigenvalues of the corresponding continuous eigenproblem. We bound the dif(cid: 173) ferences between the two spectra and provide a performance bound on kernel peA.

NeurIPS Conference 2001 Conference Paper

Products of Gaussians

Christopher Williams
Felix Agakov
Stephen Felderhof

Recently Hinton (1999) has introduced the Products of Experts (PoE) model in which several individual probabilistic models for data are combined to provide an overall model of the data. Be(cid: 173) low we consider PoE models in which each expert is a Gaussian. Although the product of Gaussians is also a Gaussian, if each Gaus(cid: 173) sian has a simple structure the product can have a richer structure. We examine (1) Products of Gaussian pancakes which give rise to probabilistic Minor Components Analysis, (2) products of I-factor PPCA models and (3) a products of experts construction for an AR(l) process. Recently Hinton (1999) has introduced the Products of Experts (PoE) model in which several individual probabilistic models for data are combined to provide an overall model of the data. In this paper we consider PoE models in which each expert is a Gaussian. It is easy to see that in this case the product model will also be Gaussian. However, if each Gaussian has a simple structure, the product can have a richer structure. Using Gaussian experts is attractive as it permits a thorough analysis of the product architecture, which can be difficult with other models, e. g. models defined over discrete random variables. Below we examine three cases of the products of Gaussians construction: (1) Prod(cid: 173) ucts of Gaussian pancakes (PoGP) which give rise to probabilistic Minor Compo(cid: 173) nents Analysis (MCA), providing a complementary result to probabilistic Principal Components Analysis (PPCA) obtained by Tipping and Bishop (1999); (2) Prod(cid: 173) ucts of I-factor PPCA models; (3) A products of experts construction for an AR(l) process. Products of Gaussians If each expert is a Gaussian pi(xI8i ) '" N(J1i' ( i), the resulting distribution of the product of m Gaussians may be expressed as By completing the square in the exponent it may be easily shown that p(xI8) N(/1; E, (2: ), where (E l = 2: :1 (i l. To simplify the following derivations we will assume that pi(xI8i ) '" N(O, (i) and thus that p(xI8) '" N(O, (2: ). J12: i ° can be obtained by translation of the coordinate system. 1 Products of Gaussian Pancakes A Gaussian "pancake" (GP) is a d-dimensional Gaussian, contracted in one dimen(cid: 173) sion and elongated in the other d - 1 dimensions. In this section we show that the maximum likelihood solution for a product of Gaussian pancakes (PoGP) yields a probabilistic formulation of Minor Components Analysis (MCA). 1. 1 Covariance Structure of a GP Expert Consider a d-dimensional Gaussian whose probability contours are contracted in the direction w and equally elongated in mutually orthogonal directions VI, .. ., vd-l. We call this a Gaussian pancake or GP. Its inverse covariance may be written as

NeurIPS Conference 2000 Conference Paper

On a Connection between Kernel PCA and Metric Multidimensional Scaling

Christopher Williams

In this paper we show that the kernel peA algorithm of Sch6lkopf et al (1998) can be interpreted as a form of metric multidimensional scaling (MDS) when the kernel function k(x, y) is isotropic, i. e. it depends only on Ilx - yll. This leads to a metric MDS algorithm where the desired configuration of points is found via the solution of an eigenproblem rather than through the iterative optimization of the stress objective function. The question of kernel choice is also discussed.

NeurIPS Conference 2000 Conference Paper

Using the Nyström Method to Speed Up Kernel Machines

Christopher Williams
Matthias Seeger

Abstract Unavailable

NeurIPS Conference 1999 Conference Paper

A MCMC Approach to Hierarchical Mixture Modelling

Christopher Williams

There are many hierarchical clustering algorithms available, but these lack a firm statistical basis. Here we set up a hierarchical probabilistic mixture model, where data is generated in a hierarchical tree-structured manner. Markov chain Monte Carlo (MCMC) methods are demonstrated which can be used to sample from the posterior distribution over trees containing variable numbers of hidden units.

NeurIPS Conference 1998 Conference Paper

Adding Constrained Discontinuities to Gaussian Process Models of Wind Fields

Dan Cornford
Ian Nabney
Christopher Williams

Gaussian Processes provide good prior models for spatial data, but can be too smooth. In many physical situations there are discontinuities along bounding surfaces, for example fronts in near-surface wind fields. We describe a modelling method for such a constrained discontinuity and demonstrate how to infer the model parameters in wind fields with MCMC sampling.

NeurIPS Conference 1998 Conference Paper

Discovering Hidden Features with Gaussian Processes Regression

Francesco Vivarelli
Christopher Williams

We study the dynamics of supervised learning in layered neural net(cid: 173) works, in the regime where the size p of the training set is proportional to the number N of inputs. Here the local fields are no longer described by Gaussian distributions. We use dynamical replica theory to predict the evolution of macroscopic observables, including the relevant error measures, incorporating the old formalism in the limit piN --t 00.

NeurIPS Conference 1998 Conference Paper

DTs: Dynamic Trees

Christopher Williams
Nicholas Adams

In this paper we introduce a new class of image models, which we call dynamic trees or DTs. A dynamic tree model specifies a prior over a large number of trees, each one of which is a tree-structured belief net (TSBN). Experiments show that DTs are capable of generating images that are less blocky, and the models have better translation invariance properties than a fixed, "balanced" TSBN. We also show that Simulated Annealing is effective at finding trees which have high posterior probability.

NeurIPS Conference 1998 Conference Paper

Finite-Dimensional Approximation of Gaussian Processes

Giancarlo Ferrari-Trecate
Christopher Williams
Manfred Opper

Gaussian process (GP) prediction suffers from O(n3) scaling with the data set size n. By using a finite-dimensional basis to approximate the GP predictor, the computational complexity can be reduced. We de(cid: 173) rive optimal finite-dimensional predictors under a number of assump(cid: 173) tions, and show the superiority of these predictors over the Projected Bayes Regression method (which is asymptotically optimal). We also show how to calculate the minimal model size for a given n. The calculations are backed up by numerical experiments.

NeurIPS Conference 1997 Conference Paper

Regression with Input-dependent Noise: A Gaussian Process Treatment

Paul Goldberg
Christopher Williams
Christopher Bishop

Gaussian processes provide natural non-parametric prior distribu(cid: 173) tions over regression functions. In this paper we consider regression problems where there is noise on the output, and the variance of the noise depends on the inputs. If we assume that the noise is a smooth function of the inputs, then it is natural to model the noise variance using a second Gaussian process, in addition to the Gaussian process governing the noise-free output value. We show that prior uncertainty about the parameters controlling both pro(cid: 173) cesses can be handled and that the posterior distribution of the noise rate can be sampled from using Markov chain Monte Carlo methods. Our results on a synthetic data set give a posterior noise variance that well-approximates the true variance. 1 Background and Motivation A very natural approach to regression problems is to place a prior on the kinds of function that we expect, and then after observing the data to obtain a posterior. The prior can be obtained by placing prior distributions on the weights in a neural 494 P. W Goldberg, C. K. L Williams and C. M. Bishop network, although we would argue that it is perhaps more natural to place priors di(cid: 173) rectly over functions. One tractable way of doing this is to create a Gaussian process prior. This has the advantage that predictions can be made from the posterior using only matrix multiplication for fixed hyperparameters and a global noise level. In contrast, for neural networks (with fixed hyperparameters and a global noise level) it is necessary to use approximations or Markov chain Monte Carlo (MCMC) meth(cid: 173) ods. Rasmussen (1996) has demonstrated that predictions obtained with Gaussian processes are as good as or better than other state-of-the art predictors. In much of the work on regression problems in the statistical and neural networks literatures, it is assumed that there is a global noise level, independent of the input vector x. The book by Bishop (1995) and the papers by Bishop (1994), MacKay (1995) and Bishop and Qazaz (1997) have examined the case of input-dependent (Such models are said to noise for parametric models such as neural networks. heteroscedastic in the statistics literature. ) In this paper we develop the treatment of an input-dependent noise model for Gaussian process regression, where the noise is assumed to be Gaussian but its variance depends on x. As the noise level is non(cid: 173) negative we place a Gaussian process prior on the log noise level. Thus there are two Gaussian processes involved in making predictions: the usual Gaussian process for predicting the function values (the y-process), and another one (the z-process) for predicting the log noise level. Below we present a Markov chain Monte Carlo method for carrying out inference with this model and demonstrate its performance on a test problem. 1. 1 Gaussian processes A stochastic process is a collection of random variables {Y(x)lx E X} indexed by a set X. Often X will be a space such as 'R, d for some dimension d, although it could be more general. The stochastic process is specified by giving the probability distribution for every finite subset of variables Y(Xl), .. ., Y(Xk) in a consistent manner. A Gaussian process is a stochastic process which can be fully specified by its mean function J. L(x) = E[Y(x)] and its covariance function Cp(x, x') = E[(Y(x)-J. L(x»)(Y(x')-J. L(x'»]; any finite set of points will have a joint multivariate Gaussian distribution. Below we consider Gaussian processes which have J. L(x) == O. This assumes that any known offset or trend in the data has been. removed. A non-zero I' (x ) is easily incorporated into the framework at the expense of extra notational complexity. A covariance junction is used to define a Gaussian process; it is a parametrised function from pairs of x-values to their covariance. The form of the covariance function that we shall use for the prior over functions is given by Cy(x(i), xU» =vyexp (-~ tWYl(x~i) _x~j»2) + Jy 8(i, j)

NeurIPS Conference 1996 Conference Paper

Computing with Infinite Networks

Christopher Williams

For neural networks with a wide class of weight-priors, it can be shown that in the limit of an infinite number of hidden units the prior over functions tends to a Gaussian process. In this paper an(cid: 173) alytic forms are derived for the covariance function of the Gaussian processes corresponding to networks with sigmoidal and Gaussian hidden units. This allows predictions to be made efficiently using networks with an infinite number of hidden units, and shows that, somewhat paradoxically, it may be easier to compute with infinite networks than finite ones.

NeurIPS Conference 1996 Conference Paper

Gaussian Processes for Bayesian Classification via Hybrid Monte Carlo

David Barber
Christopher Williams

The full Bayesian method for applying neural networks to a pre(cid: 173) diction problem is to set up the prior/hyperprior structure for the net and then perform the necessary integrals. However, these inte(cid: 173) grals are not tractable analytically, and Markov Chain Monte Carlo (MCMC) methods are slow, especially if the parameter space is high-dimensional. Using Gaussian processes we can approximate the weight space integral analytically, so that only a small number of hyperparameters need be integrated over by MCMC methods. We have applied this idea to classification problems, obtaining ex(cid: 173) cellent results on the real-world problems investigated so far.

NeurIPS Conference 1996 Conference Paper

GTM: A Principled Alternative to the Self-Organizing Map

Christopher Bishop
Markus Svensén
Christopher Williams

The Self-Organizing Map (SOM) algorithm has been extensively studied and has been applied with considerable success to a wide variety of problems. However, the algorithm is derived from heuris(cid: 173) tic ideas and this leads to a number of significant limitations. In this paper, we consider the problem of modelling the probabil(cid: 173) ity density of data in a space of several dimensions in terms of a smaller number of latent, or hidden, variables. We introduce a novel form of latent variable model, which we call the GTM algo(cid: 173) rithm (for Generative Topographic Mapping), which allows general non-linear transformations from latent space to data space, and which is trained using the EM (expectation-maximization) algo(cid: 173) rithm. Our approach overcomes the limitations of the SOM, while introducing no significant disadvantages. We demonstrate the per(cid: 173) formance of the GTM algorithm on simulated data from flow diag(cid: 173) nostics for a multi-phase oil pipeline.

NeurIPS Conference 1995 Conference Paper

EM Optimization of Latent-Variable Density Models

Christopher Bishop
Markus Svensén
Christopher Williams

There is currently considerable interest in developing general non(cid: 173) linear density models based on latent, or hidden, variables. Such models have the ability to discover the presence of a relatively small number of underlying 'causes' which, acting in combination, give rise to the apparent complexity of the observed data set. Unfortu(cid: 173) nately, to train such models generally requires large computational effort. In this paper we introduce a novel latent variable algorithm which retains the general non-linear capabilities of previous models but which uses a training procedure based on the EM algorithm. We demonstrate the performance of the model on a toy problem and on data from flow diagnostics for a multi-phase oil pipeline.

NeurIPS Conference 1995 Conference Paper

Gaussian Processes for Regression

Christopher Williams
Carl Rasmussen

The Bayesian analysis of neural networks is difficult because a sim(cid: 173) ple prior over weights implies a complex prior distribution over functions. In this paper we investigate the use of Gaussian process priors over functions, which permit the predictive Bayesian anal(cid: 173) ysis for fixed values of hyperparameters to be carried out exactly using matrix operations. Two methods, using optimization and av(cid: 173) eraging (via Hybrid Monte Carlo) over hyperparameters have been tested on a number of challenging problems and have produced excellent results.

NeurIPS Conference 1994 Conference Paper

Using a neural net to instantiate a deformable model

Christopher Williams
Michael Revow
Geoffrey Hinton

Deformable models are an attractive approach to recognizing non(cid: 173) rigid objects which have considerable within class variability. How(cid: 173) ever, there are severe search problems associated with fitting the models to data. We show that by using neural networks to provide better starting points, the search time can be significantly reduced. The method is demonstrated on a character recognition task. In previous work we have developed an approach to handwritten character recogni(cid: 173) tion based on the use of deformable models (Hinton, Williams and Revow, 1992a; Revow, Williams and Hinton, 1993). We have obtained good performance with this method, but a major problem is that the search procedure for fitting each model to an image is very computationally intensive, because there is no efficient algorithm (like dynamic programming) for this task. In this paper we demonstrate that it is possible to "compile down" some of the knowledge gained while fitting models to data to obtain better starting points that significantly reduce the search time. 1 DEFORMABLE MODELS FOR DIGIT RECOGNITION The basic idea in using deformable models for digit recognition is that each digit has a model, and a test image is classified by finding the model which is most likely to have generated it. The quality of the match between model and test image depends on the deformation of the model, the amount of ink that is attributed to noise and the distance of the remaining ink from the deformed model. ·Current address: Department of Computer Science and Applied Mathematics, Aston University, Birmingham B4 7ET, UK. 966 Christopher K. T. Williams, Michael D. Revow, Geoffrey E. Hinton More formally, the two important terms in assessing the fit are the prior probabil(cid: 173) ity distribution for the instantiation parameters of a model (which penalizes very distorted models), and the imaging model that characterizes the probability distri(cid: 173) bution over possible images given the instantiated model l. Let I be an image, M be a model and z be its instantiation parameters. Then the evidence for model M is given by P(IIM) = J P(zIM)P(IIM, z)dz (1) The first term in the integrand is the prior on the instantiation parameters and the second is the imaging model i. e. , the likelihood of the data given the instantiated model. P(MII) is directly proportional to P(IIM), as we assume a uniform prior on each digit. Equation 1 is formally correct, but if z has more than a few dimensions the evalua(cid: 173) tion of this integral is very computationally intensive. However, it is often possible to make an approximation based on the assumption that the integrand is strongly peaked around a (global) maximum value z*. In this case, the evidence can be ap(cid: 173) proximated by the highest peak of the integrand times a volume factor ~(zII, M), which measures the sharpness of the peak2. P(IIM) ~ P(z IM)P(Ilz, M)~(zII, M) (2) By Taylor expanding around z* to second order it can be shown that the volume factor depends on the determinant of the Hessian of 10gP(z, 11M). Taking logs of equation 2, defining EdeJ as the negative log of P(z*IM), and EJit as the cor(cid: 173) responding term for the imaging model, then the aim of the search is to find the minimum of E tot = EdeJ + EJit. Of course the total energy will have many local minima; for the character recognition task we aim to find the global minimum by using a continuation method (see section 1. 2). 1. 1 SPLINES, AFFINE TRANSFORMS AND IMAGING MODELS This section presents a brief overview of our work on using deformable models for digit recognition. For a fuller treatment, see Revow, Williams and Hinton (1993). Each digit is modelled by a cubic B-spline whose shape is determined by the posi(cid: 173) tions of the control points in the object-based frame. The models have eight control points, except for the one model which has three, and the seven model which has five. To generate an ideal example of a digit the control points are positioned at their "home" locations. Deformed characters are produced by perturbing the con(cid: 173) trol points away from their home locations. The home locations and covariance matrix for each model were adapted in order to improve the performance. The deformation energy only penalizes shape deformations. Affine transformations, i. e. , translation, rotation, dilation, elongation, and shear, do not change the under(cid: 173) lying shape of an object so we want the deformation energy to be invariant under them. We achieve this by giving each model its own "object-based frame" and computing the deformation energy relative to this frame. lThis framework has been used by many authors, e. g. Grenander et al (1991). 2The Gaussian approximation has been popularized in the neural net community by

NeurIPS Conference 1992 Conference Paper

Directional-Unit Boltzmann Machines

Richard Zemel
Christopher Williams
Michael Mozer

We present a general formulation for a network of stochastic di(cid: 173) rectional units. This formulation is an extension of the Boltzmann machine in which the units are not binary, but take on values in a cyclic range, between 0 and 271' radians. The state of each unit in a Directional-Unit Boltzmann Machine (DUBM) is described by a complex variable, where the phase component specifies a direction; the weights are also complex variables. We associate a quadratic energy function, and corresponding probability, with each DUBM configuration. The conditional distribution of a unit's stochastic state is a circular version of the Gaussian probability distribution, known as the von Mises distribution. In a mean-field approxima(cid: 173) tion to a stochastic DUBM, the phase component of a unit's state represents its mean direction, and the magnitude component spec(cid: 173) ifies the degree of certainty associated with this direction. This combination of a value and a certainty provides additional repre(cid: 173) sentational power in a unit. We describe a learning algorithm and simulations that demonstrate a mean-field DUBM'S ability to learn interesting mappings. Many kinds of information can naturally be represented in terms of angular, or directional, variables. A circular range forms a suitable representation for explicitly directional information, such as wind direction, as well as for information where the underlying range is periodic, such as days of the week or months of the year. In computer vision, tangent fields and optic flow fields are represented as fields of oriented line segments, each of which can be described by a magnitude and direction. Directions can also be used to represent a set of symbolic labels, e. g. , object label A at 0, and object label B at 71'/2 radians. We discuss below some advantages of representing symbolic labels with directional units.

NeurIPS Conference 1991 Conference Paper

Adaptive Elastic Models for Hand-Printed Character Recognition

Geoffrey Hinton
Christopher Williams
Michael Revow

Hand-printed digits can be modeled as splines that are governed by about 8 control points. For each known digit. the control points have preferred. , home" locations, and deformations of the digit are generated by moving the control points away from their home locations. Images of digits can be produced by placing Gaussian ink generators uniformly along the spline. Real images can be recognized by finding the digit model most likely to have generated the data. For each digit model we use an elastic matching algorithm to minimize an energy function that includes both the defor(cid: 173) mation energy of the digit model and the log probability that the model would generate the inked pixels in the image. The model with the lowest total energy wins. If a uniform noise process is included in the model of image generation, some of the inked pixels can be rejected as noise as a digit model is fitting a poorly segmented image. The digit models learn by modifying the home locations of the control points.