Arrow Research search

Author name cluster

Mark van der Wilk

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

38 papers
2 author rows

Possible papers

38

ICLR Conference 2025 Conference Paper

A Meta-Learning Approach to Bayesian Causal Discovery

  • Anish Dhir
  • Matthew Ashman
  • James Requeima
  • Mark van der Wilk

Discovering a unique causal structure is difficult due to both inherent identifiability issues, and the consequences of finite data. As such, uncertainty over causal structures, such as those obtained from a Bayesian posterior, are often necessary for downstream tasks. Finding an accurate approximation to this posterior is challenging, due to the large number of possible causal graphs, as well as the difficulty in the subproblem of finding posteriors over the functional relationships of the causal edges. Recent works have used Bayesian meta learning to view the problem of posterior estimation as a supervised learning task. Yet, these methods are limited as they cannot reliably sample from the posterior over causal structures and fail to encode key properties of the posterior, such as correlation between edges and permutation equivariance with respect to nodes. To address these limitations, we propose a Bayesian meta learning model that allows for sampling causal structures from the posterior and encodes these key properties. We compare our meta-Bayesian causal discovery against existing Bayesian causal discovery methods, demonstrating the advantages of directly learning a posterior over causal structure.

ICML Conference 2025 Conference Paper

Adjusting Model Size in Continual Gaussian Processes: How Big is Big Enough?

  • Guiomar Pescador-Barrios
  • Sarah Filippi
  • Mark van der Wilk

Many machine learning models require setting a parameter that controls their size before training, e. g. number of neurons in DNNs, or inducing points in GPs. Increasing capacity typically improves performance until all the information from the dataset is captured. After this point, computational cost keeps increasing, without improved performance. This leads to the question "How big is big enough? " We investigate this problem for Gaussian processes (single-layer neural networks) in continual learning. Here, data becomes available incrementally, and the final dataset size will therefore not be known before training, preventing the use of heuristics for setting a fixed model size. We develop a method to automatically adjust model size while maintaining near-optimal performance. Our experimental procedure follows the constraint that any hyperparameters must be set without seeing dataset properties, and we show that our method performs well across diverse datasets without the need to adjust its hyperparameter, showing it requires less tuning than others.

ICML Conference 2025 Conference Paper

Continuous Bayesian Model Selection for Multivariate Causal Discovery

  • Anish Dhir
  • Ruby Sedgwick
  • Avinash Kori
  • Ben Glocker
  • Mark van der Wilk

Current causal discovery approaches require restrictive model assumptions in the absence of interventional data to ensure structure identifiability. These assumptions often do not hold in real-world applications leading to a loss of guarantees and poor performance in practice. Recent work has shown that, in the bivariate case, Bayesian model selection can greatly improve performance by exchanging restrictive modelling for more flexible assumptions, at the cost of a small probability of making an error. Our work shows that this approach is useful in the important multivariate case as well. We propose a scalable algorithm leveraging a continuous relaxation of the discrete model selection problem. Specifically, we employ the Causal Gaussian Process Conditional Density Estimator (CGP-CDE) as a Bayesian non-parametric model, using its hyperparameters to construct an adjacency matrix. This matrix is then optimised using the marginal likelihood and an acyclicity regulariser, giving the maximum a posteriori causal graph. We demonstrate the competitiveness of our approach, showing it is advantageous to perform multivariate causal discovery without infeasible assumptions using Bayesian model selection.

NeurIPS Conference 2025 Conference Paper

Estimating Interventional Distributions with Uncertain Causal Graphs through Meta-Learning

  • Anish Dhir
  • Cristiana Diaconu
  • Valentinian Lungu
  • James Requeima
  • Richard Turner
  • Mark van der Wilk

In scientific domains---from biology to the social sciences---many questions boil down to \textit{What effect will we observe if we intervene on a particular variable? } If the causal relationships (e. g. ~a causal graph) are known, its possible to estimate the intervention distributions. In the absence of this domain knowledge, the causal structure must be discovered from the available observational data. However, observational data are often compatible with multiple causal graphs, making methods that commit to a single structure prone to overconfidence. A principled way to manage this structural uncertainty is via Bayesian inference, which averages over a posterior distribution on possible causal structures and functional mechanisms. Unfortunately, the number of causal structures grows super-exponentially with the number of nodes in the graph, making computations intractable. We propose to circumvent these challenges by using meta-learning to create an end-to-end model: the Model-Averaged Causal Estimation Transformer Neural Process (MACE-TNP). The model is trained to predict the Bayesian model-averaged interventional posterior distribution, and its end-to-end nature bypasses the need for expensive calculations. Empirically, we demonstrate that MACE-TNP outperforms strong Bayesian baselines. Our work established meta-learning as a flexible and scalable paradigm for approximating complex Bayesian causal inference, that can be scaled to increasingly challenging settings in the future.

ICML Conference 2025 Conference Paper

Rethinking Aleatoric and Epistemic Uncertainty

  • Freddie Bickford Smith
  • Jannik Kossen
  • Eleanor Trollope
  • Mark van der Wilk
  • Adam Foster 0001
  • Tom Rainforth

The ideas of aleatoric and epistemic uncertainty are widely used to reason about the probabilistic predictions of machine-learning models. We identify incoherence in existing discussions of these ideas and suggest this stems from the aleatoric-epistemic view being insufficiently expressive to capture all the distinct quantities that researchers are interested in. To address this we present a decision-theoretic perspective that relates rigorous notions of uncertainty, predictive performance and statistical dispersion in data. This serves to support clearer thinking as the field moves forward. Additionally we provide insights into popular information-theoretic quantities, showing they can be poor estimators of what they are often purported to measure, while also explaining how they can still be useful in guiding data acquisition.

TMLR Journal 2025 Journal Article

System-Aware Neural ODE Processes for Few-Shot Bayesian Optimization

  • Jixiang Qing
  • Rebecca D. Langdon
  • Robert Matthew Lee
  • Behrang Shafei
  • Mark van der Wilk
  • Calvin Tsay
  • Ruth Misener

We consider the problem of optimizing initial conditions and termination time in dynamical systems governed by unknown ordinary differential equations (ODEs), where evaluating different initial conditions is costly and the state's value can not be measured in real-time but only with a delay while the measuring device processes the sample. To identify the optimal conditions in limited trials, we introduce a few-shot Bayesian Optimization (BO) framework based on the system's prior information. At the core of our approach is the System-Aware Neural ODE Processes (SANODEP), an extension of Neural ODE Processes (NODEP) designed to meta-learn ODE systems from multiple trajectories using a novel context embedding block. We further develop a two-stage BO framework to effectively incorporate search space constraints, enabling efficient optimization of both initial conditions and observation timings. We conduct extensive experiments showcasing SANODEP's potential for few-shot BO within dynamical systems. We also explore SANODEP's adaptability to varying levels of prior information, highlighting the trade-off between prior flexibility and model fitting accuracy.

ICML Conference 2024 Conference Paper

Bivariate Causal Discovery using Bayesian Model Selection

  • Anish Dhir
  • Samuel Power
  • Mark van der Wilk

Much of the causal discovery literature prioritises guaranteeing the identifiability of causal direction in statistical models. For structures within a Markov equivalence class, this requires strong assumptions which may not hold in real-world datasets, ultimately limiting the usability of these methods. Building on previous attempts, we show how to incorporate causal assumptions within the Bayesian framework. Identifying causal direction then becomes a Bayesian model selection problem. This enables us to construct models with realistic assumptions, and consequently allows for the differentiation between Markov equivalent causal structures. We analyse why Bayesian model selection works in situations where methods based on maximum likelihood fail. To demonstrate our approach, we construct a Bayesian non-parametric model that can flexibly model the joint distribution. We then outperform previous methods on a wide range of benchmark datasets with varying data generating assumptions.

ICML Conference 2024 Conference Paper

Learning in Deep Factor Graphs with Gaussian Belief Propagation

  • Seth Nabarro
  • Mark van der Wilk
  • Andrew J. Davison

We propose an approach to do learning in Gaussian factor graphs. We treat all relevant quantities (inputs, outputs, parameters, activations) as random variables in a graphical model, and view training and prediction as inference problems with different observed nodes. Our experiments show that these problems can be efficiently solved with belief propagation (BP), whose updates are inherently local, presenting exciting opportunities for distributed and asynchronous training. Our approach can be scaled to deep networks and provides a natural means to do continual learning: use the BP-estimated posterior of the current task as a prior for the next. On a video denoising task we demonstrate the benefit of learnable parameters over a classical factor graph approach and we show encouraging performance of deep factor graphs for continual image classification.

NeurIPS Conference 2024 Conference Paper

Noether's Razor: Learning Conserved Quantities

  • Tycho F. van der Ouderaa
  • Mark van der Wilk
  • Pim De Haan

Symmetries have proven useful in machine learning models, improving generalisation and overall performance. At the same time, recent advancements in learning dynamical systems rely on modelling the underlying Hamiltonian to guarantee the conservation of energy. These approaches can be connected via a seminal result in mathematical physics: Noether's theorem, which states that symmetries in a dynamical system correspond to conserved quantities. This work uses Noether's theorem to parameterise symmetries as learnable conserved quantities. We then allow conserved quantities and associated symmetries to be learned directly from train data through approximate Bayesian model selection, jointly with the regular training procedure. As training objective, we derive a variational lower bound to the marginal likelihood. The objective automatically embodies an Occam's Razor effect that avoids collapse of conversation laws to the trivial constant, without the need to manually add and tune additional regularisers. We demonstrate a proof-of-principle on n-harmonic oscillators and n-body systems. We find that our method correctly identifies the correct conserved quantities and U(n) and SE(n) symmetry groups, improving overall performance and predictive accuracy on test data.

JMLR Journal 2024 Journal Article

Numerically Stable Sparse Gaussian Processes via Minimum Separation using Cover Trees

  • Alexander Terenin
  • David R. Burt
  • Artem Artemev
  • Seth Flaxman
  • Mark van der Wilk
  • Carl Edward Rasmussen
  • Hong Ge

Gaussian processes are frequently deployed as part of larger machine learning and decision-making systems, for instance in geospatial modeling, Bayesian optimization, or in latent Gaussian models. Within a system, the Gaussian process model needs to perform in a stable and reliable manner to ensure it interacts correctly with other parts of the system. In this work, we study the numerical stability of scalable sparse approximations based on inducing points. To do so, we first review numerical stability, and illustrate typical situations in which Gaussian process models can be unstable. Building on stability theory originally developed in the interpolation literature, we derive sufficient and in certain cases necessary conditions on the inducing points for the computations performed to be numerically stable. For low-dimensional tasks such as geospatial modeling, we propose an automated method for computing inducing points satisfying these conditions. This is done via a modification of the cover tree data structure, which is of independent interest. We additionally propose an alternative sparse approximation for regression with a Gaussian likelihood which trades off a small amount of performance to further improve stability. We provide illustrative examples showing the relationship between stability of calculations and predictive performance of inducing point methods on spatial tasks. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2024. ( edit, beta )

NeurIPS Conference 2024 Conference Paper

Transition Constrained Bayesian Optimization via Markov Decision Processes

  • Jose P. Folch
  • Calvin Tsay
  • Robert M. Lee
  • Behrang Shafei
  • Weronika Ormaniec
  • Andreas Krause
  • Mark van der Wilk
  • Ruth Misener

Bayesian optimization is a methodology to optimize black-box functions. Traditionally, it focuses on the setting where you can arbitrarily query the search space. However, many real-life problems do not offer this flexibility; in particular, the search space of the next query may depend on previous ones. Example challenges arise in the physical sciences in the form of local movement constraints, required monotonicity in certain variables, and transitions influencing the accuracy of measurements. Altogether, such transition constraints necessitate a form of planning. This work extends classical Bayesian optimization via the framework of Markov Decision Processes. We iteratively solve a tractable linearization of our utility function using reinforcement learning to obtain a policy that plans ahead for the entire horizon. This is a parallel to the optimization of an acquisition function in policy space. The resulting policy is potentially history-dependent and non-Markovian. We showcase applications in chemical reactor optimization, informative path planning, machine calibration, and other synthetic examples.

NeurIPS Conference 2023 Conference Paper

Learning Layer-wise Equivariances Automatically using Gradients

  • Tycho van der Ouderaa
  • Alexander Immer
  • Mark van der Wilk

Convolutions encode equivariance symmetries into neural networks leading to better generalisation performance. However, symmetries provide fixed hard constraints on the functions a network can represent, need to be specified in advance, and can not be adapted. Our goal is to allow flexible symmetry constraints that can automatically be learned from data using gradients. Learning symmetry and associated weight connectivity structures from scratch is difficult for two reasons. First, it requires efficient and flexible parameterisations of layer-wise equivariances. Secondly, symmetries act as constraints and are therefore not encouraged by training losses measuring data fit. To overcome these challenges, we improve parameterisations of soft equivariance and learn the amount of equivariance in layers by optimising the marginal likelihood, estimated using differentiable Laplace approximations. The objective balances data fit and model complexity enabling layer-wise symmetry discovery in deep networks. We demonstrate the ability to automatically learn layer-wise equivariances on image classification tasks, achieving equivalent or improved performance over baselines with hard-coded symmetry.

ICML Conference 2023 Conference Paper

Stochastic Marginal Likelihood Gradients using Neural Tangent Kernels

  • Alexander Immer
  • Tycho F. A. van der Ouderaa
  • Mark van der Wilk
  • Gunnar Rätsch
  • Bernhard Schölkopf

Selecting hyperparameters in deep learning greatly impacts its effectiveness but requires manual effort and expertise. Recent works show that Bayesian model selection with Laplace approximations can allow to optimize such hyperparameters just like standard neural network parameters using gradients and on the training data. However, estimating a single hyperparameter gradient requires a pass through the entire dataset, limiting the scalability of such algorithms. In this work, we overcome this issue by introducing lower bounds to the linearized Laplace approximation of the marginal likelihood. In contrast to previous estimators, these bounds are amenable to stochastic-gradient-based optimization and allow to trade off estimation accuracy against computational complexity. We derive them using the function-space form of the linearized Laplace, which can be estimated using the neural tangent kernel. Experimentally, we show that the estimators can significantly accelerate gradient-based hyperparameter optimization.

ICLR Conference 2022 Conference Paper

Bayesian Neural Network Priors Revisited

  • Vincent Fortuin
  • Adrià Garriga-Alonso
  • Sebastian W. Ober
  • Florian Wenzel
  • Gunnar Rätsch
  • Richard E. Turner
  • Mark van der Wilk
  • Laurence Aitchison

Isotropic Gaussian priors are the de facto standard for modern Bayesian neural network inference. However, it is unclear whether these priors accurately reflect our true beliefs about the weight distributions or give optimal performance. To find better priors, we study summary statistics of neural network weights in networks trained using stochastic gradient descent (SGD). We find that convolutional neural network (CNN) and ResNet weights display strong spatial correlations, while fully connected networks (FCNNs) display heavy-tailed weight distributions. We show that building these observations into priors can lead to improved performance on a variety of image classification datasets. Surprisingly, these priors mitigate the cold posterior effect in FCNNs, but slightly increase the cold posterior effect in ResNets.

UAI Conference 2022 Conference Paper

Data augmentation in Bayesian neural networks and the cold posterior effect

  • Seth Nabarro
  • Stoil Ganev
  • Adrià Garriga-Alonso
  • Vincent Fortuin
  • Mark van der Wilk
  • Laurence Aitchison

Bayesian neural networks that incorporate data augmentation implicitly use a “randomly perturbed log-likelihood [which] does not have a clean interpretation as a valid likelihood function” (Izmailov et al. 2021). Here, we provide several approaches to developing principled Bayesian neural networks incorporating data augmentation. We introduce a “finite orbit” setting which allows valid likelihoods to be computed exactly, and for the more usual “full orbit” setting we derive multi-sample bounds tighter than those used previously. These models cast light on the origin of the cold posterior effect. In particular, we find that the cold posterior effect persists even in these principled models incorporating data augmentation. This suggests that the cold posterior effect cannot be dismissed as an artifact of data augmentation using incorrect likelihoods.

NeurIPS Conference 2022 Conference Paper

Invariance Learning in Deep Neural Networks with Differentiable Laplace Approximations

  • Alexander Immer
  • Tycho van der Ouderaa
  • Gunnar Rätsch
  • Vincent Fortuin
  • Mark van der Wilk

Data augmentation is commonly applied to improve performance of deep learning by enforcing the knowledge that certain transformations on the input preserve the output. Currently, the data augmentation parameters are chosen by human effort and costly cross-validation, which makes it cumbersome to apply to new datasets. We develop a convenient gradient-based method for selecting the data augmentation without validation data during training of a deep neural network. Our approach relies on phrasing data augmentation as an invariance in the prior distribution on the functions of a neural network, which allows us to learn it using Bayesian model selection. This has been shown to work in Gaussian processes, but not yet for deep neural networks. We propose a differentiable Kronecker-factored Laplace approximation to the marginal likelihood as our objective, which can be optimised without human supervision or validation data. We show that our method can successfully recover invariances present in the data, and that this improves generalisation and data efficiency on image datasets.

UAI Conference 2022 Conference Paper

Learning invariant weights in neural networks

  • Tycho F. A. van der Ouderaa
  • Mark van der Wilk

Assumptions about invariances or symmetries in data can significantly increase the predictive power of statistical models. Many commonly used machine learning models are constraint to respect certain symmetries, such as translation equivariance in convolutional neural networks, and incorporating other symmetry types is actively being studied. Yet, learning invariances from the data itself remains an open research problem. It has been shown that the marginal likelihood offers a principled way to learn invariances in Gaussian Processes. We propose a weight-space equivalent to this approach, by minimizing a lower bound on the marginal likelihood to learn invariances in neural networks, resulting in naturally higher performing models.

NeurIPS Conference 2022 Conference Paper

Memory safe computations with XLA compiler

  • Artem Artemev
  • Yuze An
  • Tilman Roeder
  • Mark van der Wilk

Software packages like TensorFlow and PyTorch are designed to support linear algebra operations, and their speed and usability determine their success. However, by prioritising speed, they often neglect memory requirements. As a consequence, the implementations of memory-intensive algorithms that are convenient in terms of software design can often not be run for large problems due to memory overflows. Memory-efficient solutions require complex programming approaches with significant logic outside the computational framework. This impairs the adoption and use of such algorithms. To address this, we developed an XLA compiler extension that adjusts the computational data-flow representation of an algorithm according to a user-specified memory limit. We show that k-nearest neighbour, sparse Gaussian process regression methods and Transformers can be run on a single device at a much larger scale, where standard implementations would have failed. Our approach leads to better use of hardware resources. We believe that further focus on removing memory constraints at a compiler level will widen the range of machine learning methods that can be developed in the future.

NeurIPS Conference 2022 Conference Paper

Relaxing Equivariance Constraints with Non-stationary Continuous Filters

  • Tycho van der Ouderaa
  • David W. Romero
  • Mark van der Wilk

Equivariances provide useful inductive biases in neural network modeling, with the translation equivariance of convolutional neural networks being a canonical example. Equivariances can be embedded in architectures through weight-sharing and place symmetry constraints on the functions a neural network can represent. The type of symmetry is typically fixed and has to be chosen in advance. Although some tasks are inherently equivariant, many tasks do not strictly follow such symmetries. In such cases, equivariance constraints can be overly restrictive. In this work, we propose a parameter-efficient relaxation of equivariance that can effectively interpolate between a (i) non-equivariant linear product, (ii) a strict-equivariant convolution, and (iii) a strictly-invariant mapping. The proposed parameterisation can be thought of as a building block to allow adjustable symmetry structure in neural networks. In addition, we demonstrate that the amount of equivariance can be learned from the training data using backpropagation. Gradient-based learning of equivariance achieves similar or improved performance compared to the best value found by cross-validation and outperforms baselines with partial or strict equivariance on CIFAR-10 and CIFAR-100 image classification tasks.

NeurIPS Conference 2022 Conference Paper

SnAKe: Bayesian Optimization with Pathwise Exploration

  • Jose Pablo Folch
  • Shiqiang Zhang
  • Robert Lee
  • Behrang Shafei
  • David Walz
  • Calvin Tsay
  • Mark van der Wilk
  • Ruth Misener

"Bayesian Optimization is a very effective tool for optimizing expensive black-box functions. Inspired by applications developing and characterizing reaction chemistry using droplet microfluidic reactors, we consider a novel setting where the expense of evaluating the function can increase significantly when making large input changes between iterations. We further assume we are working asynchronously, meaning we have to decide on new queries before we finish evaluating previous experiments. This paper investigates the problem and introduces 'Sequential Bayesian Optimization via Adaptive Connecting Samples' (SnAKe), which provides a solution by considering large batches of queries and preemptively building optimization paths that minimize input costs. We investigate some convergence properties and empirically show that the algorithm is able to achieve regret similar to classical Bayesian Optimization algorithms in both the synchronous and asynchronous settings, while reducing the input costs significantly. We show the method is robust to the choice of its single hyper-parameter and provide a parameter-free alternative. "

UAI Conference 2021 Conference Paper

Correlated weights in infinite limits of deep convolutional neural networks

  • Adrià Garriga-Alonso
  • Mark van der Wilk

Infinite width limits of deep neural networks often have tractable forms. They have been used to analyse the behaviour of finite networks, as well as being useful methods in their own right. When investigating infinitely wide convolutional neural networks (CNNs), it was observed that the correlations arising from spatial weight sharing disappear in the infinite limit. This is undesirable, as spatial correlation is the main motivation behind CNNs. We show that the loss of this property is not a consequence of the infinite limit, but rather of choosing an independent weight prior. Correlating the weights maintains the correlations in the activations. Varying the amount of correlation interpolates between independent-weight limits and mean-pooling. Empirical evaluation of the infinitely wide network shows that optimal performance is achieved between the extremes, indicating that correlations can be useful.

NeurIPS Conference 2021 Conference Paper

Deep Neural Networks as Point Estimates for Deep Gaussian Processes

  • Vincent Dutordoir
  • James Hensman
  • Mark van der Wilk
  • Carl Henrik Ek
  • Zoubin Ghahramani
  • Nicolas Durrande

Neural networks and Gaussian processes are complementary in their strengths and weaknesses. Having a better understanding of their relationship comes with the promise to make each method benefit from the strengths of the other. In this work, we establish an equivalence between the forward passes of neural networks and (deep) sparse Gaussian process models. The theory we develop is based on interpreting activation functions as interdomain inducing features through a rigorous analysis of the interplay between activation functions and kernels. This results in models that can either be seen as neural networks with improved uncertainty prediction or deep Gaussian processes with increased prediction accuracy. These claims are supported by experimental results on regression and classification datasets.

NeurIPS Conference 2021 Conference Paper

Speedy Performance Estimation for Neural Architecture Search

  • Robin Ru
  • Clare Lyle
  • Lisa Schut
  • Miroslav Fil
  • Mark van der Wilk
  • Yarin Gal

Reliable yet efficient evaluation of generalisation performance of a proposed architecture is crucial to the success of neural architecture search (NAS). Traditional approaches face a variety of limitations: training each architecture to completion is prohibitively expensive, early stopped validation accuracy may correlate poorly with fully trained performance, and model-based estimators require large training sets. We instead propose to estimate the final test performance based on a simple measure of training speed. Our estimator is theoretically motivated by the connection between generalisation and training speed, and is also inspired by the reformulation of a PAC-Bayes bound under the Bayesian setting. Our model-free estimator is simple, efficient, and cheap to implement, and does not require hyperparameter-tuning or surrogate training before deployment. We demonstrate on various NAS search spaces that our estimator consistently outperforms other alternatives in achieving better correlation with the true test performance rankings. We further show that our estimator can be easily incorporated into both query-based and one-shot NAS methods to improve the speed or quality of the search.

UAI Conference 2021 Conference Paper

The promises and pitfalls of deep kernel learning

  • Sebastian W. Ober
  • Carl E. Rasmussen
  • Mark van der Wilk

Deep kernel learning and related techniques promise to combine the representational power of neural networks with the reliable uncertainty estimates of Gaussian processes. One crucial aspect of these models is an expectation that, because they are treated as Gaussian process models optimized using the marginal likelihood, they are protected from overfitting. However, we identify pathological behavior, including overfitting, on a simple toy example. We explore this pathology, explaining its origins and considering how it applies to real datasets. Through careful experimentation on UCI datasets, CIFAR-10, and the UTKFace dataset, we find that the overfitting from overparameterized deep kernel learning, in which the model is “somewhat Bayesian”, can in certain scenarios be worse than that from not being Bayesian at all. However, we find that a fully Bayesian treatment of deep kernel learning can rectify this overfitting and obtain the desired performance improvements over standard neural networks and Gaussian processes.

ICML Conference 2021 Conference Paper

Tighter Bounds on the Log Marginal Likelihood of Gaussian Process Regression Using Conjugate Gradients

  • Artem Artemev
  • David R. Burt
  • Mark van der Wilk

We propose a lower bound on the log marginal likelihood of Gaussian process regression models that can be computed without matrix factorisation of the full kernel matrix. We show that approximate maximum likelihood learning of model parameters by maximising our lower bound retains many benefits of the sparse variational approach while reducing the bias introduced into hyperparameter learning. The basis of our bound is a more careful analysis of the log-determinant term appearing in the log marginal likelihood, as well as using the method of conjugate gradients to derive tight lower bounds on the term involving a quadratic form. Our approach is a step forward in unifying methods relying on lower bound maximisation (e. g. variational methods) and iterative approaches based on conjugate gradients for training Gaussian processes. In experiments, we show improved predictive performance with our model for a comparable amount of training time compared to other conjugate gradient based approaches.

NeurIPS Conference 2020 Conference Paper

A Bayesian Perspective on Training Speed and Model Selection

  • Clare Lyle
  • Lisa Schut
  • Robin Ru
  • Yarin Gal
  • Mark van der Wilk

We take a Bayesian perspective to illustrate a connection between training speed and the marginal likelihood in linear models. This provides two major insights: first, that a measure of a model's training speed can be used to estimate its marginal likelihood. Second, that this measure, under certain conditions, predicts the relative weighting of models in linear model combinations trained to minimize a regression loss. We verify our results in model selection tasks for linear models and for the infinite-width limit of deep neural networks. We further provide encouraging empirical evidence that the intuition developed in these settings also holds for deep neural networks trained with stochastic gradient descent. Our results suggest a promising new direction towards explaining why neural networks trained with stochastic gradient descent are biased towards functions that generalize well.

JMLR Journal 2020 Journal Article

Convergence of Sparse Variational Inference in Gaussian Processes Regression

  • David R. Burt
  • Carl Edward Rasmussen
  • Mark van der Wilk

Gaussian processes are distributions over functions that are versatile and mathematically convenient priors in Bayesian modelling. However, their use is often impeded for data with large numbers of observations, $N$, due to the cubic (in $N$) cost of matrix operations used in exact inference. Many solutions have been proposed that rely on $M \ll N$ inducing variables to form an approximation at a cost of $\mathcal{O}\left(NM^2\right)$. While the computational cost appears linear in $N$, the true complexity depends on how $M$ must scale with $N$ to ensure a certain quality of the approximation. In this work, we investigate upper and lower bounds on how $M$ needs to grow with $N$ to ensure high quality approximations. We show that we can make the KL-divergence between the approximate model and the exact posterior arbitrarily small for a Gaussian-noise regression model with $M \ll N$. Specifically, for the popular squared exponential kernel and $D$-dimensional Gaussian distributed covariates, $M = \mathcal{O}((\log N)^D)$ suffice and a method with an overall computational cost of $\mathcal{O}\left(N(\log N)^{2D}(\log \log N)^2\right)$ can be used to perform inference. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2020. ( edit, beta )

NeurIPS Conference 2020 Conference Paper

Stochastic Segmentation Networks: Modelling Spatially Correlated Aleatoric Uncertainty

  • Miguel Monteiro
  • Loic Le Folgoc
  • Daniel Coelho de Castro
  • Nick Pawlowski
  • Bernardo Marques
  • Konstantinos Kamnitsas
  • Mark van der Wilk
  • Ben Glocker

In image segmentation, there is often more than one plausible solution for a given input. In medical imaging, for example, experts will often disagree about the exact location of object boundaries. Estimating this inherent uncertainty and predicting multiple plausible hypotheses is of great interest in many applications, yet this ability is lacking in most current deep learning methods. In this paper, we introduce stochastic segmentation networks (SSNs), an efficient probabilistic method for modelling aleatoric uncertainty with any image segmentation network architecture. In contrast to approaches that produce pixel-wise estimates, SSNs model joint distributions over entire label maps and thus can generate multiple spatially coherent hypotheses for a single image. By using a low-rank multivariate normal distribution over the logit space to model the probability of the label map given the image, we obtain a spatially consistent probability distribution that can be efficiently computed by a neural network without any changes to the underlying architecture. We tested our method on the segmentation of real-world medical data, including lung nodules in 2D CT and brain tumours in 3D multimodal MRI scans. SSNs outperform state-of-the-art for modelling correlated uncertainty in ambiguous images while being much simpler, more flexible, and more efficient.

NeurIPS Conference 2019 Conference Paper

Bayesian Layers: A Module for Neural Network Uncertainty

  • Dustin Tran
  • Mike Dusenberry
  • Mark van der Wilk
  • Danijar Hafner

We describe Bayesian Layers, a module designed for fast experimentation with neural network uncertainty. It extends neural network libraries with drop-in replacements for common layers. This enables composition via a unified abstraction over deterministic and stochastic functions and allows for scalability via the underlying system. These layers capture uncertainty over weights (Bayesian neural nets), pre-activation units (dropout), activations ( stochastic output layers''), or the function itself (Gaussian processes). They can also be reversible to propagate uncertainty from input to output. We include code examples for common architectures such as Bayesian LSTMs, deep GPs, and flow-based models. As demonstration, we fit a 5-billion parameter Bayesian Transformer'' on 512 TPUv2 cores for uncertainty in machine translation and a Bayesian dynamics model for model-based planning. Finally, we show how Bayesian Layers can be used within the Edward2 language for probabilistic programming with stochastic processes.

ICML Conference 2019 Conference Paper

Overcoming Mean-Field Approximations in Recurrent Gaussian Process Models

  • Alessandro Davide Ialongo
  • Mark van der Wilk
  • James Hensman
  • Carl E. Rasmussen

We identify a new variational inference scheme for dynamical systems whose transition function is modelled by a Gaussian process. Inference in this setting has either employed computationally intensive MCMC methods, or relied on factorisations of the variational posterior. As we demonstrate in our experiments, the factorisation between latent system states and transition function can lead to a miscalibrated posterior and to learning unnecessarily large noise terms. We eliminate this factorisation by explicitly modelling the dependence between state trajectories and the low-rank representation of our Gaussian process posterior. Samples of the latent states can then be tractably generated by conditioning on this representation. The method we obtain gives better predictive performance and more calibrated estimates of the transition function, yet maintains the same time and space complexities as mean-field methods.

ICML Conference 2019 Conference Paper

Rates of Convergence for Sparse Variational Gaussian Process Regression

  • David R. Burt
  • Carl E. Rasmussen
  • Mark van der Wilk

Excellent variational approximations to Gaussian process posteriors have been developed which avoid the $\mathcal{O}\left(N^3\right)$ scaling with dataset size $N$. They reduce the computational cost to $\mathcal{O}\left(NM^2\right)$, with $M\ll N$ the number of inducing variables, which summarise the process. While the computational cost seems to be linear in $N$, the true complexity of the algorithm depends on how $M$ must increase to ensure a certain quality of approximation. We show that with high probability the KL divergence can be made arbitrarily small by growing $M$ more slowly than $N$. A particular case is that for regression with normally distributed inputs in D-dimensions with the Squared Exponential kernel, $M=\mathcal{O}(\log^D N)$ suffices. Our results show that as datasets grow, Gaussian process posteriors can be approximated cheaply, and provide a concrete rule for how to increase $M$ in continual learning scenarios.

NeurIPS Conference 2019 Conference Paper

Scalable Bayesian dynamic covariance modeling with variational Wishart and inverse Wishart processes

  • Creighton Heaukulani
  • Mark van der Wilk

We implement gradient-based variational inference routines for Wishart and inverse Wishart processes, which we apply as Bayesian models for the dynamic, heteroskedastic covariance matrix of a multivariate time series. The Wishart and inverse Wishart processes are constructed from i. i. d. Gaussian processes, existing variational inference algorithms for which form the basis of our approach. These methods are easy to implement as a black-box and scale favorably with the length of the time series, however, they fail in the case of the Wishart process, an issue we resolve with a simple modification into an additive white noise parameterization of the model. This modification is also key to implementing a factored variant of the construction, allowing inference to additionally scale to high-dimensional covariance matrices. Through experimentation, we demonstrate that some (but not all) model variants outperform multivariate GARCH when forecasting the covariances of returns on financial instruments.

NeurIPS Conference 2018 Conference Paper

Learning Invariances using the Marginal Likelihood

  • Mark van der Wilk
  • Matthias Bauer
  • ST John
  • James Hensman

In many supervised learning tasks, learning what changes do not affect the predic-tion target is as crucial to generalisation as learning what does. Data augmentationis a common way to enforce a model to exhibit an invariance: training data is modi-fied according to an invariance designed by a human and added to the training data. We argue that invariances should be incorporated the model structure, and learnedusing themarginal likelihood, which can correctly reward the reduced complexityof invariant models. We incorporate invariances in a Gaussian process, due to goodmarginal likelihood approximations being available for these models. Our maincontribution is a derivation for a variational inference scheme for invariant Gaussianprocesses where the invariance is described by a probability distribution that canbe sampled from, much like how data augmentation is implemented in practice

IJCAI Conference 2017 Conference Paper

Concrete Problems for Autonomous Vehicle Safety: Advantages of Bayesian Deep Learning

  • Rowan McAllister
  • Yarin Gal
  • Alex Kendall
  • Mark van der Wilk
  • Amar Shah
  • Roberto Cipolla
  • Adrian Weller

Autonomous vehicle (AV) software is typically composed of a pipeline of individual components, linking sensor inputs to motor outputs. Erroneous component outputs propagate downstream, hence safe AV software must consider the ultimate effect of each component’s errors. Further, improving safety alone is not sufficient. Passengers must also feel safe to trust and use AV systems. To address such concerns, we investigate three under-explored themes for AV research: safety, interpretability, and compliance. Safety can be improved by quantifying the uncertainties of component outputs and propagating them forward through the pipeline. Interpretability is concerned with explaining what the AV observes and why it makes the decisions it does, building reassurance with the passenger. Compliance refers to maintaining some control for the passenger. We discuss open challenges for research within these themes. We highlight the need for concrete evaluation metrics, propose example problems, and highlight possible solutions.

NeurIPS Conference 2017 Conference Paper

Convolutional Gaussian Processes

  • Mark van der Wilk
  • Carl Edward Rasmussen
  • James Hensman

We present a practical way of introducing convolutional structure into Gaussian processes, making them more suited to high-dimensional inputs like images. The main contribution of our work is the construction of an inter-domain inducing point approximation that is well-tailored to the convolutional kernel. This allows us to gain the generalisation benefit of a convolutional kernel, together with fast but accurate posterior inference. We investigate several variations of the convolutional kernel, and apply it to MNIST and CIFAR-10, where we obtain significant improvements over existing Gaussian process models. We also show how the marginal likelihood can be used to find an optimal weighting between convolutional and RBF kernels to further improve performance. This illustration of the usefulness of the marginal likelihood may help automate discovering architectures in larger models.

JMLR Journal 2017 Journal Article

GPflow: A Gaussian Process Library using TensorFlow

  • Alexander G. de G. Matthews
  • Mark van der Wilk
  • Tom Nickson
  • Keisuke Fujii
  • Alexis Boukouvalas
  • Pablo León-Villagrá
  • Zoubin Ghahramani
  • James Hensman

GPflow is a Gaussian process library that uses TensorFlow for its core computations and Python for its front end. The distinguishing features of GPflow are that it uses variational inference as the primary approximation method, provides concise code through the use of automatic differentiation, has been engineered with a particular emphasis on software testing and is able to exploit GPU hardware. [abs] [ pdf ][ bib ] [ code ] [ webpage ] &copy JMLR 2017. ( edit, beta )

NeurIPS Conference 2016 Conference Paper

Understanding Probabilistic Sparse Gaussian Process Approximations

  • Matthias Bauer
  • Mark van der Wilk
  • Carl Edward Rasmussen

Good sparse approximations are essential for practical inference in Gaussian Processes as the computational cost of exact methods is prohibitive for large datasets. The Fully Independent Training Conditional (FITC) and the Variational Free Energy (VFE) approximations are two recent popular methods. Despite superficial similarities, these approximations have surprisingly different theoretical properties and behave differently in practice. We thoroughly investigate the two methods for regression both analytically and through illustrative examples, and draw conclusions to guide practical application.

NeurIPS Conference 2014 Conference Paper

Distributed Variational Inference in Sparse Gaussian Process Regression and Latent Variable Models

  • Yarin Gal
  • Mark van der Wilk
  • Carl Edward Rasmussen

Gaussian processes (GPs) are a powerful tool for probabilistic inference over functions. They have been applied to both regression and non-linear dimensionality reduction, and offer desirable properties such as uncertainty estimates, robustness to over-fitting, and principled ways for tuning hyper-parameters. However the scalability of these models to big datasets remains an active topic of research. We introduce a novel re-parametrisation of variational inference for sparse GP regression and latent variable models that allows for an efficient distributed algorithm. This is done by exploiting the decoupling of the data given the inducing points to re-formulate the evidence lower bound in a Map-Reduce setting. We show that the inference scales well with data and computational resources, while preserving a balanced distribution of the load among the nodes. We further demonstrate the utility in scaling Gaussian processes to big data. We show that GP performance improves with increasing amounts of data in regression (on flight data with 2 million records) and latent variable modelling (on MNIST). The results show that GPs perform better than many common models often used for big data.