Author name cluster

Arnaud Doucet

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

80 papers

2 author rows

ICML Conference 2025 Conference Paper

Accelerated Diffusion Models via Speculative Sampling

Valentin De Bortoli
Alexandre Galashov
Arthur Gretton
Arnaud Doucet

Speculative sampling is a popular technique for accelerating inference in Large Language Models by generating candidate tokens using a fast draft model and then accepting or rejecting them based on the target model’s distribution. While speculative sampling was previously limited to discrete sequences, we extend it to diffusion models, which generate samples via continuous, vector-valued Markov chains. In this context, the target model is a high-quality but computationally expensive diffusion model. We propose various drafting strategies, including a simple and effective approach that does not require training a draft model and is applicable out-of-the-box to any diffusion model. We demonstrate significant generation speedup on various diffusion models, halving the number of function evaluations while generating exact samples from the target model. Finally, we also show how this procedure can be used to accelerate Langevin diffusions to sample unnormalized distributions.

TMLR Journal 2025 Journal Article

Conformalized Credal Regions for Classification with Ambiguous Ground Truth

Michele Caprio
David Stutz
Shuo Li
Arnaud Doucet

An open question in Imprecise Probabilistic Machine Learning is how to empirically derive a credal region (i.e., a closed and convex family of probabilities on the output space) from the available data, without any prior knowledge or assumption. In classification problems, credal regions are a tool that is able to provide provable guarantees under realistic assumptions by characterizing the uncertainty about the distribution of the labels. Building on previous work, we show that credal regions can be directly constructed using conformal methods. This allows us to provide a novel extension of classical conformal prediction to problems with ambiguous ground truth, that is, when the exact labels for given inputs are not exactly known. The resulting construction enjoys desirable practical and theoretical properties: (i) conformal coverage guarantees, (ii) smaller prediction sets (compared to classical conformal prediction regions) and (iii) disentanglement of uncertainty sources (epistemic, aleatoric). We empirically verify our findings on both synthetic and real datasets.

ICML Conference 2025 Conference Paper

Distributional Diffusion Models with Scoring Rules

Valentin De Bortoli
Alexandre Galashov
J. Swaroop Guntupalli
Guangyao Zhou
Kevin Murphy 0002
Arthur Gretton
Arnaud Doucet

Diffusion models generate high-quality synthetic data. They operate by defining a continuous-time forward process which gradually adds Gaussian noise to data until fully corrupted. The corresponding reverse process progressively “denoises" a Gaussian sample into a sample from the data distribution. However, generating high-quality outputs requires many discretization steps to obtain a faithful approximation of the reverse process. This is expensive and has motivated the development of many acceleration methods. We propose to speed up sample generation by learning the posterior distribution of clean data samples given their noisy versions, instead of only the mean of this distribution. This allows us to sample from the probability transitions of the reverse process on a coarse time scale, significantly accelerating inference with minimal degradation of the quality of the output. This is accomplished by replacing the standard regression loss used to estimate conditional means with a scoring rule. We validate our method on image and robot trajectory generation, where we consistently outperform standard diffusion models at few discretization steps.

ICML Conference 2025 Conference Paper

Feynman-Kac Correctors in Diffusion: Annealing, Guidance, and Product of Experts

Marta Skreta
Tara Akhound-Sadegh
Viktor Ohanesian
Roberto Bondesan
Alán Aspuru-Guzik
Arnaud Doucet
Rob Brekelmans
Alexander Tong 0001

While score-based generative models are the model of choice across diverse domains, there are limited tools available for controlling inference-time behavior in a principled manner, e. g. for composing multiple pretrained models. Existing classifier-free guidance methods use a simple heuristic to mix conditional and unconditional scores to approximately sample from conditional distributions. However, such methods do not approximate the intermediate distributions, necessitating additional ‘corrector’ steps. In this work, we provide an efficient and principled method for sampling from a sequence of annealed, geometric-averaged, or product distributions derived from pretrained score-based models. We derive a weighted simulation scheme which we call Feynman-Kac Correctors (FKCs) based on the celebrated Feynman-Kac formula by carefully accounting for terms in the appropriate partial differential equations (PDEs). To simulate these PDEs, we propose Sequential Monte Carlo (SMC) resampling algorithms that leverage inference-time scaling to improve sampling quality. We empirically demonstrate the utility of our methods by proposing amortized sampling via inference-time temperature annealing, improving multi-objective molecule generation using pretrained models, and improving classifier-free guidance for text-to-image generation.

NeurIPS Conference 2025 Conference Paper

Progressive Inference-Time Annealing of Diffusion Models for Sampling from Boltzmann Densities

Tara Akhound-Sadegh
Jungyoon Lee
Joey Bose
Valentin De Bortoli
Arnaud Doucet
Michael Bronstein
Dominique Beaini
Siamak Ravanbakhsh

Sampling efficiently from a target unnormalized probability density remains a core challenge, with relevance across countless high-impact scientific applications. A promising approach towards this challenge is the design of amortized samplers that borrow key ideas, such as probability path design, from state-of-the-art generative diffusion models. However, all existing diffusion-based samplers remain unable to draw samples from distributions at the scale of even simple molecular systems. In this paper, we propose Progressive Inference-Time Annealing (PITA) a novel framework to learn diffusion-based samplers that combines two complementary interpolation techniques: I. ) Annealing of the Boltzmann distribution and II. ) Diffusion smoothing. PITA trains a sequence of diffusion models from high to low temperatures by sequentially training each model at progressively higher temperatures, leveraging engineered easy access to samples of the temperature-annealed target density. In the subsequent step, PITA enables simulating the trained diffusion model to *procure training samples at a lower temperature* for the next diffusion model through inference-time annealing using a novel Feynman-Kac PDE combined with Sequential Monte Carlo. Empirically, PITA enables, for the first time, equilibrium sampling of $N$-body particle systems, Alanine Dipeptide, and tripeptides in Cartesian coordinates with dramatically lower energy function evaluations.

TMLR Journal 2024 Journal Article

Error Bounds for Flow Matching Methods

Joe Benton
George Deligiannidis
Arnaud Doucet

Score-based generative models are a popular class of generative modelling techniques relying on stochastic differential equations (SDEs). From their inception, it was realized that it was also possible to perform generation using ordinary differential equations (ODEs) rather than SDEs. This led to the introduction of the probability flow ODE approach and denoising diffusion implicit models. Flow matching methods have recently further extended these ODE-based approaches and approximate a flow between two arbitrary probability distributions. Previous work derived bounds on the approximation error of diffusion models under the stochastic sampling regime, given assumptions on the $L^2$ loss. We present error bounds for the flow matching procedure using fully deterministic sampling, assuming an $L^2$ bound on the approximation error and a certain regularity condition on the data distributions.

ICLR Conference 2024 Conference Paper

Nearly d-Linear Convergence Bounds for Diffusion Models via Stochastic Localization

Joe Benton
Valentin De Bortoli
Arnaud Doucet
George Deligiannidis

Denoising diffusions are a powerful method to generate approximate samples from high-dimensional data distributions. Recent results provide polynomial bounds on their convergence rate, assuming $L^2$-accurate scores. Until now, the tightest bounds were either superlinear in the data dimension or required strong smoothness assumptions. We provide the first convergence bounds which are linear in the data dimension (up to logarithmic factors) assuming only finite second moments of the data distribution. We show that diffusion models require at most $\tilde O(\frac{d \log^2(1/\delta)}{\varepsilon^2})$ steps to approximate an arbitrary distribution on $\mathbb{R}^d$ corrupted with Gaussian noise of variance $\delta$ to within $\varepsilon^2$ in KL divergence. Our proof extends the Girsanov-based methods of previous works. We introduce a refined treatment of the error from discretizing the reverse SDE inspired by stochastic localization.

ICML Conference 2024 Conference Paper

Particle Denoising Diffusion Sampler

Angus Phillips
Hai-Dang Dau
Michael John Hutchinson
Valentin De Bortoli
George Deligiannidis
Arnaud Doucet

Denoising diffusion models have become ubiquitous for generative modeling. The core idea is to transport the data distribution to a Gaussian by using a diffusion. Approximate samples from the data distribution are then obtained by estimating the time-reversal of this diffusion using score matching ideas. We follow here a similar strategy to sample from unnormalized probability densities and compute their normalizing constants. However, the time-reversed diffusion is here simulated by using an original iterative particle scheme relying on a novel score matching loss. Contrary to standard denoising diffusion models, the resulting Particle Denoising Diffusion Sampler (PDDS) provides asymptotically consistent estimates under mild assumptions. We demonstrate PDDS on multimodal and high dimensional sampling tasks.

NeurIPS Conference 2024 Conference Paper

Schrodinger Bridge Flow for Unpaired Data Translation

Valentin De Bortoli
Iryna Korshunova
Andriy Mnih
Arnaud Doucet

Mass transport problems arise in many areas of machine learning whereby one wants to compute a map transporting one distribution to another. Generative modeling techniques like Generative Adversarial Networks (GANs) and Denoising Diffusion Models (DMMs) have been successfully adapted to solve such transport problems, resulting in CycleGAN and Bridge Matching respectively. However, these methods do not approximate Optimal Transport (OT) maps, which are known to have desirable properties. Existing techniques approximating OT maps for high-dimensional data-rich problems, including DDMs-based Rectified Flow and Schrodinger bridge procedures, require fully training a DDM-type model at each iteration, or use mini-batch techniques which can introduce significant errors. We propose a novel algorithm to compute the Schrodinger bridge, a dynamic entropy-regularized version of OT, that eliminates the need to train multiple DDMs-like models. This algorithm corresponds to a discretization of a flow of path measures, referred to as the Schrodinger Bridge Flow, whose only stationary point is the Schrodinger bridge. We demonstrate the performance of our algorithm on a variety of unpaired data translation tasks.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Score-Optimal Diffusion Schedules

Christopher Williams
Andrew Campbell
Arnaud Doucet
Saifuddin Syed

Denoising diffusion models (DDMs) offer a flexible framework for sampling from high dimensional data distributions. DDMs generate a path of probability distributions interpolating between a reference Gaussian distribution and a data distribution by incrementally injecting noise into the data. To numerically simulate the sampling process, a discretisation schedule from the reference back towards clean data must be chosen. An appropriate discretisation schedule is crucial to obtain high quality samples. However, beyond hand crafted heuristics, a general method for choosing this schedule remains elusive. This paper presents a novel algorithm for adaptively selecting an optimal discretisation schedule with respect to a cost that we derive. Our cost measures the work done by the simulation procedure to transport samples from one point in the diffusion path to the next. Our method does not require hyperparameter tuning and adapts to the dynamics and geometry of the diffusion path. Our algorithm only involves the evaluation of the estimated Stein score, making it scalable to existing pre-trained models at inference time and online during training. We find that our learned schedule recovers performant schedules previously only discovered through manual search and obtains competitive FID scores on image datasets.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Simplified and Generalized Masked Diffusion for Discrete Data

Jiaxin Shi
Kehang Han
Zhe Wang
Arnaud Doucet
Michalis Titsias

Masked (or absorbing) diffusion is actively explored as an alternative to autoregressive models for generative modeling of discrete data. However, existing work in this area has been hindered by unnecessarily complex model formulations and unclear relationships between different perspectives, leading to suboptimal parameterization, training objectives, and ad hoc adjustments to counteract these issues. In this work, we aim to provide a simple and general framework that unlocks the full potential of masked diffusion models. We show that the continuous-time variational objective of masked diffusion models is a simple weighted integral of cross-entropy losses. Our framework also enables training generalized masked diffusion models with state-dependent masking schedules. When evaluated by perplexity, our models trained on OpenWebText surpass prior diffusion language models at GPT-2 scale and demonstrate superior performance on 4 out of 5 zero-shot language modeling tasks. Furthermore, our models vastly outperform previous discrete diffusion models on pixel-level image modeling, achieving 2. 75 (CIFAR-10) and 3. 40 (ImageNet 64x64) bits per dimension that are better than autoregressive models of similar sizes.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

A Unified Framework for U-Net Design and Analysis

Christopher Williams
Fabian Falck
George Deligiannidis
Chris C Holmes
Arnaud Doucet
Saifuddin Syed

U-Nets are a go-to neural architecture across numerous tasks for continuous signals on a square such as images and Partial Differential Equations (PDE), however their design and architecture is understudied. In this paper, we provide a framework for designing and analysing general U-Net architectures. We present theoretical results which characterise the role of the encoder and decoder in a U-Net, their high-resolution scaling limits and their conjugacy to ResNets via preconditioning. We propose Multi-ResNets, U-Nets with a simplified, wavelet-based encoder without learnable parameters. Further, we show how to design novel U-Net architectures which encode function constraints, natural bases, or the geometry of the data. In diffusion models, our framework enables us to identify that high-frequency information is dominated by noise exponentially faster, and show how U-Nets with average pooling exploit this. In our experiments, we demonstrate how Multi-ResNets achieve competitive and often superior performance compared to classical U-Nets in image segmentation, PDE surrogate modelling, and generative modelling with diffusion models. Our U-Net framework paves the way to study the theoretical properties of U-Nets and design natural, scalable neural architectures for a multitude of problems beyond the square.

JMLR Journal 2023 Journal Article

Alpha-divergence Variational Inference Meets Importance Weighted Auto-Encoders: Methodology and Asymptotics

Kamélia Daudel
Joe Benton
Yuyang Shi
Arnaud Doucet

Several algorithms involving the Variational Rényi (VR) bound have been proposed to minimize an alpha-divergence between a target posterior distribution and a variational distribution. Despite promising empirical results, those algorithms resort to biased stochastic gradient descent procedures and thus lack theoretical guarantees. In this paper, we formalize and study the VR-IWAE bound, a generalization of the importance weighted auto-encoder (IWAE) bound. We show that the VR-IWAE bound enjoys several desirable properties and notably leads to the same stochastic gradient descent procedure as the VR bound in the reparameterized case, but this time by relying on unbiased gradient estimators. We then provide two complementary theoretical analyses of the VR-IWAE bound and thus of the standard IWAE bound. Those analyses shed light on the benefits or lack thereof of these bounds. Lastly, we illustrate our theoretical claims over toy and real-data examples. [abs] [ pdf ][ bib ] &copy JMLR 2023. ( edit, beta )

TMLR Journal 2023 Journal Article

Conformal prediction under ambiguous ground truth

David Stutz
Abhijit Guha Roy
Tatiana Matejovicova
Patricia Strachan
Ali Taylan Cemgil
Arnaud Doucet

Conformal Prediction (CP) allows to perform rigorous uncertainty quantification by constructing a prediction set $C(X)$ satisfying $\mathbb{P}(Y \in C(X))\geq 1-\alpha$ for a user-chosen $\alpha \in [0,1]$ by relying on calibration data $(X_1,Y_1),...,(X_n,Y_n)$ from $\mathbb{P}=\mathbb{P}^{X} \otimes \mathbb{P}^{Y|X}$. It is typically implicitly assumed that $\mathbb{P}^{Y|X}$ is the ``true'' posterior label distribution. However, in many real-world scenarios, the labels $Y_1,...,Y_n$ are obtained by aggregating expert opinions using a voting procedure, resulting in a one-hot distribution $\mathbb{P}_{\textup{vote}}^{Y|X}$. This is the case for most datasets, even well-known ones like ImageNet. For such ``voted'' labels, CP guarantees are thus w.r.t. $\mathbb{P}_{\textup{vote}}=\mathbb{P}^X \otimes \mathbb{P}_{\textup{vote}}^{Y|X}$ rather than the true distribution $\mathbb{P}$. In cases with unambiguous ground truth labels, the distinction between $\mathbb{P}_{\textup{vote}}$ and $\mathbb{P}$ is irrelevant. However, when experts do not agree because of ambiguous labels, approximating $\mathbb{P}^{Y|X}$ with a one-hot distribution $\mathbb{P}_{\textup{vote}}^{Y|X}$ ignores this uncertainty. In this paper, we propose to leverage expert opinions to approximate $\mathbb{P}^{Y|X}$ using a non-degenerate distribution $\mathbb{P}_{\textup{agg}}^{Y|X}$. We then develop \emph{Monte Carlo CP} procedures which provide guarantees w.r.t. $\mathbb{P}_{\textup{agg}}=\mathbb{P}^X \otimes \mathbb{P}_{\textup{agg}}^{Y|X}$ by sampling multiple synthetic pseudo-labels from $\mathbb{P}_{\textup{agg}}^{Y|X}$ for each calibration example $X_1,...,X_n$. In a case study of skin condition classification with significant disagreement among expert annotators, we show that applying CP w.r.t. $\mathbb{P}_{\textup{vote}}$ under-covers expert annotations: calibrated for $72\%$ coverage, it falls short by on average $10\%$; our Monte Carlo CP closes this gap both empirically and theoretically. We also extend Monte Carlo CP to multi-label classification and CP with calibration examples enriched through data augmentation.

ICLR Conference 2023 Conference Paper

Denoising Diffusion Samplers

Francisco Vargas 0001
Will Grathwohl
Arnaud Doucet

Denoising diffusion models are a popular class of generative models providing state-of-the-art results in many domains. One adds gradually noise to data using a diffusion to transform the data distribution into a Gaussian distribution. Samples from the generative model are then obtained by simulating an approximation of the time-reversal of this diffusion initialized by Gaussian samples. Practically, the intractable score terms appearing in the time-reversed process are approximated using score matching techniques. We explore here a similar idea to sample approximately from unnormalized probability density functions and estimate their normalizing constants. We consider a process where the target density diffuses towards a Gaussian. Denoising Diffusion Samplers (DDS) are obtained by approximating the corresponding time-reversal. While score matching is not applicable in this context, we can leverage many of the ideas introduced in generative modeling for Monte Carlo sampling. Existing theoretical results from denoising diffusion models also provide theoretical guarantees for DDS. We discuss the connections between DDS, optimal control and Schr\"odinger bridges and finally demonstrate DDS experimentally on a variety of challenging sampling tasks.

NeurIPS Conference 2023 Conference Paper

Diffusion Schrödinger Bridge Matching

Yuyang Shi
Valentin De Bortoli
Andrew Campbell
Arnaud Doucet

Solving transport problems, i. e. finding a map transporting one given distribution to another, has numerous applications in machine learning. Novel mass transport methods motivated by generative modeling have recently been proposed, e. g. Denoising Diffusion Models (DDMs) and Flow Matching Models (FMMs) implement such a transport through a Stochastic Differential Equation (SDE) or an Ordinary Differential Equation (ODE). However, while it is desirable in many applications to approximate the deterministic dynamic Optimal Transport (OT) map which admits attractive properties, DDMs and FMMs are not guaranteed to provide transports close to the OT map. In contrast, Schrödinger bridges (SBs) compute stochastic dynamic mappings which recover entropy-regularized versions of OT. Unfortunately, existing numerical methods approximating SBs either scale poorly with dimension or accumulate errors across iterations. In this work, we introduce Iterative Markovian Fitting (IMF), a new methodology for solving SB problems, and Diffusion Schrödinger Bridge Matching (DSBM), a novel numerical algorithm for computing IMF iterates. DSBM significantly improves over previous SB numerics and recovers as special/limiting cases various recent transport methods. We demonstrate the performance of DSBM on a variety of problems.

NeurIPS Conference 2023 Conference Paper

Marginal Density Ratio for Off-Policy Evaluation in Contextual Bandits

Muhammad Faaiz Taufiq
Arnaud Doucet
Rob Cornish
Jean-Francois Ton

Off-Policy Evaluation (OPE) in contextual bandits is crucial for assessing new policies using existing data without costly experimentation. However, current OPE methods, such as Inverse Probability Weighting (IPW) and Doubly Robust (DR) estimators, suffer from high variance, particularly in cases of low overlap between target and behaviour policies or large action and context spaces. In this paper, we introduce a new OPE estimator for contextual bandits, the Marginal Ratio (MR) estimator, which focuses on the shift in the marginal distribution of outcomes $Y$ instead of the policies themselves. Through rigorous theoretical analysis, we demonstrate the benefits of the MR estimator compared to conventional methods like IPW and DR in terms of variance reduction. Additionally, we establish a connection between the MR estimator and the state-of-the-art Marginalized Inverse Propensity Score (MIPS) estimator, proving that MR achieves lower variance among a generalized family of MIPS estimators. We further illustrate the utility of the MR estimator in causal inference settings, where it exhibits enhanced performance in estimating Average Treatment Effects (ATE). Our experiments on synthetic and real-world datasets corroborate our theoretical findings and highlight the practical advantages of the MR estimator in OPE for contextual bandits.

ICML Conference 2023 Conference Paper

Reduce, Reuse, Recycle: Compositional Generation with Energy-Based Diffusion Models and MCMC

Yilun Du
Conor Durkan
Robin Strudel
Joshua B. Tenenbaum
Sander Dieleman
Rob Fergus
Jascha Sohl-Dickstein
Arnaud Doucet

Since their introduction, diffusion models have quickly become the prevailing approach to generative modeling in many domains. They can be interpreted as learning the gradients of a time-varying sequence of log-probability density functions. This interpretation has motivated classifier-based and classifier-free guidance as methods for post-hoc control of diffusion models. In this work, we build upon these ideas using the score-based interpretation of diffusion models, and explore alternative ways to condition, modify, and reuse diffusion models for tasks involving compositional generation and guidance. In particular, we investigate why certain types of composition fail using current techniques and present a number of solutions. We conclude that the sampler (not the model) is responsible for this failure and propose new samplers, inspired by MCMC, which enable successful compositional generation. Further, we propose an energy-based parameterization of diffusion models which enables the use of new compositional operators and more sophisticated, Metropolis-corrected samplers. Intriguingly we find these samplers lead to notable improvements in compositional generation across a wide variety of problems such as classifier-guided ImageNet modeling and compositional text-to-image generation.

ICML Conference 2023 Conference Paper

SE(3) diffusion model with application to protein backbone generation

Jason Yim
Brian L. Trippe
Valentin De Bortoli
Emile Mathieu
Arnaud Doucet
Regina Barzilay
Tommi S. Jaakkola

The design of novel protein structures remains a challenge in protein engineering for applications across biomedicine and chemistry. In this line of work, a diffusion model over rigid bodies in 3D (referred to as frames) has shown success in generating novel, functional protein backbones that have not been observed in nature. However, there exists no principled methodological framework for diffusion on SE(3), the space of orientation preserving rigid motions in R3, that operates on frames and confers the group invariance. We address these shortcomings by developing theoretical foundations of SE(3) invariant diffusion models on multiple frames followed by a novel framework, FrameDiff, for estimating the SE(3) equivariant score over multiple frames. We apply FrameDiff on monomer backbone generation and find it can generate designable monomers up to 500 amino acids without relying on a pretrained protein structure prediction network that has been integral to previous methods. We find our samples are capable of generalizing beyond any known protein structure.

NeurIPS Conference 2023 Conference Paper

Trans-Dimensional Generative Modeling via Jump Diffusion Models

Andrew Campbell
William Harvey
Christian Weilbach
Valentin De Bortoli
Thomas Rainforth
Arnaud Doucet

We propose a new class of generative model that naturally handles data of varying dimensionality by jointly modeling the state and dimension of each datapoint. The generative process is formulated as a jump diffusion process that makes jumps between different dimensional spaces. We first define a dimension destroying forward noising process, before deriving the dimension creating time-reversed generative process along with a novel evidence lower bound training objective for learning to approximate it. Simulating our learned approximation to the time-reversed generative process then provides an effective way of sampling data of varying dimensionality by jointly generating state values and dimensions. We demonstrate our approach on molecular and video datasets of varying dimensionality, reporting better compatibility with test-time diffusion guidance imputation tasks and improved interpolation capabilities versus fixed dimensional models that generate state values and dimensions separately.

NeurIPS Conference 2023 Conference Paper

Tree-Based Diffusion Schrödinger Bridge with Applications to Wasserstein Barycenters

Maxence Noble
Valentin De Bortoli
Arnaud Doucet
Alain Durmus

Multi-marginal Optimal Transport (mOT), a generalization of OT, aims at minimizing the integral of a cost function with respect to a distribution with some prescribed marginals. In this paper, we consider an entropic version of mOT with a tree-structured quadratic cost, i. e. , a function that can be written as a sum of pairwise cost functions between the nodes of a tree. To address this problem, we develop Tree-based Diffusion Schr\"odinger Bridge (TreeDSB), an extension of the Diffusion Schr\"odinger Bridge (DSB) algorithm. TreeDSB corresponds to a dynamic and continuous state-space counterpart of the multimarginal Sinkhorn algorithm. A notable use case of our methodology is to compute Wasserstein barycenters which can be recast as the solution of a mOT problem on a star-shaped tree. We demonstrate that our methodology can be applied in high-dimensional settings such as image interpolation and Bayesian fusion.

NeurIPS Conference 2022 Conference Paper

A Continuous Time Framework for Discrete Denoising Models

Andrew Campbell
Joe Benton
Valentin De Bortoli
Thomas Rainforth
George Deligiannidis
Arnaud Doucet

We provide the first complete continuous time framework for denoising diffusion models of discrete data. This is achieved by formulating the forward noising process and corresponding reverse time generative process as Continuous Time Markov Chains (CTMCs). The model can be efficiently trained using a continuous time version of the ELBO. We simulate the high dimensional CTMC using techniques developed in chemical physics and exploit our continuous time framework to derive high performance samplers that we show can outperform discrete time methods for discrete data. The continuous time treatment also enables us to derive a novel theoretical result bounding the error between the generated sample distribution and the true data distribution.

NeurIPS Conference 2022 Conference Paper

A Multi-Resolution Framework for U-Nets with Applications to Hierarchical VAEs

Fabian Falck
Christopher Williams
Dominic Danks
George Deligiannidis
Christopher Yau
Chris C Holmes
Arnaud Doucet
Matthew Willetts

U-Net architectures are ubiquitous in state-of-the-art deep learning, however their regularisation properties and relationship to wavelets are understudied. In this paper, we formulate a multi-resolution framework which identifies U-Nets as finite-dimensional truncations of models on an infinite-dimensional function space. We provide theoretical results which prove that average pooling corresponds to projection within the space of square-integrable functions and show that U-Nets with average pooling implicitly learn a Haar wavelet basis representation of the data. We then leverage our framework to identify state-of-the-art hierarchical VAEs (HVAEs), which have a U-Net architecture, as a type of two-step forward Euler discretisation of multi-resolution diffusion processes which flow from a point mass, introducing sampling instabilities. We also demonstrate that HVAEs learn a representation of time which allows for improved parameter efficiency through weight-sharing. We use this observation to achieve state-of-the-art HVAE performance with half the number of parameters of existing models, exploiting the properties of our continuous-time formulation.

TMLR Journal 2022 Journal Article

An empirical study of implicit regularization in deep offline RL

Caglar Gulcehre
Srivatsan Srinivasan
Jakub Sygnowski
Georg Ostrovski
Mehrdad Farajtabar
Matthew Hoffman
Razvan Pascanu
Arnaud Doucet

Deep neural networks are the most commonly used function approximators in offline reinforcement learning. Prior works have shown that neural nets trained with TD-learning and gradient descent can exhibit implicit regularization that can be characterized by under-parameterization of these networks. Specifically, the rank of the penultimate feature layer, also called effective rank, has been observed to drastically collapse during the training. In turn, this collapse has been argued to reduce the model's ability to further adapt in later stages of learning, leading to the diminished final performance. Such an association between the effective rank and performance makes effective rank compelling for offline RL, primarily for offline policy evaluation. In this work, we conduct a careful empirical study on the relation between effective rank and performance on three offline RL datasets: bsuite, Atari, and DeepMind lab. We observe that a direct association exists only in restricted settings and disappears in the more extensive hyperparameter sweeps. Also, we empirically identify three phases of learning that explain the impact of implicit regularization on the learning dynamics and found that bootstrapping alone is insufficient to explain the collapse of the effective rank. Further, we show that several other factors could confound the relationship between effective rank and performance and conclude that studying this association under simplistic assumptions could be highly misleading.

TMLR Journal 2022 Journal Article

COIN++: Neural Compression Across Modalities

Emilien Dupont
Hrushikesh Loya
Milad Alizadeh
Adam Golinski
Yee Whye Teh
Arnaud Doucet

Neural compression algorithms are typically based on autoencoders that require specialized encoder and decoder architectures for different data modalities. In this paper, we propose COIN++, a neural compression framework that seamlessly handles a wide range of data modalities. Our approach is based on converting data to implicit neural representations, i.e. neural functions that map coordinates (such as pixel locations) to features (such as RGB values). Then, instead of storing the weights of the implicit neural representation directly, we store modulations applied to a meta-learned base network as a compressed code for the data. We further quantize and entropy code these modulations, leading to large compression gains while reducing encoding time by two orders of magnitude compared to baselines. We empirically demonstrate the feasibility of our method by compressing various data modalities, from images and audio to medical and climate data.

UAI Conference 2022 Conference Paper

Conditional simulation using diffusion Schrödinger bridges

Yuyang Shi 0002
Valentin De Bortoli
George Deligiannidis
Arnaud Doucet

Denoising diffusion models have recently emerged as a powerful class of generative models. They provide state-of-the-art results, not only for unconditional simulation, but also when used to solve conditional simulation problems arising in a wide range of inverse problems. A limitation of these models is that they are computationally intensive at generation time as they require simulating a diffusion process over a long time horizon. When performing unconditional simulation, a Schr{ö}dinger bridge formulation of generative modeling leads to a theoretically grounded algorithm shortening generation time which is complementary to other proposed acceleration techniques. We extend the Schrödinger bridge framework to conditional simulation. We demonstrate this novel methodology on various applications including image super-resolution, optimal filtering for state-space models and the refinement of pre-trained networks. Our code can be found at https: //github. com/vdeborto/cdsb.

NeurIPS Conference 2022 Conference Paper

Conformal Off-Policy Prediction in Contextual Bandits

Muhammad Faaiz Taufiq
Jean-Francois Ton
Rob Cornish
Yee Whye Teh
Arnaud Doucet

Most off-policy evaluation methods for contextual bandits have focused on the expected outcome of a policy, which is estimated via methods that at best provide only asymptotic guarantees. However, in many applications, the expectation may not be the best measure of performance as it does not capture the variability of the outcome. In addition, particularly in safety-critical settings, stronger guarantees than asymptotic correctness may be required. To address these limitations, we consider a novel application of conformal prediction to contextual bandits. Given data collected under a behavioral policy, we propose \emph{conformal off-policy prediction} (COPP), which can output reliable predictive intervals for the outcome under a new target policy. We provide theoretical finite-sample guarantees without making any additional assumptions beyond the standard contextual bandit setup, and empirically demonstrate the utility of COPP compared with existing methods on synthetic and real-world data.

ICML Conference 2022 Conference Paper

Continual Repeated Annealed Flow Transport Monte Carlo

Alexander G. de G. Matthews
Michael Arbel
Danilo Jimenez Rezende
Arnaud Doucet

We propose Continual Repeated Annealed Flow Transport Monte Carlo (CRAFT), a method that combines a sequential Monte Carlo (SMC) sampler (itself a generalization of Annealed Importance Sampling) with variational inference using normalizing flows. The normalizing flows are directly trained to transport between annealing temperatures using a KL divergence for each transition. This optimization objective is itself estimated using the normalizing flow/SMC approximation. We show conceptually and using multiple empirical examples that CRAFT improves on Annealed Flow Transport Monte Carlo (Arbel et al. , 2021), on which it builds and also on Markov chain Monte Carlo (MCMC) based Stochastic Normalizing Flows (Wu et al. , 2020). By incorporating CRAFT within particle MCMC, we show that such learnt samplers can achieve impressively accurate results on a challenging lattice field theory example.

JMLR Journal 2022 Journal Article

Efficient MCMC Sampling with Dimension-Free Convergence Rate using ADMM-type Splitting

Maxime Vono
Daniel Paulin
Arnaud Doucet

Performing exact Bayesian inference for complex models is computationally intractable. Markov chain Monte Carlo (MCMC) algorithms can provide reliable approximations of the posterior distribution but are expensive for large data sets and high-dimensional models. A standard approach to mitigate this complexity consists in using subsampling techniques or distributing the data across a cluster. However, these approaches are typically unreliable in high-dimensional scenarios. We focus here on a recent alternative class of MCMC schemes exploiting a splitting strategy akin to the one used by the celebrated alternating direction method of multipliers (ADMM) optimization algorithm. These methods appear to provide empirically state-of-the-art performance but their theoretical behavior in high dimension is currently unknown. In this paper, we propose a detailed theoretical study of one of these algorithms known as the split Gibbs sampler. Under regularity conditions, we establish explicit convergence rates for this scheme using Ricci curvature and coupling ideas. We support our theory with numerical illustrations. [abs] [ pdf ][ bib ] &copy JMLR 2022. ( edit, beta )

ICML Conference 2022 Conference Paper

Importance Weighted Kernel Bayes' Rule

Liyuan Xu
Yutian Chen
Arnaud Doucet
Arthur Gretton

We study a nonparametric approach to Bayesian computation via feature means, where the expectation of prior features is updated to yield expected posterior features, based on regression from kernel or neural net features of the observations. All quantities involved in the Bayesian update are learned from observed data, making the method entirely model-free. The resulting algorithm is a novel instance of a kernel Bayes’ rule (KBR). Our approach is based on importance weighting, which results in superior numerical stability to the existing approach to KBR, which requires operator inversion. We show the convergence of the estimator using a novel consistency analysis on the importance weighting estimator in the infinity norm. We evaluate our KBR on challenging synthetic benchmarks, including a filtering problem with a state-space model involving high dimensional image observations. The proposed method yields uniformly better empirical performance than the existing KBR, and competitive performance with other competing methods. We evaluate our KBR on challenging synthetic benchmarks, including a filtering problem with a state-space model involving high dimensional image observations. The proposed method yields uniformly better empirical performance than the existing KBR, and competitive performance with other competing methods.

ICLR Conference 2022 Conference Paper

Learning Optimal Conformal Classifiers

David Stutz
Krishnamurthy Dvijotham
A. Taylan Cemgil
Arnaud Doucet

Modern deep learning based classifiers show very high accuracy on test data but this does not provide sufficient guarantees for safe deployment, especially in high-stake AI applications such as medical diagnosis. Usually, predictions are obtained without a reliable uncertainty estimate or a formal guarantee. Conformal prediction (CP) addresses these issues by using the classifier's predictions, e.g., its probability estimates, to predict confidence sets containing the true class with a user-specified probability. However, using CP as a separate processing step after training prevents the underlying model from adapting to the prediction of confidence sets. Thus, this paper explores strategies to differentiate through CP during training with the goal of training model with the conformal wrapper end-to-end. In our approach, conformal training (ConfTr), we specifically "simulate" conformalization on mini-batches during training. Compared to standard training, ConfTr reduces the average confidence set size (inefficiency) of state-of-the-art CP methods applied after training. Moreover, it allows to "shape" the confidence sets predicted at test time, which is difficult for standard CP. On experiments with several datasets, we show ConfTr can influence how inefficiency is distributed across classes, or guide the composition of confidence sets in terms of the included classes, while retaining the guarantees offered by CP.

UAI Conference 2022 Conference Paper

Mitigating statistical bias within differentially private synthetic data

Sahra Ghalebikesabi
Harry Wilde
Jack Jewson
Arnaud Doucet
Sebastian J. Vollmer
Chris C. Holmes

Increasing interest in privacy-preserving machine learning has led to new and evolved approaches for generating private synthetic data from undisclosed real data. However, mechanisms of privacy preservation can significantly reduce the utility of synthetic data, which in turn impacts downstream tasks such as learning predictive models or inference. We propose several re-weighting strategies using privatised likelihood ratios that not only mitigate statistical bias of downstream estimators but also have general applicability to differentially private generative models. Through large-scale empirical evaluation, we show that private importance weighting provides simple and effective privacy-compliant augmentation for general applications of synthetic data.

JMLR Journal 2022 Journal Article

On Instrumental Variable Regression for Deep Offline Policy Evaluation

Yutian Chen
Liyuan Xu
Caglar Gulcehre
Tom Le Paine
Arthur Gretton
Nando de Freitas
Arnaud Doucet

We show that the popular reinforcement learning (RL) strategy of estimating the state-action value (Q-function) by minimizing the mean squared Bellman error leads to a regression problem with confounding, the inputs and output noise being correlated. Hence, direct minimization of the Bellman error can result in significantly biased Q-function estimates. We explain why fixing the target Q-network in Deep Q-Networks and Fitted Q Evaluation provides a way of overcoming this confounding, thus shedding new light on this popular but not well understood trick in the deep RL literature. An alternative approach to address confounding is to leverage techniques developed in the causality literature, notably instrumental variables (IV). We bring together here the literature on IV and RL by investigating whether IV approaches can lead to improved Q-function estimates. This paper analyzes and compares a wide range of recent IV methods in the context of offline policy evaluation (OPE), where the goal is to estimate the value of a policy using logged data only. By applying different IV techniques to OPE, we are not only able to recover previously proposed OPE methods such as model-based techniques but also to obtain competitive new techniques. We find empirically that state-of-the-art OPE methods are closely matched in performance by some IV methods such as AGMM, which were not developed for OPE. We open-source all our code and datasets at https://github.com/liyuan9988/IVOPEwithACME. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2022. ( edit, beta )

NeurIPS Conference 2022 Conference Paper

Riemannian Score-Based Generative Modelling

Valentin De Bortoli
Emile Mathieu
Michael Hutchinson
James Thornton
Yee Whye Teh
Arnaud Doucet

Score-based generative models (SGMs) are a powerful class of generative models that exhibit remarkable empirical performance. Score-based generative modelling (SGM) consists of a noising'' stage, whereby a diffusion is used to gradually add Gaussian noise to data, and a generative model, which entails a denoising'' process defined by approximating the time-reversal of the diffusion. Existing SGMs assume that data is supported on a Euclidean space, i. e. a manifold with flat geometry. In many domains such as robotics, geoscience or protein modelling, data is often naturally described by distributions living on Riemannian manifolds and current SGM techniques are not appropriate. We introduce here \emph{Riemannian Score-based Generative Models} (RSGMs), a class of generative models extending SGMs to Riemannian manifolds. We demonstrate our approach on a variety of compact manifolds, and in particular with earth and climate science spherical data.

NeurIPS Conference 2022 Conference Paper

Score-Based Diffusion meets Annealed Importance Sampling

Arnaud Doucet
Will Grathwohl
Alexander G. Matthews
Heiko Strathmann

More than twenty years after its introduction, Annealed Importance Sampling (AIS) remains one of the most effective methods for marginal likelihood estimation. It relies on a sequence of distributions interpolating between a tractable initial distribution and the target distribution of interest which we simulate from approximately using a non-homogeneous Markov chain. To obtain an importance sampling estimate of the marginal likelihood, AIS introduces an extended target distribution to reweight the Markov chain proposal. While much effort has been devoted to improving the proposal distribution used by AIS, by changing the intermediate distributions and corresponding Markov kernels, an underappreciated issue is that AIS uses a convenient but suboptimal extended target distribution. This can hinder its performance. We here leverage recent progress in score-based generative modeling (SGM) to approximate the optimal extended target distribution for AIS proposals corresponding to the discretization of Langevin and Hamiltonian dynamics. We demonstrate these novel, differentiable, AIS procedures on a number of synthetic benchmark distributions and variational auto-encoders.

NeurIPS Conference 2022 Conference Paper

Towards Learning Universal Hyperparameter Optimizers with Transformers

Yutian Chen
Xingyou Song
Chansoo Lee
Zi Wang
Richard Zhang
David Dohan
Kazuya Kawakami
Greg Kochanski

Meta-learning hyperparameter optimization (HPO) algorithms from prior experiments is a promising approach to improve optimization efficiency over objective functions from a similar distribution. However, existing methods are restricted to learning from experiments sharing the same set of hyperparameters. In this paper, we introduce the OptFormer, the first text-based Transformer HPO framework that provides a universal end-to-end interface for jointly learning policy and function prediction when trained on vast tuning data from the wild, such as Google’s Vizier database, one of the world’s largest HPO datasets. Our extensive experiments demonstrate that the OptFormer can simultaneously imitate at least 7 different HPO algorithms, which can be further improved via its function uncertainty estimates. Compared to a Gaussian Process, the OptFormer also learns a robust prior distribution for hyperparameter response functions, and can thereby provide more accurate and better calibrated predictions. This work paves the path to future extensions for training a Transformer-based model as a general HPO optimizer.

ICML Conference 2021 Conference Paper

Annealed Flow Transport Monte Carlo

Michael Arbel
Alexander G. de G. Matthews
Arnaud Doucet

Annealed Importance Sampling (AIS) and its Sequential Monte Carlo (SMC) extensions are state-of-the-art methods for estimating normalizing constants of probability distributions. We propose here a novel Monte Carlo algorithm, Annealed Flow Transport (AFT), that builds upon AIS and SMC and combines them with normalizing flows (NFs) for improved performance. This method transports a set of particles using not only importance sampling (IS), Markov chain Monte Carlo (MCMC) and resampling steps - as in SMC, but also relies on NFs which are learned sequentially to push particles towards the successive annealed targets. We provide limit theorems for the resulting Monte Carlo estimates of the normalizing constant and expectations with respect to the target distribution. Additionally, we show that a continuous-time scaling limit of the population version of AFT is given by a Feynman–Kac measure which simplifies to the law of a controlled diffusion for expressive NFs. We demonstrate experimentally the benefits and limitations of our methodology on a variety of applications.

ICML Conference 2021 Conference Paper

Differentiable Particle Filtering via Entropy-Regularized Optimal Transport

Adrien Corenflos
James Thornton
George Deligiannidis
Arnaud Doucet

Particle Filtering (PF) methods are an established class of procedures for performing inference in non-linear state-space models. Resampling is a key ingredient of PF necessary to obtain low variance likelihood and states estimates. However, traditional resampling methods result in PF-based loss functions being non-differentiable with respect to model and PF parameters. In a variational inference context, resampling also yields high variance gradient estimates of the PF-based evidence lower bound. By leveraging optimal transport ideas, we introduce a principled differentiable particle filter and provide convergence results. We demonstrate this novel method on a variety of applications.

NeurIPS Conference 2021 Conference Paper

Diffusion Schrödinger Bridge with Applications to Score-Based Generative Modeling

Valentin De Bortoli
James Thornton
Jeremy Heng
Arnaud Doucet

Progressively applying Gaussian noise transforms complex data distributions to approximately Gaussian. Reversing this dynamic defines a generative model. When the forward noising process is given by a Stochastic Differential Equation (SDE), Song et al (2021) demonstrate how the time inhomogeneous drift of the associated reverse-time SDE may be estimated using score-matching. A limitation of this approach is that the forward-time SDE must be run for a sufficiently long time for the final distribution to be approximately Gaussian. In contrast, solving the Schrödinger Bridge (SB) problem, i. e. an entropy-regularized optimal transport problem on path spaces, yields diffusions which generate samples from the data distribution in finite time. We present Diffusion SB (DSB), an original approximation of the Iterative Proportional Fitting (IPF) procedure to solve the SB problem, and provide theoretical analysis along with generative modeling experiments. The first DSB iteration recovers the methodology proposed by Song et al. (2021), with the flexibility of using shorter time intervals, as subsequent DSB iterations reduce the discrepancy between the final-time marginal of the forward (resp. backward) SDE with respect to the prior (resp. data) distribution. Beyond generative modeling, DSB offers a widely applicable computational optimal transport tool as the continuous state-space analogue of the popular Sinkhorn algorithm (Cuturi, 2013).

ICML Conference 2021 Conference Paper

Improving Lossless Compression Rates via Monte Carlo Bits-Back Coding

Yangjun Ruan
Karen Ullrich
Daniel Severo 0001
James Townsend
Ashish J. Khisti
Arnaud Doucet
Alireza Makhzani
Chris J. Maddison

Latent variable models have been successfully applied in lossless compression with the bits-back coding algorithm. However, bits-back suffers from an increase in the bitrate equal to the KL divergence between the approximate posterior and the true posterior. In this paper, we show how to remove this gap asymptotically by deriving bits-back coding algorithms from tighter variational bounds. The key idea is to exploit extended space representations of Monte Carlo estimators of the marginal likelihood. Naively applied, our schemes would require more initial bits than the standard bits-back coder, but we show how to drastically reduce this additional cost with couplings in the latent space. When parallel architectures can be exploited, our coders can achieve better rates than bits-back with little additional cost. We demonstrate improved lossless compression rates in a variety of settings, especially in out-of-distribution or sequential data compression.

ICLR Conference 2021 Conference Paper

Learning Deep Features in Instrumental Variable Regression

Liyuan Xu
Yutian Chen 0001
Siddarth Srinivasan
Nando de Freitas
Arnaud Doucet
Arthur Gretton

Instrumental variable (IV) regression is a standard strategy for learning causal relationships between confounded treatment and outcome variables from observational data by using an instrumental variable, which affects the outcome only through the treatment. In classical IV regression, learning proceeds in two stages: stage 1 performs linear regression from the instrument to the treatment; and stage 2 performs linear regression from the treatment to the outcome, conditioned on the instrument. We propose a novel method, deep feature instrumental variable regression (DFIV), to address the case where relations between instruments, treatments, and outcomes may be nonlinear. In this case, deep neural nets are trained to define informative nonlinear features on the instruments and treatments. We propose an alternating training regime for these features to ensure good end-to-end performance when composing stages 1 and 2, thus obtaining highly flexible feature maps in a computationally efficient manner. DFIV outperforms recent state-of-the-art methods on challenging IV benchmarks, including settings involving high dimensional image data. DFIV also exhibits competitive performance in off-policy policy evaluation for reinforcement learning, which can be understood as an IV regression task.

ICML Conference 2021 Conference Paper

Monte Carlo Variational Auto-Encoders

Achille Thin
Nikita Kotelevskii
Arnaud Doucet
Alain Durmus
Eric Moulines
Maxim Panov

Variational auto-encoders (VAE) are popular deep latent variable models which are trained by maximizing an Evidence Lower Bound (ELBO). To obtain tighter ELBO and hence better variational approximations, it has been proposed to use importance sampling to get a lower variance estimate of the evidence. However, importance sampling is known to perform poorly in high dimensions. While it has been suggested many times in the literature to use more sophisticated algorithms such as Annealed Importance Sampling (AIS) and its Sequential Importance Sampling (SIS) extensions, the potential benefits brought by these advanced techniques have never been realized for VAE: the AIS estimate cannot be easily differentiated, while SIS requires the specification of carefully chosen backward Markov kernels. In this paper, we address both issues and demonstrate the performance of the resulting Monte Carlo VAEs on a variety of applications.

NeurIPS Conference 2021 Conference Paper

NEO: Non Equilibrium Sampling on the Orbits of a Deterministic Transform

Achille Thin
Yazid Janati El Idrissi
Sylvain Le Corff
Charles Ollion
Eric Moulines
Arnaud Doucet
Alain Durmus
Christian X Robert

Sampling from a complex distribution $\pi$ and approximating its intractable normalizing constant $\mathrm{Z}$ are challenging problems. In this paper, a novel family of importance samplers (IS) and Markov chain Monte Carlo (MCMC) samplers is derived. Given an invertible map $\mathrm{T}$, these schemes combine (with weights) elements from the forward and backward Orbits through points sampled from a proposal distribution $\rho$. The map $\mathrm{T}$ does not leave the target $\pi$ invariant, hence the name NEO, standing for Non-Equilibrium Orbits. NEO-IS provides unbiased estimators of the normalizing constant and self-normalized IS estimators of expectations under $\pi$ while NEO-MCMC combines multiple NEO-IS estimates of the normalizing constant and an iterated sampling-importance resampling mechanism to sample from $\pi$. For $\mathrm{T}$ chosen as a discrete-time integrator of a conformal Hamiltonian system, NEO-IS achieves state-of-the art performance on difficult benchmarks and NEO-MCMC is able to explore highly multimodal targets. Additionally, we provide detailed theoretical results for both methods. In particular, we show that NEO-MCMC is uniformly geometrically ergodic and establish explicit mixing time estimates under mild conditions.

NeurIPS Conference 2021 Conference Paper

Online Variational Filtering and Parameter Learning

Andrew Campbell
Yuyang Shi
Thomas Rainforth
Arnaud Doucet

We present a variational method for online state estimation and parameter learning in state-space models (SSMs), a ubiquitous class of latent variable models for sequential data. As per standard batch variational techniques, we use stochastic gradients to simultaneously optimize a lower bound on the log evidence with respect to both model parameters and a variational approximation of the states' posterior distribution. However, unlike existing approaches, our method is able to operate in an entirely online manner, such that historic observations do not require revisitation after being incorporated and the cost of updates at each time step remains constant, despite the growing dimensionality of the joint posterior distribution of the states. This is achieved by utilizing backward decompositions of this joint posterior distribution and of its variational approximation, combined with Bellman-type recursions for the evidence lower bound and its gradients. We demonstrate the performance of this methodology across several examples, including high-dimensional SSMs and sequential Variational Auto-Encoders.

ICLR Conference 2021 Conference Paper

Robust Pruning at Initialization

Soufiane Hayou
Jean-Francois Ton
Arnaud Doucet
Yee Whye Teh

Overparameterized Neural Networks (NN) display state-of-the-art performance. However, there is a growing need for smaller, energy-efficient, neural networks to be able to use machine learning applications on devices with limited computational resources. A popular approach consists of using pruning techniques. While these techniques have traditionally focused on pruning pre-trained NN (LeCun et al.,1990; Hassibi et al., 1993), recent work by Lee et al. (2018) has shown promising results when pruning at initialization. However, for Deep NNs, such procedures remain unsatisfactory as the resulting pruned networks can be difficult to train and, for instance, they do not prevent one layer from being fully pruned. In this paper, we provide a comprehensive theoretical analysis of Magnitude and Gradient based pruning at initialization and training of sparse architectures. This allows us to propose novel principled approaches which we validate experimentally on a variety of NN architectures.

UAI Conference 2021 Conference Paper

Unbiased gradient estimation for variational auto-encoders using coupled Markov chains

Francisco J. R. Ruiz
Michalis K. Titsias
A. Taylan Cemgil
Arnaud Doucet

The variational auto-encoder (VAE) is a deep latent variable model that has two neural networks in an autoencoder-like architecture; one of them parameterizes the model’s likelihood. Fitting its parameters via maximum likelihood (ML) is challenging since the computation of the marginal likelihood involves an intractable integral over the latent space; thus the VAE is trained instead by maximizing a variational lower bound. Here, we develop a ML training scheme for VAEs by introducing unbiased estimators of the log-likelihood gradient. We obtain the estimators by augmenting the latent space with a set of importance samples, similarly to the importance weighted auto-encoder (IWAE), and then constructing a Markov chain Monte Carlo coupling procedure on this augmented space. We provide the conditions under which the estimators can be computed in finite time and with finite variance. We show experimentally that VAEs fitted with unbiased estimators exhibit better predictive performance.

UAI Conference 2021 Conference Paper

Variational inference with continuously-indexed normalizing flows

Anthony L. Caterini
Robert Cornish
Dino Sejdinovic
Arnaud Doucet

Continuously-indexed flows (CIFs) have recently achieved improvements over baseline normalizing flows on a variety of density estimation tasks. CIFs do not possess a closed-form marginal density, and so, unlike standard flows, cannot be plugged in directly to a variational inference (VI) scheme in order to produce a more expressive family of approximate posteriors. However, we show here how CIFs can be used as part of an auxiliary VI scheme to formulate and train expressive posterior approximations in a natural way. We exploit the conditional independence structure of multi-layer CIFs to build the required auxiliary inference models, which we show empirically yield low-variance estimators of the model evidence. We then demonstrate the advantages of CIFs over baseline flows in VI problems when the posterior distribution of interest possesses a complicated topology, obtaining improved results in both the Bayesian inference and surrogate maximum likelihood settings.

NeurIPS Conference 2020 Conference Paper

Modular Meta-Learning with Shrinkage

Yutian Chen
Abram L. Friesen
Feryal Behbahani
Arnaud Doucet
David Budden
Matthew Hoffman
Nando de Freitas

Many real-world problems, including multi-speaker text-to-speech synthesis, can greatly benefit from the ability to meta-learn large models with only a few task- specific components. Updating only these task-specific modules then allows the model to be adapted to low-data tasks for as many steps as necessary without risking overfitting. Unfortunately, existing meta-learning methods either do not scale to long adaptation or else rely on handcrafted task-specific architectures. Here, we propose a meta-learning approach that obviates the need for this often sub-optimal hand-selection. In particular, we develop general techniques based on Bayesian shrinkage to automatically discover and learn both task-specific and general reusable modules. Empirically, we demonstrate that our method discovers a small set of meaningful task-specific modules and outperforms existing meta- learning approaches in domains like few-shot text-to-speech that have little task data and long adaptation horizons. We also show that existing meta-learning methods including MAML, iMAML, and Reptile emerge as special cases of our method.

ICML Conference 2020 Conference Paper

Relaxing Bijectivity Constraints with Continuously Indexed Normalising Flows

Robert Cornish
Anthony L. Caterini
George Deligiannidis
Arnaud Doucet

We show that normalising flows become pathological when used to model targets whose supports have complicated topologies. In this scenario, we prove that a flow must become arbitrarily numerically noninvertible in order to approximate the target closely. This result has implications for all flow-based models, and especially residual flows (ResFlows), which explicitly control the Lipschitz constant of the bijection used. To address this, we propose continuously indexed flows (CIFs), which replace the single bijection used by normalising flows with a continuously indexed family of bijections, and which can intuitively "clean up" mass that would otherwise be misplaced by a single bijection. We show theoretically that CIFs are not subject to the same topological limitations as normalising flows, and obtain better empirical performance on a variety of models and benchmarks.

NeurIPS Conference 2019 Conference Paper

Augmented Neural ODEs

Emilien Dupont
Arnaud Doucet
Yee Whye Teh

We show that Neural Ordinary Differential Equations (ODEs) learn representations that preserve the topology of the input space and prove that this implies the existence of functions Neural ODEs cannot represent. To address these limitations, we introduce Augmented Neural ODEs which, in addition to being more expressive models, are empirically more stable, generalize better and have a lower computational cost than Neural ODEs.

ICML Conference 2019 Conference Paper

On the Impact of the Activation function on Deep Neural Networks Training

Soufiane Hayou
Arnaud Doucet
Judith Rousseau

The weight initialization and the activation function of deep neural networks have a crucial impact on the performance of the training procedure. An inappropriate selection can lead to the loss of information of the input during forward propagation and the exponential vanishing/exploding of gradients during back-propagation. Understanding the theoretical properties of untrained random networks is key to identifying which deep networks may be trained successfully as recently demonstrated by Samuel et al. (2017) who showed that for deep feedforward neural networks only a specific choice of hyperparameters known as the ‘Edge of Chaos’ can lead to good performance. While the work by Samuel et al. (2017) discuss trainability issues, we focus here on training acceleration and overall performance. We give a comprehensive theoretical analysis of the Edge of Chaos and show that we can indeed tune the initialization parameters and the activation function in order to accelerate the training and improve the performance.

ICML Conference 2019 Conference Paper

Replica Conditional Sequential Monte Carlo

Alexander Y. Shestopaloff
Arnaud Doucet

We propose a Markov chain Monte Carlo (MCMC) scheme to perform state inference in non-linear non-Gaussian state-space models. Current state-of-the-art methods to address this problem rely on particle MCMC techniques and its variants, such as the iterated conditional Sequential Monte Carlo (cSMC) scheme, which uses a Sequential Monte Carlo (SMC) type proposal within MCMC. A deficiency of standard SMC proposals is that they only use observations up to time $t$ to propose states at time $t$ when an entire observation sequence is available. More sophisticated SMC based on lookahead techniques could be used but they can be difficult to put in practice. We propose here replica cSMC where we build SMC proposals for one replica using information from the entire observation sequence by conditioning on the states of the other replicas. This approach is easily parallelizable and we demonstrate its excellent empirical performance when compared to the standard iterated cSMC scheme at fixed computational complexity.

ICML Conference 2019 Conference Paper

Scalable Metropolis-Hastings for Exact Bayesian Inference with Large Datasets

Robert Cornish
Paul Vanetti
Alexandre Bouchard-Côté
George Deligiannidis
Arnaud Doucet

Bayesian inference via standard Markov Chain Monte Carlo (MCMC) methods such as Metropolis-Hastings is too computationally intensive to handle large datasets, since the cost per step usually scales like $O(n)$ in the number of data points $n$. We propose the Scalable Metropolis-Hastings (SMH) kernel that only requires processing on average $O(1)$ or even $O(1/\sqrt{n})$ data points per step. This scheme is based on a combination of factorized acceptance probabilities, procedures for fast simulation of Bernoulli processes, and control variate ideas. Contrary to many MCMC subsampling schemes such as fixed step-size Stochastic Gradient Langevin Dynamics, our approach is exact insofar as the invariant distribution is the true posterior and not an approximation to it. We characterise the performance of our algorithm theoretically, and give realistic and verifiable conditions under which it is geometrically ergodic. This theory is borne out by empirical results that demonstrate overall performance benefits over standard Metropolis-Hastings and various subsampling algorithms.

NeurIPS Conference 2018 Conference Paper

Hamiltonian Variational Auto-Encoder

Anthony Caterini
Arnaud Doucet
Dino Sejdinovic

Variational Auto-Encoders (VAE) have become very popular techniques to perform inference and learning in latent variable models as they allow us to leverage the rich representational power of neural networks to obtain flexible approximations of the posterior of latent variables as well as tight evidence lower bounds (ELBO). Com- bined with stochastic variational inference, this provides a methodology scaling to large datasets. However, for this methodology to be practically efficient, it is neces- sary to obtain low-variance unbiased estimators of the ELBO and its gradients with respect to the parameters of interest. While the use of Markov chain Monte Carlo (MCMC) techniques such as Hamiltonian Monte Carlo (HMC) has been previously suggested to achieve this [23, 26], the proposed methods require specifying reverse kernels which have a large impact on performance. Additionally, the resulting unbiased estimator of the ELBO for most MCMC kernels is typically not amenable to the reparameterization trick. We show here how to optimally select reverse kernels in this setting and, by building upon Hamiltonian Importance Sampling (HIS) [17], we obtain a scheme that provides low-variance unbiased estimators of the ELBO and its gradients using the reparameterization trick. This allows us to develop a Hamiltonian Variational Auto-Encoder (HVAE). This method can be re-interpreted as a target-informed normalizing flow [20] which, within our context, only requires a few evaluations of the gradient of the sampled likelihood and trivial Jacobian calculations at each iteration.

NeurIPS Conference 2017 Conference Paper

Clone MCMC: Parallel High-Dimensional Gaussian Gibbs Sampling

Andrei-Cristian Barbos
Francois Caron
Jean-François Giovannelli
Arnaud Doucet

We propose a generalized Gibbs sampler algorithm for obtaining samples approximately distributed from a high-dimensional Gaussian distribution. Similarly to Hogwild methods, our approach does not target the original Gaussian distribution of interest, but an approximation to it. Contrary to Hogwild methods, a single parameter allows us to trade bias for variance. We show empirically that our method is very flexible and performs well compared to Hogwild-type algorithms.

NeurIPS Conference 2017 Conference Paper

Filtering Variational Objectives

Chris Maddison
John Lawson
George Tucker
Nicolas Heess
Mohammad Norouzi
Andriy Mnih
Arnaud Doucet
Yee Teh

When used as a surrogate objective for maximum likelihood estimation in latent variable models, the evidence lower bound (ELBO) produces state-of-the-art results. Inspired by this, we consider the extension of the ELBO to a family of lower bounds defined by a particle filter's estimator of the marginal likelihood, the filtering variational objectives (FIVOs). FIVOs take the same arguments as the ELBO, but can exploit a model's sequential structure to form tighter bounds. We present results that relate the tightness of FIVO's bound to the variance of the particle filter's estimator by considering the generic case of bounds defined as log-transformed likelihood estimators. Experimentally, we show that training with FIVO results in substantial improvements over training the same model architecture with the ELBO on sequential data.

JMLR Journal 2017 Journal Article

Generalized Pólya Urn for Time-Varying Pitman-Yor Processes

François Caron
Willie Neiswanger
Frank Wood
Arnaud Doucet
Manuel Davy

This article introduces a class of first-order stationary time- varying Pitman-Yor processes. Subsuming our construction of time-varying Dirichlet processes presented in (Caron et al., 2007), these models can be used for time-dynamic density estimation and clustering. Our intuitive and simple construction relies on a generalized PÃ³lya urn scheme. Significantly, this construction yields marginal distributions at each time point that can be explicitly characterized and easily controlled. Inference is performed using Markov chain Monte Carlo and sequential Monte Carlo methods. We demonstrate our models and algorithms on epidemiological and video tracking data. [abs] [ pdf ][ bib ] &copy JMLR 2017. ( edit, beta )

JMLR Journal 2017 Journal Article

On Markov chain Monte Carlo methods for tall data

Rémi Bardenet
Arnaud Doucet
Chris Holmes

Markov chain Monte Carlo methods are often deemed too computationally intensive to be of any practical use for big data applications, and in particular for inference on datasets containing a large number $n$ of individual data points, also known as tall datasets. In scenarios where data are assumed independent, various approaches to scale up the Metropolis- Hastings algorithm in a Bayesian inference context have been recently proposed in machine learning and computational statistics. These approaches can be grouped into two categories: divide-and-conquer approaches and, subsampling-based algorithms. The aims of this article are as follows. First, we present a comprehensive review of the existing literature, commenting on the underlying assumptions and theoretical guarantees of each method. Second, by leveraging our understanding of these limitations, we propose an original subsampling-based approach relying on a control variate method which samples under regularity conditions from a distribution provably close to the posterior distribution of interest, yet can require less than $O(n)$ data point likelihood evaluations at each iteration for certain statistical models in favourable scenarios. Finally, we emphasize that we have only been able so far to propose subsampling-based methods which display good performance in scenarios where the Bernstein-von Mises approximation of the target posterior distribution is excellent. It remains an open challenge to develop such methods in scenarios where the Bernstein-von Mises approximation is poor. [abs] [ pdf ][ bib ] &copy JMLR 2017. ( edit, beta )

JMLR Journal 2017 Journal Article

Particle Gibbs Split-Merge Sampling for Bayesian Inference in Mixture Models

Alexandre Bouchard-Côté
Arnaud Doucet
Andrew Roth

This paper presents an original Markov chain Monte Carlo method to sample from the posterior distribution of conjugate mixture models. This algorithm relies on a flexible split-merge procedure built using the particle Gibbs sampler introduced in Andrieu et al. (2009, 2010). The resulting so-called Particle Gibbs Split-Merge sampler does not require the computation of a complex acceptance ratio and can be implemented using existing sequential Monte Carlo libraries. We investigate its performance experimentally on synthetic problems as well as on geolocation data. Our results show that for a given computational budget, the Particle Gibbs Split-Merge sampler empirically outperforms existing split merge methods. The code and instructions allowing to reproduce the experiments is available at github.com/aroth85/pgsm. [abs] [ pdf ][ bib ] &copy JMLR 2017. ( edit, beta )

ICML Conference 2016 Conference Paper

Interacting Particle Markov Chain Monte Carlo

Tom Rainforth
Christian A. Naesseth
Fredrik Lindsten
Brooks Paige
Jan-Willem van de Meent
Arnaud Doucet
Frank Wood

We introduce interacting particle Markov chain Monte Carlo (iPMCMC), a PMCMC method based on an interacting pool of standard and conditional sequential Monte Carlo samplers. Like related methods, iPMCMC is a Markov chain Monte Carlo sampler on an extended space. We present empirical results that show significant improvements in mixing rates relative to both non-interacting PMCMC samplers and a single PMCMC sampler with an equivalent memory and computational budget. An additional advantage of the iPMCMC method is that it is suitable for distributed and multi-core architectures.

NeurIPS Conference 2015 Conference Paper

Expectation Particle Belief Propagation

Thibaut Lienart
Yee Whye Teh
Arnaud Doucet

We propose an original particle-based implementation of the Loopy Belief Propagation (LPB) algorithm for pairwise Markov Random Fields (MRF) on a continuous state space. The algorithm constructs adaptively efficient proposal distributions approximating the local beliefs at each note of the MRF. This is achieved by considering proposal distributions in the exponential family whose parameters are updated iterately in an Expectation Propagation (EP) framework. The proposed particle scheme provides consistent estimation of the LBP marginals as the number of particles increases. We demonstrate that it provides more accurate results than the Particle Belief Propagation (PBP) algorithm of Ihler and McAllester (2009) at a fraction of the computational cost and is additionally more robust empirically. The computational complexity of our algorithm at each iteration is quadratic in the number of particles. We also propose an accelerated implementation with sub-quadratic computational complexity which still provides consistent estimates of the loopy BP marginal distributions and performs almost as well as the original procedure.

NeurIPS Conference 2014 Conference Paper

Asynchronous Anytime Sequential Monte Carlo

Brooks Paige
Frank Wood
Arnaud Doucet
Yee Whye Teh

We introduce a new sequential Monte Carlo algorithm we call the particle cascade. The particle cascade is an asynchronous, anytime alternative to traditional sequential Monte Carlo algorithms that is amenable to parallel and distributed implementations. It uses no barrier synchronizations which leads to improved particle throughput and memory efficiency. It is an anytime algorithm in the sense that it can be run forever to emit an unbounded number of particles while keeping within a fixed memory budget. We prove that the particle cascade provides an unbiased marginal likelihood estimator which can be straightforwardly plugged into existing pseudo-marginal methods.

ICML Conference 2014 Conference Paper

Fast Computation of Wasserstein Barycenters

Marco Cuturi
Arnaud Doucet

We present new algorithms to compute the mean of a set of $N$ empirical probability measures under the optimal transport metric. This mean, known as the Wasserstein barycenter (Agueh and Carlier, 2011; Rabin et al, 2012), is the measure that minimizes the sum of its Wasserstein distances to each element in that set. We argue through a simple example that Wasserstein barycenters have appealing properties that differentiate them from other barycenters proposed recently, which all build on kernel smoothing and/or Bregman divergences. Two original algorithms are proposed that require the repeated computation of primal and dual optimal solutions of transport problems. However direct implementation of these algorithms is too costly as optimal transports are notoriously computationally expensive. Extending the work of Cuturi (2013), we smooth both the primal and dual of the optimal transport problem to recover fast approximations of the primal and dual optimal solutions. We apply these algorithms to the visualization of perturbed images and to a clustering problem.

ICML Conference 2014 Conference Paper

Towards scaling up Markov chain Monte Carlo: an adaptive subsampling approach

Rémi Bardenet
Arnaud Doucet
Chris C. Holmes

Markov chain Monte Carlo (MCMC) methods are often deemed far too computationally intensive to be of any practical use for large datasets. This paper describes a methodology that aims to scale up the Metropolis-Hastings (MH) algorithm in this context. We propose an approximate implementation of the accept/reject step of MH that only requires evaluating the likelihood of a random subset of the data, yet is guaranteed to coincide with the accept/reject step based on the full dataset with a probability superior to a user-specified tolerance level. This adaptive subsampling technique is an alternative to the recent approach developed in (Korattikara et al, ICML’14), and it allows us to establish rigorously that the resulting approximate MH algorithm samples from a perturbed version of the target distribution of interest, whose total variation distance to this very target is controlled explicitly. We explore the benefits and limitations of this scheme on several examples.

NeurIPS Conference 2009 Conference Paper

Bayesian Nonparametric Models on Decomposable Graphs

Francois Caron
Arnaud Doucet

Over recent years Dirichlet processes and the associated Chinese restaurant process (CRP) have found many applications in clustering while the Indian buffet process (IBP) is increasingly used to describe latent feature models. In the clustering case, we associate to each data point a latent allocation variable. These latent variables can share the same value and this induces a partition of the data set. The CRP is a prior distribution on such partitions. In latent feature models, we associate to each data point a potentially infinite number of binary latent variables indicating the possession of some features and the IBP is a prior distribution on the associated infinite binary matrix. These prior distributions are attractive because they ensure exchangeability (over samples). We propose here extensions of these models to decomposable graphs. These models have appealing properties and can be easily learned using Monte Carlo techniques.

UAI Conference 2009 Conference Paper

New inference strategies for solving Markov Decision Processes using reversible jump MCMC

Matthias Hoffman
Hendrik Kück
Nando de Freitas
Arnaud Doucet

In this paper we build on previous work which uses inferences techniques, in particular Markov Chain Monte Carlo (MCMC) methods, to solve parameterized control problems. We propose a number of modifications in order to make this approach more practical in general, higher-dimensional spaces. We first introduce a new target distribution which is able to incorporate more reward information from sampled trajectories. We also show how to break strong correlations between the policy parameters and sampled trajectories in order to sample more freely. Finally, we show how to incorporate these techniques in a principled manner to obtain estimates of the optimal policy.

ICML Conference 2008 Conference Paper

Sparse Bayesian nonparametric regression

Francois Caron
Arnaud Doucet

One of the most common problems in machine learning and statistics consists of estimating the mean response Xβ from a vector of observations y assuming y = Xβ + ε where X is known, β is a vector of parameters of interest and ε a vector of stochastic errors. We are particularly interested here in the case where the dimension K of β is much higher than the dimension of y . We propose some flexible Bayesian models which can yield sparse estimates of β. We show that as K → ∞ these models are closely related to a class of Lévy processes. Simulations demonstrate that our models outperform significantly a range of popular alternatives.

NeurIPS Conference 2007 Conference Paper

Bayesian Policy Learning with Trans-Dimensional MCMC

Matthew Hoffman
Arnaud Doucet
Nando Freitas
Ajay Jasra

A recently proposed formulation of the stochastic planning and control problem as one of parameter estimation for suitable artiﬁcial statistical models has led to the adoption of inference algorithms for this notoriously hard problem. At the algorithmic level, the focus has been on developing Expectation-Maximization (EM) algorithms. In this paper, we begin by making the crucial observation that the stochastic control problem can be reinterpreted as one of trans-dimensional inference. With this new interpretation, we are able to propose a novel reversible jump Markov chain Monte Carlo (MCMC) algorithm that is more efﬁcient than its EM counterparts. Moreover, it enables us to implement full Bayesian policy search, without the need for gradients and with one single Markov chain. The new approach involves sampling directly from a distribution that is proportional to the reward and, consequently, performs better than classic simulations methods in situations where the reward is a rare event.

UAI Conference 2007 Conference Paper

Generalized Polya Urn for Time-varying Dirichlet Process Mixtures

Francois Caron
Manuel Davy
Arnaud Doucet

Dirichlet Process Mixtures (DPMs) are a popular class of statistical models to perform density estimation and clustering. However, when the data available have a distribution evolving over time, such models are inadequate. We introduce here a class of time-varying DPMs which ensures that at each time step the random distribution follows a DPM model. Our model relies on an intuitive and simple generalized Polya urn scheme. Inference is performed using Markov chain Monte Carlo and Sequential Monte Carlo. We demonstrate our model on various applications.

ICML Conference 2006 Conference Paper

Fast particle smoothing: if I had a million particles

Mike Klaas
Mark Briers
Nando de Freitas
Arnaud Doucet
Simon Maskell
Dustin Lang

We propose efficient particle smoothing methods for generalized state-spaces models. Particle smoothing is an expensive O(N 2 ) algorithm, where N is the number of particles. We overcome this problem by integrating dual tree recursions and fast multipole techniques with forward-backward smoothers, a new generalized two-filter smoother and a maximum a posteriori (MAP) smoother. Our experiments show that these improvements can substantially increase the practicality of particle smoothing.

ICML Conference 2005 Conference Paper

Adapting two-class support vector classification methods to many class problems

Simon I. Hill
Arnaud Doucet

UAI Conference 2005 Conference Paper

Toward Practical N2 Monte Carlo: the Marginal Particle Filter

Mike Klaas
Nando de Freitas
Arnaud Doucet

Sequential Monte Carlo techniques are useful for state estimation in non-linear, non-Gaussian dynamic models. These methods allow us to approximate the joint posterior distribution using sequential importance sampling. In this framework, the dimension of the target distribution grows with each time step, thus it is necessary to introduce some resampling steps to ensure that the estimates provided by the algorithm have a reasonable variance. In many applications, we are only interested in the marginal filtering distribution which is defined on a space of fixed dimension. We present a Sequential Monte Carlo algorithm called the Marginal Particle Filter which operates directly on the marginal distribution, hence avoiding having to perform importance sampling on a space of growing dimension. Using this idea, we also derive an improved version of the auxiliary particle filter. We show theoretic and empirical results which demonstrate a reduction in variance over conventional particle filtering, and present techniques for reducing the cost of the marginal particle filter with N particles from O(N2) to O(N logN).

NeurIPS Conference 2003 Conference Paper

Sequential Bayesian Kernel Regression

Jaco Vermaak
Simon Godsill
Arnaud Doucet

We propose a method for sequential Bayesian kernel regression. As is the case for the popular Relevance Vector Machine (RVM) [10, 11], the method automatically identiﬁes the number and locations of the kernels. Our algorithm overcomes some of the computational difﬁculties related to batch methods for kernel regression. It is non-iterative, and requires only a single pass over the data. It is thus applicable to truly sequen- tial data sets and batch data sets alike. The algorithm is based on a generalisation of Importance Sampling, which allows the design of in- tuitively simple and efﬁcient proposal distributions for the model param- eters. Comparative results on two standard data sets show our algorithm to compare favourably with existing batch estimation strategies.

ICML Conference 2002 Conference Paper

Sparse Bayesian Learning for Regression and Classification using Markov Chain Monte Carlo

Shien-Shin Tham
Arnaud Doucet
Kotagiri Ramamohanarao

NeurIPS Conference 2001 Conference Paper

Rao-Blackwellised Particle Filtering via Data Augmentation

Christophe Andrieu
Nando Freitas
Arnaud Doucet

EE Engineering University of Melbourne Parkville, Victoria 3052

UAI Conference 2000 Conference Paper

Rao-Blackwellised Particle Filtering for Dynamic Bayesian Networks

Arnaud Doucet
Nando de Freitas
Kevin Murphy 0002
Stuart Russell 0001

Particle filters (PFs) are powerful sampling-based inference/learning algorithms for dynamic Bayesian networks (DBNs). They allow us to treat, in a principled way, any type of probability distribution, nonlinearity and non-stationarity. They have appeared in several fields under such names as ``condensation'', ``sequential Monte Carlo'' and ``survival of the fittest''. In this paper, we show how we can exploit the structure of the DBN to increase the efficiency of particle filtering, using a technique known as Rao-Blackwellisation. Essentially, this samples some of the variables, and marginalizes out the rest exactly, using the Kalman filter, HMM filter, junction tree algorithm, or any other finite dimensional optimal filter. We show that Rao-Blackwellised particle filters (RBPFs) lead to more accurate estimates than standard PFs. We demonstrate RBPFs on two problems, namely non-stationary online regression with radial basis function networks and robot localization and map building. We also discuss other potential application areas and provide references to some finite dimensional optimal filters.

UAI Conference 2000 Conference Paper

Reversible Jump MCMC Simulated Annealing for Neural Networks

Christophe Andrieu
Nando de Freitas
Arnaud Doucet

We propose a novel reversible jump Markov chain Monte Carlo (MCMC) simulated annealing algorithm to optimize radial basis function (RBF) networks. This algorithm enables us to maximize the joint posterior distribution of the network parameters and the number of basis functions. It performs a global search in the joint space of the parameters and number of parameters, thereby surmounting the problem of local minima. We also show that by calibrating a Bayesian model, we can obtain the classical AIC, BIC and MDL model selection criteria within a penalized likelihood framework. Finally, we show theoretically and empirically that the algorithm converges to the modes of the full posterior distribution in an efficient way.

NeurIPS Conference 2000 Conference Paper

The Unscented Particle Filter

Rudolph van der Merwe
Arnaud Doucet
Nando de Freitas
Eric Wan

In this paper, we propose a new particle filter based on sequential importance sampling. The algorithm uses a bank of unscented fil(cid: 173) ters to obtain the importance proposal distribution. This proposal has two very "nice" properties. Firstly, it makes efficient use of the latest available information and, secondly, it can have heavy tails. As a result, we find that the algorithm outperforms stan(cid: 173) dard particle filtering and other nonlinear filtering methods very substantially. This experimental finding is in agreement with the theoretical convergence proof for the algorithm. The algorithm also includes resampling and (possibly) Markov chain Monte Carlo (MCMC) steps.

NeurIPS Conference 1999 Conference Paper

Robust Full Bayesian Methods for Neural Networks

Christophe Andrieu
João de Freitas
Arnaud Doucet

In this paper, we propose a full Bayesian model for neural networks. This model treats the model dimension (number of neurons), model parameters, regularisation parameters and noise parameters as ran(cid: 173) dom variables that need to be estimated. We then propose a re(cid: 173) versible jump Markov chain Monte Carlo (MCMC) method to per(cid: 173) form the necessary computations. We find that the results are not only better than the previously reported ones, but also appear to be robust with respect to the prior specification. Moreover, we present a geometric convergence theorem for the algorithm.

NeurIPS Conference 1998 Conference Paper

Global Optimisation of Neural Network Models via Sequential Sampling

João de Freitas
Mahesan Niranjan
Arnaud Doucet
Andrew Gee

We propose a novel strategy for training neural networks using se(cid: 173) quential sampling-importance resampling algorithms. This global optimisation strategy allows us to learn the probability distribu(cid: 173) tion of the network weights in a sequential framework. It is well suited to applications involving on-line, nonlinear, non-Gaussian or non-stationary signal processing.