Author name cluster

David Barber

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

42 papers

2 author rows

ICLR Conference 2025 Conference Paper

Improving Probabilistic Diffusion Models With Optimal Diagonal Covariance Matching

Zijing Ou
Mingtian Zhang
Andi Zhang 0001
Tim Z. Xiao
Yingzhen Li
David Barber

The probabilistic diffusion model has become highly effective across various domains. Typically, sampling from a diffusion model involves using a denoising distribution characterized by a Gaussian with a learned mean and either fixed or learned covariances. In this paper, we leverage the recently proposed covariance moment matching technique and introduce a novel method for learning the diagonal covariances. Unlike traditional data-driven covariance approximation approaches, our method involves directly regressing the optimal analytic covariance using a new, unbiased objective named Optimal Covariance Matching (OCM). This approach can significantly reduce the approximation error in covariance prediction. We demonstrate how our method can substantially enhance the sampling efficiency, recall rate and likelihood of both diffusion models and latent diffusion models.

NeurIPS Conference 2025 Conference Paper

Incremental Sequence Classification with Temporal Consistency

Lucas Maystre
Gabriel Barello
Tudor Berariu
Cambray Cambray
Rares Dolga
Alvaro Ortega Gonzalez
Andrei Nica
David Barber

We address the problem of incremental sequence classification, where predictions are updated as new elements in the sequence are revealed. Drawing on temporal-difference learning from reinforcement learning, we identify a temporal-consistency condition that successive predictions should satisfy. We leverage this condition to develop a novel loss function for training incremental sequence classifiers. Through a concrete example, we demonstrate that optimizing this loss can offer substantial gains in data efficiency. We apply our method to text classification tasks and show that it improves predictive accuracy over competing approaches on several benchmark datasets. We further evaluate our approach on the task of verifying large language model generations for correctness in grade-school math problems. Our results show that models trained with our method are better able to distinguish promising generations from unpromising ones after observing only a few tokens.

TMLR Journal 2025 Journal Article

Unifying Linear-Time Attention via Latent Probabilistic Modelling

Rares Dolga
Lucas Maystre
Marius Cobzarenco
David Barber

Transformers have achieved state-of-the-art results across a range of domains, but their quadratic attention mechanism poses significant challenges for long-sequence modelling. Recent efforts to design linear-time attention mechanisms have yielded more scalable alternatives, yet often at the cost of performance, particularly on discrete data such as language. In this work, we revisit linear attention through the lens of probabilistic graphical models. We first show that standard linear attention can be interpreted as an undirected latent variable model, revealing a key limitation: the absence of directionality. To address this, we propose a novel directed parameterisation of linear attention that introduces an asymmetric structure, enabling an interpretation aligned with the causal and sequential nature of language. Our formulation integrates global latent-variable attention with local standard attention in a fully probabilistic framework. Additionally, we introduce a recurrent parameterisation of queries and keys that avoids reliance on relative positional encodings, often incompatible with linear attention. Experiments on language modelling benchmarks demonstrate that our model achieves competitive performance with standard attention and outperforms existing linear attention variants.

ICML Conference 2024 Conference Paper

Active Preference Learning for Large Language Models

William Muldrew
Peter Hayes
Mingtian Zhang
David Barber

As large language models (LLMs) become more capable, fine-tuning techniques for aligning with human intent are increasingly important. A key consideration for aligning these models is how to most effectively use human resources, or model resources in the case where LLMs themselves are used as oracles. Reinforcement learning from Human or AI preferences (RLHF/RLAIF) is the most prominent example of such a technique, but is complex and often unstable. Direct Preference Optimization (DPO) has recently been proposed as a simpler and more stable alternative. In this work, we develop an active learning strategy for DPO to make better use of preference labels. We propose a practical acquisition function for prompt/completion pairs based on the predictive entropy of the language model and a measure of certainty of the implicit preference model optimized by DPO. We demonstrate how our approach improves both the rate of learning and final performance of fine-tuning on pairwise preference data.

ICML Conference 2024 Conference Paper

Diffusive Gibbs Sampling

Wenlin Chen
Mingtian Zhang
Brooks Paige
José Miguel Hernández-Lobato
David Barber

The inadequate mixing of conventional Markov Chain Monte Carlo (MCMC) methods for multi-modal distributions presents a significant challenge in practical applications such as Bayesian inference and molecular dynamics. Addressing this, we propose Diffusive Gibbs Sampling (DiGS), an innovative family of sampling methods designed for effective sampling from distributions characterized by distant and disconnected modes. DiGS integrates recent developments in diffusion models, leveraging Gaussian convolution to create an auxiliary noisy distribution that bridges isolated modes in the original space and applying Gibbs sampling to alternately draw samples from both spaces. A novel Metropolis-within-Gibbs scheme is proposed to enhance mixing in the denoising sampling step. DiGS exhibits a better mixing property for sampling multi-modal distributions than state-of-the-art methods such as parallel tempering, attaining substantially improved performance across various tasks, including mixtures of Gaussians, Bayesian neural networks and molecular dynamics.

NeurIPS Conference 2023 Conference Paper

Moment Matching Denoising Gibbs Sampling

Mingtian Zhang
Alex Hawkins-Hooker
Brooks Paige
David Barber

Energy-Based Models (EBMs) offer a versatile framework for modelling complex data distributions. However, training and sampling from EBMs continue to pose significant challenges. The widely-used Denoising Score Matching (DSM) method for scalable EBM training suffers from inconsistency issues, causing the energy model to learn a noisy data distribution. In this work, we propose an efficient sampling framework: (pseudo)-Gibbs sampling with moment matching, which enables effective sampling from the underlying clean model when given a noisy model that has been well-trained via DSM. We explore the benefits of our approach compared to related methods and demonstrate how to scale the method to high-dimensional datasets.

NeurIPS Conference 2022 Conference Paper

Generalization Gap in Amortized Inference

Mingtian Zhang
Peter Hayes
David Barber

The ability of likelihood-based probabilistic models to generalize to unseen data is central to many machine learning applications such as lossless compression. In this work, we study the generalization of a popular class of probabilistic model - the Variational Auto-Encoder (VAE). We discuss the two generalization gaps that affect VAEs and show that overfitting is usually dominated by amortized inference. Based on this observation, we propose a new training objective that improves the generalization of amortized inference. We demonstrate how our method can improve performance in the context of image modeling and lossless compression.

ICML Conference 2021 Conference Paper

Addressing Catastrophic Forgetting in Few-Shot Problems

Pau Ching Yap
Hippolyt Ritter
David Barber

Neural networks are known to suffer from catastrophic forgetting when trained on sequential datasets. While there have been numerous attempts to solve this problem in large-scale supervised classification, little has been done to overcome catastrophic forgetting in few-shot classification problems. We demonstrate that the popular gradient-based model-agnostic meta-learning algorithm (MAML) indeed suffers from catastrophic forgetting and introduce a Bayesian online meta-learning framework that tackles this problem. Our framework utilises Bayesian online learning and meta-learning along with Laplace approximation and variational inference to overcome catastrophic forgetting in few-shot classification problems. The experimental evaluations demonstrate that our framework can effectively achieve this goal in comparison with various baselines. As an additional utility, we also demonstrate empirically that our framework is capable of meta-learning on sequentially arriving few-shot tasks from a stationary task distribution.

ICLR Conference 2021 Conference Paper

Reducing the Computational Cost of Deep Generative Models with Binary Neural Networks

Thomas Bird
Friso H. Kingma
David Barber

Deep generative models provide a powerful set of tools to understand real-world data. But as these models improve, they increase in size and complexity, so their computational cost in memory and execution time grows. Using binary weights in neural networks is one method which has shown promise in reducing this cost. However, whether binary neural networks can be used in generative models is an open problem. In this work we show, for the first time, that we can successfully train generative models which utilize binary neural networks. This reduces the computational cost of the models massively. We develop a new class of binary weight normalization, and provide insights for architecture designs of these binarized generative models. We demonstrate that two state-of-the-art deep generative models, the ResNet VAE and Flow++ models, can be binarized effectively using these techniques. We train binary models that achieve loss values close to those of the regular models but are 90%-94% smaller in size, and also allow significant speed-ups in execution time.

ICLR Conference 2020 Conference Paper

HiLLoC: lossless image compression with hierarchical latent variable models

James Townsend
Thomas Bird
Julius Kunze
David Barber

We make the following striking observation: fully convolutional VAE models trained on 32x32 ImageNet can generalize well, not just to 64x64 but also to far larger photographs, with no changes to the model. We use this property, applying fully convolutional models to lossless compression, demonstrating a method to scale the VAE-based 'Bits-Back with ANS' algorithm for lossless compression to large color photographs, and achieving state of the art for compression of full size ImageNet images. We release Craystack, an open source library for convenient prototyping of lossless compression using probabilistic models, along with full implementations of all of our compression results.

ICML Conference 2020 Conference Paper

Spread Divergence

Mingtian Zhang
Peter Hayes
Thomas Bird
Raza Habib
David Barber

For distributions $\mathbb{P}$ and $\mathbb{Q}$ with different supports or undefined densities, the divergence $\textrm{D}(\mathbb{P}||\mathbb{Q})$ may not exist. We define a Spread Divergence $\tilde{\textrm{D}}(\mathbb{P}||\mathbb{Q})$ on modified $\mathbb{P}$ and $\mathbb{Q}$ and describe sufficient conditions for the existence of such a divergence. We demonstrate how to maximize the discriminatory power of a given divergence by parameterizing and learning the spread. We also give examples of using a Spread Divergence to train implicit generative models, including linear models (Independent Components Analysis) and non-linear models (Deep Generative Networks).

AAAI Conference 2018 Conference Paper

Generating Sentences Using a Dynamic Canvas

Harshil Shah
Bowen Zheng
David Barber

We introduce the Attentive Unsupervised Text (W)riter (AUTR), which is a word level generative model for natural language. It uses a recurrent neural network with a dynamic attention and canvas memory mechanism to iteratively construct sentences. By viewing the state of the memory at intermediate stages and where the model is placing its attention, we gain insight into how it constructs sentences. We demonstrate that AUTR learns a meaningful latent representation for each sentence, and achieves competitive log-likelihood lower bounds whilst being computationally efﬁcient. It is effective at generating and reconstructing sentences, as well as imputing missing words.

NeurIPS Conference 2018 Conference Paper

Generative Neural Machine Translation

Harshil Shah
David Barber

We introduce Generative Neural Machine Translation (GNMT), a latent variable architecture which is designed to model the semantics of the source and target sentences. We modify an encoder-decoder translation model by adding a latent variable as a language agnostic representation which is encouraged to learn the meaning of the sentence. GNMT achieves competitive BLEU scores on pure translation tasks, and is superior when there are missing words in the source sentence. We augment the model to facilitate multilingual translation and semi-supervised learning without adding parameters. This framework significantly reduces overfitting when there is limited paired data available, and is effective for translating between pairs of languages not seen during training.

NeurIPS Conference 2018 Conference Paper

Modular Networks: Learning to Decompose Neural Computation

Louis Kirsch
Julius Kunze
David Barber

Scaling model capacity has been vital in the success of deep learning. For a typical network, necessary compute resources and training time grow dramatically with model size. Conditional computation is a promising way to increase the number of parameters with a relatively small increase in resources. We propose a training algorithm that flexibly chooses neural modules based on the data to be processed. Both the decomposition and modules are learned end-to-end. In contrast to existing approaches, training does not rely on regularization to enforce diversity in module use. We apply modular networks both to image recognition and language modeling tasks, where we achieve superior performance compared to several baselines. Introspection reveals that modules specialize in interpretable contexts.

NeurIPS Conference 2018 Conference Paper

Online Structured Laplace Approximations for Overcoming Catastrophic Forgetting

Hippolyt Ritter
Aleksandar Botev
David Barber

We introduce the Kronecker factored online Laplace approximation for overcoming catastrophic forgetting in neural networks. The method is grounded in a Bayesian online learning framework, where we recursively approximate the posterior after every task with a Gaussian, leading to a quadratic penalty on changes to the weights. The Laplace approximation requires calculating the Hessian around a mode, which is typically intractable for modern architectures. In order to make our method scalable, we leverage recent block-diagonal Kronecker factored approximations to the curvature. Our algorithm achieves over 90% test accuracy across a sequence of 50 instantiations of the permuted MNIST dataset, substantially outperforming related methods for overcoming catastrophic forgetting.

ICML Conference 2017 Conference Paper

Practical Gauss-Newton Optimisation for Deep Learning

Aleksandar Botev
Hippolyt Ritter
David Barber

We present an efficient block-diagonal approximation to the Gauss-Newton matrix for feedforward neural networks. Our resulting algorithm is competitive against state-of-the-art first-order optimisation methods, with sometimes significant improvement in optimisation performance. Unlike first-order methods, for which hyperparameter tuning of the optimisation parameters is often a laborious process, our approach can provide good performance even when used with default settings. A side result of our work is that for piecewise linear transfer functions, the network objective function can have no differentiable local maxima, which may partially explain why such transfer functions facilitate effective optimisation.

NeurIPS Conference 2017 Conference Paper

Thinking Fast and Slow with Deep Learning and Tree Search

Thomas Anthony
Zheng Tian
David Barber

Sequential decision making problems, such as structured prediction, robotic control, and game playing, require a combination of planning policies and generalisation of those plans. In this paper, we present Expert Iteration (ExIt), a novel reinforcement learning algorithm which decomposes the problem into separate planning and generalisation tasks. Planning new policies is performed by tree search, while a deep neural network generalises those plans. Subsequently, tree search is improved by using the neural network policy to guide search, increasing the strength of new plans. In contrast, standard deep Reinforcement Learning algorithms rely on a neural network not only to generalise plans, but to discover them too. We show that ExIt outperforms REINFORCE for training a neural network to play the board game Hex, and our final tree search agent, trained tabula rasa, defeats MoHex1. 0, the most recent Olympiad Champion player to be publicly released.

NeurIPS Conference 2017 Conference Paper

Wider and Deeper, Cheaper and Faster: Tensorized LSTMs for Sequence Learning

Zhen He
Shaobing Gao
Liang Xiao
Daxue Liu
Hangen He
David Barber

Long Short-Term Memory (LSTM) is a popular approach to boosting the ability of Recurrent Neural Networks to store longer term temporal information. The capacity of an LSTM network can be increased by widening and adding layers. However, usually the former introduces additional parameters, while the latter increases the runtime. As an alternative we propose the Tensorized LSTM in which the hidden states are represented by tensors and updated via a cross-layer convolution. By increasing the tensor size, the network can be widened efficiently without additional parameters since the parameters are shared across different locations in the tensor; by delaying the output, the network can be deepened implicitly with little additional runtime since deep computations for each timestep are merged into temporal computations of the sequence. Experiments conducted on five challenging sequence learning tasks show the potential of the proposed model.

JMLR Journal 2016 Journal Article

Approximate Newton Methods for Policy Search in Markov Decision Processes

Thomas Furmston
Guy Lever
David Barber

Approximate Newton methods are standard optimization tools which aim to maintain the benefits of Newton's method, such as a fast rate of convergence, while alleviating its drawbacks, such as computationally expensive calculation or estimation of the inverse Hessian. In this work we investigate approximate Newton methods for policy optimization in Markov decision processes (MDPs). We first analyse the structure of the Hessian of the total expected reward, which is a standard objective function for MDPs. We show that, like the gradient, the Hessian exhibits useful structure in the context of MDPs and we use this analysis to motivate two Gauss-Newton methods for MDPs. Like the Gauss- Newton method for non-linear least squares, these methods drop certain terms in the Hessian. The approximate Hessians possess desirable properties, such as negative definiteness, and we demonstrate several important performance guarantees including guaranteed ascent directions, invariance to affine transformation of the parameter space and convergence guarantees. We finally provide a unifying perspective of key policy search algorithms, demonstrating that our second Gauss- Newton algorithm is closely related to both the EM-algorithm and natural gradient ascent applied to MDPs, but performs significantly better in practice on a range of challenging domains. [abs] [ pdf ][ bib ] &copy JMLR 2016. ( edit, beta )

ICML Conference 2014 Conference Paper

Gaussian Processes for Bayesian Estimation in Ordinary Differential Equations

David Barber
Yali Wang

Bayesian parameter estimation in coupled ordinary differential equations (ODEs) is challenging due to the high computational cost of numerical integration. In gradient matching a separate data model is introduced with the property that its gradient can be calculated easily. Parameter estimation is achieved by requiring consistency between the gradients computed from the data model and those specified by the ODE. We propose a Gaussian process model that directly links state derivative information with system observations, simplifying previous approaches and providing a natural generative model.

JMLR Journal 2013 Journal Article

Gaussian Kullback-Leibler Approximate Inference

Edward Challis
David Barber

We investigate Gaussian Kullback-Leibler (G-KL) variational approximate inference techniques for Bayesian generalised linear models and various extensions. In particular we make the following novel contributions: sufficient conditions for which the G-KL objective is differentiable and convex are described; constrained parameterisations of Gaussian covariance that make G-KL methods fast and scalable are provided; the lower bound to the normalisation constant provided by G-KL methods is proven to dominate those provided by local lower bounding methods; complexity and model applicability issues of G-KL versus other Gaussian approximate inference methods are discussed. Numerical results comparing G-KL and other deterministic Gaussian approximate inference methods are presented for: robust Gaussian process regression models with either Student-$t$ or Laplace likelihoods, large scale Bayesian binary logistic regression models, and Bayesian sparse linear models for sequential experimental design. [abs] [ pdf ][ bib ] &copy JMLR 2013. ( edit, beta )

NeurIPS Conference 2012 Conference Paper

A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes

Thomas Furmston
David Barber

Parametric policy search algorithms are one of the methods of choice for the optimisation of Markov Decision Processes, with Expectation Maximisation and natural gradient ascent being considered the current state of the art in the field. In this article we provide a unifying perspective of these two algorithms by showing that their step-directions in the parameter space are closely related to the search direction of an approximate Newton method. This analysis leads naturally to the consideration of this approximate Newton method as an alternative gradient-based method for Markov Decision Processes. We are able show that the algorithm has numerous desirable properties, absent in the naive application of Newton's method, that make it a viable alternative to either Expectation Maximisation or natural gradient ascent. Empirical results suggest that the algorithm has excellent convergence and robustness properties, performing strongly in comparison to both Expectation Maximisation and natural gradient ascent.

NeurIPS Conference 2012 Conference Paper

Affine Independent Variational Inference

Edward Challis
David Barber

We present a method for approximate inference for a broad class of non-conjugate probabilistic models. In particular, for the family of generalized linear model target densities we describe a rich class of variational approximating densities which can be best fit to the target by minimizing the Kullback-Leibler divergence. Our approach is based on using the Fourier representation which we show results in efficient and scalable inference.

ICML Conference 2012 Conference Paper

Bayesian Conditional Cointegration

Chris Bracegirdle
David Barber

UAI Conference 2011 Conference Paper

Efficient Inference in Markov Control Problems

Thomas Furmston
David Barber

UAI Conference 2008 Conference Paper

Clique Matrices for Statistical Graph Decomposition and Parameterising Restricted Positive Definite Matrices

David Barber

We introduce Clique Matrices as an alternative representation of undirected graphs, being a generalisation of the incidence matrix representation. Here we use clique matrices to decompose a graph into a set of possibly overlapping clusters, defined as wellconnected subsets of vertices. The decomposition is based on a statistical description which encourages clusters to be well connected and few in number. Inference is carried out using a variational approximation. Clique matrices also play a natural role in parameterising positive definite matrices under zero constraints on elements of the matrix. We show that clique matrices can parameterise all positive definite matrices restricted according to a decomposable graph and form a structured Factor Analysis approximation in the non-decomposable case.

NeurIPS Conference 2006 Conference Paper

A Novel Gaussian Sum Smoother for Approximate Inference in Switching Linear Dynamical Systems

David Barber
Bertrand Mesot

We introduce a method for approximate smoothed inference in a class of switching linear dynamical systems, based on a novel form of Gaussian Sum smoother. This class includes the switching Kalman Filter and the more general case of switch transitions dependent on the continuous latent state. The method improves on the standard Kim smoothing approach by dispensing with one of the key approximations, thus making fuller use of the available future information. Whilst the only central assumption required is projection to a mixture of Gaussians, we show that an additional conditional independence assumption results in a simpler but stable and accurate alternative. Unlike the alternative unstable Expectation Propagation procedure, our method consists only of a single forward and backward pass and is reminiscent of the standard smoothing `correction' recursions in the simpler linear dynamical system. The algorithm performs well on both toy experiments and in a large scale application to noise robust speech recognition. 1 Switching Linear Dynamical System The Linear Dynamical System (LDS) [1] is a key temporal model in which a latent linear process generates the observed series. For complex time-series which are not well described globally by a single LDS, we may break the time-series into segments, each modeled by a potentially different LDS. This is the basis for the Switching LDS (SLDS) [2, 3, 4, 5] where, for each time t, a switch variable st 1, .. ., S describes which of the LDSs is to be used. The observation (or `visible') vt RV is linearly related to the hidden state ht RH with additive noise by vt = B (st )ht + v (st ) p(vt |ht, st ) = N (B (st )ht, v (st )) (1 ) where N (, ) denotes a Gaussian distribution with mean and covariance. The transition dynamics of the continuous hidden state ht is linear, A ( ht = A(st )ht-1 + h (st ), p(ht |ht-1, st ) = N (st )ht-1, h (st ) 2) The switch st may depend on both the previous st-1 and ht-1. This is an augmented SLDS (aSLDS), and defines the model p(v1: T, h1: T, s1: T ) = tT p(vt |ht, st )p(ht |ht-1, st )p(st |ht-1, st-1 ) =1 The standard SLDS[4] considers only switch transitions p(st |st-1 ). At time t = 1, p(s1 |h0, s0 ) simply denotes the prior p(s1 ), and p(h1 |h0, s1 ) denotes p(h1 |s1 ). The aim of this article is to address how to perform inference in the aSLDS. In particular we desire the filtered estimate p(ht, st |v1: t ) and the smoothed estimate p(ht, st |v1: T ), for any 1 t T. Both filtered and smoothed inference in the SLDS is intractable, scaling exponentially with time [4].

JMLR Journal 2006 Journal Article

Expectation Correction for Smoothed Inference in Switching Linear Dynamical Systems

David Barber

We introduce a method for approximate smoothed inference in a class of switching linear dynamical systems, based on a novel form of Gaussian Sum smoother. This class includes the switching Kalman 'Filter' and the more general case of switch transitions dependent on the continuous latent state. The method improves on the standard Kim smoothing approach by dispensing with one of the key approximations, thus making fuller use of the available future information. Whilst the central assumption required is projection to a mixture of Gaussians, we show that an additional conditional independence assumption results in a simpler but accurate alternative. Our method consists of a single Forward and Backward Pass and is reminiscent of the standard smoothing 'correction' recursions in the simpler linear dynamical system. The method is numerically stable and compares favourably against alternative approximations, both in cases where a single mixture component provides a good posterior approximation, and where a multimodal approximation is required. [abs] [ pdf ][ bib ] &copy JMLR 2006. ( edit, beta )

NeurIPS Conference 2006 Conference Paper

Unified Inference for Variational Bayesian Linear Gaussian State-Space Models

David Barber
Silvia Chiappa

Linear Gaussian State-Space Models are widely used and a Bayesian treatment of parameters is therefore of considerable interest. The approximate Variational Bayesian method applied to these models is an attractive approach, used successfully in applications ranging from acoustics to bioinformatics. The most challenging aspect of implementing the method is in performing inference on the hidden state sequence of the model. We show how to convert the inference problem so that standard Kalman Filtering/Smoothing recursions from the literature may be applied. This is in contrast to previously published approaches based on Belief Propagation. Our framework both simplifies and unifies the inference problem, so that future applications may be more easily developed. We demonstrate the elegance of the approach on Bayesian temporal ICA, with an application to finding independent dynamical processes underlying noisy EEG signals. 1 Linear Gaussian State-Space Models Linear Gaussian State-Space Models (LGSSMs)1 are fundamental in time-series analysis [1, 2, 3]. In these models the observations v1: T 2 are generated from an underlying dynamical system on h1: T according to: v v vt = B ht + t, t N (0V, V ), h h ht = Aht-1 + t, t N (0H, H ), where N (, ) denotes a Gaussian with mean and covariance, and 0X denotes an X dimensional zero vector. The observation vt has dimension V and the hidden state ht has dimension H. Probabilistically, the LGSSM is defined by: p(v1: T, h1: T |) = p(v1 |h1 )p(h1 ) tT p(vt |ht )p(ht |ht-1 ), =2 with p(vt |ht ) = N (B ht, V ), p(ht |ht-1 ) = N (Aht-1, H ), p(h1 ) = N (, ) and where = {A, B, H, V, , } denotes the model parameters. Because of the widespread use of these models, a Bayesian treatment of parameters is of considerable interest [4, 5, 6, 7, 8]. An exact implementation of the Bayesian LGSSM is formally intractable [8], and recently a Variational Bayesian (VB) approximation has been studied [4, 5, 6, 7, 9]. The most challenging part of implementing the VB method is performing inference over h1: T, and previous authors have developed their own specialized routines, based on Belief Propagation, since standard LGSSM inference routines appear, at first sight, not to be applicable. 1 2 Also called Kalman Filters/Smoothers, Linear Dynamical Systems. v1: T denotes v1, .. ., vT. A key contribution of this paper is to show how the Variational Bayesian treatment of the LGSSM can be implemented using standard LGSSM inference routines. Based on the insight we provide, any standard inference method may be applied, including those specifically addressed to improve numerical stability [2, 10, 11]. In this article, we decided to describe the predictor-corrector and Rauch-Tung-Striebel recursions [2], and also suggest a small modification that reduces computational cost. The Bayesian LGSSM is particularly of interest when strong prior constraints are needed to find adequate solutions. One such case is in EEG signal analysis, whereby we wish to extract sources that evolve independently through time. Since EEG is particularly noisy [12], a prior that encourages sources to have preferential dynamics is advantageous. This application is discussed in Section 4, and demonstrates the ease of applying our VB framework. 2 Bayesian Linear Gaussian State-Space Models In the Bayesian treatment of the LGSSM, instead of considering the model parameters as fixed, ^ ^ we define a prior distribution p(|), where is a set of hyperparameters. Then: ^ ^ p(v1: T |) = p(v1: T |)p(|). (1) In a full Bayesian treatment we would define additional prior distributions over the hyperparameters ^. Here we take instead the ML-II (`evidence') framework, in which the optimal set of hyperpa^ ^ rameters is found by maximizing p(v1: T |) with respect to [6, 7, 9]. For the parameter priors, here we define Gaussians on the columns of A and B 3: p(A|, H ) jH e- j 2

ICML Conference 2005 Conference Paper

A graphical model for chord progressions embedded in a psychoacoustic space

Jean-François Paiement
Douglas Eck
Samy Bengio
David Barber

NeurIPS Conference 2005 Conference Paper

Kernelized Infomax Clustering

David Barber
Felix Agakov

We propose a simple information-theoretic approach to soft clus- tering based on maximizing the mutual information I(x, y) between the unknown cluster labels y and the training patterns x with re- spect to parameters of speciﬁcally constrained encoding distribu- tions. The constraints are chosen such that patterns are likely to be clustered similarly if they lie close to speciﬁc unknown vectors in the feature space. The method may be conveniently applied to learning the optimal aﬃnity matrix, which corresponds to learn- ing parameters of the kernelized encoder. The procedure does not require computations of eigenvalues of the Gram matrices, which makes it potentially attractive for clustering large data sets.

NeurIPS Conference 2003 Conference Paper

Information Maximization in Noisy Channels : A Variational Approach

David Barber
Felix Agakov

The maximisation of information transmission over noisy channels is a common, albeit generally computationally diﬃcult problem. We approach the diﬃculty of computing the mutual information for noisy channels by using a variational approximation. The re- sulting IM algorithm is analagous to the EM algorithm, yet max- imises mutual information, as opposed to likelihood. We apply the method to several practical examples, including linear compression, population encoding and CDMA.

NeurIPS Conference 2002 Conference Paper

Dynamic Bayesian Networks with Deterministic Latent Tables

David Barber

The application of latent/hidden variable Dynamic Bayesian Net- works is constrained by the complexity of marginalising over latent variables. For this reason either small latent dimensions or Gaus- sian latent conditional tables linearly dependent on past states are typically considered in order that inference is tractable. We suggest an alternative approach in which the latent variables are modelled using deterministic conditional probability tables. This specialisa- tion has the advantage of tractable inference even for highly com- plex non-linear/non-Gaussian visible conditional probability tables. This approach enables the consideration of highly complex latent dynamics whilst retaining the bene(cid: 12)ts of a tractable probabilistic model.

NeurIPS Conference 2002 Conference Paper

Learning in Spiking Neural Assemblies

David Barber

We consider a statistical framework for learning in a class of net- works of spiking neurons. Our aim is to show how optimal local learning rules can be readily derived once the neural dynamics and desired functionality of the neural assembly have been speciﬂed, in contrast to other models which assume (sub-optimal) learning rules. Within this framework we derive local rules for learning tem- poral sequences in a model of spiking neurons and demonstrate its superior performance to correlation (Hebbian) based approaches. We further show how to include mechanisms such as synaptic de- pression and outline how the framework is readily extensible to learning in networks of highly complex spiking neurons. A stochas- tic quantal vesicle release mechanism is considered and implications on the complexity of learning discussed.

NeurIPS Conference 1999 Conference Paper

Gaussian Fields for Approximate Inference in Layered Sigmoid Belief Networks

David Barber
Peter Sollich

Local "belief propagation" rules of the sort proposed by Pearl [15] are guaranteed to converge to the correct posterior probabilities in singly connected graphical models. Recently, a number of researchers have em(cid: 173) pirically demonstrated good performance of "loopy belief propagation"(cid: 173) using these same rules on graphs with loops. Perhaps the most dramatic instance is the near Shannon-limit performance of "Turbo codes", whose decoding algorithm is equivalent to loopy belief propagation. Except for the case of graphs with a single loop, there has been little theo(cid: 173) retical understanding of the performance of loopy propagation. Here we analyze belief propagation in networks with arbitrary topologies when the nodes in the graph describe jointly Gaussian random variables. We give an analytical formula relating the true posterior probabilities with those calculated using loopy propagation. We give sufficient conditions for convergence and show that when belief propagation converges it gives the correct posterior means for all graph topologies, not just networks with a single loop. The related "max-product" belief propagation algorithm finds the max(cid: 173) imum posterior probability estimate for singly connected networks. We show that, even for non-Gaussian probability distributions, the conver(cid: 173) gence points of the max-product algorithm in loopy networks are max(cid: 173) ima over a particular large local neighborhood of the posterior proba(cid: 173) bility. These results help clarify the empirical performance results and motivate using the powerful belief propagation algorithm in a broader class of networks. Problems involving probabilistic belief propagation arise in a wide variety of applications, including error correcting codes, speech recognition and medical diagnosis. If the graph is singly connected, there exist local message-passing schemes to calculate the posterior probability of an unobserved variable given the observed variables. Pearl [15] derived such a scheme for singly connected Bayesian networks and showed that this "belief propagation" algorithm is guaranteed to converge to the correct posterior probabilities (or "beliefs"). Several groups have recently reported excellent experimental results by running algorithms 674 Y. Weiss and W T. Freeman equivalent to Pearl's algorithm on networks with loops [8, 13, 6]. Perhaps the most dramatic instance of this performance is for "Turbo code" [2] error correcting codes. These codes have been described as "the most exciting and potentially important development in coding theory in many years" [12] and have recently been shown [10, 11] to utilize an algorithm equivalent to belief propagation in a network with loops. Progress in the analysis of loopy belief propagation has been made for the case of networks with a single loop [17, 18, 4, 1]. For these networks, it can be shown that (1) unless all the compatabilities are deterministic, loopy belief propagation will converge. (2) The difference between the loopy beliefs and the true beliefs is related to the convergence rate the faster the convergence the more exact the approximation and (3) If of the messages - the hidden nodes are binary, then the loopy beliefs and the true beliefs are both maximized by the same assignments, although the confidence in that assignment is wrong for the loopy beliefs. In this paper we analyze belief propagation in graphs of arbitrary topology, for nodes de(cid: 173) scribing jointly Gaussian random variables. We give an exact formula relating the correct marginal posterior probabilities with the ones calculated using loopy belief propagation. We show that if belief propagation converges, then it will give the correct posterior means for all graph topologies, not just networks with a single loop. We show that the covari(cid: 173) ance estimates will generally be incorrect but present a relationship between the error in the covariance estimates and the convergence speed. For Gaussian or non-Gaussian vari(cid: 173) ables, we show that the "max-product" algorithm, which calculates the MAP estimate in singly connected networks, only converges to points that are maxima over a particular large neighborhood of the posterior probability of loopy networks.

NeurIPS Conference 1998 Conference Paper

Tractable Variational Structures for Approximating Graphical Models

David Barber
Wim Wiegerinck

Graphical models provide a broad probabilistic framework with ap(cid: 173) plications in speech recognition (Hidden Markov Models), medical diagnosis (Belief networks) and artificial intelligence (Boltzmann Machines). However, the computing time is typically exponential in the number of nodes in the graph. Within the variational frame(cid: 173) work for approximating these models, we present two classes of dis(cid: 173) tributions, decimatable Boltzmann Machines and Tractable Belief Networks that go beyond the standard factorized approach. We give generalised mean-field equations for both these directed and undirected approximations. Simulation results on a small bench(cid: 173) mark problem suggest using these richer approximations compares favorably against others previously reported in the literature.

NeurIPS Conference 1997 Conference Paper

Ensemble Learning for Multi-Layer Networks

David Barber
Christopher Bishop

Bayesian treatments of learning in neural networks are typically based either on local Gaussian approximations to a mode of the posterior weight distribution, or on Markov chain Monte Carlo simulations. A third approach, called ensemble learning, was in(cid: 173) troduced by Hinton and van Camp (1993). It aims to approximate the posterior distribution by minimizing the Kullback-Leibler di(cid: 173) vergence between the true posterior and a parametric approximat(cid: 173) ing distribution. However, the derivation of a deterministic algo(cid: 173) rithm relied on the use of a Gaussian approximating distribution with a diagonal covariance matrix and so was unable to capture the posterior correlations between parameters. In this paper, we show how the ensemble learning approach can be extended to full(cid: 173) covariance Gaussian distributions while remaining computationally tractable. We also extend the framework to deal with hyperparam(cid: 173) eters, leading to a simple re-estimation procedure. Initial results from a standard benchmark problem are encouraging.

NeurIPS Conference 1997 Conference Paper

On-line Learning from Finite Training Sets in Nonlinear Networks

Peter Sollich
David Barber

Online learning is one of the most common forms of neural net(cid: 173) work training. We present an analysis of online learning from finite training sets for non-linear networks (namely, soft-committee ma(cid: 173) chines), advancing the theory to more realistic learning scenarios. Dynamical equations are derived for an appropriate set of order parameters; these are exact in the limiting case of either linear networks or infinite training sets. Preliminary comparisons with simulations suggest that the theory captures some effects of finite training sets, but may not yet account correctly for the presence of local minima.

NeurIPS Conference 1997 Conference Paper

Radial Basis Functions: A Bayesian Treatment

David Barber
Bernhard Schottky

Bayesian methods have been successfully applied to regression and classification problems in multi-layer perceptrons. We present a novel application of Bayesian techniques to Radial Basis Function networks by developing a Gaussian approximation to the posterior distribution which, for fixed basis function widths, is analytic in the parameters. The setting of regularization constants by cross(cid: 173) validation is wasteful as only a single optimal parameter estimate is retained. We treat this issue by assigning prior distributions to these constants, which are then adapted in light of the data under a simple re-estimation formula.

NeurIPS Conference 1996 Conference Paper

Bayesian Model Comparison by Monte Carlo Chaining

David Barber
Christopher Bishop

The techniques of Bayesian inference have been applied with great success to many problems in neural computing including evaluation of regression functions, determination of error bars on predictions, and the treatment of hyper-parameters. However, the problem of model comparison is a much more challenging one for which current techniques have significant limitations. In this paper we show how an extended form of Markov chain Monte Carlo, called chaining, is able to provide effective estimates of the relative probabilities of different models. We present results from the robot arm problem and compare them with the corresponding results obtained using the standard Gaussian approximation framework. 1 Bayesian Model Comparison In a Bayesian treatment of statistical inference, our state of knowledge of the values of the parameters w in a model M is described in terms of a probability distribution function. Initially this is chosen to be some prior distribution p(wIM), which can be combined with a likelihood function p( Dlw, M) using Bayes' theorem to give a posterior distribution p(wID, M) in the form

NeurIPS Conference 1996 Conference Paper

Gaussian Processes for Bayesian Classification via Hybrid Monte Carlo

David Barber
Christopher Williams

The full Bayesian method for applying neural networks to a pre(cid: 173) diction problem is to set up the prior/hyperprior structure for the net and then perform the necessary integrals. However, these inte(cid: 173) grals are not tractable analytically, and Markov Chain Monte Carlo (MCMC) methods are slow, especially if the parameter space is high-dimensional. Using Gaussian processes we can approximate the weight space integral analytically, so that only a small number of hyperparameters need be integrated over by MCMC methods. We have applied this idea to classification problems, obtaining ex(cid: 173) cellent results on the real-world problems investigated so far.

NeurIPS Conference 1996 Conference Paper

Online Learning from Finite Training Sets: An Analytical Case Study

Peter Sollich
David Barber

We analyse online learning from finite training sets at non(cid: 173) infinitesimal learning rates TJ. By an extension of statistical me(cid: 173) chanics methods, we obtain exact results for the time-dependent generalization error of a linear network with a large number of weights N. We find, for example, that for small training sets of size p ~ N, larger learning rates can be used without compromis(cid: 173) ing asymptotic generalization performance or convergence speed. Encouragingly, for optimal settings of TJ (and, less importantly, weight decay, ) at given final learning time, the generalization per(cid: 173) formance of online learning is essentially as good as that of offline learning.