Arrow Research search

Author name cluster

Mohammad Emtiyaz Khan

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

39 papers
2 author rows

Possible papers

39

TMLR Journal 2026 Journal Article

Variational Visual Question Answering for Uncertainty-Aware Selective Prediction

  • Tobias Jan Wieczorek
  • Nathalie Daun
  • Mohammad Emtiyaz Khan
  • Marcus Rohrbach

Despite remarkable progress in recent years, Vision Language Models (VLMs) remain prone to overconfidence and hallucinations on tasks such as Visual Question Answering (VQA) and Visual Reasoning. Bayesian methods can potentially improve reliability by helping models predict selectively, that is, models respond only when they are sufficiently confident. Unfortunately, such approaches can be costly and ineffective for large models, and there exists little evidence to show otherwise for multimodal applications. Here, we show for the first time the effectiveness and competitive edge of variational Bayes for selective prediction in VQA. We build on recent advances in variational methods for deep learning and propose an extension called "Variational VQA". This method improves calibration and yields significant gains for selective prediction on VQA and Visual Reasoning, particularly when the error tolerance is low (≤ 1%). Often, just one posterior sample yields more reliable answers than those given by models trained with AdamW. In addition, we propose a new risk-averse selector that outperforms standard sample averaging by considering the variance of predictions. Overall, we present compelling evidence that variational learning is a viable option to make large VLMs safer and more trustworthy.

NeurIPS Conference 2025 Conference Paper

Compact Memory for Continual Logistic Regression

  • Yohan Jung
  • Hyungi Lee
  • Wenlong Chen
  • Thomas Möllenhoff
  • Yingzhen Li
  • Juho Lee
  • Mohammad Emtiyaz Khan

Despite recent progress, continual learning still does not match the performance of batch training. To avoid catastrophic forgetting, we need to build compact memory of essential past knowledge, but no clear solution has yet emerged, even for shallow neural networks with just one or two layers. In this paper, we present a new method to build compact memory for logistic regression. Our method is based on a result by Khan and Swaroop [2021] who show the existence of optimal memory for such models. We formulate the search for the optimal memory as Hessian-matching and propose a probabilistic PCA method to estimate them. Our approach can drastically improve accuracy compared to Experience Replay. For instance, on Split-ImageNet, we get 60% accuracy compared to 30% obtained by replay with memory-size equivalent to 0. 3% of the data size. Increasing the memory size to 2% further boosts the accuracy to 74%, closing the gap to the batch accuracy of 77. 6% on this task. Our work opens a new direction for building compact memory that can also be useful in the future for continual deep learning.

ICLR Conference 2025 Conference Paper

Connecting Federated ADMM to Bayes

  • Siddharth Swaroop
  • Mohammad Emtiyaz Khan
  • Finale Doshi

We provide new connections between two distinct federated learning approaches based on (i) ADMM and (ii) Variational Bayes (VB), and propose new variants by combining their complementary strengths. Specifically, we show that the dual variables in ADMM naturally emerge through the "site" parameters used in VB with isotropic Gaussian covariances. Using this, we derive two versions of ADMM from VB that use flexible covariances and functional regularisation, respectively. Through numerical experiments, we validate the improvements obtained in performance. The work shows connection between two fields that are believed to be fundamentally different and combines them to improve federated learning.

TMLR Journal 2025 Journal Article

Optimization Guarantees for Square-Root Natural-Gradient Variational Inference

  • Navish Kumar
  • Thomas Möllenhoff
  • Mohammad Emtiyaz Khan
  • Aurelien Lucchi

Variational inference with natural-gradient descent often shows fast convergence in practice, but its theoretical convergence guarantees have been challenging to establish. This is true even for the simplest cases that involve concave log-likelihoods and use a Gaussian approximation. We show that the challenge can be circumvented for such cases using a square-root parameterization for the Gaussian covariance. This approach establishes novel convergence guarantees for natural-gradient variational-Gaussian inference and its continuous-time gradient flow. Our experiments demonstrate the effectiveness of natural gradient methods and highlight their advantages over algorithms that use Euclidean or Wasserstein geometries.

NeurIPS Conference 2025 Conference Paper

Variational Learning Finds Flatter Solutions at the Edge of Stability

  • Avrajit Ghosh
  • Bai Cong
  • Rio Yokota
  • Saiprasad Ravishankar
  • Rongrong Wang
  • Molei Tao
  • Mohammad Emtiyaz Khan
  • Thomas Möllenhoff

Variational Learning (VL) has recently gained popularity for training deep neural networks. Part of its empirical success can be explained by theories such as PAC-Bayes bounds, minimum description length and marginal likelihood, but little has been done to unravel the implicit regularization in play. Here, we analyze the implicit regularization of VL through the Edge of Stability (EoS) framework. EoS has previously been used to show that gradient descent can find flat solutions and we extend this result to show that VL can find even flatter solutions. This result is obtained by controlling the shape of the variational posterior as well as the number of posterior samples used during training. The derivation follows in a similar fashion as in the standard EoS literature for deep learning, by first deriving a result for a quadratic problem and then extending it to deep neural networks. We empirically validate these findings on a wide variety of large networks, such as ResNet and ViT, to find that the theoretical results closely match the empirical ones. Ours is the first work to analyze the EoS dynamics of~VL.

ICLR Conference 2024 Conference Paper

Conformal Prediction via Regression-as-Classification

  • Etash Kumar Guha
  • Shlok Natarajan
  • Thomas Möllenhoff
  • Mohammad Emtiyaz Khan
  • Eugène Ndiaye

Conformal prediction (CP) for regression can be challenging, especially when the output distribution is heteroscedastic, multimodal, or skewed. Some of the issues can be addressed by estimating a distribution over the output, but in reality, such approaches can be sensitive to estimation error and yield unstable intervals. Here, we circumvent the challenges by converting regression to a classification problem and then use CP for classification to obtain CP sets for regression. To preserve the ordering of the continuous-output space, we design a new loss function and present necessary modifications to the CP classification techniques. Empirical results on many benchmarks show that this simple approach gives surprisingly good results on many practical problems.

ICLR Conference 2024 Conference Paper

Model Merging by Uncertainty-Based Gradient Matching

  • Nico Daheim
  • Thomas Möllenhoff
  • Edoardo M. Ponti
  • Iryna Gurevych
  • Mohammad Emtiyaz Khan

Models trained on different datasets can be merged by a weighted-averaging of their parameters, but why does it work and when can it fail? Here, we connect the inaccuracy of weighted-averaging to mismatches in the gradients and propose a new uncertainty-based scheme to improve the performance by reducing the mismatch. The connection also reveals implicit assumptions in other schemes such as averaging, task arithmetic, and Fisher-weighted averaging. Our new method gives consistent improvements for large language models and vision transformers, both in terms of performance and robustness to hyperparameters.

ICML Conference 2024 Conference Paper

Position: Bayesian Deep Learning is Needed in the Age of Large-Scale AI

  • Theodore Papamarkou
  • Maria Skoularidou
  • Konstantina Palla
  • Laurence Aitchison
  • Julyan Arbel
  • David B. Dunson
  • Maurizio Filippone
  • Vincent Fortuin

In the current landscape of deep learning research, there is a predominant emphasis on achieving high predictive accuracy in supervised tasks involving large image and language datasets. However, a broader perspective reveals a multitude of overlooked metrics, tasks, and data types, such as uncertainty, active and continual learning, and scientific data, that demand attention. Bayesian deep learning (BDL) constitutes a promising avenue, offering advantages across these diverse settings. This paper posits that BDL can elevate the capabilities of deep learning. It revisits the strengths of BDL, acknowledges existing challenges, and highlights some exciting research avenues aimed at addressing these obstacles. Looking ahead, the discussion focuses on possible ways to combine large-scale foundation models with BDL to unlock their full potential.

ICML Conference 2024 Conference Paper

Variational Learning is Effective for Large Deep Networks

  • Yuesong Shen
  • Nico Daheim
  • Bai Cong
  • Peter Nickl
  • Gian Maria Marconi
  • Clement Bazan
  • Rio Yokota
  • Iryna Gurevych

We give extensive empirical evidence against the common belief that variational learning is ineffective for large neural networks. We show that an optimizer called Improved Variational Online Newton (IVON) consistently matches or outperforms Adam for training large networks such as GPT-2 and ResNets from scratch. IVON’s computational costs are nearly identical to Adam but its predictive uncertainty is better. We show several new use cases of IVON where we improve finetuning and model merging in Large Language Models, accurately predict generalization error, and faithfully estimate sensitivity to data. We find overwhelming evidence that variational learning is effective. Code is available at https: //github. com/team-approx-bayes/ivon.

TMLR Journal 2023 Journal Article

Bridging the Gap Between Target Networks and Functional Regularization

  • Alexandre Piché
  • Valentin Thomas
  • Joseph Marino
  • Rafael Pardinas
  • Gian Maria Marconi
  • Christopher Pal
  • Mohammad Emtiyaz Khan

Bootstrapping is behind much of the successes of deep Reinforcement Learning. However, learning the value function via bootstrapping often leads to unstable training due to fast-changing target values. Target Networks are employed to stabilize training by using an additional set of lagging parameters to estimate the target values. Despite the popularity of Target Networks, their effect on the optimization is still misunderstood. In this work, we show that they act as an implicit regularizer which can be beneficial in some cases, but also have disadvantages such as being inflexible and can result in instabilities, even when vanilla TD(0) converges. To overcome these issues, we propose an explicit Functional Regularization alternative that is flexible and a convex regularizer in function space and we theoretically study its convergence. We conducted an experimental study across a range of environments, discount factors, and off-policiness data collections to investigate the effectiveness of the regularization induced by Target Networks and Functional Regularization in terms of performance, accuracy, and stability. Our findings emphasize that Functional Regularization can be used as a drop-in replacement for Target Networks and result in performance improvement. Furthermore, adjusting both the regularization weight and the network update period in Functional Regularization can result in further performance improvements compared to solely adjusting the network update period as typically done with Target Networks. Our approach also enhances the ability to networks to recover accurate $Q$-values.

UAI Conference 2023 Conference Paper

Exploiting Inferential Structure in Neural Processes

  • Dharmesh Tailor
  • Mohammad Emtiyaz Khan
  • Eric T. Nalisnick

Neural Processes (NPs) are appealing due to their ability to perform fast adaptation based on a context set. This set is encoded by a latent variable, which is often assumed to follow a simple distribution. However, in real-word settings, the context set may be drawn from richer distributions having multiple modes, heavy tails, etc. In this work, we provide a framework that allows NPs’ latent variable to be given a rich prior defined by a graphical model. These distributional assumptions directly translate into an appropriate aggregation strategy for the context set. Moreover, we describe a message-passing procedure that still allows for end-to-end optimization with stochastic gradients. We demonstrate the generality of our framework by using mixture and Student-t assumptions that yield improvements in function modelling and test-time robustness.

TMLR Journal 2023 Journal Article

Improving Continual Learning by Accurate Gradient Reconstructions of the Past

  • Erik Daxberger
  • Siddharth Swaroop
  • Kazuki Osawa
  • Rio Yokota
  • Richard E Turner
  • José Miguel Hernández-Lobato
  • Mohammad Emtiyaz Khan

Weight-regularization and experience replay are two popular continual-learning strategies with complementary strengths: while weight-regularization requires less memory, replay can more accurately mimic batch training. How can we combine them to get better methods? Despite the simplicity of the question, little is known or done to optimally combine these approaches. In this paper, we present such a method by using a recently proposed principle of adaptation that relies on a faithful reconstruction of the gradients of the past data. Using this principle, we design a prior which combines two types of replay methods with a quadratic weight-regularizer and achieves better gradient reconstructions. The combination improves performance on standard task-incremental continual learning benchmarks such as Split-CIFAR, SplitTinyImageNet, and ImageNet-1000, achieving $>\!80\%$ of the batch performance by simply utilizing a memory of $<\!10\%$ of the past data. Our work shows that a good combination of the two strategies can be very effective in reducing forgetting.

ICML Conference 2023 Conference Paper

Memory-Based Dual Gaussian Processes for Sequential Learning

  • Paul Edmund Chang
  • Prakhar Verma
  • S. T. John
  • Arno Solin
  • Mohammad Emtiyaz Khan

Sequential learning with Gaussian processes (GPs) is challenging when access to past data is limited, for example, in continual and active learning. In such cases, errors can accumulate over time due to inaccuracies in the posterior, hyperparameters, and inducing points, making accurate learning challenging. Here, we present a method to keep all such errors in check using the recently proposed dual sparse variational GP. Our method enables accurate inference for generic likelihoods and improves learning by actively building and updating a memory of past data. We demonstrate its effectiveness in several applications involving Bayesian optimization, active learning, and continual learning.

ICLR Conference 2023 Conference Paper

SAM as an Optimal Relaxation of Bayes

  • Thomas Möllenhoff
  • Mohammad Emtiyaz Khan

Sharpness-aware minimization (SAM) and related adversarial deep-learning methods can drastically improve generalization, but their underlying mechanisms are not yet fully understood. Here, we establish SAM as a relaxation of the Bayes objective where the expected negative-loss is replaced by the optimal convex lower bound, obtained by using the so-called Fenchel biconjugate. The connection enables a new Adam-like extension of SAM to automatically obtain reasonable uncertainty estimates, while sometimes also improving its accuracy. By connecting adversarial and Bayesian methods, our work opens a new path to robustness.

ICML Conference 2023 Conference Paper

Simplifying Momentum-based Positive-definite Submanifold Optimization with Applications to Deep Learning

  • Wu Lin
  • Valentin Duruisseaux
  • Melvin Leok
  • Frank Nielsen
  • Mohammad Emtiyaz Khan
  • Mark Schmidt 0001

Riemannian submanifold optimization with momentum is computationally challenging because, to ensure that the iterates remain on the submanifold, we often need to solve difficult differential equations. Here, we simplify such difficulties for a class of structured symmetric positive-definite matrices with the affine-invariant metric. We do so by proposing a generalized version of the Riemannian normal coordinates that dynamically orthonormalizes the metric and locally converts the problem into an unconstrained problem in the Euclidean space. We use our approach to simplify existing approaches for structured covariances and develop matrix-inverse-free $2^\text{nd}$-order optimizers for deep learning in low precision settings.

JMLR Journal 2023 Journal Article

The Bayesian Learning Rule

  • Mohammad Emtiyaz Khan
  • Håvard Rue

We show that many machine-learning algorithms are specific instances of a single algorithm called the Bayesian learning rule. The rule, derived from Bayesian principles, yields a wide-range of algorithms from fields such as optimization, deep learning, and graphical models. This includes classical algorithms such as ridge regression, Newton's method, and Kalman filter, as well as modern deep-learning algorithms such as stochastic-gradient descent, RMSprop, and Dropout. The key idea in deriving such algorithms is to approximate the posterior using candidate distributions estimated by using natural gradients. Different candidate distributions result in different algorithms and further approximations to natural gradients give rise to variants of those algorithms. Our work not only unifies, generalizes, and improves existing algorithms, but also helps us design new ones. [abs] [ pdf ][ bib ] &copy JMLR 2023. ( edit, beta )

NeurIPS Conference 2023 Conference Paper

The Memory-Perturbation Equation: Understanding Model's Sensitivity to Data

  • Peter Nickl
  • Lu Xu
  • Dharmesh Tailor
  • Thomas Möllenhoff
  • Mohammad Emtiyaz Khan

Understanding model’s sensitivity to its training data is crucial but can also be challenging and costly, especially during training. To simplify such issues, we present the Memory-Perturbation Equation (MPE) which relates model's sensitivity to perturbation in its training data. Derived using Bayesian principles, the MPE unifies existing sensitivity measures, generalizes them to a wide-variety of models and algorithms, and unravels useful properties regarding sensitivities. Our empirical results show that sensitivity estimates obtained during training can be used to faithfully predict generalization on unseen test data. The proposed equation is expected to be useful for future research on robust and adaptive learning.

NeurIPS Conference 2021 Conference Paper

Dual Parameterization of Sparse Variational Gaussian Processes

  • Vincent ADAM
  • Paul Chang
  • Mohammad Emtiyaz Khan
  • Arno Solin

Sparse variational Gaussian process (SVGP) methods are a common choice for non-conjugate Gaussian process inference because of their computational benefits. In this paper, we improve their computational efficiency by using a dual parameterization where each data example is assigned dual parameters, similarly to site parameters used in expectation propagation. Our dual parameterization speeds-up inference using natural gradient descent, and provides a tighter evidence lower bound for hyperparameter learning. The approach has the same memory cost as the current SVGP methods, but it is faster and more accurate.

NeurIPS Conference 2021 Conference Paper

Knowledge-Adaptation Priors

  • Mohammad Emtiyaz Khan
  • Siddharth Swaroop

Humans and animals have a natural ability to quickly adapt to their surroundings, but machine-learning models, when subjected to changes, often require a complete retraining from scratch. We present Knowledge-adaptation priors (K-priors) to reduce the cost of retraining by enabling quick and accurate adaptation for a wide-variety of tasks and models. This is made possible by a combination of weight and function-space priors to reconstruct the gradients of the past, which recovers and generalizes many existing, but seemingly-unrelated, adaptation strategies. Training with simple first-order gradient methods can often recover the exact retrained model to an arbitrary accuracy by choosing a sufficiently large memory of the past data. Empirical results show that adaptation with K-priors achieves performance similar to full retraining, but only requires training on a handful of past examples.

ICML Conference 2021 Conference Paper

Scalable Marginal Likelihood Estimation for Model Selection in Deep Learning

  • Alexander Immer
  • Matthias Bauer 0001
  • Vincent Fortuin
  • Gunnar Rätsch
  • Mohammad Emtiyaz Khan

Marginal-likelihood based model-selection, even though promising, is rarely used in deep learning due to estimation difficulties. Instead, most approaches rely on validation data, which may not be readily available. In this work, we present a scalable marginal-likelihood estimation method to select both hyperparameters and network architectures, based on the training data alone. Some hyperparameters can be estimated online during training, simplifying the procedure. Our marginal-likelihood estimate is based on Laplace’s method and Gauss-Newton approximations to the Hessian, and it outperforms cross-validation and manual tuning on standard regression and image classification datasets, especially in terms of calibration and out-of-distribution detection. Our work shows that marginal likelihoods can improve generalization and be useful when validation data is unavailable (e. g. , in nonstationary settings).

UAI Conference 2021 Conference Paper

Subset-of-data variational inference for deep Gaussian-processes regression

  • Ayush Jain
  • P. K. Srijith
  • Mohammad Emtiyaz Khan

Deep Gaussian Processes (DGPs) are multi-layer, flexible extensions of Gaussian Processes but their training remains challenging. Most existing methods for inference in DGPs use sparse approximation which require optimization over a large number of inducing inputs and their locations across layers. In this paper, we simplify the training by setting the locations to a fixed subset of data and sampling the inducing inputs from a variational distribution. This reduces the trainable parameters and computation cost without any performance degradation, as demonstrated by our empirical results on regression data sets. Our modifications simplify and stabilize DGP training methods while making them amenable to sampling schemes such as leverage score and determinantal point processes.

ICML Conference 2021 Conference Paper

Tractable structured natural-gradient descent using local parameterizations

  • Wu Lin
  • Frank Nielsen
  • Mohammad Emtiyaz Khan
  • Mark Schmidt 0001

Natural-gradient descent (NGD) on structured parameter spaces (e. g. , low-rank covariances) is computationally challenging due to difficult Fisher-matrix computations. We address this issue by using \emph{local-parameter coordinates} to obtain a flexible and efficient NGD method that works well for a wide-variety of structured parameterizations. We show four applications where our method (1) generalizes the exponential natural evolutionary strategy, (2) recovers existing Newton-like algorithms, (3) yields new structured second-order algorithms, and (4) gives new algorithms to learn covariances of Gaussian and Wishart-based distributions. We show results on a range of problems from deep learning, variational inference, and evolution strategies. Our work opens a new direction for scalable structured geometric methods.

AAAI Conference 2020 Conference Paper

Beyond Unfolding: Exact Recovery of Latent Convex Tensor Decomposition Under Reshuffling

  • Chao Li
  • Mohammad Emtiyaz Khan
  • Zhun Sun
  • Gang Niu
  • Bo Han
  • Shengli Xie
  • Qibin Zhao

Exact recovery of tensor decomposition (TD) methods is a desirable property in both unsupervised learning and scientific data analysis. The numerical defects of TD methods, however, limit their practical applications on real-world data. As an alternative, convex tensor decomposition (CTD) was proposed to alleviate these problems, but its exact-recovery property is not properly addressed so far. To this end, we focus on latent convex tensor decomposition (LCTD), a practically widely-used CTD model, and rigorously prove a sufficient condition for its exact-recovery property. Furthermore, we show that such property can be also achieved by a more general model than LCTD. In the new model, we generalize the classic tensor (un-)folding into reshuffling operation, a more flexible mapping to relocate the entries of the matrix into a tensor. Armed with the reshuffling operations and exact-recovery property, we explore a totally novel application for (generalized) LCTD, i. e. , image steganography. Experimental results on synthetic data validate our theory, and results on image steganography show that our method outperforms the state-of-the-art methods.

NeurIPS Conference 2020 Conference Paper

Continual Deep Learning by Functional Regularisation of Memorable Past

  • Pingbo Pan
  • Siddharth Swaroop
  • Alexander Immer
  • Runa Eschenhagen
  • Richard Turner
  • Mohammad Emtiyaz Khan

Continually learning new skills is important for intelligent systems, yet standard deep learning methods suffer from catastrophic forgetting of the past. Recent works address this with weight regularisation. Functional regularisation, although computationally expensive, is expected to perform better, but rarely does so in practice. In this paper, we fix this issue by using a new functional-regularisation approach that utilises a few memorable past examples crucial to avoid forgetting. By using a Gaussian Process formulation of deep networks, our approach enables training in weight-space while identifying both the memorable past and a functional prior. Our method achieves state-of-the-art performance on standard benchmarks and opens a new direction for life-long learning where regularisation and memory-based methods are naturally combined.

ICML Conference 2020 Conference Paper

Handling the Positive-Definite Constraint in the Bayesian Learning Rule

  • Wu Lin
  • Mark Schmidt 0001
  • Mohammad Emtiyaz Khan

The Bayesian learning rule is a natural-gradient variational inference method, which not only contains many existing learning algorithms as special cases but also enables the design of new algorithms. Unfortunately, when variational parameters lie in an open constraint set, the rule may not satisfy the constraint and requires line-searches which could slow down the algorithm. In this work, we address this issue for positive-definite constraints by proposing an improved rule that naturally handles the constraints. Our modification is obtained by using Riemannian gradient methods, and is valid when the approximation attains a block-coordinate natural parameterization (e. g. , Gaussian distributions and their mixtures). Our method outperforms existing methods without any significant increase in computation. Our work makes it easier to apply the rule in the presence of positive-definite constraints in parameter spaces.

ICML Conference 2020 Conference Paper

Training Binary Neural Networks using the Bayesian Learning Rule

  • Xiangming Meng
  • Roman Bachmann 0001
  • Mohammad Emtiyaz Khan

Neural networks with binary weights are computation-efficient and hardware-friendly, but their training is challenging because it involves a discrete optimization problem. Surprisingly, ignoring the discrete nature of the problem and using gradient-based methods, such as the Straight-Through Estimator, still works well in practice. This raises the question: are there principled approaches which justify such methods? In this paper, we propose such an approach using the Bayesian learning rule. The rule, when applied to estimate a Bernoulli distribution over the binary weights, results in an algorithm which justifies some of the algorithmic choices made by the previous approaches. The algorithm not only obtains state-of-the-art performance, but also enables uncertainty estimation and continual learning to avoid catastrophic forgetting. Our work provides a principled approach for training binary neural networks which also justifies and extends existing approaches.

ICML Conference 2020 Conference Paper

Variational Imitation Learning with Diverse-quality Demonstrations

  • Voot Tangkaratt
  • Bo Han 0003
  • Mohammad Emtiyaz Khan
  • Masashi Sugiyama

Learning from demonstrations can be challenging when the quality of demonstrations is diverse, and even more so when the quality is unknown and there is no additional information to estimate the quality. We propose a new method for imitation learning in such scenarios. We show that simple quality-estimation approaches might fail due to compounding error, and fix this issue by jointly estimating both the quality and reward using a variational approach. Our method is easy to implement within reinforcement-learning frameworks and also achieves state-of-the-art performance on continuous-control benchmarks. Our work enables scalable and data-efficient imitation learning under more realistic settings than before.

NeurIPS Conference 2019 Conference Paper

Approximate Inference Turns Deep Networks into Gaussian Processes

  • Mohammad Emtiyaz Khan
  • Alexander Immer
  • Ehsan Abedi
  • Maciej Korzepa

Deep neural networks (DNN) and Gaussian processes (GP) are two powerful models with several theoretical connections relating them, but the relationship between their training methods is not well understood. In this paper, we show that certain Gaussian posterior approximations for Bayesian DNNs are equivalent to GP posteriors. This enables us to relate solutions and iterations of a deep-learning algorithm to GP inference. As a result, we can obtain a GP kernel and a nonlinear feature map while training a DNN. Surprisingly, the resulting kernel is the neural tangent kernel. We show kernels obtained on real datasets and demonstrate the use of the GP marginal likelihood to tune hyperparameters of DNNs. Our work aims to facilitate further research on combining DNNs and GPs in practical settings.

ICML Conference 2019 Conference Paper

Fast and Simple Natural-Gradient Variational Inference with Mixture of Exponential-family Approximations

  • Wu Lin
  • Mohammad Emtiyaz Khan
  • Mark Schmidt 0001

Natural-gradient methods enable fast and simple algorithms for variational inference, but due to computational difficulties, their use is mostly limited to minimal exponential-family (EF) approximations. In this paper, we extend their application to estimate structured approximations such as mixtures of EF distributions. Such approximations can fit complex, multimodal posterior distributions and are generally more accurate than unimodal EF approximations. By using a minimal conditional-EF representation of such approximations, we derive simple natural-gradient updates. Our empirical results demonstrate a faster convergence of our natural-gradient method compared to black-box gradient-based methods. Our work expands the scope of natural gradients for Bayesian inference and makes them more widely applicable than before.

NeurIPS Conference 2019 Conference Paper

Practical Deep Learning with Bayesian Principles

  • Kazuki Osawa
  • Siddharth Swaroop
  • Mohammad Emtiyaz Khan
  • Anirudh Jain
  • Runa Eschenhagen
  • Richard Turner
  • Rio Yokota

Bayesian methods promise to fix many shortcomings of deep learning, but they are impractical and rarely match the performance of standard methods, let alone improve them. In this paper, we demonstrate practical training of deep networks with natural-gradient variational inference. By applying techniques such as batch normalisation, data augmentation, and distributed training, we achieve similar performance in about the same number of epochs as the Adam optimiser, even on large datasets such as ImageNet. Importantly, the benefits of Bayesian principles are preserved: predictive probabilities are well-calibrated, uncertainties on out-of-distribution data are improved, and continual-learning performance is boosted. This work enables practical deep learning while preserving benefits of Bayesian principles. A PyTorch implementation is available as a plug-and-play optimiser.

ICML Conference 2019 Conference Paper

Scalable Training of Inference Networks for Gaussian-Process Models

  • Jiaxin Shi
  • Mohammad Emtiyaz Khan
  • Jun Zhu 0001

Inference in Gaussian process (GP) models is computationally challenging for large data, and often difficult to approximate with a small number of inducing points. We explore an alternative approximation that employs stochastic inference networks for a flexible inference. Unfortunately, for such networks, minibatch training is difficult to be able to learn meaningful correlations over function outputs for a large dataset. We propose an algorithm that enables such training by tracking a stochastic, functional mirror-descent algorithm. At each iteration, this only requires considering a finite number of input locations, resulting in a scalable and easy-to-implement algorithm. Empirical results show comparable and, sometimes, superior performance to existing sparse variational GP methods.

ICML Conference 2018 Conference Paper

Fast and Scalable Bayesian Deep Learning by Weight-Perturbation in Adam

  • Mohammad Emtiyaz Khan
  • Didrik Nielsen
  • Voot Tangkaratt
  • Wu Lin
  • Yarin Gal
  • Akash Srivastava

Uncertainty computation in deep learning is essential to design robust and reliable systems. Variational inference (VI) is a promising approach for such computation, but requires more effort to implement and execute compared to maximum-likelihood methods. In this paper, we propose new natural-gradient algorithms to reduce such efforts for Gaussian mean-field VI. Our algorithms can be implemented within the Adam optimizer by perturbing the network weights during gradient evaluations, and uncertainty estimates can be cheaply obtained by using the vector that adapts the learning rate. This requires lower memory, computation, and implementation effort than existing VI methods, while obtaining uncertainty estimates of comparable quality. Our empirical results confirm this and further suggest that the weight-perturbation in our algorithm could be useful for exploration in reinforcement learning and stochastic optimization.

NeurIPS Conference 2018 Conference Paper

SLANG: Fast Structured Covariance Approximations for Bayesian Deep Learning with Natural Gradient

  • Aaron Mishkin
  • Frederik Kunstner
  • Didrik Nielsen
  • Mark Schmidt
  • Mohammad Emtiyaz Khan

Uncertainty estimation in large deep-learning models is a computationally challenging task, where it is difficult to form even a Gaussian approximation to the posterior distribution. In such situations, existing methods usually resort to a diagonal approximation of the covariance matrix despite the fact that these matrices are known to give poor uncertainty estimates. To address this issue, we propose a new stochastic, low-rank, approximate natural-gradient (SLANG) method for variational inference in large deep models. Our method estimates a “diagonal plus low-rank” structure based solely on back-propagated gradients of the network log-likelihood. This requires strictly less gradient computations than methods that compute the gradient of the whole variational objective. Empirical evaluations on standard benchmarks confirm that SLANG enables faster and more accurate estimation of uncertainty than mean-field methods, and performs comparably to state-of-the-art methods.

UAI Conference 2016 Conference Paper

Faster Stochastic Variational Inference using Proximal-Gradient Methods with General Divergence Functions

  • Mohammad Emtiyaz Khan
  • Reza Babanezhad 0001
  • Wu Lin
  • Mark Schmidt 0001
  • Masashi Sugiyama

Several recent works have explored stochastic gradient methods for variational inference that exploit the geometry of the variational-parameter space. However, the theoretical properties of these methods are not well-understood and these methods typically only apply to conditionallyconjugate models. We present a new stochastic method for variational inference which exploits the geometry of the variational-parameter space and also yields simple closed-form updates even for non-conjugate models. We also give a convergence-rate analysis of our method and many other previous methods which exploit the geometry of the space. Our analysis generalizes existing convergence results for stochastic mirror-descent on non-convex objectives by using a more general class of divergence functions. Beyond giving a theoretical justification for a variety of recent methods, our experiments show that new algorithms derived in this framework lead to state of the art results on a variety of problems. Further, due to its generality, we expect that our theoretical analysis could also apply to other applications. 1 Reza Babanezhad University of British Columbia Vancouver, Canada babanezhad@gmail. com

NeurIPS Conference 2015 Conference Paper

Kullback-Leibler Proximal Variational Inference

  • Mohammad Emtiyaz Khan
  • Pierre Baque
  • François Fleuret
  • Pascal Fua

We propose a new variational inference method based on the Kullback-Leibler (KL) proximal term. We make two contributions towards improving efficiency of variational inference. Firstly, we derive a KL proximal-point algorithm and show its equivalence to gradient descent with natural gradient in stochastic variational inference. Secondly, we use the proximal framework to derive efficient variational algorithms for non-conjugate models. We propose a splitting procedure to separate non-conjugate terms from conjugate ones. We then linearize the non-conjugate terms and show that the resulting subproblem admits a closed-form solution. Overall, our approach converts a non-conjugate model to subproblems that involve inference in well-known conjugate models. We apply our method to many models and derive generalizations for non-conjugate exponential family. Applications to real-world datasets show that our proposed algorithms are easy to implement, fast to converge, perform well, and reduce computations.

NeurIPS Conference 2014 Conference Paper

Decoupled Variational Gaussian Inference

  • Mohammad Emtiyaz Khan

Variational Gaussian (VG) inference methods that optimize a lower bound to the marginal likelihood are a popular approach for Bayesian inference. These methods are fast and easy to use, while being reasonably accurate. A difficulty remains in computation of the lower bound when the latent dimensionality $L$ is large. Even though the lower bound is concave for many models, its computation requires optimization over $O(L^2)$ variational parameters. Efficient reparameterization schemes can reduce the number of parameters, but give inaccurate solutions or destroy concavity leading to slow convergence. We propose decoupled variational inference that brings the best of both worlds together. First, it maximizes a Lagrangian of the lower bound reducing the number of parameters to $O(N)$, where $N$ is the number of data examples. The reparameterization obtained is unique and recovers maxima of the lower-bound even when the bound is not concave. Second, our method maximizes the lower bound using a sequence of convex problems, each of which is parallellizable over data examples and computes gradient efficiently. Overall, our approach avoids all direct computations of the covariance, only requiring its linear projections. Theoretically, our method converges at the same rate as existing methods in the case of concave lower bounds, while remaining convergent at a reasonable rate for the non-concave case.

ICML Conference 2013 Conference Paper

Fast Dual Variational Inference for Non-Conjugate Latent Gaussian Models

  • Mohammad Emtiyaz Khan
  • Aleksandr Y. Aravkin
  • Michael P. Friedlander
  • Matthias W. Seeger

Latent Gaussian models (LGMs) are widely used in statistics and machine learning. Bayesian inference in non-conjugate LGM is difficult due to intractable integrals involving the Gaussian prior and non-conjugate likelihoods. Algorithms based on Variational Gaussian (VG) approximations are widely employed since they strike a favorable balance between accuracy, generality, speed, and ease of use. However, the structure of optimization problems associated with them remains poorly understood, and standard solvers take too long to converge. In this paper, we derive a novel dual variational inference approach, which exploits the convexity property of the VG approximations. The implications of our approach is that we obtain an algorithm that solves a convex optimization problem, reduces the number of variational parameters, and converges much faster than previous methods. Using real world data, we demonstrate these advantages on a variety of LGMs including Gaussian process classification and latent Gaussian Markov random fields.

NeurIPS Conference 2010 Conference Paper

Variational bounds for mixed-data factor analysis

  • Mohammad Emtiyaz Khan
  • Guillaume Bouchard
  • Kevin Murphy
  • Benjamin Marlin

We propose a new variational EM algorithm for fitting factor analysis models with mixed continuous and categorical observations. The algorithm is based on a simple quadratic bound to the log-sum-exp function. In the special case of fully observed binary data, the bound we propose is significantly faster than previous variational methods. We show that EM is significantly more robust in the presence of missing data compared to treating the latent factors as parameters, which is the approach used by exponential family PCA and other related matrix-factorization methods. A further benefit of the variational approach is that it can easily be extended to the case of mixtures of factor analyzers, as we show. We present results on synthetic and real data sets demonstrating several desirable properties of our proposed method.