Arrow Research search

Author name cluster

Jasper Snoek

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

28 papers
2 author rows

Possible papers

28

ICLR Conference 2025 Conference Paper

Bayesian Optimization via Continual Variational Last Layer Training

  • Paul Brunzema
  • Mikkel Jordahn
  • John Willes
  • Sebastian Trimpe
  • Jasper Snoek
  • James Harrison

Gaussian Processes (GPs) are widely seen as the state-of-the-art surrogate models for Bayesian optimization (BO) due to their ability to model uncertainty and their performance on tasks where correlations are easily captured (such as those defined by Euclidean metrics) and their ability to be efficiently updated online. However, the performance of GPs depends on the choice of kernel, and kernel selection for complex correlation structures is often difficult or must be made bespoke. While Bayesian neural networks (BNNs) are a promising direction for higher capacity surrogate models, they have so far seen limited use due to poor performance on some problem types. In this paper, we propose an approach which shows competitive performance on many problem types, including some that BNNs typically struggle with. We build on variational Bayesian last layers (VBLLs), and connect training of these models to exact conditioning in GPs. We exploit this connection to develop an efficient online training algorithm that interleaves conditioning and optimization. Our findings suggest that VBLL networks significantly outperform GPs and other BNN architectures on tasks with complex input correlations, and match the performance of well-tuned GPs on established benchmark tasks.

TMLR Journal 2024 Journal Article

Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models

  • Avi Singh
  • John D Co-Reyes
  • Rishabh Agarwal
  • Ankesh Anand
  • Piyush Patil
  • Xavier Garcia
  • Peter J Liu
  • James Harrison

Fine-tuning language models~(LMs) on human-generated data remains a prevalent practice. However, the performance of such models is often limited by the quantity and diversity of high-quality human data. In this paper, we explore whether we can go beyond human data on tasks where we have access to scalar feedback, for example, on math problems where one can verify correctness. To do so, we investigate a simple self-training method based on expectation-maximization, which we call \method, where we (1) generate samples from the model and filter them using binary feedback, (2) fine-tune the model on these samples, and (3) repeat this process a few times. Testing on advanced MATH reasoning and APPS coding benchmarks using PaLM-2 models, we find that \method{} scales favorably with model size and significantly surpasses fine-tuning only on human data. Overall, our findings suggest self-training with feedback can reduce dependence on human-generated data.

JMLR Journal 2024 Journal Article

Pre-trained Gaussian Processes for Bayesian Optimization

  • Zi Wang
  • George E. Dahl
  • Kevin Swersky
  • Chansoo Lee
  • Zachary Nado
  • Justin Gilmer
  • Jasper Snoek
  • Zoubin Ghahramani

Bayesian optimization (BO) has become a popular strategy for global optimization of expensive real-world functions. Contrary to a common expectation that BO is suited to optimizing black-box functions, it actually requires domain knowledge about those functions to deploy BO successfully. Such domain knowledge often manifests in Gaussian process (GP) priors that specify initial beliefs on functions. However, even with expert knowledge, it is non-trivial to quantitatively define a prior. This is especially true for hyperparameter tuning problems on complex machine learning models, where landscapes of tuning objectives are often difficult to comprehend. We seek an alternative practice for setting these functional priors. In particular, we consider the scenario where we have data from similar functions that allow us to pre-train a tighter distribution a priori. We detail what pre-training entails for GPs using a KL divergence based loss function, and propose a new pre-training based BO framework named HyperBO. Theoretically, we show bounded posterior predictions and near-zero regrets for HyperBO without assuming the "ground truth" GP prior is known. To verify our approach in realistic setups, we collect a large multi-task hyperparameter tuning dataset by training tens of thousands of configurations of near-state-of-the-art deep learning models on popular image and text datasets, as well as a protein sequence dataset. Our results show that on average, HyperBO is able to locate good hyperparameters at least 3 times more efficiently than the best competing methods on both our new tuning dataset and existing multi-task BO benchmarks. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2024. ( edit, beta )

ICLR Conference 2024 Conference Paper

Variational Bayesian Last Layers

  • James Harrison
  • John Willes
  • Jasper Snoek

We introduce a deterministic variational formulation for training Bayesian last layer neural networks. This yields a sampling-free, single-pass model and loss that effectively improves uncertainty estimation. Our variational Bayesian last layer (VBLL) can be trained and evaluated with only quadratic complexity in last layer width, and is thus (nearly) computationally free to add to standard architectures. We experimentally investigate VBLLs, and show that they improve predictive accuracy, calibration, and out of distribution detection over baselines across both regression and classification. Finally, we investigate combining VBLL layers with variational Bayesian feature learning, yielding a lower variance collapsed variational inference method for Bayesian neural networks.

JMLR Journal 2023 Journal Article

A Simple Approach to Improve Single-Model Deep Uncertainty via Distance-Awareness

  • Jeremiah Zhe Liu
  • Shreyas Padhy
  • Jie Ren
  • Zi Lin
  • Yeming Wen
  • Ghassen Jerfel
  • Zachary Nado
  • Jasper Snoek

Accurate uncertainty quantification is a major challenge in deep learning, as neural networks can make overconfident errors and assign high confidence predictions to out-of-distribution (OOD) inputs. The most popular approaches to estimate predictive uncertainty in deep learning are methods that combine predictions from multiple neural networks, such as Bayesian neural networks (BNNs) and deep ensembles. However their practicality in real-time, industrial-scale applications are limited due to the high memory and computational cost. Furthermore, ensembles and BNNs do not necessarily fix all the issues with the underlying member networks. In this work, we study principled approaches to improve the uncertainty property of a single network, based on a single, deterministic representation. By formalizing the uncertainty quantification as a minimax learning problem, we first identify distance awareness, i.e., the model's ability to quantify the distance of a testing example from the training data, as a necessary condition for a DNN to achieve high-quality (i.e., minimax optimal) uncertainty estimation. We then propose Spectral-normalized Neural Gaussian Process (SNGP), a simple method that improves the distance-awareness ability of modern DNNs with two simple changes: (1) applying spectral normalization to hidden weights to enforce bi-Lipschitz smoothness in representations and (2) replacing the last output layer with a Gaussian process layer. On a suite of vision and language understanding benchmarks and on modern architectures (Wide-ResNet and BERT), SNGP consistently outperforms other single-model approaches in prediction, calibration and out-of-domain detection. Furthermore, SNGP provides complementary benefits to popular techniques such as deep ensembles and data augmentation, making it a simple and scalable building block for probabilistic deep learning. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2023. ( edit, beta )

TMLR Journal 2022 Journal Article

Deep Learning for Bayesian Optimization of Scientific Problems with High-Dimensional Structure

  • Samuel Kim
  • Peter Y Lu
  • Charlotte Loh
  • Jamie Smith
  • Jasper Snoek
  • Marin Soljacic

Bayesian optimization (BO) is a popular paradigm for global optimization of expensive black-box functions, but there are many domains where the function is not completely a black-box. The data may have some known structure (e.g.\ symmetries) and/or the data generation process may be a composite process that yields useful intermediate or auxiliary information in addition to the value of the optimization objective. However, surrogate models traditionally employed in BO, such as Gaussian Processes (GPs), scale poorly with dataset size and do not easily accommodate known structure. Instead, we use Bayesian neural networks, a class of scalable and flexible surrogate models with inductive biases, to extend BO to complex, structured problems with high dimensionality. We demonstrate BO on a number of realistic problems in physics and chemistry, including topology optimization of photonic crystal materials using convolutional neural networks, and chemical property optimization of molecules using graph neural networks. On these complex tasks, we show that neural networks often outperform GPs as surrogate models for BO in terms of both sampling efficiency and computational cost.

TMLR Journal 2022 Journal Article

Sparse MoEs meet Efficient Ensembles

  • James Urquhart Allingham
  • Florian Wenzel
  • Zelda E Mariet
  • Basil Mustafa
  • Joan Puigcerver
  • Neil Houlsby
  • Ghassen Jerfel
  • Vincent Fortuin

Machine learning models based on the aggregated outputs of submodels, either at the activation or prediction levels, often exhibit strong performance compared to individual models. We study the interplay of two popular classes of such models: ensembles of neural networks and sparse mixture of experts (sparse MoEs). First, we show that the two approaches have complementary features whose combination is beneficial. This includes a comprehensive evaluation of sparse MoEs in uncertainty related benchmarks. Then, we present efficient ensemble of experts (E$^3$), a scalable and simple ensemble of sparse MoEs that takes the best of both classes of models, while using up to 45% fewer FLOPs than a deep ensemble. Extensive experiments demonstrate the accuracy, log-likelihood, few-shot learning, robustness, and uncertainty improvements of E$^3$ over several challenging vision Transformer-based baselines. E$^3$ not only preserves its efficiency while scaling to models with up to 2.7B parameters, but also provides better predictive performance and uncertainty estimates for larger models.

ICLR Conference 2021 Conference Paper

Combining Ensembles and Data Augmentation Can Harm Your Calibration

  • Yeming Wen
  • Ghassen Jerfel
  • Rafael Muller
  • Michael W. Dusenberry
  • Jasper Snoek
  • Balaji Lakshminarayanan
  • Dustin Tran

Ensemble methods which average over multiple neural network predictions are a simple approach to improve a model’s calibration and robustness. Similarly, data augmentation techniques, which encode prior information in the form of invariant feature transformations, are effective for improving calibration and robustness. In this paper, we show a surprising pathology: combining ensembles and data augmentation can harm model calibration. This leads to a trade-off in practice, whereby improved accuracy by combining the two techniques comes at the expense of calibration. On the other hand, selecting only one of the techniques ensures good uncertainty estimates at the expense of accuracy. We investigate this pathology and identify a compounding under-confidence among methods which marginalize over sets of weights and data augmentation techniques which soften labels. Finally, we propose a simple correction, achieving the best of both worlds with significant accuracy and calibration gains over using only ensembles or data augmentation individually. Applying the correction produces new state-of-the art in uncertainty calibration and robustness across CIFAR-10, CIFAR-100, and ImageNet.

ICLR Conference 2021 Conference Paper

Exploring the Uncertainty Properties of Neural Networks' Implicit Priors in the Infinite-Width Limit

  • Ben Adlam
  • Jaehoon Lee 0001
  • Lechao Xiao
  • Jeffrey Pennington
  • Jasper Snoek

Modern deep learning models have achieved great success in predictive accuracy for many data modalities. However, their application to many real-world tasks is restricted by poor uncertainty estimates, such as overconfidence on out-of-distribution (OOD) data and ungraceful failing under distributional shift. Previous benchmarks have found that ensembles of neural networks (NNs) are typically the best calibrated models on OOD data. Inspired by this, we leverage recent theoretical advances that characterize the function-space prior of an infinitely-wide NN as a Gaussian process, termed the neural network Gaussian process (NNGP). We use the NNGP with a softmax link function to build a probabilistic model for multi-class classification and marginalize over the latent Gaussian outputs to sample from the posterior. This gives us a better understanding of the implicit prior NNs place on function space and allows a direct comparison of the calibration of the NNGP and its finite-width analogue. We also examine the calibration of previous approaches to classification with the NNGP, which treat classification problems as regression to the one-hot labels. In this case the Bayesian posterior is exact, and we compare several heuristics to generate a categorical distribution over classes. We find these methods are well calibrated under distributional shift. Finally, we consider an infinite-width final layer in conjunction with a pre-trained embedding. This replicates the important practical use case of transfer learning and allows scaling to significantly larger datasets. As well as achieving competitive predictive accuracy, this approach is better calibrated than its finite width analogue.

ICLR Conference 2021 Conference Paper

Training independent subnetworks for robust prediction

  • Marton Havasi
  • Rodolphe Jenatton
  • Stanislav Fort
  • Jeremiah Zhe Liu
  • Jasper Snoek
  • Balaji Lakshminarayanan
  • Andrew M. Dai
  • Dustin Tran

Recent approaches to efficiently ensemble neural networks have shown that strong robustness and uncertainty performance can be achieved with a negligible gain in parameters over the original network. However, these methods still require multiple forward passes for prediction, leading to a significant runtime cost. In this work, we show a surprising result: the benefits of using multiple predictions can be achieved 'for free' under a single model's forward pass. In particular, we show that, using a multi-input multi-output (MIMO) configuration, one can utilize a single model's capacity to train multiple subnetworks that independently learn the task at hand. By ensembling the predictions made by the subnetworks, we improve model robustness without increasing compute. We observe a significant improvement in negative log-likelihood, accuracy, and calibration error on CIFAR10, CIFAR100, ImageNet, and their out-of-distribution variants compared to previous methods.

NeurIPS Conference 2020 Conference Paper

A Spectral Energy Distance for Parallel Speech Synthesis

  • Alexey Gritsenko
  • Tim Salimans
  • Rianne van den Berg
  • Jasper Snoek
  • Nal Kalchbrenner

Speech synthesis is an important practical generative modeling problem that has seen great progress over the last few years, with likelihood-based autoregressive neural models now outperforming traditional concatenative systems. A downside of such autoregressive models is that they require executing tens of thousands of sequential operations per second of generated audio, making them ill-suited for deployment on specialized deep learning hardware. Here, we propose a new learning method that allows us to train highly parallel models of speech, without requiring access to an analytical likelihood function. Our approach is based on a generalized energy distance between the distributions of the generated and real audio. This spectral energy distance is a proper scoring rule with respect to the distribution over magnitude-spectrograms of the generated waveform audio and offers statistical consistency guarantees. The distance can be calculated from minibatches without bias, and does not involve adversarial learning, yielding a stable and consistent method for training implicit generative models. Empirically, we achieve state-of-the-art generation quality among implicit generative models, as judged by the recently-proposed cFDSD metric. When combining our method with adversarial techniques, we also improve upon the recently-proposed GAN-TTS model in terms of Mean Opinion Score as judged by trained human evaluators.

ICML Conference 2020 Conference Paper

Efficient and Scalable Bayesian Neural Nets with Rank-1 Factors

  • Michael W. Dusenberry
  • Ghassen Jerfel
  • Yeming Wen
  • Yian Ma
  • Jasper Snoek
  • Katherine A. Heller
  • Balaji Lakshminarayanan
  • Dustin Tran

Bayesian neural networks (BNNs) demonstrate promising success in improving the robustness and uncertainty quantification of modern deep learning. However, they generally struggle with underfitting at scale and parameter efficiency. On the other hand, deep ensembles have emerged as alternatives for uncertainty quantification that, while outperforming BNNs on certain problems, also suffer from efficiency issues. It remains unclear how to combine the strengths of these two approaches and remediate their common issues. To tackle this challenge, we propose a rank-1 parameterization of BNNs, where each weight matrix involves only a distribution on a rank-1 subspace. We also revisit the use of mixture approximate posteriors to capture multiple modes, where unlike typical mixtures, this approach admits a significantly smaller memory increase (e. g. , only a 0. 4% increase for a ResNet-50 mixture of size 10). We perform a systematic empirical study on the choices of prior, variational posterior, and methods to improve training. For ResNet-50 on ImageNet, Wide ResNet 28-10 on CIFAR-10/100, and an RNN on MIMIC-III, rank-1 BNNs achieve state-of-the-art performance across log-likelihood, accuracy, and calibration on the test sets and out-of-distribution variants.

ICML Conference 2020 Conference Paper

How Good is the Bayes Posterior in Deep Neural Networks Really?

  • Florian Wenzel
  • Kevin Roth
  • Bastiaan S. Veeling
  • Jakub Swiatkowski
  • Linh Tran
  • Stephan Mandt
  • Jasper Snoek
  • Tim Salimans

During the past five years the Bayesian deep learning community has developed increasingly accurate and efficient approximate inference procedures that allow for Bayesian inference in deep neural networks. However, despite this algorithmic progress and the promise of improved uncertainty quantification and sample efficiency there are—as of early 2020—no publicized deployments of Bayesian neural networks in industrial practice. In this work we cast doubt on the current understanding of Bayes posteriors in popular deep neural networks: we demonstrate through careful MCMC sampling that the posterior predictive induced by the Bayes posterior yields systematically worse predictions when compared to simpler methods including point estimates obtained from SGD. Furthermore, we demonstrate that predictive performance is improved significantly through the use of a “cold posterior” that overcounts evidence. Such cold posteriors sharply deviate from the Bayesian paradigm but are commonly used as heuristic in Bayesian deep learning papers. We put forward several hypotheses that could explain cold posteriors and evaluate the hypotheses through experiments. Our work questions the goal of accurate posterior approximations in Bayesian deep learning: If the true Bayes posterior is poor, what is the use of more accurate approximations? Instead, we argue that it is timely to focus on understanding the origin of cold posteriors.

NeurIPS Conference 2020 Conference Paper

Hyperparameter Ensembles for Robustness and Uncertainty Quantification

  • Florian Wenzel
  • Jasper Snoek
  • Dustin Tran
  • Rodolphe Jenatton

Ensembles over neural network weights trained from different random initialization, known as deep ensembles, achieve state-of-the-art accuracy and calibration. The recently introduced batch ensembles provide a drop-in replacement that is more parameter efficient. In this paper, we design ensembles not only over weights, but over hyperparameters to improve the state of the art in both settings. For best performance independent of budget, we propose hyper-deep ensembles, a simple procedure that involves a random search over different hyperparameters, themselves stratified across multiple random initializations. Its strong performance highlights the benefit of combining models with both weight and hyperparameter diversity. We further propose a parameter efficient version, hyper-batch ensembles, which builds on the layer structure of batch ensembles and self-tuning networks. The computational and memory costs of our method are notably lower than typical ensembles. On image classification tasks, with MLP, LeNet, ResNet 20 and Wide ResNet 28-10 architectures, we improve upon both deep and batch ensembles.

ICML Conference 2020 Conference Paper

The k-tied Normal Distribution: A Compact Parameterization of Gaussian Mean Field Posteriors in Bayesian Neural Networks

  • Jakub Swiatkowski
  • Kevin Roth
  • Bastiaan S. Veeling
  • Linh Tran
  • Joshua V. Dillon
  • Jasper Snoek
  • Stephan Mandt
  • Tim Salimans

Variational Bayesian Inference is a popular methodology for approximating posterior distributions over Bayesian neural network weights. Recent work developing this class of methods has explored ever richer parameterizations of the approximate posterior in the hope of improving performance. In contrast, here we share a curious experimental finding that suggests instead restricting the variational distribution to a more compact parameterization. For a variety of deep Bayesian neural networks trained using Gaussian mean-field variational inference, we find that the posterior standard deviations consistently exhibit strong low-rank structure after convergence. This means that by decomposing these variational parameters into a low-rank factorization, we can make our variational approximation more compact without decreasing the models’ performance. Furthermore, we find that such factorized parameterizations improve the signal-to-noise ratio of stochastic gradient estimates of the variational lower bound, resulting in faster convergence.

NeurIPS Conference 2019 Conference Paper

Can you trust your model's uncertainty? Evaluating predictive uncertainty under dataset shift

  • Yaniv Ovadia
  • Emily Fertig
  • Jie Ren
  • Zachary Nado
  • D. Sculley
  • Sebastian Nowozin
  • Joshua Dillon
  • Balaji Lakshminarayanan

Modern machine learning methods including deep learning have achieved great success in predictive accuracy for supervised learning tasks, but may still fall short in giving useful estimates of their predictive uncertainty. Quantifying uncertainty is especially critical in real-world settings, which often involve input distributions that are shifted from the training distribution due to a variety of factors including sample bias and non-stationarity. In such settings, well calibrated uncertainty estimates convey information about when a model's output should (or should not) be trusted. Many probabilistic deep learning methods, including Bayesian-and non-Bayesian methods, have been proposed in the literature for quantifying predictive uncertainty, but to our knowledge there has not previously been a rigorous large-scale empirical comparison of these methods under dataset shift. We present a large-scale benchmark of existing state-of-the-art methods on classification problems and investigate the effect of dataset shift on accuracy and calibration. We find that traditional post-hoc calibration does indeed fall short, as do several other previous methods. However, some methods that marginalize over models give surprisingly strong results across a broad spectrum of tasks.

NeurIPS Conference 2019 Conference Paper

DppNet: Approximating Determinantal Point Processes with Deep Networks

  • Zelda Mariet
  • Yaniv Ovadia
  • Jasper Snoek

Determinantal point processes (DPPs) provide an elegant and versatile way to sample sets of items that balance the point-wise quality with the set-wise diversity of selected items. For this reason, they have gained prominence in many machine learning applications that rely on subset selection. However, sampling from a DPP over a ground set of size N is a costly operation, requiring in general an O(N^3) preprocessing cost and an O(Nk^3) sampling cost for subsets of size k. We approach this problem by introducing DppNets: generative deep models that produce DPP-like samples for arbitrary ground sets. We develop an inhibitive attention mechanism based on transformer networks that captures a notion of dissimilarity between feature vectors. We show theoretically that such an approximation is sensible as it maintains the guarantees of inhibition or dissimilarity that makes DPPs so powerful and unique. Empirically, we show across multiple datasets that DPPNET is orders of magnitude faster than competing approaches for DPP sampling, while generating high-likelihood samples and performing as well as DPPs on downstream tasks.

NeurIPS Conference 2019 Conference Paper

Likelihood Ratios for Out-of-Distribution Detection

  • Jie Ren
  • Peter Liu
  • Emily Fertig
  • Jasper Snoek
  • Ryan Poplin
  • Mark Depristo
  • Joshua Dillon
  • Balaji Lakshminarayanan

Discriminative neural networks offer little or no performance guarantees when deployed on data not generated by the same process as the training distribution. On such out-of-distribution (OOD) inputs, the prediction may not only be erroneous, but confidently so, limiting the safe deployment of classifiers in real-world applications. One such challenging application is bacteria identification based on genomic sequences, which holds the promise of early detection of diseases, but requires a model that can output low confidence predictions on OOD genomic sequences from new bacteria that were not present in the training data. We introduce a genomics dataset for OOD detection that allows other researchers to benchmark progress on this important problem. We investigate deep generative model based approaches for OOD detection and observe that the likelihood score is heavily affected by population level background statistics. We propose a likelihood ratio method for deep generative models which effectively corrects for these confounding background statistics. We benchmark the OOD detection performance of the proposed method against existing approaches on the genomics dataset and show that our method achieves state-of-the-art performance. Finally, we demonstrate the generality of the proposed method by showing that it significantly improves OOD detection when applied to deep generative models of images.

JBHI Journal 2017 Journal Article

Unobtrusive Detection of Mild Cognitive Impairment in Older Adults Through Home Monitoring

  • Ahmad Akl
  • Jasper Snoek
  • Alex Mihailidis

The early detection of dementias such as Alzheimer's disease can in some cases reverse, stop, or slow cognitive decline and in general greatly reduce the burden of care. This is of increasing significance as demographic studies are warning of an aging population in North America and worldwide. Various smart homes and systems have been developed to detect cognitive decline through continuous monitoring of high risk individuals. However, the majority of these smart homes and systems use a number of predefined heuristics to detect changes in cognition, which has been demonstrated to focus on the idiosyncratic nuances of the individual subjects, and thus, does not generalize. In this paper, we address this problem by building generalized linear models of home activity of older adults monitored using unobtrusive sensing technologies. We use inhomogenous Poisson processes to model the presence of the recruited older adults within different rooms throughout the day. We employ an information theoretic approach to compare the generalized linear models learned, and we observe significant statistical differences between the cognitively intact and impaired older adults. Using a simple thresholding approach, we were able to detect mild cognitive impairment in older adults with an average area under the ROC curve of 0. 716 and an average area under the precision-recall curve of 0. 706 using activity models estimated over a time window of 12 weeks.

ICML Conference 2015 Conference Paper

Scalable Bayesian Optimization Using Deep Neural Networks

  • Jasper Snoek
  • Oren Rippel
  • Kevin Swersky
  • Jamie Kiros
  • Nadathur Satish
  • Narayanan Sundaram
  • Md. Mostofa Ali Patwary
  • Prabhat

Bayesian optimization is an effective methodology for the global optimization of functions with expensive evaluations. It relies on querying a distribution over functions defined by a relatively cheap surrogate model. An accurate model for this distribution over functions is critical to the effectiveness of the approach, and is typically fit using Gaussian processes (GPs). However, since GPs scale cubically with the number of observations, it has been challenging to handle objectives whose optimization requires many evaluations, and as such, massively parallelizing the optimization. In this work, we explore the use of neural networks as an alternative to GPs to model distributions over functions. We show that performing adaptive basis function regression with a neural network as the parametric form performs competitively with state-of-the-art GP-based approaches, but scales linearly with the number of data rather than cubically. This allows us to achieve a previously intractable degree of parallelism, which we apply to large scale hyperparameter optimization, rapidly finding competitive models on benchmark object recognition tasks using convolutional networks, and image caption generation using neural language models.

NeurIPS Conference 2015 Conference Paper

Spectral Representations for Convolutional Neural Networks

  • Oren Rippel
  • Jasper Snoek
  • Ryan Adams

Discrete Fourier transforms provide a significant speedup in the computation of convolutions in deep learning. In this work, we demonstrate that, beyond its advantages for efficient computation, the spectral domain also provides a powerful representation in which to model and train convolutional neural networks (CNNs). We employ spectral representations to introduce a number of innovations to CNN design. First, we propose spectral pooling, which performs dimensionality reduction by truncating the representation in the frequency domain. This approach preserves considerably more information per parameter than other pooling strategies and enables flexibility in the choice of pooling output dimensionality. This representation also enables a new form of stochastic regularization by randomized modification of resolution. We show that these methods achieve competitive results on classification and approximation tasks, without using any dropout or max-pooling. Finally, we demonstrate the effectiveness of complex-coefficient spectral parameterization of convolutional filters. While this leaves the underlying model unchanged, it results in a representation that greatly facilitates optimization. We observe on a variety of popular CNN configurations that this leads to significantly faster convergence during training.

UAI Conference 2014 Conference Paper

Bayesian Optimization with Unknown Constraints

  • Michael A. Gelbart
  • Jasper Snoek
  • Ryan P. Adams

Recent work on Bayesian optimization has shown its effectiveness in global optimization of difficult black-box objective functions. Many real-world optimization problems of interest also have constraints which are unknown a priori. In this paper, we study Bayesian optimization for constrained problems in the general case that noise may be present in the constraint functions, and the objective and constraints may be evaluated independently. We provide motivating practical examples, and present a general framework to solve such problems. We demonstrate the effectiveness of our approach on optimizing the performance of online latent Dirichlet allocation subject to topic sparsity constraints, tuning a neural network given test-time memory constraints, and optimizing Hamiltonian Monte Carlo to achieve maximal effectiveness in a fixed time, subject to passing standard convergence diagnostics.

JBHI Journal 2014 Journal Article

Data Mining in Bone Marrow Transplant Records to Identify Patients With High Odds of Survival

  • Babak Taati
  • Jasper Snoek
  • Dionne Aleman
  • Ardeshir Ghavamzadeh

Patients undergoing a bone marrow stem cell transplant (BMT) face various risk factors. Analyzing data from past transplants could enhance the understanding of the factors influencing success. Records up to 120 measurements per transplant procedure from 1751 patients undergoing BMT were collected (Shariati Hospital). Collaborative filtering techniques allowed the processing of highly sparse records with 22. 3% missing values. Ten-fold cross-validation was used to evaluate the performance of various classification algorithms trained on predicting the survival status. Modest accuracy levels were obtained in predicting the survival status (AUC = 0. 69). More importantly, however, operations that had the highest chances of success were shown to be identifiable with high accuracy, e. g. , 92% or 97% when identifying 74 or 31 recipients, respectively. Identifying the patients with the highest chances of survival has direct application in the prioritization of resources and in donor matching. For patients where high-confidence prediction is not achieved, assigning a probability to their survival odds has potential applications in probabilistic decision support systems and in combination with other sources of information.

ICML Conference 2014 Conference Paper

Input Warping for Bayesian Optimization of Non-Stationary Functions

  • Jasper Snoek
  • Kevin Swersky
  • Richard S. Zemel
  • Ryan P. Adams

Bayesian optimization has proven to be a highly effective methodology for the global optimization of unknown, expensive and multimodal functions. The ability to accurately model distributions over functions is critical to the effectiveness of Bayesian optimization. Although Gaussian processes provide a flexible prior over functions, there are various classes of functions that remain difficult to model. One of the most frequently occurring of these is the class of non-stationary functions. The optimization of the hyperparameters of machine learning algorithms is a problem domain in which parameters are often manually transformed a priori, for example by optimizing in "log-space", to mitigate the effects of spatially-varying length scale. We develop a methodology for automatically learning a wide family of bijective transformations or warpings of the input space using the Beta cumulative distribution function. We further extend the warping framework to multi-task Bayesian optimization so that multiple tasks can be warped into a jointly stationary space. On a set of challenging benchmark optimization tasks, we observe that the inclusion of warping greatly improves on the state-of-the-art, producing better results faster and more reliably.

NeurIPS Conference 2013 Conference Paper

A Determinantal Point Process Latent Variable Model for Inhibition in Neural Spiking Data

  • Jasper Snoek
  • Richard Zemel
  • Ryan Adams

Point processes are popular models of neural spiking behavior as they provide a statistical distribution over temporal sequences of spikes and help to reveal the complexities underlying a series of recorded action potentials. However, the most common neural point process models, the Poisson process and the gamma renewal process, do not capture interactions and correlations that are critical to modeling populations of neurons. We develop a novel model based on a determinantal point process over latent embeddings of neurons that effectively captures and helps visualize complex inhibitory and competitive interaction. We show that this model is a natural extension of the popular generalized linear model to sets of interacting neurons. The model is extended to incorporate gain control or divisive normalization, and the modulation of neural spiking based on periodic phenomena. Applied to neural spike recordings from the rat hippocampus, we see that the model captures inhibitory relationships, a dichotomy of classes of neurons, and a periodic modulation by the theta rhythm known to be present in the data.

NeurIPS Conference 2013 Conference Paper

Multi-Task Bayesian Optimization

  • Kevin Swersky
  • Jasper Snoek
  • Ryan Adams

Bayesian optimization has recently been proposed as a framework for automatically tuning the hyperparameters of machine learning models and has been shown to yield state-of-the-art performance with impressive ease and efficiency. In this paper, we explore whether it is possible to transfer the knowledge gained from previous optimizations to new tasks in order to find optimal hyperparameter settings more efficiently. Our approach is based on extending multi-task Gaussian processes to the framework of Bayesian optimization. We show that this method significantly speeds up the optimization process when compared to the standard single-task approach. We further propose a straightforward extension of our algorithm in order to jointly minimize the average error across multiple tasks and demonstrate how this can be used to greatly speed up $k$-fold cross-validation. Lastly, our most significant contribution is an adaptation of a recently proposed acquisition function, entropy search, to the cost-sensitive and multi-task settings. We demonstrate the utility of this new acquisition function by utilizing a small dataset in order to explore hyperparameter settings for a large dataset. Our algorithm dynamically chooses which dataset to query in order to yield the most information per unit cost.

JMLR Journal 2012 Journal Article

Nonparametric Guidance of Autoencoder Representations using Label Information

  • Jasper Snoek
  • Ryan P. Adams
  • Hugo Larochelle

While unsupervised learning has long been useful for density modeling, exploratory data analysis and visualization, it has become increasingly important for discovering features that will later be used for discriminative tasks. Discriminative algorithms often work best with highly-informative features; remarkably, such features can often be learned without the labels. One particularly effective way to perform such unsupervised learning has been to use autoencoder neural networks, which find latent representations that are constrained but nevertheless informative for reconstruction. However, pure unsupervised learning with autoencoders can find representations that may or may not be useful for the ultimate discriminative task. It is a continuing challenge to guide the training of an autoencoder so that it finds features which will be useful for predicting labels. Similarly, we often have a priori information regarding what statistical variation will be irrelevant to the ultimate discriminative task, and we would like to be able to use this for guidance as well. Although a typical strategy would be to include a parametric discriminative model as part of the autoencoder training, here we propose a nonparametric approach that uses a Gaussian process to guide the representation. By using a nonparametric model, we can ensure that a useful discriminative function exists for a given set of features, without explicitly instantiating it. We demonstrate the superiority of this guidance mechanism on four data sets, including a real-world application to rehabilitation research. We also show how our proposed approach can learn to explicitly ignore statistically significant covariate information that is label-irrelevant, by evaluating on the small NORB image recognition problem in which pose and lighting labels are available. [abs] [ pdf ][ bib ] &copy JMLR 2012. ( edit, beta )

NeurIPS Conference 2012 Conference Paper

Practical Bayesian Optimization of Machine Learning Algorithms

  • Jasper Snoek
  • Hugo Larochelle
  • Ryan Adams

The use of machine learning algorithms frequently involves careful tuning of learning parameters and model hyperparameters. Unfortunately, this tuning is often a “black art” requiring expert experience, rules of thumb, or sometimes brute-force search. There is therefore great appeal for automatic approaches that can optimize the performance of any given learning algorithm to the problem at hand. In this work, we consider this problem through the framework of Bayesian optimization, in which a learning algorithm’s generalization performance is modeled as a sample from a Gaussian process (GP). We show that certain choices for the nature of the GP, such as the type of kernel and the treatment of its hyperparameters, can play a crucial role in obtaining a good optimizer that can achieve expert-level performance. We describe new algorithms that take into account the variable cost (duration) of learning algorithm experiments and that can leverage the presence of multiple cores for parallel experimentation. We show that these proposed algorithms improve on previous automatic procedures and can reach or surpass human expert-level optimization for many algorithms including Latent Dirichlet Allocation, Structured SVMs and convolutional neural networks.