Author name cluster

Lawrence Saul

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

26 papers

1 author row

NeurIPS Conference 2025 Conference Paper

CosmoBench: A Multiscale, Multiview, Multitask Cosmology Benchmark for Geometric Deep Learning

Teresa Huang
Richard Stiskalek
Jun-Young Lee
Adrian Bayer
Charles Margossian
Christian Kragh Jespersen
Lucia Perez
Lawrence Saul

Cosmological simulations provide a wealth of data in the form of point clouds and directed trees. A crucial goal is to extract insights from this data that shed light on the nature and composition of the Universe. In this paper we introduce CosmoBench, a benchmark dataset curated from state-of-the-art cosmological simulations whose runs required more than 41 million core-hours and generated over two petabytes of data. CosmoBench is the largest dataset of its kind: it contains 34 thousand point clouds from simulations of dark matter halos and galaxies at three different length scales, as well as 25 thousand directed trees that record the formation history of halos on two different time scales. The data in CosmoBench can be used for multiple tasks---to predict cosmological parameters from point clouds and merger trees, to predict the velocities of individual halos and galaxies from their collective positions, and to reconstruct merger trees on finer time scales from those on coarser time scales. We provide multiple baselines on these tasks, some based on established approaches from cosmological modeling and others rooted in machine learning. For the latter, we study different approaches---from simple linear models that are minimally constrained by symmetries to much larger and more computationally-demanding models in deep learning, such as graph neural networks. We find that least-squares fits with a handful of invariant features sometimes outperform deep architectures with many more parameters and far longer training times. Still there remains tremendous potential to improve these baselines by combining machine learning and cosmological modeling in a more principled way, one that fully exploits the structure in the data. CosmoBench sets the stage for bridging cosmology and geometric deep learning at scale. We invite the community to push the frontier of scientific discovery by engaging with this challenging, high-impact dataset. The data and code are available at this URL.

NeurIPS Conference 2025 Conference Paper

Fisher meets Feynman: score-based variational inference with a product of experts

Diana Cai
Robert Gower
David Blei
Lawrence Saul

We introduce a highly expressive yet distinctly tractable family for black-box variational inference (BBVI). Each member of this family is a weighted product of experts (PoE), and each weighted expert in the product is proportional to a multivariate $t$-distribution. These products of experts can model distributions with skew, heavy tails, and multiple modes, but to use them for BBVI, we must be able to sample from their densities. We show how to do this by reformulating these products of experts as latent variable models with auxiliary Dirichlet random variables. These Dirichlet variables emerge from a Feynman identity, originally developed for loop integrals in quantum field theory, that expresses the product of multiple fractions (or in our case, $t$-distributions) as an integral over the simplex. We leverage this simplicial latent space to draw weighted samples from these products of experts---samples which BBVI then uses to find the PoE that best approximates a target density. Given a collection of experts, we derive an iterative procedure to optimize the exponents that determine their geometric weighting in the PoE. At each iteration, this procedure minimizes a regularized Fisher divergence to match the scores of the variational and target densities at a batch of samples drawn from the current approximation. This minimization reduces to a convex quadratic program, and we prove under general conditions that these updates converge exponentially fast to a near-optimal weighting of experts. We conclude by evaluating this approach on a variety of synthetic and real-world target distributions.

NeurIPS Conference 2023 Conference Paper

Variational Inference with Gaussian Score Matching

Chirag Modi
Robert Gower
Charles Margossian
Yuling Yao
David Blei
Lawrence Saul

Variational inference (VI) is a method to approximate the computationally intractable posterior distributions that arise in Bayesian statistics. Typically, VI fits a simple parametric distribution to be close to the target posterior, optimizing an appropriate objective such as the evidence lower bound (ELBO). In this work, we present a new approach to VI. Our method is based on the principle of score matching---namely, that if two distributions are equal then their score functions (i. e. , gradients of the log density) are equal at every point on their support. With this principle, we develop score-matching VI, an iterative algorithm that seeks to match the scores between the variational approximation and the exact posterior. At each iteration, score-matching VI solves an inner optimization, one that minimally adjusts the current variational estimate to match the scores at a newly sampled value of the latent variables. We show that when the variational family is a Gaussian, this inner optimization enjoys a closed-form solution, which we call Gaussian score matching VI (GSM-VI). GSM-VI is a ``black box'' variational algorithm in that it only requires a differentiable joint distribution, and as such it can be applied to a wide class of models. We compare GSM-VI to black box variational inference (BBVI), which has similar requirements but instead optimizes the ELBO. We first study how GSM-VI behaves as a function of the problem dimensionality, the condition number of the target covariance matrix (when the target is Gaussian), and the degree of mismatch between the approximating and exact posterior distribution. We then study GSM-VI on a collection of real-world Bayesian inference problems from the posteriorDB database of datasets and models. We find that GSM-VI is faster than BBVI and equally or more accurate. Specifically, over a wide range of target posteriors, GSM-VI requires 10-100x fewer gradient evaluations than BBVI to obtain a comparable quality of approximation.

NeurIPS Conference 2021 Conference Paper

An online passive-aggressive algorithm for difference-of-squares classification

Lawrence Saul

We investigate a low-rank model of quadratic classification inspired by previous work on factorization machines, polynomial networks, and capsule-based architectures for visual object recognition. The model is parameterized by a pair of affine transformations, and it classifies examples by comparing the magnitudes of vectors that these transformations produce. The model is also over-parameterized in the sense that different pairs of affine transformations can describe classifiers with the same decision boundary and confidence scores. We show that such pairs arise from discrete and continuous symmetries of the model’s parameter space: in particular, the latter define symmetry groups of rotations and Lorentz transformations, and we use these group structures to devise appropriately invariant procedures for model alignment and averaging. We also leverage the form of the model’s decision boundary to derive simple margin-based updates for online learning. Here we explore a strategy of passive-aggressive learning: for each example, we compute the minimum change in parameters that is required to predict its correct label with high confidence. We derive these updates by solving a quadratically constrained quadratic program (QCQP); interestingly, this QCQP is nonconvex but tractable, and it can be solved efficiently by elementary methods. We highlight the conceptual and practical contributions of this approach. Conceptually, we show that it extends the paradigm of passive-aggressive learning to a larger family of nonlinear models for classification. Practically, we show that these models perform well on large-scale problems in online learning.

NeurIPS Conference 2012 Conference Paper

Latent Coincidence Analysis: A Hidden Variable Model for Distance Metric Learning

Matthew Der
Lawrence Saul

We describe a latent variable model for supervised dimensionality reduction and distance metric learning. The model discovers linear projections of high dimensional data that shrink the distance between similarly labeled inputs and expand the distance between differently labeled ones. The model’s continuous latent variables locate pairs of examples in a latent space of lower dimensionality. The model differs significantly from classical factor analysis in that the posterior distribution over these latent variables is not always multivariate Gaussian. Nevertheless we show that inference is completely tractable and derive an Expectation-Maximization (EM) algorithm for parameter estimation. We also compare the model to other approaches in distance metric learning. The model’s main advantage is its simplicity: at each iteration of the EM algorithm, the distance metric is re-estimated by solving an unconstrained least-squares problem. Experiments show that these simple updates are highly effective.

NeurIPS Conference 2011 Conference Paper

Maximum Covariance Unfolding : Manifold Learning for Bimodal Data

Vijay Mahadevan
Chi Wong
Jose Pereira
Tom Liu
Nuno Vasconcelos
Lawrence Saul

We propose maximum covariance unfolding (MCU), a manifold learning algorithm for simultaneous dimensionality reduction of data from different input modalities. Given high dimensional inputs from two different but naturally aligned sources, MCU computes a common low dimensional embedding that maximizes the cross-modal (inter-source) correlations while preserving the local (intra-source) distances. In this paper, we explore two applications of MCU. First we use MCU to analyze EEG-fMRI data, where an important goal is to visualize the fMRI voxels that are most strongly correlated with changes in EEG traces. To perform this visualization, we augment MCU with an additional step for metric learning in the high dimensional voxel space. Second, we use MCU to perform cross-modal retrieval of matched image and text samples from Wikipedia. To manage large applications of MCU, we develop a fast implementation based on ideas from spectral graph theory. These ideas transform the original problem for MCU, one of semidefinite programming, into a simpler problem in semidefinite quadratic linear programming.

NeurIPS Conference 2010 Conference Paper

Latent Variable Models for Predicting File Dependencies in Large-Scale Software Development

Diane Hu
Laurens Maaten
Youngmin Cho
Sorin Lerner
Lawrence Saul

When software developers modify one or more files in a large code base, they must also identify and update other related files. Many file dependencies can be detected by mining the development history of the code base: in essence, groups of related files are revealed by the logs of previous workflows. From data of this form, we show how to detect dependent files by solving a problem in binary matrix completion. We explore different latent variable models (LVMs) for this problem, including Bernoulli mixture models, exponential family PCA, restricted Boltzmann machines, and fully Bayesian approaches. We evaluate these models on the development histories of three large, open-source software systems: Mozilla Firefox, Eclipse Subversive, and Gimp. In all of these applications, we find that LVMs improve the performance of related file prediction over current leading methods.

NeurIPS Conference 2009 Conference Paper

Kernel Methods for Deep Learning

Youngmin Cho
Lawrence Saul

We introduce a new family of positive-definite kernel functions that mimic the computation in large, multilayer neural nets. These kernel functions can be used in shallow architectures, such as support vector machines (SVMs), or in deep kernel-based architectures that we call multilayer kernel machines (MKMs). We evaluate SVMs and MKMs with these kernel functions on problems designed to illustrate the advantages of deep architectures. On several problems, we obtain better results than previous, leading benchmarks from both SVMs with Gaussian kernels as well as deep belief nets.

NeurIPS Conference 2006 Conference Paper

Graph Laplacian Regularization for Large-Scale Semidefinite Programming

Kilian Weinberger
Fei Sha
Qihui Zhu
Lawrence Saul

In many areas of science and engineering, the problem arises how to discover low dimensional representations of high dimensional data. Recently, a number of researchers have converged on common solutions to this problem using methods from convex optimization. In particular, many results have been obtained by constructing semidefinite programs (SDPs) with low rank solutions. While the rank of matrix variables in SDPs cannot be directly constrained, it has been observed that low rank solutions emerge naturally by computing high variance or maximal trace solutions that respect local distance constraints. In this paper, we show how to solve very large problems of this type by a matrix factorization that leads to much smaller SDPs than those previously studied. The matrix factorization is derived by expanding the solution of the original problem in terms of the bottom eigenvectors of a graph Laplacian. The smaller SDPs obtained from this matrix factorization yield very good approximations to solutions of the original problem. Moreover, these approximations can be further refined by conjugate gradient descent. We illustrate the approach on localization in large scale sensor networks, where optimizations involving tens of thousands of nodes can be solved in just a few minutes.

NeurIPS Conference 2006 Conference Paper

Large Margin Hidden Markov Models for Automatic Speech Recognition

Fei Sha
Lawrence Saul

We study the problem of parameter estimation in continuous density hidden Markov models (CD-HMMs) for automatic speech recognition (ASR). As in support vector machines, we propose a learning algorithm based on the goal of margin maximization. Unlike earlier work on max-margin Markov networks, our approach is specifically geared to the modeling of real-valued observations (such as acoustic feature vectors) using Gaussian mixture models. Unlike previous discriminative frameworks for ASR, such as maximum mutual information and minimum classification error, our framework leads to a convex optimization, without any spurious local minima. The objective function for large margin training of CD-HMMs is defined over a parameter space of positive semidefinite matrices. Its optimization can be performed efficiently with simple gradient-based methods that scale well to large problems. We obtain competitive results for phonetic recognition on the TIMIT speech corpus.

NeurIPS Conference 2005 Conference Paper

Distance Metric Learning for Large Margin Nearest Neighbor Classification

Kilian Weinberger
John Blitzer
Lawrence Saul

We show how to learn a Mahanalobis distance metric for k -nearest neighbor (kNN) classification by semidefinite programming. The metric is trained with the goal that the k -nearest neighbors always belong to the same class while examples from different classes are separated by a large margin. On seven data sets of varying size and difficulty, we find that metrics trained in this way lead to significant improvements in kNN classification--for example, achieving a test error rate of 1. 3% on the MNIST handwritten digits. As in support vector machines (SVMs), the learning problem reduces to a convex optimization based on the hinge loss. Unlike learning in SVMs, however, our framework requires no modification or extension for problems in multiway (as opposed to binary) classification.

NeurIPS Conference 2004 Conference Paper

Hierarchical Distributed Representations for Statistical Language Modeling

John Blitzer
Fernando Pereira
Kilian Weinberger
Lawrence Saul

Statistical language models estimate the probability of a word occurring in a given context. The most common language models rely on a discrete enumeration of predictive contexts (e. g. , n-grams) and consequently fail to capture and exploit statistical regularities across these contexts. In this paper, we show how to learn hierarchical, distributed representations of word contexts that maximize the predictive value of a statistical language model. The representations are initialized by unsupervised algorithms for linear and nonlinear dimensionality reduction [14], then fed as input into a hierarchical mixture of experts, where each expert is a multinomial dis- tribution over predicted words [12]. While the distributed representations in our model are inspired by the neural probabilistic language model of Bengio et al. [2, 3], our particular architecture enables us to work with significantly larger vocabularies and training corpora. For example, on a large-scale bigram modeling task involving a sixty thousand word vocab- ulary and a training corpus of three million sentences, we demonstrate consistent improvement over class-based bigram models [10, 13]. We also discuss extensions of our approach to longer multiword contexts. 1 Introduction Statistical language models are essential components of natural language systems for human-computer interaction. They play a central role in automatic speech recognition [11], machine translation [5], statistical parsing [8], and information retrieval [15]. These mod- els estimate the probability that a word will occur in a given context, where in general a context specifies a relationship to one or more words that have already been observed. The simplest, most studied case is that of n-gram language modeling, where each word is predicted from the preceding n-1 words. The main problem in building these models is that the vast majority of word combinations occur very infrequently, making it difficult to estimate accurate probabilities of words in most contexts. Researchers in statistical language modeling have developed a variety of smoothing tech- niques to alleviate this problem of data sparseness. Most smoothing methods are based on simple back-off formulas or interpolation schemes that discount the probability of observed events and assign the "leftover" probability mass to events unseen in training [7]. Unfortu- nately, these methods do not typically represent or take advantage of statistical regularities among contexts. One expects the probabilities of rare or unseen events in one context to be related to their probabilities in statistically similar contexts. Thus, it should be possible to estimate more accurate probabilities by exploiting these regularities. Several approaches have been suggested for sharing statistical information across contexts. The aggregate Markov model (AMM) of Saul and Pereira [13] (also discussed by Hofmann and Puzicha [10] as a special case of the aspect model) factors the conditional probability table of a word given its context by a latent variable representing context "classes". How- ever, this latent variable approach is difficult to generalize to multiword contexts, as the size of the conditional probability table for class given context grows exponentially with the context length. The neural probabilistic language model (NPLM) of Bengio et al. [2, 3] achieved signifi- cant improvements over state-of-the-art smoothed n-gram models [6]. The NPLM encodes contexts as low-dimensional continuous vectors. These are fed to a multilayer neural net- work that outputs a probability distribution over words. The low-dimensional vectors and the parameters of the network are trained simultaneously to minimize the perplexity of the language model. This model has no difficulty encoding multiword contexts, but its training and application are very costly because of the need to compute a separate normalization for the conditional probabilities associated to each context. In this paper, we introduce and evaluate a statistical language model that combines the advantages of the AMM and NPLM. Like the NPLM, it can be used for multiword con- texts, and like the AMM it avoids per-context normalization. In our model, contexts are represented as low-dimensional real vectors initialized by unsupervised algorithms for di- mensionality reduction [14]. The probabilities of words given contexts are represented by a hierarchical mixture of experts (HME) [12], where each expert is a multinomial distri- bution over predicted words. This tree-structured mixture model allows a rich dependency on context without expensive per-context normalization. Proper initialization of the dis- tributed representations is crucial; in particular, we find that initializations from the results of linear and nonlinear dimensionality reduction algorithms lead to better models (with significantly lower test perplexities) than random initialization. In practice our model is several orders of magnitude faster to train and apply than the NPLM, enabling us to work with larger vocabularies and training corpora. We present re- sults on a large-scale bigram modeling task, showing that our model also leads to significant improvements over comparable AMMs. 2 Distributed representations of words Natural language has complex, multidimensional semantics. As a trivial example, consider the following four sentences: The vase broke. The vase contains water. The window broke. The window contains water. The bottom right sentence is syntactically valid but semantically meaningless. As shown by the table, a two-bit distributed representation of the words "vase" and "window" suffices to express that a vase is both a container and breakable, while a window is breakable but can- not be a container. More generally, we expect low dimensional continuous representations of words to be even more effective at capturing semantic regularities. Distributed representations of words can be derived in several ways. In a given corpus of text, for example, consider the matrix of bigram counts whose element Cij records the number of times that word wj follows word wi. Further, let pij = Cij/ C k ik denote the conditional frequencies derived from these counts, and let pi denote the V -dimensional frequency vector with elements pij, where V is the vocabulary size. Note that the vectors pi themselves provide a distributed representation of the words wi in the corpus. For large vocabularies and training corpora, however, this is an extremely unwieldy representation, tantamount to storing the full matrix of bigram counts. Thus, it is natural to seek a lower dimensional representation that captures the same information. To this end, we need to map each vector pi to some d-dimensional vector xi, with d V. We consider two methods in dimensionality reduction for this problem. The results from these methods are then used to initialize the HME architecture in the next section. 2. 1 Linear dimensionality reduction The simplest form of dimensionality reduction is principal component analysis (PCA). PCA computes a linear projection of the frequency vectors pi into the low dimensional subspace that maximizes their variance. The variance-maximizing subspace of dimension- ality d is spanned by the top d eigenvectors of the frequency vector covariance matrix. The eigenvalues of the covariance matrix measure the variance captured by each axis of the subspace. The effect of PCA can also be understood as a translation and rotation of the frequency vectors pi, followed by a truncation that preserves only their first d elements. 2. 2 Nonlinear dimensionality reduction Intuitively, we would like to map the vectors pi into a low dimensional space where se- mantically similar words remain close together and semantically dissimilar words are far apart. Can we find a nonlinear mapping that does this better than PCA? Weinberger et al. recently proposed a new solution to this problem based on semidefinite programming [14]. Let xi denote the image of pi under this mapping. The mapping is discovered by first learning the V V matrix of Euclidean squared distances [1] given by Dij = |xi - xj|2. This is done by balancing two competing goals: (i) to co-locate semantically similar words, and (ii) to separate semantically dissimilar words. The first goal is achieved by fixing the distances between words with similar frequency vectors to their original values. In particu- lar, if pj and pk lie within some small neighborhood of each other, then the corresponding element Djk in the distance matrix is fixed to the value |pj - pk|2. The second goal is achieved by maximizing the sum of pairwise squared distances ijDij. Thus, we push the words in the vocabulary as far apart as possible subject to the constraint that the distances between semantically similar words do not change. The only freedom in this optimization is the criterion for judging that two words are se- mantically similar. In practice, we adopt a simple criterion such as k-nearest neighbors in the space of frequency vectors pi and choose k as small as possible so that the resulting neighborhood graph is connected [14]. The optimization is performed over the space of Euclidean squared distance matrices [1]. Necessary and sufficient conditions for the matrix D to be interpretable as a Euclidean squared distance matrix are that D is symmetric and that the Gram matrix1 derived from G = - 1 HDHT is semipositive definite, where H = I - 1 11T. The optimization can 2 V thus be formulated as the semidefinite programming problem: Maximize ijDij subject to: (i) DT = D, (ii) - 1 HDH 0, and 2 (iii) Dij = |pi - pj|2 for all neighboring vectors pi and pj. 1Assuming without loss of generality that the vectors xi are centered on the origin, the dot prod- ucts Gij = xi xj are related to the pairwise squared distances Dij = |xi - xj |2 as stated above.

NeurIPS Conference 2004 Conference Paper

Real-Time Pitch Determination of One or More Voices by Nonnegative Matrix Factorization

Fei Sha
Lawrence Saul

An auditory "scene", composed of overlapping acoustic sources, can be viewed as a complex object whose constituent parts are the individual sources. Pitch is known to be an important cue for auditory scene analy- sis. In this paper, with the goal of building agents that operate in human environments, we describe a real-time system to identify the presence of one or more voices and compute their pitch. The signal processing in the front end is based on instantaneous frequency estimation, a method for tracking the partials of voiced speech, while the pattern-matching in the back end is based on nonnegative matrix factorization, an unsupervised algorithm for learning the parts of complex objects. While supporting a framework to analyze complicated auditory scenes, our system maintains real-time operability and state-of-the-art performance in clean speech.

NeurIPS Conference 2002 Conference Paper

Multiplicative Updates for Nonnegative Quadratic Programming in Support Vector Machines

Fei Sha
Lawrence Saul
Daniel Lee

We derive multiplicative updates for solving the nonnegative quadratic programming problem in support vector machines (SVMs). The updates have a simple closed form, and we prove that they converge monotoni- cally to the solution of the maximum margin hyperplane. The updates optimize the traditionally proposed objective function for SVMs. They do not involve any heuristics such as choosing a learning rate or deciding which variables to update at each iteration. They can be used to adjust all the quadratic programming variables in parallel with a guarantee of im- provement at each iteration. We analyze the asymptotic convergence of the updates and show that the coefﬁcients of non-support vectors decay geometrically to zero at a rate that depends on their margins. In practice, the updates converge very rapidly to good classiﬁers.

NeurIPS Conference 2002 Conference Paper

Real Time Voice Processing with Audiovisual Feedback: Toward Autonomous Agents with Perfect Pitch

Lawrence Saul
Daniel Lee
Charles Isbell
Yann Cun

We have implemented a real time front end for detecting voiced speech and estimating its fundamental frequency. The front end performs the signal processing for voice-driven agents that attend to the pitch contours of human speech and provide continuous audiovisual feedback. The al- gorithm we use for pitch tracking has several distinguishing features: it makes no use of FFTs or autocorrelation at the pitch period; it updates the pitch incrementally on a sample-by-sample basis; it avoids peak picking and does not require interpolation in time or frequency to obtain high res- olution estimates; and it works reliably over a four octave range, in real time, without the need for postprocessing to produce smooth contours. The algorithm is based on two simple ideas in neural computation: the introduction of a purposeful nonlinearity, and the error signal of a least squares ﬁt. The pitch tracker is used in two real time multimedia applica- tions: a voice-to-MIDI player that synthesizes electronic music from vo- calized melodies, and an audiovisual Karaoke machine with multimodal feedback. Both applications run on a laptop and display the user’s pitch scrolling across the screen as he or she sings into the computer.

NeurIPS Conference 2001 Conference Paper

Global Coordination of Local Linear Models

Sam Roweis
Lawrence Saul
Geoffrey Hinton

High dimensional data that lies on or near a low dimensional manifold can be de- scribed by a collection of local linear models. Such a description, however, does not provide a global parameterization of the manifold—arguably an important goal of unsupervised learning. In this paper, we show how to learn a collection of local linear models that solves this more difﬁcult problem. Our local linear models are represented by a mixture of factor analyzers, and the “global coordi- nation” of these models is achieved by adding a regularizing term to the standard maximum likelihood objective function. The regularizer breaks a degeneracy in the mixture model’s parameter space, favoring models whose internal coor- dinate systems are aligned in a consistent way. As a result, the internal coor- dinates change smoothly and continuously as one traverses a connected path on the manifold—even when the path crosses the domains of many different local models. The regularizer takes the form of a Kullback-Leibler divergence and illustrates an unexpected application of variational methods: not to perform ap- proximate inference in intractable probabilistic models, but to learn more useful internal representations in tractable ones. 1 Manifold Learning Consider an ensemble of images, each of which contains a face against a neutral back- ground. Each image can be represented by a point in the high dimensional vector space of pixel intensities. This representation, however, does not exploit the strong correlations between pixels of the same image, nor does it support many useful operations for reasoning about faces. If, for example, we select two images with faces in widely different locations and then average their pixel intensities, we do not obtain an image of a face at their average location. Images of faces lie on or near a low-dimensional, curved manifold, and we can represent them more usefully by the coordinates on this manifold than by pixel intensi- ties. Using these “intrinsic coordinates”, the average of two faces is another face with the average of their locations, poses and expressions. To analyze and manipulate faces, it is helpful to imagine a “magic black box” with levers or dials corresponding to the intrinsic coordinates on this manifold. Given a setting of the levers and dials, the box generates an image of a face. Given an image of a face, the box deduces the appropriate setting of the levers and dials. In this paper, we describe a fairly general way to construct such a box automatically from an ensemble of high-dimensional vectors. We assume only that there exists an underlying manifold of low dimensionality and that the relationship between the raw data and the manifold coordinates is locally linear and smoothly varying. Thus our method applies not only to images of faces, but also to many other forms of highly distributed perceptual and scientiﬁc data (e. g. , spectrograms of speech, robotic sensors, gene expression arrays, document collections). 2 Local Linear Models The global structure of perceptual manifolds (such as images of faces) tends to be highly nonlinear. Fortunately, despite their complicated global structure, we can usually char- acterize these manifolds as locally linear. Thus, to a good approximation, they can be represented by collections of simpler models, each of which describes a locally linear neighborhood[3, 6, 8]. For unsupervised learning tasks, a probabilistic model that nicely captures this intuition is a mixture of factor analyzers (MFA)[5]. The model is used to describe high dimensional data that lies on or near a lower dimensional manifold. MFAs parameterize a joint distribution over observed and hidden variables:

NeurIPS Conference 2001 Conference Paper

Multiplicative Updates for Classification by Mixture Models

Lawrence Saul
Daniel Lee

We investigate a learning algorithm for the classiﬁcation of nonnegative data by mixture models. Multiplicative update rules are derived that directly optimize the performance of these models as classiﬁers. The update rules have a simple closed form and an intuitive appeal. Our algorithm retains the main virtues of the Expectation-Maximization (EM) algorithm—its guarantee of monotonic im- provement, and its absence of tuning parameters—with the added advantage of optimizing a discriminative objective function. The algorithm reduces as a spe- cial case to the method of generalized iterative scaling for log-linear models. The learning rate of the algorithm is controlled by the sparseness of the training data. We use the method of nonnegative matrix factorization (NMF) to discover sparse distributed representations of the data. This form of feature selection greatly accelerates learning and makes the algorithm practical on large problems. Ex- periments show that discriminatively trained mixture models lead to much better classiﬁcation than comparably sized models trained by EM.

NeurIPS Conference 2000 Conference Paper

Periodic Component Analysis: An Eigenvalue Method for Representing Periodic Structure in Speech

Lawrence Saul
Jont Allen

An eigenvalue method is developed for analyzing periodic structure in speech. Signals are analyzed by a matrix diagonalization reminiscent of methods for principal component analysis (PCA) and independent com(cid: 173) ponent analysis (ICA). Our method-called periodic component analysis (1l"CA)-uses constructive interference to enhance periodic components of the frequency spectrum and destructive interference to cancel noise. The front end emulates important aspects of auditory processing, such as cochlear filtering, nonlinear compression, and insensitivity to phase, with the aim of approaching the robustness of human listeners. The method avoids the inefficiencies of autocorrelation at the pitch period: it does not require long delay lines, and it correlates signals at a clock rate on the order of the actual pitch, as opposed to the original sampling rate. We derive its cost function and present some experimental results.

NeurIPS Conference 1998 Conference Paper

Inference in Multilayer Networks via Large Deviation Bounds

Michael Kearns
Lawrence Saul

We study probabilistic inference in large, layered Bayesian net(cid: 173) works represented as directed acyclic graphs. We show that the intractability of exact inference in such networks does not preclude their effective use. We give algorithms for approximate probabilis(cid: 173) tic inference that exploit averaging phenomena occurring at nodes with large numbers of parents. We show that these algorithms compute rigorous lower and upper bounds on marginal probabili(cid: 173) ties of interest, prove that these bounds become exact in the limit of large networks, and provide rates of convergence.

NeurIPS Conference 1998 Conference Paper

Markov Processes on Curves for Automatic Speech Recognition

Lawrence Saul
Mazin Rahim

We investigate a probabilistic framework for automatic speech recognition based on the intrinsic geometric properties of curves. In particular, we analyze the setting in which two variables-one continuous (~), one discrete (s )-evolve jointly in time. We sup(cid: 173) pose that the vector ~ traces out a smooth multidimensional curve and that the variable s evolves stochastically as a function of the arc length traversed along this curve. Since arc length does not depend on the rate at which a curve is traversed, this gives rise to a family of Markov processes whose predictions, Pr[sl~]' are invariant to nonlinear warpings of time. We describe the use of such models, known as Markov processes on curves (MPCs), for automatic speech recognition, where ~ are acoustic feature trajec(cid: 173) tories and s are phonetic transcriptions. On two tasks-recognizing New Jersey town names and connected alpha-digits- we find that MPCs yield lower word error rates than comparably trained hidden Markov models.

NeurIPS Conference 1997 Conference Paper

Modeling Acoustic Correlations by Factor Analysis

Lawrence Saul
Mazin Rahim

Hidden Markov models (HMMs) for automatic speech recognition rely on high dimensional feature vectors to summarize the short(cid: 173) time properties of speech. Correlations between features can arise when the speech signal is non-stationary or corrupted by noise. We investigate how to model these correlations using factor analysis, a statistical method for dimensionality reduction. Factor analysis uses a small number of parameters to model the covariance struc(cid: 173) ture of high dimensional data. These parameters are estimated by an Expectation-Maximization (EM) algorithm that can be em(cid: 173) bedded in the training procedures for HMMs. We evaluate the combined use of mixture densities and factor analysis in HMMs that recognize alphanumeric strings. Holding the total number of parameters fixed, we find that these methods, properly combined, yield better models than either method on its own.

NeurIPS Conference 1996 Conference Paper

A Variational Principle for Model-based Morphing

Lawrence Saul
Michael Jordan

Given a multidimensional data set and a model of its density, we consider how to define the optimal interpolation between two points. This is done by assigning a cost to each path through space, based on two competing goals-one to interpolate through regions of high density, the other to minimize arc length. From this path functional, we derive the Euler-Lagrange equations for extremal motionj given two points, the desired interpolation is found by solv(cid: 173) ing a boundary value problem. We show that this interpolation can be done efficiently, in high dimensions, for Gaussian, Dirichlet, and mixture models.

NeurIPS Conference 1996 Conference Paper

Hidden Markov Decision Trees

Michael Jordan
Zoubin Ghahramani
Lawrence Saul

We study a time series model that can be viewed as a decision tree with Markov temporal structure. The model is intractable for exact calculations, thus we utilize variational approximations. We consider three different distributions for the approximation: one in which the Markov calculations are performed exactly and the layers of the decision tree are decoupled, one in which the decision tree calculations are performed exactly and the time steps of the Markov chain are decoupled, and one in which a Viterbi-like assumption is made to pick out a single most likely state sequence. We present simulation results for artificial data and the Bach chorales.

NeurIPS Conference 1995 Conference Paper

Exploiting Tractable Substructures in Intractable Networks

Lawrence Saul
Michael Jordan

We develop a refined mean field approximation for inference and learning in probabilistic neural networks. Our mean field theory, unlike most, does not assume that the units behave as independent degrees of freedom; instead, it exploits in a principled way the existence of large substructures that are computationally tractable. To illustrate the advantages of this framework, we show how to incorporate weak higher order interactions into a first-order hidden Markov model, treating the corrections (but not the first order structure) within mean field theory.

NeurIPS Conference 1995 Conference Paper

Fast Learning by Bounding Likelihoods in Sigmoid Type Belief Networks

Tommi Jaakkola
Lawrence Saul
Michael Jordan

Sigmoid type belief networks, a class of probabilistic neural net(cid: 173) works, provide a natural framework for compactly representing probabilistic information in a variety of unsupervised and super(cid: 173) vised learning problems. Often the parameters used in these net(cid: 173) works need to be learned from examples. Unfortunately, estimat(cid: 173) ing the parameters via exact probabilistic calculations (i. e, the EM-algorithm) is intractable even for networks with fairly small numbers of hidden units. We propose to avoid the infeasibility of the E step by bounding likelihoods instead of computing them ex(cid: 173) actly. We introduce extended and complementary representations for these networks and show that the estimation of the network parameters can be made fast (reduced to quadratic optimization) by performing the estimation in either of the alternative domains. The complementary networks can be used for continuous density estimation as well.

NeurIPS Conference 1994 Conference Paper

Boltzmann Chains and Hidden Markov Models

Lawrence Saul
Michael Jordan

We propose a statistical mechanical framework for the modeling of discrete time series. Maximum likelihood estimation is done via Boltzmann learning in one-dimensional networks with tied weights. We call these networks Boltzmann chains and show that they contain hidden Markov models (HMMs) as a special case. Our framework also motivates new architectures that address partic(cid: 173) ular shortcomings of HMMs. We look at two such architectures: parallel chains that model feature sets with disparate time scales, and looped networks that model long-term dependencies between hidden states. For these networks, we show how to implement the Boltzmann learning rule exactly, in polynomial time, without resort to simulated or mean-field annealing. The necessary com(cid: 173) putations are done by exact decimation procedures from statistical mechanics. 1 INTRODUCTION AND SUMMARY Statistical models of discrete time series have a wide range of applications, most notably to problems in speech recognition (Juang & Rabiner, 1991) and molecular biology (Baldi, Chauvin, Hunkapiller, & McClure, 1992). A common problem in these fields is to find a probabilistic model, and a set of model parameters, that 436 Lawrence K. Saul, Michael I. Jordan account for sequences of observed data. Hidden Markov models (HMMs) have been particularly successful at modeling discrete time series. One reason for this is the powerful learning rule (Baum) 1972») a special case of the Expectation-Maximization (EM) procedure for maximum likelihood estimation (Dempster) Laird) & Rubin) 1977). In this work) we develop a statistical mechanical framework for the modeling of discrete time series. The framework enables us to relate HMMs to a large family of exactly solvable models in statistical mechanics. The connection to statistical mechanics was first noticed by Sourlas (1989») who studied spin glass models of error-correcting codes. We view the estimation procedure for HMMs as a special (and particularly tractable) case of the Boltzmann learning rule (Ackley) Hinton) & Sejnowski) 1985; Byrne) 1992). The rest of this paper is organized as follows. In Section 2) we review the modeling problem for discrete time series and establish the connection between HMMs and Boltzmann machines. In Section 3) we show how to quickly determine whether or not a particular Boltzmann machine is tractable) and if so) how to efficiently compute the correlations in the Boltzmann learning rule. Finally) in Section 4) we look at two architectures that address particular weaknesses of HMMs: the modelling of disparate time scales and long-term dependencies. 2 MODELING DISCRETE TIME SERIES A discrete time series is a sequence of symbols {jdr=l in which each symbol belongs to a finite countable set) i. e. jl E {1) 2). .. ) m}. Given one long sequence) or perhaps many shorter ones) the modeling task is to characterize the probability distribution from which the time series are generated. 2. 1 HIDDEN MARKOV MODELS A first-order Hidden Markov Model (HMM) is characterized by a set of n hidden states) an alphabet of m symbols) a transmission matrix ajj') an emission matrix bjj ) and a prior distribution 7I'j over the initial hidden state. The sequence of states {idr=l and symbols {jdr=l is modeled to occur with probability (1) The modeling problem is to find the parameter values (ajj', bij ) 7I'j) that maximize the likelihood of observed sequences of training data. We will elaborate on the learning rule in section 2. 3) but first let us make the connection to a well-known family of stochastic neural networks, namely Boltzmann machines. 2. 2 BOLTZMANN MACHINES Consider a Boltzmann machine with m-state visible units) n-state hidden units) tied weights) and the linear architecture shown in Figure 1. This example represents the simplest possible Boltzmann "chain))) one that is essentially equivalent to a first(cid: 173) order HMM unfolded in time (MacKay) 1994). The transition weights Aii' connect adjacent hidden units) while the emission weights Bjj connect each hidden unit to Boltzmann Chains and Hidden Markov Models