Arrow Research search

Author name cluster

Michael Jordan

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

161 papers
1 author row

Possible papers

161

NeurIPS Conference 2025 Conference Paper

Backward Conformal Prediction

  • Etienne Gauthier
  • Francis Bach
  • Michael Jordan

We introduce *Backward Conformal Prediction*, a method that guarantees conformal coverage while providing flexible control over the size of prediction sets. Unlike standard conformal prediction, which fixes the coverage level and allows the conformal set size to vary, our approach defines a rule that constrains how prediction set sizes behave based on the observed data, and adapts the coverage level accordingly. Our method builds on two key foundations: (i) recent results by Gauthier et al. [2025] on post-hoc validity using e-values, which ensure marginal coverage of the form $\mathbb{P}(Y_{\rm test} \in \hat C_n^{\tilde{\alpha}}(X_{\rm test})) \ge 1 - \mathbb{E}[\tilde{\alpha}]$ for any data-dependent miscoverage $\tilde{\alpha}$, and (ii) a novel leave-one-out estimator $\hat{\alpha}^{\rm LOO}$ of the marginal miscoverage $\mathbb{E}[\tilde{\alpha}]$ based on the calibration set, ensuring that the theoretical guarantees remain computable in practice. This approach is particularly useful in applications where large prediction sets are impractical such as medical diagnosis. We provide theoretical results and empirical evidence supporting the validity of our method, demonstrating that it maintains computable coverage guarantees while ensuring interpretable, well-controlled prediction set sizes.

NeurIPS Conference 2025 Conference Paper

Conformal Prediction under Lévy-Prokhorov Distribution Shifts: Robustness to Local and Global Perturbations

  • Liviu Aolaritei
  • Julie Zhu
  • Oliver Wang
  • Michael Jordan
  • Youssef Marzouk

Conformal prediction provides a powerful framework for constructing prediction intervals with finite-sample guarantees, yet its robustness under distribution shifts remains a significant challenge. This paper addresses this limitation by modeling distribution shifts using Lévy-Prokhorov (LP) ambiguity sets, which capture both local and global perturbations. We provide a self-contained overview of LP ambiguity sets and their connections to popular metrics such as Wasserstein and Total Variation. We show that the link between conformal prediction and LP ambiguity sets is a natural one: by propagating the LP ambiguity set through the scoring function, we reduce complex high-dimensional distribution shifts to manageable one-dimensional distribution shifts, enabling exact quantification of worst-case quantiles and coverage. Building on this analysis, we construct robust conformal prediction intervals that remain valid under distribution shifts, explicitly linking LP parameters to interval width and confidence levels. Experimental results on real-world datasets demonstrate the effectiveness of the proposed approach.

NeurIPS Conference 2025 Conference Paper

Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers

  • Yixiao Huang
  • Hanlin Zhu
  • Tianyu Guo
  • Jiantao Jiao
  • Somayeh Sojoudi
  • Michael Jordan
  • Stuart J Russell
  • Song Mei

Large language models (LLMs) can acquire new knowledge through fine-tuning, but this process exhibits a puzzling duality: models can generalize remarkably from new facts, yet are also prone to hallucinating incorrect information. However, the reasons for this phenomenon remain poorly understood. In this work, we argue that both behaviors stem from a single mechanism known as out-of-context reasoning (OCR): the ability to deduce implications by associating concepts, even those without a causal link. Our experiments across five prominent LLMs confirm that OCR indeed drives both generalization and hallucination, depending on whether the associated concepts are causally related. To build a rigorous theoretical understanding of this phenomenon, we then formalize OCR as a synthetic factual recall task. We empirically show that a one-layer single-head attention-only transformer with factorized output and value matrices can learn to solve this task, while a model with combined weights cannot, highlighting the crucial role of matrix factorization. Our theoretical analysis shows that the OCR capability can be attributed to the implicit bias of gradient descent, which favors solutions that minimize the nuclear norm of the combined output-value matrix. This structure explains why the model learns to associate facts and implications with high sample efficiency, regardless of whether the correlation is causal or merely spurious. Ultimately, our work provides a theoretical foundation for understanding the OCR phenomenon, offering a new lens for analyzing and mitigating undesirable behaviors from knowledge injection.

NeurIPS Conference 2025 Conference Paper

Valid Selection among Conformal Sets

  • Mahmoud Hegazy
  • Liviu Aolaritei
  • Michael Jordan
  • Aymeric Dieuleveut

Conformal prediction offers a distribution-free framework for constructing prediction sets with coverage guarantees. In practice, multiple valid conformal prediction sets may be available, arising from different models or methodologies. However, selecting the most desirable set, such as the smallest, can invalidate the coverage guarantees. To address this challenge, we propose a stability-based approach that ensures coverage for the selected prediction set. We extend our results to the online conformal setting, propose several refinements in settings where additional structure is available, and demonstrate its effectiveness through experiments.

NeurIPS Conference 2024 Conference Paper

Data Acquisition via Experimental Design for Data Markets

  • Charles Lu
  • Baihe Huang
  • Sai Praneeth Karimireddy
  • Praneeth Vepakomma
  • Michael Jordan
  • Ramesh Raskar

The acquisition of training data is crucial for machine learning applications. Data markets can increase the supply of data, particularly in data-scarce domains such as healthcare, by incentivizing potential data providers to join the market. A major challenge for a data buyer in such a market is choosing the most valuable data points from a data seller. Unlike prior work in data valuation, which assumes centralized data access, we propose a federated approach to the data acquisition problem that is inspired by linear experimental design. Our proposed data acquisition method achieves lower prediction error without requiring labeled validation data and can be optimized in a fast and federated procedure. The key insight of our work is that a method that directly estimates the benefit of acquiring data for test set prediction is particularly compatible with a decentralized market setting.

NeurIPS Conference 2024 Conference Paper

Dimension-free Private Mean Estimation for Anisotropic Distributions

  • Yuval Dagan
  • Michael Jordan
  • Xuelin Yang
  • Lydia Zakynthinou
  • Nikita Zhivotovskiy

We present differentially private algorithms for high-dimensional mean estimation. Previous private estimators on distributions over $\mathbb{R}^d$ suffer from a curse of dimensionality, as they require $\Omega(d^{1/2})$ samples to achieve non-trivial error, even in cases where $O(1)$ samples suffice without privacy. This rate is unavoidable when the distribution is isotropic, namely, when the covariance is a multiple of the identity matrix. Yet, real-world data is often highly anisotropic, with signals concentrated on a small number of principal components. We develop estimators that are appropriate for such signals---our estimators are $(\varepsilon, \delta)$-differentially private and have sample complexity that is dimension-independent for anisotropic subgaussian distributions. Given $n$ samples from a distribution with known covariance-proxy $\Sigma$ and unknown mean $\mu$, we present an estimator $\hat{\mu}$ that achieves error, $\|\hat{\mu}-\mu\|_2\leq \alpha$, as long as $n\gtrsim \text{tr}(\Sigma)/\alpha^2+ \text{tr}(\Sigma^{1/2})/(\alpha\varepsilon)$. We show that this is the optimal sample complexity for this task up to logarithmic factors. Moreover, for the case of unknown covariance, we present an algorithm whose sample complexity has improved dependence on the dimension, from $d^{1/2}$ to $d^{1/4}$.

NeurIPS Conference 2024 Conference Paper

Fair Allocation in Dynamic Mechanism Design

  • Alireza Fallah
  • Michael Jordan
  • Annie Ulichney

We consider a dynamic mechanism design problem where an auctioneer sells an indivisible good to two groups of buyers in every round, for a total of $T$ rounds. The auctioneer aims to maximize their discounted overall revenue while adhering to a fairness constraint that guarantees a minimum average allocation for each group. We begin by studying the static case ($T=1$) and establish that the optimal mechanism involves two types of subsidization: one that increases the overall probability of allocation to all buyers, and another that favors the group which otherwise has a lower probability of winning the item. We then extend our results to the dynamic case by characterizing a set of recursive functions that determine the optimal allocation and payments in each round. Notably, our results establish that in the dynamic case, the seller, on one hand, commits to a participation reward to incentivize truth-telling, and, on the other hand, charges an entry fee for every round. Moreover, the optimal allocation once more involves subsidization in favor of one group, where the extent of subsidization depends on the difference in future utilities for both the seller and buyers when allocating the item to one group versus the other. Finally, we present an approximation scheme to solve the recursive equations and determine an approximately optimal and fair allocation efficiently.

NeurIPS Conference 2024 Conference Paper

Learning to Mitigate Externalities: the Coase Theorem with Hindsight Rationality

  • Antoine Scheid
  • Aymeric Capitaine
  • Etienne Boursier
  • Eric Moulines
  • Michael Jordan
  • Alain Durmus

In Economics, the concept of externality refers to any indirect effect resulting from an interaction between players and affecting a third party without compensation. Most of the models within which externality has been studied assume that agents have perfect knowledge of their environment and preferences. This is a major hindrance to the practical implementation of many proposed solutions. To adress this issue, we consider a two-players bandit game setting where the actions of one of the player affect the other one. Building upon this setup, we extend the Coase theorem [Coase, 2013], which suggests that the optimal approach for maximizing the social welfare in the presence of externality is to establish property rights, i. e. , enabling transfers and bargaining between the players. Nonetheless, this fundamental result relies on the assumption that bargainers possess perfect knowledge of the underlying game. We first demonstrate that in the absence of property rights in the considered online scenario, the social welfare breaks down. We then provide a policy for the players, which allows them to learn a bargaining strategy which maximizes the total welfare, recovering the Coase theorem under uncertainty.

NeurIPS Conference 2024 Conference Paper

Towards a Theoretical Understanding of the 'Reversal Curse' via Training Dynamics

  • Hanlin Zhu
  • Baihe Huang
  • Shaolun Zhang
  • Michael Jordan
  • Jiantao Jiao
  • Yuandong Tian
  • Stuart Russell

Auto-regressive large language models (LLMs) show impressive capacities to solve many complex reasoning tasks while struggling with some simple logical reasoning tasks such as inverse search: when trained on ''$A \to B$'' (e. g. , *Tom is the parent of John*), LLM fails to directly conclude ''$B \gets A$'' (e. g. , *John is the child of Tom*) during inference even if the two sentences are semantically identical, which is known as the ''reversal curse''. In this paper, we theoretically analyze the reversal curse via the training dynamics of (stochastic) gradient descent for two auto-regressive models: (1) a bilinear model that can be viewed as a simplification of a one-layer transformer; (2) one-layer transformers under certain assumptions. Our analysis reveals that for both models, the reversal curse is a consequence of the (effective) model weights *asymmetry*, i. e. , the increase of weights from a token $A$ to token $B$ during training does not necessarily cause the increase of the weights from $B$ to $A$, which is caused by the training dynamics under certain choice of loss function and the optimization space of model parameters. Moreover, our analysis can be naturally applied to other logical reasoning tasks such as chain-of-thought (COT), which provides a new perspective different from previous work that focuses on expressivity. Finally, we conduct experiments to validate our theory on multi-layer transformers under different settings. Our code is available at [https: //github. com/marlo-z/reversal_curse_analysis/](https: //github. com/marlo-z/reversal_curse_analysis/).

NeurIPS Conference 2024 Conference Paper

Unravelling in Collaborative Learning

  • Aymeric Capitaine
  • Etienne Boursier
  • Antoine Scheid
  • Eric Moulines
  • Michael Jordan
  • El-Mahdi El-Mhamdi
  • Alain Durmus

Collaborative learning offers a promising avenue for leveraging decentralized data. However, collaboration in groups of strategic learners is not a given. In this work, we consider strategic agents who wish to train a model together but have sampling distributions of different quality. The collaboration is organized by a benevolent aggregator who gathers samples so as to maximize total welfare, but is unaware of data quality. This setting allows us to shed light on the deleterious effect of adverse selection in collaborative learning. More precisely, we demonstrate that when data quality indices are private, the coalition may undergo a phenomenon known as unravelling, wherein it shrinks up to the point that it becomes empty or solely comprised of the worst agent. We show how this issue can be addressed without making use of external transfers, by proposing a novel method inspired by probabilistic verification. This approach makes the grand coalition a Nash equilibrium with high probability despite information asymmetry, thereby breaking unravelling.

NeurIPS Conference 2023 Conference Paper

A Unifying Perspective on Multi-Calibration: Game Dynamics for Multi-Objective Learning

  • Nika Haghtalab
  • Michael Jordan
  • Eric Zhao

We provide a unifying framework for the design and analysis of multi-calibrated predictors. By placing the multi-calibration problem in the general setting of multi-objective learning---where learning guarantees must hold simultaneously over a set of distributions and loss functions---we exploit connections to game dynamics to achieve state-of-the-art guarantees for a diverse set of multi-calibration learning problems. In addition to shedding light on existing multi-calibration guarantees and greatly simplifying their analysis, our approach also yields improved guarantees, such as error tolerances that scale with the square-root of group size versus the constant tolerances guaranteed by prior works, and improving the complexity of $k$-class multi-calibration by an exponential factor of $k$ versus Gopalan et al. . Beyond multi-calibration, we use these game dynamics to address emerging considerations in the study of group fairness and multi-distribution learning.

NeurIPS Conference 2023 Conference Paper

Class-Conditional Conformal Prediction with Many Classes

  • Tiffany Ding
  • Anastasios Angelopoulos
  • Stephen Bates
  • Michael Jordan
  • Ryan J. Tibshirani

Standard conformal prediction methods provide a marginal coverage guarantee, which means that for a random test point, the conformal prediction set contains the true label with a user-specified probability. In many classificationproblems, we would like to obtain a stronger guarantee--that for test pointsof a specific class, the prediction set contains the true label with thesame user-chosen probability. For the latter goal, existing conformal predictionmethods do not work well when there is a limited amount of labeled data perclass, as is often the case in real applications where the number of classes islarge. We propose a method called clustered conformal prediction thatclusters together classes having "similar" conformal scores and performs conformal prediction at the cluster level. Based on empirical evaluation acrossfour image data sets with many (up to 1000) classes, we find that clusteredconformal typically outperforms existing methods in terms of class-conditionalcoverage and set size metrics.

NeurIPS Conference 2023 Conference Paper

Doubly-Robust Self-Training

  • Banghua Zhu
  • Mingyu Ding
  • Philip Jacobson
  • Ming Wu
  • Wei Zhan
  • Michael Jordan
  • Jiantao Jiao

Self-training is a well-established technique in semi-supervised learning, which leverages unlabeled data by generating pseudo-labels and incorporating them with a limited labeled dataset for training. The effectiveness of self-training heavily relies on the accuracy of these pseudo-labels. In this paper, we introduce doubly-robust self-training, an innovative semi-supervised algorithm that provably balances between two extremes. When pseudo-labels are entirely incorrect, our method reduces to a training process solely using labeled data. Conversely, when pseudo-labels are completely accurate, our method transforms into a training process utilizing all pseudo-labeled data and labeled data, thus increasing the effective sample size. Through empirical evaluations on both the ImageNet dataset for image classification and the nuScenes autonomous driving dataset for 3D object detection, we demonstrate the superiority of the doubly-robust loss over the self-training baseline.

NeurIPS Conference 2023 Conference Paper

Improved Bayes Risk Can Yield Reduced Social Welfare Under Competition

  • Meena Jagadeesan
  • Michael Jordan
  • Jacob Steinhardt
  • Nika Haghtalab

As the scale of machine learning models increases, trends such as scaling laws anticipate consistent downstream improvements in predictive accuracy. However, these trends take the perspective of a single model-provider in isolation, while in reality providers often compete with each other for users. In this work, we demonstrate that competition can fundamentally alter the behavior of these scaling trends, even causing overall predictive accuracy across users to be non-monotonic or decreasing with scale. We define a model of competition for classification tasks, and use data representations as a lens for studying the impact of increases in scale. We find many settings where improving data representation quality (as measured by Bayes risk) decreases the overall predictive accuracy across users (i. e. , social welfare) for a marketplace of competing model-providers. Our examples range from closed-form formulas in simple settings to simulations with pretrained representations on CIFAR-10. At a conceptual level, our work suggests that favorable scaling trends for individual model-providers need not translate to downstream improvements in social welfare in marketplaces with multiple model providers.

NeurIPS Conference 2023 Conference Paper

On Learning Necessary and Sufficient Causal Graphs

  • Hengrui Cai
  • Yixin Wang
  • Michael Jordan
  • Rui Song

The causal revolution has stimulated interest in understanding complex relationships in various fields. Most of the existing methods aim to discover causal relationships among all variables within a complex large-scale graph. However, in practice, only a small subset of variables in the graph are relevant to the outcomes of interest. Consequently, causal estimation with the full causal graph---particularly given limited data---could lead to numerous falsely discovered, spurious variables that exhibit high correlation with, but exert no causal impact on, the target outcome. In this paper, we propose learning a class of necessary and sufficient causal graphs (NSCG) that exclusively comprises causally relevant variables for an outcome of interest, which we term causal features. The key idea is to employ probabilities of causation to systematically evaluate the importance of features in the causal graph, allowing us to identify a subgraph relevant to the outcome of interest. To learn NSCG from data, we develop a necessary and sufficient causal structural learning (NSCSL) algorithm, by establishing theoretical properties and relationships between probabilities of causation and natural causal effects of features. Across empirical studies of simulated and real data, we demonstrate that NSCSL outperforms existing algorithms and can reveal crucial yeast genes for target heritable traits of interest.

NeurIPS Conference 2023 Conference Paper

Optimal Extragradient-Based Algorithms for Stochastic Variational Inequalities with Separable Structure

  • Angela Yuan
  • Chris Junchi Li
  • Gauthier Gidel
  • Michael Jordan
  • Quanquan Gu
  • Simon S. Du

We consider the problem of solving stochastic monotone variational inequalities with a separable structure using a stochastic first-order oracle. Building on standard extragradient for variational inequalities we propose a novel algorithm---stochastic \emph{accelerated gradient-extragradient} (AG-EG)---for strongly monotone variational inequalities (VIs). Our approach combines the strengths of extragradient and Nesterov acceleration. By showing that its iterates remain in a bounded domain and applying scheduled restarting, we prove that AG-EG has an optimal convergence rate for strongly monotone VIs. Furthermore, when specializing to the particular case of bilinearly coupled strongly-convex-strongly-concave saddle-point problems, including bilinear games, our algorithm achieves fine-grained convergence rates that match the respective lower bounds, with the stochasticity being characterized by an additive statistical error term that is optimal up to a constant prefactor.

TMLR Journal 2023 Journal Article

Provably Personalized and Robust Federated Learning

  • Mariel Werner
  • Lie He
  • Michael Jordan
  • Martin Jaggi
  • Sai Praneeth Karimireddy

Clustering clients with similar objectives and learning a model per cluster is an intuitive and interpretable approach to personalization in federated learning. However, doing so with provable and optimal guarantees has remained an open challenge. In this work, we formalize personalized federated learning as a stochastic optimization problem. We propose simple clustering-based algorithms which iteratively identify and train within clusters, using local client gradients. Our algorithms have optimal convergence rates which asymptotically match those obtained if we knew the true underlying clustering of the clients, and are provably robust in the Byzantine setting where some fraction of the clients are malicious.

NeurIPS Conference 2023 Conference Paper

Towards Optimal Caching and Model Selection for Large Model Inference

  • Banghua Zhu
  • Ying Sheng
  • Lianmin Zheng
  • Clark Barrett
  • Michael Jordan
  • Jiantao Jiao

Large Language Models (LLMs) and other large foundation models have achieved impressive results, but their size exacerbates existing resource consumption and latency challenges. In particular, the large-scale deployment of these models is hindered by the significant resource requirements during inference. In this paper, we study two approaches for mitigating these challenges: employing a cache to store previous queries and learning a model selector to choose from an ensemble of models for query processing. Theoretically, we provide an optimal algorithm for jointly optimizing both approaches to reduce the inference cost in both offline and online tabular settings. By combining a caching algorithm, namely Greedy Dual Size with Frequency (GDSF) or Least Expected Cost (LEC), with a model selector, we achieve optimal rates in both offline and online settings. Empirically, simulations show that our caching and model selection algorithm greatly improves over the baselines, with up to $50\times$ improvement over the baseline when the ratio between the maximum cost and minimum cost is $100$. Experiments on real datasets show a $4. 3\times$ improvement in FLOPs over the baseline when the ratio for FLOPs is $10$, and a $1. 8\times$ improvement in latency when the ratio for average latency is $1. 85$.

NeurIPS Conference 2022 Conference Paper

Empirical Gateaux Derivatives for Causal Inference

  • Michael Jordan
  • Yixin Wang
  • Angela Zhou

We study a constructive procedure that approximates Gateaux derivatives for statistical functionals by finite-differencing, with attention to causal inference functionals. We focus on the case where probability distributions are not known a priori but need also to be estimated from data, leading to empirical Gateaux derivatives, and study relationships between empirical, numerical, and analytical Gateaux derivatives. Starting with a case study of counterfactual mean estimation, we verify the exact relationship between finite-differences and the analytical Gateaux derivative. We then derive requirements on the rates of numerical approximation in perturbation and smoothing that preserve statistical benefits. We study more complicated functionals such as dynamic treatment regimes and the linear-programming formulation for policy optimization infinite-horizon Markov decision processes. In the case of the latter, this approach can be used to approximate bias adjustments in the presence of arbitrary constraints, illustrating the usefulness of constructive approaches for Gateaux derivatives. We find that, omitting unfavorable dimension dependence of smoothing, although rate-double robustness permits for coarser rates of perturbation size than implied by generic approximation analysis of finite-differences for the case of the counterfactual mean, this is not the case for the infinite-horizon MDP policy value.

NeurIPS Conference 2022 Conference Paper

First-Order Algorithms for Min-Max Optimization in Geodesic Metric Spaces

  • Michael Jordan
  • Tianyi Lin
  • Emmanouil-Vasileios Vlatakis-Gkaragkounis

From optimal transport to robust dimensionality reduction, many machine learning applicationscan be cast into the min-max optimization problems over Riemannian manifolds. Though manymin-max algorithms have been analyzed in the Euclidean setting, it has been elusive how theseresults translate to the Riemannian case. Zhang et al. (2022) have recently identified that geodesic convexconcave Riemannian problems admit always Sion’s saddle point solutions. Immediately, an importantquestion that arises is if a performance gap between the Riemannian and the optimal Euclidean spaceconvex concave algorithms is necessary. Our work is the first to answer the question in the negative: We prove that the Riemannian corrected extragradient (RCEG) method achieves last-iterate at alinear convergence rate at the geodesically strongly convex concave case, matching the euclidean one. Our results also extend to the stochastic or non-smooth case where RCEG & Riemanian gradientascent descent (RGDA) achieve respectively near-optimal convergence rates up to factors dependingon curvature of the manifold. Finally, we empirically demonstrate the effectiveness of RCEG insolving robust PCA.

NeurIPS Conference 2022 Conference Paper

Gradient-Free Methods for Deterministic and Stochastic Nonsmooth Nonconvex Optimization

  • Tianyi Lin
  • Zeyu Zheng
  • Michael Jordan

Nonsmooth nonconvex optimization problems broadly emerge in machine learning and business decision making, whereas two core challenges impede the development of efficient solution methods with finite-time convergence guarantee: the lack of computationally tractable optimality criterion and the lack of computationally powerful oracles. The contributions of this paper are two-fold. First, we establish the relationship between the celebrated Goldstein subdifferential~\citep{Goldstein-1977-Optimization} and uniform smoothing, thereby providing the basis and intuition for the design of gradient-free methods that guarantee the finite-time convergence to a set of Goldstein stationary points. Second, we propose the gradient-free method (GFM) and stochastic GFM for solving a class of nonsmooth nonconvex optimization problems and prove that both of them can return a $(\delta, \epsilon)$-Goldstein stationary point of a Lipschitz function $f$ at an expected convergence rate at $O(d^{3/2}\delta^{-1}\epsilon^{-4})$ where $d$ is the problem dimension. Two-phase versions of GFM and SGFM are also proposed and proven to achieve improved large-deviation results. Finally, we demonstrate the effectiveness of 2-SGFM on training ReLU neural networks with the \textsc{Minst} dataset.

NeurIPS Conference 2022 Conference Paper

Learn to Match with No Regret: Reinforcement Learning in Markov Matching Markets

  • Yifei Min
  • Tianhao Wang
  • Ruitu Xu
  • Zhaoran Wang
  • Michael Jordan
  • Zhuoran Yang

We study a Markov matching market involving a planner and a set of strategic agents on the two sides of the market. At each step, the agents are presented with a dynamical context, where the contexts determine the utilities. The planner controls the transition of the contexts to maximize the cumulative social welfare, while the agents aim to find a myopic stable matching at each step. Such a setting captures a range of applications including ridesharing platforms. We formalize the problem by proposing a reinforcement learning framework that integrates optimistic value iteration with maximum weight matching. The proposed algorithm addresses the coupled challenges of sequential exploration, matching stability, and function approximation. We prove that the algorithm achieves sublinear regret.

NeurIPS Conference 2022 Conference Paper

Learning Two-Player Markov Games: Neural Function Approximation and Correlated Equilibrium

  • Chris Junchi Li
  • Dongruo Zhou
  • Quanquan Gu
  • Michael Jordan

We consider learning Nash equilibria in two-player zero-sum Markov Games with nonlinear function approximation, where the action-value function is approximated by a function in a Reproducing Kernel Hilbert Space (RKHS). The key challenge is how to do exploration in the high-dimensional function space. We propose a novel online learning algorithm to find a Nash equilibrium by minimizing the duality gap. At the core of our algorithms are upper and lower confidence bounds that are derived based on the principle of optimism in the face of uncertainty. We prove that our algorithm is able to attain an $O(\sqrt{T})$ regret with polynomial computational complexity, under very mild assumptions on the reward function and the underlying dynamic of the Markov Games. We also propose several extensions of our algorithm, including an algorithm with Bernstein-type bonus that can achieve a tighter regret bound, and another algorithm for model misspecification that can be applied to neural network function approximation.

TMLR Journal 2022 Journal Article

Multi-Source Causal Inference Using Control Variates under Outcome Selection Bias

  • Wenshuo Guo
  • Serena Lutong Wang
  • Peng Ding
  • Yixin Wang
  • Michael Jordan

While many areas of machine learning have benefited from the increasing availability of large and varied datasets, the benefit to causal inference has been limited given the strong assumptions needed to ensure the identifiability of causal effects -- which are often not satisfied in real-world datasets. For example, many large observational datasets (e.g., case-control studies in epidemiology, click-through data in recommender systems) suffer from selection bias on the outcome, which makes the average treatment effect (ATE) non-identifiable. We propose an algorithm to estimate causal effects from multiple data sources, where the ATE may be identifiable only in some datasets but not others. The idea is to construct control variates across the datasets in which the ATE may not be identifiable, which provably reduces the variance of the ATE estimate. We focus on a setting where the observational datasets suffer from outcome selection bias, assuming access to an auxiliary small dataset from which we can obtain a consistent estimate of the ATE. We propose a construction of control variate by taking the difference of the conditional odds ratio estimates from the two datasets. Across simulations and two case studies with real data, we show that the control variate-based ATE estimator has consistently and significantly reduced variance against different baselines.

NeurIPS Conference 2022 Conference Paper

Off-Policy Evaluation with Policy-Dependent Optimization Response

  • Wenshuo Guo
  • Michael Jordan
  • Angela Zhou

The intersection of causal inference and machine learning for decision-making is rapidly expanding, but the default decision criterion remains an average of individual causal outcomes across a population. In practice, various operational restrictions ensure that a decision-maker's utility is not realized as an average but rather as an output of a downstream decision-making problem (such as matching, assignment, network flow, minimizing predictive risk). In this work, we develop a new framework for off-policy evaluation with policy-dependent linear optimization responses: causal outcomes introduce stochasticity in objective function coefficients. Under this framework, a decision-maker's utility depends on the policy-dependent optimization, which introduces a fundamental challenge of optimization bias even for the case of policy evaluation. We construct unbiased estimators for the policy-dependent estimand by a perturbation method, and discuss asymptotic variance properties for a set of adjusted plug-in estimators. Lastly, attaining unbiased policy evaluation allows for policy optimization: we provide a general algorithm for optimizing causal interventions. We corroborate our theoretical results with numerical simulations.

NeurIPS Conference 2022 Conference Paper

On-Demand Sampling: Learning Optimally from Multiple Distributions

  • Nika Haghtalab
  • Michael Jordan
  • Eric Zhao

Societal and real-world considerations such as robustness, fairness, social welfare and multi-agent tradeoffs have given rise to multi-distribution learning paradigms, such as collaborative [Blum et al. 2017], group distributionally robust [Sagawa et al. 2019], and fair federated learning [Mohri et al. 2019]. In each of these settings, a learner seeks to minimize its worstcase loss over a set of $n$ predefined distributions, while using as few samples as possible. In this paper, we establish the optimal sample complexity of these learning paradigms and give algorithms that meet this sample complexity. Importantly, our sample complexity bounds exceed that of the sample complexity of learning a single distribution only by an additive factor of $\frac{n\log(n)}{\epsilon^2}$. These improve upon the best known sample complexity of agnostic federated learning by Mohri et al. 2019 by a multiplicative factor of $n$, the sample complexity of collaborative learning by Nguyen and Zakynthinou 2018 by a multiplicative factor $\frac{\log(n)}{\epsilon^3}$, and give the first sample complexity bounds for the group DRO objective of Sagawa et al. 2019. To achieve optimal sample complexity, our algorithms learn to sample and learn from distributions on demand. Our algorithm design and analysis extends stochastic optimization techniques to solve zero-sum games in a new stochastic setting.

NeurIPS Conference 2022 Conference Paper

Rank Diminishing in Deep Neural Networks

  • Ruili Feng
  • Kecheng Zheng
  • Yukun Huang
  • Deli Zhao
  • Michael Jordan
  • Zheng-Jun Zha

The rank of neural networks measures information flowing across layers. It is an instance of a key structural condition that applies across broad domains of machine learning. In particular, the assumption of low-rank feature representations led to algorithmic developments in many architectures. For neural networks, however, the intrinsic mechanism that yields low-rank structures remains vague and unclear. To fill this gap, we perform a rigorous study on the behavior of network rank, focusing particularly on the notion of rank deficiency. We theoretically establish a universal monotone decreasing property of network ranks from the basic rules of differential and algebraic composition, and uncover rank deficiency of network blocks and deep function coupling. By virtue of our numerical tools, we provide the first empirical analysis of the per-layer behavior of network ranks in realistic settings, \ieno, ResNets, deep MLPs, and Transformers on ImageNet. These empirical results are in direct accord with our theory. Furthermore, we reveal a novel phenomenon of independence deficit caused by the rank deficiency of deep networks, where classification confidence of a given category can be linearly decided by the confidence of a handful of other categories. The theoretical results of this work, together with the empirical findings, may advance understanding of the inherent principles of deep neural networks. Code to detect the rank behavior of networks can be found in https: //github. com/RuiLiFeng/Rank-Diminishing-in-Deep-Neural-Networks.

NeurIPS Conference 2022 Conference Paper

Robust Calibration with Multi-domain Temperature Scaling

  • Yaodong Yu
  • Stephen Bates
  • Yi Ma
  • Michael Jordan

Uncertainty quantification is essential for the reliable deployment of machine learning models to high-stakes application domains. Uncertainty quantification is all the more challenging when training distribution and test distribution are different, even if the distribution shifts are mild. Despite the ubiquity of distribution shifts in real-world applications, existing uncertainty quantification approaches mainly study the in-distribution setting where the train and test distributions are the same. In this paper, we develop a systematic calibration model to handle distribution shifts by leveraging data from multiple domains. Our proposed method---multi-domain temperature scaling---uses the heterogeneity in the domains to improve calibration robustness under distribution shift. Through experiments on three benchmark data sets, we find our proposed method outperforms existing methods as measured on both in-distribution and out-of-distribution test sets.

NeurIPS Conference 2022 Conference Paper

TCT: Convexifying Federated Learning using Bootstrapped Neural Tangent Kernels

  • Yaodong Yu
  • Alexander Wei
  • Sai Praneeth Karimireddy
  • Yi Ma
  • Michael Jordan

State-of-the-art federated learning methods can perform far worse than their centralized counterparts when clients have dissimilar data distributions. For neural networks, even when centralized SGD easily finds a solution that is simultaneously performant for all clients, current federated optimization methods fail to converge to a comparable solution. We show that this performance disparity can largely be attributed to optimization challenges presented by nonconvexity. Specifically, we find that the early layers of the network do learn useful features, but the final layers fail to make use of them. That is, federated optimization applied to this non-convex problem distorts the learning of the final layers. Leveraging this observation, we propose a Train-Convexify-Train (TCT) procedure to sidestep this issue: first, learn features using off-the-shelf methods (e. g. , FedAvg); then, optimize a convexified problem obtained from the network's empirical neural tangent kernel approximation. Our technique yields accuracy improvements of up to $+36\%$ on FMNIST and $+37\%$ on CIFAR10 when clients have dissimilar data.

NeurIPS Conference 2021 Conference Paper

Learning Equilibria in Matching Markets from Bandit Feedback

  • Meena Jagadeesan
  • Alexander Wei
  • Yixin Wang
  • Michael Jordan
  • Jacob Steinhardt

Large-scale, two-sided matching platforms must find market outcomes that align with user preferences while simultaneously learning these preferences from data. But since preferences are inherently uncertain during learning, the classical notion of stability (Gale and Shapley, 1962; Shapley and Shubik, 1971) is unattainable in these settings. To bridge this gap, we develop a framework and algorithms for learning stable market outcomes under uncertainty. Our primary setting is matching with transferable utilities, where the platform both matches agents and sets monetary transfers between them. We design an incentive-aware learning objective that captures the distance of a market outcome from equilibrium. Using this objective, we analyze the complexity of learning as a function of preference structure, casting learning as a stochastic multi-armed bandit problem. Algorithmically, we show that "optimism in the face of uncertainty, " the principle underlying many bandit algorithms, applies to a primal-dual formulation of matching with transfers and leads to near-optimal regret bounds. Our work takes a first step toward elucidating when and how stable matchings arise in large, data-driven marketplaces.

NeurIPS Conference 2021 Conference Paper

Learning in Multi-Stage Decentralized Matching Markets

  • Xiaowu Dai
  • Michael Jordan

Matching markets are often organized in a multi-stage and decentralized manner. Moreover, participants in real-world matching markets often have uncertain preferences. This article develops a framework for learning optimal strategies in such settings, based on a nonparametric statistical approach and variational analysis. We propose an efficient algorithm, built upon concepts of "lower uncertainty bound" and "calibrated decentralized matching, " for maximizing the participants' expected payoff. We show that there exists a welfare-versus-fairness trade-off that is characterized by the uncertainty level of acceptance. Participants will strategically act in favor of a low uncertainty level to reduce competition and increase expected payoff. We prove that participants can be better off with multi-stage matching compared to single-stage matching. We demonstrate aspects of the theoretical predictions through simulations and an experiment using real data from college admissions.

NeurIPS Conference 2021 Conference Paper

On Component Interactions in Two-Stage Recommender Systems

  • Jiri Hron
  • Karl Krauth
  • Michael Jordan
  • Niki Kilbertus

Thanks to their scalability, two-stage recommenders are used by many of today's largest online platforms, including YouTube, LinkedIn, and Pinterest. These systems produce recommendations in two steps: (i) multiple nominators—tuned for low prediction latency—preselect a small subset of candidates from the whole item pool; (ii) a slower but more accurate ranker further narrows down the nominated items, and serves to the user. Despite their popularity, the literature on two-stage recommenders is relatively scarce, and the algorithms are often treated as mere sums of their parts. Such treatment presupposes that the two-stage performance is explained by the behavior of the individual components in isolation. This is not the case: using synthetic and real-world data, we demonstrate that interactions between the ranker and the nominators substantially affect the overall performance. Motivated by these findings, we derive a generalization lower bound which shows that independent nominator training can lead to performance on par with uniformly random recommendations. We find that careful design of item pools, each assigned to a different nominator, alleviates these issues. As manual search for a good pool allocation is difficult, we propose to learn one instead using a Mixture-of-Experts based approach. This significantly improves both precision and recall at $K$.

NeurIPS Conference 2021 Conference Paper

On the Theory of Reinforcement Learning with Once-per-Episode Feedback

  • Niladri Chatterji
  • Aldo Pacchiano
  • Peter Bartlett
  • Michael Jordan

We study a theory of reinforcement learning (RL) in which the learner receives binary feedback only once at the end of an episode. While this is an extreme test case for theory, it is also arguably more representative of real-world applications than the traditional requirement in RL practice that the learner receive feedback at every time step. Indeed, in many real-world applications of reinforcement learning, such as self-driving cars and robotics, it is easier to evaluate whether a learner's complete trajectory was either good'' or bad, '' but harder to provide a reward signal at each step. To show that learning is possible in this more challenging setting, we study the case where trajectory labels are generated by an unknown parametric model, and provide a statistically and computationally efficient algorithm that achieves sublinear regret.

NeurIPS Conference 2021 Conference Paper

Robust Learning of Optimal Auctions

  • Wenshuo Guo
  • Michael Jordan
  • Emmanouil Zampetakis

We study the problem of learning revenue-optimal multi-bidder auctions from samples when the samples of bidders' valuations can be adversarially corrupted or drawn from distributions that are adversarially perturbed. First, we prove tight upper bounds on the revenue we can obtain with a corrupted distribution under a population model, for both regular valuation distributions and distributions with monotone hazard rate (MHR). We then propose new algorithms that, given only an ``approximate distribution'' for the bidder's valuation, can learn a mechanism whose revenue is nearly optimal simultaneously for all ``true distributions'' that are $\alpha$-close to the original distribution in Kolmogorov-Smirnov distance. The proposed algorithms operate beyond the setting of bounded distributions that have been studied in prior works, and are guaranteed to obtain a fraction $1-O(\alpha)$ of the optimal revenue under the true distribution when the distributions are MHR. Moreover, they are guaranteed to yield at least a fraction $1-O(\sqrt{\alpha})$ of the optimal revenue when the distributions are regular. We prove that these upper bounds cannot be further improved, by providing matching lower bounds. Lastly, we derive sample complexity upper bounds for learning a near-optimal auction for both MHR and regular distributions.

NeurIPS Conference 2021 Conference Paper

Tactical Optimism and Pessimism for Deep Reinforcement Learning

  • Ted Moskovitz
  • Jack Parker-Holder
  • Aldo Pacchiano
  • Michael Arbel
  • Michael Jordan

In recent years, deep off-policy actor-critic algorithms have become a dominant approach to reinforcement learning for continuous control. One of the primary drivers of this improved performance is the use of pessimistic value updates to address function approximation errors, which previously led to disappointing performance. However, a direct consequence of pessimism is reduced exploration, running counter to theoretical support for the efficacy of optimism in the face of uncertainty. So which approach is best? In this work, we show that the most effective degree of optimism can vary both across tasks and over the course of learning. Inspired by this insight, we introduce a novel deep actor-critic framework, Tactical Optimistic and Pessimistic (TOP) estimation, which switches between optimistic and pessimistic value learning online. This is achieved by formulating the selection as a multi-arm bandit problem. We show in a series of continuous control tasks that TOP outperforms existing methods which rely on a fixed degree of optimism, setting a new state of the art in challenging pixel-based environments. Since our changes are simple to implement, we believe these insights can easily be incorporated into a multitude of off-policy algorithms.

NeurIPS Conference 2021 Conference Paper

Test-time Collective Prediction

  • Celestine Mendler-Dünner
  • Wenshuo Guo
  • Stephen Bates
  • Michael Jordan

An increasingly common setting in machine learning involves multiple parties, each with their own data, who want to jointly make predictions on future test points. Agents wish to benefit from the collective expertise of the full set of agents to make better predictions than they would individually, but may not be willing to release labeled data or model parameters. In this work, we explore a decentralized mechanism to make collective predictions at test time, that is inspired by the literature in social science on human consensus-making. Building on a query model to facilitate information exchange among agents, our approach leverages each agent’s pre-trained model without relying on external validation, model retraining, or data pooling. A theoretical analysis shows that our approach recovers inverse mean-squared-error (MSE) weighting in the large-sample limit which is known to be the optimal way to combine independent, unbiased estimators. Empirically, we demonstrate that our scheme effectively combines models with differing quality across the input space: the proposed consensus prediction achieves significant gains over classical model averaging, and even outperforms weighted averaging schemes that have access to additional validation data. Finally, we propose a decentralized Jackknife procedure as a tool to evaluate the sensitivity of the collective predictions with respect to a single agent's opinion.

NeurIPS Conference 2021 Conference Paper

Wasserstein Flow Meets Replicator Dynamics: A Mean-Field Analysis of Representation Learning in Actor-Critic

  • Yufeng Zhang
  • Siyu Chen
  • Zhuoran Yang
  • Michael Jordan
  • Zhaoran Wang

Actor-critic (AC) algorithms, empowered by neural networks, have had significant empirical success in recent years. However, most of the existing theoretical support for AC algorithms focuses on the case of linear function approximations, or linearized neural networks, where the feature representation is fixed throughout training. Such a limitation fails to capture the key aspect of representation learning in neural AC, which is pivotal in practical problems. In this work, we take a mean-field perspective on the evolution and convergence of feature-based neural AC. Specifically, we consider a version of AC where the actor and critic are represented by overparameterized two-layer neural networks and are updated with two-timescale learning rates. The critic is updated by temporal-difference (TD) learning with a larger stepsize while the actor is updated via proximal policy optimization (PPO) with a smaller stepsize. In the continuous-time and infinite-width limiting regime, when the timescales are properly separated, we prove that neural AC finds the globally optimal policy at a sublinear rate. Additionally, we prove that the feature representation induced by the critic network is allowed to evolve within a neighborhood of the initial one.

NeurIPS Conference 2021 Conference Paper

Who Leads and Who Follows in Strategic Classification?

  • Tijana Zrnic
  • Eric Mazumdar
  • Shankar Sastry
  • Michael Jordan

As predictive models are deployed into the real world, they must increasingly contend with strategic behavior. A growing body of work on strategic classification treats this problem as a Stackelberg game: the decision-maker "leads" in the game by deploying a model, and the strategic agents "follow" by playing their best response to the deployed model. Importantly, in this framing, the burden of learning is placed solely on the decision-maker, while the agents’ best responses are implicitly treated as instantaneous. In this work, we argue that the order of play in strategic classification is fundamentally determined by the relative frequencies at which the decision-maker and the agents adapt to each other’s actions. In particular, by generalizing the standard model to allow both players to learn over time, we show that a decision-maker that makes updates faster than the agents can reverse the order of play, meaning that the agents lead and the decision-maker follows. We observe in standard learning settings that such a role reversal can be desirable for both the decision-maker and the strategic agents. Finally, we show that a decision-maker with the freedom to choose their update frequency can induce learning dynamics that converge to Stackelberg equilibria with either order of play.

AAAI Conference 2020 Conference Paper

Cost-Effective Incentive Allocation via Structured Counterfactual Inference

  • Romain Lopez
  • Chenchen Li
  • Xiang Yan
  • Junwu Xiong
  • Michael Jordan
  • Yuan Qi
  • Le Song

We address a practical problem ubiquitous in modern marketing campaigns, in which a central agent tries to learn a policy for allocating strategic financial incentives to customers and observes only bandit feedback. In contrast to traditional policy optimization frameworks, we take into account the additional reward structure and budget constraints common in this setting, and develop a new two-step method for solving this constrained counterfactual policy optimization problem. Our method first casts the reward estimation problem as a domain adaptation problem with supplementary structure, and then subsequently uses the estimators for optimizing the policy with constraints. We also establish theoretical error bounds for our estimation procedure and we empirically show that the approach leads to significant improvement on both synthetic and real datasets.

NeurIPS Conference 2020 Conference Paper

Decision-Making with Auto-Encoding Variational Bayes

  • Romain Lopez
  • Pierre Boyeau
  • Nir Yosef
  • Michael Jordan
  • Jeffrey Regier

To make decisions based on a model fit with auto-encoding variational Bayes (AEVB), practitioners often let the variational distribution serve as a surrogate for the posterior distribution. This approach yields biased estimates of the expected risk, and therefore leads to poor decisions for two reasons. First, the model fit with AEVB may not equal the underlying data distribution. Second, the variational distribution may not equal the posterior distribution under the fitted model. We explore how fitting the variational distribution based on several objective functions other than the ELBO, while continuing to fit the generative model based on the ELBO, affects the quality of downstream decisions. For the probabilistic principal component analysis model, we investigate how importance sampling error, as well as the bias of the model parameter estimates, varies across several approximate posteriors when used as proposal distributions. Our theoretical results suggest that a posterior approximation distinct from the variational distribution should be used for making decisions. Motivated by these theoretical results, we propose learning several approximate proposals for the best model and combining them using multiple importance sampling for decision-making. In addition to toy examples, we present a full-fledged case study of single-cell RNA sequencing. In this challenging instance of multiple hypothesis testing, our proposed approach surpasses the current state of the art.

NeurIPS Conference 2020 Conference Paper

Fixed-Support Wasserstein Barycenters: Computational Hardness and Fast Algorithm

  • Tianyi Lin
  • Nhat Ho
  • Xi Chen
  • Marco Cuturi
  • Michael Jordan

We study the fixed-support Wasserstein barycenter problem (FS-WBP), which consists in computing the Wasserstein barycenter of $m$ discrete probability measures supported on a finite metric space of size $n$. We show first that the constraint matrix arising from the standard linear programming (LP) representation of the FS-WBP is \textit{not totally unimodular} when $m \geq 3$ and $n \geq 3$. This result resolves an open question pertaining to the relationship between the FS-WBP and the minimum-cost flow (MCF) problem since it proves that the FS-WBP in the standard LP form is not an MCF problem when $m \geq 3$ and $n \geq 3$. We also develop a provably fast \textit{deterministic} variant of the celebrated iterative Bregman projection (IBP) algorithm, named \textsc{FastIBP}, with a complexity bound of $\tilde{O}(mn^{7/3}\varepsilon^{-4/3})$, where $\varepsilon \in (0, 1)$ is the desired tolerance. This complexity bound is better than the best known complexity bound of $\tilde{O}(mn^2\varepsilon^{-2})$ for the IBP algorithm in terms of $\varepsilon$, and that of $\tilde{O}(mn^{5/2}\varepsilon^{-1})$ from accelerated alternating minimization algorithm or accelerated primal-dual adaptive gradient algorithm in terms of $n$. Finally, we conduct extensive experiments with both synthetic data and real images and demonstrate the favorable performance of the \textsc{FastIBP} algorithm in practice.

AAAI Conference 2020 Conference Paper

LS-Tree: Model Interpretation When the Data Are Linguistic

  • Jianbo Chen
  • Michael Jordan

We study the problem of interpreting trained classification models in the setting of linguistic data sets. Leveraging a parse tree, we propose to assign least-squares-based importance scores to each word of an instance by exploiting syntactic constituency structure. We establish an axiomatic characterization of these importance scores by relating them to the Banzhaf value in coalitional game theory. Based on these importance scores, we develop a principled method for detecting and quantifying interactions between words in a sentence. We demonstrate that the proposed method can aid in interpretability and diagnostics for several widely-used language models.

AAAI Conference 2020 Conference Paper

ML-LOO: Detecting Adversarial Examples with Feature Attribution

  • Puyudi Yang
  • Jianbo Chen
  • Cho-Jui Hsieh
  • Jane-Ling Wang
  • Michael Jordan

Deep neural networks obtain state-of-the-art performance on a series of tasks. However, they are easily fooled by adding a small adversarial perturbation to the input. The perturbation is often imperceptible to humans on image data. We observe a significant difference in feature attributions between adversarially crafted examples and original examples. Based on this observation, we introduce a new framework to detect adversarial examples through thresholding a scale estimate of feature attribution scores. Furthermore, we extend our method to include multi-layer feature attributions in order to tackle attacks that have mixed confidence levels. As demonstrated in extensive experiments, our method achieves superior performances in distinguishing adversarial examples from popular attack methods on a variety of real data sets compared to stateof-the-art detection methods. In particular, our method is able to detect adversarial examples of mixed confidence levels, and transfer between different attacking methods. We also show that our method achieves competitive performance even when the attacker has complete access to the detector.

NeurIPS Conference 2020 Conference Paper

On the Theory of Transfer Learning: The Importance of Task Diversity

  • Nilesh Tripuraneni
  • Michael Jordan
  • Chi Jin

We provide new statistical guarantees for transfer learning via representation learning--when transfer is achieved by learning a feature representation shared across different tasks. This enables learning on new tasks using far less data than is required to learn them in isolation. Formally, we consider $t+1$ tasks parameterized by functions of the form $f_j \circ h$ in a general function class $F \circ H$, where each $f_j$ is a task-specific function in $F$ and $h$ is the shared representation in $H$. Letting $C(\cdot)$ denote the complexity measure of the function class, we show that for diverse training tasks (1) the sample complexity needed to learn the shared representation across the first $t$ training tasks scales as $C(H) + t C(F)$, despite no explicit access to a signal from the feature representation and (2) with an accurate estimate of the representation, the sample complexity needed to learn a new task scales only with $C(F)$. Our results depend upon a new general notion of task diversity--applicable to models with general tasks, features, and losses--as well as a novel chain rule for Gaussian complexities. Finally, we exhibit the utility of our general framework in several models of importance in the literature.

NeurIPS Conference 2020 Conference Paper

Projection Robust Wasserstein Distance and Riemannian Optimization

  • Tianyi Lin
  • Chenyou Fan
  • Nhat Ho
  • Marco Cuturi
  • Michael Jordan

Projection robust Wasserstein (PRW) distance, or Wasserstein projection pursuit (WPP), is a robust variant of the Wasserstein distance. Recent work suggests that this quantity is more robust than the standard Wasserstein distance, in particular when comparing probability measures in high-dimensions. However, it is ruled out for practical application because the optimization model is essentially non-convex and non-smooth which makes the computation intractable. Our contribution in this paper is to revisit the original motivation behind WPP/PRW, but take the hard route of showing that, despite its non-convexity and lack of nonsmoothness, and even despite some hardness results proved by~\citet{Niles-2019-Estimation} in a minimax sense, the original formulation for PRW/WPP \textit{can} be efficiently computed in practice using Riemannian optimization, yielding in relevant cases better behavior than its convex relaxation. More specifically, we provide three simple algorithms with solid theoretical guarantee on their complexity bound (one in the appendix), and demonstrate their effectiveness and efficiency by conducing extensive experiments on synthetic and real data. This paper provides a first step into a computational theory of the PRW distance and provides the links between optimal transport and Riemannian optimization.

NeurIPS Conference 2020 Conference Paper

Provably Efficient Reinforcement Learning with Kernel and Neural Function Approximations

  • Zhuoran Yang
  • Chi Jin
  • Zhaoran Wang
  • Mengdi Wang
  • Michael Jordan

Reinforcement learning (RL) algorithms combined with modern function approximators such as kernel functions and deep neural networks have achieved significant empirical successes in large-scale application problems with a massive number of states. From a theoretical perspective, however, RL with functional approximation poses a fundamental challenge to developing algorithms with provable computational and statistical efficiency, due to the need to take into consideration both the exploration-exploitation tradeoff that is inherent in RL and the bias-variance tradeoff that is innate in statistical estimation. To address such a challenge, focusing on the episodic setting where the action-value functions are represented by a kernel function or over-parametrized neural network, we propose the first provable RL algorithm with both polynomial runtime and sample complexity, without additional assumptions on the data-generating model. In particular, for both the kernel and neural settings, we prove that an optimistic modification of the least-squares value iteration algorithm incurs an $\tilde{\mathcal{O}}(\delta_{\cF} H^2 \sqrt{T})$ regret, where $\delta_{\cF}$ characterizes the intrinsic complexity of the function class $\cF$, $H$ is the length of each episode, and $T$ is the total number of episodes. Our regret bounds are independent of the number of states and therefore even allows it to diverge, which exhibits the benefit of function approximation.

NeurIPS Conference 2020 Conference Paper

Robust Optimization for Fairness with Noisy Protected Groups

  • Serena Wang
  • Wenshuo Guo
  • Harikrishna Narasimhan
  • Andrew Cotter
  • Maya Gupta
  • Michael Jordan

Many existing fairness criteria for machine learning involve equalizing some metric across protected groups such as race or gender. However, practitioners trying to audit or enforce such group-based criteria can easily face the problem of noisy or biased protected group information. First, we study the consequences of naively relying on noisy protected group labels: we provide an upper bound on the fairness violations on the true groups $G$ when the fairness criteria are satisfied on noisy groups $\hat{G}$. Second, we introduce two new approaches using robust optimization that, unlike the naive approach of only relying on $\hat{G}$, are guaranteed to satisfy fairness criteria on the true protected groups $G$ while minimizing a training objective. We provide theoretical guarantees that one such approach converges to an optimal feasible solution. Using two case studies, we show empirically that the robust approaches achieve better true group fairness guarantees than the naive approach.

NeurIPS Conference 2020 Conference Paper

Transferable Calibration with Lower Bias and Variance in Domain Adaptation

  • Ximei Wang
  • Mingsheng Long
  • Jianmin Wang
  • Michael Jordan

Domain Adaptation (DA) enables transferring a learning machine from a labeled source domain to an unlabeled target one. While remarkable advances have been made, most of the existing DA methods focus on improving the target accuracy at inference. How to estimate the predictive uncertainty of DA models is vital for decision-making in safety-critical scenarios but remains the boundary to explore. In this paper, we delve into the open problem of Calibration in DA, which is extremely challenging due to the coexistence of domain shift and the lack of target labels. We first reveal the dilemma that DA models learn higher accuracy at the expense of well-calibrated probabilities. Driven by this finding, we propose Transferable Calibration (TransCal) to achieve more accurate calibration with lower bias and variance in a unified hyperparameter-free optimization framework. As a general post-hoc calibration method, TransCal can be easily applied to recalibrate existing DA methods. Its efficacy has been justified both theoretically and empirically.

NeurIPS Conference 2019 Conference Paper

Acceleration via Symplectic Discretization of High-Resolution Differential Equations

  • Bin Shi
  • Simon Du
  • Weijie Su
  • Michael Jordan

We study first-order optimization algorithms obtained by discretizing ordinary differential equations (ODEs) corresponding to Nesterov’s accelerated gradient methods (NAGs) and Polyak’s heavy-ball method. We consider three discretization schemes: symplectic Euler (S), explicit Euler (E) and implicit Euler (I) schemes. We show that the optimization algorithm generated by applying the symplectic scheme to a high-resolution ODE proposed by Shi et al. [2018] achieves the accelerated rate for minimizing both strongly convex function and convex function. On the other hand, the resulting algorithm either fails to achieve acceleration or is impractical when the scheme is implicit, the ODE is low-resolution, or the scheme is explicit.

NeurIPS Conference 2019 Conference Paper

Transferable Normalization: Towards Improving Transferability of Deep Neural Networks

  • Ximei Wang
  • Ying Jin
  • Mingsheng Long
  • Jianmin Wang
  • Michael Jordan

Deep neural networks (DNNs) excel at learning representations when trained on large-scale datasets. Pre-trained DNNs also show strong transferability when fine-tuned to other labeled datasets. However, such transferability becomes weak when the target dataset is fully unlabeled as in Unsupervised Domain Adaptation (UDA). We envision that the loss of transferability may stem from the intrinsic limitation of the architecture design of DNNs. In this paper, we delve into the components of DNN architectures and propose Transferable Normalization (TransNorm) in place of existing normalization techniques. TransNorm is an end-to-end trainable layer to make DNNs more transferable across domains. As a general method, TransNorm can be easily applied to various deep neural networks and domain adaption methods, without introducing any extra hyper-parameters or learnable parameters. Empirical results justify that TransNorm not only improves classification accuracies but also accelerates convergence for mainstream DNN-based domain adaptation methods.

NeurIPS Conference 2018 Conference Paper

Conditional Adversarial Domain Adaptation

  • Mingsheng Long
  • Zhangjie Cao
  • Jianmin Wang
  • Michael Jordan

Adversarial learning has been embedded into deep networks to learn disentangled and transferable representations for domain adaptation. Existing adversarial domain adaptation methods may struggle to align different domains of multimodal distributions that are native in classification problems. In this paper, we present conditional adversarial domain adaptation, a principled framework that conditions the adversarial adaptation models on discriminative information conveyed in the classifier predictions. Conditional domain adversarial networks (CDANs) are designed with two novel conditioning strategies: multilinear conditioning that captures the cross-covariance between feature representations and classifier predictions to improve the discriminability, and entropy conditioning that controls the uncertainty of classifier predictions to guarantee the transferability. Experiments testify that the proposed approach exceeds the state-of-the-art results on five benchmark datasets.

NeurIPS Conference 2018 Conference Paper

Gen-Oja: Simple & Efficient Algorithm for Streaming Generalized Eigenvector Computation

  • Kush Bhatia
  • Aldo Pacchiano
  • Nicolas Flammarion
  • Peter Bartlett
  • Michael Jordan

In this paper, we study the problems of principle Generalized Eigenvector computation and Canonical Correlation Analysis in the stochastic setting. We propose a simple and efficient algorithm for these problems. We prove the global convergence of our algorithm, borrowing ideas from the theory of fast-mixing Markov chains and two-Time-Scale Stochastic Approximation, showing that it achieves the optimal rate of convergence. In the process, we develop tools for understanding stochastic processes with Markovian noise which might be of independent interest.

NeurIPS Conference 2018 Conference Paper

Generalized Zero-Shot Learning with Deep Calibration Network

  • Shichen Liu
  • Mingsheng Long
  • Jianmin Wang
  • Michael Jordan

A technical challenge of deep learning is recognizing target classes without seen data. Zero-shot learning leverages semantic representations such as attributes or class prototypes to bridge source and target classes. Existing standard zero-shot learning methods may be prone to overfitting the seen data of source classes as they are blind to the semantic representations of target classes. In this paper, we study generalized zero-shot learning that assumes accessible to target classes for unseen data during training, and prediction on unseen data is made by searching on both source and target classes. We propose a novel Deep Calibration Network (DCN) approach towards this generalized zero-shot learning paradigm, which enables simultaneous calibration of deep networks on the confidence of source classes and uncertainty of target classes. Our approach maps visual features of images and semantic representations of class prototypes to a common embedding space such that the compatibility of seen data to both source and target classes are maximized. We show superior accuracy of our approach over the state of the art on benchmark datasets for generalized zero-shot learning, including AwA, CUB, SUN, and aPY.

NeurIPS Conference 2018 Conference Paper

Information Constraints on Auto-Encoding Variational Bayes

  • Romain Lopez
  • Jeffrey Regier
  • Michael Jordan
  • Nir Yosef

Parameterizing the approximate posterior of a generative model with neural networks has become a common theme in recent machine learning research. While providing appealing flexibility, this approach makes it difficult to impose or assess structural constraints such as conditional independence. We propose a framework for learning representations that relies on Auto-Encoding Variational Bayes and whose search space is constrained via kernel-based measures of independence. In particular, our method employs the $d$-variable Hilbert-Schmidt Independence Criterion (dHSIC) to enforce independence between the latent representations and arbitrary nuisance factors. We show how to apply this method to a range of problems, including the problems of learning invariant representations and the learning of interpretable representations. We also present a full-fledged application to single-cell RNA sequencing (scRNA-seq). In this setting the biological signal in mixed in complex ways with sequencing errors and sampling effects. We show that our method out-performs the state-of-the-art in this domain.

NeurIPS Conference 2018 Conference Paper

Is Q-Learning Provably Efficient?

  • Chi Jin
  • Zeyuan Allen-Zhu
  • Sebastien Bubeck
  • Michael Jordan

Model-free reinforcement learning (RL) algorithms directly parameterize and update value functions or policies, bypassing the modeling of the environment. They are typically simpler, more flexible to use, and thus more prevalent in modern deep RL than model-based approaches. However, empirical work has suggested that they require large numbers of samples to learn. The theoretical question of whether not model-free algorithms are in fact \emph{sample efficient} is one of the most fundamental questions in RL. The problem is unsolved even in the basic scenario with finitely many states and actions. We prove that, in an episodic MDP setting, Q-learning with UCB exploration achieves regret $\tlO(\sqrt{H^3 SAT})$ where $S$ and $A$ are the numbers of states and actions, $H$ is the number of steps per episode, and $T$ is the total number of steps. Our regret matches the optimal regret up to a single $\sqrt{H}$ factor. Thus we establish the sample efficiency of a classical model-free approach. Moreover, to the best of our knowledge, this is the first model-free analysis to establish $\sqrt{T}$ regret \emph{without} requiring access to a ``simulator. ''

NeurIPS Conference 2018 Conference Paper

On the Local Minima of the Empirical Risk

  • Chi Jin
  • Lydia T. Liu
  • Rong Ge
  • Michael Jordan

Population risk is always of primary interest in machine learning; however, learning algorithms only have access to the empirical risk. Even for applications with nonconvex non-smooth losses (such as modern deep networks), the population risk is generally significantly more well behaved from an optimization point of view than the empirical risk. In particular, sampling can create many spurious local minima. We consider a general framework which aims to optimize a smooth nonconvex function $F$ (population risk) given only access to an approximation $f$ (empirical risk) that is pointwise close to $F$ (i. e. , $\norm{F-f}_{\infty} \le \nu$). Our objective is to find the $\epsilon$-approximate local minima of the underlying function $F$ while avoiding the shallow local minima---arising because of the tolerance $\nu$---which exist only in $f$. We propose a simple algorithm based on stochastic gradient descent (SGD) on a smoothed version of $f$ that is guaranteed to achieve our goal as long as $\nu \le O(\epsilon^{1. 5}/d)$. We also provide an almost matching lower bound showing that our algorithm achieves optimal error tolerance $\nu$ among all algorithms making a polynomial number of queries of $f$. As a concrete example, we show that our results can be directly used to give sample complexities for learning a ReLU unit.

NeurIPS Conference 2018 Conference Paper

Stochastic Cubic Regularization for Fast Nonconvex Optimization

  • Nilesh Tripuraneni
  • Mitchell Stern
  • Chi Jin
  • Jeffrey Regier
  • Michael Jordan

This paper proposes a stochastic variant of a classic algorithm---the cubic-regularized Newton method [Nesterov and Polyak]. The proposed algorithm efficiently escapes saddle points and finds approximate local minima for general smooth, nonconvex functions in only $\mathcal{\tilde{O}}(\epsilon^{-3. 5})$ stochastic gradient and stochastic Hessian-vector product evaluations. The latter can be computed as efficiently as stochastic gradients. This improves upon the $\mathcal{\tilde{O}}(\epsilon^{-4})$ rate of stochastic gradient descent. Our rate matches the best-known result for finding local minima without requiring any delicate acceleration or variance-reduction techniques.

NeurIPS Conference 2018 Conference Paper

Theoretical guarantees for EM under misspecified Gaussian mixture models

  • Raaz Dwivedi
  • nhật Hồ
  • Koulik Khamaru
  • Martin Wainwright
  • Michael Jordan

Recent years have witnessed substantial progress in understanding the behavior of EM for mixture models that are correctly specified. Given that model misspecification is common in practice, it is important to understand EM in this more general setting. We provide non-asymptotic guarantees for population and sample-based EM for parameter estimation under a few specific univariate settings of misspecified Gaussian mixture models. Due to misspecification, the EM iterates no longer converge to the true model and instead converge to the projection of the true model over the set of models being searched over. We provide two classes of theoretical guarantees: first, we characterize the bias introduced due to the misspecification; and second, we prove that population EM converges at a geometric rate to the model projection under a suitable initialization condition. This geometric convergence rate for population EM imply a statistical complexity of order $1/\sqrt{n}$ when running EM with $n$ samples. We validate our theoretical findings in different cases via several numerical examples.

NeurIPS Conference 2017 Conference Paper

Fast Black-box Variational Inference through Stochastic Trust-Region Optimization

  • Jeffrey Regier
  • Michael Jordan
  • Jon McAuliffe

We introduce TrustVI, a fast second-order algorithm for black-box variational inference based on trust-region optimization and the reparameterization trick. At each iteration, TrustVI proposes and assesses a step based on minibatches of draws from the variational distribution. The algorithm provably converges to a stationary point. We implemented TrustVI in the Stan framework and compared it to two alternatives: Automatic Differentiation Variational Inference (ADVI) and Hessian-free Stochastic Gradient Variational Inference (HFSGVI). The former is based on stochastic first-order optimization. The latter uses second-order information, but lacks convergence guarantees. TrustVI typically converged at least one order of magnitude faster than ADVI, demonstrating the value of stochastic second-order information. TrustVI often found substantially better variational distributions than HFSGVI, demonstrating that our convergence theory can matter in practice.

NeurIPS Conference 2017 Conference Paper

Gradient Descent Can Take Exponential Time to Escape Saddle Points

  • Simon Du
  • Chi Jin
  • Jason Lee
  • Michael Jordan
  • Aarti Singh
  • Barnabas Poczos

Although gradient descent (GD) almost always escapes saddle points asymptotically [Lee et al. , 2016], this paper shows that even with fairly natural random initialization schemes and non-pathological functions, GD can be significantly slowed down by saddle points, taking exponential time to escape. On the other hand, gradient descent with perturbations [Ge et al. , 2015, Jin et al. , 2017] is not slowed down by saddle points—it can find an approximate local minimizer in polynomial time. This result implies that GD is inherently slower than perturbed GD, and justifies the importance of adding perturbations for efficient non-convex optimization. While our focus is theoretical, we also present experiments that illustrate our theoretical findings.

NeurIPS Conference 2017 Conference Paper

Kernel Feature Selection via Conditional Covariance Minimization

  • Jianbo Chen
  • Mitchell Stern
  • Martin Wainwright
  • Michael Jordan

We propose a method for feature selection that employs kernel-based measures of independence to find a subset of covariates that is maximally predictive of the response. Building on past work in kernel dimension reduction, we show how to perform feature selection via a constrained optimization problem involving the trace of the conditional covariance operator. We prove various consistency results for this procedure, and also demonstrate that our method compares favorably with other state-of-the-art algorithms on a variety of synthetic and real data sets.

NeurIPS Conference 2017 Conference Paper

Non-convex Finite-Sum Optimization Via SCSG Methods

  • Lihua Lei
  • Cheng Ju
  • Jianbo Chen
  • Michael Jordan

We develop a class of algorithms, as variants of the stochastically controlled stochastic gradient (SCSG) methods, for the smooth nonconvex finite-sum optimization problem. Only assuming the smoothness of each component, the complexity of SCSG to reach a stationary point with $E \|\nabla f(x)\|^{2}\le \epsilon$ is $O(\min\{\epsilon^{-5/3}, \epsilon^{-1}n^{2/3}\})$, which strictly outperforms the stochastic gradient descent. Moreover, SCSG is never worse than the state-of-the-art methods based on variance reduction and it significantly outperforms them when the target accuracy is low. A similar acceleration is also achieved when the functions satisfy the Polyak-Lojasiewicz condition. Empirical experiments demonstrate that SCSG outperforms stochastic gradient methods on training multi-layers neural networks in terms of both training and validation loss.

NeurIPS Conference 2017 Conference Paper

Online control of the false discovery rate with decaying memory

  • Aaditya Ramdas
  • Fanny Yang
  • Martin Wainwright
  • Michael Jordan

In the online multiple testing problem, p-values corresponding to different null hypotheses are presented one by one, and the decision of whether to reject a hypothesis must be made immediately, after which the next p-value is presented. Alpha-investing algorithms to control the false discovery rate were first formulated by Foster and Stine and have since been generalized and applied to various settings, varying from quality-preserving databases for science to multiple A/B tests for internet commerce. This paper improves the class of generalized alpha-investing algorithms (GAI) in four ways: (a) we show how to uniformly improve the power of the entire class of GAI procedures under independence by awarding more alpha-wealth for each rejection, giving a near win-win resolution to a dilemma raised by Javanmard and Montanari, (b) we demonstrate how to incorporate prior weights to indicate domain knowledge of which hypotheses are likely to be null or non-null, (c) we allow for differing penalties for false discoveries to indicate that some hypotheses may be more meaningful/important than others, (d) we define a new quantity called the \emph{decaying memory false discovery rate, or $\memfdr$} that may be more meaningful for applications with an explicit time component, using a discount factor to incrementally forget past decisions and alleviate some potential problems that we describe and name ``piggybacking'' and ``alpha-death''. Our GAI++ algorithms incorporate all four generalizations (a, b, c, d) simulatenously, and reduce to more powerful variants of earlier algorithms when the weights and decay are all set to unity.

NeurIPS Conference 2016 Conference Paper

Cyclades: Conflict-free Asynchronous Machine Learning

  • Xinghao Pan
  • Maximilian Lam
  • Stephen Tu
  • Dimitris Papailiopoulos
  • Ce Zhang
  • Michael Jordan
  • Kannan Ramchandran
  • Christopher Ré

We present Cyclades, a general framework for parallelizing stochastic optimization algorithms in a shared memory setting. Cyclades is asynchronous during model updates, and requires no memory locking mechanisms, similar to Hogwild! -type algorithms. Unlike Hogwild! , Cyclades introduces no conflicts during parallel execution, and offers a black-box analysis for provable speedups across a large family of algorithms. Due to its inherent cache locality and conflict-free nature, our multi-core implementation of Cyclades consistently outperforms Hogwild! -type algorithms on sufficiently sparse datasets, leading to up to 40% speedup gains compared to Hogwild! , and up to 5\times gains over asynchronous implementations of variance reduction algorithms.

NeurIPS Conference 2016 Conference Paper

Local Maxima in the Likelihood of Gaussian Mixture Models: Structural Results and Algorithmic Consequences

  • Chi Jin
  • Yuchen Zhang
  • Sivaraman Balakrishnan
  • Martin Wainwright
  • Michael Jordan

We provide two fundamental results on the population (infinite-sample) likelihood function of Gaussian mixture models with $M \geq 3$ components. Our first main result shows that the population likelihood function has bad local maxima even in the special case of equally-weighted mixtures of well-separated and spherical Gaussians. We prove that the log-likelihood value of these bad local maxima can be arbitrarily worse than that of any global optimum, thereby resolving an open question of Srebro (2007). Our second main result shows that the EM algorithm (or a first-order variant of it) with random initialization will converge to bad critical points with probability at least $1-e^{-\Omega(M)}$. We further establish that a first-order variant of EM will not converge to strict saddle points almost surely, indicating that the poor performance of the first-order method can be attributed to the existence of bad local maxima rather than bad saddle points. Overall, our results highlight the necessity of careful initialization when using the EM algorithm in practice, even when applied in highly favorable settings.

AAAI Conference 2016 Conference Paper

The Constrained Laplacian Rank Algorithm for Graph-Based Clustering

  • Feiping Nie
  • Xiaoqian Wang
  • Michael Jordan
  • Heng Huang

Graph-based clustering methods perform clustering on a fixed input data graph. If this initial construction is of low quality then the resulting clustering may also be of low quality. Moreover, existing graph-based clustering methods require post-processing on the data graph to extract the clustering indicators. We address both of these drawbacks by allowing the data graph itself to be adjusted as part of the clustering procedure. In particular, our Constrained Laplacian Rank (CLR) method learns a graph with exactly k connected components (where k is the number of clusters). We develop two versions of this method, based upon the L1-norm and the L2-norm, which yield two new graph-based clustering objectives. We derive optimization algorithms to solve these objectives. Experimental results on synthetic datasets and real-world benchmark datasets exhibit the effectiveness of this new graph-based clustering method.

NeurIPS Conference 2016 Conference Paper

Unsupervised Domain Adaptation with Residual Transfer Networks

  • Mingsheng Long
  • Han Zhu
  • Jianmin Wang
  • Michael Jordan

The recent success of deep neural networks relies on massive amounts of labeled data. For a target task where labeled data is unavailable, domain adaptation can transfer a learner from a different source domain. In this paper, we propose a new approach to domain adaptation in deep networks that can jointly learn adaptive classifiers and transferable features from labeled data in the source domain and unlabeled data in the target domain. We relax a shared-classifier assumption made by previous methods and assume that the source classifier and target classifier differ by a residual function. We enable classifier adaptation by plugging several layers into deep network to explicitly learn the residual function with reference to the target classifier. We fuse features of multiple layers with tensor product and embed them into reproducing kernel Hilbert spaces to match distributions for feature adaptation. The adaptation can be achieved in most feed-forward models by extending them with new residual layers and loss functions, which can be trained efficiently via back-propagation. Empirical evidence shows that the new approach outperforms state of the art methods on standard domain adaptation benchmarks.

NeurIPS Conference 2015 Conference Paper

Linear Response Methods for Accurate Covariance Estimates from Mean Field Variational Bayes

  • Ryan Giordano
  • Tamara Broderick
  • Michael Jordan

Mean field variational Bayes (MFVB) is a popular posterior approximation method due to its fast runtime on large-scale data sets. However, a well known failing of MFVB is that it underestimates the uncertainty of model variables (sometimes severely) and provides no information about model variable covariance. We generalize linear response methods from statistical physics to deliver accurate uncertainty estimates for model variables---both for individual variables and coherently across variables. We call our method linear response variational Bayes (LRVB). When the MFVB posterior approximation is in the exponential family, LRVB has a simple, analytic form, even for non-conjugate models. Indeed, we make no assumptions about the form of the true posterior. We demonstrate the accuracy and scalability of our method on a range of models for both simulated and real data.

NeurIPS Conference 2015 Conference Paper

On the Accuracy of Self-Normalized Log-Linear Models

  • Jacob Andreas
  • Maxim Rabinovich
  • Michael Jordan
  • Dan Klein

Calculation of the log-normalizer is a major computational obstacle in applications of log-linear models with large output spaces. The problem of fast normalizer computation has therefore attracted significant attention in the theoretical and applied machine learning literature. In this paper, we analyze a recently proposed technique known as ``self-normalization'', which introduces a regularization term in training to penalize log normalizers for deviating from zero. This makes it possible to use unnormalized model scores as approximate probabilities. Empirical evidence suggests that self-normalization is extremely effective, but a theoretical understanding of why it should work, and how generally it can be applied, is largely lacking. We prove upper bounds on the loss in accuracy due to self-normalization, describe classes of input distributionsthat self-normalize easily, and construct explicit examples of high-variance input distributions. Our theoretical results make predictions about the difficulty of fitting self-normalized models to several classes of distributions, and we conclude with empirical validation of these predictions on both real and synthetic datasets.

NeurIPS Conference 2015 Conference Paper

Parallel Correlation Clustering on Big Graphs

  • Xinghao Pan
  • Dimitris Papailiopoulos
  • Samet Oymak
  • Benjamin Recht
  • Kannan Ramchandran
  • Michael Jordan

Given a similarity graph between items, correlation clustering (CC) groups similar items together and dissimilar ones apart. One of the most popular CC algorithms is KwikCluster: an algorithm that serially clusters neighborhoods of vertices, and obtains a 3-approximation ratio. Unfortunately, in practice KwikCluster requires a large number of clustering rounds, a potential bottleneck for large graphs. We present C4 and ClusterWild! , two algorithms for parallel correlation clustering that run in a polylogarithmic number of rounds, and provably achieve nearly linear speedups. C4 uses concurrency control to enforce serializability of a parallel clustering process, and guarantees a 3-approximation ratio. ClusterWild! is a coordination free algorithm that abandons consistency for the benefit of better scaling; this leads to a provably small loss in the 3 approximation ratio. We provide extensive experimental results for both algorithms, where we outperform the state of the art, both in terms of clustering accuracy and running time. We show that our algorithms can cluster billion-edge graphs in under 5 seconds on 32 cores, while achieving a 15x speedup.

NeurIPS Conference 2015 Conference Paper

Variational Consensus Monte Carlo

  • Maxim Rabinovich
  • Elaine Angelino
  • Michael Jordan

Practitioners of Bayesian statistics have long depended on Markov chain Monte Carlo (MCMC) to obtain samples from intractable posterior distributions. Unfortunately, MCMC algorithms are typically serial, and do not scale to the large datasets typical of modern machine learning. The recently proposed consensus Monte Carlo algorithm removes this limitation by partitioning the data and drawing samples conditional on each partition in parallel (Scott et al, 2013). A fixed aggregation function then combines these samples, yielding approximate posterior samples. We introduce variational consensus Monte Carlo (VCMC), a variational Bayes algorithm that optimizes over aggregation functions to obtain samples from a distribution that better approximates the target. The resulting objective contains an intractable entropy term; we therefore derive a relaxation of the objective and show that the relaxed problem is blockwise concave under mild conditions. We illustrate the advantages of our algorithm on three inference tasks from the literature, demonstrating both the superior quality of the posterior approximation and the moderate overhead of the optimization step. Our algorithm achieves a relative error reduction (measured against serial MCMC) of up to 39% compared to consensus Monte Carlo on the task of estimating 300-dimensional probit regression parameter expectations; similarly, it achieves an error reduction of 92% on the task of estimating cluster comembership probabilities in a Gaussian mixture model with 8 components in 8 dimensions. Furthermore, these gains come at moderate cost compared to the runtime of serial MCMC, achieving near-ideal speedup in some instances.

NeurIPS Conference 2014 Conference Paper

Communication-Efficient Distributed Dual Coordinate Ascent

  • Martin Jaggi
  • Virginia Smith
  • Martin Takac
  • Jonathan Terhorst
  • Sanjay Krishnan
  • Thomas Hofmann
  • Michael Jordan

Communication remains the most significant bottleneck in the performance of distributed optimization algorithms for large-scale machine learning. In this paper, we propose a communication-efficient framework, COCOA, that uses local computation in a primal-dual setting to dramatically reduce the amount of necessary communication. We provide a strong convergence rate analysis for this class of algorithms, as well as experiments on real-world distributed datasets with implementations in Spark. In our experiments, we find that as compared to state-of-the-art mini-batch versions of SGD and SDCA algorithms, COCOA converges to the same. 001-accurate solution quality on average 25× as quickly.

NeurIPS Conference 2014 Conference Paper

On the Convergence Rate of Decomposable Submodular Function Minimization

  • Robert Nishihara
  • Stefanie Jegelka
  • Michael Jordan

Submodular functions describe a variety of discrete problems in machine learning, signal processing, and computer vision. However, minimizing submodular functions poses a number of algorithmic challenges. Recent work introduced an easy-to-use, parallelizable algorithm for minimizing submodular functions that decompose as the sum of simple" submodular functions. Empirically, this algorithm performs extremely well, but no theoretical analysis was given. In this paper, we show that the algorithm converges linearly, and we provide upper and lower bounds on the rate of convergence. Our proof relies on the geometry of submodular polyhedra and draws on results from spectral graph theory. "

NeurIPS Conference 2014 Conference Paper

Parallel Double Greedy Submodular Maximization

  • Xinghao Pan
  • Stefanie Jegelka
  • Joseph Gonzalez
  • Joseph Bradley
  • Michael Jordan

Many machine learning problems can be reduced to the maximization of submodular functions. Although well understood in the serial setting, the parallel maximization of submodular functions remains an open area of research with recent results only addressing monotone functions. The optimal algorithm for maximizing the more general class of non-monotone submodular functions was introduced by Buchbinder et al. and follows a strongly serial double-greedy logic and program analysis. In this work, we propose two methods to parallelize the double-greedy algorithm. The first, coordination-free approach emphasizes speed at the cost of a weaker approximation guarantee. The second, concurrency control approach guarantees a tight 1/2-approximation, at the quantifiable cost of additional coordination and reduced parallelism. As a consequence we explore the trade off space between guaranteed performance and objective optimality. We implement and evaluate both algorithms on multi-core hardware and billion edge graphs, demonstrating both the scalability and tradeoffs of each approach.

NeurIPS Conference 2014 Conference Paper

Spectral Methods meet EM: A Provably Optimal Algorithm for Crowdsourcing

  • Yuchen Zhang
  • Xi Chen
  • Dengyong Zhou
  • Michael Jordan

The Dawid-Skene estimator has been widely used for inferring the true labels from the noisy labels provided by non-expert crowdsourcing workers. However, since the estimator maximizes a non-convex log-likelihood function, it is hard to theoretically justify its performance. In this paper, we propose a two-stage efficient algorithm for multi-class crowd labeling problems. The first stage uses the spectral method to obtain an initial estimate of parameters. Then the second stage refines the estimation by optimizing the objective function of the Dawid-Skene estimator via the EM algorithm. We show that our algorithm achieves the optimal convergence rate up to a logarithmic factor. We conduct extensive experiments on synthetic and real datasets. Experimental results demonstrate that the proposed algorithm is comparable to the most accurate empirical approach, while outperforming several other recently proposed methods.

NeurIPS Conference 2013 Conference Paper

A Comparative Framework for Preconditioned Lasso Algorithms

  • Fabian Wauthier
  • Nebojsa Jojic
  • Michael Jordan

The Lasso is a cornerstone of modern multivariate data analysis, yet its performance suffers in the common situation in which covariates are correlated. This limitation has led to a growing number of \emph{Preconditioned Lasso} algorithms that pre-multiply $X$ and $y$ by matrices $P_X$, $P_y$ prior to running the standard Lasso. A direct comparison of these and similar Lasso-style algorithms to the original Lasso is difficult because the performance of all of these methods depends critically on an auxiliary penalty parameter $\lambda$. In this paper we propose an agnostic, theoretical framework for comparing Preconditioned Lasso algorithms to the Lasso without having to choose $\lambda$. We apply our framework to three Preconditioned Lasso instances and highlight when they will outperform the Lasso. Additionally, our theory offers insights into the fragilities of these algorithms to which we provide partial solutions.

RLDM Conference 2013 Conference Abstract

Dirichlet Process Reinforcement Learning

  • Teodor Mihai Moldovan
  • Michael Jordan
  • Pieter Abbeel

We consider the problem of model-based reinforcement learning in continuous state-action spaces. The key ingredients of our algorithm are: (i) To model the (initially unknown) dynamics our algorithm uses a Dirichlet process mixture of linear models. We present a novel method for on-line inference with this model which enables to learn continuously at high frequency. (ii) To address the exploration-exploitation trade-off, we describe how to adapt BOLT [1], an established method for discrete systems, to make it prac- tical for continuous systems. (iii) Efficient control is possible by relying on recent advances in sequential quadratic programming (SQP). Our algorithm is highly automated, requiring only two scalar parameters, and it is designed for parallel computation. Experiments show that it can solve the classical cartpole and under-actuated pendulum swing-up tasks as well as a new helicopter 180-degree flip followed by inverted hover task with minimal re-tuning of these two parameters for each new system.

NeurIPS Conference 2013 Conference Paper

Estimation, Optimization, and Parallelism when Data is Sparse

  • John Duchi
  • Michael Jordan
  • Brendan McMahan

We study stochastic optimization problems when the \emph{data} is sparse, which is in a sense dual to the current understanding of high-dimensional statistical learning and optimization. We highlight both the difficulties---in terms of increased sample complexity that sparse data necessitates---and the potential benefits, in terms of allowing parallelism and asynchrony in the design of algorithms. Concretely, we derive matching upper and lower bounds on the minimax rate for optimization and learning with sparse data, and we exhibit algorithms achieving these rates. Our algorithms are adaptive: they achieve the best possible rate for the data observed. We also show how leveraging sparsity leads to (still minimax optimal) parallel and asynchronous algorithms, providing experimental evidence complementing our theoretical results on medium to large-scale learning tasks.

NeurIPS Conference 2013 Conference Paper

Information-theoretic lower bounds for distributed statistical estimation with communication constraints

  • Yuchen Zhang
  • John Duchi
  • Michael Jordan
  • Martin Wainwright

We establish minimax risk lower bounds for distributed statistical estimation given a budget $B$ of the total number of bits that may be communicated. Such lower bounds in turn reveal the minimum amount of communication required by any procedure to achieve the classical optimal rate for statistical estimation. We study two classes of protocols in which machines send messages either independently or interactively. The lower bounds are established for a variety of problems, from estimating the mean of a population to estimating parameters in linear regression or binary classification.

NeurIPS Conference 2013 Conference Paper

Local Privacy and Minimax Bounds: Sharp Rates for Probability Estimation

  • John Duchi
  • Martin Wainwright
  • Michael Jordan

We provide a detailed study of the estimation of probability distributions---discrete and continuous---in a stringent setting in which data is kept private even from the statistician. We give sharp minimax rates of convergence for estimation in these locally private settings, exhibiting fundamental tradeoffs between privacy and convergence rate, as well as providing tools to allow movement along the privacy-statistical efficiency continuum. One of the consequences of our results is that Warner's classical work on randomized response is an optimal way to perform survey sampling while maintaining privacy of the respondents.

NeurIPS Conference 2013 Conference Paper

Optimistic Concurrency Control for Distributed Unsupervised Learning

  • Xinghao Pan
  • Joseph Gonzalez
  • Stefanie Jegelka
  • Tamara Broderick
  • Michael Jordan

Research on distributed machine learning algorithms has focused primarily on one of two extremes---algorithms that obey strict concurrency constraints or algorithms that obey few or no such constraints. We consider an intermediate alternative in which algorithms optimistically assume that conflicts are unlikely and if conflicts do arise a conflict-resolution protocol is invoked. We view this optimistic concurrency control'' paradigm as particularly appropriate for large-scale machine learning algorithms, particularly in the unsupervised setting. We demonstrate our approach in three problem areas: clustering, feature learning and online facility location. We evaluate our methods via large-scale experiments in a cluster computing environment. "

NeurIPS Conference 2013 Conference Paper

Streaming Variational Bayes

  • Tamara Broderick
  • Nicholas Boyd
  • Andre Wibisono
  • Ashia Wilson
  • Michael Jordan

We present SDA-Bayes, a framework for (S)treaming, (D)istributed, (A)synchronous computation of a Bayesian posterior. The framework makes streaming updates to the estimated posterior according to a user-specified approximation primitive function. We demonstrate the usefulness of our framework, with variational Bayes (VB) as the primitive, by fitting the latent Dirichlet allocation model to two large-scale document collections. We demonstrate the advantages of our algorithm over stochastic variational inference (SVI), both in the single-pass setting SVI was designed for and in the streaming setting, to which SVI does not apply.

NeurIPS Conference 2012 Conference Paper

Ancestor Sampling for Particle Gibbs

  • Fredrik Lindsten
  • Thomas Schön
  • Michael Jordan

We present a novel method in the family of particle MCMC methods that we refer to as particle Gibbs with ancestor sampling (PG-AS). Similarly to the existing PG with backward simulation (PG-BS) procedure, we use backward sampling to (considerably) improve the mixing of the PG kernel. Instead of using separate forward and backward sweeps as in PG-BS, however, we achieve the same effect in a single forward sweep. We apply the PG-AS framework to the challenging class of non-Markovian state-space models. We develop a truncation strategy of these models that is applicable in principle to any backward-simulation-based method, but which is particularly well suited to the PG-AS framework. In particular, as we show in a simulation study, PG-AS can yield an order-of-magnitude improved accuracy relative to PG-BS due to its robustness to the truncation error. Several application examples are discussed, including Rao-Blackwellized particle smoothing and inference in degenerate state-space models.

NeurIPS Conference 2012 Conference Paper

Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods

  • Andre Wibisono
  • Martin Wainwright
  • Michael Jordan
  • John Duchi

We consider derivative-free algorithms for stochastic optimization problems that use only noisy function values rather than gradients, analyzing their finite-sample convergence rates. We show that if pairs of function values are available, algorithms that use gradient estimates based on random perturbations suffer a factor of at most $\sqrt{\dim}$ in convergence rate over traditional stochastic gradient methods, where $\dim$ is the dimension of the problem. We complement our algorithmic development with information-theoretic lower bounds on the minimax convergence rate of such problems, which show that our bounds are sharp with respect to all problem-dependent quantities: they cannot be improved by more than constant factors.

NeurIPS Conference 2012 Conference Paper

Privacy Aware Learning

  • Martin Wainwright
  • Michael Jordan
  • John Duchi

We study statistical risk minimization problems under a version of privacy in which the data is kept confidential even from the learner. In this local privacy framework, we show sharp upper and lower bounds on the convergence rates of statistical estimation procedures. As a consequence, we exhibit a precise tradeoff between the amount of privacy the data preserves and the utility, measured by convergence rate, of any statistical estimator.

NeurIPS Conference 2012 Conference Paper

Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

  • Ke Jiang
  • Brian Kulis
  • Michael Jordan

Links between probabilistic and non-probabilistic learning algorithms can arise by performing small-variance asymptotics, i. e. , letting the variance of particular distributions in a graphical model go to zero. For instance, in the context of clustering, such an approach yields precise connections between the k-means and EM algorithms. In this paper, we explore small-variance asymptotics for exponential family Dirichlet process (DP) and hierarchical Dirichlet process (HDP) mixture models. Utilizing connections between exponential family distributions and Bregman divergences, we derive novel clustering algorithms from the asymptotic limit of the DP and HDP mixtures that feature the scalability of existing hard clustering methods as well as the flexibility of Bayesian nonparametric models. We focus on special cases of our analysis for discrete-data problems, including topic modeling, and we demonstrate the utility of our results by applying variants of our algorithms to problems arising in vision and document analysis.

NeurIPS Conference 2011 Conference Paper

Bayesian Bias Mitigation for Crowdsourcing

  • Fabian Wauthier
  • Michael Jordan

Biased labelers are a systemic problem in crowdsourcing, and a comprehensive toolbox for handling their responses is still being developed. A typical crowdsourcing application can be divided into three steps: data collection, data curation, and learning. At present these steps are often treated separately. We present Bayesian Bias Mitigation for Crowdsourcing (BBMC), a Bayesian model to unify all three. Most data curation methods account for the {\it effects} of labeler bias by modeling all labels as coming from a single latent truth. Our model captures the {\it sources} of bias by describing labelers as influenced by shared random effects. This approach can account for more complex bias patterns that arise in ambiguous or hard labeling tasks and allows us to merge data curation and learning into a single computation. Active learning integrates data collection with learning, but is commonly considered infeasible with Gibbs sampling inference. We propose a general approximation strategy for Markov chains to efficiently quantify the effect of a perturbation on the stationary distribution and specialize this approach to active learning. Experiments show BBMC to outperform many common heuristics.

NeurIPS Conference 2011 Conference Paper

Divide-and-Conquer Matrix Factorization

  • Lester Mackey
  • Michael Jordan
  • Ameet Talwalkar

This work introduces Divide-Factor-Combine (DFC), a parallel divide-and-conquer framework for noisy matrix factorization. DFC divides a large-scale matrix factorization task into smaller subproblems, solves each subproblem in parallel using an arbitrary base matrix factorization algorithm, and combines the subproblem solutions using techniques from randomized matrix approximation. Our experiments with collaborative filtering, video background modeling, and simulated data demonstrate the near-linear to super-linear speed-ups attainable with this approach. Moreover, our analysis shows that DFC enjoys high-probability recovery guarantees comparable to those of its base algorithm.

NeurIPS Conference 2010 Conference Paper

Heavy-Tailed Process Priors for Selective Shrinkage

  • Fabian Wauthier
  • Michael Jordan

Heavy-tailed distributions are often used to enhance the robustness of regression and classification methods to outliers in output space. Often, however, we are confronted with ``outliers'' in input space, which are isolated observations in sparsely populated regions. We show that heavy-tailed process priors (which we construct from Gaussian processes via a copula), can be used to improve robustness of regression and classification estimators to such outliers by selectively shrinking them more strongly in sparse regions than in dense regions. We carry out a theoretical analysis to show that selective shrinkage occurs provided the marginals of the heavy-tailed process have sufficiently heavy tails. The analysis is complemented by experiments on biological data which indicate significant improvements of estimates in sparse regions while producing competitive results in dense regions.

NeurIPS Conference 2010 Conference Paper

Random Conic Pursuit for Semidefinite Programming

  • Ariel Kleiner
  • Ali Rahimi
  • Michael Jordan

We present a novel algorithm, Random Conic Pursuit, that solves semidefinite programs (SDPs) via repeated optimization over randomly selected two-dimensional subcones of the PSD cone. This scheme is simple, easily implemented, applicable to very general SDPs, scalable, and theoretically interesting. Its advantages are realized at the expense of an ability to readily compute highly exact solutions, though useful approximate solutions are easily obtained. This property renders Random Conic Pursuit of particular interest for machine learning applications, in which the relevant SDPs are generally based upon random data and so exact minima are often not a priority. Indeed, we present empirical results to this effect for various SDPs encountered in machine learning; these experiments demonstrate the potential practical usefulness of Random Conic Pursuit. We also provide a preliminary analysis that yields insight into the theoretical properties and convergence of the algorithm.

NeurIPS Conference 2010 Conference Paper

Tree-Structured Stick Breaking for Hierarchical Data

  • Zoubin Ghahramani
  • Michael Jordan
  • Ryan Adams

Many data are naturally modeled by an unobserved hierarchical structure. In this paper we propose a flexible nonparametric prior over unknown data hierarchies. The approach uses nested stick-breaking processes to allow for trees of unbounded width and depth, where data can live at any node and are infinitely exchangeable. One can view our model as providing infinite mixtures where the components have a dependency structure corresponding to an evolutionary diffusion down a tree. By using a stick-breaking approach, we can apply Markov chain Monte Carlo methods based on slice sampling to perform Bayesian inference and simulate from the posterior distribution on trees. We apply our method to hierarchical clustering of images and topic modeling of text data.

NeurIPS Conference 2010 Conference Paper

Unsupervised Kernel Dimension Reduction

  • Meihong Wang
  • Fei Sha
  • Michael Jordan

We apply the framework of kernel dimension reduction, originally designed for supervised problems, to unsupervised dimensionality reduction. In this framework, kernel-based measures of independence are used to derive low-dimensional representations that maximally capture information in covariates in order to predict responses. We extend this idea and develop similarly motivated measures for unsupervised problems where covariates and responses are the same. Our empirical studies show that the resulting compact representation yields meaningful and appealing visualization and clustering of data. Furthermore, when used in conjunction with supervised learners for classification, our methods lead to lower classification errors than state-of-the-art methods, especially when embedding data in spaces of very few dimensions.

NeurIPS Conference 2010 Conference Paper

Variational Inference over Combinatorial Spaces

  • Alexandre Bouchard-Côté
  • Michael Jordan

Since the discovery of sophisticated fully polynomial randomized algorithms for a range of #P problems (Karzanov et al. , 1991; Jerrum et al. , 2001; Wilson, 2004), theoretical work on approximate inference in combinatorial spaces has focused on Markov chain Monte Carlo methods. Despite their strong theoretical guarantees, the slow running time of many of these randomized algorithms and the restrictive assumptions on the potentials have hindered the applicability of these algorithms to machine learning. Because of this, in applications to combinatorial spaces simple exact models are often preferred to more complex models that require approximate inference (Siepel et al. , 2004). Variational inference would appear to provide an appealing alternative, given the success of variational methods for graphical models (Wainwright et al. , 2008); unfortunately, however, it is not obvious how to develop variational approximations for combinatorial objects such as matchings, partial orders, plane partitions and sequence alignments. We propose a new framework that extends variational inference to a wide range of combinatorial spaces. Our method is based on a simple assumption: the existence of a tractable measure factorization, which we show holds in many examples. Simulations on a range of matching models show that the algorithm is more general and empirically faster than a popular fully polynomial randomized algorithm. We also apply the framework to the problem of multiple alignment of protein sequences, obtaining state-of-the-art results on the BAliBASE dataset (Thompson et al. , 1999).

NeurIPS Conference 2009 Conference Paper

Asymptotically Optimal Regularization in Smooth Parametric Models

  • Percy Liang
  • Guillaume Bouchard
  • Francis Bach
  • Michael Jordan

Many types of regularization schemes have been employed in statistical learning, each one motivated by some assumption about the problem domain. In this paper, we present a unified asymptotic analysis of smooth regularizers, which allows us to see how the validity of these assumptions impacts the success of a particular regularizer. In addition, our analysis motivates an algorithm for optimizing regularization parameters, which in turn can be analyzed within our framework. We apply our analysis to several examples, including hybrid generative-discriminative learning and multi-task learning.

NeurIPS Conference 2009 Conference Paper

Nonparametric Latent Feature Models for Link Prediction

  • Kurt Miller
  • Michael Jordan
  • Thomas Griffiths

As the availability and importance of relational data -- such as the friendships summarized on a social networking website -- increases, it becomes increasingly important to have good models for such data. The kinds of latent structure that have been considered for use in predicting links in such networks have been relatively limited. In particular, the machine learning community has focused on latent class models, adapting nonparametric Bayesian methods to jointly infer how many latent classes there are while learning which entities belong to each class. We pursue a similar approach with a richer kind of latent variable -- latent features -- using a nonparametric Bayesian technique to simultaneously infer the number of features at the same time we learn which entities have each feature. The greater expressiveness of this approach allows us to improve link prediction on three datasets.

NeurIPS Conference 2009 Conference Paper

Sharing Features among Dynamical Systems with Beta Processes

  • Emily Fox
  • Michael Jordan
  • Erik Sudderth
  • Alan Willsky

We propose a Bayesian nonparametric approach to relating multiple time series via a set of latent, dynamical behaviors. Using a beta process prior, we allow data-driven selection of the size of this set, as well as the pattern with which behaviors are shared among time series. Via the Indian buffet process representation of the beta process predictive distributions, we develop an exact Markov chain Monte Carlo inference method. In particular, our approach uses the sum-product algorithm to efficiently compute Metropolis-Hastings acceptance probabilities, and explores new dynamical behaviors via birth/death proposals. We validate our sampling algorithm using several synthetic datasets, and also demonstrate promising unsupervised segmentation of visual motion capture data.

NeurIPS Conference 2008 Conference Paper

DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification

  • Simon Lacoste-Julien
  • Fei Sha
  • Michael Jordan

Probabilistic topic models (and their extensions) have become popular as models of latent structures in collections of text documents or images. These models are usually treated as generative models and trained using maximum likelihood estimation, an approach which may be suboptimal in the context of an overall classification problem. In this paper, we describe DiscLDA, a discriminative learning framework for such models as Latent Dirichlet Allocation (LDA) in the setting of dimensionality reduction with supervised side information. In DiscLDA, a class-dependent linear transformation is introduced on the topic mixture proportions. This parameter is estimated by maximizing the conditional likelihood using Monte Carlo EM. By using the transformed topic mixture proportions as a new representation of documents, we obtain a supervised dimensionality reduction algorithm that uncovers the latent structure in a document collection while preserving predictive power for the task of classification. We compare the predictive power of the latent structure of DiscLDA with unsupervised LDA on the 20 Newsgroup ocument classification task.

NeurIPS Conference 2008 Conference Paper

Efficient Inference in Phylogenetic InDel Trees

  • Alexandre Bouchard-Côté
  • Dan Klein
  • Michael Jordan

Accurate and efficient inference in evolutionary trees is a central problem in computational biology. Realistic models require tracking insertions and deletions along the phylogenetic tree, making inference challenging. We propose new sampling techniques that speed up inference and improve the quality of the samples. We compare our method to previous approaches and show performance improvement on metrics evaluating multiple sequence alignment and reconstruction of ancestral sequences.

NeurIPS Conference 2008 Conference Paper

High-dimensional support union recovery in multivariate regression

  • Guillaume Obozinski
  • Martin Wainwright
  • Michael Jordan

We study the behavior of block (cid: 96)1/(cid: 96)2 regularization for multivariate regression, where a K-dimensional response vector is regressed upon a fixed set of p co- variates. The problem of support union recovery is to recover the subset of covariates that are active in at least one of the regression problems. Study- ing this problem under high-dimensional scaling (where the problem parame- ters as well as sample size n tend to infinity simultaneously), our main result is to show that exact recovery is possible once the order parameter given by θ(cid: 96)1/(cid: 96)2(n, p, s): = n/[2ψ(B∗) log(p − s)] exceeds a critical threshold. Here n is the sample size, p is the ambient dimension of the regression model, s is the size of the union of supports, and ψ(B∗) is a sparsity-overlap function that measures a combination of the sparsities and overlaps of the K-regression coefficient vectors that constitute the model. This sparsity-overlap function reveals that block (cid: 96)1/(cid: 96)2 regularization for multivariate regression never harms performance relative to a naive (cid: 96)1-approach, and can yield substantial improvements in sample complexity (up to a factor of K) when the regression vectors are suitably orthogonal rela- tive to the design. We complement our theoretical results with simulations that demonstrate the sharpness of the result, even for relatively small problems.

NeurIPS Conference 2008 Conference Paper

Nonparametric Bayesian Learning of Switching Linear Dynamical Systems

  • Emily Fox
  • Erik Sudderth
  • Michael Jordan
  • Alan Willsky

Many nonlinear dynamical phenomena can be effectively modeled by a system that switches among a set of conditionally linear dynamical modes. We consider two such models: the switching linear dynamical system (SLDS) and the switching vector autoregressive (VAR) process. In this paper, we present a nonparametric approach to the learning of an unknown number of persistent, smooth dynamical modes by utilizing a hierarchical Dirichlet process prior. We develop a sampling algorithm that combines a truncated approximation to the Dirichlet process with an efficient joint sampling of the mode and state sequences. The utility and flexibility of our model are demonstrated on synthetic data, sequences of dancing honey bees, and the IBOVESPA stock index.

NeurIPS Conference 2008 Conference Paper

Posterior Consistency of the Silverman g-prior in Bayesian Model Choice

  • Zhihua Zhang
  • Michael Jordan
  • Dit-Yan Yeung

Kernel supervised learning methods can be unified by utilizing the tools from regularization theory. The duality between regularization and prior leads to interpreting regularization methods in terms of maximum a posteriori estimation and has motivated Bayesian interpretations of kernel methods. In this paper we pursue a Bayesian interpretation of sparsity in the kernel setting by making use of a mixture of a point-mass distribution and prior that we refer to as ``Silverman's g-prior. '' We provide a theoretical analysis of the posterior consistency of a Bayesian model choice procedure based on this prior. We also establish the asymptotic relationship between this procedure and the Bayesian information criterion.

NeurIPS Conference 2008 Conference Paper

Shared Segmentation of Natural Scenes Using Dependent Pitman-Yor Processes

  • Erik Sudderth
  • Michael Jordan

We develop a statistical framework for the simultaneous, unsupervised segmentation and discovery of visual object categories from image databases. Examining a large set of manually segmented scenes, we use chi--square tests to show that object frequencies and segment sizes both follow power law distributions, which are well modeled by the Pitman--Yor (PY) process. This nonparametric prior distribution leads to learning algorithms which discover an unknown set of objects, and segmentation methods which automatically adapt their resolution to each image. Generalizing previous applications of PY processes, we use Gaussian processes to discover spatially contiguous segments which respect image boundaries. Using a novel family of variational approximations, our approach produces segmentations which compare favorably to state--of--the--art methods, while simultaneously discovering categories shared among natural scenes.

NeurIPS Conference 2008 Conference Paper

Spectral Clustering with Perturbed Data

  • Ling Huang
  • Donghui Yan
  • Nina Taft
  • Michael Jordan

Spectral clustering is useful for a wide-ranging set of applications in areas such as biological data analysis, image processing and data mining. However, the computational and/or communication resources required by the method in processing large-scale data sets are often prohibitively high, and practitioners are often required to perturb the original data in various ways (quantization, downsampling, etc) before invoking a spectral algorithm. In this paper, we use stochastic perturbation theory to study the effects of data perturbation on the performance of spectral clustering. We show that the error under perturbation of spectral clustering is closely related to the perturbation of the eigenvectors of the Laplacian matrix. From this result we derive approximate upper bounds on the clustering error. We show that this bound is tight empirically across a wide range of problems, suggesting that it can be used in practical settings to determine the amount of data reduction allowed in order to meet a specification of permitted loss in clustering performance.

NeurIPS Conference 2007 Conference Paper

Agreement-Based Learning

  • Percy Liang
  • Dan Klein
  • Michael Jordan

The learning of probabilistic models with many hidden variables and non- decomposable dependencies is an important and challenging problem. In contrast to traditional approaches based on approximate inference in a single intractable model, our approach is to train a set of tractable submodels by encouraging them to agree on the hidden variables. This allows us to capture non-decomposable aspects of the data while still maintaining tractability. We propose an objective function for our approach, derive EM-style algorithms for parameter estimation, and demonstrate their effectiveness on three challenging real-world learning tasks.

NeurIPS Conference 2007 Conference Paper

Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization

  • XuanLong Nguyen
  • Martin Wainwright
  • Michael Jordan

We develop and analyze an algorithm for nonparametric estimation of divergence functionals and the density ratio of two probability distributions. Our method is based on a variational characterization of f-divergences, which turns the estima- tion into a penalized convex risk minimization problem. We present a derivation of our kernel-based estimation algorithm and an analysis of convergence rates for the estimator. Our simulation results demonstrate the convergence behavior of the method, which compares favorably with existing methods in the literature.

NeurIPS Conference 2007 Conference Paper

Feature Selection Methods for Improving Protein Structure Prediction with Rosetta

  • Ben Blum
  • David Baker
  • Michael Jordan
  • Philip Bradley
  • Rhiju Das
  • David Kim

Rosetta is one of the leading algorithms for protein structure prediction today. It is a Monte Carlo energy minimization method requiring many random restarts to find structures with low energy. In this paper we present a resampling technique for structure prediction of small alpha/beta proteins using Rosetta. From an ini- tial round of Rosetta sampling, we learn properties of the energy landscape that guide a subsequent round of sampling toward lower-energy structures. Rather than attempt to fit the full energy landscape, we use feature selection methods—both L1-regularized linear regression and decision trees—to identify structural features that give rise to low energy. We then enrich these structural features in the second sampling round. Results are presented across a benchmark set of nine small al- pha/beta proteins demonstrating that our methods seldom impair, and frequently improve, Rosetta’s performance.

NeurIPS Conference 2006 Conference Paper

In-Network PCA and Anomaly Detection

  • Ling Huang
  • XuanLong Nguyen
  • Minos Garofalakis
  • Michael Jordan
  • Anthony Joseph
  • Nina Taft

We consider the problem of network anomaly detection in large distributed systems. In this setting, Principal Component Analysis (PCA) has been proposed as a method for discover- ing anomalies by continuously tracking the projection of the data onto a residual subspace. This method was shown to work well empirically in highly aggregated networks, that is, those with a limited number of large nodes and at coarse time scales. This approach, how- ever, has scalability limitations. To overcome these limitations, we develop a PCA-based anomaly detector in which adaptive local data (cid: 2)lters send to a coordinator just enough data to enable accurate global detection. Our method is based on a stochastic matrix perturba- tion analysis that characterizes the tradeoff between the accuracy of anomaly detection and the amount of data communicated over the network.

NeurIPS Conference 2005 Conference Paper

Divergences, surrogate loss functions and experimental design

  • XuanLong Nguyen
  • Martin Wainwright
  • Michael Jordan

In this paper, we provide a general theorem that establishes a correspon- dence between surrogate loss functions in classification and the family of f-divergences. Moreover, we provide constructive procedures for determining the f-divergence induced by a given surrogate loss, and conversely for finding all surrogate loss functions that realize a given f-divergence. Next we introduce the notion of universal equivalence among loss functions and corresponding f-divergences, and provide nec- essary and sufficient conditions for universal equivalence to hold. These ideas have applications to classification problems that also involve a com- ponent of experiment design; in particular, we leverage our results to prove consistency of a procedure for learning a classifier under decen- tralization requirements.

NeurIPS Conference 2005 Conference Paper

Robust design of biological experiments

  • Patrick Flaherty
  • Adam Arkin
  • Michael Jordan

We address the problem of robust, computationally-efficient design of biological experiments. Classical optimal experiment design methods have not been widely adopted in biological practice, in part because the resulting designs can be very brittle if the nominal parameter estimates for the model are poor, and in part because of computational constraints. We present a method for robust experiment design based on a semidefinite programming relaxation. We present an application of this method to the design of experiments for a complex calcium signal transduction pathway, where we have found that the parameter estimates obtained from the robust design are better than those obtained from an "optimal" design.

NeurIPS Conference 2005 Conference Paper

Structured Prediction via the Extragradient Method

  • Ben Taskar
  • Simon Lacoste-Julien
  • Michael Jordan

We present a simple and scalable algorithm for large-margin estima- tion of structured models, including an important class of Markov net- works and combinatorial models. We formulate the estimation problem as a convex-concave saddle-point problem and apply the extragradient method, yielding an algorithm with linear convergence using simple gra- dient and projection calculations. The projection step can be solved us- ing combinatorial algorithms for min-cost quadratic flow. This makes the approach an efficient alternative to formulations based on reductions to a quadratic program (QP). We present experiments on two very different structured prediction tasks: 3D image segmentation and word alignment, illustrating the favorable scaling properties of our algorithm.

NeurIPS Conference 2004 Conference Paper

A Direct Formulation for Sparse PCA Using Semidefinite Programming

  • Alexandre d'Aspremont
  • Laurent Ghaoui
  • Michael Jordan
  • Gert Lanckriet

We examine the problem of approximating, in the Frobenius-norm sense, a positive, semidefinite symmetric matrix by a rank-one matrix, with an upper bound on the cardinality of its eigenvector. The problem arises in the decomposition of a covariance matrix into sparse factors, and has wide applications ranging from biology to finance. We use a modifica- tion of the classical variational representation of the largest eigenvalue of a symmetric matrix, where cardinality is constrained, and derive a semidefinite programming based relaxation for our problem.

NeurIPS Conference 2004 Conference Paper

Blind One-microphone Speech Separation: A Spectral Learning Approach

  • Francis Bach
  • Michael Jordan

We present an algorithm to perform blind, one-microphone speech sep- aration. Our algorithm separates mixtures of speech without modeling individual speakers. Instead, we formulate the problem of speech sep- aration as a problem in segmenting the spectrogram of the signal into two or more disjoint sets. We build feature sets for our segmenter using classical cues from speech psychophysics. We then combine these fea- tures into parameterized affinity matrices. We also take advantage of the fact that we can generate training examples for segmentation by artifi- cially superposing separately-recorded signals. Thus the parameters of the affinity matrices can be tuned using recent work on learning spectral clustering [1]. This yields an adaptive, speech-specific segmentation al- gorithm that can successfully separate one-microphone speech mixtures.

NeurIPS Conference 2004 Conference Paper

Computing regularization paths for learning multiple kernels

  • Francis Bach
  • Romain Thibaux
  • Michael Jordan

The problem of learning a sparse conic combination of kernel functions or kernel matrices for classification or regression can be achieved via the regularization by a block 1-norm [1]. In this paper, we present an al- gorithm that computes the entire regularization path for these problems. The path is obtained by using numerical continuation techniques, and involves a running time complexity that is a constant times the complex- ity of solving the problem for one value of the regularization parameter. Working in the setting of kernel linear regression and kernel logistic re- gression, we show empirically that the effect of the block 1-norm reg- ularization differs notably from the (non-block) 1-norm regularization commonly used for variable selection, and that the regularization path is of particular value in the block case.

NeurIPS Conference 2004 Conference Paper

Semi-supervised Learning via Gaussian Processes

  • Neil Lawrence
  • Michael Jordan

We present a probabilistic approach to learning a Gaussian Process classifier in the presence of unlabeled data. Our approach involves a "null category noise model" (NCNM) inspired by ordered cate- gorical noise models. The noise model reflects an assumption that the data density is lower between the class-conditional densities. We illustrate our approach on a toy problem and present compar- ative results for the semi-supervised classification of handwritten digits.

NeurIPS Conference 2004 Conference Paper

Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes

  • Yee Teh
  • Michael Jordan
  • Matthew Beal
  • David Blei

We propose the hierarchical Dirichlet process (HDP), a nonparametric Bayesian model for clustering problems involving multiple groups of data. Each group of data is modeled with a mixture, with the number of components being open-ended and inferred automatically by the model. Further, components can be shared across groups, allowing dependencies across groups to be modeled effectively as well as conferring generaliza- tion to new groups. Such grouped clustering problems occur often in practice, e. g. in the problem of topic discovery in document corpora. We report experimental results on three text corpora showing the effective and superior performance of the HDP over previous models.

NeurIPS Conference 2003 Conference Paper

Autonomous Helicopter Flight via Reinforcement Learning

  • H. Kim
  • Michael Jordan
  • Shankar Sastry
  • Andrew Ng

Autonomous helicopter flight represents a challenging control problem, with complex, noisy, dynamics. In this paper, we describe a successful application of reinforcement learning to autonomous helicopter flight. We first fit a stochastic, nonlinear model of the helicopter dynamics. We then use the model to learn to hover in place, and to fly a number of maneuvers taken from an RC helicopter competition.

NeurIPS Conference 2003 Conference Paper

Hierarchical Topic Models and the Nested Chinese Restaurant Process

  • Thomas Griffiths
  • Michael Jordan
  • Joshua Tenenbaum
  • David Blei

We address the problem of learning topic hierarchies from data. The model selection problem in this domain is daunting—which of the large collection of possible trees to use? We take a Bayesian approach, gen- erating an appropriate prior via a distribution on partitions that we refer to as the nested Chinese restaurant process. This nonparametric prior al- lows arbitrarily large branching factors and readily accommodates grow- ing data collections. We build a hierarchical topic model by combining this prior with a likelihood that is based on a hierarchical variant of latent Dirichlet allocation. We illustrate our approach on simulated data and with an application to the modeling of NIPS abstracts.

NeurIPS Conference 2003 Conference Paper

Kernel Dimensionality Reduction for Supervised Learning

  • Kenji Fukumizu
  • Francis Bach
  • Michael Jordan

We propose a novel method of dimensionality reduction for supervised learning. Given a regression or classification problem in which we wish to predict a variable Y from an explanatory vector X, we treat the prob- lem of dimensionality reduction as that of finding a low-dimensional “ef- fective subspace” of X which retains the statistical relationship between X and Y. We show that this problem can be formulated in terms of conditional independence. To turn this formulation into an optimization problem, we characterize the notion of conditional independence using covariance operators on reproducing kernel Hilbert spaces; this allows us to derive a contrast function for estimation of the effective subspace. Un- like many conventional methods, the proposed method requires neither assumptions on the marginal distribution of X, nor a parametric model of the conditional distribution of Y.

NeurIPS Conference 2003 Conference Paper

Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates

  • Peter Bartlett
  • Michael Jordan
  • Jon McAuliffe

Many classification algorithms, including the support vector machine, boosting and logistic regression, can be viewed as minimum contrast methods that minimize a convex surrogate of the 0-1 loss function. We characterize the statistical consequences of using such a surrogate by pro- viding a general quantitative relationship between the risk as assessed us- ing the 0-1 loss and the risk as assessed using any nonnegative surrogate loss function. We show that this relationship gives nontrivial bounds un- der the weakest possible condition on the loss function—that it satisfy a pointwise form of Fisher consistency for classification. The relationship is based on a variational transformation of the loss function that is easy to compute in many applications. We also present a refined version of this result in the case of low noise. Finally, we present applications of our results to the estimation of convergence rates in the general setting of function classes that are scaled hulls of a finite-dimensional base class.

NeurIPS Conference 2003 Conference Paper

Learning Spectral Clustering

  • Francis Bach
  • Michael Jordan

Spectral clustering refers to a class of techniques which rely on the eigen- structure of a similarity matrix to partition points into disjoint clusters with points in the same cluster having high similarity and points in dif- ferent clusters having low similarity. In this paper, we derive a new cost function for spectral clustering based on a measure of error between a given partition and a solution of the spectral relaxation of a minimum normalized cut problem. Minimizing this cost function with respect to the partition leads to a new spectral clustering algorithm. Minimizing with respect to the similarity matrix leads to an algorithm for learning the similarity matrix. We develop a tractable approximation of our cost function that is based on the power method of computing eigenvectors.

NeurIPS Conference 2003 Conference Paper

On the Concentration of Expectation and Approximate Inference in Layered Networks

  • XuanLong Nguyen
  • Michael Jordan

We present an analysis of concentration-of-expectation phenomena in layered Bayesian networks that use generalized linear models as the local conditional probabilities. This framework encompasses a wide variety of probability distributions, including both discrete and continuous random variables. We utilize ideas from large deviation analysis and the delta method to devise and evaluate a class of approximate inference algo- rithms for layered Bayesian networks that have superior asymptotic error bounds and very fast computation time.

NeurIPS Conference 2003 Conference Paper

Semidefinite Relaxations for Approximate Inference on Graphs with Cycles

  • Michael Jordan
  • Martin Wainwright

We present a new method for calculating approximate marginals for probability distributions defined by graphs with cycles, based on a Gaus- sian entropy bound combined with a semidefinite outer bound on the marginal polytope. This combination leads to a log-determinant max- imization problem that can be solved by efficient interior point meth- ods [8]. As with the Bethe approximation and its generalizations [12], the optimizing arguments of this problem can be taken as approximations to the exact marginals. In contrast to Bethe/Kikuchi approaches, our vari- ational problem is strictly convex and so has a unique global optimum. An additional desirable feature is that the value of the optimal solution is guaranteed to provide an upper bound on the log partition function. In experimental trials, the performance of the log-determinant relaxation is comparable to or better than the sum-product algorithm, and by a sub- stantial margin for certain problem classes. Finally, the zero-temperature limit of our log-determinant relaxation recovers a class of well-known semidefinite relaxations for integer programming [e. g. , 3].

NeurIPS Conference 2003 Conference Paper

Statistical Debugging of Sampled Programs

  • Alice Zheng
  • Michael Jordan
  • Ben Liblit
  • Alex Aiken

We present a novel strategy for automatically debugging programs given sampled data from thousands of actual user runs. Our goal is to pinpoint those features that are most correlated with crashes. This is accomplished by maximizing an appropriately defined utility function. It has analogies with intuitive debugging heuristics, and, as we demonstrate, is able to deal with various types of bugs that occur in real programs.

NeurIPS Conference 2002 Conference Paper

A Hierarchical Bayesian Markovian Model for Motifs in Biopolymer Sequences

  • Eric Xing
  • Michael Jordan
  • Richard Karp
  • Stuart Russell

We propose a dynamic Bayesian model for motifs in biopolymer se- quences which captures rich biological prior knowledge and positional dependencies in motif structure in a principled way. Our model posits that the position-specific multinomial parameters for monomer distribu- tion are distributed as a latent Dirichlet-mixture random variable, and the position-specific Dirichlet component is determined by a hidden Markov process. Model parameters can be fit on training motifs using a vari- ational EM algorithm within an empirical Bayesian framework. Varia- tional inference is also used for detecting hidden motifs. Our model im- proves over previous models that ignore biological priors and positional dependence. It has much higher sensitivity to motifs during detection and a notable ability to distinguish genuine motifs from false recurring patterns.

NeurIPS Conference 2002 Conference Paper

A Minimal Intervention Principle for Coordinated Movement

  • Emanuel Todorov
  • Michael Jordan

Behavioral goals are achieved reliably and repeatedly with movements rarely reproducible in their detail. Here we offer an explanation: we show that not only are variability and goal achievement compatible, but indeed that allowing variability in redundant dimensions is the optimal control strategy in the face of uncertainty. The optimal feedback control laws for typical motor tasks obey a “minimal intervention” principle: deviations from the average trajectory are only corrected when they interfere with the task goals. The resulting behavior exhibits task-constrained variabil- ity, as well as synergetic coupling among actuators—which is another unexplained empirical phenomenon.

NeurIPS Conference 2002 Conference Paper

Distance Metric Learning with Application to Clustering with Side-Information

  • Eric Xing
  • Michael Jordan
  • Stuart Russell
  • Andrew Ng

@cs. berkeley. edu Many algorithms rely critically on being given a good metric over their inputs. For instance, data can often be clustered in many “plausible” ways, and if a clustering algorithm such as K-means initially fails to find one that is meaningful to a user, the only recourse may be for the user to manually tweak the metric until sufficiently good clusters are found. For these and other applications requiring good metrics, it is desirable that we provide a more systematic way for users to indicate what they con- sider “similar. ” For instance, we may ask them to provide examples. In this paper, we present an algorithm that, given examples of similar (and, , learns a distance metric over if desired, dissimilar) pairs of points in that respects these relationships. Our method is based on posing met-  ric learning as a convex optimization problem, which allows us to give efficient, local-optima-free algorithms. We also demonstrate empirically that the learned metrics can be used to significantly improve clustering performance. 

NeurIPS Conference 2002 Conference Paper

Learning Graphical Models with Mercer Kernels

  • Francis Bach
  • Michael Jordan

We present a class of algorithms for learning the structure of graphical models from data. The algorithms are based on a measure known as the kernel generalized variance (KGV), which essentially allows us to treat all variables on an equal footing as Gaussians in a feature space obtained from Mercer kernels. Thus we are able to learn hybrid graphs involving discrete and continuous variables of arbitrary type. We explore the computational properties of our approach, showing how to use the kernel trick to compute the relevant statistics in linear time. We illustrate our framework with experiments involving discrete and continuous data.

NeurIPS Conference 2002 Conference Paper

Robust Novelty Detection with Single-Class MPM

  • Laurent Ghaoui
  • Michael Jordan
  • Gert Lanckriet

the "single-class minimax probabil(cid: 173) In this paper we consider the problem of novelty detection, pre(cid: 173) senting an algorithm that aims to find a minimal region in input space containing a fraction 0: of the probability mass underlying a data set. This algorithm- ity machine (MPM)" - is built on a distribution-free methodology that minimizes the worst-case probability of a data point falling outside of a convex set, given only the mean and covariance matrix of the distribution and making no further distributional assump(cid: 173) tions. We present a robust approach to estimating the mean and covariance matrix within the general two-class MPM setting, and show how this approach specializes to the single-class problem. We provide empirical results comparing the single-class MPM to the single-class SVM and a two-class SVM method.

NeurIPS Conference 2001 Conference Paper

Latent Dirichlet Allocation

  • David Blei
  • Andrew Ng
  • Michael Jordan

We propose a generative model for text and other collections of dis(cid: 173) crete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams [6], and Hof(cid: 173) mann's aspect model, also known as probabilistic latent semantic indexing (pLSI) [3]. In the context of text modeling, our model posits that each document is generated as a mixture of topics, where the continuous-valued mixture proportions are distributed as a latent Dirichlet random variable. Inference and learning are carried out efficiently via variational algorithms. We present em(cid: 173) pirical results on applications of this model to problems in text modeling, collaborative filtering, and text classification.

NeurIPS Conference 2001 Conference Paper

Minimax Probability Machine

  • Gert Lanckriet
  • Laurent Ghaoui
  • Chiranjib Bhattacharyya
  • Michael Jordan

When constructing a classifier, the probability of correct classifi(cid: 173) cation of future data points should be maximized. In the current paper this desideratum is translated in a very direct way into an optimization problem, which is solved using methods from con(cid: 173) vex optimization. We also show how to exploit Mercer kernels in this setting to obtain nonlinear decision boundaries. A worst-case bound on the probability of misclassification of future data is ob(cid: 173) tained explicitly.

NeurIPS Conference 2001 Conference Paper

On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes

  • Andrew Ng
  • Michael Jordan

We compare discriminative and generative learning as typified by logistic regression and naive Bayes. We show, contrary to a widely(cid: 173) held belief that discriminative classifiers are almost always to be preferred, that there can often be two distinct regimes of per(cid: 173) formance as the training set size is increased, one in which each algorithm does better. This stems from the observation- which is borne out in repeated experiments- that while discriminative learning has lower asymptotic error, a generative classifier may also approach its (higher) asymptotic error much faster.

NeurIPS Conference 2001 Conference Paper

On Spectral Clustering: Analysis and an algorithm

  • Andrew Ng
  • Michael Jordan
  • Yair Weiss

Despite many empirical successes of spectral clustering methods(cid: 173) algorithms that cluster points using eigenvectors of matrices de(cid: 173) rived from the data- there are several unresolved issues. First, there are a wide variety of algorithms that use the eigenvectors in slightly different ways. Second, many of these algorithms have no proof that they will actually compute a reasonable clustering. In this paper, we present a simple spectral clustering algorithm that can be implemented using a few lines of Matlab. Using tools from matrix perturbation theory, we analyze the algorithm, and give conditions under which it can be expected to do well. We also show surprisingly good experimental results on a number of challenging clustering problems.

NeurIPS Conference 2001 Conference Paper

Thin Junction Trees

  • Francis Bach
  • Michael Jordan

We present an algorithm that induces a class of models with thin junction trees—models that are characterized by an upper bound on the size of the maximal cliques of their triangulated graph. By ensuring that the junction tree is thin, inference in our models remains tractable throughout the learning process. This allows both an efficient implementation of an iterative scaling parameter estimation algorithm and also ensures that inference can be performed efficiently with the final model. We illustrate the approach with applications in handwritten digit recognition and DNA splice site detection.

NeurIPS Conference 1999 Conference Paper

Approximate Inference A lgorithms for Two-Layer Bayesian Networks

  • Andrew Ng
  • Michael Jordan

We present a class of approximate inference algorithms for graphical models of the QMR-DT type. We give convergence rates for these al(cid: 173) gorithms and for the Jaakkola and Jordan (1999) algorithm, and verify these theoretical predictions empirically. We also present empirical re(cid: 173) sults on the difficult QMR-DT network problem, obtaining performance of the new algorithms roughly comparable to the Jaakkola and Jordan algorithm.

NeurIPS Conference 1998 Conference Paper

Learning from Dyadic Data

  • Thomas Hofmann
  • Jan Puzicha
  • Michael Jordan

Dyadzc data refers to a domain with two finite sets of objects in which observations are made for dyads, i. e. , pairs with one element from either set. This type of data arises naturally in many ap(cid: 173) plication ranging from computational linguistics and information retrieval to preference analysis and computer vision. In this paper, we present a systematic, domain-independent framework of learn(cid: 173) ing from dyadic data by statistical mixture models. Our approach covers different models with fiat and hierarchical latent class struc(cid: 173) tures. We propose an annealed version of the standard EM algo(cid: 173) rithm for model fitting which is empirically evaluated on a variety of data sets from different domains.

NeurIPS Conference 1997 Conference Paper

Adaptation in Speech Motor Control

  • John Houde
  • Michael Jordan

Human subjects are known to adapt their motor behavior to a shift of the visual field brought about by wearing prism glasses over their eyes. We have studied the analog of this effect in speech. U sing a device that can feed back transformed speech signals in real time, we exposed subjects to alterations of their own speech feedback. We found that speakers learn to adjust their production of a vowel to compensate for feedback alterations that change the vowel's perceived phonetic identity; moreover, the effect generalizes across consonant contexts and to different vowels.

NeurIPS Conference 1997 Conference Paper

Approximating Posterior Distributions in Belief Networks Using Mixtures

  • Christopher Bishop
  • Neil Lawrence
  • Tommi Jaakkola
  • Michael Jordan

Exact inference in densely connected Bayesian networks is computation(cid: 173) ally intractable, and so there is considerable interest in developing effec(cid: 173) tive approximation schemes. One approach which has been adopted is to bound the log likelihood using a mean-field approximating distribution. While this leads to a tractable algorithm, the mean field distribution is as(cid: 173) sumed to be factorial and hence unimodal. In this paper we demonstrate the feasibility of using a richer class of approximating distributions based on mixtures of mean field distributions. We derive an efficient algorithm for updating the mixture parameters and apply it to the problem of learn(cid: 173) ing in sigmoid belief networks. Our results demonstrate a systematic improvement over simple mean field theory as the number of mixture components is increased.

NeurIPS Conference 1997 Conference Paper

Estimating Dependency Structure as a Hidden Variable

  • Marina Meila
  • Michael Jordan

This paper introduces a probability model, the mixture of trees that can account for sparse, dynamically changing dependence relationships. We present a family of efficient algorithms that use EM and the Minimum Spanning Tree algorithm to find the ML and MAP mixture of trees for a variety of priors, including the Dirichlet and the MDL priors.

NeurIPS Conference 1996 Conference Paper

A Variational Principle for Model-based Morphing

  • Lawrence Saul
  • Michael Jordan

Given a multidimensional data set and a model of its density, we consider how to define the optimal interpolation between two points. This is done by assigning a cost to each path through space, based on two competing goals-one to interpolate through regions of high density, the other to minimize arc length. From this path functional, we derive the Euler-Lagrange equations for extremal motionj given two points, the desired interpolation is found by solv(cid: 173) ing a boundary value problem. We show that this interpolation can be done efficiently, in high dimensions, for Gaussian, Dirichlet, and mixture models.

NeurIPS Conference 1996 Conference Paper

Hidden Markov Decision Trees

  • Michael Jordan
  • Zoubin Ghahramani
  • Lawrence Saul

We study a time series model that can be viewed as a decision tree with Markov temporal structure. The model is intractable for exact calculations, thus we utilize variational approximations. We consider three different distributions for the approximation: one in which the Markov calculations are performed exactly and the layers of the decision tree are decoupled, one in which the decision tree calculations are performed exactly and the time steps of the Markov chain are decoupled, and one in which a Viterbi-like assumption is made to pick out a single most likely state sequence. We present simulation results for artificial data and the Bach chorales.

NeurIPS Conference 1996 Conference Paper

Recursive Algorithms for Approximating Probabilities in Graphical Models

  • Tommi Jaakkola
  • Michael Jordan

We develop a recursive node-elimination formalism for efficiently approximating large probabilistic networks. No constraints are set on the network topologies. Yet the formalism can be straightfor(cid: 173) wardly integrated with exact methods whenever they are/become applicable. The approximations we use are controlled: they main(cid: 173) tain consistently upper and lower bounds on the desired quantities at all times. We show that Boltzmann machines, sigmoid belief networks, or any combination (i. e. , chain graphs) can be handled within the same framework. The accuracy of the methods is veri(cid: 173) fied experimentally.

NeurIPS Conference 1996 Conference Paper

Triangulation by Continuous Embedding

  • Marina Meila
  • Michael Jordan

When triangulating a belief network we aim to obtain a junction tree of minimum state space. According to (Rose, 1970), searching for the optimal triangulation can be cast as a search over all the permutations of the graph's vertices. Our approach is to embed the discrete set of permutations in a convex continuous domain D. By suitably extending the cost function over D and solving the continous nonlinear optimization task we hope to obtain a good triangulation with respect to the aformentioned cost. This paper presents two ways of embedding the triangulation problem into continuous domain and shows that they perform well compared to the best known heuristic. 1 INTRODUCTION. WHAT IS TRIANGULATION? Belief networks are graphical representations of probability distributions over a set of variables. In what follows it will be always assumed that the variables take values in a finite set and that they correspond to the vertices of a graph. The graph's arcs will represent the dependencies among variables. There are two kinds of representations that have gained wide use: one is the directed acyclic graph model, also called a Bayes net, which represents the joint distribution as a product of the probabilities of each vertex conditioned on the values of its parents; the other is the undirected graph model, also called a Markov field, where the joint distribution is factorized over the cliques! of an undirected graph. This factorization is called a junction tree and optimizing it is the subject of the present paper. The power of both models lies in their ability to display and exploit existent marginal and conditional independencies among subsets of variables. Emphasizing independencies is useful 1 A clique is a fully connected set of vertices and a maximal clique is a clique that is not contained in any other clique. 558 M. Meilii and M. /. Jordan from both a qualitative point of view (it reveals something about the domain under study) and a quantitative one (it makes computations tractable). The two models differ in the kinds of independencies they are able to represent and often times in their naturalness in particular tasks. Directed graphs are more convenient for learning a model from data; on the other hand, the clique structure of undirected graphs organizes the information in a way that makes it immediately available to inference algorithms. Therefore it is a standard procedure to construct the model of a domain as a Bayes net and then to convert it to a Markov field for the purpose of querying it. This process is known as decomposition and it consists of the following stages: first, the directed graph is transformed into an undirected graph by an operation called moralization. Second, the moralized graph is triangulated. A graph is called triangulated if any cycle of length> 3 has a chord (i. e. an edge connecting two nonconsecutive vertices). If a graph is not triangulated it is always possible to add new edges so that the resulting graph is triangulated. We shall call this procedure triangulation and the added edges the fill-in. In the final stage, the junction tree (Kjrerulff, 1991) is constructed from the maximal cliques of the triangulated graph. We define the state space of a clique to be the cartesian product of the state spaces of the variables associated to the vertices in the clique and we call weight of the clique the size of this state space. The weight of the junction tree is the sum of the weights of its component cliques. All further exact inference in the net takes place in the junction tree representation. The number of computations required by an inference operation is proportional to the weight of the tree. For each graph there are several and usually a large number of possible triangu(cid: 173) lations, with widely varying state space sizes. Moreover, triangulation is the only stage where the cost of inference can be influenced. It is therefore critical that the triangulation procedure produces a graph that is optimal or at least "good" in this respect. Unfortunately, this is a hard problem. No optimal triangulation algorithm is known to date. However, a number of heuristic algorithms like maximum cardinality search (Tarjan and Yannakakis, 1984), lexicographic search (Rose et al. , 1976) and the minimum weight heuristic (MW) (Kjrerulff, 1990) are known. An optimization method based on simulated annealing which performs better than the heuristics on large graphs has been proposed in (Kjrerulff, 1991) and recently a "divide and conquer" algorithm which bounds the maximum clique size of the triangulated graph has been published (Becker and Geiger, 1996). All but the last algorithm are based on Rose's (Rose, 1970) elimination procedure: choose a node v of the graph, connect all its neighbors to form a clique, then eliminate v and all the edges incident to it and proceed recursively. The resulting filled-in graph is triangulated. It can be proven that the optimal triangulation can always be obtained by applying Rose's elimination procedure with an appropriate ordering of the nodes. It follows then that searching for an optimal triangulation can be cast as a search in the space of all node permutations. The idea of the present work is the following: embed the discrete search space of permutations of n objects (where n is the number of vertices) into a suitably chosen continuous space. Then extend the cost to a smooth function over the continuous domain and thus transform the discrete optimization problem into a continuous nonlinear optimization task. This allows one to take advantage of the thesaurus of optimization methods that exist for continuous cost functions. The rest of the paper will present this procedure in the following sequence: the next section introduces and discusses the objective function; section 3 states the continuous version of the problem; section 4 discusses further aspects of the optimization procedure and presents experimental results and section 5 concludes Triangulation by Continuous Embedding

NeurIPS Conference 1995 Conference Paper

Exploiting Tractable Substructures in Intractable Networks

  • Lawrence Saul
  • Michael Jordan

We develop a refined mean field approximation for inference and learning in probabilistic neural networks. Our mean field theory, unlike most, does not assume that the units behave as independent degrees of freedom; instead, it exploits in a principled way the existence of large substructures that are computationally tractable. To illustrate the advantages of this framework, we show how to incorporate weak higher order interactions into a first-order hidden Markov model, treating the corrections (but not the first order structure) within mean field theory.

NeurIPS Conference 1995 Conference Paper

Factorial Hidden Markov Models

  • Zoubin Ghahramani
  • Michael Jordan

We present a framework for learning in hidden Markov models with distributed state representations. Within this framework, we de(cid: 173) rive a learning algorithm based on the Expectation-Maximization (EM) procedure for maximum likelihood estimation. Analogous to the standard Baum-Welch update rules, the M-step of our algo(cid: 173) rithm is exact and can be solved analytically. However, due to the combinatorial nature of the hidden state representation, the exact E-step is intractable. A simple and tractable mean field approxima(cid: 173) tion is derived. Empirical results on a set of problems suggest that both the mean field approximation and Gibbs sampling are viable alternatives to the computationally expensive exact algorithm.

NeurIPS Conference 1995 Conference Paper

Fast Learning by Bounding Likelihoods in Sigmoid Type Belief Networks

  • Tommi Jaakkola
  • Lawrence Saul
  • Michael Jordan

Sigmoid type belief networks, a class of probabilistic neural net(cid: 173) works, provide a natural framework for compactly representing probabilistic information in a variety of unsupervised and super(cid: 173) vised learning problems. Often the parameters used in these net(cid: 173) works need to be learned from examples. Unfortunately, estimat(cid: 173) ing the parameters via exact probabilistic calculations (i. e, the EM-algorithm) is intractable even for networks with fairly small numbers of hidden units. We propose to avoid the infeasibility of the E step by bounding likelihoods instead of computing them ex(cid: 173) actly. We introduce extended and complementary representations for these networks and show that the estimation of the network parameters can be made fast (reduced to quadratic optimization) by performing the estimation in either of the alternative domains. The complementary networks can be used for continuous density estimation as well.

NeurIPS Conference 1995 Conference Paper

Learning Fine Motion by Markov Mixtures of Experts

  • Marina Meila
  • Michael Jordan

Compliant control is a standard method for performing fine manip(cid: 173) ulation tasks, like grasping and assembly, but it requires estimation of the state of contact (s. o. c. ) between the robot arm and the ob(cid: 173) jects involved. Here we present a method to learn a model of the movement from measured data. The method requires little or no prior knowledge and the resulting model explicitly estimates the s. o. c. The current s. o. c. is viewed as the hidden state variable of a discrete HMM. The control dependent transition probabilities between states are modeled as parametrized functions of the mea(cid: 173) surement. We show that their parameters can be estimated from measurements at the same time as the parameters of the movement in each s. o. c. The learning algorithm is a variant of the EM proce(cid: 173) dure. The E step is computed exactly; solving the M step exactly is not possible in general. Here, gradient ascent is used to produce an increase in likelihood.

NeurIPS Conference 1995 Conference Paper

Reinforcement Learning by Probability Matching

  • Philip Sabes
  • Michael Jordan

We present a new algorithm for associative reinforcement learn(cid: 173) ing. The algorithm is based upon the idea of matching a network's output probability with a probability distribution derived from the environment's reward signal. This Probability Matching algorithm is shown to perform faster and be less susceptible to local minima than previously existing algorithms. We use Probability Match(cid: 173) ing to train mixture of experts networks, an architecture for which other reinforcement learning rules fail to converge reliably on even simple problems. This architecture is particularly well suited for our algorithm as it can compute arbitrarily complex functions yet calculation of the output probability is simple.

NeurIPS Conference 1994 Conference Paper

Active Learning with Statistical Models

  • David Cohn
  • Zoubin Ghahramani
  • Michael Jordan

For many types of learners one can compute the statistically "op(cid: 173) timal" way to select data. We review how these techniques have been used with feedforward neural networks [MacKay, 1992; Cohn, 1994]. We then show how the same principles may be used to select data for two alternative, statistically-based learning architectures: mixtures of Gaussians and locally weighted regression. While the techniques for neural networks are expensive and approximate, the techniques for mixtures of Gaussians and locally weighted regres(cid: 173) sion are both efficient and accurate. 1 ACTIVE LEARNING - BACKGROUND An active learning problem is one where the learner has the ability or need to influence or select its own training data. Many problems of great practical interest allow active learning, and many even require it. We consider the problem of actively learning a mapping X - Y based on a set of training examples {(Xi, Yi)}~l' where Xi E X and Yi E Y. The learner is allowed to iteratively select new inputs x (possibly from a constrained set), observe the resulting output y, and incorporate the new examples (x, y) into its training set. The primary question of active learning is how to choose which x to try next. There are many heuristics for choosing x based on intuition, including choosing places where we don't have data, where we perform poorly [Linden and Weber, 1993], where we have low confidence [Thrun and Moller, 1992], where we expect it 706 David Cohn, Zoubin Ghahramani, Michael I. Jordon to change our model [Cohn et aI, 1990], and where we previously found data that resulted in learning [Schmidhuber and Storck, 1993]. In this paper we consider how one may select x "optimally" from a statistical viewpoint. We first review how the statistical approach can be applied to neural networks, as described in MacKay [1992] and Cohn [1994]. We then consider two alternative, statistically-based learning architectures: mixtures of Gaussians and locally weighted regression. While optimal data selection for a neural network is computationally expensive and approximate, we find that optimal data selection for the two statistical models is efficient and accurate. 2 ACTIVE LEARNING - A STATISTICAL APPROACH We denote the learner's output given input x as y(x). The mean squared error of this output can be expressed as the sum of the learner's bias and variance. The variance 0'3 (x) indicates the learner's uncertainty in its estimate at x. 1 Our goal will be to select a new example x such that when the resulting example (x, y) is added to the training set, the integrated variance IV is minimized: IV = J 0'3 P (x)dx. (1) Here, P(x) is the (known) distribution over X. In practice, we will compute a Monte Carlo approximation of this integral, evaluating 0'3 at a number of random points drawn according to P(x). Selecting x so as to minimize IV requires computing 0-3, the new variance at x given (x, y). Until we actually commit to an x, we do not know what corresponding y we will see, so the minimization cannot be performed deterministically. 2 Many learning architectures, however, provide an estimate of PWlx) based on current data, so we can use this estimate to compute the expectation of 0-3. Selecting x to minimize the expected integrated variance provides a solid statistical basis for choosing new examples. 2. 1 EXAMPLE: ACTIVE LEARNING WITH A NEURAL

NeurIPS Conference 1994 Conference Paper

An Alternative Model for Mixtures of Experts

  • Lei Xu
  • Michael Jordan
  • Geoffrey Hinton

We propose an alternative model for mixtures of experts which uses a different parametric form for the gating network. The modified model is trained by the EM algorithm. In comparison with earlier models-trained by either EM or gradient ascent-there is no need to select a learning stepsize. We report simulation experiments which show that the new architecture yields faster convergence. We also apply the new model to two problem domains: piecewise nonlinear function approximation and the combination of multiple previously trained classifiers.

NeurIPS Conference 1994 Conference Paper

Boltzmann Chains and Hidden Markov Models

  • Lawrence Saul
  • Michael Jordan

We propose a statistical mechanical framework for the modeling of discrete time series. Maximum likelihood estimation is done via Boltzmann learning in one-dimensional networks with tied weights. We call these networks Boltzmann chains and show that they contain hidden Markov models (HMMs) as a special case. Our framework also motivates new architectures that address partic(cid: 173) ular shortcomings of HMMs. We look at two such architectures: parallel chains that model feature sets with disparate time scales, and looped networks that model long-term dependencies between hidden states. For these networks, we show how to implement the Boltzmann learning rule exactly, in polynomial time, without resort to simulated or mean-field annealing. The necessary com(cid: 173) putations are done by exact decimation procedures from statistical mechanics. 1 INTRODUCTION AND SUMMARY Statistical models of discrete time series have a wide range of applications, most notably to problems in speech recognition (Juang & Rabiner, 1991) and molecular biology (Baldi, Chauvin, Hunkapiller, & McClure, 1992). A common problem in these fields is to find a probabilistic model, and a set of model parameters, that 436 Lawrence K. Saul, Michael I. Jordan account for sequences of observed data. Hidden Markov models (HMMs) have been particularly successful at modeling discrete time series. One reason for this is the powerful learning rule (Baum) 1972») a special case of the Expectation-Maximization (EM) procedure for maximum likelihood estimation (Dempster) Laird) & Rubin) 1977). In this work) we develop a statistical mechanical framework for the modeling of discrete time series. The framework enables us to relate HMMs to a large family of exactly solvable models in statistical mechanics. The connection to statistical mechanics was first noticed by Sourlas (1989») who studied spin glass models of error-correcting codes. We view the estimation procedure for HMMs as a special (and particularly tractable) case of the Boltzmann learning rule (Ackley) Hinton) & Sejnowski) 1985; Byrne) 1992). The rest of this paper is organized as follows. In Section 2) we review the modeling problem for discrete time series and establish the connection between HMMs and Boltzmann machines. In Section 3) we show how to quickly determine whether or not a particular Boltzmann machine is tractable) and if so) how to efficiently compute the correlations in the Boltzmann learning rule. Finally) in Section 4) we look at two architectures that address particular weaknesses of HMMs: the modelling of disparate time scales and long-term dependencies. 2 MODELING DISCRETE TIME SERIES A discrete time series is a sequence of symbols {jdr=l in which each symbol belongs to a finite countable set) i. e. jl E {1) 2). .. ) m}. Given one long sequence) or perhaps many shorter ones) the modeling task is to characterize the probability distribution from which the time series are generated. 2. 1 HIDDEN MARKOV MODELS A first-order Hidden Markov Model (HMM) is characterized by a set of n hidden states) an alphabet of m symbols) a transmission matrix ajj') an emission matrix bjj ) and a prior distribution 7I'j over the initial hidden state. The sequence of states {idr=l and symbols {jdr=l is modeled to occur with probability (1) The modeling problem is to find the parameter values (ajj', bij ) 7I'j) that maximize the likelihood of observed sequences of training data. We will elaborate on the learning rule in section 2. 3) but first let us make the connection to a well-known family of stochastic neural networks, namely Boltzmann machines. 2. 2 BOLTZMANN MACHINES Consider a Boltzmann machine with m-state visible units) n-state hidden units) tied weights) and the linear architecture shown in Figure 1. This example represents the simplest possible Boltzmann "chain))) one that is essentially equivalent to a first(cid: 173) order HMM unfolded in time (MacKay) 1994). The transition weights Aii' connect adjacent hidden units) while the emission weights Bjj connect each hidden unit to Boltzmann Chains and Hidden Markov Models

NeurIPS Conference 1994 Conference Paper

Computational Structure of coordinate transformations: A generalization study

  • Zoubin Ghahramani
  • Daniel Wolpert
  • Michael Jordan

One of the fundamental properties that both neural networks and the central nervous system share is the ability to learn and gener(cid: 173) alize from examples. While this property has been studied exten(cid: 173) sively in the neural network literature it has not been thoroughly explored in human perceptual and motor learning. We have chosen a coordinate transformation system-the visuomotor map which transforms visual coordinates into motor coordinates-to study the generalization effects of learning new input-output pairs. Using a paradigm of computer controlled altered visual feedback, we have studied the generalization of the visuomotor map subsequent to both local and context-dependent remappings. A local remapping of one or two input-output pairs induced a significant global, yet decaying, change in the visuomotor map, suggesting a representa(cid: 173) tion for the map composed of units with large functional receptive fields. Our study of context-dependent remappings indicated that a single point in visual space can be mapped to two different fin(cid: 173) ger locations depending on a context variable-the starting point of the movement. Furthermore, as the context is varied there is a gradual shift between the two remappings, consistent with two visuomotor modules being learned and gated smoothly with the context.

NeurIPS Conference 1994 Conference Paper

Forward dynamic models in human motor control: Psychophysical evidence

  • Daniel Wolpert
  • Zoubin Ghahramani
  • Michael Jordan

Based on computational principles, with as yet no direct experi(cid: 173) mental validation, it has been proposed that the central nervous system (CNS) uses an internal model to simulate the dynamic be(cid: 173) havior of the motor system in planning, control and learning (Sut(cid: 173) ton and Barto, 1981; Ito, 1984; Kawato et aI. , 1987; Jordan and Rumelhart, 1992; Miall et aI. , 1993). We present experimental re(cid: 173) sults and simulations based on a novel approach that investigates the temporal propagation of errors in the sensorimotor integration process. Our results provide direct support for the existence of an internal model.

NeurIPS Conference 1994 Conference Paper

Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems

  • Tommi Jaakkola
  • Satinder Singh
  • Michael Jordan

Increasing attention has been paid to reinforcement learning algo(cid: 173) rithms in recent years, partly due to successes in the theoretical analysis of their behavior in Markov environments. If the Markov assumption is removed, however, neither generally the algorithms nor the analyses continue to be usable. We propose and analyze a new learning algorithm to solve a certain class of non-Markov decision problems. Our algorithm applies to problems in which the environment is Markov, but the learner has restricted access to state information. The algorithm involves a Monte-Carlo pol(cid: 173) icy evaluation combined with a policy improvement method that is similar to that of Markov decision problems and is guaranteed to converge to a local maximum. The algorithm operates in the space of stochastic policies, a space which can yield a policy that per(cid: 173) forms considerably better than any deterministic policy. Although the space of stochastic policies is continuous-even for a discrete action space-our algorithm is computationally tractable. 346 Tommi Jaakkola, Satinder P. Singh, Michaell. Jordan

NeurIPS Conference 1994 Conference Paper

Reinforcement Learning with Soft State Aggregation

  • Satinder Singh
  • Tommi Jaakkola
  • Michael Jordan

It is widely accepted that the use of more compact representations than lookup tables is crucial to scaling reinforcement learning (RL) algorithms to real-world problems. Unfortunately almost all of the theory of reinforcement learning assumes lookup table representa(cid: 173) tions. In this paper we address the pressing issue of combining function approximation and RL, and present 1) a function approx(cid: 173) imator based on a simple extension to state aggregation (a com(cid: 173) monly used form of compact representation), namely soft state aggregation, 2) a theory of convergence for RL with arbitrary, but fixed, soft state aggregation, 3) a novel intuitive understanding of the effect of state aggregation on online RL, and 4) a new heuristic adaptive state aggregation algorithm that finds improved compact representations by exploiting the non-discrete nature of soft state aggregation. Preliminary empirical results are also presented.

NeurIPS Conference 1993 Conference Paper

Convergence of Stochastic Iterative Dynamic Programming Algorithms

  • Tommi Jaakkola
  • Michael Jordan
  • Satinder Singh

Increasing attention has recently been paid to algorithms based on dynamic programming (DP) due to the suitability of DP for learn(cid: 173) ing problems involving control. In stochastic environments where the system being controlled is only incompletely known, however, a unifying theoretical account of these methods has been missing. In this paper we relate DP-based learning algorithms to the pow(cid: 173) erful techniques of stochastic approximation via a new convergence theorem, enabling us to establish a class of convergent algorithms to which both TD(") and Q-Iearning belong.

NeurIPS Conference 1993 Conference Paper

Supervised learning from incomplete data via an EM approach

  • Zoubin Ghahramani
  • Michael Jordan

Real-world learning tasks may involve high-dimensional data sets with arbitrary patterns of missing data. In this paper we present a framework based on maximum likelihood density estimation for learning from such data set. s. VVe use mixture models for the den(cid: 173) sity estimates and make two distinct appeals to the Expectation(cid: 173) Maximization (EM) principle (Dempster et al. , 1977) in deriving a learning algorithm-EM is used both for the estimation of mix(cid: 173) ture components and for coping wit. h missing dat. a. The result(cid: 173) ing algorithm is applicable t. o a wide range of supervised as well as unsupervised learning problems. Result. s from a classification benchmark-t. he iris data set-are presented.

NeurIPS Conference 1992 Conference Paper

A dynamical model of priming and repetition blindness

  • Daphne Bavelier
  • Michael Jordan

We describe a model of visual word recognition that accounts for several aspects of the temporal processing of sequences of briefly presented words. The model utilizes a new representation for writ(cid: 173) ten words, based on dynamic time warping and multidimensional scaling. The visual input passes through cascaded perceptual, com(cid: 173) parison, and detection stages. We describe how these dynamical processes can account for several aspects of word recognition, in(cid: 173) cluding repetition priming and repetition blindness.

NeurIPS Conference 1991 Conference Paper

Forward Dynamics Modeling of Speech Motor Control Using Physiological Data

  • Makoto Hirayama
  • Eric Vatikiotis-Bateson
  • Mitsuo Kawato
  • Michael Jordan

We propose a paradigm for modeling speech production based on neural networks. We focus on characteristics of the musculoskeletal system. Using real physiological data - articulator movements and EMG from muscle activity(cid: 173) a neural network learns the forward dynamics relating motor commands to muscles and the ensuing articulator behavior. After learning, simulated perturbations, were used to asses properties of the acquired model, such as natural frequency, damping, and interarticulator couplings. Finally, a cascade neural network is used to generate continuous motor commands from a sequence of discrete articulatory targets.

NeurIPS Conference 1991 Conference Paper

Hierarchies of adaptive experts

  • Michael Jordan
  • Robert Jacobs

In this paper we present a neural network architecture that discovers a recursive decomposition of its input space. Based on a generalization of the modular architecture of Jacobs, Jordan, Nowlan, and Hinton (1991), the architecture uses competition among networks to recursively split the input space into nested regions and to learn separate associative mappings within each region. The learning algorithm is shown to perform gradient ascent in a log likelihood function that captures the architecture's hierarchical structure.

NeurIPS Conference 1990 Conference Paper

A competitive modular connectionist architecture

  • Robert Jacobs
  • Michael Jordan

We describe a multi-network, or modular, connectionist architecture that captures that fact that many tasks have structure at a level of granularity intermediate to that assumed by local and global function approximation schemes. The main innovation of the architecture is that it combines associative and competitive learning in order to learn task decompositions. A task decomposition is discovered by forcing the networks comprising the architecture to compete to learn the training patterns. As a result of the competition, different networks learn different training patterns and, thus, learn to partition the input space. The performance of the architecture on a "what" and "where" vision task and on a multi-payload robotics task are presented.

NeurIPS Conference 1989 Conference Paper

Learning to Control an Unstable System with Forward Modeling

  • Michael Jordan
  • Robert Jacobs

The forward modeling approach is a methodology for learning con(cid: 173) trol when data is available in distal coordinate systems. We extend previous work by considering how this methodology can be applied to the optimization of quantities that are distal not only in space but also in time. In many learning control problems, the output variables of the controller are not the natural coordinates in which to specify tasks and evaluate performance. Tasks are generally more naturally specified in "distal" coordinate systems (e. g. , endpoint coordinates for manipulator motion) than in the "proximal" coordinate system of the controller (e. g. , joint angles or torques). Furthermore, the relationship between proximal coordinates and distal coordinates is often not known a priori and, if known, not easily inverted. The forward modeling approach is a methodology for learning control when train(cid: 173) ing data is available in distal coordinate systems. A forward model is a network that learns the transformation from proximal to distal coordinates so that distal specifications can be used in training the controller (Jordan & Rumelhart, 1990). The forward model can often be learned separately from the controller because it depends only on the dynamics of the controlled system and not on the closed-loop dynamics. In previous work, we studied forward models of kinematic transformations (Jordan, 1988, 1990) and state transitions (Jordan & Rumelhart, 1990). In the current paper, Learning to Control an Unstable System with Forward Modeling 325 we go beyond the spatial credit assignment problems studied in those papers and broaden the application of forward modeling to include cases of temporal credit assignment (cf. Barto, Sutton, & Anderson, 1983; Werbos, 1987). As discussed below, the function to be modeled in such cases depends on a time integral of the closed-loop dynamics. This fact has two important implications. First, the data needed for learning the forward model can no longer be obtained solely by observing the instantaneous state or output of the plant. Second, the forward model is no longer independent of the controller: If the parameters of the controller are changed by a learning algorithm, then the closed-loop dynamics change and so does the mapping from proximal to distal variables. Thus the learning of the forward model and the learning of the controller can no longer be separated into different phases. 1 FORWARD MODELING In this section we briefly summarize our previous work on forward modeling (see also Nguyen & Widrow, 1989 and Werbos, 1987). 1. 1 LEARNING A FORWARD MODEL Given a fixed control law, the learning of a forward model is a system identification problem. Let z = g(s, u) be a system to be modeled, where z is the output or the state-derivative, s is the state, and u is the control. We require the forward model to minimize the cost functional Jm = ~ J (z - z)T(z - z)dt.