Arrow Research search

Author name cluster

Pradeep Ravikumar

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

131 papers
2 author rows

Possible papers

131

ICML Conference 2025 Conference Paper

Contextures: Representations from Contexts

  • Runtian Zhai
  • Kai Yang
  • Burak Varici
  • Che-Ping Tsai
  • J. Zico Kolter
  • Pradeep Ravikumar

Despite the empirical success of foundation models, we do not have a systematic characterization of the representations that these models learn. In this paper, we establish the contexture theory. It shows that a large class of representation learning methods can be characterized as learning from the association between the input and a context variable. Specifically, we show that many popular methods aim to approximate the top-d singular functions of the expectation operator induced by the context, in which case we say that the representation learns the contexture. We demonstrate the generality of the contexture theory by proving that representation learning within various learning paradigms – supervised, self-supervised, and manifold learning – can all be studied from such a perspective. We prove that representations that learn the contexture are optimal on those tasks that are compatible with the context. One important implication of our theory is that once the model is large enough to approximate the top singular functions, scaling up the model size yields diminishing returns, so further improvement requires better contexts. To this end, we study how to evaluate a context without knowing the downstream tasks. We propose a metric and show by experiments that it correlates well with the actual performance of the encoder on many real datasets.

ICLR Conference 2025 Conference Paper

Learning from weak labelers as constraints

  • Vishwajeet Agrawal
  • Rattana Pukdee
  • Maria-Florina Balcan
  • Pradeep Ravikumar

We study programmatic weak supervision, where in contrast to labeled data, we have access to \emph{weak labelers}, each of which either abstains or provides noisy labels corresponding to any input. Most previous approaches typically employ latent generative models that model the joint distribution of the weak labels and the latent ``true'' label. The caveats are that this relies on assumptions that may not always hold in practice such as conditional independence assumptions over the joint distribution of the weak labelers and the latent true label, and more general implicit inductive biases in the latent generative models. In this work, we consider a more explicit form of side-information that can be leveraged to denoise the weak labeler, namely the bounds on the average error of the weak labelers. We then propose a novel but natural weak supervision objective that minimizes a regularization functional subject to satisfying these bounds. This turns out to be a difficult constrained optimization problem due to discontinuous accuracy bound constraints. We provide a continuous optimization formulation for this objective through an alternating minimization algorithm that iteratively computes soft pseudo labels on the unlabeled data satisfying the constraints while being close to the model, and then updates the model on these labels until all the constraints are satisfied. We follow this with a theoretical analysis of this approach and provide insights into its denoising effects in training discriminative models given multiple weak labelers. Finally, we demonstrate the superior performance and robustness of our method on a popular weak supervision benchmark.

NeurIPS Conference 2024 Conference Paper

Do LLMs dream of elephants (when told not to)? Latent concept association and associative memory in transformers

  • Yibo Jiang
  • Goutham Rajendran
  • Pradeep Ravikumar
  • Bryon Aragam

Large Language Models (LLMs) have the capacity to store and recall facts. Through experimentation with open-source models, we observe that this ability to retrieve facts can be easily manipulated by changing contexts, even without altering their factual meanings. These findings highlight that LLMs might behave like an associative memory model where certain tokens in the contexts serve as clues to retrieving facts. We mathematically explore this property by studying how transformers, the building blocks of LLMs, can complete such memory tasks. We study a simple latent concept association problem with a one-layer transformer and we show theoretically and empirically that the transformer gathers information using self-attention and uses the value matrix for associative memory.

NeurIPS Conference 2024 Conference Paper

From Causal to Concept-Based Representation Learning

  • Goutham Rajendran
  • Simon Buchholz
  • Bryon Aragam
  • Bernhard Schölkopf
  • Pradeep Ravikumar

To build intelligent machine learning systems, modern representation learning attempts to recover latent generative factors from data, such as in causal representation learning. A key question in this growing field is to provide rigorous conditions under which latent factors can be identified and thus, potentially learned. Motivated by extensive empirical literature on linear representations and concept learning, we propose to relax causal notions with a geometric notion of concepts. We formally define a notion of concepts and show rigorously that they can be provably recovered from diverse data. Instead of imposing assumptions on the "true" generative latent space, we assume that concepts can be represented linearly in this latent space. The tradeoff is that instead of identifying the "true" generative factors, we identify a subset of desired human-interpretable concepts that are relevant for a given application. Experiments on synthetic data, multimodal CLIP models and large language models supplement our results and show the utility of our approach. In this way, we provide a foundation for moving from causal representations to interpretable, concept-based representations by bringing together ideas from these two neighboring disciplines.

NeurIPS Conference 2024 Conference Paper

Identifying General Mechanism Shifts in Linear Causal Representations

  • Tianyu Chen
  • Kevin Bello
  • Francesco Locatello
  • Bryon Aragam
  • Pradeep Ravikumar

We consider the linear causal representation learning setting where we observe a linear mixing of $d$ unknown latent factors, which follow a linear structural causal model. Recent work has shown that it is possible to recover the latent factors as well as the underlying structural causal model over them, up to permutation and scaling, provided that we have at least $d$ environments, each of which corresponds to perfect interventions on a single latent node (factor). After this powerful result, a key open problem faced by the community has been to relax these conditions: allow for coarser than perfect single-node interventions, and allow for fewer than $d$ of them, since the number of latent factors $d$ could be very large. In this work, we consider precisely such a setting, where we allow a smaller than $d$ number of environments, and also allow for very coarse interventions that can very coarsely \textit{change the entire causal graph over the latent factors}. On the flip side, we relax what we wish to extract to simply the \textit{list of nodes that have shifted between one or more environments}. We provide a surprising identifiability result that it is indeed possible, under some very mild standard assumptions, to identify the set of shifted nodes. Our identifiability proof moreover is a constructive one: we explicitly provide necessary and sufficient conditions for a node to be a shifted node, and show that we can check these conditions given observed data. Our algorithm lends itself very naturally to the sample setting where instead of just interventional distributions, we are provided datasets of samples from each of these distributions. We corroborate our results on both synthetic experiments as well as an interesting psychometric dataset. The code can be found at https: //github. com/TianyuCodings/iLCS.

ICLR Conference 2024 Conference Paper

Identifying Representations for Intervention Extrapolation

  • Sorawit Saengkyongam
  • Elan Rosenfeld
  • Pradeep Ravikumar
  • Niklas Pfister
  • Jonas Peters

The premise of identifiable and causal representation learning is to improve the current representation learning paradigm in terms of generalizability or robustness. Despite recent progress in questions of identifiability, more theoretical results demonstrating concrete advantages of these methods for downstream tasks are needed. In this paper, we consider the task of intervention extrapolation: predicting how interventions affect an outcome, even when those interventions are not observed at training time, and show that identifiable representations can provide an effective solution to this task even if the interventions affect the outcome non-linearly. Our setup includes an outcome variable $Y$, observed features $X$, which are generated as a non-linear transformation of latent features $Z$, and exogenous action variables $A$, which influence $Z$. The objective of intervention extrapolation is then to predict how interventions on $A$ that lie outside the training support of $A$ affect $Y$. Here, extrapolation becomes possible if the effect of $A$ on $Z$ is linear and the residual when regressing Z on A has full support. As $Z$ is latent, we combine the task of intervention extrapolation with identifiable representation learning, which we call $\texttt{Rep4Ex}$: we aim to map the observed features $X$ into a subspace that allows for non-linear extrapolation in $A$. We show that the hidden representation is identifiable up to an affine transformation in $Z$-space, which, we prove, is sufficient for intervention extrapolation. The identifiability is characterized by a novel constraint describing the linearity assumption of $A$ on $Z$. Based on this insight, we propose a flexible method that enforces the linear invariance constraint and can be combined with any type of autoencoder. We validate our theoretical findings through a series of synthetic experiments and show that our approach can indeed succeed in predicting the effects of unseen interventions.

NeurIPS Conference 2024 Conference Paper

LogiCity: Advancing Neuro-Symbolic AI with Abstract Urban Simulation

  • Bowen Li
  • Zhaoyu Li
  • Qiwei Du
  • Jinqi Luo
  • Wenshan Wang
  • Yaqi Xie
  • Simon Stepputtis
  • Chen Wang

Recent years have witnessed the rapid development of Neuro-Symbolic (NeSy) AI systems, which integrate symbolic reasoning into deep neural networks. However, most of the existing benchmarks for NeSy AI fail to provide long-horizon reasoning tasks with complex multi-agent interactions. Furthermore, they are usually constrained by fixed and simplistic logical rules over limited entities, making them far from real-world complexities. To address these crucial gaps, we introduce LogiCity, the first simulator based on customizable first-order logic (FOL) for an urban-like environment with multiple dynamic agents. LogiCity models diverse urban elements using semantic and spatial concepts, such as $\texttt{IsAmbulance}(\texttt{X})$ and $\texttt{IsClose}(\texttt{X}, \texttt{Y})$. These concepts are used to define FOL rules that govern the behavior of various agents. Since the concepts and rules are abstractions, they can be universally applied to cities with any agent compositions, facilitating the instantiation of diverse scenarios. Besides, a key feature of LogiCity is its support for user-configurable abstractions, enabling customizable simulation complexities for logical reasoning. To explore various aspects of NeSy AI, LogiCity introduces two tasks, one features long-horizon sequential decision-making, and the other focuses on one-step visual reasoning, varying in difficulty and agent behaviors. Our extensive evaluation reveals the advantage of NeSy frameworks in abstract reasoning. Moreover, we highlight the significant challenges of handling more complex abstractions in long-horizon multi-agent scenarios or under high-dimensional, imbalanced data. With its flexible design, various features, and newly raised challenges, we believe LogiCity represents a pivotal step forward in advancing the next generation of NeSy AI. All the code and data are open-sourced at our website.

NeurIPS Conference 2024 Conference Paper

Markov Equivalence and Consistency in Differentiable Structure Learning

  • Chang Deng
  • Kevin Bello
  • Pradeep Ravikumar
  • Bryon Aragam

Existing approaches to differentiable structure learning of directed acyclic graphs (DAGs) rely on strong identifiability assumptions in order to guarantee that global minimizers of the acyclicity-constrained optimization problem identifies the true DAG. Moreover, it has been observed empirically that the optimizer may exploit undesirable artifacts in the loss function. We explain and remedy these issues by studying the behavior of differentiable acyclicity-constrained programs under general likelihoods with multiple global minimizers. By carefully regularizing the likelihood, it is possible to identify the sparsest model in the Markov equivalence class, even in the absence of an identifiable parametrization. We first study the Gaussian case in detail, showing how proper regularization of the likelihood defines a score that identifies the sparsest model. Assuming faithfulness, it also recovers the Markov equivalence class. These results are then generalized to general models and likelihoods, where the same claims hold. These theoretical results are validated empirically, showing how this can be done using standard gradient-based optimizers (without resorting to approximations such as Gumbel-Softmax), thus paving the way for differentiable structure learning under general models and losses. Open-source code is available at \url{https: //github. com/duntrain/dagrad}.

ICML Conference 2024 Conference Paper

On the Origins of Linear Representations in Large Language Models

  • Yibo Jiang
  • Goutham Rajendran
  • Pradeep Ravikumar
  • Bryon Aragam
  • Victor Veitch

An array of recent works have argued that high-level semantic concepts are encoded "linearly" in the representation space of large language models. In this work, we study the origins of such linear representations. To that end, we introduce a latent variable model to abstract and formalize the concept dynamics of the next token prediction. We use this formalism to prove that linearity arises as a consequence of the loss function and the implicit bias of gradient descent. The theory is further substantiated empirically via experiments.

ICLR Conference 2024 Conference Paper

Spectrally Transformed Kernel Regression

  • Runtian Zhai
  • Rattana Pukdee
  • Roger Jin
  • Maria-Florina Balcan
  • Pradeep Ravikumar

Unlabeled data is a key component of modern machine learning. In general, the role of unlabeled data is to impose a form of smoothness, usually from the similarity information encoded in a base kernel, such as the ϵ-neighbor kernel or the adjacency matrix of a graph. This work revisits the classical idea of spectrally transformed kernel regression (STKR), and provides a new class of general and scalable STKR estimators able to leverage unlabeled data. Intuitively, via spectral transformation, STKR exploits the data distribution for which unlabeled data can provide additional information. First, we show that STKR is a principled and general approach, by characterizing a universal type of “target smoothness”, and proving that any sufficiently smooth function can be learned by STKR. Second, we provide scalable STKR implementations for the inductive setting and a general transformation function, while prior work is mostly limited to the transductive setting. Third, we derive statistical guarantees for two scenarios: STKR with a known polynomial transformation, and STKR with kernel PCA when the transformation is unknown. Overall, we believe that this work helps deepen our understanding of how to work with unlabeled data, and its generality makes it easier to inspire new methods.

ICLR Conference 2024 Conference Paper

Understanding Augmentation-based Self-Supervised Representation Learning via RKHS Approximation and Regression

  • Runtian Zhai
  • Bingbin Liu
  • Andrej Risteski
  • J. Zico Kolter
  • Pradeep Ravikumar

Data augmentation is critical to the empirical success of modern self-supervised representation learning, such as contrastive learning and masked language modeling. However, a theoretical understanding of the exact role of the augmentation remains limited. Recent work has built the connection between self-supervised learning and the approximation of the top eigenspace of a graph Laplacian operator, suggesting that learning a linear probe atop such representation can be connected to RKHS regression. Building on this insight, this work delves into a statistical analysis of augmentation-based pretraining. Starting from the isometry property, a geometric characterization of the target function given by the augmentation, we disentangle the effects of the model and the augmentation, and prove two generalization bounds that are free of model complexity. Our first bound works for an arbitrary encoder, and it is the sum of an estimation error bound incurred by fitting a linear probe, and an approximation error bound by RKHS approximation. Our second bound specifically addresses the case where the encoder extracts the top-d eigenspace of a finite-sample-based approximation of the underlying RKHS. A key ingredient in our analysis is the *augmentation complexity*, which we use to quantitatively compare different augmentations and analyze their impact on downstream performance.

ICLR Conference 2023 Conference Paper

Concept Gradient: Concept-based Interpretation Without Linear Assumption

  • Andrew Bai
  • Chih-Kuan Yeh
  • Neil Y. C. Lin
  • Pradeep Ravikumar
  • Cho-Jui Hsieh

Concept-based interpretations of black-box models are often more intuitive for humans to understand. The most widely adopted approach for concept-based, gradient interpretation is Concept Activation Vector (CAV). CAV relies on learning a linear relation between some latent representation of a given model and concepts. The premise of meaningful concepts lying in a linear subspace of model layers is usually implicitly assumed but does not hold true in general. In this work we proposed Concept Gradient (CG), which extends concept-based, gradient interpretation methods to non-linear concept functions. We showed that for a general (potentially non-linear) concept, we can mathematically measure how a small change of concept affects the model’s prediction, which is an extension of gradient-based interpretation to the concept space. We demonstrated empirically that CG outperforms CAV in attributing concept importance on real world datasets and performed case study on a medical dataset. The code is available at github.com/jybai/concept-gradients.

JMLR Journal 2023 Journal Article

Faith-Shap: The Faithful Shapley Interaction Index

  • Che-Ping Tsai
  • Chih-Kuan Yeh
  • Pradeep Ravikumar

Shapley values, which were originally designed to assign attributions to individual players in coalition games, have become a commonly used approach in explainable machine learning to provide attributions to input features for black-box machine learning models. A key attraction of Shapley values is that they uniquely satisfy a very natural set of axiomatic properties. However, extending the Shapley value to assigning attributions to interactions rather than individual players, an interaction index, is non-trivial: as the natural set of axioms for the original Shapley values, extended to the context of interactions, no longer specify a unique interaction index. Many proposals thus introduce additional possibly stringent axioms, while sacrificing the key axiom of efficiency, in order to obtain unique interaction indices. In this work, rather than introduce additional conflicting axioms, we adopt the viewpoint of Shapley values as coefficients of the most faithful linear approximation to the pseudo-Boolean coalition game value function. By extending linear to higher order polynomial approximations, we can then define the general family of faithful interaction indices. We show that by additionally requiring the faithful interaction indices to satisfy interaction-extensions of the standard individual Shapley axioms (dummy, symmetry, linearity, and efficiency), we obtain a unique Faithful Shapley Interaction index, which we denote Faith-Shap, as a natural generalization of the Shapley value to interactions. We then provide some illustrative contrasts of Faith-Shap with previously proposed interaction indices, and further investigate some of its interesting algebraic properties. We further show the computational efficiency of computing Faith-Shap, together with some additional qualitative insights, via some illustrative experiments. [abs] [ pdf ][ bib ] &copy JMLR 2023. ( edit, beta )

NeurIPS Conference 2023 Conference Paper

Global Optimality in Bivariate Gradient-based DAG Learning

  • Chang Deng
  • Kevin Bello
  • Pradeep Ravikumar
  • Bryon Aragam

Recently, a new class of non-convex optimization problems motivated by the statistical problem of learning an acyclic directed graphical model from data has attracted significant interest. While existing work uses standard first-order optimization schemes to solve this problem, proving the global optimality of such approaches has proven elusive. The difficulty lies in the fact that unlike other non-convex problems in the literature, this problem is not "benign", and possesses multiple spurious solutions that standard approaches can easily get trapped in. In this paper, we prove that a simple path-following optimization scheme globally converges to the global minimum of the population loss in the bivariate setting.

ECAI Conference 2023 Conference Paper

Individual Fairness Under Uncertainty

  • Wenbin Zhang 0002
  • Zichong Wang
  • Juyong Kim 0002
  • Cheng Cheng 0001
  • Thomas Oommen
  • Pradeep Ravikumar
  • Jeremy C. Weiss

Algorithmic fairness, the research field of making machine learning (ML) algorithms fair, is an established area in ML. As ML technologies expand their application domains, including ones with high societal impact, it becomes essential to take fairness into consideration during the building of ML systems. Yet, despite its wide range of socially sensitive applications, most work treats the issue of algorithmic bias as an intrinsic property of supervised learning, i. e. , the class label is given as a precondition. Unlike prior studies in fairness, we propose an individual fairness measure and a corresponding algorithm that deal with the challenges of uncertainty arising from censorship in class labels, while enforcing similar individuals to be treated similarly from a ranking perspective, free of the Lipschitz condition in the conventional individual fairness definition. We argue that this perspective represents a more realistic model of fairness research for real-world application deployment and show how learning with such a relaxed precondition draws new insights that better explains algorithmic fairness. We conducted experiments on four real-world datasets to evaluate our proposed method compared to other fairness models, demonstrating its superiority in minimizing discrimination while maintaining predictive performance with uncertainty present.

NeurIPS Conference 2023 Conference Paper

iSCAN: Identifying Causal Mechanism Shifts among Nonlinear Additive Noise Models

  • Tianyu Chen
  • Kevin Bello
  • Bryon Aragam
  • Pradeep Ravikumar

Structural causal models (SCMs) are widely used in various disciplines to represent causal relationships among variables in complex systems. Unfortunately, the underlying causal structure is often unknown, and estimating it from data remains a challenging task. In many situations, however, the end goal is to localize the changes (shifts) in the causal mechanisms between related datasets instead of learning the full causal structure of the individual datasets. Some applications include root cause analysis, analyzing gene regulatory network structure changes between healthy and cancerous individuals, or explaining distribution shifts. This paper focuses on identifying the causal mechanism shifts in two or more related datasets over the same set of variables--- without estimating the entire DAG structure of each SCM. Prior work under this setting assumed linear models with Gaussian noises; instead, in this work we assume that each SCM belongs to the more general class of nonlinear additive noise models (ANMs). A key technical contribution of this work is to show that the Jacobian of the score function for the mixture distribution allows for the identification of shifts under general non-parametric functional mechanisms. Once the shifted variables are identified, we leverage recent work to estimate the structural differences, if any, for the shifted variables. Experiments on synthetic and real-world data are provided to showcase the applicability of this approach. Code implementing the proposed method is open-source and publicly available at https: //github. com/kevinsbello/iSCAN.

ICLR Conference 2023 Conference Paper

Label Propagation with Weak Supervision

  • Rattana Pukdee
  • Dylan Sam
  • Pradeep Ravikumar
  • Maria-Florina Balcan

Semi-supervised learning and weakly supervised learning are important paradigms that aim to reduce the growing demand for labeled data in current machine learning applications. In this paper, we introduce a novel analysis of the classical label propagation algorithm (LPA) (Zhu & Ghahramani, 2002) that moreover takes advantage of useful prior information, specifically probabilistic hypothesized labels on the unlabeled data. We provide an error bound that exploits both the local geometric properties of the underlying graph and the quality of the prior information. We also propose a framework to incorporate multiple sources of noisy information. In particular, we consider the setting of weak supervision, where our sources of information are weak labelers. We demonstrate the ability of our approach on multiple benchmark weakly supervised classification tasks, showing improvements upon existing semi-supervised and weakly supervised methods.

NeurIPS Conference 2023 Conference Paper

Learning Linear Causal Representations from Interventions under General Nonlinear Mixing

  • Simon Buchholz
  • Goutham Rajendran
  • Elan Rosenfeld
  • Bryon Aragam
  • Bernhard Schölkopf
  • Pradeep Ravikumar

We study the problem of learning causal representations from unknown, latent interventions in a general setting, where the latent distribution is Gaussian but the mixing function is completely general. We prove strong identifiability results given unknown single-node interventions, i. e. , without having access to the intervention targets. This generalizes prior works which have focused on weaker classes, such as linear maps or paired counterfactual data. This is also the first instance of identifiability from non-paired interventions for deep neural network embeddings and general causal structures. Our proof relies on carefully uncovering the high-dimensional geometric structure present in the data distribution after a non-linear density transformation, which we capture by analyzing quadratic forms of precision matrices of the latent distributions. Finally, we propose a contrastive algorithm to identify the latent variables in practice and evaluate its performance on various tasks.

NeurIPS Conference 2023 Conference Paper

Learning with Explanation Constraints

  • Rattana Pukdee
  • Dylan Sam
  • J. Zico Kolter
  • Maria-Florina F. Balcan
  • Pradeep Ravikumar

As larger deep learning models are hard to interpret, there has been a recent focus on generating explanations of these black-box models. In contrast, we may have apriori explanations of how models should behave. In this paper, we formalize this notion as learning from explanation constraints and provide a learning theoretic framework to analyze how such explanations can improve the learning of our models. One may naturally ask, "When would these explanations be helpful? "Our first key contribution addresses this question via a class of models that satisfies these explanation constraints in expectation over new data. We provide a characterization of the benefits of these models (in terms of the reduction of their Rademacher complexities) for a canonical class of explanations given by gradient information in the settings of both linear models and two layer neural networks. In addition, we provide an algorithmic solution for our framework, via a variational approximation that achieves better performance and satisfies these constraints more frequently, when compared to simpler augmented Lagrangian methods to incorporate these explanations. We demonstrate the benefits of our approach over a large array of synthetic and real-world experiments.

ICML Conference 2023 Conference Paper

Optimizing NOTEARS Objectives via Topological Swaps

  • Chang Deng
  • Kevin Bello
  • Bryon Aragam
  • Pradeep Ravikumar

Recently, an intriguing class of non-convex optimization problems has emerged in the context of learning directed acyclic graphs (DAGs). These problems involve minimizing a given loss or score function, subject to a non-convex continuous constraint that penalizes the presence of cycles in a graph. In this work, we delve into the optimality challenges associated with this class of non-convex programs. To address these challenges, we propose a bi-level algorithm that leverages the non-convex constraint in a novel way. The outer level of the algorithm optimizes over topological orders by iteratively swapping pairs of nodes within the topological order of a DAG. A key innovation of our approach is the development of an effective method for generating a set of candidate swapping pairs for each iteration. At the inner level, given a topological order, we utilize off-the-shelf solvers that can handle linear constraints. The key advantage of our proposed algorithm is that it is guaranteed to find a local minimum or a KKT point under weaker conditions compared to previous work and finds solutions with lower scores. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches in terms of achieving a better score. Additionally, our method can also be used as a post-processing algorithm to significantly improve the score of other algorithms. Code implementing the proposed method is available at https: //github. com/duntrain/topo.

ICML Conference 2023 Conference Paper

Representer Point Selection for Explaining Regularized High-dimensional Models

  • Che-Ping Tsai
  • Jiong Zhang 0001
  • Hsiang-Fu Yu
  • Eli Chien
  • Cho-Jui Hsieh
  • Pradeep Ravikumar

We introduce a novel class of sample-based explanations we term high-dimensional representers, that can be used to explain the predictions of a regularized high-dimensional model in terms of importance weights for each of the training samples. Our workhorse is a novel representer theorem for general regularized high-dimensional models, which decomposes the model prediction in terms of contributions from each of the training samples: with positive (negative) values corresponding to positive (negative) impact training samples to the model’s prediction. We derive consequences for the canonical instances of $\ell_1$ regularized sparse models and nuclear norm regularized low-rank models. As a case study, we further investigate the application of low-rank models in the context of collaborative filtering, where we instantiate high-dimensional representers for specific popular classes of models. Finally, we study the empirical performance of our proposed methods on three real-world binary classification datasets and two recommender system datasets. We also showcase the utility of high-dimensional representers in explaining model recommendations.

NeurIPS Conference 2023 Conference Paper

Responsible AI (RAI) Games and Ensembles

  • Yash Gupta
  • Runtian Zhai
  • Arun Suggala
  • Pradeep Ravikumar

Several recent works have studied the societal effects of AI; these include issues such as fairness, robustness, and safety. In many of these objectives, a learner seeks to minimize its worst-case loss over a set of predefined distributions (known as uncertainty sets), with usual examples being perturbed versions of the empirical distribution. In other words, the aforementioned problems can be written as min-max problems over these uncertainty sets. In this work, we provide a general framework for studying these problems, which we refer to as Responsible AI (RAI) games. We provide two classes of algorithms for solving these games: (a) game-play based algorithms, and (b) greedy stagewise estimation algorithms. The former class is motivated by online learning and game theory, whereas the latter class is motivated by the classical statistical literature on boosting, and regression. We empirically demonstrate the applicability and competitive performance of our techniques for solving several RAI problems, particularly around subpopulation shift.

NeurIPS Conference 2023 Conference Paper

Sample based Explanations via Generalized Representers

  • Che-Ping Tsai
  • Chih-Kuan Yeh
  • Pradeep Ravikumar

We propose a general class of sample based explanations of machine learning models, which we term generalized representers. To measure the effect of a training sample on a model's test prediction, generalized representers use two components: a global sample importance that quantifies the importance of the training point to the model and is invariant to test samples, and a local sample importance that measures similarity between the training sample and the test point with a kernel. A key contribution of the paper is to show that generalized representers are the only class of sample based explanations satisfying a natural set of axiomatic properties. We discuss approaches to extract global importances given a kernel, and also natural choices of kernels given modern non-linear models. As we show, many popular existing sample based explanations could be cast as generalized representers with particular choices of kernels and approaches to extract global importances. Additionally, we conduct empirical comparisons of different generalized representers on two image classification datasets.

ICLR Conference 2023 Conference Paper

Understanding Why Generalized Reweighting Does Not Improve Over ERM

  • Runtian Zhai
  • Chen Dan 0001
  • J. Zico Kolter
  • Pradeep Ravikumar

Empirical risk minimization (ERM) is known to be non-robust in practice to distributional shift where the training and the test distributions are different. A suite of approaches, such as importance weighting, and variants of distributionally robust optimization (DRO), have been proposed to solve this problem. But a line of recent work has empirically shown that these approaches do not significantly improve over ERM in real applications with distribution shift. The goal of this work is to obtain a comprehensive theoretical understanding of this intriguing phenomenon. We first posit the class of Generalized Reweighting (GRW) algorithms, as a broad category of approaches that iteratively update model parameters based on iterative reweighting of the training samples. We show that when overparameterized models are trained under GRW, the resulting models are close to that obtained by ERM. We also show that adding small regularization which does not greatly affect the empirical training accuracy does not help. Together, our results show that a broad category of what we term GRW approaches are not able to achieve distributionally robust generalization. Our work thus has the following sobering takeaway: to make progress towards distributionally robust generalization, we either have to develop non-GRW approaches, or perhaps devise novel classification/regression loss functions that are adapted to GRW approaches.

ICLR Conference 2022 Conference Paper

Analyzing and Improving the Optimization Landscape of Noise-Contrastive Estimation

  • Bingbin Liu
  • Elan Rosenfeld
  • Pradeep Ravikumar
  • Andrej Risteski

Noise-contrastive estimation (NCE) is a statistically consistent method for learning unnormalized probabilistic models. It has been empirically observed that the choice of the noise distribution is crucial for NCE’s performance. However, such observation has never been made formal or quantitative. In fact, it is not even clear whether the difficulties arising from a poorly chosen noise distribution are statistical or algorithmic in nature. In this work, we formally pinpoint reasons for NCE’s poor performance when an inappropriate noise distribution is used. Namely, we prove these challenges arise due to an ill-behaved (more precisely, flat) loss landscape. To address this, we introduce a variant of NCE called \emph{eNCE} which uses an exponential loss and for which \emph{normalized gradient descent} addresses the landscape issues \emph{provably} when the target and noise distributions are in a given exponential family.

ICML Conference 2022 Conference Paper

Building Robust Ensembles via Margin Boosting

  • Dinghuai Zhang
  • Hongyang Zhang 0001
  • Aaron C. Courville
  • Yoshua Bengio
  • Pradeep Ravikumar
  • Arun Sai Suggala

In the context of adversarial robustness, a single model does not usually have enough power to defend against all possible adversarial attacks, and as a result, has sub-optimal robustness. Consequently, an emerging line of work has focused on learning an ensemble of neural networks to defend against adversarial attacks. In this work, we take a principled approach towards building robust ensembles. We view this problem from the perspective of margin-boosting and develop an algorithm for learning an ensemble with maximum margin. Through extensive empirical evaluation on benchmark datasets, we show that our algorithm not only outperforms existing ensembling techniques, but also large models trained in an end-to-end fashion. An important byproduct of our work is a margin-maximizing cross-entropy (MCE) loss, which is a better alternative to the standard cross-entropy (CE) loss. Empirically, we show that replacing the CE loss in state-of-the-art adversarial training techniques with our MCE loss leads to significant performance improvement.

NeurIPS Conference 2022 Conference Paper

DAGMA: Learning DAGs via M-matrices and a Log-Determinant Acyclicity Characterization

  • Kevin Bello
  • Bryon Aragam
  • Pradeep Ravikumar

The combinatorial problem of learning directed acyclic graphs (DAGs) from data was recently framed as a purely continuous optimization problem by leveraging a differentiable acyclicity characterization of DAGs based on the trace of a matrix exponential function. Existing acyclicity characterizations are based on the idea that powers of an adjacency matrix contain information about walks and cycles. In this work, we propose a new acyclicity characterization based on the log-determinant (log-det) function, which leverages the nilpotency property of DAGs. To deal with the inherent asymmetries of a DAG, we relate the domain of our log-det characterization to the set of $\textit{M-matrices}$, which is a key difference to the classical log-det function defined over the cone of positive definite matrices. Similar to acyclicity functions previously proposed, our characterization is also exact and differentiable. However, when compared to existing characterizations, our log-det function: (1) Is better at detecting large cycles; (2) Has better-behaved gradients; and (3) Its runtime is in practice about an order of magnitude faster. From the optimization side, we drop the typically used augmented Lagrangian scheme and propose DAGMA ($\textit{Directed Acyclic Graphs via M-matrices for Acyclicity}$), a method that resembles the central path for barrier methods. Each point in the central path of DAGMA is a solution to an unconstrained problem regularized by our log-det function, then we show that at the limit of the central path the solution is guaranteed to be a DAG. Finally, we provide extensive experiments for $\textit{linear}$ and $\textit{nonlinear}$ SEMs and show that our approach can reach large speed-ups and smaller structural Hamming distances against state-of-the-art methods. Code implementing the proposed method is open-source and publicly available at https: //github. com/kevinsbello/dagma.

ICLR Conference 2022 Conference Paper

FILM: Following Instructions in Language with Modular Methods

  • So Yeon Min
  • Devendra Singh Chaplot
  • Pradeep Ravikumar
  • Yonatan Bisk
  • Ruslan Salakhutdinov

Recent methods for embodied instruction following are typically trained end-to-end using imitation learning. This often requires the use of expert trajectories and low-level language instructions. Such approaches assume that neural states will integrate multimodal semantics to perform state tracking, building spatial memory, exploration, and long-term planning. In contrast, we propose a modular method with structured representations that (1) builds a semantic map of the scene and (2) performs exploration with a semantic search policy, to achieve the natural language goal. Our modular method achieves SOTA performance (24.46 %) with a substantial (8.17 % absolute) gap from previous work while using less data by eschewing both expert trajectories and low-level instructions. Leveraging low-level language, however, can further increase our performance (26.49 %). Our findings suggest that an explicit spatial memory and a semantic search policy can provide a stronger and more general representation for state-tracking and guidance, even in the absence of expert trajectories or low-level instructions.

NeurIPS Conference 2022 Conference Paper

First is Better Than Last for Language Data Influence

  • Chih-Kuan Yeh
  • Ankur Taly
  • Mukund Sundararajan
  • Frederick Liu
  • Pradeep Ravikumar

The ability to identify influential training examples enables us to debug training data and explain model behavior. Existing techniques to do so are based on the flow of training data influence through the model parameters. For large models in NLP applications, it is often computationally infeasible to study this flow through all model parameters, therefore techniques usually pick the last layer of weights. However, we observe that since the activation connected to the last layer of weights contains "shared logic", the data influenced calculated via the last layer weights prone to a "cancellation effect", where the data influence of different examples have large magnitude that contradicts each other. The cancellation effect lowers the discriminative power of the influence score, and deleting influential examples according to this measure often does not change the model's behavior by much. To mitigate this, we propose a technique called TracIn-WE that modifies a method called TracIn to operate on the word embedding layer instead of the last layer, where the cancellation effect is less severe. One potential concern is that influence based on the word embedding layer may not encode sufficient high level information. However, we find that gradients (unlike embeddings) do not suffer from this, possibly because they chain through higher layers. We show that TracIn-WE significantly outperforms other data influence methods applied on the last layer significantly on the case deletion evaluation on three language classification tasks for different models. In addition, TracIn-WE can produce scores not just at the level of the overall training input, but also at the level of words within the training input, a further aid in debugging.

JMLR Journal 2022 Journal Article

Fundamental Limits and Tradeoffs in Invariant Representation Learning

  • Han Zhao
  • Chen Dan
  • Bryon Aragam
  • Tommi S. Jaakkola
  • Geoffrey J. Gordon
  • Pradeep Ravikumar

A wide range of machine learning applications such as privacy-preserving learning, algorithmic fairness, and domain adaptation/generalization among others, involve learning invariant representations of the data that aim to achieve two competing goals: (a) maximize information or accuracy with respect to a target response, and (b) maximize invariance or independence with respect to a set of protected features (e.g.\ for fairness, privacy, etc). Despite their wide applicability, theoretical understanding of the optimal tradeoffs --- with respect to accuracy, and invariance --- achievable by invariant representations is still severely lacking. In this paper, we provide an information theoretic analysis of such tradeoffs under both classification and regression settings. More precisely, we provide a geometric characterization of the accuracy and invariance achievable by any representation of the data; we term this feasible region the information plane. We provide an inner bound for this feasible region for the classification case, and an exact characterization for the regression case, which allows us to either bound or exactly characterize the Pareto optimal frontier between accuracy and invariance. Although our contributions are mainly theoretical, a key practical application of our results is in certifying the potential sub-optimality of any given representation learning algorithm for either classification or regression tasks. Our results shed new light on the fundamental interplay between accuracy and invariance, and may be useful in guiding the design of future representation learning algorithms. [abs] [ pdf ][ bib ] &copy JMLR 2022. ( edit, beta )

NeurIPS Conference 2022 Conference Paper

Identifiability of deep generative models without auxiliary information

  • Bohdan Kivva
  • Goutham Rajendran
  • Pradeep Ravikumar
  • Bryon Aragam

We prove identifiability of a broad class of deep latent variable models that (a) have universal approximation capabilities and (b) are the decoders of variational autoencoders that are commonly used in practice. Unlike existing work, our analysis does not require weak supervision, auxiliary information, or conditioning in the latent space. Specifically, we show that for a broad class of generative (i. e. unsupervised) models with universal approximation capabilities, the side information $u$ is not necessary: We prove identifiability of the entire generative model where we do not observe $u$ and only observe the data $x$. The models we consider match autoencoder architectures used in practice that leverage mixture priors in the latent space and ReLU/leaky-ReLU activations in the encoder, such as VaDE and MFC-VAE. Our main result is an identifiability hierarchy that significantly generalizes previous work and exposes how different assumptions lead to different ``strengths'' of identifiability, and includes certain ``vanilla'' VAEs with isotropic Gaussian priors as a special case. For example, our weakest result establishes (unsupervised) identifiability up to an affine transformation, and thus partially resolves an open problem regarding model identifiability raised in prior work. These theoretical results are augmented with experiments on both simulated and real data.

NeurIPS Conference 2022 Conference Paper

Masked Prediction: A Parameter Identifiability View

  • Bingbin Liu
  • Daniel J. Hsu
  • Pradeep Ravikumar
  • Andrej Risteski

The vast majority of work in self-supervised learning have focused on assessing recovered features by a chosen set of downstream tasks. While there are several commonly used benchmark datasets, this lens of feature learning requires assumptions on the downstream tasks which are not inherent to the data distribution itself. In this paper, we present an alternative lens, one of parameter identifiability: assuming data comes from a parametric probabilistic model, we train a self-supervised learning predictor with a suitable parametric form, and ask whether the parameters of the optimal predictor can be used to extract the parameters of the ground truth generative model. Specifically, we focus on latent-variable models capturing sequential structures, namely Hidden Markov Models with both discrete and conditionally Gaussian observations. We focus on masked prediction as the self-supervised learning task and study the optimal masked predictor. We show that parameter identifiability is governed by the task difficulty, which is determined by the choice of data model and the amount of tokens to predict. Technique-wise, we uncover close connections with the uniqueness of tensor rank decompositions, a widely used tool in studying identifiability through the lens of the method of moments.

NeurIPS Conference 2021 Conference Paper

Boosted CVaR Classification

  • Runtian Zhai
  • Chen Dan
  • Arun Suggala
  • J. Zico Kolter
  • Pradeep Ravikumar

Many modern machine learning tasks require models with high tail performance, i. e. high performance over the worst-off samples in the dataset. This problem has been widely studied in fields such as algorithmic fairness, class imbalance, and risk-sensitive decision making. A popular approach to maximize the model's tail performance is to minimize the CVaR (Conditional Value at Risk) loss, which computes the average risk over the tails of the loss. However, for classification tasks where models are evaluated by the zero-one loss, we show that if the classifiers are deterministic, then the minimizer of the average zero-one loss also minimizes the CVaR zero-one loss, suggesting that CVaR loss minimization is not helpful without additional assumptions. We circumvent this negative result by minimizing the CVaR loss over randomized classifiers, for which the minimizers of the average zero-one loss and the CVaR zero-one loss are no longer the same, so minimizing the latter can lead to better tail performance. To learn such randomized classifiers, we propose the Boosted CVaR Classification framework which is motivated by a direct relationship between CVaR and a classical boosting algorithm called LPBoost. Based on this framework, we design an algorithm called $\alpha$-AdaLPBoost. We empirically evaluate our proposed algorithm on four benchmark datasets and show that it achieves higher tail performance than deterministic model training methods.

ICML Conference 2021 Conference Paper

DORO: Distributional and Outlier Robust Optimization

  • Runtian Zhai
  • Chen Dan 0001
  • J. Zico Kolter
  • Pradeep Ravikumar

Many machine learning tasks involve subpopulation shift where the testing data distribution is a subpopulation of the training distribution. For such settings, a line of recent work has proposed the use of a variant of empirical risk minimization(ERM) known as distributionally robust optimization (DRO). In this work, we apply DRO to real, large-scale tasks with subpopulation shift, and observe that DRO performs relatively poorly, and moreover has severe instability. We identify one direct cause of this phenomenon: sensitivity of DRO to outliers in the datasets. To resolve this issue, we propose the framework of DORO, for Distributional and Outlier Robust Optimization. At the core of this approach is a refined risk function which prevents DRO from overfitting to potential outliers. We instantiate DORO for the Cressie-Read family of Rényi divergence, and delve into two specific instances of this family: CVaR and $\chi^2$-DRO. We theoretically prove the effectiveness of the proposed method, and empirically show that DORO improves the performance and stability of DRO with experiments on large modern datasets, thereby positively addressing the open question raised by Hashimoto et al. , 2018. Codes are available at https: //github. com/RuntianZ/doro.

ICLR Conference 2021 Conference Paper

Evaluations and Methods for Explanation through Robustness Analysis

  • Cheng-Yu Hsieh
  • Chih-Kuan Yeh
  • Xuanqing Liu
  • Pradeep Ravikumar
  • Seungyeon Kim 0001
  • Sanjiv Kumar
  • Cho-Jui Hsieh

Feature based explanations, that provide importance of each feature towards the model prediction, is arguably one of the most intuitive ways to explain a model. In this paper, we establish a novel set of evaluation criteria for such feature based explanations by robustness analysis. In contrast to existing evaluations which require us to specify some way to "remove" features that could inevitably introduces biases and artifacts, we make use of the subtler notion of smaller adversarial perturbations. By optimizing towards our proposed evaluation criteria, we obtain new explanations that are loosely necessary and sufficient for a prediction. We further extend the explanation to extract the set of features that would move the current prediction to a target class by adopting targeted adversarial attack for the robustness analysis. Through experiments across multiple domains and a user study, we validate the usefulness of our evaluation criteria and our derived explanations.

NeurIPS Conference 2021 Conference Paper

Learning latent causal graphs via mixture oracles

  • Bohdan Kivva
  • Goutham Rajendran
  • Pradeep Ravikumar
  • Bryon Aragam

We study the problem of reconstructing a causal graphical model from data in the presence of latent variables. The main problem of interest is recovering the causal structure over the latent variables while allowing for general, potentially nonlinear dependencies. In many practical problems, the dependence between raw observations (e. g. pixels in an image) is much less relevant than the dependence between certain high-level, latent features (e. g. concepts or objects), and this is the setting of interest. We provide conditions under which both the latent representations and the underlying latent causal model are identifiable by a reduction to a mixture oracle. These results highlight an intriguing connection between the well-studied problem of learning the order of a mixture model and the problem of learning the bipartite structure between observables and unobservables. The proof is constructive, and leads to several algorithms for explicitly reconstructing the full graphical model. We discuss efficient algorithms and provide experiments illustrating the algorithms in practice.

ICML Conference 2021 Conference Paper

On Proximal Policy Optimization's Heavy-tailed Gradients

  • Saurabh Garg
  • Joshua Zhanson
  • Emilio Parisotto
  • Adarsh Prasad
  • J. Zico Kolter
  • Zachary C. Lipton
  • Sivaraman Balakrishnan
  • Ruslan Salakhutdinov

Modern policy gradient algorithms such as Proximal Policy Optimization (PPO) rely on an arsenal of heuristics, including loss clipping and gradient clipping, to ensure successful learning. These heuristics are reminiscent of techniques from robust statistics, commonly used for estimation in outlier-rich ("heavy-tailed") regimes. In this paper, we present a detailed empirical study to characterize the heavy-tailed nature of the gradients of the PPO surrogate reward function. We demonstrate that the gradients, especially for the actor network, exhibit pronounced heavy-tailedness and that it increases as the agent’s policy diverges from the behavioral policy (i. e. , as the agent goes further off policy). Further examination implicates the likelihood ratios and advantages in the surrogate reward as the main sources of the observed heavy-tailedness. We then highlight issues arising due to the heavy-tailed nature of the gradients. In this light, we study the effects of the standard PPO clipping heuristics, demonstrating that these tricks primarily serve to offset heavy-tailedness in gradients. Thus motivated, we propose incorporating GMOM, a high-dimensional robust estimator, into PPO as a substitute for three clipping tricks. Despite requiring less hyperparameter tuning, our method matches the performance of PPO (with all heuristics enabled) on a battery of MuJoCo continuous control tasks.

AAAI Conference 2021 Conference Paper

Sub-Seasonal Climate Forecasting via Machine Learning: Challenges, Analysis, and Advances

  • Sijie He
  • Xinyan Li
  • Timothy DelSole
  • Pradeep Ravikumar
  • Arindam Banerjee

Sub-seasonal forecasting (SSF) focuses on predicting key variables such as temperature and precipitation on the 2-week to 2-month time scale. Skillful SSF would have immense societal value in such areas as agricultural productivity, water resource management, and emergency planning for extreme weather events. However, SSF is considered more challenging than either weather prediction or even seasonal prediction, and is still a largely understudied problem. In this paper, we carefully investigate 10 Machine Learning (ML) approaches to sub-seasonal temperature forecasting over the contiguous U. S. on the SSF dataset we collect, including a variety of climate variables from the atmosphere, ocean, and land. Because of the complicated atmosphere-land-ocean couplings and the limited amount of good quality observational data, SSF imposes a great challenge for ML despite the recent advances in various domains. Our results indicate that suitable ML models, e. g. , XGBoost, to some extent, capture the predictability on sub-seasonal time scales and can outperform the climatological baselines, while Deep Learning (DL) models barely manage to match the best results with carefully designed architecture. Besides, our analysis and exploration provide insights on important aspects to improve the quality of sub-seasonal forecasts, e. g. , feature representation and model architecture. The SSF dataset and code 1 are released with this paper for use by the broader research community.

UAI Conference 2021 Conference Paper

Subseasonal climate prediction in the western US using Bayesian spatial models

  • Vishwak Srinivasan
  • Justin Khim
  • Arindam Banerjee 0001
  • Pradeep Ravikumar

Subseasonal climate forecasting is the task of predicting climate variables, such as temperature and precipitation, in a two-week to two-month time horizon. The primary predictors for such prediction problem are spatio-temporal satellite and ground measurements of a variety of climate variables in the atmosphere, ocean, and land, which however have rather limited predictive signal at the subseasonal time horizon. We propose a carefully constructed spatial hierarchical Bayesian regression model that makes use of the inherent spatial structure of the subseasonal climate prediction task. We use our Bayesian model to then derive decision-theoretically optimal point estimates with respect to various performance measures of interest to climate science. As we show, our approach handily improves on various off-the-shelf ML baselines. Since our method is based on a Bayesian framework, we are also able to quantify the uncertainty in our predictions, which is particularly crucial for difficult tasks such as the subseasonal prediction, where we expect any model to have considerable uncertainty at different test locations under different scenarios.

ICLR Conference 2021 Conference Paper

The Risks of Invariant Risk Minimization

  • Elan Rosenfeld
  • Pradeep Ravikumar
  • Andrej Risteski

Invariant Causal Prediction (Peters et al., 2016) is a technique for out-of-distribution generalization which assumes that some aspects of the data distribution vary across the training set but that the underlying causal mechanisms remain constant. Recently, Arjovsky et al. (2019) proposed Invariant Risk Minimization (IRM), an objective based on this idea for learning deep, invariant features of data which are a complex function of latent variables; many alternatives have subsequently been suggested. However, formal guarantees for all of these works are severely lacking. In this paper, we present the first analysis of classification under the IRM objective—as well as these recently proposed alternatives—under a fairly natural and general model. In the linear case, we show simple conditions under which the optimal solution succeeds or, more often, fails to recover the optimal invariant predictor. We furthermore present the very first results in the non-linear regime: we demonstrate that IRM can fail catastrophically unless the test data is sufficiently similar to the training distribution—this is precisely the issue that it was intended to solve. Thus, in this setting we find that IRM and its alternatives fundamentally do not improve over standard Empirical Risk Minimization.

NeurIPS Conference 2021 Conference Paper

When Is Generalizable Reinforcement Learning Tractable?

  • Dhruv Malik
  • Yuanzhi Li
  • Pradeep Ravikumar

Agents trained by reinforcement learning (RL) often fail to generalize beyond the environment they were trained in, even when presented with new scenarios that seem similar to the training environment. We study the query complexity required to train RL agents that generalize to multiple environments. Intuitively, tractable generalization is only possible when the environments are similar or close in some sense. To capture this, we introduce Weak Proximity, a natural structural condition that requires the environments to have highly similar transition and reward functions and share a policy providing optimal value. Despite such shared structure, we prove that tractable generalization is impossible in the worst case. This holds even when each individual environment can be efficiently solved to obtain an optimal linear policy, and when the agent possesses a generative model. Our lower bound applies to the more complex task of representation learning for efficient generalization to multiple environments. On the positive side, we introduce Strong Proximity, a strengthened condition which we prove is sufficient for efficient generalization.

UAI Conference 2020 Conference Paper

Automated Dependence Plots

  • David I. Inouye
  • Liu Leqi
  • Joon Sik Kim
  • Bryon Aragam
  • Pradeep Ravikumar

In practical applications of machine learning, it is necessary to look beyond standard metrics such as test accuracy in order to validate various qualitative properties of a model. Partial dependence plots (PDP), including instance-specific PDPs (i. e. , ICE plots), have been widely used as a visual tool to understand or validate a model. Yet, current PDPs suffer from two main drawbacks: (1) a user must manually sort or select interesting plots, and (2) PDPs are usually limited to plots along a single feature. To address these drawbacks, we formalize a method for automating the selection of interesting PDPs and extend PDPs beyond showing single features to show the model response along arbitrary directions, for example in raw feature space or a latent space arising from some generative model. We demonstrate the usefulness of our automated dependence plots (ADP) across multiple use-cases and datasets including model selection, bias detection, understanding out-of-sample behavior, and exploring the latent space of a generative model. The code is available at.

ICML Conference 2020 Conference Paper

Certified Robustness to Label-Flipping Attacks via Randomized Smoothing

  • Elan Rosenfeld
  • Ezra Winston
  • Pradeep Ravikumar
  • J. Zico Kolter

Machine learning algorithms are known to be susceptible to data poisoning attacks, where an adversary manipulates the training data to degrade performance of the resulting classifier. In this work, we present a unifying view of randomized smoothing over arbitrary functions, and we leverage this novel characterization to propose a new strategy for building classifiers that are pointwise-certifiably robust to general data poisoning attacks. As a specific instantiation, we utilize our framework to build linear classifiers that are robust to a strong variant of label flipping, where each test example is targeted independently. In other words, for each test point, our classifier includes a certification that its prediction would be the same had some number of training labels been changed adversarially. Randomized smoothing has previously been used to guarantee—with high probability—test-time robustness to adversarial manipulation of the input to a classifier; we derive a variant which provides a deterministic, analytical bound, sidestepping the probabilistic certificates that traditionally result from the sampling subprocedure. Further, we obtain these certified bounds with minimal additional runtime complexity over standard classification and no assumptions on the train or test distributions. We generalize our results to the multi-class case, providing the first multi-class classification algorithm that is certifiably robust to label-flipping attacks.

ICML Conference 2020 Conference Paper

Class-Weighted Classification: Trade-offs and Robust Approaches

  • Ziyu Xu 0001
  • Chen Dan 0001
  • Justin Khim
  • Pradeep Ravikumar

We consider imbalanced classification, the problem in which a label may have low marginal probability relative to other labels, by weighting losses according to the correct class. First, we examine the convergence rates of the expected excess weighted risk of plug-in classifiers where the weighting for the plug-in classifier and the risk may be different. This leads to irreducible errors that do not converge to the weighted Bayes risk, which motivates our consideration of robust risks. We define a robust risk that minimizes risk over a set of weightings, show excess risk bounds for this problem, and demonstrate that particular choices of the weighting set leads to a special instance of conditional value at risk (CVaR) from stochastic programming, which we call label conditional value at risk (LCVaR). Additionally, we generalize this weighting to derive a new robust risk problem that we call label heterogeneous conditional value at risk (LHCVaR). Finally, we empirically demonstrate the efficacy of LCVaR and LHCVaR on improving class conditional risks.

NeurIPS Conference 2020 Conference Paper

Generalized Boosting

  • Arun Suggala
  • Bingbin Liu
  • Pradeep Ravikumar

Boosting is a widely used learning technique in machine learning for solving classification problems. In boosting, one predicts the label of an example using an ensemble of weak classifiers. While boosting has shown tremendous success on many classification problems involving tabular data, it performs poorly on complex classification tasks involving low-level features such as image classification tasks. This drawback stems from the fact that boosting builds an additive model of weak classifiers, each of which has very little predictive power. Often, the resulting additive models are not powerful enough to approximate the complex decision boundaries of real-world classification problems. In this work, we present a general framework for boosting where, similar to traditional boosting, we aim to boost the performance of a weak learner and transform it into a strong learner. However, unlike traditional boosting, our framework allows for more complex forms of aggregation of weak learners. In this work, we specifically focus on one form of aggregation - \emph{function composition}. We show that many popular greedy algorithms for learning deep neural networks (DNNs) can be derived from our framework using function compositions for aggregation. Moreover, we identify the drawbacks of these greedy algorithms and propose new algorithms that fix these issues. Using thorough empirical evaluation, we show that our learning algorithms have superior performance over traditional additive boosting algorithms, as well as existing greedy learning techniques for DNNs. An important feature of our algorithms is that they come with strong theoretical guarantees.

ICLR Conference 2020 Conference Paper

MACER: Attack-free and Scalable Robust Training via Maximizing Certified Radius

  • Runtian Zhai
  • Chen Dan 0001
  • Di He 0001
  • Huan Zhang 0001
  • Boqing Gong
  • Pradeep Ravikumar
  • Cho-Jui Hsieh
  • Liwei Wang 0001

Adversarial training is one of the most popular ways to learn robust models but is usually attack-dependent and time costly. In this paper, we propose the MACER algorithm, which learns robust models without using adversarial training but performs better than all existing provable l2-defenses. Recent work shows that randomized smoothing can be used to provide a certified l2 radius to smoothed classifiers, and our algorithm trains provably robust smoothed classifiers via MAximizing the CErtified Radius (MACER). The attack-free characteristic makes MACER faster to train and easier to optimize. In our experiments, we show that our method can be applied to modern deep neural networks on a wide range of datasets, including Cifar-10, ImageNet, MNIST, and SVHN. For all tasks, MACER spends less training time than state-of-the-art adversarial training algorithms, and the learned models achieve larger average certified radius.

ICLR Conference 2020 Conference Paper

Minimizing FLOPs to Learn Efficient Sparse Representations

  • Biswajit Paria
  • Chih-Kuan Yeh
  • Ian En-Hsu Yen
  • Ning Xu
  • Pradeep Ravikumar
  • Barnabás Póczos

Deep representation learning has become one of the most widely adopted approaches for visual search, recommendation, and identification. Retrieval of such representations from a large database is however computationally challenging. Approximate methods based on learning compact representations, have been widely explored for this problem, such as locality sensitive hashing, product quantization, and PCA. In this work, in contrast to learning compact representations, we propose to learn high dimensional and sparse representations that have similar representational capacity as dense embeddings while being more efficient due to sparse matrix multiplication operations which can be much faster than dense multiplication. Following the key insight that the number of operations decreases quadratically with the sparsity of embeddings provided the non-zero entries are distributed uniformly across dimensions, we propose a novel approach to learn such distributed sparse embeddings via the use of a carefully constructed regularization function that directly minimizes a continuous relaxation of the number of floating-point operations (FLOPs) incurred during retrieval. Our experiments show that our approach is competitive to the other baselines and yields a similar or better speed-vs-accuracy tradeoff on practical datasets.

NeurIPS Conference 2020 Conference Paper

On Completeness-aware Concept-Based Explanations in Deep Neural Networks

  • Chih-Kuan Yeh
  • Been Kim
  • Sercan Arik
  • Chun-Liang Li
  • Tomas Pfister
  • Pradeep Ravikumar

Human explanations of high-level decisions are often expressed in terms of key concepts the decisions are based on. In this paper, we study such concept-based explainability for Deep Neural Networks (DNNs). First, we define the notion of \emph{completeness}, which quantifies how sufficient a particular set of concepts is in explaining a model's prediction behavior based on the assumption that complete concept scores are sufficient statistics of the model prediction. Next, we propose a concept discovery method that aims to infer a complete set of concepts that are additionally encouraged to be interpretable, which addresses the limitations of existing methods on concept explanations. To define an importance score for each discovered concept, we adapt game-theoretic notions to aggregate over sets and propose \emph{ConceptSHAP}. Via proposed metrics and user studies, on a synthetic dataset with apriori-known concept explanations, as well as on real-world image and language datasets, we validate the effectiveness of our method in finding concepts that are both complete in explaining the decisions and interpretable.

NeurIPS Conference 2020 Conference Paper

On Learning Ising Models under Huber's Contamination Model

  • Adarsh Prasad
  • Vishwak Srinivasan
  • Sivaraman Balakrishnan
  • Pradeep Ravikumar

We study the problem of learning Ising models in a setting where some of the samples from the underlying distribution can be arbitrarily corrupted. In such a setup, we aim to design statistically optimal estimators in a high-dimensional scaling in which the number of nodes p, the number of edges k and the maximal node degree d are allowed to increase to infinity as a function of the sample size n. Our analysis is based on exploiting moments of the underlying distribution, coupled with novel reductions to univariate estimation. Our proposed estimators achieve an optimal dimension independent dependence on the fraction of corrupted data in the contaminated setting, while also simultaneously achieving high-probability error guarantees with optimal sample-complexity. We corroborate our theoretical results by simulations.

ICML Conference 2020 Conference Paper

Sharp Statistical Guaratees for Adversarially Robust Gaussian Classification

  • Chen Dan 0001
  • Yuting Wei 0001
  • Pradeep Ravikumar

Adversarial robustness has become a fundamental requirement in modern machine learning applications. Yet, there has been surprisingly little statistical understanding so far. In this paper, we provide the first result of the \emph{optimal} minimax guarantees for the excess risk for adversarially robust classification, under Gaussian mixture model proposed by \cite{schmidt2018adversarially}. The results are stated in terms of the \emph{Adversarial Signal-to-Noise Ratio (AdvSNR)}, which generalizes a similar notion for standard linear classification to the adversarial setting. For the Gaussian mixtures with AdvSNR value of $r$, we prove an excess risk lower bound of order $\Theta(e^{-(\frac{1}{2}+o(1)) r^2} \frac{d}{n})$ and design a computationally efficient estimator that achieves this optimal rate. Our results built upon minimal assumptions while cover a wide spectrum of adversarial perturbations including $\ell_p$ balls for any $p \ge 1$.

ICML Conference 2020 Conference Paper

Uniform Convergence of Rank-weighted Learning

  • Justin Khim
  • Liu Leqi
  • Adarsh Prasad
  • Pradeep Ravikumar

The decision-theoretic foundations of classical machine learning models have largely focused on estimating model parameters that minimize the expectation of a given loss function. However, as machine learning models are deployed in varied contexts, such as in high-stakes decision-making and societal settings, it is clear that these models are not just evaluated by their average performances. In this work, we study a novel notion of L-Risk based on the classical idea of rank-weighted learning. These L-Risks, induced by rank-dependent weighting functions with bounded variation, is a unification of popular risk measures such as conditional value-at-risk and those defined by cumulative prospect theory. We give uniform convergence bounds of this broad class of risk measures and study their consequences on a logistic regression example.

AAAI Conference 2019 Short Paper

Building Human-Machine Trust via Interpretability

  • Umang Bhatt
  • Pradeep Ravikumar
  • Jos´e M. F. Moura

Developing human-machine trust is a prerequisite for adoption of machine learning systems in decision critical settings (e. g healthcare and governance). Users develop appropriate trust in these systems when they understand how the systems make their decisions. Interpretability not only helps users understand what a system learns but also helps users contest that system to align with their intuition. We propose an algorithm, AVA: Aggregate Valuation of Antecedents, that generates a consensus feature attribution, retrieving local explanations and capturing global patterns learned by a model. Our empirical results show that AVA rivals current benchmarks.

NeurIPS Conference 2019 Conference Paper

Game Design for Eliciting Distinguishable Behavior

  • Fan Yang
  • Liu Leqi
  • Yifan Wu
  • Zachary Lipton
  • Pradeep Ravikumar
  • Tom Mitchell
  • William Cohen

The ability to inferring latent psychological traits from human behavior is key to developing personalized human-interacting machine learning systems. Approaches to infer such traits range from surveys to manually-constructed experiments and games. However, these traditional games are limited because they are typically designed based on heuristics. In this paper, we formulate the task of designing behavior diagnostic games that elicit distinguishable behavior as a mutual information maximization problem, which can be solved by optimizing a variational lower bound. Our framework is instantiated by using prospect theory to model varying player traits, and Markov Decision Processes to parameterize the games. We validate our approach empirically, showing that our designed games can successfully distinguish among players with different traits, outperforming manually-designed ones by a large margin.

NeurIPS Conference 2019 Conference Paper

On Human-Aligned Risk Minimization

  • Liu Leqi
  • Adarsh Prasad
  • Pradeep Ravikumar

The statistical decision theoretic foundations of modern machine learning have largely focused on the minimization of the expectation of some loss function for a given task. However, seminal results in behavioral economics have shown that human decision-making is based on different risk measures than the expectation of any given loss function. In this paper, we pose the following simple question: in contrast to minimizing expected loss, could we minimize a better human-aligned risk measure? While this might not seem natural at first glance, we analyze the properties of such a revised risk measure, and surprisingly show that it might also better align with additional desiderata like fairness that have attracted considerable recent attention. We focus in particular on a class of human-aligned risk measures inspired by cumulative prospect theory. We empirically study these risk measures, and demonstrate their improved performance on desiderata such as fairness, in contrast to the traditional workhorse of expected loss minimization.

NeurIPS Conference 2019 Conference Paper

On the (In)fidelity and Sensitivity of Explanations

  • Chih-Kuan Yeh
  • Cheng-Yu Hsieh
  • Arun Suggala
  • David Inouye
  • Pradeep Ravikumar

We consider objective evaluation measures of saliency explanations for complex black-box machine learning models. We propose simple robust variants of two notions that have been considered in recent literature: (in)fidelity, and sensitivity. We analyze optimal explanations with respect to both these measures, and while the optimal explanation for sensitivity is a vacuous constant explanation, the optimal explanation for infidelity is a novel combination of two popular explanation methods. By varying the perturbation distribution that defines infidelity, we obtain novel explanations by optimizing infidelity, which we show to out-perform existing explanations in both quantitative and qualitative measurements. Another salient question given these measures is how to modify any given explanation to have better values with respect to these measures. We propose a simple modification based on lowering sensitivity, and moreover show that when done appropriately, we could simultaneously improve both sensitivity as well as fidelity.

NeurIPS Conference 2019 Conference Paper

Optimal Analysis of Subset-Selection Based L_p Low-Rank Approximation

  • Chen Dan
  • Hong Wang
  • Hongyang Zhang
  • Yuchen Zhou
  • Pradeep Ravikumar

We show that for the problem of $\ell_p$ rank-$k$ approximation of any given matrix over $R^{n\times m}$ and $C^{n\times m}$, the algorithm of column subset selection enjoys approximation ratio $(k+1)^{1/p}$ for $1\le p\le 2$ and $(k+1)^{1-1/p}$ for $p\ge 2$. This improves upon the previous $O(k+1)$ bound (Chierichetti et al. ,2017) for $p\ge 1$. We complement our analysis with lower bounds; these bounds match our upper bounds up to constant 1 when $p\geq 2$. At the core of our techniques is an application of Riesz-Thorin interpolation theorem from harmonic analysis, which might be of independent interest to other algorithmic designs and analysis more broadly. Our analysis results in improvements on approximation guarantees of several other algorithms with various time complexity. For example, to make the algorithm of column subset selection computationally efficient, we analyze a polynomial time bi-criteria algorithm which selects $O(k\log m)$ number of columns. We show that this algorithm has an approximation ratio of $O((k+1)^{1/p})$ for $1\le p\le 2$ and $O((k+1)^{1-1/p})$ for $p\ge 2$. This improves over the bound in (Chierichetti et al. ,2017) with an $O(k+1)$ approximation ratio. Our bi-criteria algorithm also implies an exact-rank method in polynomial time with a slightly larger approximation ratio.

AAAI Conference 2018 Conference Paper

A Voting-Based System for Ethical Decision Making

  • Ritesh Noothigattu
  • Snehalkumar Gaikwad
  • Edmond Awad
  • Sohan Dsouza
  • Iyad Rahwan
  • Pradeep Ravikumar
  • Ariel Procaccia

We present a general approach to automating ethical decisions, drawing on machine learning and computational social choice. In a nutshell, we propose to learn a model of societal preferences, and, when faced with a specific ethical dilemma at runtime, efficiently aggregate those preferences to identify a desirable choice. We provide a concrete algorithm that instantiates our approach; some of its crucial steps are informed by a new theory of swap-dominance efficient voting rules. Finally, we implement and evaluate a system for ethical decision making in the autonomous vehicle domain, using preference data collected from 1. 3 million people through the Moral Machine website.

ICML Conference 2018 Conference Paper

Binary Classification with Karmic, Threshold-Quasi-Concave Metrics

  • Bowei Yan
  • Sanmi Koyejo
  • Kai Zhong
  • Pradeep Ravikumar

Complex performance measures, beyond the popular measure of accuracy, are increasingly being used in the context of binary classification. These complex performance measures are typically not even decomposable, that is, the loss evaluated on a batch of samples cannot typically be expressed as a sum or average of losses evaluated at individual samples, which in turn requires new theoretical and methodological developments beyond standard treatments of supervised learning. In this paper, we advance this understanding of binary classification for complex performance measures by identifying two key properties: a so-called Karmic property, and a more technical threshold-quasi-concavity property, which we show is milder than existing structural assumptions imposed on performance measures. Under these properties, we show that the Bayes optimal classifier is a threshold function of the conditional probability of positive class. We then leverage this result to come up with a computationally practical plug-in classifier, via a novel threshold estimator, and further, provide a novel statistical analysis of classification error with respect to complex performance measures.

NeurIPS Conference 2018 Conference Paper

Connecting Optimization and Regularization Paths

  • Arun Suggala
  • Adarsh Prasad
  • Pradeep Ravikumar

We study the implicit regularization properties of optimization techniques by explicitly connecting their optimization paths to the regularization paths of ``corresponding'' regularized problems. This surprising connection shows that iterates of optimization techniques such as gradient descent and mirror descent are \emph{pointwise} close to solutions of appropriately regularized objectives. While such a tight connection between optimization and regularization is of independent intellectual interest, it also has important implications for machine learning: we can port results from regularized estimators to optimization, and vice versa. We investigate one key consequence, that borrows from the well-studied analysis of regularized estimators, to then obtain tight excess risk bounds of the iterates generated by optimization techniques.

JMLR Journal 2018 Journal Article

Cost-Sensitive Learning with Noisy Labels

  • Nagarajan Natarajan
  • Inderjit S. Dhillon
  • Pradeep Ravikumar
  • Ambuj Tewari

We study binary classification in the presence of \emph{class- conditional} random noise, where the learner gets to see labels that are flipped independently with some probability, and where the flip probability depends on the class. Our goal is to devise learning algorithms that are efficient and statistically consistent with respect to commonly used utility measures. In particular, we look at a family of measures motivated by their application in domains where cost-sensitive learning is necessary (for example, when there is class imbalance). In contrast to most of the existing literature on consistent classification that are limited to the classical 0-1 loss, our analysis includes more general utility measures such as the AM measure (arithmetic mean of True Positive Rate and True Negative Rate). For this problem of cost-sensitive learning under class- conditional random noise, we develop two approaches that are based on suitably modifying surrogate losses. First, we provide a simple unbiased estimator of any loss, and obtain performance bounds for empirical utility maximization in the presence of i.i.d. data with noisy labels. If the loss function satisfies a simple symmetry condition, we show that using unbiased estimator leads to an efficient algorithm for empirical maximization. Second, by leveraging a reduction of risk minimization under noisy labels to classification with weighted 0-1 loss, we suggest the use of a simple weighted surrogate loss, for which we are able to obtain strong utility bounds. This approach implies that methods already used in practice, such as biased SVM and weighted logistic regression, are provably noise- tolerant. For two practically important measures in our family, we show that the proposed methods are competitive with respect to recently proposed methods for dealing with label noise in several benchmark data sets. [abs] [ pdf ][ bib ] &copy JMLR 2018. ( edit, beta )

NeurIPS Conference 2018 Conference Paper

DAGs with NO TEARS: Continuous Optimization for Structure Learning

  • Xun Zheng
  • Bryon Aragam
  • Pradeep Ravikumar
  • Eric Xing

Estimating the structure of directed acyclic graphs (DAGs, also known as Bayesian networks) is a challenging problem since the search space of DAGs is combinatorial and scales superexponentially with the number of nodes. Existing approaches rely on various local heuristics for enforcing the acyclicity constraint. In this paper, we introduce a fundamentally different strategy: we formulate the structure learning problem as a purely continuous optimization problem over real matrices that avoids this combinatorial constraint entirely. This is achieved by a novel characterization of acyclicity that is not only smooth but also exact. The resulting problem can be efficiently solved by standard numerical algorithms, which also makes implementation effortless. The proposed method outperforms existing ones, without imposing any structural assumptions on the graph such as bounded treewidth or in-degree.

ICML Conference 2018 Conference Paper

Deep Density Destructors

  • David I. Inouye
  • Pradeep Ravikumar

We propose a unified framework for deep density models by formally defining density destructors. A density destructor is an invertible function that transforms a given density to the uniform density—essentially destroying any structure in the original density. This destructive transformation generalizes Gaussianization via ICA and more recent autoregressive models such as MAF and Real NVP. Informally, this transformation can be seen as a generalized whitening procedure or a multivariate generalization of the univariate CDF function. Unlike Gaussianization, our destructive transformation has the elegant property that the density function is equal to the absolute value of the Jacobian determinant. Thus, each layer of a deep density can be seen as a shallow density—uncovering a fundamental connection between shallow and deep densities. In addition, our framework provides a common interface for all previous methods enabling them to be systematically combined, evaluated and improved. Leveraging the connection to shallow densities, we also propose a novel tree destructor based on tree densities and an image-specific destructor based on pixel locality. We illustrate our framework on a 2D dataset, MNIST, and CIFAR-10. Code is available on first author’s website.

ICML Conference 2018 Conference Paper

Loss Decomposition for Fast Learning in Large Output Spaces

  • Ian En-Hsu Yen
  • Satyen Kale
  • Felix X. Yu
  • Daniel Niels Holtmann-Rice
  • Sanjiv Kumar
  • Pradeep Ravikumar

For problems with large output spaces, evaluation of the loss function and its gradient are expensive, typically taking linear time in the size of the output space. Recently, methods have been developed to speed up learning via efficient data structures for Nearest-Neighbor Search (NNS) or Maximum Inner-Product Search (MIPS). However, the performance of such data structures typically degrades in high dimensions. In this work, we propose a novel technique to reduce the intractable high dimensional search problem to several much more tractable lower dimensional ones via dual decomposition of the loss function. At the same time, we demonstrate guaranteed convergence to the original loss via a greedy message passing procedure. In our experiments on multiclass and multilabel classification with hundreds of thousands of classes, as well as training skip-gram word embeddings with a vocabulary size of half a million, our technique consistently improves the accuracy of search-based gradient approximation methods and outperforms sampling-based gradient approximation methods by a large margin.

NeurIPS Conference 2018 Conference Paper

MixLasso: Generalized Mixed Regression via Convex Atomic-Norm Regularization

  • Ian En-Hsu Yen
  • Wei-Cheng Lee
  • Kai Zhong
  • Sung-En Chang
  • Pradeep Ravikumar
  • Shou-De Lin

We consider a generalization of mixed regression where the response is an additive combination of several mixture components. Standard mixed regression is a special case where each response is generated from exactly one component. Typical approaches to the mixture regression problem employ local search methods such as Expectation Maximization (EM) that are prone to spurious local optima. On the other hand, a number of recent theoretically-motivated \emph{Tensor-based methods} either have high sample complexity, or require the knowledge of the input distribution, which is not available in most of practical situations. In this work, we study a novel convex estimator \emph{MixLasso} for the estimation of generalized mixed regression, based on an atomic norm specifically constructed to regularize the number of mixture components. Our algorithm gives a risk bound that trades off between prediction accuracy and model sparsity without imposing stringent assumptions on the input/output distribution, and can be easily adapted to the case of non-linear functions. In our numerical experiments on mixtures of linear as well as nonlinear regressions, the proposed method yields high-quality solutions in a wider range of settings than existing approaches.

NeurIPS Conference 2018 Conference Paper

Representer Point Selection for Explaining Deep Neural Networks

  • Chih-Kuan Yeh
  • Joon Kim
  • Ian En-Hsu Yen
  • Pradeep Ravikumar

We propose to explain the predictions of a deep neural network, by pointing to the set of what we call representer points in the training set, for a given test point prediction. Specifically, we show that we can decompose the pre-activation prediction of a neural network into a linear combination of activations of training points, with the weights corresponding to what we call representer values, which thus capture the importance of that training point on the learned parameters of the network. But it provides a deeper understanding of the network than simply training point influence: with positive representer values corresponding to excitatory training points, and negative values corresponding to inhibitory points, which as we show provides considerably more insight. Our method is also much more scalable, allowing for real-time feedback in a manner not feasible with influence functions.

NeurIPS Conference 2018 Conference Paper

The Sample Complexity of Semi-Supervised Learning with Nonparametric Mixture Models

  • Chen Dan
  • Liu Leqi
  • Bryon Aragam
  • Pradeep Ravikumar
  • Eric Xing

We study the sample complexity of semi-supervised learning (SSL) and introduce new assumptions based on the mismatch between a mixture model learned from unlabeled data and the true mixture model induced by the (unknown) class conditional distributions. Under these assumptions, we establish an $\Omega(K\log K)$ labeled sample complexity bound without imposing parametric assumptions, where $K$ is the number of classes. Our results suggest that even in nonparametric settings it is possible to learn a near-optimal classifier using only a few labeled samples. Unlike previous theoretical work which focuses on binary classification, we consider general multiclass classification ($K>2$), which requires solving a difficult permutation learning problem. This permutation defines a classifier whose classification error is controlled by the Wasserstein distance between mixing measures, and we provide finite-sample results characterizing the behaviour of the excess risk of this classifier. Finally, we describe three algorithms for computing these estimators based on a connection to bipartite graph matching, and perform experiments to illustrate the superiority of the MLE over the majority vote estimator.

ICML Conference 2017 Conference Paper

Doubly Greedy Primal-Dual Coordinate Descent for Sparse Empirical Risk Minimization

  • Qi Lei
  • Ian En-Hsu Yen
  • Chao-Yuan Wu
  • Inderjit S. Dhillon
  • Pradeep Ravikumar

We consider the popular problem of sparse empirical risk minimization with linear predictors and a large number of both features and observations. With a convex-concave saddle point objective reformulation, we propose a Doubly Greedy Primal-Dual Coordinate Descent algorithm that is able to exploit sparsity in both primal and dual variables. It enjoys a low cost per iteration and our theoretical analysis shows that it converges linearly with a good iteration complexity, provided that the set of primal variables is sparse. We then extend this algorithm further to leverage active sets. The resulting new algorithm is even faster, and experiments on large-scale Multi-class data sets show that our algorithm achieves up to 30 times speedup on several state-of-the-art optimization methods.

ICML Conference 2017 Conference Paper

Latent Feature Lasso

  • Ian En-Hsu Yen
  • Wei-Cheng Lee
  • Sung-En Chang
  • Arun Sai Suggala
  • Shou-De Lin
  • Pradeep Ravikumar

The latent feature model (LFM), proposed in (Griffiths \& Ghahramani, 2005), but possibly with earlier origins, is a generalization of a mixture model, where each instance is generated not from a single latent class but from a combination of latent features. Thus, each instance has an associated latent binary feature incidence vector indicating the presence or absence of a feature. Due to its combinatorial nature, inference of LFMs is considerably intractable, and accordingly, most of the attention has focused on nonparametric LFMs, with priors such as the Indian Buffet Process (IBP) on infinite binary matrices. Recent efforts to tackle this complexity either still have computational complexity that is exponential, or sample complexity that is high-order polynomial w. r. t. the number of latent features. In this paper, we address this outstanding problem of tractable estimation of LFMs via a novel atomic-norm regularization, which gives an algorithm with polynomial run-time and sample complexity without impractical assumptions on the data distribution.

NeurIPS Conference 2017 Conference Paper

On Separability of Loss Functions, and Revisiting Discriminative Vs Generative Models

  • Adarsh Prasad
  • Alexandru Niculescu-Mizil
  • Pradeep Ravikumar

We revisit the classical analysis of generative vs discriminative models for general exponential families, and high-dimensional settings. Towards this, we develop novel technical machinery, including a notion of separability of general loss functions, which allow us to provide a general framework to obtain l∞ convergence rates for general M-estimators. We use this machinery to analyze l∞ and l2 convergence rates of generative and discriminative models, and provide insights into their nuanced behaviors in high-dimensions. Our results are also applicable to differential parameter estimation, where the quantity of interest is the difference between generative model parameters.

ICML Conference 2017 Conference Paper

Ordinal Graphical Models: A Tale of Two Approaches

  • Arun Sai Suggala
  • Eunho Yang
  • Pradeep Ravikumar

Undirected graphical models or Markov random fields (MRFs) are widely used for modeling multivariate probability distributions. Much of the work on MRFs has focused on continuous variables, and nominal variables (that is, unordered categorical variables). However, data from many real world applications involve ordered categorical variables also known as ordinal variables, e. g. , movie ratings on Netflix which can be ordered from 1 to 5 stars. With respect to univariate ordinal distributions, as we detail in the paper, there are two main categories of distributions; while there have been efforts to extend these to multivariate ordinal distributions, the resulting distributions are typically very complex, with either a large number of parameters, or with non-convex likelihoods. While there have been some work on tractable approximations, these do not come with strong statistical guarantees, and moreover are relatively computationally expensive. In this paper, we theoretically investigate two classes of graphical models for ordinal data, corresponding to the two main categories of univariate ordinal distributions. In contrast to previous work, our theoretical developments allow us to provide correspondingly two classes of estimators that are not only computationally efficient but also have strong statistical guarantees.

NeurIPS Conference 2017 Conference Paper

The Expxorcist: Nonparametric Graphical Models Via Conditional Exponential Densities

  • Arun Suggala
  • Mladen Kolar
  • Pradeep Ravikumar

Non-parametric multivariate density estimation faces strong statistical and computational bottlenecks, and the more practical approaches impose near-parametric assumptions on the form of the density functions. In this paper, we leverage recent developments to propose a class of non-parametric models which have very attractive computational and statistical properties. Our approach relies on the simple function space assumption that the conditional distribution of each variable conditioned on the other variables has a non-parametric exponential family form.

ICML Conference 2016 Conference Paper

A Convex Atomic-Norm Approach to Multiple Sequence Alignment and Motif Discovery

  • Ian En-Hsu Yen
  • Xin Lin
  • Jiong Zhang 0001
  • Pradeep Ravikumar
  • Inderjit S. Dhillon

Multiple Sequence Alignment and Motif Discovery, known as NP-hard problems, are two fundamental tasks in Bioinformatics. Existing approaches to these two problems are based on either local search methods such as Expectation Maximization (EM), Gibbs Sampling or greedy heuristic methods. In this work, we develop a convex relaxation approach to both problems based on the recent concept of atomic norm and develop a new algorithm, termed Greedy Direction Method of Multiplier, for solving the convex relaxation with two convex atomic constraints. Experiments show that our convex relaxation approach produces solutions of higher quality than those standard tools widely-used in Bioinformatics community on the Multiple Sequence Alignment and Motif Discovery problems.

NeurIPS Conference 2016 Conference Paper

Dual Decomposed Learning with Factorwise Oracle for Structural SVM of Large Output Domain

  • Ian En-Hsu Yen
  • Xiangru Huang
  • Kai Zhong
  • Ruohan Zhang
  • Pradeep Ravikumar
  • Inderjit Dhillon

Many applications of machine learning involve structured output with large domain, where learning of structured predictor is prohibitive due to repetitive calls to expensive inference oracle. In this work, we show that, by decomposing training of Structural Support Vector Machine (SVM) into a series of multiclass SVM problems connected through messages, one can replace expensive structured oracle with Factorwise Maximization Oracle (FMO) that allows efficient implementation of complexity sublinear to the factor domain. A Greedy Direction Method of Multiplier (GDMM) algorithm is proposed to exploit sparsity of messages which guarantees $\epsilon$ sub-optimality after $O(log(1/\epsilon))$ passes of FMO calls. We conduct experiments on chain-structured problems and fully-connected problems of large output domains. The proposed approach is orders-of-magnitude faster than the state-of-the-art training algorithms for Structural SVM.

ICML Conference 2016 Conference Paper

Optimal Classification with Multivariate Losses

  • Nagarajan Natarajan
  • Sanmi Koyejo
  • Pradeep Ravikumar
  • Inderjit S. Dhillon

Multivariate loss functions are extensively employed in several prediction tasks arising in Information Retrieval. Often, the goal in the tasks is to minimize expected loss when retrieving relevant items from a presented set of items, where the expectation is with respect to the joint distribution over item sets. Our key result is that for most multivariate losses, the expected loss is provably optimized by sorting the items by the conditional probability of label being positive and then selecting top k items. Such a result was previously known only for the F-measure. Leveraging on the optimality characterization, we give an algorithm for estimating optimal predictions in practice with runtime quadratic in size of item sets for many losses. We provide empirical results on benchmark datasets, comparing the proposed algorithm to state-of-the-art methods for optimizing multivariate losses.

ICML Conference 2016 Conference Paper

PD-Sparse: A Primal and Dual Sparse Approach to Extreme Multiclass and Multilabel Classification

  • Ian En-Hsu Yen
  • Xiangru Huang
  • Pradeep Ravikumar
  • Kai Zhong
  • Inderjit S. Dhillon

We consider Multiclass and Multilabel classification with extremely large number of classes, of which only few are labeled to each instance. In such setting, standard methods that have training, prediction cost linear to the number of classes become intractable. State-of-the-art methods thus aim to reduce the complexity by exploiting correlation between labels under assumption that the similarity between labels can be captured by structures such as low-rank matrix or balanced tree. However, as the diversity of labels increases in the feature space, structural assumption can be easily violated, which leads to degrade in the testing performance. In this work, we show that a margin-maximizing loss with l1 penalty, in case of Extreme Classification, yields extremely sparse solution both in primal and in dual without sacrificing the expressive power of predictor. We thus propose a Fully-Corrective Block-Coordinate Frank-Wolfe (FC-BCFW) algorithm that exploits both primal and dual sparsity to achieve a complexity sublinear to the number of primal and dual variables. A bi-stochastic search method is proposed to further improve the efficiency. In our experiments on both Multiclass and Multilabel problems, the proposed method achieves significant higher accuracy than existing approaches of Extreme Classification with very competitive training and prediction time.

ICML Conference 2016 Conference Paper

Square Root Graphical Models: Multivariate Generalizations of Univariate Exponential Families that Permit Positive Dependencies

  • David I. Inouye
  • Pradeep Ravikumar
  • Inderjit S. Dhillon

We develop Square Root Graphical Models (SQR), a novel class of parametric graphical models that provides multivariate generalizations of univariate exponential family distributions. Previous multivariate graphical models [Yang et al. 2015] did not allow positive dependencies for the exponential and Poisson generalizations. However, in many real-world datasets, variables clearly have positive dependencies. For example, the airport delay time in New York—modeled as an exponential distribution—is positively related to the delay time in Boston. With this motivation, we give an example of our model class derived from the univariate exponential distribution that allows for almost arbitrary positive and negative dependencies with only a mild condition on the parameter matrix—a condition akin to the positive definiteness of the Gaussian covariance matrix. Our Poisson generalization allows for both positive and negative dependencies without any constraints on the parameter values. We also develop parameter estimation methods using node-wise regressions with \ell_1 regularization and likelihood approximation methods using sampling. Finally, we demonstrate our exponential generalization on a synthetic dataset and a real-world dataset of airport delay times.

ICML Conference 2015 Conference Paper

A Convex Exemplar-based Approach to MAD-Bayes Dirichlet Process Mixture Models

  • Ian En-Hsu Yen
  • Xin Lin
  • Kai Zhong
  • Pradeep Ravikumar
  • Inderjit S. Dhillon

MAD-Bayes (MAP-based Asymptotic Derivations) has been recently proposed as a general technique to derive scalable algorithm for Bayesian Nonparametric models. However, the combinatorial nature of objective functions derived from MAD-Bayes results in hard optimization problem, for which current practice employs heuristic algorithms analogous to k-means to find local minimum. In this paper, we consider the exemplar-based version of MAD-Bayes formulation for DP and Hierarchical DP (HDP) mixture model. We show that an exemplar-based MAD-Bayes formulation can be relaxed to a convex structural-regularized program that, under cluster-separation conditions, shares the same optimal solution to its combinatorial counterpart. An algorithm based on Alternating Direction Method of Multiplier (ADMM) is then proposed to solve such program. In our experiments on several benchmark data sets, the proposed method finds optimal solution of the combinatorial problem and significantly improves existing methods in terms of the exemplar-based objective.

NeurIPS Conference 2015 Conference Paper

Beyond Sub-Gaussian Measurements: High-Dimensional Structured Estimation with Sub-Exponential Designs

  • Vidyashankar Sivakumar
  • Arindam Banerjee
  • Pradeep Ravikumar

We consider the problem of high-dimensional structured estimation with norm-regularized estimators, such as Lasso, when the design matrix and noise are drawn from sub-exponential distributions. Existing results only consider sub-Gaussian designs and noise, and both the sample complexity and non-asymptotic estimation error have been shown to depend on the Gaussian width of suitable sets. In contrast, for the sub-exponential setting, we show that the sample complexity and the estimation error will depend on the exponential width of the corresponding sets, and the analysis holds for any norm. Further, using generic chaining, we show that the exponential width for any set will be at most $\sqrt{\log p}$ times the Gaussian width of the set, yielding Gaussian width based results even for the sub-exponential case. Further, for certain popular estimators, viz Lasso and Group Lasso, using a VC-dimension based analysis, we show that the sample complexity will in fact be the same order as Gaussian designs. Our general analysis and results are the first in the sub-exponential setting, and are readily applicable to special sub-exponential families such as log-concave and extreme-value distributions.

NeurIPS Conference 2015 Conference Paper

Closed-form Estimators for High-dimensional Generalized Linear Models

  • Eunho Yang
  • Aurelie Lozano
  • Pradeep Ravikumar

We propose a class of closed-form estimators for GLMs under high-dimensional sampling regimes. Our class of estimators is based on deriving closed-form variants of the vanilla unregularized MLE but which are (a) well-defined even under high-dimensional settings, and (b) available in closed-form. We then perform thresholding operations on this MLE variant to obtain our class of estimators. We derive a unified statistical analysis of our class of estimators, and show that it enjoys strong statistical guarantees in both parameter error as well as variable selection, that surprisingly match those of the more complex regularized GLM MLEs, even while our closed-form estimators are computationally much simpler. We derive instantiations of our class of closed-form estimators, as well as corollaries of our general theorem, for the special cases of logistic, exponential and Poisson regression models. We corroborate the surprising statistical and computational performance of our class of estimators via extensive simulations.

NeurIPS Conference 2015 Conference Paper

Collaborative Filtering with Graph Information: Consistency and Scalable Methods

  • Nikhil Rao
  • Hsiang-Fu Yu
  • Pradeep Ravikumar
  • Inderjit Dhillon

Low rank matrix completion plays a fundamental role in collaborative filtering applications, the key idea being that the variables lie in a smaller subspace than the ambient space. Often, additional information about the variables is known, and it is reasonable to assume that incorporating this information will lead to better predictions. We tackle the problem of matrix completion when pairwise relationships among variables are known, via a graph. We formulate and derive a highly efficient, conjugate gradient based alternating minimization scheme that solves optimizations with over 55 million observations up to 2 orders of magnitude faster than state-of-the-art (stochastic) gradient-descent based methods. On the theoretical front, we show that such methods generalize weighted nuclear norm formulations, and derive statistical consistency guarantees. We validate our results on both real and synthetic datasets.

NeurIPS Conference 2015 Conference Paper

Consistent Multilabel Classification

  • Oluwasanmi Koyejo
  • Nagarajan Natarajan
  • Pradeep Ravikumar
  • Inderjit Dhillon

Multilabel classification is rapidly developing as an important aspect of modern predictive modeling, motivating study of its theoretical aspects. To this end, we propose a framework for constructing and analyzing multilabel classification metrics which reveals novel results on a parametric form for population optimal classifiers, and additional insight into the role of label correlations. In particular, we show that for multilabel metrics constructed as instance-, micro- and macro-averages, the population optimal classifier can be decomposed into binary classifiers based on the marginal instance-conditional distribution of each label, with a weak association between labels via the threshold. Thus, our analysis extends the state of the art from a few known multilabel classification metrics such as Hamming loss, to a general framework applicable to many of the classification metrics in common use. Based on the population-optimal classifier, we propose a computationally efficient and general-purpose plug-in classification algorithm, and prove its consistency with respect to the metric of interest. Empirical results on synthetic and benchmark datasets are supportive of our theoretical findings.

ICML Conference 2015 Conference Paper

Distributional Rank Aggregation, and an Axiomatic Analysis

  • Adarsh Prasad
  • Harsh H. Pareek
  • Pradeep Ravikumar

The rank aggregation problem has been studied with varying desiderata in varied communities such as Theoretical Computer Science, Statistics, Information Retrieval and Social Welfare Theory. We introduce a variant of this problem we call distributional rank aggregation, where the ranking data is only available via the induced distribution over the set of all permutations. We provide a novel translation of the usual social welfare theory axioms to this setting. As we show this allows for a more quantitative characterization of these axioms: which then are not only less prone to misinterpretation, but also allow simpler proofs for some key impossibility theorems. Most importantly, these quantitative characterizations lead to natural and novel relaxations of these axioms, which as we show, allow us to get around celebrated impossibility results in social choice theory. We are able to completely characterize the class of positional scoring rules with respect to our axioms and show that Borda Count is optimal in a certain sense.

NeurIPS Conference 2015 Conference Paper

Fast Classification Rates for High-dimensional Gaussian Generative Models

  • Tianyang Li
  • Adarsh Prasad
  • Pradeep Ravikumar

We consider the problem of binary classification when the covariates conditioned on the each of the response values follow multivariate Gaussian distributions. We focus on the setting where the covariance matrices for the two conditional distributions are the same. The corresponding generative model classifier, derived via the Bayes rule, also called Linear Discriminant Analysis, has been shown to behave poorly in high-dimensional settings. We present a novel analysis of the classification error of any linear discriminant approach given conditional Gaussian models. This allows us to compare the generative model classifier, other recently proposed discriminative approaches that directly learn the discriminant function, and then finally logistic regression which is another classical discriminative model classifier. As we show, under a natural sparsity assumption, and letting $s$ denote the sparsity of the Bayes classifier, $p$ the number of covariates, and $n$ the number of samples, the simple ($\ell_1$-regularized) logistic regression classifier achieves the fast misclassification error rates of $O\left(\frac{s \log p}{n}\right)$, which is much better than the other approaches, which are either inconsistent under high-dimensional settings, or achieve a slower rate of $O\left(\sqrt{\frac{s \log p}{n}}\right)$.

NeurIPS Conference 2015 Conference Paper

Fixed-Length Poisson MRF: Adding Dependencies to the Multinomial

  • David Inouye
  • Pradeep Ravikumar
  • Inderjit Dhillon

We propose a novel distribution that generalizes the Multinomial distribution to enable dependencies between dimensions. Our novel distribution is based on the parametric form of the Poisson MRF model [Yang et al. , 2012] but is fundamentally different because of the domain restriction to a fixed-length vector like in a Multinomial where the number of trials is fixed or known. Thus, we propose the Fixed-Length Poisson MRF (LPMRF) distribution. We develop methods to estimate the likelihood and log partition function (i. e. the log normalizing constant), which was not developed for the Poisson MRF model. In addition, we propose novel mixture and topic models that use LPMRF as a base distribution and discuss the similarities and differences with previous topic models such as the recently proposed Admixture of Poisson MRFs [Inouye et al. , 2014]. We show the effectiveness of our LPMRF distribution over Multinomial models by evaluating the test set perplexity on a dataset of abstracts and Wikipedia. Qualitatively, we show that the positive dependencies discovered by LPMRF are interesting and intuitive. Finally, we show that our algorithms are fast and have good scaling (code available online).

JMLR Journal 2015 Journal Article

Graphical Models via Univariate Exponential Family Distributions

  • Eunho Yang
  • Pradeep Ravikumar
  • Genevera I. Allen
  • Zhandong Liu

Undirected graphical models, or Markov networks, are a popular class of statistical models, used in a wide variety of applications. Popular instances of this class include Gaussian graphical models and Ising models. In many settings, however, it might not be clear which subclass of graphical models to use, particularly for non-Gaussian and non-categorical data. In this paper, we consider a general sub-class of graphical models where the node-wise conditional distributions arise from exponential families. This allows us to derive multivariate graphical model distributions from univariate exponential family distributions, such as the Poisson, negative binomial, and exponential distributions. Our key contributions include a class of M-estimators to fit these graphical model distributions; and rigorous statistical analysis showing that these M-estimators recover the true graphical model structure exactly, with high probability. We provide examples of genomic and proteomic networks learned via instances of our class of graphical models derived from Poisson and exponential distributions. [abs] [ pdf ][ bib ] &copy JMLR 2015. ( edit, beta )

NeurIPS Conference 2015 Conference Paper

Sparse Linear Programming via Primal and Dual Augmented Coordinate Descent

  • Ian En-Hsu Yen
  • Kai Zhong
  • Cho-Jui Hsieh
  • Pradeep Ravikumar
  • Inderjit Dhillon

Over the past decades, Linear Programming (LP) has been widely used in different areas and considered as one of the mature technologies in numerical optimization. However, the complexity offered by state-of-the-art algorithms (i. e. interior-point method and primal, dual simplex methods) is still unsatisfactory for problems in machine learning with huge number of variables and constraints. In this paper, we investigate a general LP algorithm based on the combination of Augmented Lagrangian and Coordinate Descent (AL-CD), giving an iteration complexity of $O((\log(1/\epsilon))^2)$ with $O(nnz(A))$ cost per iteration, where $nnz(A)$ is the number of non-zeros in the $m\times n$ constraint matrix $A$, and in practice, one can further reduce cost per iteration to the order of non-zeros in columns (rows) corresponding to the active primal (dual) variables through an active-set strategy. The algorithm thus yields a tractable alternative to standard LP methods for large-scale problems of sparse solutions and $nnz(A)\ll mn$. We conduct experiments on large-scale LP instances from $\ell_1$-regularized multi-class SVM, Sparse Inverse Covariance Estimation, and Nonnegative Matrix Factorization, where the proposed approach finds solutions of $10^{-3}$ precision orders of magnitude faster than state-of-the-art implementations of interior-point and simplex methods.

UAI Conference 2015 Conference Paper

Tracking with ranked signals

  • Tianyang Li
  • Harsh H. Pareek
  • Pradeep Ravikumar
  • Dhruv Balwada
  • Kevin Speer

We present a novel graphical model approach for a problem not previously considered in the machine learning literature: that of tracking with ranked signals. The problem consists of tracking a single target given observations about the target that consist of ranked continuous signals, from unlabeled sources in a cluttered environment. We introduce appropriate factors to handle the imposed ordering assumption, and also incorporate various systematic errors that can arise in this problem, particularly clutter or noise signals as well as missing signals. We show that inference in the obtained graphical model can be simplified by adding bipartite structures with appropriate factors. We apply a hybrid approach consisting of belief propagation and particle filtering in this mixed graphical model for inference and validate the approach on simulated data. We were motivated to formalize and study this problem by a key task in Oceanography, that of tracking the motion of RAFOS ocean floats, using range measurements sent from a set of fixed beacons, but where the identities of the beacons corresponding to the measurements are not known. However, unlike the usual tracking problem in artificial intelligence, there is an implicit ranking assumption among signal arrival times. Our experiments show that the proposed graphical model approach allows us to effectively leverage the problem constraints and improve tracking accuracy over baseline tracking methods yielding results similar to the ground truth hand-labeled data.

ICML Conference 2015 Conference Paper

Vector-Space Markov Random Fields via Exponential Families

  • Wesley Tansey
  • Oscar Hernan Madrid Padilla
  • Arun Sai Suggala
  • Pradeep Ravikumar

We present Vector-Space Markov Random Fields (VS-MRFs), a novel class of undirected graphical models where each variable can belong to an arbitrary vector space. VS-MRFs generalize a recent line of work on scalar-valued, uni-parameter exponential family and mixed graphical models, thereby greatly broadening the class of exponential families available (e. g. , allowing multinomial and Dirichlet distributions). Specifically, VS-MRFs are the joint graphical model distributions where the node-conditional distributions belong to generic exponential families with general vector space domains. We also present a sparsistent M-estimator for learning our class of MRFs that recovers the correct set of edges with high probability. We validate our approach via a set of synthetic data experiments as well as a real-world case study of over four million foods from the popular diet tracking app MyFitnessPal. Our results demonstrate that our algorithm performs well empirically and that VS-MRFs are capable of capturing and highlighting interesting structure in complex, real-world data. All code for our algorithm is open source and publicly available.

NeurIPS Conference 2014 Conference Paper

A Representation Theory for Ranking Functions

  • Harsh Pareek
  • Pradeep Ravikumar

This paper presents a representation theory for permutation-valued functions, which in their general form can also be called listwise ranking functions. Pointwise ranking functions assign a score to each object independently, without taking into account the other objects under consideration; whereas listwise loss functions evaluate the set of scores assigned to all objects as a whole. In many supervised learning to rank tasks, it might be of interest to use listwise ranking functions instead; in particular, the Bayes Optimal ranking functions might themselves be listwise, especially if the loss function is listwise. A key caveat to using listwise ranking functions has been the lack of an appropriate representation theory for such functions. We show that a natural symmetricity assumption that we call exchangeability allows us to explicitly characterize the set of such exchangeable listwise ranking functions. Our analysis draws from the theories of tensor analysis, functional analysis and De Finetti theorems. We also present experiments using a novel reranking method motivated by our representation theory.

ICML Conference 2014 Conference Paper

Admixture of Poisson MRFs: A Topic Model with Word Dependencies

  • David I. Inouye
  • Pradeep Ravikumar
  • Inderjit S. Dhillon

This paper introduces a new topic model based on an admixture of Poisson Markov Random Fields (APM), which can model dependencies between words as opposed to previous independent topic models such as PLSA (Hofmann, 1999), LDA (Blei et al. , 2003) or SAM (Reisinger et al. , 2010). We propose a class of admixture models that generalizes previous topic models and show an equivalence between the conditional distribution of LDA and independent Poissons—suggesting that APM subsumes the modeling power of LDA. We present a tractable method for estimating the parameters of an APM based on the pseudo log-likelihood and demonstrate the benefits of APM over previous models by preliminary qualitative and quantitative experiments.

NeurIPS Conference 2014 Conference Paper

Capturing Semantically Meaningful Word Dependencies with an Admixture of Poisson MRFs

  • David Inouye
  • Pradeep Ravikumar
  • Inderjit Dhillon

We develop a fast algorithm for the Admixture of Poisson MRFs (APM) topic model and propose a novel metric to directly evaluate this model. The APM topic model recently introduced by Inouye et al. (2014) is the first topic model that allows for word dependencies within each topic unlike in previous topic models like LDA that assume independence between words within a topic. Research in both the semantic coherence of a topic models (Mimno et al. 2011, Newman et al. 2010) and measures of model fitness (Mimno & Blei 2011) provide strong support that explicitly modeling word dependencies---as in APM---could be both semantically meaningful and essential for appropriately modeling real text data. Though APM shows significant promise for providing a better topic model, APM has a high computational complexity because $O(p^2)$ parameters must be estimated where $p$ is the number of words (Inouye et al. could only provide results for datasets with $p = 200$). In light of this, we develop a parallel alternating Newton-like algorithm for training the APM model that can handle $p = 10^4$ as an important step towards scaling to large datasets. In addition, Inouye et al. only provided tentative and inconclusive results on the utility of APM. Thus, motivated by simple intuitions and previous evaluations of topic models, we propose a novel evaluation metric based on human evocation scores between word pairs (i. e. how much one word brings to mind" another word (Boyd-Graber et al. 2006)). We provide compelling quantitative and qualitative results on the BNC corpus that demonstrate the superiority of APM over previous topic models for identifying semantically meaningful word dependencies. (MATLAB code available at: http: //bigdata. ices. utexas. edu/software/apm/)"

NeurIPS Conference 2014 Conference Paper

Consistent Binary Classification with Generalized Performance Metrics

  • Oluwasanmi Koyejo
  • Nagarajan Natarajan
  • Pradeep Ravikumar
  • Inderjit Dhillon

Performance metrics for binary classification are designed to capture tradeoffs between four fundamental population quantities: true positives, false positives, true negatives and false negatives. Despite significant interest from theoretical and applied communities, little is known about either optimal classifiers or consistent algorithms for optimizing binary classification performance metrics beyond a few special cases. We consider a fairly large family of performance metrics given by ratios of linear combinations of the four fundamental population quantities. This family includes many well known binary classification metrics such as classification accuracy, AM measure, F-measure and the Jaccard similarity coefficient as special cases. Our analysis identifies the optimal classifiers as the sign of the thresholded conditional probability of the positive class, with a performance metric-dependent threshold. The optimal threshold can be constructed using simple plug-in estimators when the performance metric is a linear combination of the population quantities, but alternative techniques are required for the general case. We propose two algorithms for estimating the optimal classifiers, and prove their statistical consistency. Both algorithms are straightforward modifications of standard approaches to address the key challenge of optimal threshold selection, thus are simple to implement in practice. The first algorithm combines a plug-in estimate of the conditional probability of the positive class with optimal threshold selection. The second algorithm leverages recent work on calibrated asymmetric surrogate losses to construct candidate classifiers. We present empirical comparisons between these algorithms on benchmark datasets.

NeurIPS Conference 2014 Conference Paper

Constant Nullspace Strong Convexity and Fast Convergence of Proximal Methods under High-Dimensional Settings

  • Ian En-Hsu Yen
  • Cho-Jui Hsieh
  • Pradeep Ravikumar
  • Inderjit Dhillon

State of the art statistical estimators for high-dimensional problems take the form of regularized, and hence non-smooth, convex programs. A key facet of thesestatistical estimation problems is that these are typically not strongly convex under a high-dimensional sampling regime when the Hessian matrix becomes rank-deficient. Under vanilla convexity however, proximal optimization methods attain only a sublinear rate. In this paper, we investigate a novel variant of strong convexity, which we call Constant Nullspace Strong Convexity (CNSC), where we require that the objective function be strongly convex only over a constant subspace. As we show, the CNSC condition is naturally satisfied by high-dimensional statistical estimators. We then analyze the behavior of proximal methods under this CNSC condition: we show global linear convergence of Proximal Gradient and local quadratic convergence of Proximal Newton Method, when the regularization function comprising the statistical estimator is decomposable. We corroborate our theory via numerical experiments, and show a qualitative difference in the convergence rates of the proximal algorithms when the loss function does satisfy the CNSC condition.

NeurIPS Conference 2014 Conference Paper

Elementary Estimators for Graphical Models

  • Eunho Yang
  • Aurelie Lozano
  • Pradeep Ravikumar

We propose a class of closed-form estimators for sparsity-structured graphical models, expressed as exponential family distributions, under high-dimensional settings. Our approach builds on observing the precise manner in which the classical graphical model MLE ``breaks down'' under high-dimensional settings. Our estimator uses a carefully constructed, well-defined and closed-form backward map, and then performs thresholding operations to ensure the desired sparsity structure. We provide a rigorous statistical analysis that shows that surprisingly our simple class of estimators recovers the same asymptotic convergence rates as those of the $\ell_1$-regularized MLEs that are much more difficult to compute. We corroborate this statistical performance, as well as significant computational advantages via simulations of both discrete and Gaussian graphical models.

ICML Conference 2014 Conference Paper

Elementary Estimators for High-Dimensional Linear Regression

  • Eunho Yang
  • Aurélie C. Lozano
  • Pradeep Ravikumar

We consider the problem of structurally constrained high-dimensional linear regression. This has attracted considerable attention over the last decade, with state of the art statistical estimators based on solving regularized convex programs. While these typically non-smooth convex programs can be solved in polynomial time, scaling the state of the art optimization methods to very large-scale problems is an ongoing and rich area of research. In this paper, we attempt to address this scaling issue at the source, by asking whether one can build \emphsimpler possibly closed-form estimators, that yet come with statistical guarantees that are nonetheless comparable to regularized likelihood estimators! We answer this question in the affirmative, with variants of the classical ridge and OLS (ordinary least squares estimators) for linear regression. We analyze our estimators in the high-dimensional setting, and moreover provide empirical corroboration of its performance on simulated as well as real world microarray data.

ICML Conference 2014 Conference Paper

Elementary Estimators for Sparse Covariance Matrices and other Structured Moments

  • Eunho Yang
  • Aurélie C. Lozano
  • Pradeep Ravikumar

We consider the problem of estimating distributional parameters that are expected values of given feature functions. We are interested in recovery under high-dimensional regimes, where the number of variables p is potentially larger than the number of samples n, and where we need to impose structural constraints upon the parameters. In a natural distributional setting for this problem, the feature functions comprise the sufficient statistics of an exponential family, so that the problem would entail estimating structured moments of exponential family distributions. A special case of the above involves estimating the covariance matrix of a random vector, and where the natural distributional setting would correspond to the multivariate Gaussian distribution. Unlike the inverse covariance estimation case, we show that the regularized MLEs for covariance estimation, as well as natural Dantzig variants, are \emphnon-convex, even when the regularization functions themselves are convex; with the same holding for the general structured moment case. We propose a class of elementary convex estimators, that in many cases are available in \emphclosed-form, for estimating general structured moments. We then provide a unified statistical analysis of our class of estimators. Finally, we demonstrate the applicability of our class of estimators on real-world climatology and biology datasets.

ICML Conference 2014 Conference Paper

Exponential Family Matrix Completion under Structural Constraints

  • Suriya Gunasekar
  • Pradeep Ravikumar
  • Joydeep Ghosh

We consider the matrix completion problem of recovering a structured matrix from noisy and partial measurements. Recent works have proposed tractable estimators with strong statistical guarantees for the case where the underlying matrix is low–rank, and the measurements consist of a subset, either of the exact individual entries, or of the entries perturbed by additive Gaussian noise, which is thus implicitly suited for thin–tailed continuous data. Arguably, common applications of matrix completion require estimators for (a) heterogeneous data–types, such as skewed–continuous, count, binary, etc. , (b) for heterogeneous noise models (beyond Gaussian), which capture varied uncertainty in the measurements, and (c) heterogeneous structural constraints beyond low–rank, such as block–sparsity, or a superposition structure of low–rank plus elementwise sparseness, among others. In this paper, we provide a vastly unified framework for generalized matrix completion by considering a matrix completion setting wherein the matrix entries are sampled from any member of the rich family of \textitexponential family distributions; and impose general structural constraints on the underlying matrix, as captured by a general regularizer \mathcalR(.). We propose a simple convex regularized M–estimator for the generalized framework, and provide a unified and novel statistical analysis for this general class of estimators. We finally corroborate our theoretical results on simulated datasets.

ICML Conference 2014 Conference Paper

Learning Graphs with a Few Hubs

  • Rashish Tandon
  • Pradeep Ravikumar

We consider the problem of recovering the graph structure of a “hub-networked” Ising model given iid samples, under high-dimensional settings, where number of nodes p could be potentially larger than the number of samples n. By a “hub-networked” graph, we mean a graph with a few “hub nodes” with very large degrees. State of the art estimators for Ising models have a sample complexity that scales polynomially with the maximum node-degree, and are thus ill-suited to recovering such graphs with a few hub nodes. Some recent proposals for specifically recovering hub graphical models do not come with theoretical guarantees, and even empirically provide limited improvements over vanilla Ising model estimators. Here, we show that under such low sample settings, instead of estimating “difficult” components such as hub-neighborhoods, we can use quantitative indicators of our inability to do so, and thereby identify hub-nodes. This simple procedure allows us to recover hub-networked graphs with very strong statistical guarantees even under very low sample settings.

NeurIPS Conference 2014 Conference Paper

On the Information Theoretic Limits of Learning Ising Models

  • Rashish Tandon
  • Karthikeyan Shanmugam
  • Pradeep Ravikumar
  • Alexandros Dimakis

We provide a general framework for computing lower-bounds on the sample complexity of recovering the underlying graphs of Ising models, given i. i. d. samples. While there have been recent results for specific graph classes, these involve fairly extensive technical arguments that are specialized to each specific graph class. In contrast, we isolate two key graph-structural ingredients that can then be used to specify sample complexity lower-bounds. Presence of these structural properties makes the graph class hard to learn. We derive corollaries of our main result that not only recover existing recent results, but also provide lower bounds for novel graph classes not considered previously. We also extend our framework to the random graph setting and derive corollaries for Erdos-Renyi graphs in a certain dense setting.

NeurIPS Conference 2014 Conference Paper

Proximal Quasi-Newton for Computationally Intensive L1-regularized M-estimators

  • Kai Zhong
  • Ian En-Hsu Yen
  • Inderjit Dhillon
  • Pradeep Ravikumar

We consider the class of optimization problems arising from computationally intensive L1-regularized M-estimators, where the function or gradient values are very expensive to compute. A particular instance of interest is the L1-regularized MLE for learning Conditional Random Fields (CRFs), which are a popular class of statistical models for varied structured prediction problems such as sequence labeling, alignment, and classification with label taxonomy. L1-regularized MLEs for CRFs are particularly expensive to optimize since computing the gradient values requires an expensive inference step. In this work, we propose the use of a carefully constructed proximal quasi-Newton algorithm for such computationally intensive M-estimation problems, where we employ an aggressive active set selection technique. In a key contribution of the paper, we show that our proximal quasi-Newton algorithm is provably super-linearly convergent, even in the absence of strong convexity, by leveraging a restricted variant of strong convexity. In our experiments, the proposed algorithm converges considerably faster than current state-of-the-art on the problems of sequence labeling and hierarchical classification.

NeurIPS Conference 2014 Conference Paper

QUIC & DIRTY: A Quadratic Approximation Approach for Dirty Statistical Models

  • Cho-Jui Hsieh
  • Inderjit Dhillon
  • Pradeep Ravikumar
  • Stephen Becker
  • Peder Olsen

In this paper, we develop a family of algorithms for optimizing superposition-structured” or “dirty” statistical estimators for high-dimensional problems involving the minimization of the sum of a smooth loss function with a hybrid regularization. Most of the current approaches are first-order methods, including proximal gradient or Alternating Direction Method of Multipliers (ADMM). We propose a new family of second-order methods where we approximate the loss function using quadratic approximation. The superposition structured regularizer then leads to a subproblem that can be efficiently solved by alternating minimization. We propose a general active subspace selection approach to speed up the solver by utilizing the low-dimensional structure given by the regularizers, and provide convergence guarantees for our algorithm. Empirically, we show that our approach is more than 10 times faster than state-of-the-art first-order approaches for the latent variable graphical model selection problems and multi-task learning problems when there is more than one regularizer. For these problems, our approach appears to be the first algorithm that can extend active subspace ideas to multiple regularizers. "

JMLR Journal 2014 Journal Article

QUIC: Quadratic Approximation for Sparse Inverse Covariance Estimation

  • Cho-Jui Hsieh
  • Mátyás A. Sustik
  • Inderjit S. Dhillon
  • Pradeep Ravikumar

The $\ell_1$-regularized Gaussian maximum likelihood estimator (MLE) has been shown to have strong statistical guarantees in recovering a sparse inverse covariance matrix, or alternatively the underlying graph structure of a Gaussian Markov Random Field, from very limited samples. We propose a novel algorithm for solving the resulting optimization problem which is a regularized log-determinant program. In contrast to recent state-of-the-art methods that largely use first order gradient information, our algorithm is based on Newton's method and employs a quadratic approximation, but with some modifications that leverage the structure of the sparse Gaussian MLE problem. We show that our method is superlinearly convergent, and present experimental results using synthetic and real-world application data that demonstrate the considerable improvements in performance of our method when compared to previous methods. [abs] [ pdf ][ bib ] &copy JMLR 2014. ( edit, beta )

NeurIPS Conference 2014 Conference Paper

Sparse Random Feature Algorithm as Coordinate Descent in Hilbert Space

  • Ian En-Hsu Yen
  • Ting-Wei Lin
  • Shou-De Lin
  • Pradeep Ravikumar
  • Inderjit Dhillon

In this paper, we propose a Sparse Random Feature algorithm, which learns a sparse non-linear predictor by minimizing an $\ell_1$-regularized objective function over the Hilbert Space induced from kernel function. By interpreting the algorithm as Randomized Coordinate Descent in the infinite-dimensional space, we show the proposed approach converges to a solution comparable within $\eps$-precision to exact kernel method by drawing $O(1/\eps)$ number of random features, contrasted to the $O(1/\eps^2)$-type convergence achieved by Monte-Carlo analysis in current Random Feature literature. In our experiments, the Sparse Random Feature algorithm obtains sparse solution that requires less memory and prediction time while maintains comparable performance on tasks of regression and classification. In the meantime, as an approximate solver for infinite-dimensional $\ell_1$-regularized problem, the randomized approach converges to better solution than Boosting approach when the greedy step of Boosting cannot be performed exactly.

NeurIPS Conference 2013 Conference Paper

BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables

  • Cho-Jui Hsieh
  • Matyas Sustik
  • Inderjit Dhillon
  • Pradeep Ravikumar
  • Russell Poldrack

The l1-regularized Gaussian maximum likelihood estimator (MLE) has been shown to have strong statistical guarantees in recovering a sparse inverse covariance matrix even under high-dimensional settings. However, it requires solving a difficult non-smooth log-determinant program with number of parameters scaling quadratically with the number of Gaussian variables. State-of-the-art methods thus do not scale to problems with more than 20, 000 variables. In this paper, we develop an algorithm BigQUIC, which can solve 1 million dimensional l1-regularized Gaussian MLE problems (which would thus have 1000 billion parameters) using a single machine, with bounded memory. In order to do so, we carefully exploit the underlying structure of the problem. Our innovations include a novel block-coordinate descent method with the blocks chosen via a clustering scheme to minimize repeated computations; and allowing for inexact computation of specific components. In spite of these modifications, we are able to theoretically analyze our procedure and show that BigQUIC can achieve super-linear or even quadratic convergence rates.

NeurIPS Conference 2013 Conference Paper

Conditional Random Fields via Univariate Exponential Families

  • Eunho Yang
  • Pradeep Ravikumar
  • Genevera Allen
  • Zhandong Liu

Conditional random fields, which model the distribution of a multivariate response conditioned on a set of covariates using undirected graphs, are widely used in a variety of multivariate prediction applications. Popular instances of this class of models such as categorical-discrete CRFs, Ising CRFs, and conditional Gaussian based CRFs, are not however best suited to the varied types of response variables in many applications, including count-valued responses. We thus introduce a “novel subclass of CRFs”, derived by imposing node-wise conditional distributions of response variables conditioned on the rest of the responses and the covariates as arising from univariate exponential families. This allows us to derive novel multivariate CRFs given any univariate exponential distribution, including the Poisson, negative binomial, and exponential distributions. Also in particular, it addresses the common CRF problem of specifying feature'' functions determining the interactions between response variables and covariates. We develop a class of tractable penalized $M$-estimators to learn these CRF distributions from data, as well as a unified sparsistency analysis for this general class of CRFs showing exact structure recovery can be achieved with high probability. "

NeurIPS Conference 2013 Conference Paper

Dirty Statistical Models

  • Eunho Yang
  • Pradeep Ravikumar

We provide a unified framework for the high-dimensional analysis of “superposition-structured” or “dirty” statistical models: where the model parameters are a “superposition” of structurally constrained parameters. We allow for any number and types of structures, and any statistical model. We consider the general class of $M$-estimators that minimize the sum of any loss function, and an instance of what we call a “hybrid” regularization, that is the infimal convolution of weighted regularization functions, one for each structural component. We provide corollaries showcasing our unified framework for varied statistical models such as linear regression, multiple regression and principal component analysis, over varied superposition structures.

ICML Conference 2013 Conference Paper

Human Boosting

  • Harsh H. Pareek
  • Pradeep Ravikumar

Humans may be exceptional learners but they have biological limitations and moreover, inductive biases similar to machine learning algorithms. This puts limits on human learning ability and on the kinds of learning tasks humans can easily handle. In this paper, we consider the problem of “boosting” human learners to extend the learning ability of human learners and achieve improved performance on tasks which individual humans find difficult. We consider classification (category learning) tasks, propose a boosting algorithm for human learners and give theoretical justifications. We conduct experiments using Amazon’s Mechanical Turk on two synthetic datasets – a crosshair task with a nonlinear decision boundary and a gabor patch task with a linear boundary but which is inaccessible to human learners – and one real world dataset – the Opinion Spam detection task introduced in (Ott et al). Our results show that boosting human learners produces gains in accuracy and can overcome some fundamental limitations of human learners.

NeurIPS Conference 2013 Conference Paper

Large Scale Distributed Sparse Precision Estimation

  • Huahua Wang
  • Arindam Banerjee
  • Cho-Jui Hsieh
  • Pradeep Ravikumar
  • Inderjit Dhillon

We consider the problem of sparse precision matrix estimation in high dimensions using the CLIME estimator, which has several desirable theoretical properties. We present an inexact alternating direction method of multiplier (ADMM) algorithm for CLIME, and establish rates of convergence for both the objective and optimality conditions. Further, we develop a large scale distributed framework for the computations, which scales to millions of dimensions and trillions of parameters, using hundreds of cores. The proposed framework solves CLIME in column-blocks and only involves elementwise operations and parallel matrix multiplications. We evaluate our algorithm on both shared-memory and distributed-memory architectures, which can use block cyclic distribution of data and parameters to achieve load balance and improve the efficiency in the use of memory hierarchies. Experimental results show that our algorithm is substantially more scalable than state-of-the-art methods and scales almost linearly with the number of cores.

NeurIPS Conference 2013 Conference Paper

Learning with Noisy Labels

  • Nagarajan Natarajan
  • Inderjit Dhillon
  • Pradeep Ravikumar
  • Ambuj Tewari

In this paper, we theoretically study the problem of binary classification in the presence of random classification noise --- the learner, instead of seeing the true labels, sees labels that have independently been flipped with some small probability. Moreover, random label noise is \emph{class-conditional} --- the flip probability depends on the class. We provide two approaches to suitably modify any given surrogate loss function. First, we provide a simple unbiased estimator of any loss, and obtain performance bounds for empirical risk minimization in the presence of iid data with noisy labels. If the loss function satisfies a simple symmetry condition, we show that the method leads to an efficient algorithm for empirical minimization. Second, by leveraging a reduction of risk minimization under noisy labels to classification with weighted 0-1 loss, we suggest the use of a simple weighted surrogate loss, for which we are able to obtain strong empirical risk bounds. This approach has a very remarkable consequence --- methods used in practice such as biased SVM and weighted logistic regression are provably noise-tolerant. On a synthetic non-separable dataset, our methods achieve over 88\% accuracy even when 40\% of the labels are corrupted, and are competitive with respect to recently proposed methods for dealing with label noise in several benchmark datasets.

NeurIPS Conference 2013 Conference Paper

On Poisson Graphical Models

  • Eunho Yang
  • Pradeep Ravikumar
  • Genevera Allen
  • Zhandong Liu

Undirected graphical models, such as Gaussian graphical models, Ising, and multinomial/categorical graphical models, are widely used in a variety of applications for modeling distributions over a large number of variables. These standard instances, however, are ill-suited to modeling count data, which are increasingly ubiquitous in big-data settings such as genomic sequencing data, user-ratings data, spatial incidence data, climate studies, and site visits. Existing classes of Poisson graphical models, which arise as the joint distributions that correspond to Poisson distributed node-conditional distributions, have a major drawback: they can only model negative conditional dependencies for reasons of normalizability given its infinite domain. In this paper, our objective is to modify the Poisson graphical model distribution so that it can capture a rich dependence structure between count-valued variables. We begin by discussing two strategies for truncating the Poisson distribution and show that only one of these leads to a valid joint distribution; even this model, however, has limitations on the types of variables and dependencies that may be modeled. To address this, we propose two novel variants of the Poisson distribution and their corresponding joint graphical model distributions. These models provide a class of Poisson graphical models that can capture both positive and negative conditional dependencies between count-valued variables. One can learn the graph structure of our model via penalized neighborhood selection, and we demonstrate the performance of our methods by learning simulated networks as well as a network from microRNA-Sequencing data.

IJCAI Conference 2013 Conference Paper

On Robust Estimation of High Dimensional Generalized Linear Models

  • Eunho Yang
  • Ambuj Tewari
  • Pradeep Ravikumar

We study robust high-dimensional estimation of generalized linear models (GLMs); where a small number k of the n observations can be arbitrarily corrupted, and where the true parameter is high dimensional in the “p n” regime, but only has a small number s of non-zero entries. There has been some recent work connecting robustness and sparsity, in the context of linear regression with corrupted observations, by using an explicitly modeled outlier response vector that is assumed to be sparse. Interestingly, we show, in the GLM setting, such explicit outlier response modeling can be performed in two distinct ways. For each of these two approaches, we give `2 error bounds for parameter estimation for general values of the tuple (n, p, s, k).

NeurIPS Conference 2012 Conference Paper

A Divide-and-Conquer Method for Sparse Inverse Covariance Estimation

  • Cho-Jui Hsieh
  • Arindam Banerjee
  • Inderjit Dhillon
  • Pradeep Ravikumar

In this paper, we consider the $\ell_1$ regularized sparse inverse covariance matrix estimation problem with a very large number of variables. Even in the face of this high dimensionality, and with limited number of samples, recent work has shown this estimator to have strong statistical guarantees in recovering the true structure of the sparse inverse covariance matrix, or alternatively the underlying graph structure of the corresponding Gaussian Markov Random Field. Our proposed algorithm divides the problem into smaller sub-problems, and uses the solutions of the sub-problems to build a good approximation for the original problem. We derive a bound on the distance of the approximate solution to the true solution. Based on this bound, we propose a clustering algorithm that attempts to minimize this bound, and in practice, is able to find effective partitions of the variables. We further use the approximate solution, i. e. , solution resulting from solving the sub-problems, as an initial point to solve the original problem, and achieve a much faster computational procedure. As an example, a recent state-of-the-art method, QUIC requires 10 hours to solve a problem (with 10, 000 nodes) that arises from a climate application, while our proposed algorithm, Divide and Conquer QUIC (DC-QUIC) only requires one hour to solve the problem.

NeurIPS Conference 2012 Conference Paper

Graphical Models via Generalized Linear Models

  • Eunho Yang
  • Genevera Allen
  • Zhandong Liu
  • Pradeep Ravikumar

Undirected graphical models, or Markov networks, such as Gaussian graphical models and Ising models enjoy popularity in a variety of applications. In many settings, however, data may not follow a Gaussian or binomial distribution assumed by these models. We introduce a new class of graphical models based on generalized linear models (GLM) by assuming that node-wise conditional distributions arise from exponential families. Our models allow one to estimate networks for a wide class of exponential distributions, such as the Poisson, negative binomial, and exponential, by fitting penalized GLMs to select the neighborhood for each node. A major contribution of this paper is the rigorous statistical analysis showing that with high probability, the neighborhood of our graphical models can be recovered exactly. We provide examples of high-throughput genomic networks learned via our GLM graphical models for multinomial and Poisson distributed data.

NeurIPS Conference 2011 Conference Paper

Greedy Algorithms for Structurally Constrained High Dimensional Problems

  • Ambuj Tewari
  • Pradeep Ravikumar
  • Inderjit Dhillon

A hallmark of modern machine learning is its ability to deal with high dimensional problems by exploiting structural assumptions that limit the degrees of freedom in the underlying model. A deep understanding of the capabilities and limits of high dimensional learning methods under specific assumptions such as sparsity, group sparsity, and low rank has been attained. Efforts (Negahban et al. , 2010, Chandrasekaran et al. , 2010} are now underway to distill this valuable experience by proposing general unified frameworks that can achieve the twin goals of summarizing previous analyses and enabling their application to notions of structure hitherto unexplored. Inspired by these developments, we propose and analyze a general computational scheme based on a greedy strategy to solve convex optimization problems that arise when dealing with structurally constrained high-dimensional problems. Our framework not only unifies existing greedy algorithms by recovering them as special cases but also yields novel ones. Finally, we extend our results to infinite dimensional problems by using interesting connections between smoothness of norms and behavior of martingales in Banach spaces.

NeurIPS Conference 2011 Conference Paper

Nearest Neighbor based Greedy Coordinate Descent

  • Inderjit Dhillon
  • Pradeep Ravikumar
  • Ambuj Tewari

Increasingly, optimization problems in machine learning, especially those arising from high-dimensional statistical estimation, have a large number of variables. Modern statistical estimators developed over the past decade have statistical or sample complexity that depends only weakly on the number of parameters when there is some structure to the problem, such as sparsity. A central question is whether similar advances can be made in their computational complexity as well. In this paper, we propose strategies that indicate that such advances can indeed be made. In particular, we investigate the greedy coordinate descent algorithm, and note that performing the greedy step efficiently weakens the costly dependence on the problem size provided the solution is sparse. We then propose a suite of methods that perform these greedy steps efficiently by a reduction to nearest neighbor search. We also devise a more amenable form of greedy descent for composite non-smooth objectives; as well as several approximate variants of such greedy descent. We develop a practical implementation of our algorithm that combines greedy coordinate descent with locality sensitive hashing. Without tuning the latter data structure, we are not only able to significantly speed up the vanilla greedy method, but also outperform cyclic descent when the problem size becomes large. Our results indicate the effectiveness of our nearest neighbor strategies, and also point to many open questions regarding the development of computational geometric techniques tailored towards first-order optimization methods.

NeurIPS Conference 2011 Conference Paper

On Learning Discrete Graphical Models using Greedy Methods

  • Ali Jalali
  • Christopher Johnson
  • Pradeep Ravikumar

In this paper, we address the problem of learning the structure of a pairwise graphical model from samples in a high-dimensional setting. Our first main result studies the sparsistency, or consistency in sparsity pattern recovery, properties of a forward-backward greedy algorithm as applied to general statistical models. As a special case, we then apply this algorithm to learn the structure of a discrete graphical model via neighborhood estimation. As a corollary of our general result, we derive sufficient conditions on the number of samples n, the maximum node-degree d and the problem size p, as well as other conditions on the model parameters, so that the algorithm recovers all the edges with high probability. Our result guarantees graph selection for samples scaling as n = Omega(d log(p)), in contrast to existing convex-optimization based algorithms that require a sample complexity of Omega(d^2 log(p)). Further, the greedy algorithm only requires a restricted strong convexity condition which is typically milder than irrepresentability assumptions. We corroborate these results using numerical simulations at the end.

NeurIPS Conference 2011 Conference Paper

Sparse Inverse Covariance Matrix Estimation Using Quadratic Approximation

  • Cho-Jui Hsieh
  • Inderjit Dhillon
  • Pradeep Ravikumar
  • Mátyás Sustik

The L_1 regularized Gaussian maximum likelihood estimator has been shown to have strong statistical guarantees in recovering a sparse inverse covariance matrix, or alternatively the underlying graph structure of a Gaussian Markov Random Field, from very limited samples. We propose a novel algorithm for solving the resulting optimization problem which is a regularized log-determinant program. In contrast to other state-of-the-art methods that largely use first order gradient information, our algorithm is based on Newton's method and employs a quadratic approximation, but with some modifications that leverage the structure of the sparse Gaussian MLE problem. We show that our method is superlinearly convergent, and also present experimental results using synthetic and real application data that demonstrate the considerable improvements in performance of our method when compared to other state-of-the-art methods.

NeurIPS Conference 2010 Conference Paper

A Dirty Model for Multi-task Learning

  • Ali Jalali
  • Sujay Sanghavi
  • Chao Ruan
  • Pradeep Ravikumar

We consider the multiple linear regression problem, in a setting where some of the set of relevant features could be shared across the tasks. A lot of recent research has studied the use of $\ell_1/\ell_q$ norm block-regularizations with $q > 1$ for such (possibly) block-structured problems, establishing strong guarantees on recovery even under high-dimensional scaling where the number of features scale with the number of observations. However, these papers also caution that the performance of such block-regularized methods are very dependent on the {\em extent} to which the features are shared across tasks. Indeed they show~\citep{NWJoint} that if the extent of overlap is less than a threshold, or even if parameter {\em values} in the shared features are highly uneven, then block $\ell_1/\ell_q$ regularization could actually perform {\em worse} than simple separate elementwise $\ell_1$ regularization. We are far away from a realistic multi-task setting: not only do the set of relevant features have to be exactly the same across tasks, but their values have to as well. Here, we ask the question: can we leverage support and parameter overlap when it exists, but not pay a penalty when it does not? Indeed, this falls under a more general question of whether we can model such \emph{dirty data} which may not fall into a single neat structural bracket (all block-sparse, or all low-rank and so on). Here, we take a first step, focusing on developing a dirty model for the multiple regression problem. Our method uses a very simple idea: we decompose the parameters into two components and {\em regularize these differently. } We show both theoretically and empirically, our method strictly and noticeably outperforms both $\ell_1$ and $\ell_1/\ell_q$ methods, over the entire range of possible overlaps. We also provide theoretical guarantees that the method performs well under high-dimensional scaling.

JMLR Journal 2010 Journal Article

Message-passing for Graph-structured Linear Programs: Proximal Methods and Rounding Schemes

  • Pradeep Ravikumar
  • Alekh Agarwal
  • Martin J. Wainwright

The problem of computing a maximum a posteriori (MAP) configuration is a central computational challenge associated with Markov random fields. There has been some focus on "tree-based" linear programming (LP) relaxations for the MAP problem. This paper develops a family of super-linearly convergent algorithms for solving these LPs, based on proximal minimization schemes using Bregman divergences. As with standard message-passing on graphs, the algorithms are distributed and exploit the underlying graphical structure, and so scale well to large problems. Our algorithms have a double-loop character, with the outer loop corresponding to the proximal sequence, and an inner loop of cyclic Bregman projections used to compute each proximal update. We establish convergence guarantees for our algorithms, and illustrate their performance via some simulations. We also develop two classes of rounding schemes, deterministic and randomized, for obtaining integral configurations from the LP solutions. Our deterministic rounding schemes use a "re-parameterization" property of our algorithms so that when the LP solution is integral, the MAP solution can be obtained even before the LP-solver converges to the optimum. We also propose graph-structured randomized rounding schemes applicable to iterative LP-solving algorithms in general. We analyze the performance of and report simulations comparing these rounding schemes. [abs] [ pdf ][ bib ] &copy JMLR 2010. ( edit, beta )

NeurIPS Conference 2009 Conference Paper

A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers

  • Sahand Negahban
  • Bin Yu
  • Martin Wainwright
  • Pradeep Ravikumar

The estimation of high-dimensional parametric models requires imposing some structure on the models, for instance that they be sparse, or that matrix structured parameters have low rank. A general approach for such structured parametric model estimation is to use regularized M-estimation procedures, which regularize a loss function that measures goodness of fit of the parameters to the data with some regularization function that encourages the assumed structure. In this paper, we aim to provide a unified analysis of such regularized M-estimation procedures. In particular, we report the convergence rates of such estimators in any metric norm. Using just our main theorem, we are able to rederive some of the many existing results, but also obtain a wide range of novel convergence rates results. Our analysis also identifies key properties of loss and regularization functions such as restricted strong convexity, and decomposability, that ensure the corresponding regularized M-estimators have good convergence rates.

NeurIPS Conference 2009 Conference Paper

Information-theoretic lower bounds on the oracle complexity of convex optimization

  • Alekh Agarwal
  • Martin Wainwright
  • Peter Bartlett
  • Pradeep Ravikumar

Despite the large amount of literature on upper bounds on complexity of convex analysis, surprisingly little is known about the fundamental hardness of these problems. The extensive use of convex optimization in machine learning and statistics makes such an understanding critical to understand fundamental computational limits of learning and estimation. In this paper, we study the complexity of stochastic convex optimization in an oracle model of computation. We improve upon known results and obtain tight minimax complexity estimates for some function classes. We also discuss implications of these results to the understanding the inherent complexity of large-scale learning and estimation problems.

NeurIPS Conference 2008 Conference Paper

Model Selection in Gaussian Graphical Models: High-Dimensional Consistency of \boldmath$\ell_1$-regularized MLE

  • Garvesh Raskutti
  • Bin Yu
  • Martin Wainwright
  • Pradeep Ravikumar

We consider the problem of estimating the graph structure associated with a Gaussian Markov random field (GMRF) from i. i. d. samples. We study the performance of study the performance of the 1 -regularized maximum likelihood estimator in the high-dimensional setting, where the number of nodes in the graph p, the number of edges in the graph s and the maximum node degree d, are allowed to grow as a function of the number of samples n. Our main result provides sufficient conditions on (n, p, d) for the 1 -regularized MLE estimator to recover all the edges of the graph with high probability. Under some conditions on the model covariance, we show that model selection can be achieved for sample sizes n = (d2 log(p)), with the error decaying as O(exp(-c log(p))) for some constant c. We illustrate our theoretical results via simulations and show good correspondences between the theoretical predictions and behavior in simulations.

NeurIPS Conference 2008 Conference Paper

Nonparametric sparse hierarchical models describe V1 fMRI responses to natural images

  • Vincent Vu
  • Bin Yu
  • Thomas Naselaris
  • Kendrick Kay
  • Jack Gallant
  • Pradeep Ravikumar

We propose a novel hierarchical, nonlinear model that predicts brain activity in area V1 evoked by natural images. In the study reported here brain activity was measured by means of functional magnetic resonance imaging (fMRI), a noninvasive technique that provides an indirect measure of neural activity pooled over a small volume (~ 2mm cube) of brain tissue. Our model, which we call the SpAM V1 model, is based on the reasonable assumption that fMRI measurements reflect the (possibly nonlinearly) pooled, rectified output of a large population of simple and complex cells in V1. It has a hierarchical filtering stage that consists of three layers: model simple cells, model complex cells, and a third layer in which the complex cells are linearly pooled (called “pooled-complex” cells). The pooling stage then obtains the measured fMRI signals as a sparse additive model (SpAM) in which a sparse nonparametric (nonlinear) combination of model complex cell and model pooled-complex cell outputs are summed. Our results show that the SpAM V1 model predicts fMRI responses evoked by natural images better than a benchmark model that only provides linear pooling of model complex cells. Furthermore, the spatial receptive fields, frequency tuning and orientation tuning curves of the SpAM V1 model estimated for each voxel appears to be consistent with the known properties of V1, and with previous analyses of this data set. A visualization procedure applied to the SpAM V1 model shows that most of the nonlinear pooling consists of simple compressive or saturating nonlinearities.

NeurIPS Conference 2007 Conference Paper

SpAM: Sparse Additive Models

  • Han Liu
  • Larry Wasserman
  • John Lafferty
  • Pradeep Ravikumar

We present a new class of models for high-dimensional nonparametric regression and classification called sparse additive models (SpAM). Our methods combine ideas from sparse linear modeling and additive nonparametric regression. We de- rive a method for fitting the models that is effective even when the number of covariates is larger than the sample size. A statistical analysis of the properties of SpAM is given together with empirical results on synthetic and real data, show- ing that SpAM can be effective in fitting sparse nonparametric models in high dimensional data.

NeurIPS Conference 2006 Conference Paper

High-Dimensional Graphical Model Selection Using $\ell_1$-Regularized Logistic Regression

  • Martin Wainwright
  • John Lafferty
  • Pradeep Ravikumar

We focus on the problem of estimating the graph structure associated with a discrete Markov random field. We describe a method based on 1- regularized logistic regression, in which the neighborhood of any given node is estimated by performing logistic regression subject to an 1-constraint. Our framework applies to the high-dimensional setting, in which both the number of nodes p and maximum neighborhood sizes d are allowed to grow as a function of the number of observations n. Our main result is to estab- lish sufficient conditions on the triple (n, p, d) for the method to succeed in consistently estimating the neighborhood of every node in the graph simul- taneously. Under certain mutual incoherence conditions analogous to those imposed in previous work on linear regression, we prove that consistent neighborhood selection can be obtained as long as the number of observa- tions n grows more quickly than 6d6 log d + 2d5 log p, thereby establishing that logarithmic growth in the number of samples n relative to graph size p is sufficient to achieve neighborhood consistency. Keywords: Graphical models; Markov random fields; structure learning; `1-regularization; model selection; convex risk minimization; high-dimensional asymptotics; concentration.

NeurIPS Conference 2005 Conference Paper

Preconditioner Approximations for Probabilistic Graphical Models

  • John Lafferty
  • Pradeep Ravikumar

We present a family of approximation techniques for probabilistic graphical models, based on the use of graphical preconditioners developed in the scientific computing literature. Our framework yields rigorous upper and lower bounds on event probabilities and the log partition function of undirected graphical models, using non-iterative procedures that have low time complexity. As in mean field approaches, the approximations are built upon tractable subgraphs; however, we recast the problem of optimizing the tractable distribution parameters and approximate inference in terms of the well-studied linear systems problem of obtaining a good matrix preconditioner. Experiments are presented that compare the new approximation schemes to variational methods.

UAI Conference 2004 Conference Paper

A Hierarchical Graphical Model for Record Linkage

  • Pradeep Ravikumar
  • William W. Cohen

The task of matching co-referent records is known among other names as rocord linkage. For large record-linkage problems, often there is little or no labeled data available, but unlabeled data shows a reasonable clear structure. For such problems, unsupervised or semi-supervised methods are preferable to supervised methods. In this paper, we describe a hierarchical graphical model framework for the linakge-problem in an unsupervised setting. In addition to proposing new methods, we also cast existing unsupervised probabilistic record-linkage methods in this framework. Some of the techniques we propose to minimize overfitting in the above model are of interest in the general graphical model setting. We describe a method for incorporating monotinicity constraints in a graphical model. We also outline a bootstrapping approach of using "single-field" classifiers to noisily label latent variables in a hierarchical model. Experimental results show that our proposed unsupervised methods perform quite competitively even with fully supervised record-linkage methods.

UAI Conference 2004 Conference Paper

Variational Chernoff Bounds for Graphical Models

  • Pradeep Ravikumar
  • John D. Lafferty

Recent research has made significant progress on the problem of bounding log partition functions for exponential family graphical models. Such bounds have associated dual parameters that are often used as heuristic estimates of the marginal probabilities required in inference and learning. However these variational estimates do not give rigorous bounds on marginal probabilities, nor do they give estimates for probabilities of more general events than simple marginals. In this paper we build on this recent work by deriving rigorous upper and lower bounds on event probabilities for graphical models. Our approach is based on the use of generalized Chernoff bounds to express bounds on event probabilities in terms of convex optimization problems; these optimization problems, in turn, require estimates of generalized log partition functions. Simulations indicate that this technique can result in useful, rigorous bounds to complement the heuristic variational estimates, with comparable computational cost.