Author name cluster

Alexandros Dimakis

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

16 papers

1 author row

NeurIPS Conference 2019 Conference Paper

Inverting Deep Generative models, One layer at a time

Qi Lei
Ajil Jalal
Inderjit Dhillon
Alexandros Dimakis

We study the problem of inverting a deep generative model with ReLU activations. Inversion corresponds to finding a latent code vector that explains observed measurements as much as possible. In most prior works this is performed by attempting to solve a non-convex optimization problem involving the generator. In this paper we obtain several novel theoretical results for the inversion problem. We show that for the realizable case, single layer inversion can be performed exactly in polynomial time, by solving a linear program. Further, we show that for two layers, inversion is NP-hard to recover binary latent code (even for the realizable case) and the pre-image set can be non-convex. For generative models of arbitrary depth, we show that exact recovery is possible in polynomial time with high probability, if the layers are expanding and the weights are randomly selected. Very recent work analyzed the same problem for gradient descent inversion. Their analysis requires significantly higher expansion (logarithmic in the latent dimension) while our proposed algorithm can provably reconstruct even with constant factor expansion. We also provide provable error bounds for different norms for reconstructing noisy observations. Our empirical validation demonstrates that we obtain better reconstructions when the latent dimension is large.

NeurIPS Conference 2019 Conference Paper

Learning Distributions Generated by One-Layer ReLU Networks

Shanshan Wu
Alexandros Dimakis
Sujay Sanghavi

We consider the problem of estimating the parameters of a $d$-dimensional rectified Gaussian distribution from i. i. d. samples. A rectified Gaussian distribution is defined by passing a standard Gaussian distribution through a one-layer ReLU neural network. We give a simple algorithm to estimate the parameters (i. e. , the weight matrix and bias vector of the ReLU neural network) up to an error $\eps\norm{W}_F$ using $\widetilde{O}(1/\eps^2)$ samples and $\widetilde{O}(d^2/\eps^2)$ time (log factors are ignored for simplicity). This implies that we can estimate the distribution up to $\eps$ in total variation distance using $\widetilde{O}(\kappa^2d^2/\eps^2)$ samples, where $\kappa$ is the condition number of the covariance matrix. Our only assumption is that the bias vector is non-negative. Without this non-negativity assumption, we show that estimating the bias vector within any error requires the number of samples at least exponential in the infinity norm of the bias vector. Our algorithm is based on the key observation that vector norms and pairwise angles can be estimated separately. We use a recent result on learning from truncated samples. We also prove two sample complexity lower bounds: $\Omega(1/\eps^2)$ samples are required to estimate the parameters up to error $\eps$, while $\Omega(d/\eps^2)$ samples are necessary to estimate the distribution up to $\eps$ in total variation distance. The first lower bound implies that our algorithm is optimal for parameter estimation. Finally, we show an interesting connection between learning a two-layer generative model and non-negative matrix factorization. Experimental results are provided to support our analysis.

NeurIPS Conference 2019 Conference Paper

Primal-Dual Block Generalized Frank-Wolfe

Qi Lei
Jiacheng Zhuo
Constantine Caramanis
Inderjit Dhillon
Alexandros Dimakis

We propose a generalized variant of Frank-Wolfe algorithm for solving a class of sparse/low-rank optimization problems. Our formulation includes Elastic Net, regularized SVMs and phase retrieval as special cases. The proposed Primal-Dual Block Generalized Frank-Wolfe algorithm reduces the per-iteration cost while maintaining linear convergence rate. The per iteration cost of our method depends on the structural complexity of the solution (i. e. sparsity/low-rank) instead of the ambient dimension. We empirically show that our algorithm outperforms the state-of-the-art methods on (multi-class) classification tasks.

NeurIPS Conference 2019 Conference Paper

Provable Certificates for Adversarial Examples: Fitting a Ball in the Union of Polytopes

Matt Jordan
Justin Lewis
Alexandros Dimakis

We propose a novel method for computing exact pointwise robustness of deep neural networks for all convex lp norms. Our algorithm, GeoCert, finds the largest lp ball centered at an input point x0, within which the output class of a given neural network with ReLU nonlinearities remains unchanged. We relate the problem of computing pointwise robustness of these networks to that of computing the maximum norm ball with a fixed center that can be contained in a non-convex polytope. This is a challenging problem in general, however we show that there exists an efficient algorithm to compute this for polyhedral complices. Further we show that piecewise linear neural networks partition the input space into a polyhedral complex. Our algorithm has the ability to almost immediately output a nontrivial lower bound to the pointwise robustness which is iteratively improved until it ultimately becomes tight. We empirically show that our approach generates a distance lower bounds that are tighter compared to prior work, under moderate time constraints.

NeurIPS Conference 2019 Conference Paper

Sparse Logistic Regression Learns All Discrete Pairwise Graphical Models

Shanshan Wu
Sujay Sanghavi
Alexandros Dimakis

We characterize the effectiveness of a classical algorithm for recovering the Markov graph of a general discrete pairwise graphical model from i. i. d. samples. The algorithm is (appropriately regularized) maximum conditional log-likelihood, which involves solving a convex program for each node; for Ising models this is $\ell_1$-constrained logistic regression, while for more general alphabets an $\ell_{2, 1}$ group-norm constraint needs to be used. We show that this algorithm can recover any arbitrary discrete pairwise graphical model, and also characterize its sample complexity as a function of model width, alphabet size, edge parameter accuracy, and the number of variables. We show that along every one of these axes, it matches or improves on all existing results and algorithms for this problem. Our analysis applies a sharp generalization error bound for logistic regression when the weight vector has an $\ell_1$ (or $\ell_{2, 1}$) constraint and the sample vector has an $\ell_{\infty}$ (or $\ell_{2, \infty}$) constraint. We also show that the proposed convex programs can be efficiently solved in $\tilde{O}(n^2)$ running time (where $n$ is the number of variables) under the same statistical guarantees. We provide experimental results to support our analysis.

NeurIPS Conference 2018 Conference Paper

Experimental Design for Cost-Aware Learning of Causal Graphs

Erik Lindgren
Murat Kocaoglu
Alexandros Dimakis
Sriram Vishwanath

We consider the minimum cost intervention design problem: Given the essential graph of a causal graph and a cost to intervene on a variable, identify the set of interventions with minimum total cost that can learn any causal graph with the given essential graph. We first show that this problem is NP-hard. We then prove that we can achieve a constant factor approximation to this problem with a greedy algorithm. We then constrain the sparsity of each intervention. We develop an algorithm that returns an intervention design that is nearly optimal in terms of size for sparse graphs with sparse interventions and we discuss how to use it when there are costs on the vertices.

AAAI Conference 2017 Conference Paper

Entropic Causal Inference

Murat Kocaoglu
Alexandros Dimakis
Sriram Vishwanath
Babak Hassibi

We consider the problem of identifying the causal direction between two discrete random variables using observational data. Unlike previous work, we keep the most general functional model but make an assumption on the unobserved exogenous variable: Inspired by Occam’s razor, we assume that the exogenous variable is simple in the true causal direction. We quantify simplicity using Rényi entropy. Our main result is that, under natural assumptions, if the exogenous variable has low H0 entropy (cardinality) in the true direction, it must have high H0 entropy in the wrong direction. We establish several algorithmic hardness results about estimating the minimum entropy exogenous variable. We show that the problem of ﬁnding the exogenous variable with minimum H1 entropy (Shannon Entropy) is equivalent to the problem of ﬁnding minimum joint entropy given n marginal distributions, also known as minimum entropy coupling problem. We propose an efﬁcient greedy algorithm for the minimum entropy coupling problem, that for n = 2 provably ﬁnds a local optimum. This gives a greedy algorithm for ﬁnding the exogenous variable with minimum Shannon entropy. Our greedy entropy-based causal inference algorithm has similar performance to the state of the art additive noise models in real datasets. One advantage of our approach is that we make no use of the values of random variables but only their distributions. Our method can therefore be used for causal inference for both ordinal and also categorical data, unlike additive noise models.

NeurIPS Conference 2017 Conference Paper

Model-Powered Conditional Independence Test

Rajat Sen
Ananda Theertha Suresh
Karthikeyan Shanmugam
Alexandros Dimakis
Sanjay Shakkottai

We consider the problem of non-parametric Conditional Independence testing (CI testing) for continuous random variables. Given i. i. d samples from the joint distribution $f(x, y, z)$ of continuous random vectors $X, Y$ and $Z, $ we determine whether $X \independent Y \vert Z$. We approach this by converting the conditional independence test into a classification problem. This allows us to harness very powerful classifiers like gradient-boosted trees and deep neural networks. These models can handle complex probability distributions and allow us to perform significantly better compared to the prior state of the art, for high-dimensional CI testing. The main technical challenge in the classification problem is the need for samples from the conditional product distribution $f^{CI}(x, y, z) = f(x|z)f(y|z)f(z)$ -- the joint distribution if and only if $X \independent Y \vert Z. $ -- when given access only to i. i. d. samples from the true joint distribution $f(x, y, z)$. To tackle this problem we propose a novel nearest neighbor bootstrap procedure and theoretically show that our generated samples are indeed close to $f^{CI}$ in terms of total variational distance. We then develop theoretical results regarding the generalization bounds for classification for our problem, which translate into error bounds for CI testing. We provide a novel analysis of Rademacher type classification bounds in the presence of non-i. i. d \textit{near-independent} samples. We empirically validate the performance of our algorithm on simulated and real datasets and show performance gains over previous methods.

NeurIPS Conference 2017 Conference Paper

Streaming Weak Submodularity: Interpreting Neural Networks on the Fly

Ethan Elenberg
Alexandros Dimakis
Moran Feldman
Amin Karbasi

In many machine learning applications, it is important to explain the predictions of a black-box classifier. For example, why does a deep neural network assign an image to a particular class? We cast interpretability of black-box classifiers as a combinatorial maximization problem and propose an efficient streaming algorithm to solve it subject to cardinality constraints. By extending ideas from Badanidiyuru et al. [2014], we provide a constant factor approximation guarantee for our algorithm in the case of random stream order and a weakly submodular objective function. This is the first such theoretical guarantee for this general class of functions, and we also show that no such algorithm exists for a worst case stream order. Our algorithm obtains similar explanations of Inception V3 predictions 10 times faster than the state-of-the-art LIME framework of Ribeiro et al. [2016].

NeurIPS Conference 2016 Conference Paper

Leveraging Sparsity for Efficient Submodular Data Summarization

Erik Lindgren
Shanshan Wu
Alexandros Dimakis

The facility location problem is widely used for summarizing large datasets and has additional applications in sensor placement, image retrieval, and clustering. One difficulty of this problem is that submodular optimization algorithms require the calculation of pairwise benefits for all items in the dataset. This is infeasible for large problems, so recent work proposed to only calculate nearest neighbor benefits. One limitation is that several strong assumptions were invoked to obtain provable approximation guarantees. In this paper we establish that these extra assumptions are not necessary—solving the sparsified problem will be almost optimal under the standard assumptions of the problem. We then analyze a different method of sparsification that is a better model for methods such as Locality Sensitive Hashing to accelerate the nearest neighbor computations and extend the use of the problem to a broader family of similarities. We validate our approach by demonstrating that it rapidly generates interpretable summaries.

NeurIPS Conference 2016 Conference Paper

Single Pass PCA of Matrix Products

Shanshan Wu
Srinadh Bhojanapalli
Sujay Sanghavi
Alexandros Dimakis

In this paper we present a new algorithm for computing a low rank approximation of the product $A^TB$ by taking only a single pass of the two matrices $A$ and $B$. The straightforward way to do this is to (a) first sketch $A$ and $B$ individually, and then (b) find the top components using PCA on the sketch. Our algorithm in contrast retains additional summary information about $A, B$ (e. g. row and column norms etc. ) and uses this additional information to obtain an improved approximation from the sketches. Our main analytical result establishes a comparable spectral norm guarantee to existing two-pass methods; in addition we also provide results from an Apache Spark implementation that shows better computational and statistical performance on real-world and synthetic evaluation datasets.

NeurIPS Conference 2015 Conference Paper

Learning Causal Graphs with Small Interventions

Karthikeyan Shanmugam
Murat Kocaoglu
Alexandros Dimakis
Sriram Vishwanath

We consider the problem of learning causal networks with interventions, when each intervention is limited in size under Pearl's Structural Equation Model with independent errors (SEM-IE). The objective is to minimize the number of experiments to discover the causal directions of all the edges in a causal graph. Previous work has focused on the use of separating systems for complete graphs for this task. We prove that any deterministic adaptive algorithm needs to be a separating system in order to learn complete graphs in the worst case. In addition, we present a novel separating system construction, whose size is close to optimal and is arguably simpler than previous work in combinatorics. We also develop a novel information theoretic lower bound on the number of interventions that applies in full generality, including for randomized adaptive learning algorithms. For general chordal graphs, we derive worst case lower bounds on the number of interventions. Building on observations about induced trees, we give a new deterministic adaptive algorithm to learn directions on any chordal skeleton completely. In the worst case, our achievable scheme is an $\alpha$-approximation algorithm where $\alpha$ is the independence number of the graph. We also show that there exist graph classes for which the sufficient number of experiments is close to the lower bound. In the other extreme, there are graph classes for which the required number of experiments is multiplicatively $\alpha$ away from our lower bound. In simulations, our algorithm almost always performs very close to the lower bound, while the approach based on separating systems for complete graphs is significantly worse for random chordal graphs.

NeurIPS Conference 2015 Conference Paper

Orthogonal NMF through Subspace Exploration

Megasthenis Asteris
Dimitris Papailiopoulos
Alexandros Dimakis

Orthogonal Nonnegative Matrix Factorization {(ONMF)} aims to approximate a nonnegative matrix as the product of two $k$-dimensional nonnegative factors, one of which has orthonormal columns. It yields potentially useful data representations as superposition of disjoint parts, while it has been shown to work well for clustering tasks where traditional methods underperform. Existing algorithms rely mostly on heuristics, which despite their good empirical performance, lack provable performance guarantees. We present a new ONMF algorithm with provable approximation guarantees. For any constant dimension~$k$, we obtain an additive EPTAS without any assumptions on the input. Our algorithm relies on a novel approximation to the related Nonnegative Principal Component Analysis (NNPCA) problem; given an arbitrary data matrix, NNPCA seeks $k$ nonnegative components that jointly capture most of the variance. Our NNPCA algorithm is of independent interest and generalizes previous work that could only obtain guarantees for a single component. We evaluate our algorithms on several real and synthetic datasets and show that their performance matches or outperforms the state of the art.

NeurIPS Conference 2015 Conference Paper

Sparse PCA via Bipartite Matchings

Megasthenis Asteris
Dimitris Papailiopoulos
Anastasios Kyrillidis
Alexandros Dimakis

We consider the following multi-component sparse PCA problem: given a set of data points, we seek to extract a small number of sparse components with \emph{disjoint} supports that jointly capture the maximum possible variance. Such components can be computed one by one, repeatedly solving the single-component problem and deflating the input data matrix, but this greedy procedure is suboptimal. We present a novel algorithm for sparse PCA that jointly optimizes multiple disjoint components. The extracted features capture variance that lies within a multiplicative factor arbitrarily close to $1$ from the optimal. Our algorithm is combinatorial and computes the desired components by solving multiple instances of the bipartite maximum weight matching problem. Its complexity grows as a low order polynomial in the ambient dimension of the input data, but exponentially in its rank. However, it can be effectively applied on a low-dimensional sketch of the input data. We evaluate our algorithm on real datasets and empirically demonstrate that in many cases it outperforms existing, deflation-based approaches.

NeurIPS Conference 2014 Conference Paper

On the Information Theoretic Limits of Learning Ising Models

Rashish Tandon
Karthikeyan Shanmugam
Pradeep Ravikumar
Alexandros Dimakis

We provide a general framework for computing lower-bounds on the sample complexity of recovering the underlying graphs of Ising models, given i. i. d. samples. While there have been recent results for specific graph classes, these involve fairly extensive technical arguments that are specialized to each specific graph class. In contrast, we isolate two key graph-structural ingredients that can then be used to specify sample complexity lower-bounds. Presence of these structural properties makes the graph class hard to learn. We derive corollaries of our main result that not only recover existing recent results, but also provide lower bounds for novel graph classes not considered previously. We also extend our framework to the random graph setting and derive corollaries for Erdos-Renyi graphs in a certain dense setting.

NeurIPS Conference 2014 Conference Paper

Sparse Polynomial Learning and Graph Sketching

Murat Kocaoglu
Karthikeyan Shanmugam
Alexandros Dimakis
Adam Klivans

Let $f: \{-1, 1\}^n \rightarrow \mathbb{R}$ be a polynomial with at most $s$ non-zero real coefficients. We give an algorithm for exactly reconstructing $f$ given random examples from the uniform distribution on $\{-1, 1\}^n$ that runs in time polynomial in $n$ and $2^{s}$ and succeeds if the function satisfies the \textit{unique sign property}: there is one output value which corresponds to a unique set of values of the participating parities. This sufficient condition is satisfied when every coefficient of $f$ is perturbed by a small random noise, or satisfied with high probability when $s$ parity functions are chosen randomly or when all the coefficients are positive. Learning sparse polynomials over the Boolean domain in time polynomial in $n$ and $2^{s}$ is considered notoriously hard in the worst-case. Our result shows that the problem is tractable for almost all sparse polynomials. Then, we show an application of this result to hypergraph sketching which is the problem of learning a sparse (both in the number of hyperedges and the size of the hyperedges) hypergraph from uniformly drawn random cuts. We also provide experimental results on a real world dataset.