STOC Conference 2025 Conference Paper
Bypassing the Noisy Parity Barrier: Learning Higher-Order Markov Random Fields from Dynamics
- Jason Gaitonde
- Ankur Moitra
- Elchanan Mossel
Author name cluster
Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.
STOC Conference 2025 Conference Paper
STOC Conference 2025 Conference Paper
FOCS Conference 2025 Conference Paper
Motivated by connections between algebraic complexity lower bounds and tensor decompositions, we investigate Koszul-Young flattenings, which are the main ingredient in recent lower bounds for matrix multiplication. Based on this tool we give a new algorithm for decomposing an $n_{1} \times n_{2} \times n_{3}$ tensor as the sum of a minimal number of rank-1 terms, and certifying uniqueness of this decomposition. For $n_{1} \leq n_{2} \leq n_{3}$ with $n_{1} \rightarrow \infty$ and $n_{3} / n_{2}=O(1)$, our algorithm is guaranteed to succeed when the tensor rank is bounded by $r \leq(1-\epsilon)\left(n_{2}+n_{3}\right)$ for an arbitrary $\epsilon \gt 0$, provided the tensor components are generically chosen. For any fixed $\epsilon$, the runtime is polynomial in $n_{3}$. When $n_{2}=n_{3}=n$, our condition on the rank gives a factor-of- 2 improvement over the classical simultaneous diagonalization algorithm, which requires $r \leq n$, and also improves on the recent algorithm of Koiran (2024) which requires $r \leq 4 n / 3$. It also improves on the PhD thesis of Persu (2018) which solves rank detection for $r \leq 3 n / 2$. We complement our upper bounds by showing limitations, in particular that no flattening of the style we consider can surpass rank $n_{2}+n_{3}$. Furthermore, for $n \times n \times n$ tensors, we show that an even more general class of degree- $\boldsymbol{d}$ polynomial flattenings cannot surpass rank Cn for a constant $C=C(d)$. This suggests that for tensor decompositions, the case of generic components may be fundamentally harder than that of random components, where efficient decomposition is possible even in highly overcomplete settings.
ICML Conference 2025 Conference Paper
Graph neural networks (GNNs) are the dominant approach to solving machine learning problems defined over graphs. Despite much theoretical and empirical work in recent years, our understanding of finer-grained aspects of architectural design for GNNs remains impoverished. In this paper, we consider the benefits of architectures that maintain and update edge embeddings. On the theoretical front, under a suitable computational abstraction for a layer in the model, as well as memory constraints on the embeddings, we show that there are natural tasks on graphical models for which architectures leveraging edge embeddings can be much shallower. Our techniques are inspired by results on time-space tradeoffs in theoretical computer science. Empirically, we show architectures that maintain edge embeddings almost always improve on their node-based counterparts—frequently significantly so in topologies that have "hub" nodes.
NeurIPS Conference 2024 Conference Paper
Motivated by the problem of detecting AI-generated text, we consider the problem of watermarking the output of language models with provable guarantees. We aim for watermarks which satisfy: (a) undetectability, a cryptographic notion introduced by Christ, Gunn, & Zamir (2023) which stipulates that it is computationally hard to distinguish watermarked language model outputs from the model's actual output distribution; and (b) robustness to channels which introduce a constant fraction of adversarial insertions, substitutions, and deletions to the watermarked text. Earlier schemes could only handle stochastic substitutions and deletions, and thus we are aiming for a more natural and appealing robustness guarantee that holds with respect to edit distance. Our main result is a watermarking scheme which achieves both (a) and (b) when the alphabet size for the language model is allowed to grow as a polynomial in the security parameter. To derive such a scheme, we follow an approach introduced by Christ & Gunn (2024), which proceeds via first constructing pseudorandom codes satisfying undetectability and robustness properties analogous to those above; our codes have the additional benefit of relying on weaker computational assumptions than used in previous work. Then we show that there is a generic transformation from such codes over large alphabets to watermarking schemes for arbitrary language models.
FOCS Conference 2024 Conference Paper
Supervised learning is often computationally easy in practice. But to what extent does this mean that other modes of learning, such as reinforcement learning (RL), ought to be computationally easy by extension? In this work we show the first cryptographic separation between RL and supervised learning, by exhibiting a class of block MDPs and associated decoding functions where reward-free exploration is provably computationally harder than the associated regression problem. We also show that there is no computationally efficient algorithm for reward-directed RL in block MDPs, even when given access to an oracle for this regression problem. It is known that being able to perform regression in block MDPs is necessary for finding a good policy; our results suggest that it is not sufficient. Our separation lower bound uses a new robustness property of the Learning Parities with Noise (LPN) hardness assumption, which is crucial in handling the dependent nature of RL data. We argue that separations and oracle lower bounds, such as ours, are a more meaningful way to prove hardness of learning because the constructions better reflect the practical reality that supervised learning by itself is often not the computational bottleneck.
STOC Conference 2024 Conference Paper
FOCS Conference 2024 Conference Paper
We show that thermal states of local Hamiltonians are separable above a constant temperature. Specifically, for a local Hamiltonian $H$ on a graph with degree $\mathfrak{g}$, its Gibbs state at inverse temperature $\beta$, denoted by $\rho=e^{-\beta H}/\text{tr}(e^{-\beta H})$, is a classical distribution over product states for all $\beta < 1/ (c \mathfrak{{g}})$, where $c$ is a constant. This sudden death of thermal entanglement upends conventional wisdom about the presence of short-range quantum correlations in Gibbs states. Moreover, we show that we can efficiently sample from the distribution over product states. In particular, for any $\beta < 1/(c\mathfrak{g}^{3})$, we can prepare a state $\varepsilon$ -close to $\rho$ in trace distance with a depth-one quantum circuit and $\text{poly}(n)\log(1/\varepsilon)$ classical overhead. 1 1 In independent and concurrent work, Rouzé, França, and Alhambra [37] obtain an efficient quantum algorithm for preparing high-temperature Gibbs states via a dissipative evolution.
STOC Conference 2024 Conference Paper
FOCS Conference 2024 Conference Paper
We study the problem of Hamiltonian structure learning from real-time evolution: given the ability to apply $e^{-\mathrm{i}Ht}$ for an unknown local Hamiltonian $H=\Sigma_{a=1}^{m}\lambda_{a}E_{a}$ on $n$ qubits, the goal is to recover $H$. This problem is already well-understood under the assumption that the interaction terms, $E_{a}$, are given, and only the interaction strengths, $\lambda_{a}$, are unknown. But how efficiently can we learn a local Hamiltonian without prior knowledge of its interaction structure? We present a new, general approach to Hamiltonian learning that not only solves the challenging structure learning variant, but also resolves other open questions in the area, all while achieving the gold standard of Heisenberg-limited scaling. In particular, our algorithm recovers the Hamiltonian to $\varepsilon$ error with total evolution time $\mathcal{O}(\log(n)/\varepsilon)$, and has the following appealing properties: 1)It does not need to know the Hamiltonian terms; 2)It works beyond the short-range setting, extending to any Hamiltonian $H$ where the sum of terms interacting with a qubit has bounded norm; 3)It evolves according to $H$ in constant time $t$ increments, thus achieving constant time resolution. As an application, we can also learn Hamiltonians exhibiting power-law decay up to accuracy $\varepsilon$ with total evolution time beating the standard limit of $1/\varepsilon^{2}$.
RLJ Journal 2024 Journal Article
In this paper, we study the offline RL problem with linear function approximation. Our main structural assumption is that the MDP has low inherent Bellman error, which stipulates that linear value functions have linear Bellman backups with respect to the greedy policy. This assumption is natural in that it is essentially the minimal assumption required for value iteration to succeed. We give a computationally efficient algorithm which outputs a policy whose value is at least that of any policy which is well-covered by the dataset, known as a single-policy coverage condition. Even in the setting when the inherent Bellman error is 0 (termed linear Bellman completeness), our algorithm yields the first known guarantee under single-policy coverage. In the setting of positive inherent Bellman error $\epsilon_{\mathsf{BE}} > 0$, we show that the suboptimality error of our algorithm scales with $\sqrt{\epsilon_{\mathsf{BE}}}$. Furthermore, we prove that the scaling of the suboptimality with $\sqrt{\epsilon_{\mathsf{BE}}}$ cannot be improved for any algorithm. Our lower bound stands in contrast to many other settings in reinforcement learning with misspecification, where one can typically obtain performance that degrades linearly with the misspecification error.
RLC Conference 2024 Conference Paper
In this paper, we study the offline RL problem with linear function approximation. Our main structural assumption is that the MDP has low inherent Bellman error, which stipulates that linear value functions have linear Bellman backups with respect to the greedy policy. This assumption is natural in that it is essentially the minimal assumption required for value iteration to succeed. We give a computationally efficient algorithm which outputs a policy whose value is at least that of any policy which is well-covered by the dataset, known as a single-policy coverage condition. Even in the setting when the inherent Bellman error is 0 (termed linear Bellman completeness), our algorithm yields the first known guarantee under single-policy coverage. In the setting of positive inherent Bellman error $\epsilon_{\mathsf{BE}} > 0$, we show that the suboptimality error of our algorithm scales with $\sqrt{\epsilon_{\mathsf{BE}}}$. Furthermore, we prove that the scaling of the suboptimality with $\sqrt{\epsilon_{\mathsf{BE}}}$ cannot be improved for any algorithm. Our lower bound stands in contrast to many other settings in reinforcement learning with misspecification, where one can typically obtain performance that degrades linearly with the misspecification error.
STOC Conference 2023 Conference Paper
ICLR Conference 2023 Conference Paper
Existing methods for isolating hard subpopulations and spurious correlations in datasets often require human intervention. This can make these methods labor-intensive and dataset-specific. To address these shortcomings, we present a scalable method for automatically distilling a model's failure modes. Specifically, we harness linear classifiers to identify consistent error patterns, and, in turn, induce a natural representation of these failure modes as directions within the feature space. We demonstrate that this framework allows us to discover and automatically caption challenging subpopulations within the training dataset. Moreover, by combining our framework with off-the-shelf diffusion models, we can generate images that are especially challenging for the analyzed model, and thus can be used to perform synthetic data augmentation that helps remedy the model's failure modes.
STOC Conference 2023 Conference Paper
NeurIPS Conference 2023 Conference Paper
Score matching is an alternative to maximum likelihood (ML) for estimating a probability distribution parametrized up to a constant of proportionality. By fitting the ''score'' of the distribution, it sidesteps the need to compute this constant of proportionality (which is often intractable). While score matching and variants thereof are popular in practice, precise theoretical understanding of the benefits and tradeoffs with maximum likelihood---both computational and statistical---are not well understood. In this work, we give the first example of a natural exponential family of distributions such that the score matching loss is computationally efficient to optimize, and has a comparable statistical efficiency to ML, while the ML loss is intractable to optimize using a gradient-based method. The family consists of exponentials of polynomials of fixed degree, and our result can be viewed as a continuous analogue of recent developments in the discrete setting. Precisely, we show: (1) Designing a zeroth-order or first-order oracle for optimizing the maximum likelihood loss is NP-hard. (2) Maximum likelihood has a statistical efficiency polynomial in the ambient dimension and the radius of the parameters of the family. (3) Minimizing the score matching loss is both computationally and statistically efficient, with complexity polynomial in the ambient dimension.
ICLR Conference 2023 Conference Paper
Auditing the stability of a machine learning model to small changes in the training procedure is critical for engendering trust in practical applications. For example, a model should not be overly sensitive to removing a small fraction of its training data. However, algorithmically validating this property seems computationally challenging, even for the simplest of models: Ordinary Least Squares (OLS) linear regression. Concretely, recent work defines the stability of a regression as the minimum number of samples that need to be removed so that rerunning the analysis overturns the conclusion (Broderick et al., 2020), specifically meaning that the sign of a particular coefficient of the OLS regressor changes. But the only known approach for estimating this metric, besides the obvious exponential-time algorithm, is a greedy heuristic that may produce severe overestimates and therefore cannot certify stability. We show that stability can be efficiently certified in the low-dimensional regime: when the number of covariates is a constant but the number of samples is large, there are polynomial-time algorithms for estimating (a fractional version of) stability, with provable approximation guarantees. Applying our algorithms to the Boston Housing dataset, we exhibit regression analyses where our estimator outperforms the greedy heuristic, and can successfully certify stability even in the regime where a constant fraction of the samples are dropped.
SODA Conference 2023 Conference Paper
FOCS Conference 2023 Conference Paper
Strong spatial mixing (SSM) is an important quantitative notion of correlation decay for Gibbs distributions arising in statistical physics, probability theory, and theoretical computer science. A longstanding conjecture is that the uniform distribution on proper q-colorings on a $\Delta$ regular tree exhibits SSM whenever $q \geq \Delta+1$. Moreover, it is widely believed that as long as SSM holds on bounded-degree trees with q colors, one would obtain an efficient sampler for q-colorings on all bounded-degree graphs via simple Markov chain algorithms. It is surprising that such a basic question is still open, even on trees, but then again it also highlights how much we still have to learn about random colorings. In this paper, we show the following: (1)For any $\Delta \geq 3$, SSM holds for random q-colorings on trees of maximum degree $\Delta$ whenever $q \geq \Delta+3$. Thus we almost fully resolve the aforementioned conjecture. Our result substantially improves upon the previously best bound which requires $q \geq 1. 59 \Delta+\gamma^{*}$ for an absolute constant $\gamma^{*}\gt0$. (2)For any $\Delta \geq 3$ and $g=\Omega_{\Delta}(1)$, we establish optimal mixing of the Glauber dynamics for q-colorings on graphs of maximum degree $\Delta$ and girth g whenever $q \geq \Delta+3$. Our approach is based on a new general reduction from spectral independence on large-girth graphs to SSM on trees that is of independent interest. Using the same techniques, we also prove near-optimal bounds on weak spatial mixing (WSM), a closely-related notion to SSM, for the antiferromagnetic Potts model on trees.
ICML Conference 2023 Conference Paper
Recently Chen and Poor initiated the study of learning mixtures of linear dynamical systems. While linear dynamical systems already have wide-ranging applications in modeling time-series data, using mixture models can lead to a better fit or even a richer understanding of underlying subpopulations represented in the data. In this work we give a new approach to learning mixtures of linear dynamical systems that is based on tensor decompositions. As a result, our algorithm succeeds without strong separation conditions on the components, and can be used to compete with the Bayes optimal clustering of the trajectories. Moreover our algorithm works in the challenging partially-observed setting. Our starting point is the simple but powerful observation that the classic Ho-Kalman algorithm is a relative of modern tensor decomposition methods for learning latent variable models. This gives us a playbook for how to extend it to work with more complicated generative models.
STOC Conference 2022 Conference Paper
NeurIPS Conference 2022 Conference Paper
Much of reinforcement learning theory is built on top of oracles that are computationally hard to implement. Specifically for learning near-optimal policies in Partially Observable Markov Decision Processes (POMDPs), existing algorithms either need to make strong assumptions about the model dynamics (e. g. deterministic transitions) or assume access to an oracle for solving a hard optimistic planning or estimation problem as a subroutine. In this work we develop the first oracle-free learning algorithm for POMDPs under reasonable assumptions. Specifically, we give a quasipolynomial-time end-to-end algorithm for learning in ``observable'' POMDPs, where observability is the assumption that well-separated distributions over states induce well-separated distributions over observations. Our techniques circumvent the more traditional approach of using the principle of optimism under uncertainty to promote exploration, and instead give a novel application of barycentric spanners to constructing policy covers.
FOCS Conference 2022 Conference Paper
In this work, we study the problem of community detection in the stochastic block model with adversarial node corruptions. Our main result is an efficient algorithm that can tolerate an $\epsilon$-fraction of corruptions and achieves error $O(\epsilon)+e^{-\frac{C}{2}(1\pm o(1))}$ where $C=(\sqrt{a}-\sqrt{b})^{2}$ is the signal-to-noise ratio and $a/n$ and $b/n$ are the inter-community and intra-community connection probabilities respectively. These bounds essentially match the minimax rates for the SBM without corruptions. We also give robust algorithms for $\mathbb{Z}_{2}$-synchronization. At the heart of our algorithm is a new semidefinite program that uses global information to robustly boost the accuracy of a rough clustering. Moreover, we show that our algorithms are doubly-robust in the sense that they work in an even more challenging noise model that mixes adversarial corruptions with unbounded monotone changes, from the semi-random model.
NeurIPS Conference 2022 Conference Paper
The Burer-Monteiro method is one of the most widely used techniques for solving large-scale semidefinite programs (SDP). The basic idea is to solve a nonconvex program in $Y$, where $Y$ is an $n \times p$ matrix such that $X = Y Y^T$. We show that this method can solve SDPs in polynomial time in a smoothed analysis setting. More precisely, we consider an SDP whose domain satisfies some compactness and smoothness assumptions, and slightly perturb the cost matrix and the constraints. We show that if $p \gtrsim \sqrt{2(1{+}\eta)m}$, where $m$ is the number of constraints and $\eta>0$ is any fixed constant, then the Burer-Monteiro method can solve SDPs to any desired accuracy in polynomial time, in the setting of smooth analysis. The bound on $p$ approaches the celebrated Barvinok-Pataki bound in the limit as $\eta$ goes to zero, beneath which it the nonconvex program can be suboptimal. Our main technical contribution, which is key for our tight bound on $p$, is to connect spurious approximately critical points of the nonconvex program to tubular neighborhoods of certain algebraic varieties, and then estimate the volume of such tubes.
NeurIPS Conference 2022 Conference Paper
In learning theory, a standard assumption is that the data is generated from a finite mixture model. But what happens when the number of components is not known in advance? The problem of estimating the number of components, also called model selection, is important in its own right but there are essentially no known efficient algorithms with provable guarantees. In this work, we study the problem of model selection for univariate Gaussian mixture models (GMMs). Given $\textsf{poly}(k/\epsilon)$ samples from a distribution that is $\epsilon$-close in TV distance to a GMM with $k$ components, we can construct a GMM with $\widetilde{O}(k)$ components that approximates the distribution to within $\widetilde{O}(\epsilon)$ in $\textsf{poly}(k/\epsilon)$ time. Thus we are able to approximately determine the minimum number of components needed to fit the distribution within a logarithmic factor. Moreover, by adapting the techniques we obtain similar results for reconstructing Fourier-sparse signals. Prior to our work, the only known algorithms for learning arbitrary univariate GMMs either output significantly more than $k$ components (e. g. $k/\epsilon^2$ components for kernel density estimates) or run in time exponential in $k$.
NeurIPS Conference 2021 Conference Paper
In recent years there has been significant effort to adapt the key tools and ideas in convex optimization to the Riemannian setting. One key challenge has remained: Is there a Nesterov-like accelerated gradient method for geodesically convex functions on a Riemannian manifold? Recent work has given partial answers and the hope was that this ought to be possible. Here we prove that in a noisy setting, there is no analogue of accelerated gradient descent for geodesically convex functions on the hyperbolic plane. Our results apply even when the noise is exponentially small. The key intuition behind our proof is short and simple: In negatively curved spaces, the volume of a ball grows so fast that information about the past gradients is not useful in the future.
STOC Conference 2021 Conference Paper
FOCS Conference 2021 Conference Paper
In this work we revisit two classic high-dimensional online learning problems, namely linear regression and contextual bandits, from the perspective of adversarial robustness. Existing works in algorithmic robust statistics make strong distributional assumptions that ensure that the input data is evenly spread out or comes from a nice generative model. Is it possible to achieve strong robustness guarantees even without distributional assumptions altogether, where the sequence of tasks we are asked to solve is adaptively and adversarially chosen? We answer this question in the affirmative for both linear regression and contextual bandits. In fact our algorithms succeed where conventional methods fail. In particular we show strong lower bounds against Huber regression and more generally any convex $M$ -estimator. Our approach is based on a novel alternating minimization scheme that interleaves ordinary least-squares with a simple convex program that finds the optimal reweighting of the distribution under a spectral constraint. Our results obtain essentially optimal dependence on the contamination level η, reach the optimal breakdown point, and naturally apply to infinite dimensional settings where the feature vectors are represented implicitly via a kernel map.
STOC Conference 2021 Conference Paper
NeurIPS Conference 2020 Conference Paper
In this paper, we revisit the problem of distribution-independently learning halfspaces under Massart noise with rate $\eta$. Recent work resolved a long-standing problem in this model of efficiently learning to error $\eta + \epsilon$ for any $\epsilon > 0$, by giving an improper learner that partitions space into $\text{poly}(d, 1/\epsilon)$ regions. Here we give a much simpler algorithm and settle a number of outstanding open questions: (1) We give the first \emph{proper} learner for Massart halfspaces that achieves $\eta + \epsilon$. (2) Based on (1), we develop a blackbox knowledge distillation procedure to convert an arbitrarily complex classifier to an equally good proper classifier. (3) By leveraging a simple but overlooked connection to \emph{evolvability}, we show any SQ algorithm requires super-polynomially many queries to achieve $\mathsf{OPT} + \epsilon$. We then zoom out to study generalized linear models and give an efficient algorithm for learning under a challenging new corruption model generalizing Massart noise. Finally we study our algorithm for learning halfspaces under Massart noise empirically and find that it exhibits some appealing fairness properties as a byproduct of its strong provable robustness guarantees.
STOC Conference 2020 Conference Paper
NeurIPS Conference 2020 Conference Paper
Gaussian Graphical Models (GGMs) have wide-ranging applications in machine learning and the natural and social sciences. In most of the settings in which they are applied, the number of observed samples is much smaller than the dimension and they are assumed to be sparse. While there are a variety of algorithms (e. g. Graphical Lasso, CLIME) that provably recover the graph structure with a logarithmic number of samples, to do so they require various assumptions on the well-conditioning of the precision matrix that are not information-theoretically necessary. Here we give the first fixed polynomial-time algorithms for learning attractive GGMs and walk-summable GGMs with a logarithmic number of samples without any such assumptions. In particular, our algorithms can tolerate strong dependencies among the variables. Our result for structure recovery in walk-summable GGMs is derived from a more general result for efficient sparse linear regression in walk-summable models without any norm dependencies. We complement our results with experiments showing that many existing algorithms fail even in some simple settings where there are long dependency chains. Our algorithms do not.
NeurIPS Conference 2020 Conference Paper
We revisit the problem of learning from untrusted batches introduced by Qiao and Valiant [QV17]. Recently, Jain and Orlitsky [JO19] gave a simple semidefinite programming approach based on the cut-norm that achieves essentially information-theoretically optimal error in polynomial time. Concurrently, Chen et al. [CLM19] considered a variant of the problem where μ is assumed to be structured, e. g. log-concave, monotone hazard rate, t-modal, etc. In this case, it is possible to achieve the same error with sample complexity sublinear in n, and they exhibited a quasi-polynomial time algorithm for doing so using Haar wavelets. In this paper, we find an appealing way to synthesize the techniques of [JO19] and [CLM19] to give the best of both worlds: an algorithm which runs in polynomial time and can exploit structure in the underlying distribution to achieve sublinear sample complexity. Along the way, we simplify the approach of [JO19] by avoiding the need for SDP rounding and giving a more direct interpretation of it through the lens of soft filtering, a powerful recent technique in high-dimensional robust estimation. We validate the usefulness of our algorithms in preliminary experimental evaluations.
NeurIPS Conference 2020 Conference Paper
Tensor completion is a natural higher-order generalization of matrix completion where the goal is to recover a low-rank tensor from sparse observations of its entries. Existing algorithms are either heuristic without provable guarantees, based on solving large semidefinite programs which are impractical to run, or make strong assumptions such as requiring the factors to be nearly orthogonal. In this paper we introduce a new variant of alternating minimization, which in turn is inspired by understanding how the progress measures that guide convergence of alternating minimization in the matrix setting need to be adapted to the tensor setting. We show strong provable guarantees, including showing that our algorithm converges linearly to the true tensors even when the factors are highly correlated and can be implemented in nearly linear time. Moreover our algorithm is also highly practical and we show that we can complete third order tensors with a thousand dimensions from observing a tiny fraction of its entries. In contrast, and somewhat surprisingly, we show that the standard version of alternating minimization, without our new twist, can converge at a drastically slower rate in practice.
STOC Conference 2019 Conference Paper
SODA Conference 2019 Conference Paper
STOC Conference 2019 Conference Paper
STOC Conference 2019 Conference Paper
FOCS Conference 2018 Conference Paper
Mixtures of Mallows models are a popular generative model for ranking data coming from a heterogeneous population. They have a variety of applications including social choice, recommendation systems and natural language processing. Here we give the first polynomial time algorithm for provably learning the parameters of a mixture of Mallows models with any constant number of components. Prior to our work, only the two component case had been settled. Our analysis revolves around a determinantal identity of Zagier which was proven in the context of mathematical physics, which we use to show polynomial identifiability and ultimately to construct test functions to peel off one component at a time. To complement our upper bounds, we show information-theoretic lower bounds on the sample complexity as well as lower bounds against restricted families of algorithms that make only local queries. Together, these results demonstrate various impediments to improving the dependence on the number of components. They also motivate the study of learning mixtures of Mallows models from the perspective of beyond worst-case analysis. In this direction, we show that when the scaling parameters of the Mallows models have separation, there are much faster learning algorithms.
SODA Conference 2018 Conference Paper
STOC Conference 2017 Conference Paper
ICML Conference 2017 Conference Paper
Robust estimation is much more challenging in high-dimensions than it is in one-dimension: Most techniques either lead to intractable optimization problems or estimators that can tolerate only a tiny fraction of errors. Recent work in theoretical computer science has shown that, in appropriate distributional models, it is possible to robustly estimate the mean and covariance with polynomial time algorithms that can tolerate a constant fraction of corruptions, independent of the dimension. However, the sample and time complexity of these algorithms is prohibitively large for high-dimensional applications. In this work, we address both of these issues by establishing sample complexity bounds that are optimal, up to logarithmic factors, as well as giving various refinements that allow the algorithms to tolerate a much larger fraction of corruptions. Finally, we show on both synthetic and real data that our algorithms have state-of-the-art performance and suddenly make high-dimensional robust estimation a realistic possibility.
NeurIPS Conference 2017 Conference Paper
Markov random fields are a popular model for high-dimensional probability distributions. Over the years, many mathematical, statistical and algorithmic problems on them have been studied. Until recently, the only known algorithms for provably learning them relied on exhaustive search, correlation decay or various incoherence assumptions. Bresler gave an algorithm for learning general Ising models on bounded degree graphs. His approach was based on a structural result about mutual information in Ising models. Here we take a more conceptual approach to proving lower bounds on the mutual information. Our proof generalizes well beyond Ising models, to arbitrary Markov random fields with higher order interactions. As an application, we obtain algorithms for learning Markov random fields on bounded degree graphs on $n$ nodes with $r$-order interactions in $n^r$ time and $\log n$ sample complexity. Our algorithms also extend to various partial observation models.
ICML Conference 2017 Conference Paper
Determinantal Point Processes (DPPs) are a family of probabilistic models that have a repulsive behavior, and lend themselves naturally to many tasks in machine learning where returning a diverse set of objects is important. While there are fast algorithms for sampling, marginalization and conditioning, much less is known about learning the parameters of a DPP. Our contribution is twofold: (i) we establish the optimal sample complexity achievable in this problem and show that it is governed by a natural parameter, which we call the cycle sparsity; (ii) we propose a provably fast combinatorial algorithm that implements the method of moments efficiently and achieves optimal sample complexity. Finally, we give experimental results that confirm our theoretical findings.
FOCS Conference 2016 Conference Paper
We prove that with high probability over the choice of a random graph G from the Erdös-Rényi distribution G(n, 1/2), the n O(d) -time degree d Sum-of-Squares semidefinite programming relaxation for the clique problem will give a value of at least n 1/2-c(d/log n)1/2 for some constant c > 0. This yields a nearly tight n 1/2-o(1) bound on the value of this program for any degree d = o(log n). Moreover we introduce a new framework that we call pseudo-calibration to construct Sum-of-Squares lower bounds. This framework is inspired by taking a computational analogue of Bayesian probability theory. It yields a general recipe for constructing good pseudo-distributions (i. e. , dual certificates for the Sum-of-Squares semidefinite program), and sheds further light on the ways in which this hierarchy differs from others.
STOC Conference 2016 Conference Paper
The stochastic block model is one of the oldest and most ubiquitous models for studying clustering and community detection. In an exciting sequence of developments, motivated by deep but non-rigorous ideas from statistical physics, Decelle et al. conjectured a sharp threshold for when community detection is possible in the sparse regime. Mossel, Neeman and Sly and Massoulie proved the conjecture and gave matching algorithms and lower bounds.
ICML Conference 2016 Conference Paper
Recently, there has been considerable progress on designing algorithms with provable guarantees —typically using linear algebraic methods—for parameter learning in latent variable models. Designing provable algorithms for inference has proved more difficult. Here we take a first step towards provable inference in topic models. We leverage a property of topic models that enables us to construct simple linear estimators for the unknown topic proportions that have small variance, and consequently can work with short documents. Our estimators also correspond to finding an estimate around which the posterior is well-concentrated. We show lower bounds that for shorter documents it can be information theoretically impossible to find the hidden topics. Finally, we give empirical results that demonstrate that our algorithm works on realistic topic models. It yields good solutions on synthetic data and runs in time comparable to a single iteration of Gibbs sampling.
FOCS Conference 2016 Conference Paper
We study high-dimensional distribution learning in an agnostic setting where an adversary is allowed to arbitrarily corrupt an epsilon fraction of the samples. Such questions have a rich history spanning statistics, machine learning and theoretical computer science. Even in the most basic settings, the only known approaches are either computationally inefficient or lose dimension dependent factors in their error guarantees. This raises the following question: Is high-dimensional agnostic distribution learning even possible, algorithmically? In this work, we obtain the first computationally efficient algorithms for agnostically learning several fundamental classes of high-dimensional distributions: (1) a single Gaussian, (2) a product distribution on the hypercube, (3) mixtures of two product distributions (under a natural balancedness condition), and (4) mixtures of k Gaussians with identical spherical covariances. All our algorithms achieve error that is independent of the dimension, and in many cases depends nearly-linearly on the fraction of adversarially corrupted samples. Moreover, we develop a general recipe for detecting and correcting corruptions in high-dimensions, that may be applicable to many other problems.
STOC Conference 2015 Conference Paper
SODA Conference 2014 Conference Paper
We consider a problem which has received considerable attention in systems literature because of its applications to routing in delay tolerant networks and replica placement in distributed storage systems. In abstract terms the problem can be stated as follows: Given a random variable X generated by a known product distribution over {0, 1} n and a target value 0 ≤ θ ≤ 1, output a non-negative vector w, with ‖ w ‖ 1 ≤ 1, which maximizes the probability of the event w · X ≥ θ. This is a challenging non-convex optimization problem for which even computing the value Pr[ w · X ≥ θ ] of a proposed solution vector w is #P-hard. We provide an additive EPTAS for this problem which, for constant-bounded product distributions, runs in poly( n ) · 2 poly(1/∊) time and outputs an ∊-approximately optimal solution vector w for this problem. Our approach is inspired by, and extends, recent structural results from the complexity-theoretic study of linear threshold functions. Furthermore, in spite of the objective function being non-smooth, we give a unicriterion PTAS while previous work for such objective functions has typically led to a bicriterion PTAS. We believe our techniques may be applicable to get unicriterion PTAS for other non-smooth objective functions.
STOC Conference 2014 Conference Paper
FOCS Conference 2013 Conference Paper
We give a polynomial time algorithm for the lossy population recovery problem. In this problem, the goal is to approximately learn an unknown distribution on binary strings of length n from lossy samples: for some parameter μ each coordinate of the sample is preserved with probability μ and otherwise is replaced by a `? '. The running time and number of samples needed for our algorithm is polynomial in n and 1/ε for each fixed μ>0. This improves on algorithm of Wigderson and Yehudayoff that runs in quasi-polynomial time for any μ > 0 and the polynomial time algorithm of Dvir et al which was shown to work for μ > rapprox 0. 30 by Batman et al. In fact, our algorithm also works in the more general framework of Batman et al. in which there is no a priori bound on the size of the support of the distribution. The algorithm we analyze is implicit in previous work; our main contribution is to analyze the algorithm by showing (via linear programming duality and connections to complex analysis) that a certain matrix associated with the problem has a robust local inverse even though its condition number is exponentially small. A corollary of our result is the first polynomial time algorithm for learning DNFs in the restriction access model of Dvir et al [9].
ICML Conference 2013 Conference Paper
Topic models provide a useful method for dimensionality reduction and exploratory data analysis in large text corpora. Most approaches to topic model learning have been based on a maximum likelihood objective. Efficient algorithms exist that attempt to approximate this objective, but they have no provable guarantees. Recently, algorithms have been introduced that provide provable bounds, but these algorithms are not practical because they are inefficient and not robust to violations of model assumptions. In this paper we present an algorithm for learning topic models that is both provable and practical. The algorithm produces results comparable to the best MCMC implementations while running orders of magnitude faster.
SODA Conference 2013 Conference Paper
STOC Conference 2013 Conference Paper
We prove an unconditional lower bound that any linear program that achieves an O(n 1-ε ) approximation for clique has size 2 Ω(n ε ) . There has been considerable recent interest in proving unconditional lower bounds against any linear program. Fiorini et al. proved that there is no polynomial sized linear program for traveling salesman. Braun et al. proved that there is no polynomial sized O(n 1/2 - ε )-approximate linear program for clique. Here we prove an optimal and unconditional lower bound against linear programs for clique that matches Hastad's celebrated hardness result. Interestingly, the techniques used to prove such lower bounds have closely followed the progression of techniques used in communication complexity. Here we develop an information theoretic framework to approach these questions, and we use it to prove our main result. Also we resolve a related question: How many bits of communication are needed to get ε-advantage over random guessing for disjointness? Kalyanasundaram and Schnitger proved that a protocol that gets constant advantage requires Ω(n) bits of communication. This result in conjunction with amplification implies that any protocol that gets ε-advantage requires Ω(ε 2 n) bits of communication. Here we improve this bound to Ω(ε n), which is optimal for any ε > 0.
STOC Conference 2012 Conference Paper
FOCS Conference 2012 Conference Paper
Topic Modeling is an approach used for automatic comprehension and classification of data in a variety of settings, and perhaps the canonical application is in uncovering thematic structure in a corpus of documents. A number of foundational works both in machine learning and in theory have suggested a probabilistic model for documents, whereby documents arise as a convex combination of (i. e. distribution on) a small number of topic vectors, each topic vector being a distribution on words (i. e. a vector of word-frequencies). Similar models have since been used in a variety of application areas, the Latent Dirichlet Allocation or LDA model of Blei et al. is especially popular. Theoretical studies of topic modeling focus on learning the model's parameters assuming the data is actually generated from it. Existing approaches for the most part rely on Singular Value Decomposition (SVD), and consequently have one of two limitations: these works need to either assume that each document contains only one topic, or else can only recover the {\em span} of the topic vectors instead of the topic vectors themselves. This paper formally justifies Nonnegative Matrix Factorization (NMF) as a main tool in this context, which is an analog of SVD where all vectors are nonnegative. Using this tool we give the first polynomial-time algorithm for learning topic models without the above two limitations. The algorithm uses a fairly mild assumption about the underlying topic matrix called separability, which is usually found to hold in real-life data. Perhaps the most attractive feature of our algorithm is that it generalizes to yet more realistic models that incorporate topic-topic correlations, such as the Correlated Topic Model (CTM) and the Pachinko Allocation Model (PAM). We hope that this paper will motivate further theoretical results that use NMF as a replacement for SVD -- just as NMF has come to replace SVD in many applications.
STOC Conference 2012 Conference Paper
We describe two constructions of (very) dense graphs which are edge disjoint unions of large induced matchings. The first construction exhibits graphs on N vertices with ( N 2 )-o(N 2 ) edges, which can be decomposed into pairwise disjoint induced matchings, each of size N 1-o(1) . The second construction provides a covering of all edges of the complete graph K N by two graphs, each being the edge disjoint union of at most N 2-δ induced matchings, where δ>0.076. This disproves (in a strong form) a conjecture of Meshulam, substantially improves a result of Birk, Linial and Meshulam on communicating over a shared channel, and (slightly) extends the analysis of Hastad and Wigderson of the graph test of Samorodnitsky and Trevisan for linearity. Additionally, our constructions settle a combinatorial question of Vempala regarding a candidate rounding scheme for the directed Steiner tree problem.
NeurIPS Conference 2012 Conference Paper
We present a new algorithm for Independent Component Analysis (ICA) which has provable performance guarantees. In particular, suppose we are given samples of the form $y = Ax + \eta$ where $A$ is an unknown $n \times n$ matrix and $x$ is chosen uniformly at random from $\{+1, -1\}^n$, $\eta$ is an $n$-dimensional Gaussian random variable with unknown covariance $\Sigma$: We give an algorithm that provable recovers $A$ and $\Sigma$ up to an additive $\epsilon$ whose running time and sample complexity are polynomial in $n$ and $1 / \epsilon$. To accomplish this, we introduce a novel ``quasi-whitening'' step that may be useful in other contexts in which the covariance of Gaussian noise is not known in advance. We also give a general framework for finding all local optima of a function (given an oracle for approximately finding just one) and this is a crucial step in our algorithm, one that has been overlooked in previous attempts, and allows us to control the accumulation of error when we find the columns of $A$ one by one via local search.
SODA Conference 2011 Conference Paper
STOC Conference 2011 Conference Paper
FOCS Conference 2011 Conference Paper
We revisit the problem of reliable interactive communication over a noisy channel, and obtain the first fully explicit (randomized) efficient constant-rate emulation procedure for reliable interactive communication. Our protocol works for any discrete memory less noisy channel with constant capacity, and fails with exponentially small probability in the total length of the protocol. Following a work by Schulman [Schulman 1993] our simulation uses a tree-code, yet as opposed to the non-constructive absolute tree-code used by Schulman, we introduce a relaxation in the notion of goodness for a tree code and define a potent tree code. This relaxation allows us to construct an explicit emulation procedure for any two-party protocol. Our results also extend to the case of interactive multiparty communication. We show that a randomly generated tree code (with suitable constant alphabet size) is an efficiently decodable potent tree code with overwhelming probability. Furthermore we are able to partially derandomize this result by means of epsilon-biased distributions using only O(N) random bits, where N is the depth of the tree.
STOC Conference 2011 Conference Paper
STOC Conference 2010 Conference Paper
Given data drawn from a mixture of multivariate Gaussians, a basic problem is to accurately estimate the mixture parameters. We provide a polynomial-time algorithm for this problem for the case of two Gaussians in $n$ dimensions (even if they overlap), with provably minimal assumptions on the Gaussians, and polynomial data requirements. In statistical terms, our estimator converges at an inverse polynomial rate, and no such estimator (even exponential time) was known for this problem (even in one dimension). Our algorithm reduces the n-dimensional problem to the one-dimensional problem, where the method of moments is applied. One technical challenge is proving that noisy estimates of the first six moments of a univariate mixture suffice to recover accurate estimates of the mixture parameters, as conjectured by Pearson (1894), and in fact these estimates converge at an inverse polynomial rate. As a corollary, we can efficiently perform near-optimal clustering: in the case where the overlap between the Gaussians is small, one can accurately cluster the data, and when the Gaussians have partial overlap, one can still accurately cluster those data points which are not in the overlap region. A second consequence is a polynomial-time density estimation algorithm for arbitrary mixtures of two Gaussians, generalizing previous work on axis-aligned Gaussians (Feldman {\em et al}, 2006).
STOC Conference 2010 Conference Paper
FOCS Conference 2010 Conference Paper
Given data drawn from a mixture of multivariate Gaussians, a basic problem is to accurately estimate the mixture parameters. We give an algorithm for this problem that has running time and data requirements polynomial in the dimension and the inverse of the desired accuracy, with provably minimal assumptions on the Gaussians. As a simple consequence of our learning algorithm, we we give the first polynomial time algorithm for proper density estimation for mixtures of k Gaussians that needs no assumptions on the mixture. It was open whether proper density estimation was even statistically possible (with no assumptions) given only polynomially many samples, let alone whether it could be computationally efficient. The building blocks of our algorithm are based on the work (Kalai et al, STOC 2010) that gives an efficient algorithm for learning mixtures of two Gaussians by considering a series of projections down to one dimension, and applying the method of moments to each univariate projection. A major technical hurdle in the previous work is showing that one can efficiently learn univariate mixtures of two Gaussians. In contrast, because pathological scenarios can arise when considering projections of mixtures of more than two Gaussians, the bulk of the work in this paper concerns how to leverage a weaker algorithm for learning univariate mixtures (of many Gaussians) to learn in high dimensions. Our algorithm employs hierarchical clustering and rescaling, together with methods for backtracking and recovering from the failures that can occur in our univariate algorithm. Finally, while the running time and data requirements of our algorithm depend exponentially on the number of Gaussians in the mixture, we prove that such a dependence is necessary.
FOCS Conference 2010 Conference Paper
The notion of vertex sparsification (in particular cut-sparsification) is introduced in, where it was shown that for any graph G = (V, E) and any subset of k terminals K ⊂ V, there is a polynomial time algorithm to construct a graph H = (K, E H ) on just the terminal set so that simultaneously for all cuts (A, K-A), the value of the minimum cut in G separating A from K-A is approximately the same as the value of the corresponding cut in H. Then approximation algorithms can be run directly on H as a proxy for running on G. We give the first super-constant lower bounds for how well a cut-sparsifier H can simultaneously approximate all minimum cuts in G. We prove a lower bound of Ω(log 1/4 k) this is polynomially-related to the known upper bound of O(log k/log log k). Independently, a similar lower bound is given in. This is an exponential improvement on the Ω(log log k) bound given in which in fact was for a stronger vertex sparsification guarantee, and did not apply to cut sparsifiers. Despite this negative result, we show that for many natural optimization problems, we do not need to incur a multiplicative penalty for our reduction. Roughly, we show that any rounding algorithm which also works for the O-extension relaxation can be used to construct good vertex-sparsifiers for which the optimization problem is easy. Using this, we obtain optimal O(log k)-competitive Steiner oblivious routing schemes, which generalize the results in. We also demonstrate that for a wide range of graph packing problems (which includes maximum concurrent flow, maximum multiflow and multicast routing, among others, as a special case), the integrality gap of the linear program is always at most O(log k) times the integrality gap restricted to trees. Lastly, we use our ideas to give an efficient construction for vertex-sparsifiers that match the current best existential results - this was previously open. Our algorithm makes novel use of Earth-mover constraints.
FOCS Conference 2009 Conference Paper
Linial, London and Rabinovich [16] and Aumann and Rabani [3] proved that the min-cut max-flow ratio for general maximum concurrent flow problems (when there are k commodities) is O(logfe). Here we attempt to derive a more general theory of Steiner cut and flow problems, and we prove bounds that are poly-logarithmic in k for a much broader class of multicommodity flow and cut problems. Our structural results are motivated by the meta question: Suppose we are given a poly(log n) approximation algorithm for a flow or cut problem when can we give a poly(log k) approximation algorithm for a generalization of this problem to a Steiner cut or flow problem? Thus we require that these approximation guarantees be independent of the size of the graph, and only depend on the number of commodities (or the number of terminal nodes in a Steiner cut problem). For many natural applications (when k = n o(1) ) this yields much stronger guarantees. We construct vertex-sparsifiers that approximately preserve the value of all terminal min-cuts. We prove such sparsifiers exist through zero-sum games and metric geometry, and we construct such sparsifiers through oblivious routing guarantees. These results let us reduce a broad class of multicommodity-type problems to a uniform case (on k nodes) at the cost of a loss of a poly (log k) in the approximation guarantee. We then give poly(log k) approximation algorithms for a number of problems for which such results were previously unknown, such as requirement cut, 1-multicut, oblivious 0-extension, and natural Steiner generalizations of oblivious routing, min-cut linear arrangement and minimum linear arrangement.
FOCS Conference 2008 Conference Paper
Geographic routing is a family of routing algorithms that uses geographic point locations as addresses for the purposes of routing. Such routing algorithms have proven to be both simple to implement and heuristically effective when applied to wireless sensor networks. Greedy routing is a natural abstraction of this model in which nodes are assigned virtual coordinates in a metric space, and these coordinates are used to perform point-to-point routing. Here we resolve a conjecture of Papadimitriou and Ratajczak that every 3-connected planar graph admits a greedy embedding into the Euclidean plane. This immediately implies that all 3-connected graphs that exclude K 3. 3 as a minor admit a greedy embedding into the Euclidean plane. Additionally, we provide the first non-trivial examples of graphs that admit no such embedding. These structural results provide efficiently verifiable certificates that a graph admits a greedy embedding or that a graph admits no greedy embedding into the Euclidean plane.