STOC Conference 2025 Conference Paper
A Generalized Trace Reconstruction Problem: Recovering a String of Probabilities
- Joey Rivkin
- Gregory Valiant
- Paul Valiant
Author name cluster
Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.
STOC Conference 2025 Conference Paper
STOC Conference 2025 Conference Paper
NeurIPS Conference 2025 Conference Paper
We explore if it is possible to learn data structures end-to-end with neural networks, with a focus on the problem of nearest-neighbor (NN) search. We introduce a framework for data structure discovery, which adapts to the underlying data distribution and provides fine-grained control over query and space complexity. Crucially, the data structure is learned from scratch, and does not require careful initialization or seeding with candidate data structures. In several settings, we are able to reverse-engineer the learned data structures and query algorithms. For 1D nearest neighbor search, the model discovers optimal distribution (in)dependent algorithms such as binary search and variants of interpolation search. In higher dimensions, the model learns solutions that resemble k-d trees in some regimes, while in others, elements of locality-sensitive hashing emerge. Additionally, the model learns useful representations of high-dimensional data such as images and exploits them to design effective data structures. Beyond NN search, we believe the framework could be a powerful tool for data structure discovery for other problems and adapt our framework to the problem of estimating frequencies over a data stream. To encourage future work in this direction, we conclude with a discussion on some of the opportunities and remaining challenges of learning data structures end-to-end.
NeurIPS Conference 2025 Conference Paper
We study the problem of entropy calibration, which asks whether a language model's entropy over generations matches its log loss on human text. Past work found that models are miscalibrated, with entropy per step increasing as generations grow longer, due to error accumulation. To calibrate the model and improve text quality, it has become standard practice to truncate the distribution, but this approach reduces output diversity, which we would like to avoid. Therefore, in this paper, we ask: does miscalibration improve automatically with scale, and if not, is it theoretically possible to calibrate without tradeoffs? To build intuition, we first study a simplified theoretical setting to characterize the scaling behavior of miscalibration with respect to dataset size. We find that the rate of scaling depends on the power law exponent of the data distribution --- in particular, for a power law exponent close to 1, the scaling exponent is close to 0, meaning that miscalibration improves very slowly with scale. Next, we measure miscalibration empirically in language models ranging from 0. 5B to 70B parameters. We find that the observed scaling behavior is similar to what is predicted theoretically: our fitted scaling exponents for text are close to 0, meaning that larger models accumulate error at a similar rate as smaller ones. This scaling (or, lack thereof) provides one explanation for why we sample from larger models with similar amounts of truncation as smaller models, even though the larger models are of higher quality. However, truncation is not a satisfying solution because it comes at the cost of increased log loss. In theory, is it even possible to reduce entropy while preserving log loss? We prove that it is possible, if we assume access to a black box which can fit models to predict the future entropy of text.
TMLR Journal 2025 Journal Article
We examine the extent to which sublinear-sample property testing and estimation applies to settings where samples are independently but not identically distributed. Specifically, we consider the following distributional property testing framework: Suppose there is a set of distributions over a discrete support of size $k$, $p_1, p_2,\ldots,p_T$, and we obtain $c$ independent draws from each distribution. Suppose the goal is to learn or test a property of the average distribution, $p_{\mathrm{avg}}$. This setup models a number of important practical settings where the individual distributions correspond to heterogeneous entities --- either individuals, chronologically distinct time periods, spatially separated data sources, etc. From a learning standpoint, even with $c=1$ samples from each distribution, $\Theta(k/\varepsilon^2)$ samples are necessary and sufficient to learn $p_{\mathrm{avg}}$ to within error $\varepsilon$ in $\ell_1$ distance. To test uniformity or identity --- distinguishing the case that $p_{\mathrm{avg}}$ is equal to some reference distribution, versus has $\ell_1$ distance at least $\varepsilon$ from the reference distribution, we show that a linear number of samples in $k$ is necessary given $c=1$ samples from each distribution. In contrast, for $c \ge 2$, we recover the usual sublinear sample testing guarantees of the i.i.d. setting: we show that $O(\sqrt{k}/\varepsilon^2 + 1/\varepsilon^4)$ total samples are sufficient, matching the optimal sample complexity in the i.i.d. case in the regime where $\varepsilon \ge k^{-1/4}$. Additionally, we show that in the $c=2$ case, there is a constant $\rho > 0$ such that even in the linear regime with $\rho k$ samples, no tester that considers the multiset of samples (ignoring which samples were drawn from the same $p_i$) can perform uniformity testing. We further extend our techniques to the problem of testing ''closeness'' of two distributions: given $c=3$ independent draws from each of $p_1, p_2,\ldots,p_T$ and $q_1, q_2,\ldots,q_T$, one can distinguish the case that $p_{\mathrm{avg}}=q_{\mathrm{avg}}$ versus having $\ell_1$ distance at least $\varepsilon$ using $O(k^{2/3}/\varepsilon^{8/3})$ total samples, where $k$ is an upper bound on the support size, matching the optimal sample complexity of the i.i.d. setting up to the $\varepsilon$-dependence.
STOC Conference 2024 Conference Paper
IJCAI Conference 2023 Conference Paper
Minimizing a convex function with access to a first order oracle---that returns the function evaluation and (sub)gradient at a query point---is a canonical optimization problem and a fundamental primitive in machine learning. Gradient-based methods are the most popular approaches used for solving the problem, owing to their simplicity and computational efficiency. These methods, however, do not achieve the information-theoretically optimal query complexity for minimizing the underlying function to small error, which are achieved by more expensive techniques based on cutting-plane methods. Is it possible to achieve the information-theoretically query complexity without using these more complex and computationally expensive methods? In this work, we use memory as a lens to understand this, and show that is is not possible to achieve optimal query complexity without using significantly more memory than that used by gradient descent.
NeurIPS Conference 2023 Conference Paper
Token embeddings, a mapping from discrete lexical symbols to continuous vectors, are at the heart of any language model (LM). However, lexical symbol meanings can also be determined and even redefined by their structural role in a long context. In this paper, we ask: is it possible for a language model to be performant without \emph{any} fixed token embeddings? Such a language model would have to rely entirely on the co-occurence and repetition of tokens in the context rather than the \textit{a priori} identity of any token. To answer this, we study \textit{lexinvariant}language models that are invariant to lexical symbols and therefore do not need fixed token embeddings in practice. First, we prove that we can construct a lexinvariant LM to converge to the true language model at a uniform rate that is polynomial in terms of the context length, with a constant factor that is sublinear in the vocabulary size. Second, to build a lexinvariant LM, we simply encode tokens using random Gaussian vectors, such that each token maps to the same representation within each sequence but different representations across sequences. Empirically, we demonstrate that it can indeed attain perplexity comparable to that of a standard language model, given a sufficiently long context. We further explore two properties of the lexinvariant language models: First, given text generated from a substitution cipher of English, it implicitly implements Bayesian in-context deciphering and infers the mapping to the underlying real tokens with high accuracy. Second, it has on average 4X better accuracy over synthetic in-context reasoning tasks. Finally, we discuss regularizing standard language models towards lexinvariance and potential practical applications.
ICML Conference 2023 Conference Paper
Given only a few observed entries from a low-rank matrix $X$, matrix completion is the problem of imputing the missing entries, and it formalizes a wide range of real-world settings that involve estimating missing data. However, when there are too few observed entries to complete the matrix, what other aspects of the underlying matrix can be reliably recovered? We study one such problem setting, that of “one-sided” matrix completion, where our goal is to recover the right singular vectors of $X$, even in the regime where recovering the left singular vectors is impossible, which arises when there are more rows than columns and very few observations. We propose a natural algorithm that involves imputing the missing values of the matrix $X^TX$ and show that even with only two observations per row in $X$, we can provably recover $X^TX$ as long as we have at least $\Omega(r^2 d \log d)$ rows, where $r$ is the rank and $d$ is the number of columns. We evaluate our algorithm on one-sided recovery of synthetic data and low-coverage genome sequencing. In these settings, our algorithm substantially outperforms standard matrix completion and a variety of direct factorization methods.
NeurIPS Conference 2022 Conference Paper
In-context learning is the ability of a model to condition on a prompt sequence consisting of in-context examples (input-output pairs corresponding to some task) along with a new query input, and generate the corresponding output. Crucially, in-context learning happens only at inference time without any parameter updates to the model. While large language models such as GPT-3 exhibit some ability to perform in-context learning, it is unclear what the relationship is between tasks on which this succeeds and what is present in the training data. To investigate this, we consider the problem of training a model to in-context learn a function class (e. g. , linear functions): given data derived from some functions in the class, can we train a model (e. g. , a Transformer) to in-context learn most functions from that class? We show empirically that standard Transformers can be trained from scratch to perform in-context learning of linear functions---that is, the trained model is able to learn unseen linear functions from in-context examples with performance comparable to the optimal least squares estimator. In fact, in-context learning is possible even under two forms of distribution shift: (i) between the training data of the Transformer and inference-time prompts, and (ii) between the in-context examples and the query input during inference. We also show that we can train Transformers to in-context learn more complex function classes: sparse linear functions where the model outperforms least squares and nearly matches the performance of Lasso, and two-layer neural networks where the model performs comparably to neural networks trained on in-context examples using gradient descent.
ICML Conference 2021 Conference Paper
Self-training is a standard approach to semi-supervised learning where the learner’s own predictions on unlabeled data are used as supervision during training. In this paper, we reinterpret this label assignment process as an optimal transportation problem between examples and classes, wherein the cost of assigning an example to a class is mediated by the current predictions of the classifier. This formulation facilitates a practical annealing strategy for label assignment and allows for the inclusion of prior knowledge on class proportions via flexible upper bound constraints. The solutions to these assignment problems can be efficiently approximated using Sinkhorn iteration, thus enabling their use in the inner loop of standard stochastic optimization algorithms. We demonstrate the effectiveness of our algorithm on the CIFAR-10, CIFAR-100, and SVHN datasets in comparison with FixMatch, a state-of-the-art self-training algorithm.
STOC Conference 2021 Conference Paper
ICML Conference 2020 Conference Paper
Data augmentation is a powerful technique to improve performance in applications such as image and text classification tasks. Yet, there is little rigorous understanding of why and how various augmentations work. In this work, we consider a family of linear transformations and study their effects on the ridge estimator in an over-parametrized linear regression setting. First, we show that transformations which preserve the labels of the data can improve estimation by enlarging the span of the training data. Second, we show that transformations which mix data can improve estimation by playing a regularization effect. Finally, we validate our theoretical insights on MNIST. Based on the insights, we propose an augmentation scheme that searches over the space of transformations by how \emph{uncertain} the model is about the transformed data. We validate our proposed scheme on image and text datasets. For example, our method outperforms RandAugment by 1. 24% on CIFAR-100 using Wide-ResNet-28-10. Furthermore, we achieve comparable accuracy to the SoTA Adversarial AutoAugment on CIFAR datasets.
ICML Conference 2020 Conference Paper
Given data drawn from an unknown distribution, D, to what extent is it possible to “amplify” this dataset and faithfully output an even larger set of samples that appear to have been drawn from D? We formalize this question as follows: an (n, m) amplification procedure takes as input n independent draws from an unknown distribution D, and outputs a set of m > n “samples” which must be indistinguishable from m samples drawn iid from D. We consider this sample amplification problem in two fundamental settings: the case where D is an arbitrary discrete distribution supported on k elements, and the case where D is a d-dimensional Gaussian with unknown mean, and fixed covariance matrix. Perhaps surprisingly, we show a valid amplification procedure exists for both of these settings, even in the regime where the size of the input dataset, n, is significantly less than what would be necessary to learn distribution D to non-trivial accuracy. We also show that our procedures are optimal up to constant factors. Beyond these results, we describe potential applications of such data amplification, and formalize a number of curious directions for future research along this vein.
NeurIPS Conference 2020 Conference Paper
We introduce a framework for statistical estimation that leverages knowledge of how samples are collected but makes no distributional assumptions on the data values. Specifically, we consider a population of elements [n]={1, .. ., n} with corresponding data values x 1, .. ., x n. We observe the values for a "sample" set A \subset [n] and wish to estimate some statistic of the values for a "target" set B \subset [n] where B could be the entire set. Crucially, we assume that the sets A and B are drawn according to some known distribution P over pairs of subsets of [n]. A given estimation algorithm is evaluated based on its "worst-case, expected error" where the expectation is with respect to the distribution P from which the sample A and target sets B are drawn, and the worst-case is with respect to the data values x 1, .. ., x n. Within this framework, we give an efficient algorithm for estimating the target mean that returns a weighted combination of the sample values–-where the weights are functions of the distribution P and the sample and target sets A, B--and show that the worst-case expected error achieved by this algorithm is at most a multiplicative pi/2 factor worse than the optimal of such algorithms. The algorithm and proof leverage a surprising connection to the Grothendieck problem. We also extend these results to the linear regression setting where each datapoint is not a scalar but a labeled vector (x i, y i). This framework, which makes no distributional assumptions on the data values but rather relies on knowledge of the data collection process via the distribution P, is a significant departure from the typical statistical estimation framework and introduces a uniform analysis for the many natural settings where membership in a sample may be correlated with data values, such as when individuals are recruited into a sample through their social networks as in "snowball/chain" sampling or when samples have chronological structure as in "selective prediction".
NeurIPS Conference 2019 Conference Paper
We consider the problem of computing the maximum likelihood multivariate log-concave distribution for a set of points. Specifically, we present an algorithm which, given $n$ points in $\mathbb{R}^d$ and an accuracy parameter $\eps>0$, runs in time $\poly(n, d, 1/\eps), $ and returns a log-concave distribution which, with high probability, has the property that the likelihood of the $n$ points under the returned distribution is at most an additive $\eps$ less than the maximum likelihood that could be achieved via any log-concave distribution. This is the first computationally efficient (polynomial time) algorithm for this fundamental and practically important task. Our algorithm rests on a novel connection with exponential families: the maximum likelihood log-concave distribution belongs to a class of structured distributions which, while not an exponential family, ``locally'' possesses key properties of exponential families. This connection then allows the problem of computing the log-concave maximum likelihood distribution to be formulated as a convex optimization problem, and solved via an approximate first-order method. Efficiently approximating the (sub) gradients of the objective function of this optimization problem is quite delicate, and is the main technical challenge in this work.
ICML Conference 2019 Conference Paper
What learning algorithms can be run directly on compressively-sensed data? In this work, we consider the question of accurately and efficiently computing low-rank matrix or tensor factorizations given data compressed via random projections. We examine the approach of first performing factorization in the compressed domain, and then reconstructing the original high-dimensional factors from the recovered (compressed) factors. In both the matrix and tensor settings, we establish conditions under which this natural approach will provably recover the original factors. While it is well-known that random projections preserve a number of geometric properties of a dataset, our work can be viewed as showing that they can also preserve certain solutions of non-convex, NP-Hard problems like non-negative matrix factorization. We support these theoretical results with experiments on synthetic data and demonstrate the practical applicability of compressed factorization on real-world gene expression and EEG time series datasets.
ICML Conference 2019 Conference Paper
How can prior knowledge on the transformation invariances of a domain be incorporated into the architecture of a neural network? We propose Equivariant Transformers (ETs), a family of differentiable image-to-image mappings that improve the robustness of models towards pre-defined continuous transformation groups. Through the use of specially-derived canonical coordinate systems, ETs incorporate functions that are equivariant by construction with respect to these transformations. We show empirically that ETs can be flexibly composed to improve model robustness towards more complicated transformation groups in several parameters. On a real-world image classification task, ETs improve the sample efficiency of ResNet classifiers, achieving relative improvements in error rate of up to 15% in the limited data regime while increasing model parameter count by less than 1%.
NeurIPS Conference 2019 Conference Paper
Intense recent discussions have focused on how to provide individuals with control over when their data can and cannot be used --- the EU’s Right To Be Forgotten regulation is an example of this effort. In this paper we initiate a framework studying what to do when it is no longer permissible to deploy models derivative from specific user data. In particular, we formulate the problem of efficiently deleting individual data points from trained machine learning models. For many standard ML models, the only way to completely remove an individual's data is to retrain the whole model from scratch on the remaining data, which is often not computationally practical. We investigate algorithmic principles that enable efficient data deletion in ML. For the specific setting of $k$-means clustering, we propose two provably deletion efficient algorithms which achieve an average of over $100\times$ improvement in deletion efficiency across 6 datasets, while producing clusters of comparable statistical quality to a canonical $k$-means++ baseline.
ICML Conference 2019 Conference Paper
Consider a setting with $N$ independent individuals, each with an unknown parameter, $p_i \in [0, 1]$ drawn from some unknown distribution $P^\star$. After observing the outcomes of $t$ independent Bernoulli trials, i. e. , $X_i \sim \text{Binomial}(t, p_i)$ per individual, our objective is to accurately estimate $P^\star$ in the sparse regime, namely when $t \ll N$. This problem arises in numerous domains, including the social sciences, psychology, health-care, and biology, where the size of the population under study is usually large yet the number of observations per individual is often limited. Our main result shows that, in this sparse regime where $t \ll N$, the maximum likelihood estimator (MLE) is both statistically minimax optimal and efficiently computable. Precisely, for sufficiently large $N$, the MLE achieves the information theoretic optimal error bound of $\mathcal{O}(\frac{1}{t})$ for $t < c\log{N}$, with regards to the earth mover’s distance (between the estimated and true distributions). More generally, in an exponentially large interval of $t$ beyond $c \log{N}$, the MLE achieves the minimax error bound of $\mathcal{O}(\frac{1}{\sqrt{t\log N}})$. In contrast, regardless of how large $N$ is, the naive "plug-in" estimator for this problem only achieves the sub-optimal error of $\Theta(\frac{1}{\sqrt{t}})$. Empirically, we also demonstrate the MLE performs well on both synthetic as well as real datasets.
STOC Conference 2019 Conference Paper
We consider the problem of performing linear regression over a stream of d -dimensional examples, and show that any algorithm that uses a subquadratic amount of memory exhibits a slower rate of convergence than can be achieved without memory constraints. Specifically, consider a sequence of labeled examples ( a 1 , b 1 ), ( a 2 , b 2 )…, with a i drawn independently from a d -dimensional isotropic Gaussian, and where b i = ⟨ a i , x ⟩ + η i , for a fixed x ∈ ℝ d with || x || 2 = 1 and with independent noise η i drawn uniformly from the interval [−2 − d /5 ,2 − d /5 ]. We show that any algorithm with at most d 2 /4 bits of memory requires at least Ω( d loglog1/є) samples to approximate x to ℓ 2 error є with probability of success at least 2/3, for є sufficiently small as a function of d . In contrast, for such є, x can be recovered to error є with probability 1− o (1) with memory O ( d 2 log(1/є)) using d examples. This represents the first nontrivial lower bounds for regression with super-linear memory, and may open the door for strong memory/sample tradeoffs for continuous optimization.
NeurIPS Conference 2018 Conference Paper
Given the apparent difficulty of learning models that are robust to adversarial perturbations, we propose tackling the simpler problem of developing adversarially robust features. Specifically, given a dataset and metric of interest, the goal is to return a function (or multiple functions) that 1) is robust to adversarial perturbations, and 2) has significant variation across the datapoints. We establish strong connections between adversarially robust features and a natural spectral property of the geometry of the dataset and metric of interest. This connection can be leveraged to provide both robust features, and a lower bound on the robustness of any function that has significant variance across the dataset. Finally, we provide empirical evidence that the adversarially robust features given by this spectral approach can be fruitfully leveraged to learn a robust (and accurate) model.
NeurIPS Conference 2018 Conference Paper
We consider the problem of estimating how well a model class is capable of fitting a distribution of labeled data. We show that it is often possible to accurately estimate this ``learnability'' even when given an amount of data that is too small to reliably learn any accurate model. Our first result applies to the setting where the data is drawn from a $d$-dimensional distribution with isotropic covariance, and the label of each datapoint is an arbitrary noisy function of the datapoint. In this setting, we show that with $O(\sqrt{d})$ samples, one can accurately estimate the fraction of the variance of the label that can be explained via the best linear function of the data. We extend these techniques to a binary classification, and show that the prediction error of the best linear classifier can be accurately estimated given $O(\sqrt{d})$ labeled samples. For comparison, in both the linear regression and binary classification settings, even if there is no noise in the labels, a sample size linear in the dimension, $d$, is required to \emph{learn} any function correlated with the underlying model. We further extend our estimation approach to the setting where the data distribution has an (unknown) arbitrary covariance matrix, allowing these techniques to be applied to settings where the model class consists of a linear function applied to a nonlinear embedding of the data. We demonstrate the practical viability of our approaches on synthetic and real data. This ability to estimate the explanatory value of a set of features (or dataset), even in the regime in which there is too little data to realize that explanatory value, may be relevant to the scientific and industrial settings for which data collection is expensive and there are many potentially relevant feature sets that could be collected.
STOC Conference 2018 Conference Paper
We consider the problem of predicting the next observation given a sequence of past observations, and consider the extent to which accurate prediction requires complex algorithms that explicitly leverage long-range dependencies. Perhaps surprisingly, our positive results show that for a broad class of sequences, there is an algorithm that predicts well on average, and bases its predictions only on the most recent few observation together with a set of simple summary statistics of the past observations. Specifically, we show that for any distribution over observations, if the mutual information between past observations and future observations is upper bounded by I , then a simple Markov model over the most recent I /є observations obtains expected KL error є—and hence ℓ 1 error √є—with respect to the optimal predictor that has access to the entire past and knows the data generating distribution. For a Hidden Markov Model with n hidden states, I is bounded by log n , a quantity that does not depend on the mixing time, and we show that the trivial prediction algorithm based on the empirical frequencies of length O (log n /є) windows of observations achieves this error, provided the length of the sequence is d Ω(log n /є) , where d is the size of the observation alphabet. We also establish that this result cannot be improved upon, even for the class of HMMs, in the following two senses: First, for HMMs with n hidden states, a window length of log n /є is information-theoretically necessary to achieve expected KL error є, or ℓ 1 error √є. Second, the d Θ(log n /є) samples required to accurately estimate the Markov model when observations are drawn from an alphabet of size d is necessary for any computationally tractable learning/prediction algorithm, assuming the hardness of strongly refuting a certain class of CSPs.
ICML Conference 2017 Conference Paper
Given samples from a distribution, how many new elements should we expect to find if we keep on sampling this distribution? This is an important and actively studied problem, with many applications ranging from species estimation to genomics. We generalize this extrapolation and related unseen estimation problems to the multiple population setting, where population $j$ has an unknown distribution $D_j$ from which we observe $n_j$ samples. We derive an optimal estimator for the total number of elements we expect to find among new samples across the populations. Surprisingly, we prove that our estimator’s accuracy is independent of the number of populations. We also develop an efficient optimization algorithm to solve the more general problem of estimating multi-population frequency distributions. We validate our methods and theory through extensive experiments. Finally, on a real dataset of human genomes across multiple ancestries, we demonstrate how our approach for unseen estimation can enable cohort designs that can discover interesting mutations with greater efficiency.
STOC Conference 2017 Conference Paper
NeurIPS Conference 2017 Conference Paper
We study the basic problem of learning overcomplete HMMs---those that have many hidden states but a small output alphabet. Despite having significant practical importance, such HMMs are poorly understood with no known positive or negative results for efficient learning. In this paper, we present several new results---both positive and negative---which help define the boundaries between the tractable-learning setting and the intractable setting. We show positive results for a large subclass of HMMs whose transition matrices are sparse, well-conditioned and have small probability mass on short cycles. We also show that learning is impossible given only a polynomial number of samples for HMMs with a small output alphabet and whose transition matrices are random regular graphs with large degree. We also discuss these results in the context of learning HMMs which can capture long-term dependencies.
NeurIPS Conference 2017 Conference Paper
Consider the following estimation problem: there are $n$ entities, each with an unknown parameter $p_i \in [0, 1]$, and we observe $n$ independent random variables, $X_1, \ldots, X_n$, with $X_i \sim $ Binomial$(t, p_i)$. How accurately can one recover the ``histogram'' (i. e. cumulative density function) of the $p_i$'s? While the empirical estimates would recover the histogram to earth mover distance $\Theta(\frac{1}{\sqrt{t}})$ (equivalently, $\ell_1$ distance between the CDFs), we show that, provided $n$ is sufficiently large, we can achieve error $O(\frac{1}{t})$ which is information theoretically optimal. We also extend our results to the multi-dimensional parameter case, capturing settings where each member of the population has multiple associated parameters. Beyond the theoretical results, we demonstrate that the recovery algorithm performs well in practice on a variety of datasets, providing illuminating insights into several domains, including politics, sports analytics, and variation in the gender ratio of offspring.
ICML Conference 2017 Conference Paper
The popular Alternating Least Squares (ALS) algorithm for tensor decomposition is efficient and easy to implement, but often converges to poor local optima—particularly when the weights of the factors are non-uniform. We propose a modification of the ALS approach that is as efficient as standard ALS, but provably recovers the true factors with random initialization under standard incoherence assumptions on the factors of the tensor. We demonstrate the significant practical superiority of our approach over traditional ALS for a variety of tasks on synthetic data—including tensor factorization on exact, noisy and over-complete tensors, as well as tensor completion—and for computing word embeddings from a third-order word tri-occurrence tensor.
NeurIPS Conference 2016 Conference Paper
We consider a crowdsourcing model in which n workers are asked to rate the quality of n items previously generated by other workers. An unknown set of $\alpha n$ workers generate reliable ratings, while the remaining workers may behave arbitrarily and possibly adversarially. The manager of the experiment can also manually evaluate the quality of a small number of items, and wishes to curate together almost all of the high-quality items with at most an fraction of low-quality items. Perhaps surprisingly, we show that this is possible with an amount of work required of the manager, and each worker, that does not scale with n: the dataset can be curated with $\tilde{O}(1/\beta\alpha\epsilon^4)$ ratings per worker, and $\tilde{O}(1/\beta\epsilon^2)$ ratings by the manager, where $\beta$ is the fraction of high-quality items. Our results extend to the more general setting of peer prediction, including peer grading in online classrooms.
STOC Conference 2016 Conference Paper
We consider the following basic learning task: given independent draws from an unknown distribution over a discrete support, output an approximation of the distribution that is as accurate as possible in L1 distance (equivalently, total variation distance, or "statistical distance"). Perhaps surprisingly, it is often possible to "de-noise" the empirical distribution of the samples to return an approximation of the true distribution that is significantly more accurate than the empirical distribution, without relying on any prior assumptions on the distribution. We present an instance optimal learning algorithm which optimally performs this de-noising for every distribution for which such a de-noising is possible. More formally, given n independent draws from a distribution p, our algorithm returns a labelled vector whose expected distance from p is equal to the minimum possible expected error that could be obtained by any algorithm, even one that is given the true unlabeled vector of probabilities of distribution p and simply needs to assign labels---up to an additive subconstant term that is independent of p and goes to zero as n gets large. This somewhat surprising result has several conceptual implications, including the fact that, for any large sample from a distribution over discrete support, prior knowledge of the rates of decay of the tails of the distribution (e.g. power-law type assumptions) is not significantly helpful for the task of learning the distribution. As a consequence of our techniques, we also show that given a set of n samples from an arbitrary distribution, one can accurately estimate the expected number of distinct elements that will be observed in a sample of any size up to n log n. This sort of extrapolation is practically relevant, particularly to domains such as genomics where it is important to understand how much more might be discovered given larger sample sizes, and we are optimistic that our approach is practically viable.
NeurIPS Conference 2015 Conference Paper
We consider the problem of testing whether two unequal-sized samples were drawn from identical distributions, versus distributions that differ significantly. Specifically, given a target error parameter $\eps > 0$, $m_1$ independent draws from an unknown distribution $p$ with discrete support, and $m_2$ draws from an unknown distribution $q$ of discrete support, we describe a test for distinguishing the case that $p=q$ from the case that $||p-q||_1 \geq \eps$. If $p$ and $q$ are supported on at most $n$ elements, then our test is successful with high probability provided $m_1\geq n^{2/3}/\varepsilon^{4/3}$ and $m_2 = \Omega\left(\max\{\frac{n}{\sqrt m_1\varepsilon^2}, \frac{\sqrt n}{\varepsilon^2}\}\right). $ We show that this tradeoff is information theoretically optimal throughout this range, in the dependencies on all parameters, $n, m_1, $ and $\eps$, to constant factors. As a consequence, we obtain an algorithm for estimating the mixing time of a Markov chain on $n$ states up to a $\log n$ factor that uses $\tilde{O}(n^{3/2} \tau_{mix})$ queries to a ``next node'' oracle. The core of our testing algorithm is a relatively simple statistic that seems to perform well in practice, both on synthetic data and on natural language data. We believe that this statistic might prove to be a useful primitive within larger machine learning and natural language processing systems.
FOCS Conference 2014 Conference Paper
We consider the problem of verifying the identity of a distribution: Given the description of a distribution over a discrete support p = (p 1, p 2, .. ., p n ), how many samples (independent draws) must one obtain from an unknown distribution, q, to distinguish, with high probability, the case that p = q from the case that the total variation distance (L 1 distance) ||p - q||1≥ϵ? We resolve this question, up to constant factors, on an instance by instance basis: there exist universal constants c, c' and a function f(p, ϵ) on distributions and error parameters, such that our tester distinguishes p = q from ||p-q||1≥ ϵ using f(p, ϵ) samples with success probability > 2/3, but no tester can distinguish p = q from ||p - q||1≥ c · ϵ when given c' · f(p, ϵ) samples. The function f(p, ϵ) is upperbounded by a multiple of ||p||2/3/ϵ 2, but is more complicated, and is significantly smaller in some cases when p has many small domain elements, or a single large one. This result significantly generalizes and tightens previous results: since distributions of support at most n have L 2/3 norm bounded by √n, this result immediately shows that for such distributions, O(√n/ϵ 2 ) samples suffice, tightening the previous bound of O(√npolylog/n 4 ) for this class of distributions, and matching the (tight) known results for the case that p is the uniform distribution over support n. The analysis of our very simple testing algorithm involves several hairy inequalities. To facilitate this analysis, we give a complete characterization of a general class of inequalities- generalizing Cauchy-Schwarz, Holder's inequality, and the monotonicity of L p norms. Specifically, we characterize the set of sequences (a) i = a 1, .. ., ar, (b)i = b 1, .. ., br, (c)i = c 1, .. ., cr, for which it holds that for all finite sequences of positive numbers (x) j = x 1, .. . and (y) j = y 1, .. ., Π i=1 r (Σ j x a j i y i b i ) ci ≥1. For example, the standard Cauchy-Schwarz inequality corresponds to the sequences a = (1, 0, 1/2), b = (0, 1, 1/2), c = (1/2, 1/2, -1). Our characterization is of a non-traditional nature in that it uses linear programming to compute a derivation that may otherwise have to be sought throu. gh trial and error, by hand. We do not believe such a characterization has appeared in the literature, and hope its computational nature will be useful to others, and facilitate analyses like the one here.
ICML Conference 2014 Conference Paper
We study the effectiveness of learning low degree polynomials using neural networks by the gradient descent method. While neural networks have been shown to have great expressive power, and gradient descent has been widely used in practice for learning neural networks, few theoretical guarantees are known for such methods. In particular, it is well known that gradient descent can get stuck at local minima, even for simple classes of target functions. In this paper, we present several positive theoretical results to support the effectiveness of neural networks. We focus on two-layer neural networks (i. e. one hidden layer) where the top layer node is a linear function, similar to \citebarron93. First we show that for a randomly initialized neural network with sufficiently many hidden units, the gradient descent method can learn any low degree polynomial. Secondly, we show that if we use complex-valued weights (the target function can still be real), then under suitable conditions, there are no “robust local minima”: the neural network can always escape a local minimum by performing a random perturbation. This property does not hold for real-valued weights. Thirdly, we discuss whether sparse polynomials can be learned with \emphsmall neural networks, where the size is dependent on the sparsity of the target function.
SODA Conference 2014 Conference Paper
ICML Conference 2014 Conference Paper
This work provides simple algorithms for multi-class (and multi-label) prediction in settings where both the number of examples n and the data dimension d are relatively large. These robust and parameter free algorithms are essentially iterative least-squares updates and very versatile both in theory and in practice. On the theoretical front, we present several variants with convergence guarantees. Owing to their effective use of second-order structure, these algorithms are substantially better than first-order methods in many practical scenarios. On the empirical side, we show how to scale our approach to high dimensional datasets, achieving dramatic computational speedups over popular optimization packages such as Liblinear and Vowpal Wabbit on standard datasets (MNIST and CIFAR-10), while attaining state-of-the-art accuracies.
SODA Conference 2014 Conference Paper
We study the question of closeness testing for two discrete distributions. More precisely, given samples from two distributions p and q over an n -element set, we wish to distinguish whether p = q versus p is at least ∊ -far from q, in either ℓ 1 or ℓ 2 distance. Batu et al [BFR+00, BFR+13] gave the first sub-linear time algorithms for these problems, which matched the lower bounds of [Val11] up to a logarithmic factor in n, and a polynomial factor of ∊. In this work, we present simple testers for both the ℓ 1 and ℓ 2 settings, with sample complexity that is information-theoretically optimal, to constant factors, both in the dependence on n, and the dependence on ∊; for the ℓ 1 testing problem we establish that the sample complexity is Θ(max{ n 2/3 / ∊ 4/3, n 1/2 / ∊ 2 }).
FOCS Conference 2014 Conference Paper
We show that, if truth assignments on n variables reproduce through recombination so that satisfaction of a particular Boolean function confers a small evolutionary advantage, then a polynomially large population over polynomially many generations (polynomial in n and the inverse of the initial satisfaction probability) will end up almost certainly consisting exclusively of satisfying truth assignments. We argue that this theorem sheds light on the problem of the evolution of complex adaptations.
NeurIPS Conference 2013 Conference Paper
Recently, [Valiant and Valiant] showed that a class of distributional properties, which includes such practically relevant properties as entropy, the number of distinct elements, and distance metrics between pairs of distributions, can be estimated given a SUBLINEAR sized sample. Specifically, given a sample consisting of independent draws from any distribution over at most n distinct elements, these properties can be estimated accurately using a sample of size O(n / log n). We propose a novel modification of this approach and show: 1) theoretically, our estimator is optimal (to constant factors, over worst-case instances), and 2) in practice, it performs exceptionally well for a variety of estimation tasks, on a variety of natural distributions, for a wide range of parameters. Perhaps unsurprisingly, the key step in this approach is to first use the sample to characterize the unseen" portion of the distribution. This goes beyond such tools as the Good-Turing frequency estimation scheme, which estimates the total probability mass of the unobserved portion of the distribution: we seek to estimate the "shape"of the unobserved portion of the distribution. This approach is robust, general, and theoretically principled; we expect that it may be fruitfully used as a component within larger machine learning and data analysis systems. "
SODA Conference 2013 Conference Paper
We give highly efficient algorithms, and almost matching lower bounds, for a range of basic statistical problems that involve testing and estimating the L 1 (total variation) distance between two k -modal distributions p and q over the discrete domain {1, …, n }. More precisely, we consider the following four problems: given sample access to an unknown k -modal distribution p, T esting identity to a known or unknown distribution: 1. Determine whether p = q (for an explicitly given k -modal distribution q ) versus p is e-far from q; 2. Determine whether p = q (where q is available via sample access) versus p is ε-far from q; E stimating L 1 distance (“ tolerant testing ”) against a known or unknown distribution: 3. Approximate d TV ( p, q ) to within additive ε where q is an explicitly given k -modal distribution q; 4. Approximate d TV ( p, q ) to within additive ε where q is available via sample access. For each of these four problems we give sub-logarithmic sample algorithms, and show that our algorithms have optimal sample complexity up to additive poly ( k ) and multiplicative polylog log n + polylog k factors. Our algorithms significantly improve the previous results of [BKR04], which were for testing identity of distributions (items (1) and (2) above) in the special cases k = 0 (monotone distributions) and k = 1 (unimodal distributions) and required O ((log n ) 3 ) samples. As our main conceptual contribution, we introduce a new reduction-based approach for distribution-testing problems that lets us obtain all the above results in a unified way. Roughly speaking, this approach enables us to transform various distribution testing problems for k -modal distributions over {1, …, n } to the corresponding distribution testing problems for unrestricted distributions over a much smaller domain {1, …, ℓ} where ℓ = O ( k log n ).
FOCS Conference 2012 Conference Paper
Given a set of n d-dimensional Boolean vectors with the promise that the vectors are chosen uniformly at random with the exception of two vectors that have Pearson-correlation ρ (Hamming distance d · 1-ρ/2), how quickly can one find the two correlated vectors? We present an algorithm which, for any constants ε, ρ >; 0 and d >; >; logn/ρ 2, finds the correlated pair with high probability, and runs in time O(n 3ω/4 + ϵ) 1. 8 ), where w 2-O (ρ)), given by Paturi et al. [15], Locality Sensitive Hashing (LSH) [11] and the Bucketing Codes approach [6]. Applications and extensions of this basic algorithm yield improved algorithms for several other problems: ApproximateClosest Pair: For any sufficiently small constant ϵ >; 0, given n vectors in R d, our algorithm returns a pair of vectors whose Euclidean distance differs from that of the closest pair by a factor of at most 1+ϵ, and runs in time O(n 2-Θ(√ϵ) ). The best previous algorithms (including LSH) have runtime O(n 2-O(ϵ) ). Learning Sparse Parity with Noise: Given samples from an instance of the learning parity with noise problem where each example has length n, the true parity set has size at most k k poly(1/1-2η) 0. 8k poly(1/1-2η). This is the first algorithm with no depenJence on η in the exponent of n, aside from the trivial brute-force algorithm. Learning k-Juntas with Noise: Given uniformly random length n Boolean vectors, together with a label, which is some function of just k k poly(1/1-2η) 0. 8k poly(1/1-2η), 2 which improves on the previous best of >; n k (1-2/2k)poly( 1/1-2η ), from [8]. Learning k-Juntas without Noise: 1 Our results for learning sparse parities with noise imply an algorithm for learning juntas without noise with runtime n ω+ϵ/4 k poly(n) 0. 6 kpoly(n), which improves on the runtime n ω+1/ω poly(n) ≈ n 0. 7k poly(n) of Mossel n et al. [13].
STOC Conference 2011 Conference Paper
FOCS Conference 2011 Conference Paper
For a broad class of practically relevant distribution properties, which includes entropy and support size, nearly all of the proposed estimators have an especially simple form. Given a set of independent samples from a discrete distribution, these estimators tally the vector of summary statistics -- the number of domain elements seen once, twice, etc. in the sample -- and output the dot product between these summary statistics, and a fixed vector of coefficients. We term such estimators linear. This historical proclivity towards linear estimators is slightly perplexing, since, despite many efforts over nearly 60 years, all proposed such estimators have significantly sub optimal convergence, compared to the bounds shown in [26], [27]. Our main result, in some sense vindicating this insistence on linear estimators, is that for any property in this broad class, there exists a near-optimal linear estimator. Additionally, we give a practical and polynomial-time algorithm for constructing such estimators for any given parameters. While this result does not yield explicit bounds on the sample complexities of these estimation tasks, we leverage the insights provided by this result to give explicit constructions of near-optimal linear estimators for three properties: entropy, L 1 distance to uniformity, and for pairs of distributions, L 1 distance. Our entropy estimator, when given O(n/∈log n) independent samples from a distribution of support at most n, will estimate the entropy of the distribution to within additive accuracy ∈, with probability of failure o(1/poly(n)). From the recent lower bounds given in [26], [27], this estimator is optimal, to constant factor, both in its dependence on n, and its dependence on ∈. In particular, the inverse-linear convergence rate of this estimator resolves the main open question of [26], [28], which left open the possibility that the error decreased only with the square root of the number of samples. Our distance to uniformity estimator, when given O(m/∈ 2 log m) independent samples from any distribution, returns an ∈-accurate estimate of the L 1 distance to the uniform distribution of support m. This is constant-factor optimal, for constant ∈. Finally, our framework extends naturally to properties of pairs of distributions, including estimating the L 1 distance and KL-divergence between pairs of distributions. We give an explicit linear estimator for estimating L 1 distance to additive accuracy ∈ using O(n/∈ 2 log n) samples from each distribution, which is constant-factor optimal, for constant ∈. This is the first sub linear-sample estimator for this fundamental property.
STOC Conference 2010 Conference Paper
Given data drawn from a mixture of multivariate Gaussians, a basic problem is to accurately estimate the mixture parameters. We provide a polynomial-time algorithm for this problem for the case of two Gaussians in $n$ dimensions (even if they overlap), with provably minimal assumptions on the Gaussians, and polynomial data requirements. In statistical terms, our estimator converges at an inverse polynomial rate, and no such estimator (even exponential time) was known for this problem (even in one dimension). Our algorithm reduces the n-dimensional problem to the one-dimensional problem, where the method of moments is applied. One technical challenge is proving that noisy estimates of the first six moments of a univariate mixture suffice to recover accurate estimates of the mixture parameters, as conjectured by Pearson (1894), and in fact these estimates converge at an inverse polynomial rate. As a corollary, we can efficiently perform near-optimal clustering: in the case where the overlap between the Gaussians is small, one can accurately cluster the data, and when the Gaussians have partial overlap, one can still accurately cluster those data points which are not in the overlap region. A second consequence is a polynomial-time density estimation algorithm for arbitrary mixtures of two Gaussians, generalizing previous work on axis-aligned Gaussians (Feldman {\em et al}, 2006).
FOCS Conference 2010 Conference Paper
Given data drawn from a mixture of multivariate Gaussians, a basic problem is to accurately estimate the mixture parameters. We give an algorithm for this problem that has running time and data requirements polynomial in the dimension and the inverse of the desired accuracy, with provably minimal assumptions on the Gaussians. As a simple consequence of our learning algorithm, we we give the first polynomial time algorithm for proper density estimation for mixtures of k Gaussians that needs no assumptions on the mixture. It was open whether proper density estimation was even statistically possible (with no assumptions) given only polynomially many samples, let alone whether it could be computationally efficient. The building blocks of our algorithm are based on the work (Kalai et al, STOC 2010) that gives an efficient algorithm for learning mixtures of two Gaussians by considering a series of projections down to one dimension, and applying the method of moments to each univariate projection. A major technical hurdle in the previous work is showing that one can efficiently learn univariate mixtures of two Gaussians. In contrast, because pathological scenarios can arise when considering projections of mixtures of more than two Gaussians, the bulk of the work in this paper concerns how to leverage a weaker algorithm for learning univariate mixtures (of many Gaussians) to learn in high dimensions. Our algorithm employs hierarchical clustering and rescaling, together with methods for backtracking and recovering from the failures that can occur in our univariate algorithm. Finally, while the running time and data requirements of our algorithm depend exponentially on the number of Gaussians in the mixture, we prove that such a dependence is necessary.
SODA Conference 2009 Conference Paper
SODA Conference 2008 Conference Paper