Arrow Research search

Author name cluster

Paul Valiant

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

22 papers
2 author rows

Possible papers

22

ICML Conference 2025 Conference Paper

All-Purpose Mean Estimation over R: Optimal Sub-Gaussianity with Outlier Robustness and Low Moments Performance

  • Jasper C. H. Lee
  • Walter McKelvie
  • Maoyuan Song
  • Paul Valiant

We consider the basic statistical challenge of designing an "all-purpose" mean estimation algorithm that is recommendable across a variety of settings and models. Recent work by [Lee and Valiant 2022] introduced the first 1-d mean estimator whose error in the standard finite-variance+i. i. d. setting is optimal even in its constant factors; experimental demonstration of its good performance was shown by [Gobet et al. 2022]. Yet, unlike for classic (but not necessarily practical) estimators such as median-of-means and trimmed mean, this new algorithm lacked proven robustness guarantees in other settings, including the settings of adversarial data corruption and heavy-tailed distributions with infinite variance. Such robustness is important for practical use cases. This raises a research question: is it possible to have a mean estimator that is robust, without sacrificing provably optimal performance in the standard i. i. d. setting? In this work, we show that Lee and Valiant’s estimator is in fact an "all-purpose" mean estimator by proving: (A) It is robust to an $\eta$-fraction of data corruption, even in the strong contamination model; it has optimal estimation error $O(\sigma\sqrt{\eta})$ for distributions with variance $\sigma^2$. (B) For distributions with finite $z^\text{th}$ moment, for $z \in (1, 2)$, it has optimal estimation error, matching the lower bounds of [Devroye et al. 2016] up to constants. We further show (C) that outlier robustness for 1-d mean estimators in fact implies neighborhood optimality, a notion of beyond worst-case and distribution-dependent optimality recently introduced by [Dang et al. 2023]. Previously, such an optimality guarantee was only known for median-of-means, but now it holds also for all estimators that are simultaneously robust and sub-Gaussian, including Lee and Valiant’s, resolving a question raised by Dang et al. Lastly, we show (D) the asymptotic normality and efficiency of Lee and Valiant’s estimator, as further evidence for its performance across many settings.

NeurIPS Conference 2023 Conference Paper

Minimax-Optimal Location Estimation

  • Shivam Gupta
  • Jasper Lee
  • Eric Price
  • Paul Valiant

Location estimation is one of the most basic questions in parametric statistics. Suppose we have a known distribution density $f$, and we get $n$ i. i. d. samples from $f(x-\mu)$ for some unknown shift $\mu$. The task is to estimate $\mu$ to high accuracy with high probability. The maximum likelihood estimator (MLE) is known to be asymptotically optimal as $n \to \infty$, but what is possible for finite $n$? In this paper, we give two location estimators that are optimal under different criteria: 1) an estimator that has minimax-optimal estimation error subject to succeeding with probability $1-\delta$ and 2) a confidence interval estimator which, subject to its output interval containing $\mu$ with probability at least $1-\delta$, has the minimum expected squared interval width among all shift-invariant estimators. The latter construction can be generalized to minimizing the expectation of any loss function on the interval width.

NeurIPS Conference 2023 Conference Paper

Optimality in Mean Estimation: Beyond Worst-Case, Beyond Sub-Gaussian, and Beyond $1+\alpha$ Moments

  • Trung Dang
  • Jasper Lee
  • Maoyuan 'Raymond' Song
  • Paul Valiant

There is growing interest in improving our algorithmic understanding of fundamental statistical problems such as mean estimation, driven by the goal of understanding the fundamental limits of what we can extract from limited and valuable data. The state of the art results for mean estimation in $\mathbb{R}$ are 1) the optimal sub-Gaussian mean estimator by [Lee and Valiant, 2022], attaining the optimal sub-Gaussian error constant for all distributions with finite but unknown variance, and 2) the analysis of the median-of-means algorithm by [Bubeck, Cesa-Bianchi and Lugosi, 2013] and a matching lower bound by [Devroye, Lerasle, Lugosi, and Oliveira, 2016], characterizing the big-O optimal errors for distributions that have tails heavy enough that only a $1+\alpha$ moment exists for some $\alpha \in (0, 1)$. Both of these results, however, are optimal only in the worst case. Motivated by the recent effort in the community to go "beyond the worst-case analysis" of algorithms, we initiate the fine-grained study of the mean estimation problem: Is it possible for algorithms to leverage *beneficial* features/quirks of their input distribution to *beat* the sub-Gaussian rate, without explicit knowledge of these features? We resolve this question, finding an unexpectedly nuanced answer: "Yes in limited regimes, but in general no". Given a distribution $p$, assuming *only* that it has a finite mean and absent any additional assumptions, we show how to construct a distribution $q_{n, \delta}$ such that the means of $p$ and $q$ are well-separated, yet $p$ and $q$ are impossible to distinguish with $n$ samples with probability $1-\delta$, and $q$ further preserves the finiteness of moments of $p$. Moreover, the variance of $q$ is at most twice the variance of $p$ if it exists. The main consequence of our result is that, no reasonable estimator can asymptotically achieve better than the sub-Gaussian error rate for any distribution, up to constant factors, which matches the worst-case result of [Lee and Valiant, 2022]. More generally, we introduce a new definitional framework to analyze the fine-grained optimality of algorithms, which we call "neighborhood optimality", interpolating between the unattainably strong "instance optimality" and the trivially weak admissibility/Pareto optimality definitions. As an application of the new framework, we show that the median-of-means algorithm is neighborhood optimal, up to constant factors. It is an open question to find a neighborhood-optimal estimator *without* constant factor slackness.

NeurIPS Conference 2022 Conference Paper

Finite-Sample Maximum Likelihood Estimation of Location

  • Shivam Gupta
  • Jasper Lee
  • Eric Price
  • Paul Valiant

We consider 1-dimensional location estimation, where we estimate a parameter $\lambda$ from $n$ samples $\lambda + \eta_i$, with each $\eta_i$ drawn i. i. d. from a known distribution $f$. For fixed $f$ the maximum-likelihood estimate (MLE) is well-known to be optimal in the limit as $n \to \infty$: it is asymptotically normal with variance matching the Cramer-Rao lower bound of $\frac{1}{n\mathcal{I}}$, where $\mathcal{I}$ is the Fisher information of $f$. However, this bound does not hold for finite $n$, or when $f$ varies with $n$. We show for arbitrary $f$ and $n$ that one can recover a similar theory based on the Fisher information of a smoothed version of $f$, where the smoothing radius decays with $n$.

FOCS Conference 2021 Conference Paper

Optimal Sub-Gaussian Mean Estimation in $\mathbb{R}$

  • Jasper C. H. Lee
  • Paul Valiant

We settle the fundamental problem of estimating the mean of a real-valued distribution in the high probability regime, under the minimal (and essentially necessary) assumption that the distribution has finite but unknown variance: we propose an estimator with convergence tight up to a $1 +o(1)$ factor. Crucially, in contrast to prior works, our estimator does not require prior knowledge of the variance, and works across the entire gamut of distributions with finite variance, including those without any higher moments. Parameterized by the sample size $n$, the failure probability $\delta$, and the variance $\sigma^{2}$, our estimator has additive accuracy within $\sigma\cdot(1+o(1))\sqrt{\frac{2\log\frac{1}{\delta}}{n}}$, which is optimal up to the $1+o(1)$ term. This asymptotically matches the convergence of the sample mean for the Gaussian distribution with the same variance. Our estimator construction and analysis gives a framework generalizable to other problems, tightly analyzing a sum of dependent random variables by viewing the sum implicitly as a 2-parameter $\psi$ -estimator, and constructing bounds using mathematical programming and duality techniques.

SODA Conference 2021 Conference Paper

Uncertainty about Uncertainty: Optimal Adaptive Algorithms for Estimating Mixtures of Unknown Coins

  • Jasper C. H. Lee
  • Paul Valiant

Given a mixture between two populations of coins, “positive” coins that each have|unknown and potentially different—bias ≥ ½ + Δ and “negative” coins with bias ≤ ½ – Δ, we consider the task of estimating the fraction ρ of positive coins to within additive error ∊. We achieve an upper and lower bound of samples for a 1 – δ probability of success, where crucially, our lower bound applies to all fully-adaptive algorithms. Thus, our sample complexity bounds have tight dependence for every relevant problem parameter. A crucial component of our lower bound proof is a decomposition lemma (Lemma 5. 2) showing how to assemble partially-adaptive bounds into a fully-adaptive bound, which may be of independent interest: though we invoke it for the special case of Bernoulli random variables (coins), it applies to general distributions. We present simulation results to demonstrate the practical efficacy of our approach for realistic problem parameters for crowdsourcing applications, focusing on the “rare events” regime where ρ is small. The fine-grained adaptive avor of both our algorithm and lower bound contrasts with much previous work in distributional testing and learning.

NeurIPS Conference 2020 Conference Paper

Worst-Case Analysis for Randomly Collected Data

  • Justin Chen
  • Gregory Valiant
  • Paul Valiant

We introduce a framework for statistical estimation that leverages knowledge of how samples are collected but makes no distributional assumptions on the data values. Specifically, we consider a population of elements [n]={1, .. ., n} with corresponding data values x 1, .. ., x n. We observe the values for a "sample" set A \subset [n] and wish to estimate some statistic of the values for a "target" set B \subset [n] where B could be the entire set. Crucially, we assume that the sets A and B are drawn according to some known distribution P over pairs of subsets of [n]. A given estimation algorithm is evaluated based on its "worst-case, expected error" where the expectation is with respect to the distribution P from which the sample A and target sets B are drawn, and the worst-case is with respect to the data values x 1, .. ., x n. Within this framework, we give an efficient algorithm for estimating the target mean that returns a weighted combination of the sample values–-where the weights are functions of the distribution P and the sample and target sets A, B--and show that the worst-case expected error achieved by this algorithm is at most a multiplicative pi/2 factor worse than the optimal of such algorithms. The algorithm and proof leverage a surprising connection to the Grothendieck problem. We also extend these results to the linear regression setting where each datapoint is not a scalar but a labeled vector (x i, y i). This framework, which makes no distributional assumptions on the data values but rather relies on knowledge of the data collection process via the distribution P, is a significant departure from the typical statistical estimation framework and introduces a uniform analysis for the many natural settings where membership in a sample may be correlated with data values, such as when individuals are recruited into a sample through their social networks as in "snowball/chain" sampling or when samples have chronological structure as in "selective prediction".

STOC Conference 2016 Conference Paper

Instance optimal learning of discrete distributions

  • Gregory Valiant
  • Paul Valiant

We consider the following basic learning task: given independent draws from an unknown distribution over a discrete support, output an approximation of the distribution that is as accurate as possible in L1 distance (equivalently, total variation distance, or "statistical distance"). Perhaps surprisingly, it is often possible to "de-noise" the empirical distribution of the samples to return an approximation of the true distribution that is significantly more accurate than the empirical distribution, without relying on any prior assumptions on the distribution. We present an instance optimal learning algorithm which optimally performs this de-noising for every distribution for which such a de-noising is possible. More formally, given n independent draws from a distribution p, our algorithm returns a labelled vector whose expected distance from p is equal to the minimum possible expected error that could be obtained by any algorithm, even one that is given the true unlabeled vector of probabilities of distribution p and simply needs to assign labels---up to an additive subconstant term that is independent of p and goes to zero as n gets large. This somewhat surprising result has several conceptual implications, including the fact that, for any large sample from a distribution over discrete support, prior knowledge of the rates of decay of the tails of the distribution (e.g. power-law type assumptions) is not significantly helpful for the task of learning the distribution. As a consequence of our techniques, we also show that given a set of n samples from an arbitrary distribution, one can accurately estimate the expected number of distinct elements that will be observed in a sample of any size up to n log n. This sort of extrapolation is practically relevant, particularly to domains such as genomics where it is important to understand how much more might be discovered given larger sample sizes, and we are optimistic that our approach is practically viable.

FOCS Conference 2016 Conference Paper

Optimizing Star-Convex Functions

  • Jasper C. H. Lee
  • Paul Valiant

Star-convexity is a significant relaxation of the notion of convexity, that allows for functions that do not have (sub)gradients at most points, and may even be discontinuous everywhere except at the global optimum. We introduce a polynomial time algorithm for optimizing the class of star-convex functions, under no Lipschitz or other smoothness assumptions whatsoever, and no restrictions except exponential boundedness on a region about the origin, and Lebesgue measurability. The algorithm’s performance is polynomial in the requested number of digits of accuracy and the dimension of the search domain. This contrasts with the previous best known algorithm of Nesterov and Polyak which has exponential dependence on the number of digits of accuracy, but only n! dependence on the dimension n (where! is the matrix multiplication exponent), and which further requires Lipschitz second differentiability of the function [1].

FOCS Conference 2014 Conference Paper

An Automatic Inequality Prover and Instance Optimal Identity Testing

  • Gregory Valiant
  • Paul Valiant

We consider the problem of verifying the identity of a distribution: Given the description of a distribution over a discrete support p = (p 1, p 2, .. ., p n ), how many samples (independent draws) must one obtain from an unknown distribution, q, to distinguish, with high probability, the case that p = q from the case that the total variation distance (L 1 distance) ||p - q||1≥ϵ? We resolve this question, up to constant factors, on an instance by instance basis: there exist universal constants c, c' and a function f(p, ϵ) on distributions and error parameters, such that our tester distinguishes p = q from ||p-q||1≥ ϵ using f(p, ϵ) samples with success probability > 2/3, but no tester can distinguish p = q from ||p - q||1≥ c · ϵ when given c' · f(p, ϵ) samples. The function f(p, ϵ) is upperbounded by a multiple of ||p||2/3/ϵ 2, but is more complicated, and is significantly smaller in some cases when p has many small domain elements, or a single large one. This result significantly generalizes and tightens previous results: since distributions of support at most n have L 2/3 norm bounded by √n, this result immediately shows that for such distributions, O(√n/ϵ 2 ) samples suffice, tightening the previous bound of O(√npolylog/n 4 ) for this class of distributions, and matching the (tight) known results for the case that p is the uniform distribution over support n. The analysis of our very simple testing algorithm involves several hairy inequalities. To facilitate this analysis, we give a complete characterization of a general class of inequalities- generalizing Cauchy-Schwarz, Holder's inequality, and the monotonicity of L p norms. Specifically, we characterize the set of sequences (a) i = a 1, .. ., ar, (b)i = b 1, .. ., br, (c)i = c 1, .. ., cr, for which it holds that for all finite sequences of positive numbers (x) j = x 1, .. . and (y) j = y 1, .. ., Π i=1 r (Σ j x a j i y i b i ) ci ≥1. For example, the standard Cauchy-Schwarz inequality corresponds to the sequences a = (1, 0, 1/2), b = (0, 1, 1/2), c = (1/2, 1/2, -1). Our characterization is of a non-traditional nature in that it uses linear programming to compute a derivation that may otherwise have to be sought throu. gh trial and error, by hand. We do not believe such a characterization has appeared in the literature, and hope its computational nature will be useful to others, and facilitate analyses like the one here.

SODA Conference 2014 Conference Paper

Optimal Algorithms for Testing Closeness of Discrete Distributions

  • Siu On Chan
  • Ilias Diakonikolas
  • Paul Valiant
  • Gregory Valiant

We study the question of closeness testing for two discrete distributions. More precisely, given samples from two distributions p and q over an n -element set, we wish to distinguish whether p = q versus p is at least ∊ -far from q, in either ℓ 1 or ℓ 2 distance. Batu et al [BFR+00, BFR+13] gave the first sub-linear time algorithms for these problems, which matched the lower bounds of [Val11] up to a logarithmic factor in n, and a polynomial factor of ∊. In this work, we present simple testers for both the ℓ 1 and ℓ 2 settings, with sample complexity that is information-theoretically optimal, to constant factors, both in the dependence on n, and the dependence on ∊; for the ℓ 1 testing problem we establish that the sample complexity is Θ(max{ n 2/3 / ∊ 4/3, n 1/2 / ∊ 2 }).

NeurIPS Conference 2013 Conference Paper

Estimating the Unseen: Improved Estimators for Entropy and other Properties

  • Paul Valiant
  • Gregory Valiant

Recently, [Valiant and Valiant] showed that a class of distributional properties, which includes such practically relevant properties as entropy, the number of distinct elements, and distance metrics between pairs of distributions, can be estimated given a SUBLINEAR sized sample. Specifically, given a sample consisting of independent draws from any distribution over at most n distinct elements, these properties can be estimated accurately using a sample of size O(n / log n). We propose a novel modification of this approach and show: 1) theoretically, our estimator is optimal (to constant factors, over worst-case instances), and 2) in practice, it performs exceptionally well for a variety of estimation tasks, on a variety of natural distributions, for a wide range of parameters. Perhaps unsurprisingly, the key step in this approach is to first use the sample to characterize the unseen" portion of the distribution. This goes beyond such tools as the Good-Turing frequency estimation scheme, which estimates the total probability mass of the unobserved portion of the distribution: we seek to estimate the "shape"of the unobserved portion of the distribution. This approach is robust, general, and theoretically principled; we expect that it may be fruitfully used as a component within larger machine learning and data analysis systems. "

SODA Conference 2013 Conference Paper

Testing k -Modal Distributions: Optimal Algorithms via Reductions

  • Constantinos Daskalakis
  • Ilias Diakonikolas
  • Rocco A. Servedio
  • Gregory Valiant
  • Paul Valiant

We give highly efficient algorithms, and almost matching lower bounds, for a range of basic statistical problems that involve testing and estimating the L 1 (total variation) distance between two k -modal distributions p and q over the discrete domain {1, …, n }. More precisely, we consider the following four problems: given sample access to an unknown k -modal distribution p, T esting identity to a known or unknown distribution: 1. Determine whether p = q (for an explicitly given k -modal distribution q ) versus p is e-far from q; 2. Determine whether p = q (where q is available via sample access) versus p is ε-far from q; E stimating L 1 distance (“ tolerant testing ”) against a known or unknown distribution: 3. Approximate d TV ( p, q ) to within additive ε where q is an explicitly given k -modal distribution q; 4. Approximate d TV ( p, q ) to within additive ε where q is available via sample access. For each of these four problems we give sub-logarithmic sample algorithms, and show that our algorithms have optimal sample complexity up to additive poly ( k ) and multiplicative polylog log n + polylog k factors. Our algorithms significantly improve the previous results of [BKR04], which were for testing identity of distributions (items (1) and (2) above) in the special cases k = 0 (monotone distributions) and k = 1 (unimodal distributions) and required O ((log n ) 3 ) samples. As our main conceptual contribution, we introduce a new reduction-based approach for distribution-testing problems that lets us obtain all the above results in a unified way. Roughly speaking, this approach enables us to transform various distribution testing problems for k -modal distributions over {1, …, n } to the corresponding distribution testing problems for unrestricted distributions over a much smaller domain {1, …, ℓ} where ℓ = O ( k log n ).

FOCS Conference 2011 Conference Paper

The Power of Linear Estimators

  • Gregory Valiant
  • Paul Valiant

For a broad class of practically relevant distribution properties, which includes entropy and support size, nearly all of the proposed estimators have an especially simple form. Given a set of independent samples from a discrete distribution, these estimators tally the vector of summary statistics -- the number of domain elements seen once, twice, etc. in the sample -- and output the dot product between these summary statistics, and a fixed vector of coefficients. We term such estimators linear. This historical proclivity towards linear estimators is slightly perplexing, since, despite many efforts over nearly 60 years, all proposed such estimators have significantly sub optimal convergence, compared to the bounds shown in [26], [27]. Our main result, in some sense vindicating this insistence on linear estimators, is that for any property in this broad class, there exists a near-optimal linear estimator. Additionally, we give a practical and polynomial-time algorithm for constructing such estimators for any given parameters. While this result does not yield explicit bounds on the sample complexities of these estimation tasks, we leverage the insights provided by this result to give explicit constructions of near-optimal linear estimators for three properties: entropy, L 1 distance to uniformity, and for pairs of distributions, L 1 distance. Our entropy estimator, when given O(n/∈log n) independent samples from a distribution of support at most n, will estimate the entropy of the distribution to within additive accuracy ∈, with probability of failure o(1/poly(n)). From the recent lower bounds given in [26], [27], this estimator is optimal, to constant factor, both in its dependence on n, and its dependence on ∈. In particular, the inverse-linear convergence rate of this estimator resolves the main open question of [26], [28], which left open the possibility that the error decreased only with the square root of the number of samples. Our distance to uniformity estimator, when given O(m/∈ 2 log m) independent samples from any distribution, returns an ∈-accurate estimate of the L 1 distance to the uniform distribution of support m. This is constant-factor optimal, for constant ∈. Finally, our framework extends naturally to properties of pairs of distributions, including estimating the L 1 distance and KL-divergence between pairs of distributions. We give an explicit linear estimator for estimating L 1 distance to additive accuracy ∈ using O(n/∈ 2 log n) samples from each distribution, which is constant-factor optimal, for constant ∈. This is the first sub linear-sample estimator for this fundamental property.

FOCS Conference 2005 Conference Paper

On the Complexity of Two-PlayerWin-Lose Games

  • Timothy G. Abbott
  • Daniel M. Kane
  • Paul Valiant

The efficient computation of Nash equilibria is one of the most formidable challenges in computational complexity today. The problem remains open for two-player games. We show that the complexity of two-player Nash equilibria is unchanged when all outcomes are restricted to be 0 or 1. That is, win-or-lose games are as complex as the general case for two-player games.