STOC Conference 2025 Conference Paper
Pauli Measurements Are Not Optimal for Single-Copy Tomography
- Jayadev Acharya
- Abhilash Dharmavarapu
- Yuhan Liu 0007
- Nengkun Yu
Author name cluster
Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.
STOC Conference 2025 Conference Paper
NeurIPS Conference 2023 Conference Paper
We introduce camouflaged data poisoning attacks, a new attack vector that arises in the context of machine unlearning and other settings when model retraining may be induced. An adversary first adds a few carefully crafted points to the training dataset such that the impact on the model's predictions is minimal. The adversary subsequently triggers a request to remove a subset of the introduced points at which point the attack is unleashed and the model's predictions are negatively affected. In particular, we consider clean-label targeted attacks (in which the goal is to cause the model to misclassify a specific test point) on datasets including CIFAR-10, Imagenette, and Imagewoof. This attack is realized by constructing camouflage datapoints that mask the effect of a poisoned dataset. We demonstrate efficacy of our attack when unlearning is performed via retraining from scratch, the idealized setting of machine unlearning which other efficient methods attempt to emulate, as well as against the approximate unlearning approach of Graves et al. (2021).
NeurIPS Conference 2023 Conference Paper
We consider distributed parameter estimation using interactive protocols subject to local information constraints such as bandwidth limitations, local differential privacy, and restricted measurements. We provide a unified framework enabling us to derive a variety of (tight) minimax lower bounds for different parametric families of distributions, both continuous and discrete, under any $\ell_p$ loss. Our lower bound framework is versatile and yields “plug-and-play” bounds that are widely applicable to a large range of estimation problems, and, for the prototypical case of the Gaussian family, circumvents limitations of previous techniques. In particular, our approach recovers bounds obtained using data processing inequalities and Cramér–Rao bounds, two other alternative approaches for proving lower bounds in our setting of interest. Further, for the families considered, we complement our lower bounds with matching upper bounds.
NeurIPS Conference 2021 Conference Paper
We obtain tight minimax rates for the problem of distributed estimation of discrete distributions under communication constraints, where $n$ users observing $m $ samples each can broadcast only $\ell$ bits. Our main result is a tight characterization (up to logarithmic factors) of the error rate as a function of $m$, $\ell$, the domain size, and the number of users under most regimes of interest. While previous work focused on the setting where each user only holds one sample, we show that as $m$ grows the $\ell_1$ error rate gets reduced by a factor of $\sqrt{m}$ for small $m$. However, for large $m$ we observe an interesting phase transition: the dependence of the error rate on the communication constraint $\ell$ changes from $1/\sqrt{2^{\ell}}$ to $1/\sqrt{\ell}$.
NeurIPS Conference 2021 Conference Paper
We revisit first-order optimization under local information constraints such as local privacy, gradient quantization, and computational constraints limiting access to a few coordinates of the gradient. In this setting, the optimization algorithm is not allowed to directly access the complete output of the gradient oracle, but only gets limited information about it subject to the local information constraints. We study the role of adaptivity in processing the gradient output to obtain this limited information from it, and obtain tight or nearly tight bounds for both convex and strongly convex optimization when adaptive gradient processing is allowed.
NeurIPS Conference 2021 Conference Paper
We consider density estimation for Besov spaces when the estimator is restricted to use only a limited number of bits about each sample. We provide a noninteractive adaptive estimator which exploits the sparsity of wavelet bases, along with a simulate-and-infer technique from parametric estimation under communication constraints. We show that our estimator is nearly rate-optimal by deriving minmax lower bounds that hold even when interactive protocols are allowed. Interestingly, while our wavelet-based estimator is almost rate-optimal for Sobolev spaces as well, it is unclear whether the standard Fourier basis, which arise naturally for those spaces, can be used to achieve the same performance.
ICML Conference 2021 Conference Paper
We consider a linear autoencoder in which the latent variables are quantized, or corrupted by noise, and the constraint is Schur-concave in the set of latent variances. Although finding the optimal encoder/decoder pair for this setup is a nonconvex optimization problem, we show that decomposing the source into its principal components is optimal. If the constraint is strictly Schur-concave and the empirical covariance matrix has only simple eigenvalues, then any optimal encoder/decoder must decompose the source in this way. As one application, we consider a strictly Schur-concave constraint that estimates the number of bits needed to represent the latent variables under fixed-rate encoding, a setup that we call \emph{Principal Bit Analysis (PBA)}. This yields a practical, general-purpose, fixed-rate compressor that outperforms existing algorithms. As a second application, we show that a prototypical autoencoder-based variable-rate compressor is guaranteed to decompose the source into its principal components.
NeurIPS Conference 2021 Conference Paper
We study the problem of unlearning datapoints from a learnt model. The learner first receives a dataset $S$ drawn i. i. d. from an unknown distribution, and outputs a model $\widehat{w}$ that performs well on unseen samples from the same distribution. However, at some point in the future, any training datapoint $z \in S$ can request to be unlearned, thus prompting the learner to modify its output model while still ensuring the same accuracy guarantees. We initiate a rigorous study of generalization in machine unlearning, where the goal is to perform well on previously unseen datapoints. Our focus is on both computational and storage complexity. For the setting of convex losses, we provide an unlearning algorithm that can unlearn up to $O(n/d^{1/4})$ samples, where $d$ is the problem dimension. In comparison, in general, differentially private learning (which implies unlearning) only guarantees deletion of $O(n/d^{1/2})$ samples. This demonstrates a novel separation between differential privacy and machine unlearning.
ICML Conference 2021 Conference Paper
We study robust testing and estimation of discrete distributions in the strong contamination model. Our results cover both centralized setting and distributed setting with general local information constraints including communication and LDP constraints. Our technique relates the strength of manipulation attacks to the earth-mover distance using Hamming distance as the metric between messages (samples) from the users. In the centralized setting, we provide optimal error bounds for both learning and testing. Our lower bounds under local information constraints build on the recent lower bound methods in distributed inference. In the communication constrained setting, we develop novel algorithms based on random hashing and an L1-L1 isometry.
ICML Conference 2020 Conference Paper
Local differential privacy (LDP) is a strong notion of privacy that often leads to a significant drop in utility. The original definition of LDP assumes that all the elements in the data domain are equally sensitive. However, in many real-life applications, some elements are more sensitive than others. We propose a context-aware framework for LDP that allows the privacy level to vary across the data domain, enabling system designers to place privacy constraints where they matter without paying the cost where they do not. For binary data domains, we provide a universally optimal privatization scheme and highlight its connections to Warner’s randomized response and Mangat’s improved response. Motivated by geo-location and web search applications, for k-ary data domains, we consider two special cases of context-aware LDP: block-structured LDP and high-low LDP. We study minimax discrete distribution estimation under both cases and provide communication-efficient, sample-optimal schemes, and information-theoretic lower bounds. We show, using worst-case analyses and experiments on Gowalla’s 3. 6 million check-ins to 43, 750 locations, that context-aware LDP achieves a far better accuracy under the same number of samples.
ICML Conference 2019 Conference Paper
We consider the problems of distribution estimation, and heavy hitter (frequency) estimation under privacy, and communication constraints. While the constraints have been studied separately, optimal schemes for one are sub-optimal for the other. We propose a sample-optimal $\eps$-locally differentially private (LDP) scheme for distribution estimation, where each user communicates one bit, and requires no public randomness. We also show that Hadamard Response, a recently proposed scheme for $\eps$-LDP distribution estimation is also utility-optimal for heavy hitters estimation. Our final result shows that unlike distribution estimation, without public randomness, any utility-optimal heavy hitter estimation algorithm must require $\Omega(\log n)$ bits of communication per user.
ICML Conference 2019 Conference Paper
A central server needs to perform statistical inference based on samples that are distributed over multiple users who can each send a message of limited length to the center. We study problems of distribution learning and identity testing in this distributed inference setting and examine the role of shared randomness as a resource. We propose a general purpose simulate-and-infer strategy that uses only private-coin communication protocols and is sample-optimal for distribution learning. This general strategy turns out to be sample-optimal even for distribution testing among private-coin protocols. Interestingly, we propose a public-coin protocol that outperforms simulate-and-infer for distribution testing and is, in fact, sample-optimal. Underlying our public-coin protocol is a random hash that when applied to the samples minimally contracts the chi-squared distance of their distribution from the uniform distribution.
ICML Conference 2019 Conference Paper
In distributed statistical learning, $N$ samples are split across $m$ machines and a learner wishes to use minimal communication to learn as well as if the examples were on a single machine. This model has received substantial interest in machine learning due to its scalability and potential for parallel speedup. However, in high-dimensional settings, where the number examples is smaller than the number of features (‘"dimension"), the speedup afforded by distributed learning may be overshadowed by the cost of communicating a single example. This paper investigates the following question: When is it possible to learn a $d$-dimensional model in the distributed setting with total communication sublinear in $d$? Starting with a negative result, we observe that for learning $\ell_1$-bounded or sparse linear models, no algorithm can obtain optimal error until communication is linear in dimension. Our main result is that by slightly relaxing the standard boundedness assumptions for linear models, we can obtain distributed algorithms that enjoy optimal error with communication logarithmic in dimension. This result is based on a family of algorithms that combine mirror descent with randomized sparsification/quantization of iterates, and extends to the general stochastic convex optimization model.
NeurIPS Conference 2019 Conference Paper
We consider the task of estimating the entropy of $k$-ary distributions from samples in the streaming model, where space is limited. Our main contribution is an algorithm that requires $O\left(\frac{k \log (1/\varepsilon)^2}{\varepsilon^3}\right)$ samples and a constant $O(1)$ memory words of space and outputs a $\pm\varepsilon$ estimate of $H(p)$. Without space limitations, the sample complexity has been established as $S(k, \varepsilon)=\Theta\left(\frac k{\varepsilon\log k}+\frac{\log^2 k}{\varepsilon^2}\right)$, which is sub-linear in the domain size $k$, and the current algorithms that achieve optimal sample complexity also require nearly-linear space in $k$. Our algorithm partitions $[0, 1]$ into intervals and estimates the entropy contribution of probability values in each interval. The intervals are designed to trade bias and variance. Distribution property estimation and testing with limited memory is a largely unexplored research area. We hope our work will motivate research in this field.
NeurIPS Conference 2018 Conference Paper
We study the fundamental problems of identity testing (goodness of fit), and closeness testing (two sample test) of distributions over $k$ elements, under differential privacy. While the problems have a long history in statistics, finite sample bounds for these problems have only been established recently. In this work, we derive upper and lower bounds on the sample complexity of both the problems under $(\varepsilon, \delta)$-differential privacy. We provide optimal sample complexity algorithms for identity testing problem for all parameter ranges, and the first results for closeness testing. Our closeness testing bounds are optimal in the sparse regime where the number of samples is at most $k$. Our upper bounds are obtained by privatizing non-private estimators for these problems. The non-private estimators are chosen to have small sensitivity. We propose a general framework to establish lower bounds on the sample complexity of statistical tasks under differential privacy. We show a bound on differentially private algorithms in terms of a coupling between the two hypothesis classes we aim to test. By constructing carefully chosen priors over the hypothesis classes, and using Le Cam's two point theorem we provide a general mechanism for proving lower bounds. We believe that the framework can be used to obtain strong lower bounds for other statistical tasks under privacy.
ICML Conference 2018 Conference Paper
We develop differentially private methods for estimating various distributional properties. Given a sample from a discrete distribution p, some functional f, and accuracy and privacy parameters alpha and epsilon, the goal is to estimate f(p) up to accuracy alpha, while maintaining epsilon-differential privacy of the sample. We prove almost-tight bounds on the sample size required for this problem for several functionals of interest, including support size, support coverage, and entropy. We show that the cost of privacy is negligible in a variety of settings, both theoretically and experimentally. Our methods are based on a sensitivity analysis of several state-of-the-art methods for estimating these properties with sublinear sample complexities
NeurIPS Conference 2018 Conference Paper
We consider testing and learning problems on causal Bayesian networks as defined by Pearl (Pearl, 2009). Given a causal Bayesian network M on a graph with n discrete variables and bounded in-degree and bounded ``confounded components'', we show that O(log n) interventions on an unknown causal Bayesian network X on the same graph, and O(n/epsilon^2) samples per intervention, suffice to efficiently distinguish whether X=M or whether there exists some intervention under which X and M are farther than epsilon in total variation distance. We also obtain sample/time/intervention efficient algorithms for: (i) testing the identity of two unknown causal Bayesian networks on the same graph; and (ii) learning a causal Bayesian network on a given graph. Although our algorithms are non-adaptive, we show that adaptivity does not help in general: Omega(log n) interventions are necessary for testing the identity of two unknown causal Bayesian networks on the same graph, even adaptively. Our algorithms are enabled by a new subadditivity inequality for the squared Hellinger distance between two causal Bayesian networks.
JMLR Journal 2018 Journal Article
We study maximum selection and sorting of $n$ numbers using imperfect pairwise comparators. The imperfect comparator returns the larger of the two inputs if the inputs are more than a given threshold apart and an adversarially-chosen input otherwise. We consider two adversarial models: a non-adaptive adversary that decides on the outcomes in advance and an adaptive adversary that decides on the outcome of each comparison depending on the previous comparisons and outcomes. Against the non-adaptive adversary, we derive a maximum-selection algorithm that uses at most $2n$ comparisons in expectation and a sorting algorithm that uses at most $2n\ln n$ comparisons in expectation. In the presence of the adaptive adversary, the proposed maximum-selection algorithm uses $\Theta(n\log (1/{\epsilon}))$ comparisons to output a correct answer with probability at least $1-\epsilon$, resolving an open problem in Ajtai et al. (2015). Our study is motivated by a density-estimation problem. Given samples from an unknown distribution, we would like to find a distribution among a known class of $n$ candidate distributions that is close to the underlying distribution in $\ell_1$ distance. Scheffe's algorithm, for example, in Devroye and Lugosi (2001) outputs a distribution at an $\ell_1$ distance at most 9 times the minimum and runs in time $\Theta(n^2\log n)$. Using our algorithm, the runtime reduces to $\Theta(n\log n)$. [abs] [ pdf ][ bib ] © JMLR 2018. ( edit, beta )
ICML Conference 2017 Conference Paper
Symmetric distribution properties such as support size, support coverage, entropy, and proximity to uniformity, arise in many applications. Recently, researchers applied different estimators and analysis tools to derive asymptotically sample-optimal approximations for each of these properties. We show that a single, simple, plug-in estimator— profile maximum likelihood (PML) —is sample competitive for all symmetric properties, and in particular is asymptotically sample-optimal for all the above properties.
SODA Conference 2017 Conference Paper
We design a new, fast algorithm for agnostically learning univariate probability distributions whose densities are well-approximated by piecewise polynomial functions. Let ƒ be the density function of an arbitrary univariate distribution, and suppose that ƒ is OPT-close in Li- distance to an unknown piecewise polynomial function with t interval pieces and degree d. For any γ > 0, our algorithm draws n = Õ γ ( t ( d + 1)/ ∊ 2 ) samples from ƒ, runs in time Õ( n ), and with probability at least 9/10 outputs an Ο γ ( t )-piecewise degree- d hypothesis H that is (3 + γ) · OPT + ∊ close to f. Our approximation factor almost matches the best known information-theoretic (but computationally inefficient) upper bound of 3. Our general algorithm yields ( n early) sample- optimal and nearly-linear time estimators for a wide range of structured distribution families over both continuous and discrete domains in a unified way. For most of our applications, these are the first sample-optimal and nearly-linear time estimators in the literature. As a consequence, our work resolves the sample and computational complexities of a broad class of inference tasks via a single “meta-algorithm”. Moreover, we demonstrate that our algorithm performs very well in experiments. Our algorithm consists of three levels: (i) At the top level, we employ an iterative greedy algorithm for finding a good partition of the real line into the pieces of a piecewise polynomial. (ii) For each piece, we show that the sub-problem of finding a good polynomial fit on the current interval can be solved efficiently with a separation oracle method. (iii) We reduce the task of finding a separating hyperplane to a combinatorial problem and design a nearly-linear algorithm for this problem. Combining these three procedures gives a density estimation algorithm with the claimed guarantees.
ICML Conference 2016 Conference Paper
We study the fixed design segmented regression problem: Given noisy samples from a piecewise linear function f, we want to recover f up to a desired accuracy in mean-squared error. Previous rigorous approaches for this problem rely on dynamic programming (DP) and, while sample efficient, have running time quadratic in the sample size. As our main contribution, we provide new sample near-linear time algorithms for the problem that - while not being minimax optimal - achieve a significantly better sample-time tradeoff on large datasets compared to the DP approach. Our experimental evaluation shows that, compared with the DP approach, our algorithms provide a convergence rate that is only off by a factor of 2 to 4, while achieving speedups of three orders of magnitude.
NeurIPS Conference 2015 Conference Paper
Given samples from an unknown distribution, p, is it possible to distinguish whether p belongs to some class of distributions C versus p being far from every distribution in C? This fundamental question has receivedtremendous attention in Statistics, albeit focusing onasymptotic analysis, as well as in Computer Science, wherethe emphasis has been on small sample size and computationalcomplexity. Nevertheless, even for basic classes ofdistributions such as monotone, log-concave, unimodal, and monotone hazard rate, the optimal sample complexity is unknown. We provide a general approach via which we obtain sample-optimal and computationally efficient testers for all these distribution families. At the core of our approach is an algorithm which solves the following problem: Given samplesfrom an unknown distribution p, and a known distribution q, are p and q close in Chi^2-distance, or far in total variation distance? The optimality of all testers is established by providing matching lower bounds. Finally, a necessary building block for our tester and important byproduct of our work are the first known computationally efficient proper learners for discretelog-concave and monotone hazard rate distributions. We exhibit the efficacy of our testers via experimental analysis.
SODA Conference 2015 Conference Paper
SODA Conference 2015 Conference Paper
It was recently shown that estimating the Shannon entropy H (p) of a discrete k -symbol distribution p requires Θ( k / log k ) samples, a number that grows nearlinearly in the support size. In many applications H (p) can be replaced by the more general Rényi entropy of order α, H α (p). We determine the number of samples needed to estimate H α (p) for all α, showing that α < 1 requires super-linear, roughly k 1/ α samples, noninteger α > 1 requires near-linear, roughly k samples, but integer α > 1 requires only Θ( k 1−1/α ) samples. In particular, estimating H 2 (p), which arises in security, DNA reconstruction, closeness testing, and other applications, requires only samples. The estimators achieving these bounds are simple and run in time linear in the number of samples.
NeurIPS Conference 2014 Conference Paper
Many important distributions are high dimensional, and often they can be modeled as Gaussian mixtures. We derive the first sample-efficient polynomial-time estimator for high-dimensional spherical Gaussian mixtures. Based on intuitive spectral reasoning, it approximates mixtures of $k$ spherical Gaussians in $d$-dimensions to within$\ell_1$ distance $\epsilon$ using $\mathcal{O}({dk^9(\log^2 d)}/{\epsilon^4})$ samples and $\mathcal{O}_{k, \epsilon}(d^3\log^5 d)$ computation time. Conversely, we show that any estimator requires $\Omega\bigl({dk}/{\epsilon^2}\bigr)$ samples, hence the algorithm's sample complexity is nearly optimal in the dimension. The implied time-complexity factor \mathcal{O}_{k, \epsilon}$ is exponential in $k$, but much smaller than previously known. We also construct a simple estimator for one-dimensional Gaussian mixtures that uses $\tilde\mathcal{O}(k /\epsilon^2)$ samples and $\tilde\mathcal{O}((k/\epsilon)^{3k+1})$ computation time.
NeurIPS Conference 2012 Conference Paper
The minimax KL-divergence of any distribution from all distributions in a collection P has several practical implications. In compression, it is called redundancy and represents the least additional number of bits over the entropy needed to encode the output of any distribution in P. In online es- timation and learning, it is the lowest expected log-loss regret when guessing a sequence of random values generated by a distribution in P. In hypothesis testing, it upper bounds the largest number of distinguishable distributions in P. Motivated by problems ranging from population estimation to text classification and speech recognition, several machine-learning and information-theory researchers have recently considered label-invariant observations and properties induced by i. i. d. distributions. A sufficient statistic for all these properties is the data’s profile, the multiset of the number of times each data element appears. Improving on a sequence of previous works, we show that the redun- dancy of the collection of distributions induced over profiles by length-n i. i. d. sequences is between 0. 3 · n1/3 and n1/3 log2 n, in particular, establishing its exact growth power.