Arrow Research search

Author name cluster

Badih Ghazi

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

39 papers
2 author rows

Possible papers

39

NeurIPS Conference 2025 Conference Paper

Private Hyperparameter Tuning with Ex-Post Guarantee

  • Badih Ghazi
  • Pritish Kamath
  • Alexander Knop
  • Ravi Kumar
  • Pasin Manurangsi
  • Chiyuan Zhang

The conventional approach in differential privacy (DP) literature formulates the privacy-utility tradeoff with a "privacy-first" perspective: for a predetermined level of privacy, a certain utility is achievable. However, practitioners often operate under a "utility-first" paradigm, prioritizing a desired level of utility and then determining the corresponding privacy cost. Wu et al. [2019] initiated a formal study of this ``utility-first'' perspective by introducing ex-post DP. They demonstrated that by adding correlated Laplace noise and progressively reducing it on demand, a sequence of increasingly accurate estimates of a private parameter can be generated, with the privacy cost attributed only to the least noisy iterate released. This led to a Laplace mechanism variant that achieves a specified utility with minimal privacy loss. However, their work, and similar findings by Whitehouse et al. [2023], are primarily limited to simple mechanisms based on Laplace or Gaussian noise. In this paper, we significantly generalize these results. In particular, we extend the findings of Wu et al. [2019] and Liu and Talwar [2019] to support any sequence of private estimators, incurring at most a doubling of the original privacy budget. Furthermore, we demonstrate that hyperparameter tuning for these estimators, including the selection of an optimal privacy budget, can be performed without additional privacy cost. Finally, we extend our results to ex-post R\'{e}nyi DP, further broadening the applicability of utility-first privacy mechanisms.

NeurIPS Conference 2025 Conference Paper

Quantifying Cross-Modality Memorization in Vision-Language Models

  • Yuxin Wen
  • Yangsibo Huang
  • Tom Goldstein
  • Ravi Kumar
  • Badih Ghazi
  • Chiyuan Zhang

Understanding what and how neural networks memorize during training is crucial, both from the perspective of unintentional memorization of potentially sensitive information and from the standpoint of effective knowledge acquisition for real-world, knowledge-intensive tasks. While previous studies primarily investigate memorization within a single modality, such as text memorization in large language models or image memorization in diffusion models, unified multimodal models are becoming increasingly prevalent in practical applications. In this work, we focus on the unique characteristics of cross-modality memorization and conduct a systematic study centered on vision-language models. To facilitate controlled experiments, we first introduce a synthetic persona dataset comprising diverse synthetic person images and textual descriptions. We quantify factual knowledge memorization and cross-modal transferability by training models on a single modality and evaluating their performance in the other. Our results reveal that facts learned in one modality transfer to the other, but a significant gap exists between recalling information in the source and target modalities. Furthermore, we observe that this gap exists across various scenarios, including more capable models, machine unlearning, and the multi-hop case. At the end, we propose a baseline method to mitigate this challenge. We hope our study can inspire future research on developing more robust multimodal learning techniques to enhance cross-modal transferability.

NeurIPS Conference 2025 Conference Paper

Scaling Embedding Layers in Language Models

  • Da Yu
  • Edith Cohen
  • Badih Ghazi
  • Yangsibo Huang
  • Pritish Kamath
  • Ravi Kumar
  • Daogao Liu
  • Chiyuan Zhang

We propose SCONE (**S**calable, **C**ontextualized, **O**ffloaded, **N**-gram **E**mbedding), a new method for extending input embedding layers to enhance language model performance. To avoid increased decoding costs, SCONE retains the original vocabulary while introducing embeddings for a set of frequent $n$-grams. These embeddings provide contextualized representation for each input token and are learned with a separate model during training. After training, embeddings are precomputed and stored in off-accelerator memory; during inference, querying them has minimal impact on latency due to the low complexity of embedding lookups. SCONE enables two new scaling strategies: increasing the number of $n$-gram embeddings and scaling the model used to learn them, both while maintaining fixed accelerator usage during inference (in terms of FLOPS and memory). We show that scaling both aspects enables a model with 1B accelerator-resident parameters to outperform a 1. 9B-parameter baseline across diverse corpora, while using only about half the FLOPS and accelerator memory during inference.

ICML Conference 2025 Conference Paper

Scaling Laws for Differentially Private Language Models

  • Ryan McKenna
  • Yangsibo Huang
  • Amer Sinha
  • Borja Balle
  • Zachary Charles
  • Christopher A. Choquette-Choo
  • Badih Ghazi
  • Georgios Kaissis

Scaling laws have emerged as important components of large language model (LLM) training as they can predict performance gains through scale, and provide guidance on important hyper-parameter choices that would otherwise be expensive. LLMs also rely on large, high-quality training datasets, like those sourced from (sometimes sensitive) user data. Training models on this sensitive user data requires careful privacy protections like differential privacy (DP). However, the dynamics of DP training are significantly different, and consequently their scaling laws are not yet fully understood. In this work, we establish scaling laws that accurately model the intricacies of DP LLM training, providing a complete picture of the compute-privacy-utility and the optimal training configurations in many settings.

ICLR Conference 2025 Conference Paper

Unlearn and Burn: Adversarial Machine Unlearning Requests Destroy Model Accuracy

  • Yangsibo Huang
  • Daogao Liu
  • Lynn Chua
  • Badih Ghazi
  • Pritish Kamath
  • Ravi Kumar 0001
  • Pasin Manurangsi
  • Milad Nasr

Machine unlearning algorithms, designed for selective removal of training data from models, have emerged as a promising approach to growing privacy concerns. In this work, we expose a critical yet underexplored vulnerability in the deployment of unlearning systems: the assumption that the data requested for removal is always part of the original training set. We present a threat model where an attacker can degrade model accuracy by submitting adversarial unlearning requests for data \textit{not} present in the training set. We propose white-box and black-box attack algorithms and evaluate them through a case study on image classification tasks using the CIFAR-10 and ImageNet datasets, targeting a family of widely used unlearning methods. Our results show extremely poor test accuracy following the attack—3.6% on CIFAR-10 and 0.4% on ImageNet for white-box attacks, and 8.5% on CIFAR-10 and 1.3% on ImageNet for black-box attacks. Additionally, we evaluate various verification mechanisms to detect the legitimacy of unlearning requests and reveal the challenges in verification, as most of the mechanisms fail to detect stealthy attacks without severely impairing their ability to process valid requests. These findings underscore the urgent need for research on more robust request verification methods and unlearning protocols, should the deployment of machine unlearning systems become more prevalent in the future.

NeurIPS Conference 2024 Conference Paper

Differentially Private Optimization with Sparse Gradients

  • Badih Ghazi
  • Cristóbal Guzmán
  • Pritish Kamath
  • Ravi Kumar
  • Pasin Manurangsi

Motivated by applications of large embedding models, we study differentially private (DP) optimization problems under sparsity of individual gradients. We start with new near-optimal bounds for the classic mean estimation problem but with sparse data, improving upon existing algorithms particularly for the high-dimensional regime. The corresponding lower bounds are based on a novel block-diagonal construction that is combined with existing DP mean estimation lower bounds. Next, we obtain pure- and approximate-DP algorithms with almost optimal rates for stochastic convex optimization with sparse gradients; the former represents the first nearly dimension-independent rates for this problem. Furthermore, by introducing novel analyses of bias reduction in mean estimation and randomly-stopped biased SGD we obtain nearly dimension-independent rates for near-stationary points for the empirical risk in nonconvex settings under approximate-DP.

ICML Conference 2024 Conference Paper

How Private are DP-SGD Implementations?

  • Lynn Chua
  • Badih Ghazi
  • Pritish Kamath
  • Ravi Kumar 0001
  • Pasin Manurangsi
  • Amer Sinha
  • Chiyuan Zhang

We demonstrate a substantial gap between the privacy guarantees of the Adaptive Batch Linear Queries (ABLQ) mechanism under different types of batch sampling: (i) Shuffling, and (ii) Poisson subsampling; the typical analysis of Differentially Private Stochastic Gradient Descent (DP-SGD) follows by interpreting it as a post-processing of ABLQ. While shuffling-based DP-SGD is more commonly used in practical implementations, it has not been amenable to easy privacy analysis, either analytically or even numerically. On the other hand, Poisson subsampling-based DP-SGD is challenging to scalably implement, but has a well-understood privacy analysis, with multiple open-source numerically tight privacy accountants available. This has led to a common practice of using shuffling-based DP-SGD in practice, but using the privacy analysis for the corresponding Poisson subsampling version. Our result shows that there can be a substantial gap between the privacy analysis when using the two types of batch sampling, and thus advises caution in reporting privacy parameters for DP-SGD.

ICML Conference 2024 Conference Paper

Individualized Privacy Accounting via Subsampling with Applications in Combinatorial Optimization

  • Badih Ghazi
  • Pritish Kamath
  • Ravi Kumar 0001
  • Pasin Manurangsi
  • Adam Sealfon

In this work, we give a new technique for analyzing individualized privacy accounting via the following simple observation: if an algorithm is one-sided add-DP, then its subsampled variant satisfies two-sided DP. From this, we obtain several improved algorithms for private combinatorial optimization problems, including decomposable submodular maximization and set cover. Our error guarantees are asymptotically tight and our algorithm satisfies pure-DP while previously known algorithms (Gupta et al. , 2010; Chaturvedi et al. , 2021) are approximate-DP. We also show an application of our technique beyond combinatorial optimization by giving a pure-DP algorithm for the shifting heavy hitter problem in a stream; previously, only an approximate-DP algorithm was known (Kaplan et al. , 2021; Cohen & Lyu, 2023).

ICLR Conference 2024 Conference Paper

LabelDP-Pro: Learning with Label Differential Privacy via Projections

  • Badih Ghazi
  • Yangsibo Huang
  • Pritish Kamath
  • Ravi Kumar 0001
  • Pasin Manurangsi
  • Chiyuan Zhang

Label differentially private (label DP) algorithms seek to preserve the privacy of the labels in a training dataset in settings where the features are known to the adversary. In this work, we study a new family of label DP training algorithms. Unlike most prior label DP algorithms that have been based on label randomization, our algorithm naturally leverages the power of the central model of DP. It interleaves gradient projection operations with private stochastic gradient descent steps in order to improve the utility of the trained model while guaranteeing the privacy of the labels. We show that such projection-based algorithms can be made practical and that they improve on the state-of-the art for label DP training in the high-privacy regime. We complement our empirical evaluation with theoretical results shedding light on the efficacy of our method through the lens of bias-variance trade-offs.

NeurIPS Conference 2024 Conference Paper

Scalable DP-SGD: Shuffling vs. Poisson Subsampling

  • Lynn Chua
  • Badih Ghazi
  • Pritish Kamath
  • Ravi Kumar
  • Pasin Manurangsi
  • Amer Sinha
  • Chiyuan Zhang

We provide new lower bounds on the privacy guarantee of multi-epoch Adaptive Batch Linear Queries (ABLQ) mechanism with shuffled batch sampling, demonstrating substantial gaps when compared to Poisson subsampling; prior analysis was limited to a single epoch. Since the privacy analysis of Differentially Private Stochastic Gradient Descent (DP-SGD) is obtained by analyzing the ABLQ mechanism, this brings into serious question the common practice of implementing Shuffling based DP-SGD, but reporting privacy parameters as if Poisson subsampling was used. To understand the impact of this gap on the utility of trained machine learning models, we introduce a novel practical approach to implement Poisson subsampling at scale using massively parallel computation, and efficiently train models with the same. We provide a comparison between the utility of models trained with Poisson subsampling based DP-SGD, and the optimistic estimates of utility when using shuffling, via our new lower bounds on the privacy guarantee of ABLQ with shuffling.

AAAI Conference 2023 Conference Paper

Differentially Private Heatmaps

  • Badih Ghazi
  • Junfeng He
  • Kai Kohlhoff
  • Ravi Kumar
  • Pasin Manurangsi
  • Vidhya Navalpakkam
  • Nachiappan Valliappan

We consider the task of producing heatmaps from users' aggregated data while protecting their privacy. We give a differentially private (DP) algorithm for this task and demonstrate its advantages over previous algorithms on real-world datasets. Our core algorithmic primitive is a DP procedure that takes in a set of distributions and produces an output that is close in Earth Mover's Distance (EMD) to the average of the inputs. We prove theoretical bounds on the error of our algorithm under a certain sparsity assumption and that these are essentially optimal.

NeurIPS Conference 2023 Conference Paper

On Computing Pairwise Statistics with Local Differential Privacy

  • Badih Ghazi
  • Pritish Kamath
  • Ravi Kumar
  • Pasin Manurangsi
  • Adam Sealfon

We study the problem of computing pairwise statistics, i. e. , ones of the form $\binom{n}{2}^{-1} \sum_{i \ne j} f(x_i, x_j)$, where $x_i$ denotes the input to the $i$th user, with differential privacy (DP) in the local model. This formulation captures important metrics such as Kendall's $\tau$ coefficient, Area Under Curve, Gini's mean difference, Gini's entropy, etc. We give several novel and generic algorithms for the problem, leveraging techniques from DP algorithms for linear queries.

NeurIPS Conference 2023 Conference Paper

On Differentially Private Sampling from Gaussian and Product Distributions

  • Badih Ghazi
  • Xiao Hu
  • Ravi Kumar
  • Pasin Manurangsi

We study the problem, where given a dataset of $n$ i. i. d. samples from an unknown distribution $P$, we seek to generate a sample from a distribution that is close to $P$ in total variation distance, under the constraint of differential privacy. We study the settings where $P$ is a multi-dimensional Gaussian distribution with different assumptions: known covariance, unknown bounded covariance, and unknown unbounded covariance. We present new differentially private sampling algorithms, and show that they achieve near-optimal sample complexity in the first two settings. Moreover, when $P$ is a product distribution on the binary hypercube, we obtain a pure-DP algorithm whereas only an approximate-DP algorithm (with slightly worse sample complexity) was previously known.

ICML Conference 2023 Conference Paper

On User-Level Private Convex Optimization

  • Badih Ghazi
  • Pritish Kamath
  • Ravi Kumar 0001
  • Pasin Manurangsi
  • Raghu Meka
  • Chiyuan Zhang

We introduce a new mechanism for stochastic convex optimization (SCO) with user-level differential privacy guarantees. The convergence rates of this mechanism are similar to those in the prior work of Levy et al. 2021 and Narayanan et al. 2022, but with two important improvements. Our mechanism does not require any smoothness assumptions on the loss. Furthermore, our bounds are also the first where the minimum number of users needed for user-level privacy has no dependence on the dimension and only a logarithmic dependence on the desired excess error. The main idea underlying the new mechanism is to show that the optimizers of strongly convex losses have low local deletion sensitivity, along with a new output perturbation method for functions with low local deletion sensitivity, which could be of independent interest.

NeurIPS Conference 2023 Conference Paper

Optimal Unbiased Randomizers for Regression with Label Differential Privacy

  • Ashwinkumar Badanidiyuru Varadaraja
  • Badih Ghazi
  • Pritish Kamath
  • Ravi Kumar
  • Ethan Leeman
  • Pasin Manurangsi
  • Avinash V Varadarajan
  • Chiyuan Zhang

We propose a new family of label randomizers for training regression models under the constraint of label differential privacy (DP). In particular, we leverage the trade-offs between bias and variance to construct better label randomizers depending on a privately estimated prior distribution over the labels. We demonstrate that these randomizers achieve state-of-the-art privacy-utility trade-offs on several datasets, highlighting the importance of reducing bias when training neural networks with label DP. We also provide theoretical results shedding light on the structural properties of the optimal unbiased randomizers.

ICLR Conference 2023 Conference Paper

Regression with Label Differential Privacy

  • Badih Ghazi
  • Pritish Kamath
  • Ravi Kumar 0001
  • Ethan Leeman
  • Pasin Manurangsi
  • Avinash V. Varadarajan
  • Chiyuan Zhang

We study the task of training regression models with the guarantee of _label_ differential privacy (DP). Based on a global prior distribution of label values, which could be obtained privately, we derive a label DP randomization mechanism that is optimal under a given regression loss function. We prove that the optimal mechanism takes the form of a "randomized response on bins", and propose an efficient algorithm for finding the optimal bin values. We carry out a thorough experimental evaluation on several datasets demonstrating the efficacy of our algorithm.

NeurIPS Conference 2023 Conference Paper

Sparsity-Preserving Differentially Private Training of Large Embedding Models

  • Badih Ghazi
  • Yangsibo Huang
  • Pritish Kamath
  • Ravi Kumar
  • Pasin Manurangsi
  • Amer Sinha
  • Chiyuan Zhang

As the use of large embedding models in recommendation systems and language applications increases, concerns over user data privacy have also risen. DP-SGD, a training algorithm that combines differential privacy with stochastic gradient descent, has been the workhorse in protecting user privacy without compromising model accuracy by much. However, applying DP-SGD naively to embedding models can destroy gradient sparsity, leading to reduced training efficiency. To address this issue, we present two new algorithms, DP-FEST and DP-AdaFEST, that preserve gradient sparsity during the private training of large embedding models. Our algorithms achieve substantial reductions ($10^6 \times$) in gradient size, while maintaining comparable levels of accuracy, on benchmark real-world datasets.

FOCS Conference 2023 Conference Paper

Towards Separating Computational and Statistical Differential Privacy

  • Badih Ghazi
  • Rahul Ilango
  • Pritish Kamath
  • Ravi Kumar 0001
  • Pasin Manurangsi

Computational differential privacy (CDP) is a natural relaxation of the standard notion of (statistical) differential privacy (SDP) proposed by Beimel, Nissim, and Omri (CRYPTO 2008) and Mironov, Pandey, Reingold, and Vadhan (CRYPTO 2009). In contrast to SDP, CDP only requires privacy guarantees to hold against computationally-bounded adversaries rather than computationally-unbounded statistical adversaries. Despite the question being raised explicitly in several works (e. g. , Bun, Chen, and Vadhan, TCC 2016), it has remained tantalizingly open whether there is any task achievable with the CDP notion but not the SDP notion. Even a candidate such task is unknown. Indeed, it is even unclear what the truth could be! In this work, we give the first construction of a task achievable with the CDP notion but not the SDP notion, under the following strong but plausible cryptographic assumptions: •Non-Interactive Witness Indistinguishable Proofs, •Laconic Collision-Resistant Keyless Hash Functions, •Differing-Inputs Obfuscation for Public-Coin Samplers. In particular, we construct a task for which there exists an $\varepsilon$-CDP mechanism with $\varepsilon=O(1)$ achieving $1-o(1)$ utility, but any $(\varepsilon, \delta)$-SDP mechanism, including computationally-unbounded ones, that achieves a constant utility must use either a super-constant $\varepsilon$ or an inverse-polynomially large $\delta$. To prove this, we introduce a new approach for showing that a mechanism satisfies CDP: first we show that a mechanism is “private” against a certain class of decision tree adversaries, and then we use cryptographic constructions to “lift” this into privacy against computationally bounded adversaries. We believe this approach could be useful to devise further tasks separating CDP from SDP.

NeurIPS Conference 2023 Conference Paper

User-Level Differential Privacy With Few Examples Per User

  • Badih Ghazi
  • Pritish Kamath
  • Ravi Kumar
  • Pasin Manurangsi
  • Raghu Meka
  • Chiyuan Zhang

Previous work on user-level differential privacy (DP) [Ghazi et al. NeurIPS 2021, Bun et al. STOC 2023] obtained generic algorithms that work for various learning tasks. However, their focus was on the *example-rich* regime, where the users have so many examples that each user could themselves solve the problem. In this work we consider the *example-scarce* regime, where each user has only a few examples, and obtain the following results: * For approximate-DP, we give a generic transformation of any item-level DP algorithm to a user-level DP algorithm. Roughly speaking, the latter gives a (multiplicative) savings of $O_{\varepsilon, \delta}(\sqrt{m})$ in terms of the number of users required for achieving the same utility, where $m$ is the number of examples per user. This algorithm, while recovering most known bounds for specific problems, also gives new bounds, e. g. , for PAC learning. * For pure-DP, we present a simple technique for adapting the exponential mechanism [McSherry & Talwar, FOCS 2007] to the user-level setting. This gives new bounds for a variety of tasks, such as private PAC learning, hypothesis selection, and distribution learning. For some of these problems, we show that our bounds are near-optimal.

NeurIPS Conference 2022 Conference Paper

Anonymized Histograms in Intermediate Privacy Models

  • Badih Ghazi
  • Pritish Kamath
  • Ravi Kumar
  • Pasin Manurangsi

We study the problem of privately computing the $\mbox{\it anonymized histogram}$ (a. k. a. $\mbox{\it unattributed histogram}$), which is defined as the histogram without item labels. Previous works have provided algorithms with $\ell_1$- and $\ell_2^2$-errors of $O_\varepsilon(\sqrt{n})$ in the central model of differential privacy (DP). In this work, we provide an algorithm with a nearly matching error guarantee of $\widetilde{O}_\varepsilon(\sqrt{n})$ in the shuffle DP and pan-private models. Our algorithm is very simple: it just post-processes the discrete Laplace-noised histogram! Using this algorithm as a subroutine, we show applications in privately estimating symmetric properties of distributions such as entropy, support coverage, and support size.

ICML Conference 2022 Conference Paper

Faster Privacy Accounting via Evolving Discretization

  • Badih Ghazi
  • Pritish Kamath
  • Ravi Kumar 0001
  • Pasin Manurangsi

We introduce a new algorithm for numerical composition of privacy random variables, useful for computing the accurate differential privacy parameters for compositions of mechanisms. Our algorithm achieves a running time and memory usage of $polylog(k)$ for the task of self-composing a mechanism, from a broad class of mechanisms, $k$ times; this class, e. g. , includes the sub-sampled Gaussian mechanism, that appears in the analysis of differentially private stochastic gradient descent (DP-SGD). By comparison, recent work by Gopi et al. (NeurIPS 2021) has obtained a running time of $\widetilde{O}(\sqrt{k})$ for the same task. Our approach extends to the case of composing $k$ different mechanisms in the same class, improving upon the running time and memory usage in their work from $\widetilde{O}(k^{1. 5})$ to $\wtilde{O}(k)$.

NeurIPS Conference 2022 Conference Paper

Private Isotonic Regression

  • Badih Ghazi
  • Pritish Kamath
  • Ravi Kumar
  • Pasin Manurangsi

In this paper, we consider the problem of differentially private (DP) algorithms for isotonic regression. For the most general problem of isotonic regression over a partially ordered set (poset) $\mathcal{X}$ and for any Lipschitz loss function, we obtain a pure-DP algorithm that, given $n$ input points, has an expected excess empirical risk of roughly $\mathrm{width}(\mathcal{X}) \cdot \log|\mathcal{X}| / n$, where $\mathrm{width}(\mathcal{X})$ is the width of the poset. In contrast, we also obtain a near-matching lower bound of roughly $(\mathrm{width}(\mathcal{X}) + \log |\mathcal{X}|) / n$, that holds even for approximate-DP algorithms. Moreover, we show that the above bounds are essentially the best that can be obtained without utilizing any further structure of the poset. In the special case of a totally ordered set and for $\ell_1$ and $\ell_2^2$ losses, our algorithm can be implemented in near-linear running time; we also provide extensions of this algorithm to the problem of private isotonic regression with additional structural constraints on the output function.

AAAI Conference 2022 Conference Paper

Private Rank Aggregation in Central and Local Models

  • Daniel Alabi
  • Badih Ghazi
  • Ravi Kumar
  • Pasin Manurangsi

In social choice theory, (Kemeny) rank aggregation is a wellstudied problem where the goal is to combine rankings from multiple voters into a single ranking on the same set of items. Since rankings can reveal preferences of voters (which a voter might like to keep private), it is important to aggregate preferences in such a way to preserve privacy. In this work, we present differentially private algorithms for rank aggregation in the pure and approximate settings along with distributionindependent utility upper and lower bounds. In addition to bounds in the central model, we also present utility bounds for the local model of differential privacy.

NeurIPS Conference 2021 Conference Paper

Deep Learning with Label Differential Privacy

  • Badih Ghazi
  • Noah Golowich
  • Ravi Kumar
  • Pasin Manurangsi
  • Chiyuan Zhang

The Randomized Response (RR) algorithm is a classical technique to improve robustness in survey aggregation, and has been widely adopted in applications with differential privacy guarantees. We propose a novel algorithm, Randomized Response with Prior (RRWithPrior), which can provide more accurate results while maintaining the same level of privacy guaranteed by RR. We then apply RRWithPrior to learn neural networks with label differential privacy (LabelDP), and show that when only the label needs to be protected, the model performance can be significantly improved over the previous state-of-the-art private baselines. Moreover, we study different ways to obtain priors, which when used with RRWithPrior can additionally improve the model performance, further reducing the accuracy gap between private and non-private models. We complement the empirical results with theoretical analysis showing that LabelDP is provably easier than protecting both the inputs and labels.

ICML Conference 2021 Conference Paper

Differentially Private Aggregation in the Shuffle Model: Almost Central Accuracy in Almost a Single Message

  • Badih Ghazi
  • Ravi Kumar 0001
  • Pasin Manurangsi
  • Rasmus Pagh
  • Amer Sinha

The shuffle model of differential privacy has attracted attention in the literature due to it being a middle ground between the well-studied central and local models. In this work, we study the problem of summing (aggregating) real numbers or integers, a basic primitive in numerous machine learning tasks, in the shuffle model. We give a protocol achieving error arbitrarily close to that of the (Discrete) Laplace mechanism in central differential privacy, while each user only sends 1 + o(1) short messages in expectation.

ICML Conference 2021 Conference Paper

Locally Private k-Means in One Round

  • Alisa Chang
  • Badih Ghazi
  • Ravi Kumar 0001
  • Pasin Manurangsi

We provide an approximation algorithm for k-means clustering in the \emph{one-round} (aka \emph{non-interactive}) local model of differential privacy (DP). Our algorithm achieves an approximation ratio arbitrarily close to the best \emph{non private} approximation algorithm, improving upon previously known algorithms that only guarantee large (constant) approximation ratios. Furthermore, ours is the first constant-factor approximation algorithm for k-means that requires only \emph{one} round of communication in the local DP model, positively resolving an open question of Stemmer (SODA 2020). Our algorithmic framework is quite flexible; we demonstrate this by showing that it also yields a similar near-optimal approximation algorithm in the (one-round) shuffle DP model.

NeurIPS Conference 2021 Conference Paper

User-Level Differentially Private Learning via Correlated Sampling

  • Badih Ghazi
  • Ravi Kumar
  • Pasin Manurangsi

Most works in learning with differential privacy (DP) have focused on the setting where each user has a single sample. In this work, we consider the setting where each user holds $m$ samples and the privacy protection is enforced at the level of each user's data. We show that, in this setting, we may learn with a much fewer number of users. Specifically, we show that, as long as each user receives sufficiently many samples, we can learn any privately learnable class via an $(\epsilon, \delta)$-DP algorithm using only $O(\log(1/\delta)/\epsilon)$ users. For $\epsilon$-DP algorithms, we show that we can learn using only $O_{\epsilon}(d)$ users even in the local model, where $d$ is the probabilistic representation dimension. In both cases, we show a nearly-matching lower bound on the number of users required. A crucial component of our results is a generalization of global stability [Bun, Livni, Moran, FOCS 2020] that allows the use of public randomness. Under this relaxed notion, we employ a correlated sampling strategy to show that the global stability can be boosted to be arbitrarily close to one, at a polynomial expense in the number of samples.

NeurIPS Conference 2020 Conference Paper

Differentially Private Clustering: Tight Approximation Ratios

  • Badih Ghazi
  • Ravi Kumar
  • Pasin Manurangsi

We study the task of differentially private clustering. For several basic clustering problems, including Euclidean DensestBall, 1-Cluster, k-means, and k-median, we give efficient differentially private algorithms that achieve essentially the same approximation ratios as those that can be obtained by any non-private algorithm, while incurring only small additive errors. This improves upon existing efficient algorithms that only achieve some large constant approximation factors. Our results also imply an improved algorithm for the Sample and Aggregate privacy framework. Furthermore, we show that one of the tools used in our 1-Cluster algorithm can be employed to get a faster quantum algorithm for ClosestPair in a moderate number of dimensions.

ICML Conference 2020 Conference Paper

Private Counting from Anonymous Messages: Near-Optimal Accuracy with Vanishing Communication Overhead

  • Badih Ghazi
  • Ravi Kumar 0001
  • Pasin Manurangsi
  • Rasmus Pagh

Differential privacy (DP) is a formal notion for quantifying the privacy loss of algorithms. Algorithms in the central model of DP achieve high accuracy but make the strongest trust assumptions whereas those in the local DP model make the weakest trust assumptions but incur substantial accuracy loss. The shuffled DP model [Bittau et al 2017, Erlingsson et al 2019, Cheu et al 19] has recently emerged as a feasible middle ground between the central and local models, providing stronger trust assumptions than the former while promising higher accuracies than the latter. In this paper, we obtain practical communication-efficient algorithms in the shuffled DP model for two basic aggregation primitives used in machine learning: 1) binary summation, and 2) histograms over a moderate number of buckets. Our algorithms achieve accuracy that is arbitrarily close to that of central DP algorithms with an expected communication per user essentially matching what is needed without any privacy constraints! We demonstrate the practicality of our algorithms by experimentally evaluating them and comparing their performance to several widely-used protocols such as Randomized Response [Warner 1965] and RAPPOR [Erlingsson et al. 2014].

ICML Conference 2019 Conference Paper

Recursive Sketches for Modular Deep Learning

  • Badih Ghazi
  • Rina Panigrahy
  • Joshua R. Wang

We present a mechanism to compute a sketch (succinct summary) of how a complex modular deep network processes its inputs. The sketch summarizes essential information about the inputs and outputs of the network and can be used to quickly identify key components and summary statistics of the inputs. Furthermore, the sketch is recursive and can be unrolled to identify sub-components of these components and so forth, capturing a potentially complicated DAG structure. These sketches erase gracefully; even if we erase a fraction of the sketch at random, the remainder still retains the “high-weight” information present in the original sketch. The sketches can also be organized in a repository to implicitly form a “knowledge graph”; it is possible to quickly retrieve sketches in the repository that are related to a sketch of interest; arranged in this fashion, the sketches can also be used to learn emerging concepts by looking for new clusters in sketch space. Finally, in the scenario where we want to learn a ground truth deep network, we show that augmenting input/output pairs with these sketches can theoretically make it easier to do so.

SODA Conference 2018 Conference Paper

Resource-Efficient Common Randomness and Secret-Key Schemes

  • Badih Ghazi
  • T. S. Jayram

We study common randomness where two parties have access to i. i. d. samples from a known random source, and wish to generate a shared random key using limited (or no) communication with the largest possible probability of agreement. This problem is at the core of secret key generation in cryptography, with connections to communication under uncertainty and locality sensitive hashing. We take the approach of treating correlated sources as a critical resource, and ask whether common randomness can be generated resource-efficiently. We consider two notable sources in this setup arising from correlated bits and correlated Gaussians. We design the first explicit schemes that use only a polynomial number of samples (in the key length) so that the players can generate shared keys that agree with constant probability using optimal communication. The best previously known schemes were both non-constructive and used an exponential number of samples. In the amortized setting, we characterize the largest achievable ratio of key length to communication in terms of the external and internal information costs, two well-studied quantities in theoretical computer science. In the relaxed setting where the two parties merely wish to improve the correlation between the generated keys of length k, we show that there are no interactive protocols using o ( k ) bits of communication having agreement probability even as small as 2 – o ( k ). For the related communication problem where the players wish to compute a joint function f of their inputs using i. i. d samples from a known source, we give a simultaneous message passing protocol using 2 O ( c ) bits where c is the interactive randomized public-coin communication complexity of f. This matches the lower bound shown previously while the best previously known upper bound was doubly exponential in c. Our schemes reveal a new connection between common randomness and unbiased error-correcting codes, e. g. , dual-BCH codes and their analogues in Euclidean space.

SODA Conference 2016 Conference Paper

Communication Complexity of Permutation-Invariant Functions

  • Badih Ghazi
  • Pritish Kamath
  • Madhu Sudan 0001

Motivated by the quest for a broader understanding of upper bounds in communication complexity, at least for simple functions, we introduce the class of “permutation-invariant” functions. A partial function f: {0, 1} n × {0, 1} n → {0, 1, ?} is permutation-invariant if for every bijection π: {1, …, n } → {1, …, n } and every x, y ∊ {0, 1} n, it is the case that f (x, y) = f (x π, y π ). Most of the commonly studied functions in communication complexity are permutation-invariant. For such functions, we present a simple complexity measure (computable in time polynomial in n given an implicit description of f ) that describes their communication complexity up to polynomial factors and up to an additive error that is logarithmic in the input size. This gives a coarse taxonomy of the communication complexity of simple functions. Our work highlights the role of the well-known lower bounds of functions such as S et -D isjointness and I ndexing, while complementing them with the relatively lesser-known upper bounds for G ap -I nner -P roduct (from the sketching literature) and S parse -G ap -I nner -P roduct (from the recent work of Canonne et al. [ITCS 2015]). We also present consequences to the study of communication complexity with imperfectly shared randomness where we show that for total permutation-invariant functions, imperfectly shared randomness results in only a polynomial blow-up in communication complexity after an additive O (log log n ) overhead.

SODA Conference 2016 Conference Paper

Communication with Contextual Uncertainty

  • Badih Ghazi
  • Ilan Komargodski
  • Pravesh K. Kothari
  • Madhu Sudan 0001

We introduce a simple model illustrating the role of context in communication and the challenge posed by uncertainty of knowledge of context. We consider a variant of distributional communication complexity where Alice gets some information x and Bob gets y, where ( x, y ) is drawn from a known distribution, and Bob wishes to compute some function g ( x, y ) (with high probability over ( x, y )). In our variant, Alice does not know g, but only knows some function f which is an approximation of g. Thus, the function being computed forms the context for the communication, and knowing it imperfectly models (mild) uncertainty in this context. A naive solution would be for Alice and Bob to first agree on some common function h that is close to both f and g and then use a protocol for h to compute h ( x, y ). We show that any such agreement leads to a large overhead in communication ruling out such a universal solution. In contrast, we show that if g has a one-way communication protocol with complexity k in the standard setting, then it has a communication protocol with complexity O ( k · (1 + I )) in the uncertain setting, where I denotes the mutual information between x and y. In the particular case where the input distribution is a product distribution, the protocol in the uncertain setting only incurs a constant factor blow-up in communication and error. Furthermore, we show that the dependence on the mutual information I is required. Namely, we construct a class of functions along with a non-product distribution over ( x, y ) for which the communication complexity is a single bit in the standard setting but at least bits in the uncertain setting.

FOCS Conference 2016 Conference Paper

Decidability of Non-interactive Simulation of Joint Distributions

  • Badih Ghazi
  • Pritish Kamath
  • Madhu Sudan 0001

We present decidability results for a sub-class of "non-interactive" simulation problems, a well-studied class of problems in information theory. A non-interactive simulation problem is specified by two distributions P(x, y) and Q(u, v): The goal is to determine if two players, Alice and Bob, that observe sequences Xn and Yn respectively where {(Xi, Yi)}ni = 1 are drawn i. i. d. from P(x, y) can generate pairs U and V respectively (without communicating with each other) with a joint distribution that is arbitrarily close in total variation to Q(u, v). Even when P and Q are extremely simple: e. g. , P is uniform on the triples (0, 0), (0, 1), (1, 0) and Q is a "doubly symmetric binary source", i. e. , U and V are uniform ± 1 variables with correlation say 0. 49, it is open if P can simulate Q. In this work, we show that whenever P is a distribution on a finite domain and Q is a 2 × 2 distribution, then the non-interactive simulation problem is decidable: specifically, given δ > 0 the algorithm runs in time bounded by some function of P and δ and either gives a non-interactive simulation protocol that is δ-close to Q or asserts that no protocol gets O(δ)-close to Q. The main challenge to such a result is determining explicit (computable) convergence bounds on the number n of samples that need to be drawn from P(x, y) to get δ-close to Q. We invoke contemporary results from the analysis of Boolean functions such as the invariance principle and a regularity lemma to obtain such explicit bounds.

FOCS Conference 2016 Conference Paper

NP-Hardness of Reed-Solomon Decoding and the Prouhet-Tarry-Escott Problem

  • Venkata Gandikota
  • Badih Ghazi
  • Elena Grigorescu

Establishing the complexity of Bounded Distance Decoding for Reed-Solomon codes is a fundamental open problem in coding theory, explicitly asked by Guruswami and Vardy (IEEE Trans. Inf. Theory, 2005). The problem is motivated by the large current gap between the regime when it is NP-hard, and the regime when it is efficiently solvable (i. e. , the Johnson radius). We show the first NP-hardness results for asymptotically smaller decoding radii than the maximum likelihood decoding radius of Guruswami and Vardy. Specifically, for Reed-Solomon codes of length N and dimension K = O(N), we show that it is NP-hard to decode more than N-K-O/log N log log N) errors. Moreover, we show that the problem is NP-hard under quasipolynomial-time reductions for an error amount > N-K-c log N (with c > 0 an absolute constant). An alternative natural reformulation of the Bounded Distance Decoding problem for Reed-Solomon codes is as a Polynomial Reconstruction problem. In this view, our results show that it is NP-hard to decide whether there exists a degree K polynomial passing through K + O(log N / log log N) points from a given set of points (a1, b1), (a2, b2). .. , (aN, bN). Furthermore, it is NP-hard under quasipolynomial-time reductions to decide whether there is a degree K polynomial passing through K + c log N many points (with c > 0 an absolute constant). These results follow from the NP-hardness of a generalization of the classical Subset Sum problem to higher moments, called Moments Subset Sum, which has been a known open problem, and which may be of independent interest. We further reveal a strong connection with the well-studied Prouhet-Tarry-Escott problem in Number Theory, which turns out to capture a main barrier in extending our techniques. We believe the Prouhet-Tarry-Escott problem deserves further study in the theoretical computer science community.