Author name cluster

David Woodruff

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

34 papers

1 author row

NeurIPS Conference 2025 Conference Paper

Nearly-Linear Time and Massively Parallel Algorithms for $k$-anonymity

Kevin Aydin
Honghao Lin
David Woodruff
Peilin Zhong

$k$-anonymity is a widely-used privacy-preserving concept that ensures each record in a dataset is indistinguishable from at least $k-1$ other records. In this paper, we revisit $k$-anonymity by suppression and give an $O(k)$-approximation algorithm with a nearly-linear runtime of $\tilde{O}(nd + n^{1+1/C^2}/k^{1/C^2})$ for an arbitrary constant $C$, where $n$ is the number of records and $d$ is the number of attributes. Previous algorithms with provable guarantees either (1) achieve the same $O(k)$ approximation ratio but require at least $O(n^2 k)$ runtime, or (2) provide a better $O(\log k)$ approximation ratio at the cost of an impractical $O(n^{2k})$ worst-case runtime for general $d$ and $k$. Our algorithm extends to the Massively Parallel Computation (MPC) model, where it can be adapted into an MPC algorithm requiring $\tilde{O}(\log^{1+\epsilon} n)$ rounds and total space $O(n^{1+1/C^2}(d+k))$. Empirically, we also demonstrate that our algorithmic ideas can be adapted to existing heuristic methods, leading to significant speed-ups while preserving comparable performance. Although~\citep{PS07} introduced improvements to achieve more practical runtimes for their $O(\log k)$-approximation algorithm, its worst-case runtime remains $O(n^{2k})$. A natural question arises: can we develop an algorithm with an $o(k)$ approximation ratio and a polynomial runtime? We investigate the single-point $k$-anonymity problem, where the goal is to select $k-1$ additional records to make a given record indistinguishable. Surprisingly, assuming the dense vs random conjecture in complexity theory, we show that for $n = k^c$, no algorithm can achieve a $k^{1 - O(1/c)}$ approximation in $\mathrm{poly}(k)$ time. This provides evidence of the inherent hardness of the $k$-anonymity problem.

NeurIPS Conference 2025 Conference Paper

Query-Efficient Locally Private Hypothesis Selection via the Scheffe Graph

Gautam Kamath
Alireza F. Pour
Matthew Regehr
David Woodruff

We propose an algorithm with improved query-complexity for the problem of hypothesis selection under local differential privacy constraints. Given a set of $k$ probability distributions $Q$, we describe an algorithm that satisfies local differential privacy, performs $\tilde{O}(k^{3/2})$ non-adaptive queries to individuals who each have samples from a probability distribution $p$, and outputs a probability distribution from the set $Q$ which is nearly the closest to $p$. Previous algorithms required either $\Omega(k^2)$ queries or many rounds of interactive queries. Technically, we introduce a new object we dub the Scheff\'e graph, which captures structure of the differences between distributions in $Q$, and may be of more broad interest for hypothesis selection tasks.

NeurIPS Conference 2024 Conference Paper

Communication Bounds for the Distributed Experts Problem

Zhihao Jia
Qi Pang
Trung Tran
David Woodruff
Zhihao Zhang
Wenting Zheng

In this work, we study the experts problem in the distributed setting where an expert's cost needs to be aggregated across multiple servers. Our study considers various communication models such as the message-passing model and the broadcast model, along with multiple aggregation functions, such as summing and taking the $\ell_p$ norm of an expert's cost across servers. We propose the first communication-efficient protocols that achieve near-optimal regret in these settings, even against a strong adversary who can choose the inputs adaptively. Additionally, we give a conditional lower bound showing that the communication of our protocols is nearly optimal. Finally, we implement our protocols and demonstrate empirical savings on the HPO-B benchmarks.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

Computing Approximate $\ell_p$ Sensitivities

Swati Padmanabhan
David Woodruff
Richard Zhang

Recent works in dimensionality reduction for regression tasks have introduced the notion of sensitivity, an estimate of the importance of a specific datapoint in a dataset, offering provable guarantees on the quality of the approximation after removing low-sensitivity datapoints via subsampling. However, fast algorithms for approximating sensitivities, which we show is equivalent to approximate regression, are known for only the $\ell_2$ setting, in which they are popularly termed leverage scores. In this work, we provide the first efficient algorithms for approximating $\ell_p$ sensitivities and other summary statistics of a given matrix. In particular, for a given $n \times d$ matrix, we compute $\alpha$-approximation to its $\ell_1$ sensitivities at the cost of $n/\alpha$ sensitivity computations. For estimating the total $\ell_p$ sensitivity (i. e. the sum of $\ell_p$ sensitivities), we provide an algorithm based on importance sampling of $\ell_p$ Lewis weights, which computes a constant factor approximation at the cost of roughly $\sqrt{d}$ sensitivity computations, with no polynomial dependence on $n$. Furthermore, we estimate the maximum $\ell_1$ sensitivity up to a $\sqrt{d}$ factor in $O(d)$ sensitivity computations. We also generalize these results to $\ell_p$ norms. Lastly, we experimentally show that for a wide class of structured matrices in real-world datasets, the total sensitivity can be quickly approximated and is significantly smaller than the theoretical prediction, demonstrating that real-world datasets have on average low intrinsic effective dimensionality.

NeurIPS Conference 2023 Conference Paper

Hardness of Low Rank Approximation of Entrywise Transformed Matrix Products

Tamas Sarlos
Xingyou Song
David Woodruff
Richard Zhang

Inspired by fast algorithms in natural language processing, we study low rank approximation in the entrywise transformed setting where we want to find a good rank $k$ approximation to $f(U \cdot V)$, where $U, V^\top \in \mathbb{R}^{n \times r}$ are given, $r = O(\log(n))$, and $f(x)$ is a general scalar function. Previous work in sublinear low rank approximation has shown that if both (1) $U = V^\top$ and (2) $f(x)$ is a PSD kernel function, then there is an $O(nk^{\omega-1})$ time constant relative error approximation algorithm, where $\omega \approx 2. 376$ is the exponent of matrix multiplication. We give the first conditional time hardness results for this problem, demonstrating that both conditions (1) and (2) are in fact necessary for getting better than $n^{2-o(1)}$ time for a relative error low rank approximation for a wide class of functions. We give novel reductions from the Strong Exponential Time Hypothesis (SETH) that rely on lower bounding the leverage scores of flat sparse vectors and hold even when the rank of the transformed matrix $f(UV)$ and the target rank are $n^{o(1)}$, and when $U = V^\top$. Furthermore, even when $f(x) = x^p$ is a simple polynomial, we give runtime lower bounds in the case when $U \neq V^\top$ of the form $\Omega(\min(n^{2-o(1)}, \Omega(2^p)))$. Lastly, we demonstrate that our lower bounds are tight by giving an $O(n \cdot \text{poly}(k, 2^p, 1/\epsilon))$ time relative error approximation algorithm and a fast $O(n \cdot \text{poly}(k, p, 1/\epsilon))$ additive error approximation using fast tensor-based sketching. Additionally, since our low rank algorithms rely on matrix-vector product subroutines, our lower bounds extend to show that computing $f(UV)W$, for even a small matrix $W$, requires $\Omega(n^{2-o(1)})$ time.

NeurIPS Conference 2023 Conference Paper

Lower Bounds on Adaptive Sensing for Matrix Recovery

Praneeth Kacham
David Woodruff

We study lower bounds on adaptive sensing algorithms for recovering low rank matrices using linear measurements. Given an $n \times n$ matrix $A$, a general linear measurement $S(A)$, for an $n \times n$ matrix $S$, is just the inner product of $S$ and $A$, each treated as $n^2$-dimensional vectors. By performing as few linear measurements as possible on a rank-$r$ matrix $A$, we hope to construct a matrix $\hat{A}$ that satisfies $|A - \hat{A}|\_F^2 \le c |A|\_F^2$, for a small constant $c$. Here $|A|\_F$ denotes the Frobenius norm $(\sum_{i, j} A_{i, j}^2)^{1/2}$. It is commonly assumed that when measuring $A$ with $S$, the response is corrupted with an independent Gaussian random variable of mean $0$ and variance $\sigma^2$. Candès and Plan (IEEE Trans. Inform. Theory 2011) study non-adaptive algorithms for low rank matrix recovery using random linear measurements. They use the restricted isometry property (RIP) of Random Gaussian Matrices to give tractable algorithms to estimate $A$ from the measurements. At the edge of the noise level where recovery is information-theoretically feasible, it is known that their non-adaptive algorithms need to perform $\Omega(n^2)$ measurements, which amounts to reading the entire matrix. An important question is whether adaptivity helps in decreasing the overall number of measurements. While for the related problem of sparse recovery, adaptive algorithms have been extensively studied, as far as we are aware adaptive algorithms and lower bounds on them seem largely unexplored for matrix recovery. We show that any adaptive algorithm that uses $k$ linear measurements in each round and outputs an approximation as in (1) with probability $\ge 9/10$ must run for $t = \Omega(\log(n^2/k)/\log\log n)$ rounds. Our lower bound shows that any adaptive algorithm which uses $n^{2-\beta}$ ($\beta > 0$ is arbitrary constant) linear measurements in each round must run for $\Omega(\log n/\log\log n)$ rounds. Our techniques also readily extend to obtain lower bounds on adaptive algorithms for tensor recovery. Our hard distribution also allows us to give a measurement-vs-rounds trade-off for many sensing problems in numerical linear algebra, such as spectral norm low rank approximation, Frobenius norm low rank approximation, singular vector approximation, and more.

NeurIPS Conference 2023 Conference Paper

Near-Optimal $k$-Clustering in the Sliding Window Model

David Woodruff
Peilin Zhong
Samson Zhou

Clustering is an important technique for identifying structural information in large-scale data analysis, where the underlying dataset may be too large to store. In many applications, recent data can provide more accurate information and thus older data past a certain time is expired. The sliding window model captures these desired properties and thus there has been substantial interest in clustering in the sliding window model. In this paper, we give the first algorithm that achieves near-optimal $(1+\varepsilon)$-approximation to $(k, z)$-clustering in the sliding window model. Our algorithm uses $\frac{k}{\min(\varepsilon^4, \varepsilon^{2+z})}\, \text{polylog}\frac{n\Delta}{\varepsilon}$ words of space when the points are from $[\Delta]^d$, thus significantly improving on works by Braverman et. al. (SODA 2016), Borassi et. al. (NeurIPS 2021), and Epasto et. al. (SODA 2022). Along the way, we develop a data structure for clustering called an online coreset, which outputs a coreset not only for the end of a stream, but also for all prefixes of the stream. Our online coreset samples $\frac{k}{\min(\varepsilon^4, \varepsilon^{2+z})}\, \text{polylog}\frac{n\Delta}{\varepsilon}$ points from the stream. We then show that any online coreset requires $\Omega\left(\frac{k}{\varepsilon^2}\log n\right)$ samples, which shows a separation between the problem of constructing an offline coreset, i. e. , constructing online coresets is strictly harder. Our results also extend to general metrics on $[\Delta]^d$ and are near-optimal in light of a $\Omega\left(\frac{k}{\varepsilon^{2+z}}\right)$ lower bound for the size of an offline coreset.

NeurIPS Conference 2023 Conference Paper

On Robust Streaming for Learning with Experts: Algorithms and Lower Bounds

David Woodruff
Fred Zhang
Samson Zhou

In the online learning with experts problem, an algorithm makes predictions about an outcome on each of $T$ days, given a set of $n$ experts who make predictions on each day. The algorithm is given feedback on the outcomes of each day, including the cost of its prediction and the cost of the expert predictions, and the goal is to make a prediction with the minimum cost, compared to the best expert in hindsight. However, often the predictions made by experts or algorithms at some time influence future outcomes, so that the input is adaptively generated. In this paper, we study robust algorithms for the experts problem under memory constraints. We first give a randomized algorithm that is robust to adaptive inputs that uses $\widetilde{O}\left(\frac{n}{R\sqrt{T}}\right)$ space for $M=O\left(\frac{R^2 T}{\log^2 n}\right)$, thereby showing a smooth space-regret trade-off. We then show a space lower bound of $\widetilde{\Omega}\left(\frac{nM}{RT}\right)$ for any randomized algorithm that achieves regret $R$ with probability $1-2^{-\Omega(T)}$, when the best expert makes $M$ mistakes. Our result implies that the natural deterministic algorithm, which iterates through pools of experts until each expert in the pool has erred, is optimal up to polylogarithmic factors. Finally, we empirically demonstrate the benefit of using robust procedures against a white-box adversary that has access to the internal state of the algorithm.

NeurIPS Conference 2023 Conference Paper

Sketching Algorithms for Sparse Dictionary Learning: PTAS and Turnstile Streaming

Gregory Dexter
Petros Drineas
David Woodruff
Taisuke Yasuda

Sketching algorithms have recently proven to be a powerful approach both for designing low-space streaming algorithms as well as fast polynomial time approximation schemes (PTAS). In this work, we develop new techniques to extend the applicability of sketching-based approaches to the sparse dictionary learning and the Euclidean $k$-means clustering problems. In particular, we initiate the study of the challenging setting where the dictionary/clustering assignment for each of the $n$ input points must be output, which has surprisingly received little attention in prior work. On the fast algorithms front, we obtain a new approach for designing PTAS's for the $k$-means clustering problem, which generalizes to the first PTAS for the sparse dictionary learning problem. On the streaming algorithms front, we obtain new upper bounds and lower bounds for dictionary learning and $k$-means clustering. In particular, given a design matrix $\mathbf A\in\mathbb R^{n\times d}$ in a turnstile stream, we show an $\tilde O(nr/\epsilon^2 + dk/\epsilon)$ space upper bound for $r$-sparse dictionary learning of size $k$, an $\tilde O(n/\epsilon^2 + dk/\epsilon)$ space upper bound for $k$-means clustering, as well as an $\tilde O(n)$ space upper bound for $k$-means clustering on random order row insertion streams with a natural "bounded sensitivity" assumption. On the lower bounds side, we obtain a general $\tilde\Omega(n/\epsilon + dk/\epsilon)$ lower bound for $k$-means clustering, as well as an $\tilde\Omega(n/\epsilon^2)$ lower bound for algorithms which can estimate the cost of a single fixed set of candidate centers.

NeurIPS Conference 2022 Conference Paper

Optimal Query Complexities for Dynamic Trace Estimation

David Woodruff
Fred Zhang
Richard Zhang

We consider the problem of minimizing the number of matrix-vector queries needed for accurate trace estimation in the dynamic setting where our underlying matrix is changing slowly, such as during an optimization process. Specifically, for any $m$ matrices $\mathbf{A}_1, .. ., \mathbf{A}_m$ with consecutive differences bounded in Schatten-$1$ norm by $\alpha$, we provide a novel binary tree summation procedure that simultaneously estimates all $m$ traces up to $\epsilon$ error with $\delta$ failure probability with an optimal query complexity of $\widetilde{O}(m \alpha\sqrt{\log(1/\delta)}/\epsilon + m\log(1/\delta))$, improving the dependence on both $\alpha$ and $\delta$ from Dharangutte and Musco (NeurIPS, 2021). Our procedure works without additional norm bounds on $\mathbf{A}_i$ and can be generalized to a bound for the $p$-th Schatten norm for $p \in [1, 2]$, giving a complexity of $\widetilde{O}(m \alpha(\sqrt{\log(1/\delta)}/\epsilon)^p +m \log(1/\delta))$. By using novel reductions to communication complexity and information-theoretic analyses of Gaussian matrices, we provide matching lower bounds for static and dynamic trace estimation in all relevant parameters, including the failure probability. Our lower bounds (1) give the first tight bounds for Hutchinson's estimator in the matrix-vector product model with Frobenius norm error {\it even in the static setting}, and (2) are the first unconditional lower bounds for dynamic trace estimation, resolving open questions of prior work.

NeurIPS Conference 2021 Conference Paper

Few-Shot Data-Driven Algorithms for Low Rank Approximation

Piotr Indyk
Tal Wagner
David Woodruff

Recently, data-driven and learning-based algorithms for low rank matrix approximation were shown to outperform classical data-oblivious algorithms by wide margins in terms of accuracy. Those algorithms are based on the optimization of sparse sketching matrices, which lead to large savings in time and memory during testing. However, they require long training times on a large amount of existing data, and rely on access to specialized hardware and software. In this work, we develop new data-driven low rank approximation algorithms with better computational efficiency in the training phase, alleviating these drawbacks. Furthermore, our methods are interpretable: while previous algorithms choose the sketching matrix either at random or by black-box learning, we show that it can be set (or initialized) to clearly interpretable values extracted from the dataset. Our experiments show that our algorithms, either by themselves or in combination with previous methods, achieve significant empirical advantage over previous work, improving training times by up to an order of magnitude toward achieving the same target accuracy.

NeurIPS Conference 2021 Conference Paper

Linear and Kernel Classification in the Streaming Model: Improved Bounds for Heavy Hitters

Arvind Mahankali
David Woodruff

We study linear and kernel classification in the streaming model. For linear classification, we improve upon the algorithm of (Tai, et al. 2018), which solves the $\ell_1$ point query problem on the optimal weight vector $w_* \in \mathbb{R}^d$ in sublinear space. We first give an algorithm solving the more difficult $\ell_2$ point query problem on $w_*$, also in sublinear space. We also give an algorithm which solves the $\ell_2$ heavy hitter problem on $w_*$, in sublinear space and running time. Finally, we give an algorithm which can $\textit{deterministically}$ solve the $\ell_1$ point query problem on $w_*$, with sublinear space improving upon that of (Tai, et al. 2018). For kernel classification, if $w_* \in \mathbb{R}^{d^p}$ is the optimal weight vector classifying points in the stream according to their $p^{th}$-degree polynomial kernel, then we give an algorithm solving the $\ell_2$ point query problem on $w_*$ in $\text{poly}(\frac{p \log d}{\varepsilon})$ space, and an algorithm solving the $\ell_2$ heavy hitter problem in $\text{poly}(\frac{p \log d}{\varepsilon})$ space and running time. Note that our space and running time are polynomial in $p$, making our algorithms well-suited to high-degree polynomial kernels and the Gaussian kernel (approximated by the polynomial kernel of degree $p = \Theta(\log T)$).

NeurIPS Conference 2021 Conference Paper

Optimal Sketching for Trace Estimation

Shuli Jiang
Hai Pham
David Woodruff
Richard Zhang

Matrix trace estimation is ubiquitous in machine learning applications and has traditionally relied on Hutchinson's method, which requires $O(\log(1/\delta)/\epsilon^2)$ matrix-vector product queries to achieve a $(1 \pm \epsilon)$-multiplicative approximation to $\text{trace}(A)$ with failure probability $\delta$ on positive-semidefinite input matrices $A$. Recently, the Hutch++ algorithm was proposed, which reduces the number of matrix-vector queries from $O(1/\epsilon^2)$ to the optimal $O(1/\epsilon)$, and the algorithm succeeds with constant probability. However, in the high probability setting, the non-adaptive Hutch++ algorithm suffers an extra $O(\sqrt{\log(1/\delta)})$ multiplicative factor in its query complexity. Non-adaptive methods are important, as they correspond to sketching algorithms, which are mergeable, highly parallelizable, and provide low-memory streaming algorithms as well as low-communication distributed protocols. In this work, we close the gap between non-adaptive and adaptive algorithms, showing that even non-adaptive algorithms can achieve $O(\sqrt{\log(1/\delta)}/\epsilon + \log(1/\delta))$ matrix-vector products. In addition, we prove matching lower bounds demonstrating that, up to a $\log \log(1/\delta)$ factor, no further improvement in the dependence on $\delta$ or $\epsilon$ is possible by any non-adaptive algorithm. Finally, our experiments demonstrate the superior performance of our sketch over the adaptive Hutch++ algorithm, which is less parallelizable, as well as over the non-adaptive Hutchinson's method.

NeurIPS Conference 2020 Conference Paper

Revisiting the Sample Complexity of Sparse Spectrum Approximation of Gaussian Processes

Minh Hoang
Nghia Hoang
Hai Pham
David Woodruff

We introduce a new scalable approximation for Gaussian processes with provable guarantees which holds simultaneously over its entire parameter space. Our approximation is obtained from an improved sample complexity analysis for sparse spectrum Gaussian processes (SSGPs). In particular, our analysis shows that under a certain data disentangling condition, an SSGP's prediction and model evidence (for training) can well-approximate those of a full GP with low sample complexity. We also develop a new auto-encoding algorithm that finds a latent space to disentangle latent input coordinates into well-separated clusters, which is amenable to our sample complexity analysis. We validate our proposed method on several benchmarks with promising results supporting our theoretical analysis.

NeurIPS Conference 2020 Conference Paper

WOR and $p$'s: Sketches for $\ell_p$-Sampling Without Replacement

Edith Cohen
Rasmus Pagh
David Woodruff

Weighted sampling is a fundamental tool in data analysis and machine learning pipelines. Samples are used for efficient estimation of statistics or as sparse representations of the data. When weight distributions are skewed, as is often the case in practice, without-replacement (WOR) sampling is much more effective than with-replacement (WR) sampling: It provides a broader representation and higher accuracy for the same number of samples. We design novel composable sketches for WOR {\em $\ell_p$ sampling}, weighted sampling of keys according to a power $p\in[0, 2]$ of their frequency (or for signed data, sum of updates). Our sketches have size that grows only linearly with sample size. Our design is simple and practical, despite intricate analysis, and based on off-the-shelf use of widely implemented heavy hitters sketches such as \texttt{CountSketch}. Our method is the first to provide WOR sampling in the important regime of $p>1$ and the first to handle signed updates for $p>0$.

NeurIPS Conference 2019 Conference Paper

Average Case Column Subset Selection for Entrywise $\ell_1$-Norm Loss

Zhao Song
David Woodruff
Peilin Zhong

We study the column subset selection problem with respect to the entrywise $\ell_1$-norm loss. It is known that in the worst case, to obtain a good rank-$k$ approximation to a matrix, one needs an arbitrarily large $n^{\Omega(1)}$ number of columns to obtain a $(1+\epsilon)$-approximation to an $n \times n$ matrix. Nevertheless, we show that under certain minimal and realistic distributional settings, it is possible to obtain a $(1+\epsilon)$-approximation with a nearly linear running time and poly$(k/\epsilon)+O(k\log n)$ columns. Namely, we show that if the input matrix $A$ has the form $A = B + E$, where $B$ is an arbitrary rank-$k$ matrix, and $E$ is a matrix with i. i. d. entries drawn from any distribution $\mu$ for which the $(1+\gamma)$-th moment exists, for an arbitrarily small constant $\gamma > 0$, then it is possible to obtain a $(1+\epsilon)$-approximate column subset selection to the entrywise $\ell_1$-norm in nearly linear time. Conversely we show that if the first moment does not exist, then it is not possible to obtain a $(1+\epsilon)$-approximate subset selection algorithm even if one chooses any $n^{o(1)}$ columns. This is the first algorithm of any kind for achieving a $(1+\epsilon)$-approximation for entrywise $\ell_1$-norm loss low rank approximation.

NeurIPS Conference 2019 Conference Paper

Efficient and Thrifty Voting by Any Means Necessary

Debmalya Mandal
Ariel Procaccia
Nisarg Shah
David Woodruff

We take an unorthodox view of voting by expanding the design space to include both the elicitation rule, whereby voters map their (cardinal) preferences to votes, and the aggregation rule, which transforms the reported votes into collective decisions. Intuitively, there is a tradeoff between the communication requirements of the elicitation rule (i. e. , the number of bits of information that voters need to provide about their preferences) and the efficiency of the outcome of the aggregation rule, which we measure through distortion (i. e. , how well the utilitarian social welfare of the outcome approximates the maximum social welfare in the worst case). Our results chart the Pareto frontier of the communication-distortion tradeoff.

NeurIPS Conference 2019 Conference Paper

Optimal Sketching for Kronecker Product Regression and Low Rank Approximation

Huaian Diao
Rajesh Jayaram
Zhao Song
Wen Sun
David Woodruff

We study the Kronecker product regression problem, in which the design matrix is a Kronecker product of two or more matrices. Formally, given $A_i \in \R^{n_i \times d_i}$ for $i=1, 2, \dots, q$ where $n_i \gg d_i$ for each $i$, and $b \in \R^{n_1 n_2 \cdots n_q}$, let $\mathcal{A} = A_i \otimes A_2 \otimes \cdots \otimes A_q$. Then for $p \in [1, 2]$, the goal is to find $x \in \R^{d_1 \cdots d_q}$ that approximately minimizes $\|\mathcal{A}x - b\|_p$. Recently, Diao, Song, Sun, and Woodruff (AISTATS, 2018) gave an algorithm which is faster than forming the Kronecker product $\mathcal{A} \in \R^{n_1 \cdots n_q \times d_1 \cdots d_q}$. Specifically, for $p=2$ they achieve a running time of $O(\sum_{i=1}^q \texttt{nnz}(A_i) + \texttt{nnz}(b))$, where $ \texttt{nnz}(A_i)$ is the number of non-zero entries in $A_i$. Note that $\texttt{nnz}(b)$ can be as large as $\Theta(n_1 \cdots n_q)$. For $p=1, $ $q=2$ and $n_1 = n_2$, they achieve a worse bound of $O(n_1^{3/2} \text{poly}(d_1d_2) + \texttt{nnz}(b))$. In this work, we provide significantly faster algorithms. For $p=2$, our running time is $O(\sum_{i=1}^q \texttt{nnz}(A_i) )$, which has no dependence on $\texttt{nnz}(b)$. For $p<2$, our running time is $O(\sum_{i=1}^q \texttt{nnz}(A_i) + \texttt{nnz}(b))$, which matches the prior best running time for $p=2$. We also consider the related all-pairs regression problem, where given $A \in \R^{n \times d}, b \in \R^n$, we want to solve $\min_{x \in \R^d} \|\bar{A}x - \bar{b}\|_p$, where $\bar{A} \in \R^{n^2 \times d}, \bar{b} \in \R^{n^2}$ consist of all pairwise differences of the rows of $A, b$. We give an $O(\texttt{nnz}(A))$ time algorithm for $p \in[1, 2]$, improving the $\Omega(n^2)$ time required to form $\bar{A}$. Finally, we initiate the study of Kronecker product low rank and and low-trank approximation. For input $\mathcal{A}$ as above, we give $O(\sum_{i=1}^q \texttt{nnz}(A_i))$ time algorithms, which is much faster than computing $\mathcal{A}$.

NeurIPS Conference 2019 Conference Paper

Regularized Weighted Low Rank Approximation

Frank Ban
David Woodruff
Richard Zhang

The classical low rank approximation problem is to find a rank $k$ matrix $UV$ (where $U$ has $k$ columns and $V$ has $k$ rows) that minimizes the Frobenius norm of $A - UV$. Although this problem can be solved efficiently, we study an NP-hard variant of this problem that involves weights and regularization. A previous paper of [Razenshteyn et al. '16] derived a polynomial time algorithm for weighted low rank approximation with constant rank. We derive provably sharper guarantees for the regularized version by obtaining parameterized complexity bounds in terms of the statistical dimension rather than the rank, allowing for a rank-independent runtime that can be significantly faster. Our improvement comes from applying sharper matrix concentration bounds, using a novel conditioning technique, and proving structural theorems for regularized low rank problems.

NeurIPS Conference 2019 Conference Paper

Tight Dimensionality Reduction for Sketching Low Degree Polynomial Kernels

Michela Meister
Tamas Sarlos
David Woodruff

We revisit the classic randomized sketch of a tensor product of $q$ vectors $x_i\in\mathbb{R}^n$. The $i$-th coordinate $(Sx)_i$ of the sketch is equal to $\prod_{j = 1}^q \langle u^{i, j}, x^j \rangle / \sqrt{m}$, where $u^{i, j}$ are independent random sign vectors. Kar and Karnick (JMLR, 2012) show that if the sketching dimension $m = \Omega(\epsilon^{-2} C_{\Omega}^2 \log (1/\delta))$, where $C_{\Omega}$ is a certain property of the point set $\Omega$ one wants to sketch, then with probability $1-\delta$, $\|Sx\|_2 = (1\pm \epsilon)\|x\|_2$ for all $x\in\Omega$. However, in their analysis $C_{\Omega}^2$ can be as large as $\Theta(n^{2q})$, even for a set $\Omega$ of $O(1)$ vectors $x$. We give a new analysis of this sketch, providing nearly optimal bounds. Namely, we show an upper bound of $m = \Theta \left (\epsilon^{-2} \log(n/\delta) + \epsilon^{-1} \log^q(n/\delta) \right ), $ which by composing with CountSketch, can be improved to $\Theta(\epsilon^{-2}\log(1/(\delta \epsilon)) + \epsilon^{-1} \log^q (1/(\delta \epsilon))$. For the important case of $q = 2$ and $\delta = 1/\poly(n)$, this shows that $m = \Theta(\epsilon^{-2} \log(n) + \epsilon^{-1} \log^2(n))$, demonstrating that the $\epsilon^{-2}$ and $\log^2(n)$ terms do not multiply each other. We also show a nearly matching lower bound of $m = \Omega(\eps^{-2} \log(1/(\delta)) + \eps^{-1} \log^q(1/(\delta)))$. In a number of applications, one has $|\Omega| = \poly(n)$ and in this case our bounds are optimal up to a constant factor. This is the first high probability sketch for tensor products that has optimal sketch size and can be implemented in $m \cdot \sum_{i=1}^q \textrm{nnz}(x_i)$ time, where $\textrm{nnz}(x_i)$ is the number of non-zero entries of $x_i$. Lastly, we empirically compare our sketch to other sketches for tensor products, and give a novel application to compressing neural networks.

NeurIPS Conference 2019 Conference Paper

Total Least Squares Regression in Input Sparsity Time

Huaian Diao
Zhao Song
David Woodruff
Xin Yang

In the total least squares problem, one is given an $m \times n$ matrix $A$, and an $m \times d$ matrix $B$, and one seeks to ``correct'' both $A$ and $B$, obtaining matrices $\hat{A}$ and $\hat{B}$, so that there exists an $X$ satisfying the equation $\hat{A}X = \hat{B}$. Typically the problem is overconstrained, meaning that $m \gg \max(n, d)$. The cost of the solution $\hat{A}, \hat{B}$ is given by $\|A-\hat{A}\|_F^2 + \|B - \hat{B}\|_F^2$. We give an algorithm for finding a solution $X$ to the linear system $\hat{A}X=\hat{B}$ for which the cost $\|A-\hat{A}\|_F^2 + \|B-\hat{B}\|_F^2$ is at most a multiplicative $(1+\epsilon)$ factor times the optimal cost, up to an additive error $\eta$ that may be an arbitrarily small function of $n$. Importantly, our running time is $\tilde{O}(\nnz(A) + \nnz(B)) + \poly(n/\epsilon) \cdot d$, where for a matrix $C$, $\nnz(C)$ denotes its number of non-zero entries. Importantly, our running time does not directly depend on the large parameter $m$. As total least squares regression is known to be solvable via low rank approximation, a natural approach is to invoke fast algorithms for approximate low rank approximation, obtaining matrices $\hat{A}$ and $\hat{B}$ from this low rank approximation, and then solving for $X$ so that $\hat{A}X = \hat{B}$. However, existing algorithms do not apply since in total least squares the rank of the low rank approximation needs to be $n$, and so the running time of known methods would be at least $mn^2$. In contrast, we are able to achieve a much faster running time for finding $X$ by never explicitly forming the equation $\hat{A} X = \hat{B}$, but instead solving for an $X$ which is a solution to an implicit such equation. Finally, we generalize our algorithm to the total least squares problem with regularization.

NeurIPS Conference 2019 Conference Paper

Towards a Zero-One Law for Column Subset Selection

Zhao Song
David Woodruff
Peilin Zhong

There are a number of approximation algorithms for NP-hard versions of low rank approximation, such as finding a rank-$k$ matrix $B$ minimizing the sum of absolute values of differences to a given $n$-by-$n$ matrix $A$, $\min_{\textrm{rank-}k~B}\|A-B\|_1$, or more generally finding a rank-$k$ matrix $B$ which minimizes the sum of $p$-th powers of absolute values of differences, $\min_{\textrm{rank-}k~B}\|A-B\|_p^p$. Many of these algorithms are linear time columns subset selection algorithms, returning a subset of $\poly(k \log n)$ columns whose cost is no more than a $\poly(k)$ factor larger than the cost of the best rank-$k$ matrix. The above error measures are special cases of the following general entrywise low rank approximation problem: given an arbitrary function $g: \mathbb{R} \rightarrow \mathbb{R}_{\geq 0}$, find a rank-$k$ matrix $B$ which minimizes $\|A-B\|_g = \sum_{i, j}g(A_{i, j}-B_{i, j})$. A natural question is which functions $g$ admit efficient approximation algorithms? Indeed, this is a central question of recent work studying generalized low rank models. In this work we give approximation algorithms for {\it every} function $g$ which is approximately monotone and satisfies an approximate triangle inequality, and we show both of these conditions are necessary. Further, our algorithm is efficient if the function $g$ admits an efficient approximate regression algorithm. Our approximation algorithms handle functions which are not even scale-invariant, such as the Huber loss function, which we show have very different structural properties than $\ell_p$-norms, e. g. , one can show the lack of scale-invariance causes any column subset selection algorithm to provably require a $\sqrt{\log n}$ factor larger number of columns than $\ell_p$-norms; nevertheless we design the first efficient column subset selection algorithms for such error measures.

NeurIPS Conference 2018 Conference Paper

On Coresets for Logistic Regression

Alexander Munteanu
Chris Schwiegelshohn
Christian Sohler
David Woodruff

Coresets are one of the central methods to facilitate the analysis of large data. We continue a recent line of research applying the theory of coresets to logistic regression. First, we show the negative result that no strongly sublinear sized coresets exist for logistic regression. To deal with intractable worst-case instances we introduce a complexity measure $\mu(X)$, which quantifies the hardness of compressing a data set for logistic regression. $\mu(X)$ has an intuitive statistical interpretation that may be of independent interest. For data sets with bounded $\mu(X)$-complexity, we show that a novel sensitivity sampling scheme produces the first provably sublinear $(1\pm\eps)$-coreset. We illustrate the performance of our method by comparing to uniform sampling as well as to state of the art methods in the area. The experiments are conducted on real world benchmark data for logistic regression.

NeurIPS Conference 2018 Conference Paper

Robust Subspace Approximation in a Stream

Roie Levin
Anish Prasad Sevekari
David Woodruff

We study robust subspace estimation in the streaming and distributed settings. Given a set of n data points {a i} {i=1}^n in R^d and an integer k, we wish to find a linear subspace S of dimension k for which sum i M(dist(S, a i)) is minimized, where dist(S, x): = min_{y in S} |x-y|_2, and M() is some loss function. When M is the identity function, S gives a subspace that is more robust to outliers than that provided by the truncated SVD. Though the problem is NP-hard, it is approximable within a (1+epsilon) factor in polynomial time when k and epsilon are constant. We give the first sublinear approximation algorithm for this problem in the turnstile streaming and arbitrary partition distributed models, achieving the same time guarantees as in the offline case. Our algorithm is the first based entirely on oblivious dimensionality reduction, and significantly simplifies prior methods for this problem, which held in neither the streaming nor distributed models.

NeurIPS Conference 2018 Conference Paper

Sublinear Time Low-Rank Approximation of Distance Matrices

Ainesh Bakshi
David Woodruff

Let $\PP=\{ p_1, p_2, \ldots p_n \}$ and $\QQ = \{ q_1, q_2 \ldots q_m \}$ be two point sets in an arbitrary metric space. Let $\AA$ represent the $m\times n$ pairwise distance matrix with $\AA_{i, j} = d(p_i, q_j)$. Such distance matrices are commonly computed in software packages and have applications to learning image manifolds, handwriting recognition, and multi-dimensional unfolding, among other things. In an attempt to reduce their description size, we study low rank approximation of such matrices. Our main result is to show that for any underlying distance metric $d$, it is possible to achieve an additive error low rank approximation in sublinear time. We note that it is provably impossible to achieve such a guarantee in sublinear time for arbitrary matrices $\AA$, and our proof exploits special properties of distance matrices. We develop a recursive algorithm based on additive projection-cost preserving sampling. We then show that in general, relative error approximation in sublinear time is impossible for distance matrices, even if one allows for bicriteria solutions. Additionally, we show that if $\PP = \QQ$ and $d$ is the squared Euclidean distance, which is not a metric but rather the square of a metric, then a relative error bicriteria solution can be found in sublinear time. Finally, we empirically compare our algorithm with the SVD and input sparsity time algorithms. Our algorithm is several hundred times faster than the SVD, and about $8$-$20$ times faster than input sparsity methods on real-world and and synthetic datasets of size $10^8$. Accuracy-wise, our algorithm is only slightly worse than that of the SVD (optimal) and input-sparsity time algorithms.

NeurIPS Conference 2017 Conference Paper

Approximation Algorithms for $\ell_0$-Low Rank Approximation

Karl Bringmann
Pavel Kolev
David Woodruff

We study the $\ell_0$-Low Rank Approximation Problem, where the goal is, given an $m \times n$ matrix $A$, to output a rank-$k$ matrix $A'$ for which $\|A'-A\|_0$ is minimized. Here, for a matrix $B$, $\|B\|_0$ denotes the number of its non-zero entries. This NP-hard variant of low rank approximation is natural for problems with no underlying metric, and its goal is to minimize the number of disagreeing data positions. We provide approximation algorithms which significantly improve the running time and approximation factor of previous work. For $k > 1$, we show how to find, in poly$(mn)$ time for every $k$, a rank $O(k \log(n/k))$ matrix $A'$ for which $\|A'-A\|_0 \leq O(k^2 \log(n/k)) \OPT$. To the best of our knowledge, this is the first algorithm with provable guarantees for the $\ell_0$-Low Rank Approximation Problem for $k > 1$, even for bicriteria algorithms. For the well-studied case when $k = 1$, we give a $(2+\epsilon)$-approximation in {\it sublinear time}, which is impossible for other variants of low rank approximation such as for the Frobenius norm. We strengthen this for the well-studied case of binary matrices to obtain a $(1+O(\psi))$-approximation in sublinear time, where $\psi = \OPT/\nnz{A}$. For small $\psi$, our approximation factor is $1+o(1)$.

NeurIPS Conference 2017 Conference Paper

Is Input Sparsity Time Possible for Kernel Low-Rank Approximation?

Cameron Musco
David Woodruff

Low-rank approximation is a common tool used to accelerate kernel methods: the $n \times n$ kernel matrix $K$ is approximated via a rank-$k$ matrix $\tilde K$ which can be stored in much less space and processed more quickly. In this work we study the limits of computationally efficient low-rank kernel approximation. We show that for a broad class of kernels, including the popular Gaussian and polynomial kernels, computing a relative error $k$-rank approximation to $K$ is at least as difficult as multiplying the input data matrix $A \in R^{n \times d}$ by an arbitrary matrix $C \in R^{d \times k}$. Barring a breakthrough in fast matrix multiplication, when $k$ is not too large, this requires $\Omega(nnz(A)k)$ time where $nnz(A)$ is the number of non-zeros in $A$. This lower bound matches, in many parameter regimes, recent work on subquadratic time algorithms for low-rank approximation of general kernels [MM16, MW17], demonstrating that these algorithms are unlikely to be significantly improved, in particular to $O(nnz(A))$ input sparsity runtimes. At the same time there is hope: we show for the first time that $O(nnz(A))$ time approximation is possible for general radial basis function kernels (e. g. , the Gaussian kernel) for the closely related problem of low-rank approximation of the kernelized dataset.

NeurIPS Conference 2017 Conference Paper

Near Optimal Sketching of Low-Rank Tensor Regression

Xingguo Li
Jarvis Haupt
David Woodruff

We study the least squares regression problem $\min_{\Theta \in \RR^{p_1 \times \cdots \times p_D}} \| \cA(\Theta) - b \|_2^2$, where $\Theta$ is a low-rank tensor, defined as $\Theta = \sum_{r=1}^{R} \theta_1^{(r)} \circ \cdots \circ \theta_D^{(r)}$, for vectors $\theta_d^{(r)} \in \mathbb{R}^{p_d}$ for all $r \in [R]$ and $d \in [D]$. %$R$ is small compared with $p_1, \ldots, p_D$, Here, $\circ$ denotes the outer product of vectors, and $\cA(\Theta)$ is a linear function on $\Theta$. This problem is motivated by the fact that the number of parameters in $\Theta$ is only $R \cdot \sum_{d=1}^D p_D$, which is significantly smaller than the $\prod_{d=1}^{D} p_d$ number of parameters in ordinary least squares regression. We consider the above CP decomposition model of tensors $\Theta$, as well as the Tucker decomposition. For both models we show how to apply data dimensionality reduction techniques based on {\it sparse} random projections $\Phi \in \RR^{m \times n}$, with $m \ll n$, to reduce the problem to a much smaller problem $\min_{\Theta} \|\Phi \cA(\Theta) - \Phi b\|_2^2$, for which $\|\Phi \cA(\Theta) - \Phi b\|_2^2 = (1 \pm \varepsilon) \| \cA(\Theta) - b \|_2^2$ holds simultaneously for all $\Theta$. We obtain a significantly smaller dimension and sparsity in the randomized linear mapping $\Phi$ than is possible for ordinary least squares regression. Finally, we give a number of numerical simulations supporting our theory.

NeurIPS Conference 2016 Conference Paper

Communication-Optimal Distributed Clustering

Jiecao Chen
He Sun
David Woodruff
Qin Zhang

Clustering large datasets is a fundamental problem with a number of applications in machine learning. Data is often collected on different sites and clustering needs to be performed in a distributed manner with low communication. We would like the quality of the clustering in the distributed setting to match that in the centralized setting for which all the data resides on a single site. In this work, we study both graph and geometric clustering problems in two distributed models: (1) a point-to-point model, and (2) a model with a broadcast channel. We give protocols in both models which we show are nearly optimal by proving almost matching communication lower bounds. Our work highlights the surprising power of a broadcast channel for clustering problems; roughly speaking, to cluster n points or n vertices in a graph distributed across s servers, for a worst-case partitioning the communication complexity in a point-to-point model is n*s, while in the broadcast model it is n + s. We implement our algorithms and demonstrate this phenomenon on real life datasets, showing that our algorithms are also very efficient in practice.

NeurIPS Conference 2016 Conference Paper

Sublinear Time Orthogonal Tensor Decomposition

Zhao Song
David Woodruff
Huan Zhang

A recent work (Wang et. al. , NIPS 2015) gives the fastest known algorithms for orthogonal tensor decomposition with provable guarantees. Their algorithm is based on computing sketches of the input tensor, which requires reading the entire input. We show in a number of cases one can achieve the same theoretical guarantees in sublinear time, i. e. , even without reading most of the input tensor. Instead of using sketches to estimate inner products in tensor decomposition algorithms, we use importance sampling. To achieve sublinear time, we need to know the norms of tensor slices, and we show how to do this in a number of important cases. For symmetric tensors $ T = \sum_{i=1}^k \lambda_i u_i^{\otimes p}$ with $\lambda_i > 0$ for all i, we estimate such norms in sublinear time whenever p is even. For the important case of p = 3 and small values of k, we can also estimate such norms. For asymmetric tensors sublinear time is not possible in general, but we show if the tensor slice norms are just slightly below $\| T \|_F$ then sublinear time is again possible. One of the main strengths of our work is empirical - in a number of cases our algorithm is orders of magnitude faster than existing methods with the same accuracy.

NeurIPS Conference 2014 Conference Paper

Improved Distributed Principal Component Analysis

Yingyu Liang
Maria-Florina Balcan
Vandana Kanchanapally
David Woodruff

We study the distributed computing setting in which there are multiple servers, each holding a set of points, who wish to compute functions on the union of their point sets. A key task in this setting is Principal Component Analysis (PCA), in which the servers would like to compute a low dimensional subspace capturing as much of the variance of the union of their point sets as possible. Given a procedure for approximate PCA, one can use it to approximately solve problems such as $k$-means clustering and low rank approximation. The essential properties of an approximate distributed PCA algorithm are its communication cost and computational efficiency for a given desired accuracy in downstream applications. We give new algorithms and analyses for distributed PCA which lead to improved communication and computational costs for $k$-means clustering and related problems. Our empirical study on real world data shows a speedup of orders of magnitude, preserving communication with only a negligible degradation in solution quality. Some of these techniques we develop, such as input-sparsity subspace embeddings with high correctness probability with a dimension and sparsity independent of the error probability, may be of independent interest.

NeurIPS Conference 2014 Conference Paper

Low Rank Approximation Lower Bounds in Row-Update Streams

David Woodruff

We study low-rank approximation in the streaming model in which the rows of an $n \times d$ matrix $A$ are presented one at a time in an arbitrary order. At the end of the stream, the streaming algorithm should output a $k \times d$ matrix $R$ so that $\|A-AR^{\dagger}R\|_F^2 \leq (1+\eps)\|A-A_k\|_F^2$, where $A_k$ is the best rank-$k$ approximation to $A$. A deterministic streaming algorithm of Liberty (KDD, 2013), with an improved analysis of Ghashami and Phillips (SODA, 2014), provides such a streaming algorithm using $O(dk/\epsilon)$ words of space. A natural question is if smaller space is possible. We give an almost matching lower bound of $\Omega(dk/\epsilon)$ bits of space, even for randomized algorithms which succeed only with constant probability. Our lower bound matches the upper bound of Ghashami and Phillips up to the word size, improving on a simple $\Omega(dk)$ space lower bound.

NeurIPS Conference 2014 Conference Paper

Subspace Embeddings for the Polynomial Kernel

Haim Avron
Huy Nguyen
David Woodruff

Sketching is a powerful dimensionality reduction tool for accelerating statistical learning algorithms. However, its applicability has been limited to a certain extent since the crucial ingredient, the so-called oblivious subspace embedding, can only be applied to data spaces with an explicit representation as the column span or row span of a matrix, while in many settings learning is done in a high-dimensional space implicitly defined by the data matrix via a kernel transformation. We propose the first {\em fast} oblivious subspace embeddings that are able to embed a space induced by a non-linear kernel {\em without} explicitly mapping the data to the high-dimensional space. In particular, we propose an embedding for mappings induced by the polynomial kernel. Using the subspace embeddings, we obtain the fastest known algorithms for computing an implicit low rank approximation of the higher-dimension mapping of the data matrix, and for computing an approximate kernel PCA of the data, as well as doing approximate kernel principal component regression.

NeurIPS Conference 2013 Conference Paper

Sketching Structured Matrices for Faster Nonlinear Regression

Haim Avron
Vikas Sindhwani
David Woodruff

Motivated by the desire to extend fast randomized techniques to nonlinear $l_p$ regression, we consider a class of structured regression problems. These problems involve Vandermonde matrices which arise naturally in various statistical modeling settings, including classical polynomial fitting problems and recently developed randomized techniques for scalable kernel methods. We show that this structure can be exploited to further accelerate the solution of the regression problem, achieving running times that are faster than input sparsity''. We present empirical results confirming both the practical value of our modeling framework, as well as speedup benefits of randomized regression. "