Arrow Research search

Author name cluster

Rasmus Pagh

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

33 papers
2 author rows

Possible papers

33

NeurIPS Conference 2025 Conference Paper

Differentially Private Quantiles with Smaller Error

  • Jacob Imola
  • Fabrizio Boninsegna
  • Hannah Keller
  • Anders Aamand
  • Amrita Roy Chowdhury
  • Rasmus Pagh

In the approximate quantiles problem, the goal is to output $m$ quantile estimates, the ranks of which are as close as possible to $m$ given quantiles $0 \leq q_1 \leq\dots \leq q_m \leq 1$. We present a mechanism for approximate quantiles that satisfies $\varepsilon$-differential privacy for a dataset of $n$ real numbers where the ratio between the distance between the closest pair of points and the size of the domain is bounded by $\psi$. As long as the minimum gap between quantiles is sufficiently large, $|q_i-q_{i-1}|\geq \Omega\left(\frac{m\log(m)\log(\psi)}{n\varepsilon}\right)$ for all $i$, the maximum rank error of our mechanism is $O\left(\frac{\log(\psi) + \log^2(m)}{\varepsilon}\right)$ with high probability. Previously, the best known algorithm under pure DP was due to Kaplan, Schnapp, and Stemmer (ICML '22), who achieved a bound of $O\left(\frac{\log(\psi)\log^2(m) + \log^3(m)}{\varepsilon}\right)$. Our improvement stems from the use of continual counting techniques which allows the quantiles to be randomized in a correlated manner. We also present an $(\varepsilon, \delta)$-differentially private mechanism that relaxes the gap assumption without affecting the error bound, improving on existing methods when $\delta$ is sufficiently close to zero. We provide experimental evaluation which confirms that our mechanism performs favorably compared to prior work in practice, in particular when the number of quantiles $m$ is large.

ICML Conference 2025 Conference Paper

Lightweight Protocols for Distributed Private Quantile Estimation

  • Anders Aamand
  • Fabrizio Boninsegna
  • Abigail Gentle
  • Jacob Imola
  • Rasmus Pagh

Distributed data analysis is a large and growing field driven by a massive proliferation of user devices, and by privacy concerns surrounding the centralised storage of data. We consider two adaptive algorithms for estimating one quantile (e. g. the median) when each user holds a single data point lying in a domain $[B]$ that can be queried once through a private mechanism; one under local differential privacy (LDP) and another for shuffle differential privacy (shuffle-DP). In the adaptive setting we present an $\varepsilon$-LDP algorithm which can estimate any quantile within error $\alpha$ only requiring $O(\frac{\log B}{\varepsilon^2\alpha^2})$ users, and an $(\varepsilon, \delta)$-shuffle DP algorithm requiring only $\widetilde{O}((\frac{1}{\varepsilon^2}+\frac{1}{\alpha^2})\log B)$ users. Prior (nonadaptive) algorithms require more users by several logarithmic factors in $B$. We further provide a matching lower bound for adaptive protocols, showing that our LDP algorithm is optimal in the low-$\varepsilon$ regime. Additionally, we establish lower bounds against non-adaptive protocols which paired with our understanding of the adaptive case, proves a fundamental separation between these models.

ICML Conference 2025 Conference Paper

Private Lossless Multiple Release

  • Joel Daniel Andersson
  • Lukas Retschmeier
  • Boel Nelson
  • Rasmus Pagh

Koufogiannis et al. (2016) showed a $\textit{gradual release}$ result for Laplace noise-based differentially private mechanisms: given an $\varepsilon$-DP release, a new release with privacy parameter $\varepsilon’ > \varepsilon$ can be computed such that the combined privacy loss of both releases is at most $\varepsilon’$ and the distribution of the latter is the same as a single release with parameter $\varepsilon’$. They also showed gradual release techniques for Gaussian noise, later also explored by Whitehouse et al. (2022). In this paper, we consider a more general $\textit{multiple release}$ setting in which analysts hold private releases with different privacy parameters corresponding to different access/trust levels. These releases are determined one by one, with privacy parameters in arbitrary order. A multiple release is $\textit{lossless}$ if having access to a subset $S$ of the releases has the same privacy guarantee as the least private release in $S$, and each release has the same distribution as a single release with the same privacy parameter. Our main result is that lossless multiple release is possible for a large class of additive noise mechanisms. For the Gaussian mechanism we give a simple method for lossless multiple release with a short, self-contained analysis that does not require knowledge of the mathematics of Brownian motion. We also present lossless multiple release for the Laplace and Poisson mechanisms. Finally, we consider how to efficiently do gradual release of sparse histograms, and present a mechanism with running time independent of the number of dimensions.

NeurIPS Conference 2024 Conference Paper

Continual Counting with Gradual Privacy Expiration

  • Joel D. Andersson
  • Monika Henzinger
  • Rasmus Pagh
  • Teresa A. Steiner
  • Jalaj Upadhyay

Differential privacy with gradual expiration models the setting where data items arrive in a stream and at a given time $t$ the privacy loss guaranteed for a data item seen at time $(t-d)$ is $\epsilon g(d)$, where $g$ is a monotonically non-decreasing function. We study the fundamental *continual (binary) counting* problem where each data item consists of a bit and the algorithm needs to output at each time step the sum of all the bits streamed so far. For a stream of length $T$ and privacy *without* expiration continual counting is possible with maximum (over all time steps) additive error $O(\log^2(T)/\varepsilon)$ and the best known lower bound is $\Omega(\log(T)/\varepsilon)$; closing this gap is a challenging open problem. We show that the situation is very different for privacy with gradual expiration by giving upper and lower bounds for a large set of expiration functions $g$. Specifically, our algorithm achieves an additive error of $O(\log(T)/\epsilon)$ for a large set of privacy expiration functions. We also give a lower bound that shows that if $C$ is the additive error of any $\epsilon$-DP algorithm for this problem, then the product of $C$ and the privacy expiration function after $2C$ steps must be $\Omega(\log(T)/\epsilon)$. Our algorithm matches this lower bound as its additive error is $O(\log(T)/\epsilon)$, even when $g(2C) = O(1)$. Our empirical evaluation shows that we achieve a slowly growing privacy loss that has significantly smaller empirical privacy loss for large values of $d$ than a natural baseline algorithm.

ICML Conference 2024 Conference Paper

Profile Reconstruction from Private Sketches

  • Hao Wu 0057
  • Rasmus Pagh

Given a multiset of $n$ items from $\mathcal{D}$, the profile reconstruction problem is to estimate, for $t = 0, 1, …, n$, the fraction $\vec{f}[t]$ of items in $\mathcal{D}$ that appear exactly $t$ times. We consider differentially private profile estimation in a distributed, space-constrained setting where we wish to maintain an updatable, private sketch of the multiset that allows us to compute an approximation of $\vec{f} = (\vec{f}[0], …, \vec{f}[n])$. Using a histogram privatized using discrete Laplace noise, we show how to “reverse” the noise using an approach of Dwork et al. (ITCS ’10). We show how to speed up the algorithm from polynomial time to $O(d + n \log n)$, and analyze the achievable error in the $\ell_1$, $\ell_2$ and $\ell_\infty$ norms. In all cases the dependency of the error on $d = |\mathcal{D}|$ is $O( 1 / \sqrt{d})$ — we give an information-theoretic lower bound showing that this dependence on $d$ is asymptotically optimal among all private, updatable sketches for the profile reconstruction problem with a high-probability error guarantee.

NeurIPS Conference 2023 Conference Paper

A Smooth Binary Mechanism for Efficient Private Continual Observation

  • Joel Daniel Andersson
  • Rasmus Pagh

In privacy under continual observation we study how to release differentially private estimates based on a dataset that evolves over time. The problem of releasing private prefix sums of $x_1, x_2, x_3, \dots\in${$0, 1$} (where the value of each $x_i$ is to be private) is particularly well-studied, and a generalized form is used in state-of-the-art methods for private stochastic gradient descent (SGD). The seminal binary mechanism privately releases the first $t$ prefix sums with noise of variance polylogarithmic in $t$. Recently, Henzinger et al. and Denisov et al. showed that it is possible to improve on the binary mechanism in two ways: The variance of the noise can be reduced by a (large) constant factor, and also made more even across time steps. However, their algorithms for generating the noise distribution are not as efficient as one would like in terms of computation time and (in particular) space. We address the efficiency problem by presenting a simple alternative to the binary mechanism in which 1) generating the noise takes constant average time per value, 2) the variance is reduced by a factor about 4 compared to the binary mechanism, and 3) the noise distribution at each step is identical. Empirically, a simple Python implementation of our approach outperforms the running time of the approach of Henzinger et al. , as well as an attempt to improve their algorithm using high-performance algorithms for multiplication with Toeplitz matrices.

FOCS Conference 2023 Conference Paper

Pseudorandom Hashing for Space-bounded Computation with Applications in Streaming

  • Praneeth Kacham
  • Rasmus Pagh
  • Mikkel Thorup
  • David P. Woodruff

We revisit Nisan’s classical pseudorandom generator (PRG) for space-bounded computation (STOC 1990) and its applications in streaming algorithms. We describe a new generator, HashPRG, that can be thought of as a symmetric version of Nisan’s generator over larger alphabets. Our generator allows a trade-off between seed length and the time needed to compute a given block of the generator’s output. HashPRG can be used to obtain derandomizations with much better update time and without sacrificing space for a large number of data stream algorithms, for example: •Andoni’s $F_{p}$ estimation algorithm for constant $p \gt 2$ (ICASSP, 2017) assumes a random oracle, but achieves optimal space and constant update time. Using HashPRG’s time-space trade-off we eliminate the random oracle assumption while preserving the other properties. Previously no time-optimal derandomization was known. Using similar techniques, we give an algorithm for a relaxed version of $\ell_{p}$ sampling in a turnstile stream. Both of our algorithms use $\tilde{O}\left(d^{1-2 / p}\right)$ bits of space and have $O(1)$ update time. •For $0\lt p\lt2$, the $1 \pm \varepsilon$ approximate $F_{p}$ estimation algorithm of Kane et al. , (STOC, 2011) uses an optimal $O\left(\varepsilon^{-2} \log d\right)$ bits of space but has an update time of $O\left(\log ^{2}(1 / \varepsilon) \log \log (1 / \varepsilon)\right)$. Using HashPRG, we show that if $1 / \sqrt{d} \leq \varepsilon \leq 1 / d^{c}$ for an arbitrarily small constant $c \gt 0$, then we can obtain a $1 \pm \varepsilon$ approximate $F_{p}$ estimation algorithm that uses the optimal $O\left(\varepsilon^{-2} \log d\right)$ bits of space and has an update time of $O(\log d)$ in the Word RAM model, which is more than a quadratic improvement in the update time. We obtain similar improvements for entropy estimation. •CountSketch, with the fine-grained error analysis of Minton and Price (SODA, 2014). For derandomization, they suggested a direct application of Nisan’s generator, yielding a logarithmic multiplicative space overhead. With HashPRG we obtain an efficient derandomization yielding the same asymptotic space as when assuming a random oracle. Our ability to obtain a time-efficient derandomization makes crucial use of HashPRG’s symmetry. We also give the first derandomization of a recent private version of CountSketch. For a d-dimensional vector x being updated in a turnstile stream, we show that $\|x\|_{\infty}$ can be estimated up to an additive error of $\varepsilon\|x\|_{2}$ using $O\left(\varepsilon^{-2} \log (1 / \varepsilon) \log d\right)$ bits of space. Additionally, the update time of this algorithm is $O(\log 1 / \varepsilon)$ in the Word RAM model. We show that the space complexity of this algorithm is optimal up to constant factors. However, for vectors x with $\|x\|_{\infty}=\Theta\left(\|x\|_{2}\right)$, we show that the lower bound can be broken by giving an algorithm that uses $O\left(\varepsilon^{-2} \log d\right)$ bits of space which approximates $\|x\|_{\infty}$ up to an additive error of $\varepsilon\|x\|_{2}$. We use our aforementioned derandomization of the CountSketch data structure to obtain this algorithm, and using the time-space trade off of HashPRG, we show that the update time of this algorithm is also $O(\log 1 / \varepsilon)$ in the Word RAM model.

NeurIPS Conference 2022 Conference Paper

Improved Utility Analysis of Private CountSketch

  • Rasmus Pagh
  • Mikkel Thorup

Sketching is an important tool for dealing with high-dimensional vectors that are sparse (or well-approximated by a sparse vector), especially useful in distributed, parallel, and streaming settings. It is known that sketches can be made differentially private by adding noise according to the sensitivity of the sketch, and this has been used in private analytics and federated learning settings. The post-processing property of differential privacy implies that \emph{all} estimates computed from the sketch can be released within the given privacy budget. In this paper we consider the classical CountSketch, made differentially private with the Gaussian mechanism, and give an improved analysis of its estimation error. Perhaps surprisingly, the privacy-utility trade-off is essentially the best one could hope for, independent of the number of repetitions in CountSketch: The error is almost identical to the error from non-private CountSketch plus the noise needed to make the vector private in the original, high-dimensional domain.

ICML Conference 2021 Conference Paper

CountSketches, Feature Hashing and the Median of Three

  • Kasper Green Larsen
  • Rasmus Pagh
  • Jakub Tetek

In this paper, we revisit the classic CountSketch method, which is a sparse, random projection that transforms a (high-dimensional) Euclidean vector $v$ to a vector of dimension $(2t-1) s$, where $t, s > 0$ are integer parameters. It is known that a CountSketch allows estimating coordinates of $v$ with variance bounded by $\|v\|_2^2/s$. For $t > 1$, the estimator takes the median of $2t-1$ independent estimates, and the probability that the estimate is off by more than $2 \|v\|_2/\sqrt{s}$ is exponentially small in $t$. This suggests choosing $t$ to be logarithmic in a desired inverse failure probability. However, implementations of CountSketch often use a small, constant $t$. Previous work only predicts a constant factor improvement in this setting. Our main contribution is a new analysis of CountSketch, showing an improvement in variance to $O(\min\{\|v\|_1^2/s^2, \|v\|_2^2/s\})$ when $t > 1$. That is, the variance decreases proportionally to $s^{-2}$, asymptotically for large enough $s$.

ICML Conference 2021 Conference Paper

Differentially Private Aggregation in the Shuffle Model: Almost Central Accuracy in Almost a Single Message

  • Badih Ghazi
  • Ravi Kumar 0001
  • Pasin Manurangsi
  • Rasmus Pagh
  • Amer Sinha

The shuffle model of differential privacy has attracted attention in the literature due to it being a middle ground between the well-studied central and local models. In this work, we study the problem of summing (aggregating) real numbers or integers, a basic primitive in numerous machine learning tasks, in the shuffle model. We give a protocol achieving error arbitrarily close to that of the (Discrete) Laplace mechanism in central differential privacy, while each user only sends 1 + o(1) short messages in expectation.

ICML Conference 2020 Conference Paper

Composable Sketches for Functions of Frequencies: Beyond the Worst Case

  • Edith Cohen
  • Ofir Geri
  • Rasmus Pagh

Recently there has been increased interest in using machine learning techniques to improve classical algorithms. In this paper we study when it is possible to construct compact, composable sketches for weighted sampling and statistics estimation according to functions of data frequencies. Such structures are now central components of large-scale data analytics and machine learning pipelines. However, many common functions, such as thresholds and $p$th frequency moments with $p>2$, are known to require polynomial size sketches in the worst case. We explore performance beyond the worst case under two different types of assumptions. The first is having access to noisy \emph{advice} on item frequencies. This continues the line of work of Hsu et al. (ICLR 2019), who assume predictions are provided by a machine learning model. The second is providing guaranteed performance on a restricted class of input frequency distributions that are better aligned with what is observed in practice. This extends the work on heavy hitters under Zipfian distributions in a seminal paper of Charikar et al. (ICALP 2002). Surprisingly, we show analytically and empirically that "in practice" small polylogarithmic-size sketches provide accuracy for "hard" functions.

SODA Conference 2020 Conference Paper

Oblivious Sketching of High-Degree Polynomial Kernels

  • Thomas D. Ahle
  • Michael Kapralov
  • Jakob Bæk Tejs Knudsen
  • Rasmus Pagh
  • Ameya Velingker
  • David P. Woodruff
  • Amir Zandieh

Kernel methods are fundamental tools in machine learning that allow detection of non-linear dependencies between data without explicitly constructing feature vectors in high dimensional spaces. A major disadvantage of kernel methods is their poor scalability: primitives such as kernel PCA or kernel ridge regression generally take prohibitively large quadratic space and (at least) quadratic time, as kernel matrices are usually dense. Some methods for speeding up kernel linear algebra are known, but they all invariably take time exponential in either the dimension of the input point set (e. g. , fast multipole methods suffer from the curse of dimensionality ) or in the degree of the kernel function. Oblivious sketching has emerged as a powerful approach to speeding up numerical linear algebra over the past decade, but our understanding of oblivious sketching solutions for kernel matrices has remained quite limited, suffering from the aforementioned exponential dependence on input parameters. Our main contribution is a general method for applying sketching solutions developed in numerical linear algebra over the past decade to a tensoring of data points without forming the tensoring explicitly. This leads to the first oblivious sketch for the polynomial kernel with a target dimension that is only polynomially dependent on the degree of the kernel function, as well as the first oblivious sketch for the Gaussian kernel on bounded datasets that does not suffer from an exponential dependence on the dimensionality of input data points.

ICML Conference 2020 Conference Paper

Private Counting from Anonymous Messages: Near-Optimal Accuracy with Vanishing Communication Overhead

  • Badih Ghazi
  • Ravi Kumar 0001
  • Pasin Manurangsi
  • Rasmus Pagh

Differential privacy (DP) is a formal notion for quantifying the privacy loss of algorithms. Algorithms in the central model of DP achieve high accuracy but make the strongest trust assumptions whereas those in the local DP model make the weakest trust assumptions but incur substantial accuracy loss. The shuffled DP model [Bittau et al 2017, Erlingsson et al 2019, Cheu et al 19] has recently emerged as a feasible middle ground between the central and local models, providing stronger trust assumptions than the former while promising higher accuracies than the latter. In this paper, we obtain practical communication-efficient algorithms in the shuffled DP model for two basic aggregation primitives used in machine learning: 1) binary summation, and 2) histograms over a moderate number of buckets. Our algorithms achieve accuracy that is arbitrarily close to that of central DP algorithms with an expected communication per user essentially matching what is needed without any privacy constraints! We demonstrate the practicality of our algorithms by experimentally evaluating them and comparing their performance to several widely-used protocols such as Randomized Response [Warner 1965] and RAPPOR [Erlingsson et al. 2014].

NeurIPS Conference 2020 Conference Paper

WOR and $p$'s: Sketches for $\ell_p$-Sampling Without Replacement

  • Edith Cohen
  • Rasmus Pagh
  • David Woodruff

Weighted sampling is a fundamental tool in data analysis and machine learning pipelines. Samples are used for efficient estimation of statistics or as sparse representations of the data. When weight distributions are skewed, as is often the case in practice, without-replacement (WOR) sampling is much more effective than with-replacement (WR) sampling: It provides a broader representation and higher accuracy for the same number of samples. We design novel composable sketches for WOR {\em $\ell_p$ sampling}, weighted sampling of keys according to a power $p\in[0, 2]$ of their frequency (or for signed data, sum of updates). Our sketches have size that grows only linearly with sample size. Our design is simple and practical, despite intricate analysis, and based on off-the-shelf use of widely implemented heavy hitters sketches such as \texttt{CountSketch}. Our method is the first to provide WOR sampling in the important regime of $p>1$ and the first to handle signed updates for $p>0$.

STOC Conference 2017 Conference Paper

Set similarity search beyond MinHash

  • Tobias Christiani
  • Rasmus Pagh

We consider the problem of approximate set similarity search under Braun-Blanquet similarity B ( x , y ) = | x ∩ y | / max(| x |, | y |). The ( b 1 , b 2 )-approximate Braun-Blanquet similarity search problem is to preprocess a collection of sets P such that, given a query set q , if there exists x Ε P with B ( q , x ) ≥ b 1 , then we can efficiently return x ′ Ε P with B ( q , x ′) > b 2 . We present a simple data structure that solves this problem with space usage O ( n 1+ρ log n + ∑ x ε P | x |) and query time O (| q | n ρ log n ) where n = | P | and ρ = log(1/ b 1 )/log(1/ b 2 ). Making use of existing lower bounds for locality-sensitive hashing by O'Donnell et al. (TOCT 2014) we show that this value of ρ is tight across the parameter space, i.e., for every choice of constants 0 < b 2 < b 1 < 1. In the case where all sets have the same size our solution strictly improves upon the value of ρ that can be obtained through the use of state-of-the-art data-independent techniques in the Indyk-Motwani locality-sensitive hashing framework (STOC 1998) such as Broder's MinHash (CCS 1997) for Jaccard similarity and Andoni et al.'s cross-polytope LSH (NIPS 2015) for cosine similarity. Surprisingly, even though our solution is data-independent, for a large part of the parameter space we outperform the currently best data- dependent method by Andoni and Razenshteyn (STOC 2015).

SODA Conference 2015 Conference Paper

Approximate Range Emptiness in Constant Time and Optimal Space

  • Mayank Goswami 0001
  • Allan Grønlund Jørgensen
  • Kasper Green Larsen
  • Rasmus Pagh

This paper studies the ε -approximate range emptiness problem, where the task is to represent a set S of n points from {0, …, U – 1} and answer emptiness queries of the form “[ a; b ] ∩ S ≠ ? ” with a probability of false positives allowed. This generalizes the functionality of Bloom filters from single point queries to any interval length L. Setting the false positive rate to ε/L and performing L queries, Bloom filters yield a solution to this problem with space O ( n lg (L/ ε)) bits, false positive probability bounded by ε for intervals of length up to L, using query time O ( L lg( L/ ε)). Our first contribution is to show that the space/error trade-off cannot be improved asymptotically: Any data structure for answering approximate range emptiness queries on intervals of length up to L with false positive probability ε, must use space Ω( n lg( L /ε)) — O ( n ) bits. On the positive side we show that the query time can be improved greatly, to constant time, while matching our space lower bound up to a lower order additive term. This result is achieved through a succinct data structure for (non-approximate 1d) range emptiness/reporting queries, which may be of independent interest.

STOC Conference 2015 Conference Paper

From Independence to Expansion and Back Again

  • Tobias Christiani
  • Rasmus Pagh
  • Mikkel Thorup

We consider the following fundamental problems: Constructing k-independent hash functions with a space-time tradeoff close to Siegel's lower bound. Constructing representations of unbalanced expander graphs having small size and allowing fast computation of the neighbor function. It is not hard to show that these problems are intimately connected in the sense that a good solution to one of them leads to a good solution to the other one. In this paper we exploit this connection to present efficient, recursive constructions of k-independent hash functions (and hence expanders with a small representation). While the previously most efficient construction (Thorup, FOCS 2013) needed time quasipolynomial in Siegel's lower bound, our time bound is just a logarithmic factor from the lower bound.

FOCS Conference 2014 Conference Paper

Generating k-Independent Variables in Constant Time

  • Tobias Christiani
  • Rasmus Pagh

The generation of pseudorandom elements over finite fields is fundamental to the time, space and randomness complexity of randomized algorithms and data structures. We consider the problem of generating k-independent random values over a finite field F in a word RAM model equipped with constant time addition and multiplication in F, and present the first nontrivial construction of a generator that outputs each value in constant time, not dependent on k. Our generator has period length |F| poly log k and uses k poly (log k) log |F| bits of space, which is optimal up to a poly log k factor. We are able to bypass Siegel's lower bound on the time-space tradeoff for k-independent functions by a restriction to sequential evaluation.

FOCS Conference 2013 Conference Paper

How to Approximate a Set without Knowing Its Size in Advance

  • Rasmus Pagh
  • Gil Segev 0001
  • Udi Wieder

The dynamic approximate membership problem asks to represent a set S of size n, whose elements are provided in an on-line fashion, supporting membership queries without false negatives and with a false positive rate at most ε. That is, the membership algorithm must be correct on each x ∈ S, and may err with probability at most ε on each x ∉ S. We study a well-motivated, yet insufficiently explored, variant of this problem where the size n of the set is not known in advance. Existing optimal approximate membership data structures require that the size is known in advance, but in many practical scenarios this is not a realistic assumption. Moreover, even if the eventual size n of the set is known in advance, it is desirable to have the smallest possible space usage also when the current number of inserted elements is smaller than n. Our contribution consists of the following results: (1) We show a super-linear gap between the space complexity when the size is known in advance and the space complexity when the size is not known in advance. When the size is known in advance, it is well-known that Θ(n log(1/ε)) bits of space are necessary and sufficient (Bloom '70, Carter et al. '78). However, when the size is not known in advance, we prove that at least (1 -o(1))n log(1/ε)+Ω(n log log n) bits of space must be used. In particular, the average number of bits per element must depend on the size of the set. . We show that our space lower bound is tight, and can even be matched by a highly efficient data structure. We present a data structure that uses (1+o(1))n log(1/ε)+O(n log log n) bits of space for approximating any set of any size n, without having to know n in advance. Our data structure supports membership queries in constant time in the worst case with high probability, and supports insertions in expected amortized constant time. Moreover, it can be “de-amortized” to support also insertions in constant time in the worst case with high probability by only increasing its space usage to O(n log(1/ε) + n loglogn) bits.

SODA Conference 2012 Conference Paper

I/O-efficient data structures for colored range and prefix reporting

  • Kasper Green Larsen
  • Rasmus Pagh

Motivated by information retrieval applications, we consider the one-dimensional colored range reporting problem in rank space. The goal is to build a static data structure for sets C 1, …, C m ⊆ {1, …, σ} that supports queries of the kind: Given indices a, b, report the set ∪ a ≤ i ≤ b C i. We study the problem in the I/O model, and show that there exists an optimal linear-space data structure that answers queries in O (1 + k/B ) I/Os, where k denotes the output size and B the disk block size in words. In fact, we obtain the same bound for the harder problem of three-sided orthogonal range reporting. In this problem, we are to preprocess a set of n two-dimensional points in rank space, such that all points inside a query rectangle of the form [ x 1, x 2 ] × (−∞, y ] can be reported. The best previous bounds for this problem is either O ( n lg 2 B n ) space and O (1 + k/B ) query I/Os, or O ( n ) space and O(lg ( h ) B n + k/B ) query I/Os, where lg ( h ) B n is the base B logarithm iterated h times, for any constant integer h. The previous bounds are both achieved under the indivisibility assumption, while our solution exploits the full capabilities of the underlying machine. Breaking the indivisibility assumption thus provides us with cleaner and optimal bounds. Our results also imply an optimal solution to the following colored prefix reporting problem. Given a set S of strings, each O (1) disk blocks in length, and a function c: S → 2 (1, …, σ), support queries of the kind: Given a string p, report the set ∪ x ∊ S ∩ p * c ( x ), where p* denotes the set of strings with prefix p. Finally, we consider the possibility of top- k extensions of this result, and present a simple solution in a model that allows non-blocked I/O.

STOC Conference 2005 Conference Paper

On dynamic range reporting in one dimension

  • Christian Worm Mortensen
  • Rasmus Pagh
  • Mihai Patrascu

We consider the problem of maintaining a dynamic set of integers and answering queries of the form: report a point (equivalently, all points) in a given interval. Range searching is a natural and fundamental variant of integer search, and can be solved using predecessor search. However, for a RAM with w -bit words, we show how to perform updates in O (lg w ) time and answer queries in O (lg lg w ) time. The update time is identical to the van Emde Boas structure, but the query time is exponentially faster. Existing lower bounds show that achieving our query time for predecessor search requires doubly-exponentially slower updates. We present some arguments supporting the conjecture that our solution is optimal.Our solution is based on a new and interesting recursion idea which is "more extreme" that the van Emde Boas recursion. Whereas van Emde Boas uses a simple recursion (repeated halving) on each path in a trie, we use a nontrivial, van Emde Boas-like recursion on every such path. Despite this, our algorithm is quite clean when seen from the right angle. To achieve linear space for our data structure, we solve a problem which is of independent interest. We develop the first scheme for dynamic perfect hashing requiring sublinear space. This gives a dynamic Bloomier filter (a storage scheme for sparse vectors) which uses low space. We strengthen previous lower bounds to show that these results are optimal.

STOC Conference 2001 Conference Paper

On the cell probe complexity of membership and perfect hashing

  • Rasmus Pagh

We study two fundamental static data structure problems, membership and perfect hashing, in Yao's cell probe model. The first space and bit probe optimal worst case upper bound is given for the membership problem. We also give a new efficient membership scheme where the query algorithm makes just one adaptive choice, and probes a total of three words. A lower bound shows that two word probes generally do not suffice. For minimal perfect hashing we show a tight bit probe lower bound, and give a simple scheme achieving this performance, making just one adaptive choice. Linear range perfect hashing is shown to be implementable with the same number of bit probes, of which just one is adaptive. In contrast, we establish that for sufficiently sparse sets, non-adaptive perfect hashing needs exponentially more bit probes. This is the first such separation of adaptivity and non-adaptivity.