Author name cluster

Chris Schwiegelshohn

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

27 papers

2 author rows

STOC Conference 2025 Conference Paper

A (2+ε)-Approximation Algorithm for Metric k-Median

Vincent Cohen-Addad
Fabrizio Grandoni 0001
Euiwoong Lee
Chris Schwiegelshohn
Ola Svensson

Details

SODA Conference 2025 Conference Paper

A Tight VC-Dimension Analysis of Clustering Coresets with Applications

Vincent Cohen-Addad
Andrew Draganov
Matteo Russo 0002
David Saulpic
Chris Schwiegelshohn

Details

STOC Conference 2025 Conference Paper

Almost Optimal PAC Learning for k-Means

Vincent Cohen-Addad
Silvio Lattanzi
Chris Schwiegelshohn

Given a set of points, the k -means clustering problem consists of finding a partition of a set of points into k clusters such that the sum of squared Euclidean distances between the points and their assigned centers is minimized. In this paper, we consider learning bounds for this problem. That is, given a set of n samples P drawn independently from some unknown but fixed distribution D , how quickly does a solution computed on P converge to the optimal clustering of D ? The currently fastest provable rate of convergence of the order √ k / n min( k ,log k log 2 ( n / k )) is due to [Appert, Catoni, 2021] with the best known lower bound being of the order √ k / n due to [Bartlett, Linder, and Lugosi, 1998]. We give learning bounds with both optimal dependency on the sample size n and nearly optimal dependency on k by proving a convergence rate of the order of √ k log k / n .

Details

ICML Conference 2025 Conference Paper

Distributed Differentially Private Data Analytics via Secure Sketching

Jakob Burkhardt
Hannah Keller
Claudio Orlandi
Chris Schwiegelshohn

We introduce the linear-transformation model, a distributed model of differentially private data analysis. Clients have access to a trusted platform capable of applying a public matrix to their inputs. Such computations can be securely distributed across multiple servers using simple and efficient secure multiparty computation techniques. The linear-transformation model serves as an intermediate model between the highly expressive central model and the minimal local model. In the central model, clients have access to a trusted platform capable of applying any function to their inputs. However, this expressiveness comes at a cost, as it is often expensive to distribute such computations, leading to the central model typically being implemented by a single trusted server. In contrast, the local model assumes no trusted platform, which forces clients to add significant noise to their data. The linear-transformation model avoids the single point of failure for privacy present in the central model, while also mitigating the high noise required in the local model. We demonstrate that linear transformations are very useful for differential privacy, allowing for the computation of linear sketches of input data. These sketches largely preserve utility for tasks such as private low-rank approximation and private ridge regression, while introducing only minimal error, critically independent of the number of clients.

Details

ICML Conference 2025 Conference Paper

Improved Learning via k-DTW: A Novel Dissimilarity Measure for Curves

Amer Krivosija
Alexander Munteanu
André Nusser
Chris Schwiegelshohn

This paper introduces $k$-Dynamic Time Warping ($k$-DTW), a novel dissimilarity measure for polygonal curves. $k$-DTW has stronger metric properties than Dynamic Time Warping (DTW) and is more robust to outliers than the Fréchet distance, which are the two gold standards of dissimilarity measures for polygonal curves. We show interesting properties of $k$-DTW and give an exact algorithm as well as a $(1+\varepsilon)$-approximation algorithm for $k$-DTW by a parametric search for the $k$-th largest matched distance. We prove the first dimension-free learning bounds for curves and further learning theoretic results. $k$-DTW not only admits smaller sample size than DTW for the problem of learning the median of curves, where some factors depending on the curves’ complexity $m$ are replaced by $k$, but we also show a surprising separation on the associated Rademacher and Gaussian complexities: $k$-DTW admits strictly smaller bounds than DTW, by a factor $\tilde\Omega(\sqrt{m})$ when $k\ll m$. We complement our theoretical findings with an experimental illustration of the benefits of using $k$-DTW for clustering and nearest neighbor classification.

Details

FOCS Conference 2025 Conference Paper

Nearly Tight Regret Bounds for Profit Maximization in Bilateral Trade

Simone Di Gregorio 0001
Paul Dütting
Federico Fusco 0001
Chris Schwiegelshohn

Bilateral trade models the task of intermediating between two strategic agents, a seller and a buyer, willing to trade a good for which they hold private valuations. We study this problem from the perspective of a broker, in a regret minimization framework. At each time step, a new seller and buyer arrive, and the broker has to propose a mechanism that is incentivecompatible and individually rational, with the goal of maximizing profit. We propose a learning algorithm that guarantees a nearly tight regret in the stochastic setting when seller and buyer valuations are drawn i. i. d. from a fixed and possibly correlated unknown distribution. We further show that it is impossible to achieve sublinear regret in the non-stationary scenario where valuations are generated upfront by an adversary. Our ambitious benchmark for these results is the best incentive-compatible and individually rational mechanism. This separates us from previous works on efficiency maximization in bilateral trade, where the benchmark is a single number: the best fixed price in hindsight. A particular challenge we face is that uniform convergence for all mechanisms’ profits is impossible. We overcome this difficulty via a careful chaining analysis that proves convergence for a provably near-optimal mechanism at (essentially) optimal rate. We further showcase the broader applicability of our techniques by providing nearly optimal results for the joint ads problem.

Details

ICML Conference 2025 Conference Paper

Randomized Dimensionality Reduction for Euclidean Maximization and Diversity Measures

Jie Gao 0001
Rajesh Jayaram
Benedikt Kolbe
Shay Sapir
Chris Schwiegelshohn
Sandeep Silwal
Erik Waingarten

Randomized dimensionality reduction is a widely-used algorithmic technique for speeding up large-scale Euclidean optimization problems. In this paper, we study dimension reduction for a variety of maximization problems, including max-matching, max-spanning tree, as well as various measures for dataset diversity. For these problems, we show that the effect of dimension reduction is intimately tied to the doubling dimension $\lambda_X$ of the underlying dataset $X$—a quantity measuring intrinsic dimensionality of point sets. Specifically, the dimension required is $O(\lambda_X)$, which we also show is necessary for some of these problems. This is in contrast to classical dimension reduction results, whose dependence grow with the dataset size $|X|$. We also provide empirical results validating the quality of solutions found in the projected space, as well as speedups due to dimensionality reduction.

Details

NeurIPS Conference 2025 Conference Paper

Simple and Optimal Sublinear Algorithms for Mean Estimation

Beatrice Bertolotti
Matteo Russo
Chris Schwiegelshohn
Sudarshan Shyam

We study the sublinear multivariate mean estimation problem in $d$-dimensional Euclidean space. Specifically, we aim to find the mean $\mu$ of a ground point set $A$, which minimizes the sum of squared Euclidean distances of the points in $A$ to $\mu$. We first show that a multiplicative $(1+\varepsilon)$ approximation to $\mu$ can be found with probability $1-\delta$ using $O(\varepsilon^{-1}\log \delta^{-1})$ many independent uniform random samples, and provide a matching lower bound. Furthermore, we give two estimators with optimal sample complexity that can be computed in optimal running time for extracting a suitable approximate mean: 1. The coordinate-wise median of $\log \delta^{-1}$ sample means of sample size $\varepsilon^{-1}$. As a corollary, we also show improved convergence rates for this estimator for estimating means of multivariate distributions. 2. The geometric median of $\log \delta^{-1}$ sample means of sample size $\varepsilon^{-1}$. To compute a solution efficiently, we design a novel and simple gradient descent algorithm that is significantly faster for our specific setting than all other known algorithms for computing geometric medians. In addition, we propose an order statistics approach that is empirically competitive with these algorithms, has an optimal sample complexity and matches the running time up to lower order terms. We finally provide an extensive experimental evaluation among several estimators which concludes that the geometric-median-of-means-based approach is typically the most competitive in practice.

PDF Details

SODA Conference 2024 Conference Paper

Adaptive Out-Orientations with Applications

Chandra Chekuri
Aleksander Bjørn Grodt Christiansen
Jacob Holm
Ivor van der Hoog
Kent Quanrud
Eva Rotenberg
Chris Schwiegelshohn

Details

TIST Journal 2024 Journal Article

Fair Projections as a Means toward Balanced Recommendations

Aris Anagnostopoulos
Luca Becchetti
Matteo Böhm
Adriano Fazzone
Stefano Leonardi
Cristina Menghini
Chris Schwiegelshohn

The goal of recommender systems is to provide to users suggestions that match their interests, with the eventual goal of increasing their satisfaction, as measured by the number of transactions (clicks, purchases, and so forth). Often, this leads to providing recommendations that are of a particular type. For some contexts (e.g., browsing videos for information) this may be undesirable, as it may enforce the creation of filter bubbles. This is because of the existence of underlying bias in the input data of prior user actions. Reducing hidden bias in the data and ensuring fairness in algorithmic data analysis has recently received significant attention. In this article, we consider both the densest subgraph and the $k$ -clustering problem, two primitives that are being used by some recommender systems. We are given a coloring on the nodes, respectively the points, and aim to compute a fair solution $S$, consisting of a subgraph or a clustering, such that none of the colors is disparately impacted by the solution. Unfortunately, introducing fair solutions typically makes these problems substantially more difficult. Unlike the unconstrained densest subgraph problem, which is solvable in polynomial time, the fair densest subgraph problem is NP-hard even to approximate, which means that with the standard computational model it is probably impossible to solve (or even approximate it sufficiently well) in polynomial time. For $k$ -clustering, the fairness constraints make the problem very similar to capacitated clustering, which is a notoriously hard problem to even approximate. Despite such negative premises, we are able to provide positive results in important use cases. In particular, we are able to prove that a suitable spectral embedding allows recovery of an almost optimal, fair, dense subgraph hidden in the input data, whenever one is present, a result that is further supported by experimental evidence. We also show a polynomial-time, $2$ -approximation algorithm to the problem of fair densest subgraph, assuming that there exist only two colors and both colors occur equally often in the graph. This result turns out to be optimal assuming the small set expansion hypothesis. For fair $k$ -clustering, we show that we can recover high quality fair clusterings effectively and efficiently. For the special case of $k$ -median and $k$ -center, we offer additional, fast and simple approximation algorithms as well as new hardness results. The above theoretical findings drive the design of heuristics, which we experimentally evaluate on a scenario based on real data, in which our aim is to strike a good balance between diversity and highly correlated items from Amazon co-purchasing graphs and Facebook contacts. We additionally evaluated our algorithmic solutions for the fair $k$ -median problem through experiments on various real-world datasets.

Details DOI

AAAI Conference 2024 Conference Paper

Low-Distortion Clustering with Ordinal and Limited Cardinal Information

Jakob Burkhardt
Ioannis Caragiannis
Karl Fehrs
Matteo Russo
Chris Schwiegelshohn
Sudarshan Shyam

Motivated by recent work in computational social choice, we extend the metric distortion framework to clustering problems. Given a set of n agents located in an underlying metric space, our goal is to partition them into k clusters, optimizing some social cost objective. The metric space is defined by a distance function d between the agent locations. Information about d is available only implicitly via n rankings, through which each agent ranks all other agents in terms of their distance from her. Still, even though no cardinal information (i.e., the exact distance values) is available, we would like to evaluate clustering algorithms in terms of social cost objectives that are defined using d. This is done using the notion of distortion, which measures how far from optimality a clustering can be, taking into account all underlying metrics that are consistent with the ordinal information available. Unfortunately, the most important clustering objectives (e.g., those used in the well-known k-median and k-center problems) do not admit algorithms with finite distortion. To sidestep this disappointing fact, we follow two alternative approaches: We first explore whether resource augmentation can be beneficial. We consider algorithms that use more than k clusters but compare their social cost to that of the optimal k-clusterings. We show that using exponentially (in terms of k) many clusters, we can get low (constant or logarithmic) distortion for the k-center and k-median objectives. Interestingly, such an exponential blowup is shown to be necessary. More importantly, we explore whether limited cardinal information can be used to obtain better results. Somewhat surprisingly, for k-median and k-center, we show that a number of queries that is polynomial in k and only logarithmic in n (i.e., only sublinear in the number of agents for the most relevant scenarios in practice) is enough to get constant distortion.

PDF Details DOI

ICML Conference 2024 Conference Paper

Optimal Coresets for Low-Dimensional Geometric Median

Peyman Afshani
Chris Schwiegelshohn

We investigate coresets for approximating the cost with respect to median queries. In this problem, we are given a set of points $P\subset \mathbb{R}^d$ and median queries are $\sum_{p\in P} ||p-c||$ for any point $c\in \mathbb{R}^d$. Our goal is to compute a small weighted summary $S\subset P$ such that the cost of any median query is approximated within a multiplicative $(1\pm\varepsilon)$ factor. We provide matching upper and lower bounds on the number of points contained in $S$ of the order $\tilde{\Theta}\left(\varepsilon^{-d/(d+1)}\right)$.

Details

FOCS Conference 2024 Conference Paper

Sensitivity Sampling for k-Means: Worst Case and Stability Optimal Coreset Bounds

Nikhil Bansal 0001
Vincent Cohen-Addad
Milind Prabhu
David Saulpic
Chris Schwiegelshohn

Coresets are arguably the most popular compression paradigm for center-based clustering objectives such as $k$ -means. Given a point set $P$, a coreset $\Omega$ is a small, weighted summary that preserves the cost of all candidate solutions $S$ up to a $(1\pm\varepsilon)$ factor. For $k$ -means in $d$ -dimensional Euclidean space the cost for solution $S$ is $\Sigma_{p\in P}{\min}_{s\in S}\Vert p-s\Vert ^2$. A very popular method for coreset construction, both in theory and practice, is Sensitivity Sampling, where points are sampled in proportion to their importance. We show that Sensitivity Sampling yields optimal coresets of size $\widetilde{O}(k/\varepsilon^{2}\min(\sqrt{k}, \varepsilon^{-2}))$ for worst-case instances. Uniquely among all known coreset algorithms, for well-clusterable data sets with $\Omega(1)$, cost stability, Sensitivity Sampling gives coresets of size $\widetilde{O}(k/\varepsilon^{2})$, improving over the worst-case lower bound. Notably, Sensitivity Sampling does not have to know the cost stability in order to exploit it: it is appropriately sensitive to the clusterability of the data set while being oblivious to it. We also show that any coreset for stable instances consisting of only input points must have size $\Omega(k/\varepsilon^{2})$. Our results for Sensitivity Sampling also extend to the k-median problem, and more general metric spaces.

Details

ICML Conference 2024 Conference Paper

Sparse Dimensionality Reduction Revisited

Mikael Møller Høgsgaard
Lior Kamma
Kasper Green Larsen
Jelani Nelson
Chris Schwiegelshohn

The sparse Johnson-Lindenstrauss transform is one of the central techniques in dimensionality reduction. It supports embedding a set of $n$ points in $\mathbb{R}^d$ into $m=O(\varepsilon^{-2} \ln n)$ dimensions while preserving all pairwise distances to within $1 \pm \varepsilon$. Each input point $x$ is embedded to $Ax$, where $A$ is an $m \times d$ matrix having $s$ non-zeros per column, allowing for an embedding time of $O(s \|x\|_0)$. Since the sparsity of $A$ governs the embedding time, much work has gone into improving the sparsity $s$. The current state-of-the-art by Kane and Nelson (2014) shows that $s = O(\varepsilon^{-1} \ln n)$ suffices. This is almost matched by a lower bound of $s = \Omega(\varepsilon^{-1} \ln n/\ln(1/\varepsilon))$ by Nelson and Nguyen (2013) for $d=\Omega(n)$. Previous work thus suggests that we have near-optimal embeddings. In this work, we revisit sparse embeddings and present a sparser embedding for instances in which $d = n^{o(1)}$, which in many applications is realistic. Formally, our embedding achieves $s = O(\varepsilon^{-1}(\ln n/\ln(1/\varepsilon)+\ln^{2/3}n \ln^{1/3} d))$. We also complement our analysis by strengthening the lower bound of Nelson and Nguyen to hold also when $d \ll n$, thereby matching the first term in our new sparsity upper bound. Finally, we also improve the sparsity of the best oblivious subspace embeddings for optimal embedding dimensionality.

Details

SODA Conference 2023 Conference Paper

Breaching the 2 LMP Approximation Barrier for Facility Location with Applications to k -Median

Vincent Cohen-Addad
Fabrizio Grandoni 0001
Euiwoong Lee
Chris Schwiegelshohn

Details

FOCS Conference 2023 Conference Paper

Deterministic Clustering in High Dimensional Spaces: Sketches and Approximation

Vincent Cohen-Addad
David Saulpic
Chris Schwiegelshohn

In all state-of-the-art sketching and coreset techniques for clustering, as well as in the best known fixed-parameter tractable approximation algorithms, randomness plays a key role. For the classic k-median and k-means problems, there are no known deterministic dimensionality reduction procedure or coreset construction that avoid an exponential dependency on the input dimension d, the precision parameter $\varepsilon^{-1}$ or k. Furthermore, there is no coreset construction that succeeds with probability $1-1/n$ and whose size does not depend on the number of input points, n. This has led researchers in the area to ask what is the power of randomness for clustering sketches [Feldman WIREs Data Mining Knowl. Discov’20]. Similarly, the best approximation ratio achievable deterministically without a complexity exponential in the dimension are $1+\sqrt{2}$ for k-median [Cohen-Addad, Esfandiari, Mirrokni, Narayanan, STOC’22] and 6. 12903 for k-means [Grandoni, Ostrovsky, Rabani, Schulman, Venkat, Inf. Process. Lett. ’22]. Those are the best results, even when allowing a complexity FPT in the number of clusters k: this stands in sharp contrast with the $(1+\varepsilon)$-approximation achievable in that case, when allowing randomization. In this paper, we provide deterministic sketches constructions for clustering, whose size bounds are close to the best-known randomized ones. We show how to compute a dimension reduction onto $\varepsilon^{-O(1)} \log k$ dimensions in time $k^{O\left(\varepsilon^{-O(1)}+\log \log k\right)}$ poly $(n d)$, and how to build a coreset of size $O\left(k^{2} \log ^{3} k \varepsilon^{-O(1)}\right)$ in time $2^{\varepsilon^{O(1)} k \log ^{3} k}+k^{O\left(\varepsilon^{-O(1)}+\log \log k\right)}$ poly $(n d)$. In the case where k is small, this answers an open question of [Feldman WIDM’20] and [Munteanu and Schwiegelshohn, Künstliche Intell. ’18] on whether it is possible to efficiently compute coresets deterministically. We also construct a deterministic algorithm for computing $(1+$ $\varepsilon)$-approximation to k-median and k-means in high dimensional Euclidean spaces in time $2^{k^{2} \log ^{3} k / \varepsilon^{O(1)}}$ poly $(n d)$, close to the best randomized complexity of $2^{(k / \varepsilon)^{O(1)}}$ nd (see [Kumar, Sabharwal, Sen, JACM 10] and [Bhattacharya, Jaiswal, Kumar, TCS’18]). Furthermore, our new insights on sketches also yield a randomized coreset construction that uses uniform sampling, that immediately improves over the recent results of [Braverman et al. FOCS ’22] by a factor k.

Details

NeurIPS Conference 2023 Conference Paper

On Generalization Bounds for Projective Clustering

Maria Sofia Bucarelli
Matilde Larsen
Chris Schwiegelshohn
Mads Toftrup

Given a set of points, clustering consists of finding a partition of a point set into $k$ clusters such that the center to which a point is assigned is as close as possible. Most commonly, centers are points themselves, which leads to the famous $k$-median and $k$-means objectives. One may also choose centers to be $j$ dimensional subspaces, which gives rise to subspace clustering. In this paper, we consider learning bounds for these problems. That is, given a set of $n$ samples $P$ drawn independently from some unknown, but fixed distribution $\mathcal{D}$, how quickly does a solution computed on $P$ converge to the optimal clustering of $\mathcal{D}$? We give several near optimal results. In particular, 1. For center-based objectives, we show a convergence rate of $\tilde{O}\left(\sqrt{{k}/{n}}\right)$. This matches the known optimal bounds of [Fefferman, Mitter, and Narayanan, Journal of the Mathematical Society 2016] and [Bartlett, Linder, and Lugosi, IEEE Trans. Inf. Theory 1998] for $k$-means and extends it to other important objectives such as $k$-median. 2. For subspace clustering with $j$-dimensional subspaces, we show a convergence rate of $\tilde{O}\left(\sqrt{{(kj^2)}/{n}}\right)$. These are the first provable bounds for most of these problems. For the specific case of projective clustering, which generalizes $k$-means, we show a converge rate of $\Omega\left(\sqrt{{(kj)}/{n}}\right)$ is necessary, thereby proving that the bounds from [Fefferman, Mitter, and Narayanan, Journal of the Mathematical Society 2016] are essentially optimal.

PDF Details

NeurIPS Conference 2022 Conference Paper

Improved Coresets for Euclidean $k$-Means

Vincent Cohen-Addad
Kasper Green Larsen
David Saulpic
Chris Schwiegelshohn
Omar Ali Sheikh-Omar

Given a set of $n$ points in $d$ dimensions, the Euclidean $k$-means problem (resp. Euclidean $k$-median) consists of finding $k$ centers such that the sum of squared distances (resp. sum of distances) from every point to its closest center is minimized. The arguably most popular way of dealing with this problem in the big data setting is to first compress the data by computing a weighted subset known as a coreset and then run any algorithm on this subset. The guarantee of the coreset is that for any candidate solution, the ratio between coreset cost and the cost of the original instance is less than a $(1\pm \varepsilon)$ factor. The current state of the art coreset size is $\tilde O(\min(k^{2} \cdot \varepsilon^{-2}, k\cdot \varepsilon^{-4}))$ for Euclidean $k$-means and $\tilde O(\min(k^{2} \cdot \varepsilon^{-2}, k\cdot \varepsilon^{-3}))$ for Euclidean $k$-median. The best known lower bound for both problems is $\Omega(k\varepsilon^{-2})$. In this paper, we improve these bounds to $\tilde O(\min(k^{3/2} \cdot \varepsilon^{-2}, k\cdot \varepsilon^{-4}))$ for Euclidean $k$-means and $\tilde O(\min(k^{4/3} \cdot \varepsilon^{-2}, k\cdot \varepsilon^{-3}))$ for Euclidean $k$-median. In particular, ours is the first provable bound that breaks through the $k^2$ barrier while retaining an optimal dependency on $\varepsilon$.

PDF Details

FOCS Conference 2022 Conference Paper

The Power of Uniform Sampling for Coresets

Vladimir Braverman
Vincent Cohen-Addad
Shaofeng H. -C. Jiang
Robert Krauthgamer
Chris Schwiegelshohn
Mads Bech Toftrup
Xuan Wu 0002

Motivated by practical generalizations of the classic k-median and k-means objectives, such as clustering with size constraints, fair clustering, and Wasserstein barycenter, we introduce a meta-theorem for designing coresets for constrained-clustering problems. The meta-theorem reduces the task of coreset construction to one on a bounded number of ring instances with a much-relaxed additive error. This reduction enables us to construct coresets using uniform sampling, in contrast to the widely-used importance sampling, and consequently we can easily handle constrained objectives. Notably and perhaps surprisingly, this simpler sampling scheme can yield coresets whose size is independent of n, the number of input points. Our technique yields smaller coresets, and sometimes the first coresets, for a large number of constrained clustering problems, including capacitated clustering, fair clustering, Euclidean Wasserstein barycenter, clustering in minor-excluded graph, and polygon clustering under Fréchet and Hausdorff distance. Finally, our technique yields also smaller coresets for 1-median in low-dimensional Euclidean spaces, specifically of size $\tilde{O}(\varepsilon^{-15})$ in $\mathbb{R}^{2}$ and $\tilde{O}(\varepsilon^{-16})$ in $\mathbb{R}^{3}$.

Details

STOC Conference 2022 Conference Paper

Towards optimal lower bounds for k-median and k-means coresets

Vincent Cohen-Addad
Kasper Green Larsen
David Saulpic
Chris Schwiegelshohn

Details

STOC Conference 2021 Conference Paper

A new coreset framework for clustering

Vincent Cohen-Addad
David Saulpic
Chris Schwiegelshohn

Details

NeurIPS Conference 2021 Conference Paper

Improved Coresets and Sublinear Algorithms for Power Means in Euclidean Spaces

Vincent Cohen-Addad
David Saulpic
Chris Schwiegelshohn

In this paper, we consider the problem of finding high dimensional power means: given a set $A$ of $n$ points in $\R^d$, find the point $m$ that minimizes the sum of Euclidean distance, raised to the power $z$, over all input points. Special cases of problem include the well-known Fermat-Weber problem -- or geometric median problem -- where $z = 1$, the mean or centroid where $z=2$, and the Minimum Enclosing Ball problem, where $z = \infty$. We consider these problem in the big data regime. Here, we are interested in sampling as few points as possible such that we can accurately estimate $m$. More specifically, we consider sublinear algorithms as well as coresets for these problems. Sublinear algorithms have a random query access to the $A$ and the goal is to minimize the number of queries. Here, we show that $\tilde{O}(\varepsilon^{-z-3})$ samples are sufficient to achieve a $(1+\varepsilon)$ approximation, generalizing the results from Cohen, Lee, Miller, Pachocki, and Sidford [STOC '16] and Inaba, Katoh, and Imai [SoCG '94] to arbitrary $z$. Moreover, we show that this bound is nearly optimal, as any algorithm requires at least $\Omega(\varepsilon^{-z+1})$ queries to achieve said approximation. The second contribution are coresets for these problems, where we aim to find find a small, weighted subset of the points which approximate cost of every candidate point $c\in \mathbb{R}^d$ up to a $(1\pm\varepsilon)$ factor. Here, we show that $\tilde{O}(\varepsilon^{-2})$ points are sufficient, improving on the $\tilde{O}(d\varepsilon^{-2})$ bound by Feldman and Langberg [STOC '11] and the $\tilde{O}(\varepsilon^{-4})$ bound by Braverman, Jiang, Krauthgamer, and Wu [SODA 21].

PDF Details

SODA Conference 2019 Conference Paper

(1 + ε)-Approximate Incremental Matching in Constant Deterministic Amortized Time

Fabrizio Grandoni 0001
Stefano Leonardi 0001
Piotr Sankowski
Chris Schwiegelshohn
Shay Solomon

We study the matching problem in the incremental setting, where we are given a sequence of edge insertions and aim at maintaining a near-maximum cardinality matching of the graph with small update time. We present a deterministic algorithm that, for any constant ε > 0, maintains a (1 + ε )-approximate matching with constant amortized update time per insertion.

Details

NeurIPS Conference 2019 Conference Paper

Fully Dynamic Consistent Facility Location

Vincent Cohen-Addad
Niklas Oskar Hjuler
Nikos Parotsidis
David Saulpic
Chris Schwiegelshohn

We consider classic clustering problems in fully dynamic data streams, where data elements can be both inserted and deleted. In this context, several parameters are of importance: (1) the quality of the solution after each insertion or deletion, (2) the time it takes to update the solution, and (3) how different consecutive solutions are. The question of obtaining efficient algorithms in this context for facility location, $k$-median and $k$-means has been raised in a recent paper by Hubert-Chan et al. [WWW'18] and also appears as a natural follow-up on the online model with recourse studied by Lattanzi and Vassilvitskii [ICML'17] (i. e. : in insertion-only streams). In this paper, we focus on general metric spaces and mainly on the facility location problem. We give an arguably simple algorithm that maintains a constant factor approximation, with $O(n\log n)$ update time, and total recourse $O(n)$. This improves over the naive algorithm which consists in recomputing a solution at each time step and that can take up to $O(n^2)$ update time, and $O(n^2)$ total recourse. These bounds are nearly optimal: in general metric space, inserting a point take $O(n)$ times to describe the distances to other points, and we give a simple lower bound of $O(n)$ for the recourse. Moreover, we generalize this result for the $k$-medians and $k$-means problems: our algorithm maintains a constant factor approximation in time $\widetilde{O}(n+k^2)$. We complement our analysis with experiments showing that the cost of the solution maintained by our algorithm at any time $t$ is very close to the cost of a solution obtained by quickly recomputing a solution from scratch at time $t$ while having a much better running time.

PDF Details

STOC Conference 2019 Conference Paper

Oblivious dimension reduction for k -means: beyond subspaces and the Johnson-Lindenstrauss lemma

Luca Becchetti
Marc Bury
Vincent Cohen-Addad
Fabrizio Grandoni 0001
Chris Schwiegelshohn

Details

NeurIPS Conference 2018 Conference Paper

On Coresets for Logistic Regression

Alexander Munteanu
Chris Schwiegelshohn
Christian Sohler
David Woodruff

Coresets are one of the central methods to facilitate the analysis of large data. We continue a recent line of research applying the theory of coresets to logistic regression. First, we show the negative result that no strongly sublinear sized coresets exist for logistic regression. To deal with intractable worst-case instances we introduce a complexity measure $\mu(X)$, which quantifies the hardness of compressing a data set for logistic regression. $\mu(X)$ has an intuitive statistical interpretation that may be of independent interest. For data sets with bounded $\mu(X)$-complexity, we show that a novel sensitivity sampling scheme produces the first provably sublinear $(1\pm\eps)$-coreset. We illustrate the performance of our method by comparing to uniform sampling as well as to state of the art methods in the area. The experiments are conducted on real world benchmark data for logistic regression.

PDF Details

FOCS Conference 2017 Conference Paper

On the Local Structure of Stable Clustering Instances

Vincent Cohen-Addad
Chris Schwiegelshohn

We study the classic k-median and k-means clustering objectives in the beyond-worst-case scenario. We consider three well-studied notions of structured data that aim at characterizing real-world inputs: Distribution Stability (introduced by Awasthi, Blum, and Sheffet, FOCS 2010); Spectral Separability (introduced by Kumar and Kannan, FOCS 2010); Perturbation Resilience (introduced by Bilu and Linial, ICS 2010). We prove structural results showing that inputs satisfying at least one of the conditions are inherently local. Namely, for any such input, any local optimum is close both in term of structure and in term of objective value to the global optima. As a corollary we obtain that the widely-used Local Search algorithm has strong performance guarantees for both the tasks of recovering the underlying optimal clustering and obtaining a clustering of small cost. This is a significant step toward understanding the success of local search heuristics in clustering applications.

Details