Arrow Research search

Author name cluster

Arturs Backurs

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

25 papers
2 author rows

Possible papers

25

NeurIPS Conference 2025 Conference Paper

Struct-Bench: A Benchmark for Differentially Private Structured Text Generation

  • Shuaiqi Wang
  • Vikas Raunak
  • Arturs Backurs
  • Victor Reis
  • Pei Zhou
  • Sihao Chen
  • Longqi Yang
  • Zinan Lin

Differentially private (DP) synthetic data generation is a promising technique for utilizing private datasets that otherwise cannot be exposed for model training or other analytics. While much research literature has focused on generating private unstructured text and image data, in enterprise settings, structured data (e. g. , tabular) is more common, often including natural language fields or components. Existing synthetic data evaluation techniques (e. g. , FID) struggle to capture the structural properties and correlations of such datasets. In this work, we propose Struct-Bench, a framework and benchmark for evaluating synthetic datasets derived from structured datasets that contain natural language data. The Struct-Bench framework requires users to provide a representation of their dataset structure as a Context-Free Grammar (CFG). Our benchmark comprises 5 real-world and 2 synthetically generated datasets. We show that these datasets demonstrably present a great challenge even for state-of-the-art DP synthetic data generation methods. Struct-Bench provides reference implementations of different metrics and a leaderboard, offering a standardized platform to benchmark and investigate privacy-preserving synthetic data methods. We also present a case study showing how Struct-Bench improves the synthetic data quality of Private Evolution (PE) on structured data. The benchmark and the leaderboard have been publicly made available at https: //struct-bench. github. io.

ICML Conference 2024 Conference Paper

Differentially Private Synthetic Data via Foundation Model APIs 2: Text

  • Chulin Xie
  • Zinan Lin 0001
  • Arturs Backurs
  • Sivakanth Gopi
  • Da Yu
  • Huseyin A. Inan
  • Harsha Nori
  • Haotian Jiang

Text data has become extremely valuable due to the emergence of machine learning algorithms that learn from it. A lot of high-quality text data generated in the real world is private and therefore cannot be shared or used freely due to privacy concerns. Generating synthetic replicas of private text data with a formal privacy guarantee, i. e. , differential privacy (DP), offers a promising and scalable solution. However, existing methods necessitate DP finetuning of large language models (LLMs) on private data to generate DP synthetic data. This approach is not viable for proprietary LLMs (e. g. , GPT-3. 5) and also demands considerable computational resources for open-source LLMs. Lin et al. (2024) recently introduced the Private Evolution (PE) algorithm to generate DP synthetic images with only API access to diffusion models. In this work, we propose an augmented PE algorithm, named Aug-PE, that applies to the complex setting of text. We use API access to an LLM and generate DP synthetic text without any model training. We conduct comprehensive experiments on three benchmark datasets. Our results demonstrate that Aug-PE produces DP synthetic text that yields competitive utility with the SOTA DP finetuning baselines. This underscores the feasibility of relying solely on API access of LLMs to produce high-quality DP synthetic texts, thereby facilitating more accessible routes to privacy-preserving LLM applications.

ICLR Conference 2024 Conference Paper

Efficiently Computing Similarities to Private Datasets

  • Arturs Backurs
  • Zinan Lin 0001
  • Sepideh Mahabadi
  • Sandeep Silwal
  • Jakub Tarnawski

Many methods in differentially private model training rely on computing the similarity between a query point (such as public or synthetic data) and private data. We abstract out this common subroutine and study the following fundamental algorithmic problem: Given a similarity function $f$ and a large high-dimensional private dataset $X \subset \mathbb{R}^d$, output a differentially private (DP) data-structure which approximates $\sum_{x \in X} f(x,y)$ for any query $y$. We consider the cases where $f$ is a kernel function, such as $f(x,y) = e^{-\|x-y\|_2^2/\sigma^2}$ (also known as DP kernel density estimation), or a distance function such as $f(x,y) = \|x-y\|_2$, among others. Our theoretical results improve upon prior work and give better privacy-utility trade-offs as well as faster query times for a wide range of kernels and distance functions. The unifying approach behind our results is leveraging `low-dimensional structures' present in the specific functions $f$ that we study, using tools such as provable dimensionality reduction, approximation theory, and one-dimensional decomposition of the functions. Our algorithms empirically exhibit improved query times and accuracy over prior state of the art. We also present an application to DP classification. Our experiments demonstrate that the simple methodology of classifying based on average similarity is orders of magnitude faster than prior DP-SGD based approaches for comparable accuracy.

ICLR Conference 2024 Conference Paper

Privately Aligning Language Models with Reinforcement Learning

  • Fan Wu
  • Huseyin A. Inan
  • Arturs Backurs
  • Varun Chandrasekaran
  • Janardhan Kulkarni
  • Robert Sim

Positioned between pre-training and user deployment, aligning large language models (LLMs) through reinforcement learning (RL) has emerged as a prevailing strategy for training instruction following-models such as ChatGPT. In this work, we initiate the study of privacy-preserving alignment of LLMs through Differential Privacy (DP) in conjunction with RL. Following the influential work of Ziegler et al. (2020), we study two dominant paradigms: (i) alignment via RL without human in the loop (e.g., positive review generation) and (ii) alignment via RL from human feedback (RLHF) (e.g., summarization in a human-preferred way). We give a new DP framework to achieve alignment via RL, and prove its correctness. Our experimental results validate the effectiveness of our approach, offering competitive utility while ensuring strong privacy protections.

ICLR Conference 2023 Conference Paper

Exploring the Limits of Differentially Private Deep Learning with Group-wise Clipping

  • Jiyan He
  • Xuechen Li
  • Da Yu
  • Huishuai Zhang
  • Janardhan Kulkarni
  • Yin Tat Lee
  • Arturs Backurs
  • Nenghai Yu

Differentially private deep learning has recently witnessed advances in computational efficiency and privacy-utility trade-off. We explore whether further improvements along the two axes are possible and provide affirmative answers leveraging two instantiations of \emph{group-wise clipping}. To reduce the compute time overhead of private learning, we show that \emph{per-layer clipping}, where the gradient of each neural network layer is clipped separately, allows clipping to be performed in conjunction with backpropagation in differentially private optimization. This results in private learning that is as memory-efficient and almost as fast per training update as non-private learning for many workflows of interest. While per-layer clipping with constant thresholds tends to underperform standard flat clipping, per-layer clipping with adaptive thresholds matches or outperforms flat clipping under given training epoch constraints, hence attaining similar or better task performance within less wall time. To explore the limits of scaling (pretrained) models in differentially private deep learning, we privately fine-tune the 175 billion-parameter GPT-3. We bypass scaling challenges associated with clipping gradients that are distributed across multiple devices with \emph{per-device clipping} that clips the gradient of each model piece separately on its host device. Privately fine-tuning GPT-3 with per-device clipping achieves a task performance at $\epsilon=1$ better than what is attainable by non-privately fine-tuning the largest GPT-2 on a summarization task.

ICLR Conference 2022 Conference Paper

Differentially Private Fine-tuning of Language Models

  • Da Yu
  • Saurabh Naik
  • Arturs Backurs
  • Sivakanth Gopi
  • Huseyin A. Inan
  • Gautam Kamath 0001
  • Janardhan Kulkarni
  • Yin Tat Lee

We give simpler, sparser, and faster algorithms for differentially private fine-tuning of large-scale pre-trained language models, which achieve the state-of-the-art privacy versus utility tradeoffs on many standard NLP tasks. We propose a meta-framework for this problem, inspired by the recent success of highly parameter-efficient methods for fine-tuning. Our experiments show that differentially private adaptations of these approaches outperform previous private algorithms in three important dimensions: utility, privacy, and the computational and memory cost of private training. On many commonly studied datasets, the utility of private models approaches that of non-private models. For example, on the MNLI dataset we achieve an accuracy of $87.8\%$ using RoBERTa-Large and $83.5\%$ using RoBERTa-Base with a privacy budget of $\epsilon = 6.7$. In comparison, absent privacy constraints, RoBERTa-Large achieves an accuracy of $90.2\%$. Our findings are similar for natural language generation when privately fine-tuning GPT-2. Our experiments also show that larger models are better suited for private fine-tuning: while they are well known to achieve superior accuracy non-privately, we find that they also better maintain their accuracy when privacy is introduced.

NeurIPS Conference 2022 Conference Paper

Differentially Private Model Compression

  • FatemehSadat Mireshghallah
  • Arturs Backurs
  • Huseyin A. Inan
  • Lukas Wutschitz
  • Janardhan Kulkarni

Recent papers have shown that large pre-trained language models (LLMs) such as BERT, GPT-2 can be fine-tuned on private data to achieve performance comparable to non-private models for many downstream Natural Language Processing (NLP) tasks while simultaneously guaranteeing differential privacy. The inference cost of these models -- which consist of hundreds of millions of parameters -- however, can be prohibitively large. Hence, often in practice, LLMs are compressed before they are deployed in specific applications. In this paper, we initiate the study of differentially private model compression and propose frameworks for achieving 50% sparsity levels while maintaining nearly full performance. We demonstrate these ideas on standard GLUE benchmarks using BERT models, setting benchmarks for future research on this topic.

ICML Conference 2021 Conference Paper

Faster Kernel Matrix Algebra via Density Estimation

  • Arturs Backurs
  • Piotr Indyk
  • Cameron Musco
  • Tal Wagner

We study fast algorithms for computing basic properties of an n x n positive semidefinite kernel matrix K corresponding to n points x_1, .. ., x_n in R^d. In particular, we consider the estimating the sum of kernel matrix entries, along with its top eigenvalue and eigenvector. These are some of the most basic problems defined over kernel matrices. We show that the sum of matrix entries can be estimated up to a multiplicative factor of 1+\epsilon in time sublinear in n and linear in d for many popular kernel functions, including the Gaussian, exponential, and rational quadratic kernels. For these kernels, we also show that the top eigenvalue (and a witnessing approximate eigenvector) can be approximated to a multiplicative factor of 1+\epsilon in time sub-quadratic in n and linear in d. Our algorithms represent significant advances in the best known runtimes for these problems. They leverage the positive definiteness of the kernel matrix, along with a recent line of work on efficient kernel density estimation.

NeurIPS Conference 2020 Conference Paper

Impossibility Results for Grammar-Compressed Linear Algebra

  • Amir Abboud
  • Arturs Backurs
  • Karl Bringmann
  • Marvin Künnemann

To handle vast amounts of data, it is natural and popular to compress vectors and matrices. When we compress a vector from size N down to size n << N, it certainly makes it easier to store and transmit efficiently, but does it also make it easier to process? In this paper we consider lossless compression schemes, and ask if we can run our computations on the compressed data as efficiently as if the original data was that small. That is, if an operation has time complexity T(input-size), can we perform it on the compressed representation in time T(n) rather than T(N)? We consider the most basic linear algebra operations: inner product, matrix-vector multiplication, and matrix multiplication. In particular, given two compressed vectors, can we compute their inner product in time O(n)? Or perhaps we must decompress first and then multiply, spending Omega(N) time? The answer depends on the compression scheme. While for simple ones such as Run-Length-Encoding (RLE) the inner product can be done in O(n) time, we prove that this is impossible for compressions from a richer class: essentially n^2 or even larger runtimes are needed in the worst case (under complexity assumptions). This is the class of \emph{grammar-compressions} containing most popular methods such as the Lempel-Ziv family. These schemes are more compressing than the simple RLE, but alas, we prove that performing computations on them is much harder.

ICML Conference 2020 Conference Paper

Scalable Nearest Neighbor Search for Optimal Transport

  • Arturs Backurs
  • Yihe Dong
  • Piotr Indyk
  • Ilya P. Razenshteyn
  • Tal Wagner

The Optimal Transport (a. k. a. Wasserstein) distance is an increasingly popular similarity measure for rich data domains, such as images or text documents. This raises the necessity for fast nearest neighbor search algorithms according to this distance, which poses a substantial computational bottleneck on massive datasets. In this work we introduce Flowtree, a fast and accurate approximation algorithm for the Wasserstein-1 distance. We formally analyze its approximation factor and running time. We perform extensive experimental evaluation of nearest neighbor search algorithms in the W_1 distance on real-world dataset. Our results show that compared to previous state of the art, Flowtree achieves up to 7. 4 times faster running time.

SODA Conference 2019 Conference Paper

Fast Modular Subset Sum using Linear Sketching

  • Kyriakos Axiotis
  • Arturs Backurs
  • Ce Jin 0001
  • Christos Tzamos
  • Hongxun Wu

Given n positive integers, the Modular Subset Sum problem asks if a subset adds up to a given target t modulo a given integer m. This is a natural generalization of the Subset Sum problem (where m = +∞) with ties to additive combinatorics and cryptography. Recently, in [Bri17, KX17], efficient algorithms have been developed for the non-modular case, running in near-linear pseudo-polynomial time. For the modular case, however, the best known algorithm by Koiliaris and Xu [KX17] runs in time Õ ( m 5/4 ). In this paper, we present an algorithm running in Õ ( m ) randomized time, which matches a recent conditional lower bound of [ABHS17] based on the Strong Exponential Time Hypothesis. Interestingly, in contrast to most previous results on Subset Sum, our algorithm does not use the Fast Fourier Transform. Instead, it is able to simulate the “textbook” Dynamic Programming algorithm much faster, using ideas from linear sketching. This is one of the first applications of sketching-based techniques to obtain fast algorithms for exact combinatorial problems in an offline setting.

ICML Conference 2019 Conference Paper

Scalable Fair Clustering

  • Arturs Backurs
  • Piotr Indyk
  • Krzysztof Onak
  • Baruch Schieber
  • Ali Vakilian
  • Tal Wagner

We study the fair variant of the classic k-median problem introduced by (Chierichetti et al. , NeurIPS 2017) in which the points are colored, and the goal is to minimize the same average distance objective as in the standard $k$-median problem while ensuring that all clusters have an “approximately equal” number of points of each color. (Chierichetti et al. , NeurIPS 2017) proposed a two-phase algorithm for fair $k$-clustering. In the first step, the pointset is partitioned into subsets called fairlets that satisfy the fairness requirement and approximately preserve the k-median objective. In the second step, fairlets are merged into k clusters by one of the existing k-median algorithms. The running time of this algorithm is dominated by the first step, which takes super-quadratic time. In this paper, we present a practical approximate fairlet decomposition algorithm that runs in nearly linear time.

NeurIPS Conference 2019 Conference Paper

Space and Time Efficient Kernel Density Estimation in High Dimensions

  • Arturs Backurs
  • Piotr Indyk
  • Tal Wagner

Recently, Charikar and Siminelakis (2017) presented a framework for kernel density estimation in provably sublinear query time, for kernels that possess a certain hashing-based property. However, their data structure requires a significantly increased super-linear storage space, as well as super-linear preprocessing time. These limitations inhibit the practical applicability of their approach on large datasets. In this work, we present an improvement to their framework that retains the same query time, while requiring only linear space and linear preprocessing time. We instantiate our framework with the Laplacian and Exponential kernels, two popular kernels which possess the aforementioned property. Our experiments on various datasets verify that our approach attains accuracy and query time similar to Charikar and Siminelakis (2017), with significantly improved space and preprocessing time.

FOCS Conference 2018 Conference Paper

Efficient Density Evaluation for Smooth Kernels

  • Arturs Backurs
  • Moses Charikar
  • Piotr Indyk
  • Paris Siminelakis

Given a kernel function k(. ,.) and a dataset P⊂ R^d, the kernel density function of P at a point x∈ R d is equal to KDF P (x): = 1/|P| Σy∈P k(x, y). Kernel density evaluation has numerous applications, in scientific computing, statistics, computer vision, machine learning and other fields. In all of them it is necessary to evaluate KDF P(x) quickly, often for many inputs x and large point-sets P. In this paper we present a collection of algorithms for efficient KDF evaluation under the assumptions that the kernel k is "smooth", i. e. the value changes at most polynomially with the distance. This assumption is satisfied by several well-studied kernels, including the (generalized) t-student kernel and rational quadratic kernel. For smooth kernels, we give a data structure that, after O(dn log (Φ n)/ε^2) preprocessing, estimates KDF P(x) up to a factor of 1 ± ε in O(dlog (Φ n)/ε 2 ) time, where Phi; is the aspect ratio. The log(Φn) term can be further replaced by log n under an additional decay condition on k, which is satisfied by the aforementioned examples. We further extend the results in two ways. First, we use low-distortion embeddings to extend the results to kernels defined for spaces other than ℓ_2. The key feature of this reduction is that the distortion of the embedding affects only the running time of the algorithm, not the accuracy of the estimation. As a result, we obtain (1+ε)-approximate estimation algorithms for kernels over other ℓ p norms, Earth-Mover Distance, and other metric spaces. Second, for smooth kernels that are decreasing with distance, we present a general reduction from density estimation to approximate near neighbor in the underlying space. This allows us to construct algorithms for general doubling metrics, as well as alternative algorithms for l p norms and other spaces.

STOC Conference 2018 Conference Paper

Towards tight approximation bounds for graph diameter and eccentricities

  • Arturs Backurs
  • Liam Roditty
  • Gilad Segal
  • Virginia Vassilevska Williams
  • Nicole Wein

Among the most important graph parameters is the Diameter, the largest distance between any two vertices. There are no known very efficient algorithms for computing the Diameter exactly. Thus, much research has been devoted to how fast this parameter can be approximated . Chechik et al. [SODA 2014] showed that the diameter can be approximated within a multiplicative factor of 3/2 in Õ( m 3/2 ) time. Furthermore, Roditty and Vassilevska W. [STOC 13] showed that unless the Strong Exponential Time Hypothesis (SETH) fails, no O ( n 2−ε ) time algorithm can achieve an approximation factor better than 3/2 in sparse graphs. Thus the above algorithm is essentially optimal for sparse graphs for approximation factors less than 3/2. It was, however, completely plausible that a 3/2-approximation is possible in linear time. In this work we conditionally rule out such a possibility by showing that unless SETH fails no O ( m 3/2−ε ) time algorithm can achieve an approximation factor better than 5/3. Another fundamental set of graph parameters are the Eccentricities. The Eccentricity of a vertex v is the distance between v and the farthest vertex from v . Chechik et al. [SODA 2014] showed that the Eccentricities of all vertices can be approximated within a factor of 5/3 in Õ( m 3/2 ) time and Abboud et al. [SODA 2016] showed that no O ( n 2−ε ) algorithm can achieve better than 5/3 approximation in sparse graphs. We show that the runtime of the 5/3 approximation algorithm is also optimal by proving that under SETH, there is no O ( m 3/2−ε ) algorithm that achieves a better than 9/5 approximation. We also show that no near-linear time algorithm can achieve a better than 2 approximation for the Eccentricities. This is the first lower bound in fine-grained complexity that addresses near-linear time computation. We show that our lower bound for near-linear time algorithms is essentially tight by giving an algorithm that approximates Eccentricities within a 2+δ factor in Õ( m /δ) time for any 0<δ<1. This beats all Eccentricity algorithms in Cairo et al. [SODA 2016] and is the first constant factor approximation for Eccentricities in directed graphs. To establish the above lower bounds we study the S - T Diameter problem: Given a graph and two subsets S and T of vertices, output the largest distance between a vertex in S and a vertex in T . We give new algorithms and show tight lower bounds that serve as a starting point for all other hardness results. Our lower bounds apply only to sparse graphs. We show that for dense graphs, there are near-linear time algorithms for S - T Diameter, Diameter and Eccentricities, with almost the same approximation guarantees as their Õ( m 3/2 ) counterparts, improving upon the best known algorithms for dense graphs.

SODA Conference 2017 Conference Paper

Better Approximations for Tree Sparsity in Nearly-Linear Time

  • Arturs Backurs
  • Piotr Indyk
  • Ludwig Schmidt

The Tree Sparsity problem is defined as follows: given a node-weighted tree of size η and an integer k, output a rooted subtree of size k with maximum weight. The best known algorithm solves this problem in time O ( kn ), i. e. , quadratic in the size of the input tree for k = Θ( n ). In this work, we design (1+∊)-approximation algorithms for the Tree Sparsity problem that run in nearly-linear time. Unlike prior algorithms for this problem, our results offer single criterion approximations, i. e. , they do not increase the sparsity of the output solution, and work for arbitrary trees (not only balanced trees). We also provide further algorithms for this problem with different runtime vs approximation trade-offs. Finally, we show that if the exact version of the Tree Sparsity problem can be solved in strongly subquadratic time, then the (min, +) convolution problem can be solved in strongly subquadratic time as well. The latter is a well- studied problem for which no strongly subquadratic time algorithm is known.

FOCS Conference 2017 Conference Paper

Fine-Grained Complexity of Analyzing Compressed Data: Quantifying Improvements over Decompress-and-Solve

  • Amir Abboud
  • Arturs Backurs
  • Karl Bringmann
  • Marvin Künnemann

Can we analyze data without decompressing it? As our data keeps growing, understanding the time complexity of problems on compressed inputs, rather than in convenient uncompressed forms, becomes more and more relevant. Suppose we are given a compression of size n of data that originally has size N, and we want to solve a problem with time complexity T(·). The naive strategy of “decompress-and-solve” gives time T(N), whereas “the gold standard” is time T(n): to analyze the compression as efficiently as if the original data was small. We restrict our attention to data in the form of a string (text, files, genomes, etc.) and study the most ubiquitous tasks. While the challenge might seem to depend heavily on the specific compression scheme, most methods of practical relevance (Lempel-Ziv-family, dictionary methods, and others) can be unified under the elegant notion of Grammar-Compressions. A vast literature, across many disciplines, established this as an influential notion for Algorithm design. We introduce a direly needed framework for proving (conditional) lower bounds in this field, allowing us to assess whether decompress-and-solve can be improved, and by how much. Our main results are: (1) The O(nN√(log N/n)) bound for LCS and the O(min{N log N, nM}) bound for Pattern Matching with Wildcards are optimal up to N o(1) factors, under the Strong Exponential Time Hypothesis. (Here, M denotes the uncompressed length of the compressed pattern.) (2) Decompress-and-solve is essentially optimal for ContextFree Grammar Parsing and RNA Folding, under the k-Clique conjecture. (3) We give an algorithm showing that decompress-and-solve is not optimal for Disjointness.

ICML Conference 2017 Conference Paper

Improving Viterbi is Hard: Better Runtimes Imply Faster Clique Algorithms

  • Arturs Backurs
  • Christos Tzamos

The classic algorithm of Viterbi computes the most likely path in a Hidden Markov Model (HMM) that results in a given sequence of observations. It runs in time $O(Tn^2)$ given a sequence of T observations from a HMM with n states. Despite significant interest in the problem and prolonged effort by different communities, no known algorithm achieves more than a polylogarithmic speedup. In this paper, we explain this difficulty by providing matching conditional lower bounds. Our lower bounds are based on assumptions that the best known algorithms for the All-Pairs Shortest Paths problem (APSP) and for the Max-Weight k-Clique problem in edge-weighted graphs are essentially tight. Finally, using a recent algorithm by Green Larsen and Williams for online Boolean matrix-vector multiplication, we get a $2^{\Omega(\sqrt{\log n})}$ speedup for the Viterbi algorithm when there are few distinct transition probabilities in the HMM.

NeurIPS Conference 2017 Conference Paper

On the Fine-Grained Complexity of Empirical Risk Minimization: Kernel Methods and Neural Networks

  • Arturs Backurs
  • Piotr Indyk
  • Ludwig Schmidt

Empirical risk minimization (ERM) is ubiquitous in machine learning and underlies most supervised learning methods. While there is a large body of work on algorithms for various ERM problems, the exact computational complexity of ERM is still not understood. We address this issue for multiple popular ERM problems including kernel SVMs, kernel ridge regression, and training the final layer of a neural network. In particular, we give conditional hardness results for these problems based on complexity-theoretic assumptions such as the Strong Exponential Time Hypothesis. Under these assumptions, we show that there are no algorithms that solve the aforementioned ERM problems to high accuracy in sub-quadratic time. We also give similar hardness results for computing the gradient of the empirical loss, which is the main computational burden in many non-convex learning tasks.

FOCS Conference 2016 Conference Paper

Which Regular Expression Patterns Are Hard to Match?

  • Arturs Backurs
  • Piotr Indyk

Regular expressions constitute a fundamental notion in formal language theory and are frequently used in computer science to define search patterns. In particular, regular expression matching and membership testing are widely used computational primitives, employed in many programming languages and text processing utilities. A classic algorithm for these problems constructs and simulates a non-deterministic finite automaton corresponding to the expression, resulting in an O(m n) running time (where m is the length of the pattern and n is the length of the text). This running time can be improved slightly (by a polylogarithmic factor), but no significantly faster solutions are known. At the same time, much faster algorithms exist for various special cases of regular expressions, including dictionary matching, wildcard matching, subset matching, word break problem etc. In this paper, we show that the complexity of regular expression matching can be characterized based on its depth (when interpreted as a formula). Our results hold for expressions involving concatenation, OR, Kleene star and Kleene plus. For regular expressions of depth two (involving any combination of the above operators), we show the following dichotomy: matching and membership testing can be solved in near-linear time, except for "concatenations of stars", which cannot be solved in strongly sub-quadratic time assuming the Strong Exponential Time Hypothesis (SETH). For regular expressions of depth three the picture is more complex. Nevertheless, we show that all problems can either be solved in strongly sub-quadratic time, or cannot be solved in strongly sub-quadratic time assuming SETH. An intriguing special case of membership testing involves regular expressions of the form "a star of an OR of concatenations", e. g. , [a|ab|bc]*. This corresponds to the so-called word break problem, for which a dynamic programming algorithm with a runtime of (roughly) O(n √m) is known. We show that the latter bound is not tight and improve the runtime to O(n m 0. 44. .. ).

STOC Conference 2015 Conference Paper

Edit Distance Cannot Be Computed in Strongly Subquadratic Time (unless SETH is false)

  • Arturs Backurs
  • Piotr Indyk

The edit distance (a.k.a. the Levenshtein distance) between two strings is defined as the minimum number of insertions, deletions or substitutions of symbols needed to transform one string into another. The problem of computing the edit distance between two strings is a classical computational task, with a well-known algorithm based on dynamic programming. Unfortunately, all known algorithms for this problem run in nearly quadratic time. In this paper we provide evidence that the near-quadratic running time bounds known for the problem of computing edit distance might be {tight}. Specifically, we show that, if the edit distance can be computed in time O(n 2-δ ) for some constant δ>0, then the satisfiability of conjunctive normal form formulas with N variables and M clauses can be solved in time M O(1) 2 (1-ε)N for a constant ε>0. The latter result would violate the Strong Exponential Time Hypothesis , which postulates that such algorithms do not exist.

FOCS Conference 2015 Conference Paper

If the Current Clique Algorithms are Optimal, So is Valiant's Parser

  • Amir Abboud
  • Arturs Backurs
  • Virginia Vassilevska Williams

The CFG recognition problem is: given a context-free grammar G and a string w of length n, decide if w can be obtained from G. This is the most basic parsing question and is a core computer science problem. Valiant's parser from 1975 solves the problem in O(nO) time, where? <; 2: 373 is the matrix multiplication exponent. Dozens of parsing algorithms have been proposed over the years, yet Valiant's upper bound remains unbeaten. The best combinatorial algorithms have mildly subcubic O(n3= log3 n) complexity. Lee (JACM'01) provided evidence that fast matrix multiplication is needed for CFG parsing, and that very efficient and practical algorithms might be hard or even impossible to obtain. Lee showed that any algorithm for a more general parsing problem with running time O(|G| n3 -- e) can be converted into a surprising subcubic algorithm for Boolean Matrix Multiplication. Unfortunately, Lee' s hardness result required that the grammar size be |G| = O(n6). Nothing was known for the more relevant case of constant size grammars. In this work, we prove that any improvement on Valiant' s algorithm, even for constant size grammars, either in terms of runtime or by avoiding the inefficiencies of fast matrix multiplication, would imply a breakthrough algorithm for the k-Clique problem: given a graph on n nodes, decide if there are k that form a clique. Besides classifying the complexity of a fundamental problem, our reduction has led us to similar lower bounds for more modern and well-studied cubic time problems for which faster algorithms are highly desirable in practice: RNA Folding, a central problem in computational biology, and Dyck Language Edit Distance, answering an open question of Saha (FOCS'14).

FOCS Conference 2015 Conference Paper

Tight Hardness Results for LCS and Other Sequence Similarity Measures

  • Amir Abboud
  • Arturs Backurs
  • Virginia Vassilevska Williams

Two important similarity measures between sequences are the longest common subsequence (LCS) and the dynamic time warping distance (DTWD). The computations of these measures for two given sequences are central tasks in a variety of applications. Simple dynamic programming algorithms solve these tasks in O(n 2 ) time, and despite an extensive amount of research, no algorithms with significantly better worst case upper bounds are known. In this paper, we show that for any constant ε >0, an O(n 2-ε ) time algorithm for computing the LCS or the DTWD of two sequences of length n over a constant size alphabet, refutes the popular Strong Exponential Time Hypothesis (SETH).