Author name cluster

Inge Li Gørtz

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

9 papers

2 author rows

SODA Conference 2024 Conference Paper

Sparse Regular Expression Matching

Philip Bille
Inge Li Gørtz

A regular expression specifies a set of strings formed by single characters combined with concatenation, union, and Kleene star operators. Given a regular expression R and a string Q, the regular expression matching problem is to decide if Q matches any of the strings specified by R. Regular expressions are a fundamental concept in formal languages and regular expression matching is a basic primitive for searching and processing data. A standard textbook solution [Thompson, CACM 1968] constructs and simulates a nondeterministic finite automaton, leading to an O(nm) time algorithm, where n is the length of Q and m is the length of R. Despite considerable research efforts only polylogarithmic improvements of this bound are known. Recently, conditional lower bounds provided evidence for this lack of progress when Backurs and Indyk [FOCS 2016] proved that, assuming the strong exponential time hypothesis (SETH), regular expression matching cannot be solved in O (( nm ) 1-ɛ ), for any constant ɛ > 0. Hence, the complexity of regular expression matching is essentially settled in terms of n and m. In this paper, we take a new approach and introduce a density parameter, Δ, that captures the amount of nondeterminism in the NFA simulation on Q. The density is at most nm + 1 but can be significantly smaller. Our main result is a new algorithm that solves regular expression matching in time. This essentially replaces nm with Δ in the complexity of regular expression matching. We complement our upper bound by a matching conditional lower bound that proves that we cannot solve regular expression matching in time O (Δ 1-ɛ ) for any constant ɛ > 0 assuming SETH. The key technical contribution in the result is a new linear space representation of the classic position automaton that supports fast state-set transition computation in near-linear time in the size of the input and output state sets. To achieve this we develop several new insights and techniques of independent interest, including new structural properties of the parse trees of regular expression, a decomposition of state-set transitions based on parse trees, and a fast batched predecessor data structure. * The full version of the paper can be accessed at https: //arxiv. org/abs/1907. 04752

Details

TCS Journal 2022 Journal Article

Partial sums on the ultra-wide word RAM

Philip Bille
Inge Li Gørtz
Frederik Rye Skjoldjensen

We consider the classic partial sums problem on the ultra-wide word RAM model of computation. This model extends the classic w-bit word RAM model with special ultrawords of length w 2 bits that support standard arithmetic and boolean operation and scattered memory access operations that can access w (non-contiguous) locations in memory. The ultra-wide word RAM model captures (and idealizes) modern vector processor architectures. Our main result is a new in-place data structure for the partial sum problem that only stores a constant number of ultrawords in addition to the input and supports operations in doubly logarithmic time. This matches the best known time bounds for the problem (among polynomial space data structures) while improving the space from superlinear to a constant number of ultrawords. Our results are based on a simple and elegant in-place word RAM data structure, known as the Fenwick tree. Our main technical contribution is a new efficient parallel ultra-wide word RAM implementation of the Fenwick tree, which is likely of independent interest.

Details DOI

TCS Journal 2022 Journal Article

String indexing for top-k close consecutive occurrences

Philip Bille
Inge Li Gørtz
Max Rishøj Pedersen
Eva Rotenberg
Teresa Anna Steiner

The classic string indexing problem is to preprocess a string S into a compact data structure that supports efficient subsequent pattern matching queries, that is, given a pattern string P, report all occurrences of P within S. In this paper, we study a basic and natural extension of string indexing called the string indexing for top-k close consecutive occurrences problem (Sitcco). Here, a consecutive occurrence is a pair ( i, j ), i < j, such that P occurs at positions i and j in S and there is no occurrence of P between i and j, and their distance is defined as j − i. Given a pattern P and a parameter k, the goal is to report the top-k consecutive occurrences of P in S of minimal distance. The challenge is to compactly represent S while supporting queries in time close to the length of P and k. We give three time-space trade-offs for the problem. Let n be the length of S, m the length of P, and ϵ ∈ ( 0, 1 ]. Our first result achieves O ( n log ⁡ n ) space and optimal query time of O ( m + k ). Our second and third results achieve linear space and query times either O ( m + k 1 + ϵ ) or O ( m + k log 1 + ϵ ⁡ n ). Along the way, we develop several techniques of independent interest, including a new translation of the problem into a line segment intersection problem and a new recursive clustering technique for trees.

Details DOI

MFCS Conference 2019 Conference Paper

From Regular Expression Matching to Parsing

Philip Bille
Inge Li Gørtz

Given a regular expression R and a string Q, the regular expression parsing problem is to determine if Q matches R and if so, determine how it matches, e. g. , by a mapping of the characters of Q to the characters in R. Regular expression parsing makes finding matches of a regular expression even more useful by allowing us to directly extract subpatterns of the match, e. g. , for extracting IP-addresses from internet traffic analysis or extracting subparts of genomes from genetic data bases. We present a new general techniques for efficiently converting a large class of algorithms that determine if a string Q matches regular expression R into algorithms that can construct a corresponding mapping. As a consequence, we obtain the first efficient linear space solutions for regular expression parsing.

Details

TCS Journal 2018 Journal Article

Time–space trade-offs for Lempel–Ziv compressed indexing

Philip Bille
Mikko Berggren Ettienne
Inge Li Gørtz
Hjalte Wedel Vildhøj

Given a string S, the compressed indexing problem is to preprocess S into a compressed representation that supports fast substring queries. The goal is to use little space relative to the compressed size of S while supporting fast queries. We present a compressed index based on the Lempel–Ziv 1977 compression scheme. We obtain the following time–space trade-offs: For constant-sized alphabets (i) O ( m + occ lg ⁡ lg ⁡ n ) time using O ( z lg ⁡ ( n / z ) lg ⁡ lg ⁡ z ) space, or (ii) O ( m ( 1 + lg ϵ ⁡ z lg ⁡ ( n / z ) ) + occ ( lg ⁡ lg ⁡ n + lg ϵ ⁡ z ) ) time using O ( z lg ⁡ ( n / z ) ) space, For integer alphabets polynomially bounded by n (iii) O ( m ( 1 + lg ϵ ⁡ z lg ⁡ ( n / z ) ) + occ ( lg ⁡ lg ⁡ n + lg ϵ ⁡ z ) ) time using O ( z ( lg ⁡ ( n / z ) + lg ⁡ lg ⁡ z ) ) space, or (iv) O ( m + occ ( lg ⁡ lg ⁡ n + lg ϵ ⁡ z ) ) time using O ( z ( lg ⁡ ( n / z ) + lg ϵ ⁡ z ) ) space, where n and m are the length of the input string and query string respectively, z is the number of phrases in the LZ77 parse of the input string, occ is the number of occurrences of the query in the input and ϵ > 0 is an arbitrarily small constant. In particular, (i) improves the leading term in the query time of the previous best solution from O ( m lg ⁡ m ) to O ( m ) at the cost of increasing the space by a factor lg ⁡ lg ⁡ z. Alternatively, (ii) matches the previous best space bound, but has a leading term in the query time of O ( m ( 1 + lg ϵ ⁡ z lg ⁡ ( n / z ) ) ). However, for any polynomial compression ratio, i. e. , z = O ( n 1 − δ ), for constant δ > 0, this becomes O ( m ). Our index also supports extraction of any substring of length ℓ in O ( ℓ + lg ⁡ ( n / z ) ) time. Technically, our results are obtained by novel extensions and combinations of existing data structures of independent interest, including a new batched variant of weak prefix search.

Details DOI

TCS Journal 2016 Journal Article

Longest common extensions in trees

Philip Bille
Paweł Gawrychowski
Inge Li Gørtz
Gad M. Landau
Oren Weimann

The longest common extension (LCE) of two indices in a string is the length of the longest identical substrings starting at these two indices. The LCE problem asks to preprocess a string into a compact data structure that supports fast LCE queries. In this paper we generalize the LCE problem to trees and suggest a few applications of LCE in trees to tries and XML databases. Given a labeled and rooted tree T of size n, the goal is to preprocess T into a compact data structure that support the following LCE queries between subpaths and subtrees in T. Let v 1, v 2, w 1, and w 2 be nodes of T such that w 1 and w 2 are descendants of v 1 and v 2 respectively. • LCE PP ( v 1, w 1, v 2, w 2 ): (path–path LCE) return the longest common prefix of the paths v 1 ↝ w 1 and v 2 ↝ w 2. • LCE PT ( v 1, w 1, v 2 ): (path–tree LCE) return maximal path–path LCE of the path v 1 ↝ w 1 and any path from v 2 to a descendant leaf. • LCE TT ( v 1, v 2 ): (tree–tree LCE) return a maximal path–path LCE of any pair of paths from v 1 and v 2 to descendant leaves. We present the first non-trivial bounds for supporting these queries. For LCE PP queries, we present a linear-space solution with O ( log ⁎ ⁡ n ) query time. For LCE PT queries, we present a linear-space solution with O ( ( log ⁡ log ⁡ n ) 2 ) query time, and complement this with a lower bound showing that any path–tree LCE structure of size O ( n polylog ( n ) ) must necessarily use Ω ( log ⁡ log ⁡ n ) time to answer queries. For LCE TT queries, we present a time-space trade-off, that given any parameter τ, 1 ≤ τ ≤ n, leads to an O ( n τ ) space and O ( n / τ ) query-time solution (all of these bounds hold on a standard unit-cost RAM model with logarithmic word size). This is complemented with a reduction from the set intersection problem implying that a fast linear space solution is not likely to exist.

Details DOI

TCS Journal 2014 Journal Article

Compact q-gram profiling of compressed strings

Philip Bille
Patrick Hagge Cording
Inge Li Gørtz

We consider the problem of computing the q-gram profile of a string T of size N compressed by a context-free grammar with n production rules. We present an algorithm that runs in O ( N − α ) expected time and uses O ( n + q + k T, q ) space, where N − α ≤ q n is the exact number of characters decompressed by the algorithm and k T, q ≤ N − α is the number of distinct q-grams in T. This simultaneously matches the current best known time bound and improves the best known space bound. Our space bound is asymptotically optimal in the sense that any algorithm storing the grammar and the q-gram profile must use Ω ( n + q + k T, q ) space. To achieve this we introduce the q-gram graph that space-efficiently captures the structure of a string with respect to its q-grams, and show how to construct it from a grammar.

Details DOI

TCS Journal 2012 Journal Article

String matching with variable length gaps

Philip Bille
Inge Li Gørtz
Hjalte Wedel Vildhøj
David Kofoed Wind

We consider string matching with variable length gaps. Given a string T and a pattern P consisting of strings separated by variable length gaps (arbitrary strings of length in a specified range), the problem is to find all ending positions of substrings in T that match P. This problem is a basic primitive in computational biology applications. Let m and n be the lengths of P and T, respectively, and let k be the number of strings in P. We present a new algorithm achieving time O ( n log k + m + α ) and space O ( m + A ), where A is the sum of the lower bounds of the lengths of the gaps in P and α is the total number of occurrences of the strings in P within T. Compared to the previous results this bound essentially achieves the best known time and space complexities simultaneously. Consequently, our algorithm obtains the best known bounds for almost all combinations of m, n, k, A, and α. Our algorithm is surprisingly simple and straightforward to implement. We also present algorithms for finding and encoding the positions of all strings in P for every match of the pattern.

Details DOI

TCS Journal 2006 Journal Article

Asymmetry in k -center variants

Inge Li Gørtz
Anthony Wirth

This paper explores three concepts: the k -center problem, some of its variants, and asymmetry. The k -center problem is fundamental in location theory. Variants of k -center may more accurately model real-life problems than the original formulation. Asymmetry is a significant impediment to approximation in many graph problems, such as k -center, facility location, k -median, and the TSP. We give an O ( log * n ) -approximation algorithm for the asymmetric weighted k -center problem. Here, the vertices have weights and we are given a total budget for opening centers. In the p-neighbor variant each vertex must have p (unweighted) centers nearby: we give an O ( log * k ) -bicriteria algorithm using 2 k centers, for small p. Finally, we show the following three versions of the asymmetric k -center problem to be inapproximable: priority k -center, k-supplier, and outliers with forbidden centers.

Details DOI