Arrow Research search

Author name cluster

Tomohiro I

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

10 papers
2 author rows

Possible papers

10

TCS Journal 2020 Journal Article

Refining the r-index

  • Hideo Bannai
  • Travis Gagie
  • Tomohiro I

Gagie, Navarro and Prezza's r-index (SODA, 2018) promises to speed up DNA alignment and variation calling by allowing us to index entire genomic databases, provided certain obstacles can be overcome. In this paper we first strengthen and simplify Policriti and Prezza's Toehold Lemma (DCC '16; Algorithmica, 2017), which inspired the r-index and plays an important role in its implementation. We then show how to update the r-index efficiently after adding a new genome to the database, which is likely to be vital in practice. As a by-product of this result, we obtain an online version of Policriti and Prezza's algorithm for constructing the LZ77 parse from a run-length compressed Burrows-Wheeler Transform. Our experiments demonstrate the practicality of all three of these results. Finally, we show how to augment the r-index such that, given a new genome and fast random access to the database, we can quickly compute the matching statistics and maximal exact matches of the new genome with respect to the database.

TCS Journal 2017 Journal Article

Inferring strings from Lyndon factorization

  • Yuto Nakashima
  • Takashi Okabe
  • Tomohiro I
  • Shunsuke Inenaga
  • Hideo Bannai
  • Masayuki Takeda

The Lyndon factorization of a string w is a unique factorization ℓ 1 p 1, …, ℓ m p m of w such that ℓ 1, …, ℓ m is a sequence of Lyndon words that is monotonically decreasing in lexicographic order. In this paper, we consider the reverse-engineering problem on Lyndon factorization: Given a sequence S = ( ( s 1, p 1 ), …, ( s m, p m ) ) of ordered pairs of positive integers, find a string w whose Lyndon factorization corresponds to the input sequence S, i. e. , the Lyndon factorization of w is in a form of ℓ 1 p 1, …, ℓ m p m with | ℓ i | = s i for all 1 ≤ i ≤ m. Firstly, we show that there exists a simple O ( n ) -time algorithm if the size of the alphabet is unbounded, where n is the length of the output string. Secondly, we present an O ( n ) -time algorithm to compute a string over an alphabet of the smallest size. Thirdly, we show how to compute only the size of the smallest alphabet in O ( m ) time. Fourthly, we give an O ( m ) -time algorithm to compute an O ( m ) -size representation of a string over an alphabet of the smallest size. Finally, we propose an efficient algorithm to enumerate all strings whose Lyndon factorizations correspond to S.

TCS Journal 2016 Journal Article

Faster Lyndon factorization algorithms for SLP and LZ78 compressed text

  • Tomohiro I
  • Yuto Nakashima
  • Shunsuke Inenaga
  • Hideo Bannai
  • Masayuki Takeda

We present two efficient algorithms which, given a compressed representation of a string w of length N, compute the Lyndon factorization of w. Given a straight line program (SLP) S of size n that describes w, the first algorithm runs in O ( n 2 + P ( n, N ) + Q ( n, N ) n log ⁡ n ) time and O ( n 2 + S ( n, N ) ) space, where P ( n, N ), S ( n, N ), Q ( n, N ) are respectively the pre-processing time, space, and query time of a data structure for longest common extensions (LCE) on SLPs. Given the Lempel–Ziv 78 encoding of size s for w, the second algorithm runs in O ( s log ⁡ s ) time and space.

MFCS Conference 2016 Conference Paper

Fully Dynamic Data Structure for LCE Queries in Compressed Space

  • Takaaki Nishimoto
  • Tomohiro I
  • Shunsuke Inenaga
  • Hideo Bannai
  • Masayuki Takeda

A Longest Common Extension (LCE) query on a text T of length N asks for the length of the longest common prefix of suffixes starting at given two positions. We show that the signature encoding G of size w = O(min(z log N log^* M, N)) [Mehlhorn et al. , Algorithmica 17(2): 183-198, 1997] of T, which can be seen as a compressed representation of T, has a capability to support LCE queries in O(log N + log ell log^* M) time, where ell is the answer to the query, z is the size of the Lempel-Ziv77 (LZ77) factorization of T, and M >= 4N is an integer that can be handled in constant time under word RAM model. In compressed space, this is the fastest deterministic LCE data structure in many cases. Moreover, G can be enhanced to support efficient update operations: After processing G in O(w f_A) time, we can insert/delete any (sub)string of length y into/from an arbitrary position of T in O((y + log Nlog^* M) f_A) time, where f_A = O(min{ (loglog M loglog w)/(logloglog M), sqrt(log w/loglog w)}). This yields the first fully dynamic LCE data structure working in compressed space. We also present efficient construction algorithms from various types of inputs: We can construct G in O(N f_A) time from uncompressed string T; in O(n loglog (n log^* M) log N log^* M) time from grammar-compressed string T represented by a straight-line program of size n; and in O(z f_A log N log^* M) time from LZ77-compressed string T with z factors. On top of the above contributions, we show several applications of our data structures which improve previous best known results on grammar-compressed string processing.

TCS Journal 2015 Journal Article

Compressed automata for dictionary matching

  • Tomohiro I
  • Takaaki Nishimoto
  • Shunsuke Inenaga
  • Hideo Bannai
  • Masayuki Takeda

We address a variant of the dictionary matching problem where the dictionary is represented by a straight line program (SLP). For a given SLP-compressed dictionary D of size n and height h representing m patterns of total length N, we present an O ( n 2 log ⁡ N ) -size representation of Aho–Corasick automaton which recognizes all occurrences of the patterns in D in amortized O ( h + m ) running time per character. We also propose an algorithm to construct this compressed Aho–Corasick automaton in O ( n 3 log ⁡ n log ⁡ N ) time and O ( n 2 log ⁡ N ) space. In a spacial case where D represents only a single pattern, we present an O ( n log ⁡ N ) -size representation of the Morris–Pratt automaton which permits us to find all occurrences of the pattern in amortized O ( h ) running time per character, and we show how to construct this representation in O ( n 3 log ⁡ n log ⁡ N ) time with O ( n 2 log ⁡ N ) working space.

MFCS Conference 2013 Conference Paper

Detecting Regularities on Grammar-Compressed Strings

  • Tomohiro I
  • Wataru Matsubara
  • Kouji Shimohira
  • Shunsuke Inenaga
  • Hideo Bannai
  • Masayuki Takeda
  • Kazuyuki Narisawa
  • Ayumi Shinohara

Abstract We solve the problems of detecting and counting various forms of regularities in a string represented as a Straight Line Program (SLP). Given an SLP of size n that represents a string s of length N, our algorithm computes all runs and squares in s in O ( n 3 h ) time and O ( n 2 ) space, where h is the height of the derivation tree of the SLP. We also show an algorithm to compute all gapped-palindromes in O ( n 3 h + gnh log N ) time and O ( n 2 ) space, where g is the length of the gap. The key technique of the above solution also allows us to compute the periods and covers of the string in O ( n 2 h ) time and O ( nh ( n + log 2 N )) time, respectively.

TCS Journal 2013 Journal Article

Palindrome pattern matching

  • Tomohiro I
  • Shunsuke Inenaga
  • Masayuki Takeda

A palindrome is a string that reads the same forward and backward. For a string x, let Pals ( x ) be the set of all maximal palindromes of x, where each maximal palindrome in Pals ( x ) is encoded by a pair ( c, r ) of its center c and its radius r. Given a text t of length n and a pattern p of length m, the palindrome pattern matching problem is to compute all positions i of t such that Pals ( p ) = Pals ( t [ i: i + m − 1 ] ). We present linear-time algorithms to solve this problem.

TCS Journal 2011 Journal Article

Verifying and enumerating parameterized border arrays

  • Tomohiro I
  • Shunsuke Inenaga
  • Hideo Bannai
  • Masayuki Takeda

The parameterized pattern matching problem is to check if there exists a renaming bijection on the alphabet with which a given pattern can be transformed into a substring of a given text. A parameterized border array (p-border array) is a parameterized version of a standard border array, and we can efficiently solve the parameterized pattern matching problem using p-border arrays. In this paper, we present a linear time algorithm to verify if a given integer array is a valid p-border array for a binary alphabet. We also show a linear time algorithm to compute all binary parameterized strings sharing a given p-border array. In addition, we give an algorithm which computes all p-border arrays of length at most n, where n is a given threshold. This algorithm runs in O ( B 2 n ) time, where B 2 n is the number of all p-border arrays of length n for a binary parameter alphabet. The problems with a larger alphabet are much more difficult. Still, we present an O ( n 1. 5 ) –time O ( n ) –space algorithm to verify if a given integer array of length n is a valid p-border array for an unbounded alphabet. The best previously known solution to this task takes time proportional to the n -th Bell number 1 e ∑ k = 0 ∞ k n k! , and hence our algorithm is much more efficient. Also, we show that it is possible to enumerate all p-border arrays of length at most n for an unbounded alphabet in O ( B n n 2. 5 ) time, where B n denotes the number of p-border arrays of length n.