5 min read
On this page

String Algorithms

String algorithms are essential for text processing, bioinformatics, search engines, and compilers. This file covers pattern matching and string processing beyond what was introduced in the data structures topic.

Pattern Matching

KMP (Knuth-Morris-Pratt)

Covered in data structures (arrays/strings). O(n + m) using the failure function to avoid re-examining characters.

Boyer-Moore

Covered in data structures. Scans right-to-left with bad character and good suffix heuristics. O(n/m) best case (sublinear).

Aho-Corasick

Multi-pattern matching. Searches for all patterns simultaneously in one pass over the text.

Structure: Build a trie of all patterns. Add failure links (similar to KMP's failure function) connecting each node to the longest proper suffix that is also a prefix of some pattern. Add output links for patterns found at each node.

Algorithm: Feed the text character by character through the automaton. At each state, follow failure links to find all matching patterns.

Time: O(n + m + z) where n = text length, m = total pattern length, z = number of matches.

Applications: Intrusion detection (scan network traffic for thousands of attack signatures), dictionary-based spell checking, DNA motif finding, grep with multiple patterns.

Z-Algorithm

Compute the Z-array: Z[i] = length of the longest substring starting at i that matches a prefix of the string.

FUNCTION Z_FUNCTION(s)
    n ← length(s)
    z ← array of size n, all 0
    l ← 0, r ← 0
    FOR i ← 1 TO n - 1
        IF i < r
            z[i] ← MIN(z[i - l], r - i)
        WHILE i + z[i] < n AND s[z[i]] = s[i + z[i]]
            z[i] ← z[i] + 1
        IF i + z[i] > r
            l ← i
            r ← i + z[i]
    RETURN z

Pattern matching with Z: Concatenate pattern + "$" + text. Positions where Z[i] = m are matches.

Time: O(n). Simpler than KMP for some applications.

Suffix Arrays

A sorted array of all suffixes of a string (represented by starting indices).

SA-IS Construction

O(n) construction using the Suffix Array Induced Sorting algorithm (Nong, Zhang, Chan, 2009).

Key idea: Classify suffixes as S-type (smaller) or L-type (larger) than the next suffix. Induce the sorted order from a subset of "sentinel" positions (LMS suffixes).

In practice: SA-IS is fast but complex. O(n log n) algorithms (prefix doubling) are simpler to implement.

LCP Array

The Longest Common Prefix array: LCP[i] = length of the longest common prefix between SA[i] and SA[i-1] in the suffix array.

Kasai's algorithm: Compute LCP array from the suffix array in O(n).

Applications with SA + LCP:

  • Pattern search: Binary search on SA → O(m log n). With LCP: O(m + log n).
  • Longest repeated substring: max(LCP).
  • Number of distinct substrings: n(n+1)/2 - Σ LCP[i].
  • Longest common substring of two strings: Concatenate with separator, build SA + LCP, find max LCP between suffixes from different strings.

Suffix Trees

Compressed trie of all suffixes. Linear space and time construction (Ukkonen's algorithm, O(n)).

Suffix trees are more powerful but harder to implement than suffix arrays. In practice, SA + LCP is preferred (simpler, cache-friendly, less memory).

Suffix Automaton

The smallest DFA accepting all suffixes of a string. O(n) states and transitions.

Key property: Each state represents an equivalence class of substrings (all substrings sharing the same set of ending positions).

Construction: Online, O(n). At each step, extend the automaton with one character.

Applications: Count distinct substrings, find shortest non-occurring substring, find longest common substring.

Longest Palindromic Substring (Manacher's Algorithm)

Find the longest palindromic substring in O(n).

Key idea: Track the rightmost palindrome boundary. Use symmetry to initialize palindrome radii at new centers.

FUNCTION MANACHER(s)
    // Transform: "abc" -> "^#a#b#c#$" to handle even-length palindromes
    n ← length(s)
    t ← ['^', '#']
    FOR EACH c IN s
        APPEND c TO t; APPEND '#' TO t
    APPEND '$' TO t

    m ← length(t)
    p ← array of size m, all 0   // p[i] = radius of palindrome centered at i
    c ← 0, r ← 0                 // center and right boundary

    FOR i ← 1 TO m - 2
        IF i < r
            p[i] ← MIN(p[2 * c - i], r - i)
        WHILE t[i + p[i] + 1] = t[i - p[i] - 1]
            p[i] ← p[i] + 1
        IF i + p[i] > r
            c ← i
            r ← i + p[i]

    RETURN MAX(p)   // max radius = max palindrome length in original

Time: O(n). Space: O(n).

String Hashing

Rolling Hash (Rabin-Karp)

Hash a substring in O(1) after O(n) preprocessing.

H(s[l..r]) = (s[l]·p^(r-l) + s[l+1]·p^(r-l-1) + ... + s[r]) mod M

Precompute prefix hashes and powers of p. Then:

H(s[l..r]) = (prefix_hash[r+1] - prefix_hash[l] × p^(r-l+1)) mod M

Collision probability: ~1/M per comparison. Use double hashing (two different (p, M) pairs) for ~1/M² collision probability.

Applications: Rabin-Karp pattern matching, longest common substring (binary search + hashing), plagiarism detection, string equality checking.

Polynomial Hash

FUNCTION POLY_HASH(s, p, m)
    hash ← 0
    power ← 1
    FOR EACH c IN s
        hash ← (hash + c * power) MOD m
        power ← (power * p) MOD m
    RETURN hash

Good choices: p = 31 or 37 (prime, larger than alphabet), M = 10⁹ + 7 or 10⁹ + 9.

Regular Expression Matching

Thompson's Construction

Convert a regex to an NFA (non-deterministic finite automaton), then simulate the NFA.

Simulation: Track the set of current states (using BFS/DFS through epsilon transitions). For each input character, transition to the next set of states.

Time: O(nm) where n = text length, m = regex size. Linear in text length for a given regex.

Key insight: The NFA simulation approach is much better than backtracking-based regex engines for pathological patterns. Backtracking engines (Perl, Python, Java) can take exponential time on patterns like (a+)+b.

Practical Regex Engines

  • DFA-based (RE2, Go, Rust regex): Guaranteed O(n) matching. No backreferences.
  • NFA-based (Thompson): O(nm) worst case. No backreferences.
  • Backtracking (PCRE, Perl, Python): Supports backreferences and lookahead. Exponential worst case.

Rust's regex crate uses a hybrid approach: DFA when possible, bounded NFA fallback.

Applications in CS

  • Search engines: Inverted index construction uses string processing. Query matching uses pattern algorithms.
  • Bioinformatics: Sequence alignment, motif discovery, genome assembly (suffix arrays, Aho-Corasick).
  • Compilers: Lexical analysis (regex → NFA → DFA). String interning.
  • Version control: diff algorithms use LCS. git uses rolling hashes for pack files.
  • Data compression: LZ77/LZ78 use longest matching substring. BWT uses suffix arrays.
  • Plagiarism detection: Document fingerprinting with rolling hashes (Winnowing algorithm).
  • Network security: Deep packet inspection uses multi-pattern matching (Aho-Corasick, PCRE).
  • Text editors: Find and replace, syntax highlighting, autocomplete.