← Back to Blog

Aho-Corasick in 180 Lines of Pure Lateralus

April 24, 2026 searchstdlibwave-10

Every serious multi-keyword filter — fgrep -f patterns.txt, Snort's IDS rules, every spam filter you've ever cursed at — is the same algorithm under the hood. Alfred Aho and Margaret Corasick published it in 1975 as a generalisation of KMP. Given a set of patterns, it builds one automaton that finds every occurrence of every pattern in a single left-to-right scan of the input.

It's built out of three tables:

The total work is O(|text| + |patterns| + |matches|), which is the best you can do.

◉ The whole algorithm, in three small Lateralus functions

Here's the build function from stdlib/aho_corasick.ltl. Nothing fancy, no bitset tricks, just lists and dicts:

pub fn build(patterns: list) -> dict {
    let auto = {}
    auto["goto"]     = []     // list of dict: child-byte -> node-id
    auto["fail"]     = []     // list of int:  fallback node
    auto["output"]   = []     // list of list: pattern-ids ending here
    auto["patterns"] = patterns
    ac_new_node(auto)         // root = 0

    // Step 1. Insert each pattern as a path in the trie.
    let pi = 0
    while pi < len(patterns) {
        let p = patterns[pi]
        let node = 0
        let j = 0
        while j < len(p) {
            let c = ord(char_at(p, j))
            let children = auto["goto"][node]
            if contains(children, c) {
                node = children[c]
            } else {
                ac_new_node(auto)
                let new_id = len(auto["goto"]) - 1
                auto["goto"][node][c] = new_id
                node = new_id
            }
            j = j + 1
        }
        auto["output"][node] = auto["output"][node] + [pi]
        pi = pi + 1
    }

    // Step 2. BFS from root, computing fail pointers.
    // ... (elided — 20 lines of BFS)
    return auto
}

And the scan is exactly as small:

pub fn find_all(auto: dict, text: str) -> list {
    let hits = []
    let node = 0
    let i = 0
    while i < len(text) {
        let c = ord(char_at(text, i))
        // Walk fail-chain until we can proceed.
        while node > 0 && (contains(auto["goto"][node], c) == false) {
            node = auto["fail"][node]
        }
        if contains(auto["goto"][node], c) {
            node = auto["goto"][node][c]
        }
        let outs = auto["output"][node]
        let j = 0
        while j < len(outs) {
            hits = hits + [[outs[j], i + 1]]
            j = j + 1
        }
        i = i + 1
    }
    return hits
}

That's the entire matcher. The source of stdlib/aho_corasick.ltl is 165 lines including comments. Reading it end-to-end is cheaper than reading the Wikipedia article.

◉ The canonical "ushers" test

Aho & Corasick's 1975 paper gives one example: patterns {he, she, his, hers} in the text "ushers". The trace hits three patterns:

let a = aho_corasick_build(["he", "she", "his", "hers"])
let hits = aho_corasick_find_all(a, "ushers")
// [[1, 4], [0, 4], [3, 6]]
//   she^  he^   hers^

That exact vector is in tests/stdlib_spiral_wave_10.ltl and runs green.

◉ Why it beats "regex in a loop"

The naïve alternative — run N separate regex matches, one per pattern — is O(N × |text|). Aho-Corasick collapses all N patterns into one pass: O(|text|).

For a log filter looking for 50 signature strings in a 100 MB log, that's 50× less work. Not 50% less — 50×. This is why every serious IDS and log shipper ships an Aho-Corasick implementation internally. With Lateralus, you don't have to ship it internally; it's already there.

◉ The real use-case: stream filtering

Here's a log shipper that drops everything except lines matching any of an alert dictionary, using three stdlib modules together:

import aho_corasick
import logfmt
import prometheus

let alert_words = aho_corasick_build([
    "error", "panic", "segfault", "denied", "timeout", "killed"
])

for line in stdin_lines() {
    let fields = logfmt_parse(line)
    let msg = fields["msg"]
    if aho_corasick_contains_any(alert_words, msg) {
        prometheus_emit_counter("log_alerts_total",
                                 {"severity": fields["level"]})
        println(line)
    }
}

Three stdlib modules, no external packages, one lateralus c away from a native binary.

◉ Where to from here

This is one module out of the Wave 10 search lane: suffix array, Aho-Corasick, Levenshtein, SimHash, q-gram. Wire them together and you have the guts of a full-text search engine in four stdlib imports.