Aho-Corasick in 180 Lines of Pure Lateralus
Every serious multi-keyword filter — fgrep -f patterns.txt, Snort's IDS rules, every spam filter you've ever cursed at — is the same algorithm under the hood. Alfred Aho and Margaret Corasick published it in 1975 as a generalisation of KMP. Given a set of patterns, it builds one automaton that finds every occurrence of every pattern in a single left-to-right scan of the input.
It's built out of three tables:
- goto — a trie of the patterns. Walk characters; if a child exists, follow it.
- fail — for each node, the longest proper suffix that is also a prefix of some pattern. When goto is stuck, jump here and try again.
- output — at each node, which patterns end here (including those ending at a fail-chain ancestor).
The total work is O(|text| + |patterns| + |matches|), which is the best you can do.
◉ The whole algorithm, in three small Lateralus functions
Here's the build function from stdlib/aho_corasick.ltl. Nothing fancy, no bitset tricks, just lists and dicts:
pub fn build(patterns: list) -> dict {
let auto = {}
auto["goto"] = [] // list of dict: child-byte -> node-id
auto["fail"] = [] // list of int: fallback node
auto["output"] = [] // list of list: pattern-ids ending here
auto["patterns"] = patterns
ac_new_node(auto) // root = 0
// Step 1. Insert each pattern as a path in the trie.
let pi = 0
while pi < len(patterns) {
let p = patterns[pi]
let node = 0
let j = 0
while j < len(p) {
let c = ord(char_at(p, j))
let children = auto["goto"][node]
if contains(children, c) {
node = children[c]
} else {
ac_new_node(auto)
let new_id = len(auto["goto"]) - 1
auto["goto"][node][c] = new_id
node = new_id
}
j = j + 1
}
auto["output"][node] = auto["output"][node] + [pi]
pi = pi + 1
}
// Step 2. BFS from root, computing fail pointers.
// ... (elided — 20 lines of BFS)
return auto
}
And the scan is exactly as small:
pub fn find_all(auto: dict, text: str) -> list {
let hits = []
let node = 0
let i = 0
while i < len(text) {
let c = ord(char_at(text, i))
// Walk fail-chain until we can proceed.
while node > 0 && (contains(auto["goto"][node], c) == false) {
node = auto["fail"][node]
}
if contains(auto["goto"][node], c) {
node = auto["goto"][node][c]
}
let outs = auto["output"][node]
let j = 0
while j < len(outs) {
hits = hits + [[outs[j], i + 1]]
j = j + 1
}
i = i + 1
}
return hits
}
That's the entire matcher. The source of stdlib/aho_corasick.ltl is 165 lines including comments. Reading it end-to-end is cheaper than reading the Wikipedia article.
◉ The canonical "ushers" test
Aho & Corasick's 1975 paper gives one example: patterns {he, she, his, hers} in the text "ushers". The trace hits three patterns:
let a = aho_corasick_build(["he", "she", "his", "hers"])
let hits = aho_corasick_find_all(a, "ushers")
// [[1, 4], [0, 4], [3, 6]]
// she^ he^ hers^
That exact vector is in tests/stdlib_spiral_wave_10.ltl and runs green.
◉ Why it beats "regex in a loop"
The naïve alternative — run N separate regex matches, one per pattern — is O(N × |text|). Aho-Corasick collapses all N patterns into one pass: O(|text|).
For a log filter looking for 50 signature strings in a 100 MB log, that's 50× less work. Not 50% less — 50×. This is why every serious IDS and log shipper ships an Aho-Corasick implementation internally. With Lateralus, you don't have to ship it internally; it's already there.
◉ The real use-case: stream filtering
Here's a log shipper that drops everything except lines matching any of an alert dictionary, using three stdlib modules together:
import aho_corasick
import logfmt
import prometheus
let alert_words = aho_corasick_build([
"error", "panic", "segfault", "denied", "timeout", "killed"
])
for line in stdin_lines() {
let fields = logfmt_parse(line)
let msg = fields["msg"]
if aho_corasick_contains_any(alert_words, msg) {
prometheus_emit_counter("log_alerts_total",
{"severity": fields["level"]})
println(line)
}
}
Three stdlib modules, no external packages, one lateralus c away from a native binary.
◉ Where to from here
- Source: stdlib/aho_corasick.ltl (165 lines)
- Tests: tests/stdlib_spiral_wave_10.ltl
- Example: examples/spiral_search.ltl
- Play with it: lateralus.dev/playground
This is one module out of the Wave 10 search lane: suffix array, Aho-Corasick, Levenshtein, SimHash, q-gram. Wire them together and you have the guts of a full-text search engine in four stdlib imports.