This article summarizes a local
benchmark run for ahocorasick. The benchmark uses long
English and Chinese novels as input, then compares three tasks:
detect: return whether each document contains at least
one pattern.count: return the total number of pattern matches per
document.extract: materialize the extracted matches.The benchmark compares ahocorasick with
polars, AhoCorasickTrie, a basic base R
for loop, and tidyverse +
stringr. Note that AhoCorasickTrie is
ASCII-only, so it is skipped for Chinese and mixed UTF-8 cases.
ℹ️ The benchmark scripts are totally generated by Codex and look messy.
The benchmark data and runners live under bench/.
Generated output is written to bench/output/, which is
ignored by git.
The results below come from a run with 5 iterations per expression, 500 documents per corpus, and 5000 characters per document chunk.
| Case | Corpus | Patterns | Documents | Input MiB |
|---|---|---|---|---|
en_small_common |
en_docs |
32 | 500 | 2.38 |
en_large_rare |
en_docs |
256 | 500 | 2.38 |
zh_small_common |
zh_docs |
32 | 500 | 7.05 |
mix_medium |
mix_docs |
64 | 500 | 4.72 |
The English corpus is built from Crime and Punishment,
Pride and Prejudice, and The Count of Monte Cristo.
The Chinese corpus is built from 红楼梦,
鲁迅全集, 三体1-3, and
射雕英雄传.
The benchmark uses fixed-string matching semantics.
ahocorasick uses a pre-built automaton in every timed
query. count and extract use overlapping
matches.
The correctness gate checks all supported detect and
count results against ahocorasick. For
extract, it checks extracted-match counts per document. All
supported methods agreed in this run.
ahocorasick builds a reusable automaton before querying.
Build time was small for all tested pattern sets.
| Case | Build median ms |
|---|---|
en_small_common |
0.22 |
en_large_rare |
0.37 |
zh_small_common |
0.24 |
mix_medium |
0.31 |
ahocorasick was the fastest method for
detect in all four cases. polars were steadily
fast. The base R loop was close for en_small_common because
it can stop at the first common pattern in each document.
Median time in milliseconds:
| Case | ahocorasick |
polars |
AhoCorasickTrie |
base for |
tidyverse |
|---|---|---|---|---|---|
en_small_common |
0.38 | 0.93 | 111.59 | 0.53 | 4.13 |
en_large_rare |
3.04 | 3.21 | 13.82 | 103.94 | 53.16 |
zh_small_common |
0.81 | 1.08 | skipped | 4.98 | 14.78 |
mix_medium |
0.36 | 1.02 | skipped | 2.96 | 14.34 |
ahocorasick was also the fastest method for
count in all four cases. polars were close to
ahocorasick and quite fast.
Median time in milliseconds:
| Case | ahocorasick |
polars |
AhoCorasickTrie |
base for |
tidyverse |
|---|---|---|---|---|---|
en_small_common |
5.86 | 6.53 | 110.77 | 60.25 | 13.29 |
en_large_rare |
4.66 | 5.38 | 14.12 | 375.45 | 53.16 |
zh_small_common |
11.32 | 12.58 | skipped | 202.76 | 28.85 |
mix_medium |
9.11 | 9.21 | skipped | 266.52 | 26.61 |
extract measures the cost of materializing match
results, not just searching. ahocorasick::ac_extract_df()
is the fastest method in all four cases. It will return a tidy long
DataFrame directly from the Rust output, avoiding the overhead in R.
polars’ performance is almost the same as
ahocorasick because they both use the same Rust crate under
the hood.
Median time in milliseconds:
| Case | ahocorasick |
polars |
AhoCorasickTrie |
base for |
tidyverse |
|---|---|---|---|---|---|
en_small_common |
16.29 | 16.97 | 62.11 | 1957.20 | 1904.87 |
en_large_rare |
4.78 | 6.74 | 12.28 | 506.96 | 222.36 |
zh_small_common |
16.65 | 17.93 | skipped | 2037.89 | 1163.19 |
mix_medium |
15.82 | 18.34 | skipped | 3499.28 | 2535.26 |
Use ahocorasick when you need fast reusable
multi-pattern matching over R character vectors, especially with UTF-8
text. In this run it is consistently fastest for detect,
count, and extract.
Use polars if your data is already in a DataFrame and
you want to both keep in the DataFrame workflow and have good
performance. Its expressions are fast, but you should make sure to use
the right ones (such as find_many()).