Benchmarks

This article summarizes a local benchmark run for ahocorasick. The benchmark uses long English and Chinese novels as input, then compares three tasks:

detect: return whether each document contains at least one pattern.
count: return the total number of pattern matches per document.
extract: materialize the extracted matches.

The benchmark compares ahocorasick with polars, AhoCorasickTrie, a basic base R for loop, and tidyverse + stringr. Note that AhoCorasickTrie is ASCII-only, so it is skipped for Chinese and mixed UTF-8 cases.

ℹ️ The benchmark scripts are totally generated by Codex and look messy.

Reproducing

The benchmark data and runners live under bench/. Generated output is written to bench/output/, which is ignored by git.

Rscript bench/01-generate-benchmark-data.R
Rscript bench/02-run-benchmarks.R

The results below come from a run with 5 iterations per expression, 500 documents per corpus, and 5000 characters per document chunk.

Case	Corpus	Patterns	Documents	Input MiB
`en_small_common`	`en_docs`	32	500	2.38
`en_large_rare`	`en_docs`	256	500	2.38
`zh_small_common`	`zh_docs`	32	500	7.05
`mix_medium`	`mix_docs`	64	500	4.72

The English corpus is built from Crime and Punishment, Pride and Prejudice, and The Count of Monte Cristo. The Chinese corpus is built from 红楼梦, 鲁迅全集, 三体1-3, and 射雕英雄传.

Semantics

The benchmark uses fixed-string matching semantics. ahocorasick uses a pre-built automaton in every timed query. count and extract use overlapping matches.

The correctness gate checks all supported detect and count results against ahocorasick. For extract, it checks extracted-match counts per document. All supported methods agreed in this run.

Build Cost

ahocorasick builds a reusable automaton before querying. Build time was small for all tested pattern sets.

Case	Build median ms
`en_small_common`	0.22
`en_large_rare`	0.37
`zh_small_common`	0.24
`mix_medium`	0.31

Detect

ahocorasick was the fastest method for detect in all four cases. polars were steadily fast. The base R loop was close for en_small_common because it can stop at the first common pattern in each document.

Median time in milliseconds:

Case	`ahocorasick`	`polars`	`AhoCorasickTrie`	base `for`	`tidyverse`
`en_small_common`	0.38	0.93	111.59	0.53	4.13
`en_large_rare`	3.04	3.21	13.82	103.94	53.16
`zh_small_common`	0.81	1.08	skipped	4.98	14.78
`mix_medium`	0.36	1.02	skipped	2.96	14.34

Count

ahocorasick was also the fastest method for count in all four cases. polars were close to ahocorasick and quite fast.

Median time in milliseconds:

Case	`ahocorasick`	`polars`	`AhoCorasickTrie`	base `for`	`tidyverse`
`en_small_common`	5.86	6.53	110.77	60.25	13.29
`en_large_rare`	4.66	5.38	14.12	375.45	53.16
`zh_small_common`	11.32	12.58	skipped	202.76	28.85
`mix_medium`	9.11	9.21	skipped	266.52	26.61

Extract

extract measures the cost of materializing match results, not just searching. ahocorasick::ac_extract_df() is the fastest method in all four cases. It will return a tidy long DataFrame directly from the Rust output, avoiding the overhead in R. polars’ performance is almost the same as ahocorasick because they both use the same Rust crate under the hood.

Median time in milliseconds:

Case	`ahocorasick`	`polars`	`AhoCorasickTrie`	base `for`	`tidyverse`
`en_small_common`	16.29	16.97	62.11	1957.20	1904.87
`en_large_rare`	4.78	6.74	12.28	506.96	222.36
`zh_small_common`	16.65	17.93	skipped	2037.89	1163.19
`mix_medium`	15.82	18.34	skipped	3499.28	2535.26

Summary

Use ahocorasick when you need fast reusable multi-pattern matching over R character vectors, especially with UTF-8 text. In this run it is consistently fastest for detect, count, and extract.

Use polars if your data is already in a DataFrame and you want to both keep in the DataFrame workflow and have good performance. Its expressions are fast, but you should make sure to use the right ones (such as find_many()).

- Reproducing
- Semantics
- Build Cost
- Detect
- Count
- Extract
- Summary