Benchmarks

This article summarizes a local benchmark run for ahocorasick. The benchmark uses long English and Chinese novels as input, then compares three tasks:

  • detect: return whether each document contains at least one pattern.
  • count: return the total number of pattern matches per document.
  • extract: materialize the extracted matches.

The benchmark compares ahocorasick with polars, AhoCorasickTrie, a basic base R for loop, and tidyverse + stringr. Note that AhoCorasickTrie is ASCII-only, so it is skipped for Chinese and mixed UTF-8 cases.

ℹ️ The benchmark scripts are totally generated by Codex and look messy.

Reproducing

The benchmark data and runners live under bench/. Generated output is written to bench/output/, which is ignored by git.

Rscript bench/01-generate-benchmark-data.R
Rscript bench/02-run-benchmarks.R

The results below come from a run with 5 iterations per expression, 500 documents per corpus, and 5000 characters per document chunk.

Case Corpus Patterns Documents Input MiB
en_small_common en_docs 32 500 2.38
en_large_rare en_docs 256 500 2.38
zh_small_common zh_docs 32 500 7.05
mix_medium mix_docs 64 500 4.72

The English corpus is built from Crime and Punishment, Pride and Prejudice, and The Count of Monte Cristo. The Chinese corpus is built from 红楼梦, 鲁迅全集, 三体1-3, and 射雕英雄传.

Semantics

The benchmark uses fixed-string matching semantics. ahocorasick uses a pre-built automaton in every timed query. count and extract use overlapping matches.

The correctness gate checks all supported detect and count results against ahocorasick. For extract, it checks extracted-match counts per document. All supported methods agreed in this run.

Build Cost

ahocorasick builds a reusable automaton before querying. Build time was small for all tested pattern sets.

Case Build median ms
en_small_common 0.22
en_large_rare 0.37
zh_small_common 0.24
mix_medium 0.31

Detect

ahocorasick was the fastest method for detect in all four cases. polars were steadily fast. The base R loop was close for en_small_common because it can stop at the first common pattern in each document.

Median time in milliseconds:

Case ahocorasick polars AhoCorasickTrie base for tidyverse
en_small_common 0.38 0.93 111.59 0.53 4.13
en_large_rare 3.04 3.21 13.82 103.94 53.16
zh_small_common 0.81 1.08 skipped 4.98 14.78
mix_medium 0.36 1.02 skipped 2.96 14.34

Count

ahocorasick was also the fastest method for count in all four cases. polars were close to ahocorasick and quite fast.

Median time in milliseconds:

Case ahocorasick polars AhoCorasickTrie base for tidyverse
en_small_common 5.86 6.53 110.77 60.25 13.29
en_large_rare 4.66 5.38 14.12 375.45 53.16
zh_small_common 11.32 12.58 skipped 202.76 28.85
mix_medium 9.11 9.21 skipped 266.52 26.61

Extract

extract measures the cost of materializing match results, not just searching. ahocorasick::ac_extract_df() is the fastest method in all four cases. It will return a tidy long DataFrame directly from the Rust output, avoiding the overhead in R. polars’ performance is almost the same as ahocorasick because they both use the same Rust crate under the hood.

Median time in milliseconds:

Case ahocorasick polars AhoCorasickTrie base for tidyverse
en_small_common 16.29 16.97 62.11 1957.20 1904.87
en_large_rare 4.78 6.74 12.28 506.96 222.36
zh_small_common 16.65 17.93 skipped 2037.89 1163.19
mix_medium 15.82 18.34 skipped 3499.28 2535.26

Summary

Use ahocorasick when you need fast reusable multi-pattern matching over R character vectors, especially with UTF-8 text. In this run it is consistently fastest for detect, count, and extract.

Use polars if your data is already in a DataFrame and you want to both keep in the DataFrame workflow and have good performance. Its expressions are fast, but you should make sure to use the right ones (such as find_many()).