---
title: "Benchmarks"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Benchmarks}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r}
#| label: setup
#| include: false
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
This article summarizes a local benchmark run for `ahocorasick`. The benchmark
uses long English and Chinese novels as input, then compares three tasks:
- `detect`: return whether each document contains at least one pattern.
- `count`: return the total number of pattern matches per document.
- `extract`: materialize the extracted matches.
The benchmark compares `ahocorasick` with `polars`, `AhoCorasickTrie`, a basic
base R `for` loop, and `tidyverse` + `stringr`. Note that `AhoCorasickTrie` is
ASCII-only, so it is skipped for Chinese and mixed UTF-8 cases.
ℹ️ *The benchmark scripts are totally generated by Codex and look messy.*
## Reproducing
The benchmark data and runners live under `bench/`. Generated output is written
to `bench/output/`, which is ignored by git.
```{bash}
#| label: run-benchmarks
#| eval: false
Rscript bench/01-generate-benchmark-data.R
Rscript bench/02-run-benchmarks.R
```
The results below come from a run with 5 iterations per expression, 500
documents per corpus, and 5000 characters per document chunk.
| Case | Corpus | Patterns | Documents | Input MiB |
| ----------------- | ---------- | -------- | --------- | --------- |
| `en_small_common` | `en_docs` | 32 | 500 | 2.38 |
| `en_large_rare` | `en_docs` | 256 | 500 | 2.38 |
| `zh_small_common` | `zh_docs` | 32 | 500 | 7.05 |
| `mix_medium` | `mix_docs` | 64 | 500 | 4.72 |
The English corpus is built from *Crime and Punishment*, *Pride and Prejudice*,
and *The Count of Monte Cristo*. The Chinese corpus is built from `红楼梦`,
`鲁迅全集`, `三体1-3`, and `射雕英雄传`.
## Semantics
The benchmark uses fixed-string matching semantics. `ahocorasick` uses a
pre-built automaton in every timed query. `count` and `extract` use overlapping
matches.
The correctness gate checks all supported `detect` and `count` results against
`ahocorasick`. For `extract`, it checks extracted-match counts per document. All
supported methods agreed in this run.
## Build Cost
`ahocorasick` builds a reusable automaton before querying. Build time was small
for all tested pattern sets.
| Case | Build median ms |
| ----------------- | --------------- |
| `en_small_common` | 0.22 |
| `en_large_rare` | 0.37 |
| `zh_small_common` | 0.24 |
| `mix_medium` | 0.31 |
## Detect
`ahocorasick` was the fastest method for `detect` in all four cases. `polars`
were steadily fast. The base R loop was close for `en_small_common` because it
can stop at the first common pattern in each document.
Median time in milliseconds:
| Case | `ahocorasick` | `polars` | `AhoCorasickTrie` | base `for` | `tidyverse` |
| ----------------- | ------------- | -------- | ----------------- | ---------- | ----------- |
| `en_small_common` | 0.38 | 0.93 | 111.59 | 0.53 | 4.13 |
| `en_large_rare` | 3.04 | 3.21 | 13.82 | 103.94 | 53.16 |
| `zh_small_common` | 0.81 | 1.08 | skipped | 4.98 | 14.78 |
| `mix_medium` | 0.36 | 1.02 | skipped | 2.96 | 14.34 |
## Count
`ahocorasick` was also the fastest method for `count` in all four cases.
`polars` were close to `ahocorasick` and quite fast.
Median time in milliseconds:
| Case | `ahocorasick` | `polars` | `AhoCorasickTrie` | base `for` | `tidyverse` |
| ----------------- | ------------- | -------- | ----------------- | ---------- | ----------- |
| `en_small_common` | 5.86 | 6.53 | 110.77 | 60.25 | 13.29 |
| `en_large_rare` | 4.66 | 5.38 | 14.12 | 375.45 | 53.16 |
| `zh_small_common` | 11.32 | 12.58 | skipped | 202.76 | 28.85 |
| `mix_medium` | 9.11 | 9.21 | skipped | 266.52 | 26.61 |
## Extract
`extract` measures the cost of materializing match results, not just searching.
`ahocorasick::ac_extract_df()` is the fastest method in all four cases. It will
return a tidy long DataFrame directly from the Rust output, avoiding the
overhead in R. `polars`' performance is almost the same as `ahocorasick` because
they both use the same Rust crate under the hood.
Median time in milliseconds:
| Case | `ahocorasick` | `polars` | `AhoCorasickTrie` | base `for` | `tidyverse` |
| ----------------- | ------------- | -------- | ----------------- | ---------- | ----------- |
| `en_small_common` | 16.29 | 16.97 | 62.11 | 1957.20 | 1904.87 |
| `en_large_rare` | 4.78 | 6.74 | 12.28 | 506.96 | 222.36 |
| `zh_small_common` | 16.65 | 17.93 | skipped | 2037.89 | 1163.19 |
| `mix_medium` | 15.82 | 18.34 | skipped | 3499.28 | 2535.26 |
## Summary
Use `ahocorasick` when you need fast reusable multi-pattern matching over R
character vectors, especially with UTF-8 text. In this run it is consistently
fastest for `detect`, `count`, and `extract`.
Use `polars` if your data is already in a DataFrame and you want to both keep in
the DataFrame workflow and have good performance. Its expressions are fast, but
you should make sure to use the right ones (such as `find_many()`).