There are many great libraries for string processing! Mostly, of course, written in Assembly, C, and C++, but some in Rust as well.
Where Rust decimates C and C++, is the simplicity of dependency management, making it great for benchmarking "Systems Software" and lining up apples-to-apples across native crates and their Python bindings.
So, to accelerate the development of the StringZilla C, C++, and CUDA libraries (with Rust and Python bindings), I've created this repository to compare it against some of my & communities most beloved Rust projects, like:
memchrfor substring search.rapidfuzzandbiofor edit distances and alignments.aHash,xxhash-rust,foldhash, andblake3for hashing.aho_corasickandregexfor multi-pattern search.arrowandpolarsfor collections and sorting.icufor Unicode processing.ringandsodiumoxidefor encryption.
Of course, the functionality of the projects is different, as are the APIs and the usage patterns. So, I focus on the workloads for which StringZilla was designed and compare the throughput of the core operations. Notably, I also favor modern hardware with support for a wider range SIMD instructions, like mask-equipped AVX-512 on x86 starting from the 2015 Intel Skylake-X CPUs or more recent predicated variable-length SVE and SVE2 on Arm, that aren't often supported by existing libraries and tooling.
Important
The numbers in the tables below are provided for reference only and may vary depending on the CPU, compiler, dataset, and tokenization method.
Most of them were obtained on Intel Sapphire Rapids (SPR) and Granite Rapids (GNR) CPUs and Nvidia Hopper-based H100 and Blackwell-based RTX 6000 Pro GPUs, using Rust with -C target-cpu=native optimization flag.
To replicate the results, please refer to the Replicating the Results section below.
Many hashing libraries exist, but they often lack reproducible outputs, streaming support, or cross-language availability. Throughput on short words and long lines:
Short Words Long Lines
Rust:
stringzilla::hash ββββββββββββββββββββ 1.84 ββββββββββββββββββββ 11.38 GB/s
aHash::hash_one ββββββββββββββ 1.23 ββββββββββββββββ 8.61 GB/s
xxh3::xxh3_64 ββββββββββββ 1.08 βββββββββββββββββ 9.48 GB/s
std::hash βββββ 0.43 βββββββ 3.74 GB/s
Python:
stringzilla.hash ββββββββββββββββββββ 0.14 ββββββββββββββββββββ 9.19 GB/s
hash βββββββββββββββββββ 0.13 ββββββββββ 4.27 GB/s
xxhash.xxh3_64 ββββββ 0.04 ββββββββββββββ 6.38 GB/s
See bench_hash.md for details
Substring search is offloaded to C's memmem or strstr in most languages, but SIMD-optimized implementations can do better.
Throughput on long lines:
Left to right Reverse order
Rust:
memmem::Finder ββββββββββββββββββββ 10.99
stringzilla ββββββββββββββββββββ 10.82 ββββββββββββββββββββ 10.66 GB/s
std::str ββββββββββββββββββββ 10.88 ββββββββββββ 5.94 GB/s
Python:
stringzilla ββββββββββββββββββββ 11.79 ββββββββββββββββββββ 11.56 GB/s
str ββ 1.23 βββββββ 3.84 GB/s
See bench_find.md for details
Searching for character sets (tabs, HTML markup, digits) commonly uses regex or Aho-Corasick automata. Throughput counting all matches on long lines:
Rust:
stringzilla ββββββββββββββββββββ 8.17 GB/s
regex::find_iter βββββββββββββ 5.22 GB/s
aho_corasick ββ 0.50 GB/s
Python:
stringzilla ββββββββββββββββββββ 8.79 GB/s
re.finditer β 0.19 GB/s
See bench_find.md for details
Different scripts stress UTF-8 differently: Korean has 3-byte Hangul with single-byte whitespace (representative for tokenization), Arabic uses 2-byte characters, English is mostly 1-byte ASCII. Throughput on AMD Zen5 Turin:
English Arabic
Newline splitting:
stringzilla ββββββββββββββββ 15.45 ββββββββββββββββββββ 18.34 GB/s
stdlib ββ 1.90 ββ 1.82 GB/s
English Korean
Whitespace splitting:
stringzilla ββββββββββββββββββββ 0.82 ββββββββββββββββββββ 1.88 GB/s
stdlib βββββββββββββββββββ 0.77 βββββββββββ 0.98 GB/s
icu::WhiteSpace βββ 0.11 ββ 0.15 GB/s
Case folding on bicameral scripts (Latin, Cyrillic, Greek, Armenian) plus Chinese for reference:
English 16x German 6x
Case folding:
stringzilla ββββββββββββββββββββ 7.53 ββββββββββββββββββββ 2.59 GB/s
stdlib βββ 0.48 ββββ 0.43 GB/s
Russian 10x French 5x
stringzilla ββββββββββββββββββββ 2.20 ββββββββββββββββββββ 1.84 GB/s
stdlib ββ 0.22 ββββ 0.35 GB/s
Greek 5x Armenian 4x
stringzilla ββββββββββββββββββββ 1.00 ββββββββββββββββββββ 908 MB/s
stdlib βββββ 0.22 βββββ 223 MB/s
Vietnamese 1.3x Chinese 4x
stringzilla ββββββββββββββββββββ 352 ββββββββββββββββββββ 1.21 GB/s
stdlib ββββββββββββββ 265 ββββββ 325 MB/s
See bench_unicode.md for details
Dataframe libraries and search engines rely heavily on string sorting. SIMD-accelerated comparisons and specialized radix sorts can outperform generic algorithms. Throughput on short words:
Rust:
stringzilla ββββββββββββββββββββ 213.73 M cmp/s
polars::sort βββββββββββββββββββ 200.34 M cmp/s
arrow::lexsort ββββββββββββ 122.20 M cmp/s
std::sort βββββ 54.35 M cmp/s
Python:
polars.sort ββββββββββββββββββββ 223.38 M cmp/s
stringzilla.sorted ββββββββββββββββ 171.13 M cmp/s
pyarrow.sort ββββββ 62.17 M cmp/s
list.sort βββββ 47.06 M cmp/s
GPU: cudf on H100 reaches 9,463 M cmp/s on short words.
See bench_sequence.md for details
Random byte generation and lookup tables are common in image processing and bioinformatics. Throughput on long lines:
Rust:
stringzilla ββββββββββββββββββββ 10.57 GB/s
zeroize βββββββββ 4.73 GB/s
rand_xoshiro ββββββββ 3.85 GB/s
Python:
stringzilla ββββββββββββββββββββ 20.37 GB/s
pycryptodome βββββββββββββ 13.16 GB/s
numpy.Philox ββ 1.59 GB/s
See bench_memory.md for details
Edit distance is essential for search engines, data cleaning, NLP, and bioinformatics. It's computationally expensive with O(n*m) complexity, but GPUs and multi-core parallelism help. Levenshtein distance on ~1,000 byte lines (MCUPS = Million Cell Updates Per Second):
1 Core 1 Socket
Rust:
bio::levenshtein ββ 823
rapidfuzz ββββββββββββββββββββ 14,316
stringzilla (384x GNR) βββββββββββββββββββ 13,084 ββββββββββββββββββββ 3,084,270 MCUPS
stringzilla (B200) βββββββ 998,620 MCUPS
stringzilla (H100) ββββββ 925,890 MCUPS
See bench_similarities.md for details
Converting variable-length strings into fixed-length sketches (like Min-Hashing) enables fast approximate matching in large-scale retrieval. Throughput on ~1,000 byte lines:
1 Core 1 Socket
Rust:
pc::MinHash ββββββββββββββββββββ 3.16
stringzilla (384x GNR) ββββ 0.51 ββββββββββββββββ 302.30 MB/s
stringzilla (H100) ββββββββββββββββββββ 392.37 MB/s
See bench_fingerprints.md for details
ChaCha20 and AES256 encryption throughput comparison on long lines:
Rust:
ring::aes256 ββββββββββββββββββββ 2.89 GB/s
ring::chacha20 βββββββββ 1.19 GB/s
libsodium::chacha20 βββββ 0.71 GB/s
See bench_encryption.md for details
Before running benchmarks, you can test your Rust environment running:
cargo install cargo-criterion --lockedTo pull and compile all the dependencies, you can call:
RUSTFLAGS="-C target-cpu=native" cargo build --benches --all-features # to compile everything
RUSTFLAGS="-C target-cpu=native" cargo check --benches --all-features --all-targets # to fail on warningsBy default StringWars links stringzilla in CPU mode.
If the machine has an NVIDIA GPU with CUDA installed, enable the CUDA kernels explicitly when running benches, for example:
RUSTFLAGS="-C target-cpu=native" \
STRINGWARS_DATASET=README.md \
STRINGWARS_TOKENS=lines \
STRINGWARS_FILTER=GPU \
cargo criterion --features "cuda bench_similarities" bench_similarities --jobs 1Wars always take long, and so do these benchmarks.
Every one of them includes a few seconds of a warm-up phase to ensure that the CPU caches are filled and the results are not affected by cold start or SIMD-related frequency scaling.
Each of them accepts a few environment variables to control the dataset, the tokenization, and the error bounds.
You can log those by printing file-level documentation using awk on Linux:
awk '/^\/\/!/ { print } !/^\/\/!/ { exit }' bench_find.rsCommonly used environment variables are:
STRINGWARS_DATASET- the path to the textual dataset file.STRINGWARS_TOKENS- the tokenization mode:file,lines, orwords.STRINGWARS_ERROR_BOUND- the maximum allowed error in the Levenshtein distance.
Here is an example of a common benchmark run on a Unix-like system:
RUSTFLAGS="-C target-cpu=native" \
STRINGWARS_DATASET=README.md \
STRINGWARS_TOKENS=lines \
cargo criterion --features bench_hash bench_hash --jobs $(nproc)On Windows using PowerShell you'd need to set the environment variable differently:
$env:STRINGWARS_DATASET="README.md"
cargo criterion --jobs $(nproc)It's recommended to use uv for Python dependency management and running the benchmarks.
To install all dependencies for all benchmarks:
uv venv --python 3.12
uv pip install -r requirements.txt -r requirements-cuda.txt
uv pip install --only-binary=:all: -r requirements.txt -r requirements-cuda.txtTo install dependencies for individual benchmarks:
PIP_EXTRA_INDEX_URL=https://pypi.nvidia.com \
uv pip install '.[find,hash,sequence,fingerprints,similarities]'To run individual benchmarks, you can call:
uv run --no-project python bench_hash.py --help
uv run --no-project python bench_find.py --help
uv run --no-project python bench_memory.py --help
uv run --no-project python bench_sequence.py --help
uv run --no-project python bench_similarities.py --help
uv run --no-project python bench_fingerprints.py --helpFor mixed UTF data, I've used the XL Sum dataset for multilingual extractive summarization. It's 4.7 GB in size (1.7 GB compressed), 1'004'598 lines long, and contains 268'435'456 tokens of mean length 8. To download, unpack, and run the benchmarks, execute the following bash script in your terminal:
curl -fL -o xlsum.csv.gz https://github.com/ashvardanian/xl-sum/releases/download/v1.0.0/xlsum.csv.gz
gzip -d xlsum.csv.gz
STRINGWARS_DATASET=xlsum.csv cargo criterion --jobs $(nproc)The Cohere Wikipedia dataset provides pre-processed JSONL files for different languages. This may be the optimal dataset for relative comparison of UTF-8 decoding and matching enginges in each individual environment. Not all Wikipedia languages are available, but the following have been selected specifically:
- Chinese (zh): 3-byte CJK characters, rare 1-byte punctuation
- Korean (ko): 3-byte Hangul syllables, frequent 1-byte punctuation
- Arabic (ar): 2-byte Arabic script, with regular 1-byte punctuation
- French (fr): Mixed 1-2 byte Latin with high diacritic density
- English (en): Mostly 1-byte ASCII baseline
To download and decompress one file from each language:
curl -fL -o wiki_en.jsonl.gz https://huggingface.co/datasets/Cohere/wikipedia-22-12/resolve/main/en/000.jsonl.gz && gunzip wiki_en.jsonl.gz
curl -fL -o wiki_zh.jsonl.gz https://huggingface.co/datasets/Cohere/wikipedia-22-12/resolve/main/zh/000.jsonl.gz && gunzip wiki_zh.jsonl.gz
curl -fL -o wiki_ko.jsonl.gz https://huggingface.co/datasets/Cohere/wikipedia-22-12/resolve/main/ko/000.jsonl.gz && gunzip wiki_ko.jsonl.gz
curl -fL -o wiki_ar.jsonl.gz https://huggingface.co/datasets/Cohere/wikipedia-22-12/resolve/main/ar/000.jsonl.gz && gunzip wiki_ar.jsonl.gz
curl -fL -o wiki_fr.jsonl.gz https://huggingface.co/datasets/Cohere/wikipedia-22-12/resolve/main/fr/000.jsonl.gz && gunzip wiki_fr.jsonl.gz
curl -fL -o wiki_de.jsonl.gz https://huggingface.co/datasets/Cohere/wikipedia-22-12/resolve/main/de/000.jsonl.gz && gunzip wiki_de.jsonl.gz
curl -fL -o wiki_es.jsonl.gz https://huggingface.co/datasets/Cohere/wikipedia-22-12/resolve/main/es/000.jsonl.gz && gunzip wiki_es.jsonl.gz
curl -fL -o wiki_it.jsonl.gz https://huggingface.co/datasets/Cohere/wikipedia-22-12/resolve/main/it/000.jsonl.gz && gunzip wiki_it.jsonl.gzEach JSONL file contains one JSON object per line with fields: id, title, text (paragraph content), url, wiki_id, and paragraph_id.
The CC-100 corpus provides large monolingual text files (1-80 GB) for 100+ languages, extracted from Common Crawl. Files are XZ-compressed plain text with documents separated by double-newlines.
| Workload | Relevant Scripts | Best Test Languages |
|---|---|---|
| Case Folding | Latin, Cyrillic, Greek, Armenian | Turkish (I/i), German (ss->SS), Greek, Russian |
| Normalization | Indic, Arabic, Vietnamese, Korean | Vietnamese, Hindi, Korean, Arabic |
| Whitespace Tokenization | Most scripts except CJK/Thai | English, Russian, Arabic vs. Chinese, Japanese, Thai |
| Grapheme Clusters | Indic, Thai, Khmer, Myanmar | Thai, Tamil, Myanmar, Khmer |
| RTL Handling | Arabic, Hebrew | Arabic, Hebrew, Persian |
Bicameral scripts with various case folding rules:
curl -fL https://data.statmt.org/cc-100/en.txt.xz | xz -d > cc100_en.txt # 82 GB - English
curl -fL https://data.statmt.org/cc-100/de.txt.xz | xz -d > cc100_de.txt # 18 GB - German
curl -fL https://data.statmt.org/cc-100/tr.txt.xz | xz -d > cc100_tr.txt # 5.4 GB - Turkish
curl -fL https://data.statmt.org/cc-100/ru.txt.xz | xz -d > cc100_ru.txt # 46 GB - Russian
curl -fL https://data.statmt.org/cc-100/uk.txt.xz | xz -d > cc100_uk.txt # 14 GB - Ukrainian
curl -fL https://data.statmt.org/cc-100/el.txt.xz | xz -d > cc100_el.txt # 7.4 GB - Greek
curl -fL https://data.statmt.org/cc-100/hy.txt.xz | xz -d > cc100_hy.txt # 776 MB - Armenian
curl -fL https://data.statmt.org/cc-100/ka.txt.xz | xz -d > cc100_ka.txt # 1.1 GB - Georgian
curl -fL https://data.statmt.org/cc-100/pl.txt.xz | xz -d > cc100_pl.txt # 12 GB - Polish
curl -fL https://data.statmt.org/cc-100/cs.txt.xz | xz -d > cc100_cs.txt # 4.4 GB - Czech
curl -fL https://data.statmt.org/cc-100/nl.txt.xz | xz -d > cc100_nl.txt # 7.9 GB - Dutch
curl -fL https://data.statmt.org/cc-100/fr.txt.xz | xz -d > cc100_fr.txt # 14 GB - French
curl -fL https://data.statmt.org/cc-100/es.txt.xz | xz -d > cc100_es.txt # 14 GB - Spanish
curl -fL https://data.statmt.org/cc-100/pt.txt.xz | xz -d > cc100_pt.txt # 13 GB - Portuguese
curl -fL https://data.statmt.org/cc-100/it.txt.xz | xz -d > cc100_it.txt # 7.8 GB - ItalianUnicameral scripts without case folding, but with other normalization/segmentation challenges:
curl -fL https://data.statmt.org/cc-100/ar.txt.xz | xz -d > cc100_ar.txt # 5.4 GB - Arabic (RTL)
curl -fL https://data.statmt.org/cc-100/he.txt.xz | xz -d > cc100_he.txt # 6.1 GB - Hebrew (RTL)
curl -fL https://data.statmt.org/cc-100/fa.txt.xz | xz -d > cc100_fa.txt # 20 GB - Persian (RTL)
curl -fL https://data.statmt.org/cc-100/hi.txt.xz | xz -d > cc100_hi.txt # 2.5 GB - Hindi (Devanagari)
curl -fL https://data.statmt.org/cc-100/bn.txt.xz | xz -d > cc100_bn.txt # 860 MB - Bengali
curl -fL https://data.statmt.org/cc-100/ta.txt.xz | xz -d > cc100_ta.txt # 1.3 GB - Tamil
curl -fL https://data.statmt.org/cc-100/te.txt.xz | xz -d > cc100_te.txt # 536 MB - Telugu
curl -fL https://data.statmt.org/cc-100/th.txt.xz | xz -d > cc100_th.txt # 8.7 GB - Thai (no spaces)
curl -fL https://data.statmt.org/cc-100/vi.txt.xz | xz -d > cc100_vi.txt # 28 GB - Vietnamese
curl -fL https://data.statmt.org/cc-100/zh-Hans.txt.xz | xz -d > cc100_zh.txt # 14 GB - Chinese
curl -fL https://data.statmt.org/cc-100/ja.txt.xz | xz -d > cc100_ja.txt # 15 GB - Japanese
curl -fL https://data.statmt.org/cc-100/ko.txt.xz | xz -d > cc100_ko.txt # 14 GB - Korean (Jamo)
curl -fL https://data.statmt.org/cc-100/my.txt.xz | xz -d > cc100_my.txt # 46 MB - Myanmar
curl -fL https://data.statmt.org/cc-100/km.txt.xz | xz -d > cc100_km.txt # 153 MB - Khmer
curl -fL https://data.statmt.org/cc-100/am.txt.xz | xz -d > cc100_am.txt # 133 MB - Amharic (Ethiopic)
curl -fL https://data.statmt.org/cc-100/si.txt.xz | xz -d > cc100_si.txt # 452 MB - SinhalaThe Leipzig Corpora Collection provides pre-segmented sentences in 200+ languages.
Each tar.gz contains *-sentences.txt (tab-separated id\tsentence), *-words.txt (frequencies), and co-occurrence files.
Standard sizes: 10K, 30K, 100K, 300K, 1M sentences. Check for newer years at the download page.
Bicameral scripts with various case folding rules:
curl -fL https://downloads.wortschatz-leipzig.de/corpora/eng_wikipedia_2016_1M.tar.gz | tar -xzf - -O 'eng_wikipedia_2016_1M/eng_wikipedia_2016_1M-sentences.txt' | cut -f2 > leipzig1M_en.txt
curl -fL https://downloads.wortschatz-leipzig.de/corpora/deu_wikipedia_2021_1M.tar.gz | tar -xzf - -O 'deu_wikipedia_2021_1M/deu_wikipedia_2021_1M-sentences.txt' | cut -f2 > leipzig1M_de.txt
curl -fL https://downloads.wortschatz-leipzig.de/corpora/tur_wikipedia_2021_1M.tar.gz | tar -xzf - -O 'tur_wikipedia_2021_1M/tur_wikipedia_2021_1M-sentences.txt' | cut -f2 > leipzig1M_tr.txt
curl -fL https://downloads.wortschatz-leipzig.de/corpora/rus_wikipedia_2021_1M.tar.gz | tar -xzf - -O 'rus_wikipedia_2021_1M/rus_wikipedia_2021_1M-sentences.txt' | cut -f2 > leipzig1M_ru.txt
curl -fL https://downloads.wortschatz-leipzig.de/corpora/ukr_wikipedia_2021_1M.tar.gz | tar -xzf - -O 'ukr_wikipedia_2021_1M/ukr_wikipedia_2021_1M-sentences.txt' | cut -f2 > leipzig1M_uk.txt
curl -fL https://downloads.wortschatz-leipzig.de/corpora/ell_wikipedia_2021_1M.tar.gz | tar -xzf - -O 'ell_wikipedia_2021_1M/ell_wikipedia_2021_1M-sentences.txt' | cut -f2 > leipzig1M_el.txt
curl -fL https://downloads.wortschatz-leipzig.de/corpora/hye_wikipedia_2021_1M.tar.gz | tar -xzf - -O 'hye_wikipedia_2021_1M/hye_wikipedia_2021_1M-sentences.txt' | cut -f2 > leipzig1M_hy.txt
curl -fL https://downloads.wortschatz-leipzig.de/corpora/kat_wikipedia_2021_300K.tar.gz | tar -xzf - -O 'kat_wikipedia_2021_300K/kat_wikipedia_2021_300K-sentences.txt' | cut -f2 > leipzig300K_ka.txt
curl -fL https://downloads.wortschatz-leipzig.de/corpora/pol_wikipedia_2021_1M.tar.gz | tar -xzf - -O 'pol_wikipedia_2021_1M/pol_wikipedia_2021_1M-sentences.txt' | cut -f2 > leipzig1M_pl.txt
curl -fL https://downloads.wortschatz-leipzig.de/corpora/ces_wikipedia_2021_1M.tar.gz | tar -xzf - -O 'ces_wikipedia_2021_1M/ces_wikipedia_2021_1M-sentences.txt' | cut -f2 > leipzig1M_cs.txt
curl -fL https://downloads.wortschatz-leipzig.de/corpora/nld_wikipedia_2021_1M.tar.gz | tar -xzf - -O 'nld_wikipedia_2021_1M/nld_wikipedia_2021_1M-sentences.txt' | cut -f2 > leipzig1M_nl.txt
curl -fL https://downloads.wortschatz-leipzig.de/corpora/fra_wikipedia_2021_1M.tar.gz | tar -xzf - -O 'fra_wikipedia_2021_1M/fra_wikipedia_2021_1M-sentences.txt' | cut -f2 > leipzig1M_fr.txt
curl -fL https://downloads.wortschatz-leipzig.de/corpora/spa_wikipedia_2021_1M.tar.gz | tar -xzf - -O 'spa_wikipedia_2021_1M/spa_wikipedia_2021_1M-sentences.txt' | cut -f2 > leipzig1M_es.txt
curl -fL https://downloads.wortschatz-leipzig.de/corpora/por_wikipedia_2021_1M.tar.gz | tar -xzf - -O 'por_wikipedia_2021_1M/por_wikipedia_2021_1M-sentences.txt' | cut -f2 > leipzig1M_pt.txt
curl -fL https://downloads.wortschatz-leipzig.de/corpora/ita_wikipedia_2021_1M.tar.gz | tar -xzf - -O 'ita_wikipedia_2021_1M/ita_wikipedia_2021_1M-sentences.txt' | cut -f2 > leipzig1M_it.txt
curl -fL https://downloads.wortschatz-leipzig.de/corpora/lit_wikipedia_2021_300K.tar.gz | tar -xzf - -O 'lit_wikipedia_2021_300K/lit_wikipedia_2021_300K-sentences.txt' | cut -f2 > leipzig300K_lt.txtUnicameral scripts without case folding, but with other normalization/segmentation challenges:
curl -fL https://downloads.wortschatz-leipzig.de/corpora/ara_wikipedia_2021_1M.tar.gz | tar -xzf - -O 'ara_wikipedia_2021_1M/ara_wikipedia_2021_1M-sentences.txt' | cut -f2 > leipzig1M_ar.txt
curl -fL https://downloads.wortschatz-leipzig.de/corpora/heb_wikipedia_2021_1M.tar.gz | tar -xzf - -O 'heb_wikipedia_2021_1M/heb_wikipedia_2021_1M-sentences.txt' | cut -f2 > leipzig1M_he.txt
curl -fL https://downloads.wortschatz-leipzig.de/corpora/fas_wikipedia_2014_1M.tar.gz | tar -xzf - -O 'fas_wikipedia_2014_1M/fas_wikipedia_2014_1M-sentences.txt' | cut -f2 > leipzig1M_fa.txt
curl -fL https://downloads.wortschatz-leipzig.de/corpora/hin_wikipedia_2021_1M.tar.gz | tar -xzf - -O 'hin_wikipedia_2021_1M/hin_wikipedia_2021_1M-sentences.txt' | cut -f2 > leipzig1M_hi.txt
curl -fL https://downloads.wortschatz-leipzig.de/corpora/ben_wikipedia_2021_1M.tar.gz | tar -xzf - -O 'ben_wikipedia_2021_1M/ben_wikipedia_2021_1M-sentences.txt' | cut -f2 > leipzig1M_bn.txt
curl -fL https://downloads.wortschatz-leipzig.de/corpora/tam_wikipedia_2021_1M.tar.gz | tar -xzf - -O 'tam_wikipedia_2021_1M/tam_wikipedia_2021_1M-sentences.txt' | cut -f2 > leipzig1M_ta.txt
curl -fL https://downloads.wortschatz-leipzig.de/corpora/tel_wikipedia_2021_300K.tar.gz | tar -xzf - -O 'tel_wikipedia_2021_300K/tel_wikipedia_2021_300K-sentences.txt' | cut -f2 > leipzig300K_te.txt
curl -fL https://downloads.wortschatz-leipzig.de/corpora/tha_wikipedia_2021_10K.tar.gz | tar -xzf - -O 'tha_wikipedia_2021_10K/tha_wikipedia_2021_10K-sentences.txt' | cut -f2 > leipzig10K_th.txt
curl -fL https://downloads.wortschatz-leipzig.de/corpora/vie_wikipedia_2021_1M.tar.gz | tar -xzf - -O 'vie_wikipedia_2021_1M/vie_wikipedia_2021_1M-sentences.txt' | cut -f2 > leipzig1M_vi.txt
curl -fL https://downloads.wortschatz-leipzig.de/corpora/zho_wikipedia_2018_1M.tar.gz | tar -xzf - -O 'zho_wikipedia_2018_1M/zho_wikipedia_2018_1M-sentences.txt' | cut -f2 > leipzig1M_zh.txt
curl -fL https://downloads.wortschatz-leipzig.de/corpora/jpn_wikipedia_2018_1M.tar.gz | tar -xzf - -O 'jpn_wikipedia_2018_1M/jpn_wikipedia_2018_1M-sentences.txt' | cut -f2 > leipzig1M_ja.txt
curl -fL https://downloads.wortschatz-leipzig.de/corpora/kor_wikipedia_2021_1M.tar.gz | tar -xzf - -O 'kor_wikipedia_2021_1M/kor_wikipedia_2021_1M-sentences.txt' | cut -f2 > leipzig1M_ko.txt
curl -fL https://downloads.wortschatz-leipzig.de/corpora/amh_wikipedia_2021_30K.tar.gz | tar -xzf - -O 'amh_wikipedia_2021_30K/amh_wikipedia_2021_30K-sentences.txt' | cut -f2 > leipzig30K_am.txtTo produce a mixed dataset with rows in all languages:
cat leipzig*.txt | shuf | head -c 1G > leipzig1GB.txtFor bioinformatics workloads, I use the following datasets with increasing string lengths:
curl -fL -o acgt_100.txt 'https://huggingface.co/datasets/ashvardanian/StringWars/resolve/main/acgt_100.txt?download=true'
curl -fL -o acgt_1k.txt 'https://huggingface.co/datasets/ashvardanian/StringWars/resolve/main/acgt_1k.txt?download=true'
curl -fL -o acgt_10k.txt 'https://huggingface.co/datasets/ashvardanian/StringWars/resolve/main/acgt_10k.txt?download=true'
curl -fL -o acgt_100k.txt 'https://huggingface.co/datasets/ashvardanian/StringWars/resolve/main/acgt_100k.txt?download=true'
curl -fL -o acgt_1m.txt 'https://huggingface.co/datasets/ashvardanian/StringWars/resolve/main/acgt_1m.txt?download=true'
curl -fL -o acgt_10m.txt 'https://huggingface.co/datasets/ashvardanian/StringWars/resolve/main/acgt_10m.txt?download=true'In case you are profiling the some of the internal kernels of mentioned libraries, here are a few example commands to get around.
Such as using ncu for NVIDIA GPUs to evaluate the register usage and occupancy of the CUDA kernels used in StringZilla's Levenshtein distance calculation:
/usr/local/cuda/bin/ncu \
--metrics launch__registers_per_thread,launch__occupancy_per_block_size,sm__warps_active.avg.pct_of_peak_sustained_active,sm__throughput.avg.pct_of_peak_sustained_elapsed,dram__throughput.avg.pct_of_peak_sustained_elapsed,dram__bytes.sum \
--target-processes all \
--kernel-name "levenshtein_on_each_cuda_thread" \
--launch-skip 5 \
--launch-count 1 \
bash -c 'STRINGWARS_DATASET=acgt_100.txt STRINGWARS_BATCH=65536 STRINGWARS_TOKENS=lines STRINGWARS_FILTER="uniform/stringzillas::LevenshteinDistances\(1xGPU\)" cargo criterion --features "cuda bench_similarities" bench_similarities --jobs 1'Using perf on Linux to analyze the CPU-side performance of SIMD-accelerated substring search:
perf record -e cpu-clock -g graph,0x400000 -o perf.data -- cargo criterion --features "bench_similarities" bench_similarities --jobs 1
perf report -i perf.data