liblevenshtein-rust

A Rust implementation of liblevenshtein for fast approximate string matching using Levenshtein Automata—deterministic finite automata that enable O(|W|) construction complexity for error-tolerant dictionary queries.

Based on "Fast String Correction with Levenshtein-Automata" (Schulz & Mihov, 2002).

What's New

v0.5.0 (2025-11-04)

DynamicDawgChar - Unicode support for Dynamic DAWG with character-level operations
Contextual Code Completion - Hierarchical scope-aware completion engine with zipper-based navigation
Performance Optimizations - DynamicDawg with Bloom filters, auto-minimization, and sorted batch insertion
UTF-8 Optimization Analysis - Comprehensive documentation of Unicode performance characteristics
Bug Fixes - Fixed suffix sharing bug, test compilation, GitHub Actions workflow, and all clippy warnings

v0.4.0 (2025-10-30)

Unicode Support - Character-level dictionary variants for correct Unicode Levenshtein distances
- DoubleArrayTrieChar and PathMapDictionaryChar for character-level operations
- Proper handling of multi-byte UTF-8 sequences (accented chars, CJK, emoji)
- ~5% performance overhead, 4x memory for edge labels
- Full test coverage with 19 Unicode-specific tests
Phase 4: SIMD Optimization - Comprehensive SIMD acceleration with 20-64% performance gains
- 8 SIMD-optimized components across critical performance paths
- Data-driven threshold tuning based on empirical benchmarking
- AVX2/SSE4.1 implementations with runtime CPU feature detection
- Extensive documentation (950+ lines across 3 analysis documents)

v0.3.0 (2025-10-26)

Package Support: Debian (.deb), RPM (.rpm), and Arch Linux (.pkg.tar.zst) packages
CI Improvements: Explicit CPU features for better compatibility across platforms
Code Quality: Fixed all clippy warnings without suppressing checks
Bug Fixes: Fixed CLI format auto-detection for text dictionaries

See CHANGELOG.md for complete details.

Overview

This library provides efficient fuzzy string matching against large dictionaries using finite state automata. It supports multiple Levenshtein distance algorithms and pluggable dictionary backends.

What Makes This Library Fast

Unlike naive approaches that compute Levenshtein distance for every dictionary word (O(|D| × |W| × |V|)), this library uses Levenshtein Automata—a deterministic finite automaton that accepts exactly all words within distance n from a query word.

Key Algorithmic Advantages

O(|W|) Automaton Construction - Linear in query word length for fixed error bound n
O(|D|) Dictionary Traversal - Single pass through dictionary (measured as total edges)
No Distance Recomputation - Automaton guides search, avoiding per-word calculations
Proven Correctness - Accepts exactly L_Lev(n,W) = {V | d_L(W,V) ≤ n}

Result: Query time is effectively independent of dictionary size for well-distributed tries!

Theory: Based on "Fast String Correction with Levenshtein-Automata" (Schulz & Mihov, 2002). See complete algorithm documentation for algorithmic details including Position structures, Subsumption relations, and Characteristic vectors.

Theoretical Background

How Levenshtein Automata Work

Traditional approaches compute edit distance for every dictionary word, requiring O(|D| × |W| × |V|) operations. This library uses Levenshtein Automata—deterministic finite automata that accept exactly all words within distance n from query word W:

LEV_n(W) accepts L_Lev(n,W) = {V | d_L(W,V) ≤ n}

Key Concepts

Position (i#e): Represents "matched i characters of W with e errors"

Example: 3#1 means "matched 3 characters, used 1 error"
Range: 0 ≤ i ≤ |W|, 0 ≤ e ≤ n

Subsumption (⊑): Eliminates redundant positions for minimal state sets

Position i#e subsumes j#f if: (e < f) ∧ (|j-i| ≤ f-e)
Any word accepted from j#f is also accepted from i#e

Characteristic Vectors (χ): Encode where characters appear in query word

χ(x,W) = ⟨b₁,...,b_w⟩ where b_i = 1 if W[i] = x
Used to determine valid transitions without recomputing distances

Elementary Transitions: Move between positions based on dictionary character

Defined in Tables 4.1, 7.1, 8.1 of the paper for each algorithm variant
Enable O(1) state updates during traversal

Result: Automaton construction is O(|W|), dictionary traversal is O(|D|), regardless of dictionary size!

Foundational Papers

Core Algorithm:

Schulz, Klaus U., and Stoyan Mihov. "Fast string correction with Levenshtein automata." International Journal on Document Analysis and Recognition 5.1 (2002): 67-85.

Extensions:

Mitankin, Petar, Stoyan Mihov, and Klaus Schulz. "Universal Levenshtein Automata. Building and Properties." Information Processing & Management 41.4 (2005): 687-702.

Documentation

Complete Algorithm Documentation - Detailed paper summary
Implementation Mapping - Code-to-paper correspondence
Glossary - Terms and notation reference
Universal Levenshtein Automata Research - Restricted substitutions
Weighted Distance Research - Variable operation costs

Features

Fast approximate string matching using Levenshtein Automata with O(|W|) construction complexity
Multiple algorithms (all maintaining O(|W|) construction complexity):
- Standard: Insert, delete, substitute operations (Chapters 4-6 of foundational paper)
  - Example: "hello" → "helo" (delete), "helo" → "hello" (insert), "hello" → "hallo" (substitute)
  - Use for: General fuzzy matching, spell checking
- Transposition: Adds adjacent character swaps (Chapter 7) - Damerau-Levenshtein distance
  - Example: "teh" → "the" costs 1 (not 2), "recieve" → "receive" costs 1
  - Use for: Typing errors, keyboard input corrections
- MergeAndSplit: Adds two-character ↔ one-character operations (Chapter 8)
  - Example: "rn" ↔ "m", "vv" ↔ "w" (common in OCR)
  - Use for: Scanned documents, character recognition systems
Multiple dictionary backends:
- DoubleArrayTrie (recommended, default) - O(1) transitions with excellent cache locality and minimal memory footprint
- DoubleArrayTrieChar - Character-level variant for correct Unicode Levenshtein distances
- PathMap (optional pathmap-backend feature) - high-performance trie with structural sharing, supports dynamic updates
- PathMapChar (optional pathmap-backend feature) - Character-level PathMap for Unicode fuzzy matching
- DAWG - Directed Acyclic Word Graph for space-efficient storage
- OptimizedDawg - Arena-based DAWG with 20-25% faster construction
- DynamicDawg - DAWG with online insert/delete/minimize operations
- DynamicDawgChar - Character-level DAWG with Unicode support and online updates
- SuffixAutomaton - For substring/infix matching
- SuffixAutomatonChar - Character-level suffix automaton for Unicode substring matching
- Extensible trait-based design for custom backends
Ordered results: OrderedQueryIterator returns results sorted by distance first, then lexicographically
Filtering and prefix matching: Filter results with custom predicates, enable prefix mode for code completion
Serialization support (optional serialization feature):
- Multiple formats: Bincode (binary), JSON, Plain Text (newline-delimited UTF-8), Protobuf
- Optimized Protobuf formats: V2 (62% smaller), DAT-specific, SuffixAutomaton-specific
- Gzip compression (optional compression feature) - 85% file size reduction
- Save and load dictionaries to/from disk
Full-featured CLI tool (optional cli feature):
- Interactive REPL for exploration
- Query, insert, delete, convert operations
- Support for all serialization formats including compressed variants
Runtime dictionary updates:
- Thread-safe insert, remove, and clear operations (PathMap, DynamicDawg)
- Queries automatically see updates via RwLock-based interior mutability
- Concurrent queries during modifications
Lazy evaluation - results generated on-demand
Efficient memory usage - shared dictionary state across transducers
SIMD acceleration (optional simd feature, x86_64 only):
- AVX2/SSE4.1 optimized operations with runtime CPU feature detection
- 20-64% faster query performance across all workloads
- Optimized characteristic vectors, position subsumption, state operations
- Dictionary edge lookup with data-driven threshold tuning
- Distance matrix computation with vectorized operations
- Automatic fallback to scalar implementation when SIMD unavailable

Installation

From crates.io (Recommended)

Add to your Cargo.toml:

[dependencies]
liblevenshtein = "0.5"

# Or with SIMD acceleration (x86_64 only, requires SSE4.1/AVX2):
liblevenshtein = { version = "0.5", features = ["simd"] }

Or install the CLI tool:

cargo install liblevenshtein --features cli,compression,protobuf

Pre-built Packages

Download pre-built packages from the GitHub Releases page:

Debian/Ubuntu: .deb packages
Fedora/RHEL/CentOS: .rpm packages
Arch Linux: .pkg.tar.zst packages
macOS: .dmg disk images for easy installation (x86_64 and ARM64)
Binaries: .tar.gz and .zip archives for Linux and macOS (x86_64 and ARM64)

Usage

Basic Querying

use liblevenshtein::prelude::*;

// Create a dictionary from terms (using DoubleArrayTrie for best performance)
let terms = vec!["test", "testing", "tested", "tester"];
let dict = DoubleArrayTrie::from_terms(terms);

// Create a transducer with Standard algorithm
let transducer = Transducer::new(dict, Algorithm::Standard);

// Query for terms within edit distance 2
for term in transducer.query("tset", 2) {
    println!("Match: {}", term);
}

// Query with distances
for candidate in transducer.query_with_distance("tset", 2) {
    println!("Match: {} (distance: {})", candidate.term, candidate.distance);
}

Output:

Match: test
Match: tester
Match: test (distance: 1)
Match: tester (distance: 2)

Unicode Support

For correct character-level Levenshtein distances with Unicode text, use the character-level dictionary variants:

use liblevenshtein::dictionary::double_array_trie_char::DoubleArrayTrieChar;
use liblevenshtein::prelude::*;

// Create a character-level dictionary for Unicode content
let terms = vec!["café", "naïve", "中文", "🎉"];
let dict = DoubleArrayTrieChar::from_terms(terms);
let transducer = Transducer::new(dict, Algorithm::Standard);

// Character-level distance: "" → "¡" is distance 1 (one character)
// Byte-level would incorrectly report distance 2 (two UTF-8 bytes)
let results: Vec<_> = transducer.query("", 1).collect();
// Results include single-character Unicode terms

// Works with accented characters
for candidate in transducer.query_with_distance("cafe", 1) {
    println!("{}: {}", candidate.term, candidate.distance);
    // Output: café: 1 (one character substitution: e→é)
}

Character-level backends:

DoubleArrayTrieChar - Character-level Double-Array Trie
PathMapDictionaryChar (requires pathmap-backend feature) - Character-level PathMap with dynamic updates

When to use:

✅ Dictionaries containing non-ASCII Unicode (accented characters, CJK, emoji)
✅ When edit distance must count characters, not bytes
✅ Multi-language applications requiring correct Unicode semantics

Performance: ~5% overhead for UTF-8 decoding, 4x memory for edge labels (char vs u8).

Runtime Dictionary Updates

The DynamicDawg backend supports thread-safe runtime updates:

use liblevenshtein::prelude::*;

// Create dictionary with runtime update support
let dict = DynamicDawg::from_terms(vec!["cat", "dog"]);
let transducer = Transducer::new(dict.clone(), Algorithm::Standard);

// Insert new terms at runtime
dict.insert("cot");
dict.insert("bat");

// Queries immediately see the new terms
let results: Vec<_> = transducer.query("cot", 1).collect();
// Results: ["cat", "cot"]

// Remove terms
dict.remove("dog");

Thread Safety: The dictionary uses RwLock for interior mutability:

Multiple concurrent queries are allowed (read locks)
Modifications acquire exclusive write locks
Active Transducer instances automatically see updates

Performance Optimizations: Configure Bloom filter and auto-minimization for better performance:

use liblevenshtein::prelude::*;

// Enable Bloom filter for 10x faster contains() operations
let dict = DynamicDawg::with_config(
    f32::INFINITY,  // Auto-minimize threshold (disabled)
    Some(10000),    // Bloom filter capacity (enabled)
);

// Or enable auto-minimization for bulk insertions (30% faster for large datasets)
let dict = DynamicDawg::with_config(
    1.5,            // Auto-minimize at 50% growth
    None,           // Bloom filter (disabled)
);

// Or enable both optimizations
let dict = DynamicDawg::with_config(
    1.5,            // Auto-minimize threshold
    Some(10000),    // Bloom filter capacity
);

Optimization Guide:

Bloom Filter: Use when frequently checking if terms exist (88-93% faster contains())
Auto-Minimization: Use for bulk insertions of 1000+ terms (30% faster, prevents memory bloat)
Default: Both disabled for maximum flexibility and minimal overhead on small datasets

Note: PathMapDictionary also supports runtime updates but requires the optional pathmap-backend feature.

See examples/dynamic_dictionary.rs for a complete demonstration.

Value-Mapped Dictionaries

Store metadata with dictionary terms using value-mapped dictionaries:

use liblevenshtein::prelude::*;
use std::collections::HashSet;

// DynamicDawg with integer values (e.g., word frequencies)
let dict: DynamicDawg<u32> = DynamicDawg::new();
dict.insert_with_value("apple", 42);
dict.insert_with_value("apply", 17);

// Retrieve values
assert_eq!(dict.get_value("apple"), Some(42));
assert_eq!(dict.get_value("banana"), None);

// Use with FuzzyMap for fuzzy lookups with values
let map = FuzzyMap::new(dict, Algorithm::Standard);
let results = map.get_with_distance("aple", 1);  // fuzzy lookup
// Results include both terms and their values

// Or use FuzzyMultiMap to aggregate multiple values
let dict: DynamicDawg<HashSet<u32>> = DynamicDawg::new();
dict.insert_with_value("test", HashSet::from([1, 2, 3]));

Supported backends:

DynamicDawg<V> - Dynamic dictionary with values of type V
DynamicDawgChar<V> - Character-level dynamic dictionary with Unicode support
PathMapDictionary<V> - PathMap with values (requires pathmap-backend)
PathMapDictionaryChar<V> - Character-level PathMap with values

Common value types:

u32 / u64 - Frequencies, scores, IDs
HashSet<T> - Multiple associations per term
Vec<T> - Ordered collections
Any type implementing Clone + Send + Sync + 'static

Integration with contextual completion:

use liblevenshtein::prelude::*;

// Use DynamicDawg backend for contextual completion
let dict: DynamicDawg<Vec<u32>> = DynamicDawg::new();
let engine = ContextualCompletionEngine::with_dictionary(
    dict,
    Algorithm::Standard
);

// Insert terms with context IDs
let ctx = engine.create_root_context();
engine.insert_finalized(ctx, "variable", vec![ctx]);

Ordered Results

Get results sorted by edit distance first, then alphabetically:

use liblevenshtein::prelude::*;

let dict = DoubleArrayTrie::from_terms(vec!["apple", "apply", "ape", "app"]);
let transducer = Transducer::new(dict, Algorithm::Standard);

// Results ordered by distance, then alphabetically
for candidate in transducer.query_ordered("aple", 1) {
    println!("{}: {}", candidate.term, candidate.distance);
}

Output:

ape: 1
apple: 1
apply: 1

Filtering and Prefix Matching

Filter results and enable prefix matching for code completion:

use liblevenshtein::prelude::*;

let dict = DoubleArrayTrie::from_terms(vec![
    "getValue", "getVariable", "setValue", "setVariable"
]);
let transducer = Transducer::new(dict, Algorithm::Standard);

// Prefix matching with filtering
for candidate in transducer
    .query_ordered("getVal", 1)
    .prefix()  // Match terms starting with query ± edits
    .filter(|c| c.term.starts_with("get"))  // Only getter methods
{
    println!("{}: {}", candidate.term, candidate.distance);
}

Output:

getValue: 0
getVariable: 1

See examples/code_completion_demo.rs and examples/contextual_filtering_optimization.rs for more examples.

Dictionary Backend Comparison

The library provides multiple dictionary backends optimized for different use cases:

Backend	Best For	Performance Highlights
DoubleArrayTrie	General use (recommended)	3x faster queries, 30x faster contains, 8 bytes/state
DoubleArrayTrieChar	Unicode text, character-level distances	Correct Unicode semantics, ~5% overhead
PathMap	Dynamic updates, runtime modifications	Thread-safe insert/delete, fastest dynamic backend
PathMapChar	Unicode + dynamic updates	Character-level distances with runtime modifications
DAWG	Static dictionaries, moderate size	Good balance of speed and memory
OptimizedDawg	Static dictionaries, construction speed	20-25% faster construction than DAWG
DynamicDawg	Occasional modifications	Best fuzzy matching for dynamic use
DynamicDawgChar	Unicode + occasional modifications	Character-level with insert/remove, ~5% overhead
SuffixAutomaton	Substring/infix matching	Find patterns anywhere in text

Performance Comparison (10,000 words):

Construction:     DAT: 3.2ms   PathMap: 3.5ms   DAWG: 7.2ms
Exact Match:      DAT: 6.6µs   DAWG: 19.8µs     PathMap: 71.1µs
Contains (100):   DAT: 0.22µs  DAWG: 6.7µs      PathMap: 132µs
Distance 1:       DAT: 12.9µs  DAWG: 319µs      PathMap: 888µs
Distance 2:       DAT: 16.3µs  DAWG: 2,150µs    PathMap: 5,919µs
Memory/state:     DAT: ~8B     OptDawg: ~13B    DAWG: ~32B

Why DoubleArrayTrie is Fastest:

O(1) transitions - Direct array indexing vs pointer chasing
Excellent cache locality - Sequential memory layout minimizes cache misses
Minimal memory footprint - ~8 bytes per state (vs 32 bytes for DAWG)
Perfect for parallel traversal - Optimal implementation of dictionary automaton A^D from paper

All backends implement the dictionary automaton A^D that is traversed in parallel with the Levenshtein automaton LEV_n(W) during queries. The choice of backend affects the constant factors in the O(|D|) query complexity, with DoubleArrayTrie providing the smallest constants due to hardware-friendly memory access patterns.

Prefix Matching Support: All backends except SuffixAutomaton support efficient prefix matching through the .prefix() method, making them ideal for code completion and autocomplete applications.

When to use each backend:

Static dictionaries (ASCII/Latin-1) → DoubleArrayTrie (best overall performance, default choice)
Static dictionaries (Unicode) → DoubleArrayTrieChar (correct character-level distances)
Dynamic updates (ASCII/Latin-1) → DynamicDawg (thread-safe runtime modifications)
Dynamic updates (Unicode) → DynamicDawgChar (character-level with insert/remove, best fuzzy matching)
Dynamic updates (Unicode, frequent) → PathMapChar (alternative, requires pathmap-backend)
Substring search → SuffixAutomaton (finds patterns anywhere in text)
Memory-constrained → DoubleArrayTrie (8 bytes/state, most efficient)
Multi-language apps → Character-level variants (*Char) for correct Unicode semantics

Serialization and Compression

Save and load dictionaries with optional compression:

use liblevenshtein::prelude::*;
use std::fs::File;

let dict = DoubleArrayTrie::from_terms(vec!["test", "testing", "tested"]);

// Save with compression (85% file size reduction)
let file = File::create("dict.bin.gz")?;
GzipSerializer::<BincodeSerializer>::serialize(&dict, file)?;

// Load compressed dictionary
let file = File::open("dict.bin.gz")?;
let dict: DoubleArrayTrie = GzipSerializer::<BincodeSerializer>::deserialize(file)?;

// Or use optimized Protobuf formats
let file = File::create("dict.pb")?;
OptimizedProtobufSerializer::serialize(&dict, file)?;  // 62% smaller than standard

Requires serialization and compression features:

[dependencies]
liblevenshtein = { git = "https://github.com/universal-automata/liblevenshtein-rust", tag = "v0.3.0", features = ["serialization", "compression"] }

CLI Tool

The library includes a full-featured command-line tool with interactive REPL:

# Install with CLI support (from GitHub)
cargo install --git https://github.com/universal-automata/liblevenshtein-rust --tag v0.3.0 \
  --features cli,compression,protobuf liblevenshtein

# Or download pre-built binaries from GitHub Releases

# Query a dictionary
liblevenshtein query "test" --dict /usr/share/dict/words -m 2 -s

# Convert between formats with compression
liblevenshtein convert words.txt words.bin.gz \
  --to-format bincode-gz --to-backend path-map

# Launch interactive REPL
liblevenshtein repl --dict words.bin.gz

The CLI auto-detects formats from file extensions and supports:

Formats: text, bincode, json, protobuf (plus -gz compressed variants)
Backends: path-map, dawg, dynamic-dawg
Operations: query, insert, delete, convert, save, info

See docs/developer-guide/building.md for comprehensive CLI documentation.

FuzzyMultiMap - Aggregating Values

The FuzzyMultiMap enables fuzzy lookup of associated values, aggregating results from all matching terms:

use liblevenshtein::prelude::*;
use liblevenshtein::cache::multimap::FuzzyMultiMap;
use std::collections::HashSet;

// Create a dictionary with values (e.g., user IDs for each name)
let dict = PathMapDictionary::from_terms_with_values([
    ("alice", HashSet::from([101, 102])),
    ("alicia", HashSet::from([103])),
    ("bob", HashSet::from([201])),
    ("robert", HashSet::from([202, 203])),
]);

// Create fuzzy multimap
let fuzzy_map = FuzzyMultiMap::new(dict, Algorithm::Standard);

// Query "alise" - matches both "alice" and "alicia" at distance 1
let user_ids = fuzzy_map.query("alise", 1).unwrap();
// Returns: HashSet {101, 102, 103} - union of all matching values

// Practical use case: find all IDs for fuzzy name search
let ids = fuzzy_map.query("rob", 2).unwrap();
// Returns IDs for both "bob" (distance 1) and "robert" (distance 2)

Supported collection types:

HashSet<T> - Union of all matching sets
BTreeSet<T> - Union of all matching sets
Vec<T> - Concatenation of all matching vectors

Use cases:

User ID lookup with fuzzy name matching
Tag aggregation across similar terms
Multi-valued dictionary queries with error tolerance

Contextual Code Completion (Zipper-Based)

The ContextualCompletionEngine provides hierarchical scope-aware code completion with draft state management:

use liblevenshtein::contextual::ContextualCompletionEngine;
use liblevenshtein::transducer::Algorithm;

// Create engine (thread-safe, shareable with Arc)
let engine = ContextualCompletionEngine::with_algorithm(Algorithm::Standard);

// Create hierarchical scopes (global → function → block)
let global = engine.create_root_context(0);
let function = engine.create_child_context(1, global).unwrap();
let block = engine.create_child_context(2, function).unwrap();

// Add finalized terms to each scope
engine.finalize_direct(global, "std::vector").unwrap();
engine.finalize_direct(global, "std::string").unwrap();
engine.finalize_direct(function, "parameter").unwrap();
engine.finalize_direct(function, "result").unwrap();

// Incremental typing in block scope (draft state)
engine.insert_str(block, "local_var").unwrap();

// Query completions - sees all visible scopes (block + function + global)
let completions = engine.complete(block, "par", 1);
for comp in completions {
    println!("{} (distance: {}, draft: {}, from context: {:?})",
             comp.term, comp.distance, comp.is_draft, comp.contexts);
}
// Output: parameter (distance: 0, draft: false, from context: [1])

// Checkpoint/undo support for editor integration
engine.checkpoint(block).unwrap();
engine.insert_str(block, "iable").unwrap();  // Now: "local_variable"
engine.undo(block).unwrap();  // Restore to: "local_var"

// Finalize draft to add to dictionary
let term = engine.finalize(block).unwrap();  // Returns "local_var"
assert!(engine.has_term("local_var"));

// Query now sees finalized term
let results = engine.complete(block, "loc", 1);
// Returns: local_var (now finalized, visible in this context)

Features:

Hierarchical visibility: Child contexts see parent terms (global → function → block)
Draft state: Incremental typing without polluting finalized dictionary
Checkpoint/undo: Editor-friendly state management
Thread-safe: Share engine across threads with Arc
Mixed queries: Search both drafts and finalized terms simultaneously

Performance (sub-millisecond for interactive use):

Insert character: ~4 µs
Checkpoint: ~116 ns
Query (500 terms, distance 1): ~11.5 µs
Query (distance 2): ~309 µs

Use cases:

LSP servers with multi-file scope awareness
Code editors with context-sensitive completion
REPL environments with session-scoped symbols
Any application requiring hierarchical fuzzy matching

See examples/contextual_completion.rs for a complete example.

Prefix-Based Iteration (PrefixZipper)

For efficient autocomplete and code completion scenarios, use the PrefixZipper trait to navigate directly to a prefix and iterate only matching terms. This is 5-10× faster than full dictionary iteration with filtering when the prefix matches a small subset of terms.

Performance: O(k) navigation + O(m) iteration, where k = prefix length, m = matching terms.

use liblevenshtein::prelude::*;
use liblevenshtein::dictionary::prefix_zipper::PrefixZipper;
use liblevenshtein::dictionary::double_array_trie_zipper::DoubleArrayTrieZipper;

let terms = vec!["process", "processUser", "produce", "product", "apple"];
let dict = DoubleArrayTrie::from_terms(terms.iter());

// Create zipper and navigate to prefix
let zipper = DoubleArrayTrieZipper::new_from_dict(&dict);
if let Some(iter) = zipper.with_prefix(b"proc") {
    for (path, _zipper) in iter {
        let term = String::from_utf8(path).unwrap();
        println!("Found: {}", term);
        // Prints only: "process" and "processUser" (not all 5 terms!)
    }
}

Unicode support (character-level):

use liblevenshtein::dictionary::double_array_trie_char::DoubleArrayTrieChar;
use liblevenshtein::dictionary::double_array_trie_char_zipper::DoubleArrayTrieCharZipper;
use liblevenshtein::dictionary::prefix_zipper::PrefixZipper;

let terms = vec!["café", "cafétéria", "naïve"];
let dict = DoubleArrayTrieChar::from_terms(terms.iter());

let zipper = DoubleArrayTrieCharZipper::new_from_dict(&dict);
let prefix: Vec<char> = "caf".chars().collect();

if let Some(iter) = zipper.with_prefix(&prefix) {
    for (path, _) in iter {
        let term: String = path.iter().collect();
        println!("Found: {}", term);
    }
}

Valued dictionaries (for metadata like scope IDs, frequencies, etc.):

use liblevenshtein::dictionary::prefix_zipper::ValuedPrefixZipper;

let dict = DoubleArrayTrie::from_terms_with_values(
    vec![("cat", 1), ("cats", 2), ("dog", 3)].into_iter()
);

let zipper = DoubleArrayTrieZipper::new_from_dict(&dict);
if let Some(iter) = zipper.with_prefix_values(b"cat") {
    for (path, value) in iter {
        let term = String::from_utf8(path).unwrap();
        println!("{} -> {}", term, value);
        // Prints: "cat -> 1", "cats -> 2"
    }
}

Use cases:

Code completion / autocomplete
Prefix search in large dictionaries
Pattern-aware completion (LSP servers)
Any scenario requiring "terms starting with X"

Backend support: All dictionary backends support PrefixZipper (DoubleArrayTrie, DynamicDawg, PathMap, etc.) through blanket trait implementation.

Advanced: Direct Zipper API

For fine-grained control over traversal, use the zipper API directly:

use liblevenshtein::dictionary::pathmap::PathMapDictionary;
use liblevenshtein::dictionary::pathmap_zipper::PathMapZipper;
use liblevenshtein::transducer::{AutomatonZipper, Algorithm, StatePool};
use liblevenshtein::transducer::intersection_zipper::IntersectionZipper;

// Create dictionary
let dict = PathMapDictionary::<()>::new();
dict.insert("cat");
dict.insert("cats");
dict.insert("dog");

// Create zippers
let dict_zipper = PathMapZipper::new_from_dict(&dict);
let auto_zipper = AutomatonZipper::new("cot".as_bytes(), 1, Algorithm::Standard);

// Intersect dictionary and automaton
let intersection = IntersectionZipper::new(dict_zipper, auto_zipper);

// Manual traversal
let mut pool = StatePool::new();
for (label, child) in intersection.children(&mut pool) {
    if child.is_match() {
        println!("Match: {} (distance: {})",
                 child.term(),
                 child.distance().unwrap());
    }

    // Recurse into child for custom traversal algorithms
    for (_, grandchild) in child.children(&mut pool) {
        // Custom logic here...
    }
}

When to use:

Custom traversal algorithms (DFS, A*, beam search)
Early termination with custom heuristics
Integration with external data structures
Research and experimentation

Note: Most users should use Transducer::query() or ContextualCompletionEngine instead. The zipper API is lower-level and requires manual state management.

Value-Filtered Queries

For performance optimization when querying dictionaries with scoped values:

use liblevenshtein::prelude::*;
use std::collections::HashSet;

// Dictionary with scope IDs
let dict = PathMapDictionary::from_terms_with_values([
    ("global_var", 0),      // Scope 0 (global)
    ("function_param", 1),  // Scope 1 (function)
    ("local_var", 2),       // Scope 2 (block)
]);

let transducer = Transducer::new(dict, Algorithm::Standard);

// Query only specific scopes (e.g., visible from scope 1)
let visible_scopes = HashSet::from([0, 1]);  // global + function
for term in transducer.query_by_value_set("param", 1, &visible_scopes) {
    println!("{}", term);
}
// Output: function_param (scope 1 is visible)
// Does NOT return: local_var (scope 2 not in visible set)

// Or use custom predicate
for term in transducer.query_filtered("var", 1, |scope_id| *scope_id <= 1) {
    println!("{}", term);
}
// Returns: global_var, function_param (scopes 0, 1)

Performance benefit: Filtering by value during traversal is significantly faster than post-filtering results, especially for large dictionaries with many out-of-scope matches.

Use cases:

Scope-based code completion (only show visible symbols)
Access control (filter by user permissions)
Multi-tenancy (filter by tenant ID)

Documentation

Technical Glossary - Comprehensive reference for all technical terms (70+ entries)
User Guide - Getting started, algorithms, backends, and features
Developer Guide - Architecture, building, contributing, and performance
Building and Testing - Comprehensive build instructions and CLI usage
Contributing Guidelines - How to contribute to the project
Features Overview - Detailed feature documentation
Publishing Guide - Requirements for publishing to crates.io
Changelog - Version history and release notes

How Code Maps to Theory

For developers wanting to understand the implementation or extend the algorithms:

Paper Concept	Code Location	Description
Position (i#e)	`src/transducer/position.rs:11-35`	Index + error count structure
Subsumption (⊑)	`src/transducer/position.rs:231-269`	Position redundancy elimination
Characteristic Vector (χ)	`src/transducer/position.rs`	Character occurrence encoding
Elementary Transitions (δ)	`src/transducer/transition.rs:119-438`	Table 4.1, 7.1, 8.1 from paper
State Transitions (Δ)	`src/transducer/query.rs`	Parallel traversal implementation
Algorithm Variants	`src/transducer/algorithm.rs`	Standard/Transposition/MergeAndSplit
Imitation Method	`src/transducer/query.rs:86-188`	Chapter 6 - on-demand state generation
Dictionary Automaton (A^D)	`src/dictionary/`	Multiple backend implementations

Key Implementation Patterns:

Position Structure: Tracks (term_index, num_errors, is_special) corresponding to i#e_t or i#e_s in the paper
Subsumption Checking: Ensures minimal state sets by eliminating redundant positions
Transition Functions: Implement the case analysis from Tables 4.1, 7.1, and 8.1
Parallel Traversal: Simultaneously navigate dictionary automaton A^D and Levenshtein automaton LEV_n(W)

See Implementation Mapping for detailed code-to-paper correspondence with examples.

Performance

The library maintains the theoretical guarantees from the foundational paper while adding practical optimizations:

Theoretical Complexity

Construction: O(|W|) time for automaton construction (Theorem 5.2.1)
Query: O(|D|) time for dictionary traversal where |D| is total edges (Chapter 3)
Space: O(|W|) states for fixed error bound n (Theorem 4.0.32)

vs Naive Approach: Computing Levenshtein distance for every dictionary entry requires O(|D| × |W| × |V|) operations. This library achieves 100-1000× speedup on large dictionaries by avoiding per-word distance calculations through automata-guided search.

Core Optimizations

Beyond the algorithmic foundations, the implementation includes:

Arc Path Sharing: Eliminated expensive cloning operations during traversal
StatePool: Object pool pattern for state reuse with exceptional performance gains
SmallVec Integration: Stack-allocated vectors reduce heap allocation pressure
Lazy Edge Iteration: 15-50% faster PathMap edge iteration with zero-copy implementation
Aggressive Inlining: Hot path functions inlined for optimal performance

Benchmarks show 3.3x speedup for DAWG operations and 5-18% improvements across filtering/prefix scenarios.

SIMD Acceleration (optional `simd` feature)

When compiled with the simd feature on x86_64 platforms, the library achieves 20-64% performance gains through:

8 SIMD-optimized components across critical performance paths
AVX2/SSE4.1 implementations with runtime CPU feature detection
Data-driven threshold tuning based on empirical benchmarking
Automatic fallback to scalar implementation when SIMD unavailable

Key optimizations:

Characteristic vector operations (vectorized bit manipulation)
Position subsumption checking (parallel state comparisons)
State operations (vectorized distance computations)
Dictionary edge lookups (batched queries with adaptive thresholds)
Distance matrix computation (vectorized row/column operations)

Performance improvements vary by workload:

Small dictionaries (< 1,000 terms): 20-30% faster
Medium dictionaries (1,000-10,000 terms): 30-45% faster
Large dictionaries (> 10,000 terms): 45-64% faster

See docs/analysis/ for detailed SIMD performance analysis (950+ lines of documentation).

Unicode Performance

Character-level dictionary variants (*Char) for Unicode support:

~5% overhead for UTF-8 decoding during traversal
4x memory for edge labels (4 bytes per char vs 1 byte per u8)
Zero-cost abstraction via monomorphization (no runtime polymorphism)
Same query performance characteristics as byte-level variants

When to use:

✅ Use *Char variants for multi-language dictionaries with non-ASCII Unicode
✅ Use byte-level variants (DoubleArrayTrie, PathMapDictionary) for ASCII/Latin-1 content

Research & Planned Features

This library includes extensive research documentation on potential enhancements:

Universal Levenshtein Automata (Planned)

Support for restricted substitutions where only specific character pairs can be substituted:

Use Cases:

Keyboard proximity - Only adjacent keys allowed (QWERTY/AZERTY/Dvorak layouts)
- Example: 'a' ↔ 's' ↔ 'd' (adjacent on QWERTY)
OCR error modeling - Visual similarity constraints
- Example: 'l' ↔ 'I' ↔ '1', 'O' ↔ '0', 'rn' ↔ 'm'
Phonetic matching - Sound-alike rules
- Example: 'f' ↔ 'ph', 'k' ↔ 'c', 's' ↔ 'z'

Theory: Based on "Universal Levenshtein Automata. Building and Properties" (Mitankin, Mihov, Schulz, 2005). Maintains O(|W|) construction complexity with restricted substitution set S ⊆ Σ × Σ.

Status: Research complete, implementation planned. See Universal Levenshtein Automata Documentation for complete details.

Weighted Operations (Research Phase)

Support for variable operation costs through discretization:

Use Cases:

Keyboard distance costs - Closer keys cost less (adjacent: 0.5, same row: 1.0, different rows: 1.5)
Context-dependent costs - Position-aware error penalties (beginning/end of word)
Frequency-based costs - Common errors cost less than rare ones

Complexity: O(|W| × max_cost/precision) through cost discretization—still linear in query length when cost range and precision are fixed.

Status: Methodology documented, implementation research phase. See Weighted Levenshtein Automata Research for feasibility analysis and implementation approach.

Contributing Research

Interested in implementing these features or researching new extensions? See:

The research documentation provides complete theoretical foundations and implementation roadmaps for extending the library.

License

Licensed under the Apache License, Version 2.0. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 274 Commits
.cargo		.cargo
.config		.config
.githooks		.githooks
.github/workflows		.github/workflows
benches		benches
data		data
docs		docs
examples		examples
packaging/arch		packaging/arch
proto		proto
rocq/liblevenshtein		rocq/liblevenshtein
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
bench_results_arc_str.txt		bench_results_arc_str.txt
bench_results_balanced.txt		bench_results_balanced.txt
bench_results_baseline.txt		bench_results_baseline.txt
bench_results_concurrent.txt		bench_results_concurrent.txt
bench_results_dashmap.txt		bench_results_dashmap.txt
bench_results_parking_lot.txt		bench_results_parking_lot.txt
benchmark_optimizations.sh		benchmark_optimizations.sh
build.rs		build.rs
eviction_wrapper_bench_baseline.txt		eviction_wrapper_bench_baseline.txt
flamegraph.svg		flamegraph.svg
flamegraph_phase1.svg		flamegraph_phase1.svg
fuzzy_multimap_analysis.md		fuzzy_multimap_analysis.md
fuzzy_multimap_baseline.txt		fuzzy_multimap_baseline.txt
fuzzy_multimap_optimization_results.md		fuzzy_multimap_optimization_results.md
pgo_build.sh		pgo_build.sh
rayon_eval_parallel.txt		rayon_eval_parallel.txt
rayon_eval_sequential.txt		rayon_eval_sequential.txt

License

vinary-tree/liblevenshtein-rust

Folders and files

Latest commit

History

Repository files navigation

liblevenshtein-rust

What's New

v0.5.0 (2025-11-04)

v0.4.0 (2025-10-30)

v0.3.0 (2025-10-26)

Overview

What Makes This Library Fast

Key Algorithmic Advantages

Theoretical Background

How Levenshtein Automata Work

Key Concepts

Foundational Papers

Documentation

Features

Installation

From crates.io (Recommended)

Pre-built Packages

Usage

Basic Querying

Unicode Support

Runtime Dictionary Updates

Value-Mapped Dictionaries

Ordered Results

Filtering and Prefix Matching

Dictionary Backend Comparison

Serialization and Compression

CLI Tool

FuzzyMultiMap - Aggregating Values

Contextual Code Completion (Zipper-Based)

Prefix-Based Iteration (PrefixZipper)

Advanced: Direct Zipper API

Value-Filtered Queries

Documentation

How Code Maps to Theory

Performance

Theoretical Complexity

Core Optimizations

SIMD Acceleration (optional simd feature)

Unicode Performance

Research & Planned Features

Universal Levenshtein Automata (Planned)

Weighted Operations (Research Phase)

Contributing Research

License

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Contributors 2

Uh oh!

Languages

SIMD Acceleration (optional `simd` feature)

Packages