A Rust implementation of liblevenshtein for fast approximate string matching using Levenshtein Automata—deterministic finite automata that enable O(|W|) construction complexity for error-tolerant dictionary queries.
Based on "Fast String Correction with Levenshtein-Automata" (Schulz & Mihov, 2002).
- DynamicDawgChar - Unicode support for Dynamic DAWG with character-level operations
- Contextual Code Completion - Hierarchical scope-aware completion engine with zipper-based navigation
- Performance Optimizations - DynamicDawg with Bloom filters, auto-minimization, and sorted batch insertion
- UTF-8 Optimization Analysis - Comprehensive documentation of Unicode performance characteristics
- Bug Fixes - Fixed suffix sharing bug, test compilation, GitHub Actions workflow, and all clippy warnings
- Unicode Support - Character-level dictionary variants for correct Unicode Levenshtein distances
DoubleArrayTrieCharandPathMapDictionaryCharfor character-level operations- Proper handling of multi-byte UTF-8 sequences (accented chars, CJK, emoji)
- ~5% performance overhead, 4x memory for edge labels
- Full test coverage with 19 Unicode-specific tests
- Phase 4: SIMD Optimization - Comprehensive SIMD acceleration with 20-64% performance gains
- 8 SIMD-optimized components across critical performance paths
- Data-driven threshold tuning based on empirical benchmarking
- AVX2/SSE4.1 implementations with runtime CPU feature detection
- Extensive documentation (950+ lines across 3 analysis documents)
- Package Support: Debian (.deb), RPM (.rpm), and Arch Linux (.pkg.tar.zst) packages
- CI Improvements: Explicit CPU features for better compatibility across platforms
- Code Quality: Fixed all clippy warnings without suppressing checks
- Bug Fixes: Fixed CLI format auto-detection for text dictionaries
See CHANGELOG.md for complete details.
This library provides efficient fuzzy string matching against large dictionaries using finite state automata. It supports multiple Levenshtein distance algorithms and pluggable dictionary backends.
Unlike naive approaches that compute Levenshtein distance for every dictionary word (O(|D| × |W| × |V|)), this library uses Levenshtein Automata—a deterministic finite automaton that accepts exactly all words within distance n from a query word.
- O(|W|) Automaton Construction - Linear in query word length for fixed error bound n
- O(|D|) Dictionary Traversal - Single pass through dictionary (measured as total edges)
- No Distance Recomputation - Automaton guides search, avoiding per-word calculations
- Proven Correctness - Accepts exactly L_Lev(n,W) = {V | d_L(W,V) ≤ n}
Result: Query time is effectively independent of dictionary size for well-distributed tries!
Theory: Based on "Fast String Correction with Levenshtein-Automata" (Schulz & Mihov, 2002). See complete algorithm documentation for algorithmic details including Position structures, Subsumption relations, and Characteristic vectors.
Traditional approaches compute edit distance for every dictionary word, requiring O(|D| × |W| × |V|) operations. This library uses Levenshtein Automata—deterministic finite automata that accept exactly all words within distance n from query word W:
LEV_n(W) accepts L_Lev(n,W) = {V | d_L(W,V) ≤ n}
Position (i#e): Represents "matched i characters of W with e errors"
- Example: 3#1 means "matched 3 characters, used 1 error"
- Range: 0 ≤ i ≤ |W|, 0 ≤ e ≤ n
Subsumption (⊑): Eliminates redundant positions for minimal state sets
- Position i#e subsumes j#f if: (e < f) ∧ (|j-i| ≤ f-e)
- Any word accepted from j#f is also accepted from i#e
Characteristic Vectors (χ): Encode where characters appear in query word
- χ(x,W) = ⟨b₁,...,b_w⟩ where b_i = 1 if W[i] = x
- Used to determine valid transitions without recomputing distances
Elementary Transitions: Move between positions based on dictionary character
- Defined in Tables 4.1, 7.1, 8.1 of the paper for each algorithm variant
- Enable O(1) state updates during traversal
Result: Automaton construction is O(|W|), dictionary traversal is O(|D|), regardless of dictionary size!
Core Algorithm:
- Schulz, Klaus U., and Stoyan Mihov. "Fast string correction with Levenshtein automata." International Journal on Document Analysis and Recognition 5.1 (2002): 67-85.
Extensions:
- Mitankin, Petar, Stoyan Mihov, and Klaus Schulz. "Universal Levenshtein Automata. Building and Properties." Information Processing & Management 41.4 (2005): 687-702.
- Complete Algorithm Documentation - Detailed paper summary
- Implementation Mapping - Code-to-paper correspondence
- Glossary - Terms and notation reference
- Universal Levenshtein Automata Research - Restricted substitutions
- Weighted Distance Research - Variable operation costs
- Fast approximate string matching using Levenshtein Automata with O(|W|) construction complexity
- Multiple algorithms (all maintaining O(|W|) construction complexity):
Standard: Insert, delete, substitute operations (Chapters 4-6 of foundational paper)- Example: "hello" → "helo" (delete), "helo" → "hello" (insert), "hello" → "hallo" (substitute)
- Use for: General fuzzy matching, spell checking
Transposition: Adds adjacent character swaps (Chapter 7) - Damerau-Levenshtein distance- Example: "teh" → "the" costs 1 (not 2), "recieve" → "receive" costs 1
- Use for: Typing errors, keyboard input corrections
MergeAndSplit: Adds two-character ↔ one-character operations (Chapter 8)- Example: "rn" ↔ "m", "vv" ↔ "w" (common in OCR)
- Use for: Scanned documents, character recognition systems
- Multiple dictionary backends:
- DoubleArrayTrie (recommended, default) - O(1) transitions with excellent cache locality and minimal memory footprint
- DoubleArrayTrieChar - Character-level variant for correct Unicode Levenshtein distances
- PathMap (optional
pathmap-backendfeature) - high-performance trie with structural sharing, supports dynamic updates - PathMapChar (optional
pathmap-backendfeature) - Character-level PathMap for Unicode fuzzy matching - DAWG - Directed Acyclic Word Graph for space-efficient storage
- OptimizedDawg - Arena-based DAWG with 20-25% faster construction
- DynamicDawg - DAWG with online insert/delete/minimize operations
- DynamicDawgChar - Character-level DAWG with Unicode support and online updates
- SuffixAutomaton - For substring/infix matching
- SuffixAutomatonChar - Character-level suffix automaton for Unicode substring matching
- Extensible trait-based design for custom backends
- Ordered results:
OrderedQueryIteratorreturns results sorted by distance first, then lexicographically - Filtering and prefix matching: Filter results with custom predicates, enable prefix mode for code completion
- Serialization support (optional
serializationfeature):- Multiple formats: Bincode (binary), JSON, Plain Text (newline-delimited UTF-8), Protobuf
- Optimized Protobuf formats: V2 (62% smaller), DAT-specific, SuffixAutomaton-specific
- Gzip compression (optional
compressionfeature) - 85% file size reduction - Save and load dictionaries to/from disk
- Full-featured CLI tool (optional
clifeature):- Interactive REPL for exploration
- Query, insert, delete, convert operations
- Support for all serialization formats including compressed variants
- Runtime dictionary updates:
- Thread-safe insert, remove, and clear operations (PathMap, DynamicDawg)
- Queries automatically see updates via
RwLock-based interior mutability - Concurrent queries during modifications
- Lazy evaluation - results generated on-demand
- Efficient memory usage - shared dictionary state across transducers
- SIMD acceleration (optional
simdfeature, x86_64 only):- AVX2/SSE4.1 optimized operations with runtime CPU feature detection
- 20-64% faster query performance across all workloads
- Optimized characteristic vectors, position subsumption, state operations
- Dictionary edge lookup with data-driven threshold tuning
- Distance matrix computation with vectorized operations
- Automatic fallback to scalar implementation when SIMD unavailable
Add to your Cargo.toml:
[dependencies]
liblevenshtein = "0.5"
# Or with SIMD acceleration (x86_64 only, requires SSE4.1/AVX2):
liblevenshtein = { version = "0.5", features = ["simd"] }Or install the CLI tool:
cargo install liblevenshtein --features cli,compression,protobufDownload pre-built packages from the GitHub Releases page:
- Debian/Ubuntu:
.debpackages - Fedora/RHEL/CentOS:
.rpmpackages - Arch Linux:
.pkg.tar.zstpackages - macOS:
.dmgdisk images for easy installation (x86_64 and ARM64) - Binaries:
.tar.gzand.ziparchives for Linux and macOS (x86_64 and ARM64)
use liblevenshtein::prelude::*;
// Create a dictionary from terms (using DoubleArrayTrie for best performance)
let terms = vec!["test", "testing", "tested", "tester"];
let dict = DoubleArrayTrie::from_terms(terms);
// Create a transducer with Standard algorithm
let transducer = Transducer::new(dict, Algorithm::Standard);
// Query for terms within edit distance 2
for term in transducer.query("tset", 2) {
println!("Match: {}", term);
}
// Query with distances
for candidate in transducer.query_with_distance("tset", 2) {
println!("Match: {} (distance: {})", candidate.term, candidate.distance);
}Output:
Match: test
Match: tester
Match: test (distance: 1)
Match: tester (distance: 2)
For correct character-level Levenshtein distances with Unicode text, use the character-level dictionary variants:
use liblevenshtein::dictionary::double_array_trie_char::DoubleArrayTrieChar;
use liblevenshtein::prelude::*;
// Create a character-level dictionary for Unicode content
let terms = vec!["café", "naïve", "中文", "🎉"];
let dict = DoubleArrayTrieChar::from_terms(terms);
let transducer = Transducer::new(dict, Algorithm::Standard);
// Character-level distance: "" → "¡" is distance 1 (one character)
// Byte-level would incorrectly report distance 2 (two UTF-8 bytes)
let results: Vec<_> = transducer.query("", 1).collect();
// Results include single-character Unicode terms
// Works with accented characters
for candidate in transducer.query_with_distance("cafe", 1) {
println!("{}: {}", candidate.term, candidate.distance);
// Output: café: 1 (one character substitution: e→é)
}Character-level backends:
DoubleArrayTrieChar- Character-level Double-Array TriePathMapDictionaryChar(requirespathmap-backendfeature) - Character-level PathMap with dynamic updates
When to use:
- ✅ Dictionaries containing non-ASCII Unicode (accented characters, CJK, emoji)
- ✅ When edit distance must count characters, not bytes
- ✅ Multi-language applications requiring correct Unicode semantics
Performance: ~5% overhead for UTF-8 decoding, 4x memory for edge labels (char vs u8).
The DynamicDawg backend supports thread-safe runtime updates:
use liblevenshtein::prelude::*;
// Create dictionary with runtime update support
let dict = DynamicDawg::from_terms(vec!["cat", "dog"]);
let transducer = Transducer::new(dict.clone(), Algorithm::Standard);
// Insert new terms at runtime
dict.insert("cot");
dict.insert("bat");
// Queries immediately see the new terms
let results: Vec<_> = transducer.query("cot", 1).collect();
// Results: ["cat", "cot"]
// Remove terms
dict.remove("dog");Thread Safety: The dictionary uses RwLock for interior mutability:
- Multiple concurrent queries are allowed (read locks)
- Modifications acquire exclusive write locks
- Active
Transducerinstances automatically see updates
Performance Optimizations: Configure Bloom filter and auto-minimization for better performance:
use liblevenshtein::prelude::*;
// Enable Bloom filter for 10x faster contains() operations
let dict = DynamicDawg::with_config(
f32::INFINITY, // Auto-minimize threshold (disabled)
Some(10000), // Bloom filter capacity (enabled)
);
// Or enable auto-minimization for bulk insertions (30% faster for large datasets)
let dict = DynamicDawg::with_config(
1.5, // Auto-minimize at 50% growth
None, // Bloom filter (disabled)
);
// Or enable both optimizations
let dict = DynamicDawg::with_config(
1.5, // Auto-minimize threshold
Some(10000), // Bloom filter capacity
);Optimization Guide:
- Bloom Filter: Use when frequently checking if terms exist (88-93% faster
contains()) - Auto-Minimization: Use for bulk insertions of 1000+ terms (30% faster, prevents memory bloat)
- Default: Both disabled for maximum flexibility and minimal overhead on small datasets
Note: PathMapDictionary also supports runtime updates but requires the optional pathmap-backend feature.
See examples/dynamic_dictionary.rs for a complete demonstration.
Store metadata with dictionary terms using value-mapped dictionaries:
use liblevenshtein::prelude::*;
use std::collections::HashSet;
// DynamicDawg with integer values (e.g., word frequencies)
let dict: DynamicDawg<u32> = DynamicDawg::new();
dict.insert_with_value("apple", 42);
dict.insert_with_value("apply", 17);
// Retrieve values
assert_eq!(dict.get_value("apple"), Some(42));
assert_eq!(dict.get_value("banana"), None);
// Use with FuzzyMap for fuzzy lookups with values
let map = FuzzyMap::new(dict, Algorithm::Standard);
let results = map.get_with_distance("aple", 1); // fuzzy lookup
// Results include both terms and their values
// Or use FuzzyMultiMap to aggregate multiple values
let dict: DynamicDawg<HashSet<u32>> = DynamicDawg::new();
dict.insert_with_value("test", HashSet::from([1, 2, 3]));Supported backends:
DynamicDawg<V>- Dynamic dictionary with values of typeVDynamicDawgChar<V>- Character-level dynamic dictionary with Unicode supportPathMapDictionary<V>- PathMap with values (requirespathmap-backend)PathMapDictionaryChar<V>- Character-level PathMap with values
Common value types:
u32/u64- Frequencies, scores, IDsHashSet<T>- Multiple associations per termVec<T>- Ordered collections- Any type implementing
Clone + Send + Sync + 'static
Integration with contextual completion:
use liblevenshtein::prelude::*;
// Use DynamicDawg backend for contextual completion
let dict: DynamicDawg<Vec<u32>> = DynamicDawg::new();
let engine = ContextualCompletionEngine::with_dictionary(
dict,
Algorithm::Standard
);
// Insert terms with context IDs
let ctx = engine.create_root_context();
engine.insert_finalized(ctx, "variable", vec![ctx]);Get results sorted by edit distance first, then alphabetically:
use liblevenshtein::prelude::*;
let dict = DoubleArrayTrie::from_terms(vec!["apple", "apply", "ape", "app"]);
let transducer = Transducer::new(dict, Algorithm::Standard);
// Results ordered by distance, then alphabetically
for candidate in transducer.query_ordered("aple", 1) {
println!("{}: {}", candidate.term, candidate.distance);
}Output:
ape: 1
apple: 1
apply: 1
Filter results and enable prefix matching for code completion:
use liblevenshtein::prelude::*;
let dict = DoubleArrayTrie::from_terms(vec![
"getValue", "getVariable", "setValue", "setVariable"
]);
let transducer = Transducer::new(dict, Algorithm::Standard);
// Prefix matching with filtering
for candidate in transducer
.query_ordered("getVal", 1)
.prefix() // Match terms starting with query ± edits
.filter(|c| c.term.starts_with("get")) // Only getter methods
{
println!("{}: {}", candidate.term, candidate.distance);
}Output:
getValue: 0
getVariable: 1
See examples/code_completion_demo.rs and examples/contextual_filtering_optimization.rs for more examples.
The library provides multiple dictionary backends optimized for different use cases:
| Backend | Best For | Performance Highlights |
|---|---|---|
| DoubleArrayTrie | General use (recommended) | 3x faster queries, 30x faster contains, 8 bytes/state |
| DoubleArrayTrieChar | Unicode text, character-level distances | Correct Unicode semantics, ~5% overhead |
| PathMap | Dynamic updates, runtime modifications | Thread-safe insert/delete, fastest dynamic backend |
| PathMapChar | Unicode + dynamic updates | Character-level distances with runtime modifications |
| DAWG | Static dictionaries, moderate size | Good balance of speed and memory |
| OptimizedDawg | Static dictionaries, construction speed | 20-25% faster construction than DAWG |
| DynamicDawg | Occasional modifications | Best fuzzy matching for dynamic use |
| DynamicDawgChar | Unicode + occasional modifications | Character-level with insert/remove, ~5% overhead |
| SuffixAutomaton | Substring/infix matching | Find patterns anywhere in text |
Performance Comparison (10,000 words):
Construction: DAT: 3.2ms PathMap: 3.5ms DAWG: 7.2ms
Exact Match: DAT: 6.6µs DAWG: 19.8µs PathMap: 71.1µs
Contains (100): DAT: 0.22µs DAWG: 6.7µs PathMap: 132µs
Distance 1: DAT: 12.9µs DAWG: 319µs PathMap: 888µs
Distance 2: DAT: 16.3µs DAWG: 2,150µs PathMap: 5,919µs
Memory/state: DAT: ~8B OptDawg: ~13B DAWG: ~32B
Why DoubleArrayTrie is Fastest:
- O(1) transitions - Direct array indexing vs pointer chasing
- Excellent cache locality - Sequential memory layout minimizes cache misses
- Minimal memory footprint - ~8 bytes per state (vs 32 bytes for DAWG)
- Perfect for parallel traversal - Optimal implementation of dictionary automaton A^D from paper
All backends implement the dictionary automaton A^D that is traversed in parallel with the Levenshtein automaton LEV_n(W) during queries. The choice of backend affects the constant factors in the O(|D|) query complexity, with DoubleArrayTrie providing the smallest constants due to hardware-friendly memory access patterns.
Prefix Matching Support: All backends except SuffixAutomaton support efficient prefix matching through the .prefix() method, making them ideal for code completion and autocomplete applications.
When to use each backend:
- Static dictionaries (ASCII/Latin-1) →
DoubleArrayTrie(best overall performance, default choice) - Static dictionaries (Unicode) →
DoubleArrayTrieChar(correct character-level distances) - Dynamic updates (ASCII/Latin-1) →
DynamicDawg(thread-safe runtime modifications) - Dynamic updates (Unicode) →
DynamicDawgChar(character-level with insert/remove, best fuzzy matching) - Dynamic updates (Unicode, frequent) →
PathMapChar(alternative, requirespathmap-backend) - Substring search →
SuffixAutomaton(finds patterns anywhere in text) - Memory-constrained →
DoubleArrayTrie(8 bytes/state, most efficient) - Multi-language apps → Character-level variants (
*Char) for correct Unicode semantics
Save and load dictionaries with optional compression:
use liblevenshtein::prelude::*;
use std::fs::File;
let dict = DoubleArrayTrie::from_terms(vec!["test", "testing", "tested"]);
// Save with compression (85% file size reduction)
let file = File::create("dict.bin.gz")?;
GzipSerializer::<BincodeSerializer>::serialize(&dict, file)?;
// Load compressed dictionary
let file = File::open("dict.bin.gz")?;
let dict: DoubleArrayTrie = GzipSerializer::<BincodeSerializer>::deserialize(file)?;
// Or use optimized Protobuf formats
let file = File::create("dict.pb")?;
OptimizedProtobufSerializer::serialize(&dict, file)?; // 62% smaller than standardRequires serialization and compression features:
[dependencies]
liblevenshtein = { git = "https://github.com/universal-automata/liblevenshtein-rust", tag = "v0.3.0", features = ["serialization", "compression"] }The library includes a full-featured command-line tool with interactive REPL:
# Install with CLI support (from GitHub)
cargo install --git https://github.com/universal-automata/liblevenshtein-rust --tag v0.3.0 \
--features cli,compression,protobuf liblevenshtein
# Or download pre-built binaries from GitHub Releases
# Query a dictionary
liblevenshtein query "test" --dict /usr/share/dict/words -m 2 -s
# Convert between formats with compression
liblevenshtein convert words.txt words.bin.gz \
--to-format bincode-gz --to-backend path-map
# Launch interactive REPL
liblevenshtein repl --dict words.bin.gzThe CLI auto-detects formats from file extensions and supports:
- Formats: text, bincode, json, protobuf (plus
-gzcompressed variants) - Backends: path-map, dawg, dynamic-dawg
- Operations: query, insert, delete, convert, save, info
See docs/developer-guide/building.md for comprehensive CLI documentation.
The FuzzyMultiMap enables fuzzy lookup of associated values, aggregating results from all matching terms:
use liblevenshtein::prelude::*;
use liblevenshtein::cache::multimap::FuzzyMultiMap;
use std::collections::HashSet;
// Create a dictionary with values (e.g., user IDs for each name)
let dict = PathMapDictionary::from_terms_with_values([
("alice", HashSet::from([101, 102])),
("alicia", HashSet::from([103])),
("bob", HashSet::from([201])),
("robert", HashSet::from([202, 203])),
]);
// Create fuzzy multimap
let fuzzy_map = FuzzyMultiMap::new(dict, Algorithm::Standard);
// Query "alise" - matches both "alice" and "alicia" at distance 1
let user_ids = fuzzy_map.query("alise", 1).unwrap();
// Returns: HashSet {101, 102, 103} - union of all matching values
// Practical use case: find all IDs for fuzzy name search
let ids = fuzzy_map.query("rob", 2).unwrap();
// Returns IDs for both "bob" (distance 1) and "robert" (distance 2)Supported collection types:
HashSet<T>- Union of all matching setsBTreeSet<T>- Union of all matching setsVec<T>- Concatenation of all matching vectors
Use cases:
- User ID lookup with fuzzy name matching
- Tag aggregation across similar terms
- Multi-valued dictionary queries with error tolerance
The ContextualCompletionEngine provides hierarchical scope-aware code completion with draft state management:
use liblevenshtein::contextual::ContextualCompletionEngine;
use liblevenshtein::transducer::Algorithm;
// Create engine (thread-safe, shareable with Arc)
let engine = ContextualCompletionEngine::with_algorithm(Algorithm::Standard);
// Create hierarchical scopes (global → function → block)
let global = engine.create_root_context(0);
let function = engine.create_child_context(1, global).unwrap();
let block = engine.create_child_context(2, function).unwrap();
// Add finalized terms to each scope
engine.finalize_direct(global, "std::vector").unwrap();
engine.finalize_direct(global, "std::string").unwrap();
engine.finalize_direct(function, "parameter").unwrap();
engine.finalize_direct(function, "result").unwrap();
// Incremental typing in block scope (draft state)
engine.insert_str(block, "local_var").unwrap();
// Query completions - sees all visible scopes (block + function + global)
let completions = engine.complete(block, "par", 1);
for comp in completions {
println!("{} (distance: {}, draft: {}, from context: {:?})",
comp.term, comp.distance, comp.is_draft, comp.contexts);
}
// Output: parameter (distance: 0, draft: false, from context: [1])
// Checkpoint/undo support for editor integration
engine.checkpoint(block).unwrap();
engine.insert_str(block, "iable").unwrap(); // Now: "local_variable"
engine.undo(block).unwrap(); // Restore to: "local_var"
// Finalize draft to add to dictionary
let term = engine.finalize(block).unwrap(); // Returns "local_var"
assert!(engine.has_term("local_var"));
// Query now sees finalized term
let results = engine.complete(block, "loc", 1);
// Returns: local_var (now finalized, visible in this context)Features:
- Hierarchical visibility: Child contexts see parent terms (global → function → block)
- Draft state: Incremental typing without polluting finalized dictionary
- Checkpoint/undo: Editor-friendly state management
- Thread-safe: Share engine across threads with
Arc - Mixed queries: Search both drafts and finalized terms simultaneously
Performance (sub-millisecond for interactive use):
- Insert character: ~4 µs
- Checkpoint: ~116 ns
- Query (500 terms, distance 1): ~11.5 µs
- Query (distance 2): ~309 µs
Use cases:
- LSP servers with multi-file scope awareness
- Code editors with context-sensitive completion
- REPL environments with session-scoped symbols
- Any application requiring hierarchical fuzzy matching
See examples/contextual_completion.rs for a complete example.
For efficient autocomplete and code completion scenarios, use the PrefixZipper trait to navigate directly to a prefix and iterate only matching terms. This is 5-10× faster than full dictionary iteration with filtering when the prefix matches a small subset of terms.
Performance: O(k) navigation + O(m) iteration, where k = prefix length, m = matching terms.
use liblevenshtein::prelude::*;
use liblevenshtein::dictionary::prefix_zipper::PrefixZipper;
use liblevenshtein::dictionary::double_array_trie_zipper::DoubleArrayTrieZipper;
let terms = vec!["process", "processUser", "produce", "product", "apple"];
let dict = DoubleArrayTrie::from_terms(terms.iter());
// Create zipper and navigate to prefix
let zipper = DoubleArrayTrieZipper::new_from_dict(&dict);
if let Some(iter) = zipper.with_prefix(b"proc") {
for (path, _zipper) in iter {
let term = String::from_utf8(path).unwrap();
println!("Found: {}", term);
// Prints only: "process" and "processUser" (not all 5 terms!)
}
}Unicode support (character-level):
use liblevenshtein::dictionary::double_array_trie_char::DoubleArrayTrieChar;
use liblevenshtein::dictionary::double_array_trie_char_zipper::DoubleArrayTrieCharZipper;
use liblevenshtein::dictionary::prefix_zipper::PrefixZipper;
let terms = vec!["café", "cafétéria", "naïve"];
let dict = DoubleArrayTrieChar::from_terms(terms.iter());
let zipper = DoubleArrayTrieCharZipper::new_from_dict(&dict);
let prefix: Vec<char> = "caf".chars().collect();
if let Some(iter) = zipper.with_prefix(&prefix) {
for (path, _) in iter {
let term: String = path.iter().collect();
println!("Found: {}", term);
}
}Valued dictionaries (for metadata like scope IDs, frequencies, etc.):
use liblevenshtein::dictionary::prefix_zipper::ValuedPrefixZipper;
let dict = DoubleArrayTrie::from_terms_with_values(
vec![("cat", 1), ("cats", 2), ("dog", 3)].into_iter()
);
let zipper = DoubleArrayTrieZipper::new_from_dict(&dict);
if let Some(iter) = zipper.with_prefix_values(b"cat") {
for (path, value) in iter {
let term = String::from_utf8(path).unwrap();
println!("{} -> {}", term, value);
// Prints: "cat -> 1", "cats -> 2"
}
}Use cases:
- Code completion / autocomplete
- Prefix search in large dictionaries
- Pattern-aware completion (LSP servers)
- Any scenario requiring "terms starting with X"
Backend support: All dictionary backends support PrefixZipper (DoubleArrayTrie, DynamicDawg, PathMap, etc.) through blanket trait implementation.
For fine-grained control over traversal, use the zipper API directly:
use liblevenshtein::dictionary::pathmap::PathMapDictionary;
use liblevenshtein::dictionary::pathmap_zipper::PathMapZipper;
use liblevenshtein::transducer::{AutomatonZipper, Algorithm, StatePool};
use liblevenshtein::transducer::intersection_zipper::IntersectionZipper;
// Create dictionary
let dict = PathMapDictionary::<()>::new();
dict.insert("cat");
dict.insert("cats");
dict.insert("dog");
// Create zippers
let dict_zipper = PathMapZipper::new_from_dict(&dict);
let auto_zipper = AutomatonZipper::new("cot".as_bytes(), 1, Algorithm::Standard);
// Intersect dictionary and automaton
let intersection = IntersectionZipper::new(dict_zipper, auto_zipper);
// Manual traversal
let mut pool = StatePool::new();
for (label, child) in intersection.children(&mut pool) {
if child.is_match() {
println!("Match: {} (distance: {})",
child.term(),
child.distance().unwrap());
}
// Recurse into child for custom traversal algorithms
for (_, grandchild) in child.children(&mut pool) {
// Custom logic here...
}
}When to use:
- Custom traversal algorithms (DFS, A*, beam search)
- Early termination with custom heuristics
- Integration with external data structures
- Research and experimentation
Note: Most users should use Transducer::query() or ContextualCompletionEngine instead. The zipper API is lower-level and requires manual state management.
For performance optimization when querying dictionaries with scoped values:
use liblevenshtein::prelude::*;
use std::collections::HashSet;
// Dictionary with scope IDs
let dict = PathMapDictionary::from_terms_with_values([
("global_var", 0), // Scope 0 (global)
("function_param", 1), // Scope 1 (function)
("local_var", 2), // Scope 2 (block)
]);
let transducer = Transducer::new(dict, Algorithm::Standard);
// Query only specific scopes (e.g., visible from scope 1)
let visible_scopes = HashSet::from([0, 1]); // global + function
for term in transducer.query_by_value_set("param", 1, &visible_scopes) {
println!("{}", term);
}
// Output: function_param (scope 1 is visible)
// Does NOT return: local_var (scope 2 not in visible set)
// Or use custom predicate
for term in transducer.query_filtered("var", 1, |scope_id| *scope_id <= 1) {
println!("{}", term);
}
// Returns: global_var, function_param (scopes 0, 1)Performance benefit: Filtering by value during traversal is significantly faster than post-filtering results, especially for large dictionaries with many out-of-scope matches.
Use cases:
- Scope-based code completion (only show visible symbols)
- Access control (filter by user permissions)
- Multi-tenancy (filter by tenant ID)
- Technical Glossary - Comprehensive reference for all technical terms (70+ entries)
- User Guide - Getting started, algorithms, backends, and features
- Developer Guide - Architecture, building, contributing, and performance
- Building and Testing - Comprehensive build instructions and CLI usage
- Contributing Guidelines - How to contribute to the project
- Features Overview - Detailed feature documentation
- Publishing Guide - Requirements for publishing to crates.io
- Changelog - Version history and release notes
For developers wanting to understand the implementation or extend the algorithms:
| Paper Concept | Code Location | Description |
|---|---|---|
| Position (i#e) | src/transducer/position.rs:11-35 |
Index + error count structure |
| Subsumption (⊑) | src/transducer/position.rs:231-269 |
Position redundancy elimination |
| Characteristic Vector (χ) | src/transducer/position.rs |
Character occurrence encoding |
| Elementary Transitions (δ) | src/transducer/transition.rs:119-438 |
Table 4.1, 7.1, 8.1 from paper |
| State Transitions (Δ) | src/transducer/query.rs |
Parallel traversal implementation |
| Algorithm Variants | src/transducer/algorithm.rs |
Standard/Transposition/MergeAndSplit |
| Imitation Method | src/transducer/query.rs:86-188 |
Chapter 6 - on-demand state generation |
| Dictionary Automaton (A^D) | src/dictionary/ |
Multiple backend implementations |
Key Implementation Patterns:
- Position Structure: Tracks (term_index, num_errors, is_special) corresponding to i#e_t or i#e_s in the paper
- Subsumption Checking: Ensures minimal state sets by eliminating redundant positions
- Transition Functions: Implement the case analysis from Tables 4.1, 7.1, and 8.1
- Parallel Traversal: Simultaneously navigate dictionary automaton A^D and Levenshtein automaton LEV_n(W)
See Implementation Mapping for detailed code-to-paper correspondence with examples.
The library maintains the theoretical guarantees from the foundational paper while adding practical optimizations:
- Construction: O(|W|) time for automaton construction (Theorem 5.2.1)
- Query: O(|D|) time for dictionary traversal where |D| is total edges (Chapter 3)
- Space: O(|W|) states for fixed error bound n (Theorem 4.0.32)
vs Naive Approach: Computing Levenshtein distance for every dictionary entry requires O(|D| × |W| × |V|) operations. This library achieves 100-1000× speedup on large dictionaries by avoiding per-word distance calculations through automata-guided search.
Beyond the algorithmic foundations, the implementation includes:
- Arc Path Sharing: Eliminated expensive cloning operations during traversal
- StatePool: Object pool pattern for state reuse with exceptional performance gains
- SmallVec Integration: Stack-allocated vectors reduce heap allocation pressure
- Lazy Edge Iteration: 15-50% faster PathMap edge iteration with zero-copy implementation
- Aggressive Inlining: Hot path functions inlined for optimal performance
Benchmarks show 3.3x speedup for DAWG operations and 5-18% improvements across filtering/prefix scenarios.
When compiled with the simd feature on x86_64 platforms, the library achieves 20-64% performance gains through:
- 8 SIMD-optimized components across critical performance paths
- AVX2/SSE4.1 implementations with runtime CPU feature detection
- Data-driven threshold tuning based on empirical benchmarking
- Automatic fallback to scalar implementation when SIMD unavailable
Key optimizations:
- Characteristic vector operations (vectorized bit manipulation)
- Position subsumption checking (parallel state comparisons)
- State operations (vectorized distance computations)
- Dictionary edge lookups (batched queries with adaptive thresholds)
- Distance matrix computation (vectorized row/column operations)
Performance improvements vary by workload:
- Small dictionaries (< 1,000 terms): 20-30% faster
- Medium dictionaries (1,000-10,000 terms): 30-45% faster
- Large dictionaries (> 10,000 terms): 45-64% faster
See docs/analysis/ for detailed SIMD performance analysis (950+ lines of documentation).
Character-level dictionary variants (*Char) for Unicode support:
- ~5% overhead for UTF-8 decoding during traversal
- 4x memory for edge labels (4 bytes per
charvs 1 byte peru8) - Zero-cost abstraction via monomorphization (no runtime polymorphism)
- Same query performance characteristics as byte-level variants
When to use:
- ✅ Use
*Charvariants for multi-language dictionaries with non-ASCII Unicode - ✅ Use byte-level variants (
DoubleArrayTrie,PathMapDictionary) for ASCII/Latin-1 content
This library includes extensive research documentation on potential enhancements:
Support for restricted substitutions where only specific character pairs can be substituted:
Use Cases:
- Keyboard proximity - Only adjacent keys allowed (QWERTY/AZERTY/Dvorak layouts)
- Example: 'a' ↔ 's' ↔ 'd' (adjacent on QWERTY)
- OCR error modeling - Visual similarity constraints
- Example: 'l' ↔ 'I' ↔ '1', 'O' ↔ '0', 'rn' ↔ 'm'
- Phonetic matching - Sound-alike rules
- Example: 'f' ↔ 'ph', 'k' ↔ 'c', 's' ↔ 'z'
Theory: Based on "Universal Levenshtein Automata. Building and Properties" (Mitankin, Mihov, Schulz, 2005). Maintains O(|W|) construction complexity with restricted substitution set S ⊆ Σ × Σ.
Status: Research complete, implementation planned. See Universal Levenshtein Automata Documentation for complete details.
Support for variable operation costs through discretization:
Use Cases:
- Keyboard distance costs - Closer keys cost less (adjacent: 0.5, same row: 1.0, different rows: 1.5)
- Context-dependent costs - Position-aware error penalties (beginning/end of word)
- Frequency-based costs - Common errors cost less than rare ones
Complexity: O(|W| × max_cost/precision) through cost discretization—still linear in query length when cost range and precision are fixed.
Status: Methodology documented, implementation research phase. See Weighted Levenshtein Automata Research for feasibility analysis and implementation approach.
Interested in implementing these features or researching new extensions? See:
The research documentation provides complete theoretical foundations and implementation roadmaps for extending the library.
Licensed under the Apache License, Version 2.0. See LICENSE for details.