suggester: Switch from UTF-32 to UTF-8 in ngram module

Nuspell switches into UTF-32 for the ngram part of the suggester. This makes plenty of the metrics easier to calculate since, for example, `s[3]` is the third character in a UTF-32 string. (Not so with UTF-8: indexing into a UTF-8 string is a byte index and not necessarily a character boundary.) UTF-32 in Rust is not very well supported compared to UTF-8. Instead of a `String` you use a `Vec<char>` and instead of `&str` you have `&[char]`. These alternatives do not have as well optimized routines as UTF-8, especially when it comes to the standard library's unstable `Pattern` trait. The standard library will use platform `memcmp` and the `memchr` crate for operations like `[u8]::eq` and `[u8]::contains` respectively, which far outperform generic/dumb `[T]::eq` or `[T]::starts_with`. The most expensive part of ngram analysis seems to be the first step: iterating over the whole word list and comparing the input word with the basic `ngram_similarity` function. A flamegraph reveals that we spent a huge amount of time on `contains_slice`, a generic but dumb routine called by `ngram_similarity` in a loop. It's a function that finds whether a slice contains subslice and emulates Nuspell's use of `std::basic_string_view<char32_t>::find`. This was ultimately a lot of `[char]::starts_with` which is fairly slow relative to `str::contains` against a `str` pattern. The `ngram` module was a bit special before this commit because it eagerly switched most `str`s into UTF-32: `Vec<char>`s. That matched Nuspell and, as mentioned above, made some calculations easier/dumber. But the optimizations in the standard library for UTF-8 are undeniable. This commit decreases the total time to `suggest` for a rough word like "exmaple" by 25%. "exmaple" is tough because it contains two 'e's. 'E' is super common in English, so the `ngram_similarity` function ends up working relatively harder for "exmaple". Same for other words with multiple common chars or common stem substrings. The reason is that `ngram_similarity` has a fast lane to break out of looping when it notices that a word is quite dissimilar. It's a kind of "layered cake" - for a `left` and `right` string, you first find any k=1 kgrams of `left` in `right` and that's a fancy way of saying you find any `char`s in `left` that are in `right`. If there is more than one match you move onto k=2: find any substrings of `right` that match any two-character window in `left`. So the substrings you search for increase in size: k=1: "e", "x", "m", "a", "p", "l", "e" k=2: "ex", "xm", "ma", "ap", "pl", "le" k=3: "exm", "xma", "map", "apl", "ple" ... You may break out of the loop at a low `k` if your words are dissimilar. Words with multiple common letters though are unlikely to break out for your average other stem in the dictionary. All of this is to say that checking whether `right` contains a given subslice of `left` is central to this module and even more so in degenerate cases. So I believe focusing on UTF-8 here is worth the extra complexity of dealing with byte indices. --- To make this possible this module adds a wrapper struct around `&str` `CharsStr`: ```rust struct CharsStr<'a, 'i> { inner: &'a str, char_indices: &'i [u16] } ``` This eagerly computes the `str::char_indices` in `CharsStr::new`, borrowing a `&'i mut Vec<u16>`'s allocation. So this is hopefully equivalently expensive as converting each stem or expanded string to UTF-32 in a reused `Vec<char>` since we need to iterate on chars anyways, and allocate per-char. We retain (and do not duplicate - as we would by converting to UTF-32) the UTF-8 representation though, allowing us to take advantage of the standard library's string searching optimizations. Hopefully this is also thriftier with memory.
helix-editor · Nov 13, 2024 · 8b80d25 · 8b80d25
1 parent b178970
commit 8b80d25
Show file tree

Hide file tree

Showing 3 changed files with 310 additions and 205 deletions.
diff --git a/README.md b/README.md
@@ -102,6 +102,7 @@ COMPOUNDRULE n*mp
 
 * [`@zverok`]'s [blog series on rebuilding Hunspell][zverok-blog] was an invaluable resource during early prototypes. The old [`spylls`](https://github.com/zverok/spylls)-like prototype can be found on the `spylls` branch.
 * Ultimately [Nuspell](https://github.com/nuspell/nuspell)'s codebase became the reference for Spellbook though as C++ idioms mesh better with Rust than Python's. Nuspell's code is in great shape and is much more readable than Hunspell so for now Spellbook is essentially a Rust rewrite of Nuspell (though we may diverge in the future).
+    * There are a few ways Spellbook diverges from Nuspell. Mostly this relates to data structures like using [`hashbrown`] instead of a custom hash table implementation or German strings for stems and flagsets (see the internal doc). Another difference is that Spellbook uses UTF-8 when calculating ngram suggestions rather than UTF-32; the results are the same but this performs better given the Rust standard library's optimizations for UTF-8.
 * The parser for `.dic` and `.aff` files is loosely based on [ZSpell](https://github.com/pluots/zspell).
 
 [`hashbrown`]: https://github.com/rust-lang/hashbrown

diff --git a/src/aff.rs b/src/aff.rs
@@ -1115,21 +1115,6 @@ impl CaseHandling {
         }
     }
 
-    pub fn lowercase_into_utf32(&self, word: &str, out: &mut Vec<char>) {
-        out.extend(
-            word.chars()
-                .map(match self {
-                    Self::Turkic => |ch| match ch {
-                        'I' => 'ı',
-                        'İ' => 'i',
-                        _ => ch,
-                    },
-                    Self::Standard => |ch| ch,
-                })
-                .flat_map(|ch| ch.to_lowercase()),
-        )
-    }
-
     pub fn uppercase(&self, word: &str) -> String {
         match self {
             Self::Turkic => word.replace('i', "İ").replace('ı', "I").to_uppercase(),