Skip to content

Commit

Permalink
suggester: Switch from UTF-32 to UTF-8 in ngram module
Browse files Browse the repository at this point in the history
Nuspell switches into UTF-32 for the ngram part of the suggester. This
makes plenty of the metrics easier to calculate since, for example,
`s[3]` is the third character in a UTF-32 string. (Not so with UTF-8:
indexing into a UTF-8 string is a byte index and not necessarily a
character boundary.)

UTF-32 in Rust is not very well supported compared to UTF-8. Instead of
a `String` you use a `Vec<char>` and instead of `&str` you have
`&[char]`. These alternatives do not have as well optimized routines as
UTF-8, especially when it comes to the standard library's unstable
`Pattern` trait. The standard library will use platform `memcmp` and the
`memchr` crate for operations like `[u8]::eq` and `[u8]::contains`
respectively, which far outperform generic/dumb `[T]::eq` or
`[T]::starts_with`.

The most expensive part of ngram analysis seems to be the first step:
iterating over the whole word list and comparing the input word with the
basic `ngram_similarity` function. A flamegraph reveals that we spent
a huge amount of time on `contains_slice`, a generic but dumb
routine called by `ngram_similarity` in a loop. It's a function that
finds whether a slice contains subslice and emulates Nuspell's use of
`std::basic_string_view<char32_t>::find`. This was ultimately a lot of
`[char]::starts_with` which is fairly slow relative to `str::contains`
against a `str` pattern.

The `ngram` module was a bit special before this commit because it
eagerly switched most `str`s into UTF-32: `Vec<char>`s. That matched
Nuspell and, as mentioned above, made some calculations easier/dumber.
But the optimizations in the standard library for UTF-8 are undeniable.

This commit decreases the total time to `suggest` for a rough word like
"exmaple" by 25%.

"exmaple" is tough because it contains two 'e's. 'E' is super common in
English, so the `ngram_similarity` function ends up working relatively
harder for "exmaple". Same for other words with multiple common chars or
common stem substrings. The reason is that `ngram_similarity` has a fast
lane to break out of looping when it notices that a word is quite
dissimilar. It's a kind of "layered cake" - for a `left` and `right`
string, you first find any k=1 kgrams of `left` in `right` and that's a
fancy way of saying you find any `char`s in `left` that are in `right`.
If there is more than one match you move onto k=2: find any substrings
of `right` that match any two-character window in `left`. So the
substrings you search for increase in size:

    k=1: "e", "x", "m", "a", "p", "l", "e"
    k=2: "ex", "xm", "ma", "ap", "pl", "le"
    k=3: "exm", "xma", "map", "apl", "ple"
    ...

You may break out of the loop at a low `k` if your words are dissimilar.
Words with multiple common letters though are unlikely to break out for
your average other stem in the dictionary.

All of this is to say that checking whether `right` contains a given
subslice of `left` is central to this module and even more so in
degenerate cases. So I believe focusing on UTF-8 here is worth the extra
complexity of dealing with byte indices.

---

To make this possible this module adds a wrapper struct around `&str`
`CharsStr`:

```rust
struct CharsStr<'a, 'i> {
    inner: &'a str,
    char_indices: &'i [u16]
}
```

This eagerly computes the `str::char_indices` in `CharsStr::new`,
borrowing a `&'i mut Vec<u16>`'s allocation. So this is hopefully
equivalently expensive as converting each stem or expanded string to
UTF-32 in a reused `Vec<char>` since we need to iterate on chars
anyways, and allocate per-char. We retain (and do not duplicate - as we
would by converting to UTF-32) the UTF-8 representation though, allowing
us to take advantage of the standard library's string searching
optimizations. Hopefully this is also thriftier with memory.
  • Loading branch information
the-mikedavis committed Nov 13, 2024
1 parent b178970 commit 8b80d25
Show file tree
Hide file tree
Showing 3 changed files with 310 additions and 205 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,7 @@ COMPOUNDRULE n*mp

* [`@zverok`]'s [blog series on rebuilding Hunspell][zverok-blog] was an invaluable resource during early prototypes. The old [`spylls`](https://github.com/zverok/spylls)-like prototype can be found on the `spylls` branch.
* Ultimately [Nuspell](https://github.com/nuspell/nuspell)'s codebase became the reference for Spellbook though as C++ idioms mesh better with Rust than Python's. Nuspell's code is in great shape and is much more readable than Hunspell so for now Spellbook is essentially a Rust rewrite of Nuspell (though we may diverge in the future).
* There are a few ways Spellbook diverges from Nuspell. Mostly this relates to data structures like using [`hashbrown`] instead of a custom hash table implementation or German strings for stems and flagsets (see the internal doc). Another difference is that Spellbook uses UTF-8 when calculating ngram suggestions rather than UTF-32; the results are the same but this performs better given the Rust standard library's optimizations for UTF-8.
* The parser for `.dic` and `.aff` files is loosely based on [ZSpell](https://github.com/pluots/zspell).

[`hashbrown`]: https://github.com/rust-lang/hashbrown
Expand Down
15 changes: 0 additions & 15 deletions src/aff.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1115,21 +1115,6 @@ impl CaseHandling {
}
}

pub fn lowercase_into_utf32(&self, word: &str, out: &mut Vec<char>) {
out.extend(
word.chars()
.map(match self {
Self::Turkic => |ch| match ch {
'I' => 'ı',
'İ' => 'i',
_ => ch,
},
Self::Standard => |ch| ch,
})
.flat_map(|ch| ch.to_lowercase()),
)
}

pub fn uppercase(&self, word: &str) -> String {
match self {
Self::Turkic => word.replace('i', "İ").replace('ı', "I").to_uppercase(),
Expand Down
Loading

0 comments on commit 8b80d25

Please sign in to comment.