Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
suggester: Switch from UTF-32 to UTF-8 in ngram module
Nuspell switches into UTF-32 for the ngram part of the suggester. This makes plenty of the metrics easier to calculate since, for example, `s[3]` is the third character in a UTF-32 string. (Not so with UTF-8: indexing into a UTF-8 string is a byte index and not necessarily a character boundary.) UTF-32 in Rust is not very well supported compared to UTF-8. Instead of a `String` you use a `Vec<char>` and instead of `&str` you have `&[char]`. These alternatives do not have as well optimized routines as UTF-8, especially when it comes to the standard library's unstable `Pattern` trait. The standard library will use platform `memcmp` and the `memchr` crate for operations like `[u8]::eq` and `[u8]::contains` respectively, which far outperform generic/dumb `[T]::eq` or `[T]::starts_with`. The most expensive part of ngram analysis seems to be the first step: iterating over the whole word list and comparing the input word with the basic `ngram_similarity` function. A flamegraph reveals that we spent a huge amount of time on `contains_slice`, a generic but dumb routine called by `ngram_similarity` in a loop. It's a function that finds whether a slice contains subslice and emulates Nuspell's use of `std::basic_string_view<char32_t>::find`. This was ultimately a lot of `[char]::starts_with` which is fairly slow relative to `str::contains` against a `str` pattern. The `ngram` module was a bit special before this commit because it eagerly switched most `str`s into UTF-32: `Vec<char>`s. That matched Nuspell and, as mentioned above, made some calculations easier/dumber. But the optimizations in the standard library for UTF-8 are undeniable. This commit decreases the total time to `suggest` for a rough word like "exmaple" by 25%. "exmaple" is tough because it contains two 'e's. 'E' is super common in English, so the `ngram_similarity` function ends up working relatively harder for "exmaple". Same for other words with multiple common chars or common stem substrings. The reason is that `ngram_similarity` has a fast lane to break out of looping when it notices that a word is quite dissimilar. It's a kind of "layered cake" - for a `left` and `right` string, you first find any k=1 kgrams of `left` in `right` and that's a fancy way of saying you find any `char`s in `left` that are in `right`. If there is more than one match you move onto k=2: find any substrings of `right` that match any two-character window in `left`. So the substrings you search for increase in size: k=1: "e", "x", "m", "a", "p", "l", "e" k=2: "ex", "xm", "ma", "ap", "pl", "le" k=3: "exm", "xma", "map", "apl", "ple" ... You may break out of the loop at a low `k` if your words are dissimilar. Words with multiple common letters though are unlikely to break out for your average other stem in the dictionary. All of this is to say that checking whether `right` contains a given subslice of `left` is central to this module and even more so in degenerate cases. So I believe focusing on UTF-8 here is worth the extra complexity of dealing with byte indices. --- To make this possible this module adds a wrapper struct around `&str` `CharsStr`: ```rust struct CharsStr<'a, 'i> { inner: &'a str, char_indices: &'i [u16] } ``` This eagerly computes the `str::char_indices` in `CharsStr::new`, borrowing a `&'i mut Vec<u16>`'s allocation. So this is hopefully equivalently expensive as converting each stem or expanded string to UTF-32 in a reused `Vec<char>` since we need to iterate on chars anyways, and allocate per-char. We retain (and do not duplicate - as we would by converting to UTF-32) the UTF-8 representation though, allowing us to take advantage of the standard library's string searching optimizations. Hopefully this is also thriftier with memory.
- Loading branch information