improve Spelling rule by using word boundaries #900

emteelb · 2024-11-12T14:49:23Z

By using regex word boundary (\b) delimiters, the spelling rule applies to individual words rather than a word that might contain the regex filter. For example, "\b[cC]he\b" will match only "che" and "Che" rather than a regex filter without word boundary delimiters, for example, "[cC]he" that would match misspelled words that contain the regex, such as "aache" or "chemitsry".

Closes #894

github-actions · 2024-11-12T14:49:36Z

Deploy PR Preview failed.

emteelb · 2024-11-12T14:52:10Z

This is a "heavy-handed" approach where word boundary delimiters are used for every filter regular expression, for consistency, even those filters that might be unambiguous and unlikely to be contained within other words.

ccoVeille · 2024-11-12T20:03:51Z

I would like to drop my thought on this

Some rules comes with nonwords flag.

https://vale.sh/docs/topics/styles/

I don't remember whether spelling rule supports it.

nonword:true is about removing the automatic \b that are added around each words with the default value nonword:false.

So if you use nonword:true, you have to use the \b to surround word, unless you want to catch words with a prefix or suffix, so you use only a \b.

It is also useful when you want to catch multiple words \bun\-[a-z]+\b or punctuation such as \bvs[\., ]\b

So adding \b somewhere can be useless, if you are using nonword:true AND there are some rules where you don't use \b on both side. Otherwise, you can use the default nonword:false and avoid repeating the \b everywhere

Let's go back to this PR, I don't see a need to add \b on both side. Also, I don't think it works as expected, especially because I don't see a nonword:true flag

Finally, the spelling rules doesn't invent things, if you add foo, it won't allow foobar. I mean if you add smally (just an example that is not in the dictionary), you want allow eyesmally or smallyfish

So I don't see what you are trying to fix, and things are getting very complicated...

emteelb · 2024-11-13T13:39:13Z

I did not know about the nonword key for style rules. Thanks for pointing that out. I notice it's shown in the link you provided but when I search Vale documentation for more information, there are no results. If you know where this key is documented, lmk.

Try running the following with styles/RedHat/Spelling.yml as it is now in the main branch:

vale "This is a mistache."

With both RedHat.Spelling = YES and Vale.Spelling = YES in my .vale.ini configuration file, my results are:

 stdin.txt
 1:11  error  Did you really mean             Vale.Spelling
              'mistache'?                                  

✖ 1 error, 0 warnings and 0 suggestions in stdin.

Neither adding nonword: true nor nonword: false to the RH spelling style rule changes this result so I gather that the spelling style rule either does not support this key or else the key does something different that does not change the result.

If I change the "[cC]he" filter in the style rule to match what I have in this PR ("\b[cC]he\b"), and enter the previous vale command, my result is:

 stdin.txt
 1:11  warning  Verify the word 'mistache'. It  RedHat.Spelling
                is not in the American English                 
                spelling dictionary used by                    
                Vale.                                          
 1:11  error    Did you really mean             Vale.Spelling  
                'mistache'?                                    

✖ 1 error, 1 warning and 0 suggestions in stdin.

as would be expected.

Granted, "mistache" is an unlikely typo but I originally stumbled upon this issue when reviewing some writing where someone had a simple transposition mistake, "subcommnad", and because of the RH spelling style rule filter ("[Ss]u"), it wasn't caught (I normally enable only one spelling style rule, RH in this case, and disable the Vale spelling style rule).

As I said in a comment above, I took a heavy-handed approach in this PR by delimiting all the filters with word boundaries. I am open to a more selective approach and that might be something the RH team considers. This was just the easiest approach to be certain not to miss any potential word partials.

ccoVeille · 2024-11-17T17:57:37Z

thanks for you reply and tests.

Let me ask for help.

@jdkato what would you recommend here?

What do you think about what I wrote in my previous post?

jdkato · 2025-01-03T21:40:00Z

I think this PR solves the problem outlined in #894 sufficiently.

However, this isn't how I personally would go about handling spelling exceptions.

My general practices are:

Filters should be used for ignoring large classes of words that you expect to be false positives. See the built-in ones for examples.
Case-insensitive exceptions should be in a dictionary.
Case-sensitive exceptions should be in a vocabulary, providing the added benefit of ensuring they're ignored during spell check but always capitalized correctly.

emteelb · 2025-02-07T13:25:55Z

Thank you @jdkato for the thoughts about this PR. @ccoVeille any opinions or directions from the Red Hat team about which direction to take this PR? Leave as is or go down the mentioned general practices road, which would require (I think) doing things differently in other places (dictionary, vocabulary)?

ccoVeille · 2025-02-07T22:52:00Z

Hi @emteelb

If you asked me the same question in a very short delay after @jdkato replied, I would have say it depends on your expectations in term of delay. And the long term direction should be the one @jdkato mentioned.

But, the PR is opened for 3 months, so I guess the issue you are trying to fix is not urgent.

So, it means you have time to fix it the right way.

The fact you are on a large codebase with many contributors could be an issue.

So it's a matter of pro and cons you will have to figure with this project maintainers

emteelb · 2025-02-10T13:29:49Z

Is there a project maintainer on the Red Hat team who can provide direction for this PR? Perhaps @aireilly or if not, @aireilly might alert an appropriate person?

aireilly · 2025-02-10T15:04:26Z

Regarding the problem at hand, I would be inclined to fix the rule for just those rules terms that are known or likely to give false positives, for example "\b[cC]he\b", or other small words that might be found in larger terms.

"\b[dD]efragmentation\b" from the PR is IMO not likely to form part of a larger word for example. Add /b delimters to all terms also makes the file less user/reader friendly for updates.

By using regex word boundary (\b) delimiters, the spelling rule applies to individual words rather than a word that might contain the regex filter. For example, `\b[cC]he\b` will match only "che" and "Che" rather than a regex filter without word boundary delimiters, for example, `[cC]he` that would match misspelled words that contain the regex, such as "aache" or "chemitsry". This commit also combines multiple related filters that share a common word base, for example, a single filter `"[bB]reakpoint(s)?` rather than `[bB]reakpoint` and `[bB]reakpoints`.

emteelb · 2025-02-11T19:45:42Z

@aireilly Had another go at this, trying to keep in mind your guidance. I also tried to economize and combined some of the filters, for example:

  - "[bB]reakpoint(s)?"

rather than

  - "[bB]reakpoint"
  - "[bB]reakpoints"

In the existing RH Spelling style rule, "[bB]reakpoints" is a redundant filter. "[bB]reakpoint" already matches it, because it is not delimited by word boundaries.

Also, just something to be aware of and perhaps pointing to @jdkato 's advice for a different long-term solution: Without word boundary delimiters around filters, the spelling rule will not catch slippery finger mistakes such as "defragmentationn".

aireilly · 2025-02-17T09:50:47Z

Great work @emteelb ❇️

Can you PTAL at the test fixtures and update accordingly:

ERROR: 17 in .vale/fixtures/RedHat/Spelling/testvalid.adoc for .vale/styles/RedHat/Spelling.yml

emteelb · 2025-02-18T14:38:30Z

@aireilly Thanks for the feedback. Took a look at the shell script that runs to validate RH style rules against corresponding valid/invalid files w/i the fixtures dir and which currently has 17 errors reported for Spelling.yml. To pass the script check here, I would need to modify this pull request to return "combined" filter regex lines back to separate lines, and add word boundary delimiters to the shorter word filters.

For example:

  - "[bB]ackport(ed)?"

in the pull request would become:

  - "\b[bB]ackport\b"
  - "[bB]ackported"

An alternative approach would be to create a separate pull request that would change the shell script to use the regex found in the filters or tokens in a given style rule file and keep a running counter against matches in corresponding fixture files for the style rule. Then compare the running counter number against the wc -l number for the corresponding fixture files.

That's a vague idea. Hopefully the meaning is clear. Do you prefer one or the other approach? Something different?

aireilly · 2025-02-18T19:26:41Z

@emteelb if you're happy to review/modify the shell script please go ahead in this PR :)

Do whatever you think is best, happy to review. Go for the path of least resistance :)

emteelb changed the title ~~improve Spelling rules by using word boundaries~~ improve Spelling rule by using word boundaries Nov 12, 2024

emteelb force-pushed the main branch from 040af1d to f201a4e Compare February 11, 2025 16:35

emteelb marked this pull request as draft February 11, 2025 16:57

emteelb force-pushed the main branch from f201a4e to f63d460 Compare February 11, 2025 19:27

emteelb force-pushed the main branch from f63d460 to ecafab6 Compare February 11, 2025 19:44

emteelb marked this pull request as ready for review February 11, 2025 19:45

aireilly mentioned this pull request Feb 17, 2025

Apply best practice to spelling rules #916

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve Spelling rule by using word boundaries #900

improve Spelling rule by using word boundaries #900

emteelb commented Nov 12, 2024

github-actions bot commented Nov 12, 2024 •

edited

Loading

emteelb commented Nov 12, 2024

ccoVeille commented Nov 12, 2024 •

edited

Loading

emteelb commented Nov 13, 2024 •

edited

Loading

ccoVeille commented Nov 17, 2024

jdkato commented Jan 3, 2025

emteelb commented Feb 7, 2025

ccoVeille commented Feb 7, 2025 •

edited

Loading

emteelb commented Feb 10, 2025

aireilly commented Feb 10, 2025

emteelb commented Feb 11, 2025

aireilly commented Feb 17, 2025

emteelb commented Feb 18, 2025

aireilly commented Feb 18, 2025 •

edited

Loading

improve Spelling rule by using word boundaries #900

Are you sure you want to change the base?

improve Spelling rule by using word boundaries #900

Conversation

emteelb commented Nov 12, 2024

github-actions bot commented Nov 12, 2024 • edited Loading

emteelb commented Nov 12, 2024

ccoVeille commented Nov 12, 2024 • edited Loading

emteelb commented Nov 13, 2024 • edited Loading

ccoVeille commented Nov 17, 2024

jdkato commented Jan 3, 2025

emteelb commented Feb 7, 2025

ccoVeille commented Feb 7, 2025 • edited Loading

emteelb commented Feb 10, 2025

aireilly commented Feb 10, 2025

emteelb commented Feb 11, 2025

aireilly commented Feb 17, 2025

emteelb commented Feb 18, 2025

aireilly commented Feb 18, 2025 • edited Loading

github-actions bot commented Nov 12, 2024 •

edited

Loading

ccoVeille commented Nov 12, 2024 •

edited

Loading

emteelb commented Nov 13, 2024 •

edited

Loading

ccoVeille commented Feb 7, 2025 •

edited

Loading

aireilly commented Feb 18, 2025 •

edited

Loading