Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve Spelling rule by using word boundaries #900

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

emteelb
Copy link
Contributor

@emteelb emteelb commented Nov 12, 2024

By using regex word boundary (\b) delimiters, the spelling rule applies to individual words rather than a word that might contain the regex filter. For example, "\b[cC]he\b" will match only "che" and "Che" rather than a regex filter without word boundary delimiters, for example, "[cC]he" that would match misspelled words that contain the regex, such as "aache" or "chemitsry".

Closes #894

Copy link

github-actions bot commented Nov 12, 2024

Deploy PR Preview failed.

@emteelb
Copy link
Contributor Author

emteelb commented Nov 12, 2024

This is a "heavy-handed" approach where word boundary delimiters are used for every filter regular expression, for consistency, even those filters that might be unambiguous and unlikely to be contained within other words.

@emteelb emteelb changed the title improve Spelling rules by using word boundaries improve Spelling rule by using word boundaries Nov 12, 2024
@ccoVeille
Copy link

ccoVeille commented Nov 12, 2024

I would like to drop my thought on this

Some rules comes with nonwords flag.

https://vale.sh/docs/topics/styles/

I don't remember whether spelling rule supports it.

nonword:true is about removing the automatic \b that are added around each words with the default value nonword:false.

So if you use nonword:true, you have to use the \b to surround word, unless you want to catch words with a prefix or suffix, so you use only a \b.

It is also useful when you want to catch multiple words \bun\-[a-z]+\b or punctuation such as \bvs[\., ]\b

So adding \b somewhere can be useless, if you are using nonword:true AND there are some rules where you don't use \b on both side. Otherwise, you can use the default nonword:false and avoid repeating the \b everywhere

Let's go back to this PR, I don't see a need to add \b on both side. Also, I don't think it works as expected, especially because I don't see a nonword:true flag

Finally, the spelling rules doesn't invent things, if you add foo, it won't allow foobar. I mean if you add smally (just an example that is not in the dictionary), you want allow eyesmally or smallyfish

So I don't see what you are trying to fix, and things are getting very complicated...

@emteelb
Copy link
Contributor Author

emteelb commented Nov 13, 2024

I did not know about the nonword key for style rules. Thanks for pointing that out. I notice it's shown in the link you provided but when I search Vale documentation for more information, there are no results. If you know where this key is documented, lmk.

Try running the following with styles/RedHat/Spelling.yml as it is now in the main branch:

vale "This is a mistache."

With both RedHat.Spelling = YES and Vale.Spelling = YES in my .vale.ini configuration file, my results are:

 stdin.txt
 1:11  error  Did you really mean             Vale.Spelling
              'mistache'?                                  

✖ 1 error, 0 warnings and 0 suggestions in stdin.

Neither adding nonword: true nor nonword: false to the RH spelling style rule changes this result so I gather that the spelling style rule either does not support this key or else the key does something different that does not change the result.

If I change the "[cC]he" filter in the style rule to match what I have in this PR ("\b[cC]he\b"), and enter the previous vale command, my result is:

 stdin.txt
 1:11  warning  Verify the word 'mistache'. It  RedHat.Spelling
                is not in the American English                 
                spelling dictionary used by                    
                Vale.                                          
 1:11  error    Did you really mean             Vale.Spelling  
                'mistache'?                                    

✖ 1 error, 1 warning and 0 suggestions in stdin.

as would be expected.

Granted, "mistache" is an unlikely typo but I originally stumbled upon this issue when reviewing some writing where someone had a simple transposition mistake, "subcommnad", and because of the RH spelling style rule filter ("[Ss]u"), it wasn't caught (I normally enable only one spelling style rule, RH in this case, and disable the Vale spelling style rule).

As I said in a comment above, I took a heavy-handed approach in this PR by delimiting all the filters with word boundaries. I am open to a more selective approach and that might be something the RH team considers. This was just the easiest approach to be certain not to miss any potential word partials.

@ccoVeille
Copy link

thanks for you reply and tests.

Let me ask for help.

@jdkato what would you recommend here?

What do you think about what I wrote in my previous post?

@jdkato
Copy link

jdkato commented Jan 3, 2025

I think this PR solves the problem outlined in #894 sufficiently.

However, this isn't how I personally would go about handling spelling exceptions.

My general practices are:

  1. Filters should be used for ignoring large classes of words that you expect to be false positives. See the built-in ones for examples.

  2. Case-insensitive exceptions should be in a dictionary.

  3. Case-sensitive exceptions should be in a vocabulary, providing the added benefit of ensuring they're ignored during spell check but always capitalized correctly.

@emteelb
Copy link
Contributor Author

emteelb commented Feb 7, 2025

Thank you @jdkato for the thoughts about this PR. @ccoVeille any opinions or directions from the Red Hat team about which direction to take this PR? Leave as is or go down the mentioned general practices road, which would require (I think) doing things differently in other places (dictionary, vocabulary)?

@ccoVeille
Copy link

ccoVeille commented Feb 7, 2025

Hi @emteelb

If you asked me the same question in a very short delay after @jdkato replied, I would have say it depends on your expectations in term of delay. And the long term direction should be the one @jdkato mentioned.

But, the PR is opened for 3 months, so I guess the issue you are trying to fix is not urgent.

So, it means you have time to fix it the right way.

The fact you are on a large codebase with many contributors could be an issue.

So it's a matter of pro and cons you will have to figure with this project maintainers

@emteelb
Copy link
Contributor Author

emteelb commented Feb 10, 2025

Is there a project maintainer on the Red Hat team who can provide direction for this PR? Perhaps @aireilly or if not, @aireilly might alert an appropriate person?

@aireilly
Copy link
Member

Regarding the problem at hand, I would be inclined to fix the rule for just those rules terms that are known or likely to give false positives, for example "\b[cC]he\b", or other small words that might be found in larger terms.

"\b[dD]efragmentation\b" from the PR is IMO not likely to form part of a larger word for example. Add /b delimters to all terms also makes the file less user/reader friendly for updates.

By using regex word boundary (\b) delimiters, the spelling rule
applies to individual words rather than a word that might contain the
regex filter. For example, `\b[cC]he\b` will match only "che" and "Che"
rather than a regex filter without word boundary delimiters, for
example, `[cC]he` that would match misspelled words that contain the
regex, such as "aache" or "chemitsry".

This commit also combines multiple related filters that share a common
word base, for example, a single filter `"[bB]reakpoint(s)?` rather than
`[bB]reakpoint` and `[bB]reakpoints`.
@emteelb
Copy link
Contributor Author

emteelb commented Feb 11, 2025

@aireilly Had another go at this, trying to keep in mind your guidance. I also tried to economize and combined some of the filters, for example:

  - "[bB]reakpoint(s)?"

rather than

  - "[bB]reakpoint"
  - "[bB]reakpoints"

In the existing RH Spelling style rule, "[bB]reakpoints" is a redundant filter. "[bB]reakpoint" already matches it, because it is not delimited by word boundaries.

Also, just something to be aware of and perhaps pointing to @jdkato 's advice for a different long-term solution: Without word boundary delimiters around filters, the spelling rule will not catch slippery finger mistakes such as "defragmentationn".

@emteelb emteelb marked this pull request as ready for review February 11, 2025 19:45
@aireilly
Copy link
Member

Great work @emteelb ❇️

Can you PTAL at the test fixtures and update accordingly:

ERROR: 17 in .vale/fixtures/RedHat/Spelling/testvalid.adoc for .vale/styles/RedHat/Spelling.yml

@emteelb
Copy link
Contributor Author

emteelb commented Feb 18, 2025

@aireilly Thanks for the feedback. Took a look at the shell script that runs to validate RH style rules against corresponding valid/invalid files w/i the fixtures dir and which currently has 17 errors reported for Spelling.yml. To pass the script check here, I would need to modify this pull request to return "combined" filter regex lines back to separate lines, and add word boundary delimiters to the shorter word filters.

For example:

  - "[bB]ackport(ed)?"

in the pull request would become:

  - "\b[bB]ackport\b"
  - "[bB]ackported"

An alternative approach would be to create a separate pull request that would change the shell script to use the regex found in the filters or tokens in a given style rule file and keep a running counter against matches in corresponding fixture files for the style rule. Then compare the running counter number against the wc -l number for the corresponding fixture files.

That's a vague idea. Hopefully the meaning is clear. Do you prefer one or the other approach? Something different?

@aireilly
Copy link
Member

aireilly commented Feb 18, 2025

@emteelb if you're happy to review/modify the shell script please go ahead in this PR :)

Do whatever you think is best, happy to review. Go for the path of least resistance :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Spelling.yml filters word partials leading to misspellings not being flagged
4 participants