Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request: checking within snake_case by default #2730

Open
jamesbraza opened this issue Feb 7, 2023 · 4 comments
Open

Request: checking within snake_case by default #2730

jamesbraza opened this issue Feb 7, 2023 · 4 comments

Comments

@jamesbraza
Copy link
Contributor

bad_spellling = "bad"  # Not detected in codespell==2.2.2

Can codespell's default regex(s) support splitting along snake case's underscore and determining misspellings within particles?


Related

@DimitriPapadopoulos
Copy link
Collaborator

The underscore (_) is part of \w. From https://docs.python.org/3/library/re.html#regular-expression-syntax:

\w

For Unicode (str) patterns:
Matches Unicode word characters; this includes alphanumeric characters (as defined by str.isalnum()) as well as the underscore (_). If the ASCII flag is used, only [a-zA-Z0-9_] is matched.

For 8-bit (bytes) patterns:
Matches characters considered alphanumeric in the ASCII character set; this is equivalent to [a-zA-Z0-9_]. If the LOCALE flag is used, matches characters considered alphanumeric in the current locale and the underscore.

Is there an easy way to get \w except _ in the non-ASCII case? It would help checking snake_case.

word_regex_def = "[\\w\\-'’`]+"

Unicode regexes with set operations might help, but they are not available in Python yet. From https://docs.python.org/3/library/re.html#regular-expression-syntax:

  • Support of nested sets and set operations as in Unicode Technical Standard #18 might be added in the future. This would change the syntax, so to facilitate this change a FutureWarning will be raised in ambiguous cases for the time being. That includes sets starting with a literal '[' or containing literal character sequences '--', '&&', '~~', and '||'. To avoid a warning escape them with a backslash.

This what I have found so far, but I haven't been able to apply it to this use case yet:

@DimitriPapadopoulos
Copy link
Collaborator

DimitriPapadopoulos commented Jul 28, 2023

A drawback of such a change is that we wouldn't be able to fix some (but not all) of the misspellings that contain an underscore, at least not by default:

clock_getttime->clock_gettime
phy_interace->phy_interface
unint8_t->uint8_t
__attribyte__->__attribute__
__cpluspus->__cplusplus
__cpusplus->__cplusplus

Unless of course, you add new misspellings such as cpluspus.

@Gabrielcarvfer
Copy link

I've been using the following for camel case, hyphen case and snake case.

(?<![a-z])[a-z'`]+|[A-Z][a-z'`]*|[a-z]+'[a-z]*|[a-z]+(?=[_-])|[a-z]+(?=[A-Z])|\d+

It indeed misses the cases where full words should be considered/checked, but sub-word typos seem to be the common case.
Adding a second pass to check just full words would be nice to check for type errors in documentation.

@yarikoptic
Copy link
Contributor

FWIW, searched myself into this issue having seen typos finding typos in snake_case words in

Disabled CameCased and ACRONYMs checks by default might also be wise but likely need to be configurable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants