-
Notifications
You must be signed in to change notification settings - Fork 0
Description
strip_accents_unicode fails to strip accents from strings that are already in NFKD form
User Request
The strip_accents="unicode" feature of CountVectorizer (and related utilities) does not strip combining accents for inputs that are already in NFKD form. Composed and decomposed forms of the same character should normalize to the same accent-stripped base string, but currently only the composed form is handled correctly.
Steps/Code to Reproduce
import unicodedata
from sklearn.feature_extraction.text import strip_accents_unicode
# One code point: LATIN SMALL LETTER N WITH TILDE (U+00F1)
s1 = 'ñ'
# Two code points: LATIN SMALL LETTER N (U+006E) + COMBINING TILDE (U+0303)
s2 = 'n' + '\u0303'
print(strip_accents_unicode(s1)) # expected: 'n'
print(strip_accents_unicode(s2)) # expected: 'n' (currently returns: 'n' + U+0303)Expected Results
Both s1 and s2 should normalize to 'n'.
Actual Results
s2 remains unchanged because strip_accents_unicode short-circuits when the string is already in NFKD form, skipping combining-mark removal.
Observed Failure and Stack Trace
To surface the failure under pytest, add this test locally and run:
pytest -q sklearn/feature_extraction/tests/test_text.py -k decomposed --maxfail=1 -vvExample failing assertion before the fix:
> assert strip_accents_unicode('n' + '\u0303') == 'n'
E AssertionError: assert 'ñ' == 'n'
E - ñ
E + n
(Here 'ñ' is the letter 'n' followed by U+0303 COMBINING TILDE.)
Researcher Specification (Summary)
- Root cause:
strip_accents_unicodeperformsNFKDnormalization and returns early whennormalized == s, which leaves combining marks intact for inputs already in NFKD. - Proposed behavior: Always remove combining marks after NFKD normalization, regardless of whether the input changes during normalization.
- Algorithmic steps:
normalized = unicodedata.normalize('NFKD', s)- Return a new string with all code points where
unicodedata.combining(c) == 0.
- Edge cases: This retains existing NFKD semantics (e.g., ligature decomposition), and strips diacritics across scripts (Arabic, Latin, etc.).
Test Plan (Non-regression)
Add tests in sklearn/feature_extraction/tests/test_text.py:
- Composed vs decomposed forms produce identical outputs:
'ñ'->'n''n' + '\u0303'->'n'
- Multiple combining marks:
'e' + '\u0301' + '\u0308'->'e' - Pre-normalized NFKD inputs still strip accents:
unicodedata.normalize('NFKD', 'é')->'e' - Mixed scripts:
'إ' + 'ñ' + 'A'->'ا' + 'n' + 'A'
Versions
System:
python: 3.7.4 (default, Jul 9 2019, 15:11:16) [GCC 7.4.0]
executable: /home/dgrady/.local/share/virtualenvs/profiling-data-exploration--DO1bU6C/bin/python3.7
machine: Linux-4.4.0-17763-Microsoft-x86_64-with-Ubuntu-18.04-bionic
Python deps:
pip: 19.2.2
setuptools: 41.2.0
sklearn: 0.21.3
numpy: 1.17.2
scipy: 1.3.1
Cython: None
pandas: 0.25.1
Notes
- A PR with this fix and the non-regression tests will follow.
- This Issue aggregates the user report and the specification for implementation.