Skip to content

strip_accents_unicode fails to strip accents from strings already in NFKD form #65

@rowan-stein

Description

@rowan-stein

strip_accents_unicode fails to strip accents from strings that are already in NFKD form

User Request

The strip_accents="unicode" feature of CountVectorizer (and related utilities) does not strip combining accents for inputs that are already in NFKD form. Composed and decomposed forms of the same character should normalize to the same accent-stripped base string, but currently only the composed form is handled correctly.

Steps/Code to Reproduce

import unicodedata
from sklearn.feature_extraction.text import strip_accents_unicode

# One code point: LATIN SMALL LETTER N WITH TILDE (U+00F1)
s1 = 'ñ'

# Two code points: LATIN SMALL LETTER N (U+006E) + COMBINING TILDE (U+0303)
s2 = 'n' + '\u0303'

print(strip_accents_unicode(s1))  # expected: 'n'
print(strip_accents_unicode(s2))  # expected: 'n' (currently returns: 'n' + U+0303)

Expected Results

Both s1 and s2 should normalize to 'n'.

Actual Results

s2 remains unchanged because strip_accents_unicode short-circuits when the string is already in NFKD form, skipping combining-mark removal.

Observed Failure and Stack Trace

To surface the failure under pytest, add this test locally and run:

pytest -q sklearn/feature_extraction/tests/test_text.py -k decomposed --maxfail=1 -vv

Example failing assertion before the fix:

>       assert strip_accents_unicode('n' + '\u0303') == 'n'
E       AssertionError: assert 'ñ' == 'n'
E         - ñ
E         + n

(Here 'ñ' is the letter 'n' followed by U+0303 COMBINING TILDE.)

Researcher Specification (Summary)

  • Root cause: strip_accents_unicode performs NFKD normalization and returns early when normalized == s, which leaves combining marks intact for inputs already in NFKD.
  • Proposed behavior: Always remove combining marks after NFKD normalization, regardless of whether the input changes during normalization.
  • Algorithmic steps:
    1. normalized = unicodedata.normalize('NFKD', s)
    2. Return a new string with all code points where unicodedata.combining(c) == 0.
  • Edge cases: This retains existing NFKD semantics (e.g., ligature decomposition), and strips diacritics across scripts (Arabic, Latin, etc.).

Test Plan (Non-regression)

Add tests in sklearn/feature_extraction/tests/test_text.py:

  • Composed vs decomposed forms produce identical outputs:
    • 'ñ' -> 'n'
    • 'n' + '\u0303' -> 'n'
  • Multiple combining marks: 'e' + '\u0301' + '\u0308' -> 'e'
  • Pre-normalized NFKD inputs still strip accents: unicodedata.normalize('NFKD', 'é') -> 'e'
  • Mixed scripts: 'إ' + 'ñ' + 'A' -> 'ا' + 'n' + 'A'

Versions

System:
    python: 3.7.4 (default, Jul  9 2019, 15:11:16)  [GCC 7.4.0]
executable: /home/dgrady/.local/share/virtualenvs/profiling-data-exploration--DO1bU6C/bin/python3.7
   machine: Linux-4.4.0-17763-Microsoft-x86_64-with-Ubuntu-18.04-bionic

Python deps:
       pip: 19.2.2
setuptools: 41.2.0
   sklearn: 0.21.3
     numpy: 1.17.2
     scipy: 1.3.1
    Cython: None
    pandas: 0.25.1

Notes

  • A PR with this fix and the non-regression tests will follow.
  • This Issue aggregates the user report and the specification for implementation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions