strip_accents_unicode fails to strip accents from strings already in NFKD form

strip_accents_unicode fails to strip accents from strings that are already in NFKD form

#### User Request
The `strip_accents="unicode"` feature of `CountVectorizer` (and related utilities) does not strip combining accents for inputs that are already in NFKD form. Composed and decomposed forms of the same character should normalize to the same accent-stripped base string, but currently only the composed form is handled correctly.

#### Steps/Code to Reproduce
```python
import unicodedata
from sklearn.feature_extraction.text import strip_accents_unicode

# One code point: LATIN SMALL LETTER N WITH TILDE (U+00F1)
s1 = 'ñ'

# Two code points: LATIN SMALL LETTER N (U+006E) + COMBINING TILDE (U+0303)
s2 = 'n' + '\u0303'

print(strip_accents_unicode(s1))  # expected: 'n'
print(strip_accents_unicode(s2))  # expected: 'n' (currently returns: 'n' + U+0303)
```

#### Expected Results
Both `s1` and `s2` should normalize to `'n'`.

#### Actual Results
`s2` remains unchanged because `strip_accents_unicode` short-circuits when the string is already in NFKD form, skipping combining-mark removal.

#### Observed Failure and Stack Trace
To surface the failure under pytest, add this test locally and run:
```bash
pytest -q sklearn/feature_extraction/tests/test_text.py -k decomposed --maxfail=1 -vv
```
Example failing assertion before the fix:
```
>       assert strip_accents_unicode('n' + '\u0303') == 'n'
E       AssertionError: assert 'ñ' == 'n'
E         - ñ
E         + n
```
(Here 'ñ' is the letter 'n' followed by U+0303 COMBINING TILDE.)

#### Researcher Specification (Summary)
- Root cause: `strip_accents_unicode` performs `NFKD` normalization and returns early when `normalized == s`, which leaves combining marks intact for inputs already in NFKD.
- Proposed behavior: Always remove combining marks after NFKD normalization, regardless of whether the input changes during normalization.
- Algorithmic steps:
  1. `normalized = unicodedata.normalize('NFKD', s)`
  2. Return a new string with all code points where `unicodedata.combining(c) == 0`.
- Edge cases: This retains existing NFKD semantics (e.g., ligature decomposition), and strips diacritics across scripts (Arabic, Latin, etc.).

#### Test Plan (Non-regression)
Add tests in `sklearn/feature_extraction/tests/test_text.py`:
- Composed vs decomposed forms produce identical outputs:
  - `'ñ'` -> `'n'`
  - `'n' + '\u0303'` -> `'n'`
- Multiple combining marks: `'e' + '\u0301' + '\u0308'` -> `'e'`
- Pre-normalized NFKD inputs still strip accents: `unicodedata.normalize('NFKD', 'é')` -> `'e'`
- Mixed scripts: `'إ' + 'ñ' + 'Ａ'` -> `'ا' + 'n' + 'A'`

#### Versions
```
System:
    python: 3.7.4 (default, Jul  9 2019, 15:11:16)  [GCC 7.4.0]
executable: /home/dgrady/.local/share/virtualenvs/profiling-data-exploration--DO1bU6C/bin/python3.7
   machine: Linux-4.4.0-17763-Microsoft-x86_64-with-Ubuntu-18.04-bionic

Python deps:
       pip: 19.2.2
setuptools: 41.2.0
   sklearn: 0.21.3
     numpy: 1.17.2
     scipy: 1.3.1
    Cython: None
    pandas: 0.25.1
```

#### Notes
- A PR with this fix and the non-regression tests will follow.
- This Issue aggregates the user report and the specification for implementation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

strip_accents_unicode fails to strip accents from strings already in NFKD form #65

User Request

Steps/Code to Reproduce

Expected Results

Actual Results

Observed Failure and Stack Trace

Researcher Specification (Summary)

Test Plan (Non-regression)

Versions

Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

strip_accents_unicode fails to strip accents from strings already in NFKD form #65

Description

User Request

Steps/Code to Reproduce

Expected Results

Actual Results

Observed Failure and Stack Trace

Researcher Specification (Summary)

Test Plan (Non-regression)

Versions

Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions