Fix strip_accents_unicode for NFKD inputs (swev-id: scikit-learn__scikit-learn-15100) by casey-brooks · Pull Request #71 · agyn-sandbox/scikit-learn

casey-brooks · 2025-12-27T21:08:25Z

Related Issue

strip_accents_unicode fails to strip accents from strings already in NFKD form #65

Reproduction Steps

Activate the project environment (source /workspace/sklearn-py37/bin/activate).
Export the runtime libraries:
export LD_LIBRARY_PATH=/nix/store/gh2dd8vimringn726ndall19gbm77prj-openblas-0.3.30/lib:/nix/store/4wdz42ns29ys6fm1xak68bnp51nxhd2s-zlib-1.3.1/lib:/nix/store/y1bnyxikip76b1nk1adjabnx67pwkl36-libxcrypt-4.5.2/lib:/workspace/miniconda3/lib:$HOME/.nix-profile/lib:$LD_LIBRARY_PATH.
Run pytest -q sklearn/feature_extraction/tests/test_text.py -k nfkd --maxfail=1 -vv.

Observed Failure (pre-fix)

============================= test session starts ==============================
platform linux -- Python 3.7.10, pytest-5.4.3, py-1.11.0, pluggy-0.13.1 -- /workspace/sklearn-py37/bin/python
cachedir: .pytest_cache
rootdir: /workspace/scikit-learn, inifile: setup.cfg
collecting ... collected 112 items / 111 deselected / 1 selected

sklearn/feature_extraction/tests/test_text.py::test_strip_accents_unicode_nfkd_inputs FAILED [100%]

=================================== FAILURES ===================================
____________________ test_strip_accents_unicode_nfkd_inputs ____________________

    def test_strip_accents_unicode_nfkd_inputs():
        assert strip_accents_unicode('ñ') == 'n'
>       assert strip_accents_unicode('n' + '\\u0303') == 'n'
E       AssertionError: assert 'ñ' == 'n'
E         - n
E         + ñ

sklearn/feature_extraction/tests/test_text.py:104: AssertionError
!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!
====================== 1 failed, 111 deselected in 0.18s =======================

Fix Summary

Always drop combining characters after NFKD normalization in strip_accents_unicode so pre-normalized inputs cannot skip accent stripping.
Add regression coverage for composed, decomposed, multi-mark, and mixed-script inputs to lock in the intended behavior.

Tests Added

test_strip_accents_unicode_nfkd_inputs validating composed vs decomposed strings, stacked combining marks, pre-normalized inputs, and mixed-script inputs.

Verification

pytest -q sklearn/feature_extraction/tests/test_text.py -k strip_accents
flake8 sklearn/feature_extraction/text.py sklearn/feature_extraction/tests/test_text.py

…ings; add non-regression tests

casey-brooks · 2025-12-27T21:08:33Z

Local Testing

pytest -q sklearn/feature_extraction/tests/test_text.py -k strip_accents (2 passed, 110 deselected)
flake8 sklearn/feature_extraction/text.py sklearn/feature_extraction/tests/test_text.py (no issues)

noa-lucent

[major] The new test helper in scikits/learn/svm/tests/test_svm.py (see the new return statement around L437) returns the argmax index (0..n-1) instead of the actual class label. liblinear's C implementation wraps the argmax index through model_->label before returning, so whenever the training labels are not 0-based (e.g. [-1, 1] or [1, 2, 3]) this helper will disagree with the real predictor. Please map the argmax index through the provided labels array before returning so the helper mirrors liblinear.

rowan-stein · 2025-12-27T21:15:10Z

Thanks for the review. The noted change regarding liblinear prediction appears unrelated to this pull request. This PR only modifies sklearn/feature_extraction/text.py and sklearn/feature_extraction/tests/test_text.py to fix strip_accents_unicode handling for NFKD inputs.

Could you please re-review the current diff and let us know if any changes are needed within the scope of this PR?

noa-lucent

Fix looks good. Removing the early return ensures we strip combining marks even when the input was already NFKD-normalized, and the new tests cover those cases.

fix: Fix strip_accents_unicode to strip combining accents in NFKD str…

726fa1b

…ings; add non-regression tests

casey-brooks requested a review from a team December 27, 2025 21:08

noa-lucent requested changes Dec 27, 2025

View reviewed changes

noa-lucent approved these changes Dec 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Fix strip_accents_unicode for NFKD inputs (swev-id: scikit-learn__scikit-learn-15100)#71

Fix strip_accents_unicode for NFKD inputs (swev-id: scikit-learnscikit-learn-15100)#71
casey-brooks wants to merge 1 commit intoscikit-learnscikit-learn-15100from
fix/strip-accents-unicode-nfkd

casey-brooks commented Dec 27, 2025

Uh oh!

casey-brooks commented Dec 27, 2025

Uh oh!

noa-lucent left a comment

Uh oh!

rowan-stein commented Dec 27, 2025

Uh oh!

noa-lucent left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

casey-brooks commented Dec 27, 2025

Related Issue

Reproduction Steps

Observed Failure (pre-fix)

Fix Summary

Tests Added

Verification

Uh oh!

casey-brooks commented Dec 27, 2025

Local Testing

Uh oh!

noa-lucent left a comment

Choose a reason for hiding this comment

Uh oh!

rowan-stein commented Dec 27, 2025

Uh oh!

noa-lucent left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants