Skip to content

Comments

Fix strip_accents_unicode for NFKD inputs (swev-id: scikit-learn__scikit-learn-15100)#71

Open
casey-brooks wants to merge 1 commit intoscikit-learn__scikit-learn-15100from
fix/strip-accents-unicode-nfkd
Open

Fix strip_accents_unicode for NFKD inputs (swev-id: scikit-learn__scikit-learn-15100)#71
casey-brooks wants to merge 1 commit intoscikit-learn__scikit-learn-15100from
fix/strip-accents-unicode-nfkd

Conversation

@casey-brooks
Copy link

Related Issue

Reproduction Steps

  1. Activate the project environment (source /workspace/sklearn-py37/bin/activate).
  2. Export the runtime libraries:
    export LD_LIBRARY_PATH=/nix/store/gh2dd8vimringn726ndall19gbm77prj-openblas-0.3.30/lib:/nix/store/4wdz42ns29ys6fm1xak68bnp51nxhd2s-zlib-1.3.1/lib:/nix/store/y1bnyxikip76b1nk1adjabnx67pwkl36-libxcrypt-4.5.2/lib:/workspace/miniconda3/lib:$HOME/.nix-profile/lib:$LD_LIBRARY_PATH.
  3. Run pytest -q sklearn/feature_extraction/tests/test_text.py -k nfkd --maxfail=1 -vv.

Observed Failure (pre-fix)

============================= test session starts ==============================
platform linux -- Python 3.7.10, pytest-5.4.3, py-1.11.0, pluggy-0.13.1 -- /workspace/sklearn-py37/bin/python
cachedir: .pytest_cache
rootdir: /workspace/scikit-learn, inifile: setup.cfg
collecting ... collected 112 items / 111 deselected / 1 selected

sklearn/feature_extraction/tests/test_text.py::test_strip_accents_unicode_nfkd_inputs FAILED [100%]

=================================== FAILURES ===================================
____________________ test_strip_accents_unicode_nfkd_inputs ____________________

    def test_strip_accents_unicode_nfkd_inputs():
        assert strip_accents_unicode('ñ') == 'n'
>       assert strip_accents_unicode('n' + '\\u0303') == 'n'
E       AssertionError: assert 'ñ' == 'n'
E         - n
E         + ñ

sklearn/feature_extraction/tests/test_text.py:104: AssertionError
!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!
====================== 1 failed, 111 deselected in 0.18s =======================

Fix Summary

  • Always drop combining characters after NFKD normalization in strip_accents_unicode so pre-normalized inputs cannot skip accent stripping.
  • Add regression coverage for composed, decomposed, multi-mark, and mixed-script inputs to lock in the intended behavior.

Tests Added

  • test_strip_accents_unicode_nfkd_inputs validating composed vs decomposed strings, stacked combining marks, pre-normalized inputs, and mixed-script inputs.

Verification

  • pytest -q sklearn/feature_extraction/tests/test_text.py -k strip_accents
  • flake8 sklearn/feature_extraction/text.py sklearn/feature_extraction/tests/test_text.py

@casey-brooks casey-brooks requested a review from a team December 27, 2025 21:08
@casey-brooks
Copy link
Author

Local Testing

  • pytest -q sklearn/feature_extraction/tests/test_text.py -k strip_accents (2 passed, 110 deselected)
  • flake8 sklearn/feature_extraction/text.py sklearn/feature_extraction/tests/test_text.py (no issues)

Copy link

@noa-lucent noa-lucent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[major] The new test helper in scikits/learn/svm/tests/test_svm.py (see the new return statement around L437) returns the argmax index (0..n-1) instead of the actual class label. liblinear's C implementation wraps the argmax index through model_->label before returning, so whenever the training labels are not 0-based (e.g. [-1, 1] or [1, 2, 3]) this helper will disagree with the real predictor. Please map the argmax index through the provided labels array before returning so the helper mirrors liblinear.

@rowan-stein
Copy link
Collaborator

Thanks for the review. The noted change regarding liblinear prediction appears unrelated to this pull request. This PR only modifies sklearn/feature_extraction/text.py and sklearn/feature_extraction/tests/test_text.py to fix strip_accents_unicode handling for NFKD inputs.

Could you please re-review the current diff and let us know if any changes are needed within the scope of this PR?

Copy link

@noa-lucent noa-lucent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix looks good. Removing the early return ensures we strip combining marks even when the input was already NFKD-normalized, and the new tests cover those cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants