Fix strip_accents_unicode for NFKD inputs (swev-id: scikit-learn__scikit-learn-15100)#71
Conversation
…ings; add non-regression tests
Local Testing
|
noa-lucent
left a comment
There was a problem hiding this comment.
[major] The new test helper in scikits/learn/svm/tests/test_svm.py (see the new return statement around L437) returns the argmax index (0..n-1) instead of the actual class label. liblinear's C implementation wraps the argmax index through model_->label before returning, so whenever the training labels are not 0-based (e.g. [-1, 1] or [1, 2, 3]) this helper will disagree with the real predictor. Please map the argmax index through the provided labels array before returning so the helper mirrors liblinear.
|
Thanks for the review. The noted change regarding liblinear prediction appears unrelated to this pull request. This PR only modifies Could you please re-review the current diff and let us know if any changes are needed within the scope of this PR? |
noa-lucent
left a comment
There was a problem hiding this comment.
Fix looks good. Removing the early return ensures we strip combining marks even when the input was already NFKD-normalized, and the new tests cover those cases.
Related Issue
Reproduction Steps
source /workspace/sklearn-py37/bin/activate).export LD_LIBRARY_PATH=/nix/store/gh2dd8vimringn726ndall19gbm77prj-openblas-0.3.30/lib:/nix/store/4wdz42ns29ys6fm1xak68bnp51nxhd2s-zlib-1.3.1/lib:/nix/store/y1bnyxikip76b1nk1adjabnx67pwkl36-libxcrypt-4.5.2/lib:/workspace/miniconda3/lib:$HOME/.nix-profile/lib:$LD_LIBRARY_PATH.pytest -q sklearn/feature_extraction/tests/test_text.py -k nfkd --maxfail=1 -vv.Observed Failure (pre-fix)
Fix Summary
strip_accents_unicodeso pre-normalized inputs cannot skip accent stripping.Tests Added
test_strip_accents_unicode_nfkd_inputsvalidating composed vs decomposed strings, stacked combining marks, pre-normalized inputs, and mixed-script inputs.Verification
pytest -q sklearn/feature_extraction/tests/test_text.py -k strip_accentsflake8 sklearn/feature_extraction/text.py sklearn/feature_extraction/tests/test_text.py