Skip to content

fix extract keyword bug#88

Open
DotaArtist wants to merge 30 commits intovi3k6i5:masterfrom
DotaArtist:master
Open

fix extract keyword bug#88
DotaArtist wants to merge 30 commits intovi3k6i5:masterfrom
DotaArtist:master

Conversation

@DotaArtist
Copy link
Copy Markdown

BUG 1:
import flashtext
_extractor = flashtext.KeywordProcessor()
_extractor.add_keyword('地中海贫血')
True
_extractor.extract_keywords('地中海贫血')
['地中海贫血']
_extractor.extract_keywords('地中海贫血2')
[]

BUG2:
import flashtext
_extractor = flashtext.KeywordProcessor()
_extractor.add_keyword('头疼')
_extractor.add_keyword('头晕')
True
_extractor.extract_keywords('头疼头晕')
['头疼']

vi3k6i5 and others added 30 commits November 10, 2017 20:47
added reference to flashtext paper
  `charactes` | `characters`
  `explaination` | `explanation`
  `matche` | `match`
Fix issue with incomplete keyword at the end of the sentence
Performances improvement for strings manipulations
@vi3k6i5
Copy link
Copy Markdown
Owner

vi3k6i5 commented May 3, 2020

Can you please resolve the conflict.

HCYT added a commit to termdock/flashtext-i18n that referenced this pull request Jan 13, 2026
Problem:
- Keywords like '地中海贫血' were not extracted from '地中海贫血2'
- Numbers (0-9) are in non_word_boundaries, causing the match to fail

Root cause:
- The algorithm required the next character to be a word boundary
- But for CJK, the keyword's last character itself is already a boundary

Solution:
- Check if the last matched character is not in non_word_boundaries (CJK)
- If so, accept the match regardless of what follows
- This preserves English word boundary semantics while fixing CJK

Fixes: #1
Ref: upstream vi3k6i5/flashtext#87, vi3k6i5/flashtext#88
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants