-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add spelling correction module [resolves #190] #213
base: develop
Are you sure you want to change the base?
Conversation
Note to self: EDIT: After a thorough reading of SymSpellPy's source code it is pretty much impossible to overload symspellpy's distance without rewriting the whole distance calculation itself with Turkish character equivalency in mind. An approach of simply generating all possible combinations of Turkish umlauts in a word and finding the correction among them with the smallest edit distance (thus simulating Turkish character equivalency) has yielded around %58 accuracy however due to all the possible combinations it was way too slow, so was scrapped. For now the only method umlauts will be compensated is by comparing its "flipped" version (aka when "yuzuyorum" is looked up, "yüzüyörüm" is also looked up). |
As an extra note, see this: https://towardsdatascience.com/spelling-correction-how-to-make-an-accurate-and-fast-corrector-dc6d0bcbba5f |
As title says, added a spelling correction module which utilizes SymSpellPy at the backend and as the vocabulary for spelling correction, approximately 450k~ term frequency vocabulary has been created by merging OpenSubtitles (Turkish) and Turkish Wikipedia data.
Currently only utilizes only one of the possible ways to use SymSpellPy, namely using itslookup_compound()
method which is not necessarily the best way to correct spelling.The module is integrated as such:
TODO:
Keep term frequency vocabulary in a bucket instead of LFS(no need)Customizable spelling correction (configs/overriding spelling correction class?)(can be added in another PR)(Currently basic mode tests fail)Dataset to test on:
https://github.com/StarlangSoftware/Dictionary/blob/master/src/main/resources/turkish_misspellings.txt
EDIT:
Result (best) max_edit_distance = 2
Accuracy: 51%
Most of the mistakes were in words with wrongly placed (or omitted) Turkish umlaut-letters:
ex: "yuzulmuyor" was fixed as "duyulmuyor" when it should have been "yüzülmüyor"
Two (orthogonal to each other) ways to bring accuracy to 90%+:
These improvements are left to other PRs as this PR is already getting a bit too large.
resolves #190