Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add spelling correction module [resolves #190] #213

Open
wants to merge 12 commits into
base: develop
Choose a base branch
from

Conversation

askarbozcan
Copy link
Member

@askarbozcan askarbozcan commented Feb 8, 2021

As title says, added a spelling correction module which utilizes SymSpellPy at the backend and as the vocabulary for spelling correction, approximately 450k~ term frequency vocabulary has been created by merging OpenSubtitles (Turkish) and Turkish Wikipedia data.

Currently only utilizes only one of the possible ways to use SymSpellPy, namely using its lookup_compound() method which is not necessarily the best way to correct spelling.

The module is integrated as such:

d = Doc("Ali bubanın çiftliği")
d_fixed = d.get_spell_corrected()
print(d_fixed) # "Ali babanın çiftliği

TODO:

  • Add more aggressive/softer methods of spelling correction
  • As a default, load pickled term frequency vocabulary instead of from text (faster loading this way)
  • Notify user when the dictionary is being loaded (as it takes a few seconds)
    Keep term frequency vocabulary in a bucket instead of LFS (no need)
    Customizable spelling correction (configs/overriding spelling correction class?) (can be added in another PR)
  • Test its performance on dataset shown below and find decent default parameters
  • Unit tests !
  • Fix punctuation preservation in "basic" mode when multiple punctuation marks are involved. (Currently basic mode tests fail)

Dataset to test on:
https://github.com/StarlangSoftware/Dictionary/blob/master/src/main/resources/turkish_misspellings.txt

EDIT:
Result (best) max_edit_distance = 2
Accuracy: 51%
Most of the mistakes were in words with wrongly placed (or omitted) Turkish umlaut-letters:
ex: "yuzulmuyor" was fixed as "duyulmuyor" when it should have been "yüzülmüyor"

Two (orthogonal to each other) ways to bring accuracy to 90%+:

  1. Prioritize fixing wrongly placed Turkish characters first.
  2. Use FastText embeddings to pick the best candidate based on semantic meaning of the word and its neighbours.

These improvements are left to other PRs as this PR is already getting a bit too large.

resolves #190

@askarbozcan askarbozcan linked an issue Feb 8, 2021 that may be closed by this pull request
@askarbozcan askarbozcan changed the title Add spelling correction module [WIP] Add spelling correction module [WIP] [resolves #190] Feb 11, 2021
@askarbozcan
Copy link
Member Author

askarbozcan commented Feb 13, 2021

Note to self:
Modify the symspellpy distance calculation in such a way that changing Turkish umlaut-characters to English counterparts (ü -> u, ç->c) and vice versa (u -> ü, c -> ç) has a smaller edit distance compared to changing any other characters.

EDIT: After a thorough reading of SymSpellPy's source code it is pretty much impossible to overload symspellpy's distance without rewriting the whole distance calculation itself with Turkish character equivalency in mind.

An approach of simply generating all possible combinations of Turkish umlauts in a word and finding the correction among them with the smallest edit distance (thus simulating Turkish character equivalency) has yielded around %58 accuracy however due to all the possible combinations it was way too slow, so was scrapped.

For now the only method umlauts will be compensated is by comparing its "flipped" version (aka when "yuzuyorum" is looked up, "yüzüyörüm" is also looked up).

@askarbozcan askarbozcan changed the title Add spelling correction module [WIP] [resolves #190] Add spelling correction module [resolves #190] Feb 18, 2021
@askarbozcan askarbozcan marked this pull request as ready for review February 18, 2021 15:38
@askarbozcan
Copy link
Member Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add spelling correction
1 participant