Add spelling correction module [resolves #190] #213

askarbozcan · 2021-02-08T13:49:06Z

As title says, added a spelling correction module which utilizes SymSpellPy at the backend and as the vocabulary for spelling correction, approximately 450k~ term frequency vocabulary has been created by merging OpenSubtitles (Turkish) and Turkish Wikipedia data.

~~Currently only utilizes only one of the possible ways to use SymSpellPy, namely using its lookup_compound() method which is not necessarily the best way to correct spelling.~~

The module is integrated as such:

d = Doc("Ali bubanın çiftliği")
d_fixed = d.get_spell_corrected()
print(d_fixed) # "Ali babanın çiftliği

TODO:

Add more aggressive/softer methods of spelling correction
As a default, load pickled term frequency vocabulary instead of from text (faster loading this way)
Notify user when the dictionary is being loaded (as it takes a few seconds)
~~Keep term frequency vocabulary in a bucket instead of LFS~~ (no need)
~~Customizable spelling correction (configs/overriding spelling correction class?)~~ (can be added in another PR)
Test its performance on dataset shown below and find decent default parameters
Unit tests !
Fix punctuation preservation in "basic" mode when multiple punctuation marks are involved. ~~(Currently basic mode tests fail)~~

Dataset to test on:
https://github.com/StarlangSoftware/Dictionary/blob/master/src/main/resources/turkish_misspellings.txt

EDIT:
Result (best) max_edit_distance = 2
Accuracy: 51%
Most of the mistakes were in words with wrongly placed (or omitted) Turkish umlaut-letters:
ex: "yuzulmuyor" was fixed as "duyulmuyor" when it should have been "yüzülmüyor"

Two (orthogonal to each other) ways to bring accuracy to 90%+:

Prioritize fixing wrongly placed Turkish characters first.
Use FastText embeddings to pick the best candidate based on semantic meaning of the word and its neighbours.

These improvements are left to other PRs as this PR is already getting a bit too large.

resolves #190

askarbozcan · 2021-02-13T17:06:46Z

Note to self:
Modify the symspellpy distance calculation in such a way that changing Turkish umlaut-characters to English counterparts (ü -> u, ç->c) and vice versa (u -> ü, c -> ç) has a smaller edit distance compared to changing any other characters.

EDIT: After a thorough reading of SymSpellPy's source code it is pretty much impossible to overload symspellpy's distance without rewriting the whole distance calculation itself with Turkish character equivalency in mind.

An approach of simply generating all possible combinations of Turkish umlauts in a word and finding the correction among them with the smallest edit distance (thus simulating Turkish character equivalency) has yielded around %58 accuracy however due to all the possible combinations it was way too slow, so was scrapped.

For now the only method umlauts will be compensated is by comparing its "flipped" version (aka when "yuzuyorum" is looked up, "yüzüyörüm" is also looked up).

askarbozcan · 2021-06-10T12:16:22Z

As an extra note, see this: https://towardsdatascience.com/spelling-correction-how-to-make-an-accurate-and-fast-corrector-dc6d0bcbba5f

askarbozcan added 2 commits February 8, 2021 15:10

Add initial SpellingCorrector class and related term frequency vocab

174f30c

Integrate spelling correction as a Doc method

14fc3b8

askarbozcan requested a review from husnusensoy February 8, 2021 13:49

askarbozcan linked an issue Feb 8, 2021 that may be closed by this pull request

Add spelling correction #190

Open

askarbozcan added 3 commits February 9, 2021 18:37

Add basic spelling mode (word by word spelling correction)

8173198

Add basic_compound mode + ability to select mode of correction

7221d50

Implement pickling loaded term frequency dictionary on first use

518e2c9

askarbozcan changed the title ~~Add spelling correction module [WIP]~~ Add spelling correction module [WIP] [resolves #190] Feb 11, 2021

askarbozcan added 3 commits February 13, 2021 14:28

Add option to ignore pickled dictionary

5a064fb

Small doc fix

19197b0

Added "flipped" lookup based on results from testing on the dataset

dc3858d

Askar Bozcan and others added 4 commits February 17, 2021 15:45

Change doc format to Numpy style

09575ee

Add tests

2095b41

Fix test bug

e0c430a

Fix punctuation preservation bugs

983912b

askarbozcan changed the title ~~Add spelling correction module [WIP] [resolves #190]~~ Add spelling correction module [resolves #190] Feb 18, 2021

askarbozcan marked this pull request as ready for review February 18, 2021 15:38

irmakyucel mentioned this pull request Jul 7, 2021

Spelling Correction Test Results #283

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add spelling correction module [resolves #190] #213

Add spelling correction module [resolves #190] #213

askarbozcan commented Feb 8, 2021 •

edited

Loading

askarbozcan commented Feb 13, 2021 •

edited

Loading

askarbozcan commented Jun 10, 2021

Add spelling correction module [resolves #190] #213

Are you sure you want to change the base?

Add spelling correction module [resolves #190] #213

Conversation

askarbozcan commented Feb 8, 2021 • edited Loading

askarbozcan commented Feb 13, 2021 • edited Loading

askarbozcan commented Jun 10, 2021

askarbozcan commented Feb 8, 2021 •

edited

Loading

askarbozcan commented Feb 13, 2021 •

edited

Loading