-
spell_check(list_text, deep_search=True, context_search=True, sim_char_search=True)
Replaces the misspelled words in the given list of texts using various strategies without training.list_text
: python list of texts.deep_search
: default=True | enables / disables deep-learning search.context_search
: default=True | enables / disables context search.sim_char_search
: default=True | enables / disables similar character search.
-
train_check(list_text, deep_search=True, context_search=True, sim_char_search=True)
Train the words in the given list and updates the misspell.json Then replaces the misspelled words in the given list of texts using various strategies.list_text
: python list of texts.deep_search
: default=True | enables / disables deep-learning search.context_search
: default=True | enables / disables context search.sim_char_search
: default=True | enables / disables similar character search.
-
train(list_text)
Train the words in the given list and updates the misspell.jsonlist_text
: python list of texts.
- COVID19 Tweets by @Gabirel Preda
- Disaster Tweets by @Viktor Stepanenko
- Bitcoin tweets by @Alexandre Bouillet
- More in line...
(to be observed once training is done.)
-
JsonSpeller
-
autocorrect
-
textblob
-
symspellpy
-
pyspellcheck
의 '니'가 되었습니다.
마음에 쏙드는 한국어 맞춤법교정기가 없어서... 만들어보는 중이고... 시험삼아 영어를 먼저 만들어 보았습니다!
While working on my final project, I realized that there are not much great spellcheckers for Korean.
Therefore, I have decided to create one.
Named it JsonSpeller as it works based on JSON file. Not because I am Jason
Before diving into my spellchecker, here are the results of different spellcheckers.
We had around 2,000 data to spellcheck.
Model | Time | Result | Tested By |
---|---|---|---|
T5 | 2.5 hours | It does a good job, but some hate-speech are labeled as "hate-speech" and some as it is | @JasonHeesangLee |
T5 - detailed prompt | 4 hours | I'll have to modify the prompt again to try, but for now, it shows only ####Hello#### x million times... | @JasonHeesangLee |
ChatGPT API - simple prompt | 2.5 hours | It is good, but too good... For hate-speech, it cleanses the term and explains the situation. (May damage the original if used as is) Most importantly, it costs... |
@SionBang |
ChatGPT - detailed prompt | 2.5 hours | It is good, hate-speech labeled as [OFF], when a single word is all there is to the text -> Explains that word. (May damage the original if used as is) Also, it costs... |
@JasonHeesangLee |
Rule-based Spacing Module with khaiii POS-tagger by Kakao Corp | 22 seconds | 95% spacing correctly done (Maybe able to reach upto 100% if the rule can be even more developed). However, it doesn't correct the spellings, but only reconcatenate the consonants and vowels if they are separated (It happens often in typing fast). It may be useful if ensembled with other spell checking models. | @JasonHeesangLee |
symspellpy-ko | 20 minutes | A model developed over symspellpy. As the model checks the spell based on the external dictionary, some new terms could be leftout or wrongly corrected. (If a perfect wordlist exists... I believe this would the best model in theory, but we all know that the perfect wordlist doesn't exist.) |
@JasonHeesangLee |
Pusan-University Spellchecking API | 40 ~ 60 minutes | Performs well at spacing, sanitizing profanity, and separating sentences / Doesn't understand neologisms, people's names, etc. and returns incorrect spellings. Also, it cannot be used for deep-learning as the service is free, and is offered for public. | @NayoungBae |
soynlp | 20 minutes for training (30,000 sentences), 3 seconds for inference |
Some neologisms were well maintained while some were not. For other normal terms, it also returns bad spacing. | @SionBang |
hanspell | No Idea | Error on all teammates' environment. | ALL |
symspellpy-ko would have been the best model among the models above, only if the perfect word-list, including neologisms.
Then I thought "What if we make typos for words on purpose, and put them in the dictionary to replace them?"
But creating typos would be too inefficient, even if generate them with LLMs.
Below is the modified version of this Thought I had.
(The current version I have developed can only perform spell check on English, and it will be developed for Korean soon.)
Please kindly go through them and tell me if this method would be too inefficient or if there is any problems I haven't thought of.
Disclaimer We are working on extracting the keywords from the users’ daily records and segment them into “Positive” and “Negative” emotions, as explained in this discussion and this JsonSpeller was initially developed solely for this project.
A keyword term must appear at least twice to be considered a keyword.
For nouns that appear only once, if the spelling is correct (if it is in the Korean word-list), we will extract it as a keyword, otherwise we will exclude it.
Even if it is included, the user may say: "What is this? It's not a word I wrote down."
- Prepare tweet texts or any texts that are written informally (not the news articles) for pretraining.
- List-up the words that appears more than once.
The ratio could be adjusted after passing certain number of pretraining - when we have enough data to set threshold. - Extract the words in the possible-misspelled-terms category of those correct words from the text and put them into a dictionary in the form of key(misspelled-words) - value_array(possible-correct-terms).
3-1. What is possible-misspelled-terms category and possible-correct-terms category?- possible-misspelled-terms category : misspelled terms like "wolrd", "word", "owrld", etc. when the correct word is "world".
However, the words in the this category must appear only once in the text.
If it appears more than once, the word is added to the list mentioned above. - Since there are real words like word, it is likely that the user meant world when "world" appears multiple times and "word" only appears for twice.
- We exclude the actual words that are in the dictionary because we don't want future uses of word to be recognized as world.
For example the key - value_array will look like : "wword" : ["word", "world", "work", ... etc.]
- possible-misspelled-terms category : misspelled terms like "wolrd", "word", "owrld", etc. when the correct word is "world".
- Iterate over the dictionary value_array words and replace them with nouns from the string.
- If this is applied in real-world service, it is possible to create separate JSON files of frequently used typos for each user and use it continuously. (Personalization of misspell processing?)
- Since the process only runs for words recognized as nouns in the sentence (Korean Language Only), the processing speed does not increase proportionally as the sentence gets longer.
- Many variants of typos can be quickly collected when combining these personalized JSON files.
- No need to build a dictionary of neologisms (because the words people use often will be in known words - appears more than 2 times)
- We can use different dataset to pretrain on a certain topic - like data science, sports, medical, ... etc. whatever topic you think of.
- Some words could be leftout or misjudged.
- It's likely to take a long time to go through each word.
- But... will there really be that many typo variants for one word...?
- 유저들의 행동이 모든게 계획대로 돌아가야 함.
Last but not least, a great THANK YOU my teammates for being patient with me! @nayoungbae @chanhyukhan @bangsioni