Skip to content

Latest commit

 

History

History
56 lines (37 loc) · 3.99 KB

README.md

File metadata and controls

56 lines (37 loc) · 3.99 KB

PyAutoSpell

Why Yet Another Spellchecker in Python(YASP)? I was looking for a text pre-processing component that would (1) identify misspellings and provide candidate corrections and, (2) would automatically incorporate corrections into a final, corrected text using sophisticated techniques such as candidate-in-context re-ranking via ngram probabilities. While there are plenty of Python libraries that perform (1) reasonably well I've yet to come across anything that also does (2). So, I wrote PyAutoSpell. Given a potentially misspelled text, PyAutoSpell identifies any misspellings in the text, generates a set of candidate corrections for the misspelling, and then uses an ngram model to determine the probability that a given candidate correction is the best correction for that misspelling. It then incorporates the highest probability correction into the final, corrected text.

Setup

The following has been tested on OSX, High Sierra.

After cloning the project, cd to PyAutoSpell and then do

python3 -m pip install -r requirements.txt

(Note that I include python3 -m before my pip call since I have multiple pythons installed on my system and I want the package installed for my python3 install; your setup might be different).

The spacy dependency requires an en language model. Do

python3 -m spacy download en

One installation pitfall that OSX users should be mindful of when installing the CyHunspell dependency: During install, you'll very likley get an Operation not permitted error when trying to install one of CyHunspell's dependencies, six. Deal with this by editing the above command as follows:

python3 -m pip install -r requirements.txt --ignore-installed six

You should confirm that CyHunspell and the rest of the dependencies listed in requirements.txt have been correctly installed by doing a pip freeze

Then, install PyAutoSpell by doing

python3 -m pip install .

Along with the python dependencies included in requirements.txt, PyAutoSpell requires two additional resources: a hunspell dictionary and a KenLM language model in binary format. The former is available from the usual sources. (I've also included a recent hunspell english dict in the resources folder so you can go ahead and use that.) For demonstration purposes I've trained a relatively small (~53 million trigrams) KenLM model on a subset of Wikipedia, which you can access here. This model works reasonably well but you'll have to train a larger model if you want better performance. The usage example below assumes that both these files have been placed in the resources folder.

Usage

Python 3.6.5 (default, Apr 25 2018, 14:23:58) 
[GCC 4.2.1 Compatible Apple LLVM 9.1.0 (clang-902.0.39.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyautospell
>>> from pyautospell.autochecker import AutoChecker
>>> spell_checker = AutoChecker.Builder().dictionary("./resources/hunspell-en_US-2018.04.16/").dictionary_name("en_US").language_model('./resources/wiki_lm_truncated_c.klm').build()
>>> auto_checker = AutoChecker(spell_checker)
 >>> corrections = auto_checker.correct_spelling("It is my beleif, Watson, founded upon my esperience, that the low3est and vilest alleys in London do noit present a more dr3eadful record of sin than does the smiling and beautiful countryside")
<class 'list'>
>>> misspellings = corrections.misspellings
>>> for misspelling in misspellings:
...         print("misspelled : ", misspelling.word)
... 
misspelled :  beleif
misspelled :  esperience
misspelled :  low3est
misspelled :  noit
misspelled :  dr3eadful
>>> print("corrected text: ",corrections.corrected_text)
corrected text:  It is my belief, Watson, founded upon my experience, that the lowest and vilest alleys in London do not present a more disregardful record of sin than does the smiling and beautiful countryside