A manual spell checker built on pyenchant that allows you to swiftly correct misspelled words.
While I was working on a text based multi-class classification competition, I noticed that the data contained a lot of misspelled words, errors which automated spell check packages out there couldn't fix. This was because the data had been compiled based on a survey of people who weren't native English speakers. As the there weren't many samples in the dataset (~1000), I decided to write some code for automated detection of spelling errors which I could then fix manually, and thus, this package was born.
pip install manual_spellchecker
- All features as provided by pyenchant
- Quickly analyze and get a list of all misspelled words
- Can replace, skip and delete misspelled words
- Use your favourite tokenizer for splitting words
- Replaced misspelled words via provided suggestions by simply typing in their indices (Indexing starts from 1 and not 0)
- Can checkpoint current set of corrections
- Contexualized pretty printing for easy visual correction (works on both command line and notebook)
# Initialize the spell checking object
__init__(dataframe, column_names, tokenizer=None, num_n_words_dis=5, save_path=None)
- dataframe - Takes a pandas dataframe as input
- column_names - Pass the column name(s) upon which you want to perform spelling correction
- tokenizer=None - Pass your favourite tokenizer like nltk or spacy, etc. (Default: splits on space)
- num_n_words_dis=5 - This decides how many neighbouring words to display on either side of the error
- save_path=None - If a save path is provided, the final corrected dataframe is saved as a csv. (Default: the dataframe is not saved but returned)
# For quick analysis of all the misspelled words
spell_check()
# Returns a list of all the misspelled words
get_all_errors()
# Starts the process of correcting erroneous words
correct_words()
Important Note:
- Type -999 into the input box to stop the error correction and save the current progress (if save_path is provided)
- Simply press enter if you want to skip the current word
- Type in "" or '' in the input box to delete a misspelled word
from manual_spellchecker import spell_checker
# Read the data
df = pd.read_csv("Train.csv")
# Initialize the model
ob = spell_checker(df, "text")
# Quick analysis
ob.spell_check()
# Read the data
df = pd.read_csv("Train.csv")
# Initialize the model
ob = spell_checker(df, ["text", "label"])
# Quick analysis
ob.spell_check()
# Import nltk's word tokenizer
from nltk import word_tokenize
# Read the data
df = pd.read_csv("Train.csv")
# Initialize the model
ob = spell_checker(df, "text", word_tokenize)
# Quick analysis
ob.spell_check()
# Read the data
df = pd.read_csv("Train.csv")
# Initialize the model
ob = spell_checker(df, "text")
# Quick analysis. This needs to be performed before getting all errors
ob.spell_check()
# Returns a list of all errors
ob.get_all_errors()
# Read the data
df = pd.read_csv("Train.csv")
# Initialize the model
ob = spell_checker(df, "text")
# Start corrections
ob.correct_words()
df = pd.read_csv("Train.csv")
# Initialize the model
ob = spell_checker(df, "text", save_path="correct_train_data.csv")
- Will be adding automated, contextual error corrections
Drop me an email at atif.hit.hassan@gmail.com if you want any particular feature