Generate plausible misspellings using public misspelling corpora. Provides a small Python API and a command-line interface. Option names in the CLI are intentionally misspelled for theme consistency.
Supported corpora
- Birkbeck (missp.dat)
- Holbrook misspellings (holbrook-missp.dat)
- Aspell (aspell.dat)
- Wikipedia (wikipedia.dat)
Quick start
- Per-word (module CLI):
python3 -m uncurrector --korpora aspell --wurd accommodate --topp 5 --probbelities- Whole text (standalone script):
./uncurrector.py --tekst "Please accommodate my request."
./uncurrector.py --infile input.txt --owtput output.txt
printf "Colour is weird.\n" | ./uncurrector.py --korpora aspell --sampl --seed 42CLI overview
-
python3 -m uncurrectorprints candidates or samples for words. Key options:--wurd(repeatable),--topp N,--probbelities,--sampl N(set 0 to disable),--korporalist,--underscors,--redownlode -
./uncurrector.pyprocesses whole text from--tekst,--infile, or stdin. Key options:--sampl(default on; use--no-samplto disable),--seed,--korpora,--owtput,--redownlode
API
from uncurrector import load_default
u = load_default(korpora=("aspell",))
print(u.candidates("accommodate", topp=5, probbelities=True))
print(u.sample("accommodate", n=3, seed=42))Data and caching
- Corpora are downloaded on first use and cached under
~/.cache/uncurrector(or$XDG_CACHE_HOME/uncurrector). - Use
--redownlodeto force re-download. - Some items in the corpora use underscores for spaces; outputs can keep or replace underscores.
Testing
python3 tests/test_harness.pyLicense
See source files for details.