Skip to content

j33433/uncurrector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

uncurrector

Generate plausible misspellings using public misspelling corpora. Provides a small Python API and a command-line interface. Option names in the CLI are intentionally misspelled for theme consistency.

Supported corpora

  • Birkbeck (missp.dat)
  • Holbrook misspellings (holbrook-missp.dat)
  • Aspell (aspell.dat)
  • Wikipedia (wikipedia.dat)

Quick start

  • Per-word (module CLI):
  python3 -m uncurrector --korpora aspell --wurd accommodate --topp 5 --probbelities
  • Whole text (standalone script):
  ./uncurrector.py --tekst "Please accommodate my request."
  ./uncurrector.py --infile input.txt --owtput output.txt
  printf "Colour is weird.\n" | ./uncurrector.py --korpora aspell --sampl --seed 42

CLI overview

  • python3 -m uncurrector prints candidates or samples for words. Key options: --wurd (repeatable), --topp N, --probbelities, --sampl N (set 0 to disable), --korpora list, --underscors, --redownlode

  • ./uncurrector.py processes whole text from --tekst, --infile, or stdin. Key options: --sampl (default on; use --no-sampl to disable), --seed, --korpora, --owtput, --redownlode

API

from uncurrector import load_default

u = load_default(korpora=("aspell",))
print(u.candidates("accommodate", topp=5, probbelities=True))
print(u.sample("accommodate", n=3, seed=42))

Data and caching

  • Corpora are downloaded on first use and cached under ~/.cache/uncurrector (or $XDG_CACHE_HOME/uncurrector).
  • Use --redownlode to force re-download.
  • Some items in the corpora use underscores for spaces; outputs can keep or replace underscores.

Testing

python3 tests/test_harness.py

License

See source files for details.

About

Generate plausible misspellings using public misspelling corpora

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages