-
Notifications
You must be signed in to change notification settings - Fork 0
Home
jirkle edited this page Feb 1, 2017
·
6 revisions
Welcome to the ErrCorp wiki!
##ErrCorp ErrCorp is tool for automated generation of error corpora from wikipedia dump.
It takes bz2 wiki dump with history and processes it page by page. During processing it compares content of every two adjacent revisions and gets unique sentences in older and newer revision. Then ErrCorp links each old sentence to best matching new sentence and finally each of these matches are resolved as one type of error:
- Word order - old & new sentence have the same bag of words
- Typo - comment of rev contains predefined set of words (regex - typoFilter), typos are further extracted
- Edit - comment of rev contains predefined set of words (regex - editFilter)
- Other - all other, non classified errors