STAT GR5243 Fall 2018 Applied Data Science

Project 4 Optical Character Recognition (OCR) Post-processing Algorithms Evaluation

Adapting published algorithms to address a data problem in your project is a common task in practice. In this project, working in teams, you will implement, evaluate and compare a pair of algorithms for OCR post-processing.

Challenge

OCR is a technology that enables you to convert different types of documents (e.g., scanned documents, PDF files or images captured by a digital camera) into editable and searchable data. An OCR system consists of a pre-processing module, a word recognition module and a post-processing module, which are pipelined together to retrieve a relevant scanned document.

For this project, each team is assigned a pair of research papers from the OCR post-processing literature. You will study the papers carefully and implement the algorithms, from scratch, in R. Paper assignments will be posted to a piazza post specific to individual teams.

You will also receive a set of 100 pairs of text files. For each pair, you have a ground-truth text file and a Tesseract processed file as the input for the task.

For the submission in courseworks, you will submit the line to your team's GitHub repo that contains your codes and a testing report (must be a reproducible R notebook or a similar format) on the algorithms in terms of a side-by-side comparison of their performance and computational efficiency. For the presentation, each team should briefly explain what each algorithm does, how the evaluation was carried out, and what are the main results.

All developments need to be carried out in the group-shared private repo on [https://www.github.com/TZstatsADS/] with clear project management log, taking advantage of GitHub issues.

Evaluation criteria

Readabiity and reproducibility of codes;
Validity of evaluation (well-defined measures of performance; well-designed evaluation experiment setup);
Presentation and organization (report, github and in-class presentation).

(More details will be posted as grading rubrics in courseoworks/canvas)

Project details

Project time table.

This is a short project. During the first week, we will give a tutorial in class and having live discussion and brainstorm sessions. The instruction team will join team discussions during class and online.

week 1 (Nov 7/8): Introduction and project description. An overview of OCR.
week 2 (Nov 14/15): Team Q&A on algorithms and evaluation plan.
Final presentation (Nov 28/29)

Suggested team workflow

[wk1] Week 1 is the reading and coding week. Read the papers assigned to you, understand the algorithms and start coding up the algorithms; Also load the data into R and understand its structure.
[wk1] Each team is strongly recommended to demonstrate project progress by posting a project plan with task assignments as issues on GitHub by Nov 14/15.
[wk1] As a team, brainstorm about your evaluation plan.
[wk2] Based on the outcomes from week 1's reading and brainstorm sessions, continue coding and start evaluation.
[wk2] Week 2 is the evaluation week.
[wk2] It is ok to separate into two sub-teams, one working on one algorithm, as long as the two teams have the same criteria for evaluating the algorithms. The two sub-teams can also serve as others' validators.
By using R Notebook to carry out coding and evaluation, your final report can just be adding explanation and comments to your Notebook.

Working together

Setup a GitHub project folder with everyone listed as contributor. Everyone clones the project locally and creates a local branch.
The team can work in subgroups of 2-3, which might meet more frequently than the entire team. However, everyone should check in regularly on group discussion online and changes in the GitHub folder.
Learn to work together is an important learning goal of this course.

Resources

Error Detection Papers

Kulp, S., & Kontostathis, A. (2007, November). On Retrieving Legal Files: Shortening Documents and Weeding Out Garbage. In TREC. Link

Rule-based techniques
OCR Error Detection is in section 2.2

Riseman, E. M., & Hanson, A. R. (1974). A contextual postprocessing system for error correction using binary n-grams. IEEE Transactions on Computers, 100(5), 480-493. Link

Positional binary digram technique
Error Detection is in section 3-a

Wudtke, R., Ringlstetter, C., & Schulz, K. U. (2011, September). Recognizing garbage in OCR output on historical documents. In Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data (p. 8). ACM. Link

Probabilistic techniques -- SVM garbage detection
Features are in section 5 (you can choose not to implement ‘Levenshtein distance’ feature)

Error Correction Papers

Riseman, E. M., & Hanson, A. R. (1974). A contextual postprocessing system for error correction using binary n-grams. IEEE Transactions on Computers, 100(5), 480-493. Link

Positional binary digram technique
Error Detection is in section 3-b

Mei, J., Islam, A., Wu, Y., Moh'd, A., & Milios, E. E. (2016). Statistical learning for OCR text correction. arXiv preprint arXiv:1611.06950. Link

Supervised model -- correction regressor

Church, K. W., & Gale, W. A. (1991). Probability scoring for spelling correction. Statistics and Computing, 1(2), 93-103. Link

Focus on section 3 -- probability scoring without context

Church, K. W., & Gale, W. A. (1991). Probability scoring for spelling correction. Statistics and Computing, 1(2), 93-103. Link

Same paper as above one, but focus on section 5 -- probability scoring with contextual constraints

Wick, M., Ross, M., & Learned-Miller, E. (2007, September). Context-sensitive error correction: Using topic models to improve OCR. In Document Analysis and Recognition, 2007. ICDAR 2007. Ninth International Conference on (Vol. 2, pp. 1168-1172). IEEE. Link

Topic modeling.

Example starter codes

As an example, you can find in the GitHub starter codes an example using part of rule-based error detection method (first three rules in the paper. The word error correction part is simply deleting the error detected before. We implement the word-level criterion, you should at least compelete the letter-level criterion.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

project4_desc.md

project4_desc.md

STAT GR5243 Fall 2018 Applied Data Science

Project 4 Optical Character Recognition (OCR) Post-processing Algorithms Evaluation

Challenge

Evaluation criteria

Project details

Project time table.

Suggested team workflow

Working together

Resources

Error Detection Papers

Error Correction Papers

Example starter codes

Files

project4_desc.md

Latest commit

History

project4_desc.md

File metadata and controls

STAT GR5243 Fall 2018 Applied Data Science

Project 4 Optical Character Recognition (OCR) Post-processing Algorithms Evaluation

Challenge

Evaluation criteria

Project details

Project time table.

Suggested team workflow

Working together

Resources

Error Detection Papers

Error Correction Papers

Example starter codes