OCR work for the Ancient World Citation Analysis project at UC Berkeley.
The main product of the work in this repository is the Text
class. Basic usage:
from tesseract_manager import Text
src = 'path/to/a/pdf_file.pdf'
out = 'path/to/directory/that/will/be/created/to/store/ocr/output'
Text(src, out).save_ocr()
The output directory will then contain two files:
- A CSV file containing text and page-level data extracted from the PDF
- A pickle containing the serialized
Text
instance that you created. This object includes the same information as the CSV, as well as word-level metadata.
tesseract-ocr-all
. On Linux and Google Colab, this can be installed usingapt-get install tesseract-ocr-all
.- Protobufs. On Linux and Google Colab, this can be installed using
apt-get install protobuf-compiler libprotobuf-dev
- Various Python packages. These can be installed by changing into this Git repository and running
pip install -r requirements.txt
.- In Google Colab,
pip install -r reduced-requirements.txt
should be used instead. This is because many of the dependencies are pre-installed in Colab, and it is best not to install them again.
- In Google Colab,
Here are some resources:
- Certain files are useful but too large or cumbersome to check into version control. These include:
- A 1600-page PDF including random pages from our collection. (These are pages selected uniformly at random from PDFs selected uniformly at random.)
- Three 400-page development sets: sample 0, sample 1, and sample 2.
- A spreadsheet with data about the 1600-page PDF, including which pages are from PDFs that are also represented in the development sets. This spreadsheet is a source of statistics about our corpus that may guide decisionmaking for OCR.
- Training data for tesseract, and other Tesseract-related files.
Here are some tips:
- If you must use Windows instead of a Unix-like system, you may wish to use a platform such as WSL or VMware Workstation (or even Google Drive and Colab) that will enable you to emulate that environment. This is because
- the package
gcld3
is not easy to install on Windows due to its dependency on protobufs, and - Tesseract 4 is not easy to install on Windows.
- the package
- A virtual environment might be useful.
- Because Google Drive is being used to store datasets that are too large for a personal computer, it can be useful to access this repository via Google Colab. Because Colab notebooks essentially provide command-line access to a virtual machine (just precede commands with the symbols ! or %), it is convenient to use Git on Google Colab.
Here are some gentle suggestions in order of decreasing importance.
- Docstrings explaining the purpose of a module, class, or function are desirable.
- Compliance with PEP 8 guidelines is desirable. Pycodestyle and autopep8 can help: They are easy to install in VS Code, and pycodestyle is enabled by default in PyCharm.
- Compliance with reST format for docstrings is desirable.