Skip to content

Convert OCR data for the Internet Archive Bookreader.

License

Notifications You must be signed in to change notification settings

uchicago-library/ocr_converters

 
 

Repository files navigation

OCR Converters for XTF Sites (e.g., Campus Publications)

This repository contains a script to convert simple OCR data, in the form of a word list and page coordinates, into the format needed by the Internet Archive Bookreader.

Please note that the script contains hardcoded references to the LDR pair tree and the ark_data.db database. Coordinate with the systems administrators to get access to these locations in the filesystem and adjust the script before proceeding.

To run this script, start by setting up a python virtual environment. Activate the environment, clone this repo, and install its dependencies:

python3 -m venv venv
source venv/bin/activate
git clone https://github.com/uchicago-library/ocr_converters.git
cd ocr_converters
pip install -r requirements.txt
pip install -r requrements_dev.txt

Then, run the program like this:

python build_ia_bookreader_ocr.py <identifier> <min-year> <max-year> [<shrink_to_height>]

is the mvol identifier to produce OCR for, for example, mvol-0001-0002-0003. is the earliest year for any item from this item's journal. (This is necessary because each item contains metadata for the entire title.) is the latest year for any item from this item's journal. <shrink_to_height> is used for situations where the JPEG images used in the Internet Archive bookreader have been shrunken down to a smaller pixel height from the dimensions of the original master file.

The script will output OCR for the Internet Archive Bookreader that is used in XTF sites like the Campus Publications.

XTF File Layout

Get the XTF production and development server names from the systems administrators. XTF uses a data directory- cd into that directory, and cd into bookreader. You'll find a sequence of directories, one for each digital object. Each will be named something like "mvol-0001-0002-0003"- this is the internal identifier the Preservation Department uses to track these files.

Inside each directory is a sequence of JPEGs. Each has eight digits with leading zeroes, numbered like:

00000001.jpg
00000002.jpg
00000003.jpg
etc.

These are the page images for this item. To add a new item to XTF, use your favorite utility to convert TIFF files to JPEGs, optionally shrinking them to some smaller height. (If you shrink them you can use the <shrink_to_height> option on build_ia_bookreader_ocr.py above.)

Then, each directory contains a thumbnail image- .jpg, which is 100px tall, e.g.:

mvol-0001-0002-0003.jpg

Each contains a PDF, with all page images:

mvol-0001-0002-0003.pdf

The OCR produced above is stored at:

mvol-0001-0002-0003.xml

And the text of the document itself, with no OCR information, lives in:

mvol-0001-0002-0003.txt

The entire file layout should look like this:

00000001.jpg
00000002.jpg
00000003.jpg
mvol-0001-0002-0003.jpg
mvol-0001-0002-0003.pdf
mvol-0001-0002-0003.txt
mvol-0001-0002-0003.xml

Because input data tends change with each deposit, I write ad-hoc scripts to get data into this format and scp it to the XTF servers.

Re-Indexing the Site

To re-index the site, look in the XTF bin directory. To rebuild the index completely, run:

./textIndexer -clean -index default

Note that will probably take about a half hour, during which time the site will be unavailable.

See Also

For more information about XTF, see https://xtf.cdlib.org/.

About

Convert OCR data for the Internet Archive Bookreader.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%