- "Digital Himalaya, OCR, and Non-Latin Text". Workshop held with Laura Ferris at the UBC Library's Pixelating Mixer, November 2, 2017. (Google Slides)
- "OCR Tools for Non-Latin text: Lessons from the Digital Himalaya Project". Lightning talk presented at Code4Lib BC, December 1, 2017. (Google Slides)
- docs_ocr.gs: Google Apps script for extracting text from a batch of JPEG files
- sample_items: contains jpeg images from Mother Tongue Pipal Pustak and Nepali Aawaz. Made available by the Digital Himalaya Project under Attribution-NonCommercial-NoDerivs 3.0 Unported license.
This script was created to generate transcripts from images featuring Nepali & Tibetan language text. Finds all JPEG files within the specified Google Drive folder, opens them as Google Docs, and exports their filenames and text contents to the specified Google Sheet. (Uploaded JPEGs are deleted from Drive in the process; Corresponding Google docs remain.)
- Create a new folder for your JPEG files. Keep track of the folder's name for step 6.
- Create a new Google Sheet in the same folder. This will store your transcript text.
- Copy the id found in the sheet's url (look for the the long string of letters and numbers between 'd/' and '/edit'). Hold onto it for step 7.
- Under the 'Tools' menu, select 'Script Editor'.
- Paste the contents of 'docs_ocr.gs' into the script editor.
- Update 'folderName' with the name of your image folder (see step 1).
- Update 'sheetId' with the id associated with your transcript sheet (see step 3).
- Click the clock icon to add a trigger. Select options "extractTextOnOpen", "From Spreadsheet", and "on open". This will tell the script to run whenever someone opens the spreadsheet.
- Upload jpegs to the folder you set up.
- Open up the spreadsheet.
- Make a cup of coffee/tea and relax while Google converts the jpegs, extracts text, and populates the spreadsheet.
- 'Google Drive: Page not found' : Make sure you're only logged into one Google account (see Stack Overflow)
Research: Laura Ferris, Digital Initiatives Assistant, UBC Library
Code: Rebecca Dickson, Digital Projects Student Librarian, UBC Library
Inspiration: http://blogs.bl.uk/digital-scholarship/2017/07/a-workshop-on-optical-character-recognition-for-bangla.html