OpenAlexOCR

This is a basic setup and demonstration of how to query the OpenAlex API to download PDFs of scientific documents and pipe the documents into 2 separate OCR engines to extract the information from the PDFs.

The majority of scientific information and technical documentation is located within PDFs. For human readability, this is no issue. However, for machines and large language models (LLMs), it can be a challenge for these programs to properly read the documents because they may not be able to properly read contextual information. A lot of these errors can arise when there are objects like mathematical equations within the PDFs; the LLMs can misinterpret these very important symbols or text.

Recent advancements in optical character recognition (OCR) have allowed for the rise of the two engines I have elected to demonstrate the need for better PDF-document reading applications.

In this repository, I have setup a Colab notebook that will call the OpenAlex API, which hosts hundreds of millions of scientific documents, of which a lot of them are hosted as PDFs. This application allows for the user to query a subjects, the API is called on that query, and then this application downloads those PDFs. Once the PDF download is complete, the user has the choice of using the 'Nougat' or 'Marker' OCR engines to convert these PDFs into Mathpix Markdown(.mmd) files. Mathpix markdown is a format that is significantly more digestible for machines and LLMs, as it converts the equations from images into LaTeX. A lot of the contextual information, as well as special characters, that an LLM or any computer program would struggle to interpret, can now easily be read as LaTeX. Thus the information contained within the PDFs is no longer lost.

Installation

Execute each cell in order for the setup, no email is required so just leave it as is.
Query OpenAlex with the string you want to search with. Optionally change the max number of results you want to get from 100.
Run the download PDFs cell to download all PDFs.
(Marker) Run the first three cells inside the Marker pipeline to verify packages have installed correctly. Output files will be downloaded to the designated folder created in colab. Optionally change the '--output_format' variable in the subprocess to 'json' to output JSON instead of markdown.
(Nougat) Run the first two cells to install the nougat dependencies. Then run the third cell to convert the files to MMD.

Marker

In the tests that I have done, Marker seems to be the more efficient engine for the conversion from PDFs into MMD. Note that this can be done from PDF to regular markdown as well as JSON.

Improvements:

I would like to give this a conversion engine that can do full OCR. This was it can convert the information directly into LaTeX. This could be accomplished through either utilizing the base models of Nougat or Donut.

Original repo: https://github.com/datalab-to/marker

Nougat

On the other hand, Nougat is a fully-fledged OCR conversion model. It does very well with converting things like mathematical equations from their PDF form into LaTeX. However, where it struggles is re-writing the output to the correct location on the page where the text originally came from.

Improvements

One way to potentially improve this location re-writing issue is to utilize a library like Tessaract in order to super-impose the locations of where the original text came from so that it can be written back correctly.

Original repo: https://github.com/facebookresearch/nougat

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
examples		examples
OpenAlexOCR.ipynb		OpenAlexOCR.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OpenAlexOCR

Installation

Marker

Improvements:

Nougat

Improvements

About

Uh oh!

Releases

Packages

Languages

WheelWell9876/OpenAlexOCR

Folders and files

Latest commit

History

Repository files navigation

OpenAlexOCR

Installation

Marker

Improvements:

Nougat

Improvements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages