Skip to content

This is a basic setup and demonstration of how to query the OpenAlex API to download PDFs of scientific documents and pipe the documents into 2 separate OCR engines to extract the information from the PDFs.

Notifications You must be signed in to change notification settings

WheelWell9876/OpenAlexOCR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

OpenAlexOCR

This is a basic setup and demonstration of how to query the OpenAlex API to download PDFs of scientific documents and pipe the documents into 2 separate OCR engines to extract the information from the PDFs.

The majority of scientific information and technical documentation is located within PDFs. For human readability, this is no issue. However, for machines and large language models (LLMs), it can be a challenge for these programs to properly read the documents because they may not be able to properly read contextual information. A lot of these errors can arise when there are objects like mathematical equations within the PDFs; the LLMs can misinterpret these very important symbols or text.

Recent advancements in optical character recognition (OCR) have allowed for the rise of the two engines I have elected to demonstrate the need for better PDF-document reading applications.

In this repository, I have setup a Colab notebook that will call the OpenAlex API, which hosts hundreds of millions of scientific documents, of which a lot of them are hosted as PDFs. This application allows for the user to query a subjects, the API is called on that query, and then this application downloads those PDFs. Once the PDF download is complete, the user has the choice of using the 'Nougat' or 'Marker' OCR engines to convert these PDFs into Mathpix Markdown(.mmd) files. Mathpix markdown is a format that is significantly more digestible for machines and LLMs, as it converts the equations from images into LaTeX. A lot of the contextual information, as well as special characters, that an LLM or any computer program would struggle to interpret, can now easily be read as LaTeX. Thus the information contained within the PDFs is no longer lost.

Installation

  1. Execute each cell in order for the setup, no email is required so just leave it as is.
  2. Query OpenAlex with the string you want to search with. Optionally change the max number of results you want to get from 100.
  3. Run the download PDFs cell to download all PDFs.
  4. (Marker) Run the first three cells inside the Marker pipeline to verify packages have installed correctly. Output files will be downloaded to the designated folder created in colab. Optionally change the '--output_format' variable in the subprocess to 'json' to output JSON instead of markdown.
  5. (Nougat) Run the first two cells to install the nougat dependencies. Then run the third cell to convert the files to MMD.

Marker

In the tests that I have done, Marker seems to be the more efficient engine for the conversion from PDFs into MMD. Note that this can be done from PDF to regular markdown as well as JSON.

Improvements:

I would like to give this a conversion engine that can do full OCR. This was it can convert the information directly into LaTeX. This could be accomplished through either utilizing the base models of Nougat or Donut.

Original repo: https://github.com/datalab-to/marker

Nougat

On the other hand, Nougat is a fully-fledged OCR conversion model. It does very well with converting things like mathematical equations from their PDF form into LaTeX. However, where it struggles is re-writing the output to the correct location on the page where the text originally came from.

Improvements

One way to potentially improve this location re-writing issue is to utilize a library like Tessaract in order to super-impose the locations of where the original text came from so that it can be written back correctly.

Original repo: https://github.com/facebookresearch/nougat

About

This is a basic setup and demonstration of how to query the OpenAlex API to download PDFs of scientific documents and pipe the documents into 2 separate OCR engines to extract the information from the PDFs.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published