A Python API for the Science Parse server.
pip install science_parse_api
This also requires that you have a science parse server running. This can be done through docker:
docker run -p 127.0.0.1:8080:8080 --rm --init ucrel/ucrel-science-parse:3.0.1
For versions of Science Parse < 3.0.1 see the AllenAI docker hub.
If you would like to run the docker image with less memory (default is 8GB) then the following command will run it with a limit of 6GB:
docker run -p 127.0.0.1:8080:8080 --rm --init --memory=6g --memory-swap=6g --env JAVA_MEMORY=5 ucrel/ucrel-science-parse:3.0.1
For more details on this docker image see the UCREL docker hub page.
Note from the science parse GitHub it is recomended to run the science parse server with 6GB of memory for the Java process e.g. JAVA_MEMORY=6
The API has only one main function: parse_pdf
.
It takes an input the:
- server_address -- Address to the science parse server e.g. "http://127.0.0.1"
- file_path_to_pdf -- The file path to the PDF you would like to parse.
- port -- Port of the science parse server e.g. "8080"
It will then return the parsed PDF as a Python dictionary with the following keys:
['abstractText', 'authors', 'id', 'references', 'sections', 'title', 'year']
Note not all of these dictionary keys will always exist if science parse cannot detect the relevant information e.g. if it cannot find any references then there will be no reference key.
The example below shows how to use the pdf_parse
function and the expected output format. In this example we ran the science parse server using docker e.g.:
docker run -p 127.0.0.1:8080:8080 --rm --init ucrel/ucrel-science-parse:3.0.1
from pathlib import Path
import tempfile
from IPython.display import Image
import requests
from science_parse_api.test_helper import test_data_dir
try:
# Tries to find the folder `test_data`
test_data_directory = test_data_dir()
test_pdf_paper = Path(test_data_directory,
'example_for_test.pdf').resolve()
image_file_name = str(Path(test_data_directory,
'example_test_pdf_as_png.png'))
except FileNotFoundError:
# If it cannot find that folder will get the pdf and
# image from Github. This will occur if you are using
# Google Colab
pdf_url = ('https://github.com/UCREL/science_parse_py_api/'
'raw/master/test_data/example_for_test.pdf')
temp_test_pdf_paper = tempfile.NamedTemporaryFile('rb+')
test_pdf_paper = Path(temp_test_pdf_paper.name)
with test_pdf_paper.open('rb+') as test_fp:
test_fp.write(requests.get(pdf_url).content)
image_url = ('https://github.com/UCREL/science_parse_py_api'
'/raw/master/test_data/example_test_pdf_as_png.png')
image_file = tempfile.NamedTemporaryFile('rb+', suffix='.png')
with Path(image_file.name).open('rb+') as image_fp:
image_fp.write(requests.get(image_url).content)
image_file_name = image_file.name
Image(filename=image_file_name)
import pprint
from science_parse_api.api import parse_pdf
host = 'http://127.0.0.1'
port = '8080'
output_dict = parse_pdf(host, test_pdf_paper, port=port)
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(output_dict)
{ 'abstractText': 'The abstract which is normally short.',
'authors': [{'affiliations': [], 'name': 'Andrew Moore'}],
'id': 'SP:1499421a494e54e17ee903b5161ccb31091fb77a',
'references': [ { 'authors': [ 'Tomas Mikolov',
'Greg Corrado',
'Kai Chen',
'Jeffrey Dean.'],
'title': 'Efficient estimation of word '
'representations in vector space',
'venue': 'Proceedings of the International '
'Conference on Learning Representations, '
'pages 1–12.',
'year': 2013}],
'sections': [ { 'text': 'The abstract which is normally short.\n'
'1 Introduction\n'
'Some introduction text.\n'
'2 Section 1\n'
'Here is some example text.'},
{ 'heading': '2.1 Sub Section 1',
'text': 'Some more text but with a reference (Mikolov '
'et al., 2013).\n'
'3 Section 2\n'
'The last section\n'
'References\n'
'Tomas Mikolov, Greg Corrado, Kai Chen, and '
'Jeffrey Dean. 2013. Efficient estimation of '
'word representations in vector space. '
'Proceedings of the International Conference '
'on Learning Representations, pages 1–12.'}],
'title': 'Example paper for testing',
'year': 2021}
The output is not perfect but it is very good! Some of the things it did not pick up on:
- The
authors
key never seems to get the affiliations of the authors (I have tried a few papers). - The sections are a list of sections and each section is made up of
text
andheading
. However as this example shows it appears that these keys are not always guaranteed e.g. the first section only contains atext
key. - The sections in this example does not contain all of the sections.
- The last section also contains the References.
- The output of the
authors
fromreferences
contains all of the correct authors. However one small issue is thatJeffrey Dean
has a full stop at the end e.g.Jeffrey Dean.
Some of the really nice features:
- Creates a unique
id
key based on hashing the request to the Science Parse server thus each request to the server will create a uniqueid
. - The
year
key contains a pythonint
e.g.2021
and2013
.
Science Parse has been used in the following academic papers:
- S2ORC: The Semantic Scholar Open Research Corpus. They used Science Parse to extract title and authors from the PDF of academic papers. They then used Grobid to extract the rest of the data from the PDFs.
If you would like to develop on this library. Clone the repository and then install the regular requirements and the development requirements using:
pip install -e .[dev]
The -e
is an editable flag meaning that if you change anything in the library locally Python will keep track on those changes.
Note as it is created with nbdev the code and documentation is generated from the notebooks that are within the ./module_notebooks folder.
Note need to run the following once: nbdev_install_git_hooks
: "This will set up git hooks which will remove metadata from your notebooks when you commit, greatly reducing the chance you have a conflict."
The main workflow is the following:
- Edit the notebook(s) you want within ./module_notebooks folder. The README is generated from the ./module_notebooks/index.ipynb file.
- Run
nbdev_build_lib
to convert the notebook(s) into a Python module, which in this case will go into the ./science_parse_api folder. Note if you created a function in one python module and want to use it in another module then you will need to runnbdev_build_lib
first, as that python module code needs to be transfered from the ./module_notebooks folder. into the ./science_parse_api folder. - Create the documentation using
nbdev_build_docs
. - Optionally if you created tests run them using
make test
. When you do add tests in the notebooks you will need to import the function from the module and not rely on the function already expressed in the notebook, this is to ensure that code coverage is calculated correctly. - Optionally if you would like to see the documentation locally see the sub-section below.
- Git add the relevant notebook(s), python module code, and documentation.
The documentation can be ran locally via a docker container. The easiest way to run this container is through the make command:
make docker_docs_serve
NOTE This documentation does not update automatically, so it requires re-running this make command each time you want to see an updated version of the documentation.
To release an updated version of the package:
- Change the version number in ./settings.ini
- Build the library using
nbdev_build_lib
- Then make the package and upload it to PYPI using
make release
The work has been funded by the UCREL research centre at Lancaster University.
We would like to thank the AllenAI institute for creating the Science Parse software.