This project is about benchmarking and evaluating existing PDF extraction tools on their semantic abilities to extract the body texts from PDF documents, especially from scientific articles. It provides (1) a benchmark generator, (2) a ready-to-use benchmark and (3) an extensive evaluation, with meaningful evaluation criteria.
- constructs high-quality benchmarks from TeX source files.
- identifies the following 16 logical text blocks: title, author(s), affiliation(s), date, abstract, headings, paragraphs of the body text, formulas, figures, tables, captions, listing-items, footnotes, acknowledgements, references, appendices.
- serializes desired logical text blocks to plain text, XML or JSON format.
For more details and usage, see benchmark-generator/
.
- consists of 12,099 ground truth files and 12,099 PDF files of scientific articles, randomly selected from arXiv.org. Each ground truth file contains the title, the headings and the body text paragraphs of a particular scientific article.
- was generated using the benchmark generated above.
For more details, see benchmark/
.
- assesses the following 13 PDF extraction tools: pdftotext, pdftohtml, pdf2xml (Xerox), pdf2xml (Tiedemann), PdfBox, ParsCit, LA-PdfText, PdfMiner, pdfXtk, pdf-extract, PDFExtract, Grobid, Icecite.
- provides meaningful evaluation criteria in order to assess the semantic abilities of a tool on identifying (1) words, (2) the reading order, (3) paragraph boundaries and (4) the semantic roles of text elements in PDF.
For more details, see evaluation/
.