Skip to content

Navigation Menu

Explore
By company size
By use case
By industry
View all solutions
Topics
- AI
- DevOps
- Security
- Software Development
- View all
Explore
- GitHub Sponsors
  Fund open source developers
- The ReadME Project
  GitHub community articles
Repositories
- Enterprise platform
  AI-powered developer platform
Available add-ons
Pricing

Search code, repositories, users, issues, pull requests...

Search

Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Saved searches

Use saved searches to filter your results more quickly

Name

Query

To see all available qualifiers, see our documentation.

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

MicheleCotrufo / pdf2doi Public

Notifications You must be signed in to change notification settings
Fork 18
Star 111

Code
Issues 5
Pull requests 3
Actions
Projects
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Actions
Projects
Security
Insights

Releases: MicheleCotrufo/pdf2doi

Releases · MicheleCotrufo/pdf2doi

v1.7

10 Nov 21:07

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Learn about vigilant mode.

Compare

Choose a tag to compare

Loading

v1.7 Latest

Latest

Main changes

Changed url for dx.doi.org validation (#35)
Added 'r' in front of strings to suppress warnings in recent Python versions (#36)
Changed pymupdf dependency to pymupdf>=1.21.0 (#32 #28 #37)

Assets 2

Loading

All reactions

v1.6

18 Jun 15:23

Compare

Choose a tag to compare

Loading

v1.6

Main changes

The library pypdf is now used (instead of PyPdf2) to add new metadata to the pdf files (see also fix below). Since PyPdf2 is now deprecated, in the next version of pdf2doi we will progressively replace all tasks performed by PyPdf2 by pypdf

Added

Make sure that the input variable target is converted to a string before processing #27

Fixed

Fixed a bug related to the storing of the DOI into the metadata of the pdf files. Due to some quirks of the library PyPdf2, the size of the pdf file would double after adding the metadata. In this new version, adding metadata to a pdf file is now performed via the library pypdf (Thanks Ole Steuernagel for pointing out this issue).

Assets 2

Loading

All reactions

v1.5

31 Dec 19:02

Compare

Choose a tag to compare

Loading

v1.5

Main changes

The library textract has been removed from the required dependencies because it often creates problems during installation (due to conflicts between library versions),
and because it generally requires installing many other dependencies which are not needed by pdf2doi. The user can still decide to install textract==1.6.4 if desired.
pdf2doi will use textract only if it is installed.
pdf2doi now stores any found identifier into a tag called /pdf2doi_identifier (previously was /identifier).

Added

The library pdfminer is now directly used by pdf2doi to extract the text from a pdf file (instead of doing it indirectly via textract)
An additional method to find the title of a pdf file, based on the library pymupdf, has been added .
[Issue https://github.com//issues/21]: When an arXiv ID is found, a corresponding DOI is also returned when available. This could be either the standard arXiv DOI (see also here),
or the DOI of the corresponding journal publication. This behavior can be disabled by adding the optional command -no_arxiv2doi to the pdf2doi invocation.
[Issue https://github.com//issues/22]: The function get_pdf_text (finders.py) has been modified to allow the library PyPDF2 to extract also the text of any annotation/comment present in the pdf file.

Fixed

Potential titles of the papers were often not correctly found, because the function find_possible_titles() (finders.py) would mistakenly disregard all the results if one of the three methods (pdftitle, PyPDF2, filename) generated an error.
Fixed bug in the function add_metadata() (finders.py). In previous versions, some of the pre-existing metadata were not preserved when a new one was added (commit).

Assets 2

Loading

All reactions

v1.4

03 Nov 03:49

Compare

Choose a tag to compare

Loading

v1.4

Main improvements (see also merge from #20)

Check for server error status codes when validating on dx.doi.org as 504 errors can occur
When performing google searches, it looks for DOIs also in the URLs.
- Support any URL with a matching DOI and the doi keyword in the URL.
Attempt to strip extensions from filenames doi10.111/1111.pdf will fail to locate the doi as 10.111/1111.pdf is a valid, if uncommon DOI.
"Standardise" DOIs to handle loose matches e.g. case variations, or trailing punctuation.

Minor code changes (see also merge from #20)

Moved regex patterns to patterns.py + add pytest tests for common DOI patterns
Update to use logger.exception which provides tracebacks on errors.
Moved code to add the '/identifier' tag to a general function add_metadata() in finders.py

Assets 2

Loading

All reactions

v1.3

17 Jun 17:38

Compare

Choose a tag to compare

Loading

v1.3

Fixed

Object files were not closed after being opened (issue #17).
Make sure that the version 2.0.0 of pypdf2 is used, since the text extracted with newer versions occasionally messes up some DOI.

Assets 2

Loading

All reactions

v1.2

28 May 16:59

Compare

Choose a tag to compare

Loading

v1.2

Added

Print explicit error when target path is not a valid file or directory (when used via CLI).

Fixed

Bug due to some functions returning None instead of an empty list (issue #15).
Fixed typo at line 134 of main.py ('/identfier' -> '/identifier')

Assets 2

Loading

All reactions

v1.1

01 May 04:36

Compare

Choose a tag to compare

Loading

v1.1

Improved the internal behavior of some functions. The input argument of the function pdf2doi.pdf2doi_singlefile can now be either a string (with a relative o absolute path to the file to process) or a file object, open elsewhere in the code. The first input argument of all the "finder" functions now must be a file object (open elsewhere) and not a string with the file path.
Cleaned up dependencies, removed version constraints that are not anymore necessary

Assets 2

Loading

All reactions

v1.0.1

04 Dec 04:23

Compare

Choose a tag to compare

Loading

v1.0.1

Improved the look-up of DOIs and Arxiv IDs in a text.
In previous versions, for each possible regexp identifying a DOI or Arxiv ID, the search would only look for the first occurrence of a potential DOI or Arxiv ID in the text. Now instead it searches for all possible results.

Assets 2

Loading

All reactions

v1.0

18 Nov 05:29

Compare

Choose a tag to compare

Loading

v1.0

-Re-organized all code
-Moved all BibTeX-related stuff to a new package pdf2bib
-Fixed minor bugs in previous version

Assets 2

Loading

All reactions

v0.6

19 Jul 03:51

Compare

Choose a tag to compare

Loading

v0.6

Main bugs fixed:

When parsing the author field of a bibtex entry, a problem occurred if an author name contained the substring "and".
Version 0.5 was not compatible with Mac systems because the library winreg was imported without first checking the operating system (issue #5).
Files with extension ".PDF" (capital case) were not recognized as valid pdf files.
An error occurred when sanitizing text strings that contained more than one latex code.

Assets 2

Loading

All reactions

Previous 1 2 Next

Previous Next

Footer

© 2025 GitHub, Inc.

Footer navigation

Terms
Privacy
Security
Status
Docs
Contact

You can’t perform that action at this time.