Releases: MicheleCotrufo/pdf2doi
Releases · MicheleCotrufo/pdf2doi
v1.7
v1.6
Main changes
- The library
pypdf
is now used (instead ofPyPdf2
) to add new metadata to the pdf files (see also fix below). SincePyPdf2
is now deprecated, in the next version ofpdf2doi
we will progressively replace all tasks performed byPyPdf2
bypypdf
Added
- Make sure that the input variable target is converted to a string before processing #27
Fixed
- Fixed a bug related to the storing of the DOI into the metadata of the pdf files. Due to some quirks of the library
PyPdf2
, the size of the pdf file would double after adding the metadata. In this new version, adding metadata to a pdf file is now performed via the librarypypdf
(Thanks Ole Steuernagel for pointing out this issue).
v1.5
Main changes
- The library
textract
has been removed from the required dependencies because it often creates problems during installation (due to conflicts between library versions),
and because it generally requires installing many other dependencies which are not needed bypdf2doi
. The user can still decide to installtextract==1.6.4
if desired.
pdf2doi
will usetextract
only if it is installed. pdf2doi
now stores any found identifier into a tag called/pdf2doi_identifier
(previously was/identifier
).
Added
- The library
pdfminer
is now directly used bypdf2doi
to extract the text from a pdf file (instead of doing it indirectly viatextract
) - An additional method to find the title of a pdf file, based on the library
pymupdf
, has been added . - [Issue https://github.com//issues/21]: When an arXiv ID is found, a corresponding DOI is also returned when available. This could be either the standard arXiv DOI (see also here),
or the DOI of the corresponding journal publication. This behavior can be disabled by adding the optional command-no_arxiv2doi
to thepdf2doi
invocation. - [Issue https://github.com//issues/22]: The function
get_pdf_text
(finders.py) has been modified to allow the libraryPyPDF2
to extract also the text of any annotation/comment present in the pdf file.
Fixed
- Potential titles of the papers were often not correctly found, because the function
find_possible_titles()
(finders.py) would mistakenly disregard all the results if one of the three methods (pdftitle, PyPDF2, filename) generated an error. - Fixed bug in the function
add_metadata()
(finders.py). In previous versions, some of the pre-existing metadata were not preserved when a new one was added (commit).
v1.4
Main improvements (see also merge from #20)
- Check for server error status codes when validating on dx.doi.org as 504 errors can occur
- When performing google searches, it looks for DOIs also in the URLs.
- Support any URL with a matching DOI and the doi keyword in the URL.
- Attempt to strip extensions from filenames doi10.111/1111.pdf will fail to locate the doi as 10.111/1111.pdf is a valid, if uncommon DOI.
- "Standardise" DOIs to handle loose matches e.g. case variations, or trailing punctuation.
Minor code changes (see also merge from #20)
- Moved regex patterns to patterns.py + add pytest tests for common DOI patterns
- Update to use logger.exception which provides tracebacks on errors.
- Moved code to add the '/identifier' tag to a general function add_metadata() in finders.py
v1.3
v1.2
v1.1
- Improved the internal behavior of some functions. The input argument of the function
pdf2doi.pdf2doi_singlefile
can now be either a string (with a relative o absolute path to the file to process) or a file object, open elsewhere in the code. The first input argument of all the "finder" functions now must be a file object (open elsewhere) and not a string with the file path. - Cleaned up dependencies, removed version constraints that are not anymore necessary
v1.0.1
- Improved the look-up of DOIs and Arxiv IDs in a text.
In previous versions, for each possible regexp identifying a DOI or Arxiv ID, the search would only look for the first occurrence of a potential DOI or Arxiv ID in the text. Now instead it searches for all possible results.
v1.0
v0.6
Main bugs fixed:
- When parsing the author field of a bibtex entry, a problem occurred if an author name contained the substring "and".
- Version 0.5 was not compatible with Mac systems because the library winreg was imported without first checking the operating system (issue #5).
- Files with extension ".PDF" (capital case) were not recognized as valid pdf files.
- An error occurred when sanitizing text strings that contained more than one latex code.