Extract reference URLs and DOIs from an IEEE-style reference list and download the references to the filesystem.
- downloads webpages as a single file in MHTML format
- including client-rendered content
- uses puppeteer-extra-plugin-stealth to evade anti-bot measures
- downloads PDF files and other file types from URLs
- downloads papers with a DOI from sci-hub
- if you make use of this feature, consider donating to sci-hub
This software requires node.js, version 18 or newer.
Before you use, install the dependencies:
npm install # or pnpm install, or yarn install, etc.
- Put your references in a plaintext file, e.g.
references.txt
- Create a target directory, e.g.
./archive
- Run:
node index.js references.txt ./archive
import { extractAndSaveAllURLs } from "./lib.js";
await extractAndSaveAllURLs(
'[1] First reference. [Online]. Available: https://jfhr.de/reference-archive/example.pdf (Accessed: 2023-07-23)\n' +
'[2] Second reference. [Online]. Available: https://jfhr.de/reference-archive/example.html (Accessed: 2023-07-23)\n' +
'[3] Third reference. [Online]. Available: https://jfhr.de/reference-archive/cr.html (Accessed: 2023-07-23)\n' +
'[4] S. DeRisi, R. Kennison and N. Twyman, The What and Whys of DOIs. doi: 10.1371/journal.pbio.0000057\n' +
'[5] N. Paskin, "Digital Object Identifier (DOI) System", Encyclopedia of Library and Information Sciences (3rd ed.)\n',
'./archive/'
);
Say you have the following reference list:
[1] First reference. [Online]. Available: https://jfhr.de/reference-archive/example.pdf (Accessed: 2023-07-23)
[2] Second reference. [Online]. Available: https://jfhr.de/reference-archive/example.html (Accessed: 2023-07-23)
[3] Third reference. [Online]. Available: https://jfhr.de/reference-archive/cr.html (Accessed: 2023-07-23)
[4] S. DeRisi, R. Kennison and N. Twyman, The What and Whys of DOIs. doi: 10.1371/journal.pbio.0000057
[5] N. Paskin, "Digital Object Identifier (DOI) System", Encyclopedia of Library and Information Sciences (3rd ed.)
reference-archive
would download the following files to your filesystem:
1.pdf # PDF file from URL
2.mhtml # Single file web page from URL
3.mhtml # Single file web page from URL, including client-rendered content
4.pdf # PDF file with DOI from sci-hub
# no 5.* - reference 5 has no URL and no DOI
To run tests:
node --test