📄 DOI Paper Scraper — Extract Academic Papers to Markdown 🚀

An automated research paper extraction tool designed for academics, researchers, and developers.

Scrape academic papers from ACM Digital Library, IEEE Xplore, and other publishers using just a DOI. Convert complex academic layouts into structured, clean Markdown with full-text content, LaTeX equations, tables, and high-quality figures.

🌟 Key Features

🎯 Intelligent DOI Resolution: Accepts plain DOIs, doi.org URLs, publisher direct links, or any string containing a Digital Object Identifier.
🛡️ Cloudflare & Anti-Bot Bypass: Leverages pydoll for advanced browser automation to bypass WAFs and access protected content.
📚 Multi-Publisher Support: Built-in specialized scrapers for ACM Digital Library and IEEE Xplore. Easily extensible for Springer, Elsevier, Wiley, and more.
📐 Rich Content Extraction:
- Preserves full paper hierarchy (Headings, Sub-headings).
- Automatically converts MathJax/LaTeX equations into Markdown $math$ blocks.
- Extracts Figures and Tables with original captions and placement.
🔗 Institutional Access Support: Seamlessly navigate paywalls using Institutional Proxy redirection and Browser Cookie injection (supports GMU's EZProxy and others).
📋 Structured Output: Generates clean, text-searchable Markdown files—perfect for research archival, NLP analysis, and building personal knowledge bases.

🛠️ Installation

This project uses the high-performance uv package manager.

# 1. Clone the repository
git clone https://github.com/ahnafnafee/doi-paper-scraper.git
cd doi-paper-scraper

# 2. Install dependencies (creates a virtualenv automatically)
uv sync

🚀 Quick Usage

Extract any paper into Markdown with one command:

# Extract by plain DOI
uv run paper-scrape 10.1145/3746059.3747603

# Extract by DOI URL
uv run paper-scrape "https://doi.org/10.1109/CSCloud-EdgeCom58631.2023.00053"

# Save to a specific directory
uv run paper-scrape [DOI] --output-dir ./my_research

🏫 Accessing Paywalled Content (Institutional Login)

If you have access via a University library (e.g., George Mason University):

Log in to the publisher (IEEE/ACM) through your university's proxy.
Export your session cookies as a JSON file using a browser extension (like Cookie-Editor).
Run the scraper with the cookies and proxy flag:

uv run paper-scrape [DOI] --cookies ieee_cookies.json --proxy "https://mutex.gmu.edu/login?qurl=%u"

💻 CLI Reference

Option	Shorthand	Description	Default
`--output-dir`	`-o`	Directory where papers and images will be saved.	`output/`
`--cookies`	`-c`	Path to a JSON cookie file for institutional authentication.	`None`
`--proxy`	`-p`	Proxy URL template (use `%u` for target URL).	GMU EZProxy
`--no-proxy`		Disable the default proxy even if on a supported domain.	`False`
`--verbose`	`-v`	Enable detailed logging for debugging.	`False`

📂 Output Structure

The tool organizes extracted data into a clean, portable structure:

output/
├── Quarks_A_Secure_Messaging_Network.md   # Paper text + Markdown formatting
└── images/                                # Extracted figures, diagrams, and tables
    ├── fig_a1b2.png
    └── table_c3d4.gif

🧬 Why Choose DOI Paper Scraper?

Research Portability: Text-searchable Markdown is 100x easier to search and edit than static PDFs.
Knowledge Graphs: Perfect for importing papers into tools like Obsidian, Logseq, or Notion.
NLP Research: Clean text extraction without the "noise" of PDF parsing (extra line breaks, headers/footers).
Automation: Designed to be integrated into CI/CD pipelines or batch processing scripts.

📜 License

Distributed under the MIT License. Free for academic, personal, and commercial use. 🎓

Developed with ❤️ by Ahnaf Nafee

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
src/academic_paper_api		src/academic_paper_api
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📄 DOI Paper Scraper — Extract Academic Papers to Markdown 🚀

🌟 Key Features

🛠️ Installation

🚀 Quick Usage

🏫 Accessing Paywalled Content (Institutional Login)

💻 CLI Reference

📂 Output Structure

🧬 Why Choose DOI Paper Scraper?

📜 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

📄 DOI Paper Scraper — Extract Academic Papers to Markdown 🚀

🌟 Key Features

🛠️ Installation

🚀 Quick Usage

🏫 Accessing Paywalled Content (Institutional Login)

💻 CLI Reference

📂 Output Structure

🧬 Why Choose DOI Paper Scraper?

📜 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages