Add PDF scrapers with URL resolution #120

muthukumaranR · 2025-08-19T15:47:30Z

Summary

Add PDF scraper capabilities with URL resolution and fallback strategies.

Changes

PyPaperBot Integration: Multi-source PDF retrieval with CrossRef/Scholar/Unpaywall fallbacks
URL Resolution: ArXiv abs → pdf conversion, publisher-specific resolvers
Waterfall Strategy: Sequential scraper execution with fallbacks
Unpaywall API: DOI-to-open-access resolution with email configuration

Usage

# UnpaywallResolver for DOI resolution
config = ArticleResolverConfig(debug=True, user_agent="Research Bot")
resolver = UnpaywallResolver(config=config)
result = await resolver.arun(ResolverInputSchema(url="https://doi.org/10.1038/nature12345"))

# PyPaperBot scraper
scraper = PyPaperBotScraper()
result = await scraper.arun(ScraperToolInputSchema(url="https://arxiv.org/abs/2301.00001"))

# Waterfall scraper with URL resolution
waterfall = WaterfallScraper()
result = await waterfall.arun(ScraperToolInputSchema(url="https://doi.org/10.1038/example"))

# CompositeWaterfallScraper with PyPaperBot fallback
composite = CompositeWaterfallScraper(scrapers=["docling", "pypaperbot"])
result = await composite.arun(ScraperToolInputSchema(url="https://arxiv.org/pdf/2301.00001.pdf"))

Testing

Integration tests with real API calls for UnpaywallResolver
PyPaperBot configuration and DOI extraction tests
WaterfallScraper URL resolution logic validation
CompositeWaterfallScraper fallback behavior tests

- Implement DOI resolution via Unpaywall API - Add email configuration for API compliance - Include test coverage

- Multi-source PDF fetching: CrossRef → Semantic Scholar → Unpaywall → PyPaperBot - DOI extraction from URLs and HTML metadata - Configurable Sci-Hub mirrors and proxy support - Query-based search fallback when DOI extraction fails

- ArXiv abs → pdf conversion with paper ID extraction - Publisher-specific resolvers for Wiley, ScienceDirect, DOI - Prefetching with browser-like headers to avoid 403 errors - Sequential scraper execution with fallback strategies

- Add PyPaperBot fallback when Docling processing fails - Update Docling wrapper with enhanced metadata extraction - Configurable pipeline with PDF processing modes and OCR options

- Add PyPaperBotScraper to __all__ exports - Update WaterfallScraper export structure

- Test PyPaperBot scraper configuration and DOI extraction - Test WaterfallScraper URL resolution logic - Test composite scraper fallback behavior - Test configuration validation

NISH1001 · 2025-08-20T02:19:48Z

akd/tools/scrapers/waterfall.py

Can we move the composite scrapers like this to akd.tools.scrapers.composite because it also uses waterfall as these are composite scrapers with dependency injection.

NISH1001 · 2025-08-20T02:20:36Z

pyproject.toml

-    "tiktoken>=0.9.0"
+    "tiktoken>=0.9.0",
+    "nbconvert>=7.16.6",
+    "matplotlib>=3.10.5",


Isn't matolotlib extra dependency?

muthukumaranR added 6 commits August 19, 2025 01:25

add UnpaywallResolver for DOI-to-open-access resolution

5552716

- Implement DOI resolution via Unpaywall API - Add email configuration for API compliance - Include test coverage

add WaterfallScraper with intelligent URL resolution

cc8a7bb

- ArXiv abs → pdf conversion with paper ID extraction - Publisher-specific resolvers for Wiley, ScienceDirect, DOI - Prefetching with browser-like headers to avoid 403 errors - Sequential scraper execution with fallback strategies

update OmniScraper with PyPaperBot fallback

35d1598

- Add PyPaperBot fallback when Docling processing fails - Update Docling wrapper with enhanced metadata extraction - Configurable pipeline with PDF processing modes and OCR options

update scraper package exports

6c8cf52

- Add PyPaperBotScraper to __all__ exports - Update WaterfallScraper export structure

add test coverage for new scraper functionality

41ce06a

- Test PyPaperBot scraper configuration and DOI extraction - Test WaterfallScraper URL resolution logic - Test composite scraper fallback behavior - Test configuration validation

github-actions bot mentioned this pull request Aug 19, 2025

Integration branch merge conflict #121

Closed

muthukumaranR force-pushed the enhance/scrapers branch from 4a34617 to 41ce06a Compare August 19, 2025 16:02

github-actions bot mentioned this pull request Aug 19, 2025

Integration branch merge conflict #122

Closed

add deps

5660042

github-actions bot mentioned this pull request Aug 19, 2025

Integration branch merge conflict #129

Closed

add unpaywall module in resolver

6ba7c25

github-actions bot mentioned this pull request Aug 19, 2025

Integration branch merge conflict #131

Closed

NISH1001 requested changes Aug 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add PDF scrapers with URL resolution #120

Add PDF scrapers with URL resolution #120

Uh oh!

muthukumaranR commented Aug 19, 2025

Uh oh!

NISH1001 Aug 20, 2025

Uh oh!

NISH1001 Aug 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add PDF scrapers with URL resolution #120

Are you sure you want to change the base?

Add PDF scrapers with URL resolution #120

Uh oh!

Conversation

muthukumaranR commented Aug 19, 2025

Summary

Changes

Usage

Testing

Uh oh!

NISH1001 Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

NISH1001 Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants