-
Notifications
You must be signed in to change notification settings - Fork 3
Add PDF scrapers with URL resolution #120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
- Implement DOI resolution via Unpaywall API - Add email configuration for API compliance - Include test coverage
- Multi-source PDF fetching: CrossRef → Semantic Scholar → Unpaywall → PyPaperBot - DOI extraction from URLs and HTML metadata - Configurable Sci-Hub mirrors and proxy support - Query-based search fallback when DOI extraction fails
- ArXiv abs → pdf conversion with paper ID extraction - Publisher-specific resolvers for Wiley, ScienceDirect, DOI - Prefetching with browser-like headers to avoid 403 errors - Sequential scraper execution with fallback strategies
- Add PyPaperBot fallback when Docling processing fails - Update Docling wrapper with enhanced metadata extraction - Configurable pipeline with PDF processing modes and OCR options
- Add PyPaperBotScraper to __all__ exports - Update WaterfallScraper export structure
- Test PyPaperBot scraper configuration and DOI extraction - Test WaterfallScraper URL resolution logic - Test composite scraper fallback behavior - Test configuration validation
4a34617 to
41ce06a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move the composite scrapers like this to akd.tools.scrapers.composite because it also uses waterfall as these are composite scrapers with dependency injection.
| "tiktoken>=0.9.0" | ||
| "tiktoken>=0.9.0", | ||
| "nbconvert>=7.16.6", | ||
| "matplotlib>=3.10.5", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't matolotlib extra dependency?
Summary
Add PDF scraper capabilities with URL resolution and fallback strategies.
Changes
Usage
Testing