Skip to content

Conversation

@muthukumaranR
Copy link
Collaborator

Summary

Add PDF scraper capabilities with URL resolution and fallback strategies.

Changes

  • PyPaperBot Integration: Multi-source PDF retrieval with CrossRef/Scholar/Unpaywall fallbacks
  • URL Resolution: ArXiv abs → pdf conversion, publisher-specific resolvers
  • Waterfall Strategy: Sequential scraper execution with fallbacks
  • Unpaywall API: DOI-to-open-access resolution with email configuration

Usage

# UnpaywallResolver for DOI resolution
config = ArticleResolverConfig(debug=True, user_agent="Research Bot")
resolver = UnpaywallResolver(config=config)
result = await resolver.arun(ResolverInputSchema(url="https://doi.org/10.1038/nature12345"))

# PyPaperBot scraper
scraper = PyPaperBotScraper()
result = await scraper.arun(ScraperToolInputSchema(url="https://arxiv.org/abs/2301.00001"))

# Waterfall scraper with URL resolution
waterfall = WaterfallScraper()
result = await waterfall.arun(ScraperToolInputSchema(url="https://doi.org/10.1038/example"))

# CompositeWaterfallScraper with PyPaperBot fallback
composite = CompositeWaterfallScraper(scrapers=["docling", "pypaperbot"])
result = await composite.arun(ScraperToolInputSchema(url="https://arxiv.org/pdf/2301.00001.pdf"))

Testing

  • Integration tests with real API calls for UnpaywallResolver
  • PyPaperBot configuration and DOI extraction tests
  • WaterfallScraper URL resolution logic validation
  • CompositeWaterfallScraper fallback behavior tests

- Implement DOI resolution via Unpaywall API
- Add email configuration for API compliance
- Include test coverage
- Multi-source PDF fetching: CrossRef → Semantic Scholar → Unpaywall → PyPaperBot
- DOI extraction from URLs and HTML metadata
- Configurable Sci-Hub mirrors and proxy support
- Query-based search fallback when DOI extraction fails
- ArXiv abs → pdf conversion with paper ID extraction
- Publisher-specific resolvers for Wiley, ScienceDirect, DOI
- Prefetching with browser-like headers to avoid 403 errors
- Sequential scraper execution with fallback strategies
- Add PyPaperBot fallback when Docling processing fails
- Update Docling wrapper with enhanced metadata extraction
- Configurable pipeline with PDF processing modes and OCR options
- Add PyPaperBotScraper to __all__ exports
- Update WaterfallScraper export structure
- Test PyPaperBot scraper configuration and DOI extraction
- Test WaterfallScraper URL resolution logic
- Test composite scraper fallback behavior
- Test configuration validation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move the composite scrapers like this to akd.tools.scrapers.composite because it also uses waterfall as these are composite scrapers with dependency injection.

"tiktoken>=0.9.0"
"tiktoken>=0.9.0",
"nbconvert>=7.16.6",
"matplotlib>=3.10.5",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't matolotlib extra dependency?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants