feat: batch arXiv API calls, RSS-first fetching, and retry resilience (v0.3.1)#8
Merged
feat: batch arXiv API calls, RSS-first fetching, and retry resilience (v0.3.1)#8
Conversation
added 2 commits
February 16, 2026 22:14
- Batch category queries into single OR query (N API calls → 1) - Match page_size to min(max_results, 100) to avoid over-fetching - Add tenacity exponential backoff (5s→15s→45s→90s) on arxiv.HTTPError - Add ArxivRateLimitError with friendly message for HTTP 429 - Remove ThreadPoolExecutor from fetch_recent_papers (single call now) - Update mock local_client and all tests for new categories list signature - Add 8 new unit tests for batched queries, page_size, retry, and 429s - Bump version to 0.3.1
Daily lookups now try arXiv RSS feeds before falling back to the API, eliminating rate-limit exposure for the most common usage pattern. --force-refresh no longer forces a 7-day backfill; it fetches today's papers via the fast RSS path. The 7-day bootstrap is reserved for true first runs (no prior watermark). Quick Start updated to recommend plain `paperweight run`.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Minimizes arXiv API pressure, adds an RSS fast path for daily runs, and adds retry resilience to prevent 429 rate-limit failures.
Problem
ThreadPoolExecutor), easily triggering 429spage_size=100always requested 100 results even whenmax_resultswas small--force-refreshforced a 7-day API window even when the user just wanted today's freshest papersChanges
RSS-first daily fetching
fetch_rss_papers()fetches today's papers from arXiv RSS feeds — no rate limits, sub-second metadatafetch_recent_papers()tries RSS first for daily lookups (start_days <= 1), falling back to the arXiv API on failure or empty results--force-refreshdecoupled from 7-day window--force-refreshnow setsdays=1(today only → RSS fast path) instead ofdays=7(always API)last_processed_date is None)paperweight runsince it already backfills automaticallyAPI call reduction
cat:cs.AI OR cat:cs.CL OR cat:cs.LG) — N calls → 1ThreadPoolExecutorfromfetch_recent_papers()since only one call is neededpage_sizenow set tomin(max_results, 100)instead of hardcoded 100Retry resilience
tenacityexponential backoff (5s → 15s → 45s → 90s) wrappingclient.results()ArxivRateLimitErrorexception with user-friendly message for HTTP 429Onboarding timing (real API, dev mode)
uv sync --all-extraspaperweight initpaperweight doctorpaperweight run(first run, 50 papers, API)paperweight run --force-refresh(RSS path)Files changed
scraper.pymain.pyArxivRateLimitErrorin pipeline + error mapper__init__.pyArxivRateLimitErrorREADME.md--force-refreshscoped as power-user flagCHANGELOG.mdtest_scraper.pypyproject.tomlTest results