Skip to content

Comments

feat: batch arXiv API calls, RSS-first fetching, and retry resilience (v0.3.1)#8

Merged
seanbrar merged 3 commits intomainfrom
fix/arxiv-api-optimization
Feb 17, 2026
Merged

feat: batch arXiv API calls, RSS-first fetching, and retry resilience (v0.3.1)#8
seanbrar merged 3 commits intomainfrom
fix/arxiv-api-optimization

Conversation

@seanbrar
Copy link
Owner

@seanbrar seanbrar commented Feb 17, 2026

Summary

Minimizes arXiv API pressure, adds an RSS fast path for daily runs, and adds retry resilience to prevent 429 rate-limit failures.

Problem

  • Each category fired its own parallel API request (ThreadPoolExecutor), easily triggering 429s
  • page_size=100 always requested 100 results even when max_results was small
  • No exponential backoff — the arxiv.py library only retries with a flat 3s delay
  • 429 errors surfaced as raw stack traces
  • --force-refresh forced a 7-day API window even when the user just wanted today's freshest papers

Changes

RSS-first daily fetching

  • New fetch_rss_papers() fetches today's papers from arXiv RSS feeds — no rate limits, sub-second metadata
  • fetch_recent_papers() tries RSS first for daily lookups (start_days <= 1), falling back to the arXiv API on failure or empty results
  • Multi-day ranges (backfill, catch-up) still use the arXiv API directly

--force-refresh decoupled from 7-day window

  • --force-refresh now sets days=1 (today only → RSS fast path) instead of days=7 (always API)
  • 7-day bootstrap reserved for true first runs (last_processed_date is None)
  • Quick Start updated: recommends plain paperweight run since it already backfills automatically

API call reduction

  • Batch all categories into a single OR query (cat:cs.AI OR cat:cs.CL OR cat:cs.LG) — N calls → 1
  • Removed ThreadPoolExecutor from fetch_recent_papers() since only one call is needed
  • page_size now set to min(max_results, 100) instead of hardcoded 100

Retry resilience

  • Added tenacity exponential backoff (5s → 15s → 45s → 90s) wrapping client.results()
  • New ArxivRateLimitError exception with user-friendly message for HTTP 429

Onboarding timing (real API, dev mode)

Step Time
uv sync --all-extras 1.5s
paperweight init 0.4s
paperweight doctor 0.9s
paperweight run (first run, 50 papers, API) ~3–7s
paperweight run --force-refresh (RSS path) 0.9s
Cached repeat run 0.4s

Files changed

File Change
scraper.py RSS fetcher, RSS-first routing, force-refresh fix, batched queries, backoff
main.py Catch ArxivRateLimitError in pipeline + error mapper
__init__.py Export ArxivRateLimitError
README.md Quick Start simplified; --force-refresh scoped as power-user flag
CHANGELOG.md Updated v0.3.1 entry
test_scraper.py 26 new tests (RSS parsing, routing, force-refresh, batching, retry)
pyproject.toml Version → 0.3.1

Test results

140 passed, 4 skipped, 0 failures

Sean Brar added 2 commits February 16, 2026 22:14
- Batch category queries into single OR query (N API calls → 1)
- Match page_size to min(max_results, 100) to avoid over-fetching
- Add tenacity exponential backoff (5s→15s→45s→90s) on arxiv.HTTPError
- Add ArxivRateLimitError with friendly message for HTTP 429
- Remove ThreadPoolExecutor from fetch_recent_papers (single call now)
- Update mock local_client and all tests for new categories list signature
- Add 8 new unit tests for batched queries, page_size, retry, and 429s
- Bump version to 0.3.1
Daily lookups now try arXiv RSS feeds before falling back to the API,
eliminating rate-limit exposure for the most common usage pattern.

--force-refresh no longer forces a 7-day backfill; it fetches today's
papers via the fast RSS path. The 7-day bootstrap is reserved for true
first runs (no prior watermark).

Quick Start updated to recommend plain `paperweight run`.
@seanbrar seanbrar changed the title feat: batch arXiv API calls and add retry resilience (v0.3.1) feat: batch arXiv API calls, RSS-first fetching, and retry resilience (v0.3.1) Feb 17, 2026
@seanbrar seanbrar merged commit 5241c79 into main Feb 17, 2026
3 checks passed
@seanbrar seanbrar deleted the fix/arxiv-api-optimization branch February 17, 2026 08:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant