feat: batch arXiv API calls, RSS-first fetching, and retry resilience (v0.3.1) by seanbrar · Pull Request #8 · seanbrar/paperweight

seanbrar · 2026-02-17T06:15:26Z

Summary

Minimizes arXiv API pressure, adds an RSS fast path for daily runs, and adds retry resilience to prevent 429 rate-limit failures.

Problem

Each category fired its own parallel API request (ThreadPoolExecutor), easily triggering 429s
page_size=100 always requested 100 results even when max_results was small
No exponential backoff — the arxiv.py library only retries with a flat 3s delay
429 errors surfaced as raw stack traces
--force-refresh forced a 7-day API window even when the user just wanted today's freshest papers

Changes

RSS-first daily fetching

New fetch_rss_papers() fetches today's papers from arXiv RSS feeds — no rate limits, sub-second metadata
fetch_recent_papers() tries RSS first for daily lookups (start_days <= 1), falling back to the arXiv API on failure or empty results
Multi-day ranges (backfill, catch-up) still use the arXiv API directly

--force-refresh decoupled from 7-day window

--force-refresh now sets days=1 (today only → RSS fast path) instead of days=7 (always API)
7-day bootstrap reserved for true first runs (last_processed_date is None)
Quick Start updated: recommends plain paperweight run since it already backfills automatically

API call reduction

Batch all categories into a single OR query (cat:cs.AI OR cat:cs.CL OR cat:cs.LG) — N calls → 1
Removed ThreadPoolExecutor from fetch_recent_papers() since only one call is needed
page_size now set to min(max_results, 100) instead of hardcoded 100

Retry resilience

Added tenacity exponential backoff (5s → 15s → 45s → 90s) wrapping client.results()
New ArxivRateLimitError exception with user-friendly message for HTTP 429

Onboarding timing (real API, dev mode)

Step	Time
`uv sync --all-extras`	1.5s
`paperweight init`	0.4s
`paperweight doctor`	0.9s
`paperweight run` (first run, 50 papers, API)	~3–7s
`paperweight run --force-refresh` (RSS path)	0.9s
Cached repeat run	0.4s

Files changed

File	Change
`scraper.py`	RSS fetcher, RSS-first routing, force-refresh fix, batched queries, backoff
`main.py`	Catch `ArxivRateLimitError` in pipeline + error mapper
`__init__.py`	Export `ArxivRateLimitError`
`README.md`	Quick Start simplified; `--force-refresh` scoped as power-user flag
`CHANGELOG.md`	Updated v0.3.1 entry
`test_scraper.py`	26 new tests (RSS parsing, routing, force-refresh, batching, retry)
`pyproject.toml`	Version → 0.3.1

Test results

140 passed, 4 skipped, 0 failures

- Batch category queries into single OR query (N API calls → 1) - Match page_size to min(max_results, 100) to avoid over-fetching - Add tenacity exponential backoff (5s→15s→45s→90s) on arxiv.HTTPError - Add ArxivRateLimitError with friendly message for HTTP 429 - Remove ThreadPoolExecutor from fetch_recent_papers (single call now) - Update mock local_client and all tests for new categories list signature - Add 8 new unit tests for batched queries, page_size, retry, and 429s - Bump version to 0.3.1

Daily lookups now try arXiv RSS feeds before falling back to the API, eliminating rate-limit exposure for the most common usage pattern. --force-refresh no longer forces a 7-day backfill; it fetches today's papers via the fast RSS path. The 7-day bootstrap is reserved for true first runs (no prior watermark). Quick Start updated to recommend plain `paperweight run`.

Sean Brar added 2 commits February 16, 2026 22:14

seanbrar changed the title ~~feat: batch arXiv API calls and add retry resilience (v0.3.1)~~ feat: batch arXiv API calls, RSS-first fetching, and retry resilience (v0.3.1) Feb 17, 2026

style: fix linting and formatting issues

7e52957

seanbrar merged commit 5241c79 into main Feb 17, 2026
3 checks passed

seanbrar deleted the fix/arxiv-api-optimization branch February 17, 2026 08:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

feat: batch arXiv API calls, RSS-first fetching, and retry resilience (v0.3.1)#8

feat: batch arXiv API calls, RSS-first fetching, and retry resilience (v0.3.1)#8
seanbrar merged 3 commits intomainfrom
fix/arxiv-api-optimization

seanbrar commented Feb 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

seanbrar commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Changes

Onboarding timing (real API, dev mode)

Files changed

Test results

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

seanbrar commented Feb 17, 2026 •

edited

Loading