Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ data/
artifacts/
.DS_Store
.coverage
.paperweight_cache.json
*.bak

# Build artifacts
build/
Expand Down
18 changes: 17 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,21 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

## [0.3.0] - 2026-02-15

### Added
- Profile switching via `--profile NAME` flag and `PAPERWEIGHT_PROFILE` env var
- Metadata cache (`metadata_cache` config section) to skip repeated arXiv API calls within a TTL window
- Progress logging during triage and summary LLM calls
- Per-call LLM timeout (45 s) for triage and summary to prevent hanging runs
- `--version` flag on the CLI
- Public API surface: `paperweight.__version__`, `load_config`, `get_recent_papers`, `score_papers`, etc. re-exported from `__init__.py`

### Changed
- Triage now uses per-paper async calls (same pattern as summaries) instead of batch `run_many`
- Triage rationale is compact: prompt asks for max 20 words, output is whitespace-normalized and truncated
- `paperweight init` prints a clean error to stderr (no traceback) when config already exists; use `--force` to overwrite

## [0.2.0] - 2026-02-14

### Added
Expand Down Expand Up @@ -72,7 +87,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Email notification system
- YAML-based configuration

[Unreleased]: https://github.com/seanbrar/paperweight/compare/v0.2.0...HEAD
[Unreleased]: https://github.com/seanbrar/paperweight/compare/v0.3.0...HEAD
[0.3.0]: https://github.com/seanbrar/paperweight/compare/v0.2.0...v0.3.0
[0.2.0]: https://github.com/seanbrar/paperweight/compare/v0.1.2...v0.2.0
[0.1.2]: https://github.com/seanbrar/paperweight/compare/v0.1.1...v0.1.2
[0.1.1]: https://github.com/seanbrar/paperweight/compare/v0.1.0...v0.1.1
Expand Down
8 changes: 8 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,10 +71,16 @@ paperweight run --delivery email

# strict checks for CI/release gates
paperweight doctor --strict

# activate a named profile
paperweight run --profile fast
```

Detailed command behavior: `docs/CLI.md`

`--max-items` is a processing cap (not a guaranteed output count): paperweight
processes at most N fetched papers, and may output fewer if filters remove them.

## Configuration

Core sections:
Expand All @@ -83,6 +89,8 @@ Core sections:
- `triage`: shortlist gate (title + abstract)
- `processor`: scoring config
- `analyzer`: `abstract` or `summary`
- `metadata_cache` (optional, speeds up repeated runs)
- `profiles` (optional, named config overlays)
- `logging`
- `notifier` (optional, only for email)

Expand Down
47 changes: 38 additions & 9 deletions config-base.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,26 +19,41 @@ processor:
content_keyword_weight: 1
exclusion_keyword_penalty: 5
important_words_weight: 0.5
min_score: 10
min_score: 3

analyzer:
type: abstract # summary | abstract
llm_provider: openai # openai | gemini
# llm_provider: openai # openai | gemini
# api_key: ${OPENAI_API_KEY}
# model: gpt-5-mini
max_input_tokens: 7000
max_input_chars: 20000

# AI-first shortlist gate (title + abstract)
triage:
# AI triage — opt-in. Uncomment and set enabled: true to use.
# Requires an LLM provider and API key.
# triage:
# enabled: true
# llm_provider: openai # openai | gemini
# # api_key: ${OPENAI_API_KEY}
# # model: gpt-5-mini
# min_score: 60
# max_selected: 25

# Metadata cache — avoids repeated arXiv API calls within the TTL window.
metadata_cache:
enabled: true
llm_provider: openai # openai | gemini
# api_key: ${OPENAI_API_KEY}
min_score: 60
max_selected: 25
path: .paperweight_cache.json
ttl_hours: 4

logging:
level: INFO # DEBUG | INFO | WARNING | ERROR
file: paperweight.log
# file: paperweight.log # Omit for stderr-only logging

# Concurrency limits for parallelized pipeline stages.
concurrency:
content_fetch: 6 # 1-20, threads for paper content downloads
triage: 3 # 1-10, async LLM workers for triage
summary: 3 # 1-10, async LLM workers for summarization

# Optional metadata for --delivery atom
feed:
Expand Down Expand Up @@ -71,3 +86,17 @@ db:
# Optional artifact storage path
storage:
base_dir: data/artifacts

# Named profiles — activate with --profile NAME or PAPERWEIGHT_PROFILE env var.
# Each profile is a partial config overlay that deep-merges on top of the base.
# profiles:
# fast:
# arxiv:
# max_results: 20
# triage:
# max_selected: 10
# deep:
# arxiv:
# max_results: 200
# triage:
# max_selected: 50
39 changes: 30 additions & 9 deletions docs/CLI.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,10 @@ paperweight has three commands:

`paperweight` is shorthand for `paperweight run`.

Global flags:

- `--version` — print version and exit

## run

```bash
Expand All @@ -17,7 +21,9 @@ paperweight run \
[--delivery stdout|json|atom|email] \
[--output PATH] \
[--sort-order relevance|alphabetical|publication_time] \
[--max-items N]
[--max-items N] \
[--profile NAME] \
[--quiet]
```

Behavior:
Expand All @@ -26,6 +32,9 @@ Behavior:
- runs triage on title + abstract
- hydrates full text only for shortlisted papers
- scores/summarizes and delivers digest
- `--max-items N` caps how many fetched papers enter processing (triage/hydration/summary); output may be fewer than `N` after filtering
- `--profile NAME` activates a named profile from the config's `profiles` section (or set `PAPERWEIGHT_PROFILE` env var)
- `--quiet` suppresses progress status lines on stderr

Delivery modes:

Expand All @@ -34,14 +43,24 @@ Delivery modes:
- `atom`: Atom feed XML
- `email`: SMTP send via `notifier.email` config

`json` fields:
`json` fields (always present):

- `title` — paper title
- `arxiv_id` — arXiv identifier
- `authors` — list of author names
- `categories` — list of arXiv categories
- `published` — publication date (ISO format)
- `abstract` — paper abstract
- `link` — arXiv abstract URL
- `pdf_url` — direct PDF URL
- `score` — relevance score (float)
- `keywords_matched` — list of matched keywords

- `title`
- `date`
- `score`
- `why`
- `link`
- `summary`
`json` fields (conditional):

- `triage_score` — present when triage is enabled
- `triage_rationale` — present when triage is enabled
- `summary` — present when summary differs from abstract (i.e. LLM summarization was used)

## init

Expand All @@ -53,11 +72,12 @@ Behavior:

- writes a minimal `config.yaml` template
- refuses to overwrite unless `--force` is passed
- prints a clean error message (no traceback) if config already exists

## doctor

```bash
paperweight doctor [--config PATH] [--strict]
paperweight doctor [--config PATH] [--strict] [--profile NAME]
```

Checks:
Expand All @@ -71,3 +91,4 @@ Exit codes:

- `0`: healthy (or warnings present without `--strict`)
- `1`: hard failure, or warning in strict mode

36 changes: 36 additions & 0 deletions docs/CONFIGURATION.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,37 @@ This keeps runtime lower than downloading full text for every candidate.

## Optional sections

### `metadata_cache`

```yaml
metadata_cache:
enabled: false
path: .paperweight_cache.json
ttl_hours: 4
```

When enabled, paperweight caches arXiv metadata locally and reuses it within the
TTL window, skipping repeated API calls. `--force-refresh` bypasses the cache.

### `profiles`

```yaml
profiles:
fast:
arxiv:
max_results: 20
triage:
max_selected: 10
deep:
arxiv:
max_results: 200
triage:
max_selected: 50
```

Activate with `--profile fast` or `PAPERWEIGHT_PROFILE=fast`. Each profile is
a partial config overlay that deep-merges on top of the base config.

### `notifier` (only for `--delivery email`)

```yaml
Expand Down Expand Up @@ -113,10 +144,14 @@ export PAPERWEIGHT_MAX_RESULTS=100
## Analyzer keys

When `analyzer.type: summary`, API key is required.
If a summary call fails at runtime, paperweight falls back to that paper's abstract.

When `triage.enabled: true`, an API key is strongly recommended. Without one,
paperweight falls back to a lightweight keyword/abstract heuristic.

If triage LLM calls fail or time out at runtime, paperweight falls back to
heuristic triage for the entire batch to keep behavior consistent within a run.

Provider keys:

- `OPENAI_API_KEY` for OpenAI
Expand All @@ -128,3 +163,4 @@ Provider keys:
- `--delivery json` ignores `notifier`.
- `--delivery atom` uses optional `feed` metadata.
- `--delivery email` requires valid `notifier.email` settings.
- `--profile NAME` deep-merges the named profile on top of the base config before env overrides.
84 changes: 44 additions & 40 deletions docs/ROADMAP.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,64 +5,68 @@ time saved, setup simplicity, and digest quality.

## Product definition

paperweight is a **fast, scriptable arXiv interface** — the himalaya of academic papers.
It fetches structured arXiv data, scores by keywords, and outputs rich JSON.
AI enrichment (triage, summarization) is available via config but imposes zero cost
on the default path.

paperweight should be better than "just checking arXiv" when the user wants:

- a smaller daily reading queue
- deterministic output that can be automated
- relevance filtering that improves over time
- deterministic output that can be automated and piped
- keyword-scored relevance filtering out of the box
- structured metadata (authors, categories, PDF URLs) for scripting

## Core success metrics

These metrics guide all releases:

1. **Time to first useful run**
- target: <= 5 minutes from install to first digest
- target: <= 2 minutes from install to first digest
2. **Daily digest size**
- target: median 5-20 items after user tuning
3. **Runtime**
- target: <= 120 seconds for `3 categories x max_results=50` on default non-summary mode
- target: sub-second warm runs (metadata cached), <= 60s cold fetch for 3 categories x 50 papers
4. **CLI reliability**
- target: >= 99% successful runs in local smoke workflows
5. **Signal quality (human-evaluated)**
- target: >= 7/10 items marked "worth reading" in pilot usage

## v0.2 release gates (must pass)

1. CLI contract stable:
- `run`, `init`, `doctor`
- `run` delivery: `stdout`, `json`, `atom`, optional `email`
2. Zero-key baseline works:
- `init` defaults to `analyzer.type: abstract`
- `run` works without LLM keys via triage fallback
3. Setup validation:
- `doctor --strict` returns non-zero on warnings/failures
4. Output ergonomics:
- deterministic text digest
- scriptable JSON
- Atom feed export
5. Quality checks:
- lint clean
- tests green (including small CLI integration suite)
6. Packaging:
- release workflow present and tag-driven

## v0.3 focus (quality lift, not surface-area lift)

1. **Speed**
- add metadata cache
- target: >= 40% runtime reduction on repeated daily runs
2. **Digest quality**
## v0.3 focus (config resilience, richer metadata, performance, CLI polish)

1. **Config resilience**
- DEFAULT_CONFIG ensures partial/minimal configs never crash
- triage disabled by default (opt-in via config)
- log file optional (stderr-only by default)
- target: `paperweight run` works with only `arxiv.categories` set
2. **Richer metadata**
- capture authors, categories, PDF URL, arXiv ID from API
- track which keywords matched during scoring
- JSON output includes full structured data contract
- target: JSON schema always complete without AI
3. **Performance**
- lazy imports for heavy dependencies (psycopg, pollux, tiktoken, pypdf)
- parallel category fetching
- target: sub-second warm runs, ~3x cold-fetch speedup
4. **CLI polish & API surface**
- `--version` flag
- `init` prints clean error (not traceback) when config exists
- `__init__.py` exposes `__version__` and key public functions
- target: scriptable from `import paperweight` without submodule diving

## v0.4 focus (typed data, AI enrichment, feedback loop)

1. **Typed data structures**
- replace `Dict[str, Any]` pipeline with `Paper` dataclass/Pydantic model
- eliminate in-place mutation in `process_papers` (return new objects)
- target: zero `KeyError` risk from undocumented dict keys
2. **AI enrichment polish**
- improve triage rationale quality and compactness
- target: rationale present on >= 95% of shortlisted items
3. **Workflow fit**
- add saved presets/profile switching
- target: switch profile in one command, no config edits

## v0.4 focus (feedback loop)

1. add local feedback capture (`relevant` / `irrelevant`)
2. incorporate feedback into ranking
3. target: +20% improvement in user-rated relevance from v0.2 baseline
- target: rationale present on >= 95% of shortlisted items when triage enabled
3. **Feedback loop**
- add local feedback capture (`relevant` / `irrelevant`)
- incorporate feedback into ranking
- target: +20% improvement in user-rated relevance from v0.2 baseline

## v1.0 criteria

Expand Down
Loading