Skip to content

Add file-based usage metrics for local runs (#110)#219

Merged
neoneye merged 13 commits intomainfrom
feature/110-usage-metrics
Mar 10, 2026
Merged

Add file-based usage metrics for local runs (#110)#219
neoneye merged 13 commits intomainfrom
feature/110-usage-metrics

Conversation

@neoneye
Copy link
Member

@neoneye neoneye commented Mar 9, 2026

Summary

  • Add file-based usage metrics (usage_metrics.jsonl) for local runs that don't have database access
  • Records per-LLM-call metrics via llama_index instrumentation: model (with provider prefix), tokens (input/output/thinking), duration, cost, and success/failure
  • Failures are recorded separately by LLMExecutor since instrumentation end events aren't emitted on failure
  • Pipeline sets the metrics path at start and clears it after completion
  • usage_metrics.jsonl is excluded from pipeline progress calculation

Example output

{"timestamp": "2026-03-10T13:36:48.250446", "success": true, "model": "Google AI Studio:google/gemini-2.0-flash-001", "duration_seconds": 4.879, "input_tokens": 5316, "output_tokens": 643, "cost_usd": 0.0007888}
{"timestamp": "2026-03-10T13:36:53.554864", "success": true, "model": "Google:google/gemini-2.0-flash-001", "duration_seconds": 5.237, "input_tokens": 8877, "output_tokens": 562, "cost_usd": 0.0011125}

Code quality improvements

  • Move usage_metrics import to top-level so bad imports fail hard on startup
  • Remove redundant try/except — record_usage_metric handles errors internally
  • Warn (not debug-log) when metrics path is unset or write fails
  • Document set_usage_metrics_path(None) teardown in module docstring
  • Place success field before model in JSONL output for easier error skimming

Bug fixes

  • Resume for legacy plans: Frontend and MCP resume checks incorrectly rejected plans created before pipeline_version was stamped into parameters (comparing None != PIPELINE_VERSION). The worker-side check against the actual snapshot metadata file is the real safety gate
  • Heartbeat crash: A corrupted psycopg2 connection during WorkerItem.upsert_heartbeat() was propagating up and killing Luigi tasks. Wrapped in try/except with session rollback since the heartbeat is just a liveness signal

Files changed

  • worker_plan/worker_plan_internal/llm_util/usage_metrics.py — core module for file-based metric recording
  • worker_plan/worker_plan_internal/llm_util/track_activity.py — records successful calls via _record_file_usage_metric()
  • worker_plan/worker_plan_internal/llm_util/llm_executor.py — records failed calls only
  • worker_plan/worker_plan_internal/plan/run_plan_pipeline.py — sets/clears metrics path around pipeline execution
  • worker_plan/worker_plan_api/filenames.py — add USAGE_METRICS_JSONL constant
  • frontend_multi_user/src/app.py — fix resume version check for legacy plans
  • mcp_cloud/db_queries.py — fix resume version check for legacy plans
  • worker_plan_database/app.py — protect pipeline from heartbeat failures
  • docs/proposals/110-usage-metrics-local-runs.md — mark as implemented
  • docs/proposals/111-promising-directions.md — mark Mcp example plans #110 as complete

Test plan

  • Run a local pipeline and verify usage_metrics.jsonl is created with per-call token counts and cost
  • Verify the model field includes the provider prefix (e.g. Google AI Studio:google/gemini-2.0-flash-001)
  • Verify metrics recording does not block pipeline on write failure
  • Stop a plan, then resume it — verify usage_metrics.jsonl is created on resume
  • Resume a legacy plan (no pipeline_version in parameters) — verify it is not rejected
  • Verify a heartbeat database error does not crash the pipeline

🤖 Generated with Claude Code

Write per-LLM-call metrics (model, tokens, duration, success/failure)
to usage_metrics.jsonl in the run output directory. Works without a
database — designed for local/offline runs where the DB-backed
token_metrics_store is unavailable.

- Add USAGE_METRICS_JSONL to ExtraFilenameEnum
- New usage_metrics.py module with set/get path and record function
- Extend LLMExecutor._record_attempt_token_metrics() to also write
  file-based metrics alongside existing DB recording
- Wire usage metrics path in ExecutePipeline.run()
- Add usage_metrics.jsonl to progress ignore list

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@neoneye neoneye deleted the branch main March 10, 2026 00:48
@neoneye neoneye closed this Mar 10, 2026
@neoneye neoneye reopened this Mar 10, 2026
@neoneye neoneye changed the base branch from feature/plan-resume-tool to main March 10, 2026 01:43
neoneye and others added 12 commits March 10, 2026 11:05
…utor

Fail hard on startup if imports are bad instead of silently swallowing
errors inside a try block at runtime.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
record_usage_metric and extract_token_count handle errors internally,
so the outer try block was unnecessary.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Plans created before pipeline_version was stamped into parameters were
incorrectly rejected by the frontend and MCP resume checks. The
worker-side check against the actual snapshot metadata file is the real
safety gate, so allow None through at the API layer.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Point to worker_plan_database/app.py and the actual snapshot metadata
file (001-3-planexe_metadata.json) so readers can find the real check.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The detailed provider/model info (e.g. "Google AI Studio:google/gemini-2.0-flash-001")
was already extracted by token_counter but not recorded. Now included in
usage_metrics.jsonl when available. Also move success field before model
for easier error skimming.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Successful LLM calls are now recorded by TrackActivity which has access
to the real ChatResponse with full token counts, cost, and provider:model
info. LLMExecutor only records failures since instrumentation end events
are not emitted when the call fails. Skip events without token usage or
cost to avoid "unknown" rows. Remove redundant upstream_provider and
upstream_model fields since model already contains "provider:model".

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Mark proposal 110 as implemented with PR #219 details: JSONL format,
instrumentation-based recording, resolved open questions. Mark 110 as
complete in the promising directions roadmap.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
A corrupted psycopg2 connection during WorkerItem.upsert_heartbeat()
was propagating up and killing Luigi tasks. The heartbeat is just a
liveness signal — wrap it in try/except with a session rollback so
the pipeline can continue.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@neoneye neoneye merged commit 6970347 into main Mar 10, 2026
3 checks passed
@neoneye neoneye deleted the feature/110-usage-metrics branch March 10, 2026 15:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant