Add file-based usage metrics for local runs (#110)#219
Merged
Conversation
Write per-LLM-call metrics (model, tokens, duration, success/failure) to usage_metrics.jsonl in the run output directory. Works without a database — designed for local/offline runs where the DB-backed token_metrics_store is unavailable. - Add USAGE_METRICS_JSONL to ExtraFilenameEnum - New usage_metrics.py module with set/get path and record function - Extend LLMExecutor._record_attempt_token_metrics() to also write file-based metrics alongside existing DB recording - Wire usage metrics path in ExecutePipeline.run() - Add usage_metrics.jsonl to progress ignore list Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…utor Fail hard on startup if imports are bad instead of silently swallowing errors inside a try block at runtime. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
record_usage_metric and extract_token_count handle errors internally, so the outer try block was unnecessary. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Plans created before pipeline_version was stamped into parameters were incorrectly rejected by the frontend and MCP resume checks. The worker-side check against the actual snapshot metadata file is the real safety gate, so allow None through at the API layer. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Point to worker_plan_database/app.py and the actual snapshot metadata file (001-3-planexe_metadata.json) so readers can find the real check. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The detailed provider/model info (e.g. "Google AI Studio:google/gemini-2.0-flash-001") was already extracted by token_counter but not recorded. Now included in usage_metrics.jsonl when available. Also move success field before model for easier error skimming. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Successful LLM calls are now recorded by TrackActivity which has access to the real ChatResponse with full token counts, cost, and provider:model info. LLMExecutor only records failures since instrumentation end events are not emitted when the call fails. Skip events without token usage or cost to avoid "unknown" rows. Remove redundant upstream_provider and upstream_model fields since model already contains "provider:model". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Mark proposal 110 as implemented with PR #219 details: JSONL format, instrumentation-based recording, resolved open questions. Mark 110 as complete in the promising directions roadmap. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
A corrupted psycopg2 connection during WorkerItem.upsert_heartbeat() was propagating up and killing Luigi tasks. The heartbeat is just a liveness signal — wrap it in try/except with a session rollback so the pipeline can continue. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
usage_metrics.jsonl) for local runs that don't have database accessusage_metrics.jsonlis excluded from pipeline progress calculationExample output
{"timestamp": "2026-03-10T13:36:48.250446", "success": true, "model": "Google AI Studio:google/gemini-2.0-flash-001", "duration_seconds": 4.879, "input_tokens": 5316, "output_tokens": 643, "cost_usd": 0.0007888} {"timestamp": "2026-03-10T13:36:53.554864", "success": true, "model": "Google:google/gemini-2.0-flash-001", "duration_seconds": 5.237, "input_tokens": 8877, "output_tokens": 562, "cost_usd": 0.0011125}Code quality improvements
usage_metricsimport to top-level so bad imports fail hard on startuprecord_usage_metrichandles errors internallyset_usage_metrics_path(None)teardown in module docstringsuccessfield beforemodelin JSONL output for easier error skimmingBug fixes
pipeline_versionwas stamped into parameters (comparingNone != PIPELINE_VERSION). The worker-side check against the actual snapshot metadata file is the real safety gateWorkerItem.upsert_heartbeat()was propagating up and killing Luigi tasks. Wrapped in try/except with session rollback since the heartbeat is just a liveness signalFiles changed
worker_plan/worker_plan_internal/llm_util/usage_metrics.py— core module for file-based metric recordingworker_plan/worker_plan_internal/llm_util/track_activity.py— records successful calls via_record_file_usage_metric()worker_plan/worker_plan_internal/llm_util/llm_executor.py— records failed calls onlyworker_plan/worker_plan_internal/plan/run_plan_pipeline.py— sets/clears metrics path around pipeline executionworker_plan/worker_plan_api/filenames.py— addUSAGE_METRICS_JSONLconstantfrontend_multi_user/src/app.py— fix resume version check for legacy plansmcp_cloud/db_queries.py— fix resume version check for legacy plansworker_plan_database/app.py— protect pipeline from heartbeat failuresdocs/proposals/110-usage-metrics-local-runs.md— mark as implementeddocs/proposals/111-promising-directions.md— mark Mcp example plans #110 as completeTest plan
usage_metrics.jsonlis created with per-call token counts and costmodelfield includes the provider prefix (e.g.Google AI Studio:google/gemini-2.0-flash-001)usage_metrics.jsonlis created on resumepipeline_versionin parameters) — verify it is not rejected🤖 Generated with Claude Code