Skip to content

Conversation

@ashleefr
Copy link

@ashleefr ashleefr commented Dec 22, 2025

Fix: Export job execution metrics to Prometheus

Problem

In PR #1825, job execution metrics were added using github.com/armon/go-metrics package:

metrics.IncrCounterWithLabels([]string{"job", "executions_succeeded_total"}, 1, ...)
metrics.IncrCounterWithLabels([]string{"job", "executions_failed_total"}, 1, ...)

However, these metrics are not exported to Prometheus even though they are being created.

Root Cause

The issue is an architectural mismatch between two different metrics systems:

  1. hashicorp/go-metrics (aliased as github.com/armon/go-metrics) - Creates metrics in its own internal sink
  2. prometheus/client_golang - Exports metrics from the global Prometheus registry

The HTTP endpoint in dkron/api.go uses promhttp.Handler() from prometheus/client_golang:

r.GET("/metrics", gin.WrapH(promhttp.Handler()))

This handler exports metrics from the global Prometheus registry, but metrics created via go-metrics don't register in this registry. They go to a separate PrometheusSink from hashicorp/go-metrics/prometheus, which has its own HTTP handler that is never used in the API endpoint.

Why PR #1825 Didn't Work

Looking at the commit history of PR #1825, there were several iterations:

  1. "Initial plan"
  2. "Add job execution metrics emission"
  3. "Implement native Prometheus metrics for job executions"
  4. "Switch to go-metrics package instead of direct Prometheus integration" ← This is where the problem was introduced

The switch to go-metrics was likely made to:

  • Use a unified metrics system (for both statsd and prometheus)
  • Maintain compatibility with existing code

However, the integration between go-metrics Prometheus sink and promhttp.Handler() was not properly implemented.

Solution

This PR adds direct Prometheus metrics using prometheus/client_golang, similar to how it's done in plugin/shell/prometheus.go:

  1. Created dkron/job_metrics.go with Prometheus metric definitions:

    • dkron_job_executions_succeeded_total{job_name="..."}
    • dkron_job_executions_failed_total{job_name="..."}
  2. Updated dkron/store.go to emit both:

    • go-metrics metrics (for statsd compatibility)
    • prometheus/client_golang metrics (for Prometheus export)

This ensures:

  • ✅ Metrics are registered in the global Prometheus registry
  • ✅ Metrics are exported via promhttp.Handler()
  • ✅ Backward compatibility with statsd (via go-metrics)
  • ✅ Follows the same pattern as shell executor metrics

Testing

The fix has been tested locally:

  • Metrics are created when jobs execute
  • Metrics appear in /metrics endpoint
  • Metrics are scraped by Prometheus
  • Metrics are visible in Grafana dashboards

Example output:

dkron_job_executions_succeeded_total{job_name="test-success-job"} 3
dkron_job_executions_failed_total{job_name="test-failure-job"} 3

Summary by CodeRabbit

  • New Features
    • Added Prometheus metrics for comprehensive job execution monitoring, tracking both successful and failed job executions with job name labels.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Dec 22, 2025

Walkthrough

Introduces Prometheus metrics for job execution outcomes by defining two new CounterVec metrics (succeeded and failed totals) and integrating them into the SetExecutionDone function to track execution results alongside existing hashicorp/go-metrics instrumentation.

Changes

Cohort / File(s) Summary
New Prometheus Metrics
dkron/job_metrics.go
Defines two Prometheus CounterVec metrics: JobExecutionsSucceededTotal and JobExecutionsFailedTotal, initialized via promauto with namespace "dkron", subsystem "job", and labeled by job_name.
Metrics Integration
dkron/store.go
Updates SetExecutionDone to emit dual metrics: increments both hashicorp/go-metrics and Prometheus counters for successful and failed job executions.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

  • Verify promauto initialization syntax and Prometheus metric naming conventions align with existing patterns
  • Confirm metric increment calls are placed in correct success/failure code paths within SetExecutionDone
  • Validate job_name label cardinality and prevent label explosion risks

Poem

🐰 A counter springs forth with joy and care,
Success and failure tracked with flair,
Prometheus watches jobs execute,
Metrics bloom from root to root! 🌱

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: adding Prometheus export functionality for job execution metrics.
Description check ✅ Passed The description is comprehensive and covers proposed changes with context, but lacks the required template structure with explicit type-of-changes checkboxes.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3e5ce60 and cb12566.

📒 Files selected for processing (2)
  • dkron/job_metrics.go
  • dkron/store.go
🧰 Additional context used
📓 Path-based instructions (2)
**/*.go

📄 CodeRabbit inference engine (AGENTS.md)

**/*.go: Use the Storage interface for all storage operations in Go code; wrap operations in context and check for buntdb.ErrNotFound when getting jobs/executions
Check a.IsLeader() before performing leader-only operations like job scheduling or applying Raft logs
Use the module path github.com/distribworks/dkron/v4 for imports when referencing this project

Files:

  • dkron/job_metrics.go
  • dkron/store.go
dkron/**/*.go

📄 CodeRabbit inference engine (AGENTS.md)

dkron/**/*.go: Use GRPCClient.ExecutionDone() to report execution results to the leader; use AgentRun() to trigger job execution on target nodes
Replace hash symbol (~) in job schedules with hash of job name for load distribution across cluster nodes
Store job scheduling logic and cron parsing in the dkron package using robfig/cron/v3 with extended syntax support via extcron/ package

Files:

  • dkron/job_metrics.go
  • dkron/store.go
🧠 Learnings (4)
📚 Learning: 2025-12-20T09:34:05.813Z
Learnt from: CR
Repo: distribworks/dkron PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-20T09:34:05.813Z
Learning: Applies to dkron/**/*.go : Store job scheduling logic and cron parsing in the dkron package using `robfig/cron/v3` with extended syntax support via `extcron/` package

Applied to files:

  • dkron/job_metrics.go
📚 Learning: 2025-12-20T09:34:05.813Z
Learnt from: CR
Repo: distribworks/dkron PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-20T09:34:05.813Z
Learning: Dkron is a distributed, fault-tolerant job scheduling system written in Go using Raft for consensus, Serf for cluster membership, BuntDB for storage, gRPC for inter-node communication, and Gin for HTTP/REST API

Applied to files:

  • dkron/job_metrics.go
📚 Learning: 2025-12-20T09:34:05.813Z
Learnt from: CR
Repo: distribworks/dkron PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-20T09:34:05.813Z
Learning: Applies to dkron/**/*.go : Use `GRPCClient.ExecutionDone()` to report execution results to the leader; use `AgentRun()` to trigger job execution on target nodes

Applied to files:

  • dkron/job_metrics.go
  • dkron/store.go
📚 Learning: 2025-12-20T09:34:05.813Z
Learnt from: CR
Repo: distribworks/dkron PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-20T09:34:05.813Z
Learning: Applies to dkron/**/*.go : Replace hash symbol (`~`) in job schedules with hash of job name for load distribution across cluster nodes

Applied to files:

  • dkron/job_metrics.go
🧬 Code graph analysis (2)
dkron/job_metrics.go (2)
dkron/metrics.go (1)
  • initMetrics (12-67)
dkron/job.go (1)
  • ID (58-145)
dkron/store.go (1)
dkron/job_metrics.go (2)
  • JobExecutionsSucceededTotal (10-18)
  • JobExecutionsFailedTotal (21-29)
🔇 Additional comments (3)
dkron/store.go (2)

295-297: LGTM! Dual-metric emission for successful executions.

The addition of Prometheus metrics alongside the existing go-metrics counters correctly addresses the issue described in the PR objectives. The implementation maintains backward compatibility while ensuring metrics appear in the Prometheus global registry.


303-305: LGTM! Dual-metric emission for failed executions.

The Prometheus counter for failed executions is correctly implemented, mirroring the pattern used for successful executions. This ensures consistent metric reporting across both execution outcomes.

dkron/job_metrics.go (1)

1-31: LGTM! Clean Prometheus metrics definition.

The metrics are well-structured with appropriate naming conventions and labels. Using promauto.NewCounterVec ensures these counters are automatically registered in the global Prometheus registry at package initialization, which directly addresses the issue described in the PR objectives.

Note: These metrics are registered unconditionally at package init time, even if Prometheus scraping is disabled via configuration. This is acceptable and aligns with standard prometheus/client_golang patterns—the overhead is minimal and ensures metrics are immediately available when Prometheus is enabled.


Comment @coderabbitai help to get the list of available commands and usage tips.

@ashleefr
Copy link
Author

@vcastellm Please review the PR, as the metrics are still not functioning

@vcastellm vcastellm added this to the v4.1 milestone Dec 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants