fix: Export job execution metrics to Prometheus #1885

ashleefr · 2025-12-22T07:50:01Z

Fix: Export job execution metrics to Prometheus

Problem

In PR #1825, job execution metrics were added using github.com/armon/go-metrics package:

metrics.IncrCounterWithLabels([]string{"job", "executions_succeeded_total"}, 1, ...)
metrics.IncrCounterWithLabels([]string{"job", "executions_failed_total"}, 1, ...)

However, these metrics are not exported to Prometheus even though they are being created.

Root Cause

The issue is an architectural mismatch between two different metrics systems:

hashicorp/go-metrics (aliased as github.com/armon/go-metrics) - Creates metrics in its own internal sink
prometheus/client_golang - Exports metrics from the global Prometheus registry

The HTTP endpoint in dkron/api.go uses promhttp.Handler() from prometheus/client_golang:

r.GET("/metrics", gin.WrapH(promhttp.Handler()))

This handler exports metrics from the global Prometheus registry, but metrics created via go-metrics don't register in this registry. They go to a separate PrometheusSink from hashicorp/go-metrics/prometheus, which has its own HTTP handler that is never used in the API endpoint.

Why PR #1825 Didn't Work

Looking at the commit history of PR #1825, there were several iterations:

"Initial plan"
"Add job execution metrics emission"
"Implement native Prometheus metrics for job executions"
"Switch to go-metrics package instead of direct Prometheus integration" ← This is where the problem was introduced

The switch to go-metrics was likely made to:

Use a unified metrics system (for both statsd and prometheus)
Maintain compatibility with existing code

However, the integration between go-metrics Prometheus sink and promhttp.Handler() was not properly implemented.

Solution

This PR adds direct Prometheus metrics using prometheus/client_golang, similar to how it's done in plugin/shell/prometheus.go:

Created dkron/job_metrics.go with Prometheus metric definitions:
- dkron_job_executions_succeeded_total{job_name="..."}
- dkron_job_executions_failed_total{job_name="..."}
Updated dkron/store.go to emit both:
- go-metrics metrics (for statsd compatibility)
- prometheus/client_golang metrics (for Prometheus export)

This ensures:

✅ Metrics are registered in the global Prometheus registry
✅ Metrics are exported via promhttp.Handler()
✅ Backward compatibility with statsd (via go-metrics)
✅ Follows the same pattern as shell executor metrics

Testing

The fix has been tested locally:

Metrics are created when jobs execute
Metrics appear in /metrics endpoint
Metrics are scraped by Prometheus
Metrics are visible in Grafana dashboards

Example output:

dkron_job_executions_succeeded_total{job_name="test-success-job"} 3
dkron_job_executions_failed_total{job_name="test-failure-job"} 3

Summary by CodeRabbit

New Features
- Added Prometheus metrics for comprehensive job execution monitoring, tracking both successful and failed job executions with job name labels.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-22T07:50:11Z

Walkthrough

Introduces Prometheus metrics for job execution outcomes by defining two new CounterVec metrics (succeeded and failed totals) and integrating them into the SetExecutionDone function to track execution results alongside existing hashicorp/go-metrics instrumentation.

Changes

Cohort / File(s)	Summary
New Prometheus Metrics `dkron/job_metrics.go`	Defines two Prometheus CounterVec metrics: `JobExecutionsSucceededTotal` and `JobExecutionsFailedTotal`, initialized via promauto with namespace "dkron", subsystem "job", and labeled by job_name.
Metrics Integration `dkron/store.go`	Updates `SetExecutionDone` to emit dual metrics: increments both hashicorp/go-metrics and Prometheus counters for successful and failed job executions.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Verify promauto initialization syntax and Prometheus metric naming conventions align with existing patterns
Confirm metric increment calls are placed in correct success/failure code paths within SetExecutionDone
Validate job_name label cardinality and prevent label explosion risks

Poem

🐰 A counter springs forth with joy and care,
Success and failure tracked with flair,
Prometheus watches jobs execute,
Metrics bloom from root to root! 🌱

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely describes the main change: adding Prometheus export functionality for job execution metrics.
Description check	✅ Passed	The description is comprehensive and covers proposed changes with context, but lacks the required template structure with explicit type-of-changes checkboxes.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3e5ce60 and cb12566.

📒 Files selected for processing (2)

dkron/job_metrics.go
dkron/store.go

🧰 Additional context used

📓 Path-based instructions (2)

**/*.go

📄 CodeRabbit inference engine (AGENTS.md)

**/*.go: Use the Storage interface for all storage operations in Go code; wrap operations in context and check for buntdb.ErrNotFound when getting jobs/executions
Check a.IsLeader() before performing leader-only operations like job scheduling or applying Raft logs
Use the module path github.com/distribworks/dkron/v4 for imports when referencing this project

Files:

dkron/job_metrics.go
dkron/store.go

dkron/**/*.go

📄 CodeRabbit inference engine (AGENTS.md)

dkron/**/*.go: Use GRPCClient.ExecutionDone() to report execution results to the leader; use AgentRun() to trigger job execution on target nodes
Replace hash symbol (~) in job schedules with hash of job name for load distribution across cluster nodes
Store job scheduling logic and cron parsing in the dkron package using robfig/cron/v3 with extended syntax support via extcron/ package

Files:

dkron/job_metrics.go
dkron/store.go

🧠 Learnings (4)

📚 Learning: 2025-12-20T09:34:05.813Z

Learnt from: CR
Repo: distribworks/dkron PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-20T09:34:05.813Z
Learning: Applies to dkron/**/*.go : Store job scheduling logic and cron parsing in the dkron package using `robfig/cron/v3` with extended syntax support via `extcron/` package

Applied to files:

dkron/job_metrics.go

📚 Learning: 2025-12-20T09:34:05.813Z

Learnt from: CR
Repo: distribworks/dkron PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-20T09:34:05.813Z
Learning: Dkron is a distributed, fault-tolerant job scheduling system written in Go using Raft for consensus, Serf for cluster membership, BuntDB for storage, gRPC for inter-node communication, and Gin for HTTP/REST API

Applied to files:

dkron/job_metrics.go

📚 Learning: 2025-12-20T09:34:05.813Z

Learnt from: CR
Repo: distribworks/dkron PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-20T09:34:05.813Z
Learning: Applies to dkron/**/*.go : Use `GRPCClient.ExecutionDone()` to report execution results to the leader; use `AgentRun()` to trigger job execution on target nodes

Applied to files:

dkron/job_metrics.go
dkron/store.go

📚 Learning: 2025-12-20T09:34:05.813Z

Learnt from: CR
Repo: distribworks/dkron PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-20T09:34:05.813Z
Learning: Applies to dkron/**/*.go : Replace hash symbol (`~`) in job schedules with hash of job name for load distribution across cluster nodes

Applied to files:

dkron/job_metrics.go

🧬 Code graph analysis (2)

dkron/job_metrics.go (2)

dkron/metrics.go (1)

initMetrics (12-67)

dkron/job.go (1)

ID (58-145)

dkron/store.go (1)

dkron/job_metrics.go (2)

JobExecutionsSucceededTotal (10-18)

JobExecutionsFailedTotal (21-29)

🔇 Additional comments (3)

dkron/store.go (2)

295-297: LGTM! Dual-metric emission for successful executions.

The addition of Prometheus metrics alongside the existing go-metrics counters correctly addresses the issue described in the PR objectives. The implementation maintains backward compatibility while ensuring metrics appear in the Prometheus global registry.

303-305: LGTM! Dual-metric emission for failed executions.

The Prometheus counter for failed executions is correctly implemented, mirroring the pattern used for successful executions. This ensures consistent metric reporting across both execution outcomes.

dkron/job_metrics.go (1)

1-31: LGTM! Clean Prometheus metrics definition.

The metrics are well-structured with appropriate naming conventions and labels. Using promauto.NewCounterVec ensures these counters are automatically registered in the global Prometheus registry at package initialization, which directly addresses the issue described in the PR objectives.

Note: These metrics are registered unconditionally at package init time, even if Prometheus scraping is disabled via configuration. This is acceptable and aligns with standard prometheus/client_golang patterns—the overhead is minimal and ensures metrics are immediately available when Prometheus is enabled.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

ashleefr · 2025-12-24T08:22:17Z

@vcastellm Please review the PR, as the metrics are still not functioning

fix: Export job execution metrics to Prometheus

cb12566

vcastellm added this to the v4.1 milestone Dec 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Export job execution metrics to Prometheus #1885

fix: Export job execution metrics to Prometheus #1885

Uh oh!

ashleefr commented Dec 22, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Dec 22, 2025 •

edited

Loading

Uh oh!

ashleefr commented Dec 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: Export job execution metrics to Prometheus #1885

Are you sure you want to change the base?

fix: Export job execution metrics to Prometheus #1885

Uh oh!

Conversation

ashleefr commented Dec 22, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix: Export job execution metrics to Prometheus

Problem

Root Cause

Why PR #1825 Didn't Work

Solution

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

ashleefr commented Dec 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ashleefr commented Dec 22, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 22, 2025 •

edited

Loading