Skip to content

Conversation

@guillemgt
Copy link

What does this PR do?

Adds the ability to query Prometheus metrics and log them to experiment tracking backends (WandB, TensorBoard, MLflow, etc.) during training. This allows users to correlate infrastructure metrics (GPU cache usage, throughput) with training metrics in a unified view.

Checklist Before Starting

  • Search for similar PRs: prometheus metrics
  • Format the PR title as [{modules}] {type}: {description}

Test

Testing was performed locally/internally during development. The feature has been validated to work correctly with:

  • Prometheus HTTP API queries
  • Ray head node auto-discovery
  • Cache behavior
  • Error handling (connection errors, timeouts, malformed responses)
  • Graceful failure modes

Due to the complexity of mocking Ray and Prometheus infrastructure, comprehensive unit tests are not included in this PR. The feature can be validated end-to-end by configuring metrics_to_log and verifying metrics appear in experiment tracking backends.

Test

Tested with the example config below. Metric was successfully logged to rollout/vllm_generation_tokens_total in tensorboard over a 20 iteration run on 2 nodes.

Screenshot 2026-02-11 at 16 39 52

API and Usage Example

actor_rollout_ref:
  rollout:
    prometheus:
      enable: True
      port: 9090
      metrics_to_log:
        - "vllm:generation_tokens_total"

Design & Code Changes

  1. verl/workers/config/rollout.py: Add metrics_to_log field to PrometheusConfig
  2. verl/experimental/agent_loop/prometheus_utils.py: Add PrometheusClient class (~180 lines)
  3. verl/trainer/ppo/ray_trainer.py: Initialize client and query metrics before logging

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

  • Read the Contribute Guide
  • Apply pre-commit checks
  • Add documentation
  • Add unit tests - Explained in Test section why tests are not included
  • Notify ci-request Slack channel (will do after PR creation)
  • Recipe submodule not affected

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a PrometheusClient to query and log metrics from Prometheus to experiment tracking backends. The implementation is well-structured, including features like Ray head node discovery, caching, and retry logic. However, I've identified two critical issues in the error handling within prometheus_utils.py. One issue in the Ray head node discovery defaults to localhost on any failure, which is problematic in a distributed setting. The other issue is in the metric querying loop, which silently swallows all exceptions. Addressing these will significantly improve the robustness and debuggability of this new feature.

@guillemgt guillemgt force-pushed the guillem.tarrach/upstream-prometheus-metrics branch from b1e6d83 to 3a60c25 Compare February 11, 2026 16:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant