Skip to content

Feature(health): Add Switch NVUE Rest API health#412

Open
mkoci wants to merge 9 commits intoNVIDIA:mainfrom
mkoci:feature_health_nvue_rest
Open

Feature(health): Add Switch NVUE Rest API health#412
mkoci wants to merge 9 commits intoNVIDIA:mainfrom
mkoci:feature_health_nvue_rest

Conversation

@mkoci
Copy link
Contributor

@mkoci mkoci commented Feb 27, 2026

Description

This PR adds NVUE telemetry collection for NVLink Switches to the health service in a new collector:

  • NvueRest (HTTP polling)

It is disabled by default and configurable (polling interval, request timeouts, and enablement of telemetry per path).

Path enablement:

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

Manual testing needed on rack (and soon to come).

@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 27, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@mkoci mkoci marked this pull request as ready for review February 27, 2026 22:43
@mkoci mkoci requested a review from a team as a code owner February 27, 2026 22:43
Copilot AI review requested due to automatic review settings February 27, 2026 22:43
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an NVUE REST (HTTP(S) polling) telemetry collector for NVLink switches to the health service, integrating it into discovery/spawn, configuration, endpoint discovery, and metrics emission.

Changes:

  • Introduces NvueRestCollector + RestClient to poll NVUE endpoints and export Prometheus metrics / sink events.
  • Extends configuration to enable/disable NVUE collection, tune polling intervals/timeouts, and selectively enable NVUE REST paths.
  • Updates discovery wiring and Carbide API endpoint fetching logic to include switch endpoints when NVUE collection is enabled.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
crates/health/src/sink/health_override.rs Updates ApiClientWrapper::new call signature to include NVUE enablement flag.
crates/health/src/lib.rs Adds HealthError::HttpError and threads NVUE enablement into endpoint-source wiring.
crates/health/src/discovery/spawn.rs Spawns NVUE REST collector for switch endpoints when enabled; adds registry/metrics wiring.
crates/health/src/discovery/context.rs Adds CollectorKind::NvueRest and stores NVUE config in discovery loop context.
crates/health/src/discovery/cleanup.rs Includes NVUE REST collectors in cleanup/stop bookkeeping and logging.
crates/health/src/config.rs Adds NVUE collector config structures, defaults, and parsing tests; adds NMX-T request timeout.
crates/health/src/collectors/nvue/mod.rs Adds NVUE collector module entry point.
crates/health/src/collectors/nvue/rest/mod.rs Defines NVUE REST submodule layout.
crates/health/src/collectors/nvue/rest/client.rs Implements NVUE REST HTTP client + response DTOs and parsing tests.
crates/health/src/collectors/nvue/rest/collector.rs Implements periodic NVUE REST polling collector and Prometheus gauge emission.
crates/health/src/collectors/nmxt.rs Adds configurable request timeout plumbed into NMX-T scraping.
crates/health/src/collectors/mod.rs Registers the new NVUE collector module and re-exports the collector types.
crates/health/src/api_client.rs Fetches switch endpoints when either NMX-T or NVUE collector is enabled.
crates/health/example/config.example.toml Documents example NVUE REST collector configuration and new NMX-T timeout.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@mkoci mkoci force-pushed the feature_health_nvue_rest branch from be24fea to 6ca7758 Compare February 28, 2026 06:07
@yoks
Copy link
Contributor

yoks commented Mar 2, 2026

Before we merge it, i think we need to extend StaticEndpointConfig with ability to add switches (serial metadata). And test it on couple of real switches, to see how metrics would look like. Without api integration for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants