Feature(health): Add Switch NVUE Rest API health#412
Feature(health): Add Switch NVUE Rest API health#412mkoci wants to merge 9 commits intoNVIDIA:mainfrom
Conversation
…ueRest. Update example and add doc comments.
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
There was a problem hiding this comment.
Pull request overview
Adds an NVUE REST (HTTP(S) polling) telemetry collector for NVLink switches to the health service, integrating it into discovery/spawn, configuration, endpoint discovery, and metrics emission.
Changes:
- Introduces
NvueRestCollector+RestClientto poll NVUE endpoints and export Prometheus metrics / sink events. - Extends configuration to enable/disable NVUE collection, tune polling intervals/timeouts, and selectively enable NVUE REST paths.
- Updates discovery wiring and Carbide API endpoint fetching logic to include switch endpoints when NVUE collection is enabled.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| crates/health/src/sink/health_override.rs | Updates ApiClientWrapper::new call signature to include NVUE enablement flag. |
| crates/health/src/lib.rs | Adds HealthError::HttpError and threads NVUE enablement into endpoint-source wiring. |
| crates/health/src/discovery/spawn.rs | Spawns NVUE REST collector for switch endpoints when enabled; adds registry/metrics wiring. |
| crates/health/src/discovery/context.rs | Adds CollectorKind::NvueRest and stores NVUE config in discovery loop context. |
| crates/health/src/discovery/cleanup.rs | Includes NVUE REST collectors in cleanup/stop bookkeeping and logging. |
| crates/health/src/config.rs | Adds NVUE collector config structures, defaults, and parsing tests; adds NMX-T request timeout. |
| crates/health/src/collectors/nvue/mod.rs | Adds NVUE collector module entry point. |
| crates/health/src/collectors/nvue/rest/mod.rs | Defines NVUE REST submodule layout. |
| crates/health/src/collectors/nvue/rest/client.rs | Implements NVUE REST HTTP client + response DTOs and parsing tests. |
| crates/health/src/collectors/nvue/rest/collector.rs | Implements periodic NVUE REST polling collector and Prometheus gauge emission. |
| crates/health/src/collectors/nmxt.rs | Adds configurable request timeout plumbed into NMX-T scraping. |
| crates/health/src/collectors/mod.rs | Registers the new NVUE collector module and re-exports the collector types. |
| crates/health/src/api_client.rs | Fetches switch endpoints when either NMX-T or NVUE collector is enabled. |
| crates/health/example/config.example.toml | Documents example NVUE REST collector configuration and new NMX-T timeout. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ng how the DataSink is implemented
… remove stuttering from metric family names
be24fea to
6ca7758
Compare
|
Before we merge it, i think we need to extend StaticEndpointConfig with ability to add switches (serial metadata). And test it on couple of real switches, to see how metrics would look like. Without api integration for now |
…r testing NVUE REST scraping in isolation
Description
This PR adds NVUE telemetry collection for NVLink Switches to the health service in a new collector:
It is disabled by default and configurable (polling interval, request timeouts, and enablement of telemetry per path).
Path enablement:
Type of Change
Testing
Additional Notes
Manual testing needed on rack (and soon to come).