feat: GPU Telemetry Integration - Milestone 2 #316

ilana-n · 2025-09-30T19:35:09Z

GPU Telemetry Milestone 2 - Service Integration

Summary

This PR builds upon the GPU telemetry work from the previous milestone and delivers integration with AIPerf's core service architecture.

Building on Previous Work: This milestone extends the initial telemetry data collection foundation by fully integrating the TelemetryManager as an AIPerf service, implementing the complete telemetry metrics processing pipeline, and exporting statistical analysis to the console (with --verbose on) and to the csv and json file exports (always)

Components of the project NOT included yet, which will be added in future PRs include:

CI end-to-end integration tests
User facing documentation
Displaying telemetry metrics real-time in a tab with a CLI option to enable it
Dynamically adding metrics depending on user customization

Key Features Added in This Milestone

Service Architecture Integration

SystemController Integration: TelemetryManager is now an optional service in the SystemController's service lifecycle
Service Lifecycle Management: Full integration with AIPerf's start/stop/configure lifecycle, with proper initialization ordering and graceful shutdown
Message Bus Integration: Integration with AIPerf's messaging system for coordinated service communication

Metrics Processing Pipeline:

BaseTelemetryMetric: Abstract base class enabling consistent metric extraction from TelemetryRecord objects
Statistical Analysis: Complete statistical summary with min/max/avg/std and percentiles (p1, p5, p25, p50, p75, p90, p95, p99)

Architecture Improvements

Async-First Design: Migration to asyncio/aiohttp for non-blocking operations during intensive benchmarking
Hierarchical Storage: Efficient data structure (dcgm_url → gpu_uuid → time series) enabling both real-time access and final aggregation
Error Resilience: Exception handling with detailed error reporting through the standard AIPerf error system

Export & Results Integration

TelemetryResultsProcessor: Converts raw telemetry records into statistical MetricResult objects
Console Integration: Results flow through the standard AIPerf console display system
Export Compatibility: Full integration with existing CSV and JSON exporters - telemetry metrics appear alongside inference metrics in all exports
Hierarchical Result Tags: Structured tagging system enabling dashboard filtering and hierarchical organization (dcgm_url, gpu_index, gpu_uuid)

Running and Testing

Automatic Integration

GPU telemetry collection is now automatically enabled when DCGM is available:

# Standard AIPerf command - telemetry collection happens automatically and is displayed in the console when --verbose is on 
aiperf profile \
  --model Qwen/Qwen3-0.6B \
  --endpoint-type chat \
  --endpoint /v1/chat/completions \
  --streaming \
  --concurrency 10 \
  --request-count 100 \
  --verbose

# AIPerf command with custom DCGM endpoint url 
aiperf profile \
  --model Qwen/Qwen3-0.6B \
  --endpoint-type chat \
  --endpoint /v1/chat/completions \
  --streaming \
  --concurrency 10 \
  --request-count 100 \
  --server-metrics-url http://localhost:9400/metrics http://localhost:9401/metrics http://localhost:9402/metrics \
  --verbose

Prerequisites

DCGM running and accessible at http://localhost:9401/metrics (default) see tutorial here
GPU hardware with DCGM support

Example Output

Verification

Service Registration

You should see TelemetryManager registration in the logs:

INFO: Registered Telemetry Manager (id: 'telemetry_manager_xyz')

# Plus any ERROR/WARNING messages if issues occur
WARNING: GPU telemetry disabled - no DCGM endpoints reachable

Testing & Development

Unit Testing

cd /home/inguyen/code/aiperf
python -m pytest tests/gpu_telemetry/ -v
python -m pytest tests/common/models/test_telemetry_models.py -v
python -m pytest tests/post_processors/test_telemetry_results_processor.py -v

Summary by CodeRabbit

New Features
- Integrated GPU telemetry collection with endpoint discovery and status.
- New console telemetry view (verbose) showing per-endpoint/GPU metrics.
- CSV and JSON exports now include telemetry summaries when available.
- Added config/CLI option to specify server metrics endpoints.
Improvements
- Exposed additional metric units; energy reported in megajoules.
- Coordinated shutdown waits for telemetry when enabled.
Changes
- Default telemetry sampling interval set to ~0.33s (~3 Hz).
- “No data” console export message lowered from warning to debug.

coderabbitai · 2025-09-30T19:35:18Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Introduces a GPU telemetry subsystem: new config group and user setting, telemetry enums/messages/models, telemetry manager and data collector, records aggregation and processing, controller coordination and shutdown logic, and telemetry export paths (console, CSV, JSON). Updates metrics base validation and energy unit, constants, and tests to reflect telemetry data structures and flows.

Changes

Cohort / File(s)	Summary
Config: Groups & UserConfig `aiperf/common/config/groups.py`, `aiperf/common/config/user_config.py`	Adds TELEMETRY group and a UserConfig field `server_metrics_url` (list[str], CLI-parameterized, TELEMETRY group).
Enums: Metrics, Data Exporters, Messages, Services `aiperf/common/enums/__init__.py`, `aiperf/common/enums/data_exporter_enums.py`, `aiperf/common/enums/message_enums.py`, `aiperf/common/enums/service_enums.py`	Exposes new metric unit enums; adds ConsoleExporterType.TELEMETRY; adds MessageType entries for telemetry; adds ServiceType.TELEMETRY_MANAGER.
Messages: Telemetry `aiperf/common/messages/telemetry_messages.py`, `aiperf/common/messages/__init__.py`	Adds TelemetryRecordsMessage, ProcessTelemetryResultMessage, TelemetryStatusMessage and exports them.
Models: Telemetry `aiperf/common/models/telemetry_models.py`, `aiperf/common/models/__init__.py`	Adds TelemetryResults and ProcessTelemetryResult; removes TelemetryRecord fields gpu_power_limit and total_gpu_memory; exports telemetry models.
Models: Records (minor) `aiperf/common/models/record_models.py`	Adds commented placeholder for telemetry_results; no runtime impact.
Controller: System lifecycle integration `aiperf/controller/system_controller.py`	Integrates telemetry manager, status/result handlers, shutdown coordination, and passes telemetry_results to exporters.
Exporters: Wiring & Config `aiperf/exporters/exporter_config.py`, `aiperf/exporters/exporter_manager.py`, `aiperf/exporters/__init__.py`	Adds TelemetryResults to ExporterConfig/Manager and exports GPUTelemetryConsoleExporter.
Exporters: Implementations `aiperf/exporters/gpu_telemetry_console_exporter.py`, `aiperf/exporters/csv_exporter.py`, `aiperf/exporters/json_exporter.py`, `aiperf/exporters/console_metrics_exporter.py`	New GPU telemetry console exporter; CSV/JSON exporters emit telemetry sections/blocks; log level change in ConsoleMetricsExporter when no records.
GPU Telemetry Package `aiperf/gpu_telemetry/__init__.py`, `aiperf/gpu_telemetry/constants.py`, `aiperf/gpu_telemetry/telemetry_data_collector.py`, `aiperf/gpu_telemetry/telemetry_manager.py`	Public package init; updates defaults/mappings; async-aware collector callbacks and parsing changes; manager reworked to BaseComponentService with publish-based messaging and status reporting.
Metrics Base & Types `aiperf/metrics/base_metric.py`, `aiperf/metrics/gpu_telemetry_types/energy_consumption_metric.py`	Allows BaseTelemetryMetric as valid base; changes energy unit from millijoule to megajoule.
Post-processors `aiperf/post_processors/telemetry_results_processor.py`, `aiperf/post_processors/__init__.py`	Adds TelemetryResultsProcessor; updates metric set and error handling.
Records Manager `aiperf/records/records_manager.py`	Adds telemetry aggregation, error summary, and independent telemetry result publishing/export.
Tests: GPU Telemetry `tests/gpu_telemetry/*`	Updates fixtures for removed fields and new dcgm_url; adjusts callbacks and defaults; adds model tests.
Tests: Exporters `tests/data_exporters/*`	Adds telemetry fixtures/tests for CSV/JSON/console telemetry exporter; updates constructors with telemetry_results.
Tests: Post-processors `tests/post_processors/test_telemetry_results_processor.py`	Adds tests for TelemetryResultsProcessor across success/error paths.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant User as User
  participant SystemController as SystemController
  participant ServiceMgr as ServiceManager
  participant TelemetryMgr as TelemetryManager
  participant RecordsMgr as RecordsManager
  participant Exporters as ExporterManager

  User->>SystemController: start()
  SystemController->>ServiceMgr: start()
  SystemController->>TelemetryMgr: start()
  TelemetryMgr->>SystemController: publish(TelemetryStatusMessage)
  alt telemetry enabled and endpoints reachable
    loop collection intervals
      TelemetryMgr->>RecordsMgr: publish(TelemetryRecordsMessage)
      RecordsMgr->>RecordsMgr: aggregate telemetry hierarchy
    end
  else telemetry disabled or none reachable
    Note over SystemController: Will not wait for telemetry results
  end

  par profiling completes
    RecordsMgr-->>SystemController: publish(ProcessRecordsResultMessage)
    SystemController->>SystemController: mark profile_results_received
  and telemetry processing completes
    RecordsMgr-->>SystemController: publish(ProcessTelemetryResultMessage)
    SystemController->>SystemController: store telemetry_results
  end

  SystemController->>SystemController: _check_and_trigger_shutdown()
  alt ready to shutdown
    SystemController->>Exporters: export(results, telemetry_results)
    Exporters->>Exporters: console/csv/json emit telemetry sections
    SystemController-->>User: done
  end

sequenceDiagram
  autonumber
  participant ExporterManager
  participant Csv as CsvExporter
  participant Json as JsonExporter
  participant Console as GPUTelemetryConsoleExporter

  ExporterManager->>Csv: init(ExporterConfig{telemetry_results})
  ExporterManager->>Json: init(ExporterConfig{telemetry_results})
  ExporterManager->>Console: init(ExporterConfig{telemetry_results})
  par CSV
    Csv->>Csv: _generate_csv_content(records, telemetry_results)
    Csv->>Csv: _write_telemetry_section()
  and JSON
    Json->>Json: build telemetry_export_data
  and Console
    Console->>Console: get_renderable(telemetry_results)
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60–90 minutes

Poem

I thump my paws in metric glee,
New wires hum with teleme-try.
Endpoints blink, GPUs sing,
Tables bloom with stately bling.
When runs are done, we export bright—
A carrot chart of watts and byte.
Hop-hop! The data’s right. 🥕📊

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 77.54% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title clearly identifies the primary change—integrating GPU telemetry support into the codebase—which aligns with the pull request’s main objective of adding TelemetryManager and related telemetry features; the inclusion of “Milestone 2” is additional context but does not obscure the core feature being delivered.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

ilana-n · 2025-09-30T19:35:49Z

@coderabbitai review

coderabbitai · 2025-09-30T19:35:56Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copilot

Pull Request Overview

This PR implements GPU telemetry integration as Milestone 2 of the telemetry project, adding comprehensive service architecture integration, metrics processing pipeline, and export capabilities to AIPerf's core service system.

Key changes include:

Service architecture integration with TelemetryManager as an optional service in SystemController's lifecycle
Complete metrics processing pipeline with statistical analysis and export integration
Hierarchical telemetry data storage enabling real-time access and final aggregation

Reviewed Changes

Copilot reviewed 45 out of 45 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
aiperf/gpu_telemetry/	New telemetry module with TelemetryManager service and TelemetryDataCollector
aiperf/post_processors/telemetry_results_processor.py	Telemetry-specific results processor implementing TelemetryResultsProcessorProtocol
aiperf/metrics/gpu_telemetry_types/	GPU telemetry metric definitions extending BaseTelemetryMetric
aiperf/common/models/telemetry_models.py	Comprehensive telemetry data models with hierarchical storage
aiperf/exporters/	Enhanced exporters with telemetry data integration for CSV, JSON, and console output
aiperf/records/records_manager.py	Extended RecordsManager to handle telemetry records alongside inference metrics

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

aiperf/metrics/gpu_telemetry_types/energy_consumption_metric.py

aiperf/gpu_telemetry/telemetry_data_collector.py

aiperf/records/records_manager.py

aiperf/exporters/csv_exporter.py

@ilana-n

Docstrings generation was requested by @ilana-n. * #316 (comment) The following files were modified: * `aiperf/common/config/user_config.py` * `aiperf/common/enums/metric_enums.py` * `aiperf/common/messages/telemetry_messages.py` * `aiperf/common/models/telemetry_models.py` * `aiperf/common/protocols.py` * `aiperf/controller/system_controller.py` * `aiperf/exporters/console_metrics_exporter.py` * `aiperf/exporters/csv_exporter.py` * `aiperf/exporters/exporter_manager.py` * `aiperf/exporters/gpu_telemetry_console_exporter.py` * `aiperf/exporters/json_exporter.py` * `aiperf/gpu_telemetry/telemetry_data_collector.py` * `aiperf/gpu_telemetry/telemetry_manager.py` * `aiperf/metrics/base_metric.py` * `aiperf/metrics/base_telemetry_metric.py` * `aiperf/metrics/gpu_telemetry_types/energy_consumption_metric.py` * `aiperf/metrics/gpu_telemetry_types/gpu_memory_used_metric.py` * `aiperf/metrics/gpu_telemetry_types/gpu_power_usage_metric.py` * `aiperf/metrics/gpu_telemetry_types/gpu_temperature_metric.py` * `aiperf/metrics/gpu_telemetry_types/gpu_utilization_metric.py` * `aiperf/metrics/gpu_telemetry_types/memory_clock_frequency_metric.py` * `aiperf/metrics/gpu_telemetry_types/memory_temperature_metric.py` * `aiperf/metrics/gpu_telemetry_types/sm_clock_frequency_metric.py` * `aiperf/post_processors/telemetry_results_processor.py` * `aiperf/records/records_manager.py` * `tests/gpu_telemetry/conftest.py` * `tests/gpu_telemetry/test_telemetry_data_collector.py`

codecov · 2025-09-30T22:58:05Z

Codecov Report

❌ Patch coverage is 67.50000% with 143 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
aiperf/gpu_telemetry/telemetry_manager.py	12.69%	55 Missing ⚠️
aiperf/controller/system_controller.py	29.16%	34 Missing ⚠️
aiperf/records/records_manager.py	13.15%	33 Missing ⚠️
aiperf/gpu_telemetry/telemetry_data_collector.py	61.76%	9 Missing and 4 partials ⚠️
aiperf/exporters/csv_exporter.py	90.69%	3 Missing and 1 partial ⚠️
aiperf/exporters/gpu_telemetry_console_exporter.py	98.16%	1 Missing and 1 partial ⚠️
aiperf/exporters/json_exporter.py	93.10%	1 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

ilana-n · 2025-10-01T01:07:02Z

aiperf/controller/system_controller.py

+        Thread safety:
+        Uses self._shutdown_lock to prevent race conditions when ProcessRecordsResultMessage
+        and ProcessTelemetryResultMessage arrive concurrently. The lock ensures atomic
+        check-and-set of _shutdown_triggered, preventing double-triggering of stop().


Scenario 1: Both messages arrive simultaneously (telemetry enabled & ready)

Without lock: Both tasks pass the _shutdown_triggered check, both call stop() → double-trigger

With lock: First task acquires lock, sets _shutdown_triggered = True, calls stop(). Second task waits, then hits early-return guard → single trigger

Scenario 2: Profile results arrive first, telemetry still pending

Task acquires lock, evaluates telemetry_ready_for_shutdown = (not True or None) = False

Logs "Waiting for telemetry results..." and returns without setting _shutdown_triggered

When telemetry arrives later, second call acquires lock and correctly triggers shutdown

Scenario 3: Telemetry disabled

Single call evaluates telemetry_ready_for_shutdown = (not False or ...) = True

Triggers shutdown immediately (lock provides no overhead in sequential case)

- Add comprehensive GPU telemetry data collection and monitoring system - Implement TelemetryManager and TelemetryDataCollector for real-time GPU metrics - Create GPU telemetry console exporter for CLI display of telemetry data - Add telemetry results post-processor for data analysis and aggregation - Implement comprehensive unit test coverage for all telemetry components - Add integration tests for end-to-end telemetry functionality - Update system controller to integrate GPU telemetry with existing infrastructure - Extend existing exporters to support telemetry data formats - Add comprehensive error handling and async callback support - Include code cleanup, linting fixes, and documentation improvements

ajcasagrande

Still have more to go through. Submitting my comments so far. Looking really great and I am excited to test it out!

aiperf/common/config/user_config.py

aiperf/common/messages/telemetry_messages.py

aiperf/common/models/record_models.py

aiperf/common/models/telemetry_models.py

aiperf/exporters/csv_exporter.py

aiperf/exporters/gpu_telemetry_console_exporter.py

…otten

ajcasagrande

Have added some more comments and feedback. I appreciate the well documented code. thanks!

aiperf/exporters/json_exporter.py

aiperf/exporters/console_metrics_exporter.py

aiperf/gpu_telemetry/telemetry_data_collector.py

aiperf/gpu_telemetry/telemetry_manager.py

Copilot

Pull Request Overview

Copilot reviewed 47 out of 47 changed files in this pull request and generated 3 comments.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

tests/records/test_records_filtering.py

tests/data_exporters/conftest.py

Copilot · 2025-10-03T21:39:07Z

aiperf/records/records_manager.py

                    )

        await self._check_if_all_records_received()



The decorator changed from @on_pull_message to @on_message. This change in message handling pattern should be documented or explained, as it affects how telemetry records are processed in the system architecture.

Suggested change

# Note: We use @on_message (push-based delivery) for TELEMETRY_RECORDS instead of @on_pull_message.

# This is intentional: telemetry records are pushed to the RecordsManager as they are generated,

# ensuring timely processing and reducing latency. In contrast, metric records use a pull-based

# approach (@on_pull_message) to allow for controlled, demand-driven retrieval. This distinction

# is important for system performance and reliability.

… to records manager

ilana-n requested a review from Copilot September 30, 2025 19:35

github-actions bot added the feat label Sep 30, 2025

Copilot AI reviewed Sep 30, 2025

View reviewed changes

This comment was marked as resolved.

Sign in to view

coderabbitai bot mentioned this pull request Sep 30, 2025

📝 Add docstrings to gpu-telemetry-cli #317

Closed

This comment was marked as outdated.

Sign in to view

ilana-n requested a review from Copilot September 30, 2025 22:12

This comment was marked as outdated.

Sign in to view

This comment was marked as resolved.

Sign in to view

ilana-n commented Oct 1, 2025

View reviewed changes

ilana-n force-pushed the gpu-telemetry-cli branch from 2e4f01c to 6bf7c00 Compare October 1, 2025 01:15

ilana-n force-pushed the gpu-telemetry-cli branch from 6bf7c00 to 2970f26 Compare October 1, 2025 01:21

ilana-n requested review from ajcasagrande, the-david-oy, lkomali and ganeshku1 October 1, 2025 01:25

ajcasagrande reviewed Oct 1, 2025

View reviewed changes

ilana-n added 2 commits October 1, 2025 16:39

feat: changed cli argument to --gpu-telemetry

bbb69dc

feat: added parsing logic so that /metrics is appended to url if forg…

0a51a0a

…otten

ajcasagrande reviewed Oct 2, 2025

View reviewed changes

ilana-n added 2 commits October 2, 2025 17:02

fix: address @ajcasagrande comments

4eca609

fix: unit tests for gpu telemetry and add docs

03feb69

ilana-n added 2 commits October 3, 2025 12:34

update docs

547ce95

remove docs from this pr

06f758f

ilana-n requested a review from Copilot October 3, 2025 21:37

Copilot AI reviewed Oct 3, 2025

View reviewed changes

feat: add new metrics and address push records from telemetry manager…

dd0168a

… to records manager

+    # Note: We use @on_message (push-based delivery) for TELEMETRY_RECORDS instead of @on_pull_message.
+    # This is intentional: telemetry records are pushed to the RecordsManager as they are generated,
+    # ensuring timely processing and reducing latency. In contrast, metric records use a pull-based
+    # approach (@on_pull_message) to allow for controlled, demand-driven retrieval. This distinction
+    # is important for system performance and reliability.

feat: GPU Telemetry Integration - Milestone 2 #316

Are you sure you want to change the base?

feat: GPU Telemetry Integration - Milestone 2 #316

Uh oh!

Conversation

ilana-n commented Sep 30, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GPU Telemetry Milestone 2 - Service Integration

Summary

Key Features Added in This Milestone

Service Architecture Integration

Architecture Improvements

Export & Results Integration

Running and Testing

Automatic Integration

Prerequisites

Example Output

Verification

Service Registration

Testing & Development

Unit Testing

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Other AI code review bot(s) detected

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

ilana-n commented Sep 30, 2025

Uh oh!

coderabbitai bot commented Sep 30, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as resolved.

Uh oh!

codecov bot commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ilana-n Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

ajcasagrande left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ajcasagrande left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ilana-n commented Sep 30, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 30, 2025 •

edited

Loading

codecov bot commented Sep 30, 2025 •

edited

Loading