Skip to content

Conversation

ilana-n
Copy link

@ilana-n ilana-n commented Sep 30, 2025

GPU Telemetry Milestone 2 - Service Integration

Summary

This PR builds upon the GPU telemetry work from the previous milestone and delivers integration with AIPerf's core service architecture.

Building on Previous Work: This milestone extends the initial telemetry data collection foundation by fully integrating the TelemetryManager as an AIPerf service, implementing the complete telemetry metrics processing pipeline, and exporting statistical analysis to the console (with --verbose on) and to the csv and json file exports (always)

Components of the project NOT included yet, which will be added in future PRs include:

  • CI end-to-end integration tests
  • User facing documentation
  • Displaying telemetry metrics real-time in a tab with a CLI option to enable it
  • Dynamically adding metrics depending on user customization

Key Features Added in This Milestone

Service Architecture Integration

  • SystemController Integration: TelemetryManager is now an optional service in the SystemController's service lifecycle
  • Service Lifecycle Management: Full integration with AIPerf's start/stop/configure lifecycle, with proper initialization ordering and graceful shutdown
  • Message Bus Integration: Integration with AIPerf's messaging system for coordinated service communication

Metrics Processing Pipeline:

  • BaseTelemetryMetric: Abstract base class enabling consistent metric extraction from TelemetryRecord objects
  • Statistical Analysis: Complete statistical summary with min/max/avg/std and percentiles (p1, p5, p25, p50, p75, p90, p95, p99)

Architecture Improvements

  • Async-First Design: Migration to asyncio/aiohttp for non-blocking operations during intensive benchmarking
  • Hierarchical Storage: Efficient data structure (dcgm_url → gpu_uuid → time series) enabling both real-time access and final aggregation
  • Error Resilience: Exception handling with detailed error reporting through the standard AIPerf error system

Export & Results Integration

  • TelemetryResultsProcessor: Converts raw telemetry records into statistical MetricResult objects
  • Console Integration: Results flow through the standard AIPerf console display system
  • Export Compatibility: Full integration with existing CSV and JSON exporters - telemetry metrics appear alongside inference metrics in all exports
  • Hierarchical Result Tags: Structured tagging system enabling dashboard filtering and hierarchical organization (dcgm_url, gpu_index, gpu_uuid)

Running and Testing

Automatic Integration

GPU telemetry collection is now automatically enabled when DCGM is available:

# Standard AIPerf command - telemetry collection happens automatically and is displayed in the console when --verbose is on 
aiperf profile \
  --model Qwen/Qwen3-0.6B \
  --endpoint-type chat \
  --endpoint /v1/chat/completions \
  --streaming \
  --concurrency 10 \
  --request-count 100 \
  --verbose
# AIPerf command with custom DCGM endpoint url 
aiperf profile \
  --model Qwen/Qwen3-0.6B \
  --endpoint-type chat \
  --endpoint /v1/chat/completions \
  --streaming \
  --concurrency 10 \
  --request-count 100 \
  --server-metrics-url http://localhost:9400/metrics http://localhost:9401/metrics http://localhost:9402/metrics \
  --verbose 

Prerequisites

  • DCGM running and accessible at http://localhost:9401/metrics (default) see tutorial here
  • GPU hardware with DCGM support

Example Output

image

Verification

Service Registration

You should see TelemetryManager registration in the logs:

INFO: Registered Telemetry Manager (id: 'telemetry_manager_xyz')

# Plus any ERROR/WARNING messages if issues occur
WARNING: GPU telemetry disabled - no DCGM endpoints reachable

Testing & Development

Unit Testing

cd /home/inguyen/code/aiperf
python -m pytest tests/gpu_telemetry/ -v
python -m pytest tests/common/models/test_telemetry_models.py -v
python -m pytest tests/post_processors/test_telemetry_results_processor.py -v

Summary by CodeRabbit

  • New Features

    • Integrated GPU telemetry collection with endpoint discovery and status.
    • New console telemetry view (verbose) showing per-endpoint/GPU metrics.
    • CSV and JSON exports now include telemetry summaries when available.
    • Added config/CLI option to specify server metrics endpoints.
  • Improvements

    • Exposed additional metric units; energy reported in megajoules.
    • Coordinated shutdown waits for telemetry when enabled.
  • Changes

    • Default telemetry sampling interval set to ~0.33s (~3 Hz).
    • “No data” console export message lowered from warning to debug.

@ilana-n ilana-n requested a review from Copilot September 30, 2025 19:35
Copy link

coderabbitai bot commented Sep 30, 2025

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Introduces a GPU telemetry subsystem: new config group and user setting, telemetry enums/messages/models, telemetry manager and data collector, records aggregation and processing, controller coordination and shutdown logic, and telemetry export paths (console, CSV, JSON). Updates metrics base validation and energy unit, constants, and tests to reflect telemetry data structures and flows.

Changes

Cohort / File(s) Summary
Config: Groups & UserConfig
aiperf/common/config/groups.py, aiperf/common/config/user_config.py
Adds TELEMETRY group and a UserConfig field server_metrics_url (list[str], CLI-parameterized, TELEMETRY group).
Enums: Metrics, Data Exporters, Messages, Services
aiperf/common/enums/__init__.py, aiperf/common/enums/data_exporter_enums.py, aiperf/common/enums/message_enums.py, aiperf/common/enums/service_enums.py
Exposes new metric unit enums; adds ConsoleExporterType.TELEMETRY; adds MessageType entries for telemetry; adds ServiceType.TELEMETRY_MANAGER.
Messages: Telemetry
aiperf/common/messages/telemetry_messages.py, aiperf/common/messages/__init__.py
Adds TelemetryRecordsMessage, ProcessTelemetryResultMessage, TelemetryStatusMessage and exports them.
Models: Telemetry
aiperf/common/models/telemetry_models.py, aiperf/common/models/__init__.py
Adds TelemetryResults and ProcessTelemetryResult; removes TelemetryRecord fields gpu_power_limit and total_gpu_memory; exports telemetry models.
Models: Records (minor)
aiperf/common/models/record_models.py
Adds commented placeholder for telemetry_results; no runtime impact.
Controller: System lifecycle integration
aiperf/controller/system_controller.py
Integrates telemetry manager, status/result handlers, shutdown coordination, and passes telemetry_results to exporters.
Exporters: Wiring & Config
aiperf/exporters/exporter_config.py, aiperf/exporters/exporter_manager.py, aiperf/exporters/__init__.py
Adds TelemetryResults to ExporterConfig/Manager and exports GPUTelemetryConsoleExporter.
Exporters: Implementations
aiperf/exporters/gpu_telemetry_console_exporter.py, aiperf/exporters/csv_exporter.py, aiperf/exporters/json_exporter.py, aiperf/exporters/console_metrics_exporter.py
New GPU telemetry console exporter; CSV/JSON exporters emit telemetry sections/blocks; log level change in ConsoleMetricsExporter when no records.
GPU Telemetry Package
aiperf/gpu_telemetry/__init__.py, aiperf/gpu_telemetry/constants.py, aiperf/gpu_telemetry/telemetry_data_collector.py, aiperf/gpu_telemetry/telemetry_manager.py
Public package init; updates defaults/mappings; async-aware collector callbacks and parsing changes; manager reworked to BaseComponentService with publish-based messaging and status reporting.
Metrics Base & Types
aiperf/metrics/base_metric.py, aiperf/metrics/gpu_telemetry_types/energy_consumption_metric.py
Allows BaseTelemetryMetric as valid base; changes energy unit from millijoule to megajoule.
Post-processors
aiperf/post_processors/telemetry_results_processor.py, aiperf/post_processors/__init__.py
Adds TelemetryResultsProcessor; updates metric set and error handling.
Records Manager
aiperf/records/records_manager.py
Adds telemetry aggregation, error summary, and independent telemetry result publishing/export.
Tests: GPU Telemetry
tests/gpu_telemetry/*
Updates fixtures for removed fields and new dcgm_url; adjusts callbacks and defaults; adds model tests.
Tests: Exporters
tests/data_exporters/*
Adds telemetry fixtures/tests for CSV/JSON/console telemetry exporter; updates constructors with telemetry_results.
Tests: Post-processors
tests/post_processors/test_telemetry_results_processor.py
Adds tests for TelemetryResultsProcessor across success/error paths.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant User as User
  participant SystemController as SystemController
  participant ServiceMgr as ServiceManager
  participant TelemetryMgr as TelemetryManager
  participant RecordsMgr as RecordsManager
  participant Exporters as ExporterManager

  User->>SystemController: start()
  SystemController->>ServiceMgr: start()
  SystemController->>TelemetryMgr: start()
  TelemetryMgr->>SystemController: publish(TelemetryStatusMessage)
  alt telemetry enabled and endpoints reachable
    loop collection intervals
      TelemetryMgr->>RecordsMgr: publish(TelemetryRecordsMessage)
      RecordsMgr->>RecordsMgr: aggregate telemetry hierarchy
    end
  else telemetry disabled or none reachable
    Note over SystemController: Will not wait for telemetry results
  end

  par profiling completes
    RecordsMgr-->>SystemController: publish(ProcessRecordsResultMessage)
    SystemController->>SystemController: mark profile_results_received
  and telemetry processing completes
    RecordsMgr-->>SystemController: publish(ProcessTelemetryResultMessage)
    SystemController->>SystemController: store telemetry_results
  end

  SystemController->>SystemController: _check_and_trigger_shutdown()
  alt ready to shutdown
    SystemController->>Exporters: export(results, telemetry_results)
    Exporters->>Exporters: console/csv/json emit telemetry sections
    SystemController-->>User: done
  end
Loading
sequenceDiagram
  autonumber
  participant ExporterManager
  participant Csv as CsvExporter
  participant Json as JsonExporter
  participant Console as GPUTelemetryConsoleExporter

  ExporterManager->>Csv: init(ExporterConfig{telemetry_results})
  ExporterManager->>Json: init(ExporterConfig{telemetry_results})
  ExporterManager->>Console: init(ExporterConfig{telemetry_results})
  par CSV
    Csv->>Csv: _generate_csv_content(records, telemetry_results)
    Csv->>Csv: _write_telemetry_section()
  and JSON
    Json->>Json: build telemetry_export_data
  and Console
    Console->>Console: get_renderable(telemetry_results)
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60–90 minutes

Poem

I thump my paws in metric glee,
New wires hum with teleme-try.
Endpoints blink, GPUs sing,
Tables bloom with stately bling.
When runs are done, we export bright—
A carrot chart of watts and byte.
Hop-hop! The data’s right. 🥕📊

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 77.54% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title clearly identifies the primary change—integrating GPU telemetry support into the codebase—which aligns with the pull request’s main objective of adding TelemetryManager and related telemetry features; the inclusion of “Milestone 2” is additional context but does not obscure the core feature being delivered.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions bot added the feat label Sep 30, 2025
@ilana-n
Copy link
Author

ilana-n commented Sep 30, 2025

@coderabbitai review

Copy link

coderabbitai bot commented Sep 30, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements GPU telemetry integration as Milestone 2 of the telemetry project, adding comprehensive service architecture integration, metrics processing pipeline, and export capabilities to AIPerf's core service system.

Key changes include:

  • Service architecture integration with TelemetryManager as an optional service in SystemController's lifecycle
  • Complete metrics processing pipeline with statistical analysis and export integration
  • Hierarchical telemetry data storage enabling real-time access and final aggregation

Reviewed Changes

Copilot reviewed 45 out of 45 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
aiperf/gpu_telemetry/ New telemetry module with TelemetryManager service and TelemetryDataCollector
aiperf/post_processors/telemetry_results_processor.py Telemetry-specific results processor implementing TelemetryResultsProcessorProtocol
aiperf/metrics/gpu_telemetry_types/ GPU telemetry metric definitions extending BaseTelemetryMetric
aiperf/common/models/telemetry_models.py Comprehensive telemetry data models with hierarchical storage
aiperf/exporters/ Enhanced exporters with telemetry data integration for CSV, JSON, and console output
aiperf/records/records_manager.py Extended RecordsManager to handle telemetry records alongside inference metrics

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

coderabbitai[bot]

This comment was marked as resolved.

coderabbitai bot added a commit that referenced this pull request Sep 30, 2025
Docstrings generation was requested by @ilana-n.

* #316 (comment)

The following files were modified:

* `aiperf/common/config/user_config.py`
* `aiperf/common/enums/metric_enums.py`
* `aiperf/common/messages/telemetry_messages.py`
* `aiperf/common/models/telemetry_models.py`
* `aiperf/common/protocols.py`
* `aiperf/controller/system_controller.py`
* `aiperf/exporters/console_metrics_exporter.py`
* `aiperf/exporters/csv_exporter.py`
* `aiperf/exporters/exporter_manager.py`
* `aiperf/exporters/gpu_telemetry_console_exporter.py`
* `aiperf/exporters/json_exporter.py`
* `aiperf/gpu_telemetry/telemetry_data_collector.py`
* `aiperf/gpu_telemetry/telemetry_manager.py`
* `aiperf/metrics/base_metric.py`
* `aiperf/metrics/base_telemetry_metric.py`
* `aiperf/metrics/gpu_telemetry_types/energy_consumption_metric.py`
* `aiperf/metrics/gpu_telemetry_types/gpu_memory_used_metric.py`
* `aiperf/metrics/gpu_telemetry_types/gpu_power_usage_metric.py`
* `aiperf/metrics/gpu_telemetry_types/gpu_temperature_metric.py`
* `aiperf/metrics/gpu_telemetry_types/gpu_utilization_metric.py`
* `aiperf/metrics/gpu_telemetry_types/memory_clock_frequency_metric.py`
* `aiperf/metrics/gpu_telemetry_types/memory_temperature_metric.py`
* `aiperf/metrics/gpu_telemetry_types/sm_clock_frequency_metric.py`
* `aiperf/post_processors/telemetry_results_processor.py`
* `aiperf/records/records_manager.py`
* `tests/gpu_telemetry/conftest.py`
* `tests/gpu_telemetry/test_telemetry_data_collector.py`

This comment was marked as outdated.

@ilana-n ilana-n requested a review from Copilot September 30, 2025 22:12
Copilot

This comment was marked as outdated.

@ilana-n

This comment was marked as outdated.

This comment was marked as outdated.

coderabbitai[bot]

This comment was marked as resolved.

Copy link

codecov bot commented Sep 30, 2025

Thread safety:
Uses self._shutdown_lock to prevent race conditions when ProcessRecordsResultMessage
and ProcessTelemetryResultMessage arrive concurrently. The lock ensures atomic
check-and-set of _shutdown_triggered, preventing double-triggering of stop().
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scenario 1: Both messages arrive simultaneously (telemetry enabled & ready)

  • Without lock: Both tasks pass the _shutdown_triggered check, both call stop() → double-trigger
  • With lock: First task acquires lock, sets _shutdown_triggered = True, calls stop(). Second task waits, then hits early-return guard → single trigger

Scenario 2: Profile results arrive first, telemetry still pending

  • Task acquires lock, evaluates telemetry_ready_for_shutdown = (not True or None) = False
  • Logs "Waiting for telemetry results..." and returns without setting _shutdown_triggered
  • When telemetry arrives later, second call acquires lock and correctly triggers shutdown

Scenario 3: Telemetry disabled

  • Single call evaluates telemetry_ready_for_shutdown = (not False or ...) = True
  • Triggers shutdown immediately (lock provides no overhead in sequential case)

@ilana-n ilana-n force-pushed the gpu-telemetry-cli branch from 2e4f01c to 6bf7c00 Compare October 1, 2025 01:15
- Add comprehensive GPU telemetry data collection and monitoring system
- Implement TelemetryManager and TelemetryDataCollector for real-time GPU metrics
- Create GPU telemetry console exporter for CLI display of telemetry data
- Add telemetry results post-processor for data analysis and aggregation
- Implement comprehensive unit test coverage for all telemetry components
- Add integration tests for end-to-end telemetry functionality
- Update system controller to integrate GPU telemetry with existing infrastructure
- Extend existing exporters to support telemetry data formats
- Add comprehensive error handling and async callback support
- Include code cleanup, linting fixes, and documentation improvements
Copy link
Contributor

@ajcasagrande ajcasagrande left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still have more to go through. Submitting my comments so far. Looking really great and I am excited to test it out!

Copy link
Contributor

@ajcasagrande ajcasagrande left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have added some more comments and feedback. I appreciate the well documented code. thanks!

@ilana-n ilana-n requested a review from Copilot October 3, 2025 21:37
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 47 out of 47 changed files in this pull request and generated 3 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

)

await self._check_if_all_records_received()

Copy link
Preview

Copilot AI Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The decorator changed from @on_pull_message to @on_message. This change in message handling pattern should be documented or explained, as it affects how telemetry records are processed in the system architecture.

Suggested change
# Note: We use @on_message (push-based delivery) for TELEMETRY_RECORDS instead of @on_pull_message.
# This is intentional: telemetry records are pushed to the RecordsManager as they are generated,
# ensuring timely processing and reducing latency. In contrast, metric records use a pull-based
# approach (@on_pull_message) to allow for controlled, demand-driven retrieval. This distinction
# is important for system performance and reliability.

Copilot uses AI. Check for mistakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants