feat: support parallel dataset processor services #220

nv-hwoo · 2025-08-15T17:03:57Z

Summary

This pull request introduces new infrastructure for dataset processing, standardizes dataset type handling, and adds new messaging and configuration options to support dataset processors. The changes lay the groundwork for scalable, configurable dataset generation and processing services within the system. Key updates include new enums, message types, configuration fields, and removal of the old composer abstraction.

Dataset Processing Infrastructure:

Added DatasetProcessor service type and related configuration (dataset_processor_service_count) to enable scalable dataset processing as a first-class service.
Introduced new message types for dataset processing (e.g., ProcessSyntheticDatasetMessage, ProcessDatasetResponseMessage) to support communication between components involved in dataset generation and processing.

Configuration and CLI Enhancements:

Added dataset_type configuration and CLI parameter, with validation logic to ensure correct dataset type and file/custom type combinations.
Added new communication addresses and ports for dataset job/result messaging in the ZMQ config, enabling routing of dataset processing tasks and results.

Enums and API Cleanup:

Standardized dataset type handling by introducing DatasetType (replacing ComposerType), updating imports, and cleaning up references throughout the codebase.
Removed the obsolete ComposerFactory and related composer imports, reflecting the move away from the composer abstraction in favor of the new dataset processor model.

Command and Message Types:

Added new command types for spawning and shutting down dataset processors, and new message types for dataset processing workflow.

These changes collectively enable more flexible and robust dataset processing, with improved configuration, clearer API boundaries, and better support for scaling dataset generation to meet system demands.

Copilot

Pull Request Overview

This pull request introduces parallel dataset processing infrastructure to replace the previous composer-based architecture. The changes enable distributed dataset generation across multiple processor services, standardize dataset type handling, and add configuration and messaging support for the new processing model.

Replaces composer architecture with distributed dataset processors - Removes old ComposerFactory and composer classes, introducing DatasetProcessor services that can run in parallel
Standardizes dataset type handling - Introduces DatasetType enum to replace ComposerType, with improved validation between dataset types and file configurations
Adds parallel processing infrastructure - Implements job distribution, result aggregation, and ZMQ communication channels for dataset generation tasks

Reviewed Changes

Copilot reviewed 30 out of 30 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`aiperf/dataset/processor.py`	New `DatasetProcessor` service that handles distributed dataset generation with support for synthetic, custom, and trace datasets
`aiperf/dataset/dataset_manager.py`	Updated to distribute dataset generation jobs across multiple processors and aggregate results
`aiperf/common/enums/dataset_enums.py`	Replaces `ComposerType` with `DatasetType` enum for clearer dataset categorization
`aiperf/common/config/input_config.py`	Adds `dataset_type` parameter with validation logic for dataset type and file combinations
`aiperf/common/messages/dataset_messages.py`	New message types for dataset processing communication between manager and processors
`aiperf/common/config/zmq_config.py`	Adds dataset job and result communication addresses/ports for ZMQ messaging
`tests/services/test_dataset_processor.py`	New test suite for dataset processor service functionality

Comments suppressed due to low confidence (3)

aiperf/common/config/zmq_config.py:258

The dataset_job_address property is missing from the ZMQTCPConfig class. This property is declared as abstract in the base class and implemented in ZMQIPCConfig but not in ZMQTCPConfig.

        """Get the credit return address based on protocol configuration."""

aiperf/common/config/zmq_config.py:269

The dataset_result_address property is missing from the ZMQTCPConfig class. This property is declared as abstract in the base class and implemented in ZMQIPCConfig but not in ZMQTCPConfig.

        return f"ipc://{self.path}/dataset_result.ipc"

aiperf/dataset/processor.py:315

The max_tokens field is being set directly instead of using the _set_max_tokens method. This bypasses the configuration-based token sampling logic and could lead to inconsistent behavior compared to other dataset types.

                conversation.turns.append(turn)

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-08-15T17:04:40Z

tests/composers/test_synthetic_composer.py

+#         composer = SyntheticDatasetComposer(audio_config, mock_tokenizer)
+#
+#         assert composer.config.input.audio.length.mean == 2
+#         assert composer.include_image is False


The entire test file is commented out with a TODO note. This leaves the synthetic dataset processing functionality without proper test coverage, which could lead to undetected regressions.

Suggested change

# assert composer.include_image is False

class TestSyntheticDatasetComposer:

# ============================================================================

# Initialization Tests

# ============================================================================

def test_initialization_basic_config(self, synthetic_config, mock_tokenizer):

"""Test that SyntheticDatasetComposer can be instantiated with basic config."""

composer = SyntheticDatasetComposer(synthetic_config, mock_tokenizer)

assert composer.config == synthetic_config

assert composer.config.input.conversation.num == 5

assert composer.prompt_generator is not None

assert composer.include_image is False

assert composer.include_audio is False

def test_initialization_with_images(self, image_config, mock_tokenizer):

"""Test initialization with image generation enabled."""

composer = SyntheticDatasetComposer(image_config, mock_tokenizer)

assert composer.config.input.image.width.mean == 10

assert composer.config.input.image.height.mean == 10

assert composer.include_image is True

assert composer.include_audio is False

def test_initialization_with_audio(self, audio_config, mock_tokenizer):

"""Test initialization with audio generation enabled."""

composer = SyntheticDatasetComposer(audio_config, mock_tokenizer)

assert composer.config.input.audio.length.mean == 2

# assert composer.include_image is False

Copilot · 2025-08-15T17:04:41Z

aiperf/dataset/loader/mooncake_trace.py

-        self.prompt_generator = prompt_generator
-        self.user_config = user_config
-        self._skipped_traces = 0
+    def __init__(self, user_config: UserConfig, **kwargs) -> None:


The constructor signature changed to remove specific parameters like prompt_generator, but the class may still need access to these components. Consider if this breaks the loader's functionality or if these dependencies should be injected differently.

nv-hwoo · 2025-08-15T17:05:27Z

tests/generators/test_image_generator.py

@ajcasagrande noticed this test being non-deterministic a while ago, and I made the fix in this PR.

nv-hwoo · 2025-08-15T17:07:51Z

I will have a separate follow-up PR that covers migrating and adding new tests, and refactoring some of the components in the dataset processor. Some unit tests for custom dataset will fail because the functionality moved to dataset processor.

nv-hwoo · 2025-08-18T19:01:33Z

Converted to draft to address one important issue pointed out by @ajcasagrande. With distributed dataset generation, we need to figure out how to create, maintain, and fetch from cached token blocks across multiple processes.

nv-hwoo added 16 commits August 14, 2025 15:41

add dataset processor service

17d68c9

fix unit test

fb4cb1b

move all dataset generation logic to processor

b9ed61e

fix rebase issue

598a02c

use custom dataset loader directly

3037114

add dataset type cli options

9c440a0

remove composer layer

49aa905

create separate dataset configure path for custom and synthetic

e163666

spawn dataset processor from system controller

40592a8

do not bind dataset processors to socket

4778eaa

fix data distribution bug and synchronization issue

003884d

cleaner way to push dataset chunks to processors

7c5ed63

enable remaining custom datasets in dataset processor

eb8e8a0

clean up

1a062b9

clean up 2

e04431c

remove mixin

3ad36c3

nv-hwoo requested review from debermudez, ajcasagrande and Copilot August 15, 2025 17:03

github-actions bot added the feat label Aug 15, 2025

Copilot AI reviewed Aug 15, 2025

View reviewed changes

nv-hwoo commented Aug 15, 2025

View reviewed changes

nv-hwoo marked this pull request as draft August 18, 2025 18:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: support parallel dataset processor services #220

feat: support parallel dataset processor services #220

Uh oh!

nv-hwoo commented Aug 15, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Aug 15, 2025

Uh oh!

Copilot AI Aug 15, 2025

Uh oh!

nv-hwoo Aug 15, 2025

Uh oh!

nv-hwoo commented Aug 15, 2025

Uh oh!

nv-hwoo commented Aug 18, 2025

Uh oh!

Uh oh!

-#         assert composer.include_image is False
+class TestSyntheticDatasetComposer:
+    # ============================================================================
+    # Initialization Tests
+    # ============================================================================
+    def test_initialization_basic_config(self, synthetic_config, mock_tokenizer):
+        """Test that SyntheticDatasetComposer can be instantiated with basic config."""
+        composer = SyntheticDatasetComposer(synthetic_config, mock_tokenizer)
+        assert composer.config == synthetic_config
+        assert composer.config.input.conversation.num == 5
+        assert composer.prompt_generator is not None
+        assert composer.include_image is False
+        assert composer.include_audio is False
+    def test_initialization_with_images(self, image_config, mock_tokenizer):
+        """Test initialization with image generation enabled."""
+        composer = SyntheticDatasetComposer(image_config, mock_tokenizer)
+        assert composer.config.input.image.width.mean == 10
+        assert composer.config.input.image.height.mean == 10
+        assert composer.include_image is True
+        assert composer.include_audio is False
+    def test_initialization_with_audio(self, audio_config, mock_tokenizer):
+        """Test initialization with audio generation enabled."""
+        composer = SyntheticDatasetComposer(audio_config, mock_tokenizer)
+        assert composer.config.input.audio.length.mean == 2
+#         assert composer.include_image is False

feat: support parallel dataset processor services #220

Are you sure you want to change the base?

feat: support parallel dataset processor services #220

Uh oh!

Conversation

nv-hwoo commented Aug 15, 2025

Summary

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

nv-hwoo Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

nv-hwoo commented Aug 15, 2025

Uh oh!

nv-hwoo commented Aug 18, 2025

Uh oh!

Uh oh!