Best practices for using multi-worker DataLoader (num_workers > 0) inside a ParallelEnv #3145

tlt18 · 2025-09-03T14:14:55Z

tlt18
Sep 3, 2025

Hi community,

I'm working with a custom environment where each reset() call involves loading a significant amount of data from disk. To optimize this, I'd like to use a standard torch.utils.data.DataLoader with num_workers > 0 inside my environment's reset logic.

However, I've encountered a fundamental conflict when using this setup with torchrl.envs.ParallelEnv. I'm opening this discussion to seek advice on the recommended patterns for this scenario and to see if others have similar needs.

The Use Case: Data-loading on reset()
Our environment simulates scenarios based on real-world data logs. Each episode requires loading a different data file. The simplified logic looks like this:

Python

import torch
from torch.utils.data import Dataset, DataLoader
from torchrl.envs import EnvBase

class MyHeavyDataEnv(EnvBase):
    def __init__(self, file_paths, ...):
        super().__init__(...)
        self.file_paths = file_paths
        # This DataLoader will be re-initialized on each reset
        self.loader = None

    def _reset(self, tensordict):
        # Sample a new data file for the new episode
        file_path = random.choice(self.file_paths)
        
        # Create a dataset and a DataLoader to load the data for this episode
        dataset = MyDataset(file_path)
        
        # The ideal scenario is to parallelize this data loading
        self.loader = DataLoader(
            dataset,
            batch_size=None, # Loading the whole file
            num_workers=4   # <--- The desired optimization
        )
        
        # Consuming the loader to get the data
        episode_data = next(iter(self.loader))
        
        # ... use episode_data to set the initial state ...
        return initial_state_tensordict

When num_workers=0, this works perfectly but can be a performance bottleneck, as data loading and preprocessing are slow.

The Technical Challenge

The core issue is that ParallelEnv creates its worker processes as daemonic, while a DataLoader with num_workers > 0 attempts to spawn its own child processes. This leads to Python's well-known restriction:
AssertionError: daemonic processes are not allowed to have children

This creates a conflict between torchrl's process management for parallel environments and PyTorch's standard for parallel data loading.

What I've Considered

Approach 1: num_workers = 0 (The Safe-but-Slow Method)

This is the most straightforward workaround. It avoids the error but sacrifices performance at a critical point in our training loop. Given that we have many parallel environments, the total throughput is okay, but each reset is individually slow.

Approach 2: Custom Non-Daemonic ParallelEnv (The Powerful-but-Risky Method)

I've considered subclassing ParallelEnv and forcing its worker processes to be non-daemonic. This would permit them to have their own children (DataLoader workers).

However, the significant drawback is the need for manual and careful process lifecycle management to avoid creating zombie processes, which adds a lot of complexity and risk to the training framework.

Discussion Points & Questions
I'd love to hear the community's thoughts on this:

Is this a common scenario? Do other users have environments that require heavy, parallelizable I/O or computation during reset()?

Are there more elegant, recommended patterns to solve this? Perhaps there's a different way to structure the data loading pipeline with torchrl that I've missed.

Would the TorchRL team consider adding a feature to facilitate this? For example, a built-in option in ParallelEnv like daemon_workers=False could be a solution. This would, of course, come with a clear warning in the documentation about the user's responsibility to properly close the environment.

This seems like a potentially valuable capability for a broader set of use cases where environments are tightly coupled with large datasets.

Thank you for your time and for building such a great library! I look forward to the discussion.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Best practices for using multi-worker DataLoader (num_workers > 0) inside a ParallelEnv #3145

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Best practices for using multi-worker DataLoader (num_workers > 0) inside a ParallelEnv #3145

Uh oh!

tlt18 Sep 3, 2025

The Technical Challenge

What I've Considered

Approach 1: num_workers = 0 (The Safe-but-Slow Method)

Approach 2: Custom Non-Daemonic ParallelEnv (The Powerful-but-Risky Method)

Replies: 0 comments

tlt18
Sep 3, 2025