You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm working with a custom environment where each reset() call involves loading a significant amount of data from disk. To optimize this, I'd like to use a standard torch.utils.data.DataLoader with num_workers > 0 inside my environment's reset logic.
However, I've encountered a fundamental conflict when using this setup with torchrl.envs.ParallelEnv. I'm opening this discussion to seek advice on the recommended patterns for this scenario and to see if others have similar needs.
The Use Case: Data-loading on reset()
Our environment simulates scenarios based on real-world data logs. Each episode requires loading a different data file. The simplified logic looks like this:
Python
import torch
from torch.utils.data import Dataset, DataLoader
from torchrl.envs import EnvBase
class MyHeavyDataEnv(EnvBase):
def __init__(self, file_paths, ...):
super().__init__(...)
self.file_paths = file_paths
# This DataLoader will be re-initialized on each reset
self.loader = None
def _reset(self, tensordict):
# Sample a new data file for the new episode
file_path = random.choice(self.file_paths)
# Create a dataset and a DataLoader to load the data for this episode
dataset = MyDataset(file_path)
# The ideal scenario is to parallelize this data loading
self.loader = DataLoader(
dataset,
batch_size=None, # Loading the whole file
num_workers=4 # <--- The desired optimization
)
# Consuming the loader to get the data
episode_data = next(iter(self.loader))
# ... use episode_data to set the initial state ...
return initial_state_tensordict
When num_workers=0, this works perfectly but can be a performance bottleneck, as data loading and preprocessing are slow.
The Technical Challenge
The core issue is that ParallelEnv creates its worker processes as daemonic, while a DataLoader with num_workers > 0 attempts to spawn its own child processes. This leads to Python's well-known restriction:
AssertionError: daemonic processes are not allowed to have children
This creates a conflict between torchrl's process management for parallel environments and PyTorch's standard for parallel data loading.
What I've Considered
Approach 1: num_workers = 0 (The Safe-but-Slow Method)
This is the most straightforward workaround. It avoids the error but sacrifices performance at a critical point in our training loop. Given that we have many parallel environments, the total throughput is okay, but each reset is individually slow.
Approach 2: Custom Non-Daemonic ParallelEnv (The Powerful-but-Risky Method)
I've considered subclassing ParallelEnv and forcing its worker processes to be non-daemonic. This would permit them to have their own children (DataLoader workers).
However, the significant drawback is the need for manual and careful process lifecycle management to avoid creating zombie processes, which adds a lot of complexity and risk to the training framework.
Discussion Points & Questions
I'd love to hear the community's thoughts on this:
Is this a common scenario? Do other users have environments that require heavy, parallelizable I/O or computation during reset()?
Are there more elegant, recommended patterns to solve this? Perhaps there's a different way to structure the data loading pipeline with torchrl that I've missed.
Would the TorchRL team consider adding a feature to facilitate this? For example, a built-in option in ParallelEnv like daemon_workers=False could be a solution. This would, of course, come with a clear warning in the documentation about the user's responsibility to properly close the environment.
This seems like a potentially valuable capability for a broader set of use cases where environments are tightly coupled with large datasets.
Thank you for your time and for building such a great library! I look forward to the discussion.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Hi community,
I'm working with a custom environment where each
reset()
call involves loading a significant amount of data from disk. To optimize this, I'd like to use a standardtorch.utils.data.DataLoader
with num_workers > 0 inside my environment's reset logic.However, I've encountered a fundamental conflict when using this setup with
torchrl.envs.ParallelEnv
. I'm opening this discussion to seek advice on the recommended patterns for this scenario and to see if others have similar needs.The Use Case: Data-loading on
reset()
Our environment simulates scenarios based on real-world data logs. Each episode requires loading a different data file. The simplified logic looks like this:
When num_workers=0, this works perfectly but can be a performance bottleneck, as data loading and preprocessing are slow.
The Technical Challenge
The core issue is that ParallelEnv creates its worker processes as daemonic, while a DataLoader with num_workers > 0 attempts to spawn its own child processes. This leads to Python's well-known restriction:
AssertionError: daemonic processes are not allowed to have children
This creates a conflict between torchrl's process management for parallel environments and PyTorch's standard for parallel data loading.
What I've Considered
Approach 1: num_workers = 0 (The Safe-but-Slow Method)
This is the most straightforward workaround. It avoids the error but sacrifices performance at a critical point in our training loop. Given that we have many parallel environments, the total throughput is okay, but each
reset
is individually slow.Approach 2: Custom Non-Daemonic ParallelEnv (The Powerful-but-Risky Method)
I've considered subclassing ParallelEnv and forcing its worker processes to be non-daemonic. This would permit them to have their own children (DataLoader workers).
However, the significant drawback is the need for manual and careful process lifecycle management to avoid creating zombie processes, which adds a lot of complexity and risk to the training framework.
Discussion Points & Questions
I'd love to hear the community's thoughts on this:
Is this a common scenario? Do other users have environments that require heavy, parallelizable I/O or computation during reset()?
Are there more elegant, recommended patterns to solve this? Perhaps there's a different way to structure the data loading pipeline with torchrl that I've missed.
Would the TorchRL team consider adding a feature to facilitate this? For example, a built-in option in ParallelEnv like
daemon_workers=False
could be a solution. This would, of course, come with a clear warning in the documentation about the user's responsibility to properly close the environment.This seems like a potentially valuable capability for a broader set of use cases where environments are tightly coupled with large datasets.
Thank you for your time and for building such a great library! I look forward to the discussion.
Beta Was this translation helpful? Give feedback.
All reactions