Skip to content

Conversation

@lhoestq
Copy link
Member

@lhoestq lhoestq commented Nov 6, 2025

this works:

import torch.distributed as dist
from datasets import IterableDataset
from datasets.distributed import split_dataset_by_node
from collections import Counter

def g(shards):
    for shard in shards:
        # shards don't have the same length
        num_examples = 3 + shard 
        for i in range(num_examples):
            yield {"shard": f"{shard=}", "i": i}

if __name__ == "__main__":
    dist.init_process_group(backend="gloo")
    rank, world_size = dist.get_rank(), dist.get_world_size()
    num_shards = 6
    ds = IterableDataset.from_generator(g, gen_kwargs={"shards": list(range(num_shards))})
    ds = split_dataset_by_node(ds, rank=rank, world_size=world_size)
    # Check that each rank has the same number of examples
    # and show the number of examples per shard and per rank
    counter = Counter(ds["shard"])
    print(f"# {rank=}\ttotal={counter.total()}\t{counter}", flush=True)

    # torchrun --nproc_per_node 2 script.py
    # rank=0        total=16        Counter({'shard=4': 7, 'shard=2': 5, 'shard=0': 4})
    # rank=1        total=16        Counter({'shard=3': 6, 'shard=5': 6, 'shard=1': 4})

TODO: make it work with DataLoader (communicate with main process to know when the node runs out of data ?)

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link

@LTMeyer LTMeyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @lhoestq for this addition! I look forward to trying it.

Comment on lines +2156 to +2160
yield key, example
dist.all_reduce(is_exhausted)
if self.bool_strategy_func(is_exhausted):
return
is_exhausted[self.rank] = True
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this doesn't yield one sample more if the dataset is exhausted.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, let me see

@lhoestq lhoestq marked this pull request as draft November 7, 2025 15:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants