[Distributed] split_dataset_by_node() gives the same number of examples for each node #7854

lhoestq · 2025-11-06T17:14:18Z

this works:

import torch.distributed as dist
from datasets import IterableDataset
from datasets.distributed import split_dataset_by_node
from collections import Counter

def g(shards):
    for shard in shards:
        # shards don't have the same length
        num_examples = 3 + shard 
        for i in range(num_examples):
            yield {"shard": f"{shard=}", "i": i}

if __name__ == "__main__":
    dist.init_process_group(backend="gloo")
    rank, world_size = dist.get_rank(), dist.get_world_size()
    num_shards = 6
    ds = IterableDataset.from_generator(g, gen_kwargs={"shards": list(range(num_shards))})
    ds = split_dataset_by_node(ds, rank=rank, world_size=world_size)
    # Check that each rank has the same number of examples
    # and show the number of examples per shard and per rank
    counter = Counter(ds["shard"])
    print(f"# {rank=}\ttotal={counter.total()}\t{counter}", flush=True)

    # torchrun --nproc_per_node 2 script.py
    # rank=0        total=16        Counter({'shard=4': 7, 'shard=2': 5, 'shard=0': 4})
    # rank=1        total=16        Counter({'shard=3': 6, 'shard=5': 6, 'shard=1': 4})

TODO: make it work with DataLoader (communicate with main process to know when the node runs out of data ?)

HuggingFaceDocBuilderDev · 2025-11-06T17:18:53Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

LTMeyer

Thank you @lhoestq for this addition! I look forward to trying it.

LTMeyer · 2025-11-06T17:36:26Z

src/datasets/iterable_dataset.py

+                yield key, example
+                dist.all_reduce(is_exhausted)
+                if self.bool_strategy_func(is_exhausted):
+                    return
+            is_exhausted[self.rank] = True


I'm wondering if this doesn't yield one sample more if the dataset is exhausted.

good point, let me see

lhoestq · 2025-11-10T14:56:34Z

Making this work with multiple workers could create a lot of communication for not a lot of benefits, considering you can simply use Join() to let nodes shutdown when they run out of data while the other nodes continue training: https://docs.pytorch.org/docs/stable/distributed.algorithms.join.html

synchronized split_dataset_by_node

f7c04b5

This was referenced Nov 6, 2025

IterableDataset sharding logic needs improvement #6594

Open

Problem in training iterable dataset #6437

Open

streaming datasets doesn't work properly with multi-node #6623

Open

LTMeyer reviewed Nov 6, 2025

View reviewed changes

lhoestq marked this pull request as draft November 7, 2025 15:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Distributed] split_dataset_by_node() gives the same number of examples for each node #7854

[Distributed] split_dataset_by_node() gives the same number of examples for each node #7854

lhoestq commented Nov 6, 2025 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Nov 6, 2025

Uh oh!

LTMeyer left a comment

Uh oh!

LTMeyer Nov 6, 2025

Uh oh!

lhoestq Nov 6, 2025

Uh oh!

lhoestq commented Nov 10, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[Distributed] split_dataset_by_node() gives the same number of examples for each node #7854

Are you sure you want to change the base?

[Distributed] split_dataset_by_node() gives the same number of examples for each node #7854

Conversation

lhoestq commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Nov 6, 2025

Uh oh!

LTMeyer left a comment

Choose a reason for hiding this comment

Uh oh!

LTMeyer Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

lhoestq Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

lhoestq commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lhoestq commented Nov 6, 2025 •

edited

Loading

lhoestq commented Nov 10, 2025 •

edited

Loading