Skip to content

Commit

Permalink
fix docstring code example for distributed shuffle (#7166)
Browse files Browse the repository at this point in the history
  • Loading branch information
lhoestq authored Sep 24, 2024
1 parent 548d2d2 commit e9ec56c
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion src/datasets/arrow_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -5146,7 +5146,7 @@ def to_iterable_dataset(self, num_shards: Optional[int] = 1) -> "IterableDataset
```python
>>> from datasets.distributed import split_dataset_by_node
>>> ids = ds.to_iterable_dataset(num_shards=512)
>>> ids = ids.shuffle(buffer_size=10_000) # will shuffle the shards order and use a shuffle buffer when you start iterating
>>> ids = ids.shuffle(buffer_size=10_000, seed=42) # will shuffle the shards order and use a shuffle buffer when you start iterating
>>> ids = split_dataset_by_node(ds, world_size=8, rank=0) # will keep only 512 / 8 = 64 shards from the shuffled lists of shards when you start iterating
>>> dataloader = torch.utils.data.DataLoader(ids, num_workers=4) # will assign 64 / 4 = 16 shards from this node's list of shards to each worker when you start iterating
>>> for example in ids:
Expand Down

0 comments on commit e9ec56c

Please sign in to comment.