New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Philip's blog #35

Open

p208p2002 opened this issue Mar 2, 2024 · 0 comments

Labels

Gitalk iterable-style-dataset-worker-setting

Owner

p208p2002 commented Mar 2, 2024

https://blog.philip-huang.tech/?page=iterable-style-dataset-worker-setting

iterable-style dataset 可以處理巨量訓練資料迭代，但是當使用多個 worker 時，每個 worker 都會有一份相同的資料集副本，PyTorch 需要開發者自己去實現邏輯避免 worker 拿到重複資料。

資料通常是一個 generator 物件，所以就算多個 worker 手上都有一份副本也不會佔用許多記憶體。

根據 PyTorch 官方建議，我們可以使用 torch.utils.data.get_worker_info() 進行 worker 配置達到目的。

For iterable-style datasets, since each worker process gets a replica of the dataset object, naive multi-process loading will often result in duplicated data. Using torch.utils.data.get_worker_info() and/or worker_init_fn, users may configure

The text was updated successfully, but these errors were encountered:

p208p2002 added Gitalk iterable-style-dataset-worker-setting labels

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment