Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Philip's blog #35

Open
p208p2002 opened this issue Mar 2, 2024 · 0 comments
Open

Philip's blog #35

p208p2002 opened this issue Mar 2, 2024 · 0 comments

Comments

@p208p2002
Copy link
Owner

https://blog.philip-huang.tech/?page=iterable-style-dataset-worker-setting

iterable-style dataset 可以處理巨量訓練資料迭代,但是當使用多個 worker 時,每個 worker 都會有一份相同的資料集副本,PyTorch 需要開發者自己去實現邏輯避免 worker 拿到重複資料。

資料通常是一個 generator 物件,所以就算多個 worker 手上都有一份副本也不會佔用許多記憶體。

根據 PyTorch 官方建議,我們可以使用 torch.utils.data.get_worker_info() 進行 worker 配置達到目的。

For iterable-style datasets, since each worker process gets a replica of the dataset object, naive multi-process loading will often result in duplicated data. Using torch.utils.data.get_worker_info() and/or worker_init_fn, users may configure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant