«MPDataset» implements a new iterable-style dataset class for large-scale data loading.
Path mp/
provides a simple implementation. You can find the test under path tests/
.
The complete MPDataset
implementation has been integrated into the zcls warehouse. You can view it on mp_dataset.py and MPDataset
The following are the test results based on cifar100
:
arch | dataset | shuffle | gpu | top1 | top5 |
---|---|---|---|---|---|
sfv1_3g1x | CIFAR100 | no | 1 | 69.470 | 91.350 |
sfv1_3g1x | MPDataset | no | 1 | 67.340 | 89.560 |
sfv1_3g1x | GeneralDataset | no | 1 | 1.010 | 4.960 |
sfv1_3g1x | CIFAR100 | yes | 1 | 70.350 | 91.040 |
sfv1_3g1x | MPDataset | yes | 1 | 68.000 | 90.030 |
sfv1_3g1x | GeneralDataset | yes | 1 | 68.680 | 90.660 |
sfv1_3g1x | CIFAR100 | no | 3 | 69.716 | 91.112 |
sfv1_3g1x | MPDataset | no | 3 | 67.367 | 89.652 |
sfv1_3g1x | GeneralDataset | no | 3 | 1.420 | 5.879 |
sfv1_3g1x | CIFAR100 | yes | 3 | 70.756 | 91.972 |
sfv1_3g1x | MPDataset | yes | 3 | 68.806 | 90.252 |
sfv1_3g1x | GeneralDataset | yes | 3 | 68.656 | 90.472 |
- for
dataset
item, refer to DatasetCIFAR100
: use the data class provided by pytorchMPDataset
: use a custom iterable data classGeneralDataset
: A wrapper class uses ImageFolder
- the complete configuration file is located at
configs/
There is no obvious difference in accuracy for MPDataset and GeneralDataset, even better (because I created the data file according to the original data loading order, so I can get better results by disrupting the data first)
There is a very strange period, that is, using the official cifar file can always get better results
- Why is the accuracy difference so much when I use the image data set and pytorch’s own data set directly?
- Why is the accuracy difference so much when I use the image data set and pytorch's own data set directly?
Based on the current big data training needs (tens of millions or even hundreds of millions), it is necessary to further optimize the training environment. In the implementation of pytoch, more data can be loaded and preprocessed synchronously through multiple processes. However, each process keeps a copy of the data, although they only need some of it.
In conventional map-style dataset usage, the sampler used in main process and distribute indices for sub-processes. From v1.2, pytorch provides a new iterable-style dataset class IterableDataset
, it can define and use sampler in every process. The warehouse defines an iterable-style dataset class for loading large-scale data, which can ensure that each process retains only the part of data it needs.
- zhujian - Initial work - zjykzj
Anyone's participation is welcome! Open an issue or submit PRs.
Small note:
- Git submission specifications should be complied with Conventional Commits
- If versioned, please conform to the Semantic Versioning 2.0.0 specification
- If editing the README, please conform to the standard-readme specification.
Apache License 2.0 © 2021 zjykzj