[Example] Refactor and Polish Cifar10-DeepSpeed Code Example. #843
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
I'm grateful to the DeepSpeed Team for their detailed examples, which are invaluable for learners like me ✨.
As a beginner, I attempted to learn from the DeepSpeed official tutorial/example, similar to how I approached learning PyTorch's DDP with a minimum code example like Example.
Upon comprehensive review, I believe the
Cifar10
may be the most suitable code example for me to gain practical experience with DeepSpeed.However, I found some potential issues after going through it. The original example indeed illustrates how to train and test cifar10 using DeepSpeed, but:
with_cuda
in argparse section) and comments (Sudden appearance of2 passes
in comments). Even some temporarily commented statements for debugging purposes.gan_deepspeed_train.py
for clarity.Therefore, I believe that the Cifar10-deepspeed code example could benefit from some additional polishing and refactoring.
Proposal
I have proposed some changes to the example, based on my understanding and experience. These changes are open for discussion and I welcome feedback for further refinement. The aim of these modifications is to enhance the following aspects:
get_ds_config
,test
, andmain
. This allows us to quickly target our goals by collapsing irrelevant sections of the code while reading.Log
To verify correctness, here are the logs I obtained after running it on my own server: