Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Example] Refactor and Polish Cifar10-DeepSpeed Code Example. #843

Merged
merged 3 commits into from
Jan 26, 2024

Conversation

keli-wen
Copy link
Contributor

@keli-wen keli-wen commented Jan 5, 2024

Description

I'm grateful to the DeepSpeed Team for their detailed examples, which are invaluable for learners like me ✨.

As a beginner, I attempted to learn from the DeepSpeed official tutorial/example, similar to how I approached learning PyTorch's DDP with a minimum code example like Example.

Upon comprehensive review, I believe the Cifar10 may be the most suitable code example for me to gain practical experience with DeepSpeed.

However, I found some potential issues after going through it. The original example indeed illustrates how to train and test cifar10 using DeepSpeed, but:

  • The code contains some unused or meaningless variables (such as with_cuda in argparse section) and comments (Sudden appearance of 2 passes in comments). Even some temporarily commented statements for debugging purposes.
  • The entire code lacks structure and is more akin to a jupyter notebook rather than an official example. I believe it should align more with the style of gan_deepspeed_train.py for clarity.
  • Some exploratory content is indeed interesting (such as showcasing images), but it can be confusing when approached with the aim of learning DeepSpeed best practices.

Therefore, I believe that the Cifar10-deepspeed code example could benefit from some additional polishing and refactoring.

Proposal

I have proposed some changes to the example, based on my understanding and experience. These changes are open for discussion and I welcome feedback for further refinement. The aim of these modifications is to enhance the following aspects:

  • Clearer code structure with functions like get_ds_config, test, and main. This allows us to quickly target our goals by collapsing irrelevant sections of the code while reading.
  • The example should minimize confusing variables/code/comments. I chose to remove completely unused variables and meaningless comments, including those interesting but not quite fitting visualizations and step-by-step debugging codes.
  • Clearer output, especially in a distributed environment where output is inherently messy. I propose some logical modifications to enhance the clarity of the output, aiming for a more user-friendly experience.
  • I attempted to format the entire function using some linters and simple comment formatting rules.

Log

To verify correctness, here are the logs I obtained after running it on my own server:

$ bash run_ds.sh
...
Accuracy of the network on the 10000 test images:  59 %
Accuracy of plane :  64 %
Accuracy of   car :  73 %
Accuracy of  bird :  56 %
Accuracy of   cat :  42 %
Accuracy of  deer :  48 %
Accuracy of   dog :  48 %
Accuracy of  frog :  59 %
Accuracy of horse :  62 %
Accuracy of  ship :  72 %
Accuracy of truck :  62 %
[2024-01-05 09:10:30,930] [INFO] [launch.py:347:main] Process 2124952 exits successfully.
[2024-01-05 09:10:30,931] [INFO] [launch.py:347:main] Process 2124957 exits successfully.
[2024-01-05 09:10:30,931] [INFO] [launch.py:347:main] Process 2124949 exits successfully.
[2024-01-05 09:10:30,932] [INFO] [launch.py:347:main] Process 2124937 exits successfully.

@keli-wen
Copy link
Contributor Author

keli-wen commented Jan 5, 2024

@microsoft-github-policy-service agree

@keli-wen
Copy link
Contributor Author

Hi DeepSpeed Team.

Could you please advise on any further steps needed to progress with the merge of my PR.

@loadams
Copy link
Contributor

loadams commented Jan 25, 2024

Hi DeepSpeed Team.

Could you please advise on any further steps needed to progress with the merge of my PR.

Hi @keli-wen - we will prioritize looking at this PR and reviewing it soon, thanks for your contributions

Copy link
Contributor

@conglongli conglongli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested. LGTM except two typos.

training/cifar/README.md Outdated Show resolved Hide resolved
training/cifar/README.md Outdated Show resolved Hide resolved
@conglongli conglongli merged commit 107681e into microsoft:master Jan 26, 2024
2 checks passed
stceum pushed a commit to stceum/DeepSpeedExamples that referenced this pull request Jan 27, 2024
…oft#843)

* Polish and Refactor Cifar10 Code Example

* fix typos

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Conglong Li <conglong.li@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants