-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Deepspeed-Domino #929
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Deepspeed-Domino #929
Changes from all commits
Commits
Show all changes
29 commits
Select commit
Hold shift + click to select a range
4f56482
add domino
chengming-zhang a6e0559
use transformer from deepspeed
shenzheyu c348644
clean args
chengming-zhang 034270a
mega opt
chengming-zhang f867064
add opt & timer
shenzheyu edab567
add opt
shenzheyu da0c63b
fix loss
chengming-zhang 069f638
folder name
chengming-zhang a95e398
Change arguent in pretrain script
a90c082
Add readme for domino
shenzheyu 1e09330
Merge branch 'master' of github.com:zhangsmallshark/DeepSpeedExamples
shenzheyu addf1f1
Update readme for domino
shenzheyu 1f51b86
Fixing usage issues
tjruwase 2f90f50
Rebase
tjruwase 89205c8
update dataset
zhangsmallshark d3afb28
megatron dependencies
zhangsmallshark bce66a5
path
zhangsmallshark 4546f52
Update README.md
shenzheyu a1eea24
remove imports
zhangsmallshark bad69e4
update import
zhangsmallshark 7a16420
Update README.md
shenzheyu ffa84d4
Minor example script changes
tjruwase e4e9c91
conflict fixed
zhangsmallshark eccdf38
train bash
zhangsmallshark f7fb12f
fix pull
zhangsmallshark 360e54a
require
zhangsmallshark 3fdc0c5
Merge branch 'master' into master
loadams 47fffed
Update README.md
shenzheyu 9c3ca5f
Merge branch 'master' into master
tjruwase File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,86 @@ | ||
# Domino Example | ||
|
||
## Install Dependency Libraries | ||
``` | ||
pip install -r requirements.txt | ||
``` | ||
|
||
## Prepare the Dataset | ||
Follow the instructions from [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/tree/main/examples_deepspeed/universal_checkpointing#download-and-pre-process-training-dataset) to prepare the training dataset. | ||
|
||
## Execute Domino Training | ||
|
||
To start training, adjust the following parameters in the script as needed: | ||
|
||
- **GPUS_PER_NODE**: Number of GPUs per node. | ||
- **CHECKPOINT_PATH**: Path to the checkpoint, if applicable. | ||
- **VOCAB_FILE**, **MERGE_FILE**, **DATA_PATH**: Paths to the dataset files. | ||
- **--micro-batch-size**: Batch size per GPU. | ||
|
||
### Available Models and Scripts | ||
|
||
| Model | Script | | ||
|------------|--------------------------| | ||
| GPT-3 2.7B | `pretrain_gpt3_2.7b.sh` | | ||
| GPT-3 6.7B | `pretrain_gpt3_6.7b.sh` | | ||
| LLaMA 7B | `pretrain_llama_7b.sh` | | ||
| LLaMA 13B | `pretrain_llama_13b.sh` | | ||
|
||
### Example | ||
|
||
To train the GPT-3 2.7B model, run the following command: | ||
|
||
```bash | ||
bash pretrain_gpt3_2.7b.sh | ||
``` | ||
|
||
The output should look like this: | ||
|
||
``` | ||
training ... | ||
iteration: 1 | loss: 11.318 | iteration time (ms): 2174.0469932556152 | ||
iteration: 2 | loss: 11.307 | iteration time (ms): 1414.4024848937988 | ||
iteration: 3 | loss: 11.323 | iteration time (ms): 1385.9455585479736 | ||
iteration: 4 | loss: 11.310 | iteration time (ms): 1475.5175113677979 | ||
iteration: 5 | loss: 11.306 | iteration time (ms): 1395.7207202911377 | ||
iteration: 6 | loss: 11.315 | iteration time (ms): 1392.2104835510254 | ||
iteration: 7 | loss: 11.314 | iteration time (ms): 1402.6703834533691 | ||
iteration: 8 | loss: 11.309 | iteration time (ms): 1450.613260269165 | ||
iteration: 9 | loss: 11.305 | iteration time (ms): 1473.1688499450684 | ||
iteration: 10 | loss: 11.320 | iteration time (ms): 1398.4534740447998 | ||
[2024-11-04 15:32:30,918] [INFO] [launch.py:351:main] Process 73015 exits successfully. | ||
[2024-11-04 15:32:30,918] [INFO] [launch.py:351:main] Process 73017 exits successfully. | ||
[2024-11-04 15:32:30,919] [INFO] [launch.py:351:main] Process 73014 exits successfully. | ||
[2024-11-04 15:32:30,919] [INFO] [launch.py:351:main] Process 73016 exits successfully. | ||
``` | ||
|
||
## Advanced Usage | ||
You can compile Pytorch and Apex from source for better performance. | ||
|
||
### Compile PyTorch from Source | ||
Compile PyTorch from source could enable JIT script. | ||
``` | ||
git clone -b v2.1.0 https://github.com/pytorch/pytorch.git | ||
git submodule sync | ||
git submodule update --init --recursive | ||
conda install cmake ninja | ||
pip install -r requirements.txt | ||
conda install intel::mkl-static intel::mkl-include | ||
conda install -c pytorch magma-cuda121 # or the magma-cuda* that matches your CUDA version from https://anaconda.org/pytorch/repo | ||
export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"} | ||
python setup.py develop | ||
|
||
# Build torchvision | ||
git clone https://github.com/pytorch/vision.git | ||
python setup.py develop | ||
``` | ||
|
||
## Build Apex | ||
``` | ||
git clone https://github.com/NVIDIA/apex | ||
cd apex | ||
# if pip >= 23.1 (ref: https://pip.pypa.io/en/stable/news/#v23-1) which supports multiple `--config-settings` with the same key... | ||
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" --config-settings "--build-option=--fast_layer_norm" ./ | ||
# otherwise | ||
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" --config-settings "--build-option=--fast_layer_norm" ./ | ||
``` | ||
Empty file.
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.