Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deepspeed-Domino #929

Merged
merged 29 commits into from
Nov 7, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
4f56482
add domino
chengming-zhang Sep 18, 2024
a6e0559
use transformer from deepspeed
shenzheyu Sep 19, 2024
c348644
clean args
chengming-zhang Sep 23, 2024
034270a
mega opt
chengming-zhang Sep 25, 2024
f867064
add opt & timer
shenzheyu Sep 26, 2024
edab567
add opt
shenzheyu Sep 26, 2024
da0c63b
fix loss
chengming-zhang Sep 26, 2024
069f638
folder name
chengming-zhang Sep 30, 2024
a95e398
Change arguent in pretrain script
Oct 15, 2024
a90c082
Add readme for domino
shenzheyu Oct 15, 2024
1e09330
Merge branch 'master' of github.com:zhangsmallshark/DeepSpeedExamples
shenzheyu Oct 15, 2024
addf1f1
Update readme for domino
shenzheyu Oct 15, 2024
1f51b86
Fixing usage issues
tjruwase Oct 18, 2024
2f90f50
Rebase
tjruwase Oct 18, 2024
89205c8
update dataset
zhangsmallshark Oct 18, 2024
d3afb28
megatron dependencies
zhangsmallshark Oct 18, 2024
bce66a5
path
zhangsmallshark Oct 18, 2024
4546f52
Update README.md
shenzheyu Oct 21, 2024
a1eea24
remove imports
zhangsmallshark Oct 23, 2024
bad69e4
update import
zhangsmallshark Oct 24, 2024
7a16420
Update README.md
shenzheyu Oct 28, 2024
ffa84d4
Minor example script changes
tjruwase Oct 29, 2024
e4e9c91
conflict fixed
zhangsmallshark Oct 29, 2024
eccdf38
train bash
zhangsmallshark Oct 29, 2024
f7fb12f
fix pull
zhangsmallshark Oct 29, 2024
360e54a
require
zhangsmallshark Oct 29, 2024
3fdc0c5
Merge branch 'master' into master
loadams Oct 29, 2024
47fffed
Update README.md
shenzheyu Nov 4, 2024
9c3ca5f
Merge branch 'master' into master
tjruwase Nov 7, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
86 changes: 86 additions & 0 deletions training/DeepSpeed-Domino/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# Domino Example

## Install Dependency Libraries
```
pip install -r requirements.txt
```

## Prepare the Dataset
Follow the instructions from [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/tree/main/examples_deepspeed/universal_checkpointing#download-and-pre-process-training-dataset) to prepare the training dataset.

## Execute Domino Training

To start training, adjust the following parameters in the script as needed:

- **GPUS_PER_NODE**: Number of GPUs per node.
- **CHECKPOINT_PATH**: Path to the checkpoint, if applicable.
- **VOCAB_FILE**, **MERGE_FILE**, **DATA_PATH**: Paths to the dataset files.
- **--micro-batch-size**: Batch size per GPU.

### Available Models and Scripts

| Model | Script |
|------------|--------------------------|
| GPT-3 2.7B | `pretrain_gpt3_2.7b.sh` |
| GPT-3 6.7B | `pretrain_gpt3_6.7b.sh` |
| LLaMA 7B | `pretrain_llama_7b.sh` |
| LLaMA 13B | `pretrain_llama_13b.sh` |

### Example

To train the GPT-3 2.7B model, run the following command:

```bash
bash pretrain_gpt3_2.7b.sh
```

The output should look like this:

```
training ...
iteration: 1 | loss: 11.318 | iteration time (ms): 2174.0469932556152
iteration: 2 | loss: 11.307 | iteration time (ms): 1414.4024848937988
iteration: 3 | loss: 11.323 | iteration time (ms): 1385.9455585479736
iteration: 4 | loss: 11.310 | iteration time (ms): 1475.5175113677979
iteration: 5 | loss: 11.306 | iteration time (ms): 1395.7207202911377
iteration: 6 | loss: 11.315 | iteration time (ms): 1392.2104835510254
iteration: 7 | loss: 11.314 | iteration time (ms): 1402.6703834533691
iteration: 8 | loss: 11.309 | iteration time (ms): 1450.613260269165
iteration: 9 | loss: 11.305 | iteration time (ms): 1473.1688499450684
iteration: 10 | loss: 11.320 | iteration time (ms): 1398.4534740447998
[2024-11-04 15:32:30,918] [INFO] [launch.py:351:main] Process 73015 exits successfully.
[2024-11-04 15:32:30,918] [INFO] [launch.py:351:main] Process 73017 exits successfully.
[2024-11-04 15:32:30,919] [INFO] [launch.py:351:main] Process 73014 exits successfully.
[2024-11-04 15:32:30,919] [INFO] [launch.py:351:main] Process 73016 exits successfully.
```

## Advanced Usage
You can compile Pytorch and Apex from source for better performance.

### Compile PyTorch from Source
Compile PyTorch from source could enable JIT script.
```
git clone -b v2.1.0 https://github.com/pytorch/pytorch.git
git submodule sync
git submodule update --init --recursive
conda install cmake ninja
pip install -r requirements.txt
conda install intel::mkl-static intel::mkl-include
conda install -c pytorch magma-cuda121 # or the magma-cuda* that matches your CUDA version from https://anaconda.org/pytorch/repo
export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
python setup.py develop

# Build torchvision
git clone https://github.com/pytorch/vision.git
python setup.py develop
```

## Build Apex
```
git clone https://github.com/NVIDIA/apex
cd apex
# if pip >= 23.1 (ref: https://pip.pypa.io/en/stable/news/#v23-1) which supports multiple `--config-settings` with the same key...
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" --config-settings "--build-option=--fast_layer_norm" ./
# otherwise
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" --config-settings "--build-option=--fast_layer_norm" ./
```
tjruwase marked this conversation as resolved.
Show resolved Hide resolved
Empty file.
Loading
Loading