Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

End-to-End LLM Model Development with Torchtitan and Torchtune #341

Open
wants to merge 856 commits into
base: main
Choose a base branch
from

Conversation

KeitaW
Copy link
Collaborator

@KeitaW KeitaW commented May 20, 2024

Issue #, if available:

Description of changes:

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

KeitaW and others added 30 commits March 17, 2024 08:28
SMHP: Remove 14k log lines from efa exporter LCC
Add conda and docker environment setups for 16.pytorch-capu-ddp test case.
Bump dcgm exporter version to correctly capture GPU utilization
NCCL 2.19.4 has performance regression.
Rename 0.crate-conda-env.sh to 0.create-conda-env.sh
Updating CF template for HyperPod to support second private subnet
smp v2 llama2 training example using fp8
Signed-off-by: Sean Smith <seaam@amazon.com>
Validate Json in preflight check
…distributed-training into torchtitan-torchtune
@KeitaW KeitaW force-pushed the torchtitan-torchtune branch from 64e0724 to 00dfbf5 Compare June 4, 2024 02:26
@KeitaW KeitaW force-pushed the main branch 2 times, most recently from 44e448e to 1209815 Compare June 4, 2024 02:30
@KeitaW KeitaW force-pushed the torchtitan-torchtune branch from 436b58c to 952eba3 Compare June 4, 2024 03:32
@KeitaW KeitaW requested a review from pbelevich June 5, 2024 02:18
@KeitaW KeitaW marked this pull request as ready for review June 5, 2024 02:18
@KeitaW
Copy link
Collaborator Author

KeitaW commented Jun 11, 2024

Basic functionalities have been implemented. Allow me to iterate on the other PRs...

3.test_cases/torchtune/slurm/README.md Outdated Show resolved Hide resolved
3.test_cases/torchtune/slurm/README.md Outdated Show resolved Hide resolved
KeitaW and others added 3 commits June 11, 2024 14:23
Co-authored-by: Pavel Belevich <belevich@amazon.com>
…ent/README.md

Co-authored-by: Pavel Belevich <belevich@amazon.com>
Co-authored-by: Pavel Belevich <belevich@amazon.com>
* Evaluation
* Deployment

for details of each step, refer the [overview documentation](../../README.md).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for details of each step, refer the [overview documentation](../../README.md).
for details of each step, refer to the [overview documentation](../../README.md).

In this step, you will fine-tune the Llama3 model starting from the original checkpoint using the WikiText dataset. This process, known as Full-Parameter Finetuning, updates all the parameters in the original model. The configuration file used for this process is `./tutorials/e2e-llama3-70b-development/full_finetune_distributed.yaml`.

### Memory Consumption Challenges
One of the primary challenges during such training is memory consumption. A typical model trained in mixed precision with AdamW requires 18 bytes per model parameter plus activation memory (6 bytes for parameters in mixed precision training, 8 bytes for AdamW, and 4 bytes for other overheads). For more details on the anatomy, see the [Hugging Face blog post](https://huggingface.co/docs/transformers/model_memory_anatomy) blog post. This means that training a 70B parameter model would require more than 1.12 TB of accelerated memory, which far exceeds the 80 GB capacity of H100 accelerated memory. To address this issue, torchtune integrates PyTorch Fully Sharded Data Parallel (FSDP).
Copy link
Collaborator

@pbelevich pbelevich Jun 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
One of the primary challenges during such training is memory consumption. A typical model trained in mixed precision with AdamW requires 18 bytes per model parameter plus activation memory (6 bytes for parameters in mixed precision training, 8 bytes for AdamW, and 4 bytes for other overheads). For more details on the anatomy, see the [Hugging Face blog post](https://huggingface.co/docs/transformers/model_memory_anatomy) blog post. This means that training a 70B parameter model would require more than 1.12 TB of accelerated memory, which far exceeds the 80 GB capacity of H100 accelerated memory. To address this issue, torchtune integrates PyTorch Fully Sharded Data Parallel (FSDP).
One of the primary challenges during such training is memory consumption. A typical model trained in mixed precision with AdamW requires 18 bytes per model parameter(6 bytes for parameter in mixed precision training, 4 bytes for gradient and 8 bytes for AdamW optimizer states) plus activation memory. For more details on the anatomy, see the [Hugging Face blog post](https://huggingface.co/docs/transformers/model_memory_anatomy) blog post. This means that training a 70B parameter model would require more than 1.12 TiB of accelerator's memory, which far exceeds the 80 GB capacity of H100 memory. To address this issue, torchtune integrates PyTorch Fully Sharded Data Parallel (FSDP).

In this step, you will fine-tune the Llama3 model starting from the original checkpoint using the WikiText dataset. This process, known as Full-Parameter Finetuning, updates all the parameters in the original model. The configuration file used for this process is `./tutorials/e2e-llama3-70b-development/full_finetune_distributed.yaml`.

### Memory Consumption Challenges
One of the primary challenges during such training is memory consumption. A typical model trained in mixed precision with AdamW requires 18 bytes per model parameter plus activation memory (6 bytes for parameters in mixed precision training, 8 bytes for AdamW, and 4 bytes for other overheads). For more details on the anatomy, see the [Hugging Face blog post](https://huggingface.co/docs/transformers/model_memory_anatomy) blog post. This means that training a 70B parameter model would require more than 1.12 TB of accelerated memory, which far exceeds the 80 GB capacity of H100 accelerated memory. To address this issue, torchtune integrates PyTorch Fully Sharded Data Parallel (FSDP).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How was 1.12 TiB calculated?
70_000_000_000 * 18 = 1_260_000_000_000
1_260_000_000_000 / 1024 / 1024 / 1024 / 1024 = 1.15TiB

In this step, you will fine-tune the Llama3 model starting from the original checkpoint using the WikiText dataset. This process, known as Full-Parameter Finetuning, updates all the parameters in the original model. The configuration file used for this process is `./tutorials/e2e-llama3-70b-development/full_finetune_distributed.yaml`.

### Memory Consumption Challenges
One of the primary challenges during such training is memory consumption. A typical model trained in mixed precision with AdamW requires 18 bytes per model parameter plus activation memory (6 bytes for parameters in mixed precision training, 8 bytes for AdamW, and 4 bytes for other overheads). For more details on the anatomy, see the [Hugging Face blog post](https://huggingface.co/docs/transformers/model_memory_anatomy) blog post. This means that training a 70B parameter model would require more than 1.12 TB of accelerated memory, which far exceeds the 80 GB capacity of H100 accelerated memory. To address this issue, torchtune integrates PyTorch Fully Sharded Data Parallel (FSDP).
Copy link
Collaborator

@pbelevich pbelevich Jun 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memory is not accelerated itself


### Basic concepts and relevant configuration

**FSDP** is a distributed training feature designed to efficiently handle large model training by sharding model parameters, gradients, and optimizer states across multiple devices. This approach significantly reduces memory consumption and optimizes resource utilization, making it possible to train models that are too large to fit on a single GPU. In `torchtune` users can launch FSDP training job with command `tune run full_finetune_distributed`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**FSDP** is a distributed training feature designed to efficiently handle large model training by sharding model parameters, gradients, and optimizer states across multiple devices. This approach significantly reduces memory consumption and optimizes resource utilization, making it possible to train models that are too large to fit on a single GPU. In `torchtune` users can launch FSDP training job with command `tune run full_finetune_distributed`.
**FSDP** is a distributed training technique designed to efficiently handle large model training by sharding model parameters, gradients, and optimizer states across multiple devices. This approach significantly reduces memory consumption and optimizes resource utilization, making it possible to train models that are too large to fit on a single GPU. In `torchtune` users can launch FSDP training job with command `tune run full_finetune_distributed`.

--master_port $RANDOM
--nproc_per_node=8
--nnodes $NNODES
--nnodes=$SLURM_JOB_NUM_NODES
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--nnodes twice

--master_port $RANDOM
--nproc_per_node=8
--nnodes $NNODES
--nnodes=$SLURM_JOB_NUM_NODES
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--nnodes twice

sbatch tutorials/e2e-llama3-70b-development/full_finetune_distributed.sbatch
```

By default, this script launches the FSDP training job with two instances. Once the job has been scheduled, you will see the following outputs in the log file named `logs/full-finetuning*`:
Copy link
Collaborator

@pbelevich pbelevich Jun 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see where two instances are specified by default, I see only --nnodes 1 --nnodes=1 in sbatch files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.