-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpected Memory Usage and Latency with PP #1056
Comments
Hi, on latency, if you measure the first iteration, it will include the distributed initialization time (e.g. NCCL communicator initialization). You can try give it some warm-up runs and then measure the latency. |
On |
On memory consumption, it is expected to be high if you initialize the model on real device.
|
Thanks for your reply! I've confirmed that latency measured with several warm up runs is normal.
Yes, since I noticed the issue, I'm using |
When running the
examples/llama/pippy_llama.py
script on two A800 GPUs, each rank is observed to consume the full model_size in memory, rather than sharing the weights across both GPUs. Additionally, the latency performance differs from expected values.For comparison, when utilizing a single GPU, the performance metrics are as follows:
These results are measured by following codes:
Upon trying the initialization settings from
examples/cpu_init/gpt2_cpu_init.py
, a RuntimeError occurs when using the stage on the CUDA device created from a pipeline on the CPU device:So I wonder if it's normal for PP with this kind of memory cunsumption and latency?
PS: The issue persists across different versions of the software:
The text was updated successfully, but these errors were encountered: