Unexpected Memory Usage and Latency with PP #1056

Lucius-THU · 2024-04-12T06:35:23Z

When running the examples/llama/pippy_llama.py script on two A800 GPUs, each rank is observed to consume the full model_size in memory, rather than sharing the weights across both GPUs. Additionally, the latency performance differs from expected values.

Rank: 0 Forward Latency: 0.7621021270751953s Peak memory: 26.341GiB
Rank: 1 Forward Latency: 0.8798844814300537s Peak memory: 26.204GiB

For comparison, when utilizing a single GPU, the performance metrics are as follows:

Rank: 0 Forward Latency: 0.45336079597473145s Peak memory: 26.252GiB

These results are measured by following codes:

torch.cuda.reset_peak_memory_stats(device)
start = time.time()

if rank == 0:
    args = inputs["input_ids"]
else:
    args = None
output = schedule.step(args)

end = time.time()
peak_mem = torch.cuda.max_memory_allocated(device)

Upon trying the initialization settings from examples/cpu_init/gpt2_cpu_init.py, a RuntimeError occurs when using the stage on the CUDA device created from a pipeline on the CPU device:

RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

So I wonder if it's normal for PP with this kind of memory cunsumption and latency?

PS: The issue persists across different versions of the software:

the latest torchpippy installed from the source and torch==2.4.0.dev.20240411
torchpippy==0.2.0 from pip and torch==2.2.2

The text was updated successfully, but these errors were encountered:

kwen2501 · 2024-04-12T20:26:46Z

Hi, on latency, if you measure the first iteration, it will include the distributed initialization time (e.g. NCCL communicator initialization). You can try give it some warm-up runs and then measure the latency.

kwen2501 · 2024-04-12T20:28:16Z

On examples/cpu_init/gpt2_cpu_init.py, I couldn't repro the error, whether with 2 ranks or 4 ranks.
Are you using the llama model with cpu init?

kwen2501 · 2024-04-12T20:33:07Z

On memory consumption, it is expected to be high if you initialize the model on real device.
We are actively developing technique to support creating initial model on meta device:

with torch.device("meta"):
    model = Model()

pipe = pipeline(model, ...)
stage_mod = pipe.get_stage_module(stage_index)
stage_mod.load_state_dict(torch.load(PATH))

Lucius-THU · 2024-04-13T00:25:09Z

Thanks for your reply! I've confirmed that latency measured with several warm up runs is normal.

On examples/cpu_init/gpt2_cpu_init.py, I couldn't repro the error, whether with 2 ranks or 4 ranks. Are you using the llama model with cpu init?

Yes, since I noticed the issue, I'm using examples/llama/pippy_llama.py with the cpu_init method, however it seems that some OPs(maybe index?) do not work properly using the stage on the CUDA device created from a pipeline on the CPU device.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected Memory Usage and Latency with PP #1056

Unexpected Memory Usage and Latency with PP #1056

Lucius-THU commented Apr 12, 2024 •

edited

Loading

kwen2501 commented Apr 12, 2024

kwen2501 commented Apr 12, 2024

kwen2501 commented Apr 12, 2024

Lucius-THU commented Apr 13, 2024

Unexpected Memory Usage and Latency with PP #1056

Unexpected Memory Usage and Latency with PP #1056

Comments

Lucius-THU commented Apr 12, 2024 • edited Loading

kwen2501 commented Apr 12, 2024

kwen2501 commented Apr 12, 2024

kwen2501 commented Apr 12, 2024

Lucius-THU commented Apr 13, 2024

Lucius-THU commented Apr 12, 2024 •

edited

Loading