Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected Memory Usage and Latency with PP #1056

Open
Lucius-THU opened this issue Apr 12, 2024 · 4 comments
Open

Unexpected Memory Usage and Latency with PP #1056

Lucius-THU opened this issue Apr 12, 2024 · 4 comments

Comments

@Lucius-THU
Copy link

Lucius-THU commented Apr 12, 2024

When running the examples/llama/pippy_llama.py script on two A800 GPUs, each rank is observed to consume the full model_size in memory, rather than sharing the weights across both GPUs. Additionally, the latency performance differs from expected values.

Rank: 0 Forward Latency: 0.7621021270751953s Peak memory: 26.341GiB
Rank: 1 Forward Latency: 0.8798844814300537s Peak memory: 26.204GiB

For comparison, when utilizing a single GPU, the performance metrics are as follows:

Rank: 0 Forward Latency: 0.45336079597473145s Peak memory: 26.252GiB

These results are measured by following codes:

torch.cuda.reset_peak_memory_stats(device)
start = time.time()

if rank == 0:
    args = inputs["input_ids"]
else:
    args = None
output = schedule.step(args)

end = time.time()
peak_mem = torch.cuda.max_memory_allocated(device)

Upon trying the initialization settings from examples/cpu_init/gpt2_cpu_init.py, a RuntimeError occurs when using the stage on the CUDA device created from a pipeline on the CPU device:

RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

So I wonder if it's normal for PP with this kind of memory cunsumption and latency?

PS: The issue persists across different versions of the software:

  1. the latest torchpippy installed from the source and torch==2.4.0.dev.20240411
  2. torchpippy==0.2.0 from pip and torch==2.2.2
@kwen2501
Copy link
Contributor

Hi, on latency, if you measure the first iteration, it will include the distributed initialization time (e.g. NCCL communicator initialization). You can try give it some warm-up runs and then measure the latency.

@kwen2501
Copy link
Contributor

On examples/cpu_init/gpt2_cpu_init.py, I couldn't repro the error, whether with 2 ranks or 4 ranks.
Are you using the llama model with cpu init?

@kwen2501
Copy link
Contributor

On memory consumption, it is expected to be high if you initialize the model on real device.
We are actively developing technique to support creating initial model on meta device:

with torch.device("meta"):
    model = Model()

pipe = pipeline(model, ...)
stage_mod = pipe.get_stage_module(stage_index)
stage_mod.load_state_dict(torch.load(PATH))

@Lucius-THU
Copy link
Author

Thanks for your reply! I've confirmed that latency measured with several warm up runs is normal.

On examples/cpu_init/gpt2_cpu_init.py, I couldn't repro the error, whether with 2 ranks or 4 ranks. Are you using the llama model with cpu init?

Yes, since I noticed the issue, I'm using examples/llama/pippy_llama.py with the cpu_init method, however it seems that some OPs(maybe index?) do not work properly using the stage on the CUDA device created from a pipeline on the CPU device.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants