You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are recently testing the CP parallelism strategy, for a 2D configuration: FSDP+CP.
From what we know, CP is to slice the sequence length, as attention kernel needs to compute the attention for the whole sequence, which means each GPU needs to gather all the sharded KV cache using some collective communication kernels.
However, we didn't see any such kind of kernels, only found the All-Gather for parameters in pre-forward phase.
Is there anything that we misunderstood? please add your comments for better understanding.
Thanks.
The text was updated successfully, but these errors were encountered:
@fegin Thanks for your feedback.
I was running the experiments on AMD platform, suppose you were using Nvidia GPU.
My command is almost the same as yours, I can only find the stream of ALL-Gather issued by FSDP, and no streams of communication issued by CP.
Because of the difference in hardware, maybe the kernel issued by CP was not executed or has not been captured by profiler.
Not sure the possible reason for why there is no such kind of trace in AMD GPU.
Hi
We are recently testing the CP parallelism strategy, for a 2D configuration: FSDP+CP.
From what we know, CP is to slice the sequence length, as attention kernel needs to compute the attention for the whole sequence, which means each GPU needs to gather all the sharded KV cache using some collective communication kernels.
However, we didn't see any such kind of kernels, only found the All-Gather for parameters in pre-forward phase.
Is there anything that we misunderstood? please add your comments for better understanding.
Thanks.
The text was updated successfully, but these errors were encountered: