apply `sort_waits` if `dist_prims.wait` is in a trace #776

crcrpar · 2024-07-16T11:20:54Z

What does this PR do?

Fixes #765

crcrpar · 2024-07-17T13:03:52Z

thunder/tests/distributed/test_ops.py

+            if bsym.sym.id in {all_gather_prim_impl.id, reduce_scatter_prim_impl.id}:
+                comm_idx = idx
+            if bsym.sym.id == wait_prim_impl.id:
+                self.assertGreater(idx, comm_idx + 2)


The goal of this check is to make sure that wait gets distant from the comm by sort_waits. In-place comms are expressed as a pair of thunder's functional comm and copy. So initially they are next to each other.

kiya00 · 2024-07-19T08:05:01Z

thunder/__init__.py

+                from thunder.distributed.utils import maybe_sort_waits
+
+                with langctxs.langctx(cd.langctx):
+                    tmp_comp_trc = _transform_for_operator_executor_execution(computation_trc, cd.executors_list)


we have this _transform_for_operator_executor_execution also in the transform_for_execution, does it affect anything? I noticed the bucketing was applied before _transform_for_operator_executor_execution, now it's after the _transform_for_operator_executor_execution, don't know if it matters

To effectively apply sort_waits to a trace, waits need to be a BoundSymbol. If a trace has a BoundSymbol representing an in-place distributed op, its subsymbols are a pair of out-of-place dist op, wait, and copy and there does not exist a boundsymbol whose sym is wait.
The call of _transform_for_operator_executor_execution here flattens such bsyms of in-place dist ops and tmp_comp_trc would have bsyms of waits, if the computation_trc has bsyms of in-place dist ops.

Before _transform_for_operator_executor_execution:

# Constructed by Dead Code Elimination (took 0 milliseconds) import thunder import thunder.core.prims as prims import thunder.torch as ltorch import torch from thunder.executors.torchex import no_autocast @torch.no_grad() @no_autocast def computation(a, b, output): # a: "cuda:1 f32[4, 2]" # b: "cuda:1 f32[4, 2]" # output: "cuda:1 f32[8, 2]" # /opt/pytorch/lightning-thunder/inplace_dist.py:12: c = a + b result = ltorch.add(a, b, alpha=None) # result: "cuda:1 f32[4, 2]" # result = prims.add(a, b) # result: "cuda:1 f32[4, 2]" t2 = ltorch.all_gather(output, result, _torch_distributed_distributed_c10d_ProcessGroup_0, True) # t2: "cuda:1 f32[8, 2]" # p1 = thunder.distributed.prims.all_gather(result, _torch_distributed_distributed_c10d_ProcessGroup_0, True, None) # p1: "FUTURE thunder.devices.Device(type='cuda:1') f32[8, 2]" # t2 = thunder.distributed.prims.wait(p1) # t2: "cuda:1 f32[8, 2]" # t2 = ltorch.view(t2, (8, 2)) # t2: "cuda:1 f32[8, 2]" # /opt/pytorch/lightning-thunder/inplace_dist.py:14: e = c + 1 e = ltorch.add(result, 1, alpha=None) # e: "cuda:1 f32[4, 2]" # _ = prims.convert_element_type(1, float) # e = prims.add(result, 1.0) # e: "cuda:1 f32[4, 2]" # /opt/pytorch/lightning-thunder/inplace_dist.py:16: f = e * b f = ltorch.mul(e, b) # f: "cuda:1 f32[4, 2]" # f = prims.mul(e, b) # f: "cuda:1 f32[4, 2]" t6 = ltorch.mul(t2, 2) # t6: "cuda:1 f32[8, 2]" # t6 = ltorch.mul(t2, 2) # t6: "cuda:1 f32[8, 2]" # _ = prims.convert_element_type(2, float) # t6 = prims.mul(t2, 2.0) # t6: "cuda:1 f32[8, 2]" prims.copy_(t6, output) # /opt/pytorch/lightning-thunder/inplace_dist.py:17: output *= 2 return f

After _transform_for_operator_executor_execution:

# Constructed by Transform for operator executor execution (took 0 milliseconds) import thunder import thunder.core.prims as prims import thunder.torch as ltorch from torch import Tensor import torch from thunder.executors.torchex import no_autocast @torch.no_grad() @no_autocast def computation(a, b, output): # a: "cuda:1 f32[4, 2]" # b: "cuda:1 f32[4, 2]" # output: "cuda:1 f32[8, 2]" # /opt/pytorch/lightning-thunder/inplace_dist.py:12: c = a + b result = ltorch.add(a, b, alpha=None) # result: "cuda:1 f32[4, 2]" # result = prims.add(a, b) # result: "cuda:1 f32[4, 2]" p1 = torch_all_gather_prim_impl(result, _torch_distributed_distributed_c10d_ProcessGroup_1, True, None) # p1: "FUTURE thunder.devices.Device(type='cuda:1') f32[8, 2]" t2 = torch_wait_prim_impl(p1) # t2: "cuda:1 f32[8, 2]" t2 = Tensor.view(t2, (8, 2)) # t2: "cuda:1 f32[8, 2]" # t2 = ltorch.view(t2, (8, 2)) # t2: "cuda:1 f32[8, 2]" # t2 = ltorch.reshape(t2, (8, 2)) # t2: "cuda:1 f32[8, 2]" # /opt/pytorch/lightning-thunder/inplace_dist.py:14: e = c + 1 e = ltorch.add(result, 1, alpha=None) # e: "cuda:1 f32[4, 2]" # _ = prims.convert_element_type(1, float) # e = prims.add(result, 1.0) # e: "cuda:1 f32[4, 2]" # /opt/pytorch/lightning-thunder/inplace_dist.py:16: f = e * b f = ltorch.mul(e, b) # f: "cuda:1 f32[4, 2]" # f = prims.mul(e, b) # f: "cuda:1 f32[4, 2]" t6 = ltorch.mul(t2, 2) # t6: "cuda:1 f32[8, 2]" # t6 = ltorch.mul(t2, 2) # t6: "cuda:1 f32[8, 2]" # _ = prims.convert_element_type(2, float) # t6 = prims.mul(t2, 2.0) # t6: "cuda:1 f32[8, 2]" prims.copy_(t6, output) # /opt/pytorch/lightning-thunder/inplace_dist.py:17: output *= 2 return f

kiya00 · 2024-07-19T08:08:40Z

thunder/executors/torch_autograd.py

        bw_extrace = sort_waits(bw_extrace)
+    if (not use_ddp) and (not use_fsdp):


Is this sorting also needed for inplace comms when using ddp/fsdp?

could you elaborate on it? I'm not quite following the question.

inplace comms are not used in thunder's ddp & fsdp, currently.
This check is rather to avoid redundant application for ddp or unwanted application for fsdp.

inplace comms are not used in thunder's ddp & fsdp, currently. This check is rather to avoid redundant application for ddp or unwanted application for fsdp.

I get it now, thanks for the above explanation and trace example

it seems for fsdp, the maybe_sort_wait in thunder/__init__ will be applied after the limit_in_flight_allgathers on fwd trace, will it prefer the allgather again and break the limit_in_flight_allgathers ?

even when `use_ddp=False` and `use_fsdp=False` Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

t-vi · 2024-07-19T14:29:08Z

Thank you @crcrpar @kiya00

IvanYashchuk · 2024-08-05T10:27:37Z

thunder/__init__.py

+                from thunder.executors.passes import _transform_for_operator_executor_execution
+                from thunder.distributed.utils import maybe_sort_waits
+
+                with langctxs.langctx(cd.langctx):


I think langctxs.langctx(cd.langctx) should be applied as a decorator on the get_computation_and_inputs function as a whole. I will create a PR for this.

crcrpar force-pushed the crpa/torch-native-dist-ops-with-waits-sorted branch 3 times, most recently from 18aa07c to 163e7b3 Compare July 17, 2024 11:22

crcrpar requested a review from kiya00 July 17, 2024 13:00

crcrpar marked this pull request as ready for review July 17, 2024 13:00

crcrpar requested review from mruberry, lantiga and t-vi as code owners July 17, 2024 13:00

crcrpar commented Jul 17, 2024

View reviewed changes

crcrpar force-pushed the crpa/torch-native-dist-ops-with-waits-sorted branch 2 times, most recently from 66b1dc1 to fbe9526 Compare July 18, 2024 00:54

t-vi assigned kiya00 Jul 19, 2024

kiya00 reviewed Jul 19, 2024

View reviewed changes

crcrpar added 6 commits July 19, 2024 18:05

apply sort_waits if dist_prims.wait is in a trace

1ea7e1c

even when `use_ddp=False` and `use_fsdp=False` Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

torch.distributed availability and docstring

ca6b481

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

give return more scores, add tests

6e9e12c

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

bit more conservative

fb47f07

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

fix typo

54d886b

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

a bit more strict check

a893e96

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

crcrpar force-pushed the crpa/torch-native-dist-ops-with-waits-sorted branch from fbe9526 to a893e96 Compare July 19, 2024 09:05

kiya00 approved these changes Jul 19, 2024

View reviewed changes

t-vi merged commit 721e28e into main Jul 19, 2024
36 checks passed

t-vi deleted the crpa/torch-native-dist-ops-with-waits-sorted branch July 19, 2024 14:29

IvanYashchuk reviewed Aug 5, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

apply `sort_waits` if `dist_prims.wait` is in a trace #776

apply `sort_waits` if `dist_prims.wait` is in a trace #776

Uh oh!

crcrpar commented Jul 16, 2024 •

edited

Loading

Uh oh!

crcrpar Jul 17, 2024

Uh oh!

kiya00 Jul 19, 2024

Uh oh!

crcrpar Jul 19, 2024 •

edited

Loading

Uh oh!

kiya00 Jul 19, 2024

Uh oh!

crcrpar Jul 19, 2024 •

edited

Loading

Uh oh!

kiya00 Jul 19, 2024

Uh oh!

kiya00 Jul 19, 2024 •

edited

Loading

Uh oh!

t-vi commented Jul 19, 2024

Uh oh!

Uh oh!

IvanYashchuk Aug 5, 2024

Uh oh!

Uh oh!

		bw_extrace = sort_waits(bw_extrace)
		if (not use_ddp) and (not use_fsdp):

apply sort_waits if dist_prims.wait is in a trace #776

apply sort_waits if dist_prims.wait is in a trace #776

Uh oh!

Conversation

crcrpar commented Jul 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

crcrpar Jul 17, 2024

Choose a reason for hiding this comment

Uh oh!

kiya00 Jul 19, 2024

Choose a reason for hiding this comment

Uh oh!

crcrpar Jul 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kiya00 Jul 19, 2024

Choose a reason for hiding this comment

Uh oh!

crcrpar Jul 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kiya00 Jul 19, 2024

Choose a reason for hiding this comment

Uh oh!

kiya00 Jul 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

t-vi commented Jul 19, 2024

Uh oh!

Uh oh!

IvanYashchuk Aug 5, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

apply `sort_waits` if `dist_prims.wait` is in a trace #776

apply `sort_waits` if `dist_prims.wait` is in a trace #776

crcrpar commented Jul 16, 2024 •

edited

Loading

crcrpar Jul 19, 2024 •

edited

Loading

crcrpar Jul 19, 2024 •

edited

Loading

kiya00 Jul 19, 2024 •

edited

Loading