`reorderShardedAxisPass` for DID loop split #4256

Priya2698 · 2025-04-16T05:16:49Z

Issue #3900.

Key changes:

Modifying allocation domain instead of logical domain

Previous implementation modified the logical shape of the communication inputs and outputs such that the gathered/scattered axis were outermost. Current implementation only sets the allocation domain

hasShardingChanges -> getGatherOrScatterCommInfo

The new function finds any gather / scatter / reduce scatter communicaton patterns and returns the logical iterdomains involved in the communication. This accomodates any DID loop split.

isInnerResharding -> isCommLayoutCompliant and isAllocatedOutermost to support split allocation domains seen in DID loop split
Dependency on canLower is removed to decouple from Issue Extend InsertReshardingPass for loop split. #4382.

This PR does not handle ParallelType::Stream right now.

~~TODO: ReduceScatter tests after PR #4384~~

Priya2698 · 2025-04-16T05:17:03Z

!test

github-actions · 2025-04-16T05:17:52Z

Review updated until commit 50a4ee2

Description

Refactor communication layout compliance checks
Introduce CommunicationInfo struct for communication pattern analysis
Update canLower to use isCommunicationLayoutCompliant
Enhance lower_to_communication.cpp with new functions and logic

Changes walkthrough 📝

Relevant files

Enhancement

7 files

lower.cpp `Update canLower to use isCommunicationLayoutCompliant`	+2/-2
lower_to_communication.cpp `Introduce CommunicationInfo and related functions`	+188/-5
communication.cpp `Add contiguity checks in postAllreduce and postReduceScatter`	+16/-1
utils.cpp `Remove deprecated functions and add new utility functions`	+2/-86
reorder_sharded_axis.cpp `Refactor pass to use new communication layout compliance checks`	+173/-118
lower_to_communication.h `Add declarations for CommunicationInfo and related functions`	+29/-0
utils.h `Remove deprecated functions and add new utility functions declarations`	+5/-15

Tests

4 files

test_multidevice_lower_communication.cpp `Update and add new tests for communication layout compliance`	+127/-90
test_resharding.cpp `Update tests to use new communication layout compliance checks`	+3/-2
test_communication.py `Re-enable and update test_reduce_scatter_noncontiguous`	+0/-3
test_matmul.py `Update test parameters for multidevice matmul`	+1/-1

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 PR contains tests

⚡ Recommended focus areas for review

Possible Issue

The new function isCommunicationLayoutCompliant is used in place of isInnerResharding. Ensure that this new function correctly identifies inner resharding scenarios and that the logic is consistent with the previous implementation.

if (!ignore_inner_resharding && !isCommunicationLayoutCompliant(expr)) {
  return false;

Performance Concern

The new function isAllocationOrderCompliant checks if the allocation order is compliant with NCCL/UCC requirements. Ensure that this function is efficient and does not introduce significant overhead during the compilation process.

bool isAllocationOrderCompliant(TensorView* tv, IterDomain* sharded_id) {
  NVF_ERROR(
      std::find(
          tv->getLogicalDomain().begin(),
          tv->getLogicalDomain().end(),
          sharded_id) != tv->getLogicalDomain().end(),
      "The sharded ID ",
      sharded_id->toString(),
      " is not in the logical domain ",
      tv->getLogicalDomain());

  if (isLocalSizeOne(sharded_id)) {
    // Parallelized dimension, broadcast, and reduction do not affect
    // allocation.
    return true;
  }

  // This sharded logical ID may not be directly present in allocation domain.
  // This indicates allocation domain has DID transformations.
  std::optional<Layout> layout = canonicalizeLayout(tv);
  if (!layout.has_value()) {
    return false;
  }

  const std::vector<IterDomain*>& allocation_domain = layout->allocation_domain;

  NVF_ERROR(
      std::is_permutation(
          allocation_domain.begin(),
          allocation_domain.end(),
          tv->getLogicalDomain().begin(),
          tv->getLogicalDomain().end()),
      "The allocation domain returned by canonicalizeLayout",
      allocation_domain,
      " should be a permutation of the logical domain ",
      tv->getLogicalDomain());

  // Check if sharded_id appears at the front.
  for (IterDomain* id : allocation_domain) {
    if (id == sharded_id) {
      return true;
    }
    if (!isLocalSizeOne(id)) {
      return false;
    }

Test Coverage

The test Allgather_NonCompliantAllocation checks for non-compliant allocation. Ensure that this test covers all edge cases and that the expected behavior is correctly validated.

TensorView* tv0 = makeConcreteTensor({5, d * 3});
tv0->setAllocationDomain(tv0->getLogicalDomain(), false);

TensorView* tv1 = set(tv0);
tv1->setAllocationDomain(tv1->getLogicalDomain(), true);

tv0->setDeviceMesh(mesh);
tv0->outer_split(1, d);
tv0->axis(1)->parallelize(ParallelType::DIDx);

tv1->setDeviceMesh(mesh);

fusion->addInput(tv0);
fusion->addOutput(tv1);

at::Tensor unsharded_in_tensor = at::randn({5, d * 3}, tensor_options);
at::Tensor in_tensor = shardTensor(unsharded_in_tensor, 1, mesh);

FusionExecutorCache executor_cache(std::move(fusion));
at::Tensor out_tensor =
    executor_cache.runFusionWithInputs({in_tensor})[0].as<at::Tensor>();

EXPECT_TRUE(out_tensor.is_contiguous());
EXPECT_TRUE(at::allclose(out_tensor, unsharded_in_tensor));

FusionKernelRuntime* runtime = executor_cache.getMostRecentKernelRuntime();
EXPECT_THAT(
    runtime->fusionSegments()->groups(),
    Contains(HeuristicIs(SchedulerType::PointWise)).Times(2));

Priya2698 · 2025-05-08T00:25:30Z

!test

Priya2698 · 2025-05-08T01:45:21Z

!test

Priya2698 · 2025-05-14T05:16:45Z

!test --diff

Priya2698 · 2025-05-14T06:51:50Z

!test --diff

Priya2698 · 2025-05-14T22:18:08Z

!test --diff

Priya2698 · 2025-05-16T23:17:58Z

!test

Extracted the logic converting a split allocation domain to logical from `alias.cpp` to ir_utils. This avoids duplicate logic in PR #4256 when checking allocation of gathered/scattered axis.

Priya2698 · 2025-05-22T01:26:19Z

!test

csrc/preseg_passes/reorder_sharded_axis.cpp

Priya2698 · 2025-05-23T05:35:20Z

!test

tests/cpp/test_multidevice_lower_communication.cpp

Priya2698 · 2025-05-23T19:28:43Z

!test

wujingyue

First batch; still reviewing...

csrc/multidevice/communication.cpp

csrc/multidevice/utils.cpp

csrc/multidevice/utils.h

wujingyue · 2025-05-24T01:19:41Z

csrc/multidevice/utils.h

+    const std::vector<IterDomain*>& domain,
+    const IterDomain* id);
+
+// Returns the communication info for the


Code-comment that this assumes expr has been decomposed.

I'd move this to lower_to_communication.h so it's closer to

Fuser/csrc/host_ir/lower_to_communication.cpp

Line 477 in 671fbe9

std::vector<Expr*> convertSingleOpToCommunication(

. The two functions should be kept in sync, so it makes sense to put them close to each other.

That can lead to circular dependency since that file import multidevice/utils and multidevice/utils itself needs this function.
In a separate PR, I can extend this function to be more elaborate and use it directly in convertSingleOpToCommunication to avoid any mismatch between them.

That can lead to circular dependency since that file import multidevice/utils and multidevice/utils itself needs this function.

AFAICT, to avoid circular dependency, only getCommunicationInfo and isCommunicationLayoutCompliant need to go to lower_to_communication. They probably should anyhow because they are all about lowering.

I general, a large "utils.h" header file tends to cause the following issues:

Increased Compilation Times: A large utils.h header will be included in many source files, leading to a lot of parsing and compilation overhead, significantly increasing build times. Any change to utils.h, even a minor one, requires recompilation of all files that include it, further slowing down development.

Reduced Encapsulation and Interface Clarity: A large utils.h likely exposes many internal utility functions and data structures to clients that don't need to know about them, violating the principle of information hiding and reducing encapsulation. This makes it harder to understand the actual public interface and can lead to accidental misuse or dependencies on internal implementation details.

lower.cpp and lower_to_communication.cpp are importing each other -> This is easy to resolve. lower.cpp does not need lower_to_communication.h.

lower.cpp can hold the above functions but it imports presegmentation passes. This is fine for now, since reorderShardedAxisPass is removed from there. However, previously these 2 files were also importing each other. The same dependency exists between other preseg passes, and this file. The preseg passes query canLower whereas lower calls the preseg passes.

We need a restructuring to avoid this. Moving canLower to lower_to_communication seems sufficient but it's a part of HostIrLower class

Circular dependencies among cpps are OK, although still not the best practice. Recall that cpp files are compiled independently. Circular dependencies among header files are much more problematic, but I don't think you are hitting any.

To avoid circular dependencies among cpps ¹ for FusionExecutorCache, I think we can:

Let preseg depend on lower_to_communication.

Don't let lower_to_communication depend on lower. HostIrLowerParams can be avoided by passing in CommunicatorBackend directly.

Fuser/csrc/host_ir/lower_to_communication.cpp

Line 320 in 520a14d

HostIrLower::canLower(c),

can be removed or if MultiDeviceExecutor needs it be moved to lower.

Keep lower away from FusionExecutorCache. I saw several #include lower.h in the main stack, but none of them seem necessary.

Footnotes

This isn't accurate because we never include a cpp from another cpp. I'm really referring to scenarios like a.cpp include b.h and b.cpp include a.h. ↩

csrc/multidevice/utils.cpp

Co-authored-by: Jingyue Wu <wujingyue@gmail.com>

tests/cpp/test_multidevice_lower_communication.cpp

Priya2698 · 2025-06-03T23:23:46Z

!test

Priya2698 · 2025-06-03T23:27:09Z

!test

As a follow-up to #4256 (comment) This makes the test more realistic and gives better coverage. It indeed caught #4642, a new bug.

Priya2698 force-pushed the pm/preseg_reorder_sharded branch from b420957 to 8e64e6e Compare May 8, 2025 00:22

Priya2698 force-pushed the pm/preseg_reorder_sharded branch from f8e6435 to fb07219 Compare May 10, 2025 02:49

Priya2698 force-pushed the pm/preseg_reorder_sharded branch from 8803b66 to 58f1039 Compare May 14, 2025 20:49

Priya2698 force-pushed the pm/preseg_reorder_sharded branch from 8fb5af9 to 35090b0 Compare May 20, 2025 20:18

Priya2698 mentioned this pull request May 20, 2025

Refactor canonicalizeLayout #4488

Merged

Priya2698 changed the title ~~Pm/preseg reorder sharded~~ reorderShardedAxisPass for DID loop split May 20, 2025

wujingyue mentioned this pull request May 21, 2025

[Stream lowering] collective based pipelines #4387

Merged

Priya2698 force-pushed the pm/preseg_reorder_sharded branch from d635cb8 to 602de55 Compare May 21, 2025 22:18

Priya2698 mentioned this pull request May 22, 2025

[MultiDevice] Remove scattered_axis #4384

Merged

Priya2698 marked this pull request as ready for review May 22, 2025 00:53

Priya2698 requested a review from wujingyue May 22, 2025 01:01

Priya2698 commented May 23, 2025

View reviewed changes

csrc/preseg_passes/reorder_sharded_axis.cpp Show resolved Hide resolved

Priya2698 commented May 23, 2025

View reviewed changes

tests/cpp/test_multidevice_lower_communication.cpp Outdated Show resolved Hide resolved

wujingyue reviewed May 24, 2025

View reviewed changes

csrc/multidevice/utils.cpp Outdated Show resolved Hide resolved

csrc/multidevice/utils.cpp Outdated Show resolved Hide resolved

csrc/multidevice/utils.cpp Outdated Show resolved Hide resolved

csrc/multidevice/utils.cpp Outdated Show resolved Hide resolved

Priya2698 requested a review from wujingyue May 27, 2025 21:03

Priya2698 force-pushed the pm/preseg_reorder_sharded branch from 56ebef4 to b2122d5 Compare May 28, 2025 21:38

Priya2698 and others added 17 commits June 3, 2025 15:22

more tests; always set allocation for communication

891c9a5

fix reduce scatter test

22d7d86

allreduce example

ed5c47b

lint, condition for reduce

a5ddacf

update comment

30eb391

change loop structure; spelling error

fa68bc9

skip reordering tests for single device

192dc7a

review comment

8c1d070

comments

d0c527e

use distinct types for reduce/allreduce and gather/allgather

dd56fbf

review feedback on naming, syntax

87ba6e2

move posInDomain to preseg pass

9dedb94

move to lower_to_communication

966aadd

remove unnecessary import

7a7f962

update tests

08182cb

unnecessary import; modify check to error

d1df5b5

Update csrc/host_ir/lower_to_communication.cpp

8c55e43

Co-authored-by: Jingyue Wu <wujingyue@gmail.com>

wujingyue reviewed Jun 3, 2025

View reviewed changes

tests/cpp/test_multidevice_lower_communication.cpp Outdated Show resolved Hide resolved

wujingyue approved these changes Jun 3, 2025

View reviewed changes

Priya2698 force-pushed the pm/preseg_reorder_sharded branch from 1fd7ea9 to 8c55e43 Compare June 3, 2025 22:56

Priya2698 added 3 commits June 3, 2025 15:58

rebase, naming changes

e0d884a

skip tests on single devices

f2ee18f

lint

91232bc

reset tests that were previously modified

50a4ee2

Priya2698 merged commit 5508e22 into main Jun 4, 2025
52 of 53 checks passed

Priya2698 deleted the pm/preseg_reorder_sharded branch June 4, 2025 20:03

wujingyue mentioned this pull request Jun 16, 2025

Make ReshardingTest.Insert a MultiDeviceTest. #4640

Merged

wujingyue added a commit that referenced this pull request Jun 16, 2025

Make ReshardingTest.Insert a MultiDeviceTest. (#4640)

3ad7564

As a follow-up to #4256 (comment) This makes the test more realistic and gives better coverage. It indeed caught #4642, a new bug.

reorderShardedAxisPass for DID loop split #4256

reorderShardedAxisPass for DID loop split #4256

Uh oh!

Conversation

Priya2698 commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Priya2698 commented Apr 16, 2025

Uh oh!

github-actions bot commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough 📝

PR Reviewer Guide 🔍

Uh oh!

Priya2698 commented May 8, 2025

Uh oh!

Priya2698 commented May 8, 2025

Uh oh!

Priya2698 commented May 14, 2025

Uh oh!

Priya2698 commented May 14, 2025

Uh oh!

Priya2698 commented May 14, 2025

Uh oh!

Priya2698 commented May 16, 2025

Uh oh!

Priya2698 commented May 22, 2025

Uh oh!

Uh oh!

Priya2698 commented May 23, 2025

Uh oh!

Uh oh!

Priya2698 commented May 23, 2025

Uh oh!

wujingyue left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wujingyue May 24, 2025

Choose a reason for hiding this comment

Uh oh!

Priya2698 May 27, 2025

Choose a reason for hiding this comment

Uh oh!

wujingyue May 28, 2025

Choose a reason for hiding this comment

Uh oh!

Priya2698 May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wujingyue May 29, 2025

Choose a reason for hiding this comment

Footnotes

Uh oh!

wujingyue May 31, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Priya2698 commented Jun 3, 2025

Uh oh!

Priya2698 commented Jun 3, 2025

Uh oh!

Uh oh!

Uh oh!

`reorderShardedAxisPass` for DID loop split #4256

`reorderShardedAxisPass` for DID loop split #4256

Priya2698 commented Apr 16, 2025 •

edited

Loading

github-actions bot commented Apr 16, 2025 •

edited

Loading

Priya2698 May 28, 2025 •

edited

Loading