[LLVMGPU] Enable scf.forall distr. on vectorDistribute Pipeline #19420

pashu123 · 2024-12-09T14:48:37Z

Enables scf.forall distribution on the vector distribute pipeline.

Max191

It seems a bit strange that the vector sizes have changed in vector distribute and some of the read/writes from global to shared memory have disappeared in tile and fuse. Do you know what caused these differences?

Max191 · 2025-01-10T19:53:13Z

compiler/src/iree/compiler/Codegen/LLVMGPU/test/ROCDL/pipeline_igemm_tile_and_fuse.mlir

-//      CHECK-DAG:       %[[LHS_RD:.+]] = vector.transfer_read %[[B0]]{{.*}} vector<1xf16>
-//      CHECK-DAG:       vector.transfer_write %[[LHS_RD]]
-// Note that to simplify the test we are not showing the mapping of the RHS_RD
-// to its buffer as it goes through an scf.if/else control structure
-// involving allocas.
-//      CHECK-DAG:       %[[RHS_RD:.+]] = vector.transfer_read {{.*}} vector<1xf16>
-//      CHECK-DAG:       vector.transfer_write %[[RHS_RD]]


What happened to these reads?

These happen if the tensor.extract_slices are pushed up in the block. Buffer optimization is kicking in.

Max191 · 2025-01-10T19:53:25Z

compiler/src/iree/compiler/Codegen/LLVMGPU/test/ROCDL/pipeline_tile_and_fuse.mlir

-//   CHECK-DAG:       %[[LHS_RD:.+]] = vector.transfer_read {{.*}} vector<4xf32>
-//   CHECK-DAG:       vector.transfer_write %[[LHS_RD]]
-//   CHECK-DAG:       %[[RHS_RD:.+]] = vector.transfer_read {{.*}} vector<1xf32>
-//   CHECK-DAG:       vector.transfer_write %[[RHS_RD]]


Max191 · 2025-01-10T20:01:07Z

compiler/src/iree/compiler/Codegen/LLVMGPU/test/ROCDL/pipeline_vector_distribute_gfx942.mlir

+// CHECK-SAME: -> (vector<1x1x1xf32>, vector<1x1x1xf32>, vector<1x2x1x4x1x4xf32>)
+// CHECK-COUNT-24:  amdgpu.mfma {{.*}} {blocks = 1 : i32, k = 8 : i32, m = 32 : i32, n = 32 : i32} blgp =  none : vector<4xf16>, vector<4xf16>, vector<16xf32>


Why has the number of mfmas changed here?

The distribution now happens on tile sizes of any length (previously, it was limited to 3). That's why there is a change in count.

Max191

Also, there are 3 separate things happening here. Can you split out the 3 commit into separate PRs (The workgroup reordering, slice optimization, and forall distribution enablement)?

pashu123 force-pushed the scfforallgpu branch 3 times, most recently from f6c10c5 to 6719bb3 Compare December 16, 2024 06:39

pashu123 force-pushed the scfforallgpu branch 3 times, most recently from f96b601 to f442fdf Compare December 17, 2024 12:30

pashu123 force-pushed the scfforallgpu branch from f442fdf to a9185da Compare January 2, 2025 08:55

pashu123 mentioned this pull request Jan 6, 2025

Unnecessary buffers creation when enabling scf.forall distribution on the vector distribute pipeline. #19608

Open

pashu123 force-pushed the scfforallgpu branch 2 times, most recently from 502f954 to 7267b8b Compare January 8, 2025 19:22

pashu123 added 2 commits January 9, 2025 00:57

[LLVMGPU] Enable scf.forall distr. on vectorDistribute Pipeline

0097e44

Update tile and distribute to enable workgroup reordering

09d6cf4

pashu123 force-pushed the scfforallgpu branch 8 times, most recently from f06be62 to 08aa578 Compare January 9, 2025 13:14

pashu123 marked this pull request as ready for review January 9, 2025 13:16

pashu123 requested review from MaheshRavishankar, qedawkins, kuhar, Groverkss and hanhanW as code owners January 9, 2025 13:16

pashu123 force-pushed the scfforallgpu branch from 08aa578 to ac25bbb Compare January 9, 2025 13:17

pashu123 requested a review from Max191 January 9, 2025 13:59

pashu123 force-pushed the scfforallgpu branch 2 times, most recently from f4b62d0 to 651af59 Compare January 9, 2025 17:43

pashu123 force-pushed the scfforallgpu branch 7 times, most recently from bc4deee to 44709eb Compare January 10, 2025 19:01

Move tensor_extract slice up in the block to respect bufferization

d5f856a

pashu123 force-pushed the scfforallgpu branch from 44709eb to d5f856a Compare January 10, 2025 19:02

Max191 reviewed Jan 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LLVMGPU] Enable scf.forall distr. on vectorDistribute Pipeline #19420

[LLVMGPU] Enable scf.forall distr. on vectorDistribute Pipeline #19420

pashu123 commented Dec 9, 2024

Max191 left a comment

Max191 Jan 10, 2025

pashu123 Jan 11, 2025

Max191 Jan 10, 2025

Max191 Jan 10, 2025

pashu123 Jan 11, 2025 •

edited

Loading

Max191 left a comment

		// CHECK-SAME: -> (vector<1x1x1xf32>, vector<1x1x1xf32>, vector<1x2x1x4x1x4xf32>)
		// CHECK-COUNT-24: amdgpu.mfma {{.*}} {blocks = 1 : i32, k = 8 : i32, m = 32 : i32, n = 32 : i32} blgp = none : vector<4xf16>, vector<4xf16>, vector<16xf32>

[LLVMGPU] Enable scf.forall distr. on vectorDistribute Pipeline #19420

Are you sure you want to change the base?

[LLVMGPU] Enable scf.forall distr. on vectorDistribute Pipeline #19420

Conversation

pashu123 commented Dec 9, 2024

Max191 left a comment

Choose a reason for hiding this comment

Max191 Jan 10, 2025

Choose a reason for hiding this comment

pashu123 Jan 11, 2025

Choose a reason for hiding this comment

Max191 Jan 10, 2025

Choose a reason for hiding this comment

Max191 Jan 10, 2025

Choose a reason for hiding this comment

pashu123 Jan 11, 2025 • edited Loading

Choose a reason for hiding this comment

Max191 left a comment

Choose a reason for hiding this comment

pashu123 Jan 11, 2025 •

edited

Loading