WCC Grouped Gemm Implementation #24

devanshulad · 2025-09-12T23:25:24Z

Motivation

Work Centric Grouped GEMM Implementation

Test Plan

Testing script is added to test the correctness

Test Result

All tests pass

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Copilot

Pull Request Overview

This PR implements a Work-Centric Grouped GEMM (General Matrix Multiply) operation for the tritonblas library using Triton kernels. The implementation provides an efficient way to execute multiple matrix multiplications in a single kernel launch by distributing work across processing units.

Adds a new work-centric grouped GEMM implementation with partial result accumulation
Includes comprehensive test coverage for various matrix sizes and block configurations
Exposes the grouped GEMM functionality through the tritonblas public API

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
tests/test_grouped_gemm.py	Test file with parametrized tests for different matrix sizes, block sizes, and group counts
include/tritonblas/internal/wcc_grouped_gemm.py	Core implementation containing WCC grouped GEMM kernel and helper functions
include/tritonblas/grouped_gemm.py	Public API wrapper that prepares data structures and launches the kernel
include/tritonblas/init.py	Exports the grouped_gemm function for public use

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

include/tritonblas/grouped_gemm.py

Copilot · 2025-09-12T23:26:14Z

include/tritonblas/grouped_gemm.py

+current_device_index = torch.cuda.current_device()
+current_device = torch.cuda.get_device_properties(current_device_index)
+MAX_SMS = current_device.multi_processor_count
+#TODO: 256x256 for fp16/bf16, need adjust for fp8/fp4


TODO comment should be formatted with proper spacing: '# TODO:' and provide more specific guidance about the adjustment needed for fp8/fp4 data types.

Suggested change

#TODO: 256x256 for fp16/bf16, need adjust for fp8/fp4

# TODO: 256x256 block size is suitable for fp16/bf16; for fp8/fp4, consider reducing block size (e.g., 128x128) due to hardware and data type constraints. Investigate optimal values.

include/tritonblas/internal/wcc_grouped_gemm.py

Copilot · 2025-09-12T23:26:14Z

include/tritonblas/internal/wcc_grouped_gemm.py

+                A_BASE = A + rm[:, None] * stride_am + rk[None, :] * stride_ak + BLOCK_SIZE_K * stride_ak * remainder
+                B_BASE = B + rk[:, None] * stride_bk + rn[None, :] * stride_bn + BLOCK_SIZE_K * stride_bk * remainder
+                """
+                A_BASE = A + rm[:, None] * stride_am + rk[None, :] + (BLOCK_SIZE_K * tile_offset)


Matrix A pointer calculation is missing the stride_ak multiplication. It should be rk[None, :] * stride_ak to properly handle different stride patterns for matrix A.

Suggested change

A_BASE = A + rm[:, None] * stride_am + rk[None, :] + (BLOCK_SIZE_K * tile_offset)

A_BASE = A + rm[:, None] * stride_am + rk[None, :] * stride_ak + (BLOCK_SIZE_K * tile_offset)

@ryanswann-amd can you take a look if its still correct after adding this stride?

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

include/tritonblas/internal/wcc_grouped_gemm.py

neoblizz · 2025-11-07T16:13:40Z

include/tritonblas/internal/wcc_grouped_gemm.py

+                A_BASE = A + rm[:, None] * stride_am + rk[None, :] * stride_ak + BLOCK_SIZE_K * stride_ak * remainder
+                B_BASE = B + rk[:, None] * stride_bk + rn[None, :] * stride_bn + BLOCK_SIZE_K * stride_bk * remainder
+                """
+                A_BASE = A + rm[:, None] * stride_am + rk[None, :] + (BLOCK_SIZE_K * tile_offset)


@ryanswann-amd can you take a look if its still correct after adding this stride?

WCC Grouped Gemm Implementation

438c9d6

Copilot AI review requested due to automatic review settings September 12, 2025 23:25

Copilot AI reviewed Sep 12, 2025

View reviewed changes

neoblizz and others added 2 commits November 7, 2025 08:11

Merge branch 'main' into grouped_gemm

4a2a86f

Update include/tritonblas/grouped_gemm.py

04ffcca

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

neoblizz previously approved these changes Nov 7, 2025

View reviewed changes

Update include/tritonblas/internal/wcc_grouped_gemm.py

a1dc950

neoblizz dismissed their stale review via a1dc950 November 7, 2025 16:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WCC Grouped Gemm Implementation #24

WCC Grouped Gemm Implementation #24

Uh oh!

devanshulad commented Sep 12, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI Sep 12, 2025

Uh oh!

Uh oh!

Copilot AI Sep 12, 2025

Uh oh!

neoblizz Nov 7, 2025

Uh oh!

Uh oh!

neoblizz Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	#TODO: 256x256 for fp16/bf16, need adjust for fp8/fp4
	# TODO: 256x256 block size is suitable for fp16/bf16; for fp8/fp4, consider reducing block size (e.g., 128x128) due to hardware and data type constraints. Investigate optimal values.

	A_BASE = A + rm[:, None] * stride_am + rk[None, :] + (BLOCK_SIZE_K * tile_offset)
	A_BASE = A + rm[:, None] * stride_am + rk[None, :] * stride_ak + (BLOCK_SIZE_K * tile_offset)

WCC Grouped Gemm Implementation #24

Are you sure you want to change the base?

WCC Grouped Gemm Implementation #24

Uh oh!

Conversation

devanshulad commented Sep 12, 2025

Motivation

Test Plan

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Copilot AI Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

neoblizz Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

neoblizz Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants