Skip to content

Conversation

@ryanswann-amd
Copy link
Collaborator

Motivation

Add epilogue function support to tritonBLAS persistent GEMM kernels, enabling users to apply element-wise operations (e.g., activation functions) directly to the output accumulator. This provides better performance through kernel fusion by eliminating separate kernel launches for common post-GEMM operations.

Technical Details

Core Changes:

  • Added epilogue.py module with built-in activation functions (ReLU, GELU, SiLU, Sigmoid, Tanh, Leaky ReLU, Identity)
  • Modified persistent_gemm.py to accept optional epilogue_fn parameter (default: None)
  • Epilogue applied after GEMM/scales/bias but before type conversion
  • When epilogue_fn=None, Triton JIT compiler optimizes it out (zero overhead)

Files Modified:

  • include/tritonblas/kernels/stages/algorithms/epilogue.py (new)
  • include/tritonblas/kernels/stages/algorithms/__init__.py
  • include/tritonblas/kernels/persistent_gemm.py
  • tests/test_epilogues.py (new)
  • examples/example_matmul_epilogue.py (new)
  • docs/EPILOGUES.md (new)

Key Features:

  • Numerically stable implementations (tanh/GELU avoid overflow)
  • Easy to create custom epilogue functions with @triton.jit

Test Plan

  • Created comprehensive pytest suite (tests/test_epilogues.py) with parametrized tests
  • Tests all built-in epilogue functions (ReLU, GELU, SiLU, Sigmoid, Tanh, Leaky ReLU, Identity)
  • Tests epilogue with bias addition
  • Tests epilogue_fn=None (no epilogue)
  • Validates against PyTorch reference implementations
  • Multiple matrix sizes tested (256x256, 512x512, 128x256x512)

Test Result

All tests pass with expected fp16 precision tolerance (rtol=1e-2, atol=1e-2):

  • ✓ Identity, ReLU, GELU, SiLU, Tanh, Sigmoid, Leaky ReLU epilogues
  • ✓ Epilogue with bias
  • ✓ No epilogue (None)
  • Example demonstrates custom clamp epilogue with perfect match to PyTorch

Submission Checklist

Copilot AI review requested due to automatic review settings January 21, 2026 20:39
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for epilogue functions to the tritonBLAS persistent GEMM kernel, enabling fused element-wise operations on the output accumulator for improved performance through kernel fusion.

Changes:

  • Introduced a new epilogue.py module with built-in activation functions (ReLU, GELU, SiLU, Sigmoid, Tanh, Leaky ReLU, Identity)
  • Modified persistent_gemm.py to accept an optional epilogue_fn parameter
  • Added comprehensive test coverage and example demonstrating custom epilogue usage

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
include/tritonblas/kernels/stages/algorithms/epilogue.py New module implementing built-in activation functions as JIT-compiled epilogue operations
include/tritonblas/kernels/stages/algorithms/__init__.py Exports epilogue functions from the new module
include/tritonblas/kernels/persistent_gemm.py Adds optional epilogue_fn parameter and applies it to accumulator before type conversion
tests/test_epilogues.py Comprehensive test suite validating all epilogue functions against PyTorch references
examples/example_matmul_epilogue.py Example demonstrating custom epilogue function creation and usage

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant