Skip to content

CUTLASS 3.4.0

Compare
Choose a tag to compare
@hwu36 hwu36 released this 16 Jan 22:39
· 77 commits to main since this release
751eb9a
  • Improved Mixed-input Hopper GEMMs supporting {16-bit, 8-bit} x {8-bit, 4-bit} input types with fast numerical converters and group scaling factors tuned for optimal performance on Hopper H100.
  • Beta release of Pointer-Array Batched GEMMs utilizing TMA and Hopper H100 tensor cores now available. (Requires CUDA 12.3 or above)
  • Beta release of Group-GEMM - commonly used in optimization of Mixture-Of-Expert models, is now available on Hopper GPUs taking advantage of TMA and Hopper H100 tensor cores. (Requires CUDA 12.3 or above)
  • Ampere Sparse GEMM supports Epilogue Visitor Tree (EVT) now.
  • Impovements to NamedBarriers including details of ReservedNamedBarriers used within the CUTLASS library.
  • Improved CuTe documentation including improved clarity and depth of Quickstart, CuTe Layout, and CuTe Layout Algebra. Associated code comments, post-conditions, and details in CuTe Core Unit Tests also improved.