You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CHANGELOG.md
+35-31Lines changed: 35 additions & 31 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,15 +2,44 @@
2
2
3
3
Documentation for Composable Kernel available at [https://rocm.docs.amd.com/projects/composable_kernel/en/latest/](https://rocm.docs.amd.com/projects/composable_kernel/en/latest/).
4
4
5
-
## Composable Kernel 1.2.0 for ROCm 7.0.0
5
+
## (Unreleased) Composable Kernel for ROCm
6
+
7
+
### Added
6
8
7
-
### Added
8
9
* Added a compute async pipeline in the CK TILE universal GEMM on gfx950
9
10
* Added support for B Tensor type pk_int4_t in the CK TILE weight preshuffle GEMM.
10
11
* Added the new api to load different memory sizes to SGPR.
11
12
* Added support for B Tensor Preshuffle in CK TILE Grouped GEMM.
12
13
* Added a basic copy kernel example and supporting documentation for new CK Tile developers.
13
14
* Added support for grouped_gemm kernels to perform multi_d elementwise operation.
15
+
* Added support for Multiple ABD GEMM
16
+
* Added benchmarking support for tile engine GEMM Multi D.
17
+
* Added block scaling support in CK_TILE GEMM, allowing flexible use of quantization matrices from either A or B operands.
18
+
* Added the row-wise column-wise quantization for CK_TILE GEMM & CK_TILE Grouped GEMM.
19
+
* Added support for f32 to FMHA (fwd/bwd).
20
+
* Added tensor-wise quantization for CK_TILE GEMM.
21
+
* Added support for batched contraction kernel.
22
+
* Added pooling kernel in CK_TILE
23
+
24
+
### Changed
25
+
26
+
* Removed `BlockSize` in `make_kernel` and `CShuffleEpilogueProblem` to support Wave32 in CK_TILE (#2594)
27
+
28
+
## Composable Kernel 1.1.0 for ROCm 7.1.0
29
+
30
+
### Added
31
+
32
+
* Added support for hdim as a multiple of 32 for FMHA (fwd/fwd_splitkv/bwd)
33
+
* Added support for elementwise kernel.
34
+
35
+
### Upcoming changes
36
+
37
+
* Non-grouped convolutions are deprecated. Their functionality is supported by grouped convolution.
38
+
39
+
## Composable Kernel 1.1.0 for ROCm 7.0.0
40
+
41
+
### Added
42
+
14
43
* Added support for bf16, f32, and f16 for 2D and 3D NGCHW grouped convolution backward data
15
44
* Added a fully asynchronous HOST (CPU) arguments copy flow for CK grouped GEMM kernels.
16
45
* Added support GKCYX layout for grouped convolution forward (NGCHW/GKCYX/NGKHW, number of instances in instance factory for NGCHW/GKYXC/NGKHW has been reduced).
@@ -19,55 +48,30 @@ Documentation for Composable Kernel available at [https://rocm.docs.amd.com/proj
19
48
* Added support for GKCYX layout for grouped convolution backward data (NGCHW/GKCYX/NGKHW).
20
49
* Added support for Stream-K version of mixed fp8/bf16 GEMM
21
50
* Added support for Multiple D GEMM
22
-
* Added support for Multiple ABD GEMM
23
51
* Added GEMM pipeline for microscaling (MX) FP8/FP6/FP4 data types
24
52
* Added support for FP16 2:4 structured sparsity to universal GEMM.
25
53
* Added support for Split K for grouped convolution backward data.
26
54
* Added logit soft-capping support for fMHA forward kernels.
27
55
* Added support for hdim as a multiple of 32 for FMHA (fwd/fwd_splitkv)
28
-
* Added support for hdim as a multiple of 32 for FMHA (fwd/fwd_splitkv/bwd)
29
56
* Added benchmarking support for tile engine GEMM.
30
57
* Added Ping-pong scheduler support for GEMM operation along the K dimension.
31
58
* Added rotating buffer feature for CK_Tile GEMM.
32
59
* Added int8 support for CK_TILE GEMM.
33
-
* Added support for elementwise kernel.
34
-
* Added benchmarking support for tile engine GEMM Multi D.
35
-
* Added block scaling support in CK_TILE GEMM, allowing flexible use of quantization matrices from either A or B operands.
36
-
* Added the row-wise column-wise quantization for CK_TILE GEMM & CK_TILE Grouped GEMM.
37
-
* Added support for f32 to FMHA (fwd/bwd).
38
-
* Added tensor-wise quantization for CK_TILE GEMM.
39
-
* Added support for batched contraction kernel.
40
-
* Added pooling kernel in CK_TILE
41
60
42
61
### Optimized
43
62
63
+
* Optimize the gemm multiply multiply preshuffle & lds bypass with Pack of KGroup and better instruction layout.
64
+
* Added Vectorize Transpose optimization for CK Tile
65
+
* Added the asynchronous copy for gfx950
44
66
45
-
* Optimize the gemm multiply multiply preshuffle & lds bypass with Pack of KGroup and better instruction layout. (#2166)
46
-
* Added Vectorize Transpose optimization for CK Tile (#2131)
47
-
* Added the asynchronous copy for gfx950 (#2425)
48
-
49
-
50
-
### Fixes
51
-
52
-
None
53
-
54
-
### Changes
67
+
### Changed
55
68
56
69
* Removed support for gfx940 and gfx941 targets (#1944)
57
70
* Replaced the raw buffer load/store intrinsics with Clang20 built-ins (#1876)
58
71
* DL and DPP kernels are now enabled by default.
59
72
* Number of instances in instance factory for grouped convolution forward NGCHW/GKYXC/NGKHW has been reduced.
60
73
* Number of instances in instance factory for grouped convolution backward weight NGCHW/GKYXC/NGKHW has been reduced.
61
74
* Number of instances in instance factory for grouped convolution backward data NGCHW/GKYXC/NGKHW has been reduced.
62
-
* Removed `BlockSize` in `make_kernel` and `CShuffleEpilogueProblem` to support Wave32 in CK_TILE (#2594)
63
-
64
-
### Known issues
65
-
66
-
None
67
-
68
-
### Upcoming changes
69
-
70
-
* Non-grouped convolutions are deprecated. All of their functionality is supported by grouped convolution.
0 commit comments