Unified GEMM and GEMM+GELU on Nvidia Tensor Cores, Intel XMX of PVC and DG2, and Intel AMX of SPR using SYCL joint matrix
- cache tiling of i and j
- cache tiling on k as well (so no reordering is needed)
- data reuse of A and B in physical layer
- Out of Bounds checking is used for PVC using -DOOB
- Prefetch for PVC is enabled under -DPREFETCH
- Both row major and VNNI transform options. For row major ommit -DVNNI
no reordering, no SLM for DG2/Nvidia
For maximum performance, cache and registers blocking parameters are different between Nvidia Tensor Cores, AMX and DPAS of DG2 vs PVC. See specific parameters below:
M=N=K=X cases, use -DMATRIX_SIZE=X Otherwise, use: -DMATRIX_M=1024 -DMATRIX_N=6144 -DMATRIX_K=6144
icpx -fsycl -fsycl-targets=nvidia_gpu_sm_80 joint_matrix_fill_k_cache.cpp -DNVIDIA -DMCACHE1=64 -DNCACHE1=64 -DMCACHE2=128 -DNCACHE2=128
icpx -fsycl -fsycl-targets=nvidia_gpu_sm_80 joint_matrix_fill_k_cache.cpp -DMATRIX_SIZE=4096 -DNVIDIA -DMCACHE1=64 -DNCACHE1=64 -DMCACHE2=128 -DNCACHE2=128
icpx -fsycl joint_matrix_fill_k_cache.cpp -DPREFETCH -DOOB
icpx -fsycl joint_matrix_fill_k_cache.cpp -DPREFETCH -DOOB -DMATRIX_SIZE=4096
icpx -fsycl joint_matrix_fill_k_cache.cpp -DNCACHE1=32 -DMCACHE2=128 -DNCACHE2=128 -DKCACHE2=16 -DVNNI
icpx -fsycl joint_matrix_fill_k_cache.cpp -DNCACHE1=32 -DMCACHE2=128 -DNCACHE2=128 -DKCACHE2=16 -DMATRIX_SIZE=4096 -DVNNI
icpx -fsycl joint_matrix_fill_k_cache.cpp -DNCACHE1=32 -DKCACHE1=32 -DMCACHE2=128 -DNCACHE2=128 -DKCACHE2=1024 -DVNNI
icpx -fsycl joint_matrix_fill_k_cache.cpp -DNCACHE1=32 -DKCACHE1=32 -DMCACHE2=256 -DNCACHE2=256 -DKCACHE2=1024 -DMATRIX_SIZE=4096 -DVNNI
ONEAPI_DEVICE_SELECTOR=cuda:0 ./a.out
SYCL_PROGRAM_COMPILE_OPTIONS="-ze-opt-large-register-file" ./a.out
To run on CPU: DPCPP_CPU_NUM_CUS=112 ./a.out