-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Task04 Кудрявцев Федор HSE #147
Conversation
src/cl/matrix_transpose.cl
Outdated
if (j >= m) | ||
return; | ||
|
||
tile[local_i][local_j] = a[i * k + j]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
не коалесд доступ к глобальной памяти + в таком использовании локального массива нет смысла, поток использует то же значение что сам и загрузил
src/cl/matrix_transpose.cl
Outdated
if (j >= m) | ||
return; | ||
|
||
tile[biased_i][biased_j] = a[i * k + j]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
не коалесд доступ к глобальной памяти + в таком использовании локального массива нет смысла, поток использует то же значение что сам и загрузил
src/cl/matrix_multiplication.cl
Outdated
|
||
float sum = 0.0f; | ||
for (int tileK = 0; tileK * TILE_SIZE < K; tileK++) { | ||
tileA[local_i][local_j] = a[i * K + local_j + tileK * TILE_SIZE]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
не коалесд доступ
src/cl/matrix_multiplication.cl
Outdated
|
||
for (int tileK = 0; tileK * TILE_SIZE < K; tileK++) { | ||
for (int thread = 0; thread < WORK_PER_THREAD; thread++) { | ||
tileA[local_i * WORK_PER_THREAD + thread][local_j] = a[(i * WORK_PER_THREAD + thread) * K + local_j + tileK * TILE_SIZE]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
не коалесд доступ
src/cl/matrix_multiplication.cl
Outdated
} | ||
|
||
for (int thread = 0; thread < WORK_PER_THREAD; thread++) { | ||
c[(i * WORK_PER_THREAD + thread) * N + j] = sum[thread]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
не коалесд доступ
src/main_matrix_multiplication.cpp
Outdated
std::string kernel_name = "matrix_multiplication_local_wpt"; | ||
gpu::WorkSize work_size(0, 0/*TODO*/); | ||
gpu::WorkSize work_size(tile_size / wpt, tile_size, M /wpt, N); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Локальный выводТранспонирование OpenCL devices: Device #0: CPU. 13th Gen Intel(R) Core(TM) i7-13700H. Intel(R) Corporation. Total memory: 16003 Mb Device #1: GPU. Intel(R) Iris(R) Xe Graphics. Total memory: 6401 Mb Using device #1: GPU. Intel(R) Iris(R) Xe Graphics. Total memory: 6401 Mb Data generated for M=4096, K=4096 [matrix_transpose_naive] GPU: 0.00706667+-0.000249444 s GPU: 2374.13 millions/s [matrix_transpose_local_bad_banks] GPU: 0.00485+-0.000510718 s GPU: 3459.22 millions/s [matrix_transpose_local_good_banks] GPU: 0.004+-0 s GPU: 4194.3 millions/s Перемножение OpenCL devices: Device #0: CPU. 13th Gen Intel(R) Core(TM) i7-13700H. Intel(R) Corporation. Total memory: 16003 Mb Device #1: GPU. Intel(R) Iris(R) Xe Graphics. Total memory: 6401 Mb Using device #1: GPU. Intel(R) Iris(R) Xe Graphics. Total memory: 6401 Mb Data generated for M=1024, K=1024, N=1024 CPU: 6.475+-0 s CPU: 0.30888 GFlops [local wpt, ts=4, wpt=2] GPU: 0.0616667+-0.00188562 s GPU: 32.4324 GFlops Average difference: 0.000196008% [local wpt, ts=4, wpt=4] GPU: 0.0955+-0.00180278 s GPU: 20.9424 GFlops Average difference: 0.000196008% [local wpt, ts=8, wpt=2] GPU: 0.00966667+-0.000471405 s GPU: 206.897 GFlops Average difference: 0.000196008% [local wpt, ts=8, wpt=4] GPU: 0.0176667+-0.000471405 s GPU: 113.208 GFlops Average difference: 0.000196008% [local wpt, ts=8, wpt=8] GPU: 0.017+-0.00057735 s GPU: 117.647 GFlops Average difference: 0.000196008% [local wpt, ts=16, wpt=2] GPU: 0.0106667+-0.000471405 s GPU: 187.5 GFlops Average difference: 0.000196008% [local wpt, ts=16, wpt=4] GPU: 0.00833333+-0.000471405 s GPU: 240 GFlops Average difference: 0.000196008% [local wpt, ts=16, wpt=8] GPU: 0.00783333+-0.000372678 s GPU: 255.319 GFlops Average difference: 0.000196008% [local wpt, ts=16, wpt=16] GPU: 0.0115+-0.0005 s GPU: 173.913 GFlops Average difference: 0.000196008%OpenCL devices: Device #0: CPU. 13th Gen Intel(R) Core(TM) i7-13700H. Intel(R) Corporation. Total memory: 16003 Mb Device #1: GPU. Intel(R) Iris(R) Xe Graphics. Total memory: 6401 Mb Using device #1: GPU. Intel(R) Iris(R) Xe Graphics. Total memory: 6401 Mb Data generated for M=1024, K=1024, N=1024 CPU: 6.475+-0 s CPU: 0.30888 GFlops [local wpt, ts=4, wpt=2] GPU: 0.0616667+-0.00188562 s GPU: 32.4324 GFlops Average difference: 0.000196008% [local wpt, ts=4, wpt=4] GPU: 0.0955+-0.00180278 s GPU: 20.9424 GFlops Average difference: 0.000196008% [local wpt, ts=8, wpt=2] GPU: 0.00966667+-0.000471405 s GPU: 206.897 GFlops Average difference: 0.000196008% [local wpt, ts=8, wpt=4] GPU: 0.0176667+-0.000471405 s GPU: 113.208 GFlops Average difference: 0.000196008% [local wpt, ts=8, wpt=8] GPU: 0.017+-0.00057735 s GPU: 117.647 GFlops Average difference: 0.000196008% [local wpt, ts=16, wpt=2] GPU: 0.0106667+-0.000471405 s GPU: 187.5 GFlops Average difference: 0.000196008% [local wpt, ts=16, wpt=4] GPU: 0.00833333+-0.000471405 s GPU: 240 GFlops Average difference: 0.000196008% [local wpt, ts=16, wpt=8] GPU: 0.00783333+-0.000372678 s GPU: 255.319 GFlops Average difference: 0.000196008% [local wpt, ts=16, wpt=16] GPU: 0.0115+-0.0005 s GPU: 173.913 GFlops Average difference: 0.000196008% |
Задача зачтена |
Локальный вывод
Транспонирование
Перемножение
Вывод Github CI
Транспонирование
Перемножение