Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task04 Кудрявцев Федор HSE #147

Closed
wants to merge 2 commits into from

Conversation

koufesser
Copy link

@koufesser koufesser commented Oct 5, 2024

Локальный вывод

Транспонирование

C:\Users\koufe\GPGPUTasks2024\cmake-build-debug\matrix_transpose.exe 1
OpenCL devices:
  Device #0: CPU. 13th Gen Intel(R) Core(TM) i7-13700H. Intel(R) Corporation. Total memory: 16003 Mb
  Device #1: GPU. Intel(R) Iris(R) Xe Graphics. Total memory: 6401 Mb
Using device #1: GPU. Intel(R) Iris(R) Xe Graphics. Total memory: 6401 Mb
Data generated for M=4096, K=4096
[matrix_transpose_naive]
    GPU: 0.00703333+-0.000515321 s
    GPU: 2385.39 millions/s
[matrix_transpose_local_bad_banks]
    GPU: 0.00731667+-0.000465176 s
    GPU: 2293.01 millions/s
[matrix_transpose_local_good_banks]
    GPU: 0.00723333+-0.000422953 s
    GPU: 2319.43 millions/s

Перемножение

C:\Users\koufe\GPGPUTasks2024\cmake-build-debug\matrix_multiplication.exe 1
OpenCL devices:
  Device #0: CPU. 13th Gen Intel(R) Core(TM) i7-13700H. Intel(R) Corporation. Total memory: 16003 Mb
  Device #1: GPU. Intel(R) Iris(R) Xe Graphics. Total memory: 6401 Mb
Using device #1: GPU. Intel(R) Iris(R) Xe Graphics. Total memory: 6401 Mb
Data generated for M=1024, K=1024, N=1024
CPU: 5.769+-0 s
CPU: 0.346681 GFlops
[naive, ts=4]
    GPU: 0.0505+-0.00111803 s
    GPU: 39.604 GFlops
    Average difference: 0.000196008%
[naive, ts=8]
    GPU: 0.0278333+-0.000372678 s
    GPU: 71.8563 GFlops
    Average difference: 0.000196008%
[naive, ts=16]
    GPU: 0.021+-0.00057735 s
    GPU: 95.2381 GFlops
    Average difference: 0.000196008%
[local, ts=4]
    GPU: 0.0338333+-0.00146249 s
    GPU: 59.1133 GFlops
    Average difference: 0.000196008%
[local, ts=8]
    GPU: 0.0196667+-0.000471405 s
    GPU: 101.695 GFlops
    Average difference: 0.000196008%
[local, ts=16]
    GPU: 0.0265+-0.000957427 s
    GPU: 75.4717 GFlops
    Average difference: 0.000196008%
[local wpt, ts=4, wpt=2]
    GPU: 0.0691667+-0.00226691 s
    GPU: 28.9157 GFlops
    Average difference: 0.000196008%
[local wpt, ts=4, wpt=4]
    GPU: 0.111833+-0.00681705 s
    GPU: 17.8838 GFlops
    Average difference: 0.000196008%
[local wpt, ts=8, wpt=2]
    GPU: 0.0123333+-0.000471405 s
    GPU: 162.162 GFlops
    Average difference: 0.000196008%
[local wpt, ts=8, wpt=4]
    GPU: 0.0256667+-0.00449691 s
    GPU: 77.9221 GFlops
    Average difference: 0.000196008%
[local wpt, ts=8, wpt=8]
    GPU: 0.0201667+-0.000897527 s
    GPU: 99.1736 GFlops
    Average difference: 0.000196008%
[local wpt, ts=16, wpt=2]
    GPU: 0.0151667+-0.000687184 s
    GPU: 131.868 GFlops
    Average difference: 0.000196008%
[local wpt, ts=16, wpt=4]
    GPU: 0.01+-0.00057735 s
    GPU: 200 GFlops
    Average difference: 0.000196008%
[local wpt, ts=16, wpt=8]
    GPU: 0.00883333+-0.000372678 s
    GPU: 226.415 GFlops
    Average difference: 0.000196008%
[local wpt, ts=16, wpt=16]
    GPU: 0.0123333+-0.000471405 s
    GPU: 162.162 GFlops
    Average difference: 0.000196008%

Вывод Github CI

Транспонирование

OpenCL devices:
  Device #0: CPU. AMD EPYC [7](https://github.com/GPGPUCourse/GPGPUTasks2024/pull/147/checks#step:7:8)763 64-Core Processor                . Intel(R) Corporation. Total memory: 15991 Mb
Using device #0: CPU. AMD EPYC 7763 64-Core Processor                . Intel(R) Corporation. Total memory: 15991 Mb
Data generated for M=4096, K=4096
[matrix_transpose_naive]
    GPU: 0.0171[8](https://github.com/GPGPUCourse/GPGPUTasks2024/pull/147/checks#step:7:9)68+-0.00135369 s
    GPU: 976.169 millions/s
[matrix_transpose_local_bad_banks]
    GPU: 0.0238586+-0.0003[9](https://github.com/GPGPUCourse/GPGPUTasks2024/pull/147/checks#step:7:10)1206 s
    GPU: 703.194 millions/s
[matrix_transpose_local_good_banks]
    GPU: 0.0280841+-0.000134503 s
    GPU: 597.393 millions/s

Перемножение

OpenCL devices:
  Device #0: CPU. AMD EPYC [7](https://github.com/GPGPUCourse/GPGPUTasks2024/pull/147/checks#step:8:8)763 64-Core Processor                . Intel(R) Corporation. Total memory: 15991 Mb
Using device #0: CPU. AMD EPYC 7763 64-Core Processor                . Intel(R) Corporation. Total memory: 15991 Mb
Data generated for M=1024, K=1024, N=1024
CPU: 6.25+-0 s
CPU: 0.32 GFlops
[naive, ts=4]
    GPU: 0.246773+-0.000759[8](https://github.com/GPGPUCourse/GPGPUTasks2024/pull/147/checks#step:8:9)67 s
    GPU: 8.10463 GFlops
    Average difference: 0.000149043%
[naive, ts=8]
    GPU: 0.261947+-0.002[9](https://github.com/GPGPUCourse/GPGPUTasks2024/pull/147/checks#step:8:10)847 s
    GPU: 7.63512 GFlops
    Average difference: 0.000149043%
[naive, ts=16]
    GPU: 0.267871+-0.00295813 s
    GPU: 7.46629 GFlops
    Average difference: 0.000149043%
[local, ts=4]
    GPU: 0.555525+-0.0012201 s
    GPU: 3.6002 GFlops
    Average difference: 0.000149043%
[local, ts=8]
    GPU: 0.315502+-0.00441561 s
    GPU: 6.33911 GFlops
    Average difference: 0.000149043%
[local, ts=16]
    GPU: 0.285373+-0.00140375 s
    GPU: 7.00836 GFlops
    Average difference: 0.000149043%
[local wpt, ts=4, wpt=2]
    GPU: 0.5184+-0.00304857 s
    GPU: 3.85802 GFlops
    Average difference: 0.000149043%
[local wpt, ts=4, wpt=4]
    GPU: 0.438042+-0.00168949 s
    GPU: 4.56577 GFlops
    Average difference: 0.000149043%
[local wpt, ts=8, wpt=2]
    GPU: 0.31639+-0.00250521 s
    GPU: 6.32132 GFlops
    Average difference: 0.000149043%
[local wpt, ts=8, wpt=4]
    GPU: 0.283483+-0.00238618 s
    GPU: 7.05509 GFlops
    Average difference: 0.000149043%
[local wpt, ts=8, wpt=8]
    GPU: 0.26[10](https://github.com/GPGPUCourse/GPGPUTasks2024/pull/147/checks#step:8:11)17+-0.00169234 s
    GPU: 7.66233 GFlops
    Average difference: 0.000149043%
[local wpt, ts=16, wpt=2]
    GPU: 0.220607+-0.000865029 s
    GPU: 9.0659 GFlops
    Average difference: 0.000149043%
[local wpt, ts=16, wpt=4]
    GPU: 0.194[11](https://github.com/GPGPUCourse/GPGPUTasks2024/pull/147/checks#step:8:12)2+-0.000522242 s
    GPU: 10.3033 GFlops
    Average difference: 0.000[14](https://github.com/GPGPUCourse/GPGPUTasks2024/pull/147/checks#step:8:15)9043%
[local wpt, ts=16, wpt=8]
    GPU: 0.285621+-0.00175619 s
    GPU: 7.00228 GFlops
    Average difference: 0.000149043%
[local wpt, ts=[16](https://github.com/GPGPUCourse/GPGPUTasks2024/pull/147/checks#step:8:17), wpt=16]
    GPU: 0.337311+-0.00248971 s
    GPU: 5.92924 GFlops
    Average difference: 0.000149043%

if (j >= m)
return;

tile[local_i][local_j] = a[i * k + j];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

не коалесд доступ к глобальной памяти + в таком использовании локального массива нет смысла, поток использует то же значение что сам и загрузил

if (j >= m)
return;

tile[biased_i][biased_j] = a[i * k + j];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

не коалесд доступ к глобальной памяти + в таком использовании локального массива нет смысла, поток использует то же значение что сам и загрузил


float sum = 0.0f;
for (int tileK = 0; tileK * TILE_SIZE < K; tileK++) {
tileA[local_i][local_j] = a[i * K + local_j + tileK * TILE_SIZE];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

не коалесд доступ


for (int tileK = 0; tileK * TILE_SIZE < K; tileK++) {
for (int thread = 0; thread < WORK_PER_THREAD; thread++) {
tileA[local_i * WORK_PER_THREAD + thread][local_j] = a[(i * WORK_PER_THREAD + thread) * K + local_j + tileK * TILE_SIZE];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

не коалесд доступ

}

for (int thread = 0; thread < WORK_PER_THREAD; thread++) {
c[(i * WORK_PER_THREAD + thread) * N + j] = sum[thread];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

не коалесд доступ

std::string kernel_name = "matrix_multiplication_local_wpt";
gpu::WorkSize work_size(0, 0/*TODO*/);
gpu::WorkSize work_size(tile_size / wpt, tile_size, M /wpt, N);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

провоцирует не коалесд доступ даже при правильном написании кернела
Screenshot from 2024-10-12 21-52-31

@koufesser
Copy link
Author

Локальный вывод

Транспонирование

OpenCL devices:
  Device #0: CPU. 13th Gen Intel(R) Core(TM) i7-13700H. Intel(R) Corporation. Total memory: 16003 Mb
  Device #1: GPU. Intel(R) Iris(R) Xe Graphics. Total memory: 6401 Mb
Using device #1: GPU. Intel(R) Iris(R) Xe Graphics. Total memory: 6401 Mb
Data generated for M=4096, K=4096
[matrix_transpose_naive]
    GPU: 0.00706667+-0.000249444 s
    GPU: 2374.13 millions/s
[matrix_transpose_local_bad_banks]
    GPU: 0.00485+-0.000510718 s
    GPU: 3459.22 millions/s
[matrix_transpose_local_good_banks]
    GPU: 0.004+-0 s
    GPU: 4194.3 millions/s

Перемножение

OpenCL devices:
  Device #0: CPU. 13th Gen Intel(R) Core(TM) i7-13700H. Intel(R) Corporation. Total memory: 16003 Mb
  Device #1: GPU. Intel(R) Iris(R) Xe Graphics. Total memory: 6401 Mb
Using device #1: GPU. Intel(R) Iris(R) Xe Graphics. Total memory: 6401 Mb
Data generated for M=1024, K=1024, N=1024
CPU: 6.475+-0 s
CPU: 0.30888 GFlops
[local wpt, ts=4, wpt=2]
    GPU: 0.0616667+-0.00188562 s
    GPU: 32.4324 GFlops
    Average difference: 0.000196008%
[local wpt, ts=4, wpt=4]
    GPU: 0.0955+-0.00180278 s
    GPU: 20.9424 GFlops
    Average difference: 0.000196008%
[local wpt, ts=8, wpt=2]
    GPU: 0.00966667+-0.000471405 s
    GPU: 206.897 GFlops
    Average difference: 0.000196008%
[local wpt, ts=8, wpt=4]
    GPU: 0.0176667+-0.000471405 s
    GPU: 113.208 GFlops
    Average difference: 0.000196008%
[local wpt, ts=8, wpt=8]
    GPU: 0.017+-0.00057735 s
    GPU: 117.647 GFlops
    Average difference: 0.000196008%
[local wpt, ts=16, wpt=2]
    GPU: 0.0106667+-0.000471405 s
    GPU: 187.5 GFlops
    Average difference: 0.000196008%
[local wpt, ts=16, wpt=4]
    GPU: 0.00833333+-0.000471405 s
    GPU: 240 GFlops
    Average difference: 0.000196008%
[local wpt, ts=16, wpt=8]
    GPU: 0.00783333+-0.000372678 s
    GPU: 255.319 GFlops
    Average difference: 0.000196008%
[local wpt, ts=16, wpt=16]
    GPU: 0.0115+-0.0005 s
    GPU: 173.913 GFlops
    Average difference: 0.000196008%OpenCL devices:
  Device #0: CPU. 13th Gen Intel(R) Core(TM) i7-13700H. Intel(R) Corporation. Total memory: 16003 Mb
  Device #1: GPU. Intel(R) Iris(R) Xe Graphics. Total memory: 6401 Mb
Using device #1: GPU. Intel(R) Iris(R) Xe Graphics. Total memory: 6401 Mb
Data generated for M=1024, K=1024, N=1024
CPU: 6.475+-0 s
CPU: 0.30888 GFlops
[local wpt, ts=4, wpt=2]
    GPU: 0.0616667+-0.00188562 s
    GPU: 32.4324 GFlops
    Average difference: 0.000196008%
[local wpt, ts=4, wpt=4]
    GPU: 0.0955+-0.00180278 s
    GPU: 20.9424 GFlops
    Average difference: 0.000196008%
[local wpt, ts=8, wpt=2]
    GPU: 0.00966667+-0.000471405 s
    GPU: 206.897 GFlops
    Average difference: 0.000196008%
[local wpt, ts=8, wpt=4]
    GPU: 0.0176667+-0.000471405 s
    GPU: 113.208 GFlops
    Average difference: 0.000196008%
[local wpt, ts=8, wpt=8]
    GPU: 0.017+-0.00057735 s
    GPU: 117.647 GFlops
    Average difference: 0.000196008%
[local wpt, ts=16, wpt=2]
    GPU: 0.0106667+-0.000471405 s
    GPU: 187.5 GFlops
    Average difference: 0.000196008%
[local wpt, ts=16, wpt=4]
    GPU: 0.00833333+-0.000471405 s
    GPU: 240 GFlops
    Average difference: 0.000196008%
[local wpt, ts=16, wpt=8]
    GPU: 0.00783333+-0.000372678 s
    GPU: 255.319 GFlops
    Average difference: 0.000196008%
[local wpt, ts=16, wpt=16]
    GPU: 0.0115+-0.0005 s
    GPU: 173.913 GFlops
    Average difference: 0.000196008%

@simiyutin
Copy link
Collaborator

Задача зачтена

@simiyutin simiyutin closed this Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants