-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Task03 Кудрявцев Федор HSE #130
Conversation
Локальный вывод
C:\Users\koufe\GPGPUTasks2024\cmake-build-debug\sum.exe 1 CPU: 0.158+-0.00416333 s CPU: 632.911 millions/s CPU OMP: 0.0175+-0.0005 s CPU OMP: 5714.29 millions/s OpenCL devices: Device #0: CPU. 13th Gen Intel(R) Core(TM) i7-13700H. Intel(R) Corporation. Total memory: 16003 Mb Device #1: GPU. Intel(R) Iris(R) Xe Graphics. Total memory: 6401 Mb Using device #1: GPU. Intel(R) Iris(R) Xe Graphics. Total memory: 6401 Mb atomic_sum GPU: 0.0103333+-0.000942809 s GPU: 9677.42 millions/s cycle_sum GPU: 0.0258333+-0.000687184 s GPU: 3870.97 millions/s cycle_coalesced_sum GPU: 0.0185+-0.000763763 s GPU: 5405.41 millions/s local_mem_sum GPU: 0.0363333+-0.000942809 s GPU: 2752.29 millions/s tree_sum GPU: 0.0136667+-0.000471405 s GPU: 7317.07 millions/s Вывод Github CI
CPU: 0.032211+-0.00011248 s CPU: 3104.53 millions/s CPU OMP: 0.01[7](https://github.com/GPGPUCourse/GPGPUTasks2024/pull/130/checks#step:9:8)7532+-0.000364107 s CPU OMP: 5632.[8](https://github.com/GPGPUCourse/GPGPUTasks2024/pull/130/checks#step:9:9) millions/s OpenCL devices: Device #0: CPU. AMD EPYC 7763 64-Core Processor . Intel(R) Corporation. Total memory: 15[9](https://github.com/GPGPUCourse/GPGPUTasks2024/pull/130/checks#step:9:10)91 Mb Using device #0: CPU. AMD EPYC 7763 64-Core Processor . Intel(R) Corporation. Total memory: 15991 Mb atomic_sum GPU: 1.47011+-0.000636482 s GPU: 68.0223 millions/s cycle_sum GPU: 1.85374+-0.000738618 s GPU: 53.9449 millions/s cycle_coalesced_sum GPU: 1.5364+-0.00325968 s GPU: 65.0872 millions/s local_mem_sum GPU: 0.037967+-0.000127016 s GPU: 2633.87 millions/s tree_sum GPU: 0.18[10](https://github.com/GPGPUCourse/GPGPUTasks2024/pull/130/checks#step:9:11)19+-0.000599893 s GPU: 552.428 millions/s Локально быстрее всего отработал самый простой вариант с atomic sum, потом вариант с деревом и coalesced. Как и ожидалось вариант с coalesced и tree сильно быстрее простого цикла и локальной памяти. На сервере исполнялось на процессоре и единственный вариант, не просевший по скорости - local_sum. Tree sum сильно хуже local memory, а разница цикла с coalesced и не coalesced сильно меньше |
src/main_sum.cpp
Outdated
// gpu::Device device = gpu::chooseGPUDevice(argc, argv); | ||
gpu::Device device = gpu::chooseGPUDevice(argc, argv); | ||
run(device, "atomic_sum", n, reference_sum, benchmarkingIters, as); | ||
run(device, "cycle_sum", n, reference_sum, benchmarkingIters, as); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
В этой и следующей версии один поток выполняет больше работы, сказывается ли это как-то на конфигурации рабочего пространства? (а конфигурация на производительности)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Да, влияет
Удалось добиться ~x2 производительности
atomic_sum
GPU: 0.013+-0.00355903 s
GPU: 7692.31 millions/s
cycle_sum
GPU: 0.0158333+-0.000687184 s
GPU: 6315.79 millions/s
cycle_coalesced_sum
GPU: 0.00983333+-0.00279384 s
GPU: 10169.5 millions/s
local_mem_sum
GPU: 0.042+-0.0057735 s
GPU: 2380.95 millions/s
tree_sum
GPU: 0.0163333+-0.000942809 s
GPU: 6122.45 millions/s
Задача зачтена |
No description provided.