GPU Task progression using callback function implemented. #554

josephjohnjj · 2023-06-07T07:45:59Z

The Callback function for a stream implemented using cudaLaunchHostFunc(). The callback function stream will push the tasks to the next stream.

This replaces the cuda event based task progression.

The Callback function for a stream implemented using cudaLaunchHostFunc(). The callback function stream will push the tasks to the next stream. This replaces the cuda event based task progression.

bosilca · 2023-06-07T15:31:45Z

Few high level comments for now:

is there any performance impact with this change ?
what is this providing the engine with ?
what thread is triggering the callback ?
what is the expected duration of the callback function ? Can PaRSEC take its time to trigger completion or something like that or it needs to return from the callback asap ? I recall it is not possible to use CUDA calls in the callback, which means we still need a real progress thread to push tasks into the streams.

josephjohnjj · 2023-06-07T22:14:21Z

Few high level comments for now:

is there any performance impact with this change ?

what is this providing the engine with ?

what thread is triggering the callback ?

what is the expected duration of the callback function ? Can PaRSEC take its time to trigger completion or something like that or it needs to return from the callback asap ? I recall it is not possible to use CUDA calls in the callback, which means we still need a real progress thread to push tasks into the streams.

This is the result from the master
[****] TIME(s) 37.72689 : dgemm PxQxg= 1 1 1 NB= 1000 N= 50000 : 6626.573274 gflops - ENQ&PROG&DEST 43.04640 : 5807.687086 gflops - ENQ 3.74052 - DEST 1.57899
This is the result from the PR
[****] TIME(s) 40.73636 : dgemm PxQxg= 1 1 1 NB= 1000 N= 50000 : 6137.023988 gflops - ENQ&PROG&DEST 47.43896 : 5269.929655 gflops - ENQ 5.06823 - DEST 1.63438

The callback function call the complete_stage for the stage it completed and pushes the task to the queue for the next stage (for instance, after stage-in, task is pushed to the execution queue). As the manager thread also has access to the queues, there can contention on these queues and this is one aspect which can increase the cost of callbacks. Another drawback nothing can be added to the stream while the callback is being triggered.

We still need a progress thread as the progress thread (manager) calls the function that offloads actions (execution or memcopy) to the streams. Immediately after this, the callback is also pushed to the stream, by the progress thread . But the callback trigger itself is a CUDA functionality and we don't need to use a manager or any other worker thread for this.

One main advantage of the PR, I think, is that we can progress more tasks. In the master when a task has been successfully progresses, the progression of all other tasks in the same stage, are delayed to move the progressed task to the next stage. And this progression is the responsibility of the manager thread. With this PR the manager thread's responsibility is limited to initiating the task stage-in (by moving the task to the stage-in queue) and task completion (__parsec_complete_execution () and parsec_cuda_kernel_epilog() ), the rest of the task progression will be done by the callbacks.

GPU Task progression using callback function implemented.

c12fb44

The Callback function for a stream implemented using cudaLaunchHostFunc(). The callback function stream will push the tasks to the next stream. This replaces the cuda event based task progression.

josephjohnjj requested a review from a team as a code owner June 7, 2023 07:46

abouteiller mentioned this pull request Jun 28, 2023

Correct the logic for passing over CPU/RECURSIVE devices #557

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU Task progression using callback function implemented. #554

GPU Task progression using callback function implemented. #554

josephjohnjj commented Jun 7, 2023

bosilca commented Jun 7, 2023

josephjohnjj commented Jun 7, 2023

GPU Task progression using callback function implemented. #554

Are you sure you want to change the base?

GPU Task progression using callback function implemented. #554

Conversation

josephjohnjj commented Jun 7, 2023

bosilca commented Jun 7, 2023

josephjohnjj commented Jun 7, 2023