Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU Task progression using callback function implemented. #554

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

josephjohnjj
Copy link
Contributor

The Callback function for a stream implemented using cudaLaunchHostFunc(). The callback function stream will push the tasks to the next stream.

This replaces the cuda event based task progression.

The Callback function  for a stream implemented using cudaLaunchHostFunc().
The callback function stream will push the tasks to the next stream.

This replaces the cuda event based task progression.
@josephjohnjj josephjohnjj requested a review from a team as a code owner June 7, 2023 07:46
@bosilca
Copy link
Contributor

bosilca commented Jun 7, 2023

Few high level comments for now:

  • is there any performance impact with this change ?
  • what is this providing the engine with ?
  • what thread is triggering the callback ?
  • what is the expected duration of the callback function ? Can PaRSEC take its time to trigger completion or something like that or it needs to return from the callback asap ? I recall it is not possible to use CUDA calls in the callback, which means we still need a real progress thread to push tasks into the streams.

@josephjohnjj
Copy link
Contributor Author

Few high level comments for now:

  • is there any performance impact with this change ?
  • what is this providing the engine with ?
  • what thread is triggering the callback ?
  • what is the expected duration of the callback function ? Can PaRSEC take its time to trigger completion or something like that or it needs to return from the callback asap ? I recall it is not possible to use CUDA calls in the callback, which means we still need a real progress thread to push tasks into the streams.

This is the result from the master
[****] TIME(s) 37.72689 : dgemm PxQxg= 1 1 1 NB= 1000 N= 50000 : 6626.573274 gflops - ENQ&PROG&DEST 43.04640 : 5807.687086 gflops - ENQ 3.74052 - DEST 1.57899
This is the result from the PR
[****] TIME(s) 40.73636 : dgemm PxQxg= 1 1 1 NB= 1000 N= 50000 : 6137.023988 gflops - ENQ&PROG&DEST 47.43896 : 5269.929655 gflops - ENQ 5.06823 - DEST 1.63438

The callback function call the complete_stage for the stage it completed and pushes the task to the queue for the next stage (for instance, after stage-in, task is pushed to the execution queue). As the manager thread also has access to the queues, there can contention on these queues and this is one aspect which can increase the cost of callbacks. Another drawback nothing can be added to the stream while the callback is being triggered.

We still need a progress thread as the progress thread (manager) calls the function that offloads actions (execution or memcopy) to the streams. Immediately after this, the callback is also pushed to the stream, by the progress thread . But the callback trigger itself is a CUDA functionality and we don't need to use a manager or any other worker thread for this.

One main advantage of the PR, I think, is that we can progress more tasks. In the master when a task has been successfully progresses, the progression of all other tasks in the same stage, are delayed to move the progressed task to the next stage. And this progression is the responsibility of the manager thread. With this PR the manager thread's responsibility is limited to initiating the task stage-in (by moving the task to the stage-in queue) and task completion (__parsec_complete_execution () and parsec_cuda_kernel_epilog() ), the rest of the task progression will be done by the callbacks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants