-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Naive register <-> tmem load/store support #3786
Conversation
PR Reviewer Guide 🔍(Review updated until commit 1b1f4cc)Here are some key observations to aid the review process:
|
!test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. Just left one small question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you plan to wrap the ptx assemble with cuda functions?
// different CTA to different physical address. There is no virtual TMem | ||
// address. All addresses are physical addresses. | ||
// | ||
// Because multiple CTAs can execute on the same SM simultaneously, there must |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Due to this handshaking mechanism, is it better to have only a single CTA occupy an SM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you talking about kernel design for better perf? My guess is, if you allocate at the beginning of the kernel, and relinquish after allocate, the latency should be acceptable if you want to use multiple CTA on SM. But we need to test it before making any conclusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, for maximum performance.
I like |
Extracted from #3755 to make code review easy.
This PR adds a new unit test
TMemTest.GmemRegTMemRegGmemCopy
that schedules a copy kernel gmem -> register -> tmem -> register -> gmem, and update our system with the minimum required changes to make this test pass.The purpose of this PR is not to provide a good implementation of TMem support, but just to provide the absolute minimal requirement for us to start. Limitations are:
Generated code: