Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] Question about UniversalFMA #2101

Open
leven-comeon opened this issue Feb 12, 2025 · 1 comment
Open

[QST] Question about UniversalFMA #2101

leven-comeon opened this issue Feb 12, 2025 · 1 comment

Comments

@leven-comeon
Copy link

What is your question?

When executing the sgemm_70.cu program, I used the print_latex tool to observe the structure of mmaC, and its specific details are shown in the figure.

Image

As far as I know, UniversalFMA uses CUDA Cores for computation. However, during the execution of the General Matrix Multiplication (GEMM) operation, a puzzling phenomenon occurred: it seems that a thread can access the values in the internal registers of other threads. For example, thread T17 can read the values in the registers of threads T1 and T16 to complete the computation.
I'm wondering if you could kindly explain to me the reason behind this behavior?

@thakkarV
Copy link
Collaborator

The A and V TV mappings only show the first thread. Multiple threads are each reading A and B in reality

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants