Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing File in 09_optimize_reduce/01_interleaved_addressing #35

Closed
A-suozhang opened this issue Jul 17, 2024 · 6 comments
Closed

Missing File in 09_optimize_reduce/01_interleaved_addressing #35

A-suozhang opened this issue Jul 17, 2024 · 6 comments

Comments

@A-suozhang
Copy link

First of all, great thanks to the authors for presenting the awesome tutorial, it really helps the newbees in CUDA programming!

I find that the cuda file is missing in 09_optimize_reduce/01_interleaved_addressing folder, although the documentaion is quite clear, and simply replacing the reduce_naive.cu with some lines would work. However, I noticed that the cuda files are provided in other folders, so supplementing the file will benefit the consistency of the project.

Hope to see more updates of the tutorial!

@A-suozhang
Copy link
Author

Also, I would like to inquire about the methodology used to measure bandwidth in the table within resolve_bank_conflict.

@AndSonder
Copy link
Collaborator

First of all, great thanks to the authors for presenting the awesome tutorial, it really helps the newbees in CUDA programming!

I find that the cuda file is missing in 09_optimize_reduce/01_interleaved_addressing folder, although the documentaion is quite clear, and simply replacing the reduce_naive.cu with some lines would work. However, I noticed that the cuda files are provided in other folders, so supplementing the file will benefit the consistency of the project.

Hope to see more updates of the tutorial!

done #36

@AndSonder
Copy link
Collaborator

Also, I would like to inquire about the methodology used to measure bandwidth in the table within r

gld thnoughput in the output of nvprof/ncu

@A-suozhang
Copy link
Author

Thx for answering the question, however, I did not quite follow the gld thnoughput command.
I'm using nsys nvprof due to nvprof does not support higher compute compabilities.
The output of my profiling command looks like this, seemingly no elements related to "throughput" is presented.
Does the output of nvprof/ncu explicitly contains the element of "throughput"?

WARNING: reduce_interleaved_address and any of its children processes will be profiled.

sum = 49995000
success
Generating '/tmp/nsys-report-f903.qdstrm'
[1/7] [========================100%] report3.nsys-rep
[2/7] [========================100%] report3.sqlite
[3/7] Executing 'nvtx_sum' stats report
SKIPPED: /mnt/public/diffusion_quant/zhaotianchen/project/cuda_learning/src/report3.sqlite does not contain NV Tools Extension (NVTX) data.
[4/7] Executing 'cuda_api_sum' stats report

 Time (%)  Total Time (ns)  Num Calls    Avg (ns)      Med (ns)    Min (ns)   Max (ns)     StdDev (ns)            Name
 --------  ---------------  ---------  ------------  ------------  --------  -----------  -------------  ----------------------
     99.7      149,038,131          2  74,519,065.5  74,519,065.5     5,383  149,032,748  105,378,260.4  cudaMalloc
      0.1          176,059          1     176,059.0     176,059.0   176,059      176,059            0.0  cudaLaunchKernel
      0.1          121,229          2      60,614.5      60,614.5     7,165      114,064       75,589.0  cudaFree
      0.1           79,184          2      39,592.0      39,592.0    30,062       49,122       13,477.5  cudaMemcpy
      0.0            3,010          1       3,010.0       3,010.0     3,010        3,010            0.0  cuModuleGetLoadingMode

[5/7] Executing 'cuda_gpu_kern_sum' stats report

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                          Name
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------
    100.0            3,487          1   3,487.0   3,487.0     3,487     3,487          0.0  void reduce_naive_kernel<(int)32>(int *, int *, int)

[6/7] Executing 'cuda_gpu_mem_time_sum' stats report

 Time (%)  Total Time (ns)  Count  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)           Operation
 --------  ---------------  -----  --------  --------  --------  --------  -----------  ----------------------------
     57.4            6,080      1   6,080.0   6,080.0     6,080     6,080          0.0  [CUDA memcpy Host-to-Device]
     42.6            4,512      1   4,512.0   4,512.0     4,512     4,512          0.0  [CUDA memcpy Device-to-Host]

[7/7] Executing 'cuda_gpu_mem_size_sum' stats report

 Total (MB)  Count  Avg (MB)  Med (MB)  Min (MB)  Max (MB)  StdDev (MB)           Operation
 ----------  -----  --------  --------  --------  --------  -----------  ----------------------------
      0.040      1     0.040     0.040     0.040     0.040        0.000  [CUDA memcpy Device-to-Host]
      0.040      1     0.040     0.040     0.040     0.040        0.000  [CUDA memcpy Host-to-Device]

@AndSonder
Copy link
Collaborator

AndSonder commented Jul 19, 2024

gld thnoughput

@A-suozhang you can use ncu with this metric:

image

https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html?highlight=gld%20thnoughput#metric-comparison

@A-suozhang
Copy link
Author

Thank you very much for clarifying this matter. I will try this out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants