Missing File in 09_optimize_reduce/01_interleaved_addressing #35

A-suozhang · 2024-07-17T05:44:50Z

First of all, great thanks to the authors for presenting the awesome tutorial, it really helps the newbees in CUDA programming!

I find that the cuda file is missing in 09_optimize_reduce/01_interleaved_addressing folder, although the documentaion is quite clear, and simply replacing the reduce_naive.cu with some lines would work. However, I noticed that the cuda files are provided in other folders, so supplementing the file will benefit the consistency of the project.

Hope to see more updates of the tutorial!

The text was updated successfully, but these errors were encountered:

A-suozhang · 2024-07-17T05:54:13Z

Also, I would like to inquire about the methodology used to measure bandwidth in the table within resolve_bank_conflict.

AndSonder · 2024-07-17T06:02:26Z

First of all, great thanks to the authors for presenting the awesome tutorial, it really helps the newbees in CUDA programming!

I find that the cuda file is missing in 09_optimize_reduce/01_interleaved_addressing folder, although the documentaion is quite clear, and simply replacing the reduce_naive.cu with some lines would work. However, I noticed that the cuda files are provided in other folders, so supplementing the file will benefit the consistency of the project.

Hope to see more updates of the tutorial!

done #36

AndSonder · 2024-07-17T06:05:44Z

Also, I would like to inquire about the methodology used to measure bandwidth in the table within r

gld thnoughput in the output of nvprof/ncu

A-suozhang · 2024-07-17T06:24:51Z

Thx for answering the question, however, I did not quite follow the gld thnoughput command.
I'm using nsys nvprof due to nvprof does not support higher compute compabilities.
The output of my profiling command looks like this, seemingly no elements related to "throughput" is presented.
Does the output of nvprof/ncu explicitly contains the element of "throughput"?

WARNING: reduce_interleaved_address and any of its children processes will be profiled.

sum = 49995000
success
Generating '/tmp/nsys-report-f903.qdstrm'
[1/7] [========================100%] report3.nsys-rep
[2/7] [========================100%] report3.sqlite
[3/7] Executing 'nvtx_sum' stats report
SKIPPED: /mnt/public/diffusion_quant/zhaotianchen/project/cuda_learning/src/report3.sqlite does not contain NV Tools Extension (NVTX) data.
[4/7] Executing 'cuda_api_sum' stats report

 Time (%)  Total Time (ns)  Num Calls    Avg (ns)      Med (ns)    Min (ns)   Max (ns)     StdDev (ns)            Name
 --------  ---------------  ---------  ------------  ------------  --------  -----------  -------------  ----------------------
     99.7      149,038,131          2  74,519,065.5  74,519,065.5     5,383  149,032,748  105,378,260.4  cudaMalloc
      0.1          176,059          1     176,059.0     176,059.0   176,059      176,059            0.0  cudaLaunchKernel
      0.1          121,229          2      60,614.5      60,614.5     7,165      114,064       75,589.0  cudaFree
      0.1           79,184          2      39,592.0      39,592.0    30,062       49,122       13,477.5  cudaMemcpy
      0.0            3,010          1       3,010.0       3,010.0     3,010        3,010            0.0  cuModuleGetLoadingMode

[5/7] Executing 'cuda_gpu_kern_sum' stats report

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                          Name
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------
    100.0            3,487          1   3,487.0   3,487.0     3,487     3,487          0.0  void reduce_naive_kernel<(int)32>(int *, int *, int)

[6/7] Executing 'cuda_gpu_mem_time_sum' stats report

 Time (%)  Total Time (ns)  Count  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)           Operation
 --------  ---------------  -----  --------  --------  --------  --------  -----------  ----------------------------
     57.4            6,080      1   6,080.0   6,080.0     6,080     6,080          0.0  [CUDA memcpy Host-to-Device]
     42.6            4,512      1   4,512.0   4,512.0     4,512     4,512          0.0  [CUDA memcpy Device-to-Host]

[7/7] Executing 'cuda_gpu_mem_size_sum' stats report

 Total (MB)  Count  Avg (MB)  Med (MB)  Min (MB)  Max (MB)  StdDev (MB)           Operation
 ----------  -----  --------  --------  --------  --------  -----------  ----------------------------
      0.040      1     0.040     0.040     0.040     0.040        0.000  [CUDA memcpy Device-to-Host]
      0.040      1     0.040     0.040     0.040     0.040        0.000  [CUDA memcpy Host-to-Device]

AndSonder · 2024-07-19T07:28:07Z

gld thnoughput

@A-suozhang you can use ncu with this metric:

https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html?highlight=gld%20thnoughput#metric-comparison

A-suozhang · 2024-07-19T08:23:11Z

Thank you very much for clarifying this matter. I will try this out.

A-suozhang closed this as completed Jul 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing File in 09_optimize_reduce/01_interleaved_addressing #35

Missing File in 09_optimize_reduce/01_interleaved_addressing #35

A-suozhang commented Jul 17, 2024

A-suozhang commented Jul 17, 2024

AndSonder commented Jul 17, 2024

AndSonder commented Jul 17, 2024

A-suozhang commented Jul 17, 2024

AndSonder commented Jul 19, 2024 •

edited

Loading

A-suozhang commented Jul 19, 2024

Missing File in 09_optimize_reduce/01_interleaved_addressing #35

Missing File in 09_optimize_reduce/01_interleaved_addressing #35

Comments

A-suozhang commented Jul 17, 2024

A-suozhang commented Jul 17, 2024

AndSonder commented Jul 17, 2024

AndSonder commented Jul 17, 2024

A-suozhang commented Jul 17, 2024

AndSonder commented Jul 19, 2024 • edited Loading

A-suozhang commented Jul 19, 2024

AndSonder commented Jul 19, 2024 •

edited

Loading