Increase granularity of halo-exchange timing info #639

max-Hawkins · 2024-10-01T13:38:35Z

Description

Previously, the NVTX ranges measuring the so-called 'MPI' time included the time to unpack and pack the contiguous buffers actually exchanged during the MPI_SENDRECV operation. While this may make sense, to avoid confusion and always be able to get proper communication time, I renamed the 'RHS-MPI' NVTX range to 'RHS-MPI+BufPack' and added an NVTX range only around the MPI_SENDRECV operation called 'RHS-MPI_SENDRECV.'

Type of change

[x ] New feature (non-breaking change which adds functionality)

How Has This Been Tested?

I ran an example case under nsys with and without this change. The reported timing from the new RHS-MPI_SENDRECV NVTX range was within 5% error of the MPI trace time reporting for this example.

See below for screenshots from the NSYS reports. In this example, the MPI_SENDRECV time is ~1.4% of the total 'MPI' time.

This shows the NSYS MPI trace timing info. Note the highlighted line's 'total time'

This is the NVTX range timing information. Note that the RHS-MPI_SENDRECV range total time is similar to the new NVTX range result:

Test Configuration:
4 V100 nodes on Phoenix running the 2D shockbubble case for 700 timesteps.

Checklist

[ x] I ran ./mfc.sh format before committing my code
[ x] This PR does not introduce any repeated code (it follows the DRY principle)
[ x] I cannot think of a way to condense this code and reduce any introduced additional line count

If your code changes any code source files (anything in `src/simulation`)

To make sure the code is performing as expected on GPU devices, I have:

[x ] Checked that the code compiles using NVHPC compilers
Checked that the code compiles using CRAY compilers
[ x] Ran the code on either V100, A100, or H100 GPUs and ensured the new feature performed as expected (the GPU results match the CPU results)
Ran the code on MI200+ GPUs and ensure the new features performed as expected (the GPU results match the CPU results)
[ x] Enclosed the new feature via nvtx ranges so that they can be identified in profiles
[ x] Ran a Nsight Systems profile using ./mfc.sh run XXXX --gpu -t simulation --nsys, and have attached the output file (.nsys-rep) and plain text results to this PR
Ran an Omniperf profile using ./mfc.sh run XXXX --gpu -t simulation --omniperf, and have attached the output file and plain text results to this PR.
Ran my code using various numbers of different GPUs (1, 2, and 8, for example) in parallel and made sure that the results scale similarly to what happens if you run without the new code/feature

codecov · 2024-10-01T15:32:14Z

Codecov Report

Attention: Patch coverage is 80.59701% with 13 lines in your changes missing coverage. Please review.

Project coverage is 42.96%. Comparing base (efc9d67) to head (022f593).
Report is 3 commits behind head on master.

Files with missing lines	Patch %	Lines
src/simulation/m_rhs.fpp	73.68%	5 Missing and 5 partials ⚠️
src/simulation/m_time_steppers.fpp	66.66%	2 Missing ⚠️
src/simulation/m_mpi_proxy.fpp	90.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #639      +/-   ##
==========================================
+ Coverage   42.85%   42.96%   +0.11%     
==========================================
  Files          61       61              
  Lines       16280    16314      +34     
  Branches     1891     1882       -9     
==========================================
+ Hits         6976     7010      +34     
- Misses       8259     8260       +1     
+ Partials     1045     1044       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

sbryngelson · 2024-10-01T16:11:40Z

Is it possible to adjust the NVTX range naming to make the hierarchy of the ranges more obvious? For example, TimeStep — RHS — Communication — MPI/SendRecv or something similar, with this style used for all ranges. Right now, it's hard to discern which call nests other calls. I do realize that other parts of Nsys GUI make this more obvious.

max-Hawkins · 2024-10-17T15:20:22Z

Just updated things. Here's the output viewed in NSYS (table and graph views):

sbryngelson · 2024-10-17T15:30:47Z

Nice. I think a 'TSTEP-SUBSTEP' range makes sense (for RK3 you have 3 such substeps). This helps consolidate things. Related to #631

sbryngelson · 2024-11-08T21:01:32Z

@max-Hawkins would you mind updating/finish this for merge?

max-Hawkins · 2024-11-12T00:37:28Z

@henryleberre Ready for your evaluation.

sbryngelson · 2024-11-12T00:41:29Z

needs ./mfc.sh format -j 4

edit: nvm

sbryngelson · 2024-11-12T13:05:50Z

Thanks! A beauty. Merging.

max-Hawkins requested a review from sbryngelson as a code owner October 1, 2024 13:38

sbryngelson approved these changes Oct 1, 2024

View reviewed changes

max-Hawkins requested a review from henryleberre as a code owner November 11, 2024 20:57

max-Hawkins added 7 commits November 11, 2024 17:03

Increase granularity of halo-exchange timing info

2f3be63

Standardize timing ranges

2b7d551

Include NVTX range in conditionals

8cfeee9

Add global simulation nvtx range

26af09c

Change NVTX names and add subranges

e00bc64

Improve NVTX range tracking

ce5c8af

Add NVTX ranges for initialization and finalization

095d890

max-Hawkins force-pushed the mpi_time branch from 1aac797 to 095d890 Compare November 12, 2024 00:35

sbryngelson approved these changes Nov 12, 2024

View reviewed changes

Fix formatting

29f0b7b

Remove chemistry straggler

022f593

sbryngelson merged commit a79e2e0 into MFlowCode:master Nov 12, 2024
25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase granularity of halo-exchange timing info #639

Increase granularity of halo-exchange timing info #639

max-Hawkins commented Oct 1, 2024 •

edited

Loading

codecov bot commented Oct 1, 2024 •

edited

Loading

sbryngelson commented Oct 1, 2024

max-Hawkins commented Oct 17, 2024

sbryngelson commented Oct 17, 2024

sbryngelson commented Nov 8, 2024

max-Hawkins commented Nov 12, 2024

sbryngelson commented Nov 12, 2024 •

edited

Loading

sbryngelson commented Nov 12, 2024

Increase granularity of halo-exchange timing info #639

Increase granularity of halo-exchange timing info #639

Conversation

max-Hawkins commented Oct 1, 2024 • edited Loading

Description

Type of change

How Has This Been Tested?

Checklist

If your code changes any code source files (anything in src/simulation)

codecov bot commented Oct 1, 2024 • edited Loading

Codecov Report

sbryngelson commented Oct 1, 2024

max-Hawkins commented Oct 17, 2024

sbryngelson commented Oct 17, 2024

sbryngelson commented Nov 8, 2024

max-Hawkins commented Nov 12, 2024

sbryngelson commented Nov 12, 2024 • edited Loading

sbryngelson commented Nov 12, 2024

max-Hawkins commented Oct 1, 2024 •

edited

Loading

If your code changes any code source files (anything in `src/simulation`)

codecov bot commented Oct 1, 2024 •

edited

Loading

sbryngelson commented Nov 12, 2024 •

edited

Loading