Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gpu pingpong test #556

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Conversation

therault
Copy link
Contributor

This creates a simple test that makes data be updated on CPU and on GPUs alternatively.

As part of the test, I found out that HIP was not ported over DTD, and this ports HIP over DTD. It also provides a first test for HIP.

@therault therault requested a review from a team as a code owner June 14, 2023 17:59
@therault therault force-pushed the gpu_pingpong_test branch 16 times, most recently from 2cdb527 to 4222191 Compare June 16, 2023 21:11
@therault
Copy link
Contributor Author

I'm a bit lost with CI here... Another pair of eyes would help. To summarize what I observe:

  • When running in shared=OFF / profiling=ON mode, we don't detect CUDA at all (no device, no compiler)
  • When running in shared=ON / profiling=OFF mode, we always detect CUDA (device part).
    • In the master version, we asked for the slack package gcc@12.something, and that makes check_language(CUDA) fail, because nvcc cannot work with a gcc > 11.x
    • In the version proposed in this patch, we load the slack package gcc@11.3.0. Now something even more curious is happening:
      • We detect CUDAToolkit and enable the cuda device without aproblem
      • We still claim that check_language(CUDA) fails.
      • To investigate why, I have added some CMake messages that are output of the current failing job (https://github.com/ICLDisco/parsec/actions/runs/5294237295/jobs/9583355256?pr=556).
        • nvcc is where it should be based on the CUDA toolkit we have discovered
        • I can run succesfully nvcc -c /path/to/some/cufile.cu
        • No CMakeError.log file is generated. I display the contents of CMakeFiles/ and it doesn't seem it contains any useful information.

To conclude, I have no idea why check_language(CUDA) fails in this setup, and I'm now out of ideas to test...

@@ -4,6 +4,7 @@
#include "parsec/data_dist/matrix/two_dim_rectangle_cyclic.h"
#include "parsec/interfaces/dtd/insert_function_internal.h"
#include "tests/tests_data.h"
#include "parsec/mca/device/cuda/device_cuda_internal.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this, this is internal and should not spillover into user code.

@bosilca
Copy link
Contributor

bosilca commented Oct 19, 2023

please rebase and reassess the changes to the CI part (not clear they are still needed).

make a token pass from CPU to each GPU, and back, a few times, to check a possible bug found by @devreal.

Part of the DTD interface was not fully ported to HIP

Enable (cuda|hip)_pingpong test in CI

Add a PTG GPU pingpong test to compare with the behavior in DTD -- Work in progress

Tests need to import the appropriate GPU-specific header file, as insert_function_internal.h doesn't do it for them anymore

Enable PTG test over CUDA

Fix errors in data distribution initialization and some DAG errors in the PTG of the GPU pingpong test

Rename files and directories to match the new status of tests (tests/runtime/cuda is renamed tests/runtime/gpu and the pingpong tests are named to specify the API and not a particular device name, since they should work on both GPU types)

Only define the pingpong tests if a suitable compiler is found for the kernels

Do a ping-pong-pong test instead of ping-pong, to see how dependencies are tracked on GPU-to-GPU task dependency

Fix the checks of the pingpong test, and add it in the Testings.cmake

PTG ping-pong test: in order to guide the selection of the best device, the advised data needs to flow from a CPU task, not directly from memory.

Trying to introduce the gpu_nvidia runner in the CI matrix

Add ROCm, create one github_runner-[device].yaml file per device; remove debugging info from CMakeLists.txt

Add some infrastructure to make sure CI does the device tests where it should, and issue an error if things cannot be tested (e.g. because the GPUs are down or the compiler/spack is broken)

Trying to work around the xml2 issue with mesa.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>

Integrate the gpu_amd/release in the test suite

Add support to rocm-smi in check_nb_devices.sh

Conditional CMake command that depends upon the github runner loaded to prepare for testing
@abouteiller abouteiller added this to the v4.0 milestone Jan 12, 2024
@therault
Copy link
Contributor Author

Split this PR in two: one for the tester itself and another for the CI/runners

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants