Change how CUDA runtime and capabilities are defined in the task and Condor #11689

amaltaro · 2023-08-15T15:36:41Z

Fixes #11595

Status

In development

Description

The following is provided in this PR:

the WMTask class now does an aggregation of multiple GPU requirements within the same task, such as:
- GPUMemoryMB: use the largest values among the steps within a task
- CUDARuntime: changed from simple string to a list of strings with an union of the CUDA runtime versions
- CUDACapabilities: still a list of strings, but now with an union of the CUDA capabilities versions
BossAir plugin method to:
- sorted list of string versions, from left to right (see cudaCapabilityToSingleVersion)
- select the smallest CUDA capability version required by the job (version string with dot-notation)
- convert the dot-notation version to a simple integer with a formula: (1000 * major + 10 * medium + minor), where 1.2.3 would be major=1, medium=2, minor=3. See [1] for further context.
[GlideinWMS] Refactor HTCondor CUDACapability classad to a single integer (actually string represented). For instance, it changes from "1.2,3.2,1.4" to "1020". Matchmaking then can be done with something like: Node CUDACapability >= Job CUDACapability, after converting the capability version to a single integer to be compatible with these changes.
[GlideinWMS] Refactor HTCondor CUDARuntime classad to be a comma separated list of CUDA runtimes. For instance, it changes from "1.2" to "1.2,3.2". Matchmaking then can be done with something like: stringListSubsetMatch(job_list_cudaruntime, node_list_cudaruntime). -->> NOTE that this needs an HTCondor upgrade to >= 10.0.6
Create new HTCondor classad OriginalCUDACapability with the actual comma separated list of CUDA capabilities

Is it backward compatible (if not, which system it affects?)

NO (in the sense that job matchmaking will have to be updated)

Related PRs

Complement to #11588 such that hybrid GPU workflows can be supported.

External dependencies / deployment changes

Submission Infrastructure needs to update HTCondor to >= 10.0.6 and GlideinWMS needs to update the job matchmaking expression for CUDARuntime and CUDACapabilities.

SI ticket: https://its.cern.ch/jira/browse/CMSSI-79
[1]
https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART____VERSION.html

update docstring

cmsdmwmbot · 2023-08-15T15:50:20Z

Jenkins results:

Python3 Unit tests: succeeded
- 1 tests no longer failing
- 2 tests added
- 1 changes in unstable tests
Python3 Pylint check: failed
- 8 warnings and errors that must be fixed
- 4 warnings
- 79 comments to review
Pylint py3k check: failed
- 2 errors and warnings that should be fixed
Pycodestyle check: succeeded
- 38 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14420/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2023-08-16T10:19:08Z

Jenkins results:

Python3 Unit tests: failed
- 2 new failures
- 1 tests no longer failing
- 2 tests added
Python3 Pylint check: failed
- 28 warnings and errors that must be fixed
- 7 warnings
- 229 comments to review
Pylint py3k check: failed
- 2 errors and warnings that should be fixed
Pycodestyle check: succeeded
- 84 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14422/artifact/artifacts/PullRequestReport.html

Fix reference variable in WMTask

fix WMTask unit tests

cmsdmwmbot · 2023-08-16T15:53:48Z

Jenkins results:

Python3 Unit tests: failed
- 1 new failures
- 1 tests no longer failing
- 2 tests added
- 1 changes in unstable tests
Python3 Pylint check: failed
- 29 warnings and errors that must be fixed
- 7 warnings
- 259 comments to review
Pylint py3k check: failed
- 2 errors and warnings that should be fixed
Pycodestyle check: succeeded
- 98 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14423/artifact/artifacts/PullRequestReport.html

amaltaro · 2023-08-16T19:40:18Z

In order to perform full testing of these changes, we need to have changes at the SI level (see initial description).
Nonetheless, I'd appreciate any feedback and review.

todor-ivanov

Hi @amaltaro

Things look good to me, but I've left few comments inline. You may wish to take a look.

todor-ivanov · 2023-08-17T12:50:19Z

src/python/Utils/Utilities.py

+    """
+    if not isinstance(versionList, list):
+        return versionList
+    versionList.sort(key=StrictVersion)


Other than the good documentation provided with this function, which I admit gives good info, why do we need it. The list.sort method works in place, so executing this same line outside of the function would do just fine. We can still make a NOTE comment providing the information from the docstring though.

todor-ivanov · 2023-08-17T12:57:54Z

src/python/WMCore/BossAir/Plugins/BasePlugin.py

+        defaultRes = 0
+        # get an ordered list of the versions and use the very first element
+        capabilities = orderVersionList(capabilities)
+        if not capabilities:


If capabilities is not a list or None (which is the default for the method), but rather could be a dictionary or string... , because the function orderVersionList returns its argument as it is if it is not a list, then this protection here here won't stop the execution further, and the next line will break.

All the GPU parameters are validated at workflow creation (StdBase), so there shouldn't be a type different than list or None at this level

todor-ivanov · 2023-08-17T13:00:48Z

src/python/WMCore/BossAir/Plugins/BasePlugin.py

+        for comparison/job matchmaking purposes.
+        Version conversion formula is: (1000 * major + 10 * medium + minor)
+        :param capabilities: a list of string versions
+        :return: an integer with the version value; 0 in case of failure


What is the order here ... is 0 with least significance or vise versa?

I expect the job matchmaking to be available_version >= required_version, so 0 here would mean that any cuda capability is supported by the job. I am not sure this answers your question, otherwise I might have not understood your question.

Partially it does.
What bothers me here, is that default value is again 0 and it overlaps with the value returned in case of an error/failure. Which would be a valid case if 0 was never to be reached in the matchmaking condition. But since it is included, then it would trigger similar behavior in failure and by default.

Yes, it can return from a version to integer conversion or from an error. However, this code will only be executed if GPUs are required, and if they are, then there are validated GPU parameters and the version is enforced to be something like r"^\d+\.\d+(\.\d+)?$". IF a version provided is 0.0, then yes, we get a "valid" 0 single version, which we could get if an empty list had been provided. But then again, everything is validated at the spec level.

I could change it to None and then mark the classad as undefined. But then it will be a job requiring GPU but asking for an undefined CUDA capability. I am not sure whether that could trigger other errors and unwanted behavior or not. Please let me know if you have any suggestions.

Alan, I do not want to say it... but usually there are many ifs in the justification process of an unhealthy code behavior. Overlapping of default and failure return values is not a good practice in general. Let alone type modification. I know we are supposed to be protected on another level, and we should not enter this bit of code under bad conditions, but lets imagine something changes in the future and we change our policy regarding GPU job execution as well.... Then we may enter this code under unexpected conditions, or because of a bug in the request validation code.... anything may happen.

So I'd bet for returning a None value in case of a failure and let the upstream code deal with it. But that's up to you. If you think we are 100% safe and we are about to call this function under the best condition only at any possible moment in the future, we can leave it as it is.

I don't really have a strong preference here. I can make it return None by default, have a check in SingleCondorPlugin for None and map that to undefined classad. Sounds good?

sounds good, yes
thank you @amaltaro

todor-ivanov · 2023-08-17T13:08:21Z

src/python/WMCore/BossAir/Plugins/SimpleCondorPlugin.py

+                minimalCapability = self.cudaCapabilityToSingleVersion(job['gpuRequirements']['CUDACapabilities'])
+                ad['My.CUDACapability'] = classad.quote(str(minimalCapability))
+                ad['My.OriginalCUDACapability'] = classad.quote(str(cudaCapabilities))
+                cudaRuntime = ','.join(sorted(job['gpuRequirements']['CUDARuntime']))


Why do we need to sot the cudaRuntime variable as well ?

I am actually not sure whether this is needed or not. I was going to say to keep condor classad values consistent among different jobs of the same workflow/task, but it could be that this is already consistent/ordered in the original list. I'd keep it around in case the original list can come out with scrambled versions.

Empty list should return None as well remove empty line

cmsdmwmbot · 2023-08-17T21:17:34Z

Jenkins results:

Python3 Unit tests: succeeded
- 1 tests no longer failing
- 1 tests added
Python3 Pylint check: failed
- 29 warnings and errors that must be fixed
- 7 warnings
- 238 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 84 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14427/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2023-08-17T21:40:54Z

Jenkins results:

Python3 Unit tests: failed
- 1 new failures
- 1 tests added
- 2 changes in unstable tests
Python3 Pylint check: failed
- 29 warnings and errors that must be fixed
- 7 warnings
- 238 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 84 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14428/artifact/artifacts/PullRequestReport.html

amaltaro · 2023-08-17T21:47:40Z

@todor-ivanov the last 2 commits provide changes based on your review. Please have another look.

todor-ivanov

things look good
thanks @amaltaro

amaltaro · 2023-08-21T13:41:28Z

@mapellidario @belforte Hi Dario, Stefano, these are the latest changes that we are trying to commission and deploy in production for GPU jobs. I just updated the initial description, but please let me know if you have any questions.

Note that this is not yet in production and we need to discuss/plan the required changes at the SI layer.

belforte · 2023-08-21T14:10:32Z

thanks Alan, @novicecpp will be back on Aug 24 and will be able to look at integrating this changes in CRAB as well. It will be nice to have some example to test, and of course what we dearly miss is users !

amaltaro · 2023-10-26T09:40:27Z

test this please

cmsdmwmbot · 2023-10-26T09:54:42Z

Jenkins results:

Python3 Unit tests: failed
- 1 new failures
- 1 tests added
- 1 changes in unstable tests
Python3 Pylint check: failed
- 35 warnings and errors that must be fixed
- 7 warnings
- 245 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 93 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14583/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2024-09-30T20:59:33Z

Can one of the admins verify this patch?

amaltaro added 2 commits August 15, 2023 11:33

Change how CUDA runtime and capabilities are defined in Condor

0235cd0

update docstring

update unit tests

3643e42

amaltaro added PR: Do not merge yet PR: Work in progress labels Aug 15, 2023

amaltaro changed the title ~~Fix Change how CUDA runtime and capabilities are defined in the task and Condor~~ Change how CUDA runtime and capabilities are defined in the task and Condor Aug 16, 2023

amaltaro added 2 commits August 16, 2023 11:43

Update WMTask to return a unique list of CUDARuntime

06f5b59

Fix reference variable in WMTask

fix StdBase unit tests

3e269ae

fix WMTask unit tests

amaltaro force-pushed the fix-11595 branch from 7431d43 to 3e269ae Compare August 16, 2023 15:43

amaltaro requested a review from khurtado August 16, 2023 19:40

amaltaro added the PR: squashing needed label Aug 17, 2023

amaltaro requested a review from todor-ivanov August 17, 2023 11:17

todor-ivanov reviewed Aug 17, 2023

View reviewed changes

amaltaro added 2 commits August 17, 2023 17:02

Remove orderVersionList function; return None minimal version if failure

ec3777c

Empty list should return None as well remove empty line

update unit tests

cee6f27

amaltaro force-pushed the fix-11595 branch from bf644b7 to cee6f27 Compare August 17, 2023 21:03

amaltaro requested a review from todor-ivanov August 17, 2023 21:47

todor-ivanov approved these changes Aug 18, 2023

View reviewed changes

novicecpp mentioned this pull request Sep 28, 2023

Adopting GPU Params change from WMCore dmwm/CRABServer#7893

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change how CUDA runtime and capabilities are defined in the task and Condor #11689

Change how CUDA runtime and capabilities are defined in the task and Condor #11689

amaltaro commented Aug 15, 2023 •

edited

Loading

cmsdmwmbot commented Aug 15, 2023

cmsdmwmbot commented Aug 16, 2023

cmsdmwmbot commented Aug 16, 2023

amaltaro commented Aug 16, 2023

todor-ivanov left a comment

todor-ivanov Aug 17, 2023

todor-ivanov Aug 17, 2023

amaltaro Aug 17, 2023

todor-ivanov Aug 17, 2023

amaltaro Aug 17, 2023

todor-ivanov Aug 17, 2023

amaltaro Aug 17, 2023

todor-ivanov Aug 17, 2023 •

edited

Loading

amaltaro Aug 17, 2023

todor-ivanov Aug 17, 2023

todor-ivanov Aug 17, 2023

amaltaro Aug 17, 2023

cmsdmwmbot commented Aug 17, 2023

cmsdmwmbot commented Aug 17, 2023

amaltaro commented Aug 17, 2023

todor-ivanov left a comment

amaltaro commented Aug 21, 2023

belforte commented Aug 21, 2023

amaltaro commented Oct 26, 2023

cmsdmwmbot commented Oct 26, 2023

cmsdmwmbot commented Sep 30, 2024

Change how CUDA runtime and capabilities are defined in the task and Condor #11689

Are you sure you want to change the base?

Change how CUDA runtime and capabilities are defined in the task and Condor #11689

Conversation

amaltaro commented Aug 15, 2023 • edited Loading

Status

Description

Is it backward compatible (if not, which system it affects?)

Related PRs

External dependencies / deployment changes

cmsdmwmbot commented Aug 15, 2023

cmsdmwmbot commented Aug 16, 2023

cmsdmwmbot commented Aug 16, 2023

amaltaro commented Aug 16, 2023

todor-ivanov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

todor-ivanov Aug 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmsdmwmbot commented Aug 17, 2023

cmsdmwmbot commented Aug 17, 2023

amaltaro commented Aug 17, 2023

todor-ivanov left a comment

Choose a reason for hiding this comment

amaltaro commented Aug 21, 2023

belforte commented Aug 21, 2023

amaltaro commented Oct 26, 2023

cmsdmwmbot commented Oct 26, 2023

cmsdmwmbot commented Sep 30, 2024

amaltaro commented Aug 15, 2023 •

edited

Loading

todor-ivanov Aug 17, 2023 •

edited

Loading