Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change how CUDA runtime and capabilities are defined in the task and Condor #11689

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

amaltaro
Copy link
Contributor

@amaltaro amaltaro commented Aug 15, 2023

Fixes #11595

Status

In development

Description

The following is provided in this PR:

  • the WMTask class now does an aggregation of multiple GPU requirements within the same task, such as:
    • GPUMemoryMB: use the largest values among the steps within a task
    • CUDARuntime: changed from simple string to a list of strings with an union of the CUDA runtime versions
    • CUDACapabilities: still a list of strings, but now with an union of the CUDA capabilities versions
  • BossAir plugin method to:
    • sorted list of string versions, from left to right (see cudaCapabilityToSingleVersion)
    • select the smallest CUDA capability version required by the job (version string with dot-notation)
    • convert the dot-notation version to a simple integer with a formula: (1000 * major + 10 * medium + minor), where 1.2.3 would be major=1, medium=2, minor=3. See [1] for further context.
  • [GlideinWMS] Refactor HTCondor CUDACapability classad to a single integer (actually string represented). For instance, it changes from "1.2,3.2,1.4" to "1020". Matchmaking then can be done with something like: Node CUDACapability >= Job CUDACapability, after converting the capability version to a single integer to be compatible with these changes.
  • [GlideinWMS] Refactor HTCondor CUDARuntime classad to be a comma separated list of CUDA runtimes. For instance, it changes from "1.2" to "1.2,3.2". Matchmaking then can be done with something like: stringListSubsetMatch(job_list_cudaruntime, node_list_cudaruntime). -->> NOTE that this needs an HTCondor upgrade to >= 10.0.6
  • Create new HTCondor classad OriginalCUDACapability with the actual comma separated list of CUDA capabilities

Is it backward compatible (if not, which system it affects?)

NO (in the sense that job matchmaking will have to be updated)

Related PRs

Complement to #11588 such that hybrid GPU workflows can be supported.

External dependencies / deployment changes

Submission Infrastructure needs to update HTCondor to >= 10.0.6 and GlideinWMS needs to update the job matchmaking expression for CUDARuntime and CUDACapabilities.

SI ticket: https://its.cern.ch/jira/browse/CMSSI-79
[1]
https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART____VERSION.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests no longer failing
    • 2 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 8 warnings and errors that must be fixed
    • 4 warnings
    • 79 comments to review
  • Pylint py3k check: failed
    • 2 errors and warnings that should be fixed
  • Pycodestyle check: succeeded
    • 38 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14420/artifact/artifacts/PullRequestReport.html

@amaltaro amaltaro changed the title Fix Change how CUDA runtime and capabilities are defined in the task and Condor Change how CUDA runtime and capabilities are defined in the task and Condor Aug 16, 2023
@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 2 new failures
    • 1 tests no longer failing
    • 2 tests added
  • Python3 Pylint check: failed
    • 28 warnings and errors that must be fixed
    • 7 warnings
    • 229 comments to review
  • Pylint py3k check: failed
    • 2 errors and warnings that should be fixed
  • Pycodestyle check: succeeded
    • 84 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14422/artifact/artifacts/PullRequestReport.html

fix WMTask unit tests
@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 1 tests no longer failing
    • 2 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 29 warnings and errors that must be fixed
    • 7 warnings
    • 259 comments to review
  • Pylint py3k check: failed
    • 2 errors and warnings that should be fixed
  • Pycodestyle check: succeeded
    • 98 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14423/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor Author

In order to perform full testing of these changes, we need to have changes at the SI level (see initial description).
Nonetheless, I'd appreciate any feedback and review.

Copy link
Contributor

@todor-ivanov todor-ivanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @amaltaro

Things look good to me, but I've left few comments inline. You may wish to take a look.

"""
if not isinstance(versionList, list):
return versionList
versionList.sort(key=StrictVersion)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than the good documentation provided with this function, which I admit gives good info, why do we need it. The list.sort method works in place, so executing this same line outside of the function would do just fine. We can still make a NOTE comment providing the information from the docstring though.

defaultRes = 0
# get an ordered list of the versions and use the very first element
capabilities = orderVersionList(capabilities)
if not capabilities:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If capabilities is not a list or None (which is the default for the method), but rather could be a dictionary or string... , because the function orderVersionList returns its argument as it is if it is not a list, then this protection here here won't stop the execution further, and the next line will break.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the GPU parameters are validated at workflow creation (StdBase), so there shouldn't be a type different than list or None at this level

for comparison/job matchmaking purposes.
Version conversion formula is: (1000 * major + 10 * medium + minor)
:param capabilities: a list of string versions
:return: an integer with the version value; 0 in case of failure
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the order here ... is 0 with least significance or vise versa?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect the job matchmaking to be available_version >= required_version, so 0 here would mean that any cuda capability is supported by the job. I am not sure this answers your question, otherwise I might have not understood your question.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partially it does.
What bothers me here, is that default value is again 0 and it overlaps with the value returned in case of an error/failure. Which would be a valid case if 0 was never to be reached in the matchmaking condition. But since it is included, then it would trigger similar behavior in failure and by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it can return from a version to integer conversion or from an error. However, this code will only be executed if GPUs are required, and if they are, then there are validated GPU parameters and the version is enforced to be something like r"^\d+\.\d+(\.\d+)?$". IF a version provided is 0.0, then yes, we get a "valid" 0 single version, which we could get if an empty list had been provided. But then again, everything is validated at the spec level.

I could change it to None and then mark the classad as undefined. But then it will be a job requiring GPU but asking for an undefined CUDA capability. I am not sure whether that could trigger other errors and unwanted behavior or not. Please let me know if you have any suggestions.

Copy link
Contributor

@todor-ivanov todor-ivanov Aug 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alan, I do not want to say it... but usually there are many ifs in the justification process of an unhealthy code behavior. Overlapping of default and failure return values is not a good practice in general. Let alone type modification. I know we are supposed to be protected on another level, and we should not enter this bit of code under bad conditions, but lets imagine something changes in the future and we change our policy regarding GPU job execution as well.... Then we may enter this code under unexpected conditions, or because of a bug in the request validation code.... anything may happen.

So I'd bet for returning a None value in case of a failure and let the upstream code deal with it. But that's up to you. If you think we are 100% safe and we are about to call this function under the best condition only at any possible moment in the future, we can leave it as it is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really have a strong preference here. I can make it return None by default, have a check in SingleCondorPlugin for None and map that to undefined classad. Sounds good?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, yes
thank you @amaltaro

minimalCapability = self.cudaCapabilityToSingleVersion(job['gpuRequirements']['CUDACapabilities'])
ad['My.CUDACapability'] = classad.quote(str(minimalCapability))
ad['My.OriginalCUDACapability'] = classad.quote(str(cudaCapabilities))
cudaRuntime = ','.join(sorted(job['gpuRequirements']['CUDARuntime']))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to sot the cudaRuntime variable as well ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am actually not sure whether this is needed or not. I was going to say to keep condor classad values consistent among different jobs of the same workflow/task, but it could be that this is already consistent/ordered in the original list. I'd keep it around in case the original list can come out with scrambled versions.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests no longer failing
    • 1 tests added
  • Python3 Pylint check: failed
    • 29 warnings and errors that must be fixed
    • 7 warnings
    • 238 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 84 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14427/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 1 tests added
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 29 warnings and errors that must be fixed
    • 7 warnings
    • 238 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 84 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14428/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor Author

@todor-ivanov the last 2 commits provide changes based on your review. Please have another look.

Copy link
Contributor

@todor-ivanov todor-ivanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

things look good
thanks @amaltaro

@amaltaro
Copy link
Contributor Author

@mapellidario @belforte Hi Dario, Stefano, these are the latest changes that we are trying to commission and deploy in production for GPU jobs. I just updated the initial description, but please let me know if you have any questions.

Note that this is not yet in production and we need to discuss/plan the required changes at the SI layer.

@belforte
Copy link
Member

thanks Alan, @novicecpp will be back on Aug 24 and will be able to look at integrating this changes in CRAB as well. It will be nice to have some example to test, and of course what we dearly miss is users !

@amaltaro
Copy link
Contributor Author

test this please

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 1 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 35 warnings and errors that must be fixed
    • 7 warnings
    • 245 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 93 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14583/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Can one of the admins verify this patch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Revisit GPUMemoryMB, CUDARuntime and CUDACapabilities at workflow and job matchmaking
4 participants