Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release-3.6] Replace nvidia-persistenced service with parallelcluster_nvidia service to avoid conflicts with DLAMI #2341

Merged
merged 1 commit into from
Jun 29, 2023

Conversation

francesco-giordano
Copy link
Contributor

@francesco-giordano francesco-giordano commented Jun 29, 2023

Description of changes

  • Replace the installation of the service nvidia-persistenced.service with parallelcluster_nvidia service which can:
  1. Execute /usr/bin/nvidia-persistenced if no other /usr/bin/nvidia-persistenced are already running.
  2. Execute /usr/bin/nvidia-smi which triggers the /dev/nvidia0 creation but does not conflict with other services

This allow ParallelCluster to have a service which slurmd can depends on. However the service will not have race conditions with other possible customer nvidia daemon.

Tests

  • Kitchen tested on EC2.
  • Created a cluster with in us-east-1 with ami-0901c773cf8fa8cb6and
 DevSettings:
  AmiSearchFilters:
    Owner: '447714826191'
  Cookbook:
    ChefCookbook: https://github.com/francesco-giordano/aws-parallelcluster-cookbook/tarball/b25d096b2d32da5ee67bcb70af37eeabe940270e

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@codecov
Copy link

codecov bot commented Jun 29, 2023

Codecov Report

Merging #2341 (363d9f0) into release-3.6 (a5d6e8d) will not change coverage.
The diff coverage is n/a.

@@             Coverage Diff              @@
##           release-3.6    #2341   +/-   ##
============================================
  Coverage        70.01%   70.01%           
============================================
  Files               13       13           
  Lines             1834     1834           
============================================
  Hits              1284     1284           
  Misses             550      550           
Flag Coverage Δ
unittests 70.01% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

@francesco-giordano francesco-giordano force-pushed the release-3.6 branch 3 times, most recently from 409b095 to 0ceaf08 Compare June 29, 2023 13:13
@hanwen-pcluste hanwen-pcluste marked this pull request as ready for review June 29, 2023 14:14
@hanwen-pcluste hanwen-pcluste requested review from a team as code owners June 29, 2023 14:14
hanwen-pcluste
hanwen-pcluste previously approved these changes Jun 29, 2023
…ce to avoid conflicts with DLAMI

parallelcluster_nvidia service ensures the creation of the block devices /dev/nvidia0 and it is needed by the slurmd service.

parallelcluster_nvidia starts the nvidia-persistenced or run nvidia-smi to avoid race condition with other services.

Signed-off-by: Francesco Giordano <giordafr@amazon.it>
# Check if a process is running
#
def is_process_running(process_name)
ps = Mixlib::ShellOut.new("ps aux | grep '#{process_name}' | egrep -v \"grep .*#{process_name}\"")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: we could have used pgrep to match processes by name instead of grepping twice on ps.

NVIDIA
mode '0644'
action :create
variables(is_nvidia_persistenced_running: is_process_running('nvidia-persistenced'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if nvidia-persistenced is run but parallelcluster_nvidia.service is already running?

NVIDIA
mode '0644'
action :create
variables(is_nvidia_persistenced_running: is_process_running('nvidia-persistenced'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if nvidia-persistenced is run but parallelcluster_nvidia.service is already running?

@hanwen-pcluste hanwen-pcluste merged commit 5b45d2f into aws:release-3.6 Jun 29, 2023
24 of 27 checks passed
enrico-usai pushed a commit to enrico-usai/aws-parallelcluster-cookbook that referenced this pull request Jul 7, 2023
`parallelcluster_nvidia` service ensures the creation of the block devices `/dev/nvidia0`
and it is needed by the `slurmd` service.

`parallelcluster_nvidia` starts the `nvidia-persistenced` or runs `nvidia-smi`
to avoid race condition with other services and avoids conflicts when using DLAMI with a gpu instance.

### Tests
* Modified ChefSpec to verify new changes.

### References
Backport of: aws#2341

Signed-off-by: Enrico Usai <usai@amazon.com>
@enrico-usai enrico-usai changed the title Replace nvidia-persistenced service with parallelcluster_nvidia service to avoid conflicts with DLAMI [release-3.6] Replace nvidia-persistenced service with parallelcluster_nvidia service to avoid conflicts with DLAMI Jul 7, 2023
enrico-usai pushed a commit that referenced this pull request Jul 7, 2023
`parallelcluster_nvidia` service ensures the creation of the block devices `/dev/nvidia0`
and it is needed by the `slurmd` service.

`parallelcluster_nvidia` starts the `nvidia-persistenced` or runs `nvidia-smi`
to avoid race condition with other services and avoids conflicts when using DLAMI with a gpu instance.

### Tests
* Modified ChefSpec to verify new changes.

### References
Backport of: #2341

Signed-off-by: Enrico Usai <usai@amazon.com>
rkilpadi pushed a commit that referenced this pull request Jul 21, 2023
`parallelcluster_nvidia` service ensures the creation of the block devices `/dev/nvidia0`
and it is needed by the `slurmd` service.

`parallelcluster_nvidia` starts the `nvidia-persistenced` or runs `nvidia-smi`
to avoid race condition with other services and avoids conflicts when using DLAMI with a gpu instance.

### Tests
* Modified ChefSpec to verify new changes.

### References
Backport of: #2341

Signed-off-by: Enrico Usai <usai@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants