-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[release-3.6] Replace nvidia-persistenced service with parallelcluster_nvidia service to avoid conflicts with DLAMI #2341
[release-3.6] Replace nvidia-persistenced service with parallelcluster_nvidia service to avoid conflicts with DLAMI #2341
Conversation
Codecov Report
@@ Coverage Diff @@
## release-3.6 #2341 +/- ##
============================================
Coverage 70.01% 70.01%
============================================
Files 13 13
Lines 1834 1834
============================================
Hits 1284 1284
Misses 550 550
Flags with carried forward coverage won't be shown. Click here to find out more. |
409b095
to
0ceaf08
Compare
0ceaf08
to
438a5ce
Compare
…ce to avoid conflicts with DLAMI parallelcluster_nvidia service ensures the creation of the block devices /dev/nvidia0 and it is needed by the slurmd service. parallelcluster_nvidia starts the nvidia-persistenced or run nvidia-smi to avoid race condition with other services. Signed-off-by: Francesco Giordano <giordafr@amazon.it>
438a5ce
to
363d9f0
Compare
# Check if a process is running | ||
# | ||
def is_process_running(process_name) | ||
ps = Mixlib::ShellOut.new("ps aux | grep '#{process_name}' | egrep -v \"grep .*#{process_name}\"") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: we could have used pgrep to match processes by name instead of grepping twice on ps.
NVIDIA | ||
mode '0644' | ||
action :create | ||
variables(is_nvidia_persistenced_running: is_process_running('nvidia-persistenced')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if nvidia-persistenced is run but parallelcluster_nvidia.service is already running?
NVIDIA | ||
mode '0644' | ||
action :create | ||
variables(is_nvidia_persistenced_running: is_process_running('nvidia-persistenced')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if nvidia-persistenced is run but parallelcluster_nvidia.service is already running?
`parallelcluster_nvidia` service ensures the creation of the block devices `/dev/nvidia0` and it is needed by the `slurmd` service. `parallelcluster_nvidia` starts the `nvidia-persistenced` or runs `nvidia-smi` to avoid race condition with other services and avoids conflicts when using DLAMI with a gpu instance. ### Tests * Modified ChefSpec to verify new changes. ### References Backport of: aws#2341 Signed-off-by: Enrico Usai <usai@amazon.com>
`parallelcluster_nvidia` service ensures the creation of the block devices `/dev/nvidia0` and it is needed by the `slurmd` service. `parallelcluster_nvidia` starts the `nvidia-persistenced` or runs `nvidia-smi` to avoid race condition with other services and avoids conflicts when using DLAMI with a gpu instance. ### Tests * Modified ChefSpec to verify new changes. ### References Backport of: #2341 Signed-off-by: Enrico Usai <usai@amazon.com>
`parallelcluster_nvidia` service ensures the creation of the block devices `/dev/nvidia0` and it is needed by the `slurmd` service. `parallelcluster_nvidia` starts the `nvidia-persistenced` or runs `nvidia-smi` to avoid race condition with other services and avoids conflicts when using DLAMI with a gpu instance. ### Tests * Modified ChefSpec to verify new changes. ### References Backport of: #2341 Signed-off-by: Enrico Usai <usai@amazon.com>
Description of changes
nvidia-persistenced.service
withparallelcluster_nvidia
service which can:/usr/bin/nvidia-persistenced
if no other/usr/bin/nvidia-persistenced
are already running./usr/bin/nvidia-smi
which triggers the/dev/nvidia0
creation but does not conflict with other servicesThis allow ParallelCluster to have a service which
slurmd
can depends on. However the service will not have race conditions with other possible customer nvidia daemon.Tests
us-east-1
withami-0901c773cf8fa8cb6
andPlease review the guidelines for contributing and Pull Request Instructions.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.