[release-3.6] Replace nvidia-persistenced service with parallelcluster_nvidia service to avoid conflicts with DLAMI #2341

francesco-giordano · 2023-06-29T12:46:47Z

Description of changes

Replace the installation of the service nvidia-persistenced.service with parallelcluster_nvidia service which can:

Execute /usr/bin/nvidia-persistenced if no other /usr/bin/nvidia-persistenced are already running.
Execute /usr/bin/nvidia-smi which triggers the /dev/nvidia0 creation but does not conflict with other services

This allow ParallelCluster to have a service which slurmd can depends on. However the service will not have race conditions with other possible customer nvidia daemon.

Tests

Kitchen tested on EC2.
Created a cluster with in us-east-1 with ami-0901c773cf8fa8cb6and

 DevSettings:
  AmiSearchFilters:
    Owner: '447714826191'
  Cookbook:
    ChefCookbook: https://github.com/francesco-giordano/aws-parallelcluster-cookbook/tarball/b25d096b2d32da5ee67bcb70af37eeabe940270e

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

codecov · 2023-06-29T12:48:51Z

Codecov Report

Merging #2341 (363d9f0) into release-3.6 (a5d6e8d) will not change coverage.
The diff coverage is n/a.

@@             Coverage Diff              @@
##           release-3.6    #2341   +/-   ##
============================================
  Coverage        70.01%   70.01%           
============================================
  Files               13       13           
  Lines             1834     1834           
============================================
  Hits              1284     1284           
  Misses             550      550

Flag	Coverage Δ
unittests	`70.01% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

…ce to avoid conflicts with DLAMI parallelcluster_nvidia service ensures the creation of the block devices /dev/nvidia0 and it is needed by the slurmd service. parallelcluster_nvidia starts the nvidia-persistenced or run nvidia-smi to avoid race condition with other services. Signed-off-by: Francesco Giordano <giordafr@amazon.it>

gmarciani · 2023-06-29T18:00:20Z

libraries/helpers.rb

+# Check if a process is running
+#
+def is_process_running(process_name)
+  ps = Mixlib::ShellOut.new("ps aux | grep '#{process_name}' | egrep -v \"grep .*#{process_name}\"")


minor: we could have used pgrep to match processes by name instead of grepping twice on ps.

gmarciani · 2023-06-29T18:03:00Z

cookbooks/aws-parallelcluster-config/recipes/nvidia.rb

-    NVIDIA
+    mode '0644'
+    action :create
+    variables(is_nvidia_persistenced_running: is_process_running('nvidia-persistenced'))


What happens if nvidia-persistenced is run but parallelcluster_nvidia.service is already running?

gmarciani · 2023-06-29T18:03:08Z

cookbooks/aws-parallelcluster-config/recipes/nvidia.rb

-    NVIDIA
+    mode '0644'
+    action :create
+    variables(is_nvidia_persistenced_running: is_process_running('nvidia-persistenced'))


What happens if nvidia-persistenced is run but parallelcluster_nvidia.service is already running?

`parallelcluster_nvidia` service ensures the creation of the block devices `/dev/nvidia0` and it is needed by the `slurmd` service. `parallelcluster_nvidia` starts the `nvidia-persistenced` or runs `nvidia-smi` to avoid race condition with other services and avoids conflicts when using DLAMI with a gpu instance. ### Tests * Modified ChefSpec to verify new changes. ### References Backport of: aws#2341 Signed-off-by: Enrico Usai <usai@amazon.com>

`parallelcluster_nvidia` service ensures the creation of the block devices `/dev/nvidia0` and it is needed by the `slurmd` service. `parallelcluster_nvidia` starts the `nvidia-persistenced` or runs `nvidia-smi` to avoid race condition with other services and avoids conflicts when using DLAMI with a gpu instance. ### Tests * Modified ChefSpec to verify new changes. ### References Backport of: #2341 Signed-off-by: Enrico Usai <usai@amazon.com>

francesco-giordano force-pushed the release-3.6 branch 3 times, most recently from 409b095 to 0ceaf08 Compare June 29, 2023 13:13

hanwen-pcluste added the skip-changelog-update label Jun 29, 2023

hanwen-pcluste marked this pull request as ready for review June 29, 2023 14:14

hanwen-pcluste requested review from a team as code owners June 29, 2023 14:14

hanwen-pcluste previously approved these changes Jun 29, 2023

View reviewed changes

hanwen-pcluste marked this pull request as draft June 29, 2023 14:16

francesco-giordano dismissed hanwen-pcluste’s stale review via 438a5ce June 29, 2023 17:02

francesco-giordano force-pushed the release-3.6 branch from 0ceaf08 to 438a5ce Compare June 29, 2023 17:02

francesco-giordano force-pushed the release-3.6 branch from 438a5ce to 363d9f0 Compare June 29, 2023 17:24

francesco-giordano marked this pull request as ready for review June 29, 2023 17:51

gmarciani reviewed Jun 29, 2023

View reviewed changes

gmarciani approved these changes Jun 29, 2023

View reviewed changes

hanwen-pcluste merged commit 5b45d2f into aws:release-3.6 Jun 29, 2023
24 of 27 checks passed

enrico-usai mentioned this pull request Jul 7, 2023

[develop] Replace nvidia-persistenced with parallelcluster_nvidia service #2348

Merged

enrico-usai changed the title ~~Replace nvidia-persistenced service with parallelcluster_nvidia service to avoid conflicts with DLAMI~~ [release-3.6] Replace nvidia-persistenced service with parallelcluster_nvidia service to avoid conflicts with DLAMI Jul 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[release-3.6] Replace nvidia-persistenced service with parallelcluster_nvidia service to avoid conflicts with DLAMI #2341

[release-3.6] Replace nvidia-persistenced service with parallelcluster_nvidia service to avoid conflicts with DLAMI #2341

francesco-giordano commented Jun 29, 2023 •

edited

Loading

codecov bot commented Jun 29, 2023 •

edited

Loading

gmarciani Jun 29, 2023

gmarciani Jun 29, 2023

gmarciani Jun 29, 2023

[release-3.6] Replace nvidia-persistenced service with parallelcluster_nvidia service to avoid conflicts with DLAMI #2341

[release-3.6] Replace nvidia-persistenced service with parallelcluster_nvidia service to avoid conflicts with DLAMI #2341

Conversation

francesco-giordano commented Jun 29, 2023 • edited Loading

Description of changes

Tests

codecov bot commented Jun 29, 2023 • edited Loading

Codecov Report

gmarciani Jun 29, 2023

Choose a reason for hiding this comment

gmarciani Jun 29, 2023

Choose a reason for hiding this comment

gmarciani Jun 29, 2023

Choose a reason for hiding this comment

francesco-giordano commented Jun 29, 2023 •

edited

Loading

codecov bot commented Jun 29, 2023 •

edited

Loading