add GPU accounting for SMHP #462

KeitaW · 2024-10-21T09:32:26Z

This PR proposes to add GPU accounting to setup_mariadb_accounting.sh.

$ sreport -tminper cluster utilization --tres="gres/gpu" start=2024-10-21T08:00:00
--------------------------------------------------------------------------------
Cluster Utilization 2024-10-21T08:00:00 - 2024-10-21T08:59:59
Usage reported in TRES Minutes/Percentage of Total
--------------------------------------------------------------------------------
  Cluster      TRES Name         Allocated              Down         PLND Down              Idle           Planned          Reported
--------- -------------- ----------------- ----------------- ----------------- ----------------- ----------------- -----------------
ml-clust+       gres/gpu        11(18.92%)          0(0.00%)          0(0.00%)        49(81.08%)          0(0.00%)       60(100.00%)

nghtm · 2024-10-21T13:17:16Z

Please test auto-resume with this configuration enabled to confirm there is not a conflict, and post results here. I dont think it should be, since you are only modifying accounting.conf, however I have seen funky behavior with modifying gres attributes, or using gres attributes, in slurm.conf and how hyperpod reacts. Let me know if you require help with testing

mhuguesaws

Left comments.

1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/setup_mariadb_accounting.sh

olegtalalov · 2024-10-22T04:02:41Z

1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/setup_mariadb_accounting.sh

@@ -99,6 +99,8 @@ JobAcctGatherFrequency=30
 AccountingStorageType=accounting_storage/slurmdbd
 AccountingStorageHost=$DBD_HOST
 AccountingStoragePort=6819
+AccountingStorageTRES=gres/gpu
+GresTypes=gpu


Let me copy paste my comment from slack.

GresTypes=gpu doesn't allow customers to resume a faulty job if they want. AutoResume plugin will always requeue a job in this case. Personally, I'm fine with this because result is the same -> we recover a faulty job from hardware failure. But this is a behaviour change, for existing customers. We can initiate discussion if gres should be on by default, because in my opinion it has more benefits for customers than resuming jobs.

We have CPU only instances. Even trn1/trn2 are marked as "CPU only" and don't have any Gres attached now, and they defenitely will not have gpus. Maybe it make sense to configure these values via ClusterAgent depends on instance types in the cluster.

discussing internally on gres setup.

KeitaW requested review from mhuguesaws and nghtm October 21, 2024 09:34

KeitaW force-pushed the add_gpu_accounting_smhp branch from d53d46f to f152b80 Compare October 21, 2024 09:38

mhuguesaws reviewed Oct 21, 2024

View reviewed changes

1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/setup_mariadb_accounting.sh Outdated Show resolved Hide resolved

add GPU accounting

840c226

KeitaW force-pushed the add_gpu_accounting_smhp branch from f152b80 to 840c226 Compare October 21, 2024 22:07

olegtalalov reviewed Oct 22, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add GPU accounting for SMHP #462

add GPU accounting for SMHP #462

KeitaW commented Oct 21, 2024 •

edited

Loading

nghtm commented Oct 21, 2024

mhuguesaws left a comment

olegtalalov Oct 22, 2024

KeitaW Oct 22, 2024

add GPU accounting for SMHP #462

Are you sure you want to change the base?

add GPU accounting for SMHP #462

Conversation

KeitaW commented Oct 21, 2024 • edited Loading

nghtm commented Oct 21, 2024

mhuguesaws left a comment

Choose a reason for hiding this comment

olegtalalov Oct 22, 2024

Choose a reason for hiding this comment

KeitaW Oct 22, 2024

Choose a reason for hiding this comment

KeitaW commented Oct 21, 2024 •

edited

Loading