Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add GPU accounting for SMHP #462

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

add GPU accounting for SMHP #462

wants to merge 1 commit into from

Conversation

KeitaW
Copy link
Collaborator

@KeitaW KeitaW commented Oct 21, 2024

This PR proposes to add GPU accounting to setup_mariadb_accounting.sh.

$ sreport -tminper cluster utilization --tres="gres/gpu" start=2024-10-21T08:00:00
--------------------------------------------------------------------------------
Cluster Utilization 2024-10-21T08:00:00 - 2024-10-21T08:59:59
Usage reported in TRES Minutes/Percentage of Total
--------------------------------------------------------------------------------
  Cluster      TRES Name         Allocated              Down         PLND Down              Idle           Planned          Reported
--------- -------------- ----------------- ----------------- ----------------- ----------------- ----------------- -----------------
ml-clust+       gres/gpu        11(18.92%)          0(0.00%)          0(0.00%)        49(81.08%)          0(0.00%)       60(100.00%)

@KeitaW KeitaW requested review from mhuguesaws and nghtm October 21, 2024 09:34
@KeitaW KeitaW force-pushed the add_gpu_accounting_smhp branch from d53d46f to f152b80 Compare October 21, 2024 09:38
@nghtm
Copy link
Collaborator

nghtm commented Oct 21, 2024

Please test auto-resume with this configuration enabled to confirm there is not a conflict, and post results here. I dont think it should be, since you are only modifying accounting.conf, however I have seen funky behavior with modifying gres attributes, or using gres attributes, in slurm.conf and how hyperpod reacts. Let me know if you require help with testing

Copy link
Contributor

@mhuguesaws mhuguesaws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left comments.

@KeitaW KeitaW force-pushed the add_gpu_accounting_smhp branch from f152b80 to 840c226 Compare October 21, 2024 22:07
@@ -99,6 +99,8 @@ JobAcctGatherFrequency=30
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=$DBD_HOST
AccountingStoragePort=6819
AccountingStorageTRES=gres/gpu
GresTypes=gpu

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me copy paste my comment from slack.

GresTypes=gpu doesn't allow customers to resume a faulty job if they want. AutoResume plugin will always requeue a job in this case. Personally, I'm fine with this because result is the same -> we recover a faulty job from hardware failure. But this is a behaviour change, for existing customers. We can initiate discussion if gres should be on by default, because in my opinion it has more benefits for customers than resuming jobs.

We have CPU only instances. Even trn1/trn2 are marked as "CPU only" and don't have any Gres attached now, and they defenitely will not have gpus. Maybe it make sense to configure these values via ClusterAgent depends on instance types in the cluster.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussing internally on gres setup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants