[GCM] Add backfill and main scheduling sdiag stats to slurm monitor #39

yonglimeta · 2026-02-04T23:09:11Z

Summary

We would like to add the below backfill and main scheduling sdiag stats into gcm collector:

schedule_cycle_max
schedule_cycle_mean
schedule_cycle_sum
schedule_cycle_total
schedule_cycle_per_minute
schedule_queue_length
bf_backfilled_jobs
bf_cycle_max
bf_cycle_mean
bf_cycle_sum
bf_queue_len
sdiag_jobs_submitted
sdiag_jobs_started
sdiag_jobs_completed
sdiag_jobs_canceled
sdiag_jobs_failed
sdiag_jobs_pending
sdiag_jobs_running

These will help us to debug slurm controller slowness and responsiveness issue.

Note that these counters are accumulative if we don't call sdiag reset. Adding a reset call everytime after we collect sdiag stats. This will make the data more meaningful on a timely basis.

Test Plan

Run unit test:
python -m pytest gcm/tests/test_slurm.py -v

Output:

plugins: mock-3.14.1, xdist-3.7.0, typeguard-2.13.3, subprocess-1.5.3, requests-mock-1.12.1
collected 6 items                                                                                                                                                                                                           

gcm/tests/test_slurm.py::TestSlurmCliClient::test_squeue[expected0] PASSED                                                                                                                                            [ 16%]
gcm/tests/test_slurm.py::TestSlurmCliClient::test_squeue_throws_if_popen_throws PASSED                                                                                                                                [ 33%]
gcm/tests/test_slurm.py::TestSlurmCliClient::test_sinfo_throws_if_popen_throws PASSED                                                                                                                                 [ 50%]
gcm/tests/test_slurm.py::TestSlurmCliClient::test_sinfo_structured[sinfo-output-for-structured.txt-expected0] PASSED                                                                                                  [ 66%]
gcm/tests/test_slurm.py::TestSlurmCliClient::test_parse_sdiag_json PASSED                                                                                                                                             [ 83%]
gcm/tests/test_slurm.py::TestSlurmCliClient::test_parse_sdiag_json_with_missing_fields PASSED                                                                                                                         [100%]

===================================================================================================== 6 passed in 0.17s =====================================================================================================
You can verify E2E test results in the following urls

Also build and install gcm on fair-rc cluster (within a gcm conda env):

pip install --no-deps -e .
gcm slurm_monitor --sink=stdout --once

Output:
[{"derived_cluster": "fair-aws-rc-1", "server_thread_count": 1, "agent_queue_size": 0, "agent_count": 0, "agent_thread_count": 0, "dbd_agent_queue_size": 0, "schedule_cycle_max": 30402, "schedule_cycle_mean": 1300, "schedule_cycle_sum": 3151388, "schedule_cycle_total": 2424, "schedule_cycle_per_minute": 1, "schedule_queue_length": 63, "sdiag_jobs_submitted": 769, "sdiag_jobs_started": 691, "sdiag_jobs_completed": 681, "sdiag_jobs_canceled": 1, "sdiag_jobs_failed": 0, "sdiag_jobs_pending": 78, "sdiag_jobs_running": 1, "bf_backfilled_jobs": 46, "bf_cycle_mean": 5846, "bf_cycle_sum": 3852599, "bf_cycle_max": 24428, "bf_queue_len": 62, "nodes_allocated": 0, "nodes_completing": 2, "nodes_down": 0, "nodes_drained": 0, "nodes_draining": 0, "nodes_fail": 0, "nodes_failing": 0, "nodes_future": 0, "nodes_idle": 2, "nodes_inval": 0, "nodes_maint": 0, "nodes_reboot_issued": 0, "nodes_reboot_requested": 0, "nodes_mixed": 0, "nodes_perfctrs": 0, "nodes_planned": 0, "nodes_power_down": 0, "nodes_powered_down": 0, "nodes_powering_down": 0, "nodes_powering_up": 0, "nodes_reserved": 0, "nodes_unknown": 0, "nodes_not_responding": 0, "nodes_unknown_state": 0, "nodes_total": 4, "total_cpus_avail": 448, "total_gpus_avail": 16, "total_cpus_up": 448, "total_gpus_up": 16, "total_cpus_down": 0, "total_gpus_down": 0, "cluster": "fair-aws-rc-1", "running_and_pending_users": 0, "jobs_pending": 0, "gpus_pending": 0, "nodes_pending": 0, "jobs_failed": 0, "jobs_running": 0, "jobs_without_user": 0, "total_cpus_alloc": 0, "total_down_nodes": 0, "total_gpus_alloc": 0, "total_nodes_alloc": 2}]

Yash0270 · 2026-02-04T23:22:04Z

gcm/monitoring/slurm/client.py

            )

+            # Reset sdiag counters after collection
+            self._reset_sdiag_counters()


Why are we resetting sdiag counter on each collection?

Note that these counters are accumulative if we don't call sdiag reset. Adding a reset call everytime after we collect sdiag stats. This will make the data more meaningful on a timely basis.

@yonglimeta maybe add a cli flag to control this, seems fine resetting

This is because these sdiag counters are accumulative, if not reset, will keep increasing. Resetting allow us to collect true time-series sdiag data.

I'm okay to start with this.

I hope these race conditions are somehow covered or may not be a big issue.
1/ Data collected --> sdiag reset
|
--> In parallel, New data came which got reset too.
2/ sdiag resets every midnight. Multiple race condition scenarios here too.

seconding on the potential data loss between each collection and reset. Also it seems that Slurm natively reset the counter at 12am server time.

gcm/monitoring/slurm/client.py

…urm-debug

yonglimeta · 2026-02-05T17:44:52Z

After merging there are failed checks. The nox-format and typecheck failure seems to point to the sprio update, should not be related to this PR.

yongl@yongl-login-0.yongl-login.tenant-slurm.svc.cluster.local and others added 2 commits February 2, 2026 18:14

[sdiag] Add sdiag telemetry

419598f

Update json for test

e7a0229

yonglimeta requested review from Yash0270, nehasaxena210 and skalyan February 4, 2026 23:09

yonglimeta requested review from calebho, giongto35, jj10306 and luccabb as code owners February 4, 2026 23:09

meta-cla bot added the cla signed label Feb 4, 2026

yonglimeta removed request for calebho, giongto35 and jj10306 February 4, 2026 23:09

Yash0270 reviewed Feb 4, 2026

View reviewed changes

gcm/monitoring/slurm/client.py Show resolved Hide resolved

luccabb approved these changes Feb 5, 2026

View reviewed changes

yongl@yongl-login-0.yongl-login.tenant-slurm.svc.cluster.local and others added 3 commits February 5, 2026 17:21

[sdiag] Add sdiag telemetry

4b8b66b

Update json for test

74657fe

Merge branch 'slurm-debug' of github.com:facebookresearch/gcm into sl…

234a139

…urm-debug

yonglimeta merged commit d84e225 into main Feb 5, 2026
28 of 33 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GCM] Add backfill and main scheduling sdiag stats to slurm monitor #39

[GCM] Add backfill and main scheduling sdiag stats to slurm monitor #39

Uh oh!

yonglimeta commented Feb 4, 2026

Uh oh!

Yash0270 Feb 4, 2026

Uh oh!

luccabb Feb 5, 2026

Uh oh!

yonglimeta Feb 5, 2026

Uh oh!

Yash0270 Feb 5, 2026

Uh oh!

L1A0 Feb 5, 2026

Uh oh!

Uh oh!

yonglimeta commented Feb 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[GCM] Add backfill and main scheduling sdiag stats to slurm monitor #39

[GCM] Add backfill and main scheduling sdiag stats to slurm monitor #39

Uh oh!

Conversation

yonglimeta commented Feb 4, 2026

Summary

Test Plan

Uh oh!

Yash0270 Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

luccabb Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

yonglimeta Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Yash0270 Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

L1A0 Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yonglimeta commented Feb 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants