Skip to content

Conversation

@yonglimeta
Copy link
Contributor

Summary

We would like to add the below backfill and main scheduling sdiag stats into gcm collector:

  • schedule_cycle_max
  • schedule_cycle_mean
  • schedule_cycle_sum
  • schedule_cycle_total
  • schedule_cycle_per_minute
  • schedule_queue_length
  • bf_backfilled_jobs
  • bf_cycle_max
  • bf_cycle_mean
  • bf_cycle_sum
  • bf_queue_len
  • sdiag_jobs_submitted
  • sdiag_jobs_started
  • sdiag_jobs_completed
  • sdiag_jobs_canceled
  • sdiag_jobs_failed
  • sdiag_jobs_pending
  • sdiag_jobs_running

These will help us to debug slurm controller slowness and responsiveness issue.

Note that these counters are accumulative if we don't call sdiag reset. Adding a reset call everytime after we collect sdiag stats. This will make the data more meaningful on a timely basis.

Test Plan

Run unit test:
python -m pytest gcm/tests/test_slurm.py -v

Output:

plugins: mock-3.14.1, xdist-3.7.0, typeguard-2.13.3, subprocess-1.5.3, requests-mock-1.12.1
collected 6 items                                                                                                                                                                                                           

gcm/tests/test_slurm.py::TestSlurmCliClient::test_squeue[expected0] PASSED                                                                                                                                            [ 16%]
gcm/tests/test_slurm.py::TestSlurmCliClient::test_squeue_throws_if_popen_throws PASSED                                                                                                                                [ 33%]
gcm/tests/test_slurm.py::TestSlurmCliClient::test_sinfo_throws_if_popen_throws PASSED                                                                                                                                 [ 50%]
gcm/tests/test_slurm.py::TestSlurmCliClient::test_sinfo_structured[sinfo-output-for-structured.txt-expected0] PASSED                                                                                                  [ 66%]
gcm/tests/test_slurm.py::TestSlurmCliClient::test_parse_sdiag_json PASSED                                                                                                                                             [ 83%]
gcm/tests/test_slurm.py::TestSlurmCliClient::test_parse_sdiag_json_with_missing_fields PASSED                                                                                                                         [100%]

===================================================================================================== 6 passed in 0.17s =====================================================================================================
You can verify E2E test results in the following urls

Also build and install gcm on fair-rc cluster (within a gcm conda env):

pip install --no-deps -e .
gcm slurm_monitor --sink=stdout --once

Output:
[{"derived_cluster": "fair-aws-rc-1", "server_thread_count": 1, "agent_queue_size": 0, "agent_count": 0, "agent_thread_count": 0, "dbd_agent_queue_size": 0, "schedule_cycle_max": 30402, "schedule_cycle_mean": 1300, "schedule_cycle_sum": 3151388, "schedule_cycle_total": 2424, "schedule_cycle_per_minute": 1, "schedule_queue_length": 63, "sdiag_jobs_submitted": 769, "sdiag_jobs_started": 691, "sdiag_jobs_completed": 681, "sdiag_jobs_canceled": 1, "sdiag_jobs_failed": 0, "sdiag_jobs_pending": 78, "sdiag_jobs_running": 1, "bf_backfilled_jobs": 46, "bf_cycle_mean": 5846, "bf_cycle_sum": 3852599, "bf_cycle_max": 24428, "bf_queue_len": 62, "nodes_allocated": 0, "nodes_completing": 2, "nodes_down": 0, "nodes_drained": 0, "nodes_draining": 0, "nodes_fail": 0, "nodes_failing": 0, "nodes_future": 0, "nodes_idle": 2, "nodes_inval": 0, "nodes_maint": 0, "nodes_reboot_issued": 0, "nodes_reboot_requested": 0, "nodes_mixed": 0, "nodes_perfctrs": 0, "nodes_planned": 0, "nodes_power_down": 0, "nodes_powered_down": 0, "nodes_powering_down": 0, "nodes_powering_up": 0, "nodes_reserved": 0, "nodes_unknown": 0, "nodes_not_responding": 0, "nodes_unknown_state": 0, "nodes_total": 4, "total_cpus_avail": 448, "total_gpus_avail": 16, "total_cpus_up": 448, "total_gpus_up": 16, "total_cpus_down": 0, "total_gpus_down": 0, "cluster": "fair-aws-rc-1", "running_and_pending_users": 0, "jobs_pending": 0, "gpus_pending": 0, "nodes_pending": 0, "jobs_failed": 0, "jobs_running": 0, "jobs_without_user": 0, "total_cpus_alloc": 0, "total_down_nodes": 0, "total_gpus_alloc": 0, "total_nodes_alloc": 2}]

yongl@yongl-login-0.yongl-login.tenant-slurm.svc.cluster.local and others added 2 commits February 2, 2026 18:14
)

# Reset sdiag counters after collection
self._reset_sdiag_counters()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we resetting sdiag counter on each collection?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that these counters are accumulative if we don't call sdiag reset. Adding a reset call everytime after we collect sdiag stats. This will make the data more meaningful on a timely basis.

@yonglimeta maybe add a cli flag to control this, seems fine resetting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because these sdiag counters are accumulative, if not reset, will keep increasing. Resetting allow us to collect true time-series sdiag data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay to start with this.

I hope these race conditions are somehow covered or may not be a big issue.
1/ Data collected --> sdiag reset
|
--> In parallel, New data came which got reset too.
2/ sdiag resets every midnight. Multiple race condition scenarios here too.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seconding on the potential data loss between each collection and reset. Also it seems that Slurm natively reset the counter at 12am server time.

yongl@yongl-login-0.yongl-login.tenant-slurm.svc.cluster.local and others added 3 commits February 5, 2026 17:21
@yonglimeta
Copy link
Contributor Author

After merging there are failed checks. The nox-format and typecheck failure seems to point to the sprio update, should not be related to this PR.

@yonglimeta yonglimeta merged commit d84e225 into main Feb 5, 2026
28 of 33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants