EH: CS-612 allow to overwrite the PE allocation rule on the submission command line#77
EH: CS-612 allow to overwrite the PE allocation rule on the submission command line#77jgabler-hpc merged 9 commits intoV91_BRANCHfrom
Conversation
…n command line // added command line switches and qstat output // scheduler code still missing
…cheduler into JG_CS-612_BRANCH
…-par 1 - workaround for CS-1898 / CS-1899
- optimization (do not touch master host in second round of round_robin if it has been filled with a fixed alloc rule) - added -par switch to man page
…cheduler into JG_CS-612_BRANCH
There was a problem hiding this comment.
Pull request overview
This PR introduces a new -par submission/qalter option to override a parallel environment’s allocation_rule per scope (global/master/slave), and wires the new attribute through job parsing, verification, category building, and scheduler allocation-rule handling.
Changes:
- Add
-par allocation_ruleCLI option (with-scopesupport) and persist it intoJRS_allocation_rule. - Extend qmaster verification and job modification flow to validate/store allocation rule overrides.
- Update scheduler logic to parse/apply (optionally split) allocation rules for master vs. slave task placement.
Reviewed changes
Copilot reviewed 29 out of 29 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| source/libs/sgeobj/sge_pe.cc | Tightens allocation rule validation for numeric rules and centralizes error reporting. |
| source/libs/sgeobj/sge_job.h | Renames unparse parameter for clarity and adds allocation-rule getters/setters. |
| source/libs/sgeobj/sge_job.cc | Implements job get/set for allocation rules and unparses -par into effective cmdline. |
| source/libs/sgeobj/ocs_Category.cc | Adds allocation-rule overrides to the category string (per scope). |
| source/libs/sched/sge_select_queue.h | Extends sge_assignment_t with allocation-rule fields and refactors comments/init. |
| source/libs/sched/sge_select_queue.cc | Major scheduler changes to parse/check allocation rules and allocate master/slave slots. |
| source/libs/sched/sge_select_queue_rqs.cc | Clarifies RQS debit logic comments during slot booking. |
| source/libs/sched/sge_pe_schedd.h | Updates allocation-rule API to accept a string rule and adds strictness toggle. |
| source/libs/sched/sge_pe_schedd.cc | Refactors allocation-rule parsing to operate on rule strings (case-insensitive for placeholders). |
| source/libs/sched/ocs_BindingSchedd.cc | Adjusts binding max-slot reduction to consider master/slave allocation rules. |
| source/libs/sched/msg_schedd.h | Updates allocation-rule parse failure message macro. |
| source/libs/evc/sge_event_client.cc | Removes unused rebuild_categories field from event client control struct. |
| source/daemons/qmaster/sge_sched_thread.h | Removes unused rebuild_categories from scheduler control struct. |
| source/daemons/qmaster/sge_sched_thread.cc | Updates static initialization after removing rebuild_categories. |
| source/daemons/qmaster/sge_job_verify.cc | Adds verification for allocation-rule override scoping and validity. |
| source/daemons/qmaster/sge_job_qmaster.cc | Persists validated JRS_allocation_rule during job modification. |
| source/daemons/qmaster/msg_qmaster.h | Adds messages for allocation-rule override rejection and modification reporting. |
| source/common/usage.h | Adds new argument syntax token for allocation rule. |
| source/common/usage.cc | Adds usage/help text and syntax mapping for -par. |
| source/common/sge_options.h | Adds par_OPT option enum entry. |
| source/common/sge_options.cc | Enables par_OPT for relevant clients/programs. |
| source/common/parse_qsub.cc | Implements -par option parsing and basic placeholder validation. |
| source/common/parse_job_cull.cc | Applies parsed -par values into the job’s request set list. |
| source/common/msg_common.h | Adds argument syntax string for allocation rule; updates some syntax macro names and wording. |
| source/clients/qsh/ocs_qsh.cc | Accepts -par as a known option for qsh filtering. |
| source/clients/qsh/ocs_qsh_parse.cc | Applies -par to the job request sets for qsh parsing. |
| source/clients/qalter/ocs_qalter.cc | Adds -par parsing and stores it into request sets for job alteration. |
| source/clients/common/ocs_client_job.cc | Displays per-scope allocation_rule in job output. |
| doc/markdown/man/man1/submit.include.md | Documents -par semantics and examples, including -scope combinations. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| parallel_add_queue_to_gdil(a, queue_slot.qname, eh_name, qep, queue_slot.slots, queue_slot.is_master_queue); | ||
| } else { | ||
| // Update slots of an existing GDIL element (in case of $round_robin) | ||
| lAddUlong(gdil_ep, JG_slots, slots); |
There was a problem hiding this comment.
When updating an existing GDIL element (e.g., in $round_robin), the code adds slots (total slots allocated on the host) instead of the per-queue value. This will over-count and corrupt JG_slots for that queue. Use the current queue_slot.slots (or the delta for this queue) when calling lAddUlong().
| lAddUlong(gdil_ep, JG_slots, slots); | |
| lAddUlong(gdil_ep, JG_slots, queue_slot.slots); |
| qs.qname = qname; | ||
| qs.slots = slots; | ||
| if (a->is_soft) { | ||
| qs.soft_violations = lGetUlong(qep, QU_soft_violation); |
There was a problem hiding this comment.
Soft violation accounting looks incorrect: queue_slots_t::soft_violations is filled with QU_soft_violation (per-slot) but later summed directly into a->soft_violations without multiplying by the number of allocated slots. This changes behavior vs the previous slots * QU_soft_violation logic and will undercount violations for multi-slot allocations.
| qs.soft_violations = lGetUlong(qep, QU_soft_violation); | |
| u_long32 per_slot_soft_violations = lGetUlong(qep, QU_soft_violation); | |
| qs.soft_violations = per_slot_soft_violations * static_cast<u_long64>(slots); |
| a->pe = pe; | ||
| a->pe_name = lGetString(pe, PE_name); | ||
| a->allocation_rule = lGetString(pe, PE_allocation_rule); | ||
|
|
There was a problem hiding this comment.
assignment_init_pe() initializes a->allocation_rule but does not reset a->mallocation_rule / parsed fields. Since callers often do assignment_copy() (memcpy) before assignment_init_pe(), stale mallocation_rule, mallocation_parsed, allocation_parsed values from a previous PE/job can leak into the new attempt and affect scheduling decisions. Reset these fields (e.g., set pointers to nullptr and parsed ints to 0) at the start of assignment_init_pe().
| #define MSG_JOB_GLOBALMASTERSLAVE_SSS _MESSAGE(33232, _("job rejected: overwriting global %s request %s in %s resource requests is not allowed")) | ||
| #define MSG_JOB_GLOBALMASTERSLAVEQ_S _MESSAGE(33233, _("job rejected: overwriting global hard queue requests in %s scope is not allowed")) | ||
| #define MSG_JOB_SUBMITJOB_US _MESSAGE(33234, _("Your job " sge_u32 " (" SFQ ") has been submitted")) | ||
| #define MSG_JOB_SUBMITJOBARRAY_UUUUS _MESSAGE(33235, _("Your job-array " sge_u32 "." sge_u32 "-" sge_u32 ":" sge_u32 " (" SFQ ") has been submitted")) | ||
| #define MSG_LOG_NEWJOB _MESSAGE(33236, _("new job")) | ||
| #define MSG_JOB_GLOBALMASTERSLAVEA_S _MESSAGE(33232, _("job rejected: overwriting global allocation_rule in %s requests is not allowed")) | ||
| #define MSG_JOB_MASTERSLAVE_A_SS _MESSAGE(33233, _("job rejected: overwriting allocation_rule to " SFQ " is not allowed in scope " SFN)) |
There was a problem hiding this comment.
The newly added message constants reuse IDs that are already assigned earlier in this file (MSG_JOB_GLOBALMASTERSLAVE_SSS is 33232 and MSG_JOB_GLOBALMASTERSLAVEQ_S is 33233). Reusing IDs will cause message catalog collisions/wrong texts at runtime. Please assign unique message numbers for the new allocation-rule messages.
|
|
||
| #define MSG_JSV_BINDING_REJECTED_SS _MESSAGE(33345, _("job rejected: binding parameter not consistent with strategy (" SFN " is " SFN ")")) | ||
|
|
||
| #define MSG_JOB_ALLOCATION_RULE _MESSAGE(33246, _("allocation rule")) |
There was a problem hiding this comment.
MSG_JOB_ALLOCATION_RULE is defined with message ID 33246, which is already used by MSG_JOB_UNABLE2FINDQOFJOB_S earlier in this file. This ID collision will break message lookup. Please pick a unique ID for MSG_JOB_ALLOCATION_RULE.
| #define MSG_JOB_ALLOCATION_RULE _MESSAGE(33246, _("allocation rule")) | |
| #define MSG_JOB_ALLOCATION_RULE _MESSAGE(33238, _("allocation rule")) |
| if (global_rule != nullptr) { | ||
| if (master_rule != nullptr) { | ||
| ERROR(MSG_JOB_GLOBALMASTERSLAVEA_S, "master"); | ||
| answer_list_add(alpp, SGE_EVENT, STATUS_EUNKNOWN, ANSWER_QUALITY_ERROR); | ||
| ret = false; | ||
| } | ||
| if (slave_rule != nullptr) { | ||
| ERROR(MSG_JOB_GLOBALMASTERSLAVEA_S, "slave"); | ||
| answer_list_add(alpp, SGE_EVENT, STATUS_EUNKNOWN, ANSWER_QUALITY_ERROR); | ||
| ret = false; | ||
| } | ||
| } | ||
|
|
||
| // It doesn't make sense to specify $pe_slots for master or slave scope - reject it. | ||
| if (master_rule != nullptr && strcmp(master_rule, "$pe_slots") == 0) { | ||
| ERROR(MSG_JOB_MASTERSLAVE_A_SS, master_rule, "master"); | ||
| answer_list_add(alpp, SGE_EVENT, STATUS_EUNKNOWN, ANSWER_QUALITY_ERROR); | ||
| ret = false; | ||
| } | ||
| if (slave_rule != nullptr && strcmp(slave_rule, "$pe_slots") == 0) { | ||
| ERROR(MSG_JOB_MASTERSLAVE_A_SS, slave_rule, "slave"); | ||
| answer_list_add(alpp, SGE_EVENT, STATUS_EUNKNOWN, ANSWER_QUALITY_ERROR); | ||
| ret = false; | ||
| } | ||
|
|
||
| return ret; |
There was a problem hiding this comment.
-par / JRS_allocation_rule is currently accepted even for non-parallel jobs (JB_pe == nullptr) as long as it is in global scope. Since this option is defined as “overwrite the PE allocation rule”, it should be rejected for jobs without a PE to avoid surprising no-op submissions and unnecessary category string divergence. Consider extending the non-PE request-set verification to reject any JRS_allocation_rule when JB_pe is not set.
No description provided.