Skip to content

EH: CS-612 allow to overwrite the PE allocation rule on the submission command line#77

Merged
jgabler-hpc merged 9 commits intoV91_BRANCHfrom
JG_CS-612_BRANCH
Apr 1, 2026
Merged

EH: CS-612 allow to overwrite the PE allocation rule on the submission command line#77
jgabler-hpc merged 9 commits intoV91_BRANCHfrom
JG_CS-612_BRANCH

Conversation

@jgabler-hpc
Copy link
Copy Markdown
Contributor

No description provided.

Copilot AI review requested due to automatic review settings April 1, 2026 09:51
@jgabler-hpc jgabler-hpc merged commit 7f7ad6f into V91_BRANCH Apr 1, 2026
3 checks passed
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new -par submission/qalter option to override a parallel environment’s allocation_rule per scope (global/master/slave), and wires the new attribute through job parsing, verification, category building, and scheduler allocation-rule handling.

Changes:

  • Add -par allocation_rule CLI option (with -scope support) and persist it into JRS_allocation_rule.
  • Extend qmaster verification and job modification flow to validate/store allocation rule overrides.
  • Update scheduler logic to parse/apply (optionally split) allocation rules for master vs. slave task placement.

Reviewed changes

Copilot reviewed 29 out of 29 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
source/libs/sgeobj/sge_pe.cc Tightens allocation rule validation for numeric rules and centralizes error reporting.
source/libs/sgeobj/sge_job.h Renames unparse parameter for clarity and adds allocation-rule getters/setters.
source/libs/sgeobj/sge_job.cc Implements job get/set for allocation rules and unparses -par into effective cmdline.
source/libs/sgeobj/ocs_Category.cc Adds allocation-rule overrides to the category string (per scope).
source/libs/sched/sge_select_queue.h Extends sge_assignment_t with allocation-rule fields and refactors comments/init.
source/libs/sched/sge_select_queue.cc Major scheduler changes to parse/check allocation rules and allocate master/slave slots.
source/libs/sched/sge_select_queue_rqs.cc Clarifies RQS debit logic comments during slot booking.
source/libs/sched/sge_pe_schedd.h Updates allocation-rule API to accept a string rule and adds strictness toggle.
source/libs/sched/sge_pe_schedd.cc Refactors allocation-rule parsing to operate on rule strings (case-insensitive for placeholders).
source/libs/sched/ocs_BindingSchedd.cc Adjusts binding max-slot reduction to consider master/slave allocation rules.
source/libs/sched/msg_schedd.h Updates allocation-rule parse failure message macro.
source/libs/evc/sge_event_client.cc Removes unused rebuild_categories field from event client control struct.
source/daemons/qmaster/sge_sched_thread.h Removes unused rebuild_categories from scheduler control struct.
source/daemons/qmaster/sge_sched_thread.cc Updates static initialization after removing rebuild_categories.
source/daemons/qmaster/sge_job_verify.cc Adds verification for allocation-rule override scoping and validity.
source/daemons/qmaster/sge_job_qmaster.cc Persists validated JRS_allocation_rule during job modification.
source/daemons/qmaster/msg_qmaster.h Adds messages for allocation-rule override rejection and modification reporting.
source/common/usage.h Adds new argument syntax token for allocation rule.
source/common/usage.cc Adds usage/help text and syntax mapping for -par.
source/common/sge_options.h Adds par_OPT option enum entry.
source/common/sge_options.cc Enables par_OPT for relevant clients/programs.
source/common/parse_qsub.cc Implements -par option parsing and basic placeholder validation.
source/common/parse_job_cull.cc Applies parsed -par values into the job’s request set list.
source/common/msg_common.h Adds argument syntax string for allocation rule; updates some syntax macro names and wording.
source/clients/qsh/ocs_qsh.cc Accepts -par as a known option for qsh filtering.
source/clients/qsh/ocs_qsh_parse.cc Applies -par to the job request sets for qsh parsing.
source/clients/qalter/ocs_qalter.cc Adds -par parsing and stores it into request sets for job alteration.
source/clients/common/ocs_client_job.cc Displays per-scope allocation_rule in job output.
doc/markdown/man/man1/submit.include.md Documents -par semantics and examples, including -scope combinations.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

parallel_add_queue_to_gdil(a, queue_slot.qname, eh_name, qep, queue_slot.slots, queue_slot.is_master_queue);
} else {
// Update slots of an existing GDIL element (in case of $round_robin)
lAddUlong(gdil_ep, JG_slots, slots);
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When updating an existing GDIL element (e.g., in $round_robin), the code adds slots (total slots allocated on the host) instead of the per-queue value. This will over-count and corrupt JG_slots for that queue. Use the current queue_slot.slots (or the delta for this queue) when calling lAddUlong().

Suggested change
lAddUlong(gdil_ep, JG_slots, slots);
lAddUlong(gdil_ep, JG_slots, queue_slot.slots);

Copilot uses AI. Check for mistakes.
qs.qname = qname;
qs.slots = slots;
if (a->is_soft) {
qs.soft_violations = lGetUlong(qep, QU_soft_violation);
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Soft violation accounting looks incorrect: queue_slots_t::soft_violations is filled with QU_soft_violation (per-slot) but later summed directly into a->soft_violations without multiplying by the number of allocated slots. This changes behavior vs the previous slots * QU_soft_violation logic and will undercount violations for multi-slot allocations.

Suggested change
qs.soft_violations = lGetUlong(qep, QU_soft_violation);
u_long32 per_slot_soft_violations = lGetUlong(qep, QU_soft_violation);
qs.soft_violations = per_slot_soft_violations * static_cast<u_long64>(slots);

Copilot uses AI. Check for mistakes.
Comment on lines +402 to +405
a->pe = pe;
a->pe_name = lGetString(pe, PE_name);
a->allocation_rule = lGetString(pe, PE_allocation_rule);

Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assignment_init_pe() initializes a->allocation_rule but does not reset a->mallocation_rule / parsed fields. Since callers often do assignment_copy() (memcpy) before assignment_init_pe(), stale mallocation_rule, mallocation_parsed, allocation_parsed values from a previous PE/job can leak into the new attempt and affect scheduling decisions. Reset these fields (e.g., set pointers to nullptr and parsed ints to 0) at the start of assignment_init_pe().

Copilot uses AI. Check for mistakes.
Comment on lines 259 to +265
#define MSG_JOB_GLOBALMASTERSLAVE_SSS _MESSAGE(33232, _("job rejected: overwriting global %s request %s in %s resource requests is not allowed"))
#define MSG_JOB_GLOBALMASTERSLAVEQ_S _MESSAGE(33233, _("job rejected: overwriting global hard queue requests in %s scope is not allowed"))
#define MSG_JOB_SUBMITJOB_US _MESSAGE(33234, _("Your job " sge_u32 " (" SFQ ") has been submitted"))
#define MSG_JOB_SUBMITJOBARRAY_UUUUS _MESSAGE(33235, _("Your job-array " sge_u32 "." sge_u32 "-" sge_u32 ":" sge_u32 " (" SFQ ") has been submitted"))
#define MSG_LOG_NEWJOB _MESSAGE(33236, _("new job"))
#define MSG_JOB_GLOBALMASTERSLAVEA_S _MESSAGE(33232, _("job rejected: overwriting global allocation_rule in %s requests is not allowed"))
#define MSG_JOB_MASTERSLAVE_A_SS _MESSAGE(33233, _("job rejected: overwriting allocation_rule to " SFQ " is not allowed in scope " SFN))
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The newly added message constants reuse IDs that are already assigned earlier in this file (MSG_JOB_GLOBALMASTERSLAVE_SSS is 33232 and MSG_JOB_GLOBALMASTERSLAVEQ_S is 33233). Reusing IDs will cause message catalog collisions/wrong texts at runtime. Please assign unique message numbers for the new allocation-rule messages.

Copilot uses AI. Check for mistakes.

#define MSG_JSV_BINDING_REJECTED_SS _MESSAGE(33345, _("job rejected: binding parameter not consistent with strategy (" SFN " is " SFN ")"))

#define MSG_JOB_ALLOCATION_RULE _MESSAGE(33246, _("allocation rule"))
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MSG_JOB_ALLOCATION_RULE is defined with message ID 33246, which is already used by MSG_JOB_UNABLE2FINDQOFJOB_S earlier in this file. This ID collision will break message lookup. Please pick a unique ID for MSG_JOB_ALLOCATION_RULE.

Suggested change
#define MSG_JOB_ALLOCATION_RULE _MESSAGE(33246, _("allocation rule"))
#define MSG_JOB_ALLOCATION_RULE _MESSAGE(33238, _("allocation rule"))

Copilot uses AI. Check for mistakes.
Comment on lines +145 to +170
if (global_rule != nullptr) {
if (master_rule != nullptr) {
ERROR(MSG_JOB_GLOBALMASTERSLAVEA_S, "master");
answer_list_add(alpp, SGE_EVENT, STATUS_EUNKNOWN, ANSWER_QUALITY_ERROR);
ret = false;
}
if (slave_rule != nullptr) {
ERROR(MSG_JOB_GLOBALMASTERSLAVEA_S, "slave");
answer_list_add(alpp, SGE_EVENT, STATUS_EUNKNOWN, ANSWER_QUALITY_ERROR);
ret = false;
}
}

// It doesn't make sense to specify $pe_slots for master or slave scope - reject it.
if (master_rule != nullptr && strcmp(master_rule, "$pe_slots") == 0) {
ERROR(MSG_JOB_MASTERSLAVE_A_SS, master_rule, "master");
answer_list_add(alpp, SGE_EVENT, STATUS_EUNKNOWN, ANSWER_QUALITY_ERROR);
ret = false;
}
if (slave_rule != nullptr && strcmp(slave_rule, "$pe_slots") == 0) {
ERROR(MSG_JOB_MASTERSLAVE_A_SS, slave_rule, "slave");
answer_list_add(alpp, SGE_EVENT, STATUS_EUNKNOWN, ANSWER_QUALITY_ERROR);
ret = false;
}

return ret;
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-par / JRS_allocation_rule is currently accepted even for non-parallel jobs (JB_pe == nullptr) as long as it is in global scope. Since this option is defined as “overwrite the PE allocation rule”, it should be rejected for jobs without a PE to avoid surprising no-op submissions and unnecessary category string divergence. Consider extending the non-PE request-set verification to reject any JRS_allocation_rule when JB_pe is not set.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants