-
Notifications
You must be signed in to change notification settings - Fork 3
V3.0 #105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: v3.0
Are you sure you want to change the base?
V3.0 #105
Conversation
Correct the protection to use static versions of pmix_getline if PMIx version is less than v4.2.5 Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit 5cde35d)
Always default the number of slots to the available cpus in the topology. Ensure that we always display some form of the resulting proces map, or else we will silently exit. Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit f01e2a2)
It should be `help-hostfile.txt`, not `help-hostfiles.txt` Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit c34d91a)
If we use one cpu from an object, then we will get a NULL response if we ask for the next object of that type within the remaining cpuset since not all of the cpus in the object are still available. This problem resulted from the recent change to only use available cpus in PRRTE topologies. So instead scan across the cpus, check to see if it is inside the object of interest - if so, then we can bind to that cpu, if not then we keep searching. Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit 2d0a840)
Attempt to make it clearer that the binding failed due to a lack of cpus for the given map/bind policies. Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit dfcc9a7)
PRRTE itself no longer requires specific resilience settings. Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit 8b95fe2)
Add a new cmd line option that corresponds to this attribute. Add the attribute to the prun payload. When received, it will default to including in the job info for the spawned job. Add query support for it. Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit 3957789)
Homebrew has broken something and I cannot figure out how to fix it. Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit 2ac45f3)
Changes will need to be made to Open MPI to parse the contents of the OMPI_MCA_mpi_memory_alloc_kinds environment variable to determine how to use the user supplied memory-alloc-kinds information. See section 11.4.3 of the MPI 4.1 standard. Signed-off-by: Howard Pritchard <howardp@lanl.gov> (cherry picked from commit c5953e1)
Get takes a (pmix_value_t**), so don't cast it to (void**) Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit 475df02)
If we haven't requested LSF support, then don't warn about not finding yp_all - we didn't ask for LSF, so no need to warn us if support cannot be built. It will show in the summary at end of configure. Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit fcaa417)
Now that we have a broader group of contributors starting to show up, we probably need to start paying more attention to code quality of contributions. Enable devel-check by default in Git clones that are configured with enable-debug. Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit 5ffc3d4)
Try adding a build using latest Clang Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit 1a3dc29)
When building against older PMIx Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit 940a474)
Signed-off-by: Ralph Castain <rhc@pmix.org>
It has been reported (and confirmed) that building against one version of PMIx and then running with another version will cause PRRTE to segfault. This isn't a universal rule. For example, one can switch v5.0 and master without a problem. However, switching v5.0 and v4.2 is a definite segfault. The root cause of the problem is a change in the layout of the base pmix_object_t definition. This renders all PMIx objects binary incompatible when crossing between the v5 and v4 (and below) series. Changing the v5 definition back to match v4 is an overly complex task. The changes were required to accommodate the new shared memory support that was introduced in v5. So instead, we check the runtime version of PMIx against the build version. If the runtime version is incompatible with the build version, then we print an explanatory error message and error out. Signed-off-by: Ralph Castain <rhc@pmix.org> dd Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit d02ad07)
Refs open-mpi/ompi#12540 Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit 7e0ff9b)
We had problems in the past with quoted params, but stripping quotes also has consequences - not clear of the best solution. For now, let's try going the other way and see how many problems we encounter. Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit be840ab)
Take only the piece that is applicable to v3.0. Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry-pick of openpmix@891bad8)
Fix the issues with the MacOS builds so that they work again in Github Action environments. Signed-off-by: Jeff Squyres <jeff@squyres.com> (cherry picked from commit 4a682ef)
Enables build against v1.11.8 and above. Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit ac80553)
If we are trying to bind to an HWLOC object type that is not defined on a given node, then (a) if the binding policy was specified by user, then error out; and (b) if we are using a default binding policy, then simply do not bind. Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit 5d21059)
Signed-off-by: Ralph Castain <rhc@pmix.org>
In some recent Slurm versions, the Slurm runtime is inserting custom arguments to the PRRTE launcher's `srun` cmd line without the user being aware of it. In many cases, this may not be a problem - but in some cases (where the user or the system admin needs/wants particular cmd line arguments used) this can cause problems as it happens silently, without the user being aware of it. Make this visible when it happens, and provide a mechanism by which the user/admin can override it. Provide a fairly long help message explaining what happened and offering advice on resolution, along with a param for disabling the warning. Add a param for overriding the "args" param if necessary, along with a caution as to possible consequences. Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit 092cd7c)
RTD is rolling out some changes. Per https://about.readthedocs.com/blog/2024/07/addons-by-default/, these are the changes we need to make. Port of open-mpi/ompi#12687 Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit 584845f)
We currently do not support the LTO optimizer as it is incompatible with our plugin component architecture. So detect it has been specified in configure and error out with an explanation. Includes suggestions from @jsquyres Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit dd7706c)
Break the multi-loop thru loading of param files that caused us to overwrite values. Defer to the PMIx pmdl components for obtaining envars and for checking MCA param overlaps across projects. Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit a68d647)
Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit ce25672)
Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit e204c73)
Make the remote connection and foreign tool settings be via MCA param so they can be globally set. Don't set the remote connection option unless someone specified it so that PMIx can use the default behavior if necessary. Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit ce2f0c2)
Signed-off-by: Ralph Castain <rhc@pmix.org>
When fixing a merge conflict, some code was inadvertently removed, so replace it Signed-off-by: Ralph Castain <rhc@pmix.org>
Signed-off-by: Ralph Castain <rhc@pmix.org>
Provide MCA params to control the ability for a client to connect even if it has a different pid than what we started. This happens when an intermediate script or executable is being used to fork the client - e.g., in the case of a debugger. Set this to not require pid match by default. Also provide a switch to enable/disable client clones - i.e., for a client process to fork a child that also connects back to the PMIx server since it will use the same nspace/rank as its parent. This is currently an unusual use-case, but allowed by the Standard. Set this to not allow clones by default. Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit 52c8851)
We captured the HNP's aliases in prte_process_info, but that happened _after_ we had already copied them to the HNP's node object. So when we then checked the node aliases, they were missing from that node. Ensure we capture the HNP's aliases on the node object. Simplify the check for local node by including the "localhost" and "127.0.0.1" aliases, being sure not to include them in the nidmap. Correct the check in dash-host for matching node names. Thanks to Alexey Novikov for the report Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit 8070277)
Signed-off-by: Ralph Castain <rhc@pmix.org>
If someone specifies that child jobs inherit from their parents, then have them inherit any env directives as well as job-level directives. Have children inherit their parent's inheritance directive, unless directed not to do so. Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit eb577d4)
If we are inheriting envar directives from our parent job, then extend that to inheriting envar directives for the application of the proc that spawned us. Shift processing of inheritance directives to the mapper, and ensure that the child inherits the inheritance directive so that the grandchildren will also inherit. Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit a63791f)
Check RAS components for compile errors by shimming the environment-specific functions Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit 17399cd)
Therer were two compensating errors that wound up yielding the correct map, but had a flaw in it should a certain condition exist. So rework the code to fix the errors and remove the flaw. Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit bdbf4db)
Work from left-to-right across the cmd line, applying env-related options as we go. When one operation affects the result of another, this preserves a user's common expectation. Add a "--set-env" option if the corresponding PMIx CLI is defined. Seemed a little weird that we had "prepend-env", "append-env", etc., but no "set-env". It's the equivalent of "-x foo=val". Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit 805e130)
Signed-off-by: Matthew Whitlock <mwhitlo@sandia.gov> (cherry picked from commit 0b1ada9)
This error is also displayed in cases where files or directories do not exist and is not only caused by missing permissions. Signed-off-by: Christoph Niethammer <niethammer@hlrs.de> (cherry picked from commit ac77387)
Allow the target node list to follow the ordering inside a provided hostfile and dash-host specification by not assigning a bookmark based on the DVM job. Add support for missing default-hostfile cmd line option We have the support for the user to specify it via MCA param, but somehow we lost the integration to pick it up off of the prte and prterun cmd lines. Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit 16d8412)
PPR placement policy requests are uniform - i.e., the specified number of procs must be placed on every object of the directed type. When the request includes a cpu/proc directive, then there must also be enough CPUs to meet the request on every object. When that isn't the case, then we need to error out and not just place the proc without binding it. Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit 665c38e)
If we are using the seq or rankfile mapper and have multiple apps on the cmd line, then allow the mappers to compute their own num procs if one or more are not given. Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit cb17cce)
The empty nodes were not properly being added to the list of names to be used by the mapper. Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit 58130c6)
Per note in the OMPI project, at least one compiler family is removing the "sprintf" function. Replace all uses of that function with the safer "snprintf" version. Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit 2ff7d6b)
When a timeout is specified and the primary job is timed-out, then we need to ensure we also report and kill any child jobs it started. This includes reporting any requested stack traces. Also all inheritance of output directives like tag and timestamp. Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit d072f27)
Port the "launching-apps" section from the OMPI docs over to PRRTE since it specifically deals with prterun usage. Add some updates about gridengine support courtesy of open-mpi/ompi#13450. Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit 424480d)
Use the hwloc synthetic topology string as the signature instead of our custom attempt at counting number of types of objects - the synthetic retains some hierarchical info and hopefully does a little better job of detecting hetero nodes are in use. Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit 7e5d030)
Update the MCA param help message to clarify what the param does and what values it supports. Cleanup an error where we would overwrite the resulting list of signals to forward. Cleanup the return value so we don't generate spurious error log output. Provide verbose output showing the signals being forwarded. Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit 2845dcd)
Further improve automatic handling of hetero nodes by making the non-symmetric signature unique, thereby forcing collection of the full topology from each such node. Fix an error in the topology retrieval procedure whereby we double-counted cached nodes, thereby causing us to quit collecting topologies early. Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit 4671290)
Need to init the ess framework to have the signal forwarding list initialized Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit bff13fb)
Signed-off-by: Ralph Castain <rhc@pmix.org>
Signed-off-by: Matthew Whitlock <mwhitlo@sandia.gov> (cherry picked from commit da3ca98)
|
Hello! The Git Commit Checker CI bot found a few problems with this PR: 5b8889e: Update NEWS and VERSION
2e89339: Final update of NEWS and VERSION for release
0267dfe: Update NEWS
2cb071d: Replace some incorrectly removed code
25d98c8: Update NEWS
b33a522: Update NEWS
b6c3a01: Protect against running with PMIx versions too hig...
f86c858: Check for PMIx version too high
222f03f: Update VERSION and NEWS for release
0ff51bd: Update NEWS for release
c4f6f78: Roll version to 3.0.10
d20e10c: Update NEWS for release
20f2c2a: Constrain PMIx versions
7b9b2aa: Protect against stone age HWLOC
648aa78: Update NEWS
253f60a: Minor cleanups
1141770: Add mpi4py CI
5a42463: Add build against older PMIx CI
be9ad17: Minor cleanups
ac40d3f: Remove the group CI as this release branch doesn't...
611f87b: Roll VERSION for end of release branch
f6f5c18: Final update for release
35270a9: Update NEWS and VERSION
2828a49: Revert "configure.ac: generate prte_version.h prop...
37e0525: Revert "configure.ac: generate prte_version.h prop...
b2f4163: Update NEWS and VERSION for final release
1b6e6d7: Update NEWS and VERSION for release
e9507eb: Protect against old PMIx versions
50147d8: Pull a couple of fixes from master branch
289e6ab: 3.0: fix support for MPIEXEC_TIMEOUT
b68a0ac: Update NEWS and VERSION for release
c6c9d12: Tailored backport of "various fixes for singleton ...
fce79e9: Cleanup issues surfaced by devel-check
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
1 similar comment
|
Hello! The Git Commit Checker CI bot found a few problems with this PR: 5b8889e: Update NEWS and VERSION
2e89339: Final update of NEWS and VERSION for release
0267dfe: Update NEWS
2cb071d: Replace some incorrectly removed code
25d98c8: Update NEWS
b33a522: Update NEWS
b6c3a01: Protect against running with PMIx versions too hig...
f86c858: Check for PMIx version too high
222f03f: Update VERSION and NEWS for release
0ff51bd: Update NEWS for release
c4f6f78: Roll version to 3.0.10
d20e10c: Update NEWS for release
20f2c2a: Constrain PMIx versions
7b9b2aa: Protect against stone age HWLOC
648aa78: Update NEWS
253f60a: Minor cleanups
1141770: Add mpi4py CI
5a42463: Add build against older PMIx CI
be9ad17: Minor cleanups
ac40d3f: Remove the group CI as this release branch doesn't...
611f87b: Roll VERSION for end of release branch
f6f5c18: Final update for release
35270a9: Update NEWS and VERSION
2828a49: Revert "configure.ac: generate prte_version.h prop...
37e0525: Revert "configure.ac: generate prte_version.h prop...
b2f4163: Update NEWS and VERSION for final release
1b6e6d7: Update NEWS and VERSION for release
e9507eb: Protect against old PMIx versions
50147d8: Pull a couple of fixes from master branch
289e6ab: 3.0: fix support for MPIEXEC_TIMEOUT
b68a0ac: Update NEWS and VERSION for release
c6c9d12: Tailored backport of "various fixes for singleton ...
fce79e9: Cleanup issues surfaced by devel-check
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
i can't directly push to the v3.0 branch (so can't sync via github web api) so here's the PR to do so.
This is second step of what we discussed yesterday.