-
Notifications
You must be signed in to change notification settings - Fork 256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Environment variables for SLURM distributed training #828
Comments
Is there a reason to hard-wire slurm anyway? Can we make that parameter only a command line option, and it's up to the user, in their batch file, to figure out how to go from scheduler-specific env vars to the number that the script needs? |
I, at least, would prefer a couple clearly documented environment variables or CLI options to the current situation. However some of the other parameters (eg job rank) could be a little more difficult to to get manually. |
I personally really don't like the code being scheduler specific (comes from having used many schedulers over the years). I especially hate it when it's slurm, which is in general nice but terrible about not setting env vars. For the rank, since I don't like the current dependence on |
there is also this one #586 situation at this stage is suboptimal. to its sins slurm IF set correctly just works... personally I would suggest having pre-canned options and one for total freedom... |
Distributed training requires
SLURM_NTASKS_PER_NODE
to be set, and if it is not, it provides a fairly unhelpful error message.The variable is not set by slurm unless the
--ntasks-per-node
option is used, so there are some reasonable batch jobs (eg specifying--ntasks
explicitly) where MACE fails unexpectedly. The variable is only used by MACE to compute a default ifSLURM_NTASKS
is not set, but based on my reading of the SLURM doc,SLURM_NTASKS
will always be set ifNTASKS_PER_NODE
is set. Could the dependence on this environment variable be removed so that a wider range of scheduler requests can be accepted without error?The text was updated successfully, but these errors were encountered: