Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Environment variables for SLURM distributed training #828

Open
aschankler opened this issue Feb 21, 2025 · 4 comments
Open

Environment variables for SLURM distributed training #828

aschankler opened this issue Feb 21, 2025 · 4 comments

Comments

@aschankler
Copy link

Distributed training requires SLURM_NTASKS_PER_NODE to be set, and if it is not, it provides a fairly unhelpful error message.

The variable is not set by slurm unless the --ntasks-per-node option is used, so there are some reasonable batch jobs (eg specifying --ntasks explicitly) where MACE fails unexpectedly. The variable is only used by MACE to compute a default if SLURM_NTASKS is not set, but based on my reading of the SLURM doc, SLURM_NTASKS will always be set if NTASKS_PER_NODE is set. Could the dependence on this environment variable be removed so that a wider range of scheduler requests can be accepted without error?

@aschankler aschankler changed the title SLURM distributed Environment variables for SLURM distributed training Feb 21, 2025
@bernstei
Copy link
Collaborator

Is there a reason to hard-wire slurm anyway? Can we make that parameter only a command line option, and it's up to the user, in their batch file, to figure out how to go from scheduler-specific env vars to the number that the script needs?

@aschankler
Copy link
Author

I, at least, would prefer a couple clearly documented environment variables or CLI options to the current situation. However some of the other parameters (eg job rank) could be a little more difficult to to get manually.

@bernstei
Copy link
Collaborator

bernstei commented Feb 21, 2025

I personally really don't like the code being scheduler specific (comes from having used many schedulers over the years). I especially hate it when it's slurm, which is in general nice but terrible about not setting env vars.

For the rank, since I don't like the current dependence on srun for the same reason, getting the rank from an mpi rank, e.g. using mpi4py, would be more versatile.

@alinelena
Copy link
Contributor

there is also this one #586 situation at this stage is suboptimal. to its sins slurm IF set correctly just works... personally I would suggest having pre-canned options and one for total freedom...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants