Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

share_allocation: max -- asynchronous launching #725

Open
j-ogas opened this issue Jan 17, 2024 · 3 comments
Open

share_allocation: max -- asynchronous launching #725

j-ogas opened this issue Jan 17, 2024 · 3 comments
Assignees
Labels
enhancement New feature or request feature request
Milestone

Comments

@j-ogas
Copy link
Collaborator

j-ogas commented Jan 17, 2024

Pav2 is the only test harness I've found that allows me to specify a number of nodes and execute all subsequent jobs on them (thank you). This is achieved as follows:

modes/share.yaml

scheduler: slurm
schedule:
  nodes: 1
  share_allocation: max

However, when looking at the results output, it appears that these jobs are launched serially, rather than asynchronously. See below.

Edited pav results output showing launch times.

11:20:24
11:20:19
11:20:16
11:20:12
11:20:03
11:19:53
11:19:43
11:19:34
11:19:30
11:19:27
11:19:24
11:19:20

Note that all of these tests are a single rank, thus they should be able to be launched with srun using the following srun args.

  slurm:
    srun_extra:
     - --overlap  

One potential issue is overwhelming SLURM. Perhaps adding another key, e.g. max_queue, that limits the number of asynchronous jobs that can be put in the srun queue will be helpful. Perhaps something as follows.

modes/share.yaml

scheduler: slurm
schedule:
  nodes: 1
  share_allocation: max
  max_queue: 250
  slurm:
    srun_extra:
     - --overlap 
     - --gres=craynetwork:0
@j-ogas j-ogas added enhancement New feature or request feature request labels Jan 17, 2024
@Paul-Ferrell
Copy link
Collaborator

Currently the kickoff scripts simply have a pav _run command for each test to run in an allocation, which is why this is synchronous.

What we need to do is expand pav _run so that it can take multiple tests as an argument, and then manage those tests by their max_queue setting. This should look at the number of tasks each test requires via the scheduler variables, and count that against the total queue size. Note that queue size can vary from test to test (unless we make it one of the parameters that forces allocation separation), so it will be necessary to manage the number of tests dynamically. So if we have tests with 1, 2, 4, and 12 max_queue, then the size 1 test will run by itself, then any pair of the size 2, 4, and 12 tasks could run together.

I think we need a better name than max_queue. Maybe max_share_tasks?

@j-ogas
Copy link
Collaborator Author

j-ogas commented Jan 17, 2024

One quick clarification: the hope would be that max_queue sets the limit of active jobs in queue. The hope would be that if I need 2000 tests to run on this single node, there is an upper limit of max_queue at any given time until all 2K tests complete.

@Paul-Ferrell
Copy link
Collaborator

This is done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request feature request
Projects
None yet
Development

No branches or pull requests

3 participants