Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check restarting/handling of pending config when resuming a run #30

Open
Neeratyoy opened this issue Nov 13, 2023 · 3 comments · May be fixed by #129
Open

Check restarting/handling of pending config when resuming a run #30

Neeratyoy opened this issue Nov 13, 2023 · 3 comments · May be fixed by #129
Labels
bug Something isn't working
Milestone

Comments

@Neeratyoy
Copy link
Contributor

For potential reproducibility of the observed issue:

  • Running Random Search for 20 (max_evaluations_total) evaluations distributed across 4 workers
  • Midway through the run, killed a worker and restarted the worker soon enough
  • The overall run ran fine but noticed certain anomalies, as described below,
  1. The process termination halted a config, for example, config ID 16
  2. On restarting, the 4 workers proceeded fine without errors but an extra config ID 21 was generated while config ID 16 was not re-evaluated or completed and remains pending forever

Some more observations:

  • For max_evaluations_total=20 we should have config IDs from 1-20 with each of them having their own result.yaml
  • Only config_16 does not have result.yaml whereas config_21 does
  • If I now re-run a worker as max_evaluations_total=21, it now satisfies that extra config required by sampling a new config config_22

Should a new worker, re-evaluate pending configs, as priority?
Also with this issue or under this scenario the generated config IDs range from [1, n+1] if max_evaluations_total=n.

@Neeratyoy Neeratyoy added the bug Something isn't working label Nov 13, 2023
@karibbov
Copy link
Contributor

karibbov commented Nov 21, 2023

This happens when the process is force-killed during the evaluation of a config, and is reproducible with a single process.

To reproduce:

  1. Choose an algorithm which have very low overhead: e.g Random Search
  2. Write a run_pipeline(...) function which takes a relatively long time compared to the algorithm overhead: e.g time.sleep(10)
  3. Run neps.api.run. Arguments don't matter this should reproduce
  4. If the logs are observed terminate the process once the algorithm enters the evaluation phase with the log Start evaluating config .... Otherwise, refine the steps 1 and 2 to increase your chance of terminating during evaluation.
  5. If after termination there is a config with a missing result.yaml file, you have successfully interrupted an evaluation.
  6. Re-run the process to see the effect described.

Alternatively, You can skip the steps 1-5, and manually delete a result.yaml file from any config folder to make NePs think that, there is a pending config some mysterious other process is handling right now.

@eddiebergman eddiebergman added this to the Runtime milestone Jul 30, 2024
@eddiebergman
Copy link
Contributor

There's been some developments here:

  1. If there is a configuration (Trial) marked as pending, the next available worker will pick it up instead of sampling a new configuration.

    neps/neps/runtime.py

    Lines 169 to 187 in 08f30ae

    def _get_next_trial_from_state(self) -> Trial:
    nxt_trial = self.state.get_next_pending_trial()
    # If we have a trial, we will use it
    if nxt_trial is not None:
    logger.info(
    f"Worker '{self.worker_id}' got previosly sampled trial: {nxt_trial}"
    )
    # Otherwise sample a new one
    else:
    nxt_trial = self.state.sample_trial(
    worker_id=self.worker_id,
    optimizer=self.optimizer,
    _sample_hooks=self._pre_sample_hooks,
    )
    logger.info(f"Worker '{self.worker_id}' sampled a new trial: {nxt_trial}")
    return nxt_trial

  2. Killing a worker mid-evaluation is interesting. Right now, if a configuration errors and the worker can process it that the configuration evaluation crashed, it will be marked as such and there will be a result.yaml for it, indicating the configuration crashed. This will not be re-attempted.

However, what should happen if you Ctrl+c a worker who is currently evaluating a configuration? This is not a fault of the configuration and so it should probably be re-evaluated. The current behaviour of this is that the configuration will be forever in the EVALUATING state.

Fixing this is non-trivial, although there's some patchwork to make this less bad.

a) When a ctrl+c happens, the worker immediately kills the configuration evaluation and it's one task to complete before ending is to tell NePSState that it is no longer EVALAUTING and set it back to PENDING, such that the config can be picked up again. There is no chance for saving a checkpoint here and resuming from this partial state. It would also add a lot of complication to support anything like that. Maybe in the future this can be re-evaluated.

I'll implement the ctrl+c handler and consider this issue done as far as we can do for now.

@eddiebergman
Copy link
Contributor

eddiebergman commented Aug 5, 2024

I did an implementation in #129 which should be robust enough to common occurences, CTRL+C as-well as SLURM, which sends process signals. The only exception is SIGKILL which really is just a hard-kill and there's no way around this. Most default actions however do not send a SIGKILL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: No status
Development

Successfully merging a pull request may close this issue.

3 participants