Check restarting/handling of pending config when resuming a run #30

Neeratyoy · 2023-11-13T17:25:11Z

For potential reproducibility of the observed issue:

Running Random Search for 20 (max_evaluations_total) evaluations distributed across 4 workers
Midway through the run, killed a worker and restarted the worker soon enough
The overall run ran fine but noticed certain anomalies, as described below,

The process termination halted a config, for example, config ID 16
On restarting, the 4 workers proceeded fine without errors but an extra config ID 21 was generated while config ID 16 was not re-evaluated or completed and remains pending forever

Some more observations:

For max_evaluations_total=20 we should have config IDs from 1-20 with each of them having their own result.yaml
Only config_16 does not have result.yaml whereas config_21 does
If I now re-run a worker as max_evaluations_total=21, it now satisfies that extra config required by sampling a new config config_22

Should a new worker, re-evaluate pending configs, as priority?
Also with this issue or under this scenario the generated config IDs range from [1, n+1] if max_evaluations_total=n.

The text was updated successfully, but these errors were encountered:

karibbov · 2023-11-21T08:44:57Z

This happens when the process is force-killed during the evaluation of a config, and is reproducible with a single process.

To reproduce:

Choose an algorithm which have very low overhead: e.g Random Search
Write a run_pipeline(...) function which takes a relatively long time compared to the algorithm overhead: e.g time.sleep(10)
Run neps.api.run. Arguments don't matter this should reproduce
If the logs are observed terminate the process once the algorithm enters the evaluation phase with the log Start evaluating config .... Otherwise, refine the steps 1 and 2 to increase your chance of terminating during evaluation.
If after termination there is a config with a missing result.yaml file, you have successfully interrupted an evaluation.
Re-run the process to see the effect described.

Alternatively, You can skip the steps 1-5, and manually delete a result.yaml file from any config folder to make NePs think that, there is a pending config some mysterious other process is handling right now.

eddiebergman · 2024-08-05T17:14:57Z

There's been some developments here:

If there is a configuration (Trial) marked as pending, the next available worker will pick it up instead of sampling a new configuration.

neps/neps/runtime.py

Lines 169 to 187 in 08f30ae

    
           def _get_next_trial_from_state(self) -> Trial: 
        
               nxt_trial = self.state.get_next_pending_trial() 
        
               # If we have a trial, we will use it 
        
               if nxt_trial is not None: 
        
                   logger.info( 
        
                       f"Worker '{self.worker_id}' got previosly sampled trial: {nxt_trial}" 
        
                   ) 
        
               # Otherwise sample a new one 
        
               else: 
        
                   nxt_trial = self.state.sample_trial( 
        
                       worker_id=self.worker_id, 
        
                       optimizer=self.optimizer, 
        
                       _sample_hooks=self._pre_sample_hooks, 
        
                   ) 
        
                   logger.info(f"Worker '{self.worker_id}' sampled a new trial: {nxt_trial}") 
        
               return nxt_trial

Killing a worker mid-evaluation is interesting. Right now, if a configuration errors and the worker can process it that the configuration evaluation crashed, it will be marked as such and there will be a result.yaml for it, indicating the configuration crashed. This will not be re-attempted.

However, what should happen if you Ctrl+c a worker who is currently evaluating a configuration? This is not a fault of the configuration and so it should probably be re-evaluated. The current behaviour of this is that the configuration will be forever in the EVALUATING state.

Fixing this is non-trivial, although there's some patchwork to make this less bad.

a) When a ctrl+c happens, the worker immediately kills the configuration evaluation and it's one task to complete before ending is to tell NePSState that it is no longer EVALAUTING and set it back to PENDING, such that the config can be picked up again. There is no chance for saving a checkpoint here and resuming from this partial state. It would also add a lot of complication to support anything like that. Maybe in the future this can be re-evaluated.

I'll implement the ctrl+c handler and consider this issue done as far as we can do for now.

eddiebergman · 2024-08-05T18:27:06Z

I did an implementation in #129 which should be robust enough to common occurences, CTRL+C as-well as SLURM, which sends process signals. The only exception is SIGKILL which really is just a hard-kill and there's no way around this. Most default actions however do not send a SIGKILL

Neeratyoy added the bug Something isn't working label Nov 13, 2023

eddiebergman added this to the Runtime milestone Jul 30, 2024

eddiebergman linked a pull request Aug 5, 2024 that will close this issue

fix: Handle KeyboardInterrupt, SIGINT, SIGTERM gracefully #129

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check restarting/handling of pending config when resuming a run #30

Check restarting/handling of pending config when resuming a run #30

Neeratyoy commented Nov 13, 2023

karibbov commented Nov 21, 2023 •

edited

Loading

eddiebergman commented Aug 5, 2024

eddiebergman commented Aug 5, 2024 •

edited

Loading

Check restarting/handling of pending config when resuming a run #30

Check restarting/handling of pending config when resuming a run #30

Comments

Neeratyoy commented Nov 13, 2023

karibbov commented Nov 21, 2023 • edited Loading

eddiebergman commented Aug 5, 2024

eddiebergman commented Aug 5, 2024 • edited Loading

karibbov commented Nov 21, 2023 •

edited

Loading

eddiebergman commented Aug 5, 2024 •

edited

Loading