Ignore 10% worst replicas in hyper loss #2014

APJansen · 2024-03-20T10:13:26Z

Basic implementation, not tested yet and the percentage is not configurable from the runcard.

APJansen · 2024-04-03T13:06:21Z

I have implemented what we discussed by removing the passed as it was created (unless there's no hyperopt, I left that case as is). Instead I've set passed just based on whether the hyper loss is bigger than the threshold, as if there are more than 10% replicas with infinite loss, this will automatically be violated.

I'm a bit confused now if that's what we want, as this check is already there of course. The difference is that the one I added does not include the penalties, and it sets the hyperopt status to failed, whereas the existing check just adds a penalty.

To make the tests pass I had to increase this hyper threshold, as it now fails the trial if it's violated. Can you check if that doesn't mess up other tests @Cmurilochem?

Also, it's running on the GPU again, without lhapdf and conda, see here for the setup and slurm scripts.

Cmurilochem · 2024-04-05T07:51:33Z

I have implemented what we discussed by removing the passed as it was created (unless there's no hyperopt, I left that case as is). Instead I've set passed just based on whether the hyper loss is bigger than the threshold, as if there are more than 10% replicas with infinite loss, this will automatically be violated.

I'm a bit confused now if that's what we want, as this check is already there of course. The difference is that the one I added does not include the penalties, and it sets the hyperopt status to failed, whereas the existing check just adds a penalty.

To make the tests pass I had to increase this hyper threshold, as it now fails the trial if it's violated. Can you check if that doesn't mess up other tests @Cmurilochem?

Also, it's running on the GPU again, without lhapdf and conda, see here for the setup and slurm scripts.

@APJansen. I do not expect any breaks on the other tests by increasing the hyper threshold. But they may take more time as you would be in principle fitting more folds than before. However, I am also a bit confused as to the reason behind this.

If I understood well, in the past (not sure it still now), the passed and not passed had to do with threshold_chi2 which was passed as an argument to instantiate the Stopping object; here. This is used in its monitor_chi2 method which is in turn used in callback.py. But there is also hyper_threshold which was compared with the final (statistic) hyper_loss (including penalties) of each fold and responsible to skip hyperparameter combination. I see now that your passed is directly tied to self.hyper_threshold. I am quite sure that this is your intention that but even so I am also sure that this my discussion might help you in some way. Please, also have a look in monitor_chi2. I see there the it is trying to deal with some NaNs.

scarlehoff

I don't understand why the threshold has to go from 2 to 1000 ? Surely this has to do with some 1/N that has not been considered? (where N might be replicas, data points, fold, or a combination of the above)

n3fit/src/n3fit/hyper_optimization/rewards.py

n3fit/src/n3fit/model_trainer.py

Cmurilochem · 2024-04-19T07:37:40Z

Hi @APJansen and @goord,
As I discussed before with you, I tried to make an hyperopt experiment and realized that most of the trials finish with {"status": "fail"} and {"loss": NaN}. For instance, out of 85 trials calculated just one seems to have a loss lower than 10. Just in case, I attach below my partial trial.json file:
tries.json

After looking at the renew_hyperopt.yml runcard, I noticed that I should have added average_best as my replica_statistic. I overlooked it before as I am copying this runcard directly from the repo to my local directory before running the experiment in snellius.

@APJansen: to avoid confusion in the future, could we add average_best in renew_hyperopt.yml?:

nnpdf/n3fit/runcards/hyperopt_studies/renew_hyperopt.yml

Line 164 in 47a6898

replica_statistic: average

I could do that if you say so.
Turning back to the results of the experiments I have done using (without knowing) average as replica_statistic. I still cannot quite understand why so many trials are discarded. Our passed criterium seems to be much more restrictive than before. While looking at the code, I see

nnpdf/n3fit/src/n3fit/model_trainer.py

Line 996 in 47a6898

passed = fold_loss < self.hyper_threshold

and right after,

nnpdf/n3fit/src/n3fit/model_trainer.py

Line 1038 in 47a6898

if hyper_loss > self.hyper_threshold:

Both fold_loss and hyper_loss appear to represent reduced-over-replicas hyper_losses for a specific fold.

n3fit/src/n3fit/model_trainer.py

scarlehoff · 2024-05-29T09:13:00Z

I'll rebase this on top of master to facilitate the review (it shouldn't make a big difference since the last rebase was not long ago)

n3fit/src/n3fit/hyper_optimization/rewards.py

scarlehoff · 2024-06-02T10:21:54Z

@Cmurilochem I've added a runcard option penalties_in_loss.

I'll add this to the docs. I've decided not to add the proportion to the runcard because it adds a bit of complexity (I need to add a custom parser in validphys) and thought that at this stage of the project it was better to skip it.

(However, since I've written already the code, if you would like to have access to the proportion or other arguments for the statistics let me know and I'll push it)

Cmurilochem · 2024-06-03T07:05:13Z

@Cmurilochem I've added a runcard option penalties_in_loss.

I'll add this to the docs. I've decided not to add the proportion to the runcard because it adds a bit of complexity (I need to add a custom parser in validphys) and thought that at this stage of the project it was better to skip it.

(However, since I've written already the code, if you would like to have access to the proportion or other arguments for the statistics let me know and I'll push it)

Thanks @scarlehoff. It looks clear. Thanks for your help.

scarlehoff

If @Cmurilochem confirms this PR works for him as it is I think it can be merged .

Cmurilochem · 2024-06-04T14:51:13Z

If @Cmurilochem confirms this PR works for him as it is I think it can be merged .

Hi @scarlehoff. Thanks for all your hard work. It looks great to me and it is working as expected. I just added a last commit in which I update our experiment's runcards adding a non-default penalties_in_loss for reproducibility. Please, let me know if you agree with that. For the moment I will take the lead and approve your changes.

edit: oh..did not see you have done so already; please, merge it when you are ready.

…t replicas

…age -> proportion; update docstr

…udies runcards

APJansen force-pushed the hyper-selection branch from 3bf1fe1 to b32c820 Compare April 2, 2024 08:46

APJansen mentioned this pull request Apr 2, 2024

NaNs in hyperopt #2015

Closed

APJansen marked this pull request as ready for review April 3, 2024 13:06

scarlehoff reviewed Apr 8, 2024

View reviewed changes

n3fit/src/n3fit/hyper_optimization/rewards.py Outdated Show resolved Hide resolved

n3fit/src/n3fit/model_trainer.py Outdated Show resolved Hide resolved

APJansen force-pushed the hyper-selection branch from 8777835 to 47a6898 Compare April 17, 2024 07:47

Cmurilochem reviewed Apr 22, 2024

View reviewed changes

n3fit/src/n3fit/model_trainer.py Show resolved Hide resolved

scarlehoff added the escience label May 29, 2024

scarlehoff force-pushed the hyper-selection branch from 1c3bb50 to 925c893 Compare May 29, 2024 09:13

Cmurilochem reviewed May 31, 2024

View reviewed changes

n3fit/src/n3fit/hyper_optimization/rewards.py Outdated Show resolved Hide resolved

scarlehoff approved these changes Jun 4, 2024

View reviewed changes

Cmurilochem approved these changes Jun 4, 2024

View reviewed changes

APJansen and others added 12 commits June 4, 2024 21:13

Change default replica statistic from average to average over 90% bes…

ea77eaa

…t replicas

Round up rather than down for number of best replicas

761b5f1

Make sure seed is int

666b95e

Add warnings on replicas not passing

8ba460f

Add selection based on average best loss

029f869

bugfix

abefcea

increase hyper threshold in quickcard

cbce54a

Use average_best replica statistic in hyperopt runcard

23a7654

apply own review comments: remove double check of hyper loss; percent…

23df9b5

…age -> proportion; update docstr

dont include penalties by default in the calculation of the loss

011bdf8

add kfold::penalties_in_loss: bool as a runcard option

456c6cf

add docs for the penalties_in_loss key

335c835

scarlehoff and others added 2 commits June 4, 2024 21:13

remove comment about gpu not being compatible with hyperopt :)

7b97765

Add 'average_best' and non-default 'penalties_in_loss' in hyperopt_st…

613bfbc

…udies runcards

scarlehoff force-pushed the hyper-selection branch from 439a16a to 613bfbc Compare June 4, 2024 19:13

scarlehoff merged commit 7caa1ef into master Jun 5, 2024
6 checks passed

scarlehoff deleted the hyper-selection branch June 5, 2024 06:03

Cmurilochem mentioned this pull request Jul 1, 2024

No agreement between parallel gpu and sequential cpu fits #2118

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignore 10% worst replicas in hyper loss #2014

Ignore 10% worst replicas in hyper loss #2014

APJansen commented Mar 20, 2024

APJansen commented Apr 3, 2024

Cmurilochem commented Apr 5, 2024

scarlehoff left a comment

Cmurilochem commented Apr 19, 2024

scarlehoff commented May 29, 2024

scarlehoff commented Jun 2, 2024

Cmurilochem commented Jun 3, 2024

scarlehoff left a comment

Cmurilochem commented Jun 4, 2024 •

edited

Loading

Ignore 10% worst replicas in hyper loss #2014

Ignore 10% worst replicas in hyper loss #2014

Conversation

APJansen commented Mar 20, 2024

APJansen commented Apr 3, 2024

Cmurilochem commented Apr 5, 2024

scarlehoff left a comment

Choose a reason for hiding this comment

Cmurilochem commented Apr 19, 2024

scarlehoff commented May 29, 2024

scarlehoff commented Jun 2, 2024

Cmurilochem commented Jun 3, 2024

scarlehoff left a comment

Choose a reason for hiding this comment

Cmurilochem commented Jun 4, 2024 • edited Loading

Cmurilochem commented Jun 4, 2024 •

edited

Loading