Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make weight initialization reproducible #1923

Merged
merged 9 commits into from
Mar 8, 2024
Merged

Make weight initialization reproducible #1923

merged 9 commits into from
Mar 8, 2024

Conversation

APJansen
Copy link
Collaborator

@APJansen APJansen commented Jan 29, 2024

This is an attempt to address #1916, branching off of #1905 because it uses the MultiInitializer to initialize the preprocessing weights.

I have checked that when doing any of:

  • n3fit runcard.yaml 1 -r 3
  • n3fit runcard.yaml 2 -r 3
  • n3fit runcard.yaml 1 -r 2

the preprocessing weights, NN weights and PDF output right after creation are identical for replica number 2.

Even after rebasing on trvl-mask-layers though, results do not remain identical.

I don't know where the difference is coming from, as the tr/vl masks, and the invcovmats, are also the same for replica 2.

@APJansen APJansen self-assigned this Jan 29, 2024
@scarlehoff
Copy link
Member

Rebase all these branches on current master. There was a bug in #1881 that I've corrected in #1922 and that would make the final fit be different.

@APJansen
Copy link
Collaborator Author

APJansen commented Feb 7, 2024

The reason this isn't working yet is of course precisely because it needs both #1788 and #1905, for proper seeds of the train/val splits and the weights respectively. Will continue once those are merged.

@APJansen APJansen force-pushed the multi-dense-layer branch 2 times, most recently from f6529ab to 234efa0 Compare February 16, 2024 16:25
Base automatically changed from multi-dense-layer to master February 19, 2024 10:05
@APJansen APJansen force-pushed the reproducibility branch 2 times, most recently from 2b4168a to f3d79bf Compare February 22, 2024 12:39
@APJansen
Copy link
Collaborator Author

APJansen commented Mar 5, 2024

I've added a test, do you agree that this is what we want to test here? And do you have an idea why it's not finding the runcards? It works locally (if I remove the linux mark)

@scarlehoff
Copy link
Member

They should be in the regression folder I think. Locally it works because I guess you are running pytest in the folder where the tests are.

Copy link
Member

@scarlehoff scarlehoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have checked that when doing any of:

n3fit runcard.yaml 1 -r 3
n3fit runcard.yaml 2 -r 3
n3fit runcard.yaml 1 -r 2

the preprocessing weights, NN weights and PDF output right after creation are identical for replica number 2.

I'd say the check missing is that n3fit runcard.yaml 2 is also identical.

However, to avoid the "replica 1 problem" I would do instead:

n3fit runcard.yaml 1 -r 3
n3fit runcard.yaml 2 -r 3
n3fit runcard.yaml 3

And check that replica 3 is the same.

With the test_fit itself... What about using save: weights.h5 and checking that the weights for replica 2 are the same? If epochs = 1 or epochs = 0 (not sure) they will be the initial ones...

n3fit/src/n3fit/tests/quickcard-parallel.yml Outdated Show resolved Hide resolved
n3fit/src/n3fit/tests/quickcard-parallel.yml Outdated Show resolved Hide resolved
n3fit/src/n3fit/tests/quickcard-parallel.yml Outdated Show resolved Hide resolved
@APJansen
Copy link
Collaborator Author

APJansen commented Mar 5, 2024

What is the "replica 1 problem"? I suppose it's related to why you started using varying replica numbers in the regression tests, but I never knew the reason for that?

@scarlehoff
Copy link
Member

What is the "replica 1 problem"?

Turns out that we were only testing the first replica so after some of the multireplica stuff was merged it actually only worked for the first replica when running sequentially.

Also when seeding, if we are missing some seed we might not notice for replica one and only realise from replica 2 onwards.

@APJansen
Copy link
Collaborator Author

APJansen commented Mar 5, 2024

With the test_fit itself... What about using save: weights.h5 and checking that the weights for replica 2 are the same? If epochs = 1 or epochs = 0 (not sure) they will be the initial ones...

This is tricky, running with 1 epoch will actually run 1 epoch, running with 0 epochs will generate lots of errors (first there's a check that stops you, turning this into a warning the timer callback errors, fixing that the stopping errors, etc).

Apart from your newly suggested n3fit runcard.yaml 3, the other 3 should really be identical, as they follow the same branches. So checking on the results should be ok, and in fact it passes for the sequential runs, not for the parallel one yet unfortuantely.

@APJansen
Copy link
Collaborator Author

APJansen commented Mar 5, 2024

I found the issue: it's the constraints on the preprocessing weights. Removing them makes all the weights identical to like 1e-8, with constraints the trainable preprocessing coefficients are completely different.
I don't understand why or how to solve it though, this is just supposed to clip the weights to the specified range right?

edit: I guess what is happening is that the constraint is applied across replicas, so one replica can influence the others. I'll try to rewrite it to be per replica.

@APJansen
Copy link
Collaborator Author

APJansen commented Mar 5, 2024

The constraint was a simple fix, but still not passing.

Checking the weight differences between my 3 ways of running (in a temporary script below for now), I find relative differences below 1e-6 after 1 epoch, with the exception of the biases, which are initialized to 0 and can have relative differences even above 0.1. I guess this is not surprising given that they start at 0 and there are some numerical differences, though I hoped there wouldn't be any differences.

script
import h5py
import numpy as np

TEMPDIR = f"/private/var/folders/lt/xy7j0k1j4tdf_k8p_87tb6300000gn/T/pytest-of-aronjansen/pytest-{testnr}/test_multireplica_runs_quickca1"

weights = {}
for name in ['a', 'b', 'c']:
    weight_path = f"{TEMPDIR}/{name}/quickcard-parallel/nnfit/replica_2/weights.h5"
    weights[name] = h5py.File(weight_path, 'r')


def extract_all_weights(file):
    weights = {}
    for key in file.keys():
        if isinstance(file[key], h5py.Group):
            weights[key] = extract_all_weights(file[key])
        else:
            weights[key] = file[key][()]
    return weights

# compute diff of nested dicts
def diff(d1, d2):
    d = {}
    for key in d1:
        if isinstance(d1[key], dict):
            d[key] = diff(d1[key], d2[key])
        else:
            reldiff = np.mean(np.abs(
                (d1[key] - d2[key]) / (d1[key] + d2[key])
                 ))
            d[key] = reldiff
            if reldiff > 1e-5:
                print(f"key: {key}, relative diff: {reldiff}, first: {d1[key]}, second: {d2[key]}")
            else:
                print(f"key: {key}, relative diff: {reldiff}")
    return d

# Comparing a and b:
a = extract_all_weights(weights['a'])
b = extract_all_weights(weights['b'])
c = extract_all_weights(weights['c'])

print("Comparing a and b:")
diff(a, b)
print("Comparing a and c:")
diff(a, c)
print("Comparing b and c:")
diff(b, c)

@scarlehoff
Copy link
Member

What if you set the learning rate to 0.? (but anyway, having the weights, even without biases, being the same at epoch 1 is a good enough test for the initialization being the same, which is the goal here)

Copy link
Member

@scarlehoff scarlehoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure whether it makes sense or is it needed to change those 0 to 1. It might be that they never enter keras alone (or that you wanted 0 so that it the user uses 0 they indeed get the keras behaviour)

Leaving the comment just to make sure that those 0 are intended.

n3fit/src/n3fit/layers/preprocessing.py Outdated Show resolved Hide resolved
n3fit/src/n3fit/layers/preprocessing.py Outdated Show resolved Hide resolved
@APJansen
Copy link
Collaborator Author

APJansen commented Mar 7, 2024

The base seed is always added to the replica seeds, which never results in a 0. I'm also ok with changing it to 1, but it will change all the regressions of course.

@scarlehoff
Copy link
Member

The base seed is always added to the replica seeds, which never results in a 0.

Then it is fine. I just wanted to make sure it was intended and not an oversight!

@APJansen
Copy link
Collaborator Author

APJansen commented Mar 7, 2024

Ok, yes, by passing the base seed separately rather than taking it from the single replica initializer, we now just override whatever the random seed was that Keras chose (rather than adding the replica seed to it).

Copy link
Member

@scarlehoff scarlehoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, lgtm

n3fit/src/n3fit/layers/preprocessing.py Outdated Show resolved Hide resolved
n3fit/src/n3fit/layers/preprocessing.py Outdated Show resolved Hide resolved
n3fit/src/n3fit/layers/preprocessing.py Outdated Show resolved Hide resolved
n3fit/src/n3fit/layers/preprocessing.py Outdated Show resolved Hide resolved
n3fit/src/n3fit/tests/test_fit.py Outdated Show resolved Hide resolved
n3fit/src/n3fit/tests/test_fit.py Outdated Show resolved Hide resolved
@APJansen APJansen added the redo-regressions Recompute the regression data label Mar 7, 2024
@scarlehoff
Copy link
Member

Update also the fitbot with the latest one once it finishes running.

@scarlehoff scarlehoff added the run-fit-bot Starts fit bot from a PR. label Mar 7, 2024
@APJansen
Copy link
Collaborator Author

APJansen commented Mar 7, 2024

I don't see the fitbot results?

@scarlehoff
Copy link
Member

It is still running but in a previous commit. It seems it cannor un in the commit of another bot.

Copy link

github-actions bot commented Mar 7, 2024

Greetings from your nice fit 🤖 !
I have good news for you, I just finished my tasks:

Check the report carefully, and please buy me a ☕ , or better, a GPU 😉!

APJansen and others added 8 commits March 8, 2024 09:00
Remove option for replica seeds to be None in Preprocessing

Uniformize naming of layers in NN

Add test comparing 3 ways of running

darwin->linux

Simplify quickcards

increase tolerance

Change axis in weight constraint

Add test on constraint

Simplify constraint, tighten test

Test weights only

Revert "Simplify constraint, tighten test"

This reverts commit df8781f.

Clarify error message

Avoid duplicate checks

Change test cases

Avoid seed=0 issue with Keras 3
Co-authored-by: Juan M. Cruz-Martinez <juacrumar@lairen.eu>
@APJansen
Copy link
Collaborator Author

APJansen commented Mar 8, 2024

We're having some silly issue with one hyperopt test only on python 3.11, where the test fails on the phi2 hyperopt loss. In the CI it's 0.0, locally it's 1e-10 and locally on master it's 1e7...
Practically we can just make the test pass by either changing the assert from > 0 to >= 0.0 or removing the option to have phi2 as a loss completely.
But I have no idea where this change is coming from. Running a 1 trial hyperopt with 200 epochs locally on master vs this branch, differences in phi and everything else are order 1 (which is fine as the preprocessing initialization changed), not order 17..

@scarlehoff
Copy link
Member

And why is it only python 3.11 with the python installation?

It might be that merging the mongodb stuff (which changed dependencies) introduced a dependency change between conda and pip that in turn creates this difference?

@APJansen
Copy link
Collaborator Author

APJansen commented Mar 8, 2024

No idea.. that could be but it doesn't explain that locally I also get very different results (with the same environment, in python 3.9). Also checked several seeds, the order of magnitude remains +7 for master and -10 here..

@scarlehoff
Copy link
Member

Is this the value of phi2? Does it start in the same point?
There might be a 1/(almost 0) division somewhere? That in one case it gets removed / set to 0 and in the other it goes through?

I agree that's very weird.

@APJansen
Copy link
Collaborator Author

APJansen commented Mar 8, 2024

I found the explanation. The seed was being set as an int, which then uses the same for all replicas. Didn't look into details but I assume this causes the phi2 statistic to be vanishingly small. That must happen for any version. Still that's larger than 0 so it's ok. For some reason, maybe as you mention Juan the addition of packages causing some version differences, in python 3.11 it was exactly 0, thus failing the test, but the actual issue was that it was practically zero anywhere. I just changed the seed from and int to a list of 2 different ints, and locally it's now again order +7.

@APJansen APJansen merged commit 39ff111 into master Mar 8, 2024
7 checks passed
@APJansen APJansen deleted the reproducibility branch March 8, 2024 12:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
escience redo-regressions Recompute the regression data run-fit-bot Starts fit bot from a PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants