-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactoring model creation code #1734
Conversation
…ather than reused, in the sumrule)
…tion grid and the integration grid, and outputs normalized pdf
…isplay shapes in model summary and plot
…actor for clarity and consistency)
The final 2 comparisons of this branch vs master on the feature scaling and flavor basis runcards are done: |
Thanks @APJansen. While the feature scaling seems fine, the flavor basis looks a bit odd (better agreement than expected based on random sampling). Are you sure you are comparing the two different branches? Under "Code versions" in the report the versions are shown to be "inconsistent" so perhaps something went wrong? |
Hm... I'm not sure... have you seen this before? I did have some trouble running vp-comparefit, eventually fixed by remaning and reuploading the master branch run. |
"inconsistent" in this caes just means that not all replicas were produced with the same commit. The large agreement between two fits I'm basing on the distance plots. They are the distance between two central values normalised wrt the uncertainty in std of the central values, for more details see appendix B of https://arxiv.org/pdf/1410.8849.pdf. Of course, if two fits are done independently (i.e. different seeds, though in this case I assume even the different model should have effectively the same impact) these distances should be of order 1. While this is the case for feature scaling, for the flavor basis fit the agreement looks better than expected if the two fits consisted of independent samples. Perhaps it's my assumption that the different model is similar to different seeds that's wrong, but then we'd need to understand why we observe this agreement for flavor basis but not feature scaling. |
Hmm, I see it's also inconsistent in the previous report on the flavor basis where I compared it to something else. Perhaps I was developing while the jobs were in the queue. I guess I'll have to rerun it. |
If you are rerunning, make sure you use the latest commit from this branch. I think this is basically done now so it would be great to link here the reports with what would be basically the last commit before merging. |
Here's the new report: https://vp.nnpdf.science/1pUg_vIYTkGmuHVMVSjQkw==/ |
Perfect agreement also for the flavour basis :D Great! If @RoyStegeman is also happy with the tests I think this is basically done? |
I'm not really... Somehow flavor basis seems to be exactly the same reply-by-replica but not feature scaling, which I don't understand. Either I'd expect the model to be exactly the same given the same seed before and after the changes in this PR, or I'd expect them to differ (which I would've thought more likely). Unless I'm missing something, these results hint at an inconsistency. |
I did a quick sanity check locally, running both the flavor_basis and feature_scaling runcards that @RoyStegeman posted here locally on both branches and just comparing the validation chi2 after 100 epochs. Indeed they are different for the feature_scaling runcard, already in the first digit, whereas for flavor_basis they are the same up to the 5th decimal. I don't understand why, I tried copying either the model architecture and seeds or the basis from flavor_basis into feature_scaling and running that on both branches, but they're still not identical. Conversely starting with the flavor_basis runcard and copying both the model architecture, seeds and basis from feature_scaling, the results still differ. I'll try some more comparisons |
I'm still confused. Doing both of 1. turning off feature scaling and 2. changing the basis to the flavor basis makes the difference between the two branches very small (~5 decimals in validation chi2 after 100 epochs), but either separately doesn't. |
I'm very confused. At some point you had a fit which is this one: standard fit (master vs this branch): https://vp.nnpdf.science/vEkkEc1jQIiHACiGX5ilTA==/ in the evolution basis, and that's exactly the same. So in principle, doing 1 but not 2 should make the difference also very small. The three comparisons that we want are:
What are the comparisons we actually have? |
Right so we have:
So I'm also confused, I just checked the standard runcard, both with the latest commits and the commits in that report, and the difference in the 100 epoch validation chi2 is not small in either case. |
Well, rather than small. The reports make it look like the results are exactly equal. I don't understand how you can get the same results if the differences are large after 100 epochs. |
Just did the same check, the standard runcard after 100 epochs, but on snellius rather than on my mac, and now the results are identical. |
And what if you do this check on snellius with the feature scaling runcard? |
Yes that was it, feature scaling has a slight difference, but just turning that off leaving everything else the same they become identical. |
Do you know in which part of the code this difference originates? I thought everything would be deterministic, but maybe I'm forgetting something. |
I'm not sure, to be honest I thought that they were always slightly different, that the initialization wasn't the same, because I always tested this locally. (No idea why that would matter either, different tensorflow versions probably but why..) |
I think I found the reason: I have changed the scaler to include the scaling from [0, 1] to [-1, 1], originally this was done with a lambda layer inside the model. This causes a difference of ~10^-8 in the scaled x using this runcard. That difference is amplified I assume by training. So this explains why it only happens with feature scaling, and I think it's harmless. |
That indeed sounds like it's harmless. Do you understand what causes the difference of ~10^-8? I'd expect it wouldn't matter if we shift by |
Do you mean a difference of |
To be precise, a maximum absolute difference of |
Then that's perfectly fine. |
Great, well done for finding the difference |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having checked that the results of this branch and master are numerically the same, I'd say this can be merged. I believe black
and isort
have both been run on the changed files?
Great, thanks! Indeed I've run black and isort. |
Aim
I am refactoring the code that creates the PDF model (and perhaps the full one as well). The main objective is to make the code and the resulting architecture more readable, without changing anything in the behavior.
If anything I expect the changes to make it slightly faster and smaller in memory, but either way this will likely be negligible.
It is still a work in progress, but here is a summary of the changes I made so far, roughly from big to small:
msr_impose
that creates a function that takes a pdf function and returns a normalized pdf function. Now it is a tensorflow model that takes as input the unnormalized pdf on the grid and on the integration grid, and the integration grid itself, and outputs the normalized pdf. This makes the code and the resulting model plots much easier to read.With these changes, it's possible to create much more understandable plots of the architecture using
tf.keras.utils.plot_model
. These are from the basic runcard: