Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring model creation code #1734

Merged
merged 74 commits into from
Jul 14, 2023
Merged

Refactoring model creation code #1734

merged 74 commits into from
Jul 14, 2023

Conversation

APJansen
Copy link
Collaborator

@APJansen APJansen commented May 17, 2023

Aim

I am refactoring the code that creates the PDF model (and perhaps the full one as well). The main objective is to make the code and the resulting architecture more readable, without changing anything in the behavior.
If anything I expect the changes to make it slightly faster and smaller in memory, but either way this will likely be negligible.

It is still a work in progress, but here is a summary of the changes I made so far, roughly from big to small:

  • I changed the structure of the sum rule code. Before there was a function msr_impose that creates a function that takes a pdf function and returns a normalized pdf function. Now it is a tensorflow model that takes as input the unnormalized pdf on the grid and on the integration grid, and the integration grid itself, and outputs the normalized pdf. This makes the code and the resulting model plots much easier to read.
  • I grouped all the actual neural layers into a neural network model, for the same reason
  • All of the smaller steps I put into named lambda layers, shared wherever possible between replicas and grids

With these changes, it's possible to create much more understandable plots of the architecture using tf.keras.utils.plot_model. These are from the basic runcard:

  • A single replica PDF
    pdf_model
  • The NN block
    NN_model
  • The MSR block
    msr_model

@APJansen
Copy link
Collaborator Author

I simplified the preprocessing of the input somewhat, getting rid of layers that do nothing and collecting the options closer together.

Here's my understanding of what happens, just checking, with some questions in bold:

inp really means whether to add logs or not, taking x to (x, log x).
scaler is a lambda function (if not None) that takes x to (scaler(x), x)
If scaler is not None, inp is set to 1, meaning there are 3 options:

  • Do nothing, x -> x I assume this is never used?
  • Add logs, x -> (x, log x)
  • Scale, x -> (scaler(x), x)

In pdfNN_layer_generator I clarified this by introducing add_logs and use_feature_scaling variables, but I've kept the interface the same.

Wouldn't it be better to just remove inp from pdfNN_layer_generator as a parameter and base it on scaler, saying if scalar is None, add logs, otherwise not?

I assume the scaling happens before the model because it's not possible to use scipy functions inside a tensorflow layer?

After the scaling, inside the PDF model, the grid is mapped from the interval [0, 1] to [-1, 1]. Why not include this in the scaler?

Rather than concatenating scaler(x) to x and then splitting it again, wouldn't it be simpler to have separate inputs?
You'd have 5 inputs: the scaled and the original grid, the scaled and the original integration grid, and the scaled x=1 grid (not the original as that's used for the preprocessing prefactor, but x=1 only goes into the NN.
For the case of adding logs instead, you could have 4 inputs to keep it as similar as possible, (x, log(x)) for both grids as input to the NN and the original x for both grids as input to the preprocessing prefactor.

Unrelated: The FkRotation docstring is not up to date, it says the input is 8-dimensional but it's actually 9-dimensional, as you can see clearly from the model plot. I don't know enough to fix this but it'd be good if someone could.

Do you want me to add these plots as an option in the logging, and if so can you point me to where and how to best do this?

Below are the model plots as they are now.

Using logs

The only preprocessing that happens here is that the log of x is concatenated to it before it goes into the NN.

pdf_model_log

Using feature scaling

Here the input is already of the form (scaler(x), x). The preprocessing that happens is first this is split again into x = x_original and scaler(x) = x_scaled. The latter undergoes a linear transformation mapping [0, 1] to [-1, 1] (process_input). In addition this turns on the subtract_one option where instead of NN(x_scaled) it computes NN(x_scaled) - NN(x=1_scaled).
pdf_model_scaled

Not using either

I didn't test this as I wasn't sure how to turn both off or if that should even be possible.

…tion grid and the integration grid, and outputs normalized pdf
@APJansen
Copy link
Collaborator Author

The final 2 comparisons of this branch vs master on the feature scaling and flavor basis runcards are done:
feature scaling
flavor basis

@RoyStegeman
Copy link
Member

Thanks @APJansen. While the feature scaling seems fine, the flavor basis looks a bit odd (better agreement than expected based on random sampling). Are you sure you are comparing the two different branches? Under "Code versions" in the report the versions are shown to be "inconsistent" so perhaps something went wrong?

@APJansen
Copy link
Collaborator Author

Hm... I'm not sure... have you seen this before? I did have some trouble running vp-comparefit, eventually fixed by remaning and reuploading the master branch run.

@RoyStegeman
Copy link
Member

RoyStegeman commented Jul 13, 2023

"inconsistent" in this caes just means that not all replicas were produced with the same commit.

The large agreement between two fits I'm basing on the distance plots. They are the distance between two central values normalised wrt the uncertainty in std of the central values, for more details see appendix B of https://arxiv.org/pdf/1410.8849.pdf. Of course, if two fits are done independently (i.e. different seeds, though in this case I assume even the different model should have effectively the same impact) these distances should be of order 1. While this is the case for feature scaling, for the flavor basis fit the agreement looks better than expected if the two fits consisted of independent samples.

Perhaps it's my assumption that the different model is similar to different seeds that's wrong, but then we'd need to understand why we observe this agreement for flavor basis but not feature scaling.

@APJansen
Copy link
Collaborator Author

Hmm, I see it's also inconsistent in the previous report on the flavor basis where I compared it to something else. Perhaps I was developing while the jobs were in the queue. I guess I'll have to rerun it.

@scarlehoff
Copy link
Member

If you are rerunning, make sure you use the latest commit from this branch. I think this is basically done now so it would be great to link here the reports with what would be basically the last commit before merging.

@APJansen
Copy link
Collaborator Author

Here's the new report: https://vp.nnpdf.science/1pUg_vIYTkGmuHVMVSjQkw==/
The inconsistency of the code version is gone, but the results are still very similar (identical even?)
I have no idea how, I'm using the same scripts as before, printing the environment and the last commit, which are the last commits of either branch and the corresponding conda environments.

@scarlehoff
Copy link
Member

Perfect agreement also for the flavour basis :D

Great! If @RoyStegeman is also happy with the tests I think this is basically done?

@RoyStegeman
Copy link
Member

If @RoyStegeman is also happy with the tests

I'm not really... Somehow flavor basis seems to be exactly the same reply-by-replica but not feature scaling, which I don't understand. Either I'd expect the model to be exactly the same given the same seed before and after the changes in this PR, or I'd expect them to differ (which I would've thought more likely). Unless I'm missing something, these results hint at an inconsistency.

@APJansen
Copy link
Collaborator Author

I did a quick sanity check locally, running both the flavor_basis and feature_scaling runcards that @RoyStegeman posted here locally on both branches and just comparing the validation chi2 after 100 epochs. Indeed they are different for the feature_scaling runcard, already in the first digit, whereas for flavor_basis they are the same up to the 5th decimal.

I don't understand why, I tried copying either the model architecture and seeds or the basis from flavor_basis into feature_scaling and running that on both branches, but they're still not identical. Conversely starting with the flavor_basis runcard and copying both the model architecture, seeds and basis from feature_scaling, the results still differ.

I'll try some more comparisons

@APJansen
Copy link
Collaborator Author

I'm still confused. Doing both of 1. turning off feature scaling and 2. changing the basis to the flavor basis makes the difference between the two branches very small (~5 decimals in validation chi2 after 100 epochs), but either separately doesn't.
But the scaled grids (i.e. the output of _xgrid_generation) are identical between branches.
Also, rather than changing the basis to the flavorbasis just setting the smallx limits to [1, 1] as they are in that runcard is not sufficient (it gets to two decimals).

@scarlehoff
Copy link
Member

I'm very confused. At some point you had a fit which is this one:

standard fit (master vs this branch): https://vp.nnpdf.science/vEkkEc1jQIiHACiGX5ilTA==/

in the evolution basis, and that's exactly the same.

So in principle, doing 1 but not 2 should make the difference also very small.

The three comparisons that we want are:

  • standard fit: (which unless I'm mistaken is this one https://vp.nnpdf.science/vEkkEc1jQIiHACiGX5ilTA==/)
    this is with evolution basis and with the (x, logx) input

  • feature scaling: this can be in the evolution basis, no problem with that

  • flavour basis (this can be without feature scaling)


What are the comparisons we actually have?

@APJansen
Copy link
Collaborator Author

Right so we have:

name report runcard report diff 100 epoch diff
feature scaling report feature_scaling.txt small large
flavor basis report flavor_basis.txt very small very small
standard report NNPDF40_nnlo_as_01180_1000.yml very small large

So I'm also confused, I just checked the standard runcard, both with the latest commits and the commits in that report, and the difference in the 100 epoch validation chi2 is not small in either case.

@scarlehoff
Copy link
Member

Well, rather than small. The reports make it look like the results are exactly equal.

I don't understand how you can get the same results if the differences are large after 100 epochs.

@APJansen
Copy link
Collaborator Author

Just did the same check, the standard runcard after 100 epochs, but on snellius rather than on my mac, and now the results are identical.

@RoyStegeman
Copy link
Member

And what if you do this check on snellius with the feature scaling runcard?

@APJansen
Copy link
Collaborator Author

Yes that was it, feature scaling has a slight difference, but just turning that off leaving everything else the same they become identical.

@RoyStegeman
Copy link
Member

Do you know in which part of the code this difference originates? I thought everything would be deterministic, but maybe I'm forgetting something.

@APJansen
Copy link
Collaborator Author

I'm not sure, to be honest I thought that they were always slightly different, that the initialization wasn't the same, because I always tested this locally. (No idea why that would matter either, different tensorflow versions probably but why..)
feature scaling shouldn't introduce anything non-deterministic certainly.. The difference is small, 79.8 vs 78.1 (locally it could be a factor of 2 even), but still surprising.. feature scaling also toggles the subtraction of NN(x=1), maybe there's something there.

@APJansen
Copy link
Collaborator Author

I think I found the reason: I have changed the scaler to include the scaling from [0, 1] to [-1, 1], originally this was done with a lambda layer inside the model. This causes a difference of ~10^-8 in the scaled x using this runcard. That difference is amplified I assume by training. So this explains why it only happens with feature scaling, and I think it's harmless.

@RoyStegeman
Copy link
Member

That indeed sounds like it's harmless. Do you understand what causes the difference of ~10^-8? I'd expect it wouldn't matter if we shift by 2*x-1 outside the model or inside a model layer.

@scarlehoff
Copy link
Member

~10^-8 in the scaled x using this runcard.

Do you mean a difference of 10^-8 (which is basically entering the numerical precision zone) or a difference around x=1e-8?

@APJansen
Copy link
Collaborator Author

To be precise, a maximum absolute difference of 6e-8, mean absolute difference of 2e-8 and mean (signed) difference of 1e-9.
In one case it's done inside a keras layer, and in the other by numpy. I would guess that they just have slightly different numerical errors.

@scarlehoff
Copy link
Member

Then that's perfectly fine.

@RoyStegeman
Copy link
Member

Great, well done for finding the difference

Copy link
Member

@RoyStegeman RoyStegeman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having checked that the results of this branch and master are numerically the same, I'd say this can be merged. I believe black and isort have both been run on the changed files?

@APJansen
Copy link
Collaborator Author

Great, thanks! Indeed I've run black and isort.

@APJansen APJansen merged commit 6ddab6e into master Jul 14, 2023
@APJansen APJansen deleted the model_refactor branch July 14, 2023 14:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants