Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Test implementation of new commondata format #1500

Closed
wants to merge 13 commits into from

Conversation

enocera
Copy link
Contributor

@enocera enocera commented Jan 19, 2022

This PR addresses #1436.
The proposed layout is as follows. For each data set, there is a folder called DATASET_NAME. Each folder contains:

  • a metadata.yaml file;
  • a filter.py file.

The filter.py script reads the metadata.yaml file and generates no less than three .json files:

  • data.json, for the central values of the data;
  • kinematics.json for the kinematics of the data;
  • uncertainties.json for the uncertainties of the data.

There can be as many variants as needed for each of these three files. Variants are specified in the metdata.yaml.

TO DO

  • include the missing filter.py
  • have a look at the layout and at its prototypical realisation;
  • write a validphys reader that takes as input the .json files in lieu of the 'DATAandSYSTYPE` files. (Add vp parser for new commondata format #1511)

@scarlehoff
Copy link
Member

One thing missing from the list (I only realised now that I started with the parser) would be an automatic installation.

I guess to first approximation these should be part of the validphys package, right? Since the fit needs validphys anyway.

(it's not important for the parser, I can hardcode the folder for now)

@scarlehoff
Copy link
Member

@enocera
One question, is it the idea that the variants are now selected in a way similar to how we select the kfactors? For instance one can have as input:

dataset_input:
  - dataset: "NMCPD"
  - variant: "dw"

and if no variant is selected it defauls to the standard one.

@enocera
Copy link
Contributor Author

enocera commented Feb 17, 2022

@scarlehoff Do you mean in a vp runcard?

@scarlehoff
Copy link
Member

Yes. Instead of having to write NMCPD_dw

@enocera
Copy link
Contributor Author

enocera commented Feb 17, 2022

Then yes, I think that we want to have something similar in spirit to, but more general than, the current sys key. cc @Zaharid

@enocera
Copy link
Contributor Author

enocera commented Feb 17, 2022

Or, as you say, the cfac key.

@scarlehoff
Copy link
Member

Great, I like this idea. I'll implement it in this spirit. The idea is then that "sys" will disappear?

@enocera
Copy link
Contributor Author

enocera commented Feb 17, 2022

Yes, definitely. The problem with sys is that it only controls the way in which systematic uncertainties are correlated or not. The variant is something more general (and cleaner) which is specified by a set of data.json and uncertainty.json files. This means that you can have different central values and different uncertainties (correlated in whatever way) for each dataset (without the need of renaming the dataset). In my opinion, this is useful to have nuclear variants, but also, e.g., EW variants (and of course also more manageable debugged variants).

@scarlehoff
Copy link
Member

Do we want to have a CommonDataSpec and a CommonData as two separate classes or can we unify them as CommonDataSpec that will have all the attributes of both? This one is for @Zaharid I guess

@scarlehoff
Copy link
Member

And for @enocera, this is a perfect and total substitution, right? So I can just scrap all the "old" commondata stuff as I find it in the code? No need to keep legacy wrappers like I have to do for the fktables.

@enocera
Copy link
Contributor Author

enocera commented Feb 17, 2022

And for @enocera, this is a perfect and total substitution, right? So I can just scrap all the "old" commondata stuff as I find it in the code? No need to keep legacy wrappers like I have to do for the fktables.

I'd answer yes because the idea (my idea) is to rework all the datasets in the new format.

@@ -0,0 +1 @@
{"kinematics_avg":[{"x":[0.0015,0.0015,0.0015,0.0015,0.0015,0.003,0.003,0.003,0.003,0.003,0.003,0.003,0.005,0.005,0.005,0.005,0.005,0.005,0.005,0.005,0.005,0.008,0.008,0.008,0.008,0.008,0.008,0.008,0.008,0.008,0.008,0.008,0.008,0.0125,0.0125,0.0125,0.0125,0.0125,0.0125,0.0125,0.0125,0.0125,0.0125,0.0125,0.0125,0.0125,0.0125,0.0175,0.0175,0.0175,0.0175,0.0175,0.0175,0.0175,0.0175,0.0175,0.0175,0.0175,0.0175,0.0175,0.0175,0.025,0.025,0.025,0.025,0.025,0.025,0.025,0.025,0.025,0.025,0.025,0.025,0.025,0.025,0.025,0.035,0.035,0.035,0.035,0.035,0.035,0.035,0.035,0.035,0.035,0.035,0.035,0.035,0.035,0.035,0.035,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.07,0.07,0.07,0.07,0.07,0.07,0.07,0.07,0.07,0.07,0.07,0.07,0.07,0.07,0.07,0.07,0.09,0.09,0.09,0.09,0.09,0.09,0.09,0.09,0.09,0.09,0.09,0.09,0.09,0.09,0.09,0.09,0.11,0.11,0.11,0.11,0.11,0.11,0.11,0.11,0.11,0.11,0.11,0.11,0.11,0.11,0.11,0.11,0.14,0.14,0.14,0.14,0.14,0.14,0.14,0.14,0.14,0.14,0.14,0.14,0.14,0.14,0.14,0.18,0.18,0.18,0.18,0.18,0.18,0.18,0.18,0.18,0.18,0.18,0.18,0.18,0.18,0.18,0.225,0.225,0.225,0.225,0.225,0.225,0.225,0.225,0.225,0.225,0.225,0.225,0.225,0.225,0.275,0.275,0.275,0.275,0.275,0.275,0.275,0.275,0.275,0.275,0.275,0.275,0.275,0.275,0.35,0.35,0.35,0.35,0.35,0.35,0.35,0.35,0.35,0.35,0.35,0.35,0.35,0.45,0.45,0.45,0.45,0.45,0.45,0.45,0.45,0.45,0.45,0.45,0.45,0.55,0.55,0.55,0.55,0.55,0.55,0.55,0.55,0.55,0.55,0.55,0.675,0.675,0.675,0.675,0.675,0.675,0.675,0.675,0.675,0.675 ]},{"Q2":[0.16,0.25,0.35,0.45,0.6,0.17,0.25,0.35,0.45,0.63,0.88,1.12,0.16,0.25,0.35,0.45,0.61,0.88,1.13,1.38,1.71,0.16,0.25,0.35,0.45,0.64,0.86,1.12,1.37,1.75,2.24,2.73,3.46,0.16,0.26,0.35,0.45,0.62,0.88,1.12,1.37,1.74,2.23,2.74,3.46,4.47,5.41,0.25,0.35,0.45,0.62,0.88,1.12,1.37,1.75,2.24,2.73,3.48,4.47,5.49,6.83,0.26,0.35,0.45,0.62,0.86,1.13,1.37,1.74,2.24,2.74,3.45,4.47,5.48,6.92,8.92,0.36,0.45,0.64,0.86,1.13,1.38,1.74,2.24,2.74,3.46,4.45,5.47,6.92,8.96,11.45,14.36,0.46,0.61,0.88,1.13,1.37,1.74,2.25,2.74,3.46,4.46,5.46,6.9,8.93,11.44,14.82,19.19,0.68,0.86,1.11,1.38,1.74,2.24,2.75,3.47,4.47,5.47,6.91,8.91,11.4,14.89,19.63,26.07,0.9,1.11,1.38,1.76,2.24,2.75,3.49,4.47,5.46,6.91,8.92,11.37,14.87,19.74,26.36,34.74,1.13,1.38,1.75,2.24,2.74,3.49,4.47,5.46,6.9,8.92,11.37,14.85,19.74,26.52,35.32,44.94,1.4,1.75,2.24,2.74,3.47,4.48,5.47,6.9,8.92,11.37,14.84,19.76,26.55,35.28,46.95,1.82,2.24,2.75,3.47,4.47,5.49,6.92,8.93,11.37,14.85,19.75,26.61,35.37,47.01,63.04,2.28,2.74,3.48,4.47,5.47,6.97,8.93,11.38,14.85,19.74,26.64,35.42,46.95,63.23,2.78,3.46,4.47,5.47,6.89,8.94,11.37,14.87,19.74,26.6,35.43,46.98,63.48,90.68,3.57,4.51,5.48,6.9,8.91,11.43,14.85,19.74,26.63,35.45,47.12,63.52,96.35,4.54,5.47,6.93,8.92,11.34,14.89,19.77,26.64,35.5,47.26,63.65,98.05,5.53,6.88,8.91,11.34,14.85,19.74,26.64,35.57,47.16,63.56,98.82,7.04,8.88,11.36,14.85,19.79,26.49,35.4,47.03,63.53,99.03]},{"y":[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]}]"kinematics_min":[{"x":null},{"Q2":null},{"y":null}]"kinematics_max":[{"x":null},{"Q2":null},{"y":null}]}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these null expected? What should be returned if an user asks for the "minimums array", just the same thing as the avg?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I placed a null because this is a case in which bin edges are not provided by the experimentalists; I'd say that if null, then min=avg.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks.
I also I think there might have a bug in the script that outputs the json because the kinematics_max and kinematics_min are missing a comma in front.

Also, I think [{"x":null},{"Q2":null},{"y":null}] could just be a single dictionary: {"x":null, "Q2":null, "y":null} (I'm using the null as example because the arrays are big, but the same applies.

arXiv: "hep-ex/9611022"
iNSPIRE: "https://inspirehep.net/literature/424154"
hepdata: "https://www.hepdata.net/record/ins426595"
tables:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
tables:
tables:

@Zaharid
Copy link
Contributor

Zaharid commented Feb 17, 2022

Do we want to have a CommonDataSpec and a CommonData as two separate classes or can we unify them as CommonDataSpec that will have all the attributes of both? This one is for @Zaharid I guess

I like the separation between header and object and also would like the parsing stage of vp to be as fast as possible, so rather prefer the arrangement where we have two classes.

Comment on lines 56 to 59
variants:
- nuclear_uncertainty_choice:
deweighted: uncertainties_dw.json
shifted: uncertainties_dw.json
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
variants:
- nuclear_uncertainty_choice:
deweighted: uncertainties_dw.json
shifted: uncertainties_dw.json
variants:
deweighted: uncertainties_dw.json
shifted: uncertainties_dw.json

I think it is better to have just one level of variance here, otherwise the fit runcards might become too complicated.
Then we might want to have metadata in the variance type of course.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can variants change things other than the uncertainties?

Copy link
Contributor Author

@enocera enocera Feb 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a very good question. The answer is yes, in the future. Example: datasets for which EW corrns have not been subtracted, in the case in which these are instead in the theory. So maybe we should list both the data.json, uncertainty.json and kinematics.json files for each variant?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of having a default selection and then allowing the variant to change a number of things.

So:

variants
  deweighted:
    uncertainties: uncertainties_dw.json
  ew:
     data: ew_subtracted.json

etc

@scarlehoff
Copy link
Member

I like the separation between header and object and also would like the parsing stage of vp to be as fast as possible, so rather prefer the arrangement where we have two classes.

This can be done with one single class as well. In any case, I already started putting all in one for the toy model, separating it in two should be easy.

Instead what I'm separating is KinematicSpec and UncertaintiesSpec (which right now were somewhere in libNNPDF I guess)

@Zaharid
Copy link
Contributor

Zaharid commented Feb 17, 2022

The one class we need to have is the one representing precisely whatever the user wrote in the runcard. As @RoyStegeman can tell, not having that e.g. for positivity is a source of of pain.

@scarlehoff
Copy link
Member

scarlehoff commented Feb 17, 2022

That has nothing to do with this, the kinematics or uncertainties are not written in the runcard but are part of the metadata of the commondata and that needs to be held by the CommonDataSpec.

@Zaharid
Copy link
Contributor

Zaharid commented Feb 17, 2022

One big annoyance with json for scientific applications is that it cannot handle NaN or infinity. I wonder how relevant is that in here. For one we could parse the same files with yaml and write e.g. .inf for infinity.

@enocera
Copy link
Contributor Author

enocera commented Feb 17, 2022

One big annoyance with json for scientific applications is that it cannot handle NaN or infinity. I wonder how relevant is that in here. For one we could parse the same files with yaml and write e.g. .inf for infinity.

@Zaharid Are you wondering how relevant .json vs .yaml is or whether NaN or inf are relevant? If you're wondering about the first, I'm in favour of moving from .json to .yaml.

@Zaharid
Copy link
Contributor

Zaharid commented Feb 17, 2022

I guess both.

@enocera
Copy link
Contributor Author

enocera commented Feb 17, 2022

So, I don't think that NaN or inf can appear as sensible values of data central values or uncertainties or kinematic bins. Moving to yaml is perhaps advisable anyway - incidentally I used .json instead of .yaml only because this is what was originally suggested.

@scarlehoff
Copy link
Member

Hi @enocera, I don't understand the format of the uncertainties:

{'sys_1': ['ADD', 'CORR', <numbers>}

What should I parse from here. Are the numbers following ADD and CORR the list of ADD and CORR concatenated together? Wouldn't it be better to have maybe {sys_1: ([list of add], [list of corr])}?

Or even directly a list of dictionaries

[{`ADD`: [numbers], 'CORR': [NUMBERS]},]

and then the number of systematic is just the number of dictionaries in the list (so no need to keep track with the sys_x name).

@enocera
Copy link
Contributor Author

enocera commented Feb 21, 2022

So:

  • sys_1 is the 1-st sytematic uncertainty; we can have as many sys as we need, and I thought to enumerate them like sys_1, sys_2, ... , sys_i, ..., sys_n, where n is the number of systematic uncertainties;
  • sys_1 in this case is additive (ADD) and correlated (CORR) at the same time; the nature (additive vs multiplicative) and correlation of each uncertainty should be specified for each sys_i (and can be different for different is);
  • <numbers> correspond to sys_i over the j data points.

@scarlehoff
Copy link
Member

Ah! Ok. Then I guess we should have some kind of metadata (thinking about yaml in this example):

sys_1:
   mode: ["ADD", "CORR"]
   values: [array of ndata numbers of points]

@enocera
Copy link
Contributor Author

enocera commented Feb 21, 2022

Yes, definitely. This is what I was thinking about after Zahari suggested to abandon .json in favour of .yaml.

@Radonirinaunimi
Copy link
Member

There was also the suggestion to add the information on what is being fitted here. This might be worth adding already?

@Zaharid
Copy link
Contributor

Zaharid commented Apr 22, 2022

As far as I understand, you still manually download the tables from HepData? This could be automatized in order to avoid this manual work. HepData has a hepdata-cli/-api but this does not seem to easily provide us with what we need. In addition, I don't think we'd rely on a package that is not maintained. Instead, below is a simple sample function that can retrieves the tables for us.

We have discussed doing this several times and it would indeed be nice if we could have it in a pull request form. A key requirement is the ability to check that the current version of the tables matches the latest version in hepdata.

@Radonirinaunimi
Copy link
Member

We have discussed doing this several times and it would indeed be nice if we could have it in a pull request form. A key requirement is the ability to check that the current version of the tables matches the latest version in hepdata.

I agree that this is definitely something that we want. This could be easily implemented and something I can take care of.

@Radonirinaunimi
Copy link
Member

I now have a module that reads the hepdata URL from metadata.yaml and download the corresponding tables. It does so by first checking if the corresponding dataset has already been downloaded, and if yes checks if the version matches the latest version on hepdata.

This can be used to perform what you, @Zaharid, mentioned above. But I just want to make sure that I understand exactly what you have in mind. What I have been playing with is a github action that--using the module--checks every now and then hepdata if a new version of a dataset is available. And if so, it opens a PR with the new version and keeps pushing commits whenever new versions are available until the PR is merged. Is this somehow in line with what you think?

scarlehoff added a commit that referenced this pull request Aug 3, 2022
…ary files for the reader (no removals or rawdata)
scarlehoff added a commit that referenced this pull request Dec 14, 2022
…ary files for the reader (no removals or rawdata)
@scarlehoff
Copy link
Member

Hi @Radonirinaunimi, in this branch there are some utilities that might be useful (while the dataset implementation is now outdated). Do you want to keep these tools?

Otherwise we can close this PR.

@Radonirinaunimi
Copy link
Member

Hi @Radonirinaunimi, in this branch there are some utilities that might be useful (while the dataset implementation is now outdated). Do you want to keep these tools?

Otherwise we can close this PR.

I think it'd be better to close this PR, and if needed I'd implement these utilities in another PR. Especially since after having re-implemented datasets, I'd approach these utilities in a different way now.

@scarlehoff
Copy link
Member

ok, thanks!

@scarlehoff scarlehoff closed this Nov 13, 2023
@scarlehoff scarlehoff deleted the NEW_COMMONDATA branch November 13, 2023 09:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants