Simplify buildmaster implementation #1417

Zaharid · 2021-09-23T15:34:54Z

As it stands, adding something to buildmaster is relatively tedious and requires certain amount of C++ coding (which could lead to had to find problems e.g. #1264). A bunch of python scripts (and a small library) could fulfil the same function more simply. The basic idea is "take as input the rawdata folder (from the experiment) plus some metadata (from us)" and output whatever #1416 would become. So I imagine we might have a small script in each data folder for that.

This all needs to be flashed out and designed.

Then there is the question of what to do with the exiting files and whether a new "filter" would need to be written for them. I think the answer that has some hope of getting implemented is to treat existing commondata files as "input" for the new commondata. Those can hopefully be combined with some binning information that we must already have elsewhere to produce the new format.

Zaharid · 2021-09-23T15:35:11Z

cc @enocera

enocera · 2021-10-06T11:35:03Z

As it stands, adding something to buildmaster is relatively tedious and requires certain amount of C++ coding (which could lead to had to find problems e.g. #1264). A bunch of python scripts (and a small library) could fulfil the same function more simply. The basic idea is "take as input the rawdata folder (from the experiment) plus some metadata (from us)" and output whatever #1416 would become. So I imagine we might have a small script in each data folder for that.

I had thought the same. The rawdata folder should ideally contain files downloaded from Hepdata, whenever these are available. I think that the relevant Hepdata entry/entries, which is/are associated to individual urls, should be part of the metadata. This is advantageous because one can immediately see whether the Hepdata entry/entries is/are up to date (sometimes experimentalists submit revised version of their data, which are called v2, v3 etc.) and also outline which tables are relevant (some experiments may have hundreds of tables, of which only a couple are relevant for the buildmaster implementation).

I am thinking whether, if one gives the relevant Hepdata entries in the metadata, these can be automatically downloaded from Hepdata, and whether it may be possible to check at the same time that these are the most updated available. I'm not sure that this is a good idea; manual download has the advantage that one needs to know what he is doing.

This all needs to be flashed out and designed.

Then there is the question of what to do with the exiting files and whether a new "filter" would need to be written for them. I think the answer that has some hope of getting implemented is to treat existing commondata files as "input" for the new commondata. Those can hopefully be combined with some binning information that we must already have elsewhere to produce the new format.

I understand that this suggestion is motivated by the fact that manpower is limited. In an ideal world, we may want to implement from scratch the old experiments. I think that experience tougth us that this may be beneficial: for instance, when we were finalising NNPDF4.0, inspection of old experiments revealed some bugs in their implementation, such as a wrong assignment of ADD/MULT labels, a spurious bin in the CDFD0 measurement and problems with the symmetrisation of the uncertainties. Re-writing the analyses for the old experiments will have the advantage of eliminating residual bugs, if any.

I of course understand that this program is really dependent on manpower. Even if we had a large amount of manpower, there would possibly be less tedious/more effective ways of using it. The idea of "filtering" the current files is certainly the cheapest.

Zaharid · 2021-10-06T16:48:03Z

As it stands, adding something to buildmaster is relatively tedious and requires certain amount of C++ coding (which could lead to had to find problems e.g. #1264). A bunch of python scripts (and a small library) could fulfil the same function more simply. The basic idea is "take as input the rawdata folder (from the experiment) plus some metadata (from us)" and output whatever #1416 would become. So I imagine we might have a small script in each data folder for that.

I had thought the same. The rawdata folder should ideally contain files downloaded from Hepdata, whenever these are available. I think that the relevant Hepdata entry/entries, which is/are associated to individual urls, should be part of the metadata. This is advantageous because one can immediately see whether the Hepdata entry/entries is/are up to date (sometimes experimentalists submit revised version of their data, which are called v2, v3 etc.) and also outline which tables are relevant (some experiments may have hundreds of tables, of which only a couple are relevant for the buildmaster implementation).

I am thinking whether, if one gives the relevant Hepdata entries in the metadata, these can be automatically downloaded from Hepdata, and whether it may be possible to check at the same time that these are the most updated available. I'm not sure that this is a good idea; manual download has the advantage that one needs to know what he is doing.

I think all that is doable. Hepdata has some json api from where we could pull the relevant information, and hopefully check for new versions. So our input would be "I want this table from this dataset" and the files would be materialized and checked for updates as appropriate. One could have a look at e.g. https://github.com/hepdata/hepdata-validator to see how the json format works. AFAICT the documentation (both there an in the hpedata paper) is oriented to submitting data in the correct format, but I'd be curious about its usages.

This all needs to be flashed out and designed.
Then there is the question of what to do with the exiting files and whether a new "filter" would need to be written for them. I think the answer that has some hope of getting implemented is to treat existing commondata files as "input" for the new commondata. Those can hopefully be combined with some binning information that we must already have elsewhere to produce the new format.

I understand that this suggestion is motivated by the fact that manpower is limited. In an ideal world, we may want to implement from scratch the old experiments. I think that experience tougth us that this may be beneficial: for instance, when we were finalising NNPDF4.0, inspection of old experiments revealed some bugs in their implementation, such as a wrong assignment of ADD/MULT labels, a spurious bin in the CDFD0 measurement and problems with the symmetrisation of the uncertainties. Re-writing the analyses for the old experiments will have the advantage of eliminating residual bugs, if any.

I of course understand that this program is really dependent on manpower. Even if we had a large amount of manpower, there would possibly be less tedious/more effective ways of using it. The idea of "filtering" the current files is certainly the cheapest.

I don't have such a strong view either way. Probably the answer will become clearer in the future, depending on both the available resources and what the future format ends looking like.

enocera · 2021-10-06T16:56:21Z

I am thinking whether, if one gives the relevant Hepdata entries in the metadata, these can be automatically downloaded from Hepdata, and whether it may be possible to check at the same time that these are the most updated available. I'm not sure that this is a good idea; manual download has the advantage that one needs to know what he is doing.

I think all that is doable. Hepdata has some json api from where we could pull the relevant information, and hopefully check for new versions. So our input would be "I want this table from this dataset" and the files would be materialized and checked for updates as appropriate. One could have a look at e.g. https://github.com/hepdata/hepdata-validator to see how the json format works. AFAICT the documentation (both there an in the hpedata paper) is oriented to submitting data in the correct format, but I'd be curious about its usages.

OK, thanks. This is maybe a point for next week's discussion. The discussion about what is doable and what is not should perhaps be put in context with whether the feature is useful or not. A clear advantage I see is that, if the experimentalists silently submit a new version of a measurement on hepdata, having it automatically downloaded and properly labeled as a dataset variant will save us the trouble of making sure that our data is in sync with the latest version on hepdata.

scarlehoff · 2023-11-17T09:18:51Z

This is superseded by #1709

enocera changed the title ~~Simplify buildmaster implmentation~~ Simplify buildmaster implementation Sep 23, 2021

Zaharid added the data toolchain label Sep 23, 2021

scarlehoff closed this as completed Nov 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify buildmaster implementation #1417

Simplify buildmaster implementation #1417

Zaharid commented Sep 23, 2021

Zaharid commented Sep 23, 2021

enocera commented Oct 6, 2021

Zaharid commented Oct 6, 2021

enocera commented Oct 6, 2021

scarlehoff commented Nov 17, 2023

Simplify buildmaster implementation #1417

Simplify buildmaster implementation #1417

Comments

Zaharid commented Sep 23, 2021

Zaharid commented Sep 23, 2021

enocera commented Oct 6, 2021

Zaharid commented Oct 6, 2021

enocera commented Oct 6, 2021

scarlehoff commented Nov 17, 2023