Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add y labels, integrate utils in filters and remove comma in sys names #2020

Closed
wants to merge 4 commits into from

Conversation

t7phy
Copy link
Member

@t7phy t7phy commented Mar 24, 2024

No description provided.

@t7phy t7phy marked this pull request as ready for review March 24, 2024 15:01
@@ -1,7 +1,59 @@
import yaml
from validphys.commondata_utils import symmetrize_errors as se
from validphys.commondata_utils import percentage_to_absolute as pta
from validphys.commondata_utils import covmat_to_artunc as cta
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the covmat_to_artunc was supposed to become a generic function to be used by all (since there are some datasets by @Radonirinaunimi that also use it).

Unless you already talked among yourselves and decided to take this one out as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the commondata utils will have to include the cormat_to_covmat, otherwise all of the collider DY datasets will break. This PR, unfortunately however, does not introduce/address the utils yet (in the form we discussed at the CM) - which will still have to be done in a separate PR.

And in that respect, I absolutely don't think such function should be re-implemented in all of the filters which require it as it is done here.

@RoyStegeman
Copy link
Member

Is this supposed to be merged in favor of #1693, while closing that one?

P.S. even if it's a lot of repetition, having descriptive names and docstrings beats the alternative.

Copy link
Member

@Radonirinaunimi Radonirinaunimi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fixes regarding the labels and sysnames look good to me. @t7phy could you attach in the description the corresponding data vs. theory comparisons for these datasets?

As for the utils, let's discuss at the CM the base functions that must crucially form the utils. For the time being, covmat_to_artunc should be reverted back.

@@ -1,7 +1,59 @@
import yaml
from validphys.commondata_utils import symmetrize_errors as se
from validphys.commondata_utils import percentage_to_absolute as pta
from validphys.commondata_utils import covmat_to_artunc as cta
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such a function for sure should be part of the utils and for the purpose of this PR it should be still called from validphys.commondata_utils.

@RoyStegeman
Copy link
Member

As for the utils, let's discuss at the CM the base functions that must crucially form the utils

This has been discussed, and this was the agreed upon solution: #1693 (comment)

At the same CM @comane also offered to take charge of fixing the filter.py files along with the tests and utils

@Radonirinaunimi
Copy link
Member

As for the utils, let's discuss at the CM the base functions that must crucially form the utils

This has been discussed, and this was the agreed upon solution: #1693 (comment)

At the same CM @comane also offered to take charge of fixing the filter.py files along with the tests and utils

I indeed had in mind @comane regarding the other utils as he has been implementing datasets which also require base functions (that I think should be part of the utils, such as the one below for example):

But if this doesn't happen soon, I'd happy for only having covmat_to_artunc in the utils as #1693 (comment) and merge that one. But definitely, @t7phy, cta has to be reverted here.

@t7phy
Copy link
Member Author

t7phy commented Mar 25, 2024

My idea behind the choice to decouple it from current utils PR was that once we have finalized the actual utils (i.e. taking ones we need from Mark and ones we need from my PR), we should then and only then reintroduce dependencies. It would then be a very simple task to remove a function and add an import, however keeping unworking filter files any longer is not a good idea. I suggest we merge this, and then once we finalize the actual utils, we can come back to filters to remove redundant functions, thus ensuring the filters always work.

Please also see that since we are moving towards creation of a data package, we may actually have to import from nnpdf_data.utils etc, so I think it is important to only add dependencies once the dependencies are by themselves finalized.

@Radonirinaunimi I would have to redo data theory comparison, if you would like that, I but I am not sure why that would change?data, kin, uncertainties are all the same.

@RoyStegeman
Copy link
Member

RoyStegeman commented Mar 25, 2024

Personally I would still keep only a single version of each function and import from there, but fine, I won't insist on it (but of course I don't talk for @Radonirinaunimi). I would still like the function names and docstrings to be there in full. Please also close #1693 and write a helpful summary in the first comment of this PR.

Oh and please run black and isort

@Radonirinaunimi
Copy link
Member

Radonirinaunimi commented Mar 25, 2024

Personally I would still keep only a single version of each function and import from there, but fine, I won't insist on it (but of course I don't talk for @Radonirinaunimi).

I was exactly about to write something along that line.

My idea behind the choice to decouple it from current utils PR was that once we have finalized the actual utils (i.e. taking ones we need from Mark and ones we need from my PR), we should then and only then reintroduce dependencies. It would then be a very simple task to remove a function and add an import, however keeping unworking filter files any longer is not a good idea. I suggest we merge this, and then once we finalize the actual utils, we can come back to filters to remove redundant functions, thus ensuring the filters always work.

I don't think this stands at all. The idea is that once this is merged, hopefully we will not touch them anymore $-$ ie re-running the filters unless there are bugs. I don't see the point in re-opening another PR again once the utils are merged to fix all of these while it is easier to revert them now. As mentioned, either way, a few of the filters currently in master does not work because of these utils so this problem still persists.

But I won't also insist on this.

@Radonirinaunimi I would have to redo data theory comparison, if you would like that, I but I am not sure why that would change?data, kin, uncertainties are all the same.

Well the idea is to finally see the fixed labels on plots in case there are uncaught (latex/typographical) errors.

@t7phy
Copy link
Member Author

t7phy commented Mar 25, 2024

@RoyStegeman I can add the docstrings but I would like to keep the names as is... I was using from XYZ import full_name as fn and so the usage of all these functions requires using the short names instead. Of course i could add a new line fn = full_name but that seems pointless at this point, these are anyway local functions now.

@Radonirinaunimi yes I forgot about the need to check the latex labels, I will share the DTC once I have it.

@RoyStegeman
Copy link
Member

A search-replace would solve that, so would putting them in one of the files and importing from there.

@t7phy
Copy link
Member Author

t7phy commented Mar 25, 2024

The problem isn't applying search and replace, it's that sometimes multiple functions (as many as 3 or 4) are used in the same line because one requires output of another which requires output of yet another and using the full names makes the lines unnecessarily long.

At this point it seems we are over complicating things for no reason. #1693 is dead. All these functions are now local to the filter files and used as they are needed. so complaining about their names is counter productive because then we may as well have a look at every single filter file in every single implementation to see how the functions are named.

@scarlehoff
Copy link
Member

sometimes multiple functions (as many as 3 or 4) are used in the same line

This should be avoided whenever possible.

@t7phy
Copy link
Member Author

t7phy commented Mar 25, 2024

I am sorry but at this point I am unable to dedicate time to rewrite filter files. I am happy to produce a DTC, add docstrings and make the formatting changes using black and isort for this PR. If people prefer to change every single thing about the implementation, please feel free to delete this PR and reimplement the datasets.

If I need to generate a covmat, based on corrmat provided by the experimentalists and based on unsymm percentage uncert. Clearly I need to convert from percentage to absolute, then symterize them, then use them to convert corrmat to covmat.

The util functions in the utils PR were designed such that these functions worked perfectly with one another and the final artUnc could be obtained using very compact code.

Complaining about all these points without looking at how data is provided from hepdata for particular edge cases is counter productive and I am not able to dedicate more time towards these implementations.
As I mentioned, I will do the things I mentioned above, however if you have complaints with the entire implementation of about 75 distributions, please feel free to close this PR.

@Radonirinaunimi
Copy link
Member

I am sorry but at this point I am unable to dedicate time to rewrite filter files. I am happy to produce a DTC, add docstrings and make the formatting changes using black and isort for this PR. If people prefer to change every single thing about the implementation, please feel free to delete this PR and reimplement the datasets.

I am not sure I understand any of this. Suggestions/request for modifications in PRs are meant to maintain a sort of reasonable standard throughout the codes in order to not descend into chaos. It is not just about getting things work!

If I need to generate a covmat, based on corrmat provided by the experimentalists and based on unsymm percentage uncert. Clearly I need to convert from percentage to absolute, then symterize them, then use them to convert corrmat to covmat.

Sure, and no one is opposing that! However, the way in which things are approached could be improved. A one-line complicated syntax is definitely not a good one and variables/functions naming should follow some convention.

Complaining about all these points without looking at how data is provided from hepdata for particular edge cases is counter productive and I am not able to dedicate more time towards these implementations. As I mentioned, I will do the things I mentioned above, however if you have complaints with the entire implementation of about 75 distributions, please feel free to close this PR.

Various people have implemented datasets so far and this has never arisen....

@scarlehoff
Copy link
Member

@t7phy this is a relatively big collaboration, in order to ensure that things work smoothly and that a lot of different people can work in the same (quite big) code base requires making some compromises and efforts.

One of these efforts concerns Pull Requests and code reviews. Doing a review of someone else's code has 3 main goals:

  1. That we audit the code that goes into the repository and that will be used in the future by many other people.
  2. That more than one person understand the code that goes into the repository.
  3. That there is some consistency on the quality and style of the code such that new people can be quickly onboarded into the collaboration. This is crucial if we want to keep having master students who only have a few months to do something.

For the person doing the review this is often not a rewarding job, it takes quite some time and it is difficult to justify as you are not "producing" anything new, but it is necessary since we are a big group of people contributing to the same codebase. This review process is not adversarial nor useless. It is what has allowed to keep this alive for about 8 years already (of the main 10 vp authors from before 2019 none of them continue in the collaboration* and the code is still alive, I think that's quite impressive!!! )

If you don't want to contribute that is perfectly ok. There are people who use the code as external users, that is fine and that's why the code is public. But as a contributor, however, it is important to treat code reviews not as a waste of time but as a way of improving the overall health of nnpdf.

Also, a very important point to mention is that code reviews not only strengthen the quality of our code base but in addition they are one of the best ways to improve and hone your programming skills. The reviewer is not always right of course, and you will learn when it makes sense to argue and when it doesn't. You will also learn to write code (and this is crucial) differently depending on what you are doing. The code I write here is different to pinefarm which has a different target, or in my own projects, which only I need to understand.

All these functions are now local to the filter files and used as they are needed. so complaining about their names is counter productive because then we may as well have a look at every single filter file in every single implementation to see how the functions are named.

I would instead say it is a productive thing. It is my fault that I didn't look into the filter files when some of the new data were merged, but it would be good to fix past mistakes instead of introducing new ones.

Regarding the names of the functions, a search and replace + using black afterwards would be extremely helpful, since with only a couple of instructions (a few calls to sed to replace the names and to black to stylize the code) would make these hard to read one liners:

this_is_the_first_function(some_function(on_some_other_function(on_yet_another_function(oh_my(input1, input2)))))

into this much more readable tower:

this_is_the_first_function(
    some_function(
        on_some_other_function(on_yet_another_function(oh_my(input1, input2)))
    )
)

(but really, going forward please don't apply more than two functions in the same line)

Complaining about all these points without looking at how data is provided from hepdata for particular edge cases is counter productive and I am not able to dedicate more time towards these implementations.

If you cannot dedicate time to this, that is also ok. Just work on this when you do have time. Just make this into a draft PR to signal that this is work in progress and we will come back at this when you can take care of it. We do this constantly (see for instance here) then people can make some comments (which can be useful even if they are about WIP code!) but there's no pressure on anyone to either finish it or review it.

*contributing code, some have now become too senior for that :P

@t7phy
Copy link
Member Author

t7phy commented Mar 25, 2024

@Radonirinaunimi @scarlehoff at no point did I mention that code review is not useful or should not be done, but a PR to fix a few things cannot be turned into a complete overhaul of a significant amount of code. Note that I explicitly mentioned that I would soon apply the suggestions including the use of black, and you are right, I didn't think earlier that find and replace together with black will solve the issue.

But at the same time, it's also essential that we prioritize physics over the name of a function. 2 points here:

  1. @Radonirinaunimi as you mentioned this function:

At first glance, this function looks much simpler than covmat_to_artunc, however there is no check if the matrix is positive definite, and in case of normalized distributions, one of the values could be 0, which furthermore could be less negative (but close to 0) due to numerical reasons and this needs to be checked and set to 0. And even more, in case of covmats that go over multiple normalized distributions, the function needs to allow for as many 0 eigenvalues as there are distributions.

Without taking into account all these caveats, if you make a comparison, it serves no usefulness as you don't consider the different possibilities a utils function should be able to take on.
And it is therefore important that one does not make blanket statements about someone's implementation being right or wrong without understanding the intricacies of the situation, and that means also having some trust in the decisions made by the person who implemented the code.

  1. In unable to perform fit comparison #1995, I shared a problem and it appears that we now have in master branch, jet implementations that are problematic. They increase the chi2 a lot and lead to problems in positivity. Someone must have reviewed it, approved it and merged it. Maybe, just maybe, the focus of PR review should be more on getting physics right than focusing on the function names. PR reviews are useful, but I feel at times we do it in a way that's so bureaucratic that it becomes counter productive and hinders the actual checks which should happen to ensure the implementation and the actual physics and math is correct.

@t7phy
Copy link
Member Author

t7phy commented Mar 25, 2024

@Radonirinaunimi @scarlehoff here is the vp report: https://vp.nnpdf.science/MzdicX55Ti2M8PlOyp26Hg==/ and above I have pushed the changes as suggested. Please let me know if this is fine.

@t7phy
Copy link
Member Author

t7phy commented Mar 25, 2024

closed in favor of #2021

@t7phy t7phy closed this Mar 25, 2024
@t7phy t7phy deleted the multiple_fixes branch March 25, 2024 21:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants