-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alternative variable binning approach #849
Conversation
Before doing a more detailed review, let me try to summarise a few key aspects of this PR:
In conclusion, I am highly in favour of the approach proposed in this PR over that in #835. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, sorry for being very late to this discussion. I think the code looks good and it is a good option for a variable binning with one "split" variable.
I understand that #835 introduces a lot of changes. While I tried to avoid affecting existing functionality as much a possible, I will fully understand if people don't feel comfortable pushing it to main and prefer adding this version instead.
In my case I specifically needed to introduce arbitrary cuts for classification, so I don't think this solution would be suitable for my analysis. However, since I might be the only person who needs this functionality for now, I would not have a problem working in separate branch.
Hi Maria, a set of n "arbitrary" cuts/selection criteria can be represented by a one-dimensional binning too, can't it? You would just have to evaluate which one of your cuts each event satisfies and add one unique number per such cut to each event in the the events file before running the pipeline, then define the one-dimensional |
Yes, I agree that this is an option, but it would be more like a workaround then an actual functionality. Like you suggested, it would mean that the analyzer needs to regenerate event file every time there is a change or adjustment to binning. In my opinion, doing it this way will not only make the process of setting up new binning less convenient, but will also introduce additional opportunity for mistakes by adding a necessary step outside PISA. Re-matching particular numbered bins to variables cuts (e.g. for plotting or to check bins are numbered correctly) will also be on the analyzer, which is another opportunity for mistakes. |
Sorry, clicked the button by accident |
@JanWeldert if we decide to go ahead with it, would you mind adding a test for |
I don't think this has to be external to PISA at all. One option I see is for this functionality to become a service, or we allow |
We could allow a series of cuts, we just have to make sure the different cuts are applied correctly. E.g. if you specify two cuts, do you want to split events by all possible combinations of the cut values or only combine specific combinations (like first-first, second-second, ...). |
I should have written "series of selections" instead (consider How about, in addition to the
for a split into two sets of events each with their own dedicated binning? We'd have to make sure that We have logic for applying selections as these in https://github.com/icecube/pisa/blob/master/pisa/core/events_pi.py#L500. It looks like this was copied into the (I still think we should avoid making a |
pisa/core/binning.py
Outdated
if isinstance(selections, OneDimBinning): | ||
assert selections.name not in b.names | ||
else: | ||
assert binnings.count(b) == 1, 'Binning used more than once, modify your selection' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to restrict user to use different MultidimBinning for each selection? I feel like it would be more convenient to allow using duplicate binning. For example one can think of a situation where we have three PID bins, but only middle bin has different binning. If one can not use duplicate binning, they will have to define a selection for lower+higher PID bins and then a 3d binning in (E, coszen, PID).
This of course would not affect the calculation, but, in my opinion, it could just be more convenient to be able to assign same binning to different selections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only enforce this restriction for arbitrary selections defined in a list of strings. You can use the OneDimBinning selection option for the case of three PID bins where the middle bin should get a different binning in E and coszen. The reason for not allowing it for the arbitrary selections is that this would split your statistic (MC and real event counts) and increase your uncertainty in each bin. In general, if the variable used for the split is also part of the binning, the OneDimBinning option should be used.
Or do you want to have other cuts additional to the PID split that are different for each PID bin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this was just an example. I agree that having duplicate binning will split your statistic, but if the selection variable carries some information useful for oscillation fit, it could also be balanced out by its contribution to sensitivity. Also like you suggested, it would be useful in the case when there multiple variables used in selection and PID (with different binning for middle bin) is just one of them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, how about making it a warning instead? I think we should also warn that the selection variables should not be part of the MultiDimBinning(s), because the selection itself already effectively acts as binning here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think warning would be better. And I think it's a good idea to have a warning about not mixing selection and binding variables.
I also just thought that the same statistic split can be achieved if a variable which has no impact on the fit is used in regular multidim binning. I don't think there is a way to have a warning there though, so it is up to the analyzer to make sure binning variables carry some information
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, in the end, we have to rely on the analyzers to know what they are doing. 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much @JanWeldert and @thehrh for putting work into this! The modifications look good to me. The only question I have is about ability to assign duplicate binnings to different selections, which is more about convenience then functionality
…ifferent tasks), new pipeline unit test & minor logic fix
…ction counts and warn if total is zero, add test for non-exclusive selections and empty selections
…t cleared, moved core pipeline output calculations into separate functions for clarity
…rom config + minor
…ning dim. in a cut expression, allow duplicate binnings, more detailed logging); add notes to class docstring; couple superficial mods
….py; comment on possible caching optimisation in pipeline.py; use built-in DistributionMaker profiling functionality in notebook and reduce no. of pseudoexperiments for usability
Similar to #835 this PR introduces the option to use different (regular) binnings in an analysis.
Which events use which binning depends on a separate variable called cut_var. This can be for example the pid value but also the number of hit modules.
I tried to modify as little code as possible but also provide all necessary changes to use the new binning type. An example notebook is also provided. This PR introduces a new binning class
VarBinning
which basically just holds multipleMultiDimBinning
objects and oneOneDimBinning
which represents the cut variable. The main change when using theVarBinning
class is that the histogramming is not happening in the dedicated stage but in the output function of the pipeline. Consequently, a pipeline usingVarBinning
can not have a hist stage.The way a
VarBinning
is defined is by passing a list in the binning config file.