Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative variable binning approach #849

Merged
merged 20 commits into from
Mar 13, 2025
Merged

Conversation

JanWeldert
Copy link
Collaborator

Similar to #835 this PR introduces the option to use different (regular) binnings in an analysis.
Which events use which binning depends on a separate variable called cut_var. This can be for example the pid value but also the number of hit modules.
I tried to modify as little code as possible but also provide all necessary changes to use the new binning type. An example notebook is also provided. This PR introduces a new binning class VarBinning which basically just holds multiple MultiDimBinning objects and one OneDimBinning which represents the cut variable. The main change when using the VarBinning class is that the histogramming is not happening in the dedicated stage but in the output function of the pipeline. Consequently, a pipeline using VarBinning can not have a hist stage.
The way a VarBinning is defined is by passing a list in the binning config file.

@thehrh
Copy link
Contributor

thehrh commented Feb 20, 2025

Before doing a more detailed review, let me try to summarise a few key aspects of this PR:

  • one binning dimension can now also serve as variable dependent on which the binnings in the remaining dimensions change (still always hyperrectangular unless masking is used, same syntax as dependent binnings: one universal mask or list of masks); it seems unlikely that more than one such distinguished dimension will be required in practice
  • in contrast to Introducing support for variable binning with event classes (species) #835, we are not introducing yet another events class and need no "event species names" (this role is played by the bins in the distinguished dimension)
  • the modifications of parse_pipeline_config, allowing it to instantiate the new binning_dict entry, are moderate in both cases
  • Container finally implements the "sum" translation mode for histogramming events (low effort, uses the pre-existing array_to_binned method)
  • this mode replaces a pipeline's utils.hist service when a VarBinning is set as the output_binning
    • all events are jointly processed by the pipeline: only at the end of _get_outputs are they split into separate ContainerSets, one per bin in the distinguished dimension, and histogrammed (using the now built-in functionality) according to the appropriate MultiDimBinning in the remaining dimensions
    • accordingly, a list of MapSets is returned, requiring minor modifications to the DistributionMaker class
  • a major benefit with respect to Introducing support for variable binning with event classes (species) #835 is the removed need for many invasive and often boilerplate changes to Map functions, whose thoroughness only emphasises the large maintenance burden they are accompanied by
    • Possibly, if the binning had been more flexible from scratch, the class would have been integrated into Map like this.
    • But the solution in this PR shows that a Map needn't be aware of this type of variable binning for conducting statistical analysis. Instead, only fairly few simple additions to the Analysis class are required, similar to what we already have there when we are distinguishing between a DistributionMaker and a Detectors instance.
    • The drawback is of course that users who are e.g. interactively computing metric values will have to manually perform the sum in the same way as fit_recursively does it.

In conclusion, I am highly in favour of the approach proposed in this PR over that in #835.

Copy link
Contributor

@marialiubarska marialiubarska left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, sorry for being very late to this discussion. I think the code looks good and it is a good option for a variable binning with one "split" variable.

I understand that #835 introduces a lot of changes. While I tried to avoid affecting existing functionality as much a possible, I will fully understand if people don't feel comfortable pushing it to main and prefer adding this version instead.

In my case I specifically needed to introduce arbitrary cuts for classification, so I don't think this solution would be suitable for my analysis. However, since I might be the only person who needs this functionality for now, I would not have a problem working in separate branch.

@thehrh
Copy link
Contributor

thehrh commented Feb 21, 2025

Hi Maria, a set of n "arbitrary" cuts/selection criteria can be represented by a one-dimensional binning too, can't it? You would just have to evaluate which one of your cuts each event satisfies and add one unique number per such cut to each event in the the events file before running the pipeline, then define the one-dimensional split bin edges accordingly. This way, double counting by PISA wouldn't even be a concern (you as analyser would need to make sure cuts are mutually exclusive anyway I presume).

@marialiubarska
Copy link
Contributor

Yes, I agree that this is an option, but it would be more like a workaround then an actual functionality. Like you suggested, it would mean that the analyzer needs to regenerate event file every time there is a change or adjustment to binning. In my opinion, doing it this way will not only make the process of setting up new binning less convenient, but will also introduce additional opportunity for mistakes by adding a necessary step outside PISA. Re-matching particular numbered bins to variables cuts (e.g. for plotting or to check bins are numbered correctly) will also be on the analyzer, which is another opportunity for mistakes.

@marialiubarska marialiubarska marked this pull request as ready for review February 24, 2025 22:42
@marialiubarska marialiubarska marked this pull request as draft February 24, 2025 22:43
@marialiubarska
Copy link
Contributor

Sorry, clicked the button by accident

@marialiubarska
Copy link
Contributor

@JanWeldert if we decide to go ahead with it, would you mind adding a test for VarBinning?

@thehrh
Copy link
Contributor

thehrh commented Feb 24, 2025

I don't think this has to be external to PISA at all. One option I see is for this functionality to become a service, or we allow split to be specified as series of cuts similar to https://github.com/icecube/pisa/pull/835/files#diff-f20428059ca19bffc11af4daed3fd79017584ca4910e41eb9cc6c37dda35d4afR670 ? Wouldn't one just need to change the logic a bit around https://github.com/icecube/pisa/pull/849/files#diff-79820012d0e6e41b7ff46042be416541ba3b7924a178e039cf7060304f1f522bR409 @JanWeldert ?

@JanWeldert
Copy link
Collaborator Author

We could allow a series of cuts, we just have to make sure the different cuts are applied correctly. E.g. if you specify two cuts, do you want to split events by all possible combinations of the cut values or only combine specific combinations (like first-first, second-second, ...).
I also wanted to point out that if this PR in its current form doesn't provide the functionality you need @marialiubarska, we (mostly I) have to work on it to change that. You shouldn't feel the need to work in separate branch.

@thehrh
Copy link
Contributor

thehrh commented Feb 25, 2025

I should have written "series of selections" instead (consider cut_var -> selections as in #835 a more appropriate designation). Look at https://github.com/icecube/pisa/pull/835/files#diff-5f7216eac81cdb9c0a8129aa6cb83056cd518b9dbc66c7c0a93db9a24484bbc2R3932 for example.

How about, in addition to the OneDimBinning currently implemented, allowing the user to pass a comma-separated list of such selections explicitly only (not in some implicit form that has to be interpreted by PISA), such as

reco_var_binning.split = (nDOM == 7) & (l7_muon_classifier_prob_nu >= 0.8), (nDOM >= 10) & (l7_muon_classifier_prob_nu <= 0.5)

for a split into two sets of events each with their own dedicated binning? We'd have to make sure that VarBinning stores the individual selection expressions such that they can be accessed through the instantiated Pipeline for plotting etc.

We have logic for applying selections as these in https://github.com/icecube/pisa/blob/master/pisa/core/events_pi.py#L500. It looks like this was copied into the utils.hist service in #835 (https://github.com/icecube/pisa/pull/835/files#diff-3f79be10e1c93fd330b7e4a4f904b1cfe8bfd766565fb7f8b352b5a38c120cf1R285).

(I still think we should avoid making a Map aware of the new binning as in https://github.com/icecube/pisa/pull/835/files#diff-2817515fee33820beb651d8b78c6dc430f377d845a8a92789c8df4d90dad11a2, as per my summary comment above).

@JanWeldert JanWeldert marked this pull request as ready for review March 3, 2025 12:54
if isinstance(selections, OneDimBinning):
assert selections.name not in b.names
else:
assert binnings.count(b) == 1, 'Binning used more than once, modify your selection'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to restrict user to use different MultidimBinning for each selection? I feel like it would be more convenient to allow using duplicate binning. For example one can think of a situation where we have three PID bins, but only middle bin has different binning. If one can not use duplicate binning, they will have to define a selection for lower+higher PID bins and then a 3d binning in (E, coszen, PID).
This of course would not affect the calculation, but, in my opinion, it could just be more convenient to be able to assign same binning to different selections.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only enforce this restriction for arbitrary selections defined in a list of strings. You can use the OneDimBinning selection option for the case of three PID bins where the middle bin should get a different binning in E and coszen. The reason for not allowing it for the arbitrary selections is that this would split your statistic (MC and real event counts) and increase your uncertainty in each bin. In general, if the variable used for the split is also part of the binning, the OneDimBinning option should be used.
Or do you want to have other cuts additional to the PID split that are different for each PID bin?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this was just an example. I agree that having duplicate binning will split your statistic, but if the selection variable carries some information useful for oscillation fit, it could also be balanced out by its contribution to sensitivity. Also like you suggested, it would be useful in the case when there multiple variables used in selection and PID (with different binning for middle bin) is just one of them

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, how about making it a warning instead? I think we should also warn that the selection variables should not be part of the MultiDimBinning(s), because the selection itself already effectively acts as binning here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think warning would be better. And I think it's a good idea to have a warning about not mixing selection and binding variables.

I also just thought that the same statistic split can be achieved if a variable which has no impact on the fit is used in regular multidim binning. I don't think there is a way to have a warning there though, so it is up to the analyzer to make sure binning variables carry some information

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, in the end, we have to rely on the analyzers to know what they are doing. 😅

Copy link
Contributor

@marialiubarska marialiubarska left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @JanWeldert and @thehrh for putting work into this! The modifications look good to me. The only question I have is about ability to assign duplicate binnings to different selections, which is more about convenience then functionality

…ifferent tasks), new pipeline unit test & minor logic fix
…ning dim. in a cut expression, allow duplicate binnings, more detailed logging); add notes to class docstring; couple superficial mods
….py; comment on possible caching optimisation in pipeline.py; use built-in DistributionMaker profiling functionality in notebook and reduce no. of pseudoexperiments for usability
@thehrh thehrh merged commit fe8a899 into icecube:master Mar 13, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants