Alternative variable binning approach #849

JanWeldert · 2025-02-14T13:13:40Z

Similar to #835 this PR introduces the option to use different (regular) binnings in an analysis.
Which events use which binning depends on a separate variable called cut_var. This can be for example the pid value but also the number of hit modules.
I tried to modify as little code as possible but also provide all necessary changes to use the new binning type. An example notebook is also provided. This PR introduces a new binning class VarBinning which basically just holds multiple MultiDimBinning objects and one OneDimBinning which represents the cut variable. The main change when using the VarBinning class is that the histogramming is not happening in the dedicated stage but in the output function of the pipeline. Consequently, a pipeline using VarBinning can not have a hist stage.
The way a VarBinning is defined is by passing a list in the binning config file.

thehrh · 2025-02-20T12:46:05Z

Before doing a more detailed review, let me try to summarise a few key aspects of this PR:

one binning dimension can now also serve as variable dependent on which the binnings in the remaining dimensions change (still always hyperrectangular unless masking is used, same syntax as dependent binnings: one universal mask or list of masks); it seems unlikely that more than one such distinguished dimension will be required in practice
in contrast to Introducing support for variable binning with event classes (species) #835, we are not introducing yet another events class and need no "event species names" (this role is played by the bins in the distinguished dimension)
the modifications of parse_pipeline_config, allowing it to instantiate the new binning_dict entry, are moderate in both cases
Container finally implements the "sum" translation mode for histogramming events (low effort, uses the pre-existing array_to_binned method)
this mode replaces a pipeline's utils.hist service when a VarBinning is set as the output_binning
- all events are jointly processed by the pipeline: only at the end of _get_outputs are they split into separate ContainerSets, one per bin in the distinguished dimension, and histogrammed (using the now built-in functionality) according to the appropriate MultiDimBinning in the remaining dimensions
- accordingly, a list of MapSets is returned, requiring minor modifications to the DistributionMaker class
a major benefit with respect to Introducing support for variable binning with event classes (species) #835 is the removed need for many invasive and often boilerplate changes to Map functions, whose thoroughness only emphasises the large maintenance burden they are accompanied by
- Possibly, if the binning had been more flexible from scratch, the class would have been integrated into Map like this.
- But the solution in this PR shows that a Map needn't be aware of this type of variable binning for conducting statistical analysis. Instead, only fairly few simple additions to the Analysis class are required, similar to what we already have there when we are distinguishing between a DistributionMaker and a Detectors instance.
- The drawback is of course that users who are e.g. interactively computing metric values will have to manually perform the sum in the same way as fit_recursively does it.

In conclusion, I am highly in favour of the approach proposed in this PR over that in #835.

marialiubarska

Hi, sorry for being very late to this discussion. I think the code looks good and it is a good option for a variable binning with one "split" variable.

I understand that #835 introduces a lot of changes. While I tried to avoid affecting existing functionality as much a possible, I will fully understand if people don't feel comfortable pushing it to main and prefer adding this version instead.

In my case I specifically needed to introduce arbitrary cuts for classification, so I don't think this solution would be suitable for my analysis. However, since I might be the only person who needs this functionality for now, I would not have a problem working in separate branch.

thehrh · 2025-02-21T10:09:31Z

Hi Maria, a set of n "arbitrary" cuts/selection criteria can be represented by a one-dimensional binning too, can't it? You would just have to evaluate which one of your cuts each event satisfies and add one unique number per such cut to each event in the the events file before running the pipeline, then define the one-dimensional split bin edges accordingly. This way, double counting by PISA wouldn't even be a concern (you as analyser would need to make sure cuts are mutually exclusive anyway I presume).

marialiubarska · 2025-02-24T22:41:53Z

Yes, I agree that this is an option, but it would be more like a workaround then an actual functionality. Like you suggested, it would mean that the analyzer needs to regenerate event file every time there is a change or adjustment to binning. In my opinion, doing it this way will not only make the process of setting up new binning less convenient, but will also introduce additional opportunity for mistakes by adding a necessary step outside PISA. Re-matching particular numbered bins to variables cuts (e.g. for plotting or to check bins are numbered correctly) will also be on the analyzer, which is another opportunity for mistakes.

marialiubarska · 2025-02-24T22:44:23Z

Sorry, clicked the button by accident

marialiubarska · 2025-02-24T22:46:47Z

@JanWeldert if we decide to go ahead with it, would you mind adding a test for VarBinning?

thehrh · 2025-02-24T23:12:15Z

I don't think this has to be external to PISA at all. One option I see is for this functionality to become a service, or we allow split to be specified as series of cuts similar to https://github.com/icecube/pisa/pull/835/files#diff-f20428059ca19bffc11af4daed3fd79017584ca4910e41eb9cc6c37dda35d4afR670 ? Wouldn't one just need to change the logic a bit around https://github.com/icecube/pisa/pull/849/files#diff-79820012d0e6e41b7ff46042be416541ba3b7924a178e039cf7060304f1f522bR409 @JanWeldert ?

JanWeldert · 2025-02-25T08:32:22Z

We could allow a series of cuts, we just have to make sure the different cuts are applied correctly. E.g. if you specify two cuts, do you want to split events by all possible combinations of the cut values or only combine specific combinations (like first-first, second-second, ...).
I also wanted to point out that if this PR in its current form doesn't provide the functionality you need @marialiubarska, we (mostly I) have to work on it to change that. You shouldn't feel the need to work in separate branch.

thehrh · 2025-02-25T11:23:04Z

I should have written "series of selections" instead (consider cut_var -> selections as in #835 a more appropriate designation). Look at https://github.com/icecube/pisa/pull/835/files#diff-5f7216eac81cdb9c0a8129aa6cb83056cd518b9dbc66c7c0a93db9a24484bbc2R3932 for example.

How about, in addition to the OneDimBinning currently implemented, allowing the user to pass a comma-separated list of such selections explicitly only (not in some implicit form that has to be interpreted by PISA), such as

reco_var_binning.split = (nDOM == 7) & (l7_muon_classifier_prob_nu >= 0.8), (nDOM >= 10) & (l7_muon_classifier_prob_nu <= 0.5)

for a split into two sets of events each with their own dedicated binning? We'd have to make sure that VarBinning stores the individual selection expressions such that they can be accessed through the instantiated Pipeline for plotting etc.

We have logic for applying selections as these in https://github.com/icecube/pisa/blob/master/pisa/core/events_pi.py#L500. It looks like this was copied into the utils.hist service in #835 (https://github.com/icecube/pisa/pull/835/files#diff-3f79be10e1c93fd330b7e4a4f904b1cfe8bfd766565fb7f8b352b5a38c120cf1R285).

(I still think we should avoid making a Map aware of the new binning as in https://github.com/icecube/pisa/pull/835/files#diff-2817515fee33820beb651d8b78c6dc430f377d845a8a92789c8df4d90dad11a2, as per my summary comment above).

…string also

marialiubarska · 2025-03-06T00:29:43Z

pisa/core/binning.py

+            if isinstance(selections, OneDimBinning):
+                assert selections.name not in b.names
+            else:
+                assert binnings.count(b) == 1, 'Binning used more than once, modify your selection'


Do we need to restrict user to use different MultidimBinning for each selection? I feel like it would be more convenient to allow using duplicate binning. For example one can think of a situation where we have three PID bins, but only middle bin has different binning. If one can not use duplicate binning, they will have to define a selection for lower+higher PID bins and then a 3d binning in (E, coszen, PID).
This of course would not affect the calculation, but, in my opinion, it could just be more convenient to be able to assign same binning to different selections.

I only enforce this restriction for arbitrary selections defined in a list of strings. You can use the OneDimBinning selection option for the case of three PID bins where the middle bin should get a different binning in E and coszen. The reason for not allowing it for the arbitrary selections is that this would split your statistic (MC and real event counts) and increase your uncertainty in each bin. In general, if the variable used for the split is also part of the binning, the OneDimBinning option should be used.
Or do you want to have other cuts additional to the PID split that are different for each PID bin?

Yes, this was just an example. I agree that having duplicate binning will split your statistic, but if the selection variable carries some information useful for oscillation fit, it could also be balanced out by its contribution to sensitivity. Also like you suggested, it would be useful in the case when there multiple variables used in selection and PID (with different binning for middle bin) is just one of them

Hmm, how about making it a warning instead? I think we should also warn that the selection variables should not be part of the MultiDimBinning(s), because the selection itself already effectively acts as binning here

Yeah, I think warning would be better. And I think it's a good idea to have a warning about not mixing selection and binding variables.

I also just thought that the same statistic split can be achieved if a variable which has no impact on the fit is used in regular multidim binning. I don't think there is a way to have a warning there though, so it is up to the analyzer to make sure binning variables carry some information

yeah, in the end, we have to rely on the analyzers to know what they are doing. 😅

marialiubarska

Thank you very much @JanWeldert and @thehrh for putting work into this! The modifications look good to me. The only question I have is about ability to assign duplicate binnings to different selections, which is more about convenience then functionality

…ifferent tasks), new pipeline unit test & minor logic fix

pisa/core/pipeline.py

pisa_examples/resources/settings/pipeline/varbin_example.cfg

pisa_examples/resources/settings/binning/example.cfg

pisa/core/pipeline.py

pisa/core/distribution_maker.py

pisa/core/binning.py

pisa/core/container.py

…ction counts and warn if total is zero, add test for non-exclusive selections and empty selections

…t cleared, moved core pipeline output calculations into separate functions for clarity

…rom config + minor

pisa/core/binning.py

…ning dim. in a cut expression, allow duplicate binnings, more detailed logging); add notes to class docstring; couple superficial mods

pisa/analysis/analysis.py

….py; comment on possible caching optimisation in pipeline.py; use built-in DistributionMaker profiling functionality in notebook and reduce no. of pseudoexperiments for usability

JanWeldert added 5 commits January 3, 2025 09:42

Add variable binning and adjust output functions

fcae83c

Force events representation and add histogramming to translations

8c20429

Adjust analysis.py and config parser

35dcf0e

Add example

81c308b

Expand example a bit

eaf6235

JanWeldert requested review from thehrh and marialiubarska February 14, 2025 13:13

fit_hypo does not know hypo_asimov_dist

3215d55

thehrh mentioned this pull request Feb 20, 2025

Stage/service to define different binning for analysis #108

Closed

thehrh added this to the PISA 4.2 milestone Feb 20, 2025

marialiubarska approved these changes Feb 21, 2025

View reviewed changes

marialiubarska marked this pull request as ready for review February 24, 2025 22:42

marialiubarska marked this pull request as draft February 24, 2025 22:43

thehrh and others added 5 commits February 26, 2025 23:57

generalise pipeline config parser and varbinning to accept selection …

87f21fb

…string also

Make selection string work with container

854e356

Check if selections are exclusive

eb78c69

Define selection before checking it

0ee0429

Simple test and more checks for VarBinning

ba8c4a7

JanWeldert marked this pull request as ready for review March 3, 2025 12:54

JanWeldert requested a review from marialiubarska March 3, 2025 12:55

marialiubarska reviewed Mar 6, 2025

View reviewed changes

marialiubarska approved these changes Mar 6, 2025

View reviewed changes

docs & comments, split up check_varbinning (was performing two very d…

e9b0b5b

…ifferent tasks), new pipeline unit test & minor logic fix

JanWeldert commented Mar 6, 2025

View reviewed changes

pisa/core/pipeline.py Outdated Show resolved Hide resolved

thehrh requested changes Mar 6, 2025

View reviewed changes

thehrh and others added 5 commits March 6, 2025 16:05

exclusivity check only when requesting new output binning in get_outputs

9f078c6

Add docstrings and change assert to warn

92409cf

fix undefined varbinning name in config parser, debug logging of sele…

94a0529

…ction counts and warn if total is zero, add test for non-exclusive selections and empty selections

superficial: comments & docstrings, thousands of lines of ipynb outpu…

7837426

…t cleared, moved core pipeline output calculations into separate functions for clarity

also make separate functions for parsing different types of binning f…

f7a53ad

…rom config + minor

JanWeldert added the enhancement label Mar 11, 2025

Merge remote-tracking branch 'origin/master' into var_bin

b0eabba

thehrh reviewed Mar 12, 2025

View reviewed changes

pisa/core/binning.py Outdated Show resolved Hide resolved

adapt varbinning init and tests (require > 1 binnings, detect any bin…

1a3dd76

…ning dim. in a cut expression, allow duplicate binnings, more detailed logging); add notes to class docstring; couple superficial mods

thehrh reviewed Mar 13, 2025

View reviewed changes

pisa/analysis/analysis.py Show resolved Hide resolved

thehrh reviewed Mar 13, 2025

View reviewed changes

pisa/analysis/analysis.py Show resolved Hide resolved

comments/NotImplementedErrors for generalized_poisson_llh in analysis…

5c2d85f

….py; comment on possible caching optimisation in pipeline.py; use built-in DistributionMaker profiling functionality in notebook and reduce no. of pseudoexperiments for usability

thehrh approved these changes Mar 13, 2025

View reviewed changes

thehrh merged commit fe8a899 into icecube:master Mar 13, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alternative variable binning approach #849

Alternative variable binning approach #849

JanWeldert commented Feb 14, 2025

thehrh commented Feb 20, 2025 •

edited

Loading

marialiubarska left a comment

thehrh commented Feb 21, 2025 •

edited

Loading

marialiubarska commented Feb 24, 2025

marialiubarska commented Feb 24, 2025

marialiubarska commented Feb 24, 2025

thehrh commented Feb 24, 2025

JanWeldert commented Feb 25, 2025

thehrh commented Feb 25, 2025 •

edited

Loading

marialiubarska Mar 6, 2025

JanWeldert Mar 6, 2025

marialiubarska Mar 6, 2025

JanWeldert Mar 6, 2025

marialiubarska Mar 6, 2025

JanWeldert Mar 6, 2025

marialiubarska left a comment

Alternative variable binning approach #849

Alternative variable binning approach #849

Conversation

JanWeldert commented Feb 14, 2025

thehrh commented Feb 20, 2025 • edited Loading

marialiubarska left a comment

Choose a reason for hiding this comment

thehrh commented Feb 21, 2025 • edited Loading

marialiubarska commented Feb 24, 2025

marialiubarska commented Feb 24, 2025

marialiubarska commented Feb 24, 2025

thehrh commented Feb 24, 2025

JanWeldert commented Feb 25, 2025

thehrh commented Feb 25, 2025 • edited Loading

marialiubarska Mar 6, 2025

Choose a reason for hiding this comment

JanWeldert Mar 6, 2025

Choose a reason for hiding this comment

marialiubarska Mar 6, 2025

Choose a reason for hiding this comment

JanWeldert Mar 6, 2025

Choose a reason for hiding this comment

marialiubarska Mar 6, 2025

Choose a reason for hiding this comment

JanWeldert Mar 6, 2025

Choose a reason for hiding this comment

marialiubarska left a comment

Choose a reason for hiding this comment

thehrh commented Feb 20, 2025 •

edited

Loading

thehrh commented Feb 21, 2025 •

edited

Loading

thehrh commented Feb 25, 2025 •

edited

Loading