Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enh!: Respect input order of collection in summary #1364

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

hoxbro
Copy link
Member

@hoxbro hoxbro commented Oct 11, 2024

This is a breaking change because the resulting array values' keyword arguments are no longer sorted, which affects the array's depth ordering.

With the current implementation, the depth order is decided based on the input variables to ds.summary; this PR will make it so it is decided based on the input order itself. Nothing is inherently wrong with the current implementation, but it makes it cumbersome to define a custom ordering of the output array, and can be hard to wrap around.

An example is that the following ds.summary(val=ds.min("s"), a=ds.where(ds.min("s"))) and ds.summary(val=ds.min("s"), z=ds.where(ds.min("s"))), will currently give different output arrays. With a the first layer would be a and the second being val; with z the first layer would be val and the second being z. In a more real-world use case, you could imagine adding an extra input ds.summary(..., b=...), which will not be appended to the end of the dataset but put into the middle.

This is hard to see if you do not know it is happening. An example is in HoloViews if you pass a summary as an aggregator. The only reason this can be "easily" seen in the following example is that a/z is an int layer and will, therefore, show -1 (ints nan value) as light blue.

a z
image image
import datashader as ds
import numpy as np
import pandas as pd

import holoviews as hv
from holoviews.operation.datashader import rasterize

hv.extension("bokeh")

num = 10000
np.random.seed(1)

dists = {
    cat: pd.DataFrame(
        {
            "x": np.random.normal(x, s, num),
            "y": np.random.normal(y, s, num),
            "s": s,
            "val": val,
            "cat": cat,
        }
    )
    for x, y, s, val, cat in [
        (2, 2, 0.03, 0, "d1"),
        (2, -2, 0.10, 1, "d2"),
        (-2, -2, 0.50, 2, "d3"),
        (-2, 2, 1.00, 3, "d4"),
        (0, 0, 3.00, 4, "d5"),
    ]
}

df = pd.concat(dists, ignore_index=True)
points = hv.Points(df)
agg = ds.summary(val=ds.min("s"), a=ds.where(ds.min("s")))
# agg = ds.summary(val=ds.min("s"), z=ds.where(ds.min("s")))
plot = rasterize(points, aggregator=agg)#.opts(tools=["hover"])
plot

I need to think if there are some consequences that I currently don't see. One thing could be __hash__ is right now sensitive to input order, even though the data could be the same

Copy link

codecov bot commented Oct 11, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 88.57%. Comparing base (09ea0f0) to head (e64b2bc).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1364   +/-   ##
=======================================
  Coverage   88.57%   88.57%           
=======================================
  Files          93       93           
  Lines       18644    18648    +4     
=======================================
+ Hits        16514    16518    +4     
  Misses       2130     2130           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@hoxbro hoxbro marked this pull request as draft October 12, 2024 06:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant