Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update type_histogram.R #288

Merged
merged 6 commits into from
Feb 2, 2025
Merged

Conversation

eleuven
Copy link
Contributor

@eleuven eleuven commented Jan 21, 2025

do not force common break points across facets. this can be done by passing break points to the break option. however, passing a vector to break is currently broken as is.null() is not vectorised.

see here for some background

Fixes #269

do not force common break points across facets. this can be done through the break option. however, passing a vector to break is currently broken as is.null() is not vectorised.
Copy link
Owner

@grantmcdermott grantmcdermott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this.

I can see that a couple of tests are failing. I'm not sure if it's because of things that I've pointed out in my quick review of the code or something else. Hopefully I'll have time to investigate properly this evening.

R/type_histogram.R Outdated Show resolved Hide resolved
Comment on lines 52 to 56
datapoints_breaks = hist(datapoints$x, breaks = hbreaks, plot = FALSE)
datapoints = split(datapoints, list(datapoints$by, datapoints$facet))
datapoints = Filter(function(k) nrow(k) > 0, datapoints)

datapoints = lapply(datapoints, function(k) {
h = hist(k$x, breaks = datapoints_breaks$breaks, plot = FALSE)
h = hist(k$x, breaks = hbreaks, plot = FALSE)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can delete the datapoints_breaks object, since we we should still use it by default in the case of grouped histograms. The thing we want to avoid is different binning widths for each group (unless the user explicitly requests it) which is why we calculate it on the full dataset first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, but atm there is no possibility of different binning for each group/facet since datapoint_breaks is always passed to breaks, and there is no possibility of passing a vector of breaks because of ifelse(). it would be nice if breaks accepted the same possibilities as hist (function, vector, number).

i think that different binning by group/facet often makes sense when freq=FALSE (missing atm).

thanks for looking into this!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps introducing an option to allow for free breaks (as in this commit) is an idea...

@grantmcdermott
Copy link
Owner

grantmcdermott commented Jan 28, 2025

This is great, thanks @eleuven. I finally had a chance to play around with your suggestion tonight and agree that we need a solution like the one you've proposed here.

As it happens, we've just been resolving a similar issue for joint bandwidth selection in grouped density plots (e.g., see here). The solution we landed on there was to introduce a joint.bw argument that takes the arguments "mean" (default), "full", or "none".

My only hesitation right now is thinking about argument consistency across type_density() and type_histogram(). Do we want to introduce a joint.breaks argument for the latter (that mirrors joint.bw in the former), or stick with something simple like your freebreaks suggestion?

Let me think on it a bit more and get back to you.

@eleuven
Copy link
Contributor Author

eleuven commented Jan 28, 2025

As it happens, we've just been resolving a similar issue for joint bandwidth selection in grouped density plots (e.g., see here). The solution we landed on there was to introduce a joint.bw argument that takes the arguments "mean" (default), "full", or "none".

My only hesitation right now is thinking about argument consistency across type_density() and type_histogram(). Do we want to introduce a joint.breaks argument for the latter (that mirrors joint.bw in the former), or stick with something simple like your freebreaks suggestion?

it looks like my freebreaks suggestion coincides with "full", or "none" in the joint.bw so it make sense to use at least some similar naming

i do not know what the historical background is for "mean", but it strikes me as a bit odd and that "none" is a sensible default, and in the case of a histogram i am also not sure whether it makes sense

however, if i am not mistaken then what is now implemented by "mean" could also be achieved by the user through using "full" + passing a custom bw, so in that sense i do not think the joint.bw implementation is conceptually very clean as it mixes up the bandwidth estimator and the population for which it should be computed

just thinking out loud though...

@grantmcdermott
Copy link
Owner

it looks like my freebreaks suggestion coincides with "full", or "none" in the joint.bw so it make sense to use at least some similar naming

Yup (specifically "none").

however, if i am not mistaken then what is now implemented by "mean" could also be achieved by the user through using "full" + passing a custom bw, so in that sense i do not think the joint.bw implementation is conceptually very clean as it mixes up the bandwidth estimator and the population for which it should be computed

The joint.bw arg is ignored if the user passes through a custom (numeric) bw, so I think we do maintain conceptual clarity there. Nonetheless, you're certainly right that it's hard to find something consistent across all cases. We've played around with many different datasets (both real and simulated) for density type and it turns out that it's very hard to find one rule that dominates globally. (See the discussion here for some guidance and why settled on "mean" as the most pragmatic default.)

I appreciate the feedback and discussion, though. Right now, I'm leaning towards a joint.breaks argument... Let me play around with an implementation to see what it feels like in practice.

@grantmcdermott
Copy link
Owner

Okay, after experimenting, I've decided to stick with a very slightly tweaked version of your free.breaks logical arg. (The added period between "free" and "breaks" allows the lazy user to do things like type_hist(free = TRUE).)

Note, however, that this wasn't the main reason free facet scales weren't working. Rather, it was the fact that we were keeping zero count bins. So I've added a new drop.zeros arg to handle that case (which it now does by default) and updated the documentation to explain when these two args do.

Thanks again for pushing this forward @eleuven!

@grantmcdermott grantmcdermott merged commit da6e957 into grantmcdermott:main Feb 2, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

facet.args=list(free=TRUE) does not free x-axis in type="hist"
2 participants