-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add size_by
argument to model_tree()
#720
Conversation
- Allows users to control the node sizing by a particular column present in `.log_df` - Important to note that `ofv` cant be used without first calling `add_summary()`, which is consistent with the previous behavior of `color_by` and `include_info` - This argument does support logical columns, however I noticed that the normalized sizing can differ here depending when the first `TRUE` value originates. If it's the first value, all the nodes appear smaller. Anywhere else and the "base size" is larger. This observation is likely true for numeric columns too, though is less apparent when the values are all closer to each other. - `collapsibleTree::collapsibleTreeNetwork` is only intended to support numeric columns for sizing, so I opted to map these values to: TRUE=5, FALSE=1, NA=3 for the time being. We may want to revisit NA values for both numeric and logical columns - Added examples to docs
- While the implementation and tests are done, I think we still want to revist the NA handling, which also impacts the "start" node size.
- While updating the vignette, I noticed that the size_by and color_by arguments occasionally didnt line up for numeric columns. Numeric columns need to be sorted first so the right colors are assigned. This was previously "tested" using the character 'run' column, though was more observable when testing against objective function values (which dont necessarily appear in ascending/descending order)
- uses leafCount, which makes the nodes containing more child nodes, larger
- Improve formatting of code in some cases
- They were always NA previously
- Adds additional tests when required or specified columns are missing - doc adjustments
- If color_by or size_by is specified, ensure these columns are also part of the tooltip. This avoids the need for specifying new columns (e.g., 'Out of Date') in both the tooltip and color_by
- doc and text adjustments - Revert change from previous commit (comment it out) until a decision has been made.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't finished reviewing these, but here's an initial wave
Testing this out, things are overall working as expected on my end. I noticed two behaviors that look off to me. Here's an example starting with log_df
from the size_tree_by
tests.
log_df$foo <- 1:5
log_df$bar <- 5:1
log_df$baz <- -2:2
First: feeding negative values to size_by
throws things off. There are warning messages and the relative sizing does not match the values.
model_tree(log_df, add_summary = FALSE, size_by = "baz", static = TRUE)
#> Warning messages:
#> 1: In sqrt(x$SizeOfNode * scaleFactor) : NaNs produced
#> 2: In sqrt(x$SizeOfNode * scaleFactor) : NaNs produced
Perhaps model_tree
should scale the input? (Only the relative values should matter for sizing.)
Second: I was surprised about how nodes were sized differently if values were reversed.
model_tree(log_df, add_summary = FALSE, size_by = "foo", static = TRUE)
model_tree(log_df, add_summary = FALSE, size_by = "bar", static = TRUE)
I expected the same set of sizes to be used for those nodes, just in reverse (e.g., run 1 in bar would be the same size as run 5 in foo). Is the above behavior intended?
is only intended to support numeric columns for sizing, so I opted to map these values to: TRUE=5, FALSE=1, NA=3
Should this part of the PR description have been updated for f6e6554 (Dont support logical columns, improve error formatting)?
- Dont create a node_size column if size_by is NULL or an unsupported column type is specified. Instead set to default sizing (leafCount) and set as an attribute
- Add a test for this. Perhaps not necessary, but good to have to capture any future change in handling
- Store color_by attribute within color_tree_by instead of using conditional logic outside of the function. - Add comments for attribute for both size_by and color_by handling setting to make this more clear
- Instead of relying on the raw data, perform the following operations: - Normalize values based on standard deviation, which ensures sizes are comparable across datasets and prevents large values from dominating. - Scale all values to be within `rescale_to`, which also helps ensure consistent sizing, but also handles negative values (some of which may be introduced by the previous operation). - The larger the standard deviation is, the larger the the nodes can be. e.g. if rescale_to was c(0, 100) and the values in the data were c(-2, -1, 0, 1, 2), the largest node would almost fill the page. - `rescale_to` is not currently an argument that can be provided by the user. There may be reasons to want to visualize size differences to a greater degree in the future (perhaps objective function value
I introduced relative scaling in d6730dd which helps keep the node sizing consistent. The one case where this doesnt work is I believe I documented this behavior in some commit (haven't been able to find it), but the actual node sizes seem to be scaled based on the first node, as if that has a predetermined size. I think this is related to Bad scaling > log_df$bar <- 5:1
> model_tree(log_df, add_summary = FALSE, size_by = "bar") Good scaling > log_df$bar <- 5:1; log_df$bar[1] <- 1
> model_tree(log_df, add_summary = FALSE, size_by = "bar") Back to bad scaling > log_df$bar <- 5:1; log_df$bar[2] <- 1
> model_tree(log_df, add_summary = FALSE, size_by = "bar") |
R/model-tree.R
Outdated
# Rescale values based on standard deviation | ||
# - a large SD can lead to very large nodes | ||
mean_val <- mean(tree_data$node_size, na.rm = TRUE) | ||
sd_val <- sd(tree_data$node_size, na.rm = TRUE) | ||
|
||
if(sd_val != 0){ | ||
tree_data$node_size <- (tree_data$node_size - mean_val) / sd_val | ||
}else{ | ||
# Would only happen if all values are the same | ||
# - avoids dividing by 0 | ||
tree_data$node_size <- rep(1, length(tree_data$node_size)) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure any of this is needed?
xs <- rnorm(100, mean = -30, sd = 70)
all.equal(scales::rescale(xs, to = c(0, 3)), scales::rescale((xs - mean(xs)) / sd(xs), to = c(0, 3)))
#> [1] TRUE
xs <- sample(1:1000, 100)
all.equal(scales::rescale(xs, to = c(0, 3)), scales::rescale((xs - mean(xs)) / sd(xs), to = c(0, 3)))
#> [1] TRUE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I asked myself the same question and had trouble coming up with good examples to test but your example proves it to me. I Noticed that passing rescale_to = c(1, 100)
made the size differences incredibly large and was trying to normalize to the standard deviation to avoid that, but I seem to have had a brain fart.
- This didnt impact the final node_size and was redundant. `rescale_to` is the only parameter to be considered when scaling the relative sizes
Core new feature:
size_by
Allows users to control the node sizing by a particular column present in
.log_df
Important to note that
ofv
cant be used without first callingadd_summary()
, which is consistent with the previous behavior ofcolor_by
andinclude_info
.This argument does support logical columns, however I noticed that the normalized sizing can differ here depending when the first
TRUE
value originates. If it's the first value, all the nodes appear smaller. Anywhere else and the "base size" is larger. This observation is likely true for numeric columns too, though is less apparent when the values are all closer to each other.Other changes
font_size
argument. While updating the vignette I foresaw a future request to adjust this, and it was easy enough to pass tocollapsibleTree::collapsibleTreeNetwork
(with one other change)