Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add size_by argument to model_tree() #720

Merged
merged 20 commits into from
Jan 15, 2025
Merged

Add size_by argument to model_tree() #720

merged 20 commits into from
Jan 15, 2025

Conversation

barrettk
Copy link
Collaborator

@barrettk barrettk commented Sep 30, 2024

Core new feature: size_by

  • Allows users to control the node sizing by a particular column present in .log_df

  • Important to note that ofv cant be used without first calling add_summary(), which is consistent with the previous behavior of color_by and include_info.

  • This argument does support logical columns, however I noticed that the normalized sizing can differ here depending when the first TRUE value originates. If it's the first value, all the nodes appear smaller. Anywhere else and the "base size" is larger. This observation is likely true for numeric columns too, though is less apparent when the values are all closer to each other.

    • We may want to revisit NA values for numeric columns

image

Other changes

  • Fixed a bug when coloring numeric columns. See details in 6d11e25
  • Added font_size argument. While updating the vignette I foresaw a future request to adjust this, and it was easy enough to pass to collapsibleTree::collapsibleTreeNetwork (with one other change)
  • Updated vignette with new example
  • Formatting fix: dont display tooltips for 'start' node (fc65532)
  • Css simplifications and other minor adjustments

 - Allows users to control the node sizing by a particular column present in `.log_df`

 - Important to note that `ofv` cant be used without first calling `add_summary()`, which is consistent with the previous behavior of `color_by` and `include_info`

 - This argument does support logical columns, however I noticed that the normalized sizing can differ here depending when the first `TRUE` value originates. If it's the first value, all the nodes appear smaller. Anywhere else and the "base size" is larger. This observation is likely true for numeric columns too, though is less apparent when the values are all closer to each other.
    - `collapsibleTree::collapsibleTreeNetwork` is only intended to support numeric columns for sizing, so I opted to map these values to: TRUE=5, FALSE=1, NA=3 for the time being. We may want to revisit NA values for both numeric and logical columns

 - Added examples to docs
 - While the implementation and tests are done, I think we still want to revist the NA handling, which also impacts the "start" node size.
@barrettk barrettk requested a review from seth127 October 4, 2024 16:34
 - While updating the vignette, I noticed that the size_by and color_by arguments occasionally didnt line up for numeric columns. Numeric columns need to be sorted first so the right colors are assigned. This was previously "tested" using the character 'run' column, though was more observable when testing against objective function values (which dont necessarily appear in ascending/descending order)
 - uses leafCount, which makes the nodes containing more child nodes, larger
 - Improve formatting of code in some cases
 - Adds additional tests when required or specified columns are missing

 - doc adjustments
 - If color_by or size_by is specified, ensure these columns are also part of the tooltip. This avoids the need for specifying new columns (e.g., 'Out of Date') in both the tooltip and color_by
 - doc and text adjustments
 - Revert change from previous commit (comment it out) until a decision has been made.
@barrettk barrettk requested review from andersone1 and removed request for andersone1 January 9, 2025 19:26
@seth127 seth127 requested a review from kyleam January 9, 2025 22:09
@seth127 seth127 added the next release Issues/PRs to be included on the next release label Jan 9, 2025
Copy link
Collaborator

@kyleam kyleam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't finished reviewing these, but here's an initial wave

Testing this out, things are overall working as expected on my end. I noticed two behaviors that look off to me. Here's an example starting with log_df from the size_tree_by tests.

log_df$foo <- 1:5
log_df$bar <- 5:1
log_df$baz <- -2:2

First: feeding negative values to size_by throws things off. There are warning messages and the relative sizing does not match the values.

model_tree(log_df, add_summary = FALSE, size_by = "baz", static = TRUE)
#> Warning messages:
#> 1: In sqrt(x$SizeOfNode * scaleFactor) : NaNs produced
#> 2: In sqrt(x$SizeOfNode * scaleFactor) : NaNs produced

baz

Perhaps model_tree should scale the input? (Only the relative values should matter for sizing.)

Second: I was surprised about how nodes were sized differently if values were reversed.

model_tree(log_df, add_summary = FALSE, size_by = "foo", static = TRUE)

foo

model_tree(log_df, add_summary = FALSE, size_by = "bar", static = TRUE)

bar

I expected the same set of sizes to be used for those nodes, just in reverse (e.g., run 1 in bar would be the same size as run 5 in foo). Is the above behavior intended?


is only intended to support numeric columns for sizing, so I opted to map these values to: TRUE=5, FALSE=1, NA=3

Should this part of the PR description have been updated for f6e6554 (Dont support logical columns, improve error formatting)?

R/model-tree.R Outdated Show resolved Hide resolved
R/model-tree.R Outdated Show resolved Hide resolved
R/model-tree.R Outdated Show resolved Hide resolved
R/model-tree.R Show resolved Hide resolved
R/model-tree.R Show resolved Hide resolved
R/model-tree.R Outdated Show resolved Hide resolved
R/model-tree.R Outdated Show resolved Hide resolved
R/model-tree.R Outdated Show resolved Hide resolved
 - Dont create a node_size column if size_by is NULL or an unsupported column type is specified. Instead set to default sizing (leafCount) and set as an attribute
 - Add a test for this. Perhaps not necessary, but good to have to capture any future change in handling
 - Store color_by attribute within color_tree_by instead of using conditional logic outside of the function.

 - Add comments for attribute for both size_by and color_by handling setting to make this more clear
 - Instead of relying on the raw data, perform the following operations:
   - Normalize values based on standard deviation, which ensures sizes are comparable across datasets and prevents large values from dominating.
   - Scale all values to be within `rescale_to`, which also helps ensure consistent sizing, but also handles negative values (some of which may be introduced by the previous operation).

 - The larger the standard deviation is, the larger the the nodes can be. e.g. if rescale_to was c(0, 100) and the values in the data were c(-2, -1, 0, 1, 2), the largest node would almost fill the page.

 - `rescale_to` is not currently an argument that can be provided by the user. There may be reasons to want to visualize size differences to a greater degree in the future (perhaps objective function value
@barrettk
Copy link
Collaborator Author

I introduced relative scaling in d6730dd which helps keep the node sizing consistent. The one case where this doesnt work is model_tree(log_df, add_summary = FALSE, size_by = "bar")

I believe I documented this behavior in some commit (haven't been able to find it), but the actual node sizes seem to be scaled based on the first node, as if that has a predetermined size. I think this is related to aggFun = "identity"

Bad scaling

> log_df$bar <- 5:1
> model_tree(log_df, add_summary = FALSE, size_by = "bar")

image

Good scaling

> log_df$bar <- 5:1; log_df$bar[1] <- 1
> model_tree(log_df, add_summary = FALSE, size_by = "bar")

image

Back to bad scaling

> log_df$bar <- 5:1; log_df$bar[2] <- 1
> model_tree(log_df, add_summary = FALSE, size_by = "bar")

image

@kyleam kyleam linked an issue Jan 15, 2025 that may be closed by this pull request
R/model-tree.R Show resolved Hide resolved
@barrettk barrettk requested a review from kyleam January 15, 2025 20:36
R/model-tree.R Outdated
Comment on lines 763 to 774
# Rescale values based on standard deviation
# - a large SD can lead to very large nodes
mean_val <- mean(tree_data$node_size, na.rm = TRUE)
sd_val <- sd(tree_data$node_size, na.rm = TRUE)

if(sd_val != 0){
tree_data$node_size <- (tree_data$node_size - mean_val) / sd_val
}else{
# Would only happen if all values are the same
# - avoids dividing by 0
tree_data$node_size <- rep(1, length(tree_data$node_size))
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure any of this is needed?

xs <- rnorm(100, mean = -30, sd = 70)
all.equal(scales::rescale(xs, to = c(0, 3)), scales::rescale((xs - mean(xs)) / sd(xs), to = c(0, 3)))
#> [1] TRUE
xs <- sample(1:1000, 100)
all.equal(scales::rescale(xs, to = c(0, 3)), scales::rescale((xs - mean(xs)) / sd(xs), to = c(0, 3)))
#> [1] TRUE

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I asked myself the same question and had trouble coming up with good examples to test but your example proves it to me. I Noticed that passing rescale_to = c(1, 100) made the size differences incredibly large and was trying to normalize to the standard deviation to avoid that, but I seem to have had a brain fart.

 - This didnt impact the final node_size and was redundant. `rescale_to` is the only parameter to be considered when scaling the relative sizes
@barrettk barrettk requested a review from kyleam January 15, 2025 21:04
@barrettk barrettk merged commit 055f903 into main Jan 15, 2025
8 checks passed
@barrettk barrettk deleted the model_tree/size_by branch January 15, 2025 21:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
next release Issues/PRs to be included on the next release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

failure with testthat 3.2.3
3 participants