Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mutate() drops some attributes of a tabyl object #6689

Closed
sfirke opened this issue Feb 6, 2023 · 6 comments
Closed

mutate() drops some attributes of a tabyl object #6689

sfirke opened this issue Feb 6, 2023 · 6 comments

Comments

@sfirke
Copy link
Contributor

sfirke commented Feb 6, 2023

Using dplyr 1.1.0 and janitor 2.2.0, a user reported this problem with a mutate call breaking their janitor pipeline. I see mutate is causing attributes to drop. But I don't understand how or why, especially as I can reproduce this example from Hadley that this shouldn't happen.

This has the $core and $tabyl_type attributes:

library(reprex)
library(dplyr, warn.conflicts = FALSE)
library(janitor, warn.conflicts = FALSE)
mtcars |>
  tabyl(cyl) |>
  attributes()
#> $names
#> [1] "cyl"     "n"       "percent"
#> 
#> $class
#> [1] "tabyl"      "data.frame"
#> 
#> $row.names
#> [1] 1 2 3
#> 
#> $core
#>   cyl  n percent
#> 1   4 11 0.34375
#> 2   6  7 0.21875
#> 3   8 14 0.43750
#> 
#> $tabyl_type
#> [1] "one_way"

This drops the attributes $core and $tabyl_type:

mtcars |>
  tabyl(cyl) |>
  mutate(foo = 1) |>
  attributes()
#> $names
#> [1] "cyl"     "n"       "percent" "foo"    
#> 
#> $row.names
#> [1] 1 2 3
#> 
#> $class
#> [1] "tabyl"      "data.frame"

Do you know why these attributes are dropped compared to the comment I linked above? And thoughts on whether this could/might be addressed on the dplyr side? In the past I had to work around dplyr commands in my code for this reason and it's been great to see the package moving toward more preservation of data.frame attributes 🙏

Created on 2023-02-06 by the reprex package (v2.0.1)

@DavisVaughan
Copy link
Member

I don't think this is really dplyr's fault. Internally we do data[loc] and it seems you are inheriting the [.data.frame method which drops attributes

library(janitor, warn.conflicts = FALSE)

df <- mtcars |> tabyl(cyl)

attributes(df)
#> $names
#> [1] "cyl"     "n"       "percent"
#> 
#> $class
#> [1] "tabyl"      "data.frame"
#> 
#> $row.names
#> [1] 1 2 3
#> 
#> $core
#>   cyl  n percent
#> 1   4 11 0.34375
#> 2   6  7 0.21875
#> 3   8 14 0.43750
#> 
#> $tabyl_type
#> [1] "one_way"

df2 <- df[c("cyl", "n", "percent")]

attributes(df2)
#> $names
#> [1] "cyl"     "n"       "percent"
#> 
#> $row.names
#> [1] 1 2 3
#> 
#> $class
#> [1] "tabyl"      "data.frame"

It is worth noting that [.tbl_df does preserve attributes

library(tibble)

df <- mtcars |> 
  as_tibble()

attr(df, "foo") <- "bar"

attributes(df)
#> $class
#> [1] "tbl_df"     "tbl"        "data.frame"
#> 
#> $row.names
#>  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
#> [26] 26 27 28 29 30 31 32
#> 
#> $names
#>  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
#> [11] "carb"
#> 
#> $foo
#> [1] "bar"

df2 <- df[c("mpg", "cyl")]

attributes(df2)
#> $names
#> [1] "mpg" "cyl"
#> 
#> $row.names
#>  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
#> [26] 26 27 28 29 30 31 32
#> 
#> $class
#> [1] "tbl_df"     "tbl"        "data.frame"
#> 
#> $foo
#> [1] "bar"

@DavisVaughan
Copy link
Member

DavisVaughan commented Feb 7, 2023

FWIW this is not a new issue in 1.1.0, this was present in 1.0.10 as well. We also used data[loc] there.

# devtools::install_version("dplyr", "1.0.10")

library(dplyr, warn.conflicts = FALSE)
library(janitor, warn.conflicts = FALSE)

packageVersion("dplyr")
#> [1] '1.0.10'

mtcars |>
  tabyl(cyl) |>
  mutate(foo = 1) |>
  attributes()
#> $names
#> [1] "cyl"     "n"       "percent" "foo"    
#> 
#> $row.names
#> [1] 1 2 3
#> 
#> $class
#> [1] "tabyl"      "data.frame"

Created on 2023-02-07 with reprex v2.0.2.9000

@sfirke
Copy link
Contributor Author

sfirke commented Feb 7, 2023

Thanks @DavisVaughan, I think I understand the above. I just don't follow then how this works:

x <- mtcars
attr(x, "test") <- "foo"

x |>
  mutate(new_col = 1) |>
  attr("test")

There x is a data.frame, not a tibble, and its attribute is preserved unlike the example of a tabyl input. What is the difference in these input objects that explains that?

@DavisVaughan
Copy link
Member

That's a good question.

In dplyr_col_select() (our wrapper for [), we have a specific path for bare data frames and bare data.table objects that "reconstruct" them after calling [. Both end up calling [.data.frame, meaning that both dropped attributes. The reconstruction is one way to restore them.

dplyr/R/generics.R

Lines 240 to 244 in e8702df

# Patch base data frames and data.table (#6171) to restore extra attributes that `[.data.frame` drops.
# We require `[` methods to keep extra attributes for all data frame subclasses.
if (identical(class(.data), "data.frame") || identical(class(.data), c("data.table", "data.frame"))) {
out <- dplyr_reconstruct(out, .data)
}

We only apply this patch to these two specific types because we don't have any power to change them, and this is the best we can do there. We also know that any extra attributes on those types are just extra metadata that the user added which just needs to be carried along, but they won't be dependent on the data in any way.

We don't want to apply this patch in general though. Imagine the tsibble class, which has an attribute that identifies the column of the data that functions as the date-time index for the data frame. If data[loc] removed that column, we don't want to just automatically add that attribute back on, as that would no longer make any sense. Instead, we rely on data frame subclasses to have [ methods that retain or adjust attributes as it makes sense for that subclass.

This typically works pretty well because people who subclass tibble will automatically get attribute propagation due to [.tbl_df. It mainly pops up for people who subclass "data.frame" like you happen to do.

@sfirke
Copy link
Contributor Author

sfirke commented Feb 7, 2023

Thank you very much for explaining this, it's invaluable in figuring out how to best develop janitor. I hope my last question is a quick one. Re:

Instead, we rely on data frame subclasses to have [ methods that retain or adjust attributes as it makes sense for that subclass.

If I wrote a [.tabyl method that preserved a tabyl's attributes, would that get used in dplyr::mutate() and preserve attributes? If that is your suggestion for a proper way forward, I will start down that road. But if that wouldn't be a complete and neat fix, I can work around this in a less-elegant but simpler way.

@DavisVaughan
Copy link
Member

DavisVaughan commented Feb 7, 2023

Yes, that is exactly right. Adding a [.tablyl method will help with mutate() and another of other verbs like select() and relocate(). Writing a method like that is exactly what we'd suggest. You can probably use NextMethod() and then patch up the result before returning it.

You can also see ?dplyr::dplyr_extending for more information (possibly a little outdated, but it looks pretty good)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants