Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a with_deduplication helper to run a vectorized function after deduplicating the input. #1090

Closed
orgadish opened this issue Jul 9, 2023 · 2 comments

Comments

@orgadish
Copy link

orgadish commented Jul 9, 2023

I discovered that fs::path_file and fs::path_dir run very slowly on windows (see fs issue 424), and since most of my use case of these functions is after using readr::read_csv(files, .id="file_path"), most of the vector is duplication. As such, I found that I could save a significant amount of time by deduplicating the vector (2x on Mac, 40x on Windows). This approach is not just helpful for fs::path_ functions.

The most straightforward approach is:

with_deduplication <- function(f) {
  function(x, ...) {
    ux <- unique(x)
    f(ux, ...)[match(x, ux)]
  }
} 

I've also submitted a PR into vctrs to speed this up (see vctrs issue 1857 and PR 1858).

I'm not sure where this helper should live, but since it's an extension of functional programming, I think it would make sense to be in purrr.

@hadley
Copy link
Member

hadley commented Jul 26, 2023

IMO it's easier to solve this sort of problem with memoisation, i.e. with https://github.com/r-lib/memoise.

@hadley hadley closed this as completed Jul 26, 2023
@orgadish
Copy link
Author

As far as I can tell, memoise acts on the input to the function, not on the individual elements in the input. However, I do think it's a good idea to add this capability into memoise directly, and will suggest this idea there:

TOTAL_N = 1e6
UNIQUE_N = 10

repeated_strs <- purrr::map_chr(1:5*UNIQUE_N,
                                \(x) sample(LETTERS, 3) |> paste(collapse="/")) |> 
  unique() |> 
  head(UNIQUE_N) |> # Ensure UNIQUE_N unique items.
  rep(TOTAL_N / UNIQUE_N) |> # Create TOTAL_N total items.
  sample()  # Shuffle order

with_dedup <- function(f) {
  function(x) {
    ux <- unique(x)
    f(ux)[data.table::chmatch(x, ux)]
  }
}

bench::mark(
  direct = stringr::str_to_lower(repeated_strs),
  dedup = with_dedup(stringr::str_to_lower)(repeated_strs),
  memo = memoise::memoise(stringr::str_to_lower)(repeated_strs),
  iterations = 100
)
#> # A tibble: 3 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 direct       53.7ms     63ms      16.0    3.84MB     2.18
#> 2 dedup        12.4ms   15.4ms      64.7   13.34MB    36.4 
#> 3 memo           84ms   99.5ms      10.0   10.37MB     4.93

Created on 2023-07-26 with reprex v2.0.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants