Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature request] add .drop argument to group_by, e.g. to include empty seqlevels #95

Open
jayoung opened this issue Feb 24, 2022 · 0 comments

Comments

@jayoung
Copy link

jayoung commented Feb 24, 2022

Hi there,

I only just discovered your package - I'm excited to start being tidy with my GRanges!

Apologies if I missed something, but I think I am requesting an enhancement.

With tibbles, when I'm grouping on a factor, there's a way to summarize and make sure I include empty groups, by using the .drop=FALSE argument. But for GRanges, I don't see a way to include the empty groups. Again, sorry if I missed it - I have tried searching but didn't see anything.

I've provided code below that I think is a nice small example.

thanks very much,

Janet Young

Malik lab,
Fred Hutch Cancer Research Center,
Seattle, WA

## here's how I include empty groups when summarizing a tibble
library(tidyverse)
fruit_tbl <- data.frame(fruit=factor( c("apple","apple","orange","pear"), 
                                      levels=c("apple","orange","pear","banana")),
                        weight=c(3,4,5,3)) %>% 
    as_tibble()
# we DO get output for 'banana', the empty group:
fruit_tbl %>% 
    group_by(fruit, .drop=FALSE) %>% 
    summarise(numFruits=n(), 
              mean=mean(weight))
#   A tibble: 4 × 3
#   fruit  numFruits  mean
#   <fct>      <int> <dbl>
# 1 apple          2   3.5
# 2 orange         1   5  
# 3 pear           1   3  
# 4 banana         0 NaN  

But I don't see a way to include empty groups in plyranges. Is that true? Sorry if I missed it. I am using plyranges_1.14.0 (release version). Here's what I tried (after restarting R to make sure tidyverse packages aren't loaded):

library(plyranges)

## make GRanges where not all factor levels are represented (for seqnames, also for regionType)
grng2 <- data.frame(seqnames = sample(c("chr1", "chr2"), 7, replace = TRUE),
                   strand = sample(c("+", "-"), 7, replace = TRUE),
                   gc = runif(7),
                   start = 1:7,
                   width = 10) %>%
    mutate(seqnames=factor(seqnames, levels=c("chr1", "chr2", "chr3"))) %>% 
    mutate(regionType=factor( sample(c("a", "b"), 7, replace = TRUE),
                              levels=c("a", "b", "c"))) %>% 
    as_granges()

## works, but we don't get summaries for the empty levels of seqlevel (chr3) or regionType (c):
grng2 %>% 
    group_by(seqnames) %>% 
    summarize(numRegions=n(),
              meanGC=mean(gc))
# DataFrame with 2 rows and 3 columns
#   seqnames numRegions    meanGC
#      <Rle>  <integer> <numeric>
# 1     chr1          6  0.592756
# 2     chr2          1  0.664616


grng2 %>% 
    group_by(regionType) %>% 
    summarize(numRegions=n(),
              meanGC=mean(gc))
# DataFrame with 2 rows and 3 columns
#   regionType numRegions    meanGC
#    <factor>  <integer> <numeric>
# 1          a          6  0.646677
# 2          b          1  0.341085

## can't use .drop like I would with a tibble
grng2 %>% 
    group_by(regionType, .drop=FALSE) %>% 
    summarize(numRegions=n(),
              meanGC=mean(gc))
# Error in new_grouping(.data, ...) : Column `.drop` is unknown

and here's my R session information

library(sessioninfo)
sessioninfo::session_info()
─ Session info ─────────────────────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.1.2 (2021-11-01)
 os       macOS Monterey 12.2.1
 system   x86_64, darwin17.0
 ui       RStudio
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/Los_Angeles
 date     2022-02-23
 rstudio  1.4.1717 Juliet Rose (desktop)
 pandoc   NA

─ Packages ─────────────────────────────────────────────────────────────────────────────────
 package              * version  date (UTC) lib source
 assertthat             0.2.1    2019-03-21 [1] CRAN (R 4.1.0)
 Biobase                2.54.0   2021-10-26 [1] Bioconductor
 BiocGenerics         * 0.40.0   2021-10-26 [1] Bioconductor
 BiocIO                 1.4.0    2021-10-26 [1] Bioconductor
 BiocParallel           1.28.3   2021-12-09 [1] Bioconductor
 Biostrings             2.62.0   2021-10-26 [1] Bioconductor
 bitops                 1.0-7    2021-04-24 [1] CRAN (R 4.1.0)
 cli                    3.1.1    2022-01-20 [1] CRAN (R 4.1.2)
 crayon                 1.4.2    2021-10-29 [1] CRAN (R 4.1.0)
 DBI                    1.1.2    2021-12-20 [1] CRAN (R 4.1.1)
 DelayedArray           0.20.0   2021-10-26 [1] Bioconductor
 dplyr                  1.0.7    2021-06-18 [1] CRAN (R 4.1.0)
 ellipsis               0.3.2    2021-04-29 [1] CRAN (R 4.1.0)
 fansi                  1.0.2    2022-01-14 [1] CRAN (R 4.1.2)
 generics               0.1.2    2022-01-31 [1] CRAN (R 4.1.2)
 GenomeInfoDb         * 1.30.0   2021-10-26 [1] Bioconductor
 GenomeInfoDbData       1.2.7    2021-11-16 [1] Bioconductor
 GenomicAlignments      1.30.0   2021-10-26 [1] Bioconductor
 GenomicRanges        * 1.46.1   2021-11-18 [1] Bioconductor
 glue                   1.6.1    2022-01-22 [1] CRAN (R 4.1.2)
 IRanges              * 2.28.0   2021-10-26 [1] Bioconductor
 lattice                0.20-45  2021-09-22 [1] CRAN (R 4.1.2)
 lifecycle              1.0.1    2021-09-24 [1] CRAN (R 4.1.0)
 magrittr               2.0.2    2022-01-26 [1] CRAN (R 4.1.2)
 Matrix                 1.4-0    2021-12-08 [1] CRAN (R 4.1.0)
 MatrixGenerics         1.6.0    2021-10-26 [1] Bioconductor
 matrixStats            0.61.0   2021-09-17 [1] CRAN (R 4.1.0)
 pillar                 1.7.0    2022-02-01 [1] CRAN (R 4.1.2)
 pkgconfig              2.0.3    2019-09-22 [1] CRAN (R 4.1.0)
 plyranges            * 1.14.0   2021-10-26 [1] Bioconductor
 purrr                  0.3.4    2020-04-17 [1] CRAN (R 4.1.0)
 R6                     2.5.1    2021-08-19 [1] CRAN (R 4.1.0)
 RCurl                  1.98-1.5 2021-09-17 [1] CRAN (R 4.1.0)
 restfulr               0.0.13   2017-08-06 [1] CRAN (R 4.1.0)
 rjson                  0.2.21   2022-01-09 [1] CRAN (R 4.1.2)
 rlang                  1.0.1    2022-02-03 [1] CRAN (R 4.1.2)
 Rsamtools              2.10.0   2021-10-26 [1] Bioconductor
 rstudioapi             0.13     2020-11-12 [1] CRAN (R 4.1.0)
 rtracklayer            1.54.0   2021-10-26 [1] Bioconductor
 S4Vectors            * 0.32.3   2021-11-21 [1] Bioconductor
 sessioninfo          * 1.2.2    2021-12-06 [1] CRAN (R 4.1.0)
 SummarizedExperiment   1.24.0   2021-10-26 [1] Bioconductor
 tibble                 3.1.6    2021-11-07 [1] CRAN (R 4.1.0)
 tidyselect             1.1.1    2021-04-30 [1] CRAN (R 4.1.0)
 utf8                   1.2.2    2021-07-24 [1] CRAN (R 4.1.0)
 vctrs                  0.3.8    2021-04-29 [1] CRAN (R 4.1.0)
 XML                    3.99-0.8 2021-09-17 [1] CRAN (R 4.1.0)
 XVector                0.34.0   2021-10-26 [1] Bioconductor
 yaml                   2.2.2    2022-01-25 [1] CRAN (R 4.1.2)
 zlibbioc               1.40.0   2021-10-26 [1] Bioconductor

 [1] /Library/Frameworks/R.framework/Versions/4.1/Resources/library
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant