Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Add min.width argument for reduce_ranges #84

Open
whtns opened this issue Jul 16, 2020 · 4 comments
Open

[BUG] Add min.width argument for reduce_ranges #84

whtns opened this issue Jul 16, 2020 · 4 comments

Comments

@whtns
Copy link

whtns commented Jul 16, 2020

Context

I'd like to use reduce_ranges to calculate grouped summary stats. This is difficult with base granges tools. I also want to specify a minimum width argument of 0 bases.

Code

Include the code you ran and comments

library(tibble)
library(plyranges)

# setup granges
start_granges <- tibble(seqnames = rep(1, 8),
                       start = rep(c(1, 2, 11, 22), 2),
                       end = rep(c(10, 8, 20, 30), 2),
                       coverage = rep(c(10, 10, 20, 30), 2),
                       sample_id = c(rep("A", 4), rep("B", 4))) %>% 
  as_granges()

# summarize coverage by sample id with min.width = 1
ply_reduced_granges <- 
  start_granges %>% 
  group_by(sample_id) %>% 
  plyranges::reduce_ranges(coverage = sum(coverage))

# summarize coverage by sample id with min.width = 0
bioc_reduced_granges <- 
  start_granges %>% 
  split(.$sample_id) %>% 
  GRangesList()

bioc_reduced_granges <- lapply(bioc_reduced_granges, reduce, min.gapwidth = 0L, with.revmap = TRUE)

# this results in a complex granges object that makes summary stat calculations difficult

R session information

Remember to include your full R session information.

─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
setting value
version R version 3.6.3 (2020-02-29)
os Ubuntu 16.04.6 LTS
system x86_64, linux-gnu
ui RStudio
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz America/Los_Angeles
date 2020-07-16

─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
package * version date lib source
assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.0)
backports 1.1.8 2020-06-17 [1] CRAN (R 3.6.3)
Biobase 2.44.0 2019-05-02 [1] Bioconductor
BiocGenerics * 0.30.0 2019-05-02 [1] Bioconductor
BiocParallel 1.18.1 2019-08-06 [1] Bioconductor
Biostrings 2.52.0 2019-05-02 [1] Bioconductor
bitops 1.0-6 2013-08-17 [1] CRAN (R 3.6.0)
callr 3.4.3 2020-03-28 [1] CRAN (R 3.6.3)
cli 2.0.2 2020-02-28 [1] CRAN (R 3.6.2)
crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.0)
DelayedArray 0.10.0 2019-05-02 [1] Bioconductor
desc 1.2.0 2018-05-01 [1] CRAN (R 3.6.0)
devtools * 2.3.0 2020-04-10 [1] CRAN (R 3.6.3)
digest 0.6.25 2020-02-23 [1] CRAN (R 3.6.2)
dplyr 1.0.0 2020-05-29 [1] CRAN (R 3.6.3)
ellipsis 0.3.1 2020-05-15 [1] CRAN (R 3.6.3)
fansi 0.4.1 2020-01-08 [1] CRAN (R 3.6.0)
fs 1.4.1 2020-04-04 [1] CRAN (R 3.6.3)
generics 0.0.2 2018-11-29 [1] CRAN (R 3.6.0)
GenomeInfoDb * 1.20.0 2019-05-02 [1] Bioconductor
GenomeInfoDbData 1.2.1 2020-05-06 [1] Bioconductor
GenomicAlignments 1.20.1 2019-06-18 [1] Bioconductor
GenomicRanges * 1.36.1 2019-09-06 [1] Bioconductor
glue 1.4.1 2020-05-13 [1] CRAN (R 3.6.3)
googledrive * 1.0.1 2020-05-05 [1] CRAN (R 3.6.3)
IRanges * 2.18.3 2019-09-24 [1] Bioconductor
lattice 0.20-41 2020-04-02 [3] CRAN (R 3.6.3)
lifecycle 0.2.0 2020-03-06 [1] CRAN (R 3.6.3)
magrittr 1.5 2014-11-22 [1] CRAN (R 3.6.0)
MASS * 7.3-51.6 2020-04-26 [3] CRAN (R 3.6.3)
Matrix 1.2-18 2019-11-27 [3] CRAN (R 3.6.1)
matrixStats 0.56.0 2020-03-13 [1] CRAN (R 3.6.3)
memoise 1.1.0 2017-04-21 [1] CRAN (R 3.6.3)
packrat 0.5.0 2018-11-14 [1] CRAN (R 3.6.0)
pillar 1.4.4 2020-05-05 [1] CRAN (R 3.6.3)
pkgbuild 1.0.8.9000 2020-07-01 [1] Github (r-lib/pkgbuild@ed57598)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 3.6.0)
pkgload 1.1.0 2020-05-29 [1] CRAN (R 3.6.3)
plyranges * 1.4.4 2019-09-17 [1] Bioconductor
prettyunits 1.1.1 2020-01-24 [1] CRAN (R 3.6.0)
processx 3.4.2 2020-02-09 [1] CRAN (R 3.6.2)
ps 1.3.3 2020-05-08 [1] CRAN (R 3.6.3)
purrr 0.3.4 2020-04-17 [1] CRAN (R 3.6.3)
R6 2.4.1 2019-11-12 [1] CRAN (R 3.6.0)
Rcpp 1.0.4.6 2020-04-09 [1] CRAN (R 3.6.3)
RCurl 1.98-1.2 2020-04-18 [1] CRAN (R 3.6.3)
remotes 2.1.1 2020-02-15 [1] CRAN (R 3.6.2)
rlang 0.4.6 2020-05-02 [1] CRAN (R 3.6.3)
rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.6.0)
Rsamtools 2.0.3 2019-10-10 [1] Bioconductor
rstudioapi 0.11 2020-02-07 [1] CRAN (R 3.6.2)
rtracklayer 1.44.4 2019-09-06 [1] Bioconductor
S4Vectors * 0.22.1 2019-09-09 [1] Bioconductor
sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.0)
SummarizedExperiment 1.14.1 2019-07-31 [1] Bioconductor
testthat 2.3.2 2020-03-02 [1] CRAN (R 3.6.2)
tibble * 3.0.1 2020-04-20 [1] CRAN (R 3.6.3)
tidyselect 1.1.0 2020-05-11 [1] CRAN (R 3.6.3)
usethis * 1.6.1 2020-04-29 [1] CRAN (R 3.6.3)
vctrs 0.3.1 2020-06-05 [1] CRAN (R 3.6.3)
withr 2.2.0 2020-04-20 [1] CRAN (R 3.6.3)
XML 3.99-0.3 2020-01-20 [1] CRAN (R 3.6.0)
XVector 0.24.0 2019-05-02 [1] Bioconductor
zlibbioc 1.30.0 2019-05-02 [1] Bioconductor

[1] /usr/local/lib/R/site-library
[2] /usr/lib/R/site-library
[3] /usr/lib/R/library

The output of sessioninfo::session_info() includes relevant GitHub installation information and other details that are missed by sessionInfo().

@sa-lee
Copy link
Collaborator

sa-lee commented Jul 17, 2020

Thanks, so I definitely think we should include an option for this - but I'm a bit swamped with other things so it might be a while to look at it.

My initial thinking was users could do interval arithmetic, in your case you could emulate the gapwidth argument like this:

library(tibble)
library(plyranges)

# setup granges
start_granges <- tibble(seqnames = rep(1, 8),
                        start = rep(c(1, 2, 11, 22), 2),
                        end = rep(c(10, 8, 20, 30), 2),
                        coverage = rep(c(10, 10, 20, 30), 2),
                        sample_id = c(rep("A", 4), rep("B", 4))) %>% 
  as_granges()

start_granges %>% 
  mutate(width = width - 1) %>% # moves width but leaves start fixed, can be made explict with anchor_start()
  group_by(sample_id) %>% 
  reduce_ranges(coverage = sum(coverage)) %>% 
  mutate(end = end + 1)  # shift the end back
#> GRanges object with 6 ranges and 2 metadata columns:
#>       seqnames    ranges strand |   sample_id  coverage
#>          <Rle> <IRanges>  <Rle> | <character> <numeric>
#>   [1]        1      1-10      * |           A        20
#>   [2]        1     11-20      * |           A        20
#>   [3]        1     22-30      * |           A        30
#>   [4]        1      1-10      * |           B        20
#>   [5]        1     11-20      * |           B        20
#>   [6]        1     22-30      * |           B        30
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

@whtns
Copy link
Author

whtns commented Jul 18, 2020

Thanks! that will work for the moment, I think. I'll try to PR if I get the chance. Do you think adding this option would impact a lot of other parts of plyranges?

@sa-lee
Copy link
Collaborator

sa-lee commented Jul 19, 2020

I don't think so - I think it just requires updating the reduce_ranges signature to allow for additional arguments to reduce via .... I should have some time in the next few days to look at it.

@mikelove
Copy link
Member

I took a shot at this here:

mikelove@a0dd1ab

If this looks alright, I can add an example and tests.

> x <- data.frame(seqnames="1", start=c(1,11,21), width=c(5,10,5)) %>% as_granges()
> x
GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]        1       1-5      *
  [2]        1     11-20      *
  [3]        1     21-25      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
> x %>% reduce_ranges(min.gapwidth=5L)
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]        1       1-5      *
  [2]        1     11-25      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
> x %>% reduce_ranges(min.gapwidth=6L)
GRanges object with 1 range and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]        1      1-25      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants