sparse bw error in rsamtools summary #100

shaorray · 2021-02-21T10:50:58Z

rtracklayer::summary(BigWigFile(file)) is convenient for extracting interval density, but it's not bullet proof.

It gives error counts when there is empty coverage in the bw file.

For example, if load these intervals:

gene_mm9_chr1_bed <- "gene.mm9.chr1.bed"

toy_bw_file <- "GSE48895_V6.5_untreated_Plus.chr1.bw"

gene_res <- bw_bed(bwfiles = toy_bw_file, bedfile = gene_mm9_chr1_bed)

There are errors Failed to summarize range 194 (chr1:145487827-145497067)Failed to summarize range 231 (chr1:173144101-173149324)Failed to summarize range 265 (chr1:182981368-182990798)Failed to summarize range 303 (chr1:18241059-18255219)Failed to summarize range 311 (chr1:32600391-32604138)Failed to summarize range 368 (chr1:65146963-65149937)Failed to summarize range 431 (chr1:121804478-121809745)

The most trickiest is that it doesn't fill the empty intervals with 0s, but NAs. And half empty intervals also have wrong density counts.
example.zip

The text was updated successfully, but these errors were encountered:

shaorray · 2021-02-21T11:23:52Z

The method (bw_bed()) used in elsasserlib is rtracklayer::summary(), which gives the density (RPK).

There are two other methods also can do the job after importing the bw coverage.

First, bw needs to be in Rle format. Then use IRanges::Views(bw_cov, ranges) to extract the profiles following by sum to aggregate the coverages.

The second way is fairly simple, Rle object is indexable, therefore bw_cov[ranges(intervals)] simply gives the combined coverages in one vector. So the density can be aggregated by intervals' widths.

These two methods can deal with the empty coverages without any error.

shaorray · 2021-02-21T11:29:35Z

I put my function here

.countBW <- function(bw_files, intervals, .summary = F)
{
if (.summary) {
count_dat <- sapply(bw_files, function (file)
unlist(rtracklayer::summary(BigWigFile(file), intervals))$score) %>%
as.data.frame()
} else {
count_dat <- foreach ( i = seq_along(bw_files), .combine = cbind) %dopar%
{
bw_i <- rtracklayer::import(bw_files[i], format = 'BW', as="Rle")
interval_counts <- NULL
for (chr in unique(seqnames(intervals)))
{
tmp.chr <- intervals[as.character(seqnames(intervals)) == chr]
interval_counts <- c(interval_counts,
diff(
c(0, as.numeric(cumsum(bw_i[[chr]][ranges(tmp.chr)])[cumsum(width(tmp.chr))]) )
) / width(tmp.chr) * 1000
)
}
return(interval_counts)
} %>% as.data.frame()
}
colnames(count_dat) <- gsub(".\/(.)\.*", "\1", bw_files)
rownames(count_dat) <- names(intervals)
return(count_dat)
}

cnluzon · 2021-02-22T08:13:20Z

So just to be clear, the error behavior you mention is that you get NAs when there are no reads instead of zeros, right? I haven't spotted errors in the actual counts, but it would be good to know if that was the case.

I was aware of both things, that empty intervals are filled with NAs and that a warning is thrown for each. Note that this is a warning, not an error, an can probably be silenced with something like suppressWarnings if we want. I didn't do it inside the function because I think it's useful to know when that happens, if it happens a lot there may be some reference error with the bigWig. I have been looking in the past for a way to kind of summarize the warnings and give it a number. If we all agree that there is no scenario where we would want to keep the NA values I can replace them with 0 after the summary. It is true that with RNA-seq and things like this it becomes annoying at some point.

Also, it is my understanding that it's not RPK what I get from rtracklayer::summary, but mean coverage (single base pair resolution - see the unit tests for specific examples). So if you multply times 1000 you get different values than the ones you get when you use summary.

cnluzon · 2021-02-22T08:34:09Z

Just to make sure, I have checked bwtool output in your example. bwtool also outputs NA when there are no read counts. But I have spotted slight discrepancies on the values in a few cases, which I attributed to rounding errors before but I am going to look at it in more detail, since sometimes it's higher than 0.1, which I had not seen before.

shaorray · 2021-02-22T08:36:35Z

I have observed a little discrepancy (very small) with histone ChIP coverage between rtracklayer::summary and Rle object mean coverage.

The difference happened to a larger extent when it has bw file empty area, like this:

So I have changed to my own script and the result suddenly making much more sense for the differential analysis.

shaorray · 2021-02-22T08:42:25Z

Also the igv browser profile comparing by eyes, in favours of the the Rle mean coverage rather than rtracklayer::summary. That's all I have checked, but don't know where the error counts come from.

cnluzon · 2021-02-22T08:59:21Z

Oh I see! Thanks that's very helpful. You have observed this only when data is sparse, right? I'm looking into it.

shaorray · 2021-02-22T11:33:40Z

Yes, only sparse bw files have this issue. Histone coverages aligned well between methods. The sparsity introduced extremely large values from rtracklayer::summary, I think you also can find these abnormal values in processing the example bw file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sparse bw error in rsamtools summary #100

sparse bw error in rsamtools summary #100

shaorray commented Feb 21, 2021 •

edited

Loading

shaorray commented Feb 21, 2021 •

edited

Loading

shaorray commented Feb 21, 2021 •

edited

Loading

cnluzon commented Feb 22, 2021

cnluzon commented Feb 22, 2021

shaorray commented Feb 22, 2021

shaorray commented Feb 22, 2021

cnluzon commented Feb 22, 2021 •

edited

Loading

shaorray commented Feb 22, 2021

sparse bw error in rsamtools summary #100

sparse bw error in rsamtools summary #100

Comments

shaorray commented Feb 21, 2021 • edited Loading

shaorray commented Feb 21, 2021 • edited Loading

shaorray commented Feb 21, 2021 • edited Loading

cnluzon commented Feb 22, 2021

cnluzon commented Feb 22, 2021

shaorray commented Feb 22, 2021

shaorray commented Feb 22, 2021

cnluzon commented Feb 22, 2021 • edited Loading

shaorray commented Feb 22, 2021

shaorray commented Feb 21, 2021 •

edited

Loading

shaorray commented Feb 21, 2021 •

edited

Loading

shaorray commented Feb 21, 2021 •

edited

Loading

cnluzon commented Feb 22, 2021 •

edited

Loading