Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sparse bw error in rsamtools summary #100

Open
shaorray opened this issue Feb 21, 2021 · 8 comments
Open

sparse bw error in rsamtools summary #100

shaorray opened this issue Feb 21, 2021 · 8 comments

Comments

@shaorray
Copy link
Contributor

shaorray commented Feb 21, 2021

rtracklayer::summary(BigWigFile(file)) is convenient for extracting interval density, but it's not bullet proof.

It gives error counts when there is empty coverage in the bw file.

For example, if load these intervals:

gene_mm9_chr1_bed <- "gene.mm9.chr1.bed"

toy_bw_file <- "GSE48895_V6.5_untreated_Plus.chr1.bw"

gene_res <- bw_bed(bwfiles = toy_bw_file, bedfile = gene_mm9_chr1_bed)

There are errors Failed to summarize range 194 (chr1:145487827-145497067)Failed to summarize range 231 (chr1:173144101-173149324)Failed to summarize range 265 (chr1:182981368-182990798)Failed to summarize range 303 (chr1:18241059-18255219)Failed to summarize range 311 (chr1:32600391-32604138)Failed to summarize range 368 (chr1:65146963-65149937)Failed to summarize range 431 (chr1:121804478-121809745)

The most trickiest is that it doesn't fill the empty intervals with 0s, but NAs. And half empty intervals also have wrong density counts.
example.zip

@shaorray
Copy link
Contributor Author

shaorray commented Feb 21, 2021

The method (bw_bed()) used in elsasserlib is rtracklayer::summary(), which gives the density (RPK).

There are two other methods also can do the job after importing the bw coverage.

First, bw needs to be in Rle format. Then use IRanges::Views(bw_cov, ranges) to extract the profiles following by sum to aggregate the coverages.

The second way is fairly simple, Rle object is indexable, therefore bw_cov[ranges(intervals)] simply gives the combined coverages in one vector. So the density can be aggregated by intervals' widths.

These two methods can deal with the empty coverages without any error.

@shaorray
Copy link
Contributor Author

shaorray commented Feb 21, 2021

I put my function here

.countBW <- function(bw_files, intervals, .summary = F)
{
if (.summary) {
count_dat <- sapply(bw_files, function (file)
unlist(rtracklayer::summary(BigWigFile(file), intervals))$score) %>%
as.data.frame()
} else {
count_dat <- foreach ( i = seq_along(bw_files), .combine = cbind) %dopar%
{
bw_i <- rtracklayer::import(bw_files[i], format = 'BW', as="Rle")
interval_counts <- NULL
for (chr in unique(seqnames(intervals)))
{
tmp.chr <- intervals[as.character(seqnames(intervals)) == chr]
interval_counts <- c(interval_counts,
diff(
c(0, as.numeric(cumsum(bw_i[[chr]][ranges(tmp.chr)])[cumsum(width(tmp.chr))]) )
) / width(tmp.chr) * 1000
)
}
return(interval_counts)
} %>% as.data.frame()
}
colnames(count_dat) <- gsub(".\/(.)\.*", "\1", bw_files)
rownames(count_dat) <- names(intervals)
return(count_dat)
}

@cnluzon
Copy link
Collaborator

cnluzon commented Feb 22, 2021

So just to be clear, the error behavior you mention is that you get NAs when there are no reads instead of zeros, right? I haven't spotted errors in the actual counts, but it would be good to know if that was the case.

I was aware of both things, that empty intervals are filled with NAs and that a warning is thrown for each. Note that this is a warning, not an error, an can probably be silenced with something like suppressWarnings if we want. I didn't do it inside the function because I think it's useful to know when that happens, if it happens a lot there may be some reference error with the bigWig. I have been looking in the past for a way to kind of summarize the warnings and give it a number. If we all agree that there is no scenario where we would want to keep the NA values I can replace them with 0 after the summary. It is true that with RNA-seq and things like this it becomes annoying at some point.

Also, it is my understanding that it's not RPK what I get from rtracklayer::summary, but mean coverage (single base pair resolution - see the unit tests for specific examples). So if you multply times 1000 you get different values than the ones you get when you use summary.

@cnluzon
Copy link
Collaborator

cnluzon commented Feb 22, 2021

Just to make sure, I have checked bwtool output in your example. bwtool also outputs NA when there are no read counts. But I have spotted slight discrepancies on the values in a few cases, which I attributed to rounding errors before but I am going to look at it in more detail, since sometimes it's higher than 0.1, which I had not seen before.

@shaorray
Copy link
Contributor Author

I have observed a little discrepancy (very small) with histone ChIP coverage between rtracklayer::summary and Rle object mean coverage.

The difference happened to a larger extent when it has bw file empty area, like this:
image

So I have changed to my own script and the result suddenly making much more sense for the differential analysis.

@shaorray
Copy link
Contributor Author

Also the igv browser profile comparing by eyes, in favours of the the Rle mean coverage rather than rtracklayer::summary. That's all I have checked, but don't know where the error counts come from.

@cnluzon
Copy link
Collaborator

cnluzon commented Feb 22, 2021

Oh I see! Thanks that's very helpful. You have observed this only when data is sparse, right? I'm looking into it.

@shaorray
Copy link
Contributor Author

Yes, only sparse bw files have this issue. Histone coverages aligned well between methods. The sparsity introduced extremely large values from rtracklayer::summary, I think you also can find these abnormal values in processing the example bw file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants