Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large numebr of BAM files leads to Error in vec_interleave_indices(): #450

Open
NikoLichi opened this issue Oct 17, 2024 · 10 comments
Open
Labels
enhancement New feature or request

Comments

@NikoLichi
Copy link

Dear Bambu team,

I am running a massive project with 480 BAM files with ~4.8 TB total data.
Following the previous suggestion for Bambu, I am running first the extended annotations (quant = FALSE), with the idea of running the quantification later in batches.

However, the is a major issue when starting the extended annotations:

--- Start extending annotations ---
Error in `vec_interleave_indices()`:
! Long vectors are not yet supported in `vec_interleave()`. Result from interleaving would have size 8857886400, which is larger than the maximum supported size of 2^31 - 1.
Backtrace:
     ▆
  1. ├─bambu::bambu(...)
  2. │ └─bambu:::bambu.extendAnnotations(...)
  3. │   └─bambu:::isore.combineTranscriptCandidates(...)
  4. │     ├─... %>% data.table()
  5. │     └─bambu:::combineSplicedTranscriptModels(...)
  6. │       └─bambu:::updateStartEndReadCount(combinedFeatureTibble)
  7. │         └─... %>% mutate(sumReadCount = sum(readCount, na.rm = TRUE))
  8. ├─data.table::data.table(.)
  9. ├─dplyr::mutate(., sumReadCount = sum(readCount, na.rm = TRUE))
 10. ├─dplyr::group_by(., rowID)
 11. ├─tidyr::pivot_longer(...)
 12. ├─tidyr:::pivot_longer.data.frame(...)
 13. │ └─tidyr::pivot_longer_spec(...)
 14. │   └─vctrs::vec_interleave(!!!val_cols, .ptype = val_type)
 15. │     └─vctrs:::vec_interleave_indices(n, size)
 16. └─rlang::abort(message = message)
Execution halted

Is there anything I could do to run Bambu?

My code looks like:

BAMlist = BAMs_one_per_Line
fa.file <- "/refData/release46/GRCh38.primary_assembly.genome.fa"
gtf.file <-  "/refData/release46/gencode.v46.primary_assembly.annotation.gtf"
bambuAnnotations <- prepareAnnotations(gtf.file)

extendedAnnotations = bambu(reads = BAMlist, annotations = bambuAnnotations, genome = fa.file, quant = FALSE, lowMemory=T, ncore = 14, rcOutDir="MY_PATH/bambu_20241015_all")

As an additional note, I also have the same warning message as some others have reported as issue #407 .

This is with R 4.3.2 and Bioc 3.18 / bambu (3.4.1).
Platform: x86_64-conda-linux-gnu (64-bit)

All the best,
Niko

@NikoLichi
Copy link
Author

Hello there,

I made some progress with a different approach but bambu is still failing in the same issue.

I divided the data in 3 batches to obtain the extended annotations in the .rds. I set it up like below for each of the 3 batches:

extendedAnnotations = bambu(reads = BAMs_one_per_Line, annotations = bambuAnnotations, genome = fa.file, NDR = 0.1, quant = FALSE, lowMemory=T, ncore = 14, rcOutDir="MY_DIR")

As a small note, I tested before with default NDR, and each batch had an NDR > 0.6, but I am working with human samples, so I decided to fix NDR=0.1.

After this, I put together the .rds files for each one of the three bambu runs listing the .rds files for each run in a vector and emerging them like:

B01 <- Myfiles01
B02 <- Myfiles02
B03 <- Myfiles03

B_allRDs <- c(B01,B02,B03)

mergedAnno = bambu(reads = B_allRDs, genome = fa.file, annotations = bambuAnnotations, quant = FALSE)

But then, I got the same error as above and the first lines are:

--- Start extending annotations ---
Error in `vec_interleave_indices()`:
! Long vectors are not yet supported in `vec_interleave()`. Result from interleaving would have size 17715772800, which is larger than the maximum supported size of 2^31 - 1.

I would appreciate any help how to run this massive data set,
Best,
Niko

@cying111
Copy link
Collaborator

Hi @NikoLichi ,

sorry for getting back lately, we have been looking into the issue and have pushed a fix in a separate branch: https://github.com/GoekeLab/bambu/tree/patch_bigsamples

where we have rewritten the code relevant to the error.

Could you please help us to try it out and see if it works? If not, could you please help to post the complete error output so that we can investigate further?

For how to use this branch:

devtools::install_github("GoekeLab/bambu", ref = "patch_bigsamples")
library(bambu)

Thank you
Kind regards,
Ying

@cying111 cying111 added the enhancement New feature or request label Oct 28, 2024
@NikoLichi
Copy link
Author

Hi @cying111,

Thanks for your reply.

I did two tests with the new installation.

  1. Bambu annotation with all samples at once.
  2. Bambu merging RDS annotation from 3 batches.

Both had the same error, which is a little weird. It did not find a function...

--- Start extending annotations ---
Error in readCountWeightedMedian(.SD, c("start.1", "start.10", "start.100",  :
  could not find function "readCountWeightedMedian"
Calls: bambu ... combineSplicedTranscriptModels -> updateStartEndReadCount -> [ -> [.data.table
Execution halted

Kind regards,
NiKo

@cying111
Copy link
Collaborator

cying111 commented Oct 28, 2024

Hi @NikoLichi ,

sorry for the bug, I have pushed a fix to add in the missing function in the patch_bigsamples branch.

Could you help to try again to see if it works?

Update: pushed one more fix for a typo just now, please update your branch to try again. Thanks a lot!

Thank you
Warm regards,
Ying

@NikoLichi
Copy link
Author

NikoLichi commented Oct 30, 2024

Hi @cying111,

Alright, after 1day and 13hours running this is the issue that both of my jobs/approaches are presenting:

--- Start extending annotations ---
Error in `[.data.table`(startEndDt, combinedFeatureTibble[, .(intronStarts,  :
  When i is a data.table (or character vector), the columns to join by must be specified using 'on=' argument (see ?data.table), by keying x (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing column names between x and i (i.e., a natural join). Keyed joins might have further speed benefits on very large data due to x being sorted in RAM.
Calls: bambu ... [ -> [.data.table -> stopf -> raise_condition -> signal
In addition: Warning message:
In `[.data.table`(startEndDt, combinedFeatureTibble[, .(intronStarts,  :
  Ignoring by/keyby because 'j' is not supplied
Execution halted

I hope this helps.
Warm regards,
NiKo

@cying111
Copy link
Collaborator

cying111 commented Nov 1, 2024

Hi @NikoLichi ,

It's great to hear back and the error means that the changes that I made is working, but there is a small typo in the code of the version that you have used, which I have pushed a fix shortly after I realized it.

Could you run this again to use the most recent version that should have the typo fixed.

devtools::install_github("GoekeLab/bambu", ref = "patch_bigsamples")
library(bambu)

I think this time it should work.

Hope to hear from you soon
Thank you
Kind regards,
Ying

@NikoLichi
Copy link
Author

Hi @cying111,

Sorry for not getting to you sooner. I was doing many tests and they took quite some time.

First of all, the current patch does not show the previous error, so it is solved! Thanks for it!

However, I encountered that Bambu is memory-hungry even when using the lowMemory=T option. I would appreciate if you could guide/correct my steps/code here.

I did two tests, first using a reduced data set (1 Million reads per sample) to have the parameters ready for the large data like:

library(bambu)

BAMlist <- paste(BAM_Files_Samples_Vector

fa.file <- "GRCh38.primary_assembly.genome.fa"
gtf.file <-  "gencode.v46.primary_assembly.annotation.gtf"
bambuAnnotations <- prepareAnnotations(gtf.file)

extendedAnnotations = bambu(reads = BAMlist, annotations = bambuAnnotations, genome = fa.file, NDR = 0.3, quant = FALSE, lowMemory=T, ncore = 14, rcOutDir="MY_DIR/Reduced_1MReads ))

RDsFiles <- list.files("MY_DIR/Reduced_1MReads", pattern = "*.rds", full.names = T)

se.multiSample <- bambu(reads = RDsFiles, annotations = extendedAnnotations, genome = fa.file, ncore = 14, lowMemory=T, NDR=0.3,  discovery =F, quant =T)

This worked quite well. However, I would appreciate it if you could comment on the syntax since I don't get why the read classes are used as reads and what is used in the annotations part.

The second approach, for all the data, I ran bambu in batches (i.e., 3), printing out the read classes and then calling them similar as above:
1st

library(bambu)

BAMlist <- paste(BAM_Files_Samples_Vector

fa.file <- "GRCh38.primary_assembly.genome.fa"
gtf.file <-  "gencode.v46.primary_assembly.annotation.gtf"
bambuAnnotations <- prepareAnnotations(gtf.file)

extendedAnnotations = bambu(reads = BAMlist, annotations = bambuAnnotations, genome = fa.file, NDR = 0.3, quant = FALSE, lowMemory=T, rcOutDir="MY_DIR/3batchs_1 ))

Later, when these were finished, I used again bambu like:

# 1-3 batch RDS list
RDsFiles <- list.files("MY_DIR/3batchs_?", pattern = "*.rds", full.names = T)

mergedAnno = bambu(reads = RDsFiles, annotations = bambuAnnotations, genome = fa.file, lowMemory=T, NDR = 0.3, discovery =TRUE, quant = FALSE)

se.multiSample <- bambu(reads = RDsFiles, annotations = mergedAnno, genome = fa.file, ncore = 14, lowMemory=T, discovery =F, quant =T)

I have one last question: due to the large use of RAM memory by bambu, do you recommend using for the reads= argument the rds classes or the BAM files in each step?

Sorry for my extensive piece of code and question, but want to be sure we are getting the data processed correctly.

Thanks again!
Kind regards,
Niko

@cying111
Copy link
Collaborator

cying111 commented Nov 13, 2024

Hi @NikoLichi ,

Thanks for getting back with the good news, I am happy that the patch branch works.

For your first test using subset data, the code all looks good to me.

extendedAnnotations = bambu(reads = BAMlist, annotations = bambuAnnotations, genome = fa.file, NDR = 0.3, quant = FALSE, lowMemory=T, ncore = 14, rcOutDir="MY_DIR/Reduced_1MReads ))

This step actually does two things in one command, it processes bam files and perform junction correction, and then collapse reads to read classes, which are then saved as rds file in the provided rcOutDir, and because discovery is set to TRUE by default, this step also return the extended annotation object which contain both the reference bambuAnnotations and the newly discovered annotations by Bambu

se.multiSample <- bambu(reads = RDsFiles, annotations = extendedAnnotations, genome = fa.file, ncore = 14, lowMemory=T, NDR=0.3, discovery =F, quant =T)

In your second step, then you can just use the preprocesssed rdsfiles, as input, so that bam files no need to be processed again, and extendAnnotations is supplied, for quantification purposes. In this step, NDR=0.3, and lowMemory=T are actually not used, so you can leave them out. The rest are all correct and needed.

For running all data together, I want to clarify with you a few details first:

extendedAnnotations = bambu(reads = BAMlist, annotations = bambuAnnotations, genome = fa.file, NDR = 0.3, quant = FALSE, lowMemory=T, rcOutDir="MY_DIR/3batchs_1 ))

This step for processing bam files in batches, you can actually set discovery = FALSE, in this step, and then it will save you sometime. You can also leave the NDR=0.3 out, so that this step only processes bam files. You can actually do it per sample using a bash mode even, cause, the bam file processing is anyway done per sample. That is, you can just loop through every sample in bash, i.e., 480 batches, instead of just 3 batches, this will help to reduce the memory usage.

For the second parts onwards,

mergedAnno = bambu(reads = RDsFiles, annotations = bambuAnnotations, genome = fa.file, lowMemory=T, NDR = 0.3, discovery =TRUE, quant = FALSE)
This is the step where you are trying to perform discovery, the syntax is all correct. Is this step sucessfully run?

If so, for the quantification parts,

se.eachsample <- bambu(reads = RDsFiles[i], annotations = mergedAnno, genome = fa.file, ncore = 14, lowMemory=T, discovery =F, quant =T)

similarly, I would suggest you to loop through all rdsfiles one by one in bash, so that quantification is done by per sample, which will save your memory a lot. Because the annotations is now mergedAnno the same for all samples, the results can also be easily combined and compared.

In this way, I bebieve you should be able to run it through. Hope this has clarified your questions!

Thank you and let us know if you have any doubts regarding the above and need any help!
All the best,
Ying

@NikoLichi
Copy link
Author

Hi @cying111,

Thank you very much for your input and the details in the commands. It seems I can speed up the process a bit more with your comments!

mergedAnno = bambu(reads = RDsFiles, annotations = bambuAnnotations, genome = fa.file, lowMemory=T, NDR = 0.3, discovery =TRUE, quant = FALSE)

Yes, this step was the bottleneck in R, but using your patch seems to work well. I only got a warning:
Detected Bambu derived annotations in the annotations. Set a new prefix with opt.discovery(list(prefix='newPrefix')) to prevent ambigious id assignment.

You mentioned that when running quantification for each sample,

se.eachsample <- bambu(reads = RDsFiles[i], annotations = mergedAnno, genome = fa.file, ncore = 14, lowMemory=T, discovery =F, quant =T)

the results can easily be merged. So, in this case, the RangedSummarizedExperiments will be merged with a simple vector, like: se.Allsamples <- c(se.eachsample_1, ..., se.eachsample_480) or ?

All the best,
Niko

@cying111
Copy link
Collaborator

Hi @NikoLichi ,

Thanks for getting back so quickly, and I am happy that the mergedAnno step runs through.

For the warning you see in the mergedAnno step warning is suggesting that the bambuAnnotations contain bambu discovered transcripts already, which should not be the case, as bambu discovery is only happening in this step. Maybe you can double check that bambuAnnotations is directly output of prepareAnnotations of the gtf file?

For the second question, yes, the RangedSummarizedExperiments can be easily combined, just need a bit of tweaks there because of the existence of metadata:

seList <- list(se.eachsample_1,...., se.eachsample_400)
incompatibleCounts <- lapply(seList, function(x) metadata(x)$incompatibleCounts))
incompatibleCounts <- Reduce(bambu:::merge_wrapper, incompatibleCounts)

This will keep the incompatible counts information

seAll <- do.call(cbind, lapply(seList, function(x) metadata(x)$incompatibleCounts <- NULL;metadata(x)$warnings <- NULL;return(x)}))

This will combine all the ses.

Let me know if you have questions regarding the above!
All the best
Warm regards,
Ying

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

2 participants