-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large numebr of BAM files leads to Error in vec_interleave_indices()
:
#450
Comments
Hello there, I made some progress with a different approach but bambu is still failing in the same issue. I divided the data in 3 batches to obtain the extended annotations in the
As a small note, I tested before with default NDR, and each batch had an NDR > 0.6, but I am working with human samples, so I decided to fix NDR=0.1. After this, I put together the
But then, I got the same error as above and the first lines are:
I would appreciate any help how to run this massive data set, |
Hi @NikoLichi , sorry for getting back lately, we have been looking into the issue and have pushed a fix in a separate branch: https://github.com/GoekeLab/bambu/tree/patch_bigsamples where we have rewritten the code relevant to the error. Could you please help us to try it out and see if it works? If not, could you please help to post the complete error output so that we can investigate further? For how to use this branch:
Thank you |
Hi @cying111, Thanks for your reply. I did two tests with the new installation.
Both had the same error, which is a little weird. It did not find a function...
Kind regards, |
Hi @NikoLichi , sorry for the bug, I have pushed a fix to add in the missing function in the patch_bigsamples branch. Could you help to try again to see if it works? Update: pushed one more fix for a typo just now, please update your branch to try again. Thanks a lot! Thank you |
Hi @cying111, Alright, after 1day and 13hours running this is the issue that both of my jobs/approaches are presenting:
I hope this helps. |
Hi @NikoLichi , It's great to hear back and the error means that the changes that I made is working, but there is a small typo in the code of the version that you have used, which I have pushed a fix shortly after I realized it. Could you run this again to use the most recent version that should have the typo fixed.
I think this time it should work. Hope to hear from you soon |
Hi @cying111, Sorry for not getting to you sooner. I was doing many tests and they took quite some time. First of all, the current patch does not show the previous error, so it is solved! Thanks for it! However, I encountered that Bambu is memory-hungry even when using the I did two tests, first using a reduced data set (1 Million reads per sample) to have the parameters ready for the large data like:
This worked quite well. However, I would appreciate it if you could comment on the syntax since I don't get why the read classes are used as reads and what is used in the annotations part. The second approach, for all the data, I ran bambu in batches (i.e., 3), printing out the read classes and then calling them similar as above:
Later, when these were finished, I used again bambu like:
I have one last question: due to the large use of RAM memory by bambu, do you recommend using for the Sorry for my extensive piece of code and question, but want to be sure we are getting the data processed correctly. Thanks again! |
Hi @NikoLichi , Thanks for getting back with the good news, I am happy that the patch branch works. For your first test using subset data, the code all looks good to me.
This step actually does two things in one command, it processes bam files and perform junction correction, and then collapse reads to read classes, which are then saved as rds file in the provided rcOutDir, and because discovery is set to TRUE by default, this step also return the extended annotation object which contain both the reference bambuAnnotations and the newly discovered annotations by Bambu
In your second step, then you can just use the preprocesssed rdsfiles, as input, so that bam files no need to be processed again, and extendAnnotations is supplied, for quantification purposes. In this step, NDR=0.3, and lowMemory=T are actually not used, so you can leave them out. The rest are all correct and needed. For running all data together, I want to clarify with you a few details first:
This step for processing bam files in batches, you can actually set discovery = FALSE, in this step, and then it will save you sometime. You can also leave the NDR=0.3 out, so that this step only processes bam files. You can actually do it per sample using a bash mode even, cause, the bam file processing is anyway done per sample. That is, you can just loop through every sample in bash, i.e., 480 batches, instead of just 3 batches, this will help to reduce the memory usage. For the second parts onwards,
If so, for the quantification parts,
similarly, I would suggest you to loop through all rdsfiles one by one in bash, so that quantification is done by per sample, which will save your memory a lot. Because the annotations is now mergedAnno the same for all samples, the results can also be easily combined and compared. In this way, I bebieve you should be able to run it through. Hope this has clarified your questions! Thank you and let us know if you have any doubts regarding the above and need any help! |
Hi @cying111, Thank you very much for your input and the details in the commands. It seems I can speed up the process a bit more with your comments!
Yes, this step was the bottleneck in R, but using your patch seems to work well. I only got a warning: You mentioned that when running quantification for each sample,
the results can easily be merged. So, in this case, the RangedSummarizedExperiments will be merged with a simple vector, like: All the best, |
Hi @NikoLichi , Thanks for getting back so quickly, and I am happy that the mergedAnno step runs through. For the warning you see in the mergedAnno step warning is suggesting that the bambuAnnotations contain bambu discovered transcripts already, which should not be the case, as bambu discovery is only happening in this step. Maybe you can double check that bambuAnnotations is directly output of prepareAnnotations of the gtf file? For the second question, yes, the RangedSummarizedExperiments can be easily combined, just need a bit of tweaks there because of the existence of metadata:
This will keep the incompatible counts information
This will combine all the ses. Let me know if you have questions regarding the above! |
Dear Bambu team,
I am running a massive project with 480 BAM files with ~4.8 TB total data.
Following the previous suggestion for Bambu, I am running first the extended annotations (
quant = FALSE
), with the idea of running the quantification later in batches.However, the is a major issue when starting the extended annotations:
Is there anything I could do to run Bambu?
My code looks like:
As an additional note, I also have the same warning message as some others have reported as issue #407 .
This is with R 4.3.2 and Bioc 3.18 / bambu (3.4.1).
Platform: x86_64-conda-linux-gnu (64-bit)
All the best,
Niko
The text was updated successfully, but these errors were encountered: