-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merging read classes across samples results in fewer novel transcripts #448
Comments
Hi Asher, Yes this might seem a bit unintuitive but it is expected behavior. For background - We found that during multiple sample analysis as you add more and more reads you greatly increase the noise meaning more false positive transcripts are reported as novel (resulting in less accurate transcript discovery with more data). Therefore we use the presence of known annotations in the ranking of transcript candidates to determine how many novel transcripts we should return in order to control for this sample size/number bias. An NDR of 0.1 means a prediction score threshold is chosen where 10% of all transcripts below this threshold are novel. If you add more samples and the ratio of highly scored novel transcripts increases over highly scored known transcripts, this than leads to a reduction in total novel transcripts returned, which is what I am suspecting is occurring in your case. This would be more likely to occur if you are combining multiple sample types together from different tissues as highly scored known transcripts are usually present in multiple sample types, where as novel transcripts are more likely to be niche. We found that analysis on one human tissue, an NDR of 0.1 is sufficient, however if you are combining multiple sample types, or you expect more novel tissues you may want to raise the NDR value. Alternatively you can not provide an NDR value and let Bambu suggest a value, were it attempts to predict how many missing annotations are present in your sample, and suggests an appropriate threshold. You may also want to do transcript discovery with an NDR of 1, and then manually filter out the novel transcripts yourself to the desired number, depending on your research goals. I hope this answers your question, and how bambus multiple sample handling differs from the method stringtie merge uses which is more similiar to a union across samples. Kind Regards, |
Hi Andre, Thank you so much for this very thoughtful response. The background really helps me wrap my head around this. I started to explore the distributions of NDRs for all the transcripts in my analysis, and there a couple of points which I was hoping you could help me understand:
Thanks again for your time and help. Best wishes, |
Hi Asher,
Kind Regards, |
Hi Andre, Thank you again, this is really helpful. For 1, this makes sense thanks for explaining. For 2, got i, thanks for explaining this point as well. This makes a lot of sense too. My last question related to this is if a max NDR of 0.4 is pretty low, is there anything I may want to try to increase the sensitivity in my analysis? And when you say you typically see a max NDR of 0.6-0.7, are these for cell lines or for human tissue samples? I could imagine that perhaps cell lines are more divergent than human tissue, even if the human tissue samples are derived from tumors. Also, could this potentially be related to improved Nanopore chemistry if the samples were recently collected? Best wishes, |
Hi Asher, Ahh sorry when I meant an NDR of 0.4 was low I meant that it is abnormally good for a sample, since the lower the number is, the less false positives are in the dataset. The maximum NDR does not have a bearing on the sensitivity. Yes the samples I normally see higher max NDRs for are cancer cell lines so I think you are right in your explanation, but also improved chemistry with lower error rates and more full length transcripts would contribute to this too. There are some other things you can do to increase sensitivity but we do not typically recommend it, these are setting remove.subsetTx to FALSE and min.readFractionByGene to 0 using the opt.discovery argument. The first means bambu will also score and output subset transcripts (incomplete splice matches) which will drastically increase the number of false positives. The second will not filter out transcripts that represent <0.05 (default) of a genes expression, this change won't be as bad for precision as the first one so I would attempt this first to see if it increases sensitivity to acceptable levels for you. Kind Regards, |
Hi Andre, Re: the low NDR. That's great, and thanks for explaining this really helps. Re: increasing sensitivity. Okay sounds good -- I will give the latter, safer option a try first and see how it goes. Thanks again for all your help. Cheers, |
Hi Andre, Sorry for all the questions here, but upon examining these data more, I'm realizing that when I look at the readCount column for the merged summarized experiment object from the 42 samples (there are only 44, what I wrote before was a typo), I see that the minimum readCount across all transcripts is readCount = 84. Unless this is a concidence, this seems to suggest that merging in the way we discussed in issue #444 requires that any assembled transcript must have >= 2 full-length reads in each sample. Is this some sort of artifact? I know that the default is for Thanks again, PS Also readCounts here are full-length read counts (vs partial + full-length read counts), right? |
Hi Asher, No problems with all the questions, they have been insightful. re. PS. The readCounts in the rowData do not represent the final quantification counts, and is actually a column we might consider removing in the future, as it is a leftover from the transcript discovery step, for candidates that are already present in the known annotations. These read counts do not have a use once you have the output from the quantification step. The partial+full-lengh read counts are found in assays(se)$counts. Kind Regards, |
Hi Andre, Great, I'm glad. :) Re: minimum read counts. Okay sounds good, thanks for looking into this! Re: interpretation of readCounts. Got it, okay thanks for explaining this, it is very helpful. I had been hoping that maybe I did not need to run the final quantification step and use these values, but it sounds like I should still be running the final quantification step regardless. Thanks again, |
Hi Asher, Sorry about the wait on this. I have tried looking into this but wasn't able to replicate your observation. I was able to run multiple samples together and get $readCount entries with 2 in them. Even though I do not recommend using this number for your future analysis its still concerning you are observing it so it will be good to understand why before you continue further, in case it has unseen ramifications. Could you paste your run script run here so I can compare my parameters? A very silly troubleshooting question, but good to double check, are you sure you are providing 42 different samples, and not the same sample 42 times? For me to look into this further I may need you to send me a minimal reproducible example so that I can debug it. One way you can generate this, is load in two read class files. Filter one for one chromosome and the other by another chromosome (preferably small chromosomes). Then try run these two samples for discovery (with NDR = 1). There should be no overlap between the two inputs, so theoretically there should be $readCount values == 2. If you only have values >= 4 then it will be a good example to send to me. You can shuffle the rowData(rcf)$readCount column of both read class files if needed for privacy reasons. Let me know if this is possible or if you need more explanation Kind Regards, |
Hi Andre, No worries at all, thank you again for your help. I just took a look at my code, and I am very embarrassed to report that I think I was actually accidentally merging read classes from a single sample 42 times vs merging all 42 samples. I am re-running now to verify that this was indeed the case, and I will let you know as soon as I can confirm. This may also explain why I found fewer transcript isoforms in the 42 samples vs the 9 samples (though I know there are other important factors as you noted above)? Also, in the merged summarized experiment object there are no sample names included if quantification mode is off? I was trying to check if I could find the sample names somewhere in the summarized experiment before I re-ran. Thank you for your suggestions here, I think you may have saved me a potentially huge headache in the future! Best wishes, |
Hi Asher, I am glad it was something simple to fix :) Regarding sample names for the summarized experiment object, you can find the sample names on the read class list object using colData(se). If you mean the output after discovery, but before quantification, which is a grangesList, the sample names are not stored in this object unfortunately. Hope this helps, |
Hi Andre, Yes! It appears that this was indeed the issue, thank you again. :) Re: sample names. Got it, I was referring to the grangesList -- but regardless it does seem that the source of the issue was from loading the same sample multiple times. Looking at the results with the properly merged grangeslist, I do see that the number of novel transcripts with NDR < 0.1 is about the same in my 42 sample vs 9 sample analysis. However, there are many more novel transcripts assembled in the 42 vs 9 sample analysis, and the max NDR goes all the way up to NDR ~ 0.85. I did notice that the default NDR is now higher though; it's around NDR ~ 0.18. I was wondering if you could explain to me why this has shifted upwards? Is this due to the additional non-canonical transcripts which were assembled in the 42 sample vs 9 sample analysis? Thanks again, |
Hi Asher, Yes if the default NDR shifts up, it suggests there are more highly ranked non canonical transcript candidates, ranking higher than known transcripts. Its not always because there are more, but the more data you have the more likely to include more real novel transcripts. I hope this answers your question. Kind Regards, |
Hi Andre and the Bambu team,
I am encountering an issue related to issue #444 . I merged read classes across 44 samples as you had suggested and found that there were about 1000 novel transcript isoforms discovered. Originally, I had analyzed 9 of these samples using the same read classes, and I found there were 1300 novel transcript isoforms. In both cases, I fixed NDR = 0.1.
Why am I finding fewer novel transcript isoforms with the 44 samples vs the 9 samples? From my reading of your user manual on github. I thought that in multisample mode the samples were still analyzed independently, but then the merge simply unified novel transcript IDs across samples. Is it possible that somehow some of the putative novel transcripts ended up merged with canonical transcripts that were detected in the larger analysis? Further, how is this merge performed and how does it differ from stringtie's merge method? I know you mentioned it would increase my false positive rate to use the latter, but it is unclear to me how.
Thanks for your time and help.
Best,
Asher
The text was updated successfully, but these errors were encountered: