Raise maximum barcode length #629
-
Hi, first of all congratulations on a great tool, the expansion to scRNA-seq analysis is especially appreciated! I was wondering what the reason for setting an upper limit on the barcode length in alevin is - would longer barcodes affect the computation in some manner? We are working with barcodes of length 27, which are incompatible with the hardcoded upper barcode length limit here. I manually raised the limit on a modified alevin version, and the final output looks as expected, so if there is no risk that I am unaware of, would you consider raising or removing the barcode length limit altogether? Thank you for you help! Is the bug primarily related to salmon (bulk mode) or alevin (single-cell mode)? Describe the bug
The barcode length upper limit is hardcoded here. To Reproduce
Expected behavior Desktop (please complete the following information):
Additional context |
Beta Was this translation helpful? Give feedback.
Replies: 8 comments 1 reply
-
Hi @kaukrise , Thanks for the very interesting question. I don't think there is any theoretical limit wrt the alevin's method, however, it would be interesting to check how does alevin performs when we increase the CB length wrt the running time. The 20 length bound was just for sanity checking and can be increased, like you already did. |
Beta Was this translation helpful? Give feedback.
-
Thank you for the swift answer! We are working with BD Rhapsody, which uses a complex barcode structure (you can read about this in their bioinformatics handbook on page 14). The extracted, combined CB is 27bp long, which is why the default sanity check was too low for our purposes. In terms of cell numbers, BD Rhapsody appears to generate a lot of "false-positive cells", actually (we are seeing up to 90% of false positives). This is expected, and also mentioned in their bioinformatics handbook (pages 23-25), but appears to be an issue for the alevin cell detection: with standard settings this is approximately two orders of magnitude lower than expected, In terms of performance, a complete alevin run on 150M reads (25k expected cells) takes around 1.5 hours using 10 threads, which is perfectly reasonable for us. |
Beta Was this translation helpful? Give feedback.
-
My apologies for the late reply, somehow I missed the reply. Closing this issue but feel free to reopen. |
Beta Was this translation helpful? Give feedback.
-
Could you please increase the maximum barcode length, as per this issue [i.e. here]? Salmon is still complaining when I use a converted file with the linker sequences spliced out, having a concatenated cell barcode length of 27, and there doesn't seem to be any user-definable option to modify or ignore the limit:
|
Beta Was this translation helpful? Give feedback.
-
As an alternative, documenting the geometry format for specifying custom barcodes would be helpful. This seems to avoid the barcode length issue. From what I can tell, the format is
|
Beta Was this translation helpful? Give feedback.
-
Hi @gringer, Yes, we can add a section for this in the docs. It will replace the old way for specifying geometry soon, as its just easier and more flexible. We talk about it in the 1.4.0 release notes. I copy the relevant info below (@k3yavi pulled for the 1-based indexing and won out ... this time): generic barcode / umi / read geometry syntax : Alevin learned to support a generic syntax to specify the read sequence that should be used for barcodes, UMIs and the read sequence. The syntax allows one to specify how the pattern corresponding to the barcode, UMI, and read sequence should be pieced together, and the syntax is meant to be intuitive and general. For example, one can specify the 10Xv2 geometry in the following manner using the generic syntax: --read-geometry 2[1-end] --bc-geometry 1[1-16] --umi-geometry 1[17-26] This specifies that the "sequence" read (the biological sequence to be aligned) comes from read 2, and it spans from the first index 1 (this syntax used 1-based indexing) until the end of the read. Likewise, the barcode derives from read 1 and occupies positions 1-16, and the UMI comes from read 1 and occupies positions 17-26. The syntax can specify multiple ranges, and they will simply be concatenated together to produce the string. For example, one could specify --bc-geometry 1[1-8,16-23] to designate that the barcode should be taken from the substring in positions 1-8 of read 1 followed by the substring in positions 16-23 of read 1. It is even possible to have the string pieced together across both reads, but that functionality is only available if you are running with --rad or --sketch and preparing a RAD file for alevin-fry. If you are running classic alevin, the barcode must reside on a single read. The robust parsing of the flexible geometry syntax is made possible by the cpp-peglib project. |
Beta Was this translation helpful? Give feedback.
-
Oh, excellent, thanks. The multi-range format will be useful for Rhapsody reads. |
Beta Was this translation helpful? Give feedback.
-
Hey @rob-p and @k3yavi - thanks for adding support for these alternative barcode geometries, described here and here! It likely will work for SPLiT-seq/SplitBio/ParseBio data, which is very exciting. One thing that is either unclear to me or may be great to add is how to handle the fact that two RT barcodes may exist per biological sample. In other words, these plates are set up such that each sample well has both an oligo-dT barcode and random hexamer barcode to enable amplification of both the 3' end as well as prime internally in the transcript. The end result is that you can get two barcodes that actually represent the exact same cell, with differing BC1s but identical BC2 and BC3, and thus should ideally be merged at a fairly early step in processing. For example, you could have a particular cell represented by these two sets of barcodes:
The way other algorithms permit merging of these is by supplying a separate file that pairs oligo-dT and random hexamer barcodes, like zUMIs does here. In this toy example above, one line of that file would have: Though it is possible to do this manually and collapse these by taking the sum after the cells x genes matrix is created, I am reasonably sure it's better to merge them prior to any filtering or cell detection, since it will change the number of reads per cell/gene. What are your thoughts on this? Do you think it's possible to incorporate a merge step like this if one doesn't already exist? |
Beta Was this translation helpful? Give feedback.
Hi @gringer,
Yes, we can add a section for this in the docs. It will replace the old way for specifying geometry soon, as its just easier and more flexible. We talk about it in the 1.4.0 release notes. I copy the relevant info below (@k3yavi pulled for the 1-based indexing and won out ... this time):
generic barcode / umi / read geometry syntax : Alevin learned to support a generic syntax to specify the read sequence that should be used for barcodes, UMIs and the read sequence. The syntax allows one to specify how the pattern corresponding to the barcode, UMI, and read sequence should be pieced together, and the syntax is meant to be intuitive and general. For example, one can specify the…