Raise maximum barcode length #629

kaukrise · 2019-11-06T11:00:10Z

kaukrise
Nov 6, 2019

Hi,

first of all congratulations on a great tool, the expansion to scRNA-seq analysis is especially appreciated!

I was wondering what the reason for setting an upper limit on the barcode length in alevin is - would longer barcodes affect the computation in some manner? We are working with barcodes of length 27, which are incompatible with the hardcoded upper barcode length limit here.

I manually raised the limit on a modified alevin version, and the final output looks as expected, so if there is no risk that I am unaware of, would you consider raising or removing the barcode length limit altogether?

Thank you for you help!

Is the bug primarily related to salmon (bulk mode) or alevin (single-cell mode)?
alevin

Describe the bug
Using the manual barcode and UMI specification with --end, --barcodeLength, and --umiLength fails for barcodes longer than 20 with the error message:

Barcode length (27) was not in the required length range [1, 20].

The barcode length upper limit is hardcoded here.

To Reproduce
In Salmon 1.0.0, run salmon alevin [...] --end 5 --barcodeLength 27 --umiLength 8 (or any barcodeLength value above 20).

Which version of salmon was used? 1.0.0
How was salmon installed (compiled, downloaded executable, through bioconda)? bioconda
Which reference (e.g. transcriptome) was used? not relevant
Which read files were used? not relevant
Which which program options were used? --end 5 --barcodeLength 27 --umiLength 8

Expected behavior
Ideally, barcode longer than 20 would be processed as normal.

Desktop (please complete the following information):

OS: [e.g. Ubuntu Linux, OSX] Mac OS X
Version [ If you are on OSX, the output of sw_vers. If you are on linux the output of uname -a and lsb_release -a] 10.14.6 (18G103)

Additional context
See top of post

Answered by rob-p

Feb 12, 2021

Hi @gringer,

Yes, we can add a section for this in the docs. It will replace the old way for specifying geometry soon, as its just easier and more flexible. We talk about it in the 1.4.0 release notes. I copy the relevant info below (@k3yavi pulled for the 1-based indexing and won out ... this time):

generic barcode / umi / read geometry syntax : Alevin learned to support a generic syntax to specify the read sequence that should be used for barcodes, UMIs and the read sequence. The syntax allows one to specify how the pattern corresponding to the barcode, UMI, and read sequence should be pieced together, and the syntax is meant to be intuitive and general. For example, one can specify the…

View full answer

k3yavi · 2019-11-06T16:32:47Z

k3yavi
Nov 6, 2019
Collaborator

Hi @kaukrise ,

Thanks for the very interesting question. I don't think there is any theoretical limit wrt the alevin's method, however, it would be interesting to check how does alevin performs when we increase the CB length wrt the running time. The 20 length bound was just for sanity checking and can be increased, like you already did.
I'd be very interested, if possible, in hearing back about your experience with alevin using longer length CB both wrt running time and gene expression estimates generated. Also if I may ask what's the reason behind using this long CB ? Are you expecting tons of real cells, if there is we can think about improving alevin even more, in my experience, we have generally seen individual 10x experiment with ~20k cells max. Even the 1.3M dataset is 164 separate experiments.

0 replies

kaukrise · 2019-11-07T13:41:19Z

kaukrise
Nov 7, 2019
Author

Thank you for the swift answer!

We are working with BD Rhapsody, which uses a complex barcode structure (you can read about this in their bioinformatics handbook on page 14). The extracted, combined CB is 27bp long, which is why the default sanity check was too low for our purposes.

In terms of cell numbers, BD Rhapsody appears to generate a lot of "false-positive cells", actually (we are seeing up to 90% of false positives). This is expected, and also mentioned in their bioinformatics handbook (pages 23-25), but appears to be an issue for the alevin cell detection: with standard settings this is approximately two orders of magnitude lower than expected, --expectCells improves matters drastically, however. We have opted for removing the false positives in post-processing ourselves - the low count depth population is very easily identifiable.

In terms of performance, a complete alevin run on 150M reads (25k expected cells) takes around 1.5 hours using 10 threads, which is perfectly reasonable for us.

0 replies

k3yavi · 2019-12-13T04:44:39Z

k3yavi
Dec 13, 2019
Collaborator

My apologies for the late reply, somehow I missed the reply.
I am glad to hear that and thanks for testing alevin with BD Rhapsody.
Let us know if you need help with anything else, we'd be happy to help.

Closing this issue but feel free to reopen.

0 replies

gringer · 2021-02-11T22:31:55Z

gringer
Feb 11, 2021

Could you please increase the maximum barcode length, as per this issue [i.e. here]? Salmon is still complaining when I use a converted file with the linker sequences spliced out, having a concatenated cell barcode length of 27, and there doesn't seem to be any user-definable option to modify or ignore the limit:

gringer@elegans:/mnt/ufds/jmayer$ salmon alevin -l ISR -1 normalised_H2GYLDRXY_1_210203_FD09251586_Other_CGAGGCTG_R_210203_DAVGAL_INDEXLIBNOVASEQ_M001_R1.fastq.gz normalised_H2GYLDRXY_2_210203_FD09251586_Other_CGAGGCTG_R_210203_DAVGAL_INDEXLIBNOVASEQ_M001_R1.fastq.gz -2 H2GYLDRXY_1_210203_FD09251586_Other_CGAGGCTG_R_210203_DAVGAL_INDEXLIBNOVASEQ_M001_R2.fastq.gz -2 H2GYLDRXY_2_210203_FD09251586_Other_CGAGGCTG_R_210203_DAVGAL_INDEXLIBNOVASEQ_M001_R2.fastq.gz -i /mnt/ufds/salmon/gencode_M23/salmon_1.4.0_decoy_M23 -p 10 -o salmon_1.4_5_27_8_JM_2021-02-12 --tgMap txp2gene.txt --end 5 --barcodeLength 27 --umiLength 8
[2021-02-12 11:01:37.654] [alevinLog] [warning] Note: the use of --end, --barcodeLength and --umiLength to describe the barcode and umi geometry is deprecated. Please start using the `--barcode-geometry` and `--umi-geometry` options instead.
[2021-02-12 11:01:37.655] [alevinLog] [error] Barcode length (27) was not in the required length range [1, 20].
Exiting now.

0 replies

gringer · 2021-02-12T00:15:39Z

gringer
Feb 12, 2021

As an alternative, documenting the geometry format for specifying custom barcodes would be helpful. This seems to avoid the barcode length issue.

From what I can tell, the format is <readNum>[start-end], i.e. for my case:

--umi-geometry '1[28-35]' --bc-geometry '1[1-27]' --read-geometry '2[1-end]'

0 replies

rob-p · 2021-02-12T00:22:57Z

rob-p
Feb 12, 2021
Maintainer

Hi @gringer,

Yes, we can add a section for this in the docs. It will replace the old way for specifying geometry soon, as its just easier and more flexible. We talk about it in the 1.4.0 release notes. I copy the relevant info below (@k3yavi pulled for the 1-based indexing and won out ... this time):

generic barcode / umi / read geometry syntax : Alevin learned to support a generic syntax to specify the read sequence that should be used for barcodes, UMIs and the read sequence. The syntax allows one to specify how the pattern corresponding to the barcode, UMI, and read sequence should be pieced together, and the syntax is meant to be intuitive and general. For example, one can specify the 10Xv2 geometry in the following manner using the generic syntax:

--read-geometry 2[1-end] --bc-geometry 1[1-16] --umi-geometry 1[17-26]

This specifies that the "sequence" read (the biological sequence to be aligned) comes from read 2, and it spans from the first index 1 (this syntax used 1-based indexing) until the end of the read. Likewise, the barcode derives from read 1 and occupies positions 1-16, and the UMI comes from read 1 and occupies positions 17-26. The syntax can specify multiple ranges, and they will simply be concatenated together to produce the string. For example, one could specify --bc-geometry 1[1-8,16-23] to designate that the barcode should be taken from the substring in positions 1-8 of read 1 followed by the substring in positions 16-23 of read 1. It is even possible to have the string pieced together across both reads, but that functionality is only available if you are running with --rad or --sketch and preparing a RAD file for alevin-fry. If you are running classic alevin, the barcode must reside on a single read. The robust parsing of the flexible geometry syntax is made possible by the cpp-peglib project.

1 reply

connersk Jun 20, 2024

Hi @rob-p! Thanks for this great tool! I'm trying to use salmon alevin with a novel single-cell chemistry where the cell barcode is 34 base pairs long and I'm getting an error that 32 base pairs is now the longest cell barcode allowed.

I'm happy to make a fork of salmon and modify the cell barcode length but I'm not sure what I need to modify. Could you provide any guidance on what I'd need to modify?

gringer · 2021-02-12T00:37:25Z

gringer
Feb 12, 2021

Oh, excellent, thanks. The multi-range format will be useful for Rhapsody reads.

0 replies

jeremymsimon · 2021-05-24T16:03:13Z

jeremymsimon
May 24, 2021

Hey @rob-p and @k3yavi - thanks for adding support for these alternative barcode geometries, described here and here! It likely will work for SPLiT-seq/SplitBio/ParseBio data, which is very exciting. One thing that is either unclear to me or may be great to add is how to handle the fact that two RT barcodes may exist per biological sample. In other words, these plates are set up such that each sample well has both an oligo-dT barcode and random hexamer barcode to enable amplification of both the 3' end as well as prime internally in the transcript. The end result is that you can get two barcodes that actually represent the exact same cell, with differing BC1s but identical BC2 and BC3, and thus should ideally be merged at a fairly early step in processing.

For example, you could have a particular cell represented by these two sets of barcodes:

ACTCGTAA-GACAAAGC-TCTTAATC
CTGCTTTG-GACAAAGC-TCTTAATC

The way other algorithms permit merging of these is by supplying a separate file that pairs oligo-dT and random hexamer barcodes, like zUMIs does here. In this toy example above, one line of that file would have:
ACTCGTAA CTGCTTTG

Though it is possible to do this manually and collapse these by taking the sum after the cells x genes matrix is created, I am reasonably sure it's better to merge them prior to any filtering or cell detection, since it will change the number of reads per cell/gene.

What are your thoughts on this? Do you think it's possible to incorporate a merge step like this if one doesn't already exist?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Raise maximum barcode length #629

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 8 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Raise maximum barcode length #629

kaukrise Nov 6, 2019

Replies: 8 comments · 1 reply

k3yavi Nov 6, 2019 Collaborator

kaukrise Nov 7, 2019 Author

k3yavi Dec 13, 2019 Collaborator

gringer Feb 11, 2021

gringer Feb 12, 2021

rob-p Feb 12, 2021 Maintainer

connersk Jun 20, 2024

gringer Feb 12, 2021

jeremymsimon May 24, 2021

kaukrise
Nov 6, 2019

Replies: 8 comments 1 reply

k3yavi
Nov 6, 2019
Collaborator

kaukrise
Nov 7, 2019
Author

k3yavi
Dec 13, 2019
Collaborator

gringer
Feb 11, 2021

gringer
Feb 12, 2021

rob-p
Feb 12, 2021
Maintainer

gringer
Feb 12, 2021

jeremymsimon
May 24, 2021