Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slamdunk all command treats paired-end reads as separate samples #165

Open
hullv opened this issue Dec 14, 2024 · 6 comments
Open

slamdunk all command treats paired-end reads as separate samples #165

hullv opened this issue Dec 14, 2024 · 6 comments

Comments

@hullv
Copy link

hullv commented Dec 14, 2024

Hello !

I am trying to analyze paired-end sequencing data with SLAM-DUNK using the slamdunk all command. However, it seems that SLAM-DUNK is treating my R1 and R2 reads as separate samples instead of paired-end data.

My command is: slamdunk all -r */hg38.fa -b */human/3utr.bed -o /output/ -5 0 -t 30 -ss -rl 150 s4u/.fq.gz

I expected SLAM-DUNK to recognize and process these files as paired-end data. However, the output directory contains separate results for each FASTQ file, as if they were independent samples.

My output files like this:
dtt_R1.fq_slamdunk_mapped.bam
dtt_R2.fq_slamdunk_mapped.bam

How can I correctly specify paired-end data with the slamdunk all command? Or should I use separate slamdunk align commands for R1 and R2 followed by the other subcommands?

@isaacvock
Copy link

isaacvock commented Dec 16, 2024

While I am not affiliated with SLAMDUNK, I can tell you it was only designed to analyze single-end data. Analyzing paired-end data has been discussed in several previous issues (like #147, #148, #57).

To summarize some of the suggestions in those posts you could consider:

  1. Only processing a single read from each pair. Suboptimal, but you could probably also do like you suggested and analyze all reads separately and then merge the final SLAMDUNK output for read pairs.
  2. Use a tool designed to work more seemlessly with paired-end data. These include:

Best,
Isaac

@hullv
Copy link
Author

hullv commented Dec 16, 2024

Dear Isaac,

Thank you for your detailed response regarding SLAMDUNK and paired-end data analysis.

Your suggestions are very helpful. Based on your recommendations, I'm considering two options:

  1. Using one read from each pair as you suggested, though I understand this may not be the optimal solution.

  2. Switching to one of the tools you recommended that are specifically designed for paired-end data. I'll look into rnalib, GRAND-SLAM, and fastq2EZbakR to see which one best suits my research needs.

Could you possibly provide any additional insights about the relative strengths of these three tools for paired-end data analysis? This would help me make a more informed decision.

Thank you again for your assistance.

Best regards,
Wang

@isaacvock
Copy link

isaacvock commented Dec 16, 2024

Hello Wang,

Full disclosure, I developed fastq2EZbakR and am not affiliated with the other two, but will try to provide an unbiased view of the pros and cons of each.

  1. rnalib: this is an easy to install, well designed, and highly flexible Python package for processing RNA-seq and SLAM-seq data. A downside of rnalib is its nascent RNA quantification strategy. For a SLAM-seq analysis, one of the main tasks is to determine how many reads are likely to have come from labeled vs. unlabeled RNA. To my knowledge, rnalib, like SLAMDUNK, only supports using simple mutational cutoffs that calls reads with a certain number of mutations "labeled" and those with less as "unlabeled". I have discussed the suboptimality of this strategy here, and in the preprint describing my SLAM-seq analysis toolkit.
  2. GRAND-SLAM: A better way to analyze SLAM-seq data is to use mixture modeling, and GRAND-SLAM was the first to implement such a strategy (though this analysis strategy was originally described in our lab's TimeLapse-seq paper, an alternative to SLAM-seq that works similarly and was developed around the same time. The GRAND-SLAM paper is here). It is an impressively performant tool, and has a helper R package that can perform downstream analyses of GRAND-SLAM output (called grandR). A downside of GRAND-SLAM is that it does not provide the intermediate processed mutational data that is passed to its mixture models. This makes it difficult to interrogate model fit and catch cases where the assumptions of its model are not met.
  3. fastq2EZbakR: fastq2EZbakR supports mixture modeling like in GRAND-SLAM, but has the advantage of providing the processed mutational data in a form that can be easily analyzed with EZbakR, a companion R package that I also developed. It is also more flexible in some of the ways it can process SLAM-seq data. In particular, it can assign reads to a more diverse array of genomic features and thus allow higher resolution dissection of certain SLAM-seq datasets (e.g., total RNA libraries). A downside of fastq2EZbakR is that it is not as computationally performant (from both an efficiency and RAM usage perspective) as GRAND-SLAM, and is a Snakemake pipeline, which has some advantages but which makes it overall a bit clunkier to install and work with than the other two lightweight software packages described above.

I hope this helps,
Isaac

@isaacvock
Copy link

I'll also try to tag the developers of the other packages in case they want to weigh in: @popitsch @florianerhard

@popitsch
Copy link

Hi Wang and Isaac
I'm the developer of rnalib, thanks for including me in this thread.

Rnalib is meant to be(come) a general Python library for the analysis of RNAseq data, so nucleotide-conversion sequencing (NC-seq) is not its main focus (I am rather working on a more specialized package on top of rnalib for this purpose).

You can, however, use rnalib for annotating numbers of convertible positions and number of actual conversions per read using its tag_tc() tool [*]. You can then count and export per annotation of your choice (gene, transcripts, exons, etc.), respective example code is included in the rnalib SLAMseq tutorial.

In this simplified SLAMseq tutorial, reads are then indeed simply classified into 'new' (one or more T-to-C conversions) and 'old' (no conversions) which is inaccurate as Isaac correctly points out (it misclassifies 'new' reads that do not contain a T-to-C conversion by chance which is a rather large fraction given the low 4sU concentrations in typical SLAMseq experiments). A better way is indeed to fit these data to a binomial mixture model as implemented, e.g., in GRAND-SLAM and EZbakR (NB: I have zero experience with the latter tool). I will try to add a section that demonstrates this to my SLAMseq tutorial.

So, in a nutshell, rnalib is very flexible but it’s not an end-to-end tool for NC-seq data, so you might resort to the mentioned approaches unless you are interested in developing a new method :)

BW niko


[*] tag_tc() takes a BAM file that contained MD tags (e.g., added by a mapper such as STAR or via samtools calmd) and adds two tags (xt and xc) containing the number of T's (convertible positions) and the number of T-to-C conversions respectively while masking T-to-C SNPs and filtering for mapping and/or per-base qualities, see [here](https://github.com/popitsch/rnalib/blob/main/rnalib/tools.py) for details.

@hullv
Copy link
Author

hullv commented Dec 17, 2024

While I am not affiliated with SLAMDUNK, I can tell you it was only designed to analyze single-end data. Analyzing paired-end data has been discussed in several previous issues (like #147, #148, #57).

To summarize some of the suggestions in those posts you could consider:

  1. Only processing a single read from each pair. Suboptimal, but you could probably also do like you suggested and analyze all reads separately and then merge the final SLAMDUNK output for read pairs.

  2. Use a tool designed to work more seemlessly with paired-end data. These include:

Best, Isaac

Dear Isaac,

Thank you for your thoughtful and detailed response. I sincerely appreciate your patience in getting back to me.

Your explanation regarding SLAMDUNK's limitations with paired-end data is extremely helpful. I understand now that it was primarily designed for single-end data analysis. Both solutions you've proposed are valuable:

  1. Processing only one read from each pair
  2. Analyzing all reads separately and then merging the final outputs

I'm particularly grateful for your recommendations of alternative tools that are better suited for paired-end data analysis:

  • rnalib
  • GRAND-SLAM
  • fastq2EZbakR

These suggestions will definitely help guide my research in the right direction. I will carefully evaluate these options to determine the most suitable approach for my work.

Thank you again for your expert guidance!

Best regards,
Wangang

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants