Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

salmon transcript-level to gene-level quantification #30

Open
Mitmischer opened this issue Sep 30, 2024 · 0 comments
Open

salmon transcript-level to gene-level quantification #30

Mitmischer opened this issue Sep 30, 2024 · 0 comments

Comments

@Mitmischer
Copy link

Mitmischer commented Sep 30, 2024

Hello,

I need help on an issue that is not directly related to the material, but as I built my pipeline based on your (awesome) tutorials, maybe you have an idea or even find that a helpful question to discuss in later version of the tutorial.

I use salmon for transcript-level quantification on RNA-seq data which I then want to reduce to gene-level for further analysis. I do so using tximport.

I noticed that after tximport, some genes with the same name are assigned to different gene ids in different replicas. That makes differential expression on this data hard, as lot of the top upregulated genes found by deseq2 actually stem from an unlucky assignment.

To make this more clear, one of my most upregulated genes is ENSG00000232326 (TAP2), but there are only counts for that gene in a single sample. During further inspection, I noticed that there are also counts assigned to ENSG00000228582, ENSG00000206235, ENSG00000223481, ENSG00000204267 - all of which have the gene name TAP2! I'm not quite sure why there is this level of duplication, but the genes seem to exist on different (virtual?) chromosomes. I'd like to have those assigned to the same gene before differential expression. I think that ENSG00000204267 is a sensible target gene for TAP2 as that one is on the primary assembly.

I had some ideas but I am unsure how to proceed.

  1. In the quant.sf , I found that transcript ENST00000457634 was assigned to ENSG00000232326. I see that transcript in the cdna.all.fa (that I used as suggested in 08_quasi_alignment_salmon.md, downloaded from http://ftp.ensembl.org/pub/release-111/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz), so I know where that's coming from:
$ grep -i 457634 Homo_sapiens.GRCh38.cdna.all.fa
>ENST00000457634.6 cdna scaffold:GRCh38:HSCHR6_MHC_SSTO_CTG1:4224836:4238030:-1 gene:ENSG00000232326.9 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:TAP2 description:transporter 2, ATP binding cassette subfamily B member [Source:HGNC Symbol;Acc:HGNC:44]

Assembling the annotations from AnnotationHub (as suggested here: https://hbctraining.github.io/DGE_workshop_salmon/lessons/genomic_annotation.html), I could also find that mapping, so that's consistent.

I do not see that transcript in the gtf/gff3 though (obtained from http://ftp.ensembl.org/pub/release-111/gtf/homo_sapiens/Homo_sapiens.GRCh38.111.gtf.gz and https://ftp.ensembl.org/pub/release-111/gff3/homo_sapiens/Homo_sapiens.GRCh38.111.gff3.gz respectively).

$ grep -i 00000457634 Homo_sapiens.GRCh38.111.gff3
$ grep -i 00000457634 Homo_sapiens.GRCh38.111.gtf

So I assume that this transcript and gene are not in the primary assembly and are therefore variants. So it appears easiest to me to just use a "cdna.all.fa" that only contains cDNA from the primary assembly. Is my reasoning correct and if so, where can I obtain such a file?

Then, I should be able to obtain my index from the gtf as suggested here: https://www.biostars.org/p/9570963/

  1. A different idea is to take the salmon output as is and use a different wayof transcript-gene mapping.

I thought about obtaining the mapping "normally" (as in https://hbctraining.github.io/DGE_workshop_salmon/lessons/genomic_annotation.html), the resulting data frame looks like this:

      | tx_id           | gene_id         | gene_name | entrezid
<...>
34379 | ENST00000612574 | ENSG00000206299 | TAP2 | 6891
34380 | ENST00000383239 | ENSG00000206299 | TAP2 | 6891
34388 | ENST00000485516 | ENSG00000206299 | TAP2 | 6891
34407 | ENST00000469427 | ENSG00000206299 | TAP2 | 6891
34494 | ENST00000451907 | ENSG00000223481 | TAP2 | 6891
34498 | ENST00000452371 | ENSG00000223481 | TAP2 | 6891
34504 | ENST00000462883 | ENSG00000223481 | TAP2 | 6891
34511 | ENST00000468636 | ENSG00000223481 | TAP2 | 6891
34624 | ENST00000445986 | ENSG00000237599 | TAP2 | 6891
<...>

... and I would change the gene_id to a common one - favorably the one on the primary assembly. Is that "okay" and how would I find out which gene_id is the correct one (i.e. which one is on the primary assembly)?

  1. I could also build the salmon index from the primary assembly, not the cDNA. That works perfectly fine for STAR. Do you see issues with that regarding salmon? This way, I'd also make sure that only the primary assembly and no variants are available for Salmon to map to.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant