Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in gtf library when trying to add stringtie output #77

Open
lukaas33 opened this issue Aug 26, 2024 · 12 comments
Open

Error in gtf library when trying to add stringtie output #77

lukaas33 opened this issue Aug 26, 2024 · 12 comments

Comments

@lukaas33
Copy link

I am getting the following error from the gtf library when trying to annotate my vcf with stringtie output.

$ vcf-expression-annotator -o /shared_dir/temp.annotated.vcf -s sample /shared_dir/temp.vep.vcf /shared_dir/temp.abundance.tsv stringtie transcript
Traceback (most recent call last):
  File "/opt/miniconda/envs/gatk/lib/python3.6/site-packages/gtfparse/read_gtf.py", line 121, in parse_with_polars_lazy
    **kwargs).lazy()
AttributeError: 'LazyFrame' object has no attribute 'lazy'

Version 2.0.1 with Python 3.6

I have also reported this issue at the gtf page:
openvax/gtfparse#49

@lukaas33
Copy link
Author

I have now ran the same command in the latest vatools Docker image and am getting a new error:

> vcf-expression-annotator -o /shared_dir/temp.annotated.vcf -s sample /shared_dir/temp.vep.vcf /shared_dir/temp.abundance.tsv stringtie transcript
Traceback (most recent call last):
  File "parsers.pyx", line 1160, in pandas._libs.parsers.TextReader._convert_tokens
TypeError: Cannot cast array data from dtype('O') to dtype('int64') according to the rule 'safe'

@griffithlab griffithlab deleted a comment Aug 26, 2024
@lukaas33
Copy link
Author

The start of my stringtie output:

Gene ID	Gene Name	Reference	Strand	Start	End	Coverage	FPKM	TPM
ENSG00000282881	TMEM275	1	-	46532166	46543969	0.0	0.0	0.0
ENSG00000201405	Y_RNA	1	+	23370254	23370346	0.0	0.0	0.0
ENSG00000143774	GUK1	1	+	228139962	228148984	0.0	0.0	0.0
ENSG00000288775	.	1	-	159776325	159779383	0.0	0.0	0.0
ENSG00000239887	C1orf226	1	+	162378841	162386812	0.0	0.0	0.0
ENSG00000200575	RNU6-414P	1	+	61816419	61816522	0.0	0.0	0.0
ENSG00000251785	RNA5SP20	1	+	77614869	77614952	0.0	0.0	0.0
ENSG00000237872	POU5F1P4	1	+	155433178	155434262	0.0	0.0	0.0

@susannasiebert
Copy link
Collaborator

susannasiebert commented Aug 26, 2024

For the first error, I suspect that one of the dependencies you have installed is incompatible with gtfparse. Here are the versions in our Docker images:

gtfparse==1.3.0
numpy==1.26.1
pandas==2.1.1
pysam==0.22.0
python-dateutil==2.8.2
pytz==2023.3.post1
six==1.16.0
testfixtures==7.2.0
tzdata==2023.3
vatools==5.1.0
vcfpy==0.12.3

Try downgrading your dependencies to match these versions.

@susannasiebert
Copy link
Collaborator

I'm unable to reproduce the error in your second comment with the stringtie output provided. Can you please attach all of your input files (VCF and full stringtie TSV)?

@lukaas33
Copy link
Author

Hi @susannasiebert, can I send these to you privately to maintain privacy of this data?

@susannasiebert
Copy link
Collaborator

Yes, absolutely. My email is susanna.kiwala@wustl.edu

@susannasiebert
Copy link
Collaborator

My apologies for the belated replies. After investigating your files, it looks like you are trying to use a gene abundance file in transcript mode. If you switch your command to

vcf-expression-annotator -o /shared_dir/temp.annotated.vcf -s sample /shared_dir/temp.vep.vcf /shared_dir/temp.abundance.tsv stringtie gene

It works without problems.

The transcript abundance file from stringtie is in gtf format while the gene abundance file is in tsv format. Mixing them up leads to unexpected errors like the one you are seeing. I've added issue #78 to add better error handling for this case.

@lukaas33
Copy link
Author

lukaas33 commented Aug 28, 2024

Ah, and sorry for my ignorance. But wouldn't I need to add both of them? So the tsv and gtf files.

Is there a way to do this with one command or will this always involve two steps?

Or does the transcript level expression always contain more detail?

@susannasiebert
Copy link
Collaborator

You would need to run this as two steps, unfortunately.

@lukaas33
Copy link
Author

Ah so it is recommended to also add gene expression level besides transcript expression level?

@susannasiebert
Copy link
Collaborator

susannasiebert commented Aug 28, 2024

Yes, the gene expression levels are, for example, used during tiering in the aggregated report.

@lukaas33
Copy link
Author

Ah in that case it may be nice to have a feature to add both files.
Or to support command line piping so that it can be done in one line and creates only one file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants
@susannasiebert @lukaas33 and others