Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interpretation of Wochenende Results #98

Open
arpit20328 opened this issue Mar 25, 2024 · 11 comments
Open

Interpretation of Wochenende Results #98

arpit20328 opened this issue Mar 25, 2024 · 11 comments

Comments

@arpit20328
Copy link

I have the following output from my paired end FASTQ files. This output came after I ran "bash runbatch_Wochenende_reporting.sh

image

Can anyone tell me which column represents the abundance value of each species (row in this matrix) ?

@colindaven
Copy link
Contributor

Hi @arpit20328 thanks for your interest in Wochenende.

The docs are here but you probably found them already:

https://github.com/MHH-RCUG/nf_wochenende/wiki/Interpreting-Wochenende-output

And to answer your question - use the bacteria_per_human_cell and or the RPMM columns.

RPMM basically combines the first two normalizations - reads_per_million_ref_bases and reads_per_million_reads_in_experiment. These normalize to a) the bacterial genome length - bigger genomes produce more reads by virtue of their length, and b) if you sequence more reads, you get more data.

The key problem is though that you do not have many reads aligned (max 154).

We typically see thousands - 100k+ reads aligned.

Why is this ? Maybe your sample is from soil or another biome which does not fit well to our current reference bacteria (mostly tested on clinical metagenomes to date, eg lung samples) ? Maybe there is nothing in your sample, or it is 16S ? This is only for WGS metagenomics.

How many reads are you supplying ? Do you have human as the key source of "contamination" , ie. is this a clinical sample ?

@arpit20328
Copy link
Author

Thanks @colindaven for your reply..

So my data is actually WGS Paired end by illumina. I have previosuly just send a snapshot and not complete dimension of the data.

When our FASTQ files were processed by Wochenende, it gave sorted CSV file of 320 rows

image

Yes its a clinical sample...I do not have reads data now...but this is around 6.6 GB of Paired end .fastq.gz files.

@arpit20328
Copy link
Author

from this file it is around 31 million reads Wochenende has found in our FASTQ data

@MHH-RCUG MHH-RCUG deleted a comment from arpit20328 Mar 25, 2024
@colindaven
Copy link
Contributor

Thanks for that - I deleted the comment since it may be sensitive information and likely shouldn't have been there :-).

Congrats, it looks like you have some interesting results. The numbers of reads are very interesting, but the distribution of reads along the chromosomes (and for multi-chromosomal orgs, are all chr covered ?) are the real indicators that the species is there, and it is not just a false positive.

You can try to plot the information, but this requires an R server/installation which we couldn't easily fit into the Wochenende conda install instructions. Maybe you have an R server, can install the required software and run the plotting, or do the plotting yourself if you prefer based upon our scripts (see the plots subdirectory).

I hope the raspir step worked for you since these results are another very important step for eliminating false positives. These assume however circular chromosomes, common for bacteria but AFAIK not typical for fungi.

@colindaven
Copy link
Contributor

from this file it is around 31 million reads Wochenende has found in our FASTQ data

Yes, this is true, but most are human (the 1_1_1_1 -> 1_1_1_Y etc results with the most assigned reads. This is good and entirely expected for a clinical human associated sample. Because there are so many human reads, we can use the bacterial_per_human_cell column as a normalization method to get an estimate of absolute abundances, and not just relative abundance.

@arpit20328
Copy link
Author

I see. thanks @colindaven for detailed reply.

I will be requiring more of your inputs in comming weeks. We are running a clinical trial here in Mumbai, India. and we feel Wochenende fits for our study..

Great. ! Thanks again

have a great weekend...oops i mean great Wochenende...

@colindaven
Copy link
Contributor

No problem. Yes, we're happy to help out where we can.

The trial sounds very interesting!

:-)

@arpit20328
Copy link
Author

@colindaven Hi again so for many fastq files in our sample we are not getting score in column of "bacteria_per_human_cell"

image

So as our previous communication... we are taking "RPMM" as our abundance. Can you please describe what is RPMM and what's the formula behind this ? Can you suggest how RPMM could be formulated to get relative or absolute abundance of the species out of 100 ?

Thanks

@colindaven
Copy link
Contributor

Hi @arpit20328

Nice job. I don't know why your samples are not sufficient to calculate the bacterial per human cell parameter. Do they now have any/enough human reads mapped ?

You're right in that the docs are a bit unspecific on normalization, so I improved and extended them here with examples:

https://github.com/MHH-RCUG/nf_wochenende/wiki/Interpreting-Wochenende-output

Please let me know if that is sufficient for your needs.

I would really recommend using the raspir function to remove false positives, and the plotting functionality to check distribution of reads along 1+ chromosome before you decide if the taxon is present or not.

These do require extra work to get running, but at least the Raspir function should work quite well for the nf_wochnenende repo (with Nextflow). It likely will not work with the older Wochenende repo without some hacking to adjust it to your compute environment.

Maybe @irosenboom can also supply you with some experience of what constitutes a high or low RPMM, and how these values can be affected by especially short genome lengths etc.

cheers
Colin

@arpit20328
Copy link
Author

@colindaven thanks

IMPORTANT: I chopped off Human Ref Sequences from wochenende recommended database since I was only interested in Pathogens identification and not human dna..

I think i should have taken complete wochenende database ref fasta.

Your thoughts?

@colindaven
Copy link
Contributor

IMPORTANT: I chopped off Human Ref Sequences from wochenende recommended database since I was only interested in Pathogens identification and not human dna..

Ah -- please don't do this. You'll get human reads massively and erroneously mapping to bacteria and have many, many false positives.
The human sequences are there to filter out the human DNA. Similarly, when we did mouse skin or gut metagenomes, we added the mouse genome to allow the mouse reads to be mapped to their true origin, and filtered out.

In clinical samples 90-99% of DNA is human in my experience, and needs to be excluded.

When using the proper version with the human sequences you'll get very, very different results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants