Interpretation of Wochenende Results #98

arpit20328 · 2024-03-25T14:25:36Z

I have the following output from my paired end FASTQ files. This output came after I ran "bash runbatch_Wochenende_reporting.sh

Can anyone tell me which column represents the abundance value of each species (row in this matrix) ?

colindaven · 2024-03-25T15:08:49Z

Hi @arpit20328 thanks for your interest in Wochenende.

The docs are here but you probably found them already:

https://github.com/MHH-RCUG/nf_wochenende/wiki/Interpreting-Wochenende-output

And to answer your question - use the bacteria_per_human_cell and or the RPMM columns.

RPMM basically combines the first two normalizations - reads_per_million_ref_bases and reads_per_million_reads_in_experiment. These normalize to a) the bacterial genome length - bigger genomes produce more reads by virtue of their length, and b) if you sequence more reads, you get more data.

The key problem is though that you do not have many reads aligned (max 154).

We typically see thousands - 100k+ reads aligned.

Why is this ? Maybe your sample is from soil or another biome which does not fit well to our current reference bacteria (mostly tested on clinical metagenomes to date, eg lung samples) ? Maybe there is nothing in your sample, or it is 16S ? This is only for WGS metagenomics.

How many reads are you supplying ? Do you have human as the key source of "contamination" , ie. is this a clinical sample ?

arpit20328 · 2024-03-25T15:47:58Z

Thanks @colindaven for your reply..

So my data is actually WGS Paired end by illumina. I have previosuly just send a snapshot and not complete dimension of the data.

When our FASTQ files were processed by Wochenende, it gave sorted CSV file of 320 rows

Yes its a clinical sample...I do not have reads data now...but this is around 6.6 GB of Paired end .fastq.gz files.

arpit20328 · 2024-03-25T15:54:04Z

from this file it is around 31 million reads Wochenende has found in our FASTQ data

colindaven · 2024-03-25T16:24:46Z

Thanks for that - I deleted the comment since it may be sensitive information and likely shouldn't have been there :-).

Congrats, it looks like you have some interesting results. The numbers of reads are very interesting, but the distribution of reads along the chromosomes (and for multi-chromosomal orgs, are all chr covered ?) are the real indicators that the species is there, and it is not just a false positive.

You can try to plot the information, but this requires an R server/installation which we couldn't easily fit into the Wochenende conda install instructions. Maybe you have an R server, can install the required software and run the plotting, or do the plotting yourself if you prefer based upon our scripts (see the plots subdirectory).

I hope the raspir step worked for you since these results are another very important step for eliminating false positives. These assume however circular chromosomes, common for bacteria but AFAIK not typical for fungi.

colindaven · 2024-03-25T16:27:21Z

from this file it is around 31 million reads Wochenende has found in our FASTQ data

Yes, this is true, but most are human (the 1_1_1_1 -> 1_1_1_Y etc results with the most assigned reads. This is good and entirely expected for a clinical human associated sample. Because there are so many human reads, we can use the bacterial_per_human_cell column as a normalization method to get an estimate of absolute abundances, and not just relative abundance.

arpit20328 · 2024-03-25T16:29:30Z

I see. thanks @colindaven for detailed reply.

I will be requiring more of your inputs in comming weeks. We are running a clinical trial here in Mumbai, India. and we feel Wochenende fits for our study..

Great. ! Thanks again

have a great weekend...oops i mean great Wochenende...

colindaven · 2024-03-26T07:04:46Z

No problem. Yes, we're happy to help out where we can.

The trial sounds very interesting!

:-)

arpit20328 · 2024-05-13T05:00:17Z

@colindaven Hi again so for many fastq files in our sample we are not getting score in column of "bacteria_per_human_cell"

So as our previous communication... we are taking "RPMM" as our abundance. Can you please describe what is RPMM and what's the formula behind this ? Can you suggest how RPMM could be formulated to get relative or absolute abundance of the species out of 100 ?

Thanks

colindaven · 2024-05-13T15:59:28Z

Hi @arpit20328

Nice job. I don't know why your samples are not sufficient to calculate the bacterial per human cell parameter. Do they now have any/enough human reads mapped ?

You're right in that the docs are a bit unspecific on normalization, so I improved and extended them here with examples:

https://github.com/MHH-RCUG/nf_wochenende/wiki/Interpreting-Wochenende-output

Please let me know if that is sufficient for your needs.

I would really recommend using the raspir function to remove false positives, and the plotting functionality to check distribution of reads along 1+ chromosome before you decide if the taxon is present or not.

These do require extra work to get running, but at least the Raspir function should work quite well for the nf_wochnenende repo (with Nextflow). It likely will not work with the older Wochenende repo without some hacking to adjust it to your compute environment.

Maybe @irosenboom can also supply you with some experience of what constitutes a high or low RPMM, and how these values can be affected by especially short genome lengths etc.

cheers
Colin

arpit20328 · 2024-05-13T16:03:41Z

@colindaven thanks

IMPORTANT: I chopped off Human Ref Sequences from wochenende recommended database since I was only interested in Pathogens identification and not human dna..

I think i should have taken complete wochenende database ref fasta.

Your thoughts?

colindaven · 2024-05-14T07:19:58Z

IMPORTANT: I chopped off Human Ref Sequences from wochenende recommended database since I was only interested in Pathogens identification and not human dna..

Ah -- please don't do this. You'll get human reads massively and erroneously mapping to bacteria and have many, many false positives.
The human sequences are there to filter out the human DNA. Similarly, when we did mouse skin or gut metagenomes, we added the mouse genome to allow the mouse reads to be mapped to their true origin, and filtered out.

In clinical samples 90-99% of DNA is human in my experience, and needs to be excluded.

When using the proper version with the human sequences you'll get very, very different results.

MHH-RCUG deleted a comment from arpit20328 Mar 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interpretation of Wochenende Results #98

Interpretation of Wochenende Results #98

arpit20328 commented Mar 25, 2024

colindaven commented Mar 25, 2024

arpit20328 commented Mar 25, 2024

arpit20328 commented Mar 25, 2024

colindaven commented Mar 25, 2024

colindaven commented Mar 25, 2024

arpit20328 commented Mar 25, 2024

colindaven commented Mar 26, 2024

arpit20328 commented May 13, 2024

colindaven commented May 13, 2024

arpit20328 commented May 13, 2024

colindaven commented May 14, 2024

Interpretation of Wochenende Results #98

Interpretation of Wochenende Results #98

Comments

arpit20328 commented Mar 25, 2024

colindaven commented Mar 25, 2024

arpit20328 commented Mar 25, 2024

arpit20328 commented Mar 25, 2024

colindaven commented Mar 25, 2024

colindaven commented Mar 25, 2024

arpit20328 commented Mar 25, 2024

colindaven commented Mar 26, 2024

arpit20328 commented May 13, 2024

colindaven commented May 13, 2024

arpit20328 commented May 13, 2024

colindaven commented May 14, 2024