Minimum length used by all 10 marker-gene profiles? #7

Jigyasa3 · 2021-04-01T04:43:44Z

I wanted to confirm if the minimum length used by all 10 marker-genes in mOTU2 is the same as mOTU1 paper?
Based on the supplementary table 4, a marker gene-specific sequence length cutoff is used in the mOTU1 paper.

When I extract marker-genes from my metagenomes using either mOTU2 or mOTU1, I find a correlation between no. of sequences per marker-gene (i.e. count) and marker-gene length. As there is no option to change the length cutoff in fetchMGs.pl and classify-genomes, I was wondering if that would bias the relative abundance calculations.

What do you think about it?

The text was updated successfully, but these errors were encountered:

AlessioMilanese · 2021-04-01T07:28:52Z

Hi @Jigyasa3,

We do not use a sequence length cutoff (at least for mOTUs2, I was not involved for mOTUs1, but I think also there it was not used). We select genes based on hmm scores calculated with HMMER3.
For example, if a gene map to the HMM for COG0012 with a score of 1259, which is higher than the threshold 1190, then it is selected.

When I extract marker-genes from my metagenomes using either mOTU2 or mOTU1, I find a correlation between no. of sequences per marker-gene (i.e. count) and marker-gene length.

There are two steps: 1. extract genes from genomes, and 2. map metagenomic reads to the extracted genes.
When doing the second one (which we also call metagenomic profiling) you can either count reads that map as they are (and the length of the genes play an important role) or you can normalize by gene length. When running mOTUs you can use insert.raw_counts to have no normalization; or insert.scaled_counts to normalize by gene length.

As there is no option to change the length cutoff in fetchMGs.pl and classify-genomes, I was wondering if that would bias the relative abundance calculations.

fetchMGs.pl and classify-genomes are not involved with read counts, but only with extracting the genes. So these 2 tools do the part 1 described above. While mOTUs works with reads and do taxonomic profiling (part 2).

Hope it makes sense.

Jigyasa3 · 2021-04-01T11:11:51Z

Hey @AlessioMilanese

Thank you so much for a quick response! Sorry about not considering the marker-gene counts with length normalization for this analysis! Thank you for pointing that out.

For the first question, I am still a bit confused (sorry :( ). You mention that mOTU2 (and maybe mOTU1) extract genes based on HMM search against the marker-gene. I understand that. But we find that in our metagenomes, there is a large difference in how many sequences are extracted for each marker-gene (raw counts). Could we possibly explain this by saying that
(a) mOTU2 (and mOTU1) extract "full length" marker-genes (hmmbuild command has a default percentage identity of 0.62 for global alignment) and as there is a higher probability of genes being truncated/pseudogenized in a gut metagenome that is why we don't get all marker-genes per organism?

I observe this in one of the single-cell genomes available on NCBI too. Candidatus Endomicrobium trichonymphae has only 8 marker-genes and this bacteria has been shown to undergo genome reduction and pseudogenization in the gut environment.

We are interested in explaining the differences between marker-genes in our gut metagenomes dataset and want to tease apart what might be causing it. It looks like the biology of the microbes is the key..Is there any other way to suggest that the software used is not the reason...

AlessioMilanese · 2021-04-01T14:54:47Z

I think that how mOTUs works is a little confusing. I'll try to explain it here, related to what you wrote me.
For mOTUs we create a database, a process that is done once. During this process we extract marker genes from reference genomes, metagenomic samples and MAGs:

When you run mOTUs (with the command motus), we map the metagenomic reads to the marker genes extracted before. Which are in the database. We use bwa to map the reads to the extracted marker genes. mOTUs is not using HMMER, but bwa. Example:

Jigyasa3 · 2021-04-02T02:10:42Z

Hey @AlessioMilanese

Thank you for replying! And attaching the figures to explain the process. I have one more question, sorry if I am slow.
mOTU2 maps the raw reads to the mOTU database, but mOTU1 and classify-genomes don't use reads. It takes protein-coding genes (mOTU1) or metagenome contigs (classify-genomes) as input. So does that mean the process is the same? Instead of mapping reads, classify-genomes maps the metagenome genes (extracted after running prodigal) to the database and extracts the ones that are marker-genes? That is what I want to know, is there a specific threshold used to specify if a gene is a marker-gene or not? or the process of gene extraction doesn't require a threshold?

Sorry if my question is rudimentary, hope I am explaining my question properly.

AlessioMilanese · 2021-04-02T21:05:41Z

No problem. The whole process is a little complicated.

mOTU2 maps the raw reads to the mOTU database, but mOTU1 and classify-genomes don't use reads.

mOTUs 1 uses reads (http://www.bork.embl.de/software/mOTUs1/):

Species-level profiles are generated by mapping shotgun DNA sequencing reads from metagenomes to a pre-compiled database (mOTU.v1.padded) consisting of 10 MGs representing

So if the inputs is reads, you use mOTUs to identify which species are present and their relative abundance (this process is called taxonomic profiling):

E. coli    30%
P. copri   15%
E.rectale  10%

This is when you sequence a human gut sample.

If you grow a bacteria on a petri dish and you sequence it, then you have a genome.
If the input is a genome, you just find what are the marker genes. But there is no way to calculate relative abundance given a genome. classify-genomes predict genes based on a genome, so there is no way to get a relative abundance (as in taxonomic profiling of a metagenomic sample)

That is what I want to know, is there a specific threshold used to specify if a gene is a marker-gene or not? or the process of gene extraction doesn't require a threshold?

Extracting marker genes is based on thresholds. Which is the second column of this file:
https://github.com/AlessioMilanese/progenomes_classifier/blob/master/lib/MG_BitScoreCutoffs.verybesthit.txt

based on the HMM score

AlessioMilanese self-assigned this Apr 1, 2021

AlessioMilanese added the question Further information is requested label Apr 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minimum length used by all 10 marker-gene profiles? #7

Minimum length used by all 10 marker-gene profiles? #7

Jigyasa3 commented Apr 1, 2021

AlessioMilanese commented Apr 1, 2021

Jigyasa3 commented Apr 1, 2021

AlessioMilanese commented Apr 1, 2021 •

edited

Loading

Jigyasa3 commented Apr 2, 2021

AlessioMilanese commented Apr 2, 2021

Minimum length used by all 10 marker-gene profiles? #7

Minimum length used by all 10 marker-gene profiles? #7

Comments

Jigyasa3 commented Apr 1, 2021

AlessioMilanese commented Apr 1, 2021

Jigyasa3 commented Apr 1, 2021

AlessioMilanese commented Apr 1, 2021 • edited Loading

Jigyasa3 commented Apr 2, 2021

AlessioMilanese commented Apr 2, 2021

AlessioMilanese commented Apr 1, 2021 •

edited

Loading