Skip to content

Minimum length used by all 10 marker-gene profiles? #7

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Jigyasa3 opened this issue Apr 1, 2021 · 5 comments
Open

Minimum length used by all 10 marker-gene profiles? #7

Jigyasa3 opened this issue Apr 1, 2021 · 5 comments
Assignees
Labels
question Further information is requested

Comments

@Jigyasa3
Copy link

Jigyasa3 commented Apr 1, 2021

hey @AlessioMilanese

I wanted to confirm if the minimum length used by all 10 marker-genes in mOTU2 is the same as mOTU1 paper?
Based on the supplementary table 4, a marker gene-specific sequence length cutoff is used in the mOTU1 paper.

When I extract marker-genes from my metagenomes using either mOTU2 or mOTU1, I find a correlation between no. of sequences per marker-gene (i.e. count) and marker-gene length. As there is no option to change the length cutoff in fetchMGs.pl and classify-genomes, I was wondering if that would bias the relative abundance calculations.

What do you think about it?

@AlessioMilanese
Copy link
Owner

Hi @Jigyasa3,

We do not use a sequence length cutoff (at least for mOTUs2, I was not involved for mOTUs1, but I think also there it was not used). We select genes based on hmm scores calculated with HMMER3.
For example, if a gene map to the HMM for COG0012 with a score of 1259, which is higher than the threshold 1190, then it is selected.

When I extract marker-genes from my metagenomes using either mOTU2 or mOTU1, I find a correlation between no. of sequences per marker-gene (i.e. count) and marker-gene length.

There are two steps: 1. extract genes from genomes, and 2. map metagenomic reads to the extracted genes.
When doing the second one (which we also call metagenomic profiling) you can either count reads that map as they are (and the length of the genes play an important role) or you can normalize by gene length. When running mOTUs you can use insert.raw_counts to have no normalization; or insert.scaled_counts to normalize by gene length.

As there is no option to change the length cutoff in fetchMGs.pl and classify-genomes, I was wondering if that would bias the relative abundance calculations.

fetchMGs.pl and classify-genomes are not involved with read counts, but only with extracting the genes. So these 2 tools do the part 1 described above. While mOTUs works with reads and do taxonomic profiling (part 2).

Hope it makes sense.

@AlessioMilanese AlessioMilanese self-assigned this Apr 1, 2021
@AlessioMilanese AlessioMilanese added the question Further information is requested label Apr 1, 2021
@Jigyasa3
Copy link
Author

Jigyasa3 commented Apr 1, 2021

Hey @AlessioMilanese

Thank you so much for a quick response! Sorry about not considering the marker-gene counts with length normalization for this analysis! Thank you for pointing that out.

For the first question, I am still a bit confused (sorry :( ). You mention that mOTU2 (and maybe mOTU1) extract genes based on HMM search against the marker-gene. I understand that. But we find that in our metagenomes, there is a large difference in how many sequences are extracted for each marker-gene (raw counts). Could we possibly explain this by saying that
(a) mOTU2 (and mOTU1) extract "full length" marker-genes (hmmbuild command has a default percentage identity of 0.62 for global alignment) and as there is a higher probability of genes being truncated/pseudogenized in a gut metagenome that is why we don't get all marker-genes per organism?

I observe this in one of the single-cell genomes available on NCBI too. Candidatus Endomicrobium trichonymphae has only 8 marker-genes and this bacteria has been shown to undergo genome reduction and pseudogenization in the gut environment.

We are interested in explaining the differences between marker-genes in our gut metagenomes dataset and want to tease apart what might be causing it. It looks like the biology of the microbes is the key..Is there any other way to suggest that the software used is not the reason...

@AlessioMilanese
Copy link
Owner

AlessioMilanese commented Apr 1, 2021

I think that how mOTUs works is a little confusing. I'll try to explain it here, related to what you wrote me.
For mOTUs we create a database, a process that is done once. During this process we extract marker genes from reference genomes, metagenomic samples and MAGs:
build_db

When you run mOTUs (with the command motus), we map the metagenomic reads to the marker genes extracted before. Which are in the database. We use bwa to map the reads to the extracted marker genes. mOTUs is not using HMMER, but bwa. Example:
tax_profiling

@Jigyasa3
Copy link
Author

Jigyasa3 commented Apr 2, 2021

Hey @AlessioMilanese

Thank you for replying! And attaching the figures to explain the process. I have one more question, sorry if I am slow.
mOTU2 maps the raw reads to the mOTU database, but mOTU1 and classify-genomes don't use reads. It takes protein-coding genes (mOTU1) or metagenome contigs (classify-genomes) as input. So does that mean the process is the same? Instead of mapping reads, classify-genomes maps the metagenome genes (extracted after running prodigal) to the database and extracts the ones that are marker-genes? That is what I want to know, is there a specific threshold used to specify if a gene is a marker-gene or not? or the process of gene extraction doesn't require a threshold?

Sorry if my question is rudimentary, hope I am explaining my question properly.

@AlessioMilanese
Copy link
Owner

No problem. The whole process is a little complicated.

mOTU2 maps the raw reads to the mOTU database, but mOTU1 and classify-genomes don't use reads.

mOTUs 1 uses reads (http://www.bork.embl.de/software/mOTUs1/):

Species-level profiles are generated by mapping shotgun DNA sequencing reads from metagenomes to a pre-compiled database (mOTU.v1.padded) consisting of 10 MGs representing

So if the inputs is reads, you use mOTUs to identify which species are present and their relative abundance (this process is called taxonomic profiling):

E. coli    30%
P. copri   15%
E.rectale  10%

This is when you sequence a human gut sample.

If you grow a bacteria on a petri dish and you sequence it, then you have a genome.
If the input is a genome, you just find what are the marker genes. But there is no way to calculate relative abundance given a genome. classify-genomes predict genes based on a genome, so there is no way to get a relative abundance (as in taxonomic profiling of a metagenomic sample)

That is what I want to know, is there a specific threshold used to specify if a gene is a marker-gene or not? or the process of gene extraction doesn't require a threshold?

Extracting marker genes is based on thresholds. Which is the second column of this file:
https://github.com/AlessioMilanese/progenomes_classifier/blob/master/lib/MG_BitScoreCutoffs.verybesthit.txt

based on the HMM score

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants