-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
issues with KEGG annotations and KO mapping #26
Comments
Hi, MetaErg performs its KO annotation by a BLAST search against a UniProt database. While this is a valid approach, only reference genes also present in KEGG have KO terms annotated. I like KofamKOALA. It is also not perfect, but they tried to reduce false positives and can be run offline. Sometimes even the reference genomes miss a annotation, making certain pathways not complete. eggNOGmapper can also provide you with EC numbers and KO terms, but many are automatically generated and appear to be quite questionable sometimes (especially KO terms), but the orthologs can be compared among different genomes and help to identify certain genes InterProScan pathway annotations also contain EC numbers and those often helped me to complete certain pathways. Hopefully this helps in one way or another, |
MetaErg also adds the FOAM and TIGRFAM KO hits, wasn't sure, but just found it in the source code. |
Hi, |
Hi, |
Found the issue. The report uses the KOs after they have been selected by MinPath, while the Using your data:
#!/usr/bin/env python3
import json
from io import StringIO
import re
import itertools
# Table used for the fam_found entries:
reporttable = 'data/data/kegg.cds.profile.datatable.json'
kolist = 'data/data/cds.gene2ko.mapping.txt'
tab = 'data/data/cds.gene2ko.tab.txt'
minpath = 'data/data/cds.gene2ko.minpath.details'
def kosfromtable(filepath):
with open(filepath, 'rt') as jsonin:
# Memory inefficient, but ok for now
jsonstr = jsonin.read()
jsonstr = re.findall(r'data = (\[.+\])',jsonstr, re.DOTALL)[0]
jsonstr = re.sub(r',\s*}','}',jsonstr)
jsonstr = json.load(StringIO(jsonstr))
kos = {ko for path in jsonstr for ko in path['kos'].split('+')}
return kos
def kosfromgene2ko(filepath,i=1,skip=False):
with open(filepath, 'rt') as jsonin:
if skip:
next(jsonin)
kos = {line.split('\t')[i].strip() for line in jsonin }
return kos
kohit = re.compile(r'^ (K\d{5}) .+')
def kosfromminpath(filepath):
with open(filepath, 'rt') as jsonin:
kos = {line.group(1) for line in map(kohit.match, jsonin) if line is not None}
return kos
def jaccardsim(A: set, B: set):
return len(A.intersection(B))/len(A.union(B))*100
def main():
kos = {
'report': kosfromtable(reporttable),
'gene2ko.mapping': kosfromgene2ko(kolist),
'gene2ko.tab': kosfromgene2ko(tab,8,True),
'gene2ko_minpath': kosfromminpath(minpath)
}
print('Sizes:')
for koset in kos.keys():
print(f'{koset}\t{len(kos[koset])}')
print('\nSimilarity:')
for A, B in itertools.combinations(kos.keys(),2):
print('\t'.join((A,B,str(round(jaccardsim(kos[A],kos[B]),2)))))
if __name__ == '__main__':
main() |
Thanks - that makes sense! |
Hi,
I have used metaerg to annotate metagenome bins, and have found some issues with the KEGG annotations.
I tested out a Methanogen MAG. It has the McrA gene and its correctly annotated in the cds.faa file. However, this does not show up in the KEGG map.
We then tried to use the cds.gene2ko.mapping.txt
The McrA gene comes up here, but this file appears to be over annotated and adds in functions that are not present in the genome - e.g. DsrA, K11180.
The gene with this KO added is metaerg.pl|01003 - which is correctly annotated as >metaerg.pl|01003 Coenzyme F420-dependent sulfite reductase OS=s__Methanothermobacter_thermautotrophicus len=689 cid=Contig_1_consensus_sequence
in the cds.faa file (KO K21816 with blastkoala)
Have anyone else noted discrepancies like this?
I have attached the data folder from the analysis.
data.zip
Camilla
The text was updated successfully, but these errors were encountered: