use prodigal results on eukaryotig contigs rather than refseq eukaryotic proteins #4

jvollme · 2021-11-30T10:26:37Z

ORF calling on eukaryotes functions drastically different than for bacteria. Instead of using Eukaryotic proteins as references, rather run prodigal with prokaryotic and metagenomic settings on reference Eukaryotic genomes (not necessary for viral genomes) and use those as references. This is more likely to mimic what happens with eukaryotic contigs during metagenome analyses pipelines.

for this:

download refseq release eukaryotig genomes (nucleotide sequences)
randomly cut into chunks of ~ 5kb, but also cut at stretches of "N"s (discard chunks that end up smaller than 200bp)
run prodigal, derepilicate proteins (95% identity? or 90% identity?) to reduce database size. always keep largest representative --> protein diamond db: eukaryotic-refprotein-db
extract all remaining chunks without any predicted CDS (non-coding reference chunks), dereplicate (95% or 90% identity?). always keep largest representative --> nucleotide blastn-db: eukaryotic-noncoding-chunks-db

jvollme added the enhancement New feature or request label Nov 30, 2021

jvollme added this to the Improvements A milestone Nov 30, 2021

jvollme self-assigned this Nov 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use prodigal results on eukaryotig contigs rather than refseq eukaryotic proteins #4

use prodigal results on eukaryotig contigs rather than refseq eukaryotic proteins #4

jvollme commented Nov 30, 2021

use prodigal results on eukaryotig contigs rather than refseq eukaryotic proteins #4

use prodigal results on eukaryotig contigs rather than refseq eukaryotic proteins #4

Comments

jvollme commented Nov 30, 2021