Genome Crossovers

Just a bit of fun to keep track of which genomes (in this case genomes with a set of predicted proteins) are available where.

I am mainly taking information from NCBI, Ensembl, JGI, EuPathDB portals (and their consituent portals e.g Mycocosm etc) as of ~April 2018. Occasionally I add in other genomes not available from these 'main' portals to cover my own needs. I am only really interested in keeping track of the Eukaryotes (for another project OrcharDB). I have also split them into 4 rough 'groups', with that said I am not making any specific comment on these groupings, it's just a little easier to keep track of them this way.

Fungi

The total number of genomes from each of the 'big' genome portals are as below:

Ensembl	JGI	FungiDB	Other	NCBI	Total	Unique Total
761	874	81	25	900	2641	1493

UpSetR Plot

Their intersections are as below, this is like a Venn diagram but 1000x better:

For example, we can see that:

whilst NCBI contains the most genomes, only 91 are unique
whereas JGI has the most number of unique genomes at 530 not available elsewhere
the 4 genome portals share 39 of the same genomes
with Ensembl and NCBI having the largest overlap

I think that's pretty revealing, if you're going to try and cover taxonomic diveristy of taxa in your analyses, you're going to need to use more than just NCBI! Of course, you will need to explore the data more thouroughly, some genera have more sequencing projects than others and that is likely inflating some of the numbers...

And so you can access the data, here, and the code to make the plot here, there are other examples in the Fungi directory.

Plants

Continuing with the loose definitions, plants = anything green...

The total number of genomes from each of the 'big' genome portals are as below:

Ensembl	JGI	NCBI	Total	Unique Total
53	85	87	225	148

UpSetR Plot

Their intersections are as below, this is like a Venn diagram but 1000x better:

For example, we can see that:

NCBI wins this time with the most unique number of genomes
but JGI is a close second
and there's surprisingly few shared between the portals

You can access the data, here, and the code to make the plot here.

Metazoa

The total number of genomes from each of the 'big' genome portals are as below:

Ensembl	JGI	NCBI	Total	Unique Total
162	26	371	559	433

UpSetR Plot

Their intersections are as below, this is like a Venn diagram but 1000x better:

For example, we can see that:

Very few taxa exist in all 3 portals
NCBI seems to be very metazoan heavy!

You can access the data, here, and the code to make the plot here

Protists / Other

The total number of genomes from each of the 'big' genome portals are as below:

Ensembl	JGI	NCBI	EuPathDB	Other	Total	Unique Total
170	22	9	93	36	330	275

UpSetR Plot

Their intersections are as below, this is like a Venn diagram but 1000x better:

For example, we can see that:

Ensembl has the largest collection of protists in one place
Lots of protists exists in their own genome portal, or subsideries of others e.g. EuPathDB
NCBI maybe be underepresented in this graph!

You can access the data, here, and the code to make the plot here

Caveats

These lists were curated from various sources. Not everyone makes their information easily accessible. So, I have probably missed taxa and if it's your favourite one, then I apologise!

NCBI

They have a list hidden away in their FTP site @ ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/. Is it complete? Hard to say. Is it full of duplicates and multiple assemblies? Yes. If anyone knows of a better list then please feel free to let me know and tell NCBI to make it more obvious.

JGI

Fungi seem to be all contained in Mycocosm now, which is great but wasn't always the case. Other taxa plants/protists are not all in their own '-cosms' or '-zomes' though. So it's a bit more hit and miss. I have to parse their XMLs with scripts here or just look at the genome portal web page, one by one.

Ensembl

Most of the constituent portals have lists and tables with the information I want in. By far the easiest to use and extract data from! Yay!

EuPathDB + Others

Pretty easy from EuPathDB and consituent portals as they have a list function which also shows taxa with predicted proteins. Nice.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Genome Crossovers

Fungi

UpSetR Plot

Plants

UpSetR Plot

Metazoa

UpSetR Plot

Protists / Other

UpSetR Plot

Caveats

NCBI

JGI

Ensembl

EuPathDB + Others

Files

README.md

Latest commit

History

README.md

File metadata and controls

Genome Crossovers

Fungi

UpSetR Plot

Plants

UpSetR Plot

Metazoa

UpSetR Plot

Protists / Other

UpSetR Plot

Caveats

NCBI

JGI

Ensembl

EuPathDB + Others