Skip to content

Global Microbial Gene Catalog sequence data which stored separately according to sequence length

Notifications You must be signed in to change notification settings

malabz/GMGC-data

Repository files navigation

GMGC data downloaded from https://gmgc.embl.de

This repository collects Global Microbial Gene Catalog (GMGC) sequence data, which store separately according to sequence length.

Data statistics

The number of sequences is shown below:

Length $L$ Built environment Cat gut Dog gut Freshwater Human gut Human nose Human oral Human skin Human vagina Marine Mouse gut Pig gut Soil Wastewater After deduplication
$5000 \leq L \leq 10000$ 10423 11694 9221 1853 82549 10786 33935 33167 5261 28914 5222 32600 10568 12775 153270
$10000 \leq L \leq 20000$ 2263 922 859 488 8587 886 2908 3987 330 3338 512 3512 1306 1627 16557
$20000 \leq L \leq 30000$ 1347 875 190 57 2128 996 1000 1463 28 396 80 600 185 392 2891
$30000 \leq L \leq 40000$ 2149 1913 1 1 2338 2187 1921 2163 0 38 7 63 229 3 2459
$L \geq 40000$ 11 1 0 0 122 68 0 51 0 6 11 2 9 2 179

How to use data

All data in this repository has zipped by xz except dedup/5000.zip. Use xz in unix system to decompress files.

For file dedup/5000.zip, use the following command:

cat dedup/5000.z* > dedup/5000_final.zip
unzip dedup/5000_final.zip

The folder of datasets is shown below:

Dataset name Folder name
Built environment built-env
Cat gut cat-gut
Dog gut dog-gut
Freshwater freshwater
Human gut human-gut
Human nose human-nose
Human oral human-oral
Human skin human-skin
Human vagina human-vagina
Marine marine
Mouse gut mouse-gut
Pig gut pig-gut
Soil soil
Wastewater wastewater
After deduplication dedup

Ciation

Coelho, L.P., et al. Towards the biogeography of prokaryotic genes. Nature 601, 252–256 (2022).

About

Global Microbial Gene Catalog sequence data which stored separately according to sequence length

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published