GMGC data downloaded from https://gmgc.embl.de
This repository collects Global Microbial Gene Catalog (GMGC) sequence data, which store separately according to sequence length.
The number of sequences is shown below:
Length |
Built environment |
Cat gut |
Dog gut |
Freshwater |
Human gut |
Human nose |
Human oral |
Human skin |
Human vagina |
Marine |
Mouse gut |
Pig gut |
Soil |
Wastewater |
After deduplication |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10423 | 11694 | 9221 | 1853 | 82549 | 10786 | 33935 | 33167 | 5261 | 28914 | 5222 | 32600 | 10568 | 12775 | 153270 | |
2263 | 922 | 859 | 488 | 8587 | 886 | 2908 | 3987 | 330 | 3338 | 512 | 3512 | 1306 | 1627 | 16557 | |
1347 | 875 | 190 | 57 | 2128 | 996 | 1000 | 1463 | 28 | 396 | 80 | 600 | 185 | 392 | 2891 | |
2149 | 1913 | 1 | 1 | 2338 | 2187 | 1921 | 2163 | 0 | 38 | 7 | 63 | 229 | 3 | 2459 | |
11 | 1 | 0 | 0 | 122 | 68 | 0 | 51 | 0 | 6 | 11 | 2 | 9 | 2 | 179 |
All data in this repository has zipped by xz
except dedup/5000.zip
. Use xz
in unix system to decompress files.
For file dedup/5000.zip
, use the following command:
cat dedup/5000.z* > dedup/5000_final.zip
unzip dedup/5000_final.zip
The folder of datasets is shown below:
Dataset name | Folder name |
---|---|
Built environment |
built-env |
Cat gut |
cat-gut |
Dog gut |
dog-gut |
Freshwater |
freshwater |
Human gut |
human-gut |
Human nose |
human-nose |
Human oral |
human-oral |
Human skin |
human-skin |
Human vagina |
human-vagina |
Marine |
marine |
Mouse gut |
mouse-gut |
Pig gut |
pig-gut |
Soil |
soil |
Wastewater |
wastewater |
After deduplication | dedup |
Coelho, L.P., et al. Towards the biogeography of prokaryotic genes. Nature 601, 252–256 (2022).