cFMD

curatedFoodMetagenomicData (cFMD) is the largest public food microbiome resource, comprehending curated metadata, microbiome profiles, as well as reconstructed genomes from food (shotgun) metagenomes. cFMD currently hosts 14,904 MAGs spanning 1,153 prokaryotic SGBs and 110 eukaryotic SGBs, comprehending 3,444 food metagenomes from 87 food metagenomic datasets.

cFMD, initially developed within the EU 2020 MASTER project, is currently maintained within the DOMINO EU project. To contribute to the further expansion of cFMD with your own food-associated metagenomic data, get in touch with one of the contacts listed at the bottom of this page.

Go to cFMD v1.1.0 for the version associated with Carlino et al., "Unexplored microbial diversity from 2,500 food metagenomes and links with the human microbiome", Cell, 2024, DOI: 10.1016/j.cell.2024.07.039.

Main update of the current version (v1.3.1): Additional curated metadata for cheese metagenomes (see MetaCheeseDB), along with the incorporation of two novel cheese metagenomic datasets.

Data

From this GitHub repository you can access to these cFMD-level files (more details are provided in the section "Detailed description of data" below):

cFMD_datasets: summary of the datasets included in the current release, version when the datasets were added to cFMD database and MetaRefSGB system with reference to the publication (if available)
cFMD_metadata: metadata information, in addition to statistics about reconstructed MAGs at sample level. The table has samples as row indices and type of information as column headers. This includes:
- categorization of the samples,
- accession codes to retrieve public metagenomes,
- technical information (e.g. dna extraction kit, sequencer, etc.),
- basic statistics (number of reads, number of bases, number of MAGs, etc.). The unique key for querying the database is represented by the dataset_name and sample_id. Food samples were classified according to their composition and production using three levels of detail (category, type and subtype).
cFMD_metadata_rules: description of the syntactic rules to define the metadata fields of the above file "cFMD_metadata"
cFMD_mags_list: the list of the reconstructed MAGs with information in terms of:
- sample origin,
- assigned taxonomy at species-level genome bin (SGB) level (MAGs remain unassigned if they belong to SGBs not present in the MetaRefSGB database),
- known/unknown status of the SGB,
- basic statistics (number of contigs, N50, completeness, contamination, etc.).

Alongside the cFMD-level files mentioned above, we also provide dataset-specific folders with the following dataset-specific files that can be accessed from the cFMD_data folder:

${DATASET}_prok_mags_info: metadata of the reconstructed prokaryotic MAGs.
${DATASET}_euk_mags_info: metadata of the reconstructed eukaryotic MAGs.
${DATASET}_metadata: sample-level metadata information for the dataset. An additional more complete metadata file (${DATASET}_additional_cheese_metadata) is available for datasets included in MetaCheeseDB.
${DATASET}_taxonomic_profiles: taxonomic profiles with samples as column headers and taxa as row indices, with values expressed as taxa relative abundances (%).
${DATASET}_mags: the reconstructed MAGs in fasta format (hosted externally due to large size; a download script is provided).

Users can download mags for the dataset(s) by downloading the provided script download_mags.sh and running it by passing the name(s) of the dataset(s) one wishes to download, as below (where LiZ_2019 and YuY_2022 are used as examples):
```
wget "https://raw.githubusercontent.com/SegataLab/cFMD/refs/heads/main/download_mags.sh" && chmod +777 download_mags.sh
./download_mags.sh LiZ_2019 YuY_2022
```
${DATASET}_functional_profiles: functional profiles (normalized UniRef90 gene families, pathway abundances, and pathway coverages) with samples as column headers and row indices as functions (hosted externally due to large size; a download script is provided).

Users can download functional profiles for the dataset(s) by downloading the provided script download_functional_profiles.sh and running it by passing the name(s) of the dataset(s) one wishes to download, as below (where LiZ_2019 and YuY_2022 are used as examples):
```
wget "https://raw.githubusercontent.com/SegataLab/cFMD/refs/heads/main/download_functional_profiles.sh" && chmod +777 download_functional_profiles.sh
./download_functional_profiles.sh LiZ_2019 YuY_2022
```
Detailed description of data

More description about the fields for some of the files presented above:
- cFMD_metadata (unique key= dataset_name+sample_id)
- dataset_name: name of dataset. It is formed as i) “first author surname + initial letter of first author name(s) + _ + year of publication” for public datasets ii) “first author surname + initial letter of first author name(s) + _ + “xxxx” for not already public datasets (among those there are also MASTER partners datasets) iii) “MASTER + WPn + sampling partner + increasing number” for datasets produced inside MASTER
- sample_id: name of the sample
- macrocategory: highest-level description of the sample type (food, controls, food processing, environment, or animal)
- category: second highest-level description of the sample type
- type: third highest-level description of the sample type
- subtype: lowest level of description of the sample type (can be blank if not necessary/available)
- commercial_name: name of the commercialized product
- fermented/non-fermented: categorizing samples across and within categories based on fermentation presence
- country: country of origin of the sample as defined by ISO3 international convention
- sample_accession: code identificative of the sample if present in public databases
- run_accession: code identificative of the sequencing run if present in public databases
- experiment_accession: code identificative of the experiment if present in public databases
- study_accession: code identificative of the study if present in public databases
- project_accession: code identificative of the sample if present in public databases
- database_origin: name of the public database from which the reads of the sample have been downloaded
- library_layout: layout of the sequencing library (e.g. paired, single )
- sequencing_platform: sequencer used to read DNA basis
- DNA_extraction_kit: extraction kit used to isolate DNA in the sample
- collection_date: day (DD/MM/YYYY) or month (MM-YYYY) or year (YYYY) of sample collection
- n_of_bases: # of nucleaotides forming the reads of the sample after pre-processing
- n_of_reads: # of reads of the sample after pre-processing
- min_read_len: minimum number of basis among the reads of the sample
- median_read_len: median number of basis among the reads of the sample
- mean_read_len: mean number of basis among the reads of the sample
- max_read_len: max number of basis among the reads of the sample
- n_contigs: # of contigs with length > 1000 bp assembled from the reads of the sample
- n_MAGs_MQ_prok: # of prokaryotic MAGs with 50%<=completeness<90% and contamination <5% according to CheckM
- n_MAGs_HQ_prok: # of prokaryotic MAGs with completeness >=90% and contamination <5% according to CheckM
- n_MAGs_MQ_euk: # of eukaryotic MAGs with 50%<=completeness<90% and contamination <5% according to BUSCO
- n_MAGs_HQ_euk: # of eukaryotic MAGs with completeness >=90% and contamination <5% according to BUSCO
- filtered: food samples with less than 1e08 basis excluded from following analysis
- curator: name of the curator
cFMD_mags_list (unique key= mag)
- MAG_id: name of the MAG formed by “${dataset_name}__${sample_id}__bin.${bin_number}”
- dataset_id: name of the dataset from which the MAG has been reconstructed
- sample_id: name of the sample from which the MAG has been reconstructed
- SGB_assignment: wheter we could assign (genomic distance < 5%) a MAG at SGB-level
- SGB_id: identification number of the SGB in MetaRefSGB to which the MAG has been assigned
- unknown: can have three values, kSGB (short for knownSGB, i.e. a cluster containing at least one isolate genome) uSGB (unknownSGB, cluster containing only reconstructed genomes), or ufSGB (unknownfoodSGB, cluster containing only reconstructed genomes from food samples and hence newly introduced)
- assigned_taxonomy_level: species if containing at least one reference genome, otherwise lowest taxonomic rank assignable
- superkingdom: superkingdom of the assigned taxonomy
- phylum: phylum of the assigned taxonomy
- class: class of the assigned taxonomy
- order: order of the assigned taxonomy
- family: family of the assigned taxonomy
- genus: genus of the assigned taxonomy
- species: species of the assigned taxonomy
- genome_size: # of nucleotides (including unknowns specified by N's) in the genome (CheckM)
- n_contigs: number of contigs within the genome as determined by splitting scaffolds at any position consisting of more than 10 consecutive ambiguous bases (CheckM)
- N50: N50 statistics as calculated over all contigs (CheckM)
- completeness: percentage value of the estimated completeness of the genome as determined from the presence/absence of marker genes and the expected colocalization of these genes (CheckM)
- contamination: percentage value of the estimated contamination of genome as determined by the presence of multi-copy marker genes and the expected colocalization of these genes (CheckM)
- GC_content: percentage of G+C nucleotides with respect to genome length

Data generation

The data here provided were mainly generated through the following tools:

Pre-processing of raw-reads: validated pipeline available here
Reconstruction and taxonomic assignment of MAGs: assembly-based pipeline available here
Taxonomic profiling: MetaPhlAn4-based pipeline, with full tutorial available here
Strain-level profiling: StrainPhlAn4-based pipeline, with full tutorial available here
Functional profiling: HUMAnN3-based pipeline, with full tutorial available here

Further information and requests should be directed to Niccolò Carlino (niccolo.carlino@unitn.it), Hrituraj Dey (hrituraj.dey@unitn.it), Vitor Heidrich (vitor.heidrich@unitn.it), Nicola Segata (nicola.segata@unitn.it), Edoardo Pasolli (edoardo.pasolli@unina.it)

Publication

Carlino et al., "Unexplored microbial diversity from 2,500 food metagenomes and links with the human microbiome", Cell, 2024, DOI: 10.1016/j.cell.2024.07.039

Acknowledgements

The MASTER EU Consortium was funded by the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 818368. This resource is also supported by the European Union’s Horizon Europe programme (project DOMINO-101060218).

Name		Name	Last commit message	Last commit date
Latest commit History 212 Commits
cFMD_data		cFMD_data
.gitattributes		.gitattributes
LICENSE.txt		LICENSE.txt
README.md		README.md
cFMD_datasets.tsv		cFMD_datasets.tsv
cFMD_mags_list.tsv		cFMD_mags_list.tsv
cFMD_metadata.tsv		cFMD_metadata.tsv
cFMD_metadata_rules.tsv		cFMD_metadata_rules.tsv
download_functional_profiles.sh		download_functional_profiles.sh
download_mags.sh		download_mags.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cFMD

Data

Detailed description of data

Data generation

Publication

Acknowledgements

About

Uh oh!

Releases 8

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

cFMD

Data

Detailed description of data

Data generation

Publication

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 8

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages