A curated version of the [GTDB v.202 representatives](https://data.gtdb.ecogenomic.org/releases/release202/202.0/) was used for training the models used in Caribou. The pipeline was also benchmarked by comparing it's performances with [Kraken2](https://github.com/DerrickWood/kraken2) and [MetaPhlan3](https://github.com/biobakery/MetaPhlAn). All data is available under [Creative Commons Attribution 4.0 International (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/) on a repository administered by [Canada's Federated Research Data Repository](https://www.frdr-dfdr.ca/repo/) Dataset repository link: [https://doi.org/10.20383/103.01160](https://doi.org/10.20383/103.01160) Dataset repository organisation: ``` root |- training | |- source | |- species | |- genus | |- family | |- class | |- order | |- phylum | |- domain | |- benchmark | |- simulated-reads | | |- ds_50 | | |- ds_100 | | |- ds_150 | | |- ds_200 | | |- ds_250 | | |- ds_500 | | | |- whole-genomes | |- ds_50 | |- ds_100 | |- ds_150 | |- ds_200 | |- ds_250 | |- ds_500 | |- models ``` ## Datasets in the repository For Caribou to be able to use a dataset, it is required that it consist of two files as is mentionned in the [*building database* section of the wiki](https://github.com/bioinfoUQAM/Caribou/wiki/Building-database). 1. A ``fasta`` file containing the sequences 2. A ``csv`` file containing the taxonomic classification ### Training datasets In the source folder, 3 datasets are present: * ``GTDB``: Bacterial genomes from GTDB representatives * ``cucurbita_sample_1000``: A sample of 1000 genomes for the *Cucurbita* genus host * ``samples``: A sample dataset consisting of 2 randomly sampled genomes for each species in the GTDB dataset In the folders named after taxonomic ranks, 3 datasets are present consisting of a growing number of classes each randomly sampled 100x among the classes present more than 100 times for the corresponding taxa. | Taxa | Number of classes | Abundance per class | Total number of samples | |---------|-------------------|---------------------|-------------------------| | Domain | 2 | 1000 | 2000 | | Phylum | 50 | 100 | 5000 | | Class | 100 | 100 | 10000 | | Order | 150 | 100 | 15000 | | Family | 200 | 100 | 20000 | | Genus | 250 | 100 | 25000 | | Species | 300 | 100 | 30000 | ### Benchmark datasets In this folder, there are 2 datasets categories: * ``whole-genomes``: Randomly sampled datasets of varying abundances containing whole genomes from species with >100 genomes available * ``simulated-reads``: Simulated reads (10 per genome) of varying abundances from randomly sampled genomes from species with >100 genomes available For each category, 6 abundances are present: * 50 genomes * 100 genomes * 150 genomes * 200 genomes * 250 genomes * 500 genomes Each abundance was then use in triplicates (except for the 500 genomes present in 5x), meaning there are 3 datasets for each abundance in each categories. In total, 40 datasets were generated for this benchmark. ### Models Pre-trained models for the default neural network (CNN) are available in this repository. 2 files are present for each taxonomical rank: 1. ``{taxa}.hdf5`` 2. ``{taxa}.json`` The `hdf5` files are the pretrained models that can be loaded by Caribou. The `json` files are the labels known by the models for each class and their translation into human readable format. ## Dataset citation ``` @misc{de Montigny:2024, author = {de Montigny, Nicolas and Steven W., Kembel and Abdoulaye Baniré, Diallo}, title = {Caribou pipeline for the alignment-free bacterial identification and classification in metagenomics sequencing data using machine learning}, year = {2024}, howpublished= {https://doi.org/10.20383/103.01160} } ```