A curated version of the [GTDB v.202 representatives](https://data.gtdb.ecogenomic.org/releases/release202/202.0/) was used for training the models used in Caribou.

The pipeline was also benchmarked by comparing it's performances with [Kraken2](https://github.com/DerrickWood/kraken2) and [MetaPhlan3](https://github.com/biobakery/MetaPhlAn).

All data is available under [Creative Commons Attribution 4.0 International (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/) on a repository administered by [Canada's Federated Research Data Repository](https://www.frdr-dfdr.ca/repo/)

Dataset repository link: [https://doi.org/10.20383/103.01160](https://doi.org/10.20383/103.01160)

Dataset repository organisation:
```
root
|- training
|  |- source
|  |- species
|  |- genus
|  |- family
|  |- class
|  |- order
|  |- phylum
|  |- domain
|
|- benchmark
|  |- simulated-reads
|  |  |- ds_50
|  |  |- ds_100
|  |  |- ds_150
|  |  |- ds_200
|  |  |- ds_250
|  |  |- ds_500
|  |
|  |- whole-genomes
|    |- ds_50
|    |- ds_100
|    |- ds_150
|    |- ds_200
|    |- ds_250
|    |- ds_500
|
|- models
```

## Datasets in the repository
For Caribou to be able to use a dataset, it is required that it consist of two files as is mentionned in the [*building database* section of the wiki](https://github.com/bioinfoUQAM/Caribou/wiki/Building-database).
1. A ``fasta`` file containing the sequences
2. A ``csv`` file containing the taxonomic classification

### Training datasets
In the source folder, 3 datasets are present:
* ``GTDB``: Bacterial genomes from GTDB representatives
* ``cucurbita_sample_1000``: A sample of 1000 genomes for the *Cucurbita* genus host 
* ``samples``: A sample dataset consisting of 2 randomly sampled genomes for each species in the GTDB dataset

In the folders named after taxonomic ranks, 3 datasets are present consisting of a growing number of classes each randomly sampled 100x among the classes present more than 100 times for the corresponding taxa.
| Taxa    | Number of classes | Abundance per class | Total number of samples |
|---------|-------------------|---------------------|-------------------------|
| Domain  | 2                 | 1000                | 2000                    |
| Phylum  | 50                | 100                 | 5000                    |
| Class   | 100               | 100                 | 10000                   |
| Order   | 150               | 100                 | 15000                   |
| Family  | 200               | 100                 | 20000                   |
| Genus   | 250               | 100                 | 25000                   |
| Species | 300               | 100                 | 30000                   |

### Benchmark datasets
In this folder, there are 2 datasets categories:
* ``whole-genomes``: Randomly sampled datasets of varying abundances containing whole genomes from species with >100 genomes available
* ``simulated-reads``: Simulated reads (10 per genome) of varying abundances from randomly sampled genomes from species with >100 genomes available

For each category, 6 abundances are present:
* 50 genomes
* 100 genomes
* 150 genomes
* 200 genomes
* 250 genomes
* 500 genomes

Each abundance was then use in triplicates (except for the 500 genomes present in 5x), meaning there are 3 datasets for each abundance in each categories. In total, 40 datasets were generated for this benchmark.


### Models
Pre-trained models for the default neural network (CNN) are available in this repository.

2 files are present for each taxonomical rank:
1. ``{taxa}.hdf5``
2. ``{taxa}.json``

The `hdf5` files are the pretrained models that can be loaded by Caribou.

The `json` files are the labels known by the models for each class and their translation into human readable format.

## Dataset citation
```
@misc{de Montigny:2024,
  author = {de Montigny, Nicolas and Steven W., Kembel and Abdoulaye Baniré, Diallo},
  title = {Caribou pipeline for the alignment-free bacterial identification and classification in metagenomics sequencing data using machine learning},
  year = {2024},
  howpublished= {https://doi.org/10.20383/103.01160}
} 
```