Skip to content

Building database

Nicolas de Montigny edited this page Dec 11, 2024 · 2 revisions

Caribou was developed having in mind that the models should be trained on the GTDB taxonomy database.

However, any database could be used to train and classify using Caribou as long as the file structure is respected.

The required files consist of the following:

  • Sequences in fasta format

    It can either be a folder containing fasta files or one large fasta file

  • Taxonomic classification in a csv file

Should the user want to build the training database from the GTDB taxonomy, this can be done using the template script to build data in one large fasta file and extract classes into a csv file. This template must be modified by the user to insert filepaths and comment the host section if there is no host to be used.

The modified template can be submitted to an HPC cluster managed by Slurm (ex: Compute Canada) using the following command :

sbatch Caribou/data/build_data_scripts/template_slurm_datagen.sh

The modified template can also be ran in a linux command shell by running the following command :

sh Caribou/data/build_data_scripts/template_slurm_datagen.sh

Finally each script used by the template can be used alone in linux command shell by running the following commands :

# Generate a list of all fastas to be merged
sh Caribou/data/build_data_scripts/generateFastaList.sh -d [directory] -o [outputFile]

# Extract classes for each bacterial genome fasta using the GTDB taxonomy
sh Caribou/data/build_data_scripts/fasta2class_bact.sh -d [directory] -i [inputFile] -c [classesFile] -o [outputDirectory]

# Extract classes for each host fasta
sh Caribou/data/build_data_scripts/fasta2class_host.sh -d [directory] -i [inputFile] -o [outputDirectory]