-
Notifications
You must be signed in to change notification settings - Fork 1
Building database
Caribou was developed having in mind that the models should be trained on the GTDB taxonomy database.
However, any database could be used to train and classify using Caribou as long as the file structure is respected.
The required files consist of the following:
- Sequences in
fasta
formatIt can either be a folder containing
fasta
files or one largefasta
file - Taxonomic classification in a
csv
file
Should the user want to build the training database from the GTDB taxonomy, this can be done using the template script to build data in one large fasta file and extract classes into a csv file. This template must be modified by the user to insert filepaths and comment the host section if there is no host to be used.
The modified template can be submitted to an HPC cluster managed by Slurm (ex: Compute Canada) using the following command :
sbatch Caribou/data/build_data_scripts/template_slurm_datagen.sh
The modified template can also be ran in a linux command shell by running the following command :
sh Caribou/data/build_data_scripts/template_slurm_datagen.sh
Finally each script used by the template can be used alone in linux command shell by running the following commands :
# Generate a list of all fastas to be merged
sh Caribou/data/build_data_scripts/generateFastaList.sh -d [directory] -o [outputFile]
# Extract classes for each bacterial genome fasta using the GTDB taxonomy
sh Caribou/data/build_data_scripts/fasta2class_bact.sh -d [directory] -i [inputFile] -c [classesFile] -o [outputDirectory]
# Extract classes for each host fasta
sh Caribou/data/build_data_scripts/fasta2class_host.sh -d [directory] -i [inputFile] -o [outputDirectory]