Building database

Caribou was developed having in mind that the models should be trained on the GTDB taxonomy database.

However, any database could be used to train and classify using Caribou as long as the file structure is respected.

The required files consist of the following:

Sequences in fasta format

It can either be a folder containing fasta files or one large fasta file
Taxonomic classification in a csv file

Should the user want to build the training database from the GTDB taxonomy, this can be done using the template script to build data in one large fasta file and extract classes into a csv file. This template must be modified by the user to insert filepaths and comment the host section if there is no host to be used.

The modified template can be submitted to an HPC cluster managed by Slurm (ex: Compute Canada) using the following command :

sbatch Caribou/data/build_data_scripts/template_slurm_datagen.sh

The modified template can also be ran in a linux command shell by running the following command :

sh Caribou/data/build_data_scripts/template_slurm_datagen.sh

Finally each script used by the template can be used alone in linux command shell by running the following commands :

# Generate a list of all fastas to be merged
sh Caribou/data/build_data_scripts/generateFastaList.sh -d [directory] -o [outputFile]

# Extract classes for each bacterial genome fasta using the GTDB taxonomy
sh Caribou/data/build_data_scripts/fasta2class_bact.sh -d [directory] -i [inputFile] -c [classesFile] -o [outputDirectory]

# Extract classes for each host fasta
sh Caribou/data/build_data_scripts/fasta2class_host.sh -d [directory] -i [inputFile] -o [outputDirectory]

Wiki home

Installation

Data

Usage

Supplements

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Building database

Clone this wiki locally