-
Notifications
You must be signed in to change notification settings - Fork 0
Configuration file
All required informations required in the configuration file are described below. It is necessary to provide at least the paths to your files on your drive in the [io] section. All other variables are optional and can be omited if users want to use the default settings.
There is a template for the configuration file where the users can change each variable for theirs.
The template is located in the directory :
Caribou/eval_configs/template_config.ini
Names sections.
Optional. It is used to create folders and files automatically.
-
database
String. Name of the database used, defaults to "database".
-
metagenome
String. Name of the metagenomic community to analyse, defaults to "metagenome".
-
host
String. Name of the host for the metagenomic community to analyse if there is one, defaults to "None" to signify that there is no host.
Input and output section.
Required. Files for the database and the host (if there is one) as well as a folder to output the files produced by Caribou
In this section, it is recommended to give absolute paths on the drive. If not, it is possible that the program be unable to find the specified files and folders.
It is also recommended to use compressed fasta files in gzip format as it will take much less space on the drive
If the scripts in the folder
Caribou/data/build_GTDB
were used to generate these files, fasta files generated will already be in compressed format and the class.csv files generated contain the following taxonomic levels : species","genus","family","order","class","phylum","domain".
-
database_seq_file
String. PATH to a fasta file containing all sequences from the database used.
-
database_cls_file
String. PATH to a csv file containing all sequences taxonomic classification from the database used.
If a class.csv file is created manually, it should at least contain the following 3 columns : "id", "species", "domain" and possibly other taxa columns to be identified depending on the user's interest and/or database used.
- The "id" column contains all the ids of each bacteria genomes in the database.
- The "species" column contains the identified species of each bacteria genomes in the database.
- The "domain" column must contain only the term "bacteria".
- Other taxonomic levels can be specified with various names depending on the user/database following the same principle as the "species" column.
-
host_seq_file
String. PATH to a fasta file containing all known sequences from the host database from a database or a sequencing experiment.
-
host_cls_file
String. PATH to a csv file containing the host taxonomic classification.
If a class.csv file is created manually, it should only contain the following 2 columns : "id", "domain".
- The "id" column contains all the ids of the host.
- The "domain" column must contain only the term "host".
-
metagenome_seq_file
String. PATH to a fasta file containing all sequences to classify from a DNA metagenomic sequencing experiment. The sequences should be preprocessed for quality and trimmed of the sequencing primers.
-
outdir
String. PATH to the folder in which the output files will be saved to. New subfolders will be created in this folder depending on the options provided in the [settings] section.
Settings section.
Optional. Values for various options in classification steps.
-
k
Integer. Length of the k-mers that will be extracted to establish the profiles, defaults to 35.
-
cross_validation
Boolean, if cross-validation statistics should be computed and saved into graphs, defaults to "True".
If '1', 'yes', 'true', 'on' or 'True' are used, cross-validation will be computed.
If '0', 'no', 'false', 'off' or 'False' are used, cross-validation will not be computed.
If this option is used, the folder
outdir/plots
will be created and the graphs containing the cross-validation statistics saved there. -
nb_cv_jobs
Integer. Number of cross-validation jobs to run, defaults to 1.
This option should be used with the cross_validation option as it will be ignored otherwise.
If multiple cross-validation jobs are ran, the program will keep the one with the best statistics and discard the others after having produced the graphs.
If there is a GPU enabled on the machine used, the program will parallelise the training by using multiple cores. Otherwise, it will try to run on multiple threads at the same time. Finally, if no parallelisation can be done, all training and cross-validation will be done back-to-back.
-
verbose
Boolean. If the program should be verbose and send to stdout wich step it is doing, defaults to "True". If '1', 'yes', 'true', 'on' or 'True' are used, the program will be verbose. If '0', 'no', 'false', 'off' or 'False' are used, the program will not be verbose.
-
training_batch_size
Integer. The size of the batches to be used, defaults to 32.
The classifiers will be trained by using only parts of the data at one time.
This parameter is mainly used to deal with big data on smaller RAM.
-
binary_save_host
Boolean. If the sequences identified as host should be extracted to a file, defaults to "True".
If '1', 'yes', 'true', 'on' or 'True' are used, the host sequences will be saved to a file.
If '0', 'no', 'false', 'off' or 'False' are used, the host sequences will not be saved to a file.
-
binary_save_unclassified
Boolean, If the unclassified sequences in host extraction step should be extracted to a file, defaults to "True".
If '1', 'yes', 'true', 'on' or 'True' are used, the unclassified sequences will be saved to a file.
If '0', 'no', 'false', 'off' or 'False' are used, the unclassified sequences will not be saved to a file.
-
host_extractor
String. Name of the classifier to be used to extract bacteria sequences, defaults to "attention" if there is a host or "onesvm" if there is no host.
Available options are:
- onesvm : A One-Class Support Vector Machine (SVM) classifier implemented in Scikit Learn to used only without host
- linearsvm : A linear SVM classifier implemented in Scikit Learn
- attention : A Neural Network based on Weighted Average architecture
- lstm : A Neural Network based on long short-term memory (LSTM) architecture
- deeplstm : A deep Neural Network based on LSTM architecture
-
bacteria_classifier
String. Name of the classifier to be used to classify bacteria sequences at each taxa ranks, defaults to "lstm_attention".
Available options are:
- sgd : A Ridge regression with SGD squared loss implemented in Scikit Learn
- mnb : A Multinomial Naïve Bayes classifier implemented in Scikit Learn
- lstm_attention : A Neural Network based on LSTM and Attention architecture
- cnn : A Neural Network based on Convolution (CNN) architecture
- deepcnn : A deep Neural Network based on CNN architecture
-
classification_threshold
Integer. The threshold that should be used to determine if classification to a taxonomic level should be kept or sent to the next level, defaults to 0.8.
The threshold should be located between 0 and 1. The higher the value the more specific the classification will be.
Outputs section.
Optional. Values for various options for outputs.
-
mpa-style
Boolean, if mpa-style classification be outputed, defaults to "True".
If '1', 'yes', 'true', 'on' or 'True' are used, the table will be created.
If '0', 'no', 'false', 'off' or 'False' are used, the table will not be created.
-
kronagram
Boolean, if the Kronagram should be outputed, defaults to "True".
If '1', 'yes', 'true', 'on' or 'True' are used, the Kronagram will be created.
If '0', 'no', 'false', 'off' or 'False' are used, the Kronagram will not be created.
-
abundance_report
Boolean, if an abundance report table should be outputed, defaults to "True".
If '1', 'yes', 'true', 'on' or 'True' are used, the table will be created.
If '0', 'no', 'false', 'off' or 'False' are used, the table will not be created.