Skip to content

API reference

Nicolas de Montigny edited this page Dec 11, 2024 · 1 revision

Descriptions of scripts usage.

For descriptions on each steps, see the analysis description part of this wiki

Whole pipeline

Run the entire Caribou analysis Pipeline

Script:

Caribou_pipeline.py

Arguments:

-h, --help            show this help message and exit
-c CONFIG, --config CONFIG
                    PATH to a configuration file containing the choices made by the user. Please refer to the wiki for further details : https://github.com/bioinfoUQAM/Caribou/wiki

K-mers representation of genetic sequences

Extract K-mers profile of a given dataset and save it to drive.

Script:

Caribou_kmers.py

Arguments:

-h, --help            show this help message and exit
-s SEQ_FILE, --seq_file SEQ_FILE
                    PATH to a fasta file containing bacterial genomes to build k-mers from or a folder containing fasta files with one sequence per file
-c CLS_FILE, --cls_file CLS_FILE
                    PATH to a csv file containing classes of the corresponding fasta
-dt DATASET_NAME, --dataset_name DATASET_NAME
                    Name of the dataset used to name files
-sh SEQ_FILE_HOST, --seq_file_host SEQ_FILE_HOST
                    PATH to a fasta file containing host genomes to build k-mers from or a folder containing fasta files with one sequence per file
-ch CLS_FILE_HOST, --cls_file_host CLS_FILE_HOST
                    PATH to a csv file containing classes of the corresponding host fasta
-dh HOST_NAME, --host_name HOST_NAME
                    Name of the host used to name files
-k K_LENGTH, --k_length K_LENGTH
                    Length of k-mers to extract
-l KMERS_LIST, --kmers_list KMERS_LIST
                    PATH to a file containing a list of k-mers to be extracted if the dataset is not a training database
-o OUTDIR, --outdir OUTDIR
                    PATH to a directory on file where outputs will be saved
-wd WORKDIR, --workdir WORKDIR
                    Optional. Path to a working directory where tuning data will be spilled

Bacterial sequences identification and host sequences exclusion

Train a model and extract bacteria / host sequences.

Script: Caribou_extraction.py

Arguments:

-h, --help            show this help message and exit
-db DATA_BACTERIA, --data_bacteria DATA_BACTERIA
                    PATH to a npz file containing the data corresponding to the k-mers profile for the bacteria database
-dh DATA_HOST, --data_host DATA_HOST
                    PATH to a npz file containing the data corresponding to the k-mers profile for the host
-dn DATABASE_NAME, --database_name DATABASE_NAME
                    Name of the bacteria database used to name files
-hn HOST_NAME, --host_name HOST_NAME
                    Name of the host database used to name files
-dm DATA_METAGENOME, --data_metagenome DATA_METAGENOME
                    PATH to a npz file containing the data corresponding to the k-mers profile for the metagenome to classify
-mn METAGENOME_NAME, --metagenome_name METAGENOME_NAME
                    Name of the metagenome to classify used to name files
-m MERGED, --merged MERGED
                    PATH to a npz file containing the k-mers profile for the merged bacteria and host databases
-v VALIDATION, --validation VALIDATION
                    PATH to a npz file containing the k-mers profile for the validation dataset
-model {None,onesvm,linearsvm,attention,lstm,deeplstm}, --model_type {None,onesvm,linearsvm,attention,lstm,deeplstm}
                    The type of model to train
-bs BATCH_SIZE, --batch_size BATCH_SIZE
                    Size of the batch size to use, defaults to 32
-e TRAINING_EPOCHS, --training_epochs TRAINING_EPOCHS
                    The number of training iterations for the neural networks models if one ise chosen, defaults to 100
-o OUTDIR, --outdir OUTDIR
                    PATH to a directory on file where outputs will be saved
-wd WORKDIR, --workdir WORKDIR
                    Optional. Path to a working directory where Ray Tune will output and spill tuning data

Bacterial sequences identification and host sequences exclusion with cross-validation

Train and cross-validate a model for the bacteria extraction / host removal step.

Script: Caribou_extraction_train_cv.py

Arguments:

-h, --help            show this help message and exit
-db DATA_BACTERIA, --data_bacteria DATA_BACTERIA
                    PATH to a npz file containing the data corresponding to the k-mers profile for the bacteria database
-dh DATA_HOST, --data_host DATA_HOST
                    PATH to a npz file containing the data corresponding to the k-mers profile for the host
-dn DATABASE_NAME, --database_name DATABASE_NAME
                    Name of the bacteria database used to name files
-hn HOST_NAME, --host_name HOST_NAME
                    Name of the host database used to name files
-m MERGED, --merged MERGED
                    PATH to a npz file containing the k-mers profile for the merged bacteria and host databases
-v VALIDATION, --validation VALIDATION
                    PATH to a npz file containing the k-mers profile for the validation dataset
-t TEST, --test TEST  PATH to a npz file containing the k-mers profile for the test dataset
-model {onesvm,linearsvm,attention,lstm,deeplstm}, --model_type {onesvm,linearsvm,attention,lstm,deeplstm}
                    The type of model to train
-bs BATCH_SIZE, --batch_size BATCH_SIZE
                    Size of the batch size to use, defaults to 32
-e TRAINING_EPOCHS, --training_epochs TRAINING_EPOCHS
                    The number of training iterations for the neural networks models if one is chosen, defaults to 100
-o OUTDIR, --outdir OUTDIR
                    PATH to a directory on file where outputs will be saved
-wd WORKDIR, --workdir WORKDIR
                    Optional. Path to a working directory where Ray Tune will output and spill tuning data

Top-down bacterial sequences classification

Train a model and classify bacteria sequences iteratively over known taxonomic ranks.

Script: Caribou_classification.py

Arguments:

-h, --help            show this help message and exit
-db DATA_BACTERIA, --data_bacteria DATA_BACTERIA
                    PATH to a npz file containing the data corresponding to the k-mers profile for the bacteria database
-dt DATABASE_NAME, --database_name DATABASE_NAME
                    Name of the bacteria database used to name files
-mg DATA_METAGENOME, --data_metagenome DATA_METAGENOME
                    PATH to a npz file containing the data corresponding to the k-mers profile for the metagenome to classify
-mn METAGENOME_NAME, --metagenome_name METAGENOME_NAME
                    Name of the metagenome to classify used to name files
-v VALIDATION, --validation VALIDATION
                    PATH to a npz file containing the k-mers profile for the validation dataset
-model {sgd,mnb,lstm_attention,cnn,widecnn}, --model_type {sgd,mnb,lstm_attention,cnn,widecnn}
                    The type of model to train
-tx TAXA, --taxa TAXA
                    The taxonomic level to use for the classification, defaults to species. Can be one level or a list of levels separated by commas.
-bs BATCH_SIZE, --batch_size BATCH_SIZE
                    Size of the batch size to use, defaults to 32
-e TRAINING_EPOCHS, --training_epochs TRAINING_EPOCHS
                    The number of training iterations for the neural networks models if one ise chosen, defaults to 100
-o OUTDIR, --outdir OUTDIR
                    PATH to a directory on file where outputs will be saved
-wd WORKDIR, --workdir WORKDIR
                    Optional. Path to a working directory where Ray Tune will output and spill tuning data

Top-down bacterial sequences classification with cross-validation

Train and cross-validate a model for the bacteria classification step.

Script: Caribou_classification_train_cv.py

Arguments:

-h, --help            show this help message and exit
-db DATA_BACTERIA, --data_bacteria DATA_BACTERIA
                    PATH to a npz file containing the data corresponding to the k-mers profile for the bacteria database
-dn DATABASE_NAME, --database_name DATABASE_NAME
                    Name of the bacteria database used to name files
-v VALIDATION, --validation VALIDATION
                    PATH to a npz file containing the k-mers profile for the validation dataset
-t TEST, --test TEST  PATH to a npz file containing the k-mers profile for the test dataset
-model {sgd,mnb,lstm_attention,cnn,widecnn}, --model_type {sgd,mnb,lstm_attention,cnn,widecnn}
                    The type of model to train
-tx TAXA, --taxa TAXA
                    The taxonomic level to use for the classification, defaults to None. Can be one level or a list of levels separated by commas.
-bs BATCH_SIZE, --batch_size BATCH_SIZE
                    Size of the batch size to use, defaults to 32
-e TRAINING_EPOCHS, --training_epochs TRAINING_EPOCHS
                    The number of training iterations for the neural networks models if one ise chosen, defaults to 100
-o OUTDIR, --outdir OUTDIR
                    PATH to a directory on file where outputs will be saved
-wd WORKDIR, --workdir WORKDIR
                    Optional. Path to a working directory where Ray Tune will output and spill tuning data

Outputs

Produce outputs from the results of classified data by Caribou.

Script: Caribou_outputs.py

Arguments:

-h, --help            show this help message and exit
-db DATA_BACTERIA, --data_bacteria DATA_BACTERIA
                    PATH to a npz file containing the data corresponding to the k-mers profile for the bacteria database
-cd CLASSIFIED_DATA, --classified_data CLASSIFIED_DATA
                    PATH to a npz file containing the data classified by Caribou
-model {sgd,mnb,lstm_attention,cnn,widecnn}, --model_type {sgd,mnb,lstm_attention,cnn,widecnn}
                    The type of model used for classification
-dt DATASET_NAME, --dataset_name DATASET_NAME
                    Name of the classified dataset used to name files
-dh HOST_NAME, --host_name HOST_NAME
                    Name of the host database used to name files
-m, --mpa             Should the mpa-style output be generated?
-k, --kronagram       Should the interactive kronagram be generated?
-r, --report          Should the abundance report be generated?
-wd WORKDIR, --workdir WORKDIR
                    Optional. Path to a working directory where tuning data will be spilled

Decompose k-mers features

Features decomposition to a given k-mers dataset and then apply it

Script:

Caribou_dimensions_decomposition.py

Arguments:

-h, --help            show this help message and exit
-db DATASET, --dataset DATASET
                    PATH to a npz file containing the data corresponding to the k-mers profile for the bacteria database
-l KMERS_LIST, --kmers_list KMERS_LIST
                    PATH to a file containing a list of k-mers that will be reduced
-n NB_COMPONENTS, --nb_components NB_COMPONENTS
                    Number of components to decompose data into
-o OUTDIR, --outdir OUTDIR
                    PATH to a directory on file where outputs will be saved
-wd WORKDIR, --workdir WORKDIR
                    Optional. Path to a working directory where tuning data will be spilled

Reduce k-mers features

Features reduction to a given k-mers dataset and then apply it

Script:

Caribou_reduce_features.py

Arguments:

-h, --help            show this help message and exit
-db DATASET, --dataset DATASET
                    PATH to a npz file containing the data corresponding to the k-mers profile for the bacteria database
-dt DATASET_NAME, --dataset_name DATASET_NAME
                    Name of the dataset used to name files
-l KMERS_LIST, --kmers_list KMERS_LIST
                    PATH to a file containing a list of k-mers that will be reduced
-t TAXA, --taxa TAXA  The taxonomic level to use for the classification, defaults to Phylum.
-o OUTDIR, --outdir OUTDIR
                    PATH to a directory on file where outputs will be saved
-wd WORKDIR, --workdir WORKDIR
                    Optional. Path to a working directory where tuning data will be spilled