API reference

Descriptions of scripts usage.

For descriptions on each steps, see the analysis description part of this wiki

Whole pipeline

Run the entire Caribou analysis Pipeline

Script:

Caribou_pipeline.py

Arguments:

-h, --help            show this help message and exit
-c CONFIG, --config CONFIG
                    PATH to a configuration file containing the choices made by the user. Please refer to the wiki for further details : https://github.com/bioinfoUQAM/Caribou/wiki

K-mers representation of genetic sequences

Extract K-mers profile of a given dataset and save it to drive.

Script:

Caribou_kmers.py

Arguments:

-h, --help            show this help message and exit
-s SEQ_FILE, --seq_file SEQ_FILE
                    PATH to a fasta file containing bacterial genomes to build k-mers from or a folder containing fasta files with one sequence per file
-c CLS_FILE, --cls_file CLS_FILE
                    PATH to a csv file containing classes of the corresponding fasta
-dt DATASET_NAME, --dataset_name DATASET_NAME
                    Name of the dataset used to name files
-sh SEQ_FILE_HOST, --seq_file_host SEQ_FILE_HOST
                    PATH to a fasta file containing host genomes to build k-mers from or a folder containing fasta files with one sequence per file
-ch CLS_FILE_HOST, --cls_file_host CLS_FILE_HOST
                    PATH to a csv file containing classes of the corresponding host fasta
-dh HOST_NAME, --host_name HOST_NAME
                    Name of the host used to name files
-k K_LENGTH, --k_length K_LENGTH
                    Length of k-mers to extract
-l KMERS_LIST, --kmers_list KMERS_LIST
                    PATH to a file containing a list of k-mers to be extracted if the dataset is not a training database
-o OUTDIR, --outdir OUTDIR
                    PATH to a directory on file where outputs will be saved
-wd WORKDIR, --workdir WORKDIR
                    Optional. Path to a working directory where tuning data will be spilled

Bacterial sequences identification and host sequences exclusion

Train a model and extract bacteria / host sequences.

Script: Caribou_extraction.py

Arguments:

-h, --help            show this help message and exit
-db DATA_BACTERIA, --data_bacteria DATA_BACTERIA
                    PATH to a npz file containing the data corresponding to the k-mers profile for the bacteria database
-dh DATA_HOST, --data_host DATA_HOST
                    PATH to a npz file containing the data corresponding to the k-mers profile for the host
-dn DATABASE_NAME, --database_name DATABASE_NAME
                    Name of the bacteria database used to name files
-hn HOST_NAME, --host_name HOST_NAME
                    Name of the host database used to name files
-dm DATA_METAGENOME, --data_metagenome DATA_METAGENOME
                    PATH to a npz file containing the data corresponding to the k-mers profile for the metagenome to classify
-mn METAGENOME_NAME, --metagenome_name METAGENOME_NAME
                    Name of the metagenome to classify used to name files
-m MERGED, --merged MERGED
                    PATH to a npz file containing the k-mers profile for the merged bacteria and host databases
-v VALIDATION, --validation VALIDATION
                    PATH to a npz file containing the k-mers profile for the validation dataset
-model {None,onesvm,linearsvm,attention,lstm,deeplstm}, --model_type {None,onesvm,linearsvm,attention,lstm,deeplstm}
                    The type of model to train
-bs BATCH_SIZE, --batch_size BATCH_SIZE
                    Size of the batch size to use, defaults to 32
-e TRAINING_EPOCHS, --training_epochs TRAINING_EPOCHS
                    The number of training iterations for the neural networks models if one ise chosen, defaults to 100
-o OUTDIR, --outdir OUTDIR
                    PATH to a directory on file where outputs will be saved
-wd WORKDIR, --workdir WORKDIR
                    Optional. Path to a working directory where Ray Tune will output and spill tuning data

Bacterial sequences identification and host sequences exclusion with cross-validation

Train and cross-validate a model for the bacteria extraction / host removal step.

Script: Caribou_extraction_train_cv.py

Arguments:

-h, --help            show this help message and exit
-db DATA_BACTERIA, --data_bacteria DATA_BACTERIA
                    PATH to a npz file containing the data corresponding to the k-mers profile for the bacteria database
-dh DATA_HOST, --data_host DATA_HOST
                    PATH to a npz file containing the data corresponding to the k-mers profile for the host
-dn DATABASE_NAME, --database_name DATABASE_NAME
                    Name of the bacteria database used to name files
-hn HOST_NAME, --host_name HOST_NAME
                    Name of the host database used to name files
-m MERGED, --merged MERGED
                    PATH to a npz file containing the k-mers profile for the merged bacteria and host databases
-v VALIDATION, --validation VALIDATION
                    PATH to a npz file containing the k-mers profile for the validation dataset
-t TEST, --test TEST  PATH to a npz file containing the k-mers profile for the test dataset
-model {onesvm,linearsvm,attention,lstm,deeplstm}, --model_type {onesvm,linearsvm,attention,lstm,deeplstm}
                    The type of model to train
-bs BATCH_SIZE, --batch_size BATCH_SIZE
                    Size of the batch size to use, defaults to 32
-e TRAINING_EPOCHS, --training_epochs TRAINING_EPOCHS
                    The number of training iterations for the neural networks models if one is chosen, defaults to 100
-o OUTDIR, --outdir OUTDIR
                    PATH to a directory on file where outputs will be saved
-wd WORKDIR, --workdir WORKDIR
                    Optional. Path to a working directory where Ray Tune will output and spill tuning data

Top-down bacterial sequences classification

Train a model and classify bacteria sequences iteratively over known taxonomic ranks.

Script: Caribou_classification.py

Arguments:

-h, --help            show this help message and exit
-db DATA_BACTERIA, --data_bacteria DATA_BACTERIA
                    PATH to a npz file containing the data corresponding to the k-mers profile for the bacteria database
-dt DATABASE_NAME, --database_name DATABASE_NAME
                    Name of the bacteria database used to name files
-mg DATA_METAGENOME, --data_metagenome DATA_METAGENOME
                    PATH to a npz file containing the data corresponding to the k-mers profile for the metagenome to classify
-mn METAGENOME_NAME, --metagenome_name METAGENOME_NAME
                    Name of the metagenome to classify used to name files
-v VALIDATION, --validation VALIDATION
                    PATH to a npz file containing the k-mers profile for the validation dataset
-model {sgd,mnb,lstm_attention,cnn,widecnn}, --model_type {sgd,mnb,lstm_attention,cnn,widecnn}
                    The type of model to train
-tx TAXA, --taxa TAXA
                    The taxonomic level to use for the classification, defaults to species. Can be one level or a list of levels separated by commas.
-bs BATCH_SIZE, --batch_size BATCH_SIZE
                    Size of the batch size to use, defaults to 32
-e TRAINING_EPOCHS, --training_epochs TRAINING_EPOCHS
                    The number of training iterations for the neural networks models if one ise chosen, defaults to 100
-o OUTDIR, --outdir OUTDIR
                    PATH to a directory on file where outputs will be saved
-wd WORKDIR, --workdir WORKDIR
                    Optional. Path to a working directory where Ray Tune will output and spill tuning data

Top-down bacterial sequences classification with cross-validation

Train and cross-validate a model for the bacteria classification step.

Script: Caribou_classification_train_cv.py

Arguments:

-h, --help            show this help message and exit
-db DATA_BACTERIA, --data_bacteria DATA_BACTERIA
                    PATH to a npz file containing the data corresponding to the k-mers profile for the bacteria database
-dn DATABASE_NAME, --database_name DATABASE_NAME
                    Name of the bacteria database used to name files
-v VALIDATION, --validation VALIDATION
                    PATH to a npz file containing the k-mers profile for the validation dataset
-t TEST, --test TEST  PATH to a npz file containing the k-mers profile for the test dataset
-model {sgd,mnb,lstm_attention,cnn,widecnn}, --model_type {sgd,mnb,lstm_attention,cnn,widecnn}
                    The type of model to train
-tx TAXA, --taxa TAXA
                    The taxonomic level to use for the classification, defaults to None. Can be one level or a list of levels separated by commas.
-bs BATCH_SIZE, --batch_size BATCH_SIZE
                    Size of the batch size to use, defaults to 32
-e TRAINING_EPOCHS, --training_epochs TRAINING_EPOCHS
                    The number of training iterations for the neural networks models if one ise chosen, defaults to 100
-o OUTDIR, --outdir OUTDIR
                    PATH to a directory on file where outputs will be saved
-wd WORKDIR, --workdir WORKDIR
                    Optional. Path to a working directory where Ray Tune will output and spill tuning data

Outputs

Produce outputs from the results of classified data by Caribou.

Script: Caribou_outputs.py

Arguments:

-h, --help            show this help message and exit
-db DATA_BACTERIA, --data_bacteria DATA_BACTERIA
                    PATH to a npz file containing the data corresponding to the k-mers profile for the bacteria database
-cd CLASSIFIED_DATA, --classified_data CLASSIFIED_DATA
                    PATH to a npz file containing the data classified by Caribou
-model {sgd,mnb,lstm_attention,cnn,widecnn}, --model_type {sgd,mnb,lstm_attention,cnn,widecnn}
                    The type of model used for classification
-dt DATASET_NAME, --dataset_name DATASET_NAME
                    Name of the classified dataset used to name files
-dh HOST_NAME, --host_name HOST_NAME
                    Name of the host database used to name files
-m, --mpa             Should the mpa-style output be generated?
-k, --kronagram       Should the interactive kronagram be generated?
-r, --report          Should the abundance report be generated?
-wd WORKDIR, --workdir WORKDIR
                    Optional. Path to a working directory where tuning data will be spilled

Decompose k-mers features

Features decomposition to a given k-mers dataset and then apply it

Script:

Caribou_dimensions_decomposition.py

Arguments:

-h, --help            show this help message and exit
-db DATASET, --dataset DATASET
                    PATH to a npz file containing the data corresponding to the k-mers profile for the bacteria database
-l KMERS_LIST, --kmers_list KMERS_LIST
                    PATH to a file containing a list of k-mers that will be reduced
-n NB_COMPONENTS, --nb_components NB_COMPONENTS
                    Number of components to decompose data into
-o OUTDIR, --outdir OUTDIR
                    PATH to a directory on file where outputs will be saved
-wd WORKDIR, --workdir WORKDIR
                    Optional. Path to a working directory where tuning data will be spilled

Reduce k-mers features

Features reduction to a given k-mers dataset and then apply it

Script:

Caribou_reduce_features.py

Arguments:

-h, --help            show this help message and exit
-db DATASET, --dataset DATASET
                    PATH to a npz file containing the data corresponding to the k-mers profile for the bacteria database
-dt DATASET_NAME, --dataset_name DATASET_NAME
                    Name of the dataset used to name files
-l KMERS_LIST, --kmers_list KMERS_LIST
                    PATH to a file containing a list of k-mers that will be reduced
-t TAXA, --taxa TAXA  The taxonomic level to use for the classification, defaults to Phylum.
-o OUTDIR, --outdir OUTDIR
                    PATH to a directory on file where outputs will be saved
-wd WORKDIR, --workdir WORKDIR
                    Optional. Path to a working directory where tuning data will be spilled

Wiki home

Installation

Data

Usage

Supplements

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API reference

Whole pipeline

K-mers representation of genetic sequences

Bacterial sequences identification and host sequences exclusion

Bacterial sequences identification and host sequences exclusion with cross-validation

Top-down bacterial sequences classification

Top-down bacterial sequences classification with cross-validation

Outputs

Decompose k-mers features

Reduce k-mers features

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally