Skip to content

Simulate data

Nicolas de Montigny edited this page Dec 11, 2024 · 1 revision

Caribou implements a wrapper class and a script for using the InSilicoSeq package.

This script can be used to easily generate reads from a collection of genomes in fasta format. It can also produce validation and test subsets.

API reference

Simulate sequencing reads for validation and/or testing dataset(s) from a whole genome dataset

Script:

Caribou_simulate_test_val.py

Arguments:

-h, --help            show this help message and exit
-db DATASET, --dataset DATASET
                    PATH to a npz file containing the data corresponding to the k-mers profile for the bacteria database
-dt DATASET_NAME, --dataset_name DATASET_NAME
                    Name of the dataset used to name files
-dh HOSTSET, --hostset HOSTSET
                    Path to .npz data for extracted k-mers profile of host
-ds HOSTSET_NAME, --hostset_name HOSTSET_NAME
                    Name of the host database used to name files
-v, --validation      Flag argument for making a "validation"-named simulated dataset
-t, --test            Flag argument for making a "test"-named simulated dataset
-l KMERS_LIST, --kmers_list KMERS_LIST
                    Optional. PATH to a file containing a list of k-mers to be extracted after the simulation. Should be the same as the reference database
-o OUTDIR, --outdir OUTDIR
                    Path to folder for outputing tuning results
-wd WORKDIR, --workdir WORKDIR
                    Optional. Path to a working directory where tuning data will be spilled