Plain simple amplicon sequence simulator for in-silico genomic sequencing assays
TL;DR: no external requirements needed. Both the recursive GitHub clone as well as the bioconda package should work out-of-the-box.
🛠️ Details to build from source
The amplisim software is intended for 64-bit POSIX compliant operating systems and was tested successfully under Ubuntu 22.04 LTS and macOS v12.5.1 (Monterey). Building amplisim from source requires libraries for lzma, libbz2 and libcurl on your system in order to compile htslib. Both Linux and masOS operating systems typically provide them via their respective package managers. See intructions below.The easiest way to install amplisim is via the conda package manager from the bioconda channel. Please note that the conda installation is currently only available for Linux operating systems.
# create a new conda environment
conda create --name amplisim
# install the latest amplisim version from the bioconda channel
conda install -c bioconda amplisim
git clone --recursive https://github.com/Krannich479/amplisim.git
cd amplisim
mkdir build
make -C lib/htslib
make
🍎 macOS system dependencies
If you are working on an Apple workstation with macOS and want to build amplisim from source you might miss system libraries for openssl and argp. These can be installed using the brew package manager viabrew install glib-openssl argp-standalone
A quick and simple way to test your software binary is to download and run amplisim on some public Sars-Cov-2 data.
mkdir testdata && cd testdata
wget https://raw.githubusercontent.com/artic-network/primer-schemes/master/nCoV-2019/V5.3.2/SARS-CoV-2.primer.bed
wget https://www.ebi.ac.uk/ena/browser/api/fasta/MN908947.3
sed 's/>ENA|MN908947|MN908947.3 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome./>MN908947.3/g' MN908947.3 > MN908947.3.fasta
cd ..
amplisim testdata/MN908947.3.fasta testdata/SARS-CoV-2.primer.bed
The most concise way to get familiar with amplisim is to inspect the help page via amplisim --help
. This will display
Usage: amplisim [OPTION...] REFERENCE PRIMERS
amplisim -- a program to simulate amplicon sequences from a reference genome
-m, --mean=INT Set the mean number of replicates per amplicon
-n, --sd=INT Set the standard deviation for the mean number of
replicates per amplicon
-o, --output=FILE Output to FILE instead of standard output
-s, --seed=INT Set a random seed
-x, --dropout=INT Set the likelihood for an amplicon dropout [0,1]
-?, --help Give this help list
--usage Give a short usage message
-V, --version Print program version
Mandatory or optional arguments to long options are also mandatory or optional
for any corresponding short options.
Report bugs to https://github.com/rki-mf1/amplisim/issues.
The minimal command to run amplisim is to provide a reference genome in FASTA format and a set of primers in BED format (see chapter Input and output for more details). By default, amplisim prints the amplicons sequences to the standard output such that the user can either direct the sequences to a file or forward them to the next program.
amplisim <my_reference.fasta> <my_primers.bed> > <my_amplicons.fasta>
If you want amplisim to store the resulting amplicon sequences directly in a FASTA file you can use the -o
option.
amplisim -o <my_amplicons.fasta> <my_reference.fasta> <my_primers.bed>
The PRIMERS
input file is a plain tab-separated textfile with pre-defined columns.
The format of the PRIMERS
file required by amplisim has to comply with the following properties:
- The BED format specification. I.e. the first column is a chromosome identifier, and the second and third column are the boundary indexes of a range in the chromosome. The second column is the start index of a primer and the third column is the end index of a primer. The start index should always be strictly smaller than the end index.
- A pair of primers (forward and reverse primer) is expected to be in consecutive lines in the file.
- The chromosome identifiers have to be arranged in blocks. I.e. irrespective of the order of the chromosomes, all primers of a particular chromosome have to occur consecutively in the file.
These format properties generally comply with the definitions in samtools but are slightly more stringent as amplisim currently does not allow alternative primers in a pair. Directly fitting examples can be found in the artic-network repository for virus primer schemes, e.g. the primers for Sars-Cov-2.
The REFERENCE
input file is a standard textfile in FASTA format which contains one or multiple records (chromosomes).
The output of amplisim is a stream or plain textfile in the FASTA format.
The header line of each amplicon sequence provides the following information:
>amplicon_<amplicon_index>_<replicate_index>
where <amplicon_index> is the i-th index (i=0...n-1) of the amplicons defined by n primer pairs and <replicate_index> is a unique index across all replicates of all amplicons. See schematic below.
For questions about amplisim, feature requests and bug reports please refer to the issues section of this repository.