A pipeline for variant calling on bacterial genomes created with Nextflow and singularity / docker
This pipeline is currently under development. If you wish to use it in future, please feel free to watch the repository.
nextflow run uct-cbio/bacterial_variant_calling --reads sample_sheet.csv --genome H37Rv.fa -with-docker bacterial_env
This is assuming you have a sample sheet formatted as described bellow, and a docker image created with VarDock called 'bacterial_env'.
The typical command for running the pipeline is as follows:
nextflow run uct-cbio/bacterial_variant_calling --reads sample_sheet.csv --genome refgenome.fa -profile uct_hex
Mandatory arguments:
--reads Path to input data (must be surrounded with quotes)
--genome Path te reference genome against which the reads will be aligned (in fasta format).
-profile Hardware config to use. Currently profile available for UCT's HPC 'uct_hex' - create your own if necessary
Other arguments:
--outdir The output directory where the results will be saved
--SRAdir The directory where reads downloaded from the SRA will be stored
--email Set this parameter to your e-mail address to get a summary e-mail with details of the run sent to you when the workflow exits
-name
Example run: To run on UCT hex
-
Start a 'screen' session from the headnode
-
Start an interactive job using: qsub -I -q UCTlong -l nodes=1:series600:ppn=1 -d
pwd
-
A typical command would look something like:
nextflow run uct-cbio/bacterial_variant_calling --reads sample_sheet.csv --genome refgenome.fa -profile uct_hex --SRAdir /path/to/writable/dir/
If you are using reads from the SRA, these will be downloaded using the SRA toolkit and deposited in the specified --SRAdir. Please make sure that this directory is writable.
To allow for both local reads and reads from the SRA to be used, the pipeline has the ability to pull reads from the SRA based on the accession number (eg, SRR5989977).
The 'number' column must contain a unique value.
number | origin | replicate | isolate | R1 | R2 |
---|---|---|---|---|---|
1 | genomic | 1 | wgs_sample_1 | path/to/reads/reads_R1.fq | path/to/reads/reads_R2.fq |
2 | genomic | 2 | wgs_sample_1 | path/to/reads/reads_R1.fq | path/to/reads/reads_R2.fq |
3 | genomic | 3 | wgs_sample_1 | path/to/reads/reads_R1.fq | path/to/reads/reads_R2.fq |
4 | genomic | 1 | wgs_sample_2 | path/to/reads/reads_R1.fq | path/to/reads/reads_R2.fq |
5 | genomic | 2 | wgs_sample_2 | path/to/reads/reads_R1.fq | path/to/reads/reads_R2.fq |
6 | genomic | 3 | wgs_sample_2 | path/to/reads/reads_R1.fq | path/to/reads/reads_R2.fq |
7 | genomic | 1 | wgs_sample_3 | path/to/reads/reads_R1.fq | path/to/reads/reads_R2.fq |
8 | genomic | 2 | wgs_sample_3 | path/to/reads/reads_R1.fq | path/to/reads/reads_R2.fq |
9 | genomic | 3 | wgs_sample_3 | path/to/reads/reads_R1.fq | path/to/reads/reads_R2.fq |
10 | genomic | 1 | H37Rv | SRR5989977 |
In the above example, samples 1-9 are locally stored where sample 10 is a control sample from the SRA. Including the accession number in the R1 column will result in the reads from the SRA to be downloaded and used in the analysis. This must be exported to a csv file, with a comma ',' separating the columns:
number,origin,replicate,isolate,R1,R2
1,genomic,1,wgs_sample_1,path/to/reads/reads_R1.fq,path/to/reads/reads_R2.fq
2,genomic,2,wgs_sample_1,path/to/reads/reads_R1.fq,path/to/reads/reads_R2.fq
...
10,genomic,1,H37Rv,SRR5989977
Nextflow, Docker. All other dependencies are found in the included Docker recipe (VarDock).
Note: if you are working on UCT hex you can simply use the singularity image specified in the uct_hex profile.
Read mapping: BWA
Variant caller used: Freebayes
Phylogenetic analysis: RAxML
To create a Singularity image from a Docker image, please make use of Docker to singularity. This is needed to run the pipeline on the UCT cluster.
GFF vs GTF format
Some tools require GFF and others GTF, and converting between the formats is often impossible due to poor standard adoption. If the run fails, 90% of the time it is a format issue. The annotation files from NCBI are often the only ones that will work.
Naming of fastq files
This pipeline was developed by members of the Bioinformatics Support Team (BST) at the University of Cape Town. Dr. Jon Ambler is a member of CIDRI-Africa, and the main developer of this pipeline, using the layout and documentation outlined by Dr Katie Lennard and Gerrit Botha. Adapted from the nf-core/rnaseq pipeline.
Additional thanks to Paolo Di Tommaso, the developer of NextFlow, for their help troubleshooting.
This project is licensed under the MIT License - see the LICENSE file for details