A nextflow pipeline for running mlst on a set of assemblies.
flowchart TD
assembly --> quast(quast)
quast --> assembly_qc
assembly --> mlst(mlst)
mlst --> mlst.json
mlst --> parse_alleles(parse_alleles)
parse_alleles --> alleles.csv
parse_alleles --> sequence_type.csv
nextflow run BCCDC-PHL/mlst-nf \
--assembly_input </path/to/assemblies> \
--outdir </path/to/outdir>
The pipeline also supports a 'samplesheet input' mode. Pass a samplesheet.csv file with the headers ID
, ASSEMBLY
:
nextflow run BCCDC-PHL/mlst-nf \
--samplesheet_input </path/to/samplesheet.csv> \
--outdir </path/to/outdir>
Outputs for each sample will be written to a separate directory under the output directory, named using the sample ID.
The following output files are produced for each sample.
sample-01
├── sample-01_20211202154752_provenance.yml
├── sample-01_alleles.csv
├── sample-01_mlst.json
└── sample-01_sequence_type.csv
The mlst.json
output is generated directly by the mlst tool.
It has the following format:
[
{
"scheme" : "sepidermidis",
"alleles" : {
"mutS" : "1",
"yqiL" : "1",
"tpiA" : "1",
"pyrR" : "2",
"gtr" : "2",
"aroE" : "1",
"arcC" : "16"
},
"sequence_type" : "184",
"filename" : "test/example.gbk.gz",
"id" : "test/example.gbk.gz"
}
]
The alleles.csv
file is generated based on the .json
output, and includes a couple of boolean (True
/False
) fields to indicate
whether the allele is a perfect match, or if it is a novel allele, based on the presence of ?
or ~
characters in the allele calls, as described
here.
The per-locus score
field is computed based on the rules described here.
The fields in in the alleles.csv
output are:
sample_id
scheme
locus
allele
perfect_match
novel_allele
score
The sequence_type.csv
file includes an overall sequence type ID based on the allele calls for each locus, and the overall
score, which is simply the sum of the per-locus scores for the sample.
sample_id
scheme
sequence_type
score
Each analysis will create a provenance.yml
file for each sample. The filename of the provenance.yml
file includes
a timestamp with format YYYYMMDDHHMMSS
to ensure that a unique file will be produced if a sample is re-analyzed and outputs
are stored to the same directory.
- pipeline_name: BCCDC-PHL/mlst-nf
pipeline_version: 0.1.4
nextflow_session_id: f18b89aa-06f7-41e4-b016-3519dfd5a5cb
nextflow_run_name: sharp_bhaskara
timestamp_analysis_start: 2024-02-20T22:59:37.862710
- input_filename: NC-000913.3.fa
input_path: /home/runner/work/mlst-nf/mlst-nf/.github/data/assemblies/NC-000913.3.fa
sha256: 6b195feda4c66140f6762742eb8b30c2652f02b45878b174f5b00ef85ecc95d7
- process_name: mlst
tools:
- tool_name: mlst
tool_version: 2.16.1
parameters:
- parameter: minid
value: 95
- parameter: mincov
value: 10
- parameter: minscore
value: 50
- process_name: quast
tools:
- tool_name: quast
tool_version: 5.0.2
parameters:
- parameter: --space-efficient
value: null
- parameter: --fast
value: null
- parameter: --min-contig
value: 0