diff --git a/.gitignore b/.gitignore
index 58f5020..361b7ac 100644
--- a/.gitignore
+++ b/.gitignore
@@ -19,4 +19,5 @@ _site
# Jupyter cache
\ No newline at end of file
diff --git a/CITATION.cff b/CITATION.cff
new file mode 100644
index 0000000..b6a7cdd
--- /dev/null
+++ b/CITATION.cff
@@ -0,0 +1,30 @@
+# This CITATION.cff file was generated with cffinit.
+# Visit https://bit.ly/cffinit to generate yours today!
+cff-version: 1.2.0
+title: Working with Bacterial Genomes
+message: >-
+ You may cite these materials using the metadata in this
+ file. Please cite our materials if you publish materials
+ derived from these, run a workshop using them, or use
+ the information in your own work.
+type: dataset
+ - given-names: Andries
+ family-names: van Tonder
+ orcid: 'https://orcid.org/0000-0002-4380-5250'
+ affiliation: >-
+ Department of Veterinary Medicine, University of
+ Cambridge
+ - given-names: Hugo
+ family-names: Tavares
+ affiliation: Cambridge Centre for Research Informatics Training
+ orcid: 'https://orcid.org/0000-0001-9373-2726'
+ - given-names: Bajuna
+ family-names: Salehe
+ affiliation: Cambridge Centre for Research Informatics Training
+repository-code: 'https://github.com/cambiotraining/bacterial-genomics'
+url: 'https://cambiotraining.github.io/bacterial-genomics/'
+license: CC-BY-4.0
+commit: TODO
+date-released: '2024-07-22'
diff --git a/_quarto.yml b/_quarto.yml
index 5799598..60cf0f0 100644
--- a/_quarto.yml
+++ b/_quarto.yml
@@ -9,7 +9,6 @@ project:
- index.md
- setup.md
- - materials.md
- materials/*
- "!materials/*/no_render"
diff --git a/course_files/scripts/M_tuberculosis/01-run_fetchngs.sh b/course_files/scripts/M_tuberculosis/01-run_fetchngs.sh
index 6a12a43..15e290c 100644
--- a/course_files/scripts/M_tuberculosis/01-run_fetchngs.sh
+++ b/course_files/scripts/M_tuberculosis/01-run_fetchngs.sh
@@ -13,4 +13,5 @@ nextflow run nf-core/fetchngs \
--max_memory '16.GB' --max_cpus 8 \
--input SAMPLES \
--outdir results/fetchngs \
- --nf_core_pipeline viralrecon
+ --nf_core_pipeline viralrecon \
+ --download_method sratools
diff --git a/index.md b/index.md
index c3c1186..be56ef2 100644
--- a/index.md
+++ b/index.md
@@ -48,55 +48,24 @@ The course is aimed at biologists interested in microbiology, prokaryotic genomi
- A working knowledge of running analysis on High Performance Computing (HPC) clusters (course registration page).
-## Authors
-About the authors:
-- **Andries van Tonder**
- _Affiliation_: Department of Veterinary Medicine, University of Cambridge
- _Roles_: writing - original draft; conceptualisation; coding
-- **Hugo Tavares**
- _Affiliation_: Bioinformatics Training Facility, University of Cambridge
- _Roles_: writing - review & editing
-- **Bajuna Salehe**
- _Affiliation_: Bioinformatics Training Facility, University of Cambridge
- _Roles_: writing - original content; conceptualisation; coding; data curation
-## Citation
+## Citation & Authors
+These materials are a fork of the [Carpentries Shell Lesson](https://swcarpentry.github.io/shell-novice/).
+As such, please make sure to cite both works if you use these materials (see [Acknowledgements]).
Please cite these materials if:
- You adapted or used any of them in your own teaching.
-- These materials were useful for your research work. For example, you can cite us in the methods section of your paper: "We carried our analyses based on the recommendations in _TODO_.".
-You can cite these materials as:
+- These materials were useful for your research work. For example, you can cite us in the methods section of your paper: "We carried our analyses based on the recommendations in _YourReferenceHere_".
-Or in BibTeX format:
- author = {},
- title = {},
- month = {},
- year = {},
- url = {},
- doi = {}
+{{< citation CITATION.cff >}}
## Acknowledgements
diff --git a/materials.md b/materials.md
deleted file mode 100644
index cc7e0e6..0000000
--- a/materials.md
+++ /dev/null
@@ -1,4 +0,0 @@
-title: "Materials"
-subtitle: Detailed course materials can be found in this section, including exercises to practice. If you are a self-learner, make sure to check the [setup page](setup.md).
\ No newline at end of file
diff --git a/materials/36-outbreak_solutions.md b/materials/36-outbreak_solutions.md
new file mode 100644
index 0000000..f8a81d8
--- /dev/null
+++ b/materials/36-outbreak_solutions.md
@@ -0,0 +1,204 @@
+title: "OUTBREAK ALERT! - Trainer solutions"
+## QC
+Run `avantonder/bacQC` ([link to section](09-bacqc.md)).
+Start by creating the samplesheet using the helper script:
+python scripts/fastq_dir_to_samplesheet.py data/reads/ --read1_extension "_1.fastq.gz" --read2_extension "_2.fastq.gz" samplesheet.csv
+Then run bacQC, with command below.
+Participants will need to "guess" a genome size, as they won't know what it is.
+The genome size is not critical for the analysis.
+nextflow run avantonder/bacQC \
+ -r "v1.2" \
+ -resume -profile "singularity" \
+ --input "samplesheet.csv" \
+ --outdir "results/bacqc" \
+ --kraken2db "databases/k2_standard_08gb_20240605" \
+ --brackendb "databases/k2_standard_08gb_20240605" \
+ --genome_size "3200000"
+From the QC participants should notice:
+- Isolate02 has very few reads, which might lead to poor assemblies - they can choose whether to proceed with it or not
+- The Kraken barplot reveals that we are dealing with a _Vibrio cholerae_ outbreak.
+- However, isolate08 is _Vibrio parahaemolyticus_. The GC content was also slightly different for this isolate.
+ - again, they can choose to proceed with this sample or not - however proceeding is nice to use as an outgroup for phylogeny later on.
+::: callout-tip
+#### Shell scripts
+Encourage participants to write their commands in a shell script for reproducibility and documentation.
+## Assembly
+Run `avantonder/assembleBAC` ([link to section](20-assemblebac.md)):
+nextflow run avantonder/assembleBAC \
+ -r "v1.2.1" \
+ -resume -profile "singularity" \
+ --input "samplesheet.csv" \
+ --outdir "results/assemblebac" \
+ --baktadb "databases/bakta_light_20240119" \
+ --genome_size "3.2M" \
+ --checkm2db "databases/checkm2_v2_20210323/uniref100.KO.1.dmnd"
+#### Preprocessed data
+`assembleBAC` takes a long time to run (up to 1h).
+To avoid waiting for that long, there is a preprocessed folder in a shared drive on the training machines.
+Files can be copied from there, once they are running the pipeline successfully:
+cp -r ~/Course_Share/preprocessed-outbreak ~/Course_Materials/outbreak/preprocessed
+From the assemblies participants should notice:
+- MultiQC report:
+ - If they proceeded with isolate02, they will notice very poor assembly, essentially unusable.
+ - Other samples are all comparable in quality, some more fragmented than others, but not massively different
+- The `checkm2_report.tsv` (in the `report` output folder):
+ - Very high assembly completeness for all samples (except isolate02).
+ - Isolate08 is again a bit different: slightly lower GC content (matches what was seen with bacQC) and higher predicted genome size at ~5Mb and higher number of predicted genes.
+ - As further QC they can compare these numbers with the ones on reference databases, for example on KEGG: [V. parahaemolyticus](https://www.genome.jp/kegg-bin/show_organism?org=T00120); [V. cholerae](https://www.genome.jp/entry/gn:T00034)
+## Core genome alignemnt
+Create a script to run Panaroo ([link to section](24-panaroo.md)).
+This takes a while to run, but we provide preprocessed files if they want to proceed from there once their command is working and running.
+mamba activate panaroo
+# create output directory
+mkdir results/panaroo
+# ensure isolate02 is not being used - they might do this in a more "manual" way
+gffs=$(ls results/assemblebac/annotation/*.gff3 | grep -v "isolate02")
+# run panaroo
+panaroo \
+ --input $gffs \
+ --out_dir results/panaroo \
+ --clean-mode strict \
+ --alignment core \
+ --core_threshold 0.98 \
+ --remove-invalid-genes \
+ --threads 8
+::: callout-tip
+#### Discussing use of an outgroup
+We're including _V. parahaemolyticus_ to use as an outgroup in our phylogeny.
+However, it's worth noting that by doing so we may be reducing the number of core genes in the alignment, as these species may be sufficiently diverged to alter the Panaroo clustering.
+One possibility would be to lower the `--core_threshold` to include a few more genes.
+There is no clear "right" answer, but this could be a good discussion to have with the participants.
+## Phylogeny
+Once participants have the core alignment, they can build their trees by writing a script to run IQ-tree ([link to section](12-phylogenetics.md)).
+# create output directory
+mkdir -p results/snp-sites/
+mkdir -p results/iqtree/
+# extract variable sites
+snp-sites results/panaroo/core_gene_alignment_filtered.aln > results/snp-sites/core_gene_alignment_snps.aln
+# count invariant sites
+snp-sites -C results/panaroo/core_gene_alignment_filtered.aln > results/snp-sites/constant_sites.txt
+# Run iqtree
+iqtree \
+ -fconst $(cat results/snp-sites/constant_sites.txt) \
+ -s results/snp-sites/core_gene_alignment_snps.aln \
+ --prefix results/iqtree/vibrio_outbreak \
+ -nt AUTO \
+ -ntmax 8 \
+ -mem 8G \
+ -m MFP \
+ -bb 1000
+Participants can visualise the phylogeny in Microreact, FigTree or R/ggtree - their choice.
+Encourage participants to include metadata from other analyses (e.g. MLST and AMR).
+## MLST
+Although `assembleBAC` runs MLST, participants may opt to run this separately as well ([link to section](22-strain_typing.md)).
+They can use the `mlst --list` command to see which schemes are available for _Vibrio_.
+There are actually two schemes available, named `vcholerae` and `vcholerae_2`.
+They can run both:
+mamba activate mlst
+mkdir results/mlst
+# exclude isolate02 and 08
+samples=$(ls results/assemblebac/assemblies/*.fa | grep -v "isolate02\|isolate08")
+mlst --scheme vcholerae $samples > results/mlst/mlst_typing_scheme1.tsv
+mlst --scheme vcholerae_2 $samples > results/mlst/mlst_typing_scheme2.tsv
+Participants should notice that most isolates fall under [profile 69](https://pubmlst.org/bigsdb?page=profileInfo&db=pubmlst_vcholerae_seqdef&scheme_id=1&profile_id=69), except isolate09 and isolate10.
+This should match what they see in the phylogeny.
+## AMR
+Participants can upload their FASTA assemblies to Pathogenwatch, which implements its own AMR analysis method ([link to section](31-pathogenwatch.md)).
+In addition, they can use the funcscan workflow ([link to section](34-command_line_amr.md)).
+First we need to create the funcscan samplesheet with two columns: "sample" and "fasta".
+This can be done manually, but we provide some code here as a (slightly convoluted) way to do this with the command line:
+# get sample names into temporary file - excluding isolate02 and isolate08
+basename -a results/assemblebac/assemblies/*.fa | sed 's/_T1_contigs.fa//' | grep -v "isolate02\|isolate08" > temp_samples
+# get FASTA files into temporary file - excluding isolate02 and isolate08
+ls results/assemblebac/assemblies/*.fa | grep -v "isolate02\|isolate08" > temp_fastas
+# create samplesheet header
+echo "sample,fasta" > samplesheet_funcscan.csv
+# paste the temporary files and append to samplesheet
+paste -d "," temp_samples temp_fastas >> samplesheet_funcscan.csv
+# remove temporary files
+rm temp_samples temp_fastas
+The following can be used in a shell script to run the workflow:
+nextflow run nf-core/funcscan \
+ -r "1.1.6" \
+ -resume -profile "singularity" \
+ --input "samplesheet_funcscan.csv" \
+ --outdir "results/funcscan" \
+ --run_arg_screening \
+ --arg_skip_deeparg
diff --git a/materials/_sidebar.yml b/materials/_sidebar.yml
deleted file mode 100644
index aa83652..0000000
--- a/materials/_sidebar.yml
+++ /dev/null
@@ -1,11 +0,0 @@
- sidebar:
- - title: "Materials"
- contents:
- - materials.md
- # Only edit from this line down
- - section: "Developer Guidelines"
- contents:
- - materials/01-render_guidelines.md
- - materials/02-content_guidelines.md
\ No newline at end of file
diff --git a/setup.md b/setup.md
index c1c367c..f9627a4 100644
--- a/setup.md
+++ b/setup.md
@@ -307,10 +307,10 @@ You can follow the same instructions as for "Ubuntu".
## Data
-The data used in these materials is provided as an archive file (`bact-data.tar.gz`).
+The data used in these materials is provided as an archive file (`bact-data.tar`).
You can download it from the link below and extract the files from the archive into a directory of your choice.
@@ -322,11 +322,18 @@ datadir="$HOME/Desktop/bacterial_genomics"
# download and extract to directory
mkdir $datadir
-wget -O $datadir/bact-data.tar.gz "https://www.dropbox.com/scl/fi/osjpmst8i2919fv3by3eh/bact-data.tar.gz?rlkey=iddkexnnm6ccsx8prfcay259u&dl=1"
-tar -xzvf $datadir/bact-data.tar.gz -C $datadir
+wget -O $datadir/bact-data.tar "https://www.dropbox.com/scl/fi/gdqf3y3toot2hjtivhlpk/bact-data.tar?rlkey=udjh38aqd05eg3r8klw5mguld&dl=1"
+tar -xvf $datadir/bact-data.tar -C $datadir
rm $datadir/bact-data.tar
+::: {.callout-tip}
+#### Note for training facility
+We also need to include preprocessed data for the outbreak exercise.
+See the [download script](https://github.com/cambiotraining/bacterial-genomics/blob/main/utils/download_data.sh) in the repo for details.
### Databases
diff --git a/utils/download_data.sh b/utils/download_data.sh
index 4af36ff..02393db 100644
--- a/utils/download_data.sh
+++ b/utils/download_data.sh
@@ -11,9 +11,16 @@ fi
# Download and extract course data
echo "Downloading and extracting course files"
-wget -O bact-data.tar.gz "https://www.dropbox.com/scl/fi/osjpmst8i2919fv3by3eh/bact-data.tar.gz?rlkey=iddkexnnm6ccsx8prfcay259u&dl=1"
-tar -xzf bact-data.tar.gz
-rm bact-data.tar.gz
+wget -O bact-data.tar "https://www.dropbox.com/scl/fi/gdqf3y3toot2hjtivhlpk/bact-data.tar?rlkey=udjh38aqd05eg3r8klw5mguld&dl=1"
+tar -xf bact-data.tar
+rm bact-data.tar
+# Download and extract preprocessed files for outbreak exercise
+echo "Downloading and extracting preprocessed outbreak exercise files"
+wget -O preprocessed-outbreak.tar "https://www.dropbox.com/scl/fi/axj9ur03hhe78d9dvbak3/preprocessed-outbreak.tar?rlkey=1ftvheba6ak3b7k20nls2ijkh&st=mbnd6lfw&dl=1"
+tar -xf preprocessed-outbreak.tar
+mv preprocessed-outbreak ~/Course_Share/
+rm preprocessed-outbreak.tar
# Download and extract public databases
mkdir databases