Merge branch 'main' of https://github.com/cambiotraining/bacterial-ge…

…nomics
cambiotraining · Dec 5, 2024 · 8d97c7c · 8d97c7c
2 parents 71128c3 + 7fef3eb
commit 8d97c7c
Show file tree

Hide file tree

Showing 10 changed files with 270 additions and 67 deletions.
diff --git a/.gitignore b/.gitignore
@@ -19,4 +19,5 @@ _site
 .Renviron
 
 # Jupyter cache
-.jupyter_cache
+.jupyter_cache
+/.luarc.json
diff --git a/CITATION.cff b/CITATION.cff
@@ -0,0 +1,30 @@
+# This CITATION.cff file was generated with cffinit.
+# Visit https://bit.ly/cffinit to generate yours today!
+
+cff-version: 1.2.0
+title: Working with Bacterial Genomes
+message: >-
+  You may cite these materials using the metadata in this
+  file. Please cite our materials if you publish materials
+  derived from these, run a workshop using them, or use
+  the   information in your own work.
+type: dataset
+authors:
+  - given-names: Andries
+    family-names: van Tonder
+    orcid: 'https://orcid.org/0000-0002-4380-5250'
+    affiliation: >-
+      Department of Veterinary Medicine, University of
+      Cambridge
+  - given-names: Hugo
+    family-names: Tavares
+    affiliation: Cambridge Centre for Research Informatics Training
+    orcid: 'https://orcid.org/0000-0001-9373-2726'
+  - given-names: Bajuna
+    family-names: Salehe
+    affiliation: Cambridge Centre for Research Informatics Training
+repository-code: 'https://github.com/cambiotraining/bacterial-genomics'
+url: 'https://cambiotraining.github.io/bacterial-genomics/'
+license: CC-BY-4.0
+commit: TODO
+date-released: '2024-07-22'
diff --git a/_quarto.yml b/_quarto.yml
@@ -9,7 +9,6 @@ project:
   render:
     - index.md
     - setup.md
-    - materials.md
     - materials/*
     - "!materials/*/no_render"
 

diff --git a/course_files/scripts/M_tuberculosis/01-run_fetchngs.sh b/course_files/scripts/M_tuberculosis/01-run_fetchngs.sh
@@ -13,4 +13,5 @@ nextflow run nf-core/fetchngs \
   --max_memory '16.GB' --max_cpus 8 \
   --input SAMPLES \
   --outdir results/fetchngs \
-  --nf_core_pipeline viralrecon
+  --nf_core_pipeline viralrecon \
+  --download_method sratools
diff --git a/index.md b/index.md
@@ -48,55 +48,24 @@ The course is aimed at biologists interested in microbiology, prokaryotic genomi
 - A working knowledge of running analysis on High Performance Computing (HPC) clusters (course registration page).
 
 
-## Authors
-<!-- 
-The listing below shows an example of how you can give more details about yourself.
-These examples include icons with links to GitHub and Orcid. 
--->
-
-About the authors:
-
-- **Andries van Tonder**
-  <a href="https://orcid.org/0000-0002-4380-5250" target="_blank"><i class="fa-brands fa-orcid" style="color:#a6ce39"></i></a> 
-  <a href="https://github.com/avantonder" target="_blank"><i class="fa-brands fa-github" style="color:#4078c0"></i></a>  
-  _Affiliation_: Department of Veterinary Medicine, University of Cambridge  
-  _Roles_: writing - original draft; conceptualisation; coding
-- **Hugo Tavares**
-  <a href="https://orcid.org/0000-0001-9373-2726" target="_blank"><i class="fa-brands fa-orcid" style="color:#a6ce39"></i></a> 
-  <a href="https://github.com/tavareshugo" target="_blank"><i class="fa-brands fa-github" style="color:#4078c0"></i></a>  
-  _Affiliation_: Bioinformatics Training Facility, University of Cambridge  
-  _Roles_: writing - review & editing
-- **Bajuna Salehe**
-  <a href="https://github.com/bsalehe" target="_blank"><i class="fa-brands fa-github" style="color:#4078c0"></i></a>  
-  _Affiliation_: Bioinformatics Training Facility, University of Cambridge  
-  _Roles_: writing - original content; conceptualisation; coding; data curation
-
-
-## Citation
+## Citation & Authors
 
-<!-- We can do this at the end -->
+:::{.callout-important}
+These materials are a fork of the [Carpentries Shell Lesson](https://swcarpentry.github.io/shell-novice/).  
+As such, please make sure to cite both works if you use these materials (see [Acknowledgements]).
+:::
 
 Please cite these materials if:
 
 - You adapted or used any of them in your own teaching.
-- These materials were useful for your research work. For example, you can cite us in the methods section of your paper: "We carried our analyses based on the recommendations in _TODO_.".
-
-You can cite these materials as:
-
-> TODO
+- These materials were useful for your research work. For example, you can cite us in the methods section of your paper: "We carried our analyses based on the recommendations in _YourReferenceHere_".
 
-Or in BibTeX format:
+<!-- 
+This is generated automatically from the CITATION.cff file. 
+If you think you should be added as an author, please get in touch with us.
+-->
 
-```
-@Misc{,
-  author = {},
-  title = {},
-  month = {},
-  year = {},
-  url = {},
-  doi = {}
-}
-```
+{{< citation CITATION.cff >}}
 
 
 ## Acknowledgements

diff --git a/materials.md b/materials.md
diff --git a/materials/36-outbreak_solutions.md b/materials/36-outbreak_solutions.md
@@ -0,0 +1,204 @@
+---
+title: "OUTBREAK ALERT! - Trainer solutions"
+---
+
+## QC
+
+Run `avantonder/bacQC` ([link to section](09-bacqc.md)). 
+
+Start by creating the samplesheet using the helper script: 
+
+```bash
+python scripts/fastq_dir_to_samplesheet.py data/reads/ --read1_extension "_1.fastq.gz" --read2_extension "_2.fastq.gz" samplesheet.csv
+```
+
+Then run bacQC, with command below.
+Participants will need to "guess" a genome size, as they won't know what it is. 
+The genome size is not critical for the analysis.
+
+```bash
+nextflow run avantonder/bacQC \
+  -r "v1.2" \
+  -resume -profile "singularity" \
+  --input "samplesheet.csv" \
+  --outdir "results/bacqc" \
+  --kraken2db "databases/k2_standard_08gb_20240605" \
+  --brackendb "databases/k2_standard_08gb_20240605" \
+  --genome_size "3200000"
+```
+
+From the QC participants should notice: 
+
+- Isolate02 has very few reads, which might lead to poor assemblies - they can choose whether to proceed with it or not
+- The Kraken barplot reveals that we are dealing with a _Vibrio cholerae_ outbreak.
+- However, isolate08 is _Vibrio parahaemolyticus_. The GC content was also slightly different for this isolate. 
+  - again, they can choose to proceed with this sample or not - however proceeding is nice to use as an outgroup for phylogeny later on.
+
+::: callout-tip
+#### Shell scripts
+
+Encourage participants to write their commands in a shell script for reproducibility and documentation. 
+:::
+
+
+## Assembly
+
+Run `avantonder/assembleBAC` ([link to section](20-assemblebac.md)):
+
+```bash
+nextflow run avantonder/assembleBAC \
+  -r "v1.2.1" \
+  -resume -profile "singularity" \
+  --input "samplesheet.csv" \
+  --outdir "results/assemblebac" \
+  --baktadb "databases/bakta_light_20240119" \
+  --genome_size "3.2M" \
+  --checkm2db "databases/checkm2_v2_20210323/uniref100.KO.1.dmnd"
+```
+
+:::{.callout-warning}
+#### Preprocessed data
+
+`assembleBAC` takes a long time to run (up to 1h). 
+To avoid waiting for that long, there is a preprocessed folder in a shared drive on the training machines. 
+Files can be copied from there, once they are running the pipeline successfully: 
+
+```bash
+cp -r ~/Course_Share/preprocessed-outbreak ~/Course_Materials/outbreak/preprocessed
+```
+:::
+
+From the assemblies participants should notice: 
+
+- MultiQC report: 
+  - If they proceeded with isolate02, they will notice very poor assembly, essentially unusable. 
+  - Other samples are all comparable in quality, some more fragmented than others, but not massively different
+- The `checkm2_report.tsv` (in the `report` output folder):
+  - Very high assembly completeness for all samples (except isolate02).
+  - Isolate08 is again a bit different: slightly lower GC content (matches what was seen with bacQC) and higher predicted genome size at ~5Mb and higher number of predicted genes. 
+  - As further QC they can compare these numbers with the ones on reference databases, for example on KEGG: [V. parahaemolyticus](https://www.genome.jp/kegg-bin/show_organism?org=T00120); [V. cholerae](https://www.genome.jp/entry/gn:T00034)
+
+
+## Core genome alignemnt
+
+Create a script to run Panaroo ([link to section](24-panaroo.md)).
+This takes a while to run, but we provide preprocessed files if they want to proceed from there once their command is working and running.
+
+```bash
+mamba activate panaroo
+
+# create output directory
+mkdir results/panaroo
+
+# ensure isolate02 is not being used - they might do this in a more "manual" way
+gffs=$(ls results/assemblebac/annotation/*.gff3 | grep -v "isolate02")
+
+# run panaroo
+panaroo \
+  --input $gffs \
+  --out_dir results/panaroo \
+  --clean-mode strict \
+  --alignment core \
+  --core_threshold 0.98 \
+  --remove-invalid-genes \
+  --threads 8
+```
+
+::: callout-tip
+#### Discussing use of an outgroup
+
+We're including _V. parahaemolyticus_ to use as an outgroup in our phylogeny.
+However, it's worth noting that by doing so we may be reducing the number of core genes in the alignment, as these species may be sufficiently diverged to alter the Panaroo clustering. 
+One possibility would be to lower the `--core_threshold` to include a few more genes. 
+
+There is no clear "right" answer, but this could be a good discussion to have with the participants. 
+:::
+
+
+## Phylogeny
+
+Once participants have the core alignment, they can build their trees by writing a script to run IQ-tree ([link to section](12-phylogenetics.md)).
+
+```bash
+# create output directory
+mkdir -p results/snp-sites/
+mkdir -p results/iqtree/
+
+# extract variable sites
+snp-sites results/panaroo/core_gene_alignment_filtered.aln > results/snp-sites/core_gene_alignment_snps.aln
+
+# count invariant sites
+snp-sites -C results/panaroo/core_gene_alignment_filtered.aln > results/snp-sites/constant_sites.txt
+
+# Run iqtree
+iqtree \
+  -fconst $(cat results/snp-sites/constant_sites.txt) \
+  -s results/snp-sites/core_gene_alignment_snps.aln \
+  --prefix results/iqtree/vibrio_outbreak \
+  -nt AUTO \
+  -ntmax 8 \
+  -mem 8G \
+  -m MFP \
+  -bb 1000
+```
+
+Participants can visualise the phylogeny in Microreact, FigTree or R/ggtree - their choice.
+Encourage participants to include metadata from other analyses (e.g. MLST and AMR).
+
+
+## MLST
+
+Although `assembleBAC` runs MLST, participants may opt to run this separately as well ([link to section](22-strain_typing.md)). 
+
+They can use the `mlst --list` command to see which schemes are available for _Vibrio_. 
+There are actually two schemes available, named `vcholerae` and `vcholerae_2`. 
+They can run both: 
+
+```bash
+mamba activate mlst
+
+mkdir results/mlst
+
+# exclude isolate02 and 08
+samples=$(ls results/assemblebac/assemblies/*.fa | grep -v "isolate02\|isolate08")
+
+mlst --scheme vcholerae $samples > results/mlst/mlst_typing_scheme1.tsv
+mlst --scheme vcholerae_2 $samples > results/mlst/mlst_typing_scheme2.tsv
+```
+
+Participants should notice that most isolates fall under [profile 69](https://pubmlst.org/bigsdb?page=profileInfo&db=pubmlst_vcholerae_seqdef&scheme_id=1&profile_id=69), except isolate09 and isolate10.
+This should match what they see in the phylogeny. 
+
+
+## AMR
+
+Participants can upload their FASTA assemblies to Pathogenwatch, which implements its own AMR analysis method ([link to section](31-pathogenwatch.md)). 
+
+In addition, they can use the funcscan workflow ([link to section](34-command_line_amr.md)). 
+First we need to create the funcscan samplesheet with two columns: "sample" and "fasta".
+This can be done manually, but we provide some code here as a (slightly convoluted) way to do this with the command line:
+
+```bash
+# get sample names into temporary file - excluding isolate02 and isolate08
+basename -a results/assemblebac/assemblies/*.fa | sed 's/_T1_contigs.fa//' | grep -v "isolate02\|isolate08" > temp_samples
+# get FASTA files into temporary file - excluding isolate02 and isolate08
+ls results/assemblebac/assemblies/*.fa | grep -v "isolate02\|isolate08" > temp_fastas
+# create samplesheet header
+echo "sample,fasta" > samplesheet_funcscan.csv
+# paste the temporary files and append to samplesheet
+paste -d "," temp_samples temp_fastas >> samplesheet_funcscan.csv
+# remove temporary files
+rm temp_samples temp_fastas
+```
+
+The following can be used in a shell script to run the workflow: 
+
+```bash
+nextflow run nf-core/funcscan \
+  -r "1.1.6" \
+  -resume -profile "singularity" \
+  --input "samplesheet_funcscan.csv" \
+  --outdir "results/funcscan" \
+  --run_arg_screening \
+  --arg_skip_deeparg
+```
diff --git a/materials/_sidebar.yml b/materials/_sidebar.yml
diff --git a/setup.md b/setup.md
@@ -307,10 +307,10 @@ You can follow the same instructions as for "Ubuntu".
 
 ## Data
 
-The data used in these materials is provided as an archive file (`bact-data.tar.gz`). 
+The data used in these materials is provided as an archive file (`bact-data.tar`). 
 You can download it from the link below and extract the files from the archive into a directory of your choice.
 
-<a href="https://www.dropbox.com/scl/fi/osjpmst8i2919fv3by3eh/bact-data.tar.gz?rlkey=iddkexnnm6ccsx8prfcay259u&dl=0">
+<a href="https://www.dropbox.com/scl/fi/gdqf3y3toot2hjtivhlpk/bact-data.tar?rlkey=udjh38aqd05eg3r8klw5mguld&dl=0">
   <button class="btn"><i class="fa fa-download"></i> Download</button>
 </a>
 
@@ -322,11 +322,18 @@ datadir="$HOME/Desktop/bacterial_genomics"
 
 # download and extract to directory
 mkdir $datadir
-wget -O $datadir/bact-data.tar.gz "https://www.dropbox.com/scl/fi/osjpmst8i2919fv3by3eh/bact-data.tar.gz?rlkey=iddkexnnm6ccsx8prfcay259u&dl=1"
-tar -xzvf $datadir/bact-data.tar.gz -C $datadir
+wget -O $datadir/bact-data.tar "https://www.dropbox.com/scl/fi/gdqf3y3toot2hjtivhlpk/bact-data.tar?rlkey=udjh38aqd05eg3r8klw5mguld&dl=1"
+tar -xvf $datadir/bact-data.tar -C $datadir
 rm $datadir/bact-data.tar
 ```
 
+::: {.callout-tip}
+#### Note for training facility
+
+We also need to include preprocessed data for the outbreak exercise. 
+See the [download script](https://github.com/cambiotraining/bacterial-genomics/blob/main/utils/download_data.sh) in the repo for details.
+:::
+
 
 ### Databases
 

diff --git a/utils/download_data.sh b/utils/download_data.sh
@@ -11,9 +11,16 @@ fi
 
 # Download and extract course data
 echo "Downloading and extracting course files"
-wget -O bact-data.tar.gz "https://www.dropbox.com/scl/fi/osjpmst8i2919fv3by3eh/bact-data.tar.gz?rlkey=iddkexnnm6ccsx8prfcay259u&dl=1"
-tar -xzf bact-data.tar.gz
-rm bact-data.tar.gz
+wget -O bact-data.tar "https://www.dropbox.com/scl/fi/gdqf3y3toot2hjtivhlpk/bact-data.tar?rlkey=udjh38aqd05eg3r8klw5mguld&dl=1"
+tar -xf bact-data.tar
+rm bact-data.tar
+
+# Download and extract preprocessed files for outbreak exercise
+echo "Downloading and extracting preprocessed outbreak exercise files"
+wget -O preprocessed-outbreak.tar "https://www.dropbox.com/scl/fi/axj9ur03hhe78d9dvbak3/preprocessed-outbreak.tar?rlkey=1ftvheba6ak3b7k20nls2ijkh&st=mbnd6lfw&dl=1"
+tar -xf preprocessed-outbreak.tar
+mv preprocessed-outbreak ~/Course_Share/
+rm preprocessed-outbreak.tar
 
 # Download and extract public databases
 mkdir databases