diff --git a/.gitignore b/.gitignore index 58f5020..361b7ac 100644 --- a/.gitignore +++ b/.gitignore @@ -19,4 +19,5 @@ _site .Renviron # Jupyter cache -.jupyter_cache \ No newline at end of file +.jupyter_cache +/.luarc.json diff --git a/CITATION.cff b/CITATION.cff new file mode 100644 index 0000000..b6a7cdd --- /dev/null +++ b/CITATION.cff @@ -0,0 +1,30 @@ +# This CITATION.cff file was generated with cffinit. +# Visit https://bit.ly/cffinit to generate yours today! + +cff-version: 1.2.0 +title: Working with Bacterial Genomes +message: >- + You may cite these materials using the metadata in this + file. Please cite our materials if you publish materials + derived from these, run a workshop using them, or use + the information in your own work. +type: dataset +authors: + - given-names: Andries + family-names: van Tonder + orcid: 'https://orcid.org/0000-0002-4380-5250' + affiliation: >- + Department of Veterinary Medicine, University of + Cambridge + - given-names: Hugo + family-names: Tavares + affiliation: Cambridge Centre for Research Informatics Training + orcid: 'https://orcid.org/0000-0001-9373-2726' + - given-names: Bajuna + family-names: Salehe + affiliation: Cambridge Centre for Research Informatics Training +repository-code: 'https://github.com/cambiotraining/bacterial-genomics' +url: 'https://cambiotraining.github.io/bacterial-genomics/' +license: CC-BY-4.0 +commit: TODO +date-released: '2024-07-22' diff --git a/_quarto.yml b/_quarto.yml index 5799598..60cf0f0 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -9,7 +9,6 @@ project: render: - index.md - setup.md - - materials.md - materials/* - "!materials/*/no_render" diff --git a/course_files/scripts/M_tuberculosis/01-run_fetchngs.sh b/course_files/scripts/M_tuberculosis/01-run_fetchngs.sh index 6a12a43..15e290c 100644 --- a/course_files/scripts/M_tuberculosis/01-run_fetchngs.sh +++ b/course_files/scripts/M_tuberculosis/01-run_fetchngs.sh @@ -13,4 +13,5 @@ nextflow run nf-core/fetchngs \ --max_memory '16.GB' --max_cpus 8 \ --input SAMPLES \ --outdir results/fetchngs \ - --nf_core_pipeline viralrecon + --nf_core_pipeline viralrecon \ + --download_method sratools diff --git a/index.md b/index.md index c3c1186..be56ef2 100644 --- a/index.md +++ b/index.md @@ -48,55 +48,24 @@ The course is aimed at biologists interested in microbiology, prokaryotic genomi - A working knowledge of running analysis on High Performance Computing (HPC) clusters (course registration page). -## Authors - - -About the authors: - -- **Andries van Tonder** - - - _Affiliation_: Department of Veterinary Medicine, University of Cambridge - _Roles_: writing - original draft; conceptualisation; coding -- **Hugo Tavares** - - - _Affiliation_: Bioinformatics Training Facility, University of Cambridge - _Roles_: writing - review & editing -- **Bajuna Salehe** - - _Affiliation_: Bioinformatics Training Facility, University of Cambridge - _Roles_: writing - original content; conceptualisation; coding; data curation - - -## Citation +## Citation & Authors - +:::{.callout-important} +These materials are a fork of the [Carpentries Shell Lesson](https://swcarpentry.github.io/shell-novice/). +As such, please make sure to cite both works if you use these materials (see [Acknowledgements]). +::: Please cite these materials if: - You adapted or used any of them in your own teaching. -- These materials were useful for your research work. For example, you can cite us in the methods section of your paper: "We carried our analyses based on the recommendations in _TODO_.". - -You can cite these materials as: - -> TODO +- These materials were useful for your research work. For example, you can cite us in the methods section of your paper: "We carried our analyses based on the recommendations in _YourReferenceHere_". -Or in BibTeX format: + -``` -@Misc{, - author = {}, - title = {}, - month = {}, - year = {}, - url = {}, - doi = {} -} -``` +{{< citation CITATION.cff >}} ## Acknowledgements diff --git a/materials.md b/materials.md deleted file mode 100644 index cc7e0e6..0000000 --- a/materials.md +++ /dev/null @@ -1,4 +0,0 @@ ---- -title: "Materials" -subtitle: Detailed course materials can be found in this section, including exercises to practice. If you are a self-learner, make sure to check the [setup page](setup.md). ---- \ No newline at end of file diff --git a/materials/36-outbreak_solutions.md b/materials/36-outbreak_solutions.md new file mode 100644 index 0000000..f8a81d8 --- /dev/null +++ b/materials/36-outbreak_solutions.md @@ -0,0 +1,204 @@ +--- +title: "OUTBREAK ALERT! - Trainer solutions" +--- + +## QC + +Run `avantonder/bacQC` ([link to section](09-bacqc.md)). + +Start by creating the samplesheet using the helper script: + +```bash +python scripts/fastq_dir_to_samplesheet.py data/reads/ --read1_extension "_1.fastq.gz" --read2_extension "_2.fastq.gz" samplesheet.csv +``` + +Then run bacQC, with command below. +Participants will need to "guess" a genome size, as they won't know what it is. +The genome size is not critical for the analysis. + +```bash +nextflow run avantonder/bacQC \ + -r "v1.2" \ + -resume -profile "singularity" \ + --input "samplesheet.csv" \ + --outdir "results/bacqc" \ + --kraken2db "databases/k2_standard_08gb_20240605" \ + --brackendb "databases/k2_standard_08gb_20240605" \ + --genome_size "3200000" +``` + +From the QC participants should notice: + +- Isolate02 has very few reads, which might lead to poor assemblies - they can choose whether to proceed with it or not +- The Kraken barplot reveals that we are dealing with a _Vibrio cholerae_ outbreak. +- However, isolate08 is _Vibrio parahaemolyticus_. The GC content was also slightly different for this isolate. + - again, they can choose to proceed with this sample or not - however proceeding is nice to use as an outgroup for phylogeny later on. + +::: callout-tip +#### Shell scripts + +Encourage participants to write their commands in a shell script for reproducibility and documentation. +::: + + +## Assembly + +Run `avantonder/assembleBAC` ([link to section](20-assemblebac.md)): + +```bash +nextflow run avantonder/assembleBAC \ + -r "v1.2.1" \ + -resume -profile "singularity" \ + --input "samplesheet.csv" \ + --outdir "results/assemblebac" \ + --baktadb "databases/bakta_light_20240119" \ + --genome_size "3.2M" \ + --checkm2db "databases/checkm2_v2_20210323/uniref100.KO.1.dmnd" +``` + +:::{.callout-warning} +#### Preprocessed data + +`assembleBAC` takes a long time to run (up to 1h). +To avoid waiting for that long, there is a preprocessed folder in a shared drive on the training machines. +Files can be copied from there, once they are running the pipeline successfully: + +```bash +cp -r ~/Course_Share/preprocessed-outbreak ~/Course_Materials/outbreak/preprocessed +``` +::: + +From the assemblies participants should notice: + +- MultiQC report: + - If they proceeded with isolate02, they will notice very poor assembly, essentially unusable. + - Other samples are all comparable in quality, some more fragmented than others, but not massively different +- The `checkm2_report.tsv` (in the `report` output folder): + - Very high assembly completeness for all samples (except isolate02). + - Isolate08 is again a bit different: slightly lower GC content (matches what was seen with bacQC) and higher predicted genome size at ~5Mb and higher number of predicted genes. + - As further QC they can compare these numbers with the ones on reference databases, for example on KEGG: [V. parahaemolyticus](https://www.genome.jp/kegg-bin/show_organism?org=T00120); [V. cholerae](https://www.genome.jp/entry/gn:T00034) + + +## Core genome alignemnt + +Create a script to run Panaroo ([link to section](24-panaroo.md)). +This takes a while to run, but we provide preprocessed files if they want to proceed from there once their command is working and running. + +```bash +mamba activate panaroo + +# create output directory +mkdir results/panaroo + +# ensure isolate02 is not being used - they might do this in a more "manual" way +gffs=$(ls results/assemblebac/annotation/*.gff3 | grep -v "isolate02") + +# run panaroo +panaroo \ + --input $gffs \ + --out_dir results/panaroo \ + --clean-mode strict \ + --alignment core \ + --core_threshold 0.98 \ + --remove-invalid-genes \ + --threads 8 +``` + +::: callout-tip +#### Discussing use of an outgroup + +We're including _V. parahaemolyticus_ to use as an outgroup in our phylogeny. +However, it's worth noting that by doing so we may be reducing the number of core genes in the alignment, as these species may be sufficiently diverged to alter the Panaroo clustering. +One possibility would be to lower the `--core_threshold` to include a few more genes. + +There is no clear "right" answer, but this could be a good discussion to have with the participants. +::: + + +## Phylogeny + +Once participants have the core alignment, they can build their trees by writing a script to run IQ-tree ([link to section](12-phylogenetics.md)). + +```bash +# create output directory +mkdir -p results/snp-sites/ +mkdir -p results/iqtree/ + +# extract variable sites +snp-sites results/panaroo/core_gene_alignment_filtered.aln > results/snp-sites/core_gene_alignment_snps.aln + +# count invariant sites +snp-sites -C results/panaroo/core_gene_alignment_filtered.aln > results/snp-sites/constant_sites.txt + +# Run iqtree +iqtree \ + -fconst $(cat results/snp-sites/constant_sites.txt) \ + -s results/snp-sites/core_gene_alignment_snps.aln \ + --prefix results/iqtree/vibrio_outbreak \ + -nt AUTO \ + -ntmax 8 \ + -mem 8G \ + -m MFP \ + -bb 1000 +``` + +Participants can visualise the phylogeny in Microreact, FigTree or R/ggtree - their choice. +Encourage participants to include metadata from other analyses (e.g. MLST and AMR). + + +## MLST + +Although `assembleBAC` runs MLST, participants may opt to run this separately as well ([link to section](22-strain_typing.md)). + +They can use the `mlst --list` command to see which schemes are available for _Vibrio_. +There are actually two schemes available, named `vcholerae` and `vcholerae_2`. +They can run both: + +```bash +mamba activate mlst + +mkdir results/mlst + +# exclude isolate02 and 08 +samples=$(ls results/assemblebac/assemblies/*.fa | grep -v "isolate02\|isolate08") + +mlst --scheme vcholerae $samples > results/mlst/mlst_typing_scheme1.tsv +mlst --scheme vcholerae_2 $samples > results/mlst/mlst_typing_scheme2.tsv +``` + +Participants should notice that most isolates fall under [profile 69](https://pubmlst.org/bigsdb?page=profileInfo&db=pubmlst_vcholerae_seqdef&scheme_id=1&profile_id=69), except isolate09 and isolate10. +This should match what they see in the phylogeny. + + +## AMR + +Participants can upload their FASTA assemblies to Pathogenwatch, which implements its own AMR analysis method ([link to section](31-pathogenwatch.md)). + +In addition, they can use the funcscan workflow ([link to section](34-command_line_amr.md)). +First we need to create the funcscan samplesheet with two columns: "sample" and "fasta". +This can be done manually, but we provide some code here as a (slightly convoluted) way to do this with the command line: + +```bash +# get sample names into temporary file - excluding isolate02 and isolate08 +basename -a results/assemblebac/assemblies/*.fa | sed 's/_T1_contigs.fa//' | grep -v "isolate02\|isolate08" > temp_samples +# get FASTA files into temporary file - excluding isolate02 and isolate08 +ls results/assemblebac/assemblies/*.fa | grep -v "isolate02\|isolate08" > temp_fastas +# create samplesheet header +echo "sample,fasta" > samplesheet_funcscan.csv +# paste the temporary files and append to samplesheet +paste -d "," temp_samples temp_fastas >> samplesheet_funcscan.csv +# remove temporary files +rm temp_samples temp_fastas +``` + +The following can be used in a shell script to run the workflow: + +```bash +nextflow run nf-core/funcscan \ + -r "1.1.6" \ + -resume -profile "singularity" \ + --input "samplesheet_funcscan.csv" \ + --outdir "results/funcscan" \ + --run_arg_screening \ + --arg_skip_deeparg +``` diff --git a/materials/_sidebar.yml b/materials/_sidebar.yml deleted file mode 100644 index aa83652..0000000 --- a/materials/_sidebar.yml +++ /dev/null @@ -1,11 +0,0 @@ -# DEPRECATED: USE THE _chapters.yml INSTEAD -website: - sidebar: - - title: "Materials" - contents: - - materials.md - # Only edit from this line down - - section: "Developer Guidelines" - contents: - - materials/01-render_guidelines.md - - materials/02-content_guidelines.md \ No newline at end of file diff --git a/setup.md b/setup.md index c1c367c..f9627a4 100644 --- a/setup.md +++ b/setup.md @@ -307,10 +307,10 @@ You can follow the same instructions as for "Ubuntu". ## Data -The data used in these materials is provided as an archive file (`bact-data.tar.gz`). +The data used in these materials is provided as an archive file (`bact-data.tar`). You can download it from the link below and extract the files from the archive into a directory of your choice. - + @@ -322,11 +322,18 @@ datadir="$HOME/Desktop/bacterial_genomics" # download and extract to directory mkdir $datadir -wget -O $datadir/bact-data.tar.gz "https://www.dropbox.com/scl/fi/osjpmst8i2919fv3by3eh/bact-data.tar.gz?rlkey=iddkexnnm6ccsx8prfcay259u&dl=1" -tar -xzvf $datadir/bact-data.tar.gz -C $datadir +wget -O $datadir/bact-data.tar "https://www.dropbox.com/scl/fi/gdqf3y3toot2hjtivhlpk/bact-data.tar?rlkey=udjh38aqd05eg3r8klw5mguld&dl=1" +tar -xvf $datadir/bact-data.tar -C $datadir rm $datadir/bact-data.tar ``` +::: {.callout-tip} +#### Note for training facility + +We also need to include preprocessed data for the outbreak exercise. +See the [download script](https://github.com/cambiotraining/bacterial-genomics/blob/main/utils/download_data.sh) in the repo for details. +::: + ### Databases diff --git a/utils/download_data.sh b/utils/download_data.sh index 4af36ff..02393db 100644 --- a/utils/download_data.sh +++ b/utils/download_data.sh @@ -11,9 +11,16 @@ fi # Download and extract course data echo "Downloading and extracting course files" -wget -O bact-data.tar.gz "https://www.dropbox.com/scl/fi/osjpmst8i2919fv3by3eh/bact-data.tar.gz?rlkey=iddkexnnm6ccsx8prfcay259u&dl=1" -tar -xzf bact-data.tar.gz -rm bact-data.tar.gz +wget -O bact-data.tar "https://www.dropbox.com/scl/fi/gdqf3y3toot2hjtivhlpk/bact-data.tar?rlkey=udjh38aqd05eg3r8klw5mguld&dl=1" +tar -xf bact-data.tar +rm bact-data.tar + +# Download and extract preprocessed files for outbreak exercise +echo "Downloading and extracting preprocessed outbreak exercise files" +wget -O preprocessed-outbreak.tar "https://www.dropbox.com/scl/fi/axj9ur03hhe78d9dvbak3/preprocessed-outbreak.tar?rlkey=1ftvheba6ak3b7k20nls2ijkh&st=mbnd6lfw&dl=1" +tar -xf preprocessed-outbreak.tar +mv preprocessed-outbreak ~/Course_Share/ +rm preprocessed-outbreak.tar # Download and extract public databases mkdir databases