Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
  • Loading branch information
avantonder committed Dec 5, 2024
2 parents 71128c3 + 7fef3eb commit 8d97c7c
Show file tree
Hide file tree
Showing 10 changed files with 270 additions and 67 deletions.
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -19,4 +19,5 @@ _site
.Renviron

# Jupyter cache
.jupyter_cache
.jupyter_cache
/.luarc.json
30 changes: 30 additions & 0 deletions CITATION.cff
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: Working with Bacterial Genomes
message: >-
You may cite these materials using the metadata in this
file. Please cite our materials if you publish materials
derived from these, run a workshop using them, or use
the information in your own work.
type: dataset
authors:
- given-names: Andries
family-names: van Tonder
orcid: 'https://orcid.org/0000-0002-4380-5250'
affiliation: >-
Department of Veterinary Medicine, University of
Cambridge
- given-names: Hugo
family-names: Tavares
affiliation: Cambridge Centre for Research Informatics Training
orcid: 'https://orcid.org/0000-0001-9373-2726'
- given-names: Bajuna
family-names: Salehe
affiliation: Cambridge Centre for Research Informatics Training
repository-code: 'https://github.com/cambiotraining/bacterial-genomics'
url: 'https://cambiotraining.github.io/bacterial-genomics/'
license: CC-BY-4.0
commit: TODO
date-released: '2024-07-22'
1 change: 0 additions & 1 deletion _quarto.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,6 @@ project:
render:
- index.md
- setup.md
- materials.md
- materials/*
- "!materials/*/no_render"

Expand Down
3 changes: 2 additions & 1 deletion course_files/scripts/M_tuberculosis/01-run_fetchngs.sh
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,5 @@ nextflow run nf-core/fetchngs \
--max_memory '16.GB' --max_cpus 8 \
--input SAMPLES \
--outdir results/fetchngs \
--nf_core_pipeline viralrecon
--nf_core_pipeline viralrecon \
--download_method sratools
53 changes: 11 additions & 42 deletions index.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,55 +48,24 @@ The course is aimed at biologists interested in microbiology, prokaryotic genomi
- A working knowledge of running analysis on High Performance Computing (HPC) clusters (course registration page).


## Authors
<!--
The listing below shows an example of how you can give more details about yourself.
These examples include icons with links to GitHub and Orcid.
-->

About the authors:

- **Andries van Tonder**
<a href="https://orcid.org/0000-0002-4380-5250" target="_blank"><i class="fa-brands fa-orcid" style="color:#a6ce39"></i></a>
<a href="https://github.com/avantonder" target="_blank"><i class="fa-brands fa-github" style="color:#4078c0"></i></a>
_Affiliation_: Department of Veterinary Medicine, University of Cambridge
_Roles_: writing - original draft; conceptualisation; coding
- **Hugo Tavares**
<a href="https://orcid.org/0000-0001-9373-2726" target="_blank"><i class="fa-brands fa-orcid" style="color:#a6ce39"></i></a>
<a href="https://github.com/tavareshugo" target="_blank"><i class="fa-brands fa-github" style="color:#4078c0"></i></a>
_Affiliation_: Bioinformatics Training Facility, University of Cambridge
_Roles_: writing - review & editing
- **Bajuna Salehe**
<a href="https://github.com/bsalehe" target="_blank"><i class="fa-brands fa-github" style="color:#4078c0"></i></a>
_Affiliation_: Bioinformatics Training Facility, University of Cambridge
_Roles_: writing - original content; conceptualisation; coding; data curation


## Citation
## Citation & Authors

<!-- We can do this at the end -->
:::{.callout-important}
These materials are a fork of the [Carpentries Shell Lesson](https://swcarpentry.github.io/shell-novice/).
As such, please make sure to cite both works if you use these materials (see [Acknowledgements]).
:::

Please cite these materials if:

- You adapted or used any of them in your own teaching.
- These materials were useful for your research work. For example, you can cite us in the methods section of your paper: "We carried our analyses based on the recommendations in _TODO_.".

You can cite these materials as:

> TODO
- These materials were useful for your research work. For example, you can cite us in the methods section of your paper: "We carried our analyses based on the recommendations in _YourReferenceHere_".

Or in BibTeX format:
<!--
This is generated automatically from the CITATION.cff file.
If you think you should be added as an author, please get in touch with us.
-->

```
@Misc{,
author = {},
title = {},
month = {},
year = {},
url = {},
doi = {}
}
```
{{< citation CITATION.cff >}}


## Acknowledgements
Expand Down
4 changes: 0 additions & 4 deletions materials.md

This file was deleted.

204 changes: 204 additions & 0 deletions materials/36-outbreak_solutions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,204 @@
---
title: "OUTBREAK ALERT! - Trainer solutions"
---

## QC

Run `avantonder/bacQC` ([link to section](09-bacqc.md)).

Start by creating the samplesheet using the helper script:

```bash
python scripts/fastq_dir_to_samplesheet.py data/reads/ --read1_extension "_1.fastq.gz" --read2_extension "_2.fastq.gz" samplesheet.csv
```

Then run bacQC, with command below.
Participants will need to "guess" a genome size, as they won't know what it is.
The genome size is not critical for the analysis.

```bash
nextflow run avantonder/bacQC \
-r "v1.2" \
-resume -profile "singularity" \
--input "samplesheet.csv" \
--outdir "results/bacqc" \
--kraken2db "databases/k2_standard_08gb_20240605" \
--brackendb "databases/k2_standard_08gb_20240605" \
--genome_size "3200000"
```

From the QC participants should notice:

- Isolate02 has very few reads, which might lead to poor assemblies - they can choose whether to proceed with it or not
- The Kraken barplot reveals that we are dealing with a _Vibrio cholerae_ outbreak.
- However, isolate08 is _Vibrio parahaemolyticus_. The GC content was also slightly different for this isolate.
- again, they can choose to proceed with this sample or not - however proceeding is nice to use as an outgroup for phylogeny later on.

::: callout-tip
#### Shell scripts

Encourage participants to write their commands in a shell script for reproducibility and documentation.
:::


## Assembly

Run `avantonder/assembleBAC` ([link to section](20-assemblebac.md)):

```bash
nextflow run avantonder/assembleBAC \
-r "v1.2.1" \
-resume -profile "singularity" \
--input "samplesheet.csv" \
--outdir "results/assemblebac" \
--baktadb "databases/bakta_light_20240119" \
--genome_size "3.2M" \
--checkm2db "databases/checkm2_v2_20210323/uniref100.KO.1.dmnd"
```

:::{.callout-warning}
#### Preprocessed data

`assembleBAC` takes a long time to run (up to 1h).
To avoid waiting for that long, there is a preprocessed folder in a shared drive on the training machines.
Files can be copied from there, once they are running the pipeline successfully:

```bash
cp -r ~/Course_Share/preprocessed-outbreak ~/Course_Materials/outbreak/preprocessed
```
:::

From the assemblies participants should notice:

- MultiQC report:
- If they proceeded with isolate02, they will notice very poor assembly, essentially unusable.
- Other samples are all comparable in quality, some more fragmented than others, but not massively different
- The `checkm2_report.tsv` (in the `report` output folder):
- Very high assembly completeness for all samples (except isolate02).
- Isolate08 is again a bit different: slightly lower GC content (matches what was seen with bacQC) and higher predicted genome size at ~5Mb and higher number of predicted genes.
- As further QC they can compare these numbers with the ones on reference databases, for example on KEGG: [V. parahaemolyticus](https://www.genome.jp/kegg-bin/show_organism?org=T00120); [V. cholerae](https://www.genome.jp/entry/gn:T00034)


## Core genome alignemnt

Create a script to run Panaroo ([link to section](24-panaroo.md)).
This takes a while to run, but we provide preprocessed files if they want to proceed from there once their command is working and running.

```bash
mamba activate panaroo

# create output directory
mkdir results/panaroo

# ensure isolate02 is not being used - they might do this in a more "manual" way
gffs=$(ls results/assemblebac/annotation/*.gff3 | grep -v "isolate02")

# run panaroo
panaroo \
--input $gffs \
--out_dir results/panaroo \
--clean-mode strict \
--alignment core \
--core_threshold 0.98 \
--remove-invalid-genes \
--threads 8
```

::: callout-tip
#### Discussing use of an outgroup

We're including _V. parahaemolyticus_ to use as an outgroup in our phylogeny.
However, it's worth noting that by doing so we may be reducing the number of core genes in the alignment, as these species may be sufficiently diverged to alter the Panaroo clustering.
One possibility would be to lower the `--core_threshold` to include a few more genes.

There is no clear "right" answer, but this could be a good discussion to have with the participants.
:::


## Phylogeny

Once participants have the core alignment, they can build their trees by writing a script to run IQ-tree ([link to section](12-phylogenetics.md)).

```bash
# create output directory
mkdir -p results/snp-sites/
mkdir -p results/iqtree/

# extract variable sites
snp-sites results/panaroo/core_gene_alignment_filtered.aln > results/snp-sites/core_gene_alignment_snps.aln

# count invariant sites
snp-sites -C results/panaroo/core_gene_alignment_filtered.aln > results/snp-sites/constant_sites.txt

# Run iqtree
iqtree \
-fconst $(cat results/snp-sites/constant_sites.txt) \
-s results/snp-sites/core_gene_alignment_snps.aln \
--prefix results/iqtree/vibrio_outbreak \
-nt AUTO \
-ntmax 8 \
-mem 8G \
-m MFP \
-bb 1000
```

Participants can visualise the phylogeny in Microreact, FigTree or R/ggtree - their choice.
Encourage participants to include metadata from other analyses (e.g. MLST and AMR).


## MLST

Although `assembleBAC` runs MLST, participants may opt to run this separately as well ([link to section](22-strain_typing.md)).

They can use the `mlst --list` command to see which schemes are available for _Vibrio_.
There are actually two schemes available, named `vcholerae` and `vcholerae_2`.
They can run both:

```bash
mamba activate mlst

mkdir results/mlst

# exclude isolate02 and 08
samples=$(ls results/assemblebac/assemblies/*.fa | grep -v "isolate02\|isolate08")

mlst --scheme vcholerae $samples > results/mlst/mlst_typing_scheme1.tsv
mlst --scheme vcholerae_2 $samples > results/mlst/mlst_typing_scheme2.tsv
```

Participants should notice that most isolates fall under [profile 69](https://pubmlst.org/bigsdb?page=profileInfo&db=pubmlst_vcholerae_seqdef&scheme_id=1&profile_id=69), except isolate09 and isolate10.
This should match what they see in the phylogeny.


## AMR

Participants can upload their FASTA assemblies to Pathogenwatch, which implements its own AMR analysis method ([link to section](31-pathogenwatch.md)).

In addition, they can use the funcscan workflow ([link to section](34-command_line_amr.md)).
First we need to create the funcscan samplesheet with two columns: "sample" and "fasta".
This can be done manually, but we provide some code here as a (slightly convoluted) way to do this with the command line:

```bash
# get sample names into temporary file - excluding isolate02 and isolate08
basename -a results/assemblebac/assemblies/*.fa | sed 's/_T1_contigs.fa//' | grep -v "isolate02\|isolate08" > temp_samples
# get FASTA files into temporary file - excluding isolate02 and isolate08
ls results/assemblebac/assemblies/*.fa | grep -v "isolate02\|isolate08" > temp_fastas
# create samplesheet header
echo "sample,fasta" > samplesheet_funcscan.csv
# paste the temporary files and append to samplesheet
paste -d "," temp_samples temp_fastas >> samplesheet_funcscan.csv
# remove temporary files
rm temp_samples temp_fastas
```

The following can be used in a shell script to run the workflow:

```bash
nextflow run nf-core/funcscan \
-r "1.1.6" \
-resume -profile "singularity" \
--input "samplesheet_funcscan.csv" \
--outdir "results/funcscan" \
--run_arg_screening \
--arg_skip_deeparg
```
11 changes: 0 additions & 11 deletions materials/_sidebar.yml

This file was deleted.

15 changes: 11 additions & 4 deletions setup.md
Original file line number Diff line number Diff line change
Expand Up @@ -307,10 +307,10 @@ You can follow the same instructions as for "Ubuntu".

## Data

The data used in these materials is provided as an archive file (`bact-data.tar.gz`).
The data used in these materials is provided as an archive file (`bact-data.tar`).
You can download it from the link below and extract the files from the archive into a directory of your choice.

<a href="https://www.dropbox.com/scl/fi/osjpmst8i2919fv3by3eh/bact-data.tar.gz?rlkey=iddkexnnm6ccsx8prfcay259u&dl=0">
<a href="https://www.dropbox.com/scl/fi/gdqf3y3toot2hjtivhlpk/bact-data.tar?rlkey=udjh38aqd05eg3r8klw5mguld&dl=0">
<button class="btn"><i class="fa fa-download"></i> Download</button>
</a>

Expand All @@ -322,11 +322,18 @@ datadir="$HOME/Desktop/bacterial_genomics"

# download and extract to directory
mkdir $datadir
wget -O $datadir/bact-data.tar.gz "https://www.dropbox.com/scl/fi/osjpmst8i2919fv3by3eh/bact-data.tar.gz?rlkey=iddkexnnm6ccsx8prfcay259u&dl=1"
tar -xzvf $datadir/bact-data.tar.gz -C $datadir
wget -O $datadir/bact-data.tar "https://www.dropbox.com/scl/fi/gdqf3y3toot2hjtivhlpk/bact-data.tar?rlkey=udjh38aqd05eg3r8klw5mguld&dl=1"
tar -xvf $datadir/bact-data.tar -C $datadir
rm $datadir/bact-data.tar
```

::: {.callout-tip}
#### Note for training facility

We also need to include preprocessed data for the outbreak exercise.
See the [download script](https://github.com/cambiotraining/bacterial-genomics/blob/main/utils/download_data.sh) in the repo for details.
:::


### Databases

Expand Down
13 changes: 10 additions & 3 deletions utils/download_data.sh
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,16 @@ fi

# Download and extract course data
echo "Downloading and extracting course files"
wget -O bact-data.tar.gz "https://www.dropbox.com/scl/fi/osjpmst8i2919fv3by3eh/bact-data.tar.gz?rlkey=iddkexnnm6ccsx8prfcay259u&dl=1"
tar -xzf bact-data.tar.gz
rm bact-data.tar.gz
wget -O bact-data.tar "https://www.dropbox.com/scl/fi/gdqf3y3toot2hjtivhlpk/bact-data.tar?rlkey=udjh38aqd05eg3r8klw5mguld&dl=1"
tar -xf bact-data.tar
rm bact-data.tar

# Download and extract preprocessed files for outbreak exercise
echo "Downloading and extracting preprocessed outbreak exercise files"
wget -O preprocessed-outbreak.tar "https://www.dropbox.com/scl/fi/axj9ur03hhe78d9dvbak3/preprocessed-outbreak.tar?rlkey=1ftvheba6ak3b7k20nls2ijkh&st=mbnd6lfw&dl=1"
tar -xf preprocessed-outbreak.tar
mv preprocessed-outbreak ~/Course_Share/
rm preprocessed-outbreak.tar

# Download and extract public databases
mkdir databases
Expand Down

0 comments on commit 8d97c7c

Please sign in to comment.