small additions and bugs

cambiotraining · Dec 12, 2023 · 41a49a9 · 41a49a9
1 parent a00f7dc
commit 41a49a9
Show file tree

Hide file tree

Showing 4 changed files with 49 additions and 34 deletions.
diff --git a/materials/23-panaroo.md b/materials/23-panaroo.md
@@ -212,6 +212,7 @@ The fixed script is:
 #!/bin/bash
 
 # create output directory
+mkdir -p results/snp-sites/
 mkdir -p results/iqtree/
 
 # extract variable sites

diff --git a/materials/27-run_bactmap.md b/materials/27-run_bactmap.md
@@ -20,19 +20,30 @@ Remember, the first step of any analysis of a new sequence dataset is to perform
 
 #### Running nf-core/bactmap
 
-Your next task is to run the **bactmap** pipeline on the _S. pneumoniae_ data.  In the folder `scripts` (within the `S_pneumoniae` analysis directory) you will find a script named `01-run_bactmap.sh`. This script contains the code to run bactmap. Edit this script, adjusting it to fit your input files and the name and location of the reference you're going to map to (Hint: the reference sequence is located in `resources/reference`).
+Your next task is to run the **bactmap** pipeline on the _S. pneumoniae_ data.  In the folder `scripts` (within the `S_pneumoniae` analysis directory) you will find a script named `01-run_bactmap.sh`. This script contains the code to run bactmap. 
 
-- Activate the `nextflow` software environment. 
+- First, create a `samplesheet.csv` file for `bactmap`. Refer back to [The bacQC pipeline](07-bacqc.md#prepare-a-samplesheet) page for how to do this, if you've forgotten.
+
+- Edit the `01-run_bactmap.sh` script, adjusting it to fit your input files and the name and location of the reference you're going to map to (Hint: the reference sequence is located in `resources/reference`).
 
-- You will need to create the `samplesheet.csv` file. Refer back to [The bacQC pipeline](07-bacqc.md#prepare-a-samplesheet) page for how to do this, if you've forgotten.
+- Activate the `nextflow` software environment. 
 
 - Run the script using `bash scripts/01-run_bactmap.sh`.
 
 - Have a look at the MultiQC report. Do any of the samples look to be poor quality?
 
 :::{.callout-answer}
 
-The fixed script is: 
+First, we created our samplesheet using the helper python script, as explained for the bacQC pipeline: 
+
+```bash
+python scripts/fastq_dir_to_samplesheet.py data/reads \
+    samplesheet.csv \
+    -r1 _1.fastq.gz \
+    -r2 _2.fastq.gz
+```
+
+Next, we fixed the script:
 
 ```bash
 #!/bin/bash
@@ -46,28 +57,19 @@ nextflow run nf-core/bactmap \
   --genome_size 2.0M
 ```
 
-- We activated the `nextflow` environment:
+Then, we activated the `nextflow` environment:
 
 ```bash
 mamba activate nextflow
 ```
 
-- We created the `samplesheet.csv` file by running the following command:
-
-```bash
-python scripts/fastq_dir_to_samplesheet.py data/reads \
-    samplesheet.csv \
-    -r1 _1.fastq.gz \
-    -r2 _2.fastq.gz
-```
-
-- We ran the script as instructed using:
+Finally, we ran the script as instructed using:
 
 ```bash
 bash scripts/01-run_bactmap.sh
 ```
 
-- While it was running it printed a message on the screen: 
+While it was running it printed a message on the screen: 
 
 ```bash
 N E X T F L O W  ~  version 23.04.1
@@ -84,21 +86,22 @@ Launching `https://github.com/nf-core/bactmap` [cranky_swartz] DSL2 - revision:
 ------------------------------------------------------
 ```
 
-- The results for all the samples looked really good so we can keep all of them for the next steps of our analyses.
+The results for all the samples looked really good so we can keep all of them for the next steps of our analyses.
 
 :::
 :::
 
 :::{.callout-exercise}
 #### How much of the reference was mapped?
 
-- Use the `02-pseudogenome_check.sh` script we've provided in the `scripts` folder, which calculates how much missing data there is for each sample using `seqtk comp`. 
-- Once the analysis finishes open the `mapping_summary.tsv` file in _Excel_ from your file browser <i class="fa-solid fa-folder"></i>.
+- Activate the `seqtk` software environment.
+- Run the `02-pseudogenome_check.sh` script we've provided in the `scripts` folder, which calculates how much missing data there is for each sample using `seqtk comp`. 
+- Once the analysis finishes, open the `mapping_summary.tsv` file in _Excel_ from your file browser <i class="fa-solid fa-folder"></i>.
 - Sort the results by the `%ref mapped` column and identify the sample which has the lowest percentage of the reference mapped.
 
 :::{.callout-answer}
 - We activated the software environment: `mamba activate seqtk`
-- We then ran the script using `bash scripts/03-pseudogenome_check.sh`. The script prints a message while it's running:
+- We then ran the script using `bash scripts/02-pseudogenome_check.sh`. The script prints a message while it's running:
 
     ```bash
     Processing ERX1265396_ERR1192012.fas

diff --git a/materials/31-intro_amr.md b/materials/31-intro_amr.md
@@ -26,6 +26,18 @@ Numerous software tools have been created to predict the presence of genes linke
 Estimating the function of a gene or protein solely from its sequence is complex, leading to varying outcomes across different software tools. 
 It is advisable to employ multiple tools and compare their findings, thus increasing our confidence in identifying which antimicrobial drugs might be more effective for treating patients infected with the strains we're studying.
 
+:::{.callout-warning}
+#### Do not use reference-based pseudogenomes for AMR analysis
+
+Genome consensus sequences, obtained using reference-based alignment, are a fast method to obtain the mutations in a large number of isolates. 
+However, these pseudogenomes are biased to the reference genome used. 
+For example, if a particular sequence is missing from the reference genome, it will not be present in the pseudogenome. 
+
+The bias created by using a reference genome is tolerable for phylogenetic applications. 
+However, when our goal is to find antimicrobial resistance factors, we need to have as much as possible a complete genome for each isolate. 
+Therefore, for **AMR scans we should use a pipeline such as `avantonder/assembleBAC` for _de novo_ genome assembly**.
+:::
+
 ## Summary
 
 ::: {.callout-tip}

diff --git a/materials/32-command_line_amr.md b/materials/32-command_line_amr.md
@@ -42,7 +42,18 @@ Two columns are required:
 - `sample` --> a sample name of our choice (we will use the same name that we used for the assembly).
 - `fasta` --> the path to the FASTA file corresponding to that sample.
 
-You can create this file using a spreadsheet software such as _Excel_, making sure to save the file as a CSV. Here is an example of our samplesheet, which we saved in a file called `samplesheet_funcscan.csv`: 
+You can create this file using a spreadsheet software such as _Excel_, making sure to save the file as a CSV. 
+To to get you started in creating this file, we can save a list of our assembly file names into a file:
+
+```bash
+ls preprocessed/assemblebac/assemblies/*.fa | head -n 5 > samplesheet_funcscan.csv
+```
+
+In this case, we only list the first **five** files (`head -n 5`), to save time when running the pipeline. 
+In your own data, you should get all the files. 
+
+We then open this file in _Excel_ and edit it further to have the two columns examplined above. 
+Here is our final samplesheet: 
 
 ```
 sample,fasta
@@ -51,18 +62,6 @@ ERX1265488_ERR1192104_T1,preprocessed/assemblebac/assemblies/ERX1265488_ERR11921
 ERX1501202_ERR1430824_T1,preprocessed/assemblebac/assemblies/ERX1501202_ERR1430824_T1_contigs.fa
 ERX1501203_ERR1430825_T1,preprocessed/assemblebac/assemblies/ERX1501203_ERR1430825_T1_contigs.fa
 ERX1501204_ERR1430826_T1,preprocessed/assemblebac/assemblies/ERX1501204_ERR1430826_T1_contigs.fa
-ERX1501205_ERR1430827_T1,preprocessed/assemblebac/assemblies/ERX1501205_ERR1430827_T1_contigs.fa
-ERX1501206_ERR1430828_T1,preprocessed/assemblebac/assemblies/ERX1501206_ERR1430828_T1_contigs.fa
-ERX1501207_ERR1430829_T1,preprocessed/assemblebac/assemblies/ERX1501207_ERR1430829_T1_contigs.fa
-ERX1501208_ERR1430830_T1,preprocessed/assemblebac/assemblies/ERX1501208_ERR1430830_T1_contigs.fa
-ERX1501212_ERR1430834_T1,preprocessed/assemblebac/assemblies/ERX1501212_ERR1430834_T1_contigs.fa
-```
-To save time, we're only going to run `funcscan` on **five** assemblies so you'll need to edit the `samplesheet_funcscan.csv` file. You can do this in _Excel_, deleting all samples except the first five or use `head` to extract the header and five samples on the command line (you'll need to save the output with a different filename and then overwrite the original samplesheet with `mv`):
-
-```bash
-head -n 6 samplesheet_funcscan.csv > samplesheet_funcscan.csv.1
-
-mv samplesheet_funcscan.csv.1 samplesheet_funcscan.csv
 ```
 
 Once we have the samplesheet ready, we can run the `nf-core/funcscan` workflow using the following commands:
@@ -72,7 +71,7 @@ Once we have the samplesheet ready, we can run the `nf-core/funcscan` workflow u
 mamba activate nextflow
 
 # create output directory
-mkdir results/funcscan
+mkdir -p results/funcscan
 
 # run the pipeline
 nextflow run nf-core/funcscan \