From 41a49a9482eae757f0fdaf10d662a3c8f493e65c Mon Sep 17 00:00:00 2001 From: Hugo Tavares Date: Tue, 12 Dec 2023 11:03:37 +0000 Subject: [PATCH] small additions and bugs --- materials/23-panaroo.md | 1 + materials/27-run_bactmap.md | 43 +++++++++++++++++--------------- materials/31-intro_amr.md | 12 +++++++++ materials/32-command_line_amr.md | 27 ++++++++++---------- 4 files changed, 49 insertions(+), 34 deletions(-) diff --git a/materials/23-panaroo.md b/materials/23-panaroo.md index 065dc2c..a489370 100644 --- a/materials/23-panaroo.md +++ b/materials/23-panaroo.md @@ -212,6 +212,7 @@ The fixed script is: #!/bin/bash # create output directory +mkdir -p results/snp-sites/ mkdir -p results/iqtree/ # extract variable sites diff --git a/materials/27-run_bactmap.md b/materials/27-run_bactmap.md index c24caeb..85cd010 100644 --- a/materials/27-run_bactmap.md +++ b/materials/27-run_bactmap.md @@ -20,11 +20,13 @@ Remember, the first step of any analysis of a new sequence dataset is to perform #### Running nf-core/bactmap -Your next task is to run the **bactmap** pipeline on the _S. pneumoniae_ data. In the folder `scripts` (within the `S_pneumoniae` analysis directory) you will find a script named `01-run_bactmap.sh`. This script contains the code to run bactmap. Edit this script, adjusting it to fit your input files and the name and location of the reference you're going to map to (Hint: the reference sequence is located in `resources/reference`). +Your next task is to run the **bactmap** pipeline on the _S. pneumoniae_ data. In the folder `scripts` (within the `S_pneumoniae` analysis directory) you will find a script named `01-run_bactmap.sh`. This script contains the code to run bactmap. -- Activate the `nextflow` software environment. +- First, create a `samplesheet.csv` file for `bactmap`. Refer back to [The bacQC pipeline](07-bacqc.md#prepare-a-samplesheet) page for how to do this, if you've forgotten. + +- Edit the `01-run_bactmap.sh` script, adjusting it to fit your input files and the name and location of the reference you're going to map to (Hint: the reference sequence is located in `resources/reference`). -- You will need to create the `samplesheet.csv` file. Refer back to [The bacQC pipeline](07-bacqc.md#prepare-a-samplesheet) page for how to do this, if you've forgotten. +- Activate the `nextflow` software environment. - Run the script using `bash scripts/01-run_bactmap.sh`. @@ -32,7 +34,16 @@ Your next task is to run the **bactmap** pipeline on the _S. pneumoniae_ data. :::{.callout-answer} -The fixed script is: +First, we created our samplesheet using the helper python script, as explained for the bacQC pipeline: + +```bash +python scripts/fastq_dir_to_samplesheet.py data/reads \ + samplesheet.csv \ + -r1 _1.fastq.gz \ + -r2 _2.fastq.gz +``` + +Next, we fixed the script: ```bash #!/bin/bash @@ -46,28 +57,19 @@ nextflow run nf-core/bactmap \ --genome_size 2.0M ``` -- We activated the `nextflow` environment: +Then, we activated the `nextflow` environment: ```bash mamba activate nextflow ``` -- We created the `samplesheet.csv` file by running the following command: - -```bash -python scripts/fastq_dir_to_samplesheet.py data/reads \ - samplesheet.csv \ - -r1 _1.fastq.gz \ - -r2 _2.fastq.gz -``` - -- We ran the script as instructed using: +Finally, we ran the script as instructed using: ```bash bash scripts/01-run_bactmap.sh ``` -- While it was running it printed a message on the screen: +While it was running it printed a message on the screen: ```bash N E X T F L O W ~ version 23.04.1 @@ -84,7 +86,7 @@ Launching `https://github.com/nf-core/bactmap` [cranky_swartz] DSL2 - revision: ------------------------------------------------------ ``` -- The results for all the samples looked really good so we can keep all of them for the next steps of our analyses. +The results for all the samples looked really good so we can keep all of them for the next steps of our analyses. ::: ::: @@ -92,13 +94,14 @@ Launching `https://github.com/nf-core/bactmap` [cranky_swartz] DSL2 - revision: :::{.callout-exercise} #### How much of the reference was mapped? -- Use the `02-pseudogenome_check.sh` script we've provided in the `scripts` folder, which calculates how much missing data there is for each sample using `seqtk comp`. -- Once the analysis finishes open the `mapping_summary.tsv` file in _Excel_ from your file browser . +- Activate the `seqtk` software environment. +- Run the `02-pseudogenome_check.sh` script we've provided in the `scripts` folder, which calculates how much missing data there is for each sample using `seqtk comp`. +- Once the analysis finishes, open the `mapping_summary.tsv` file in _Excel_ from your file browser . - Sort the results by the `%ref mapped` column and identify the sample which has the lowest percentage of the reference mapped. :::{.callout-answer} - We activated the software environment: `mamba activate seqtk` -- We then ran the script using `bash scripts/03-pseudogenome_check.sh`. The script prints a message while it's running: +- We then ran the script using `bash scripts/02-pseudogenome_check.sh`. The script prints a message while it's running: ```bash Processing ERX1265396_ERR1192012.fas diff --git a/materials/31-intro_amr.md b/materials/31-intro_amr.md index c72dde6..b5e5644 100644 --- a/materials/31-intro_amr.md +++ b/materials/31-intro_amr.md @@ -26,6 +26,18 @@ Numerous software tools have been created to predict the presence of genes linke Estimating the function of a gene or protein solely from its sequence is complex, leading to varying outcomes across different software tools. It is advisable to employ multiple tools and compare their findings, thus increasing our confidence in identifying which antimicrobial drugs might be more effective for treating patients infected with the strains we're studying. +:::{.callout-warning} +#### Do not use reference-based pseudogenomes for AMR analysis + +Genome consensus sequences, obtained using reference-based alignment, are a fast method to obtain the mutations in a large number of isolates. +However, these pseudogenomes are biased to the reference genome used. +For example, if a particular sequence is missing from the reference genome, it will not be present in the pseudogenome. + +The bias created by using a reference genome is tolerable for phylogenetic applications. +However, when our goal is to find antimicrobial resistance factors, we need to have as much as possible a complete genome for each isolate. +Therefore, for **AMR scans we should use a pipeline such as `avantonder/assembleBAC` for _de novo_ genome assembly**. +::: + ## Summary ::: {.callout-tip} diff --git a/materials/32-command_line_amr.md b/materials/32-command_line_amr.md index 7ab2439..d102179 100644 --- a/materials/32-command_line_amr.md +++ b/materials/32-command_line_amr.md @@ -42,7 +42,18 @@ Two columns are required: - `sample` --> a sample name of our choice (we will use the same name that we used for the assembly). - `fasta` --> the path to the FASTA file corresponding to that sample. -You can create this file using a spreadsheet software such as _Excel_, making sure to save the file as a CSV. Here is an example of our samplesheet, which we saved in a file called `samplesheet_funcscan.csv`: +You can create this file using a spreadsheet software such as _Excel_, making sure to save the file as a CSV. +To to get you started in creating this file, we can save a list of our assembly file names into a file: + +```bash +ls preprocessed/assemblebac/assemblies/*.fa | head -n 5 > samplesheet_funcscan.csv +``` + +In this case, we only list the first **five** files (`head -n 5`), to save time when running the pipeline. +In your own data, you should get all the files. + +We then open this file in _Excel_ and edit it further to have the two columns examplined above. +Here is our final samplesheet: ``` sample,fasta @@ -51,18 +62,6 @@ ERX1265488_ERR1192104_T1,preprocessed/assemblebac/assemblies/ERX1265488_ERR11921 ERX1501202_ERR1430824_T1,preprocessed/assemblebac/assemblies/ERX1501202_ERR1430824_T1_contigs.fa ERX1501203_ERR1430825_T1,preprocessed/assemblebac/assemblies/ERX1501203_ERR1430825_T1_contigs.fa ERX1501204_ERR1430826_T1,preprocessed/assemblebac/assemblies/ERX1501204_ERR1430826_T1_contigs.fa -ERX1501205_ERR1430827_T1,preprocessed/assemblebac/assemblies/ERX1501205_ERR1430827_T1_contigs.fa -ERX1501206_ERR1430828_T1,preprocessed/assemblebac/assemblies/ERX1501206_ERR1430828_T1_contigs.fa -ERX1501207_ERR1430829_T1,preprocessed/assemblebac/assemblies/ERX1501207_ERR1430829_T1_contigs.fa -ERX1501208_ERR1430830_T1,preprocessed/assemblebac/assemblies/ERX1501208_ERR1430830_T1_contigs.fa -ERX1501212_ERR1430834_T1,preprocessed/assemblebac/assemblies/ERX1501212_ERR1430834_T1_contigs.fa -``` -To save time, we're only going to run `funcscan` on **five** assemblies so you'll need to edit the `samplesheet_funcscan.csv` file. You can do this in _Excel_, deleting all samples except the first five or use `head` to extract the header and five samples on the command line (you'll need to save the output with a different filename and then overwrite the original samplesheet with `mv`): - -```bash -head -n 6 samplesheet_funcscan.csv > samplesheet_funcscan.csv.1 - -mv samplesheet_funcscan.csv.1 samplesheet_funcscan.csv ``` Once we have the samplesheet ready, we can run the `nf-core/funcscan` workflow using the following commands: @@ -72,7 +71,7 @@ Once we have the samplesheet ready, we can run the `nf-core/funcscan` workflow u mamba activate nextflow # create output directory -mkdir results/funcscan +mkdir -p results/funcscan # run the pipeline nextflow run nf-core/funcscan \