Skip to content

Commit

Permalink
small additions and bugs
Browse files Browse the repository at this point in the history
  • Loading branch information
tavareshugo committed Dec 12, 2023
1 parent a00f7dc commit 41a49a9
Show file tree
Hide file tree
Showing 4 changed files with 49 additions and 34 deletions.
1 change: 1 addition & 0 deletions materials/23-panaroo.md
Original file line number Diff line number Diff line change
Expand Up @@ -212,6 +212,7 @@ The fixed script is:
#!/bin/bash

# create output directory
mkdir -p results/snp-sites/
mkdir -p results/iqtree/

# extract variable sites
Expand Down
43 changes: 23 additions & 20 deletions materials/27-run_bactmap.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,19 +20,30 @@ Remember, the first step of any analysis of a new sequence dataset is to perform

#### Running nf-core/bactmap

Your next task is to run the **bactmap** pipeline on the _S. pneumoniae_ data. In the folder `scripts` (within the `S_pneumoniae` analysis directory) you will find a script named `01-run_bactmap.sh`. This script contains the code to run bactmap. Edit this script, adjusting it to fit your input files and the name and location of the reference you're going to map to (Hint: the reference sequence is located in `resources/reference`).
Your next task is to run the **bactmap** pipeline on the _S. pneumoniae_ data. In the folder `scripts` (within the `S_pneumoniae` analysis directory) you will find a script named `01-run_bactmap.sh`. This script contains the code to run bactmap.

- Activate the `nextflow` software environment.
- First, create a `samplesheet.csv` file for `bactmap`. Refer back to [The bacQC pipeline](07-bacqc.md#prepare-a-samplesheet) page for how to do this, if you've forgotten.

- Edit the `01-run_bactmap.sh` script, adjusting it to fit your input files and the name and location of the reference you're going to map to (Hint: the reference sequence is located in `resources/reference`).

- You will need to create the `samplesheet.csv` file. Refer back to [The bacQC pipeline](07-bacqc.md#prepare-a-samplesheet) page for how to do this, if you've forgotten.
- Activate the `nextflow` software environment.

- Run the script using `bash scripts/01-run_bactmap.sh`.

- Have a look at the MultiQC report. Do any of the samples look to be poor quality?

:::{.callout-answer}

The fixed script is:
First, we created our samplesheet using the helper python script, as explained for the bacQC pipeline:

```bash
python scripts/fastq_dir_to_samplesheet.py data/reads \
samplesheet.csv \
-r1 _1.fastq.gz \
-r2 _2.fastq.gz
```

Next, we fixed the script:

```bash
#!/bin/bash
Expand All @@ -46,28 +57,19 @@ nextflow run nf-core/bactmap \
--genome_size 2.0M
```

- We activated the `nextflow` environment:
Then, we activated the `nextflow` environment:

```bash
mamba activate nextflow
```

- We created the `samplesheet.csv` file by running the following command:

```bash
python scripts/fastq_dir_to_samplesheet.py data/reads \
samplesheet.csv \
-r1 _1.fastq.gz \
-r2 _2.fastq.gz
```

- We ran the script as instructed using:
Finally, we ran the script as instructed using:

```bash
bash scripts/01-run_bactmap.sh
```

- While it was running it printed a message on the screen:
While it was running it printed a message on the screen:

```bash
N E X T F L O W ~ version 23.04.1
Expand All @@ -84,21 +86,22 @@ Launching `https://github.com/nf-core/bactmap` [cranky_swartz] DSL2 - revision:
------------------------------------------------------
```

- The results for all the samples looked really good so we can keep all of them for the next steps of our analyses.
The results for all the samples looked really good so we can keep all of them for the next steps of our analyses.

:::
:::

:::{.callout-exercise}
#### How much of the reference was mapped?

- Use the `02-pseudogenome_check.sh` script we've provided in the `scripts` folder, which calculates how much missing data there is for each sample using `seqtk comp`.
- Once the analysis finishes open the `mapping_summary.tsv` file in _Excel_ from your file browser <i class="fa-solid fa-folder"></i>.
- Activate the `seqtk` software environment.
- Run the `02-pseudogenome_check.sh` script we've provided in the `scripts` folder, which calculates how much missing data there is for each sample using `seqtk comp`.
- Once the analysis finishes, open the `mapping_summary.tsv` file in _Excel_ from your file browser <i class="fa-solid fa-folder"></i>.
- Sort the results by the `%ref mapped` column and identify the sample which has the lowest percentage of the reference mapped.

:::{.callout-answer}
- We activated the software environment: `mamba activate seqtk`
- We then ran the script using `bash scripts/03-pseudogenome_check.sh`. The script prints a message while it's running:
- We then ran the script using `bash scripts/02-pseudogenome_check.sh`. The script prints a message while it's running:

```bash
Processing ERX1265396_ERR1192012.fas
Expand Down
12 changes: 12 additions & 0 deletions materials/31-intro_amr.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,18 @@ Numerous software tools have been created to predict the presence of genes linke
Estimating the function of a gene or protein solely from its sequence is complex, leading to varying outcomes across different software tools.
It is advisable to employ multiple tools and compare their findings, thus increasing our confidence in identifying which antimicrobial drugs might be more effective for treating patients infected with the strains we're studying.

:::{.callout-warning}
#### Do not use reference-based pseudogenomes for AMR analysis

Genome consensus sequences, obtained using reference-based alignment, are a fast method to obtain the mutations in a large number of isolates.
However, these pseudogenomes are biased to the reference genome used.
For example, if a particular sequence is missing from the reference genome, it will not be present in the pseudogenome.

The bias created by using a reference genome is tolerable for phylogenetic applications.
However, when our goal is to find antimicrobial resistance factors, we need to have as much as possible a complete genome for each isolate.
Therefore, for **AMR scans we should use a pipeline such as `avantonder/assembleBAC` for _de novo_ genome assembly**.
:::

## Summary

::: {.callout-tip}
Expand Down
27 changes: 13 additions & 14 deletions materials/32-command_line_amr.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,18 @@ Two columns are required:
- `sample` --> a sample name of our choice (we will use the same name that we used for the assembly).
- `fasta` --> the path to the FASTA file corresponding to that sample.

You can create this file using a spreadsheet software such as _Excel_, making sure to save the file as a CSV. Here is an example of our samplesheet, which we saved in a file called `samplesheet_funcscan.csv`:
You can create this file using a spreadsheet software such as _Excel_, making sure to save the file as a CSV.
To to get you started in creating this file, we can save a list of our assembly file names into a file:

```bash
ls preprocessed/assemblebac/assemblies/*.fa | head -n 5 > samplesheet_funcscan.csv
```

In this case, we only list the first **five** files (`head -n 5`), to save time when running the pipeline.
In your own data, you should get all the files.

We then open this file in _Excel_ and edit it further to have the two columns examplined above.
Here is our final samplesheet:

```
sample,fasta
Expand All @@ -51,18 +62,6 @@ ERX1265488_ERR1192104_T1,preprocessed/assemblebac/assemblies/ERX1265488_ERR11921
ERX1501202_ERR1430824_T1,preprocessed/assemblebac/assemblies/ERX1501202_ERR1430824_T1_contigs.fa
ERX1501203_ERR1430825_T1,preprocessed/assemblebac/assemblies/ERX1501203_ERR1430825_T1_contigs.fa
ERX1501204_ERR1430826_T1,preprocessed/assemblebac/assemblies/ERX1501204_ERR1430826_T1_contigs.fa
ERX1501205_ERR1430827_T1,preprocessed/assemblebac/assemblies/ERX1501205_ERR1430827_T1_contigs.fa
ERX1501206_ERR1430828_T1,preprocessed/assemblebac/assemblies/ERX1501206_ERR1430828_T1_contigs.fa
ERX1501207_ERR1430829_T1,preprocessed/assemblebac/assemblies/ERX1501207_ERR1430829_T1_contigs.fa
ERX1501208_ERR1430830_T1,preprocessed/assemblebac/assemblies/ERX1501208_ERR1430830_T1_contigs.fa
ERX1501212_ERR1430834_T1,preprocessed/assemblebac/assemblies/ERX1501212_ERR1430834_T1_contigs.fa
```
To save time, we're only going to run `funcscan` on **five** assemblies so you'll need to edit the `samplesheet_funcscan.csv` file. You can do this in _Excel_, deleting all samples except the first five or use `head` to extract the header and five samples on the command line (you'll need to save the output with a different filename and then overwrite the original samplesheet with `mv`):

```bash
head -n 6 samplesheet_funcscan.csv > samplesheet_funcscan.csv.1

mv samplesheet_funcscan.csv.1 samplesheet_funcscan.csv
```

Once we have the samplesheet ready, we can run the `nf-core/funcscan` workflow using the following commands:
Expand All @@ -72,7 +71,7 @@ Once we have the samplesheet ready, we can run the `nf-core/funcscan` workflow u
mamba activate nextflow

# create output directory
mkdir results/funcscan
mkdir -p results/funcscan

# run the pipeline
nextflow run nf-core/funcscan \
Expand Down

0 comments on commit 41a49a9

Please sign in to comment.