Skip to content

Commit

Permalink
Merge pull request #66 from pdimens/docs_dev
Browse files Browse the repository at this point in the history
update to 0.9x
  • Loading branch information
pdimens authored Apr 3, 2024
2 parents 43f435f + fab71bc commit c5be001
Show file tree
Hide file tree
Showing 5 changed files with 460 additions and 1 deletion.
1 change: 1 addition & 0 deletions Modules/Align/ema.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ order: 5
- at least 4 cores/threads available
- a genome assembly in FASTA format
- paired-end fastq sequence file with the [proper naming convention](/haplotagdata/#naming-conventions) (gzipped recommended)
- patience
==- Why EMA?
The original haplotag manuscript uses BWA to map reads. The authors have since recommended
the use of EMA (EMerald Aligner) for most applications. EMA is barcode-aware,
Expand Down
194 changes: 194 additions & 0 deletions Modules/Align/minimap.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,194 @@
---
label: Minimap
description: Align haplotagged sequences with Minimap2
icon: dot
order: 5
---

# :icon-quote: Map Reads onto a genome with Minimap2
=== :icon-checklist: You will need
- at least 4 cores/threads available
- a genome assembly in FASTA format
- paired-end fastq sequence file with the [proper naming convention](/haplotagdata/#naming-conventions) (gzipped recommended)
===

Once sequences have been trimmed and passed through other QC filters, they will need to
be aligned to a reference genome. This module within Harpy expects filtered reads as input,
such as those derived using `harpy qc`. You can map reads onto a genome assembly with Harpy
using the `align` module:

```bash usage
harpy align minimap OPTIONS... INPUTS...
```
```bash example
harpy align minimap --genome genome.fasta Sequences/
```

## :icon-terminal: Running Options
In addition to the [common runtime options](/commonoptions.md), the `harpy align minimap` module is configured using these command-line arguments:

| argument | short name | type | default | required | description |
|:-------------------|:----------:|:----------------------|:-------:|:--------:|:------------------------------------------------------|
| `INPUTS` | | file/directory paths | | **yes** | Files or directories containing [input FASTQ files](/commonoptions.md#input-arguments) |
| `--genome` | `-g` | file path | | **yes** | Genome assembly for read mapping |
| `--molecule-distance` | `-m` | integer | 100000 | no | Base-pair distance threshold to separate molecules |
| `--quality-filter` | `-f` | integer (0-40) | 30 | no | Minimum `MQ` (SAM mapping quality) to pass filtering |
| `--method` | `-m` | choice [`bwa`, `ema`] | bwa | no | Which aligning software to use |
| `--extra-params` | `-x` | string | | no | Additional EMA-align/BWA arguments, in quotes |

### Molecule distance
The `--molecule-distance` option is used during the BWA alignment workflow
to assign alignments a unique Molecular Identifier `MI:i` tag based on their
haplotag barcode and the distance threshold you specify. See
[haplotag data](/haplotagdata/#barcode-thresholds) for more information on
what this value does.

## Quality filtering
The `--quality` argument filters out alignments below a given $MQ$ threshold. The default, `30`, keeps alignments
that are at least 99.9% likely correctly mapped. Set this value to `1` if you only want alignments removed with
$MQ = 0$ (0% likely correct). You may also set it to `0` to keep all alignments for diagnostic purposes.
The plot below shows the relationship between $MQ$ score and the likelihood the alignment is correct and will serve to help you decide
on a value you may want to use. It is common to remove alignments with $MQ <30$ (<99.9% chance correct) or $MQ <40$ (<99.99% chance correct).

==- What is the $MQ$ score?
Every alignment in a BAM file has an associated mapping quality score ($MQ$) that informs you of the likelihood
that the alignment is accurate. This score can range from 0-40, where higher numbers mean the alignment is more
likely correct. The math governing the $MQ$ score actually calculates the percent chance the alignment is ***incorrect***:
$$
\%\ chance\ incorrect = 10^\frac{-MQ}{10} \times 100\\
\text{where }0\le MQ\le 40
$$
You can simply subtract it from 100 to determine the percent chance the alignment is ***correct***:
$$
\%\ chance\ correct = 100 - \%\ chance\ incorrect\\
\text{or} \\
\%\ chance\ correct = (1 - 10^\frac{-MQ}{10}) \times 100
$$

[!embed el="embed"](//plotly.com/~pdimens/7.embed)
===

## Marking PCR duplicates
Harpy uses `samtools markdup` to mark putative PCR duplicates. By using the `--barcode-tag BX`
option, it considers the linked-read barcode for more accurate duplicate detection. Duplicate
marking also uses the `-S` option to mark supplementary (chimeric) alignments as duplicates
if the primary alignment was marked as a duplicate. Duplicates get marked but **are not removed**.

----

## :icon-git-pull-request: Minimap workflow
+++ :icon-git-merge: details
- ignores (but retains) barcode information
- ultra-fast
- comparable accuracy to BWA MEM for sequences greater than 100bp
- accuracy may be lower for sequences less than 100bp

The [minimap2](https://github.com/lh3/minimap2) workflow is nearly identical to the BWA workflow,
the only real difference being how the input genome is indexed and that alignment is performed with
`minimap2` instead of BWA. Duplicates are marked using `samtools markdup`.
The `BX:Z` tags in the read headers are still added to the alignment headers, even though barcodes
are not used to inform mapping. The `-m` threshold is used for alignment molecule assignment.

```mermaid
graph LR
Z([trimmed reads]) --> B
A([index genome]) --> B([align to genome])
B-->C([sort alignments])
C-->D([mark duplicates])
D-->E([assign molecules])
E-->F([alignment metrics])
D-->G([barcode stats])
G-->F
subgraph markdp [mark duplicates via `samtools`]
direction LR
collate-->fixmate
fixmate-->sort
sort-->markdup
end
```
+++ :icon-file-directory: minimap2 output
The default output directory is `Align/minimap` with the folder structure below. `Sample1` is a generic sample name for demonstration purposes.
The resulting folder also includes a `workflow` directory (not shown) with workflow-relevant runtime files and information.
```
Align/bwa
├── Sample1.bam
├── Sample1.bam.bai
├── logs
│ └── markduplicates
│    └── Sample1.markdup.log
└── reports
├── minimap.stats.html
├── BXstats
│   ├── Sample1.bxstats.html
│ └── data
│ └── Sample1.bxstats.gz
└── coverage
   ├── Sample1.gencov.html
└── data
└── Sample1.gencov.gz
```

| item | description |
|:---------|:------------------------------------------------------------------------------------------------------------|
| `*.bam` | sequence alignments for each sample |
| `*.bai` | sequence alignment indexes for each sample |
| `logs/markduplicates` | stats provided by `samtools markdup` |
| `reports/` | various counts/statistics/reports relating to sequence alignment |
| `reports/minimap.stats.html` | report summarizing `samtools flagstat and stats` results across all samples from `multiqc` |
| `reports/reads.bxstats.html` | interactive html report summarizing valid vs invalid barcodes across all samples |
| `reports/BXstats/*.bxstats.html` | interactive html report summarizing inferred molecule size |
| `reports/coverage/*.html` | summary plots of alignment coverage per contig |
| `reports/coverage/data/*.gencov.gz` | output from samtools bedcov from all alignments, used for plots |
| `reports/BXstats/` | reports summarizing molecule size and reads per molecule |
| `reports/BXstats/data/` | tabular data containing the information used to generate the BXstats reports |

+++ :icon-code-square: minimap2 parameters
By default, Harpy runs `minimap2` with these parameters (excluding inputs and outputs):
```bash
minimap2 -ax sr -y --sam-hit-only -R \"@RG\\tID:samplename\\tSM:samplename\"
```

The Minimap2 aligner has a lot of parameters that can be provided, too many to list here, so please refer to the
[Minimap2 documentation](https://lh3.github.io/minimap2/minimap2.html). The `-a` indicates to output a SAM format
file and the `-x sr` argument is a preset for short reads with these parameters:
- `-k21`
- `-w11`
- `--sr`
- `--frag=yes`
- `-A2`
- `-B8`
- `-O12,32`
- `-E2,1`
- `-b0`
- `-r100`
- `-p.5`
- `-N20`
- `-f1000,5000`
- `-n2`
- `-m20`
- `-s40`
- `-g100`
- `-2K50m`
- `--heap-sort=yes`
- `--secondary=no`

+++ :icon-graph: reports
These are the summary reports Harpy generates for this workflow. You may right-click
the images and open them in a new tab if you wish to see the examples in better detail.
||| Depth and coverage
Reports the depth of alignments in 10kb windows.
![reports/coverage/*.html](/static/report_align_coverage.png)
||| BX validation
Reports the number of valid/invalid barcodes in the alignments.
![reports/reads.bxstats.html](/static/report_align_bxstats.png)
||| Molecule size
Reports the inferred molecule sized based on barcodes in the alignments.
![reports/BXstats/*.bxstats.html](/static/report_align_bxmol.png)
||| Alignment stats
Reports the general statistics computed by samtools `stats` and `flagstat`
![reports/samtools_*stat/*html](/static/report_align_flagstat.png)
|||

+++
2 changes: 2 additions & 0 deletions Modules/Simulate/index.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
icon: flame
order: 5
Loading

0 comments on commit c5be001

Please sign in to comment.