Skip to content

Commit

Permalink
Merge pull request #51 from pdimens/docs_dev
Browse files Browse the repository at this point in the history
Docs dev
  • Loading branch information
pdimens authored Feb 29, 2024
2 parents ba15ea0 + 34c83f8 commit 46d8557
Show file tree
Hide file tree
Showing 13 changed files with 113 additions and 75 deletions.
21 changes: 11 additions & 10 deletions Modules/Align/bwa.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,19 +18,19 @@ such as those derived using `harpy qc`. You can map reads onto a genome assembly
using the `align` module:

```bash usage
harpy align bwa|ema OPTIONS...
harpy align bwa OPTIONS... INPUTS...
```
```bash example
harpy align bwa --genome genome.fasta --directory Sequences/
harpy align bwa --genome genome.fasta Sequences/
```

## :icon-terminal: Running Options
In addition to the [common runtime options](/commonoptions.md), the `harpy align bwa` module is configured using these command-line arguments:

| argument | short name | type | default | required | description |
|:-------------------|:----------:|:----------------------|:-------:|:--------:|:------------------------------------------------------|
| `INPUTS` | | file/directory paths | | **yes** | Files or directories containing [input FASTQ files](/commonoptions.md#input-arguments) |
| `--genome` | `-g` | file path | | **yes** | Genome assembly for read mapping |
| `--directory` | `-d` | folder path | | **yes** | Directory with sample sequences |
| `--molecule-distance` | `-m` | integer | 100000 | no | Base-pair distance threshold to separate molecules |
| `--quality-filter` | `-f` | integer (0-40) | 30 | no | Minimum `MQ` (SAM mapping quality) to pass filtering |
| `--method` | `-m` | choice [`bwa`, `ema`] | bwa | no | Which aligning software to use |
Expand All @@ -44,7 +44,13 @@ to assign alignments a unique Molecular Identifier `MI:i` tag based on their
what this value does.

## Quality filtering
==- What is a $MQ$ score?
The `--quality` argument filters out alignments below a given $MQ$ threshold. The default, `30`, keeps alignments
that are at least 99.9% likely correctly mapped. Set this value to `1` if you only want alignments removed with
$MQ = 0$ (0% likely correct). You may also set it to `0` to keep all alignments for diagnostic purposes.
The plot below shows the relationship between $MQ$ score and the likelihood the alignment is correct and will serve to help you decide
on a value you may want to use. It is common to remove alignments with $MQ <30$ (<99.9% chance correct) or $MQ <40$ (<99.99% chance correct).

==- What is the $MQ$ score?
Every alignment in a BAM file has an associated mapping quality score ($MQ$) that informs you of the likelihood
that the alignment is accurate. This score can range from 0-40, where higher numbers mean the alignment is more
likely correct. The math governing the $MQ$ score actually calculates the percent chance the alignment is ***incorrect***:
Expand All @@ -58,14 +64,9 @@ $$
\text{or} \\
\%\ chance\ correct = (1 - 10^\frac{-MQ}{10}) \times 100
$$
===
The `--quality` argument filters out alignments below a given $MQ$ threshold. The default, `30`, keeps alignments
that are at least 99.9% likely correctly mapped. Set this value to `1` if you only want alignments removed with
$MQ = 0$ (0% likely correct). You may also set it to `0` to keep all alignments for diagnostic purposes.
The plot below shows the relationship between $MQ$ score and the likelihood the alignment is correct and will serve to help you decide
on a value you may want to use. It is common to remove alignments with $MQ <30$ (<99.9% chance correct) or $MQ <40$ (<99.99% chance correct).

[!embed el="embed"](//plotly.com/~pdimens/7.embed)
===

## Marking PCR duplicates
Harpy uses `samtools markdup` to mark putative PCR duplicates. By using the `--barcode-tag BX`
Expand Down
21 changes: 11 additions & 10 deletions Modules/Align/ema.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,10 +25,10 @@ such as those derived using `harpy qc`. You can map reads onto a genome assembly
using the `align` module:

```bash usage
harpy align OPTIONS...
harpy align ema OPTIONS... INPUTS...
```
```bash example
harpy align ema --genome genome.fasta --directory Sequences/
harpy align ema --genome genome.fasta Sequences/
```


Expand All @@ -37,10 +37,10 @@ In addition to the [common runtime options](/commonoptions.md), the `harpy align

| argument | short name | type | default | required | description |
|:-------------------|:----------:|:----------------------|:-------:|:--------:|:-------------------------------------------------------------------|
| `INPUTS` | | file/directory paths | | **yes** | Files or directories containing [input FASTQ files](/commonoptions.md#input-arguments) |
| `--genome` | `-g` | file path | | **yes** | Genome assembly for read mapping |
| `--platform` | `-p` | string | haplotag | **yes** | Linked read technology: `haplotag` or `10x` |
| `--whitelist` | `-w` | file path | | no | Path to barcode whitelist (`--platform 10x` only) |
| `--directory` | `-d` | folder path | | **yes** | Directory with sample sequences |
| `--ema-bins` | `-e` | integer (1-1000) | 500 | no | Number of barcode bins for EMA |
| `--quality-filter` | `-f` | integer (0-40) | 30 | no | Minimum `MQ` (SAM mapping quality) to pass filtering |
| `--extra-params` | `-x` | string | | no | Additional EMA-align/BWA arguments, in quotes |
Expand All @@ -52,7 +52,13 @@ If you need to process 10x data, then you will need to include the whitelist fil
Conveniently, **haplotag data doesn't require this file**.

## Quality filtering
==- What is a $MQ$ score?
The `--quality` argument filters out alignments below a given $MQ$ threshold. The default, `30`, keeps alignments
that are at least 99.9% likely correctly mapped. Set this value to `1` if you only want alignments removed with
$MQ = 0$ (0% likely correct). You may also set it to `0` to keep all alignments for diagnostic purposes.
The plot below shows the relationship between $MQ$ score and the likelihood the alignment is correct and will serve to help you decide
on a value you may want to use. It is common to remove alignments with $MQ <30$ (<99.9% chance correct) or $MQ <40$ (<99.99% chance correct).

==- What is the $MQ$ score?
Every alignment in a BAM file has an associated mapping quality score ($MQ$) that informs you of the likelihood
that the alignment is accurate. This score can range from 0-40, where higher numbers mean the alignment is more
likely correct. The math governing the $MQ$ score actually calculates the percent chance the alignment is ***incorrect***:
Expand All @@ -66,14 +72,9 @@ $$
\text{or} \\
\%\ chance\ correct = (1 - 10^\frac{-MQ}{10}) \times 100
$$
===
The `--quality` argument filters out alignments below a given $MQ$ threshold. The default, `30`, keeps alignments
that are at least 99.9% likely correctly mapped. Set this value to `1` if you only want alignments removed with
$MQ = 0$ (0% likely correct). You may also set it to `0` to keep all alignments for diagnostic purposes.
The plot below shows the relationship between $MQ$ score and the likelihood the alignment is correct and will serve to help you decide
on a value you may want to use. It is common to remove alignments with $MQ <30$ (<99.9% chance correct) or $MQ <40$ (<99.99% chance correct).

[!embed el="embed"](//plotly.com/~pdimens/7.embed)
===

## Marking PCR duplicates
EMA marks duplicates in the resulting alignments, however the read with invalid barcodes
Expand Down
12 changes: 6 additions & 6 deletions Modules/SV/leviathan.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,8 @@ This file is optional and only useful if you want variant calling to happen on a
- spaces can be used as delimeters too
- the groups can be numbers or text (_i.e._ meaningful population names)
- you can comment out lines with `#` for Harpy to ignore them
- create with `harpy extra popgroup -d <samplefolder>` or manually
- if created with `harpy extra popgroup`, all the samples will be assigned to group `pop1`
- create with `harpy popgroup -d <samplefolder>` or manually
- if created with `harpy popgroup`, all the samples will be assigned to group `pop1`
- make sure to edit the second column to reflect your data correctly.

``` example file for --populations
Expand All @@ -47,25 +47,25 @@ from the sample names. A simple fix would be to use underscores (`_`) to differe

After reads have been aligned, _e.g._ with `harpy align`, you can use those alignment files
(`.bam`) to call structural variants in your data using LEVIATHAN. To make sure your data
will work seemlessly with LEVIATHAN, the alignments in the input BAM files should **end**
will work seemlessly with LEVIATHAN, the alignments in the [input BAM files](/commonoptions.md) should **end**
with a `BX:Z:AxxCxxBxxDxx` tag. Use `harpy preflight bam` if you want to double-check file
format validity.

```bash usage
harpy sv leviathan OPTIONS...
harpy sv leviathan OPTIONS... INPUTS...
```

```bash example
harpy sv leviathan --threads 20 --directory Align/bwa -g genome.fasta
harpy sv leviathan --threads 20 -g genome.fasta Align/bwa
```

## :icon-terminal: Running Options
In addition to the [common runtime options](/commonoptions.md), the `harpy sv leviathan` module is configured using these command-line arguments:

| argument | short name | type | default | required | description |
|:-----------------|:----------:|:--------------|:-------:|:--------:|:---------------------------------------------------|
| `INPUTS` | | file/directory paths | | **yes** | Files or directories containing [input BAM files](/commonoptions.md#input-arguments) |
| `--genome` | `-g` | file path | | **conditionally** | Genome assembly for phasing bam files |
| `--directory` | `-d` | folder path | | **yes** | Directory with sequence alignments |
| `--populations` | `-p` | file path | | no | Tab-delimited file of sample\<*tab*\>group |
| `--extra-params` | `-x` | string | | no | Additional naibr arguments, in quotes |

Expand Down
14 changes: 7 additions & 7 deletions Modules/SV/naibr.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,8 @@ This file is optional and only useful if you want variant calling to happen on a
- spaces can be used as delimeters too
- the groups can be numbers or text (_i.e._ meaningful population names)
- you can comment out lines with `#` for Harpy to ignore them
- create with `harpy extra popgroup -d <samplefolder>` or manually
- if created with `harpy extra popgroup`, all the samples will be assigned to group `pop1`
- create with `harpy popgroup -d <samplefolder>` or manually
- if created with `harpy popgroup`, all the samples will be assigned to group `pop1`
- make sure to edit the second column to reflect your data correctly.

``` example file for --populations
Expand Down Expand Up @@ -49,24 +49,24 @@ should already have that information (yay!). If your alignments don't have phasi
then you will need to do a little extra work for NAIBR to work best with your data. This process is described below.

```bash usage
harpy sv naibr OPTIONS...
harpy sv naibr OPTIONS... INPUTS...
```

```bash examples
# input bams already phased
harpy sv naibr --threads 20 --directory Align/bwa --genome genome.fasta
harpy sv naibr --threads 20 --genome genome.fasta Align/bwa

# input bams require phasing
harpy sv naibr --threads 20 --directory Align/bwa --genome genome.fasta --vcf Variants/data.vcf.gz
harpy sv naibr --threads 20 --genome genome.fasta --vcf Variants/data.vcf.gz Align/bwa
```

## :icon-terminal: Running Options
In addition to the [common runtime options](/commonoptions.md), the `harpy sv naibr` module is configured using these command-line arguments:

| argument | short name | type | default | required | description |
|:-----------------|:----------:|:--------------|:-------:|:--------:|:---------------------------------------------------|
| `INPUTS` | | file/directory paths | | **yes** | Files or directories containing [input BAM files](/commonoptions.md#input-arguments) |
| `--genome` | `-g` | file path | | **yes** | Genome assembly for phasing bam files |
| `--directory` | `-d` | folder path | | **yes** | Directory with sequence alignments |
| `--vcf` | `-v` | file path | | **conditionally** | Phased vcf file for phasing bam files |
| `--molecule-distance` | `-m` | integer | 100000 | no | Base-pair distance threshold to separate molecules |
| `--populations` | `-p` | file path | | no | Tab-delimited file of sample\<*tab*\>group |
Expand Down Expand Up @@ -99,7 +99,7 @@ it too well.
In order to get the best variant calling performance out of NAIBR, it requires _phased_ bam files as input.
The `--vcf` option is optional and not used by NAIBR. However, to use `harpy sv naibr` with
bam files that are not phased, you will need to include `--vcf`, which Harpy uses with
`whatshap haplotag` to phase your input bam files prior to variant calling. See the [whatshap documentation](https://whatshap.readthedocs.io/en/latest/guide.html#whatshap-haplotag)
`whatshap haplotag` to phase your input BAM files prior to variant calling. See the [whatshap documentation](https://whatshap.readthedocs.io/en/latest/guide.html#whatshap-haplotag)
for more details on that process.

#### a phased input --vcf
Expand Down
6 changes: 3 additions & 3 deletions Modules/demultiplex.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,17 +20,17 @@ should have been added during the sample DNA preparation in a laboratory. The de
haplotag technology you are using (read [Haplotag Types](#haplotag-types)).

```bash usage
harpy demultiplex OPTIONS...
harpy demultiplex OPTIONS... INPUT
```
```bash example
harpy demultiplex --threads 20 --file Plate_1_S001_R1.fastq.gz --samplesheet demux.schema
harpy demultiplex --threads 20 --samplesheet demux.schema Plate_1_S001_R1.fastq.gz
```
## :icon-terminal: Running Options
In addition to the [common runtime options](/commonoptions.md), the `harpy demultiplex` module is configured using these command-line arguments:

| argument | short name | type | default | required | description |
|:------------------|:----------:|:-----------|:-------:|:--------:|:-------------------------------------------------------------------------------------|
| `--file` | `-f` | file path | | **yes** | The forward (or reverse) multiplexed FASTQ file |
| `INPUT` | | file path | | **yes** | The forward (or reverse) multiplexed FASTQ file |
| `--samplesheet` | `-b` | file path | | **yes** | Tab-delimited file of sample\<tab\>barcode |
| `--method` | `-m` | choice | `gen1` | **yes** | Haplotag technology of the sequences |

Expand Down
10 changes: 5 additions & 5 deletions Modules/impute.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,24 +28,24 @@ most from your data. Harpy uses `STITCH` to impute genotypes, a haplotype-based
method that is linked-read aware. Imputing genotypes requires a variant call file
**containing SNPs**, such as that produced by `harpy variants`. You can impute genotypes with Harpy using the `impute` module:
```bash usage
harpy impute OPTIONS...
harpy impute OPTIONS... INPUTS...
```

```bash example
# create stitch parameter file 'stitch.params'
harpy stitchparams -o stitch.params

# run imputation
harpy impute --threads 20 --vcf Variants/mpileup/variants.raw.bcf --directory Align/ema --parameters stitch.params
harpy impute --threads 20 --vcf Variants/mpileup/variants.raw.bcf --parameters stitch.params Align/ema
```

## :icon-terminal: Running Options
In addition to the [common runtime options](/commonoptions.md), the `harpy impute` module is configured using these command-line arguments:

| argument | short name | type | default | required | description |
|:---------------|:----------:|:------------|:-------------:|:--------:|:------------------------------------------------------------------------------------------------|
| `INPUTS` | | file/directory paths | | **yes** | Files or directories containing [input BAM files](/commonoptions.md) |
| `--vcf` | `-v` | file path | | **yes** | Path to VCF/BCF file |
| `--directory` | `-d` | folder path | | **yes** | Directory with sequence alignments |
| `--extra-params` | `-x` | folder path | | no | Extra arguments to add to the STITCH R function, provided in quotes and R syntax |
| `--vcf-samples`| | toggle | | no | [Use samples present in vcf file](#prioritize-the-vcf-file) for imputation rather than those found the directory |
| `--parameters` | `-p` | file path | | **yes** | STITCH [parameter file](#parameter-file) (tab-delimited) |
Expand All @@ -57,7 +57,7 @@ syntax (e.g. `regionStart=0`, `populations=c("GBA","CUE")`). The argument should
however, if your additional parameters require the use of quotes (like the previous example), then wrap the `-x` argument
in **single quotes**. Otherwise, the format should take the form of `"arg1=value, arg2=value2"`. Example:
```bash
harpy impute -v file.vcf -p stitch.params -t 15 -x 'regionStart=20, regionEnd=500'
harpy impute -v file.vcf -p stitch.params -t 15 -x 'regionStart=20, regionEnd=500' Align/ema
```

### Prioritize the vcf file
Expand Down Expand Up @@ -227,7 +227,7 @@ Impute/
|:------------------------------------|:--------------------------------------------------------------------------|
| `logs/harpy.impute.log` | relevant runtime parameters for the phase module |
| `input/*.stitch` | biallelic SNPs used for imputation |
| `input/samples.list` | list of input BAM files |
| `input/samples.list` | list of [input BAM files](/commonoptions.md) |
| `input/samples.names` | list of sample names |
| `model*/concat.log` | output from bcftools concat to create final imputed bcf |
| `model*/variants.imputed.bcf` | final bcf file of imputed genotypes |
Expand Down
5 changes: 4 additions & 1 deletion Modules/othermodules.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,9 +18,12 @@ The arguments represent different sub-commands and can be run in any order or co

### popgroup
#### Sample grouping file for variant calling

```bash
harpy popgroup -o samples.groups data/
```
##### arguments
- `-o`, `--output`: name of the output file
- `-d`, `--directory`: name of the directory of input files, either fastq or bam.

This file is entirely optional and useful if you want SNP variant calling to happen on a
per-population level via `harpy snp ... -p` or on samples pooled-as-populations via `harpy sv ... -p`.
Expand Down
Loading

0 comments on commit 46d8557

Please sign in to comment.