Merge pull request #51 from pdimens/docs_dev

Docs dev
pdimens · Feb 29, 2024 · 46d8557 · 46d8557
2 parents ba15ea0 + 34c83f8
commit 46d8557
Show file tree

Hide file tree

Showing 13 changed files with 113 additions and 75 deletions.
diff --git a/Modules/Align/bwa.md b/Modules/Align/bwa.md
@@ -18,19 +18,19 @@ such as those derived using `harpy qc`. You can map reads onto a genome assembly
 using the `align` module:
 
 ```bash usage
-harpy align bwa|ema OPTIONS...
+harpy align bwa OPTIONS... INPUTS...
 ```
 ```bash example
-harpy align bwa --genome genome.fasta --directory Sequences/ 
+harpy align bwa --genome genome.fasta Sequences/ 
 ```
 
 ## :icon-terminal: Running Options
 In addition to the [common runtime options](/commonoptions.md), the `harpy align bwa` module is configured using these command-line arguments:
 
 | argument           | short name | type                  | default | required | description                                           |
 |:-------------------|:----------:|:----------------------|:-------:|:--------:|:------------------------------------------------------|
+| `INPUTS`           |            | file/directory paths  |         | **yes**  | Files or directories containing [input FASTQ files](/commonoptions.md#input-arguments)     |
 | `--genome`         |    `-g`    | file path             |         | **yes**  | Genome assembly for read mapping                      |
-| `--directory`      |    `-d`    | folder path           |         | **yes**  | Directory with sample sequences                       |
 | `--molecule-distance` |    `-m`    | integer         |  100000  |    no    | Base-pair distance threshold to separate molecules      |
 | `--quality-filter` |    `-f`    | integer (0-40)        |   30    |    no    | Minimum `MQ` (SAM mapping quality) to pass filtering  |
 | `--method`         |    `-m`    | choice [`bwa`, `ema`] |   bwa   |    no    | Which aligning software to use                        |
@@ -44,7 +44,13 @@ to assign alignments a unique Molecular Identifier `MI:i` tag based on their
 what this value does. 
 
 ## Quality filtering
-==- What is a $MQ$ score?
+The `--quality` argument filters out alignments below a given $MQ$ threshold. The default, `30`, keeps alignments
+that are at least 99.9% likely correctly mapped. Set this value to `1` if you only want alignments removed with
+$MQ = 0$ (0% likely correct). You may also set it to `0` to keep all alignments for diagnostic purposes.
+The plot below shows the relationship between $MQ$ score and the likelihood the alignment is correct and will serve to help you decide
+on a value you may want to use. It is common to remove alignments with $MQ <30$ (<99.9% chance correct) or $MQ <40$ (<99.99% chance correct).
+
+==- What is the $MQ$ score?
 Every alignment in a BAM file has an associated mapping quality score ($MQ$) that informs you of the likelihood 
 that the alignment is accurate. This score can range from 0-40, where higher numbers mean the alignment is more
 likely correct. The math governing the $MQ$ score actually calculates the percent chance the alignment is ***incorrect***: 
@@ -58,14 +64,9 @@ $$
 \text{or} \\
 \%\ chance\ correct = (1 - 10^\frac{-MQ}{10}) \times 100
 $$
-===
-The `--quality` argument filters out alignments below a given $MQ$ threshold. The default, `30`, keeps alignments
-that are at least 99.9% likely correctly mapped. Set this value to `1` if you only want alignments removed with
-$MQ = 0$ (0% likely correct). You may also set it to `0` to keep all alignments for diagnostic purposes.
-The plot below shows the relationship between $MQ$ score and the likelihood the alignment is correct and will serve to help you decide
-on a value you may want to use. It is common to remove alignments with $MQ <30$ (<99.9% chance correct) or $MQ <40$ (<99.99% chance correct).
 
 [!embed el="embed"](//plotly.com/~pdimens/7.embed)
+===
 
 ## Marking PCR duplicates
 Harpy uses `samtools markdup` to mark putative PCR duplicates. By using the `--barcode-tag BX`

diff --git a/Modules/Align/ema.md b/Modules/Align/ema.md
@@ -25,10 +25,10 @@ such as those derived using `harpy qc`. You can map reads onto a genome assembly
 using the `align` module:
 
 ```bash usage
-harpy align OPTIONS...
+harpy align ema OPTIONS... INPUTS...
 ```
 ```bash example
-harpy align ema --genome genome.fasta --directory Sequences/ 
+harpy align ema --genome genome.fasta Sequences/ 
 ```
 
 
@@ -37,10 +37,10 @@ In addition to the [common runtime options](/commonoptions.md), the `harpy align
 
 | argument           | short name | type                  | default | required | description                                                        |
 |:-------------------|:----------:|:----------------------|:-------:|:--------:|:-------------------------------------------------------------------|
+| `INPUTS`           |            | file/directory paths  |         | **yes**  | Files or directories containing [input FASTQ files](/commonoptions.md#input-arguments)                  |
 | `--genome`         |    `-g`    | file path             |         | **yes**  | Genome assembly for read mapping                                   |
 | `--platform`       |    `-p`    | string                | haplotag | **yes** | Linked read technology: `haplotag` or `10x`                        |
 | `--whitelist`      |    `-w`    | file path             |         |    no    | Path to barcode whitelist (`--platform 10x` only)                  |
-| `--directory`      |    `-d`    | folder path           |         | **yes**  | Directory with sample sequences                                    |
 | `--ema-bins`       |    `-e`    | integer (1-1000)      |   500   |    no    | Number of barcode bins for EMA                                     |
 | `--quality-filter` |    `-f`    | integer (0-40)        |   30    |    no    | Minimum `MQ` (SAM mapping quality) to pass filtering               |
 | `--extra-params`   |    `-x`    | string                |         |    no    | Additional EMA-align/BWA arguments, in quotes                      |
@@ -52,7 +52,13 @@ If you need to process 10x data, then you will need to include the whitelist fil
 Conveniently, **haplotag data doesn't require this file**.
 
 ## Quality filtering
-==- What is a $MQ$ score?
+The `--quality` argument filters out alignments below a given $MQ$ threshold. The default, `30`, keeps alignments
+that are at least 99.9% likely correctly mapped. Set this value to `1` if you only want alignments removed with
+$MQ = 0$ (0% likely correct). You may also set it to `0` to keep all alignments for diagnostic purposes.
+The plot below shows the relationship between $MQ$ score and the likelihood the alignment is correct and will serve to help you decide
+on a value you may want to use. It is common to remove alignments with $MQ <30$ (<99.9% chance correct) or $MQ <40$ (<99.99% chance correct).
+
+==- What is the $MQ$ score?
 Every alignment in a BAM file has an associated mapping quality score ($MQ$) that informs you of the likelihood 
 that the alignment is accurate. This score can range from 0-40, where higher numbers mean the alignment is more
 likely correct. The math governing the $MQ$ score actually calculates the percent chance the alignment is ***incorrect***: 
@@ -66,14 +72,9 @@ $$
 \text{or} \\
 \%\ chance\ correct = (1 - 10^\frac{-MQ}{10}) \times 100
 $$
-===
-The `--quality` argument filters out alignments below a given $MQ$ threshold. The default, `30`, keeps alignments
-that are at least 99.9% likely correctly mapped. Set this value to `1` if you only want alignments removed with
-$MQ = 0$ (0% likely correct). You may also set it to `0` to keep all alignments for diagnostic purposes.
-The plot below shows the relationship between $MQ$ score and the likelihood the alignment is correct and will serve to help you decide
-on a value you may want to use. It is common to remove alignments with $MQ <30$ (<99.9% chance correct) or $MQ <40$ (<99.99% chance correct).
 
 [!embed el="embed"](//plotly.com/~pdimens/7.embed)
+===
 
 ## Marking PCR duplicates
 EMA marks duplicates in the resulting alignments, however the read with invalid barcodes

diff --git a/Modules/SV/leviathan.md b/Modules/SV/leviathan.md
@@ -25,8 +25,8 @@ This file is optional and only useful if you want variant calling to happen on a
     - spaces can be used as delimeters too
 - the groups can be numbers or text (_i.e._ meaningful population names)
 - you can comment out lines with `#` for Harpy to ignore them
-- create with `harpy extra popgroup -d <samplefolder>` or manually
-- if created with `harpy extra popgroup`, all the samples will be assigned to group `pop1`
+- create with `harpy popgroup -d <samplefolder>` or manually
+- if created with `harpy popgroup`, all the samples will be assigned to group `pop1`
     - make sure to edit the second column to reflect your data correctly.
 
 ``` example file for --populations
@@ -47,25 +47,25 @@ from the sample names. A simple fix would be to use underscores (`_`) to differe
 
 After reads have been aligned, _e.g._ with `harpy align`, you can use those alignment files
 (`.bam`) to call structural variants in your data using LEVIATHAN. To make sure your data
-will work seemlessly with LEVIATHAN, the alignments in the input BAM files should **end**
+will work seemlessly with LEVIATHAN, the alignments in the [input BAM files](/commonoptions.md) should **end**
 with a `BX:Z:AxxCxxBxxDxx` tag. Use `harpy preflight bam` if you want to double-check file
 format validity.
 
 ```bash usage
-harpy sv leviathan OPTIONS... 
+harpy sv leviathan OPTIONS... INPUTS...
 ```
 
 ```bash example
-harpy sv leviathan --threads 20 --directory Align/bwa -g genome.fasta
+harpy sv leviathan --threads 20 -g genome.fasta Align/bwa
 ```
 
 ## :icon-terminal: Running Options
 In addition to the [common runtime options](/commonoptions.md), the `harpy sv leviathan` module is configured using these command-line arguments:
 
 | argument         | short name | type          | default | required | description                                        |
 |:-----------------|:----------:|:--------------|:-------:|:--------:|:---------------------------------------------------|
+| `INPUTS`         |            | file/directory paths  |         | **yes**  | Files or directories containing [input BAM files](/commonoptions.md#input-arguments)     |
 | `--genome`       |    `-g`    | file path     |         | **conditionally** | Genome assembly for phasing bam files     |
-| `--directory`    |    `-d`    | folder path   |         | **yes**           | Directory with sequence alignments                  |
 | `--populations`  |    `-p`    | file path     |         |    no             | Tab-delimited file of sample\<*tab*\>group         |
 | `--extra-params` |    `-x`    | string        |         |    no             | Additional naibr arguments, in quotes              |
 

diff --git a/Modules/SV/naibr.md b/Modules/SV/naibr.md
@@ -20,8 +20,8 @@ This file is optional and only useful if you want variant calling to happen on a
     - spaces can be used as delimeters too
 - the groups can be numbers or text (_i.e._ meaningful population names)
 - you can comment out lines with `#` for Harpy to ignore them
-- create with `harpy extra popgroup -d <samplefolder>` or manually
-- if created with `harpy extra popgroup`, all the samples will be assigned to group `pop1`
+- create with `harpy popgroup -d <samplefolder>` or manually
+- if created with `harpy popgroup`, all the samples will be assigned to group `pop1`
     - make sure to edit the second column to reflect your data correctly.
 
 ``` example file for --populations
@@ -49,24 +49,24 @@ should already have that information (yay!). If your alignments don't have phasi
 then you will need to do a little extra work for NAIBR to work best with your data. This process is described below.
 
 ```bash usage
-harpy sv naibr OPTIONS... 
+harpy sv naibr OPTIONS... INPUTS...
 ```
 
 ```bash examples
 # input bams already phased
-harpy sv naibr --threads 20 --directory Align/bwa --genome genome.fasta
+harpy sv naibr --threads 20 --genome genome.fasta Align/bwa
 
 # input bams require phasing
-harpy sv naibr --threads 20 --directory Align/bwa --genome genome.fasta --vcf Variants/data.vcf.gz
+harpy sv naibr --threads 20 --genome genome.fasta --vcf Variants/data.vcf.gz Align/bwa
 ```
 
 ## :icon-terminal: Running Options
 In addition to the [common runtime options](/commonoptions.md), the `harpy sv naibr` module is configured using these command-line arguments:
 
 | argument         | short name | type          | default | required | description                                        |
 |:-----------------|:----------:|:--------------|:-------:|:--------:|:---------------------------------------------------|
+| `INPUTS`         |            | file/directory paths  |         | **yes**  | Files or directories containing [input BAM files](/commonoptions.md#input-arguments)     |
 | `--genome`       |    `-g`    | file path     |         | **yes** | Genome assembly for phasing bam files     |
-| `--directory`    |    `-d`    | folder path   |         | **yes**           | Directory with sequence alignments                  |
 | `--vcf`          |    `-v`    | file path     |         | **conditionally** | Phased vcf file for phasing bam files     |
 | `--molecule-distance` |  `-m` | integer       |  100000 |    no             | Base-pair distance threshold to separate molecules |
 | `--populations`  |    `-p`    | file path     |         |    no             | Tab-delimited file of sample\<*tab*\>group         |
@@ -99,7 +99,7 @@ it too well.
 In order to get the best variant calling performance out of NAIBR, it requires _phased_ bam files as input. 
 The `--vcf` option is optional and not used by NAIBR. However, to use `harpy sv naibr` with
 bam files that are not phased, you will need to include `--vcf`, which Harpy uses with 
-`whatshap haplotag` to phase your input bam files prior to variant calling. See the [whatshap documentation](https://whatshap.readthedocs.io/en/latest/guide.html#whatshap-haplotag)
+`whatshap haplotag` to phase your input BAM files prior to variant calling. See the [whatshap documentation](https://whatshap.readthedocs.io/en/latest/guide.html#whatshap-haplotag)
 for more details on that process.
 
 #### a phased input --vcf

diff --git a/Modules/demultiplex.md b/Modules/demultiplex.md
@@ -20,17 +20,17 @@ should have been added during the sample DNA preparation in a laboratory. The de
 haplotag technology you are using (read [Haplotag Types](#haplotag-types)).
 
 ```bash usage
-harpy demultiplex OPTIONS... 
+harpy demultiplex OPTIONS... INPUT
 ```
 ```bash example
-harpy demultiplex --threads 20 --file Plate_1_S001_R1.fastq.gz --samplesheet demux.schema
+harpy demultiplex --threads 20 --samplesheet demux.schema Plate_1_S001_R1.fastq.gz
 ```
 ## :icon-terminal: Running Options
 In addition to the [common runtime options](/commonoptions.md), the `harpy demultiplex` module is configured using these command-line arguments:
 
 | argument          | short name | type       | default | required | description                                                                          |
 |:------------------|:----------:|:-----------|:-------:|:--------:|:-------------------------------------------------------------------------------------|
-| `--file`          |    `-f`    | file path  |         | **yes**  | The forward (or reverse) multiplexed FASTQ file                                      |
+| `INPUT`           |            | file path  |         | **yes**  | The forward (or reverse) multiplexed FASTQ file                                      |
 | `--samplesheet`   |    `-b`    | file path  |         | **yes**  | Tab-delimited file of sample\<tab\>barcode                                           |
 | `--method`        |    `-m`    | choice     | `gen1`  | **yes**  | Haplotag technology of the sequences                                                 |
 

diff --git a/Modules/impute.md b/Modules/impute.md
@@ -28,24 +28,24 @@ most from your data. Harpy uses `STITCH` to impute genotypes, a haplotype-based
 method that is linked-read aware. Imputing genotypes requires a variant call file 
 **containing SNPs**, such as that produced by `harpy variants`. You can impute genotypes with Harpy using the `impute` module:
 ```bash usage
-harpy impute OPTIONS...
+harpy impute OPTIONS... INPUTS...
 ```
 
 ```bash example
 # create stitch parameter file 'stitch.params'
 harpy stitchparams -o stitch.params 
 
 # run imputation
-harpy impute --threads 20 --vcf Variants/mpileup/variants.raw.bcf --directory Align/ema --parameters stitch.params
+harpy impute --threads 20 --vcf Variants/mpileup/variants.raw.bcf --parameters stitch.params Align/ema
 ```
 
 ## :icon-terminal: Running Options
 In addition to the [common runtime options](/commonoptions.md), the `harpy impute` module is configured using these command-line arguments:
 
 | argument       | short name | type        |    default    | required | description                                                                                     |
 |:---------------|:----------:|:------------|:-------------:|:--------:|:------------------------------------------------------------------------------------------------|
+| `INPUTS`       |            | file/directory paths  |         | **yes**  | Files or directories containing [input BAM files](/commonoptions.md)     |
 | `--vcf`        |    `-v`    | file path   |               | **yes**  | Path to VCF/BCF file                                                                            |
-| `--directory`  |    `-d`    | folder path |               | **yes**  | Directory with sequence alignments                                                              |
 | `--extra-params` |  `-x`    | folder path |               |  no      | Extra arguments to add to the STITCH R function, provided in quotes and R syntax                |
 | `--vcf-samples`|            |    toggle   |               | no       | [Use samples present in vcf file](#prioritize-the-vcf-file) for imputation rather than those found the directory    |
 | `--parameters` |    `-p`    | file path   |               | **yes**  | STITCH [parameter file](#parameter-file) (tab-delimited)                                                           |
@@ -57,7 +57,7 @@ syntax (e.g. `regionStart=0`, `populations=c("GBA","CUE")`). The argument should
 however, if your additional parameters require the use of quotes (like the previous example), then wrap the `-x` argument 
 in **single quotes**. Otherwise, the format should take the form of `"arg1=value, arg2=value2"`. Example:
 ```bash
-harpy impute -v file.vcf -p stitch.params -t 15 -x 'regionStart=20, regionEnd=500'
+harpy impute -v file.vcf -p stitch.params -t 15 -x 'regionStart=20, regionEnd=500' Align/ema
 ```
 
 ### Prioritize the vcf file
@@ -227,7 +227,7 @@ Impute/
 |:------------------------------------|:--------------------------------------------------------------------------|
 | `logs/harpy.impute.log`             | relevant runtime parameters for the phase module                          |
 | `input/*.stitch`                    | biallelic SNPs used for imputation                                        |
-| `input/samples.list`                | list of input BAM files                                                   |
+| `input/samples.list`                | list of [input BAM files](/commonoptions.md)                                                   |
 | `input/samples.names`               | list of sample names                                                      |
 | `model*/concat.log`                 | output from bcftools concat to create final imputed bcf                   |
 | `model*/variants.imputed.bcf`       | final bcf file of imputed genotypes                                       |

diff --git a/Modules/othermodules.md b/Modules/othermodules.md
@@ -18,9 +18,12 @@ The arguments represent different sub-commands and can be run in any order or co
 
 ### popgroup
 #### Sample grouping file for variant calling
+
+```bash
+harpy popgroup -o samples.groups data/
+```
 ##### arguments
 - `-o`, `--output`: name of the output file
-- `-d`, `--directory`: name of the directory of input files, either fastq or bam.
 
 This file is entirely optional and useful if you want SNP variant calling to happen on a
 per-population level via `harpy snp ... -p` or on samples pooled-as-populations via `harpy sv ... -p`.