diff --git a/Modules/Align/bwa.md b/Modules/Align/bwa.md index 65a6e3e88..34db2471a 100644 --- a/Modules/Align/bwa.md +++ b/Modules/Align/bwa.md @@ -18,10 +18,10 @@ such as those derived using `harpy qc`. You can map reads onto a genome assembly using the `align` module: ```bash usage -harpy align bwa|ema OPTIONS... +harpy align bwa OPTIONS... INPUTS... ``` ```bash example -harpy align bwa --genome genome.fasta --directory Sequences/ +harpy align bwa --genome genome.fasta Sequences/ ``` ## :icon-terminal: Running Options @@ -29,8 +29,8 @@ In addition to the [common runtime options](/commonoptions.md), the `harpy align | argument | short name | type | default | required | description | |:-------------------|:----------:|:----------------------|:-------:|:--------:|:------------------------------------------------------| +| `INPUTS` | | file/directory paths | | **yes** | Files or directories containing [input FASTQ files](/commonoptions.md#input-arguments) | | `--genome` | `-g` | file path | | **yes** | Genome assembly for read mapping | -| `--directory` | `-d` | folder path | | **yes** | Directory with sample sequences | | `--molecule-distance` | `-m` | integer | 100000 | no | Base-pair distance threshold to separate molecules | | `--quality-filter` | `-f` | integer (0-40) | 30 | no | Minimum `MQ` (SAM mapping quality) to pass filtering | | `--method` | `-m` | choice [`bwa`, `ema`] | bwa | no | Which aligning software to use | @@ -44,7 +44,13 @@ to assign alignments a unique Molecular Identifier `MI:i` tag based on their what this value does. ## Quality filtering -==- What is a $MQ$ score? +The `--quality` argument filters out alignments below a given $MQ$ threshold. The default, `30`, keeps alignments +that are at least 99.9% likely correctly mapped. Set this value to `1` if you only want alignments removed with +$MQ = 0$ (0% likely correct). You may also set it to `0` to keep all alignments for diagnostic purposes. +The plot below shows the relationship between $MQ$ score and the likelihood the alignment is correct and will serve to help you decide +on a value you may want to use. It is common to remove alignments with $MQ <30$ (<99.9% chance correct) or $MQ <40$ (<99.99% chance correct). + +==- What is the $MQ$ score? Every alignment in a BAM file has an associated mapping quality score ($MQ$) that informs you of the likelihood that the alignment is accurate. This score can range from 0-40, where higher numbers mean the alignment is more likely correct. The math governing the $MQ$ score actually calculates the percent chance the alignment is ***incorrect***: @@ -58,14 +64,9 @@ $$ \text{or} \\ \%\ chance\ correct = (1 - 10^\frac{-MQ}{10}) \times 100 $$ -=== -The `--quality` argument filters out alignments below a given $MQ$ threshold. The default, `30`, keeps alignments -that are at least 99.9% likely correctly mapped. Set this value to `1` if you only want alignments removed with -$MQ = 0$ (0% likely correct). You may also set it to `0` to keep all alignments for diagnostic purposes. -The plot below shows the relationship between $MQ$ score and the likelihood the alignment is correct and will serve to help you decide -on a value you may want to use. It is common to remove alignments with $MQ <30$ (<99.9% chance correct) or $MQ <40$ (<99.99% chance correct). [!embed el="embed"](//plotly.com/~pdimens/7.embed) +=== ## Marking PCR duplicates Harpy uses `samtools markdup` to mark putative PCR duplicates. By using the `--barcode-tag BX` diff --git a/Modules/Align/ema.md b/Modules/Align/ema.md index ad621ba9b..5877d6cf8 100644 --- a/Modules/Align/ema.md +++ b/Modules/Align/ema.md @@ -25,10 +25,10 @@ such as those derived using `harpy qc`. You can map reads onto a genome assembly using the `align` module: ```bash usage -harpy align OPTIONS... +harpy align ema OPTIONS... INPUTS... ``` ```bash example -harpy align ema --genome genome.fasta --directory Sequences/ +harpy align ema --genome genome.fasta Sequences/ ``` @@ -37,10 +37,10 @@ In addition to the [common runtime options](/commonoptions.md), the `harpy align | argument | short name | type | default | required | description | |:-------------------|:----------:|:----------------------|:-------:|:--------:|:-------------------------------------------------------------------| +| `INPUTS` | | file/directory paths | | **yes** | Files or directories containing [input FASTQ files](/commonoptions.md#input-arguments) | | `--genome` | `-g` | file path | | **yes** | Genome assembly for read mapping | | `--platform` | `-p` | string | haplotag | **yes** | Linked read technology: `haplotag` or `10x` | | `--whitelist` | `-w` | file path | | no | Path to barcode whitelist (`--platform 10x` only) | -| `--directory` | `-d` | folder path | | **yes** | Directory with sample sequences | | `--ema-bins` | `-e` | integer (1-1000) | 500 | no | Number of barcode bins for EMA | | `--quality-filter` | `-f` | integer (0-40) | 30 | no | Minimum `MQ` (SAM mapping quality) to pass filtering | | `--extra-params` | `-x` | string | | no | Additional EMA-align/BWA arguments, in quotes | @@ -52,7 +52,13 @@ If you need to process 10x data, then you will need to include the whitelist fil Conveniently, **haplotag data doesn't require this file**. ## Quality filtering -==- What is a $MQ$ score? +The `--quality` argument filters out alignments below a given $MQ$ threshold. The default, `30`, keeps alignments +that are at least 99.9% likely correctly mapped. Set this value to `1` if you only want alignments removed with +$MQ = 0$ (0% likely correct). You may also set it to `0` to keep all alignments for diagnostic purposes. +The plot below shows the relationship between $MQ$ score and the likelihood the alignment is correct and will serve to help you decide +on a value you may want to use. It is common to remove alignments with $MQ <30$ (<99.9% chance correct) or $MQ <40$ (<99.99% chance correct). + +==- What is the $MQ$ score? Every alignment in a BAM file has an associated mapping quality score ($MQ$) that informs you of the likelihood that the alignment is accurate. This score can range from 0-40, where higher numbers mean the alignment is more likely correct. The math governing the $MQ$ score actually calculates the percent chance the alignment is ***incorrect***: @@ -66,14 +72,9 @@ $$ \text{or} \\ \%\ chance\ correct = (1 - 10^\frac{-MQ}{10}) \times 100 $$ -=== -The `--quality` argument filters out alignments below a given $MQ$ threshold. The default, `30`, keeps alignments -that are at least 99.9% likely correctly mapped. Set this value to `1` if you only want alignments removed with -$MQ = 0$ (0% likely correct). You may also set it to `0` to keep all alignments for diagnostic purposes. -The plot below shows the relationship between $MQ$ score and the likelihood the alignment is correct and will serve to help you decide -on a value you may want to use. It is common to remove alignments with $MQ <30$ (<99.9% chance correct) or $MQ <40$ (<99.99% chance correct). [!embed el="embed"](//plotly.com/~pdimens/7.embed) +=== ## Marking PCR duplicates EMA marks duplicates in the resulting alignments, however the read with invalid barcodes diff --git a/Modules/SV/leviathan.md b/Modules/SV/leviathan.md index c2a8e6548..181596b44 100644 --- a/Modules/SV/leviathan.md +++ b/Modules/SV/leviathan.md @@ -25,8 +25,8 @@ This file is optional and only useful if you want variant calling to happen on a - spaces can be used as delimeters too - the groups can be numbers or text (_i.e._ meaningful population names) - you can comment out lines with `#` for Harpy to ignore them -- create with `harpy extra popgroup -d ` or manually -- if created with `harpy extra popgroup`, all the samples will be assigned to group `pop1` +- create with `harpy popgroup -d ` or manually +- if created with `harpy popgroup`, all the samples will be assigned to group `pop1` - make sure to edit the second column to reflect your data correctly. ``` example file for --populations @@ -47,16 +47,16 @@ from the sample names. A simple fix would be to use underscores (`_`) to differe After reads have been aligned, _e.g._ with `harpy align`, you can use those alignment files (`.bam`) to call structural variants in your data using LEVIATHAN. To make sure your data -will work seemlessly with LEVIATHAN, the alignments in the input BAM files should **end** +will work seemlessly with LEVIATHAN, the alignments in the [input BAM files](/commonoptions.md) should **end** with a `BX:Z:AxxCxxBxxDxx` tag. Use `harpy preflight bam` if you want to double-check file format validity. ```bash usage -harpy sv leviathan OPTIONS... +harpy sv leviathan OPTIONS... INPUTS... ``` ```bash example -harpy sv leviathan --threads 20 --directory Align/bwa -g genome.fasta +harpy sv leviathan --threads 20 -g genome.fasta Align/bwa ``` ## :icon-terminal: Running Options @@ -64,8 +64,8 @@ In addition to the [common runtime options](/commonoptions.md), the `harpy sv le | argument | short name | type | default | required | description | |:-----------------|:----------:|:--------------|:-------:|:--------:|:---------------------------------------------------| +| `INPUTS` | | file/directory paths | | **yes** | Files or directories containing [input BAM files](/commonoptions.md#input-arguments) | | `--genome` | `-g` | file path | | **conditionally** | Genome assembly for phasing bam files | -| `--directory` | `-d` | folder path | | **yes** | Directory with sequence alignments | | `--populations` | `-p` | file path | | no | Tab-delimited file of sample\<*tab*\>group | | `--extra-params` | `-x` | string | | no | Additional naibr arguments, in quotes | diff --git a/Modules/SV/naibr.md b/Modules/SV/naibr.md index 923ca85fc..4b8fa2d5d 100644 --- a/Modules/SV/naibr.md +++ b/Modules/SV/naibr.md @@ -20,8 +20,8 @@ This file is optional and only useful if you want variant calling to happen on a - spaces can be used as delimeters too - the groups can be numbers or text (_i.e._ meaningful population names) - you can comment out lines with `#` for Harpy to ignore them -- create with `harpy extra popgroup -d ` or manually -- if created with `harpy extra popgroup`, all the samples will be assigned to group `pop1` +- create with `harpy popgroup -d ` or manually +- if created with `harpy popgroup`, all the samples will be assigned to group `pop1` - make sure to edit the second column to reflect your data correctly. ``` example file for --populations @@ -49,15 +49,15 @@ should already have that information (yay!). If your alignments don't have phasi then you will need to do a little extra work for NAIBR to work best with your data. This process is described below. ```bash usage -harpy sv naibr OPTIONS... +harpy sv naibr OPTIONS... INPUTS... ``` ```bash examples # input bams already phased -harpy sv naibr --threads 20 --directory Align/bwa --genome genome.fasta +harpy sv naibr --threads 20 --genome genome.fasta Align/bwa # input bams require phasing -harpy sv naibr --threads 20 --directory Align/bwa --genome genome.fasta --vcf Variants/data.vcf.gz +harpy sv naibr --threads 20 --genome genome.fasta --vcf Variants/data.vcf.gz Align/bwa ``` ## :icon-terminal: Running Options @@ -65,8 +65,8 @@ In addition to the [common runtime options](/commonoptions.md), the `harpy sv na | argument | short name | type | default | required | description | |:-----------------|:----------:|:--------------|:-------:|:--------:|:---------------------------------------------------| +| `INPUTS` | | file/directory paths | | **yes** | Files or directories containing [input BAM files](/commonoptions.md#input-arguments) | | `--genome` | `-g` | file path | | **yes** | Genome assembly for phasing bam files | -| `--directory` | `-d` | folder path | | **yes** | Directory with sequence alignments | | `--vcf` | `-v` | file path | | **conditionally** | Phased vcf file for phasing bam files | | `--molecule-distance` | `-m` | integer | 100000 | no | Base-pair distance threshold to separate molecules | | `--populations` | `-p` | file path | | no | Tab-delimited file of sample\<*tab*\>group | @@ -99,7 +99,7 @@ it too well. In order to get the best variant calling performance out of NAIBR, it requires _phased_ bam files as input. The `--vcf` option is optional and not used by NAIBR. However, to use `harpy sv naibr` with bam files that are not phased, you will need to include `--vcf`, which Harpy uses with -`whatshap haplotag` to phase your input bam files prior to variant calling. See the [whatshap documentation](https://whatshap.readthedocs.io/en/latest/guide.html#whatshap-haplotag) +`whatshap haplotag` to phase your input BAM files prior to variant calling. See the [whatshap documentation](https://whatshap.readthedocs.io/en/latest/guide.html#whatshap-haplotag) for more details on that process. #### a phased input --vcf diff --git a/Modules/demultiplex.md b/Modules/demultiplex.md index 124de2fd6..be96c822d 100644 --- a/Modules/demultiplex.md +++ b/Modules/demultiplex.md @@ -20,17 +20,17 @@ should have been added during the sample DNA preparation in a laboratory. The de haplotag technology you are using (read [Haplotag Types](#haplotag-types)). ```bash usage -harpy demultiplex OPTIONS... +harpy demultiplex OPTIONS... INPUT ``` ```bash example -harpy demultiplex --threads 20 --file Plate_1_S001_R1.fastq.gz --samplesheet demux.schema +harpy demultiplex --threads 20 --samplesheet demux.schema Plate_1_S001_R1.fastq.gz ``` ## :icon-terminal: Running Options In addition to the [common runtime options](/commonoptions.md), the `harpy demultiplex` module is configured using these command-line arguments: | argument | short name | type | default | required | description | |:------------------|:----------:|:-----------|:-------:|:--------:|:-------------------------------------------------------------------------------------| -| `--file` | `-f` | file path | | **yes** | The forward (or reverse) multiplexed FASTQ file | +| `INPUT` | | file path | | **yes** | The forward (or reverse) multiplexed FASTQ file | | `--samplesheet` | `-b` | file path | | **yes** | Tab-delimited file of sample\barcode | | `--method` | `-m` | choice | `gen1` | **yes** | Haplotag technology of the sequences | diff --git a/Modules/impute.md b/Modules/impute.md index 8a6c0b8fb..f085d4d63 100644 --- a/Modules/impute.md +++ b/Modules/impute.md @@ -28,7 +28,7 @@ most from your data. Harpy uses `STITCH` to impute genotypes, a haplotype-based method that is linked-read aware. Imputing genotypes requires a variant call file **containing SNPs**, such as that produced by `harpy variants`. You can impute genotypes with Harpy using the `impute` module: ```bash usage -harpy impute OPTIONS... +harpy impute OPTIONS... INPUTS... ``` ```bash example @@ -36,7 +36,7 @@ harpy impute OPTIONS... harpy stitchparams -o stitch.params # run imputation -harpy impute --threads 20 --vcf Variants/mpileup/variants.raw.bcf --directory Align/ema --parameters stitch.params +harpy impute --threads 20 --vcf Variants/mpileup/variants.raw.bcf --parameters stitch.params Align/ema ``` ## :icon-terminal: Running Options @@ -44,8 +44,8 @@ In addition to the [common runtime options](/commonoptions.md), the `harpy imput | argument | short name | type | default | required | description | |:---------------|:----------:|:------------|:-------------:|:--------:|:------------------------------------------------------------------------------------------------| +| `INPUTS` | | file/directory paths | | **yes** | Files or directories containing [input BAM files](/commonoptions.md) | | `--vcf` | `-v` | file path | | **yes** | Path to VCF/BCF file | -| `--directory` | `-d` | folder path | | **yes** | Directory with sequence alignments | | `--extra-params` | `-x` | folder path | | no | Extra arguments to add to the STITCH R function, provided in quotes and R syntax | | `--vcf-samples`| | toggle | | no | [Use samples present in vcf file](#prioritize-the-vcf-file) for imputation rather than those found the directory | | `--parameters` | `-p` | file path | | **yes** | STITCH [parameter file](#parameter-file) (tab-delimited) | @@ -57,7 +57,7 @@ syntax (e.g. `regionStart=0`, `populations=c("GBA","CUE")`). The argument should however, if your additional parameters require the use of quotes (like the previous example), then wrap the `-x` argument in **single quotes**. Otherwise, the format should take the form of `"arg1=value, arg2=value2"`. Example: ```bash -harpy impute -v file.vcf -p stitch.params -t 15 -x 'regionStart=20, regionEnd=500' +harpy impute -v file.vcf -p stitch.params -t 15 -x 'regionStart=20, regionEnd=500' Align/ema ``` ### Prioritize the vcf file @@ -227,7 +227,7 @@ Impute/ |:------------------------------------|:--------------------------------------------------------------------------| | `logs/harpy.impute.log` | relevant runtime parameters for the phase module | | `input/*.stitch` | biallelic SNPs used for imputation | -| `input/samples.list` | list of input BAM files | +| `input/samples.list` | list of [input BAM files](/commonoptions.md) | | `input/samples.names` | list of sample names | | `model*/concat.log` | output from bcftools concat to create final imputed bcf | | `model*/variants.imputed.bcf` | final bcf file of imputed genotypes | diff --git a/Modules/othermodules.md b/Modules/othermodules.md index 57115eb57..7cf014a48 100644 --- a/Modules/othermodules.md +++ b/Modules/othermodules.md @@ -18,9 +18,12 @@ The arguments represent different sub-commands and can be run in any order or co ### popgroup #### Sample grouping file for variant calling + +```bash +harpy popgroup -o samples.groups data/ +``` ##### arguments - `-o`, `--output`: name of the output file -- `-d`, `--directory`: name of the directory of input files, either fastq or bam. This file is entirely optional and useful if you want SNP variant calling to happen on a per-population level via `harpy snp ... -p` or on samples pooled-as-populations via `harpy sv ... -p`. diff --git a/Modules/phase.md b/Modules/phase.md index 02afe144b..c29b8d13c 100644 --- a/Modules/phase.md +++ b/Modules/phase.md @@ -21,10 +21,10 @@ works on SNP data**, and will not work for structural variants produced by `LEVI phase genotypes into haplotypes with Harpy using the `phase` module: ```bash usage -harpy phase OPTIONS... +harpy phase OPTIONS... INPUTS... ``` ```bash example -harpy phase --threads 20 --vcf Variants/variants.raw.bcf --directory Align/ema +harpy phase --threads 20 --vcf Variants/variants.raw.bcf Align/ema ``` ## :icon-terminal: Running Options @@ -32,8 +32,8 @@ In addition to the [common runtime options](/commonoptions.md), the `harpy phase | argument | short name | type | default | required | description | |:----------------------|:----------:|:----------------|:-------:|:--------:|:---------------------------------------------------------------------| +| `INPUTS` | | file/directory paths | | **yes** | Files or directories containing [input BAM files](/commonoptions.md#input-arguments) | | `--vcf` | `-v` | file path | | **yes** | Path to BCF/VCF file | -| `--directory` | `-d` | folder path | | **yes** | Directory with sequence alignments | | `--genome ` | `-g` | file path | | no | Path to genome if wanting to also use reads spanning indels | | `--molecule-distance` | `-m` | integer | 100000 | no | Base-pair distance threshold to separate molecules | | `--prune-threshold` | `-p` | integer (0-100) | 7 | no | PHRED-scale (%) threshold for pruning low-confidence SNPs | diff --git a/Modules/preflight.md b/Modules/preflight.md index 6f053df82..4c98b45a7 100644 --- a/Modules/preflight.md +++ b/Modules/preflight.md @@ -27,17 +27,17 @@ for the pipeline. There are separate `fastq` and `bam` submodules and the result ```bash fastq usage and example -harpy preflight fastq OPTIONS... +harpy preflight fastq OPTIONS... INPUTS... # example -harpy preflight fastq --threads 20 -d raw_data +harpy preflight fastq --threads 20 raw_data ``` ```bash bam usage and example -harpy preflight bam OPTIONS... +harpy preflight bam OPTIONS... INPUST... # example -harpy preflight bam --threads 20 -d Align/bwa +harpy preflight bam --threads 20 Align/bwa ``` ## :icon-terminal: Running Options @@ -45,12 +45,13 @@ In addition to the [common runtime options](/commonoptions.md), the `harpy prefl | argument | short name | type | default | required | description | |:------------------|:----------:|:-----------|:-------:|:--------:|:-------------------------------------------------------------------------------------| -| `--directory` | `-d` | folder path | | **yes** | Directory with sequences or alignments | +| `INPUTS` | | file/directory paths | | **yes** | Files or directories containing [input fastq or bam files](/commonoptions.md#input-arguments) | ## Workflow +++ `fastq` checks -Below is a table of the format specifics `harpy preflight` checks for FASTQ files. Take note +Below is a table of the format specifics `harpy preflight` checks for FASTQ files. Since 10X data doesn't use +the haplotagging data format, you will find little value in running `preflight` on 10X FASTQ files. Take note of the language such as when "any" and "all" are written. | Criteria | Pass Condition | Fail Condition | diff --git a/Modules/qc.md b/Modules/qc.md index f52f92c6f..811792b1d 100644 --- a/Modules/qc.md +++ b/Modules/qc.md @@ -21,17 +21,17 @@ harpy qc OPTIONS... ``` ```bash example -harpy qc --directory Sequences_Raw/ --threads 20 +harpy qc --threads 20 Sequences_Raw/ ``` ## :icon-terminal: Running Options In addition to the [common runtime options](/commonoptions.md), the `harpy qc` module is configured using these command-line arguments: | argument | short name | type | default | required | description | -|:-----------------|:----------:|:------------|:-------:|:--------:|:------------------------------------------------------------------------------------------------| -| `--directory` | `-d` | folder path | | **yes** | Directory with sequence alignments | -| `--max-length` | `-l` | integer | 150 | no | Maximum length to trim sequences down to | -| `--extra-params` | `-x` | string | | no | Additional fastp arguments, in quotes | +|:-----------------|:----------:|:------------|:-------:|:-------:|:------------------------------------------------------------------------------------------------| +| `INPUTS` | | file/directory paths | | **yes** | Files or directories containing [input FASTQ files](/commonoptions.md#input-arguments) | +| `--max-length` | `-l` | integer | 150 | no | Maximum length to trim sequences down to | +| `--extra-params` | `-x` | string | | no | Additional fastp arguments, in quotes | --- ## :icon-git-pull-request: QC Workflow diff --git a/Modules/snp.md b/Modules/snp.md index 8a3c1495a..90e202ff6 100644 --- a/Modules/snp.md +++ b/Modules/snp.md @@ -11,15 +11,15 @@ order: 5 - at least 4 cores/threads available - a genome assembly in FASTA format - sequence alignments, in `.bam` format -- sample grouping file +- sample grouping file (optional) ==- :icon-file: sample grouping file This file is optional and useful if you want variant calling to happen on a per-population level. - takes the format of sample\<*tab*\>group - spaces can be used as delimeters too - the groups can be numbers or text (_i.e._ meaningful population names) - you can comment out lines with `#` for Harpy to ignore them -- create with `harpy extra popgroup -d ` or manually -- if created with `harpy extra popgroup`, all the samples will be assigned to group `pop1`, so make sure to edit the second column to reflect your data correctly. +- create with `harpy popgroup ` or manually +- if created with `harpy popgroup`, all the samples will be assigned to group `pop1`, so make sure to edit the second column to reflect your data correctly. ``` example file for --populations sample1 pop1 @@ -43,15 +43,15 @@ After reads have been aligned, _e.g._, with `harpy align`, you can use those ali You can call SNPs with the ` harpy snp` module: ```bash usage -harpy snp method OPTIONS... +harpy snp method OPTIONS... INPUTS... ``` ```bash examples # call variants with mpileup -harpy snp mpileup --threads 20 --genome genome.fasta --directory Align/bwa +harpy snp mpileup --threads 20 --genome genome.fasta Align/bwa # call variants with freebayes -harpy snp freebayes --threads 20 --genome genome.fasta --directory Align/bwa +harpy snp freebayes --threads 20 --genome genome.fasta Align/bwa ``` ## :icon-terminal: Running Options @@ -59,8 +59,8 @@ In addition to the [common runtime options](../commonoptions.md), the `harpy snp | argument | short name | type | default | required | description | |:-----------------|:----------:|:--------------------------------|:-------:|:--------:|:----------------------------------------------------| +| `INPUTS` | | file/directory paths | | **yes** | Files or directories containing [input BAM files](/commonoptions.md#input-arguments) | | `--genome` | `-g` | file path | | **yes** | Genome assembly for variant calling | -| `--directory` | `-d` | folder path | | **yes** | Directory with sequence alignments | | `--windowsize` | `-w` | integer | 50000 | no | Interval size for parallel variant calling | | `--populations` | `-p` | file path | | no | Tab-delimited file of sample\<*tab*\>group | | `--ploidy` | `-x` | integer | 2 | no | Ploidy of samples | diff --git a/commonoptions.md b/commonoptions.md index 5095e36e6..8b0744702 100644 --- a/commonoptions.md +++ b/commonoptions.md @@ -5,6 +5,35 @@ order: 4 --- # :icon-list-unordered: Common Harpy Options +## Input Arguments +Each of the main Harpy modules (e.g. `qc` or `phase`) follows the format of +```bash +harpy module options arguments +``` +where `module` is something like `impute` or `snp mpileup` and `options` are the runtime parameters, +which can include things like an input `--vcf` file, `--molecule-distance`, etc. After the options +is where you provide the input files/directories without flags and following standard BASH expansion +rules (e.g. wildcards). You can mix and match entire directories, individual files, and wildcard expansions. +In most cases, you can provide an unlimited amount of input arguments, which Harpy will parse and symlink +into the `*/workflow/input` folder, leaving the original files unmodified. In practice, that can look like: +```bash +harpy align bwa -t 5 -g genome.fasta data/pop1 data/pop2/trimmed*gz data/pop3/sample{1,2}* data/pop4/sample{2..5}*gz +``` +!!!info not recursive +Keep in mind that Harpy will not recursively scan input directories for files. If you provide `data/` as an input, +Harpy will search for fastq/bam files in `data/` and not in any subdirectories within `data/`. This is done deliberately +to avoid unexpected behavior. +!!! + +!!!warning clashing names +Harpy will symlink just the file names into `workflow/input` regardless of their origin, +meaning that files in different directories that have the same name (ignoring extensions) will +clash. As an example, both `folderA/sample001.bam` and `folderB/sample001.bam` will become symlinked +as `workflow/input/sample001.bam`, with one symlink overwriting the other, leaving you with one missing +sample. During parsing, Harpy will inform you of naming clashes and terminate to protect you against +this behavior. +!!! + ## Common command-line options Every Harpy module has a series of configuration parameters. These are arguments you need to input to configure the module to run on your data, such as the directory with the reads/alignments, @@ -25,11 +54,11 @@ configured using these arguments: As as example, you could call the `harpy align` module and specify 20 threads with no output to console: ```bash -harpy align bwa --threads 20 --directory samples/trimmedreads --quiet +harpy align bwa --threads 20 --quiet samples/trimmedreads -# same as # +# identical to # -harpy align bwa -t 20 -d samples/trimmedreads -q +harpy align bwa -t 20 -q samples/trimmedreads ``` --- @@ -41,15 +70,15 @@ and the contents therein also allow you to rerun the workflow manually. The `wor | item | contents | utility | |:-----|:---------|:--------| -|`*.smk`| Snakefile with the full recipe of the workflow | useful for understanding the workflow | -| `config.yml` | Configuration file generated from command-line arguments and consumed by the Snakefile | useful for bookkeeping | -| `report/*.Rmd` | RMarkdown files used to generate the fancy reports | useful to understand math behind plots/tables or borrow code from | +|`*.smk` | Snakefile with the full recipe of the workflow | useful for understanding the workflow | +| `config.yml` | Configuration file generated from command-line arguments and consumed by the Snakefile | useful for bookkeeping | +| `input/` | Symlinks to all of the provided input files with standardized extensions | +| `report/*.Rmd` | RMarkdown files used to generate the fancy reports | useful to understand math behind plots/tables or borrow code from | | `*.workflow.summary` | Plain-text overview of the important parts of the workflow | useful for bookkeeping and writing Methods | --- ## The `Genome` folder - You will notice that many of the workflows will create a `Genome` folder in the working directory. This folder is to make it easier for Harpy to store the genome and the associated indexing/etc. files. Your input genome will be symlinked into that directory (not copied), but diff --git a/retype.yml b/retype.yml index 2b447f94e..38383a029 100644 --- a/retype.yml +++ b/retype.yml @@ -16,15 +16,18 @@ links: footer: copyright: "© Copyright {{ year }}. All rights reserved." branding: - title: Harpy # Your custom website title; keep it short. - label: v0.6.1 - logo: static/favicon.png # Path to a logo file. - logoDark: static/favicon.png # Path to a logo file to use in dark mode. + title: Harpy + label: v0.7.0 + logo: static/favicon.png + logoDark: static/favicon.png logoAlign: left -favicon: static/favicon.png # Path to a custom favicon, or. +favicon: static/favicon.png search: mode: partial edit: repo: "https://github.com/pdimens/HARPY/edit" base: / branch: "docs" +exclude: + - "src/" + - "test/" \ No newline at end of file