diff --git a/Modules/Align/ema.md b/Modules/Align/ema.md index 6e0a654bd..f6ee8ab10 100644 --- a/Modules/Align/ema.md +++ b/Modules/Align/ema.md @@ -10,6 +10,7 @@ order: 5 - at least 4 cores/threads available - a genome assembly in FASTA format - paired-end fastq sequence file with the [proper naming convention](/haplotagdata/#naming-conventions) (gzipped recommended) +- patience ==- Why EMA? The original haplotag manuscript uses BWA to map reads. The authors have since recommended the use of EMA (EMerald Aligner) for most applications. EMA is barcode-aware, diff --git a/Modules/Align/minimap.md b/Modules/Align/minimap.md new file mode 100644 index 000000000..1ec79dcd4 --- /dev/null +++ b/Modules/Align/minimap.md @@ -0,0 +1,194 @@ +--- +label: Minimap +description: Align haplotagged sequences with Minimap2 +icon: dot +order: 5 +--- + +# :icon-quote: Map Reads onto a genome with Minimap2 +=== :icon-checklist: You will need +- at least 4 cores/threads available +- a genome assembly in FASTA format +- paired-end fastq sequence file with the [proper naming convention](/haplotagdata/#naming-conventions) (gzipped recommended) +=== + +Once sequences have been trimmed and passed through other QC filters, they will need to +be aligned to a reference genome. This module within Harpy expects filtered reads as input, +such as those derived using `harpy qc`. You can map reads onto a genome assembly with Harpy +using the `align` module: + +```bash usage +harpy align minimap OPTIONS... INPUTS... +``` +```bash example +harpy align minimap --genome genome.fasta Sequences/ +``` + +## :icon-terminal: Running Options +In addition to the [common runtime options](/commonoptions.md), the `harpy align minimap` module is configured using these command-line arguments: + +| argument | short name | type | default | required | description | +|:-------------------|:----------:|:----------------------|:-------:|:--------:|:------------------------------------------------------| +| `INPUTS` | | file/directory paths | | **yes** | Files or directories containing [input FASTQ files](/commonoptions.md#input-arguments) | +| `--genome` | `-g` | file path | | **yes** | Genome assembly for read mapping | +| `--molecule-distance` | `-m` | integer | 100000 | no | Base-pair distance threshold to separate molecules | +| `--quality-filter` | `-f` | integer (0-40) | 30 | no | Minimum `MQ` (SAM mapping quality) to pass filtering | +| `--method` | `-m` | choice [`bwa`, `ema`] | bwa | no | Which aligning software to use | +| `--extra-params` | `-x` | string | | no | Additional EMA-align/BWA arguments, in quotes | + +### Molecule distance +The `--molecule-distance` option is used during the BWA alignment workflow +to assign alignments a unique Molecular Identifier `MI:i` tag based on their + haplotag barcode and the distance threshold you specify. See +[haplotag data](/haplotagdata/#barcode-thresholds) for more information on +what this value does. + +## Quality filtering +The `--quality` argument filters out alignments below a given $MQ$ threshold. The default, `30`, keeps alignments +that are at least 99.9% likely correctly mapped. Set this value to `1` if you only want alignments removed with +$MQ = 0$ (0% likely correct). You may also set it to `0` to keep all alignments for diagnostic purposes. +The plot below shows the relationship between $MQ$ score and the likelihood the alignment is correct and will serve to help you decide +on a value you may want to use. It is common to remove alignments with $MQ <30$ (<99.9% chance correct) or $MQ <40$ (<99.99% chance correct). + +==- What is the $MQ$ score? +Every alignment in a BAM file has an associated mapping quality score ($MQ$) that informs you of the likelihood +that the alignment is accurate. This score can range from 0-40, where higher numbers mean the alignment is more +likely correct. The math governing the $MQ$ score actually calculates the percent chance the alignment is ***incorrect***: +$$ +\%\ chance\ incorrect = 10^\frac{-MQ}{10} \times 100\\ +\text{where }0\le MQ\le 40 +$$ +You can simply subtract it from 100 to determine the percent chance the alignment is ***correct***: +$$ +\%\ chance\ correct = 100 - \%\ chance\ incorrect\\ +\text{or} \\ +\%\ chance\ correct = (1 - 10^\frac{-MQ}{10}) \times 100 +$$ + +[!embed el="embed"](//plotly.com/~pdimens/7.embed) +=== + +## Marking PCR duplicates +Harpy uses `samtools markdup` to mark putative PCR duplicates. By using the `--barcode-tag BX` +option, it considers the linked-read barcode for more accurate duplicate detection. Duplicate +marking also uses the `-S` option to mark supplementary (chimeric) alignments as duplicates +if the primary alignment was marked as a duplicate. Duplicates get marked but **are not removed**. + +---- + +## :icon-git-pull-request: Minimap workflow ++++ :icon-git-merge: details +- ignores (but retains) barcode information +- ultra-fast +- comparable accuracy to BWA MEM for sequences greater than 100bp + - accuracy may be lower for sequences less than 100bp + +The [minimap2](https://github.com/lh3/minimap2) workflow is nearly identical to the BWA workflow, +the only real difference being how the input genome is indexed and that alignment is performed with +`minimap2` instead of BWA. Duplicates are marked using `samtools markdup`. +The `BX:Z` tags in the read headers are still added to the alignment headers, even though barcodes +are not used to inform mapping. The `-m` threshold is used for alignment molecule assignment. + +```mermaid +graph LR + Z([trimmed reads]) --> B + A([index genome]) --> B([align to genome]) + B-->C([sort alignments]) + C-->D([mark duplicates]) + D-->E([assign molecules]) + E-->F([alignment metrics]) + D-->G([barcode stats]) + G-->F + subgraph markdp [mark duplicates via `samtools`] + direction LR + collate-->fixmate + fixmate-->sort + sort-->markdup + end +``` ++++ :icon-file-directory: minimap2 output +The default output directory is `Align/minimap` with the folder structure below. `Sample1` is a generic sample name for demonstration purposes. +The resulting folder also includes a `workflow` directory (not shown) with workflow-relevant runtime files and information. +``` +Align/bwa +├── Sample1.bam +├── Sample1.bam.bai +├── logs +│ └── markduplicates +│    └── Sample1.markdup.log +└── reports + ├── minimap.stats.html + ├── BXstats + │   ├── Sample1.bxstats.html + │ └── data + │ └── Sample1.bxstats.gz + └── coverage +    ├── Sample1.gencov.html + └── data + └── Sample1.gencov.gz + + +``` + +| item | description | +|:---------|:------------------------------------------------------------------------------------------------------------| +| `*.bam` | sequence alignments for each sample | +| `*.bai` | sequence alignment indexes for each sample | +| `logs/markduplicates` | stats provided by `samtools markdup` | +| `reports/` | various counts/statistics/reports relating to sequence alignment | +| `reports/minimap.stats.html` | report summarizing `samtools flagstat and stats` results across all samples from `multiqc` | +| `reports/reads.bxstats.html` | interactive html report summarizing valid vs invalid barcodes across all samples | +| `reports/BXstats/*.bxstats.html` | interactive html report summarizing inferred molecule size | +| `reports/coverage/*.html` | summary plots of alignment coverage per contig | +| `reports/coverage/data/*.gencov.gz` | output from samtools bedcov from all alignments, used for plots | +| `reports/BXstats/` | reports summarizing molecule size and reads per molecule | +| `reports/BXstats/data/` | tabular data containing the information used to generate the BXstats reports | + ++++ :icon-code-square: minimap2 parameters +By default, Harpy runs `minimap2` with these parameters (excluding inputs and outputs): +```bash +minimap2 -ax sr -y --sam-hit-only -R \"@RG\\tID:samplename\\tSM:samplename\" +``` + +The Minimap2 aligner has a lot of parameters that can be provided, too many to list here, so please refer to the +[Minimap2 documentation](https://lh3.github.io/minimap2/minimap2.html). The `-a` indicates to output a SAM format +file and the `-x sr` argument is a preset for short reads with these parameters: +- `-k21` +- `-w11` +- `--sr` +- `--frag=yes` +- `-A2` +- `-B8` +- `-O12,32` +- `-E2,1` +- `-b0` +- `-r100` +- `-p.5` +- `-N20` +- `-f1000,5000` +- `-n2` +- `-m20` +- `-s40` +- `-g100` +- `-2K50m` +- `--heap-sort=yes` +- `--secondary=no` + ++++ :icon-graph: reports +These are the summary reports Harpy generates for this workflow. You may right-click +the images and open them in a new tab if you wish to see the examples in better detail. +||| Depth and coverage +Reports the depth of alignments in 10kb windows. +![reports/coverage/*.html](/static/report_align_coverage.png) +||| BX validation +Reports the number of valid/invalid barcodes in the alignments. +![reports/reads.bxstats.html](/static/report_align_bxstats.png) +||| Molecule size +Reports the inferred molecule sized based on barcodes in the alignments. +![reports/BXstats/*.bxstats.html](/static/report_align_bxmol.png) +||| Alignment stats +Reports the general statistics computed by samtools `stats` and `flagstat` +![reports/samtools_*stat/*html](/static/report_align_flagstat.png) +||| + ++++ diff --git a/Modules/Simulate/index.yml b/Modules/Simulate/index.yml new file mode 100644 index 000000000..7fa938bf3 --- /dev/null +++ b/Modules/Simulate/index.yml @@ -0,0 +1,2 @@ +icon: flame +order: 5 \ No newline at end of file diff --git a/Modules/Simulate/simulate-variants.md b/Modules/Simulate/simulate-variants.md new file mode 100644 index 000000000..9d172bbe1 --- /dev/null +++ b/Modules/Simulate/simulate-variants.md @@ -0,0 +1,262 @@ +--- +label: Variants +description: Simulate snps, indels, inversions, cnv, translocations +icon: dot +#visibility: hidden +order: 6 +--- + +# :icon-flame: Simulate Genomic Variants +Simulate snps, indels, inversions, cnv, translocations + +=== :icon-checklist: You will need +- a reference genome in fasta or gzipped fasta format +=== + +You may want to benchmark haplotag data on different kinds of genomic variants. To +do that, you'll need *known* variants, and typically simulations are how you achieve +that. This series of modules simulates genomic variants onto a genome, either randomly +or specific variants provided in VCF files. The simulator Harpy uses, +[simuG](https://github.com/yjx1217/simuG), can only simulate one type of +variant at a time and each variant type has their own set of parameters. This page +is divided by variant types to help you navigate the process. The general usage +for simulating variants is: + +```bash usage +harpy simulate variant OPTIONS... INPUT_GENOME +``` +```bash example +harpy simulate inversion -n 10 --min-size 1000 --max-size 50000 path/to/genome.fasta +``` +## Modules +There are 4 submodules with very obvious names: + +| submodule | what it does | +|:----------|:-------------| +|`snpindel` | simulates single nucleotide polymorphisms (snps) and insertion-deletions (indels) | +| `inversion` | simulates inversions | +| `cnv` | simulates copy number variants | +| `translocation` | simulates translocations | + +## :icon-terminal: Running Options +While there are serveral differences between the submodule command line options, each has available all the +[common runtime options](/commonoptions.md) like other Harpy modules. Each requires and input genome at the +end of the command line, and each requires either a `--count` of variants to randomly simulate, or a `--vcf` of +specific variants to simulate. There are also these unifying options among the different variant types: + +| argument | short name | type | description | +| :-----|:-----|:-----|:-----| +| `INPUT_GENOME` | | file path | The haploid genome to simulate variants onto. **REQUIRED** | +| `--prefix` | | string | Naming prefix for output files (default: `sim.{module_name}`)| +| `--exclude-chr` | `-e` | file path | Text file of chromosomes to avoid, one per line | +| `--centromeres` | `-c` | file path | GFF3 file of centromeres to avoid | +| `--genes` | `-g` | file path | GFF3 file of genes to avoid simulating over (see `snpindel` for caveat) | +| `--heterozygosity` | `-z` | float between [0,1] | [% heterozygosity to simulate diploid later](#heterozygosity) (default: `0`) | +| `--randomseed` | | integer | Random seed for simulation | + +==- snps and indels +### snpindel +A single nucleotide polymorphism ("SNP") is a genomic variant at a single base position in the DNA ([source](https://www.genome.gov/genetics-glossary/Single-Nucleotide-Polymorphisms)). +An indel, is a type of mutation that involves the addition/deletion of one or more nucleotides into a segment of DNA ([insertions](https://www.genome.gov/genetics-glossary/Insertion), [deletions](https://www.genome.gov/genetics-glossary/Deletion)). +The snp and indel variants are combined in this module because `simuG` allows simulating them together. The +ratio parameters control different things for snp and indel variants and have special meanings when setting +the value to either `9999` or `0` : +- `--titv-ratio` + - `9999`: transitions only + - `0`: transversions only +- `--indel-ratio` + - `9999`: insertions only + - `0`: deletions only + +| argument | short name | type | default | description | +|:------------------|:----------:|:-----------|:-------:|:-------------------------------------------------------------| +| `--snp-vcf`| `-v` | file path | | VCF file of known snps to simulate | +| `--indel-vcf` | `-i` | file path | | VCF file of known indels to simulate | +| `--snp-count` | `-n` | integer | 0 | Number of random snps to simluate | +| `--indel-count` | `-m` | integer | 0 | Number of random indels to simluate | +| `--titv-ratio` | `-r` | float | 0.5 | Transition/Transversion ratio for snps | +| `--indel-ratio` | `-d` | float | 1 | Insertion/Deletion ratio for indels | +| `--indel-size-alpha` | `-a` | float | 2.0 | Exponent Alpha for power-law-fitted indel size distribution| +| `--indel-size-constant` | `-l` | float | 0.5 | Exponent constant for power-law-fitted indel size distribution | +| `--snp-gene-constraints` | `-y` | string | | How to constrain randomly simulated SNPs {`noncoding`,`coding`,`2d`,`4d`} when using `--genes`| + +!!!warning SNPs can be slow +Given software limitations, simulating many SNPs (>10,000) will be noticeably slower than the other variant types. +!!! + +==- inversions +### inversion +Inversions are when a section of a chromosome appears in the reverse orientation ([source](https://www.genome.gov/genetics-glossary/Inversion)). + +| argument | short name | type | default | description | +|:------------------|:----------:|:-----------|:-------:|:----------------| +| `--vcf` | `-v` | file path | | VCF file of known inversions to simulate | +| `--count`| `-n` | integer | 0 | Number of random inversions to simluate | +| `--min-size` | `-m` | integer | 1000 | Minimum inversion size (bp) | +| `--max-size` | `-x` | integer | 100000 | Maximum inversion size (bp) | + +==- copy number variants +### cnv +A copy number variation (CNV) is when the number of copies of a particular gene varies +between individuals ([source](https://www.genome.gov/genetics-glossary/Copy-Number-Variation)) +The ratio parameters control different things and have special meanings when setting +the value to either `9999` or `0` : +- `--dup-ratio` + - `9999`: tandem duplications only + - `0`: dispersed duplications only +- `--gain-ratio` + - `9999`: gain only + - `0`: loss only + +| argument | short name | type | default | description | +|:------------------|:----------:|:-----------|:-------:|:----------------| +| `--vcf` | `-v` | file path | | VCF file of known copy number variants to simulate | +| `--count` | `-n` | integer | 0 | Number of random cnv to simluate | +| `--min-size` | `-m` | integer | 1000 | Minimum cnv size (bp) | +| `--max-size`| `-x` | integer |100000 | Maximum cnv size (bp) | +| `--max-copy` | `-y` | integer | 10 | Maximum number of copies | +| `--dup-ratio` | `-d` | float | 1 | Tandem/Dispersed duplication ratio | +| `--gain-ratio` |`-l` | float | 1 | Relative ratio of DNA gain over DNA loss | + + +==- translocations +### translocation +A translocation occurs when a chromosome breaks and the fragmented pieces re-attach to different chromosomes ([source](https://www.genome.gov/genetics-glossary/Translocation)). + +| argument | short name | type | default | description | +|:------------------|:----------:|:-----------|:-------:|:----------------| +| `--vcf` | `-v` | file path | | VCF file of known inversions to simulate | +| `--count`| `-n` | integer | 0 | Number of random inversions to simluate | + +=== + +## Simulate known variants +Rather than simulating random variants, you can use a VCF file as input to any of the submodules +to have `simuG` simulate the variants (of that type) from the VCF file. This becomes particularly +handy because the modules output a VCF file of the variants that were introduced, which you can +modify and reuse as you see fit (see [heterozygosity](#heterozygosity)). Using `--genes`, +`--centromeres`, or `--exclude-chr` would still avoid creating variants in those regions as with +random simulation, except with SNPs, where you would have to specify the contraints for using +`--genes` as per usual. + +## Heterozygosity +Each submodule has a `--heterozygosity` parameter where you can specify the heterozygosity of +an intended diploid genome, should you use the resulting VCF(s) to simulate variants again. + This **does not** create a diploid genome for you, rather, it +takes the output VCF from `simuG` and creates two new VCF files (`{prefix}.hap1.vcf`, +`{prefix}.hap2.vcf`) that have their variants shuffled between the two haplotypes to +achieve the desired heterozygosity. You can then use the two resulting VCF files to +[simulate known variants](#simulate-known-variants) onto a genome and create a diploid genome! +A simplified workflow is provided below to explain that in better context. + +==- How the paramater is actually used +To understand how heterozygosity is created from the `simuG` VCF output, consider a genome +with 5 variants added to it, here represented as a column labelled `h1` with `1` being the presence +of a variant (the ALT allele). +``` +h1 +1 +1 +1 +1 +1 +``` +If we were to simulate those same variants onto the genome again, it would create a homozygote +at every position (`h2` is the second haplotype): +``` +h1 h2 +1 1 +1 1 +1 1 +1 1 +1 1 +``` +However, if we omit some of the variants on `h2` to create 40% heterozygosity (2/5), +we would now have heterozygotes, except the ALT allele for the heterozygote would +only every be on the first haplotype `h1`: +``` +h1 h2 +1 1 +1 1 +1 1 +1 <- heterozygote with ALT on h1 +1 <- heterozygote with ALT on h1 +``` +It would probably be more biologically sound to then make sure that the ALT allele +in the heterozygote can appear in either haplotype: +``` +h1 h2 +1 1 +1 1 +1 1 + 1 <- heterozygote with ALT on h2 +1 <- heterozygote with ALT on h1 +``` +Within Harpy, a heterozygous variant has a 50% chance of being assigned to one of the haplotypes. +So that's the logic behind the `--heterozygosity` parameter and why it ouputs 3 VCF files: +1. the VCF `simuG` outputs of variants added to the genome +2. haplotype 1 of that VCF file with some of the variants +3. haplotype 2 of that VCF file with some of the variants + +Knowing that, you can then have a workflow to start with a haploid assembly and create +a diploid assembly with simulated variants. +==- + +## Simulate Diploid Assembly +Here is a simple but realistic workflow of creating a diploid assembly with simulated variants. +If you haven't already, please read the sections about [simulating known variants](#simulate-known-variants) +and [heterozygosity](#heterozygosity). The idea here is that due to the limitations of `simuG`, we can +only simulate one type of variant at a time and we will take advantage of the VCF files created by +the `--heterozygosity` parameter to simulate known variants from random ones! The basic idea is such: + +#### Step 1 +Simulate random variants onto your haploid assembly with `--heterozygosity` (`-z`) set above `0`. +We aren't interested in the resulting genome, but rather the positions of the variants `simuG` created. +```mermaid +graph LR + geno(haploid genome)-->|simulate inversion -n 10 -z 0.5|hap(inversions.vcf) + hap-->hap1(inversion.hap1.vcf) + hap-->hap2(inversion.hap2.vcf) +``` +#### Step 2 +Use the resulting hap1 and hap2 VCF files to simulate those same variants, but shuffled +into homozygotes and heterozygotes, onto the original haploid genome, creating two haplotype +genomes. +```mermaid +graph LR + hap1(inversion.hap1.vcf)-->|simulate inversion -v|geno + geno(haploid genome)-->genohap1(haplotype-1 genome) +``` +```mermaid +graph LR + hap1(inversion.hap2.vcf)-->|simulate inversion -v|geno + geno(haploid genome)-->genohap1(haplotype-2 genome) +``` +#### Step 3 +Use the one of the new genome haplotypes for simulating other kinds of variants. +Again, use `--heterozygosity` (`-z`) with a value greater than `0`. Like [**Step 1**](#step-1), +we're only interested in the haplotype VCF files (positions of variants) and not the resulting +genome. +```mermaid +graph LR + geno(haplotype-1 genome)-->|simulate snpindel -n 100000 -z 0.5|hap(snpindel.vcf) + hap-->hap1(snpindel.hap1.vcf) + hap-->hap2(snpindel.hap2.vcf) +``` +#### Step 4 +Use the resulting haplotype VCFs to simulate known variants onto the **haplotype genomes** from +Step 2. + +```mermaid +graph LR + hap1(snpindel.hap1.vcf)-->|simulate snpindel -v|geno + geno(haplotype-1 genome)-->genohap1(haplotype-1 genome with new variants) +``` +```mermaid +graph LR + hap1(snpindel.hap2.vcf)-->|simulate snpindel -v|geno + geno(haplotype-2 genome)-->genohap1(haplotype-2 genome with new variants) +``` + +#### Step 5 +Repeat [**Step 3**](#step-3) and [**Step 4**](#step-4) to your heart's content. \ No newline at end of file diff --git a/retype.yml b/retype.yml index 61fead8e2..9a7017d92 100644 --- a/retype.yml +++ b/retype.yml @@ -17,7 +17,7 @@ footer: copyright: "© Copyright {{ year }}. All rights reserved." branding: title: Harpy - label: v0.8.x + label: v0.9.x logo: static/favicon.png logoDark: static/favicon.png logoAlign: left