From bc3224d75b868c6692c525791224a4526efe8d78 Mon Sep 17 00:00:00 2001
From: pdimens <pdimens@live.com>
Date: Tue, 2 Apr 2024 14:49:46 -0400
Subject: [PATCH 1/5] add minimap

---
 Modules/Align/minimap.md | 194 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 194 insertions(+)
 create mode 100644 Modules/Align/minimap.md

diff --git a/Modules/Align/minimap.md b/Modules/Align/minimap.md
new file mode 100644
index 000000000..72baed3f8
--- /dev/null
+++ b/Modules/Align/minimap.md
@@ -0,0 +1,194 @@
+---
+label: Minimap
+description: Align haplotagged sequences with Minimap2
+icon: dot
+order: 5
+---
+
+# :icon-quote: Map Reads onto a genome with BWA MEM
+===  :icon-checklist: You will need
+- at least 4 cores/threads available
+- a genome assembly in FASTA format
+- paired-end fastq sequence file with the [proper naming convention](/haplotagdata/#naming-conventions) (gzipped recommended)
+===
+
+Once sequences have been trimmed and passed through other QC filters, they will need to
+be aligned to a reference genome. This module within Harpy expects filtered reads as input,
+such as those derived using `harpy qc`. You can map reads onto a genome assembly with Harpy 
+using the `align` module:
+
+```bash usage
+harpy align minimap OPTIONS... INPUTS...
+```
+```bash example
+harpy align minimap --genome genome.fasta Sequences/ 
+```
+
+## :icon-terminal: Running Options
+In addition to the [common runtime options](/commonoptions.md), the `harpy align minimap` module is configured using these command-line arguments:
+
+| argument           | short name | type                  | default | required | description                                           |
+|:-------------------|:----------:|:----------------------|:-------:|:--------:|:------------------------------------------------------|
+| `INPUTS`           |            | file/directory paths  |         | **yes**  | Files or directories containing [input FASTQ files](/commonoptions.md#input-arguments)     |
+| `--genome`         |    `-g`    | file path             |         | **yes**  | Genome assembly for read mapping                      |
+| `--molecule-distance` |    `-m`    | integer         |  100000  |    no    | Base-pair distance threshold to separate molecules      |
+| `--quality-filter` |    `-f`    | integer (0-40)        |   30    |    no    | Minimum `MQ` (SAM mapping quality) to pass filtering  |
+| `--method`         |    `-m`    | choice [`bwa`, `ema`] |   bwa   |    no    | Which aligning software to use                        |
+| `--extra-params`   |    `-x`    | string                |         |    no    | Additional EMA-align/BWA arguments, in quotes         |
+
+### Molecule distance
+The `--molecule-distance` option is used during the BWA alignment workflow
+to assign alignments a unique Molecular Identifier `MI:i` tag based on their
+ haplotag barcode and the distance threshold you specify. See 
+[haplotag data](/haplotagdata/#barcode-thresholds) for more information on
+what this value does. 
+
+## Quality filtering
+The `--quality` argument filters out alignments below a given $MQ$ threshold. The default, `30`, keeps alignments
+that are at least 99.9% likely correctly mapped. Set this value to `1` if you only want alignments removed with
+$MQ = 0$ (0% likely correct). You may also set it to `0` to keep all alignments for diagnostic purposes.
+The plot below shows the relationship between $MQ$ score and the likelihood the alignment is correct and will serve to help you decide
+on a value you may want to use. It is common to remove alignments with $MQ <30$ (<99.9% chance correct) or $MQ <40$ (<99.99% chance correct).
+
+==- What is the $MQ$ score?
+Every alignment in a BAM file has an associated mapping quality score ($MQ$) that informs you of the likelihood 
+that the alignment is accurate. This score can range from 0-40, where higher numbers mean the alignment is more
+likely correct. The math governing the $MQ$ score actually calculates the percent chance the alignment is ***incorrect***: 
+$$
+\%\ chance\ incorrect = 10^\frac{-MQ}{10} \times 100\\
+\text{where }0\le MQ\le 40
+$$
+You can simply subtract it from 100 to determine the percent chance the alignment is ***correct***:
+$$
+\%\ chance\ correct = 100 - \%\ chance\ incorrect\\
+\text{or} \\
+\%\ chance\ correct = (1 - 10^\frac{-MQ}{10}) \times 100
+$$
+
+[!embed el="embed"](//plotly.com/~pdimens/7.embed)
+===
+
+## Marking PCR duplicates
+Harpy uses `samtools markdup` to mark putative PCR duplicates. By using the `--barcode-tag BX`
+option, it considers the linked-read barcode for more accurate duplicate detection. Duplicate
+marking also uses the `-S` option to mark supplementary (chimeric) alignments as duplicates
+if the primary alignment was marked as a duplicate. Duplicates get marked but **are not removed**.
+
+----
+
+## :icon-git-pull-request: Minimap workflow
++++ :icon-git-merge: details
+- ignores (but retains) barcode information
+- ultra-fast
+- comparable accuracy to BWA MEM for sequences greater than 100bp
+    - accuracy may be lower for sequences less than 100bp
+
+The [minimap2](https://github.com/lh3/minimap2) workflow is nearly identical to the BWA workflow,
+the only real difference being how the input genome is indexed and that alignment is performed with
+`minimap2` instead of BWA. Duplicates are marked using `samtools markdup`.
+The `BX:Z` tags in the read headers are still added to the alignment headers, even though barcodes
+are not used to inform mapping. The `-m` threshold is used for alignment molecule assignment.
+
+```mermaid
+graph LR
+    Z([trimmed reads]) --> B
+    A([index genome]) --> B([align to genome])
+    B-->C([sort alignments])
+    C-->D([mark duplicates])
+    D-->E([assign molecules])
+    E-->F([alignment metrics])
+    D-->G([barcode stats])
+    G-->F
+    subgraph markdp [mark duplicates via `samtools`]
+        direction LR
+        collate-->fixmate
+        fixmate-->sort
+        sort-->markdup
+    end
+```
++++ :icon-file-directory: minimap2 output
+The default output directory is `Align/minimap` with the folder structure below. `Sample1` is a generic sample name for demonstration purposes.
+The resulting folder also includes a `workflow` directory (not shown) with workflow-relevant runtime files and information.
+```
+Align/bwa
+├── Sample1.bam
+├── Sample1.bam.bai
+├── logs
+│   └── markduplicates
+│       └── Sample1.markdup.log
+└── reports
+    ├── minimap.stats.html
+    ├── BXstats
+    │   ├── Sample1.bxstats.html
+    │   └── data
+    │       └── Sample1.bxstats.gz
+    └── coverage
+        ├── Sample1.gencov.html
+        └── data
+            └── Sample1.gencov.gz
+
+
+```
+
+| item     | description                                                                                                 |
+|:---------|:------------------------------------------------------------------------------------------------------------|
+| `*.bam`                             | sequence alignments for each sample                                              |
+| `*.bai`                             | sequence alignment indexes for each sample                                       |
+| `logs/markduplicates`               | stats provided by `samtools markdup`                                             |
+| `reports/`                          | various counts/statistics/reports relating to sequence alignment                 |
+| `reports/minimap.stats.html`        | report summarizing `samtools flagstat and stats` results across all samples from `multiqc` |
+| `reports/reads.bxstats.html`        | interactive html report summarizing valid vs invalid barcodes across all samples | 
+| `reports/BXstats/*.bxstats.html`    | interactive html report summarizing inferred molecule size                       | 
+| `reports/coverage/*.html`           | summary plots of alignment coverage per contig                                   |
+| `reports/coverage/data/*.gencov.gz` | output from samtools bedcov from all alignments, used for plots                  |
+| `reports/BXstats/`                  | reports summarizing molecule size and reads per molecule                         |
+| `reports/BXstats/data/`             | tabular data containing the information used to generate the BXstats reports     |
+
++++ :icon-code-square: minimap2 parameters
+By default, Harpy runs `minimap2` with these parameters (excluding inputs and outputs):
+```bash
+minimap2 -ax sr -y --sam-hit-only -R \"@RG\\tID:samplename\\tSM:samplename\"
+```
+
+The Minimap2 aligner has a lot of parameters that can be provided, too many to list here, so please refer to the 
+[Minimap2 documentation](https://lh3.github.io/minimap2/minimap2.html). The `-a` indicates to output a SAM format
+file and the `-x sr` argument is a preset for short reads with these parameters:
+- `-k21`
+- `-w11`
+- `--sr`
+- `--frag=yes`
+- `-A2`
+- `-B8`
+- `-O12,32` 
+- `-E2,1`
+- `-b0`
+- `-r100`
+- `-p.5`
+- `-N20`
+- `-f1000,5000` 
+- `-n2`
+- `-m20`
+- `-s40`
+- `-g100`
+- `-2K50m`
+- `--heap-sort=yes`
+- `--secondary=no`
+
++++ :icon-graph: reports
+These are the summary reports Harpy generates for this workflow. You may right-click
+the images and open them in a new tab if you wish to see the examples in better detail.
+||| Depth and coverage
+Reports the depth of alignments in 10kb windows.
+![reports/coverage/*.html](/static/report_align_coverage.png)
+||| BX validation
+Reports the number of valid/invalid barcodes in the alignments.
+![reports/reads.bxstats.html](/static/report_align_bxstats.png)
+||| Molecule size
+Reports the inferred molecule sized based on barcodes in the alignments.
+![reports/BXstats/*.bxstats.html](/static/report_align_bxmol.png)
+||| Alignment stats
+Reports the general statistics computed by samtools `stats` and `flagstat`
+![reports/samtools_*stat/*html](/static/report_align_flagstat.png)
+|||
+
++++

From 0e1af4afd299d7422c8d5020172cf074128cc658 Mon Sep 17 00:00:00 2001
From: pdimens <pdimens@live.com>
Date: Tue, 2 Apr 2024 16:31:52 -0400
Subject: [PATCH 2/5] add simulations

---
 Modules/Align/ema.md                  |  1 +
 Modules/Align/minimap.md              |  2 +-
 Modules/Simulate/index.yml            |  2 +
 Modules/Simulate/simulate-variants.md | 91 +++++++++++++++++++++++++++
 4 files changed, 95 insertions(+), 1 deletion(-)
 create mode 100644 Modules/Simulate/index.yml
 create mode 100644 Modules/Simulate/simulate-variants.md

diff --git a/Modules/Align/ema.md b/Modules/Align/ema.md
index 6e0a654bd..f6ee8ab10 100644
--- a/Modules/Align/ema.md
+++ b/Modules/Align/ema.md
@@ -10,6 +10,7 @@ order: 5
 - at least 4 cores/threads available
 - a genome assembly in FASTA format
 - paired-end fastq sequence file with the [proper naming convention](/haplotagdata/#naming-conventions) (gzipped recommended)
+- patience
 ==- Why EMA?
 The original haplotag manuscript uses BWA to map reads. The authors have since recommended
 the use of EMA (EMerald Aligner) for most applications. EMA is barcode-aware,
diff --git a/Modules/Align/minimap.md b/Modules/Align/minimap.md
index 72baed3f8..1ec79dcd4 100644
--- a/Modules/Align/minimap.md
+++ b/Modules/Align/minimap.md
@@ -5,7 +5,7 @@ icon: dot
 order: 5
 ---
 
-# :icon-quote: Map Reads onto a genome with BWA MEM
+# :icon-quote: Map Reads onto a genome with Minimap2
 ===  :icon-checklist: You will need
 - at least 4 cores/threads available
 - a genome assembly in FASTA format
diff --git a/Modules/Simulate/index.yml b/Modules/Simulate/index.yml
new file mode 100644
index 000000000..7fa938bf3
--- /dev/null
+++ b/Modules/Simulate/index.yml
@@ -0,0 +1,2 @@
+icon: flame
+order: 5
\ No newline at end of file
diff --git a/Modules/Simulate/simulate-variants.md b/Modules/Simulate/simulate-variants.md
new file mode 100644
index 000000000..d359d5750
--- /dev/null
+++ b/Modules/Simulate/simulate-variants.md
@@ -0,0 +1,91 @@
+---
+label: Variants
+description: Simulate snps, indels, inversions, cnv, translocations
+icon: dot
+#visibility: hidden
+order: 6
+---
+
+# :icon-flame: Simulate Genomic Variants
+Simulate snps, indels, inversions, cnv, translocations
+
+===  :icon-checklist: You will need
+- a reference genome in fasta or gzipped fasta format
+===
+
+You may want to benchmark haplotag data on different kinds of genomic variants. To
+do that, you'll need *known* variants, and typically simulations are how you achieve
+that. This series of modules simulates genomic variants onto a genome, either randomly
+or specific variants provided in VCF files. The simulator Harpy uses,
+[simuG](https://github.com/yjx1217/simuG), can only simulate one type of
+variant at a time and each variant type has their own set of parameters. This page
+is divided by variant types to help you navigate the process. The general usage
+for simulating variants is:
+
+```bash usage
+harpy simulate variant OPTIONS... INPUT_GENOME
+```
+```bash example
+harpy simulate inversion -n 10 --min-size 1000 --max-size 50000  path/to/genome.fasta
+```
+There are 4 submodules with very obvious names:
+
+| submodule | what it does |
+|:----------|:-------------|
+|`snpindel` | simulates single nucleotide polymorphisms (snps) and insertion-deletions (indels) |
+| `inversion` | simulates inversions |
+| `cnv` | simulates copy number variants |
+| `translocation` | simulates translocations |
+
+## :icon-terminal: Running Options
+While there are serveral differences between the submodule command line options, each has available all the
+[common runtime options](/commonoptions.md) like other Harpy modules. Each requires and input genome at the
+end of the command line, and each requires either a `--count` of variants to randomly simulate, or a `--vcf` of
+specific variants to simulate. There are also these unifying options among the different variant types:
+
+| argument | short name | type | required | description |
+| :-----|:-----|:-----|:-----|:-----|
+| `INPUT_GENOME`           |            | file path  |   **yes**  | The forward (or reverse) multiplexed FASTQ file                                      |
+| `--prefix` | | string | | Naming prefix for output files (default: `sim.{module_name}`)|
+| `--exclude-chr` | `-e` | file path | | Text file of chromosomes to avoid, one per line |
+| `--centromeres` | `-c` | file path | | GFF3 file of centromeres to avoid |
+| `--genes` | `-g` | file path | | GFF3 file of genes to avoid simulating over (see `snpindel` for caveat) |
+| `--heterozygosity` | `-z` | float between [0,1] | 0 | [% heterozygosity to simulate diploid later](#simulate-diploids) |
+| `--randomseed` |  | integer > 0 |  | Random seed for simulation |
+
+### snpindel
+The snp and indel variants are combined in this module because `simuG` allows simulating them together. The
+ratio parameters control different things for snp and indel variants and have special meanings when setting
+the value to either `9999` or `0` :
+- `--titv-ratio`
+    - `9999`: transitions only
+    - `0`: transversions only
+- `--indel-ratio`
+    - `9999`: insertions only
+    - `0`: deletions only
+
+| argument          | short name | type       | default |  description                                                 |
+|:------------------|:----------:|:-----------|:-------:|:-------------------------------------------------------------|
+| `--snp-vcf`| `-v` | file path | | VCF file of known snps to simulate |
+| `--indel-vcf` | `-i` | file path | | VCF file of known indels to simulate |
+| `--snp-count` | `-n` | integer | 0 | Number of random snps to simluate |
+| `--indel-count` |  `-m` | integer | 0 | Number of random indels to simluate |
+| `--titv-ratio` | `-r` | float  | 0.5 | Transition/Transversion ratio for snps |
+| `--indel-ratio` | `-d` | float  |  1 | Insertion/Deletion ratio for indels |
+| `--indel-size-alpha` | `-a` | float |  2.0 | Exponent Alpha for power-law-fitted indel size distribution|
+| `--indel-size-constant` | `-l` | float | 0.5 | Exponent constant for power-law-fitted indel size distribution |
+| `--snp-gene-constraints` | `-y` | string | | How to constrain randomly simulated SNPs {`noncoding`,`coding`,`2d`,`4d`} when using `--genes`|
+
+### inversion
+Inversions are when a section of a chromosome appears in the reverse orientation.
+
+| argument          | short name | type       | default |  description     |
+|:------------------|:----------:|:-----------|:-------:|:----------------|
+| `--vcf` | `-v` | file path |  |  VCF file of known inversions to simulate |
+| `--count`| `-n` | integer | 0 |  Number of random inversions to simluate |
+| `--min-size` | `-m` | integer | 1000 | Minimum inversion size (bp) |
+| `--max-size` | `-x` | integer | 100000 | Maximum inversion size (bp) |
+
+## Simulate known variants
+
+## Simulate diploids

From d0bf210fe74c4223f7d513224ac973293f43da64 Mon Sep 17 00:00:00 2001
From: pdimens <pdimens@live.com>
Date: Tue, 2 Apr 2024 16:37:14 -0400
Subject: [PATCH 3/5] update

---
 Modules/Simulate/simulate-variants.md | 7 +++++--
 retype.yml                            | 2 +-
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/Modules/Simulate/simulate-variants.md b/Modules/Simulate/simulate-variants.md
index d359d5750..4ae910fd6 100644
--- a/Modules/Simulate/simulate-variants.md
+++ b/Modules/Simulate/simulate-variants.md
@@ -45,14 +45,15 @@ specific variants to simulate. There are also these unifying options among the d
 
 | argument | short name | type | required | description |
 | :-----|:-----|:-----|:-----|:-----|
-| `INPUT_GENOME`           |            | file path  |   **yes**  | The forward (or reverse) multiplexed FASTQ file                                      |
+| `INPUT_GENOME`           |            | file path  |   **yes**  | The haploid genome to simulate variants onto   |
 | `--prefix` | | string | | Naming prefix for output files (default: `sim.{module_name}`)|
 | `--exclude-chr` | `-e` | file path | | Text file of chromosomes to avoid, one per line |
 | `--centromeres` | `-c` | file path | | GFF3 file of centromeres to avoid |
 | `--genes` | `-g` | file path | | GFF3 file of genes to avoid simulating over (see `snpindel` for caveat) |
-| `--heterozygosity` | `-z` | float between [0,1] | 0 | [% heterozygosity to simulate diploid later](#simulate-diploids) |
+| `--heterozygosity` | `-z` | float between [0,1] | | [% heterozygosity to simulate diploid later](#simulate-diploids) (default: `0`) |
 | `--randomseed` |  | integer > 0 |  | Random seed for simulation |
 
+==- snpindel
 ### snpindel
 The snp and indel variants are combined in this module because `simuG` allows simulating them together. The
 ratio parameters control different things for snp and indel variants and have special meanings when setting
@@ -76,6 +77,7 @@ the value to either `9999` or `0` :
 | `--indel-size-constant` | `-l` | float | 0.5 | Exponent constant for power-law-fitted indel size distribution |
 | `--snp-gene-constraints` | `-y` | string | | How to constrain randomly simulated SNPs {`noncoding`,`coding`,`2d`,`4d`} when using `--genes`|
 
+==- inversion
 ### inversion
 Inversions are when a section of a chromosome appears in the reverse orientation.
 
@@ -86,6 +88,7 @@ Inversions are when a section of a chromosome appears in the reverse orientation
 | `--min-size` | `-m` | integer | 1000 | Minimum inversion size (bp) |
 | `--max-size` | `-x` | integer | 100000 | Maximum inversion size (bp) |
 
+===
 ## Simulate known variants
 
 ## Simulate diploids
diff --git a/retype.yml b/retype.yml
index 61fead8e2..9a7017d92 100644
--- a/retype.yml
+++ b/retype.yml
@@ -17,7 +17,7 @@ footer:
   copyright: "&copy; Copyright {{ year }}. All rights reserved."
 branding:
   title: Harpy
-  label: v0.8.x
+  label: v0.9.x
   logo: static/favicon.png
   logoDark: static/favicon.png
   logoAlign: left

From 7ac820d746fa4889e8b97079c27791296d85545e Mon Sep 17 00:00:00 2001
From: pdimens <pdimens@live.com>
Date: Wed, 3 Apr 2024 11:56:03 -0400
Subject: [PATCH 4/5] update much text

---
 Modules/Simulate/simulate-variants.md | 185 ++++++++++++++++++++++++--
 1 file changed, 171 insertions(+), 14 deletions(-)

diff --git a/Modules/Simulate/simulate-variants.md b/Modules/Simulate/simulate-variants.md
index 4ae910fd6..46814e945 100644
--- a/Modules/Simulate/simulate-variants.md
+++ b/Modules/Simulate/simulate-variants.md
@@ -43,18 +43,20 @@ While there are serveral differences between the submodule command line options,
 end of the command line, and each requires either a `--count` of variants to randomly simulate, or a `--vcf` of
 specific variants to simulate. There are also these unifying options among the different variant types:
 
-| argument | short name | type | required | description |
-| :-----|:-----|:-----|:-----|:-----|
-| `INPUT_GENOME`           |            | file path  |   **yes**  | The haploid genome to simulate variants onto   |
-| `--prefix` | | string | | Naming prefix for output files (default: `sim.{module_name}`)|
-| `--exclude-chr` | `-e` | file path | | Text file of chromosomes to avoid, one per line |
-| `--centromeres` | `-c` | file path | | GFF3 file of centromeres to avoid |
-| `--genes` | `-g` | file path | | GFF3 file of genes to avoid simulating over (see `snpindel` for caveat) |
-| `--heterozygosity` | `-z` | float between [0,1] | | [% heterozygosity to simulate diploid later](#simulate-diploids) (default: `0`) |
-| `--randomseed` |  | integer > 0 |  | Random seed for simulation |
-
-==- snpindel
+| argument | short name | type |  description |
+| :-----|:-----|:-----|:-----|
+| `INPUT_GENOME`           |    | file path  |  The haploid genome to simulate variants onto. **REQUIRED**   |
+| `--prefix` | | string |  Naming prefix for output files (default: `sim.{module_name}`)|
+| `--exclude-chr` | `-e` | file path | Text file of chromosomes to avoid, one per line |
+| `--centromeres` | `-c` | file path | GFF3 file of centromeres to avoid |
+| `--genes` | `-g` | file path |  GFF3 file of genes to avoid simulating over (see `snpindel` for caveat) |
+| `--heterozygosity` | `-z` | float between [0,1] |  [% heterozygosity to simulate diploid later](#heterozygosity) (default: `0`) |
+| `--randomseed` |  | integer > 0 |   Random seed for simulation |
+
+==- snps and indels
 ### snpindel
+A single nucleotide polymorphism ("SNP") is a genomic variant at a single base position in the DNA ([source](https://www.genome.gov/genetics-glossary/Single-Nucleotide-Polymorphisms)).
+An indel, is a type of mutation that involves the addition/deletion of one or more nucleotides into a segment of DNA ([insertions](https://www.genome.gov/genetics-glossary/Insertion), [deletions](https://www.genome.gov/genetics-glossary/Deletion)).
 The snp and indel variants are combined in this module because `simuG` allows simulating them together. The
 ratio parameters control different things for snp and indel variants and have special meanings when setting
 the value to either `9999` or `0` :
@@ -77,9 +79,9 @@ the value to either `9999` or `0` :
 | `--indel-size-constant` | `-l` | float | 0.5 | Exponent constant for power-law-fitted indel size distribution |
 | `--snp-gene-constraints` | `-y` | string | | How to constrain randomly simulated SNPs {`noncoding`,`coding`,`2d`,`4d`} when using `--genes`|
 
-==- inversion
+==- inversions
 ### inversion
-Inversions are when a section of a chromosome appears in the reverse orientation.
+Inversions are when a section of a chromosome appears in the reverse orientation ([source](https://www.genome.gov/genetics-glossary/Inversion)).
 
 | argument          | short name | type       | default |  description     |
 |:------------------|:----------:|:-----------|:-------:|:----------------|
@@ -88,7 +90,162 @@ Inversions are when a section of a chromosome appears in the reverse orientation
 | `--min-size` | `-m` | integer | 1000 | Minimum inversion size (bp) |
 | `--max-size` | `-x` | integer | 100000 | Maximum inversion size (bp) |
 
+==- copy number variants
+### cnv
+A copy number variation (CNV) is when the number of copies of a particular gene varies
+between individuals ([source](https://www.genome.gov/genetics-glossary/Copy-Number-Variation))
+The ratio parameters control different things and have special meanings when setting
+the value to either `9999` or `0` :
+- `--dup-ratio`
+    - `9999`: tandem duplications only
+    - `0`: dispersed duplications only
+- `--gain-ratio`
+    - `9999`: gain only
+    - `0`: loss only
+
+| argument          | short name | type       | default |  description     |
+|:------------------|:----------:|:-----------|:-------:|:----------------|
+| `--vcf` | `-v` | file path | | VCF file of known copy number variants to simulate |
+| `--count` | `-n` | integer | 0 | Number of random cnv to simluate |
+| `--min-size` | `-m` | integer |  1000 | Minimum cnv size (bp) |
+| `--max-size`|   `-x` | integer |100000 | Maximum cnv size (bp) |
+| `--max-copy` |  `-y` | integer | 10 | Maximum number of copies |
+| `--dup-ratio` | `-d` | float |  1 | Tandem/Dispersed duplication ratio |
+| `--gain-ratio` |`-l` | float |  1 | Relative ratio of DNA gain over DNA loss |
+
+
+==- translocations
+### translocation
+A translocation occurs when a chromosome breaks and the fragmented pieces re-attach to different chromosomes ([source](https://www.genome.gov/genetics-glossary/Translocation)). 
+
+| argument          | short name | type       | default |  description     |
+|:------------------|:----------:|:-----------|:-------:|:----------------|
+| `--vcf` | `-v` | file path |  |  VCF file of known inversions to simulate |
+| `--count`| `-n` | integer | 0 |  Number of random inversions to simluate |
+
 ===
+
 ## Simulate known variants
+Rather than simulating random variants, you can use a VCF file as input to any of the submodules
+to have `simuG` simulate the variants (of that type) from the VCF file. This becomes particularly
+handy because the modules output a VCF file of the variants that were introduced, which you can
+modify and reuse as you see fit (see [heterozygosity](#heterozygosity)). Using `--genes`, 
+`--centromeres`, or `--exclude-chr` would still avoid creating variants in those regions as with
+random simulation, except with SNPs, where you would have to specify the contraints for using 
+`--genes` as per usual.
+
+## Heterozygosity
+Each submodule has a `--heterozygosity` parameter where you can specify the heterozygosity of
+an intended diploid genome, should you use the resulting VCF(s) to simulate variants again.
+ This **does not** create a diploid genome for you, rather, it
+takes the output VCF from `simuG` and creates two new VCF files (`{prefix}.hap1.vcf`,
+`{prefix}.hap2.vcf`) that have their variants shuffled between the two haplotypes to
+achieve the desired heterozygosity. You can then use the two resulting VCF files to
+[simulate known variants](#simulate-known-variants) onto a genome and create a diploid genome!
+A simplified workflow is provided below to explain that in better context. 
+
+==- How the paramater is actually used
+To understand how heterozygosity is created from the `simuG` VCF output, consider a genome
+with 5 variants added to it, here represented as a column labelled `h1` with `1` being the presence
+of a variant (the ALT allele).
+```
+h1
+1
+1
+1
+1
+1
+```
+If we were to simulate those same variants onto the genome again, it would create a homozygote
+at every position (`h2` is the second haplotype):
+```
+h1 h2
+1  1
+1  1
+1  1
+1  1
+1  1
+```
+However, if we omit some of the variants on `h2` to create 40% heterozygosity (2/5),
+we would now have heterozygotes, except the ALT allele for the heterozygote would
+only every be on the first haplotype `h1`:
+```
+h1 h2
+1  1
+1  1
+1  1
+1     <- heterozygote with ALT on h1
+1     <- heterozygote with ALT on h1
+```
+It would probably be more biologically sound to then make sure that the ALT allele
+in the heterozygote can appear in either haplotype:
+```
+h1 h2
+1  1
+1  1
+1  1
+   1  <- heterozygote with ALT on h2
+1     <- heterozygote with ALT on h1
+```
+So that's the logic behind the `--heterozygosity` parameter and why it ouputs 3 VCF files:
+1. the VCF `simuG` outputs of variants added to the genome
+2. haplotype 1 of that VCF file with some of the variants
+3. haplotype 2 of that VCF file with some of the variants
+
+Knowing that, you can then have a workflow to start with a haploid assembly and create
+a diploid assembly with simulated variants.
+==-
+
+## Simulate Diploid Assembly
+Here is a simple but realistic workflow of creating a diploid assembly with simulated variants.
+If you haven't already, please read the sections about [simulating known variants](#simulate-known-variants)
+and [heterozygosity](#heterozygosity). The idea here is that due to the limitations of `simuG`, we can
+only simulate one type of variant at a time and we will take advantage of the VCF files created by
+the `--heterozygosity` parameter to simulate known variants from random ones! The basic idea is such:
+
+#### Step 1
+Simulate random variants onto your haploid assembly with `--heterozygosity` (`-z`) set above `0`.
+```mermaid
+graph LR
+    geno(haploid genome)-->|simulate inversion -n 10 -z 0.5|hap(inversions.vcf)
+    hap-->hap1(inversion.hap1.vcf)
+    hap-->hap2(inversion.hap2.vcf)
+```
+#### Step 2
+Use the resulting hap1 and hap2 VCF files to simulate those same variants onto the (original) haploid genome. 
+```mermaid
+graph LR
+    hap1(inversion.hap1.vcf)-->|simulate inversion -v|geno
+    geno(haploid genome)-->genohap1(haplotype-1 genome)
+```
+```mermaid
+graph LR
+    hap1(inversion.hap2.vcf)-->|simulate inversion -v|geno
+    geno(haploid genome)-->genohap1(haplotype-2 genome)
+```
+#### Step 3
+Use the one of the new genome haplotypes for simulating other kinds of variants. 
+Again, use `--heterozygosity` (`-z`) with a value greater than `0`.
+```mermaid
+graph LR
+    geno(haplotype-1 genome)-->|simulate snpindel -n 100000 -z 0.5|hap(snpindel.vcf)
+    hap-->hap1(snpindel.hap1.vcf)
+    hap-->hap2(snpindel.hap2.vcf)
+```
+#### Step 4
+Use the resulting haplotype VCFs to simulate known variants onto the haplotype genomes from
+Step 2.
+
+```mermaid
+graph LR
+    hap1(snpindel.hap1.vcf)-->|simulate snpindel -v|geno
+    geno(haplotype-1 genome)-->genohap1(haplotype-1 genome with new variants)
+```
+```mermaid
+graph LR
+    hap1(snpindel.hap2.vcf)-->|simulate snpindel -v|geno
+    geno(haplotype-2 genome)-->genohap1(haplotype-2 genome with new variants)
+```
 
-## Simulate diploids
+#### Step 5
+Repeat [**Step 3**](#step-3) and [**Step 4**](#step-4) to your heart's content.
\ No newline at end of file

From fab71bcbdc57c445ba3c44a7f1228cfee75f65c8 Mon Sep 17 00:00:00 2001
From: pdimens <pdimens@live.com>
Date: Wed, 3 Apr 2024 12:59:08 -0400
Subject: [PATCH 5/5] nicer

---
 Modules/Simulate/simulate-variants.md | 19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/Modules/Simulate/simulate-variants.md b/Modules/Simulate/simulate-variants.md
index 46814e945..9d172bbe1 100644
--- a/Modules/Simulate/simulate-variants.md
+++ b/Modules/Simulate/simulate-variants.md
@@ -28,6 +28,7 @@ harpy simulate variant OPTIONS... INPUT_GENOME
 ```bash example
 harpy simulate inversion -n 10 --min-size 1000 --max-size 50000  path/to/genome.fasta
 ```
+## Modules
 There are 4 submodules with very obvious names:
 
 | submodule | what it does |
@@ -51,7 +52,7 @@ specific variants to simulate. There are also these unifying options among the d
 | `--centromeres` | `-c` | file path | GFF3 file of centromeres to avoid |
 | `--genes` | `-g` | file path |  GFF3 file of genes to avoid simulating over (see `snpindel` for caveat) |
 | `--heterozygosity` | `-z` | float between [0,1] |  [% heterozygosity to simulate diploid later](#heterozygosity) (default: `0`) |
-| `--randomseed` |  | integer > 0 |   Random seed for simulation |
+| `--randomseed` |  | integer |   Random seed for simulation |
 
 ==- snps and indels
 ### snpindel
@@ -79,6 +80,10 @@ the value to either `9999` or `0` :
 | `--indel-size-constant` | `-l` | float | 0.5 | Exponent constant for power-law-fitted indel size distribution |
 | `--snp-gene-constraints` | `-y` | string | | How to constrain randomly simulated SNPs {`noncoding`,`coding`,`2d`,`4d`} when using `--genes`|
 
+!!!warning SNPs can be slow
+Given software limitations, simulating many SNPs (>10,000) will be noticeably slower than the other variant types.
+!!!
+
 ==- inversions
 ### inversion
 Inversions are when a section of a chromosome appears in the reverse orientation ([source](https://www.genome.gov/genetics-glossary/Inversion)).
@@ -187,6 +192,7 @@ h1 h2
    1  <- heterozygote with ALT on h2
 1     <- heterozygote with ALT on h1
 ```
+Within Harpy, a heterozygous variant has a 50% chance of being assigned to one of the haplotypes.
 So that's the logic behind the `--heterozygosity` parameter and why it ouputs 3 VCF files:
 1. the VCF `simuG` outputs of variants added to the genome
 2. haplotype 1 of that VCF file with some of the variants
@@ -205,6 +211,7 @@ the `--heterozygosity` parameter to simulate known variants from random ones! Th
 
 #### Step 1
 Simulate random variants onto your haploid assembly with `--heterozygosity` (`-z`) set above `0`.
+We aren't interested in the resulting genome, but rather the positions of the variants `simuG` created.
 ```mermaid
 graph LR
     geno(haploid genome)-->|simulate inversion -n 10 -z 0.5|hap(inversions.vcf)
@@ -212,7 +219,9 @@ graph LR
     hap-->hap2(inversion.hap2.vcf)
 ```
 #### Step 2
-Use the resulting hap1 and hap2 VCF files to simulate those same variants onto the (original) haploid genome. 
+Use the resulting hap1 and hap2 VCF files to simulate those same variants, but shuffled
+into homozygotes and heterozygotes, onto the original haploid genome, creating two haplotype
+genomes. 
 ```mermaid
 graph LR
     hap1(inversion.hap1.vcf)-->|simulate inversion -v|geno
@@ -225,7 +234,9 @@ graph LR
 ```
 #### Step 3
 Use the one of the new genome haplotypes for simulating other kinds of variants. 
-Again, use `--heterozygosity` (`-z`) with a value greater than `0`.
+Again, use `--heterozygosity` (`-z`) with a value greater than `0`. Like [**Step 1**](#step-1),
+we're only interested in the haplotype VCF files (positions of variants) and not the resulting
+genome.
 ```mermaid
 graph LR
     geno(haplotype-1 genome)-->|simulate snpindel -n 100000 -z 0.5|hap(snpindel.vcf)
@@ -233,7 +244,7 @@ graph LR
     hap-->hap2(snpindel.hap2.vcf)
 ```
 #### Step 4
-Use the resulting haplotype VCFs to simulate known variants onto the haplotype genomes from
+Use the resulting haplotype VCFs to simulate known variants onto the **haplotype genomes** from
 Step 2.
 
 ```mermaid