Skip to content

Commit

Permalink
fix assembly
Browse files Browse the repository at this point in the history
  • Loading branch information
pdimens committed Nov 6, 2024
1 parent dc21da4 commit cdcc0f0
Show file tree
Hide file tree
Showing 9 changed files with 106 additions and 95 deletions.
6 changes: 3 additions & 3 deletions Workflows/SV/leviathan.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,9 +69,9 @@ In addition to the [!badge variant="info" corners="pill" text="common runtime op
| `--contigs` | | file path or list | | | [Contigs to plot](/commonoptions.md#--contigs) in the report |
| `--extra-params` | `-x` | string | | | Additional naibr arguments, in quotes |
| `--genome` | `-g` | file path | | ‼️ | Genome assembly that was used to create alignments |
| `--iterations` | `-i` | integer | 50 | | Number of iterations to perform through index (reduces memory) |
| `--min-barcodes` | `-b` | integer | 2 | | Minimum number of barcode overlaps supporting candidate SV |
| `--min-sv` | `-m` | integer | 1000 | | Minimum size of SV to detect |
| `--iterations` | `-i` | integer | `50` | | Number of iterations to perform through index (reduces memory) |
| `--min-barcodes` | `-b` | integer | `2` | | Minimum number of barcode overlaps supporting candidate SV |
| `--min-sv` | `-m` | integer | `1000` | | Minimum size of SV to detect |
| `--populations` | `-p` | file path | | | Tab-delimited file of sample\<*tab*\>group |

### Single-sample variant calling
Expand Down
8 changes: 4 additions & 4 deletions Workflows/SV/naibr.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,10 +69,10 @@ In addition to the [!badge variant="info" corners="pill" text="common runtime op
| `--contigs` | | file path or list | | | [Contigs to plot](/commonoptions.md#--contigs) in the report |
| `--extra-params` | `-x` | string | | | Additional naibr arguments, in quotes |
| `--genome` | `-g` | file path | | ‼️ | Genome assembly for phasing bam files |
| `--min-barcodes` | `-b` | integer | 2 | | Minimum number of barcode overlaps supporting candidate SV |
| `--min-quality` | `-q` | integer (0-40) | 30 | | Minimum `MQ` (SAM mapping quality) to pass filtering |
| `--min-sv` | `-n` | integer | 1000 | | Minimum size of SV to detect |
| `--molecule-distance` | `-m` | integer | 100000 | | Base-pair distance threshold to separate molecules |
| `--min-barcodes` | `-b` | integer | `2` | | Minimum number of barcode overlaps supporting candidate SV |
| `--min-quality` | `-q` | integer (0-40) | `30` | | Minimum `MQ` (SAM mapping quality) to pass filtering |
| `--min-sv` | `-n` | integer | `1000` | | Minimum size of SV to detect |
| `--molecule-distance` | `-m` | integer | `100000` | | Base-pair distance threshold to separate molecules |
| `--populations` | `-p` | file path | | | Tab-delimited file of sample\<*tab*\>group |
| `--vcf` | `-v` | file path | || Phased vcf file for phasing bam files ([see below](#optional-vcf-file)) |

Expand Down
24 changes: 12 additions & 12 deletions Workflows/Simulate/simulate-linkedreads.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,18 +45,18 @@ harpy simulate linkedreads -t 4 -n 2 -l 100 -p 50 data/genome.hap1.fasta data/
In addition to the [!badge variant="info" corners="pill" text="common runtime options"](/commonoptions.md), the [!badge corners="pill" text="simulate linkedreads"] module is configured using these command-line arguments:

{.compact}
| argument | short name | type | default | required | description |
|:---------------|:----------:|:------------|:-------------:|:--------:|:------------------------------------------------------------------------------------------------|
| `HAP1_GENOME` | | file path | | ‼️ | Haplotype 1 of the diploid genome to simulate reads |
| `HAP2_GENOME` | | file path | | ‼️ | Haplotype 1 of the diploid genome to simulate reads |
| `--barcodes` | `-b` | file path | [10X barcodes](https://github.com/aquaskyline/LRSIM/blob/master/4M-with-alts-february-2016.txt) | | File of linked-read barcodes to add to reads |
| `--distance-sd` | `-s` | integer | 15 | | Standard deviation of read-pair distance |
| `--molecule-length` | `-l` | integer | 100 | | Mean molecule length (kbp) |
| `--molecules-per` | `-m` | integer | 10 | | Average number of molecules per partition |
| `--mutation-rate` | `-r` | number | 0.001 | | Random mutation rate for simulating reads (0 - 1.0) |
| `--outer-distance` | `-d` | integer | 350 | | Outer distance between paired-end reads (bp) |
| `--patitions` | `-p` | integer | 1500 | | Number (in thousands) of partitions/beads to generate |
| `--read-pairs` | `-n` | number | 600 | | Number (in millions) of read pairs to simulate |
| argument | short name | default | required | description |
| :------------------ | :--------: | :---------------------------------------------------------------------------------------------: | :------: | :---------------------------------------------------- |
| `HAP1_GENOME` | | | ‼️ | Haplotype 1 of the diploid genome to simulate reads |
| `HAP2_GENOME` | | | ‼️ | Haplotype 1 of the diploid genome to simulate reads |
| `--barcodes` | `-b` | [10X barcodes](https://github.com/aquaskyline/LRSIM/blob/master/4M-with-alts-february-2016.txt) | | File of linked-read barcodes to add to reads |
| `--distance-sd` | `-s` | `15` | | Standard deviation of read-pair distance |
| `--molecule-length` | `-l` | `100` | | Mean molecule length (kbp) |
| `--molecules-per` | `-m` | `10` | | Average number of molecules per partition |
| `--mutation-rate` | `-r` | `0.001` | | Random mutation rate for simulating reads (0 - 1.0) |
| `--outer-distance` | `-d` | `350` | | Outer distance between paired-end reads (bp) |
| `--patitions` | `-p` | `1500` | | Number (in thousands) of partitions/beads to generate |
| `--read-pairs` | `-n` | `600` | | Number (in millions) of read pairs to simulate |

## Mutation Rate
The read simulation is two-part: first `dwgsim` generates forward and reverse FASTQ files from the provided genome haplotypes
Expand Down
108 changes: 54 additions & 54 deletions Workflows/Simulate/simulate-variants.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,12 +32,12 @@ harpy simulate inversion -n 10 --min-size 1000 --max-size 50000 path/to/genome.
There are 4 submodules with very obvious names:

{.compact}
| submodule | what it does |
|:----------|:-------------|
| [!badge corners="pill" text="snpindel"](#snpindel) | simulates single nucleotide polymorphisms (snps) and insertion-deletions (indels) |
| [!badge corners="pill" text="inversion"](#inversion) | simulates inversions |
| [!badge corners="pill" text="cnv"](#cnv) | simulates copy number variants |
| [!badge corners="pill" text="translocation"](#translocation) | simulates translocations |
| submodule | what it does |
| :----------------------------------------------------------- | :-------------------------------------------------------------------------------- |
| [!badge corners="pill" text="snpindel"](#snpindel) | simulates single nucleotide polymorphisms (snps) and insertion-deletions (indels) |
| [!badge corners="pill" text="inversion"](#inversion) | simulates inversions |
| [!badge corners="pill" text="cnv"](#cnv) | simulates copy number variants |
| [!badge corners="pill" text="translocation"](#translocation) | simulates translocations |

## :icon-terminal: Running Options
While there are serveral differences between individual workflow options, each has available all the
Expand All @@ -46,16 +46,16 @@ Each requires and input genome at the end of the command line, and each requires
to randomly simulate, or a `--vcf` of specific variants to simulate. There are also these unifying options among the different variant types:

{.compact}
| argument | short name | type | required | description |
| :-----|:-----|:-----|:---:|:-----|
| `INPUT_GENOME` | | file path | ‼️ | The haploid genome to simulate variants onto|
| `--centromeres` | `-c` | file path | | GFF3 file of centromeres to avoid |
| `--exclude-chr` | `-e` | file path | | Text file of chromosomes to avoid, one per line |
| `--genes` | `-g` | file path | | GFF3 file of genes to avoid simulating over (see `snpindel` for caveat) |
| `--heterozygosity` | `-z` | float between [0,1] | | [proportion of simulated variants to make heterozygous ](#heterozygosity) (default: `0`) |
| `--only-vcf` | | toggle | | When used with `--heterozygosity`, will create the diploid VCFs but will not simulate a diploid genome |
| `--prefix` | | string | | Naming prefix for output files (default: `sim.{module_name}`)|
| `--randomseed` | | integer | | Random seed for simulation |
| argument | short name | required | description |
| :----------------- | :--------- | :------: | :----------------------------------------------------------------------------------------------------- |
| `INPUT_GENOME` | | ‼️ | The haploid genome to simulate variants onto |
| `--centromeres` | `-c` | | GFF3 file of centromeres to avoid |
| `--exclude-chr` | `-e` | | Text file of chromosomes to avoid, one per line |
| `--genes` | `-g` | | GFF3 file of genes to avoid simulating over (see `snpindel` for caveat) |
| `--heterozygosity` | `-z` | | [proportion of simulated variants to make heterozygous ](#heterozygosity) (default: `0`) |
| `--only-vcf` | | | When used with `--heterozygosity`, will create the diploid VCFs but will not simulate a diploid genome |
| `--prefix` | | | Naming prefix for output files (default: `sim.{module_name}`) |
| `--randomseed` | | | Random seed for simulation |

!!!warning simulations can be slow
Given software limitations, simulating many variants **relative to the size of the input genome** will be noticeably slow.
Expand All @@ -69,72 +69,72 @@ An indel, is a type of mutation that involves the addition/deletion of one or mo
The snp and indel variants are combined in this module because `simuG` allows simulating them together.

{.compact}
| argument | short name | type | default | description |
|:------------------|:----------:|:-----------|:-------:|:-------------------------------------------------------------|
| `--indel-count` | `-m` | integer | 0 | Number of random indels to simluate |
| `--indel-vcf` | `-i` | file path | | VCF file of known indels to simulate |
| `--indel-ratio` | `-d` | float | 1 | Insertion/Deletion ratio for indels |
| `--indel-size-alpha` | `-a` | float | 2.0 | Exponent Alpha for power-law-fitted indel size distribution|
| `--indel-size-constant` | `-l` | float | 0.5 | Exponent constant for power-law-fitted indel size distribution |
| `--snp-count` | `-n` | integer | 0 | Number of random snps to simluate |
| `--snp-gene-constraints` | `-y` | string | | How to constrain randomly simulated SNPs {`noncoding`,`coding`,`2d`,`4d`} when using `--genes`|
| `--snp-vcf`| `-s` | file path | | VCF file of known snps to simulate |
| `--titv-ratio` | `-r` | float | 0.5 | Transition/Transversion ratio for snps |
| argument | short name | default | description |
| :----------------------- | :--------: | :-----: | :--------------------------------------------------------------------------------------------- |
| `--indel-count` | `-m` | `0` | Number of random indels to simluate |
| `--indel-vcf` | `-i` | | VCF file of known indels to simulate |
| `--indel-ratio` | `-d` | `1` | Insertion/Deletion ratio for indels |
| `--indel-size-alpha` | `-a` | `2.0` | Exponent Alpha for power-law-fitted indel size distribution |
| `--indel-size-constant` | `-l` | `0.5` | Exponent constant for power-law-fitted indel size distribution |
| `--snp-count` | `-n` | `0` | Number of random snps to simluate |
| `--snp-gene-constraints` | `-y` | | How to constrain randomly simulated SNPs {`noncoding`,`coding`,`2d`,`4d`} when using `--genes` |
| `--snp-vcf` | `-s` | | VCF file of known snps to simulate |
| `--titv-ratio` | `-r` | `0.5` | Transition/Transversion ratio for snps |

The ratio parameters for snp and indel variants and have special meanings when setting
the value to either `0` or `9999` :

{.compact}
| ratio | `0` meaning | `9999` meaning |
|:---- |:---|:---|
| `--indel-ratio` | deletions only | insertions only |
| `--titv-ratio` | transversions only | transitions only |
| ratio | `0` meaning | `9999` meaning |
| :-------------- | :----------------- | :---------------- |
| `--indel-ratio` | deletions only | insertions only |
| `--titv-ratio` | transversions only | transitions only |

+++ 🔵 inversions
### inversion
Inversions are when a section of a chromosome appears in the reverse orientation ([source](https://www.genome.gov/genetics-glossary/Inversion)).

{.compact}
| argument | short name | type | default | description |
|:------------------|:----------:|:-----------|:-------:|:----------------|
| `--count`| `-n` | integer | 0 | Number of random inversions to simluate |
| `--max-size` | `-x` | integer | 100000 | Maximum inversion size (bp) |
| `--min-size` | `-m` | integer | 1000 | Minimum inversion size (bp) |
| `--vcf` | `-v` | file path | | VCF file of known inversions to simulate |
| argument | short name | default | description |
| :----------- | :--------: | :------: | :--------------------------------------- |
| `--count` | `-n` | `0` | Number of random inversions to simluate |
| `--max-size` | `-x` | `100000` | Maximum inversion size (bp) |
| `--min-size` | `-m` | `1000` | Minimum inversion size (bp) |
| `--vcf` | `-v` | | VCF file of known inversions to simulate |

+++ 🟢 copy number variants
### cnv
A copy number variation (CNV) is when the number of copies of a particular gene varies
between individuals ([source](https://www.genome.gov/genetics-glossary/Copy-Number-Variation)).

{.compact}
| argument | short name | type | default | description |
|:------------------|:----------:|:-----------|:-------:|:----------------|
| `--vcf` | `-v` | file path | | VCF file of known copy number variants to simulate |
| `--count` | `-n` | integer | 0 | Number of random cnv to simluate |
| `--dup-ratio` | `-d` | float | 1 | Tandem/Dispersed duplication ratio |
| `--gain-ratio` |`-l` | float | 1 | Relative ratio of DNA gain over DNA loss |
| `--max-size`| `-x` | integer |100000 | Maximum cnv size (bp) |
| `--max-copy` | `-y` | integer | 10 | Maximum number of copies |
| `--min-size` | `-m` | integer | 1000 | Minimum cnv size (bp) |
| argument | short name | default | description |
| :------------- | :--------: | :------: | :------------------------------------------------- |
| `--vcf` | `-v` | | VCF file of known copy number variants to simulate |
| `--count` | `-n` | `0` | Number of random cnv to simluate |
| `--dup-ratio` | `-d` | `1` | Tandem/Dispersed duplication ratio |
| `--gain-ratio` | `-l` | `1` | Relative ratio of DNA gain over DNA loss |
| `--max-size` | `-x` | `100000` | Maximum cnv size (bp) |
| `--max-copy` | `-y` | `10` | Maximum number of copies |
| `--min-size` | `-m` | `1000` | Minimum cnv size (bp) |

The ratio parameters have special meanings when setting the value to either `0` or `9999` :

{.compact}
| ratio | `0` meaning | `9999` meaning |
|:---- |:---|:---|
| `--dup-ratio` | dispersed duplications only | tandem duplications only |
| `--gain-ratio` | loss only | gain only |
| ratio | `0` meaning | `9999` meaning |
| :------------- | :-------------------------- | :----------------------- |
| `--dup-ratio` | dispersed duplications only | tandem duplications only |
| `--gain-ratio` | loss only | gain only |

+++ 🟡 translocations
### translocation
A translocation occurs when a chromosome breaks and the fragmented pieces re-attach to different chromosomes ([source](https://www.genome.gov/genetics-glossary/Translocation)).

{.compact}
| argument | short name | type | default | description |
|:------------------|:----------:|:-----------|:-------:|:----------------|
| `--count`| `-n` | integer | 0 | Number of random inversions to simluate |
| `--vcf` | `-v` | file path | | VCF file of known inversions to simulate |
| argument | short name | default | description |
| :-------- | :--------: | :-----: | :--------------------------------------- |
| `--count` | `-n` | `0` | Number of random inversions to simluate |
| `--vcf` | `-v` | | VCF file of known inversions to simulate |

+++

Expand Down
Loading

0 comments on commit cdcc0f0

Please sign in to comment.