diff --git a/.gitignore b/.gitignore index dc467ef53..23ce969fa 100644 --- a/.gitignore +++ b/.gitignore @@ -7,8 +7,15 @@ src/ .harpy_envs/ build/ Demultiplex/ +Deconvolve/ QC/ SNP/ +SV/ Impute/ Phase/ -workflow/ \ No newline at end of file +Simulate/ +*ssembly/ +workflow/ +harpy/ +__pycache__/ +harpy.egg-info/ \ No newline at end of file diff --git a/Modules/Simulate/simulate-variants.md b/Modules/Simulate/simulate-variants.md deleted file mode 100644 index 386060a31..000000000 --- a/Modules/Simulate/simulate-variants.md +++ /dev/null @@ -1,301 +0,0 @@ ---- -label: Variants -description: Simulate snps, indels, inversions, cnv, translocations -icon: dot -#visibility: hidden -order: 6 ---- - -# :icon-flame: Simulate Genomic Variants -Simulate snps, indels, inversions, cnv, translocations - -=== :icon-checklist: You will need -- a reference genome in FASTA format: [!badge variant="success" text=".fasta"] [!badge variant="success" text=".fa"] [!badge variant="success" text=".fasta.gz"] [!badge variant="success" text=".fa.gz"] -=== - -You may want to benchmark haplotag data on different kinds of genomic variants. To -do that, you'll need *known* variants, and typically simulations are how you achieve -that. This series of modules simulates genomic variants onto a genome, either randomly -or specific variants provided in VCF files. The simulator Harpy uses, -[simuG](https://github.com/yjx1217/simuG), can only simulate one type of -variant at a time and each variant type has their own set of parameters. This page -is divided by variant types to help you navigate the process. The general usage -for simulating variants is: - -```bash usage -harpy simulate variant OPTIONS... INPUT_GENOME -``` -```bash example -harpy simulate inversion -n 10 --min-size 1000 --max-size 50000 path/to/genome.fasta -``` -## Modules -There are 4 submodules with very obvious names: - -{.compact} -| submodule | what it does | -|:----------|:-------------| -| [!badge corners="pill" text="snpindel"](#snpindel) | simulates single nucleotide polymorphisms (snps) and insertion-deletions (indels) | -| [!badge corners="pill" text="inversion"](#inversion) | simulates inversions | -| [!badge corners="pill" text="cnv"](#cnv) | simulates copy number variants | -| [!badge corners="pill" text="translocation"](#translocation) | simulates translocations | - -## :icon-terminal: Running Options -While there are serveral differences between the submodule command line options, each has available all the -[!badge variant="info" corners="pill" text="common runtime options"](/commonoptions.md) like other Harpy modules. Each requires and input genome at the -end of the command line, and each requires either a `--count` of variants to randomly simulate, or a `--vcf` of -specific variants to simulate. There are also these unifying options among the different variant types: - -{.compact} -| argument | short name | type | description | -| :-----|:-----|:-----|:-----| -| `INPUT_GENOME` | | file path | The haploid genome to simulate variants onto. **REQUIRED** | -| `--centromeres` | `-c` | file path | GFF3 file of centromeres to avoid | -| `--exclude-chr` | `-e` | file path | Text file of chromosomes to avoid, one per line | -| `--genes` | `-g` | file path | GFF3 file of genes to avoid simulating over (see `snpindel` for caveat) | -| `--heterozygosity` | `-z` | float between [0,1] | [% heterozygosity to simulate diploid later](#heterozygosity) (default: `0`) | -| `--prefix` | | string | Naming prefix for output files (default: `sim.{module_name}`)| -| `--randomseed` | | integer | Random seed for simulation | - -+++ 🟣 snps and indels -### snpindel -!!!warning SNPs can be slow -Given software limitations, simulating many SNPs (>10,000) will be noticeably slower than the other variant types. -!!! - -A single nucleotide polymorphism ("SNP") is a genomic variant at a single base position in the DNA ([source](https://www.genome.gov/genetics-glossary/Single-Nucleotide-Polymorphisms)). -An indel, is a type of mutation that involves the addition/deletion of one or more nucleotides into a segment of DNA ([insertions](https://www.genome.gov/genetics-glossary/Insertion), [deletions](https://www.genome.gov/genetics-glossary/Deletion)). -The snp and indel variants are combined in this module because `simuG` allows simulating them together. - -{.compact} -| argument | short name | type | default | description | -|:------------------|:----------:|:-----------|:-------:|:-------------------------------------------------------------| -| `--indel-count` | `-m` | integer | 0 | Number of random indels to simluate | -| `--indel-vcf` | `-i` | file path | | VCF file of known indels to simulate | -| `--indel-ratio` | `-d` | float | 1 | Insertion/Deletion ratio for indels | -| `--indel-size-alpha` | `-a` | float | 2.0 | Exponent Alpha for power-law-fitted indel size distribution| -| `--indel-size-constant` | `-l` | float | 0.5 | Exponent constant for power-law-fitted indel size distribution | -| `--snp-count` | `-n` | integer | 0 | Number of random snps to simluate | -| `--snp-gene-constraints` | `-y` | string | | How to constrain randomly simulated SNPs {`noncoding`,`coding`,`2d`,`4d`} when using `--genes`| -| `--snp-vcf`| `-s` | file path | | VCF file of known snps to simulate | -| `--titv-ratio` | `-r` | float | 0.5 | Transition/Transversion ratio for snps | - -The ratio parameters for snp and indel variants and have special meanings when setting -the value to either `0` or `9999` : - -{.compact} -| ratio | `0` meaning | `9999` meaning | -|:---- |:---|:---| -| `--indel-ratio` | deletions only | insertions only | -| `--titv-ratio` | transversions only | transitions only | - -+++ 🔵 inversions -### inversion -Inversions are when a section of a chromosome appears in the reverse orientation ([source](https://www.genome.gov/genetics-glossary/Inversion)). - -{.compact} -| argument | short name | type | default | description | -|:------------------|:----------:|:-----------|:-------:|:----------------| -| `--count`| `-n` | integer | 0 | Number of random inversions to simluate | -| `--max-size` | `-x` | integer | 100000 | Maximum inversion size (bp) | -| `--min-size` | `-m` | integer | 1000 | Minimum inversion size (bp) | -| `--vcf` | `-v` | file path | | VCF file of known inversions to simulate | - -+++ 🟢 copy number variants -### cnv -A copy number variation (CNV) is when the number of copies of a particular gene varies -between individuals ([source](https://www.genome.gov/genetics-glossary/Copy-Number-Variation)). - -{.compact} -| argument | short name | type | default | description | -|:------------------|:----------:|:-----------|:-------:|:----------------| -| `--vcf` | `-v` | file path | | VCF file of known copy number variants to simulate | -| `--count` | `-n` | integer | 0 | Number of random cnv to simluate | -| `--dup-ratio` | `-d` | float | 1 | Tandem/Dispersed duplication ratio | -| `--gain-ratio` |`-l` | float | 1 | Relative ratio of DNA gain over DNA loss | -| `--max-size`| `-x` | integer |100000 | Maximum cnv size (bp) | -| `--max-copy` | `-y` | integer | 10 | Maximum number of copies | -| `--min-size` | `-m` | integer | 1000 | Minimum cnv size (bp) | - -The ratio parameters special meanings when setting the value to either `0` or `9999` : - -{.compact} -| ratio | `0` meaning | `9999` meaning | -|:---- |:---|:---| -| `--dup-ratio` | dispersed duplications only | tandem duplications only | -| `--gain-ratio` | loss only | gain only | - -+++ 🟡 translocations -### translocation -A translocation occurs when a chromosome breaks and the fragmented pieces re-attach to different chromosomes ([source](https://www.genome.gov/genetics-glossary/Translocation)). - -{.compact} -| argument | short name | type | default | description | -|:------------------|:----------:|:-----------|:-------:|:----------------| -| `--count`| `-n` | integer | 0 | Number of random inversions to simluate | -| `--vcf` | `-v` | file path | | VCF file of known inversions to simulate | - -+++ - -## Simulate known variants -Rather than simulating random variants, you can use a VCF file as input to any of the submodules -to have `simuG` simulate the variants (of that type) from the VCF file. This becomes particularly -handy because the modules output a VCF file of the variants that were introduced, which you can -modify and reuse as you see fit (see [heterozygosity](#heterozygosity)). Using `--genes`, -`--centromeres`, or `--exclude-chr` would still avoid creating variants in those regions as with -random simulation, except with SNPs, where you would have to specify the contraints for using -`--genes` as per usual. - -## Heterozygosity -Each submodule has a `--heterozygosity` parameter where you can specify the heterozygosity of -an intended diploid genome, should you use the resulting VCF(s) to simulate variants again. - This **does not** create a diploid genome for you, rather, it -takes the output VCF from `simuG` and creates two new VCF files (`{prefix}.hap1.vcf`, -`{prefix}.hap2.vcf`) that have their variants shuffled between the two haplotypes to -achieve the desired heterozygosity. You can then use the two resulting VCF files to -[simulate known variants](#simulate-known-variants) onto a genome and create a diploid genome! -A simplified workflow is provided below to explain that in better context. - -==- How the paramater is actually used -To understand how heterozygosity is created from the `simuG` VCF output, consider a genome -with 5 variants added to it, here represented as a column labelled `h1` with `1` being the presence -of a variant (the ALT allele). -``` -h1 -1 -1 -1 -1 -1 -``` -If we were to simulate those same variants onto the genome again, it would create a homozygote -at every position (`h2` is the second haplotype): -``` -h1 h2 -1 1 -1 1 -1 1 -1 1 -1 1 -``` -However, if we omit some of the variants on `h2` to create 40% heterozygosity (2/5), -we would now have heterozygotes, except the ALT allele for the heterozygote would -only every be on the first haplotype `h1`: -``` -h1 h2 -1 1 -1 1 -1 1 -1 <- heterozygote with ALT on h1 -1 <- heterozygote with ALT on h1 -``` -It would probably be more biologically sound to then make sure that the ALT allele -in the heterozygote can appear in either haplotype: -``` -h1 h2 -1 1 -1 1 -1 1 - 1 <- heterozygote with ALT on h2 -1 <- heterozygote with ALT on h1 -``` -Within Harpy, a heterozygous variant has a 50% chance of being assigned to one of the haplotypes. -So that's the logic behind the `--heterozygosity` parameter and why it ouputs 3 VCF files: -1. the VCF `simuG` outputs of variants added to the genome -2. haplotype 1 of that VCF file with some of the variants -3. haplotype 2 of that VCF file with some of the variants - -Knowing that, you can then have a workflow to start with a haploid assembly and create -a diploid assembly with simulated variants. -==- - -## Simulate Diploid Assembly -Here is a simple but realistic workflow of creating a diploid assembly with simulated variants. Due -to the roundabout complexity of the process, attempts were made to use color to help keep track of the -original haploid genome and the resulting genome haplotypes. -If you haven't already, please read the sections about [simulating known variants](#simulate-known-variants) -and [heterozygosity](#heterozygosity). The idea here is that due to the limitations of `simuG`, we can -only simulate one type of variant at a time and we will take advantage of the VCF files created by -the `--heterozygosity` parameter to simulate known variants from random ones! The basic idea is such: - -#### Step 1 -Simulate random variants onto your haploid assembly with `--heterozygosity` (`-z`) set above `0`. -We aren't interested in the resulting genome, but rather the positions of the variants `simuG` created. -```mermaid -graph LR - geno(haploid genome)-->|simulate inversion -n 10 -z 0.5|hap(inversions.vcf):::clean - hap-->hap1(inversion.hap1.vcf) - hap-->hap2(inversion.hap2.vcf) - style geno fill:#ebb038,stroke:#d19b2f,stroke-width:2px - style hap1 fill:#f5f6f9,stroke:#90c8be,stroke-width:2px - style hap2 fill:#f5f6f9,stroke:#bd8fcb,stroke-width:2px - classDef clean fill:#f5f6f9,stroke:#b7c9ef,stroke-width:2px -``` -#### Step 2 -Use the resulting hap1 and hap2 VCF files to simulate those same variants, but shuffled -into homozygotes and heterozygotes, onto the original haploid genome, creating two haplotype -genomes. -```mermaid -graph LR - subgraph id1 ["Haplotype 1 Inputs"] - hap1(inversion.hap1.vcf)---geno(haploid genome) - end - id1-->|simulate inversion -v|hapgeno(haplotype-1 genome) - style id1 fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px - style hap1 fill:#f5f6f9,stroke:#90c8be,stroke-width:2px - style hapgeno fill:#90c8be,stroke:#6fb6a9,stroke-width:2px - style geno fill:#ebb038,stroke:#d19b2f,stroke-width:2px -``` -```mermaid -graph LR - subgraph id2 ["Haplotype 2 Inputs"] - hap2(inversion.hap2.vcf)---geno(haploid genome) - end - id2-->|simulate inversion -v|hapgeno2(haplotype-2 genome) - style id2 fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px - style hap2 fill:#f5f6f9,stroke:#bd8fcb,stroke-width:2px - style hapgeno2 fill:#bd8fcb,stroke:#a460b7,stroke-width:2px - style geno fill:#ebb038,stroke:#d19b2f,stroke-width:2px -``` -#### Step 3 -Use the one of the new genome haplotypes for simulating other kinds of variants. -Again, use `--heterozygosity` (`-z`) with a value greater than `0`. Like [**Step 1**](#step-1), -we're only interested in the haplotype VCF files (positions of variants) and not the resulting -genome. -```mermaid -graph LR - geno(haplotype-1 genome)-->|simulate snpindel -n 100000 -z 0.5|hap(snpindel.vcf):::clean - hap-->hap1(snpindel.hap1.vcf) - hap-->hap2(snpindel.hap2.vcf) - style geno fill:#90c8be,stroke:#6fb6a9,stroke-width:2px - style hap1 fill:#f5f6f9,stroke:#90c8be,stroke-width:2px - style hap2 fill:#f5f6f9,stroke:#bd8fcb,stroke-width:2px - classDef clean fill:#f5f6f9,stroke:#b7c9ef,stroke-width:2px -``` -#### Step 4 -Use the resulting haplotype VCFs to simulate known variants onto the **haplotype genomes** from -[Step 2](#step-2). -```mermaid -graph LR - subgraph id1 ["Haplotype 1 inputs"] - hap1(snpindel.hap1.vcf)---geno(haplotype-1 genome) - end - id1-->|simulate inversion -v|genohap1(haplotype-1 genome with new variants) - style id1 fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px - style geno fill:#90c8be,stroke:#6fb6a9,stroke-width:2px - style hap1 fill:#f5f6f9,stroke:#90c8be,stroke-width:2px - style genohap1 fill:#90c8be,stroke:#000000,stroke-width:2px -``` -```mermaid -graph LR - subgraph id2 ["Haplotype 2 inputs"] - hap2(snpindel.hap2.vcf)---geno(haplotype-2 genome) - end - id2-->|simulate inversion -v|genohap2(haplotype-2 genome with new variants) - style id2 fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px - style geno fill:#bd8fcb,stroke:#a460b7,stroke-width:2px - style hap2 fill:#f5f6f9,stroke:#bd8fcb,stroke-width:2px - style genohap2 fill:#bd8fcb,stroke:#000000,stroke-width:2px -``` - -#### Step 5 -Repeat [**Step 3**](#step-3) and [**Step 4**](#step-4) to your heart's content. diff --git a/Modules/other.md b/Modules/other.md deleted file mode 100644 index c48a0cd75..000000000 --- a/Modules/other.md +++ /dev/null @@ -1,106 +0,0 @@ ---- -label: Other -icon: file-diff -description: Generate extra files for analysis with Harpy -order: 7 ---- - -# :icon-file-diff: Other Harpy modules -Some parts of Harpy (variant calling, imputation) want or need extra files. You can create various files necessary for different modules using these extra modules: - -## :icon-terminal: Other modules -{.compact} -| module | description | -|:---------------|:---------------------------------------------------------------------------------| -| `resume` | Continue a Harpy workflow from an existing output folder | -| `popgroup` | Create generic sample-group file using existing sample file names (fq.gz or bam) | -| `stitchparams` | Create template STITCH parameter file | - -### resume -When calling a workflow (e.g. [!badge corners="pill" text="qc"](qc.md)), Harpy performs various file checks and validations, sets up the Snakemake command, -output folder(s), etc. In the event you want to continue a failed or manually terminated workflow without overwriting the workflow -files (e.g. `config.yaml`), you can use [!badge corners="pill" text="harpy resume"]. - -```bash usage -harpy resume [--conda] DIRECTORY -``` - -#### arguments -{.compact} -| argument | short name | type | default | required | description | -|:----------------------|:----------:|:----------------|:-------:|:--------:|:---------------------------------------------------------------------| -| `DIRECTORY` | | file/directory paths | | **yes** | Output directory of an existing harpy workflow | -| `--conda` | | toggle | | | generate a `.harpy_envs/` folder with the necessary conda enviroments | - -The `DIRECTORY` is the output directory of a previous harpy-invoked workflow, which **must** have the `workflow/config.yaml` file. -For example, if you previously ran `harpy align bwa -o align-bwa ...`, then you would use `harpy resume align-bwa`, -which would have the necessary `workflow/config.yaml` (and other necessary things) required to successfully continue the workflow. -Using [!badge corners="pill" text="resume"] does **not** overwrite any preprocessing files in the target directory (whereas rerunning the workflow would), -which means you can also manually modify the `config.yaml` file (advanced, not recommended unless you are confident with what you're doing). - -[!badge corners="pill" text="resume"] also requires an existing and populated `.harpy_envs/` directory in the current work directory, like the kind all -main `harpy` workflows would create. If one is not present, you can use `--conda` to create one. - -### popgroup -Creates a sample grouping file for variant calling - -```bash usage -harpy popgroup -o OUTPUTFILE INPUTS -``` - -```bash usage example -harpy popgroup -o samples.groups data/ -``` -#### arguments -{.compact} -| argument | short name | type | default | required | description | -|:----------------------|:----------:|:----------------|:-------:|:--------:|:---------------------------------------------------------------------| -| `INPUTS` | | file/directory paths | | **yes** | Files or directories containing input FASTQ/BAM files | -| `--output` | `-o` | file path | | **yes** | name of the output file | - -This optional file is useful if you want SNP variant calling to happen on a -per-population level via [!badge corners="pill" text="harpy snp"](snp.md/#populations) or on samples -pooled-as-populations via [!badge corners="pill" text="harpy sv"](SV/naibr.md/#pooled-sample-variant-calling). -- takes the format of sample[!badge variant="ghost" text="tab"]group -- all the samples will be assigned to group `pop1` since file names don't always provide grouping information - - so make sure to edit the second column to reflect your data correctly. -- the file will look like: -```less popgroups.txt -sample1 pop1 -sample2 pop1 -sample3 pop2 -sample4 pop1 -sample5 pop3 -``` ---- -### stitchparams -Create a template parameter file for the [!badge corners="pill" text="impute"](/Modules/impute.md) module. The file is formatted correctly and serves -as a starting point for using parameters that make sense for your study. - -```bash usage -harpy stitchparams -o OUTPUTFILE -``` - -```bash example -harpy stitchparams -o params.stitch -``` - -#### arguments -{.compact} -| argument | short name | type | default | required | description | -|:----------------------|:----------:|:----------------|:-------:|:--------:|:---------------------------------------------------------------------| -| `--output` | `-o` | file path | | **yes** | name of the output file | - -Typically, one runs STITCH multiple times, exploring how results vary with -different model parameters. The solution Harpy uses for this is to have the user -provide a tab-delimited dataframe file where the columns are the 6 STITCH model -parameters and the rows are the values for those parameters. To make formatting -easier, a template file is generated for you, just replace the values and add/remove -rows as necessary. See the section for the [!badge corners="pill" text="impute"](/Modules/impute.md) -module for details on these parameters. The template file will look like: - -``` params.stitch -model usebx bxlimit k s ngen -diploid TRUE 50000 3 2 10 -diploid TRUE 50000 3 1 5 -``` \ No newline at end of file diff --git a/Modules/Align/Align.md b/Workflows/Align/Align.md similarity index 100% rename from Modules/Align/Align.md rename to Workflows/Align/Align.md diff --git a/Modules/Align/bwa.md b/Workflows/Align/bwa.md similarity index 83% rename from Modules/Align/bwa.md rename to Workflows/Align/bwa.md index 9b16660ec..2671ba549 100644 --- a/Modules/Align/bwa.md +++ b/Workflows/Align/bwa.md @@ -8,11 +8,12 @@ order: 5 # :icon-quote: Map Reads onto a genome with BWA MEM === :icon-checklist: You will need - at least 4 cores/threads available -- a genome assembly in FASTA format: [!badge variant="success" text=".fasta"] [!badge variant="success" text=".fa"] [!badge variant="success" text=".fasta.gz"] [!badge variant="success" text=".fa.gz"] -- paired-end fastq sequence file with the [proper naming convention](/haplotagdata/#naming-conventions) [!badge variant="secondary" text="gzipped recommended"] +- a genome assembly in FASTA format: [!badge variant="success" text=".fasta"] [!badge variant="success" text=".fa"] [!badge variant="success" text=".fasta.gz"] [!badge variant="success" text=".fa.gz"] [!badge variant="secondary" text="case insensitive"] +- paired-end fastq sequence files [!badge variant="secondary" text="gzipped recommended"] + - **sample name**: [!badge variant="success" text="a-z"] [!badge variant="success" text="0-9"] [!badge variant="success" text="."] [!badge variant="success" text="_"] [!badge variant="success" text="-"] [!badge variant="secondary" text="case insensitive"] - **forward**: [!badge variant="success" text="_F"] [!badge variant="success" text=".F"] [!badge variant="success" text=".1"] [!badge variant="success" text="_1"] [!badge variant="success" text="_R1_001"] [!badge variant="success" text=".R1_001"] [!badge variant="success" text="_R1"] [!badge variant="success" text=".R1"] - **reverse**: [!badge variant="success" text="_R"] [!badge variant="success" text=".R"] [!badge variant="success" text=".2"] [!badge variant="success" text="_2"] [!badge variant="success" text="_R2_001"] [!badge variant="success" text=".R2_001"] [!badge variant="success" text="_R2"] [!badge variant="success" text=".R2"] - - **fastq extension**: [!badge variant="success" text=".fq"] [!badge variant="success" text=".fastq"] [!badge variant="success" text=".FQ"] [!badge variant="success" text=".FASTQ"] + - **fastq extension**: [!badge variant="success" text=".fq"] [!badge variant="success" text=".fastq"] [!badge variant="secondary" text="case insensitive"] === Once sequences have been trimmed and passed through other QC filters, they will need to @@ -31,14 +32,15 @@ harpy align bwa --genome genome.fasta Sequences/ In addition to the [!badge variant="info" corners="pill" text="common runtime options"](/commonoptions.md), the [!badge corners="pill" text="align bwa"] module is configured using these command-line arguments: {.compact} -| argument | short name | type | default | required | description | -|:-------------------|:----------:|:----------------------|:-------:|:--------:|:------------------------------------------------------| -| `INPUTS` | | file/directory paths | | **yes** | Files or directories containing [input FASTQ files](/commonoptions.md#input-arguments) | -| `--extra-params` | `-x` | string | | no | Additional EMA-align/BWA arguments, in quotes | -| `--genome` | `-g` | file path | | **yes** | Genome assembly for read mapping | -| `--keep-unmapped` | `-u` | toggle | false | no | Output unmapped sequences too | -| `--min-quality` | `-q` | integer (0-40) | 30 | no | Minimum `MQ` (SAM mapping quality) to pass filtering | -| `--molecule-distance` | `-d` | integer | 100000 | no | Base-pair distance threshold to separate molecules | +| argument | short name | type | default | description | +| :-------------------- | :--------: | :------------------- | :-----: | :----------------------------------------------------------------------------------------------------------------------------- | +| `INPUTS` | | file/directory paths | | [!badge variant="info" text="required"] Files or directories containing [input FASTQ files](/commonoptions.md#input-arguments) | +| `--contigs` | | file path or list | | [Contigs to plot](/commonoptions.md#--contigs) in the report | +| `--extra-params` | `-x` | string | | Additional EMA-align/BWA arguments, in quotes | +| `--genome` | `-g` | file path | | [!badge variant="info" text="required"] Genome assembly for read mapping | +| `--keep-unmapped` | `-u` | toggle | false | Output unmapped sequences too | +| `--min-quality` | `-q` | integer (0-40) | `30` | Minimum `MQ` (SAM mapping quality) to pass filtering | +| `--molecule-distance` | `-d` | integer | `100000` | Base-pair distance threshold to separate molecules | ### Molecule distance The `--molecule-distance` option is used during the BWA alignment workflow diff --git a/Modules/Align/ema.md b/Workflows/Align/ema.md similarity index 77% rename from Modules/Align/ema.md rename to Workflows/Align/ema.md index 833ddf599..088468496 100644 --- a/Modules/Align/ema.md +++ b/Workflows/Align/ema.md @@ -8,11 +8,12 @@ order: 5 # :icon-quote: Map Reads onto a genome with EMA === :icon-checklist: You will need - at least 4 cores/threads available -- a genome assembly in FASTA format: [!badge variant="success" text=".fasta"] [!badge variant="success" text=".fa"] [!badge variant="success" text=".fasta.gz"] [!badge variant="success" text=".fa.gz"] -- paired-end fastq sequence file with the [proper naming convention](/haplotagdata/#naming-conventions) [!badge variant="secondary" text="gzipped recommended"] +- a genome assembly in FASTA format: [!badge variant="success" text=".fasta"] [!badge variant="success" text=".fa"] [!badge variant="success" text=".fasta.gz"] [!badge variant="success" text=".fa.gz"] [!badge variant="secondary" text="case insensitive"] +- paired-end fastq sequence files [!badge variant="secondary" text="gzipped recommended"] + - **sample name**: [!badge variant="success" text="a-z"] [!badge variant="success" text="0-9"] [!badge variant="success" text="."] [!badge variant="success" text="_"] [!badge variant="success" text="-"] [!badge variant="secondary" text="case insensitive"] - **forward**: [!badge variant="success" text="_F"] [!badge variant="success" text=".F"] [!badge variant="success" text=".1"] [!badge variant="success" text="_1"] [!badge variant="success" text="_R1_001"] [!badge variant="success" text=".R1_001"] [!badge variant="success" text="_R1"] [!badge variant="success" text=".R1"] - **reverse**: [!badge variant="success" text="_R"] [!badge variant="success" text=".R"] [!badge variant="success" text=".2"] [!badge variant="success" text="_2"] [!badge variant="success" text="_R2_001"] [!badge variant="success" text=".R2_001"] [!badge variant="success" text="_R2"] [!badge variant="success" text=".R2"] - - **fastq extension**: [!badge variant="success" text=".fq"] [!badge variant="success" text=".fastq"] [!badge variant="success" text=".FQ"] [!badge variant="success" text=".FASTQ"] + - **fastq extension**: [!badge variant="success" text=".fq"] [!badge variant="success" text=".fastq"] [!badge variant="secondary" text="case insensitive"] - patience because EMA is [!badge variant="warning" text="slow"] ==- Why EMA? The original haplotag manuscript uses BWA to map reads. The authors have since recommended @@ -40,16 +41,18 @@ harpy align ema --genome genome.fasta Sequences/ In addition to the [!badge variant="info" corners="pill" text="common runtime options"](/commonoptions.md), the [!badge corners="pill" text="align ema"] module is configured using these command-line arguments: {.compact} -| argument | short name | type | default | required | description | -|:-------------------|:----------:|:----------------------|:-------:|:--------:|:-------------------------------------------------------------------| -| `INPUTS` | | file/directory paths | | **yes** | Files or directories containing [input FASTQ files](/commonoptions.md#input-arguments) | -| `--ema-bins` | `-e` | integer (1-1000) | 500 | no | Number of barcode bins for EMA | -| `--extra-params` | `-x` | string | | no | Additional EMA-align/BWA arguments, in quotes | -| `--genome` | `-g` | file path | | **yes** | Genome assembly for read mapping | -| `--keep-unmapped` | `-u` | toggle | false | no | Output unmapped sequences too | -| `--min-quality` | `-q` | integer (0-40) | 30 | no | Minimum `MQ` (SAM mapping quality) to pass filtering | -| `--platform` | `-p` | string | haplotag | **yes** | Linked read technology: `haplotag` or `10x` | -| `--whitelist` | `-w` | file path | | no | Path to barcode whitelist (`--platform 10x` only) | +| argument | short name | type | default | description | +| :------------------- | :--------: | :------------------- | :------: | :----------------------------------------------------------------------------------------------------------------------------- | +| `INPUTS` | | file/directory paths | | [!badge variant="info" text="required"] Files or directories containing [input FASTQ files](/commonoptions.md#input-arguments) | +| `--contigs` | | file path or list | | [Contigs to plot](/commonoptions.md#--contigs) in the report | +| `--fragment-density` | `-d` | toggle | false | Perform read fragment density optimization | +| `--ema-bins` | `-e` | integer (1-1000) | `500` | Number of barcode bins for EMA | +| `--extra-params` | `-x` | string | | Additional EMA-align arguments, in quotes | +| `--genome` | `-g` | file path | | [!badge variant="info" text="required"] Genome assembly for read mapping | +| `--keep-unmapped` | `-u` | toggle | false | Output unmapped sequences too | +| `--min-quality` | `-q` | integer (0-40) | `30` | Minimum `MQ` (SAM mapping quality) to pass filtering | +| `--platform` | `-p` | string | `haplotag` | [!badge variant="info" text="required"] Linked read technology: `haplotag` or `10x` | +| `--whitelist` | `-w` | file path | | Path to barcode whitelist (`--platform 10x` only) | ### Barcode whitelist Some linked-read methods (e.g. 10x, Tellseq) require the inclusion of a barcode "whitelist." This file is a @@ -96,7 +99,7 @@ marked as a duplicate. Duplicates get marked but **are not removed**. - leverages the BX barcode information to improve mapping - sometimes better downstream SV detection - slower -- marks split alignments as secondary alignments [⚠️](/Modules/SV/leviathan.md) +- marks split alignments as secondary alignments [⚠️](/Workflows/SV/leviathan.md) - lots of temporary files Since [EMA](https://github.com/arshajii/ema) does extra things to account for barcode diff --git a/Modules/Align/index.yml b/Workflows/Align/index.yml similarity index 100% rename from Modules/Align/index.yml rename to Workflows/Align/index.yml diff --git a/Modules/Align/strobe.md b/Workflows/Align/strobe.md similarity index 83% rename from Modules/Align/strobe.md rename to Workflows/Align/strobe.md index 845c10af2..baa6d03d7 100644 --- a/Modules/Align/strobe.md +++ b/Workflows/Align/strobe.md @@ -8,11 +8,12 @@ order: 5 # :icon-quote: Map Reads onto a genome with strobealign === :icon-checklist: You will need - at least 4 cores/threads available -- a genome assembly in FASTA format: [!badge variant="success" text=".fasta"] [!badge variant="success" text=".fa"] [!badge variant="success" text=".fasta.gz"] [!badge variant="success" text=".fa.gz"] -- paired-end fastq sequence file with the [proper naming convention](/haplotagdata/#naming-conventions) [!badge variant="secondary" text="gzipped recommended"] +- a genome assembly in FASTA format: [!badge variant="success" text=".fasta"] [!badge variant="success" text=".fa"] [!badge variant="success" text=".fasta.gz"] [!badge variant="success" text=".fa.gz"] [!badge variant="secondary" text="case insensitive"] +- paired-end fastq sequence files [!badge variant="secondary" text="gzipped recommended"] + - **sample name**: [!badge variant="success" text="a-z"] [!badge variant="success" text="0-9"] [!badge variant="success" text="."] [!badge variant="success" text="_"] [!badge variant="success" text="-"] [!badge variant="secondary" text="case insensitive"] - **forward**: [!badge variant="success" text="_F"] [!badge variant="success" text=".F"] [!badge variant="success" text=".1"] [!badge variant="success" text="_1"] [!badge variant="success" text="_R1_001"] [!badge variant="success" text=".R1_001"] [!badge variant="success" text="_R1"] [!badge variant="success" text=".R1"] - **reverse**: [!badge variant="success" text="_R"] [!badge variant="success" text=".R"] [!badge variant="success" text=".2"] [!badge variant="success" text="_2"] [!badge variant="success" text="_R2_001"] [!badge variant="success" text=".R2_001"] [!badge variant="success" text="_R2"] [!badge variant="success" text=".R2"] - - **fastq extension**: [!badge variant="success" text=".fq"] [!badge variant="success" text=".fastq"] [!badge variant="success" text=".FQ"] [!badge variant="success" text=".FASTQ"] + - **fastq extension**: [!badge variant="success" text=".fq"] [!badge variant="success" text=".fastq"] [!badge variant="secondary" text="case insensitive"] === Once sequences have been trimmed and passed through other QC filters, they will need to @@ -31,15 +32,16 @@ harpy align strobe --genome genome.fasta Sequences/ In addition to the [!badge variant="info" corners="pill" text="common runtime options"](/commonoptions.md), the [!badge corners="pill" text="align strobe"] module is configured using these command-line arguments: {.compact} -| argument | short name | type | default | required | description | -|:-------------------|:----------:|:----------------------|:-------:|:--------:|:------------------------------------------------------| -| `INPUTS` | | file/directory paths | | **yes** | Files or directories containing [input FASTQ files](/commonoptions.md#input-arguments) | -| `--extra-params` | `-x` | string | | no | Additional EMA-align/BWA arguments, in quotes | -| `--genome` | `-g` | file path | | **yes** | Genome assembly for read mapping | -| `--keep-unmapped` | `-u` | toggle | false | no | Output unmapped sequences too | -| `--min-quality` | `-d` | integer (0-40) | 30 | no | Minimum `MQ` (SAM mapping quality) to pass filtering | -| `--molecule-distance` | `-m` | integer | 100000 | no | Base-pair distance threshold to separate molecules | -| `--read-length` | `-l` | choice | `auto` | no | Average read length for creating index. Options: [auto, 50, 75, 100, 125, 150, 250, 400] | +| argument | short name | type | default | description | +| :-------------------- | :--------: | :------------------- | :-----: | :----------------------------------------------------------------------------------------------------------------------------- | +| `INPUTS` | | file/directory paths | | [!badge variant="info" text="required"] Files or directories containing [input FASTQ files](/commonoptions.md#input-arguments) | +| `--contigs` | | file path or list | | [Contigs to plot](/commonoptions.md#--contigs) in the report | +| `--extra-params` | `-x` | string | | Additional EMA-align/BWA arguments, in quotes | +| `--genome` | `-g` | file path | | [!badge variant="info" text="required"] Genome assembly for read mapping | +| `--keep-unmapped` | `-u` | toggle | false | Output unmapped sequences too | +| `--min-quality` | `-d` | integer (0-40) | `30` | Minimum `MQ` (SAM mapping quality) to pass filtering | +| `--molecule-distance` | `-m` | integer | `100000` | Base-pair distance threshold to separate molecules | +| `--read-length` | `-l` | choice | `auto` | Average read length for creating index. Options: [auto, 50, 75, 100, 125, 150, 250, 400] | ### Read Length The strobealign program uses a new _strobemer_ design for aligning and requires its own way of indexing the genome. diff --git a/Modules/SV/SV.md b/Workflows/SV/SV.md similarity index 100% rename from Modules/SV/SV.md rename to Workflows/SV/SV.md diff --git a/Modules/SV/index.yml b/Workflows/SV/index.yml similarity index 100% rename from Modules/SV/index.yml rename to Workflows/SV/index.yml diff --git a/Modules/SV/leviathan.md b/Workflows/SV/leviathan.md similarity index 71% rename from Modules/SV/leviathan.md rename to Workflows/SV/leviathan.md index 94133bdf4..64e0c5239 100644 --- a/Modules/SV/leviathan.md +++ b/Workflows/SV/leviathan.md @@ -10,8 +10,8 @@ order: 1 === :icon-checklist: You will need - at least 4 cores/threads available -- sequence alignments, in BAM format: [!badge variant="success" text=".bam"] -- genome assembly in FASTA format: [!badge variant="success" text=".fasta"] [!badge variant="success" text=".fa"] [!badge variant="success" text=".fasta.gz"] [!badge variant="success" text=".fa.gz"] +- sequence alignments: [!badge variant="success" text=".bam"] [!badge variant="secondary" text="coordinate-sorted"] +- genome assembly in FASTA format: [!badge variant="success" text=".fasta"] [!badge variant="success" text=".fa"] [!badge variant="success" text=".fasta.gz"] [!badge variant="success" text=".fa.gz"] [!badge variant="secondary" text="case insensitive"] - [!badge variant="ghost" text="optional"] sample grouping file ([see below](#pooled-sample-variant-calling)) !!!warning EMA-mapped reads @@ -63,15 +63,16 @@ harpy sv leviathan --threads 20 -g genome.fasta Align/bwa In addition to the [!badge variant="info" corners="pill" text="common runtime options"](/commonoptions.md), the [!badge corners="pill" text="sv leviathan"] module is configured using these command-line arguments: {.compact} -| argument | short name | type | default | required | description | -|:-----------------|:----------:|:--------------|:-------:|:--------:|:---------------------------------------------------| -| `INPUTS` | | file/directory paths | | **yes** | Files or directories containing [input BAM files](/commonoptions.md#input-arguments) | -| `--extra-params` | `-x` | string | | no | Additional naibr arguments, in quotes | -| `--genome` | `-g` | file path | | yes | Genome assembly that was used to create alignments | -| `--iterations` | `-i` | integer | 50 | no | Number of iterations to perform through index (reduces memory) | -| `--min-barcodes` | `-b` | integer | 2 | no | Minimum number of barcode overlaps supporting candidate SV | -| `--min-sv` | `-m` | integer | 1000 | no | Minimum size of SV to detect | -| `--populations` | `-p` | file path | | no | Tab-delimited file of sample\<*tab*\>group | +| argument | short name | default | description | +| :--------------- | :--------: | :-----: | :--------------------------------------------------------------------------------------------------------------------------- | +| `INPUTS` | | | [!badge variant="info" text="required"] Files or directories containing [input BAM files](/commonoptions.md#input-arguments) | +| `--contigs` | | | [Contigs to plot](/commonoptions.md#--contigs) in the report | +| `--extra-params` | `-x` | | Additional naibr arguments, in quotes | +| `--genome` | `-g` | | [!badge variant="info" text="required"] Genome assembly that was used to create alignments | +| `--iterations` | `-i` | `50` | Number of iterations to perform through index (reduces memory) | +| `--min-barcodes` | `-b` | `2` | Minimum number of barcode overlaps supporting candidate SV | +| `--min-sv` | `-m` | `1000` | Minimum size of SV to detect | +| `--populations` | `-p` | | Tab-delimited file of sample\<*tab*\>group | ### Single-sample variant calling When **not** using a population grouping file via `--populations`, variants will be called per-sample. @@ -140,27 +141,24 @@ SV/leviathan └── sample2.bcf ``` {.compact} -| item | description | -|:-----------------------|:---------------------------------------------------------| -| `breakends.bedpe` | an aggregation of all the breakends identified by LEVIATHAN | -| `deletions.bedpe` | an aggregation of all the deletions identified by LEVIATHAN | -| `duplications.bedpe` | an aggregation of all the duplications identified by LEVIATHAN | -| `inversions.bedpe` | an aggregation of all the inversions identified by LEVIATHAN | -| `logs/harpy.variants.log` | relevant runtime parameters for the variants module | -| `logs/sample.groups` | if provided, a copy of the file provided to `--populations` with commented lines removed | -| `logs/*candidates` | candidate structural variants LEVIATHAN identified | -| `reports/` | summary reports with interactive plots of detected SV | -| `stats/` | results of `bcftools stats` on the vcf LEVIATHAN creates | -| `vcf/` | structural variants identified by LEVIATHAN | +| item | description | +| :------------------------ | :--------------------------------------------------------------------------------------- | +| `breakends.bedpe` | an aggregation of all the breakends identified by LEVIATHAN | +| `deletions.bedpe` | an aggregation of all the deletions identified by LEVIATHAN | +| `duplications.bedpe` | an aggregation of all the duplications identified by LEVIATHAN | +| `inversions.bedpe` | an aggregation of all the inversions identified by LEVIATHAN | +| `logs/harpy.variants.log` | relevant runtime parameters for the variants module | +| `logs/sample.groups` | if provided, a copy of the file provided to `--populations` with commented lines removed | +| `logs/*candidates` | candidate structural variants LEVIATHAN identified | +| `reports/` | summary reports with interactive plots of detected SV | +| `stats/` | results of `bcftools stats` on the vcf LEVIATHAN creates | +| `vcf/` | structural variants identified by LEVIATHAN | +++ :icon-code-square: leviathan parameters -By default, Harpy runs `leviathan` with default parameters (shown below), only modifying inputs and outputs at the command line. - Below is a list of all `leviathan` command line options, excluding those Harpy already uses or those made redundant by Harpy's implementation of LEVIATHAN. These are taken directly from the [LEVIATHAN documentation](https://github.com/morispi/LEVIATHAN). ``` LEVIATHAN arguments -r, --regionSize: Size of the regions on the reference genome to consider (default: 1000) - -v, --minVariantSize: Minimum size of the SVs to detect (default: same as regionSize) -n, --maxLinks: Remove from candidates list all candidates which have a region involved in that much candidates (default: 1000) -M, --mediumSize: Minimum size of medium variants (default: 2000) -L, --largeSize: Minimum size of large variants (default: 10000) @@ -170,8 +168,6 @@ These are taken directly from the [LEVIATHAN documentation](https://github.com/m -d, --duplicates: Consider SV as duplicates if they have the same type and if their breakpoints are within this distance (default: 10) -s, --skipTranslocations: Skip SVs that are translocations (default: false) -p, --poolSize: Size of the thread pool (default: 100000) - -B, --nbBins: Number of iterations to perform through the barcode index (default: 10) - -c, --minBarcodes: Always remove candidates that share less than this number of barcodes (default: 1) ``` +++ :icon-graph: reports These are the summary reports Harpy generates for this workflow. You may right-click diff --git a/Modules/SV/naibr.md b/Workflows/SV/naibr.md similarity index 72% rename from Modules/SV/naibr.md rename to Workflows/SV/naibr.md index 98c7ada9e..898e4d8cb 100644 --- a/Modules/SV/naibr.md +++ b/Workflows/SV/naibr.md @@ -10,8 +10,8 @@ order: 1 === :icon-checklist: You will need - at least 4 cores/threads available -- sequence alignments, in BAM format: [!badge variant="success" text=".bam"] -- genome assembly in FASTA format: [!badge variant="success" text=".fasta"] [!badge variant="success" text=".fa"] [!badge variant="success" text=".fasta.gz"] [!badge variant="success" text=".fa.gz"] +- sequence alignments: [!badge variant="success" text=".bam"] [!badge variant="secondary" text="coordinate-sorted"] +- genome assembly in FASTA format: [!badge variant="success" text=".fasta"] [!badge variant="success" text=".fa"] [!badge variant="success" text=".fasta.gz"] [!badge variant="success" text=".fa.gz"] [!badge variant="secondary" text="case insensitive"] - [!badge variant="ghost" text="optional"] phased VCF file - [!badge variant="ghost" text="optional"] sample grouping file ([see below](#pooled-sample-variant-calling)) ==- :icon-file: sample grouping file [!badge variant="ghost" text="optional"] @@ -63,17 +63,18 @@ harpy sv naibr --threads 20 --genome genome.fasta --vcf Variants/data.vcf.gz Ali In addition to the [!badge variant="info" corners="pill" text="common runtime options"](/commonoptions.md), the [!badge corners="pill" text="sv naibr"] module is configured using these command-line arguments: {.compact} -| argument | short name | type | default | required | description | -|:-----------------|:----------:|:--------------|:-------:|:--------:|:---------------------------------------------------| -| `INPUTS` | | file/directory paths | | **yes** | Files or directories containing [input BAM files](/commonoptions.md#input-arguments) | -| `--extra-params` | `-x` | string | | no | Additional naibr arguments, in quotes | -| `--genome` | `-g` | file path | | **yes** | Genome assembly for phasing bam files | -| `--min-barcodes` | `-b` | integer | 2 | no | Minimum number of barcode overlaps supporting candidate SV | -| `--min-quality` | `-q` | integer (0-40) | 30 | no | Minimum `MQ` (SAM mapping quality) to pass filtering | -| `--min-sv` | `-n` | integer | 1000 | no | Minimum size of SV to detect | -| `--molecule-distance` | `-m` | integer | 100000 | no | Base-pair distance threshold to separate molecules | -| `--populations` | `-p` | file path | | no | Tab-delimited file of sample\<*tab*\>group | -| `--vcf` | `-v` | file path | | **conditionally** | Phased vcf file for phasing bam files | +| argument | short name | default | description | +| :-------------------- | :--------: | :------: | :---------------------------------------------------------------------------------------------------------------------------- | +| `INPUTS` | | | [!badge variant="info" text="required"] Files or directories containing [input BAM files](/commonoptions.md#input-arguments) | +| `--contigs` | | | [Contigs to plot](/commonoptions.md#--contigs) in the report | +| `--extra-params` | `-x` | | Additional naibr arguments, in quotes | +| `--genome` | `-g` | | [!badge variant="info" text="required"] Genome assembly for phasing bam files | +| `--min-barcodes` | `-b` | `2` | Minimum number of barcode overlaps supporting candidate SV | +| `--min-quality` | `-q` | `30` | Minimum `MQ` (SAM mapping quality) to pass filtering | +| `--min-sv` | `-n` | `1000` | Minimum size of SV to detect | +| `--molecule-distance` | `-m` | `100000` | Base-pair distance threshold to separate molecules | +| `--populations` | `-p` | | Tab-delimited file of sample\<*tab*\>group | +| `--vcf` | `-v` | | [!badge variant="info" text="conditionally required"] Phased vcf file for phasing bam files ([see below](#optional-vcf-file)) | ### Molecule distance The `--molecule-distance` option is used to let the program determine how far apart alignments on a contig with the same @@ -98,10 +99,10 @@ comparing "populations" as usual. Keep in mind that if there are too many sample it too well. !!! -### optional vcf file -In order to get the best variant calling performance out of NAIBR, it requires _phased_ bam files as input. -The `--vcf` option is optional and not used by NAIBR. However, to use [!badge corners="pill" text="sv naibr"] with -bam files that are not phased, you will need to include `--vcf`, which Harpy uses with +### Optional vcf file +In order to get the best variant calling performance out of NAIBR, it requires **_phased_** bam files as input. +Using `--vcf` is optional and not used by NAIBR directly. However, to use [!badge corners="pill" text="sv naibr"] with +bam files that are not phased, you will need to include a **phased VCF file** with `--vcf`, which Harpy uses with `whatshap haplotag` to phase your input BAM files prior to variant calling. See the [whatshap documentation](https://whatshap.readthedocs.io/en/latest/guide.html#whatshap-haplotag) for more details on that process. @@ -109,7 +110,7 @@ for more details on that process. This file can be in vcf/vcf.gz/bcf format and most importantly **it must be phased haplotypes**. There are various ways to haplotype SNPs, but you can use [!badge corners="pill" text="harpy phase"](../phase.md) to phase your SNPs into haplotypes using the haplotag barcode information. The resulting phased VCF file can then be used as input here. -Your VCF file should be [filtered in some capacity](/blog/filteringsnps.md) to keep high quality data. +Your VCF file should be [filtered in some capacity](/blog/filtering_snps.md) to keep high quality data. ```mermaid --- @@ -197,39 +198,38 @@ SV/naibr ``` {.compact} -| item | description | -|:--------------|:-----------------------------------------------------------------| +| item | description | +| :------------------- | :--------------------------------------------------------- | | `deletions.bedpe` | an aggregation of all the deletions identified by NAIBR | | `duplications.bedpe` | an aggregation of all the duplications identified by NAIBR | | `inversions.bedpe` | an aggregation of all the inversions identified by NAIBR | -| `bedpe/` | structural variants identified by NAIBR | -| `configs/` | the configuration files harpy generated for each sample | -| `filtered/` | the variants that failed NAIBR's internal filters | -| `IGV/` | same as the output `.bedpe` files but in IGV format | -| `logs/*.log` | what NAIBR writes to `stderr` during operation | -| `reports/` | summary reports with interactive plots of detected SV | -| `vcf/` | the resulting variants, but in `.VCF` format | +| `bedpe/` | structural variants identified by NAIBR | +| `configs/` | the configuration files harpy generated for each sample | +| `filtered/` | the variants that failed NAIBR's internal filters | +| `IGV/` | same as the output `.bedpe` files but in IGV format | +| `logs/*.log` | what NAIBR writes to `stderr` during operation | +| `reports/` | summary reports with interactive plots of detected SV | +| `vcf/` | the resulting variants, but in `.VCF` format | +++ :icon-code-square: naibr parameters By default, Harpy runs `naibr` with these parameters (excluding inputs and outputs): ```python min_mapq = 30 min_sv = 100000 -k = 3 +k = 2 +d = 100000 ``` Below is a list of all `naibr` runtime options, excluding those Harpy already uses or those made redundant by Harpy's implementation of NAIBR. These are taken directly from the [NAIBR documentation](https://github.com/pontushojer/NAIBR#running-naibr). If adding these arguments, do so in quotes: ``` -harpy sv naibr -d somedir -x "min_sv 1000 k 5" +harpy sv naibr -x "candidates duplications.bedpe" data/alignments/* ``` ``` NAIBR arguments - -blacklist: BED-file with regions to be excluded from analysis - -candidates: BEDPE-file with novel adjacencies to be scored by NAIBR. This will override automatic detection of candidate novel adjacencies - -min_sv: Minimum size of a structural variant to be detected (default: lmax, i.e. the 95th percentile of the paired-end read insert size distribution) - -k: minimum number of barcode overlaps supporting a candidate NA (default = 3) +-blacklist: BED-file with regions to be excluded from analysis +-candidates: BEDPE-file with novel adjacencies to be scored by NAIBR. This will override automatic detection of candidate novel adjacencies ``` +++ :icon-graph: reports diff --git a/Modules/Simulate/Simulate.md b/Workflows/Simulate/Simulate.md similarity index 100% rename from Modules/Simulate/Simulate.md rename to Workflows/Simulate/Simulate.md diff --git a/Modules/Simulate/index.yml b/Workflows/Simulate/index.yml similarity index 100% rename from Modules/Simulate/index.yml rename to Workflows/Simulate/index.yml diff --git a/Modules/Simulate/simulate-linkedreads.md b/Workflows/Simulate/simulate-linkedreads.md similarity index 53% rename from Modules/Simulate/simulate-linkedreads.md rename to Workflows/Simulate/simulate-linkedreads.md index d43d1c07e..03ca16ba0 100644 --- a/Modules/Simulate/simulate-linkedreads.md +++ b/Workflows/Simulate/simulate-linkedreads.md @@ -10,28 +10,36 @@ order: 6 Simulate linked reads from a genome === :icon-checklist: You will need -- two haplotypes of a reference genome in FASTA format: [!badge variant="success" text=".fasta"] [!badge variant="success" text=".fa"] [!badge variant="success" text=".fasta.gz"] [!badge variant="success" text=".fa.gz"] +- two haplotypes of a reference genome in FASTA format: [!badge variant="success" text=".fasta"] [!badge variant="success" text=".fa"] [!badge variant="success" text=".fasta.gz"] [!badge variant="success" text=".fa.gz"] [!badge variant="secondary" text="case insensitive"] - can be created with [!badge corners="pill" text="simulate {snpindel,inversion,...}"](simulate-variants.md) -- [!badge variant="ghost" text="optional"] a file of 16-basepair barcodes to tag linked reads with + - see the [tutorial](/blog/simulate_diploid.md) +- [!badge variant="ghost" text="optional"] a file of barcodes to tag linked reads with ==- :icon-question: LRSIM differences The original [LRSIM](https://github.com/aquaskyline/LRSIM) is a lengthy Perl script that, like Harpy, outsources -to various other programs (SURVIVOR, DWGSIM, samtools, msort) and acts as a workflow through these programs. The Harpy -version of LRSIM keeps only the novel LRSIM code that creates linked reads from reads simulated by DWGSIM. The -rest of LRSIM's components are reincorporated into the Snakemake workflow governing the [!badge corners="pill" text="simulate linkedreads"]module, while removing the SURVIVOR part since [!badge corners="pill" text="simulate {snpindel,...}"](simulate-variants.md) are used for that purpose. +to various other programs (`SURVIVOR`, `DWGSIM`, `samtools`, `msort`) and acts as a workflow through these programs. The Harpy +version of `LRSIM` keeps only the novel `LRSIM` code that creates linked reads from reads simulated by `DWGSIM`. The +rest of `LRSIM`'s components are reincorporated into the Snakemake workflow governing the [!badge corners="pill" text="simulate linkedreads"] +module, while removing the `SURVIVOR` part since [!badge corners="pill" text="simulate {snpindel,...}"](simulate-variants.md) are used for that purpose. +We generally wouldn't recommend using the Harpy version of LRSIM (`HaploSim.pl`) by itself unless you're confident you know what you're doing. #### Notable differences -- dependencies are expected to be on the `PATH`, not hardcoded to the folder LRSIM is running from -- `-r` parameter changed to folder prefix since Harpy uses `-g` for the haplotypes -- outputs are coded a little differently for flexibility (and use the `-r` parameter for some parts) -- SURVIVOR variant simulation functionality removed entirely -- DWGSIM, samtools, msort, and extractReads functionality moved into Harpy workflow -- uses newer version of DWGSIM +- dependencies are expected to be on the `PATH`, not hardcoded to the folder `LRSIM` is running from +- `-r`, `-u` parameters removed +- `-p` names the outputs with a prefix +- `-g` is used for the DWGSIM output fastq files, not input genomes +- `-a` is used for input fasta.fai files +- accepts any length of barcodes (mininmum 6 bases) +- **does not** add sequencing error to the barcode +- no `.status` file, all logging writes to `stderr` +- `SURVIVOR` variant simulation functionality removed entirely +- `DWGSIM`, samtools, msort, and extractReads functionality moved into Harpy workflow +- uses newer version of `DWGSIM` === You may want to benchmark haplotag data on different kinds of genomic variants. To do that, you'll need *known* variants (like those created by [!badge corners="pill" text="simulate {snpindel,...}"](simulate-variants.md)) and linked-read sequences. This module will create (diploid) linked-read sequences from two genome haplotypes. To accomplish this, Harpy uses a modified version of [LRSIM](https://github.com/aquaskyline/LRSIM), -and converts the LRSIM 10X-style output into Haplotag-style reads. To simulate linked reads, use: +and demultiplexes the resulting linked-reads into Haplotag-formatted reads. To simulate linked reads, use: ```bash usage harpy simulate linkedreads OPTIONS... HAP1_GENOME HAP2_GENOME @@ -44,18 +52,18 @@ harpy simulate linkedreads -t 4 -n 2 -l 100 -p 50 data/genome.hap1.fasta data/ In addition to the [!badge variant="info" corners="pill" text="common runtime options"](/commonoptions.md), the [!badge corners="pill" text="simulate linkedreads"] module is configured using these command-line arguments: {.compact} -| argument | short name | type | default | required | description | -|:---------------|:----------:|:------------|:-------------:|:--------:|:------------------------------------------------------------------------------------------------| -| `HAP1_GENOME` | | file path | | **yes** | Haplotype 1 of the diploid genome to simulate reads | -| `HAP2_GENOME` | | file path | | **yes** | Haplotype 1 of the diploid genome to simulate reads | -| `--barcodes` | `-b` | file path | [10X barcodes](https://github.com/aquaskyline/LRSIM/blob/master/4M-with-alts-february-2016.txt) | | File of linked-read barcodes to add to reads | -| `--distance-sd` | `-s` | integer | 15 | | Standard deviation of read-pair distance | -| `--molecule-length` | `-l` | integer | 100 | | Mean molecule length (kbp) | -| `--molecules-per` | `-m` | integer | 10 | | Average number of molecules per partition | -| `--mutation-rate` | `-r` | number | 0.001 | | Random mutation rate for simulating reads (0 - 1.0) | -| `--outer-distance` | `-d` | integer | 350 | | Outer distance between paired-end reads (bp) | -| `--patitions` | `-p` | integer | 1500 | | Number (in thousands) of partitions/beads to generate | -| `--read-pairs` | `-n` | number | 600 | | Number (in millions) of read pairs to simulate | +| argument | short name | default | description | +| :------------------ | :--------: | :--------------------------------: | :------------------------------------------------------------------------------------------ | +| `HAP1_GENOME` | | | [!badge variant="info" text="required"] Haplotype 1 of the diploid genome to simulate reads | +| `HAP2_GENOME` | | | [!badge variant="info" text="required"] Haplotype 2 of the diploid genome to simulate reads | +| `--barcodes` | `-b` | Meier et al. haplotagging barcodes | File of linked-read barcodes to add to reads | +| `--distance-sd` | `-s` | `15` | Standard deviation of read-pair distance | +| `--molecule-length` | `-l` | `100` | Mean molecule length (kbp) | +| `--molecules-per` | `-m` | `10` | Average number of molecules per partition | +| `--mutation-rate` | `-r` | `0.001` | Random mutation rate for simulating reads (0 - 1.0) | +| `--outer-distance` | `-d` | `350` | Outer distance between paired-end reads (bp) | +| `--patitions` | `-p` | `1500` | Number (in thousands) of partitions/beads to generate | +| `--read-pairs` | `-n` | `600` | Number (in millions) of read pairs to simulate | ## Mutation Rate The read simulation is two-part: first `dwgsim` generates forward and reverse FASTQ files from the provided genome haplotypes @@ -63,6 +71,7 @@ The read simulation is two-part: first `dwgsim` generates forward and reverse FA option controls random mutation rate `dwgsim` uses when creating FASTQ files from your provided genome haplotypes. This parameter adds SNPs/variation in addition to the error rate assumed for the Illumina platform. If you don't want any more SNPs added to the reads beyond sequencing error, set this value to `--mutation-rate 0`. + #### Simulating a single sample If you intend to simulate a "single individual" (i.e. use this module once), then you might want no additonal SNPs beyond the variants you may have already introduced into the genome and set `--mutation-rate 0`. @@ -85,25 +94,23 @@ reads that aren't from the same molecule have the same linked-read barcode. This corresponding Harpy modules) have an option to set the [barcode threshold](../../haplotagdata.md/#barcode-thresholds). ## Barcodes -Barcodes, if provided, must be given as 16-basepair nucleotide sequences, one per line. If not provided, -Harpy will download the standard 10X Genomics `4M-with-alts-february-2016.txt` barcode set from the [LRSIM -repository](https://github.com/aquaskyline/LRSIM/blob/master/4M-with-alts-february-2016.txt) and use those. The barcode file should look like: +Barcodes, if provided, must be given as nucleotide sequences, one per line. If not provided, +Harpy will generate the standard set of $96^4$ haplotagging barcodes (24bp) used by Meier et al. +**All barcodes must be the same length**. A barcode file should look like this: ``` input.barcodes.txt ATATGTACTCATACCA GGATGTACTCATTCCA TCACGTACTCATACCA etc... ``` -### 10X to Haplotag conversion -Harpy will convert the simulated 10X-style reads, where the 16-basepair barcode is at the beginning of read 1, -to haplotag format, where the barcode is coded in the sequence header under the `BX:Z` tag with the format -`AxxCxxBxxDxx`, where `xx` is a number between `00` and `96`. Using this format, a `00` would invalidate the -entire barcode due to a segment failing to be associated with a beadtag segment. In the simulated data, since -10X barcodes don't feature segments, failure to associate the first 16 bases of read 1 with barcodes provided -to `--barcodes` will appear as `BX:Z:A00C00B00D00`. The original 10X barcode (or first 16 bases of read 1) -will be removed from the sequence and stored in the `TX:Z` sequence header tag, e.g. `TX:Z:ATATGTACTCATACCA`. + +### Haplotagging conversion +Harpy will convert (demultiplex) the simulated linked-reads into proper haplotagging ACBD format by matching the first `X` +number of nucleotides from the start of the forward read to the barcode list. The provided barcodes will all be assigned a +unique haplotagging ACBD barcode and moved to the `BX` tag in the sequence headers, e.g. `BX:Z:A01C47B32D91`. The original nucleotide basrcode +will be removed from the sequence and stored in the `OX:Z` sequence header tag, e.g. `OX:Z:ATATGTACTCATACCA`. The paired reverse read will also have these tags. The diagram below attempts to simplify this visually. -![10X linked read barcode conversion into AxxCxxBxxDxx haplotag barcode format](/static/lr_conversion.png) +![inline linked-read barcode conversion into AxxCxxBxxDxx haplotag barcode format](/static/lr_conversion.png) ## Choosing parameters LRSIM does internal calculations to determine the number of reads per molecule based on `--read-pairs`, @@ -116,7 +123,7 @@ $$ $$\text{where:}\\\text{N = number of reads to simulate (in millions)}\\\text{H = number of haplotypes (fixed at 2)}\\\text{P = number of partitions (in thousands)}\\\text{M = molecules per partition}$$ ### Parameter calculator -Conveniently, we provide a calculator to help you make informed decisions for these parameters: +Use the calculator provided below to help you make informed decisions for these parameters: [!embed](https://app.calconic.com/api/embed/calculator/662146310482ea001e7acea2) ## :icon-git-pull-request: Simulate Linkedreads Workflow diff --git a/Workflows/Simulate/simulate-variants.md b/Workflows/Simulate/simulate-variants.md new file mode 100644 index 000000000..885975f73 --- /dev/null +++ b/Workflows/Simulate/simulate-variants.md @@ -0,0 +1,236 @@ +--- +label: Variants +description: Simulate snps, indels, inversions, cnv, translocations +icon: dot +#visibility: hidden +order: 6 +--- + +# :icon-flame: Simulate Genomic Variants +Simulate snps, indels, inversions, cnv, translocations + +=== :icon-checklist: You will need +- a reference genome in FASTA format: [!badge variant="success" text=".fasta"] [!badge variant="success" text=".fa"] [!badge variant="success" text=".fasta.gz"] [!badge variant="success" text=".fa.gz"] [!badge variant="secondary" text="case insensitive"] +=== + +You may want to benchmark haplotag data on different kinds of genomic variants. To +do that, you'll need *known* variants, and typically simulations are how you achieve +that. This series of modules simulates genomic variants onto a genome, either randomly +or specific variants provided in VCF files. The simulator Harpy uses, +[simuG](https://github.com/yjx1217/simuG), can only simulate one type of +variant at a time and each variant type has their own set of parameters. This page +is divided by variant types to help you navigate the process. The general usage +for simulating variants is: + +```bash usage +harpy simulate variant OPTIONS... INPUT_GENOME +``` +```bash example +harpy simulate inversion -n 10 --min-size 1000 --max-size 50000 path/to/genome.fasta +``` +## Modules +There are 4 submodules with very obvious names: + +{.compact} +| submodule | what it does | +| :----------------------------------------------------------- | :-------------------------------------------------------------------------------- | +| [!badge corners="pill" text="snpindel"](#snpindel) | simulates single nucleotide polymorphisms (snps) and insertion-deletions (indels) | +| [!badge corners="pill" text="inversion"](#inversion) | simulates inversions | +| [!badge corners="pill" text="cnv"](#cnv) | simulates copy number variants | +| [!badge corners="pill" text="translocation"](#translocation) | simulates translocations | + +## :icon-terminal: Running Options +While there are serveral differences between individual workflow options, each has available all the +[!badge variant="info" corners="pill" text="common runtime options"](/commonoptions.md) like other Harpy modules. +Each requires and input genome at the end of the command line, and each requires either a `--count` of variants +to randomly simulate, or a `--vcf` of specific variants to simulate. There are also these unifying options among the different variant types: + +{.compact} +| argument | short name | description | +| :----------------- | :--------: | :----------------------------------------------------------------------------------------------------- | +| `INPUT_GENOME` | | [!badge variant="info" text="required"] The haploid genome to simulate variants onto | +| `--centromeres` | `-c` | GFF3 file of centromeres to avoid | +| `--exclude-chr` | `-e` | Text file of chromosomes to avoid, one per line | +| `--genes` | `-g` | GFF3 file of genes to avoid simulating over (see `snpindel` for caveat) | +| `--heterozygosity` | `-z` | [proportion of simulated variants to make heterozygous ](#heterozygosity) (default: `0`) | +| `--only-vcf` | | When used with `--heterozygosity`, will create the diploid VCFs but will not simulate a diploid genome | +| `--prefix` | | Naming prefix for output files (default: `sim.{module_name}`) | +| `--randomseed` | | Random seed for simulation | + +!!!warning simulations can be slow +Given software limitations, simulating many variants **relative to the size of the input genome** will be noticeably slow. +The program slows down substantially when it becomes difficult to find new sites to create variants. +!!! + ++++ 🟣 snps and indels +### snpindel +A single nucleotide polymorphism ("SNP") is a genomic variant at a single base position in the DNA ([source](https://www.genome.gov/genetics-glossary/Single-Nucleotide-Polymorphisms)). +An indel, is a type of mutation that involves the addition/deletion of one or more nucleotides into a segment of DNA ([insertions](https://www.genome.gov/genetics-glossary/Insertion), [deletions](https://www.genome.gov/genetics-glossary/Deletion)). +The snp and indel variants are combined in this module because `simuG` allows simulating them together. + +{.compact} +| argument | short name | default | description | +| :----------------------- | :--------: | :-----: | :--------------------------------------------------------------------------------------------- | +| `--indel-count` | `-m` | `0` | Number of random indels to simluate | +| `--indel-vcf` | `-i` | | VCF file of known indels to simulate | +| `--indel-ratio` | `-d` | `1` | Insertion/Deletion ratio for indels | +| `--indel-size-alpha` | `-a` | `2.0` | Exponent Alpha for power-law-fitted indel size distribution | +| `--indel-size-constant` | `-l` | `0.5` | Exponent constant for power-law-fitted indel size distribution | +| `--snp-count` | `-n` | `0` | Number of random snps to simluate | +| `--snp-gene-constraints` | `-y` | | How to constrain randomly simulated SNPs {`noncoding`,`coding`,`2d`,`4d`} when using `--genes` | +| `--snp-vcf` | `-s` | | VCF file of known snps to simulate | +| `--titv-ratio` | `-r` | `0.5` | Transition/Transversion ratio for snps | + +The ratio parameters for snp and indel variants and have special meanings when setting +the value to either `0` or `9999` : + +{.compact} +| ratio | `0` meaning | `9999` meaning | +| :-------------- | :----------------- | :---------------- | +| `--indel-ratio` | deletions only | insertions only | +| `--titv-ratio` | transversions only | transitions only | + ++++ 🔵 inversions +### inversion +Inversions are when a section of a chromosome appears in the reverse orientation ([source](https://www.genome.gov/genetics-glossary/Inversion)). + +{.compact} +| argument | short name | default | description | +| :----------- | :--------: | :------: | :--------------------------------------- | +| `--count` | `-n` | `0` | Number of random inversions to simluate | +| `--max-size` | `-x` | `100000` | Maximum inversion size (bp) | +| `--min-size` | `-m` | `1000` | Minimum inversion size (bp) | +| `--vcf` | `-v` | | VCF file of known inversions to simulate | + ++++ 🟢 copy number variants +### cnv +A copy number variation (CNV) is when the number of copies of a particular gene varies +between individuals ([source](https://www.genome.gov/genetics-glossary/Copy-Number-Variation)). + +{.compact} +| argument | short name | default | description | +| :------------- | :--------: | :------: | :------------------------------------------------- | +| `--vcf` | `-v` | | VCF file of known copy number variants to simulate | +| `--count` | `-n` | `0` | Number of random cnv to simluate | +| `--dup-ratio` | `-d` | `1` | Tandem/Dispersed duplication ratio | +| `--gain-ratio` | `-l` | `1` | Relative ratio of DNA gain over DNA loss | +| `--max-size` | `-x` | `100000` | Maximum cnv size (bp) | +| `--max-copy` | `-y` | `10` | Maximum number of copies | +| `--min-size` | `-m` | `1000` | Minimum cnv size (bp) | + +The ratio parameters have special meanings when setting the value to either `0` or `9999` : + +{.compact} +| ratio | `0` meaning | `9999` meaning | +| :------------- | :-------------------------- | :----------------------- | +| `--dup-ratio` | dispersed duplications only | tandem duplications only | +| `--gain-ratio` | loss only | gain only | + ++++ 🟡 translocations +### translocation +A translocation occurs when a chromosome breaks and the fragmented pieces re-attach to different chromosomes ([source](https://www.genome.gov/genetics-glossary/Translocation)). + +{.compact} +| argument | short name | default | description | +| :-------- | :--------: | :-----: | :--------------------------------------- | +| `--count` | `-n` | `0` | Number of random inversions to simluate | +| `--vcf` | `-v` | | VCF file of known inversions to simulate | + ++++ + +## Simulate known variants +Rather than simulating random variants, you can use a VCF file as input to any of the workflows +to have `simuG` simulate the variants (of that type) from the VCF file. This becomes particularly +handy because the modules output a VCF file of the variants that were introduced, which you can +modify and reuse as you see fit (see [heterozygosity](#heterozygosity)). Using `--genes`, +`--centromeres`, or `--exclude-chr` would still avoid creating variants in those regions as with +random simulation, except with SNPs, where you would have to specify the contraints for using +`--genes` as per usual. + +## Heterozygosity +Each workflow has a `--heterozygosity` parameter where you can specify the heterozygosity of +the simulated variants, which creates two new VCF files (`{prefix}.hap1.vcf`, +`{prefix}.hap2.vcf`) that have their variants shuffled between the two haplotypes to +achieve the desired heterozygosity. The workflows will then use the new "diploid" variants +to generate a diploid genome-- one fasta file for each haplotype. You can disable the creation +of the diploid fasta files using `--only-vcf`, which will still create the VCF files of the variants +to your chosen heterozygosity. + +==- How the paramater is actually used +To understand how heterozygosity is created from the `simuG` VCF output, consider a genome +with 5 variants added to it, here represented as a column labelled `h1` with `1` being the presence +of a variant (the ALT allele). +``` +h1 +1 +1 +1 +1 +1 +``` +If we were to simulate those same variants onto the genome again, it would create a homozygote +at every position (`h2` is the second haplotype): +``` +h1 h2 +1 1 +1 1 +1 1 +1 1 +1 1 +``` +However, if we omit some of the variants on `h2` to create 40% heterozygosity (2/5), +we would now have heterozygotes, except the ALT allele for the heterozygote would +only every be on the first haplotype `h1`: +``` +h1 h2 +1 1 +1 1 +1 1 +1 <- heterozygote with ALT on h1 +1 <- heterozygote with ALT on h1 +``` +It would probably be more biologically sound to then make sure that the ALT allele +in the heterozygote can appear in either haplotype: +``` +h1 h2 +1 1 +1 1 +1 1 + 1 <- heterozygote with ALT on h2 +1 <- heterozygote with ALT on h1 +``` +Within Harpy, a heterozygous variant has a 50% chance of being assigned to one of the haplotypes. +So that's the logic behind the `--heterozygosity` parameter and why it ouputs 3 VCF files: +1. the VCF `simuG` outputs of variants added to the genome +2. haplotype 1 of that VCF file with some of the variants +3. haplotype 2 of that VCF file with some of the variants + +Knowing that, you can then have a workflow to start with a haploid assembly and create +a diploid assembly with simulated variants. +==- + +## :icon-git-pull-request: Variant Simulation Workflow +```mermaid +graph LR + geno[input genome]:::clean + simhap([simulate haploid variants]):::clean + makedipl([create diploids]):::clean + simhap1([simulate haplotype 1]):::clean + simhap2([simulate haplotype 2]):::clean + haplo1([haplotype1.fasta]) + haplo2([haplotype2.fasta]) + geno--->simhap + simhap--->|--heterozygosity > 0|makedipl + makedipl--->simhap1 + makedipl--->simhap2 + simhap1--->haplo1 + simhap2--->haplo2 + subgraph diploid ["diploid variants"] + haplo1 + haplo2 + end + classDef clean fill:#f5f6f9,stroke:#b7c9ef,stroke-width:2px + style diploid fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px + style haplo1 stroke:#90c8be,fill:#f0f0f0,stroke-width:2px + style haplo2 stroke:#bd8fcb,fill:#f0f0f0,stroke-width:2px +``` \ No newline at end of file diff --git a/Workflows/assembly.md b/Workflows/assembly.md new file mode 100644 index 000000000..2c8d1e39f --- /dev/null +++ b/Workflows/assembly.md @@ -0,0 +1,142 @@ +--- +label: Assembly +description: Create a genome assembly from linked reads +icon: commit +order: 11 +--- + +# :icon-commit: Create a Genome Assembly + +=== :icon-checklist: You will need +- at least 2 cores/threads available +- paired-end reads from an Illumina sequencer in FASTQ format [!badge variant="secondary" text="gzip recommended"] + - deconvolved with [!badge corners="pill" text="deconvolve"](deconvolve.md) (QuickDeconvolution) or equivalent [!badge variant="warning" text="IMPORTANT"] +=== + +If you have single-sample data, you might be interested in a genome assembly. Unlike metagenome assemblies, +a classic genome assembly assumes there is exactly one genome present in your sequences and will try to +assemble the most contiguous sequences for this one individual. + +```bash usage +harpy metassembly OPTIONS... FASTQ_R1 FASTQ_R2 +``` + +```bash example +harpy metassembly --threads 20 -u prokaryote -k 13,51,75,83 FASTQ_R1 FASTQ_R2 +``` + +## :icon-terminal: Running Options +In addition to the [!badge variant="info" corners="pill" text="common runtime options"](/commonoptions.md), the [!badge corners="pill" text="assembly"] +module is configured using the command-line arguments below. Since the assembly process consists of several distinct phases, +the descriptions are provided with an extra badge to reflect which part of the assembly process they correspond to. + +{.compact} +| argument | short name | default | description | +| :-------------------- | :--------: | :---------: | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `FASTQ_R1` | | | [!badge variant="info" text="required"] FASTQ file of forward reads | +| `FASTQ_R2` | | | [!badge variant="info" text="required"] FASTQ file of reverse reads | +| `--extra-params` | `-x` | | [!badge variant="secondary" text="spades assembly"] Additional spades parameters, in quotes | +| `--kmer-length` | `-k` | `auto` | [!badge variant="secondary" text="spades assembly"] Kmer lengths to use for initial spades assembly. They must be **odd** and **<128**, separated by commas, and without spaces. (e.g. `13,23,51`) | +| `--max-memory` | `-r` | `10000` | [!badge variant="secondary" text="spades assembly"] Maximum memory for spades to use, given in megabytes | +| `--arcs-extra` | `-y` | | [!badge variant="secondary" text="arcs scaffold"] Additional ARCS parameters, in quotes and `option=arg` format | +| `--contig-length` | `-c` | `500` | [!badge variant="secondary" text="arcs scaffold"] Minimum contig length | +| `--links` | `-n` | `5` | [!badge variant="secondary" text="arcs scaffold"] Minimum number of links to compute scaffold | +| `--min-aligned` | `-a` | `5` | [!badge variant="secondary" text="arcs scaffold"] Minimum aligned read pairs per barcode | +| `--min-quality` | `-q` | `0` | [!badge variant="secondary" text="arcs scaffold"] Minimum mapping quality | +| `--mismatch` | `-m` | `5` | [!badge variant="secondary" text="arcs scaffold"] Maximum number of mismatches | +| `--molecule-distance` | `-d` | `50000` | [!badge variant="secondary" text="arcs scaffold"] Distance cutoff to split molecules (bp) | +| `--molecule-length` | `-l` | `2000` | [!badge variant="secondary" text="arcs scaffold"] Minimum molecule length (bp) | +| `--seq-identity` | `-i` | `98` | [!badge variant="secondary" text="arcs scaffold"] Minimum sequence identity | +| `--span` | `-s` | `20` | [!badge variant="secondary" text="arcs scaffold"] Minimum number of spanning molecules to be considered assembled | +| `--organism-type` | `-u` | `eukaryote` | [!badge variant="secondary" text="report"] Organism type for assembly report: `eukaryote`,`prokaryote`, or `fungus` | + + +## :icon-tag: Deconvolved Inputs +For linked-read assemblies, the barcodes need to be deconvolved in the sequence data, meaning that +barcodes that are shared by reads that originate from different molecules need to have unique barcode +IDs. Deconvolution often takes the form of adding a hyphenated integer to the end of a barcode so that software +can recognize that they are different from each other. For example: two sequences from different molecules +sharing the [linked read] barcode `A03C45B11D91` would have one of them recoded as `A03C45B11D91-1`. Software +like [QuickDeconvolution](https://github.com/RolandFaure/QuickDeconvolution), which is used by [!badge corners="pill" text="deconvolve"](deconvolve.md) will parse +your fastq input files and perform this deconvolution. + +## :icon-git-pull-request: Assembly Workflow ++++ :icon-git-merge: details +Initial assembly is performed with [cloudspades](https://github.com/ablab/spades/tree/cloudspades-ismb), +followed by [tigmint](https://github.com/bcgsc/tigmint), [arcs](https://github.com/bcgsc/arcs), +[links](https://github.com/bcgsc/links) to scaffold the contig-level assembly. + +```mermaid +graph LR + subgraph Inputs + F1([fastq read 1]):::clean + F2([fastq read 2]):::clean + end + subgraph init[Initial Assembly] + A([cloudspades]):::clean + end + Inputs--->A + subgraph Scaffolding + A ---> B([tigmint]):::clean + B-->C([arcs]):::clean + C-->D([links]):::clean + end + style Inputs fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px + style init fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px + style Scaffolding fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px + classDef clean fill:#f5f6f9,stroke:#b7c9ef,stroke-width:2px +``` + ++++ :icon-file-directory: assembly output +The default output directory is `Assembly` with the folder structure below. Using `--skip-reports` +will skip the QUAST/BUSCO analysis as well. The file structure below isn't exhaustive and serves +to highlight the general structure and most important outputs. +``` +Metassembly/ +├── busco +│   ├── short_summary.*.txt +│   └── run_*_odb10 +├── quast +│   ├── report.* +│   └── predicted_genes +├── reports +│   └── assembly.metrics.html +├── scaffold +├── spades +│   └── contigs.fasta +└── scaffolds.fasta +``` +{.compact} +| item | description | +| :------------------------------ | :---------------------------------------------------------------------------- | +| `busco/` | directory with results from the BUSCO analysis | +| `busco/short_summary.*.txt` | text file summarizing BUSCO analysis | +| `busco/run_*_odb10` | directory with results from the BUSCO analysis for the specific organism type | +| `scaffold/` | directory with the tigmint/arcs/links output | +| `scaffolds.fasta` | the resulting scaffolded assembly | +| `spades/` | directory with the SPADES output | +| `spades/contigs.fasta` | the resulting primary assembly | +| `quast/` | directory with results from the QUAST analysis | +| `quast/report*` | resulting QUAST report in various formats | +| `quast/predicted_genes/` | GLIMMER gene-finding output | +| `reports/assembly.metrics.html` | aggregate and generalization of QUAST and BUSCO results | + ++++ :icon-code-square: SPADES parameters +By default, Harpy runs `spades` with these parameters (excluding inputs and outputs): +```bash +spades.py -t threads -m mem -k k --gemcode1-1 FQ_R1 --gemcode1-2 FQ_R2 +``` +See the [SPADES documentation](http://ablab.github.io/spades/running.html) for a list of all available command line options. + ++++ :icon-graph: reports +These are the summary reports Harpy generates for this workflow. You may right-click +the image and open it in a new tab if you wish to see the example in better detail. + +|||Aggregated Report +Aggregates QUAST and BUSCO analyses. +![reports/assembly.metrics.html](/static/assembly_multiqc.png) +||| QUAST Report +This is the report produced by QUAST +![reports/assembly.metrics.html](/static/assembly_quast.png) +||| ++++ diff --git a/Modules/deconvolve.md b/Workflows/deconvolve.md similarity index 66% rename from Modules/deconvolve.md rename to Workflows/deconvolve.md index 95ad91ceb..8d9b91f08 100644 --- a/Modules/deconvolve.md +++ b/Workflows/deconvolve.md @@ -1,17 +1,18 @@ --- label: Deconvolve -description: Resolve clashing barcodes from different molecules +description: Resolve barcodes shared by different molecules icon: tag order: 10 --- -# :icon-tag: Resolve clashing barcodes from different molecules +# :icon-tag: Resolve barcodes shared by different molecules === :icon-checklist: You will need - paired-end reads from an Illumina sequencer in FASTQ format [!badge variant="secondary" text="gzip recommended"] + - **sample name**: [!badge variant="success" text="a-z"] [!badge variant="success" text="0-9"] [!badge variant="success" text="."] [!badge variant="success" text="_"] [!badge variant="success" text="-"] [!badge variant="secondary" text="case insensitive"] - **forward**: [!badge variant="success" text="_F"] [!badge variant="success" text=".F"] [!badge variant="success" text=".1"] [!badge variant="success" text="_1"] [!badge variant="success" text="_R1_001"] [!badge variant="success" text=".R1_001"] [!badge variant="success" text="_R1"] [!badge variant="success" text=".R1"] - **reverse**: [!badge variant="success" text="_R"] [!badge variant="success" text=".R"] [!badge variant="success" text=".2"] [!badge variant="success" text="_2"] [!badge variant="success" text="_R2_001"] [!badge variant="success" text=".R2_001"] [!badge variant="success" text="_R2"] [!badge variant="success" text=".R2"] - - **fastq extension**: [!badge variant="success" text=".fq"] [!badge variant="success" text=".fastq"] [!badge variant="success" text=".FQ"] [!badge variant="success" text=".FASTQ"] + - **fastq extension**: [!badge variant="success" text=".fq"] [!badge variant="success" text=".fastq"] [!badge variant="secondary" text="case insensitive"] === @@ -44,13 +45,13 @@ harpy deconvolve OPTIONS... INPUTS... ## :icon-terminal: Running Options {.compact} -| argument | short name | type | default | required | description | -|:----------------------|:----------:|:----------------|:-------:|:--------:|:---------------------------------------------------------------------| -| `INPUTS` | | file/directory paths | | **yes** | Files or directories containing [input FASTQ files](/commonoptions.md#input-arguments) | -| `--density` | `-d` | integer | 3 | | On average, $\frac{1}{2^d}$ kmers are indexed | -| `--dropout` | `-a` | integer | 0 | | Minimum cloud size to deconvolve | -| `--kmer-length` | `-k` | integer | 21 | | Size of k-mers to search for similarities | -| `--window-size` | `-w` | integer | 40 | | Size of window guaranteed to contain at least one kmer | +| argument | short name | default | description | +| :-------------- | :--------: | :-----: | :----------------------------------------------------------------------------------------------------------------------------- | +| `INPUTS` | | | [!badge variant="info" text="required"] Files or directories containing [input FASTQ files](/commonoptions.md#input-arguments) | +| `--density` | `-d` | `3` | On average, $\frac{1}{2^d}$ kmers are indexed | +| `--dropout` | `-a` | `0` | Minimum cloud size to deconvolve | +| `--kmer-length` | `-k` | `21` | Size of k-mers to search for similarities | +| `--window-size` | `-w` | `40` | Size of window guaranteed to contain at least one kmer | ## Resulting Barcodes After deconvolution, some barcodes may have a hyphenated suffix like `-1` or `-2` (e.g. `A01C33B41D93-1`). diff --git a/Modules/demultiplex.md b/Workflows/demultiplex.md similarity index 74% rename from Modules/demultiplex.md rename to Workflows/demultiplex.md index 82c60c81b..3efb9d9c2 100644 --- a/Modules/demultiplex.md +++ b/Workflows/demultiplex.md @@ -28,14 +28,14 @@ harpy demultiplex gen1 --threads 20 --schema demux.schema Plate_1_S001_R*.fastq. In addition to the [!badge variant="info" corners="pill" text="common runtime options"](/commonoptions.md), the [!badge corners="pill" text="demultiplex"] module is configured using these command-line arguments: {.compact} -| argument | short name | type | required | description | -|:------------------|:----------:|:-----------|:--------:|:------------------------------------------------------------------------| -| `METHOD` | | choice | **yes** | Haplotag technology of the sequences [`gen1`] | -| `R1_FQ` | | file path | **yes** | The forward multiplexed FASTQ file | -| `R2_FQ` | | file path | **yes** | The reverse multiplexed FASTQ file | -| `I1_FQ` | | file path | **yes** | The forward FASTQ index file provided by the sequencing facility | -| `I2_FQ` | | file path | **yes** | The reverse FASTQ index file provided by the sequencing facility | -| `--schema` | `-s` | file path | **yes** | Tab-delimited file of sample\barcode | +| argument | short name | description | +| :--------- | :--------: | :--------------------------------------------------------------- | +| `METHOD` | | [!badge variant="info" text="required"] Haplotag technology of the sequences [`gen1`] | +| `R1_FQ` | | [!badge variant="info" text="required"] The forward multiplexed FASTQ file | +| `R2_FQ` | | [!badge variant="info" text="required"] The reverse multiplexed FASTQ file | +| `I1_FQ` | | [!badge variant="info" text="required"] The forward FASTQ index file provided by the sequencing facility | +| `I2_FQ` | | [!badge variant="info" text="required"] The reverse FASTQ index file provided by the sequencing facility | +| `--schema` | `-s` | [!badge variant="info" text="required"] Tab-delimited file of sample\barcode | ## Haplotag Types ==- Generation 1 - `gen1` @@ -101,11 +101,11 @@ Demultiplex/    └── demultiplex.QC.html ``` {.compact} -| item | description | -|:---|:---| -| `*.F.fq.gz` | Forward-reads from multiplexed input `--file` belonging to samples from the `samplesheet` | -| `*.R.fq.gz` | Reverse-reads from multiplexed input `--file` belonging to samples from the `samplesheet` | -| `reports/demultiplex.QC.html` | phased vcf annotated with phased blocks | +| item | description | +| :---------------------------- | :---------------------------------------------------------------------------------------- | +| `*.F.fq.gz` | Forward-reads from multiplexed input `--file` belonging to samples from the `samplesheet` | +| `*.R.fq.gz` | Reverse-reads from multiplexed input `--file` belonging to samples from the `samplesheet` | +| `reports/demultiplex.QC.html` | phased vcf annotated with phased blocks | +++ :icon-graph: reports ||| FASTQC metrics diff --git a/Modules/impute.md b/Workflows/impute.md similarity index 64% rename from Modules/impute.md rename to Workflows/impute.md index c06638eb5..b2b565b5a 100644 --- a/Modules/impute.md +++ b/Workflows/impute.md @@ -1,15 +1,17 @@ --- label: Impute description: Impute genotypes for haplotagged data with Harpy -icon: workflow +icon: project order: 8 --- -# :icon-workflow: Impute Genotypes using Sequences +# :icon-project: Impute Genotypes using Sequences === :icon-checklist: You will need -- a tab-delimited parameter file -- sequence alignments in BAM format: [!badge variant="success" text=".bam"] +- at least 4 cores/threads available +- a tab-delimited [parameter file](#parameter-file) +- sequence alignments: [!badge variant="success" text=".bam"] [!badge variant="secondary" text="coordinate-sorted"] + - **sample names**: [!badge variant="success" text="a-z"] [!badge variant="success" text="0-9"] [!badge variant="success" text="."] [!badge variant="success" text="_"] [!badge variant="success" text="-"] [!badge variant="secondary" text="case insensitive"] - a variant call format file: [!badge variant="success" text=".vcf"] [!badge variant="success" text=".vcf.gz"] [!badge variant="success" text=".bcf"] ==- :icon-codescan: Curation of input VCF file To work well with STITCH, Harpy needs the input variant call file to meet specific criteria. @@ -35,15 +37,15 @@ tasks yourself prior to running the [!badge corners="pill" text="impute"] module After variants have been called, you may want to impute missing genotypes to get the most from your data. Harpy uses `STITCH` to impute genotypes, a haplotype-based method that is linked-read aware. Imputing genotypes requires a variant call file -**containing SNPs**, such as that produced by [!badge corners="pill" text="harpy snp"](snp.md) and preferably [filtered in some capacity](/blog/filteringsnps.md). +**containing SNPs**, such as that produced by [!badge corners="pill" text="harpy snp"](snp.md) and preferably [filtered in some capacity](/blog/filtering_snps.md). You can impute genotypes with Harpy using the [!badge corners="pill" text="impute"] module: ```bash usage harpy impute OPTIONS... INPUTS... ``` ```bash example -# create stitch parameter file 'stitch.params' -harpy stitchparams -o stitch.params +# create the parameter file 'stitch.params' +harpy imputeparams -o stitch.params # run imputation harpy impute --threads 20 --vcf Variants/mpileup/variants.raw.bcf --parameters stitch.params Align/ema @@ -53,13 +55,13 @@ harpy impute --threads 20 --vcf Variants/mpileup/variants.raw.bcf --parameters s In addition to the [!badge variant="info" corners="pill" text="common runtime options"](/commonoptions.md), the [!badge corners="pill" text="impute"] module is configured using these command-line arguments: {.compact} -| argument | short name | type | default | required | description | -|:---------------|:----------:|:------------|:-------------:|:--------:|:------------------------------------------------------------------------------------------------| -| `INPUTS` | | file/directory paths | | **yes** | Files or directories containing [input BAM files](/commonoptions.md) | -| `--extra-params` | `-x` | folder path | | no | Extra arguments to add to the STITCH R function, provided in quotes and R syntax | -| `--parameters` | `-p` | file path | | **yes** | STITCH [parameter file](#parameter-file) (tab-delimited) | -| `--vcf` | `-v` | file path | | **yes** | Path to VCF/BCF file | -| `--vcf-samples`| | toggle | | no | [Use samples present in vcf file](#prioritize-the-vcf-file) for imputation rather than those found the directory | +| argument | short name | default | description | +| :--------------- | :--------: | :-----: | :--------------------------------------------------------------------------------------------------------------------------- | +| `INPUTS` | | | [!badge variant="info" text="required"] Files or directories containing [input BAM files](/commonoptions.md) | +| `--extra-params` | `-x` | | Extra arguments to add to the STITCH R function, provided in quotes and R syntax | +| `--parameters` | `-p` | | [!badge variant="info" text="required"] STITCH [parameter file](#parameter-file) (tab-delimited) | +| `--vcf` | `-v` | | [!badge variant="info" text="required"] Path to VCF/BCF file | +| `--vcf-samples` | | | Use samples present in vcf file for imputation rather than those found the directory ([see below](#prioritize-the-vcf-file)) | ### Extra STITCH parameters You may add [additional parameters](https://github.com/rwdavies/STITCH/blob/master/Options.md) to STITCH by way of the @@ -84,44 +86,46 @@ Typically, one runs STITCH multiple times, exploring how results vary with different model parameters (explained in next section). The solution Harpy uses for this is to have the user provide a tab-delimited dataframe file where the columns are the 6 STITCH model parameters and the rows are the values for those parameters. The parameter file -is required and can be created manually or with [!badge corners="pill" text="harpy stitchparams"](other.md/#stitchparams). +is required and can be created manually or with [!badge corners="pill" text="harpy imputeparams"](other.md/#imputeparams). If created using harpy, the resulting file includes largely meaningless values that you will need to adjust for your study. The parameter must follow a particular format: - tab or comma delimited -- column order doesn't matter, but all 6 column names must be present +- column order doesn't matter, but all 7 column names must be present - header row present with the specific column names below +++example file This file is tab-delimited, note the column names: ``` paramaters.txt -model usebx bxlimit k s nGen -diploid TRUE 50000 10 5 50 -diploid TRUE 50000 15 10 100 -pseudoHaploid TRUE 50000 10 1 50 +name model usebx bxlimit k s nGen +model1 diploid TRUE 50000 10 5 50 +model2 diploid TRUE 50000 15 10 100 +waffles pseudoHaploid TRUE 50000 10 1 50 ``` ++++example file (as a table) +This is the table view of the tab-delimited file, shown here for clarity. + +{.compact} +| name | model | useBX | bxlimit | k | s | nGen | +| :------ | :------------ | :---- | :------ | :-- | :-- | :--- | +| model1 | diploid | TRUE | 50000 | 10 | 5 | 50 | +| model2 | diploid | TRUE | 50000 | 15 | 10 | 100 | +| waffles | pseudoHaploid | TRUE | 50000 | 10 | 1 | 50 | + +++parameter file columns See the section below for detailed information on each parameter. This table serves as an overview of the parameters. {.compact} -| column name | value type | accepted values | description | -|:------------|:------------:|:---------------------------------------:|:----------------------------------------------------------------------| -| model | text | pseudoHaploid, diploid, diploid-inbred | The STITCH model/method to use | -| usebx | text/boolean | true, false, yes, no (case insensitive) | Whether to incorporate beadtag information | -| bxlimit | integer | ≥ 1 | Distance between identical BX tags at which to consider them different molecules | -| k | integer | ≥ 1 | Number of founder haplotypes | -| s | integer | ≥ 1 | Number of instances of the founder haplotypes to average results over | -| nGen | integer | ≥ 1 | Estimated number of generations since founding | - -+++example file (as a table) -This is the table view of the tab-delimited file, shown here for clarity. +| column name | accepted values | description | +| :---------- | :------------------------------------------: | :---------------------------------------------------------------------------------------------- | +| name | alphanumeric (a-z, 0-9) and `-_.` | Arbitrary name of the parameter set, used to name outputs | +| model | `pseudoHaploid`, `diploid`, `diploid-inbred` | The STITCH model/method to use [!badge variant="secondary" text="case sensitive"] | +| usebx | `true`, `false`, `yes`, `no` | Whether to incorporate beadtag information [!badge variant="secondary" text="case insensitive"] | +| bxlimit | ≥ 1 | Distance between identical BX tags at which to consider them different molecules | +| k | ≥ 1 | Number of founder haplotypes | +| s | ≥ 1 | Number of instances of the founder haplotypes to average results over | +| nGen | ≥ 1 | Estimated number of generations since founding | -{.compact} -| model | useBX | bxlimit | k | s | nGen | -|:--------------|:------|:---------|:---|:---|:-----| -| diploid | TRUE | 50000 | 10 | 5 | 50 | -| diploid | TRUE | 50000 | 15 | 10 | 100 | -| pseudoHaploid | TRUE | 50000 | 10 | 1 | 50 | +++ ## STITCH Parameters @@ -164,6 +168,10 @@ that each ancestral haplotype gets at least a certain average \_X of coverage, l The `s` parameter controls the number of sets of ancestral haplotypes used and which final results are averaged over. This may be useful for wild or large populations, like humans. The `s` value should affect RAM and run time in a near-linearly. +!!! +For the time being, it's probably best to set this value to `1` due to [this inconsistent issue](https://github.com/rwdavies/STITCH/issues/98#issuecomment-2248700697). +!!! + +++nGen ##### Recombination rate between samples The `nGen` parameter controls recombination rate between the sequenced samples and the ancestral haplotypes. @@ -207,58 +215,37 @@ graph LR ``` +++ :icon-file-directory: impute output The default output directory is `Impute` with the folder structure below. `contig1` and `contig2` -are generic contig names from an imaginary `genome.fasta` for demonstration purposes. The directory `model1/` -is a generic name to reflect the corresponding parameter row in the stitch parameter -file, which would have explicit names in real use (e.g. `modelpseudoHaploid_useBXTrue_k10_s1_nGen50/`). +are generic contig names from an imaginary `genome.fasta` for demonstration purposes. The directory `modelname` +is a placeholder for whatever `name` you gave that parameter set in the parameter file's `name` column. The resulting folder also includes a `workflow` directory (not shown) with workflow-relevant runtime files and information. ``` Impute/ -├── input -│   ├── contig1.stitch -│   ├── contig2.stitch -│   ├── samples.list -│   ├── samples.names -│   └── logs -│ └── harpy.impute.log -└── model1 -    ├── concat.log -    ├── variants.imputed.bcf -    ├── variants.imputed.bcf.csi -    ├── variants.imputed.html -    └── contigs -    ├── contig1 -    │ ├── contig1.log -    │ ├── contig1.impute.html -    │ ├── contig1.stats -    │ ├── contig1.vcf.gz -    │ └── contig1.vcf.gz.tbi -    ├── contig2 -    │ ├── contig2.log -    │ ├── contig2.impute.html -    │ ├── contig2.stats -    │ ├── contig2.vcf.gz -    │ └── contig2.vcf.gz.tbi -    └── stats - └── impute.compare.stats +└── modelname +    ├── modelname.bcf +    ├── modelname.bcf.csi +    ├── contigs +    │ ├── contig1.vcf.gz +    │ ├── contig1.vcf.gz.tbi +    │ ├── contig2.vcf.gz +    │ └── contig2.vcf.gz.tbi +   ├── logs +   └── reports + ├── data + ├── contig1.modelname.html + ├── contig2.modelname.html + └── modelname.html ``` {.compact} -| item | description | -|:------------------------------------|:--------------------------------------------------------------------------| -| `logs/harpy.impute.log` | relevant runtime parameters for the phase module | -| `input/*.stitch` | biallelic SNPs used for imputation | -| `input/samples.list` | list of [input BAM files](/commonoptions.md) | -| `input/samples.names` | list of sample names | -| `model*/concat.log` | output from bcftools concat to create final imputed bcf | -| `model*/variants.imputed.bcf` | final bcf file of imputed genotypes | -| `model*/variants.imputed.bcf.csi` | index of `variants.imputed.bcf` | -| `model*/variants.imputed.html` | report summarizing the results of imputation | -| `model*/contigs/*/*.impute.html` | summary of STITCH imputation | -| `model*/contigs/*/*.log` | what STITCH writes to `stdout` and `stderr` | -| `model*/contigs/*/*.vcf.gz` | variants resulting from imputation | -| `model*/contigs/*/*.vcf.gz.tbi` | index of variant file | -| `model*/contigs/*/impute.compare.stats` | results of `bcftools stats` comparing the original to the imputed vcf | +| item | description | +| :----------------------------------- | :------------------------------------------- | +| `modelname/modelname.bcf` | final bcf file of imputed genotypes | +| `modelname/modelname.bcf.csi` | index of `modelname.bcf` | +| `modelname/reports/modelname.html` | report summarizing the results of imputation | +| `modelname/reports/*.modelname.html` | summary of STITCH imputation (per contig) | +| `modelname/contigs/*.vcf.gz` | variants resulting from imputation | +| `modelname/contigs/*.vcf.gz.tbi` | index of variant file | +++ :icon-code-square: STITCH parameters While you are expected to run STITCH using your own set of diff --git a/Modules/index.yml b/Workflows/index.yml similarity index 100% rename from Modules/index.yml rename to Workflows/index.yml diff --git a/Workflows/metassembly.md b/Workflows/metassembly.md new file mode 100644 index 000000000..3c2c9a948 --- /dev/null +++ b/Workflows/metassembly.md @@ -0,0 +1,140 @@ +--- +label: Metassembly +description: Create a metagenome assembly from linked reads +icon: workflow +order: 8 +--- + +# :icon-workflow: Create a Metagenome Assembly + +=== :icon-checklist: You will need +- at least 2 cores/threads available +- paired-end reads from an Illumina sequencer in FASTQ format [!badge variant="secondary" text="gzip recommended"] + - deconvolved with [!badge corners="pill" text="deconvolve"](deconvolve.md) (QuickDeconvolution) or equivalent [!badge variant="warning" text="IMPORTANT"] +=== + +If you have mixed-sample data, you might be interested in a metagenome assembly, also known as a metassembly. Unlike +a single-sample assembly, a metassembly assumes there are multiple genomes present in your sequences and will try to +assemble the most contiguous sequences for multi-sample (or multi-species) data. + +```bash usage +harpy metassembly OPTIONS... FASTQ_R1 FASTQ_R2 +``` + +```bash example +harpy metassembly --threads 20 -u prokaryote -k 13,51,75,83 FASTQ_R1 FASTQ_R2 +``` + +## :icon-terminal: Running Options +In addition to the [!badge variant="info" corners="pill" text="common runtime options"](/commonoptions.md), the [!badge corners="pill" text="metassembly"] module is configured using these command-line arguments: + +{.compact} +| argument | short name | default | description | +| :---------------- | :--------: | :---------: | :--------------------------------------------------------------------------------------------------------------------------------------------- | +| `FASTQ_R1` | | | [!badge variant="info" text="required"] deconvolved FASTQ file of forward reads | +| `FASTQ_R2` | | | [!badge variant="info" text="required"] deconvolved FASTQ file of reverse reads | +| `--bx-tag` | `-b` | `BX` | [!badge variant="info" text="required"] Which sequence header tag encodes the linked-read barcode (`BX` for `BX:Z` or `BC` for `BC:Z`) | +| `--extra-params` | `-x` | | Additional spades parameters, in quotes | +| `--ignore-bx` | | | Ignore linked-read info for initial spades assembly | +| `--kmer-length` | `-k` | `auto` | Kmer lengths to use for initial spades assembly. They must be **odd** and **<128**, separated by commas, and without spaces. (e.g. `13,23,51`) | +| `--max-memory` | `-r` | `10000` | Maximum memory for spades to use, given in megabytes | +| `--organism-type` | `-u` | `eukaryote` | Organism type for assembly report. Options: `eukaryote`,`prokaryote`,`fungus` | + +## :icon-tag: Deconvolved Inputs +For linked-read assemblies, the barcodes need to be deconvolved in the sequence data, meaning that +barcodes that are shared by reads that originate from different molecules need to have unique barcode +IDs. Deconvolution often takes the form of adding a hyphenated integer to the end of a barcode so that software +can recognize that they are different from each other. For example: two sequences from different molecules +sharing the [linked read] barcode `A03C45B11D91` would have one of them recoded as `A03C45B11D91-1`. Software +like [QuickDeconvolution](https://github.com/RolandFaure/QuickDeconvolution), which is used by [!badge corners="pill" text="deconvolve"](deconvolve.md) will parse +your fastq input files and perform this deconvolution. + +## :icon-git-pull-request: Metassembly Workflow ++++ :icon-git-merge: details +Initial assembly is performed with [spades](http://ablab.github.io/spades/) or [cloudspades](https://github.com/ablab/spades/tree/cloudspades-ismb) +depending on whether `--ignore-bx` was used. After the initial spades-based assembly, +[athena](https://github.com/abishara/athena_meta) assembles the contigs into larger scaffolds. + +```mermaid +graph LR + subgraph Inputs + F1([fastq forward]):::clean + F2([fastq reverse]):::clean + F1---|and|F2 + end + subgraph init[Initial Assembly] + AC([meta cloudspades]):::clean + A([metaspades]):::clean + A---|or|AC + end + subgraph athena[Secondary Assembly] + B([athena]):::clean + end + Inputs ---> sort([sort by barcode]):::clean + sort--->init--->B + style Inputs fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px + style init fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px + style athena fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px + classDef clean fill:#f5f6f9,stroke:#b7c9ef,stroke-width:2px +``` + ++++ :icon-file-directory: metassembly output +The default output directory is `Metassembly` with the folder structure below. If `--ignore-bx` is used, the initial +spades assembly will be in `*/spades_assembly`, otherwise it will be in `*/cloudspades_assembly`. Using `--skip-reports` +will skip the QUAST/BUSCO analysis as well. The file structure below isn't exhaustive and serves to highlight the general +structure and most important outputs. +``` +Metassembly/ +├── athena +│   ├── athena.asm.fa +│   └── athena.config +├── busco +│   ├── short_summary.*.txt +│   └── run_*_odb10 +├── quast +│   ├── report.* +│   └── predicted_genes +├── reports +│   └── assembly.metrics.html +└── *spades_assembly +    └── contigs.fasta +``` +{.compact} +| item | description | +| :------------------------------- | :---------------------------------------------------------------------------- | +| `athena/athena.asm.fa` | final metagenome assemble | +| `busco/` | directory with results from the BUSCO analysis | +| `busco/short_summary.*.txt` | text file summarizing BUSCO analysis | +| `busco/run_*_odb10` | directory with results from the BUSCO analysis for the specific organism type | +| `quast/` | directory with results from the QUAST analysis | +| `quast/report*` | resulting QUAST report in various formats | +| `quast/predicted_genes/` | GLIMMER gene-finding output | +| `reports/assembly.metrics.html` | aggregate and generalization of QUAST and BUSCO results | +| `*spades_assembly/` | directory with the SPADES output | +| `*spades_assembly/contigs.fasta` | the resulting primary assembly | + ++++ :icon-code-square: SPADES parameters +By default, Harpy runs `spades` with these parameters (excluding inputs and outputs): +```bash +spades.py --meta -t threads -m mem -k k --gemcode1-1 FQ_R1 --gemcode1-2 FQ_R2 + +# with --ignore-bx +## error correct reads +metaspades.py -t threads -m mem -k k -1 FQ_R1 -2 FQ_R2 --only-error-correction +## assemble corrected reads +metaspades.py -t threads -m mem -k k -1 FQ_R1C -2 FQ_R2C -s FQ_UNPAIREDC --only-assembler +``` +See the [SPADES documentation](http://ablab.github.io/spades/running.html) for a list of all available command line options. + ++++ :icon-graph: reports +These are the summary reports Harpy generates for this workflow. You may right-click +the image and open it in a new tab if you wish to see the example in better detail. + +|||Aggregated Report +Aggregates QUAST and BUSCO analyses. +![reports/assembly.metrics.html](/static/assembly_multiqc.png) +||| QUAST Report +This is the report produced by QUAST +![reports/assembly.metrics.html](/static/metassembly_quast.png) +||| ++++ diff --git a/Workflows/other.md b/Workflows/other.md new file mode 100644 index 000000000..90d21faa2 --- /dev/null +++ b/Workflows/other.md @@ -0,0 +1,139 @@ +--- +label: Other +icon: file-diff +description: Generate extra files for analysis with Harpy +order: 7 +--- + +# :icon-file-diff: Other Harpy modules +On this page you'll find Harpy functions that aren't standalone workflows. These may create necessary inputs, continue where you left off, +or view important workflow files. + +## :icon-terminal: Other modules +{.compact} +| module | description | +| :------------- | :------------------------------------------------------------------------------- | +| `imputeparams` | Create a template imputation parameter file | +| `resume` | Continue a Harpy workflow from an existing output folder | +| `popgroup` | Create generic sample-group file using existing sample file names (fq.gz or bam) | +| `view` | View a workflow log, config, or snakefile | + + +--- + +### imputeparams +Create a template parameter file for the [!badge corners="pill" text="impute"](/Workflows/impute.md) module. +The file is formatted correctly and serves as a starting point for using parameters that make sense for your study. + +```bash usage +harpy imputeparams -o OUTPUTFILE +``` + +```bash example +harpy imputeparams -o params.stitch +``` + +#### arguments +{.compact} +| argument | short name | description | +| :--------- | :--------: | :-------------------------------------------------------------- | +| `--output` | `-o` | [!badge variant="info" text="required"] Name of the output file | + +Typically, one runs STITCH multiple times, exploring how results vary with +different model parameters. The solution Harpy uses for this is to have the user +provide a tab-delimited dataframe file where the columns are the 6 STITCH model +parameters and the rows are the values for those parameters. To make formatting +easier, a template file is generated for you, just replace the values and add/remove +rows as necessary. See the section for the [!badge corners="pill" text="impute"](/Workflows/impute.md) +module for details on these parameters. The template file will look like: + +```text params.stitch +name model usebx bxlimit k s ngen +k10_ng50 diploid TRUE 50000 3 2 10 +k1_ng30 diploid TRUE 50000 3 1 5 +high_ngen diploid TRUE 50000 15 1 100 +``` +--- + +### resume +When calling a workflow (e.g. [!badge corners="pill" text="qc"](qc.md)), Harpy performs various file checks +and validations, sets up the Snakemake command, output folder(s), etc. In the event you want to continue a +failed or manually terminated workflow without overwriting the workflow files (e.g. `config.yaml`), +you can use [!badge corners="pill" text="harpy resume"]. using `resume` also skips all input validations. + +```bash usage +harpy resume [--conda] DIRECTORY +``` + +#### arguments +{.compact} +| argument | short name | type | description | +| :---------- | :--------: | :------------------- | :------------------------------------------------------------------------------------- | +| `DIRECTORY` | | file/directory paths | [!badge variant="info" text="required"] Output directory of an existing harpy workflow | +| `--conda` | | toggle | Generate a `/workflow/envs` folder with the necessary conda enviroments | + +The `DIRECTORY` is the output directory of a previous harpy-invoked workflow, which **must** have the `workflow/config.yaml` file. +For example, if you previously ran `harpy align bwa -o align-bwa ...`, then you would use `harpy resume align-bwa`, +which would have the necessary `workflow/config.yaml` (and other necessary things) required to successfully continue the workflow. +Using [!badge corners="pill" text="resume"] does **not** overwrite any preprocessing files in the target directory (whereas rerunning the workflow would), +which means you can also manually modify the `config.yaml` file (advanced, not recommended unless you are confident with what you're doing). + +[!badge corners="pill" text="resume"] also requires an existing and populated `workdir/envs/` directory in the target directory, like the kind all +main `harpy` workflows would create. If one is not present, you can use `--conda` to create one. + +--- + +### popgroup +Creates a sample grouping file for variant calling + +```bash usage +harpy popgroup -o OUTPUTFILE INPUTS +``` + +```bash usage example +harpy popgroup -o samples.groups data/ +``` +#### arguments +{.compact} +| argument | short name | type | description | +| :--------- | :--------: | :------------------- | :-------------------------------------------------------------------------------------------- | +| `INPUTS` | | file/directory paths | [!badge variant="info" text="required"] Files or directories containing input FASTQ/BAM files | +| `--output` | `-o` | file path | [!badge variant="info" text="required"] Name of the output file | + +This optional file is useful if you want SNP variant calling to happen on a +per-population level via [!badge corners="pill" text="harpy snp"](snp.md/#populations) or on samples +pooled-as-populations via [!badge corners="pill" text="harpy sv"](SV/naibr.md/#pooled-sample-variant-calling). +- takes the format of sample[!badge variant="ghost" text="tab"]group +- all the samples will be assigned to group `pop1` since file names don't always provide grouping information + - so make sure to edit the second column to reflect your data correctly. +- the file will look like: +```less popgroups.txt +sample1 pop1 +sample2 pop1 +sample3 pop2 +sample4 pop1 +sample5 pop3 +``` +--- + +### view +This convenience command lets you view the latest workflow log file +of a Harpy output directory. Use `--snakefile` or `--config` to view the workflow +snakefile or config.yaml file instead, respectively. Output is printed to the screen via `less` and +accepts the typical [keyboard shortcuts to navigate](https://gist.github.com/glnds/8862214) the output. + +```bash usage +harpy view [-s] [-c] DIRECTORY +``` + +```bash example +harpy view Align/bwa +``` + +#### arguments +{.compact} +| argument | short name | description | +| :------------ | :--------: | :------------------------------------------------------------------------------------- | +| `DIRECTORY` | | [!badge variant="info" text="required"] Output directory of an existing harpy workflow | +| `--snakemake` | `-s` | View the workflow snakefile instead | +| `--config` | `-c` | View the `config.yaml` file instead | \ No newline at end of file diff --git a/Modules/phase.md b/Workflows/phase.md similarity index 61% rename from Modules/phase.md rename to Workflows/phase.md index 4bbb9e09e..f3e8be6b0 100644 --- a/Modules/phase.md +++ b/Workflows/phase.md @@ -9,9 +9,10 @@ order: 6 === :icon-checklist: You will need - at least 2 cores/threads available -- sequence alignments, in BAM format: [!badge variant="success" text=".bam"] +- sequence alignments: [!badge variant="success" text=".bam"] [!badge variant="secondary" text="coordinate-sorted"] + - **sample name**: [!badge variant="success" text="a-z"] [!badge variant="success" text="0-9"] [!badge variant="success" text="."] [!badge variant="success" text="_"] [!badge variant="success" text="-"] [!badge variant="secondary" text="case insensitive"] - a variant call format file of genotypes: [!badge variant="success" text=".vcf"] [!badge variant="success" text=".bcf"] -- [!badge variant="ghost" text="optional"] a reference genome in FASTA format: [!badge variant="success" text=".fasta"] [!badge variant="success" text=".fa"] [!badge variant="success" text=".fasta.gz"] [!badge variant="success" text=".fa.gz"] +- [!badge variant="ghost" text="optional"] a reference genome in FASTA format: [!badge variant="success" text=".fasta"] [!badge variant="success" text=".fa"] [!badge variant="success" text=".fasta.gz"] [!badge variant="success" text=".fa.gz"] [!badge variant="secondary" text="case insensitive"] === @@ -21,7 +22,7 @@ genotypes into haplotypes requires alignment files, such as those produced by [! and a variant call file, such as one produced by [!badge corners="pill" text="snp freebayes"](snp.md) or [!badge corners="pill" text="impute"](impute.md). **Phasing only works on SNP data**, and will not work for structural variants produced by [!badge corners="pill" text="sv leviathan"](SV/leviathan.md) -or [!badge corners="pill" text="sv naibr"](SV/naibr.md), preferably [filtered in some capacity](/blog/filteringsnps.md). You can phase genotypes into haplotypes with +or [!badge corners="pill" text="sv naibr"](SV/naibr.md), preferably [filtered in some capacity](/blog/filtering_snps.md). You can phase genotypes into haplotypes with Harpy using the [!badge corners="pill" text="phase"] module: ```bash usage @@ -35,16 +36,17 @@ harpy phase --threads 20 --vcf Variants/variants.raw.bcf Align/ema In addition to the [!badge variant="info" corners="pill" text="common runtime options"](/commonoptions.md), the [!badge corners="pill" text="phase"] module is configured using these command-line arguments: {.compact} -| argument | short name | type | default | required | description | -|:----------------------|:----------:|:----------------|:-------:|:--------:|:---------------------------------------------------------------------| -| `INPUTS` | | file/directory paths | | **yes** | Files or directories containing [input BAM files](/commonoptions.md#input-arguments) | -| `--extra-params` | `-x` | string | | no | Additional Hapcut2 arguments, in quotes | -| `--genome ` | `-g` | file path | | no | Path to genome if wanting to also use reads spanning indels | -| `--ignore-bx` | `-b` | toggle | | no | Ignore haplotag barcodes for phasing | -| `--molecule-distance` | `-d` | integer | 100000 | no | Base-pair distance threshold to separate molecules | -| `--prune-threshold` | `-p` | integer (0-100) | 7 | no | PHRED-scale (%) threshold for pruning low-confidence SNPs | -| `--vcf` | `-v` | file path | | **yes** | Path to BCF/VCF file | -| `--vcf-samples` | | toggle | | no | [Use samples present in vcf file](#prioritize-the-vcf-file) for imputation rather than those found the directory | +| argument | short name | default | description | +| :-------------------- | :--------: | :------: | :--------------------------------------------------------------------------------------------------------------------------- | +| `INPUTS` | | | [!badge variant="info" text="required"] Files or directories containing [input BAM files](/commonoptions.md#input-arguments) | +| `--contigs` | | | [Contigs to plot](/commonoptions.md#--contigs) in the report | +| `--extra-params` | `-x` | | Additional Hapcut2 arguments, in quotes | +| `--genome ` | `-g` | | Path to genome if wanting to also use reads spanning indels | +| `--ignore-bx` | `-b` | | Ignore haplotag barcodes for phasing | +| `--molecule-distance` | `-d` | `100000` | Base-pair distance threshold to separate molecules | +| `--prune-threshold` | `-p` | `7` | PHRED-scale (%) threshold for pruning low-confidence SNPs | +| `--vcf` | `-v` | | [!badge variant="info" text="required"] Path to BCF/VCF file | +| `--vcf-samples` | | | [Use samples present in vcf file](#prioritize-the-vcf-file) for imputation rather than those found the directory | ### Prioritize the vcf file Sometimes you want to run imputation on all the samples present in the `INPUTS`, but other times you may want @@ -138,22 +140,22 @@ Phase/ ``` {.compact} -| item | description | -|:---|:---| -| `variants.phased.bcf*` | final vcf output of HapCut2 with all samples merged into a single file (with .csi index) | -| `annotations/` | phased vcf annotated with phased blocks | -| `annotations_merge/` | merged vcf of annotated and original vcf | -| `extractHairs/` | output from `extractHairs` | -| `extractHairs/logs/` | everything HapCut2's `extractHairs` prints to `stderr` | -| `input/head.names` | extra file harpy creates to support new INFO fields in the phased VCF | -| `input/*.bcf` | vcf of a single sample from the original multi-sample input vcf | -| `input/*.het.bcf` | vcf of heterozygous loci of a single sample from the original multi-sample input vcf | -| `linkFragments/` | results from HapCut2's `linkFragments` | -| `linkFragments/logs` | everything `linkFragments` prints to `stderr` | -| `reports/blocks.summary.gz` | summary information of all the samples' block files | -| `reports/phase.html` | report of haplotype phasing results | -| `phaseBlocks/*.blocks*` | output from HapCut2 | -| `phaseBlocks/logs` | everything HapCut2 prints to `stderr` | +| item | description | +| :-------------------------- | :--------------------------------------------------------------------------------------- | +| `variants.phased.bcf*` | final vcf output of HapCut2 with all samples merged into a single file (with .csi index) | +| `annotations/` | phased vcf annotated with phased blocks | +| `annotations_merge/` | merged vcf of annotated and original vcf | +| `extractHairs/` | output from `extractHairs` | +| `extractHairs/logs/` | everything HapCut2's `extractHairs` prints to `stderr` | +| `input/head.names` | extra file harpy creates to support new INFO fields in the phased VCF | +| `input/*.bcf` | vcf of a single sample from the original multi-sample input vcf | +| `input/*.het.bcf` | vcf of heterozygous loci of a single sample from the original multi-sample input vcf | +| `linkFragments/` | results from HapCut2's `linkFragments` | +| `linkFragments/logs` | everything `linkFragments` prints to `stderr` | +| `reports/blocks.summary.gz` | summary information of all the samples' block files | +| `reports/phase.html` | report of haplotype phasing results | +| `phaseBlocks/*.blocks*` | output from HapCut2 | +| `phaseBlocks/logs` | everything HapCut2 prints to `stderr` | +++ :icon-code-square: HapCut2 parameters By default, Harpy runs `HAPCUT2` with these parameters (excluding inputs and outputs): @@ -176,13 +178,9 @@ Advanced Options: These are the summary reports Harpy generates for this workflow. You may right-click the image and open it in a new tab if you wish to see the example in better detail. -||| Overall Phasing -Aggregates phasing metrics across all samples. +||| Phasing Performance +Aggregates phasing metrics overall and across all samples. ![reports/phase.html](/static/report_phase.png) -||| Per Sample Phasing -The second tab shows haplotype metrics for every sample -![reports/phase.html](/static/report_phase2.png) ||| - +++ diff --git a/Modules/preflight.md b/Workflows/preflight.md similarity index 58% rename from Modules/preflight.md rename to Workflows/preflight.md index c8063af29..a511c5672 100644 --- a/Modules/preflight.md +++ b/Workflows/preflight.md @@ -11,9 +11,10 @@ order: 5 - at least 2 cores/threads available - [!badge corners="pill" text="preflight bam"]: SAM/BAM alignment files [!badge variant="secondary" text="BAM recommended"] - [!badge corners="pill" text="preflight fastq"]: paired-end reads from an Illumina sequencer in FASTQ format [!badge variant="secondary" text="gzip recommended"] + - **sample names**: [!badge variant="success" text="a-z"] [!badge variant="success" text="0-9"] [!badge variant="success" text="."] [!badge variant="success" text="_"] [!badge variant="success" text="-"] [!badge variant="secondary" text="case insensitive"] - **forward**: [!badge variant="success" text="_F"] [!badge variant="success" text=".F"] [!badge variant="success" text=".1"] [!badge variant="success" text="_1"] [!badge variant="success" text="_R1_001"] [!badge variant="success" text=".R1_001"] [!badge variant="success" text="_R1"] [!badge variant="success" text=".R1"] - **reverse**: [!badge variant="success" text="_R"] [!badge variant="success" text=".R"] [!badge variant="success" text=".2"] [!badge variant="success" text="_2"] [!badge variant="success" text="_R2_001"] [!badge variant="success" text=".R2_001"] [!badge variant="success" text="_R2"] [!badge variant="success" text=".R2"] - - **fastq extension**: [!badge variant="success" text=".fq"] [!badge variant="success" text=".fastq"] [!badge variant="success" text=".FQ"] [!badge variant="success" text=".FASTQ"] + - **fastq extension**: [!badge variant="success" text=".fq"] [!badge variant="success" text=".fastq"] [!badge variant="secondary" text="case insensitive"] === Harpy does a lot of stuff with a lot of software and each of these programs expect the incoming data to follow particular formats (plural, unfortunately). @@ -46,9 +47,9 @@ harpy preflight bam --threads 20 Align/bwa In addition to the [!badge variant="info" corners="pill" text="common runtime options"](/commonoptions.md), the [!badge corners="pill" text="preflight fastq"] and [!badge corners="pill" text="preflight bam"] modules are configured using only command-line input arguments: {.compact} -| argument | short name | type | default | required | description | -|:------------------|:----------:|:-----------|:-------:|:--------:|:-------------------------------------------------------------------------------------| -| `INPUTS` | | file/directory paths | | **yes** | Files or directories containing [input fastq or bam files](/commonoptions.md#input-arguments) | +| argument | short name | default | description | +| :------- | :--------: | :-----: | :------------------------------------------------------------------------------------------------------------------------------------ | +| `INPUTS` | | | [!badge variant="info" text="required"] Files or directories containing [input fastq or bam files](/commonoptions.md#input-arguments) | ## Workflow @@ -58,25 +59,25 @@ the haplotagging data format, you will find little value in running [!badge corn of the language such as when "any" and "all" are written. {.compact} -| Criteria | Pass Condition | Fail Condition | -|:---|:---|:---| -|AxxCxxBxxDxx format| **all** reads with BX:Z: tag have properly formatted `AxxCxxBxxDxx` barcodes | **any** BX:Z: barcodes have incorrect format| -|follows SAM spec | **all** reads have proper `TAG:TYPE:VALUE` comments | **any** reads have incorrectly formatted comments| -|BX:Z: last comment | **all** reads have `BX:Z`: as final comment| **at least 1 read** doesn't have `BX:Z:` tag as final comment| -|BX:Z: tag | any `BX:Z:` tags present | **all** reads lack `BX:Z:` tag| + Criteria | Pass Condition | Fail Condition | + :------------------ | :--------------------------------------------------------------------------- | :------------------------------------------------------------ | + AxxCxxBxxDxx format | **all** reads with BX:Z: tag have properly formatted `AxxCxxBxxDxx` barcodes | **any** BX:Z: barcodes have incorrect format | + follows SAM spec | **all** reads have proper `TAG:TYPE:VALUE` comments | **any** reads have incorrectly formatted comments | + BX:Z: last comment | **all** reads have `BX:Z`: as final comment | **at least 1 read** doesn't have `BX:Z:` tag as final comment | + BX:Z: tag | any `BX:Z:` tags present | **all** reads lack `BX:Z:` tag | +++ bam checks Below is a table of the format specifics [!badge corners="pill" text="preflight bam"] checks for SAM/BAM files. Take note of the language such as when "any" and "all" are written. {.compact} -| Criteria | Pass Condition | Fail Condition | -|:---|:---|:---| -|name matches| the file name matches the `@RG ID:` tag in the header| file name does not match `@RG ID:` in the header| -|MI: tag| **any** alignments with `BX:Z:` tags also have `MI:i:` (or `MI:Z:`) tags| **all** reads have `BX:Z:` tag present but `MI:i:` tag absent| -|BX:Z: tag| any `BX:Z:` tags present| **all** alignments lack `BX:Z:` tag| -|AxxCxxBxxDxx format| **all** alignments with BX:Z: tag have properly formatted `AxxCxxBxxDxx` barcodes| **any** `BX:Z:` barcodes have incorrect format| -|BX:Z: last tag| **all** reads have `BX:Z`: as final tag in alignment records | **at least 1 read** doesn't have `BX:Z:` tag as final tag| +| Criteria | Pass Condition | Fail Condition | +| :------------------ | :-------------------------------------------------------------------------------- | :------------------------------------------------------------ | +| name matches | the file name matches the `@RG ID:` tag in the header | file name does not match `@RG ID:` in the header | +| MI: tag | **any** alignments with `BX:Z:` tags also have `MI:i:` (or `MI:Z:`) tags | **all** reads have `BX:Z:` tag present but `MI:i:` tag absent | +| BX:Z: tag | any `BX:Z:` tags present | **all** alignments lack `BX:Z:` tag | +| AxxCxxBxxDxx format | **all** alignments with BX:Z: tag have properly formatted `AxxCxxBxxDxx` barcodes | **any** `BX:Z:` barcodes have incorrect format | +| BX:Z: last tag | **all** reads have `BX:Z`: as final tag in alignment records | **at least 1 read** doesn't have `BX:Z:` tag as final tag | +++ output The default output directory is `Preflight/fastq` or `Preflight/bam` depending on which mode you are using. diff --git a/Modules/qc.md b/Workflows/qc.md similarity index 73% rename from Modules/qc.md rename to Workflows/qc.md index ddc397844..ffb70d859 100644 --- a/Modules/qc.md +++ b/Workflows/qc.md @@ -9,9 +9,10 @@ order: 4 === :icon-checklist: You will need - at least 2 cores/threads available - paired-end [fastq](../haplotagdata.md/#naming-conventions) sequence files [!badge variant="secondary" text="gzip recommended"] + - **sample names**: [!badge variant="success" text="a-z"] [!badge variant="success" text="0-9"] [!badge variant="success" text="."] [!badge variant="success" text="_"] [!badge variant="success" text="-"] [!badge variant="secondary" text="case insensitive"] - **forward**: [!badge variant="success" text="_F"] [!badge variant="success" text=".F"] [!badge variant="success" text=".1"] [!badge variant="success" text="_1"] [!badge variant="success" text="_R1_001"] [!badge variant="success" text=".R1_001"] [!badge variant="success" text="_R1"] [!badge variant="success" text=".R1"] - **reverse**: [!badge variant="success" text="_R"] [!badge variant="success" text=".R"] [!badge variant="success" text=".2"] [!badge variant="success" text="_2"] [!badge variant="success" text="_R2_001"] [!badge variant="success" text=".R2_001"] [!badge variant="success" text="_R2"] [!badge variant="success" text=".R2"] - - **fastq extension**: [!badge variant="success" text=".fq"] [!badge variant="success" text=".fastq"] [!badge variant="success" text=".FQ"] [!badge variant="success" text=".FASTQ"] + - **fastq extension**: [!badge variant="success" text=".fq"] [!badge variant="success" text=".fastq"] [!badge variant="secondary" text="case insensitive"] === Raw sequences are not suitable for downstream analyses. They have sequencing adapters, @@ -31,16 +32,16 @@ harpy qc --threads 20 Sequences_Raw/ In addition to the [!badge variant="info" corners="pill" text="common runtime options"](/commonoptions.md), the [!badge corners="pill" text="qc"] module is configured using these command-line arguments: {.compact} -| argument | short name | type | default | required | description | -|:-----------------|:----------:|:------------|:-------:|:-------:|:--------------------------------------------------------------------------------------------------| -| `INPUTS` | | file/directory paths | | **yes** | Files or directories containing [input FASTQ files](/commonoptions.md#input-arguments) | -| `--deconvolve` | `-c` | toggle | | | Resolve barcode clashes between reads from different molecules | -| `--deconvolve-params` | `-p` | (int,int,int,int) | (21,40,3,0) | | Accepts the [QuickDeconvolution parameters](/Modules/deconvolve.md/#running-options) for `k`,`w`,`d`,`a`, in that order | -| `--deduplicate` | `-d` | toggle | | | Identify and remove PCR duplicates | -| `--extra-params` | `-x` | string | | | Additional fastp arguments, in quotes | -| `--min-length` | `-n` | integer | 30 | | Discard reads shorter than this length | -| `--max-length` | `-m` | integer | 150 | | Maximum length to trim sequences down to | -| `--trim-adapters` | `-a` | toggle | | | Detect and remove adapter sequences | +| argument | short name | default | description | +| :-------------------- | :--------: | :---------: | :----------------------------------------------------------------------------------------------------------------------------- | +| `INPUTS` | | | [!badge variant="info" text="required"] Files or directories containing [input FASTQ files](/commonoptions.md#input-arguments) | +| `--deconvolve` | `-c` | | Resolve barcode clashes between reads from different molecules | +| `--deconvolve-params` | `-p` | `21,40,3,0` | Accepts the [QuickDeconvolution parameters](/Workflows/deconvolve.md/#running-options) for `k`,`w`,`d`,`a`, in that order | +| `--deduplicate` | `-d` | | Identify and remove PCR duplicates [!badge variant="secondary" text="recommended"] | +| `--extra-params` | `-x` | | Additional fastp arguments, in quotes | +| `--min-length` | `-n` | `30` | Discard reads shorter than this length | +| `--max-length` | `-m` | `150` | Maximum length to trim sequences down to | +| `--trim-adapters` | `-a` | | Detect and remove adapter sequences [!badge variant="secondary" text="recommended"] | By default, this workflow will only quality-trim the sequences. You can also opt-in to: - [!badge variant="secondary" text="recommended"] find and remove sequencing adapters @@ -123,7 +124,7 @@ Aggregates the metrics FASTP generates for every sample during QC. ![reports/trim.report.html](/static/report_trim_aggregate.png) ||| BX validation Reports the number of valid/invalid barcodes in the sequences and the segments contributing to invalidation. -![reports/summary.bx.valid.html](/static/report_align_bxstats.png) +![reports/summary.bx.valid.html](/static/report_qc_bx.png) ||| +++ diff --git a/Modules/snp.md b/Workflows/snp.md similarity index 75% rename from Modules/snp.md rename to Workflows/snp.md index c5e5492a6..5c5dbf4a0 100644 --- a/Modules/snp.md +++ b/Workflows/snp.md @@ -9,8 +9,9 @@ order: 2 === :icon-checklist: You will need - at least 4 cores/threads available -- sequence alignments, in BAM format: [!badge variant="success" text=".bam"] -- genome assembly in FASTA format: [!badge variant="success" text=".fasta"] [!badge variant="success" text=".fa"] [!badge variant="success" text=".fasta.gz"] [!badge variant="success" text=".fa.gz"] +- sequence alignments: [!badge variant="success" text=".bam"] [!badge variant="secondary" text="coordinate-sorted"] + - **sample name**: [!badge variant="success" text="a-z"] [!badge variant="success" text="0-9"] [!badge variant="success" text="."] [!badge variant="success" text="_"] [!badge variant="success" text="-"] [!badge variant="secondary" text="case insensitive"] +- genome assembly in FASTA format: [!badge variant="success" text=".fasta"] [!badge variant="success" text=".fa"] [!badge variant="success" text=".fasta.gz"] [!badge variant="success" text=".fa.gz"] [!badge variant="secondary" text="case insensitive"] - [!badge variant="ghost" text="optional"] sample grouping file ==- :icon-file: sample grouping file [!badge variant="ghost" text="optional"] This file is optional and useful if you want variant calling to happen on a per-population level. @@ -58,14 +59,19 @@ harpy snp freebayes --threads 20 --genome genome.fasta Align/bwa In addition to the [!badge variant="info" corners="pill" text="common runtime options"](../commonoptions.md), the [!badge corners="pill" text="snp"] module is configured using these command-line arguments: {.compact} -| argument | short name | type | default | required | description | -|:-----------------|:----------:|:--------------------------------|:-------:|:--------:|:----------------------------------------------------| -| `INPUTS` | | file/directory paths | | **yes** | Files or directories containing [input BAM files](/commonoptions.md#input-arguments) | -| `--extra-params` | `-x` | string | | no | Additional mpileup/freebayes arguments, in quotes | -| `--genome` | `-g` | file path | | **yes** | Genome assembly for variant calling | -| `--ploidy` | `-n` | integer | 2 | no | Ploidy of samples | -| `--populations` | `-p` | file path | | no | Tab-delimited file of sample\<*tab*\>group | -| `--regions` | `-r` | integer/file path/string | 50000 | no | Regions to call variants on ([see below](#regions)) | +| argument | short name | default | description | +| :--------------- | :--------: | :-----: | :--------------------------------------------------------------------------------------------------------------------------- | +| `INPUTS` | | | [!badge variant="info" text="required"] Files or directories containing [input BAM files](/commonoptions.md#input-arguments) | +| `--extra-params` | `-x` | | Additional mpileup/freebayes arguments, in quotes | +| `--genome` | `-g` | | [!badge variant="info" text="required"] Genome assembly for variant calling | +| `--ploidy` | `-n` | `2` | Ploidy of samples | +| `--populations` | `-p` | | Tab-delimited file of sample\<*tab*\>group | +| `--regions` | `-r` | `50000` | Regions to call variants on ([see below](#regions)) | + + +### ploidy +If you are calling haploid or diploid samples, using either `mpileup` or `freebayes` will be comparable. However, if you need to call SNPs in polyploids (ploidy >2), +then you will need to use `freebayes`, since `mpileup` does not call variants for ploidy greater than 2. ### regions The `--regions` (`-r`) option lets you specify the genomic regions you want to call variants on. Keep in mind that @@ -124,40 +130,34 @@ it. ## :icon-git-pull-request: SNP calling workflow +++ :icon-git-merge: details -The workflow is parallelized over genomic intervals (`--`). All intermediate outputs are removed, leaving +The workflow is parallelized over genomic intervals (`--regions`). All intermediate outputs are removed, leaving you only the raw variants file (in `.bcf` format), the index of that file, and some stats about it. -### mpileup -The `mpileup` and `call` modules from [bcftools](https://samtools.github.io/bcftools/bcftools.html) (formerly samtools) -are used to call variants from alignments. +### SNP workflows +The `mpileup` and `call` modules from [bcftools](https://samtools.github.io/bcftools/bcftools.html) (formerly samtools) or [Freebayes](https://github.com/freebayes/freebayes) are used to call variants from alignments. Both +are very popular variant callers to call SNPs and small indels. ```mermaid graph LR subgraph Inputs aln[BAM alignments]:::clean---gen[genome]:::clean end - Inputs --> B([freebayes]):::clean - B-->C([bcftools call]):::clean + subgraph mpileup + B([mpileup]):::clean-->C([bcftools call]):::clean + end + subgraph freebayes + z([freebayes]):::clean + end + Inputs --> mpileup + Inputs --> freebayes C-->D([index BCFs]):::clean D-->E([combine BCFs]):::clean + z --> D C-->E + E-->F([realign indels]):::clean style Inputs fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px - classDef clean fill:#f5f6f9,stroke:#b7c9ef,stroke-width:2px -``` - -### freebayes -[Freebayes](https://github.com/freebayes/freebayes) is a very popular variant caller that uses local haplotype assemblies to -call SNPs and small indels. Like mpileup, this method is ubiquitous in bioinformatics and very easy to use. - -```mermaid -graph LR - subgraph Inputs - aln[BAM alignments]:::clean---gen[genome]:::clean - end - Inputs --> B([freebayes]):::clean - B-->D([index BCFs]):::clean - D-->E([combine BCFs]):::clean - style Inputs fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px + style mpileup fill:#b0c1b3,stroke:#9eada1,stroke-width:2px + style freebayes fill:#bcb7cd,stroke:#a9a4b8,stroke-width:2px classDef clean fill:#f5f6f9,stroke:#b7c9ef,stroke-width:2px ``` @@ -187,6 +187,7 @@ SNP/method | item | description | |:--------------------------|:-----------------------------------------------------------------------------------------------| | `variants.raw.bcf` | vcf file produced from variant calling, contains all samples and loci | +| `variants.normalized.bcf` | variants, but with indels realigned and duplicates removed | | `variants.*.bcf.csi` | index file for `variants.*.bcf` | | `logs/*.call.log` | what `bcftools call` writes to `stderr` | | `logs/*.METHOD.log` | what `bcftools mpileup` or `freebayes` writes to `stderr` | diff --git a/blog/filteringsnps.md b/blog/filtering_snps.md similarity index 95% rename from blog/filteringsnps.md rename to blog/filtering_snps.md index dda6c0e9a..549fa9d39 100644 --- a/blog/filteringsnps.md +++ b/blog/filtering_snps.md @@ -1,5 +1,7 @@ --- -author: Pavel Dimens +author: + - name: Pavel Dimens + avatar: https://cals.cornell.edu/sites/default/files/styles/faculty/public/2024-09/afs-headshot-high-res-2cropped_0.jpg date: 2024-06-05 category: guides description: A gentle introduction to the wild world of filtering SNPs @@ -70,5 +72,5 @@ fairly common to use `0.05` (e.g. `-i 'MAF>0.05'`) to `0.10` (e.g. `-i 'MAF>0.10 Missing data is, frankly, not terribly useful. The amount of missing data you're willing to tolerate will depend on your study, but it's common to remove sites with >20% missing data (e.g. `-e 'F_MISSING>0.2'`). This can be as strict (or lenient) as you want; it's not uncommon to see very conservative filtering at 10% or 5% missing data. **However**, you can impute missing genotypes to recover -missing data! Harpy can leverage linked-read information to impute genotypes with the [!badge corners="pill" text="impute"](../Modules/impute.md) +missing data! Harpy can leverage linked-read information to impute genotypes with the [!badge corners="pill" text="impute"](../Workflows/impute.md) module. You should try to impute genotypes first before filtering out sites based on missingness. diff --git a/blog/simulate_diploid.md b/blog/simulate_diploid.md new file mode 100644 index 000000000..0d580cbfe --- /dev/null +++ b/blog/simulate_diploid.md @@ -0,0 +1,179 @@ +--- +author: + - name: Pavel Dimens + avatar: https://cals.cornell.edu/sites/default/files/styles/faculty/public/2024-09/afs-headshot-high-res-2cropped_0.jpg +date: 2024-09-30 +category: guides +description: A realistic workflow to simulate variants +icon: git-merge-queue +image: https://neupsykey.com/wp-content/uploads/2017/03/ch02_f01.jpg +--- + +# :icon-git-merge-queue: Simulating variants +You may want to (and are encouraged to) simulate data before investing in the +costs associated with linked-read sample preparation and subsequent sequencing. +Harpy provides both a variant and linked-read simulators and this tutorial serves to +show a real-world workflow starting with a haploid genome and creating a diploid genome +with the variants we want in it, which will then be fed into linked-read simulation. The +process might seem a little roundabout due to the limitations of the underlying software, +but it shouldn't be too bad to wrap your head around it! Ultimately, you will creating +linked-reads from the resulting genome and then aligning those reads onto your **original** +genome to identify those variants. + +=== :icon-checklist: You will need +- a genome assembly in FASTA format: [!badge variant="success" text=".fasta"] [!badge variant="success" text=".fa"] [!badge variant="success" text=".fasta.gz"] [!badge variant="success" text=".fa.gz"] [!badge variant="secondary" text="case insensitive"] +!!! Picking a genome +For simplicity and shorter runtimes, this tutorial can be followed with a [Drosophila melanogaster](https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001215.4/) +genome. Otherwise use your favorite genome! For learning purposes, the fewer contigs the better. +!!! +=== + +To keep this tutorial simple, but with a certain amount of real-world complexity, let's say you are interested in **inversions** and +whether your linked-reads will be able to identify inversions in your system. To simulate inversions, we will first simulate the inversions, +then simulate introduce SNPs and small idnels. The SNPs and indels serve to create typical variants in the downstream linked-reads because it's highly +unlikely that the _only_ variants in your data are a few inversions. + +!!! heterozygosity +By specifying a heterozygosity value for `harpy simulate ...`, we can make sure that our diploid haplotypes aren't exclusively homozygous for the alternative allele of the variants we are introducing. +!!! + +## 1. Add random inversions +First, we will need to simulate some inversions and set a `--heterozygosity` value >0 to get a diploid genome as the output. +If you wanted to manually create inversions in specific areas or with specific lengths, this would be a good starting point too since +you could manually modify the resulting VCF to create the specific inversions you want (although use `--only-vcf`). We won't be covering +that here, but you should hopefully be able to intuit how to do that by the end of this tutorial. + +```bash +harpy simulate inversion --conda -n 20 -z 0.5 --min-size 30000 dmel.fa +``` +==- :icon-terminal: code explation +- `-n` is the number of random inversions (`20`) +- `--min-size` is the minimum inversion size (`30000`) + - default is 1000 bp, which was arbitrarily made bigger in this example +- `-z` is the level of heterozygosity (`0.5` = 50%) +- `-o` is the name of the output directory +- `--conda` is optional and a matter of runtime preference +- `GENOME.fa` is your genome file +==- :icon-git-compare: diagram +```mermaid +graph LR + geno(haploid genome)-->|simulate inversion ... -z 0.5|hap(sim.fasta):::clean + hap-->hap1(haplotype1.fasta) + hap-->hap2(haplotype2.fasta) + style geno fill:#ebb038,stroke:#d19b2f,stroke-width:2px + style hap1 fill:#90c8be,stroke:#6fb6a9,stroke-width:2px + style hap2 fill:#bd8fcb,stroke:#a460b7,stroke-width:2px + classDef clean fill:#f5f6f9,stroke:#b7c9ef,stroke-width:2px +``` +=== 📝 output to keep +The two haploid FASTA files in `Simulate/inversion/diploid/` +=== + +## 2. Add snps and indels +Let's say we wanted to simulate SNPs and indels like so: +- 500 indels +- 75k snps +- introduce a heterozygosity of 10% for the variants + - based on [this for flies](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1203202/pdf/255.pdf) + +We will need to first create random SNPs and indels from the haploid genome, then let Harpy create +the homo/heterozygotes. We can then use the VCFs of hom/het variants to simulate those genotypes ([Step 3](#3-simulate-known-snps-and-indels-onto-the-diploid-genome-with-inversions)) onto +the resulting diploid genome from [Step 1](#1-add-random-inversions). + +The Harpy command to accomplish this is: +```bash +harpy simulate snpindel -m 500 -n 75000 -z 0.1 --only-vcf --conda -o sim_snpindel GENOME.fa +``` +==- :icon-terminal: code explation +- `-m` is the number of ranbom indels (`500`) +- `-n` is the number of random snps (`500`) +- `-z` is the level of heterozygosity (`0.1` = 10%) +- `-o` is the name of the output directory + - specifying this so subsequent runs don't overwrite each other +- `--only-vcf` is to skip the diploid genome simulation +- `--conda` is optional and a matter of runtime preference +- `GENOME.fa` is your genome file +==- :icon-git-compare: diagram +```mermaid +graph LR + geno(haploid genome)-->|simulate snpindel ... -z 0.1 --only-vcf|hap(sim.fasta):::clean + hap-->snp1(snp.hap1.vcf) + hap-->snp2(snp.hap2.vcf) + hap-->indel1(indel.hap1.vcf) + hap-->indel2(indel.hap2.vcf) + style geno fill:#ebb038,stroke:#d19b2f,stroke-width:2px + style snp1 fill:#f5f6f9,stroke:#90c8be,stro ke-width:2px + style indel1 fill:#f5f6f9,stroke:#90c8be,stro ke-width:2px + style snp2 fill:#f5f6f9,stroke:#bd8fcb,stroke-width:2px + style indel2 fill:#f5f6f9,stroke:#bd8fcb,stroke-width:2px + classDef clean fill:#f5f6f9,stroke:#b7c9ef,stroke-width:2px +``` +=== 📝 output to keep +The two VCF files of SNPs and indels in the `diploid/` subdirectory. There should be 2 haplotypes for SNPs and two for indels, depending +on if you simulated one, the other, or both (up to a total of 4 VCF files). +=== + +## 3. Simulate "known" snps and indels onto the diploid genome with inversions +We will run Harpy twice, once for each haplotype, using the corresponding VCFs from [**Step 2**](#2-add-snps-and-indels): + +The Harpy command to accomplish this is: +```bash +# haplotype 1 +harpy simulate snpindel --conda --snp-vcf SNP.hap1.vcf --indel-vcf indel.hap1.vcf -o sim_snp_hap1 haplotype1.fa + +# haplotype 2 +harpy simulate snpindel --conda --snp-vcf SNP.hap2.vcf --indel-vcf indel.hap2.vcf -o sim_snp_hap2 haplotype2.fa +``` +==- :icon-terminal: code explation +- `--snp-vcf` is the vcf of snps for haplotype 1 (or 2) from [**Step 2**](#2-add-snps-and-indels) +- `--indel-vcf` is the vcf of indels for haplotype 1 (or 2) from [**Step 2**](#2-add-snps-and-indels) +- `-o` is the name of the output directory +- `--conda` is optional and a matter of runtime preference +- `haplotypeX.fa` is the two haploype genomes from [**Step 1**](#1-add-random-inversions) +==- :icon-git-compare: diagram +```mermaid +graph LR + subgraph id1 ["Haplotype 1 inputs"] + geno1(haplotype1.fasta) + snphap1(snp.hap1.vcf) + indelhap1(indel.hap1.vcf) + end + subgraph id2 ["Haplotype 2 inputs"] + geno2(haplotype2.fasta) + snphap2(snp.hap2.vcf) + indelhap2(indel.hap2.vcf) + end + + id1-->|simulate snpindel -s -i|genohap1(haplotype1.fasta final) + id2-->|simulate snpindel -s -i|genohap2(haplotype2.fasta final) + + style id1 fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px + style id2 fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px + style geno1 fill:#90c8be,stroke:#6fb6a9,stroke-width:2px + style geno2 fill:#bd8fcb,stroke:#a460b7,stroke-width:2px + style snphap2 fill:#f5f6f9,stroke:#bd8fcb,stroke-width:2px + style snphap1 fill:#f5f6f9,stroke:#90c8be,stro ke-width:2px + style indelhap1 fill:#f5f6f9,stroke:#90c8be,stro ke-width:2px + style indelhap2 fill:#f5f6f9,stroke:#bd8fcb,stroke-width:2px + style genohap1 fill:#90c8be,stroke:#000000,stroke-width:2px + style genohap2 fill:#bd8fcb,stroke:#000000,stroke-width:2px +``` +=== 📝 output to keep +The resulting genomes for both haplotype 1 and haplotype 2 +=== +------- + +## 5. Simulating linked-reads +Now that you have heterozygous haplotypes created from your starting genome, you can simulate linked-reads from it using +`harpy simulate linkedreads`. A simple implementation of that could look like: +```bash +harpy simulate linkedreads --conda -t 4 HAP1.fa HAP2.fa +``` +==- :icon-terminal: code explation +- `--conda` is optional and a matter of runtime preference +- `-t` is the number of threads to use (`4`) +- `HAP1.fa` is the resulting genome from [**Step 3: haplotype 1**](#3-simulate-known-snps-and-indels-onto-the-diploid-genome-with-inversions) + - i.e. haplotype 1 with the simulated snps, indels, and inversions +- `HAP2.fa` is the resulting genome from [**Step 3: haplotype 2**](#3-simulate-known-snps-and-indels-onto-the-diploid-genome-with-inversions) + - i.e. haplotype 2 with the simulated snps, indels, and inversions +=== \ No newline at end of file diff --git a/blog/softwareenvironments.md b/blog/software_environments.md similarity index 97% rename from blog/softwareenvironments.md rename to blog/software_environments.md index 57773aac8..88228a4ab 100644 --- a/blog/softwareenvironments.md +++ b/blog/software_environments.md @@ -1,5 +1,7 @@ --- -author: Pavel Dimens +author: + - name: Pavel Dimens + avatar: https://cals.cornell.edu/sites/default/files/styles/faculty/public/2024-09/afs-headshot-high-res-2cropped_0.jpg date: 2024-06-21 category: guides description: Deciding between using Conda or Containers diff --git a/blog/sort_by_barcode.md b/blog/sort_by_barcode.md new file mode 100644 index 000000000..6e566c13f --- /dev/null +++ b/blog/sort_by_barcode.md @@ -0,0 +1,81 @@ +--- +author: + - name: Pavel Dimens + avatar: https://cals.cornell.edu/sites/default/files/styles/faculty/public/2024-09/afs-headshot-high-res-2cropped_0.jpg +date: 2024-11-05 +category: guides +description: Sorting data by linked-read barcode +icon: list-ordered +image: https://t4.ftcdn.net/jpg/07/40/71/29/360_F_740712956_dyX3F3ehpbdqjCZT0RdqkJ8Aniu7fecl.jpg +--- + +# :icon-list-ordered: Sort data by barcode +You would think sorting data would be a no-brainer, and in most cases it is. +You can use `seqtk` or `seqkit` to sort FASTQ/A files by their IDs, `samtools` to sort +SAM/BAM/CRAM files by name or coordinates. However, in the world of linked-read +data, sometimes you may need to sort your FASTQ (or BAM) files by the +linked-read barcode. The way to do that wasn't initially obvious to the Harpy/haplotag +team, so this article serves to make this knowledge widely available to linked-read +adopters. + +## Sorting Alignments +Let's start with BAM (or SAM/CRAM) files because the process is much simpler. +Since the linked-read barcode is stored in a `BX:Z` tag (or less often as `BC:Z:`), +we can use a little feature of `samtools sort` to guide the sort by the barcode: +> -t TAG +> Sort first by the value in the alignment tag TAG, then by position or name (if using -n or -N) + +The `-t` option then makes it pretty trivial to sort an alignment file by barcode: +```bash +samtools sort -t BX file.bam > sorted.bam +``` + +The above command will accomplish sorting by whatever kind of barcode is listed in +the `BX:Z` tag. If your barcode was in the `BC:Z` tag, you would use `-t BC`. + +!!! BX and not BX:Z +Notice that you only specify `BX` rather than `BX:Z`, because the `:Z` part +of the tag is a SAM specification to indicate what type of data is in the tag (i.e. `TAG:TYPE:VALUE`). +A tag is named by its first two letters and you cannot have clashing names in a single record +(e.g. having both `BX:Z` and `BX:i` would be invalid because they are both named `BX`). +!!! + +## Sorting FASTQ +Sorting FASTQ files by barcode is trickier, only because there aren't (to our knowledge!) +any existing convenience methods to do it. Like any bioinformatics puzzle, you could +probably solve it with a sophisticated AWK command, but HTSlib tools are so much more +efficient and built for these exact purposes. The process to accomplish this includes +3 steps that will be shown at the end as a single pipe. + +### 1. convert FASTQ to SAM +Yep, we're solving our problem by doing a simple file conversion to SAM/BAM. That's +the easiest way to do it, surprisingly. FASTQ files can be converted to unmapped BAM +files using `samtools import`, which would also interleave the forward and reverse +reads into a single file. The `-T "*"` argument preserves all the tags between file formats. +```bash +samtools import -T "*" sample_01.R1.fq sample_01.R2.fq > sample_01.sam +``` + +### 2. sort the SAM by barcode +Exactly like shown above to sort a SAM/BAM file with `samtools sort`, we're going +to do the same on the unmapped SAM file we just created: +```bash +samtools sort -O SAM -t BX sample_01.sam > sample_01.sort.sam +``` + +### 3. convert SAM back to FASTQ +Now that the data have been sorted, we need to convert it back into forward and reverse +FASTQ files using `samtools fastq`. The `-T "*"` argument once again preserves all the tags +between file formats. The `-1` and `-2` arguments are the forward and reverse output FASTQ files, +respectively. +```bash +samtools fastq -T "*" -1 sample_01.sort.R1.fq -2 sample_01.sort.R2.fq sample_01.sort.sam +``` + +### as a single pipe +Rather than splitting out these three processess, you can stream/pipe them in a single workflow: +```bash +samtools import -T "*" sample_01.R1.fq sample_01.R2.fq | +samtools sort -O SAM -t BX | +samtools fastq -T "*" -1 sample_01.sort.R1.fq -2 sample_01.sort.R2.fq +``` \ No newline at end of file diff --git a/blog/svpooling.md b/blog/sv_pooling.md similarity index 80% rename from blog/svpooling.md rename to blog/sv_pooling.md index 68a7b6e43..ce6a046e9 100644 --- a/blog/svpooling.md +++ b/blog/sv_pooling.md @@ -1,5 +1,7 @@ --- -author: Pavel Dimens +author: + - name: Pavel Dimens + avatar: https://cals.cornell.edu/sites/default/files/styles/faculty/public/2024-09/afs-headshot-high-res-2cropped_0.jpg date: 2024-08-01 category: guides description: Why pool samples for SV calling and when to do it @@ -59,34 +61,43 @@ get results like these: ![PCA of Alosa sapidissima (SNPs from low-depth haplotag dataset)](/static/pca.png) Given these results, a sensible pooling strategy may be: -- **Pool 1**: Miramichi samples -- **Pool 2**: Annapolis samples -- **Pool 3**: St_Johns samples -- **Pool 4**: Santee_Cooper samples -- **Pool 5**: Roanoke + Chowan-Blackwater samples -- **Pool 6**: Potomac samples -- **Pool 7**: Delaware samples -- **Pool 8**: Hudson samples -- **Pool 9**: Connecticut samples -- **Pool 10**: Merrimack + Kennebec samples + +{.compact} +| Pool | Sample Locations | +| :---:| :------------------------- | +| 1 | Miramichi | +| 2 | Annapolis | +| 3 | St_Johns | +| 4 | Santee_Cooper | +| 5 | Roanoke, Chowan-Blackwater | +| 6 | Potomac | +| 7 | Delaware | +| 8 | Hudson | +| 9 | Connecticut | +| 10 | Merrimack, Kennebec | !!!danger Do not concatenate manually! It would seem natural to use `samtools cat` to combine alignment files -quickly and easily, but **do not do this**. The reason is, samples aligned +quickly and easily, but **do not do this**. The [!badge corners="pill" text="harpy sv"](/Workflows/SV/SV.md) +workflows will intelligently concatenate files and will make sure +every individual will have unique `MI` values that are not shared with any +other individual in the pool. If you need to concatenate linked-read alignment files outside +of a workflow, use [concatenate_bam.py](/utilities.md#concatenate_bampy) shipped with Harpy + instead of `samtools cat` or other similar tools. +==- technical explanation +The reason is, samples aligned using Harpy have their linked-read barcodes deconvolved and supplemented with a unique molecule ID, given as an `MI:i` SAM tag in the alignment records. The molecule ID's start at `1` for each sample and increment for every identified unique molecule. What this means is that doing a typical concatenation via -`samtools cat` **will not resolve conflicting molecule IDs**. Think of it this way, +`samtools cat` **will not resolve conflicting molecule IDs**. + +Think of it this way, every sample has MI's from `1` to `N`, and as soon as you do a basic concatenation, the MI 1..N from sample2 will immediately conflict with 1..N from sample1. Why? Well, it's impossible that `MI:i:1` from sample1 and `MI:i:1` from sample2 came from the same molecule. A basic concatenation will flood the resulting file with clashing `MI` tags, which will make it impossible for a linked-read aware SV caller to make sense of the data and do its job well. - -The [!badge corners="pill" text="harpy sv"](/Modules/SV/SV.md) workflows will intelligently concatenate files and will make sure -every individual will have unique `MI` values that are not shared with any -other individual in the pool. If you need to concatenate linked-read alignment files outside -of a workflow, use `concatenate_bam.py` shipped with Harpy instead of `samtools cat` or other similar tools. +=== !!! diff --git a/commonoptions.md b/commonoptions.md index c957e6e8c..72af27a72 100644 --- a/commonoptions.md +++ b/commonoptions.md @@ -6,11 +6,11 @@ order: 4 # :icon-list-unordered: Common Harpy Options ## Input Arguments -Each of the main Harpy modules (e.g. [!badge corners="pill" text="qc"](Modules/qc.md) or [!badge corners="pill" text="phase"](Modules/phase.md)) follows the format of +Each of the main Harpy modules (e.g. [!badge corners="pill" text="qc"](Workflows/qc.md) or [!badge corners="pill" text="phase"](Workflows/phase.md)) follows the format of ```bash harpy module options arguments ``` -where `module` is something like [!badge corners="pill" text="impute"](Modules/impute.md) or [!badge corners="pill" text="snp mpileup"](Modules/snp.md) and `options` are the runtime parameters, +where `module` is something like [!badge corners="pill" text="impute"](Workflows/impute.md) or [!badge corners="pill" text="snp mpileup"](Workflows/snp.md) and `options` are the runtime parameters, which can include things like an input `--vcf` file, `--molecule-distance`, etc. After the options is where you provide the input files/directories without flags and following standard BASH expansion rules (e.g. wildcards). You can mix and match entire directories, individual files, and wildcard expansions. @@ -27,7 +27,7 @@ to avoid unexpected behavior. !!!warning clashing names Given the regex pattern matching Harpy employs under the hood and the isolation of just the sample names for Snakemake rules, files in different directories that have the same name (ignoring extensions) will clash. For example, `lane1/sample1.F.fq` -and `lane2/sample1.F.fq` would both derive the sample name `sample1`, which, in a workflow like [!badge corners="pill" text="align"](Modules/Align/Align.md) +and `lane2/sample1.F.fq` would both derive the sample name `sample1`, which, in a workflow like [!badge corners="pill" text="align"](Workflows/Align/Align.md) would both result in `output/sample1.bam`, creating a problem. This also holds true for the same sample name but different extension, such as `sample1.F.fq` and `sample1_R1.fq.gz`, which would again derive `sample1` as the sample name and create a naming clash for workflow outputs. During parsing, Harpy will inform you of naming clashes and terminate to protect you against this behavior. @@ -36,7 +36,7 @@ During parsing, Harpy will inform you of naming clashes and terminate to protect ## Common command-line options Every Harpy module has a series of configuration parameters. These are arguments you need to input to configure the module to run on your data, such as the directory with the reads/alignments, -the genome assembly, etc. All main modules (e.g. [!badge corners="pill" text="qc"](Modules/qc.md)) also share a series of common runtime +the genome assembly, etc. All main modules (e.g. [!badge corners="pill" text="qc"](Workflows/qc.md)) also share a series of common runtime parameters that don't impact the results of the module, but instead control the speed/verbosity/etc. of calling the module. These runtime parameters are listed in the modules' help strings and can be configured using these arguments: @@ -44,15 +44,40 @@ configured using these arguments: {.compact} | argument | short name | type | default | description | |:--------------- |:----------:|:------- |:-------:|:--------------------------------------------------------------------------------- | -| `--output-dir` | `-o` | string | varies | Name of output directory | -| `--threads` | `-t` | integer | 4 | Number of threads to use | | `--conda` | | toggle | | Use local conda environments instead of preconfigured Singularity container | -| `--skipreports` | | toggle | | Skip the processing and generation of HTML reports in a workflow | -| `--snakemake` | | string | | Additional [Snakemake](snakemake/#adding-snakemake-parameters) options, in quotes | -| `--quiet` | `-q` | toggle | | Suppress Snakemake printing to console | +| `--contigs` | | file path or list | | Contigs to plot in the report(s) | | `--help` | `-h` | | | Show the module docstring | +| `--output-dir` | `-o` | string | varies | Name of output directory | +| `--quiet` | | toggle | | Suppress the progress bars and other status text when running | +| `--skip-reports` | | toggle | | Skip the processing and generation of HTML reports in a workflow | +| `--snakemake` | | string | | Additional [Snakemake](snakemake/#adding-snakemake-parameters) options, in quotes | +| `--threads` | `-t` | integer | 4 | Number of threads to use | + +### --contigs +Some of the workflows (like [!badge corners="pill" text="align"](Workflows/Align/Align.md)) plot per-contig information in their reports. +By default, Harpy will plot up to 30 of the largest contigs. If you are only interested in a specific set of contigs, then you can use `--contigs` +to have Harpy only create plots for those contigs. **This will only impact plotting for reports**. This can be done by including a file of one-per-line contig names or a comma-separated +list of contigs (without spaces): + +==- contigs.txt +``` +contig1 +contig2 +sexchrom1 +``` +=== +```bash +harpy align bwa -g genome.fasta --contigs contig1,contig2,sexchrom1 dir/data +# or # +harpy align bwa -g genome.fasta --contigs contigs.txt dir/data +``` +!!!warning too many contigs +Things start to look sloppy and cluttered with >30 contigs, so it's advisable not to +exceed that number. +!!! -As as example, you could call [!badge corners="pill" text="align strobe"](Modules/Align/strobe.md) and specify 20 threads with no output to console: +### example +You could call [!badge corners="pill" text="align strobe"](Workflows/Align/strobe.md) and specify 20 threads with no output to console: ```bash harpy align strobe --threads 20 --quiet samples/trimmedreads @@ -72,10 +97,10 @@ and the contents therein also allow you to rerun the workflow manually. The `wor {.compact} | item | contents | utility | |:-----|:---------|:--------| -|`*.smk` | Snakefile with the full recipe of the workflow | useful for understanding the workflow | -| `config.yml` | Configuration file generated from command-line arguments and consumed by the Snakefile | useful for bookkeeping | -| `report/*.Rmd` | RMarkdown files used to generate the fancy reports | useful to understand math behind plots/tables or borrow code from | -| `*.summary` | Plain-text overview of the important parts of the workflow | useful for bookkeeping and writing Methods | +|`*.smk` | Snakefile with the full recipe of the workflow | understanding the entire workflow | +| `config.yml` | Configuration file generated from command-line arguments and consumed by the Snakefile | general bookkeeping, advanced runs | +| `report/*.Rmd` | RMarkdown files used to generate the fancy reports | seeing math behind plots/tables or borrow code from | +| `*.summary` | Plain-text overview of the important parts of the workflow | bookkeeping and writing Methods in manuscripts | --- diff --git a/development.md b/development.md index ac98bfab9..8493a82c3 100644 --- a/development.md +++ b/development.md @@ -2,6 +2,7 @@ label: Development icon: tools order: 1 +hidden: true --- # :icon-tools: Developing Harpy @@ -15,28 +16,27 @@ Before we get into the technical details, you, dear reader, need to understand why Harpy is the way it is. Harpy may be a pipeline for other software, but there is a lot of extra stuff built in to make it user friendly. Not just friendly, but _compassionate_. The guiding ethos for Harpy is -**"We don't hate the user"**. That means there is a lot -of code that checks input files, runtime details, etc. to exit before +**"Don't hate the user"**. That means there is a lot +of code that checks input files, runtime details, etc. before Snakemake takes over. This is done to minimize time wasted on minor errors that only show their ugly heads 18 hours into a 96 hour process. With that in mind: 1. **Code should be written clearly** because someone else will need to read it at some point, and that person could be future-you who hasn't seen or thought about that code for a while. Write nicely. Annotate. -2. **Error messages should provide all the information a user needs to fix the problem and retry**. It's not enough to exit when an error is identified. Collate -the things causing the error, explain to the user what and why. Harpy follows a -style of presenting and explaining the error, then providing a solution and showing exactly what files/rows/columns/etc. caused the error. Be kind to users. - -![These are Harpy error messages](/static/errormsg.png) - -3. **Documentation is just as important as the code**. No features are undocumented, +2. **Documentation is just as important as the code**. No user-facing features are undocumented, and the documentation should read like something that a new student can pick up and understand. Good documentation, written compassionately, will lower the barrier of entry to people who just want to process their haplotag data. Harpy isn't about ego, it's about accessibility. We invest in this documentation because -we _want_ to, not because we need to. +we **_want_** to. +3. **Error messages should provide all the information a user needs to fix the problem and retry**. It's not enough to exit when an error is identified. Collate +the things causing the error, explain to the user what and why. Harpy follows a +style of presenting and explaining the error, then providing a solution and showing exactly what files/rows/columns/etc. caused the error. Be kind to users. +![These are Harpy error messages](/static/errormsg.png) + === -## Installing Harpy for development +## Installing dev version The process follows cloning the harpy repository, installing the preconfigured conda environment, and running the `resources/buildlocal.sh` script to move all the necessary files to the `/bin/` path within your active conda environment. @@ -85,7 +85,7 @@ this is handled by `resources/buildlocal.sh`. It's a little circuitous, but it's we can keep the source code modular, installable, and have the flexibility of using non-python code. -### Bioconda recipe +### bioconda recipe For the ease of installation for end-users, Harpy has a [recipe and build script](https://github.com/bioconda/bioconda-recipes/blob/master/recipes/harpy/meta.yaml) in Bioconda, which makes it available for download and installation. A copy of the recipe is also stored in `resources/meta.yml`. The yaml file is the metadata of the package, including software @@ -95,12 +95,12 @@ requiring any intervention on the development side for the newest Harpy version ## The Harpy repository -### structure -Harpy exists as a Git repository and has 5 standard branches that are used in +### repo structure +Harpy exists as a Git repository and has 3 standard branches that are used in specific ways during development. Git is a popular version control system and discussing its use is out of the scope of this documentation, however there is no shortage of [great resources](https://www.youtube.com/watch?v=8Dd7KRpKeaE) to get -you started. The 5 standard branches in the Harpy repository are outlined in the +you started. The 3 standard branches in the Harpy repository are outlined in the table below: {.compact} @@ -155,18 +155,18 @@ version. So, if the Harpy version is `1.4.12`, then the associated docker image (unless there is a very good reason otherwise) since automatic Docker tagging happens upon releases of new Harpy versions. ## Automations -### Testing +### testing CI (**C**ontinuous **I**ntegration) is a term describing automated actions that do things to/with your code and are triggered by how you interact with a repository. Harpy has a series of GitHub Actions triggered by interactions with the `main` branch (in `.github/workflows`) to test the Harpy modules depending on which files are being changed by the push or -pull request. It's setup such that, for example, when files associated with +pull request. It's set up such that, for example, when files associated with demultiplexing are altered, it will run `harpy demultiplex` on the test data in the cloud and notify the Harpy devs if for some reason `harpy demultiplex` could not run successfully to completion. These tests do not test for accuracy, but test for breaking behavior. You'd be shocked to find out how many errors -crop up this way and require more development so Harpy can be resilient to more use cases. -### Releases +crop up this way and require more work so Harpy can be resilient to more use cases. +### releases There is [an automation](https://github.com/pdimens/harpy/blob/dev/.github/workflows/createrelease.yml) that gets triggered every time Harpy is tagged with the new version. It strips out the unnecessary files and will upload a cleaned tarball to the new release (reducing filesize by orders of magnitude). The automation will also diff --git a/haplotagdata.md b/haplotagdata.md index fda47d8f1..01ab7c80e 100644 --- a/haplotagdata.md +++ b/haplotagdata.md @@ -60,12 +60,12 @@ Read comments that aren't following the `TAG:TYPE:VALUE` SAM spec may cause down The Leviathan structural variant caller expects the `BX:Z:` tag at the end of the alignment record, so if you intend on using that variant caller, you will need to make sure the `BX:Z:` tag is the last one in the _sequence alignment_ (BAM file). If you use any method within -[!badge corners="pill" text="harpy align"](Modules/Align/bwa.md), the `BX:Z:` tag is guaranteed to be at +[!badge corners="pill" text="harpy align"](Workflows/Align/bwa.md), the `BX:Z:` tag is guaranteed to be at the end of the alignment record. !!! ### Read length -Reads must be at least 30 base pairs in length for alignment. By default, the [!badge corners="pill" text="qc"](Modules/qc.md) module removes reads <30bp. +Reads must be at least 30 base pairs in length for alignment. By default, the [!badge corners="pill" text="qc"](Workflows/qc.md) module removes reads <30bp. ### Compression Harpy generally doesn't require the input sequences to be in gzipped/bgzipped format, but it's good practice to compress your reads anyway. @@ -75,14 +75,14 @@ Compressed files are expected to end with the extension [!badge variant="success Unfortunately, there are many different ways of naming FASTQ files, which makes it difficult to accomodate every wacky iteration currently in circulation. While Harpy tries its best to be flexible, there are limitations. -To that end, for the [!badge corners="pill" text="deumultiplex"](Modules/demultiplex.md), [!badge corners="pill" text="qc"](Modules/qc.md), and [!badge corners="pill" text="align"](Modules/Align/bwa.md) modules, the +To that end, for the [!badge corners="pill" text="deumultiplex"](Workflows/demultiplex.md), [!badge corners="pill" text="qc"](Workflows/qc.md), and [!badge corners="pill" text="align"](Workflows/Align/bwa.md) modules, the most common FASTQ naming styles are supported: -- **sample names**: Alphanumeric and [!badge variant="success" text="."] [!badge variant="success" text="_"] [!badge variant="success" text="-"] +- **sample names**: [!badge variant="success" text="a-z"] [!badge variant="success" text="0-9"] [!badge variant="success" text="."] [!badge variant="success" text="_"] [!badge variant="success" text="-"] [!badge variant="secondary" text="case insensitive"] - you can mix and match special characters, but that's bad practice and not recommended - examples: `Sample.001`, `Sample_001_year4`, `Sample-001_population1.year2` <- not recommended - **forward**: [!badge variant="success" text="_F"] [!badge variant="success" text=".F"] [!badge variant="success" text="_1"] [!badge variant="success" text=".1"] [!badge variant="success" text="_R1_001"] [!badge variant="success" text=".R1_001"] [!badge variant="success" text="_R1"] [!badge variant="success" text=".R1"] - **reverse**: [!badge variant="success" text="_R"] [!badge variant="success" text=".R"] [!badge variant="success" text="_2"] [!badge variant="success" text=".2"] [!badge variant="success" text="_R2_001"] [!badge variant="success" text=".R2_001"] [!badge variant="success" text="_R2"] [!badge variant="success" text=".R2"] -- **fastq extension**: [!badge variant="success" text=".fq"] [!badge variant="success" text=".fastq"] [!badge variant="success" text=".FQ"] [!badge variant="success" text=".FASTQ"] +- **fastq extension**: [!badge variant="success" text=".fq"] [!badge variant="success" text=".fastq"] [!badge variant="secondary" text="case insensitive"] - **gzipped**: supported and recommended - **not gzipped**: supported diff --git a/index.md b/index.md index dde6426a6..849097e38 100644 --- a/index.md +++ b/index.md @@ -17,19 +17,22 @@ Most of the settings are pre-configured and the settings you can modify are done ## Harpy Modules Harpy is modular, meaning you can use different parts of it independent from each other. Need to only align reads? -Great! Only want to call variants? Awesome! All modules are called by `harpy `. For example, use `harpy align` to align reads. +Great! Only want to call variants? Awesome! All modules are called by `harpy `. For example, use `harpy align` to align reads. -| Module | Description | -|:-------------------------------------------------------------------|:----------------------------------------------| -| [!badge corners="pill" text="align"](Modules/Align/Align.md) | Align sample sequences to a reference genome | -| [!badge corners="pill" text="demultiplex"](Modules/demultiplex.md) | Demultiplex haplotagged FASTQ files | -| [!badge corners="pill" text="impute"](Modules/impute.md) | Impute genotypes using variants and sequences | -| [!badge corners="pill" text="phase"](Modules/phase.md) | Phase SNPs into haplotypes | -| [!badge corners="pill" text="preflight"](Modules/preflight.md) | Run various format checks for FASTQ and BAM files | -| [!badge corners="pill" text="qc"](Modules/qc.md) | Remove adapters, deduplicate, and quality trim sequences | -| [!badge corners="pill" text="simulate"](Modules/Simulate/Simulate.md) | Simulate haplotag linked reads or genomic variants | -| [!badge corners="pill" text="snp"](Modules/snp.md) | Call SNPs and small indels | -| [!badge corners="pill" text="sv"](Modules/SV/SV.md) | Call large structural variants (inversions, deletions, duplications) | +{.compact} +| Workflow | Description | +| :---------------------------------------------------------------------- | :------------------------------------------------------------------- | +| [!badge corners="pill" text="align"](Workflows/Align/Align.md) | Align sample sequences to a reference genome | +| [!badge corners="pill" text="assembly"](Workflows/assembly.md) | Create a genome assembly from linked-reads | +| [!badge corners="pill" text="demultiplex"](Workflows/demultiplex.md) | Demultiplex haplotagged FASTQ files | +| [!badge corners="pill" text="impute"](Workflows/impute.md) | Impute genotypes using variants and sequences | +| [!badge corners="pill" text="metassembly"](Workflows/metassembly.md) | Create a metagenome assembly from linked-reads | +| [!badge corners="pill" text="phase"](Workflows/phase.md) | Phase SNPs into haplotypes | +| [!badge corners="pill" text="preflight"](Workflows/preflight.md) | Run various format checks for FASTQ and BAM files | +| [!badge corners="pill" text="qc"](Workflows/qc.md) | Remove adapters, deduplicate, and quality trim sequences | +| [!badge corners="pill" text="simulate"](Workflows/Simulate/Simulate.md) | Simulate haplotag linked reads or genomic variants | +| [!badge corners="pill" text="snp"](Workflows/snp.md) | Call SNPs and small indels | +| [!badge corners="pill" text="sv"](Workflows/SV/SV.md) | Call large structural variants (inversions, deletions, duplications) | ## Using Harpy You can call `harpy` without any arguments (or with `--help`) to print the docstring to your terminal. You can likewise call any of the modules without arguments or with `--help` to see their usage (e.g. `harpy align --help`). @@ -46,7 +49,7 @@ You can call `harpy` without any arguments (or with `--help`) to print the docst │ --version Show the version and exit. │ │ --help -h Show this message and exit. │ ╰─────────────────────────────────────────────────────────────────────────╯ -╭─ Modules ───────────────────────────────────────────────────────────────╮ +╭─ workflows ─────────────────────────────────────────────────────────────╮ │ demultiplex Demultiplex haplotagged FASTQ files │ │ qc Remove adapters and quality-control sequences │ │ align Align sample sequences to a reference genome │ @@ -55,19 +58,23 @@ You can call `harpy` without any arguments (or with `--help`) to print the docst │ impute Impute genotypes using variants and alignments │ │ phase Phase SNPs into haplotypes │ │ simulate Simulate variants or linked-reads from a genome │ +│ assembly Create an assembly from linked-reads │ +│ metassembly Create a metassembly from linked-reads │ ╰─────────────────────────────────────────────────────────────────────────╯ ╭─ Other Commands ────────────────────────────────────────────────────────╮ -│ resume Resume a workflow from an existing Harpy directory │ -│ hpc Profile templates for cluster job submissions │ -│ preflight File format checks for haplotag data │ │ deconvolve Resolve clashing barcodes from different molecules │ +│ hpc Profile templates for cluster job submissions │ +│ imputeparams Create a template imputation parameter file │ │ popgroup Create a template grouping file for samples │ -│ stitchparams Create a template STITCH parameter file │ +│ preflight File format checks for haplotag data │ +│ resume Resume a workflow from an existing Harpy directory │ +│ view View a workflow log, config, or snakefile │ ╰─────────────────────────────────────────────────────────────────────────╯ ``` -## Linked-Read Workflow -Depending on your project goals, you may want any combination of SNPs, structural variants (inversions, deletions, duplications), or phased haplotypes. Below is a flow chart +## Typical Linked-Read Workflows +Depending on your project goals, you may want any combination of SNPs, structural +variants (inversions, deletions, duplications), or phased haplotypes. Below is a flow chart outlining a general workflow of linked-read data. ```mermaid @@ -79,5 +86,15 @@ graph LR SNP--->Phase([phase haplotypes]):::clean Align--->SV([call structural variants]):::clean + classDef clean fill:#f5f6f9,stroke:#b7c9ef,stroke-width:2px +``` + +Alternatively, if you are interested in assembling a genome or metagenome, your workflow might look like: + +```mermaid +graph LR + QC([QC, trim adapters, etc.]):::clean--->DC([barcode deconvolution]):::clean + DC--->Assembly([assembly/metassembly]):::clean + classDef clean fill:#f5f6f9,stroke:#b7c9ef,stroke-width:2px ``` \ No newline at end of file diff --git a/install.md b/install.md index 3270a4e8f..d3ee4a937 100644 --- a/install.md +++ b/install.md @@ -6,12 +6,12 @@ order: 100 ## :icon-desktop-download: Install Harpy === :icon-checklist: You will need -- [conda](https://conda.io/projects/conda/en/latest/user-guide/install/index.html) or [⚡mamba⚡](https://mamba.readthedocs.io/en/latest/installation.html) - - we [strongly recommend](issues#problem-installing-with-conda) you use mamba - - if using conda, replace `mamba` with `conda` in the instructions below +- [conda](https://conda.io/projects/conda/en/latest/user-guide/install/index.html) or [mamba](https://mamba.readthedocs.io/en/latest/installation.html) + - we [strongly recommend](troubleshooting#installation-issue) you use conda `23.10` (or later) or mamba + - if using mamba, replace `conda` with `mamba` in the instructions below === -Harpy is hosted on [Bioconda](https://anaconda.org/bioconda/harpy)! That means to install it, you just need to have `mamba` (or `conda`) on your Linux-based +Harpy is hosted on [Bioconda](https://anaconda.org/bioconda/harpy)! That means to install it, you just need to have `conda` (or `mamba`) on your Linux-based system and install it with a simple command. You can install Harpy into an existing environment or create a new one for it (recommended). ### install into a new environment @@ -19,31 +19,31 @@ system and install it with a simple command. You can install Harpy into an exist The code snippet below creates a new environment called `harpy` (the `-n harpy` part) and installs harpy into it from the bioconda channel (`-c bioconda` part). You can name this environment anything (e.g. `haplotagging`, `jeanclaudevandamme`, etc.). ```bash install harpy -mamba create -n harpy -c bioconda -c conda-forge harpy +conda create -n harpy -c bioconda -c conda-forge harpy ``` Once conda finishes the install, you can activate the environment (with whatever `-n ` you gave it) and `harpy` will be callable from the command line. ```bash activate harpy environment -mamba activate +conda activate ``` ### install into an existing evironment -If you want to install harpy into an existing environment, then with an environment already activated (via `mamba activate `) simply use the `mamba install` command and harpy +If you want to install harpy into an existing environment, then with an environment already activated (via `conda activate `) simply use the `conda install` command and harpy will be callable from the command line. ```bash install harpy -mamba install -c bioconda -c conda-forge harpy +conda install -c bioconda -c conda-forge harpy ``` --- ## :icon-move-to-top: Update Harpy === :icon-checklist: You will need -- [conda](https://conda.io/projects/conda/en/latest/user-guide/install/index.html) or [⚡mamba⚡](https://mamba.readthedocs.io/en/latest/installation.html) +- [conda](https://conda.io/projects/conda/en/latest/user-guide/install/index.html) or [mamba](https://mamba.readthedocs.io/en/latest/installation.html) - Harpy installed into an existing conda enviroment === If you want to update Harpy, the process is quite similar: ```bash update harpy -mamba activate -mamba update -c conda-forge bioconda::harpy -``` \ No newline at end of file +conda activate +conda update -c conda-forge bioconda::harpy +``` diff --git a/issues.md b/issues.md deleted file mode 100644 index aa10108cd..000000000 --- a/issues.md +++ /dev/null @@ -1,42 +0,0 @@ ---- -label: Common Issues -icon: alert -order: 3 ---- - -Lots of stuff can go wrong during an analysis. The intent of this page is to highlight -common issues you may experience during analysis and ways to address these issues. - -## Problem installing with conda -Conda is an awesome package manager, but it's _slow_ and uses a ton of memory -as dependencies increase. Harpy has a lot of dependencies and you might stall -out conda trying to install it. Use mamba instead-- it'll work where conda fails. - - -## Failures during imputation or phasing -If you use `bamutils clipOverlap` on alignments that are used for the [!badge corners="pill" text="impute"](Modules/impute.md) or -[!badge corners="pill" text="phase"](Modules/phase.md) modules, they will cause both programs to error. We don't know why, but they do. - -**Solution**: Do not clip overlapping alignments for bam files you intend to use for -the [!badge corners="pill" text="impute"](Modules/impute.md) or -[!badge corners="pill" text="phase"](Modules/phase.md) modules. Harpy does not clip overlapping alignments, so -alignments produced by Harpy should work just fine. - -## Alignment file name and ID: tag mismatch -Aligning a sample to a genome via Harpy will insert the sample name (based on the file name) -into the alignment header (the `@RG ID:name SM:name` tag). It likewise expects, through various steps, -that the sample names in resulting vcf files match the filenames of associated bam files. This creates -problems when manually renaming alignment files after the creation of any vcf files. If you rename the -bam file, the alignments will still have the original sample name hardcoded into the file header. -Harpy will check for this and will preemtively warn you of a mismatch between file name and encoded -sample name. Due to certain expectations of the workflow, this mismatch will absolutely cause things -to fail, hence the pre-flight check. - -**Solution**: If you need to rename a bam file, do so using the `renamebam` script bundled with Harpy, which is a just a thin veneer over `samtools addreplacerg` with some extra validations. -```bash -renamebam input.bam newname -``` -Call the script with no arguments to see the full usage instructions. - - -**More cases will be added here as they become apparent to us** \ No newline at end of file diff --git a/retype.yml b/retype.yml index 929a71bc0..f1010eacf 100644 --- a/retype.yml +++ b/retype.yml @@ -3,6 +3,21 @@ output: .retype url: pdimens.github.io/harpy/ meta: title: " | Harpy haplotag" +footer: + copyright: "© Copyright {{ year }}. All rights reserved." +branding: + title: Harpy + label: v1.13 + logo: static/logo_navbar.png + logoDark: static/logo_navbar.png + logoAlign: left +favicon: static/favicon.png +search: + mode: partial +edit: + repo: "https://github.com/pdimens/harpy/edit" + base: / + branch: "docs" links: - text: Source Code link: https://github.com/pdimens/harpy @@ -19,26 +34,13 @@ links: - text: Guides & Tutorials link: /blog icon: code-of-conduct - target: blank - text: Submit an Issue link: https://github.com/pdimens/harpy/issues/new/choose icon: bug target: blank -footer: - copyright: "© Copyright {{ year }}. All rights reserved." -branding: - title: Harpy - label: v1.5 - logo: static/logo_navbar.png - logoDark: static/logo_navbar.png - logoAlign: left -favicon: static/favicon.png -search: - mode: partial -edit: - repo: "https://github.com/pdimens/harpy/edit" - base: / - branch: "docs" +- text: Development + link: /development.md + icon: tools exclude: - "src/" - "test/" \ No newline at end of file diff --git a/snakemake.md b/snakemake.md index 4272e9a25..b9aff8ae7 100644 --- a/snakemake.md +++ b/snakemake.md @@ -1,33 +1,37 @@ --- label: Snakemake Things -icon: terminal +icon: git-merge order: 2 --- -# :icon-terminal: Snakamake Things +# :icon-git-merge: Snakamake Things ## Workflow logs Barring a few exceptions, most of Harpy's options are Snakemake workflows. This means we are all at the mercy of how Snakemake operates, which includes the `.snakemake/` folder in your project directory. That folder contains all sorts of things necessary for Snakemake to do its magic. However, as a convenience, Harpy workflows will also create a copy of the Snakemake -workflow log (all those things that print on screen when Harpy is running) +workflow log (all those things that print on screen when snakemake is running) in a workflow's output directory. These logs are found in `OUTDIR/logs/snakemake` -and are named `workflow.runX.DATE.snakelog`, where `workflow` is the harpy workflow -(qc, sv_naibr, etc.), `runX` is the attempt number (given by `X`, e.g. `run4`), and +and are named `workflow.X.DATE.log`, where `workflow` is the harpy workflow +(qc, sv_naibr, etc.), `X` is the attempt number (given by `X`, e.g. `4`), and `DATE` is given as `MONTH_DAY_YEAR`. As an example, using the default -settings of `harpy qc`, you will find a copy of your workflow's log in -`QC/logs/snakemake/qc.run1.5_13_2024.snakelog`. The name of this log will -**not** match the official log Snakemake creates in `.snakemake/logs`, but -the contents will be identical. This naming convention exists to allow multiple -runs in a single output directory, which usually happens because something went wrong. +settings of `harpy qc`, you will find your workflow's log in +`QC/logs/snakemake/qc.1.5_13_2024.log.gz`. + +!!!ghost Design choices +Snakemake prints **a lot** of text during runtime. For some workflows, the resulting log files +could occupy >1 gigabyte of hard drive space 😱. For this reason, Harpy gzip compresses the log file +and deletes the original Snakemake logfile that you would expect to find in `.snakemake/logs/`. +!!! ## Adding Snakemake Parameters Harpy relies on Snakemake under the hood to handle file and job dependencies. -Most of these details have been abstracted away from the end-user, but every -module of Harpy (except `popgroup`, and `stitchparams`) has an optional flag `--snakemake` -that you can use to augment the Snakemake workflow if necessary. Whenever you -use this flag, your argument must be enclosed in quotation marks, for example: +Most of these details have been abstracted away from the end-user, but most +Harpy modules have an optional flag `--snakemake` that you can use to augment +the Snakemake workflow if necessary. [The full list of Snakemake command line +options can be found here](https://snakemake.readthedocs.io/en/stable/executing/cli.html). +Whenever you use this flag, your argument must be enclosed in quotation marks, for example: ```bash harpy qc --snakemake "--dry-run" rawseq ``` @@ -36,36 +40,10 @@ This means you can add several Snakemake arguments at once, as long as the entir harpy qc --snakemake "--dry-run --debug --shadow-prefix /scratch" rawseq ``` -!!!danger reserved/forbidden arguments -Harpy calls Snakemake with a given set of arguments, meaning you cannot append -these again to the internal command line call. Well, you can, but Snakemake will -complain or behave unexpectedly. [Everything else](https://snakemake.readthedocs.io/en/stable/executing/cli.html#all-options) -is allowed. The reserved (**forbidden**) arguments are: -- `--conda-prefix` -- `--configfile` -- `--cores` -- `--directory` -- `--nolock` -- `--rerun-incomplete` -- `--rerun-triggers` -- `--show-failed-logs` -- `--software-deployment-method` -- `--snakefile` -!!! - ### Common use cases You likely wont need to invoke `--snakemake` very often, if ever. However, here examples of some possible use cases for this parameter. -==- Dry run -##### `--dry-run` -This is a directive in which Snakemake will build the DAG and "pretend" to -run the Harpy workflow. Useful for knowing what you're getting yourself into -ahead of time. It's also useful for debugging during development. Here is an -example of dry-running variant calling: -```bash -harpy snp mpileup -g genome.fasta --snakemake "--dry-run" Align/ema -``` ==- Specific file target Sometimes you want to generate a specific intermediate file (or files) rather than running the entire module to completion. For example, you want the beadtag report Harpy makes from the output of `EMA count`. To do this, just list the file/files (relative @@ -73,7 +51,7 @@ to your working directory) without any flags. Example for the beadtag report: ```bash harpy align bwa -g genome.fasta -t 4 --snakemake "Align/ema/reports/bxstats.html" QC/ ``` -This of course necessitates knowing the names of the files ahead of time. See the individual modules for a breakdown of expected outputs. +This of course necessitates knowing the names of the files ahead of time. See the individual workflows for a breakdown of expected outputs. ==- Set a shadow directory ##### `--shadow-prefix ` diff --git a/software.md b/software.md index 9a5db987e..d03571623 100644 --- a/software.md +++ b/software.md @@ -14,6 +14,8 @@ Issues with specific tools might warrant a discussion with the authors/developer {.compact} | Software | Links | Publication | |:------------|:-------------------------| :-------------------------------------------------------------------------------------------| +| ARCS suite | [tigmint](https://github.com/bcgsc/tigmint), [arcs](https://github.com/bcgsc/arcs), [links](https://github.com/bcgsc/links) | [tigmint](https://doi.org/10.1186/s12859-018-2425-6), [arcs](https://doi.org/10.1101/100750), [links](https://gigascience.biomedcentral.com/articles/10.1186/s13742-015-0076-3) | +| athena | [github](https://github.com/abishara/athena_meta) | [publication](https://doi.org/10.1038/nbt.4266) | | bash | [website](https://www.gnu.org/software/bash/) | | bcftools | [github](https://github.com/samtools/bcftools), [website](https://samtools.github.io/bcftools/bcftools.html) | | | bgzip | [website](http://www.htslib.org/doc/bgzip.html) | | @@ -35,6 +37,7 @@ Issues with specific tools might warrant a discussion with the authors/developer | seqtk | [github](https://github.com/lh3/seqtk) | | | simuG | [github](https://github.com/aquaskyline/LRSIM) | [publication](https://doi.org/10.1093/bioinformatics/btz424) | | Snakemake | [github](https://github.com/snakemake/snakemake)| [publication](https://f1000research.com/articles/10-33/v1) | +| spades | [spades](http://ablab.github.io/spades/), [cloudspades](https://github.com/ablab/spades/tree/cloudspades-ismb) | [spades](https://doi.org/10.1002/cpbi.102), [cloudspades](https://doi.org/10.1093/bioinformatics/btz349) | | strobealign | [github](https://github.com/ksahlin/strobealign)| [publication](https://doi.org/10.1186/s13059-022-02831-7) | | whatshap | [github](https://github.com/whatshap/whatshap) |[publication](https://doi.org/10.1101/085050) | @@ -45,7 +48,6 @@ Issues with specific tools might warrant a discussion with the authors/developer | click | python | [github](https://github.com/pallets/click) | | | pysam | python | [github](https://github.com/pysam-developers/pysam) | | | r-biocircos | R | [github](https://github.com/lvulliard/BioCircos.R) | | -| r-circlize | R | [github](https://github.com/jokergoo/circlize) | [publication](https://doi.org/10.1093/bioinformatics/btu393) | | r-highcharter | R | [website (source)](https://www.highcharts.com/) [website (R-package)](https://github.com/jbkunst/highcharter/) | | | r-tidyverse | R | [website](https://www.tidyverse.org/) | [publication](https://doi.org/10.21105/joss.01686) | | r-DT | R | [website](https://rstudio.github.io/DT/), [js-website](http://datatables.net) | | diff --git a/static/assembly_multiqc.png b/static/assembly_multiqc.png new file mode 100644 index 000000000..14a1012fc Binary files /dev/null and b/static/assembly_multiqc.png differ diff --git a/static/assembly_quast.png b/static/assembly_quast.png new file mode 100644 index 000000000..6000c83ab Binary files /dev/null and b/static/assembly_quast.png differ diff --git a/static/error_text.png b/static/error_text.png new file mode 100644 index 000000000..14a360c5c Binary files /dev/null and b/static/error_text.png differ diff --git a/static/errormsg.png b/static/errormsg.png index 8c52812e8..358815920 100644 Binary files a/static/errormsg.png and b/static/errormsg.png differ diff --git a/static/lr_conversion.png b/static/lr_conversion.png index 009399333..e930f3f1c 100644 Binary files a/static/lr_conversion.png and b/static/lr_conversion.png differ diff --git a/static/lr_conversion.svg b/static/lr_conversion.svg index cb64466f0..0f518482b 100644 --- a/static/lr_conversion.svg +++ b/static/lr_conversion.svg @@ -27,14 +27,14 @@ inkscape:deskcolor="#d1d1d1" inkscape:document-units="mm" inkscape:zoom="1.2094621" - inkscape:cx="422.50187" - inkscape:cy="212.49116" + inkscape:cx="431.18342" + inkscape:cy="132.29021" inkscape:window-width="1920" - inkscape:window-height="1052" + inkscape:window-height="1020" inkscape:window-x="0" inkscape:window-y="0" inkscape:window-maximized="1" - inkscape:current-layer="layer1" + inkscape:current-layer="g9" inkscape:export-bgcolor="#ffffffff" />Forward readForward read@seq_id/1 TX:Z: BX:Z:@seq_id/1 OX:Z: BX:Z:@seq_id/2 TX:Z: BX:Z:@seq_id/2 OX:Z: BX:Z:barcodesATACGAGTACAGATGGAGGATACBDACBD10X10XATCG + sodipodi:nodetypes="ccc" />ATCG diff --git a/static/metassembly_quast.png b/static/metassembly_quast.png new file mode 100644 index 000000000..19a2ee684 Binary files /dev/null and b/static/metassembly_quast.png differ diff --git a/static/report_align_bxmol.png b/static/report_align_bxmol.png index 90f91a2c7..da31cf3dc 100644 Binary files a/static/report_align_bxmol.png and b/static/report_align_bxmol.png differ diff --git a/static/report_align_bxstats.png b/static/report_align_bxstats.png deleted file mode 100644 index c869d6713..000000000 Binary files a/static/report_align_bxstats.png and /dev/null differ diff --git a/static/report_align_coverage.png b/static/report_align_coverage.png index 780b8cfa3..cd191d1e4 100644 Binary files a/static/report_align_coverage.png and b/static/report_align_coverage.png differ diff --git a/static/report_align_flagstat.png b/static/report_align_flagstat.png index c982f7169..23a80e473 100644 Binary files a/static/report_align_flagstat.png and b/static/report_align_flagstat.png differ diff --git a/static/report_impute.png b/static/report_impute.png index c37052798..4d5e0fe59 100644 Binary files a/static/report_impute.png and b/static/report_impute.png differ diff --git a/static/report_phase.png b/static/report_phase.png index 5abe11cd9..acc687ecd 100644 Binary files a/static/report_phase.png and b/static/report_phase.png differ diff --git a/static/report_phase2.png b/static/report_phase2.png deleted file mode 100644 index ca60f1b04..000000000 Binary files a/static/report_phase2.png and /dev/null differ diff --git a/static/report_qc_bx.png b/static/report_qc_bx.png new file mode 100644 index 000000000..467c3a1f2 Binary files /dev/null and b/static/report_qc_bx.png differ diff --git a/static/troubleshoot_cli.png b/static/troubleshoot_cli.png new file mode 100644 index 000000000..de5fc1649 Binary files /dev/null and b/static/troubleshoot_cli.png differ diff --git a/static/workflow_call.png b/static/workflow_call.png new file mode 100644 index 000000000..51acdade1 Binary files /dev/null and b/static/workflow_call.png differ diff --git a/troubleshooting.md b/troubleshooting.md new file mode 100644 index 000000000..f7e186fff --- /dev/null +++ b/troubleshooting.md @@ -0,0 +1,84 @@ +--- +label: Troubleshooting +icon: alert +order: 3 +--- + +Lots of stuff can go wrong during an analysis. The intent of this page is to guide you through +navigating the inevitable errors associated with doing bioinformatics. + +## Troubleshooting Harpy +Harpy has two steps: first it performs checks and validations, then it runs Snakemake. + +### checks and validations +First, Harpy takes your command-line inputs and checks/validates the input files and parameters. +If your parameters are not the correct type (e.g. a number where there should be a file), the +program will error immediately and inform you that something isn't right. +![Harpy option validations](/static/troubleshoot_cli.png) + +If all the command-line options pass validation, then the workflow inputs +will be validated with all kinds of scripts and checks internally. For +example, an input genome FASTA will be checked to be a properly formatted +FASTA file. We do our best to include an error message that guides you +on how to fix things and try again. If there are errors you're receiving +that you think could use better wording or more information, [please let us know](https://github.com/pdimens/harpy/issues/new?assignees=&labels=enhancement&projects=&template=feature_request.yml)! +![Harpy file validations](/static/errormsg.png) + +### snakemake validations +Once all the file validations pass, Harpy passes the baton over to +Snakemake. Snakemake builds a workflow graph of the rules and performs +its own checks. If you get an error before the workflow starts processing any data (there +won't yet be a Snakemake log file created), then something is wrong with +the Snakefile. Harpy may print the error to the terminal, but it's +possible you won't see any Snakemake error text (let us know!). If no +helpful error text is printed, then you should run the Snakemake command +directly and explore the output. You can copy and paste the Snakemake +command from the `config.yaml` file created by Harpy, specifically listed +with `workflow_call:`. If it seems like something on our end, please +[open an issue](https://github.com/pdimens/harpy/issues/new?assignees=&labels=bug&projects=&template=bug_report.yml) +and include the error text Snakemake provides. +![copy and paste this command directly to see Snakemake error text](/static/workflow_call.png) + +### error during a workflow +Sometimes something goes wrong with one of the steps in a workflow. If/when +that happens, Harpy will print the offending step and all the information +Snakemake has regarding the failure. If the step had a log file, it will +print the log information too, hopefully making it easier to figure out +what's wrong. +![error during a workflow](/static/error_text.png) + +--- +## Common Issues +### installation issue +Conda is an awesome package manager, but _was_ slow and used a ton of memory +as dependencies increased. Recent (`23.10+`) versions of Conda [now use the `libmamba` solver](https://www.anaconda.com/blog/a-faster-conda-for-a-growing-community), +the super-fast and super-lightweight solver from Mamba. If you're experiencing +a suspiciously slow Harpy installation, either update Conda to at least version `23.10` or use Mamba. + +### imputation or phasing failure +If you use `bamutils clipOverlap` on alignments that are used for the [!badge corners="pill" text="impute"](Workflows/impute.md) or +[!badge corners="pill" text="phase"](Workflows/phase.md) modules, they will cause both programs to error. We don't know why, but they do. + +**Solution**: Do not clip overlapping alignments for bam files you intend to use for +the [!badge corners="pill" text="impute"](Workflows/impute.md) or +[!badge corners="pill" text="phase"](Workflows/phase.md) modules. Harpy does not clip overlapping alignments, so +alignments produced by Harpy should work just fine. + +### SAM name and ID mismatch +Aligning a sample to a genome via Harpy will insert the sample name (based on the file name) +into the alignment header (the `@RG ID:name SM:name` tag). It likewise expects, through various steps, +that the sample names in resulting vcf files match the filenames of associated bam files. This creates +problems when manually renaming alignment files after the creation of any vcf files. If you rename the +bam file, the alignments will still have the original sample name hardcoded into the file header. +Harpy will check for this and will preemtively warn you of a mismatch between file name and encoded +sample name. Due to certain expectations of the workflow, this mismatch will absolutely cause things +to fail, hence the pre-flight check. + +**Solution**: If you need to rename a bam file, do so using the [rename_bam](utilities.md#rename_bam) script bundled with Harpy, which is a just a thin veneer over `samtools addreplacerg` with some extra validations. +```bash +rename_bam newname input.bam +``` +Call the script with no arguments to see the full usage instructions. + + +**More cases will be added here as they become apparent to us** diff --git a/utilities.md b/utilities.md new file mode 100644 index 000000000..ea5f2696a --- /dev/null +++ b/utilities.md @@ -0,0 +1,155 @@ +--- +label: Utilities +icon: terminal +order: 2 +--- + +# :icon-terminal: Utilities +Harpy is the sum of its parts and some of those parts are [stand-alone scripts](https://github.com/pdimens/harpy/tree/main/harpy/bin) +used by the workflows that are accessible from within the Harpy conda environment. +This page serves to document those scripts, since using them outside of a workflow +might be useful too. You can call up the docstring for any one of these utilities +by calling the program without any arguments. + +### assign_mi.py +```bash +assign_mi.py -c cutoff -o output.bam input.bam +``` +Assign an `MI:i` (Molecular Identifier) tag to each barcoded +record based on a molecular distance cutoff. Input file **must be coordinate sorted**. +This is similar to [deconvolve_alignments.py](#deconvolve_alignmentspy), except it does not record the deconvolution in the `BX` tag. +- unmapped records are discarded +- records without a `BX:Z` tag or with an invalid barcode (`00` as one of its segments) are presevered but are not assigned an `MI:i` tag + +### bx_stats.py +```bash +bx_stats.py -o output.gz input.bam +``` +Calculates various linked-read molecule metrics from the (coordinate-sorted) input alignment file. +Metrics include (per molecule): +- number of reads +- position start +- position end +- length of molecule inferred from alignments +- total aligned basepairs +- total length of inferred inserts +- molecule coverage (%) based on aligned bases +- molecule coverage (%) based on total inferred insert length + +### check_bam.py +```bash +check_bam.py input.bam > output.txt +``` +Parses an aligment file to check: +- if the sample name matches the `RG` tag +- whether `BX:Z` is the last tag in the record +- the counts of: + - total alignments + - alignments with an `MI:i` tag + - alignments without `BX:Z` tag + - incorrect `BX:Z` tag + +### check_fastq.py +```bash +check_bam.py input.bam > output.txt +``` +Parses a FASTQ file to check if any sequences don't conform to the SAM spec, +whether BX:Z: is the last tag in the record, and the counts of: +- total reads +- reads without `BX:Z` tag +- reads with incorrect `BX:Z` tag + +### concatenate_bam.py +```bash +concatenate_bam.py [--bx] -o output.bam file_1.bam file_2.bam...file_N.bam +# or # +concatenate_bam.py [--bx] -o output.bam -b bam_files.txt +``` +Concatenate records from haplotagged SAM/BAM files while making sure `MI` tags remain unique for every sample. +This is a means of accomplishing the same as `samtools cat`, except all `MI` tags are updated +so individuals don't have overlapping `MI` tags (which would mess up all the linked-read data). You can either provide +all the files you want to concatenate, or a single file featuring filenames with the `-b` option. Use the `--bx` option +to also rewrite `BX` tags such that they are unique for every individual too, although take note that there can only be +$96^4$ (84,934,656) unique haplotag barcodes and it will raise an error if that number is exceeded. + +### count_bx.py +```bash +count_bx.py input.fastq > output.txt +``` +Parses a FASTQ file to count: +- total sequences +- total number of `BX` tags +- number of valid haplotagging `BX` tags +- number of invalid `BX` tags +- number of invalid `BX` tag segments (i.e. `A00`, `C00`, `B00`, `D00`). + +### deconvolve_alignments.py +```bash +deconvolve_alignments.py -c cutoff -o output.bam input.bam +``` +Deconvolve BX-tagged barcodes and assign an `MI` (Molecular Identifier) tag to each barcoded record based on a molecular distance cutoff. +Input file **must be coordinate sorted**. This is similar to [assign_mi.py](#assign_mipy), except it will also deconvolve the `BX` tag by +hyphenating it with an integer (e.g. `A01C25B31D92-2`). +- unmapped records are discarded +- records without a `BX` tag or with an invalid barcode (`00` as one of its segments) are presevered but are not assigned an `MI` tag + +### depth_windows.py +```bash +samtools depth -a file.bam | depth_windows.py windowsize > output.txt +``` +Reads the output of `samtools depth -a` from stdin and calculates means within windows of a given `windowsize`. + +### haplotag_acbd.py +```bash +haplotag_acbd.py output_directory +``` +Generates the `BC_{ABCD}.txt` files necessary to demultiplex Gen I haplotag barcodes into the specified `output_directory`. + +### infer_sv.py +```bash +infer_sv.py file.bedpe [-f fail.bedpe] > outfile.bedpe +``` +Create column in NAIBR bedpe output inferring the SV type from the orientation. Removes variants with FAIL flags +and you can use the optional `-f` (`--fail`) argument to output FAIL variants to a separate file. + +### inline_to_haplotag.py +```bash +inline_to_haplotag.py -f -r -b -p > barcodes.conversion.txt +``` +Converts inline nucleotide barcodes in reads to haplotag linked reads with barcodes in `BX:Z` and `OX:Z` header tags. + +### make_windows.py +```bash +make_windows.py -w -m <0,1> input.fasta[.fai] > output.bed +``` +Create a BED file of fixed intervals (`-w`, --`window`) from a FASTA or fai file (the kind generated with `samtools faidx`). +Nearly identical to `bedtools makewindows`, except the intervals are nonoverlapping. The `-m` (`--mode`) option specified +whether indexing starts at `0` or `1`. + +### molecule_coverage.py +```bash +molecule_coverage.py -f genome.fasta.fai statsfile > output.cov +``` +Using the statsfile generated by `bx_stats.py` from Harpy, will calculate "molecular coverage" across the genome. +Molecular coverage is the "effective" alignment coverage if you treat a molecule inferred from linked-read data as +one contiguous alignment, even though the reads that make up that molecule don't cover its entire length. Requires a +FASTA fai index (the kind created with `samtools faidx`) to know the actual sizes of the contigs. + +### parse_phaseblocks.py +```bash +parse_phaseblocks.py input > output.txt +``` +Parse a phase block file from HapCut2 to pull out summary information + +### rename_bam +```bash +rename_bam.py [-d] new_name input.bam +``` +Rename a sam/bam file and modify the `@RG` tag of the alignment file to reflect the change for both `ID` and `SM`. +This process creates a new file `new_name.bam` and you may use `-d` to delete the original file. Requires `samtools`. + +### separate_validbx +```bash +separate_validbx input.bam > valid.bam 2> invalid.bam +``` +Split a BAM file with `BX` tags into 2 files, one with valid ACBD barcodes (`stdout`), one with invalid ACBD barcodes (`stderr`).