diff --git a/404.html b/404.html index 21cc3d88d..d328ae8ec 100644 --- a/404.html +++ b/404.html @@ -4,7 +4,7 @@ - + @@ -29,11 +29,11 @@ - + - + - +
diff --git a/commonoptions/index.html b/commonoptions/index.html index f5f38c854..7796b2034 100644 --- a/commonoptions/index.html +++ b/commonoptions/index.html @@ -4,7 +4,7 @@ - + @@ -32,12 +32,12 @@ - + - + - - + +
diff --git a/development/index.html b/development/index.html index 440559be2..f45c494ae 100644 --- a/development/index.html +++ b/development/index.html @@ -4,7 +4,7 @@ - + @@ -34,12 +34,12 @@ - + - + - - + +
@@ -335,7 +335,7 @@

not all of it can be installed as a pure python program using setuptools. The build process installs part of Harpy as a pure-python command line program, but all the extra files Harpy needs to run need to be installed separately. All of -this is handled by misc/buildlocal.sh. It's a little circuitous, but it's how +this is handled by resources/buildlocal.sh. It's a little circuitous, but it's how we can keep the source code modular, installable, and have the flexibility of using non-python code.

@@ -347,13 +347,12 @@

For the ease of installation for end-users, Harpy has a recipe and build script in Bioconda, which makes it available for download and installation. A copy of the recipe and -build script is also stored in misc/meta.yml and misc/build.sh. The yaml file +build script is also stored in resources/meta.yml and resources/build.sh. The yaml file is the metadata of the package, including software deps and their versions. The -build script is how conda will install all of Harpy's parts. In order to modify -these files for a new release, you need to fork bioconda/bioconda-recipes, -create a new branch, modify the Harpy meta.yml (and possibly build.sh) files. Bioconda -has an bot that looks for changes to the version number in the meta.yml file -and will automatically submit a Pull Request when it notices that's been changed.

+build script is how conda will install all of Harpy's parts. Now that Harpy is hosted on +bioconda, when a new version is tagged with a release, Bioconda will automatically create +a pull request (after a delay), typically not requiring any intervention on the development side +for the newest Harpy version to be released for conda installation.

# @@ -422,12 +421,18 @@

  • add and modify code with your typical coding workflow, pushing your changes to your Harpy fork
  • when it's ready for inclusion into Harpy (and testing), create a Pull Request to merge your changes into the Harpy dev branch
  • - +

    - # - Testing and CI + # + Automations

    + +

    + # + Testing +

    +

    CI (Continuous Integration) is a term describing automated actions that do things to/with your code and are triggered by how you interact with a repository. Harpy has a series of GitHub Actions triggered by interactions with the dev branch (in .github/workflows) @@ -438,6 +443,25 @@

    could not run successfully to completion. These tests do not test for accuracy, but test for breaking behavior. You'd be shocked to find out how many errors crop up this way and require more development so Harpy can be resilient to more use cases.

    + +

    + # + Releases +

    +
    +

    To save on disk space, there is an automation to strip out the unnecessary files and upload a +cleaned tarball to the new release. This is triggered automatically when a new version is tagged. +Tagging is easily accomplished with Git commands in the command line:

    +
    +
    # make sure you're on the main branch
    +$ git checkout main
    +
    +# create the tag locally, where X.X.X is something like 1.7.1
    +$ git tag X.X.X
    +
    +# push the new tag to the repository
    +$ git push origin X.X.X
    +
    diff --git a/haplotagdata/index.html b/haplotagdata/index.html index 5b2190491..128415f05 100644 --- a/haplotagdata/index.html +++ b/haplotagdata/index.html @@ -4,7 +4,7 @@ - + @@ -34,11 +34,11 @@ - + - + - + diff --git a/index.html b/index.html index f102bb70a..60540e8df 100644 --- a/index.html +++ b/index.html @@ -4,7 +4,7 @@ - + @@ -34,11 +34,11 @@ - + - + - +
    diff --git a/install/index.html b/install/index.html index 47233aa1a..72841cdfa 100644 --- a/install/index.html +++ b/install/index.html @@ -4,7 +4,7 @@ - + @@ -32,12 +32,12 @@ - + - + - - + +
    diff --git a/issues/index.html b/issues/index.html index 05f20e917..b42f8d432 100644 --- a/issues/index.html +++ b/issues/index.html @@ -4,7 +4,7 @@ - + @@ -32,12 +32,12 @@ - + - + - - + +
    diff --git a/modules/align/bwa/index.html b/modules/align/bwa/index.html index 8da13ac26..e785f5e89 100644 --- a/modules/align/bwa/index.html +++ b/modules/align/bwa/index.html @@ -4,7 +4,7 @@ - + @@ -34,12 +34,12 @@ - + - + - - + + diff --git a/modules/align/ema/index.html b/modules/align/ema/index.html index d1333f69d..658a69846 100644 --- a/modules/align/ema/index.html +++ b/modules/align/ema/index.html @@ -4,7 +4,7 @@ - + @@ -34,12 +34,12 @@ - + - + - - + + diff --git a/modules/demultiplex/index.html b/modules/demultiplex/index.html index 571451e91..a6923f720 100644 --- a/modules/demultiplex/index.html +++ b/modules/demultiplex/index.html @@ -4,7 +4,7 @@ - + @@ -34,12 +34,12 @@ - + - + - - + + diff --git a/modules/impute/index.html b/modules/impute/index.html index e1d6e1f15..56f8e416d 100644 --- a/modules/impute/index.html +++ b/modules/impute/index.html @@ -4,7 +4,7 @@ - + @@ -34,12 +34,12 @@ - + - + - - + + diff --git a/modules/othermodules/index.html b/modules/othermodules/index.html index 21ea63f9a..3a8f665a6 100644 --- a/modules/othermodules/index.html +++ b/modules/othermodules/index.html @@ -4,7 +4,7 @@ - + @@ -32,12 +32,12 @@ - + - + - - + +
    diff --git a/modules/phase/index.html b/modules/phase/index.html index cd5136902..1b350c87e 100644 --- a/modules/phase/index.html +++ b/modules/phase/index.html @@ -4,7 +4,7 @@ - + @@ -34,12 +34,12 @@ - + - + - - + + diff --git a/modules/preflight/index.html b/modules/preflight/index.html index 7e88dfbf0..9fec912dc 100644 --- a/modules/preflight/index.html +++ b/modules/preflight/index.html @@ -4,7 +4,7 @@ - + @@ -34,12 +34,12 @@ - + - + - - + +
    diff --git a/modules/qc/index.html b/modules/qc/index.html index 6f0167020..5cf1aaefa 100644 --- a/modules/qc/index.html +++ b/modules/qc/index.html @@ -4,7 +4,7 @@ - + @@ -34,12 +34,12 @@ - + - + - - + + diff --git a/modules/snp/index.html b/modules/snp/index.html index eb135e784..ab74f8c6e 100644 --- a/modules/snp/index.html +++ b/modules/snp/index.html @@ -4,7 +4,7 @@ - + @@ -34,12 +34,12 @@ - + - + - - + + diff --git a/modules/sv/leviathan/index.html b/modules/sv/leviathan/index.html index 9633db6be..265f1798d 100644 --- a/modules/sv/leviathan/index.html +++ b/modules/sv/leviathan/index.html @@ -4,7 +4,7 @@ - + @@ -34,12 +34,12 @@ - + - + - - + + diff --git a/modules/sv/naibr/index.html b/modules/sv/naibr/index.html index e6990c14c..16e4a099c 100644 --- a/modules/sv/naibr/index.html +++ b/modules/sv/naibr/index.html @@ -4,7 +4,7 @@ - + @@ -34,12 +34,12 @@ - + - + - - + + diff --git a/resources/js/config.js b/resources/js/config.js index aa9db1875..8f178984e 100644 --- a/resources/js/config.js +++ b/resources/js/config.js @@ -1 +1 @@ -var __DOCS_CONFIG__ = {"id":"nwgcjYi9zGQBXMdwLyC215BPnT325aSJym+","key":"LztNliMBtal2zfan24ECw6+ix/OfiA9KIp+VZWOtEQU.kJ4B+qpFf/C22+nO0hUAVTTErGzmflMWvnjHc+W/PQZWJo1eJE2VutaVTQyESoIgkK7icLwmKwOd217Ce8nlyg.127","base":"/HARPY/","host":"pdimens.github.io","version":"1.0.0","useRelativePaths":true,"documentName":"index.html","appendDocumentName":false,"trailingSlash":true,"preloadSearch":false,"cacheBustingToken":"3.5.0.761763982618","cacheBustingStrategy":"query","sidebarFilterPlaceholder":"Filter","toolbarFilterPlaceholder":"Filter","showSidebarFilter":true,"filterNotFoundMsg":"No member names found containing the query \"{query}\"","maxHistoryItems":15,"homeIcon":"","access":[{"value":"public","label":"Public"},{"value":"protected","label":"Protected"}],"toolbarLinks":[{"id":"fields","label":"Fields"},{"id":"properties","label":"Properties"},{"id":"methods","label":"Methods"},{"id":"events","label":"Events"}],"sidebar":[{"n":"/","l":"Home","s":""},{"n":"install","l":"Install","s":""},{"n":"modules","l":"Modules","c":false,"i":[{"n":"demultiplex","l":"Demultiplex","s":""},{"n":"preflight","l":"Preflight","s":""},{"n":"qc","l":"QC","s":""},{"n":"align","l":"Align","c":false,"i":[{"n":"bwa","l":"BWA","s":""},{"n":"ema","l":"EMA","s":""}],"s":""},{"n":"snp","l":"SNP","s":""},{"n":"sv","l":"SV","c":false,"i":[{"n":"leviathan","l":"Leviathan","s":""},{"n":"naibr","l":"Naibr","s":""}],"s":""},{"n":"impute","l":"Impute","s":""},{"n":"phase","l":"Phase","s":""},{"n":"othermodules","l":"Other","s":""}],"s":""},{"n":"haplotagdata","l":"Haplotag data","s":""},{"n":"commonoptions","l":"Common Options","s":""},{"n":"issues","l":"Common Issues","s":""},{"n":"snakemake","l":"Sneaky Snakemake","s":""},{"n":"software","l":"Software","s":""},{"n":"development","l":"Development","s":""}],"search":{"mode":1,"minChars":2,"maxResults":20,"placeholder":"Search","hotkeys":["k"],"noResultsFoundMsg":"Sorry, no results found.","recognizeLanguages":true,"languages":[0],"preload":false},"resources":{"History_Title_Label":"History","History_ClearLink_Label":"Clear","History_NoHistory_Label":"No history items","API_AccessFilter_Label":"Access","API_ParameterSection_Label":"PARAMETERS","API_SignatureSection_Label":"SIGNATURE","API_CopyHint_Label":"Copy","API_CopyNameHint_Label":"Copy name","API_CopyLinkHint_Label":"Copy link","API_CopiedAckHint_Label":"Copied!","API_MoreOverloads_Label":"more","API_MoreDropdownItems_Label":"More","API_OptionalParameter_Label":"optional","API_DefaultParameterValue_Label":"Default value","API_InheritedFilter_Label":"Inherited","Search_Input_Placeholder":"Search","Toc_Contents_Label":"Contents","Toc_RelatedClasses_Label":"Related Classes","History_JustNowTime_Label":"just now","History_AgoTime_Label":"ago","History_YearTime_Label":"y","History_MonthTime_Label":"mo","History_DayTime_Label":"d","History_HourTime_Label":"h","History_MinuteTime_Label":"m","History_SecondTime_Label":"s"}}; +var __DOCS_CONFIG__ = {"id":"JtK5ngNTtmAwChWCUtpiQWtP7215aIzRz4O","key":"dZtL8rqPQCZS8qsJmkyPOIs4Dp05n5jRtZhPxegtwZ4.P3ArqvhrpFFTrDcjLjRFCqH/nR3MvFLElDiwtnmQSsPsIuQQ2+JB2vR1k0/3pJ7scTA5iIT3wnGeuXDSdjWKUQ.94","base":"/HARPY/","host":"pdimens.github.io","version":"1.0.0","useRelativePaths":true,"documentName":"index.html","appendDocumentName":false,"trailingSlash":true,"preloadSearch":false,"cacheBustingToken":"3.5.0.761771887257","cacheBustingStrategy":"query","sidebarFilterPlaceholder":"Filter","toolbarFilterPlaceholder":"Filter","showSidebarFilter":true,"filterNotFoundMsg":"No member names found containing the query \"{query}\"","maxHistoryItems":15,"homeIcon":"","access":[{"value":"public","label":"Public"},{"value":"protected","label":"Protected"}],"toolbarLinks":[{"id":"fields","label":"Fields"},{"id":"properties","label":"Properties"},{"id":"methods","label":"Methods"},{"id":"events","label":"Events"}],"sidebar":[{"n":"/","l":"Home","s":""},{"n":"install","l":"Install","s":""},{"n":"modules","l":"Modules","c":false,"i":[{"n":"demultiplex","l":"Demultiplex","s":""},{"n":"preflight","l":"Preflight","s":""},{"n":"qc","l":"QC","s":""},{"n":"align","l":"Align","c":false,"i":[{"n":"bwa","l":"BWA","s":""},{"n":"ema","l":"EMA","s":""}],"s":""},{"n":"snp","l":"SNP","s":""},{"n":"sv","l":"SV","c":false,"i":[{"n":"leviathan","l":"Leviathan","s":""},{"n":"naibr","l":"Naibr","s":""}],"s":""},{"n":"impute","l":"Impute","s":""},{"n":"phase","l":"Phase","s":""},{"n":"othermodules","l":"Other","s":""}],"s":""},{"n":"haplotagdata","l":"Haplotag data","s":""},{"n":"commonoptions","l":"Common Options","s":""},{"n":"issues","l":"Common Issues","s":""},{"n":"snakemake","l":"Sneaky Snakemake","s":""},{"n":"software","l":"Software","s":""},{"n":"development","l":"Development","s":""}],"search":{"mode":1,"minChars":2,"maxResults":20,"placeholder":"Search","hotkeys":["k"],"noResultsFoundMsg":"Sorry, no results found.","recognizeLanguages":true,"languages":[0],"preload":false},"resources":{"History_Title_Label":"History","History_ClearLink_Label":"Clear","History_NoHistory_Label":"No history items","API_AccessFilter_Label":"Access","API_ParameterSection_Label":"PARAMETERS","API_SignatureSection_Label":"SIGNATURE","API_CopyHint_Label":"Copy","API_CopyNameHint_Label":"Copy name","API_CopyLinkHint_Label":"Copy link","API_CopiedAckHint_Label":"Copied!","API_MoreOverloads_Label":"more","API_MoreDropdownItems_Label":"More","API_OptionalParameter_Label":"optional","API_DefaultParameterValue_Label":"Default value","API_InheritedFilter_Label":"Inherited","Search_Input_Placeholder":"Search","Toc_Contents_Label":"Contents","Toc_RelatedClasses_Label":"Related Classes","History_JustNowTime_Label":"just now","History_AgoTime_Label":"ago","History_YearTime_Label":"y","History_MonthTime_Label":"mo","History_DayTime_Label":"d","History_HourTime_Label":"h","History_MinuteTime_Label":"m","History_SecondTime_Label":"s"}}; diff --git a/resources/js/search.json b/resources/js/search.json index 1036000d3..a3315479d 100644 --- a/resources/js/search.json +++ b/resources/js/search.json @@ -1 +1 @@ -[[{"i":"#","p":["Using Harpy to process your haplotagged data"]},{"l":"Home","p":["Harpy is a haplotagging data processing pipeline for Linux-based systems. It uses all the magic of Snakemake under the hood to handle the worklfow decision-making, but as a user, you just interact with it like a normal command-line"]},{"i":"what-is-haplotagging","l":"What is haplotagging?","p":["Linked-read sequencing exists to combine the throughput and accuracy of short-read sequencing with the long range haplotype information of long-read sequencing. Haplotagging is an implementation of linked-read sequencing developed by"]},{"l":"Harpy Modules","p":["Harpy is modular, meaning you can use different parts of it independent from each other. Need to only align reads? Great! Only want to call variants? Awesome! All modules are called by"]},{"l":"Using Harpy","p":["You can call harpy without any arguments (or with --help) to print the docstring to your terminal. You can likewise call any of the modules without arguments or with --help to see their usage (e.g."]}],[{"l":"Install HARPY","p":["Harpy is now hosted on Bioconda! That means to install it, you just need to have mamba(or conda) on your Linux-based system and install it with a simple command. You can install Harpy into an existing environment or create a new one for it (recommended)."]}],[{"i":"#","p":["Demultiplex raw sequences into haplotag barcoded samples"]},{"l":"Demultiplex Raw Sequences","p":["When pooling samples and sequencing them in parallel on an Illumina sequencer, you will be given large multiplexed FASTQ files in return. These files contain sequences for all of your samples and need to be demultiplexed using barcodes to"]},{"l":"Running Options","p":["In addition to the common runtime options, the harpy demultiplex module is configured using these command-line arguments:"]},{"l":"Haplotag Types"},{"l":"Gen I Demultiplex Workflow"}],[{"i":"#","p":["Run file format checks on haplotagged FASTQ/BAM files"]},{"l":"Pre-flight checks for input files","p":["Harpy does a lot of stuff with a lot of software and each of these programs expect the incoming data to follow particular formats (plural, unfortunately). These formatting opinions/specifics are at the mercy of the original developers and while there are times when Harpy can (and does)"]},{"l":"when to run"},{"l":"Running Options","p":["In addition to the common runtime options, the harpy preflight fastq|bam module is configured using these command-line arguments:"]},{"l":"Workflow"}],[{"i":"#","p":["Quality trim haplotagged sequences with Harpy"]},{"l":"Quality Trim Sequences","p":["Raw sequences are not suitable for downstream analyses. They have sequencing adapters, index sequences, regions of poor quality, etc. The first step of any genetic sequence analyses is to remove these adapters and trim poor quality data. You can remove adapters"]},{"l":"Running Options","p":["In addition to the common runtime options, the harpy qc module is configured using these command-line arguments:"]},{"l":"QC Workflow"}],[{"i":"#","p":["Align haplotagged sequences with BWA MEM"]},{"l":"Map Reads onto a genome with BWA MEM","p":["Once sequences have been trimmed and passed through other QC filters, they will need to be aligned to a reference genome. This module within Harpy expects filtered reads as input,"]},{"l":"Running Options","p":["In addition to the common runtime options, the harpy align bwa module is configured using these command-line arguments:"]},{"l":"Molecule distance","p":["The --molecule-distance option is used during the BWA alignment workflow to assign alignments a unique Molecular Identifier MI:i tag based on their haplotag barcode and the distance threshold you specify. See"]},{"l":"Quality filtering","p":["The --quality argument filters out alignments below a given MQ threshold. The default, 30, keeps alignments that are at least 99.9% likely correctly mapped. Set this value to 1"]},{"l":"Marking PCR duplicates","p":["Harpy uses samtools markdup to mark putative PCR duplicates. By using the --barcode-tag BX option, it considers the linked-read barcode for more accurate duplicate detection. Duplicate"]},{"l":"BWA workflow"}],[{"i":"#","p":["Align haplotagged sequences with EMA"]},{"l":"Map Reads onto a genome with EMA","p":["Once sequences have been trimmed and passed through other QC filters, they will need to be aligned to a reference genome. This module within Harpy expects filtered reads as input,"]},{"l":"Running Options","p":["In addition to the common runtime options, the harpy align ema module is configured using these command-line arguments:"]},{"l":"Barcode whitelist","p":["Some linked-read methods (e.g. 10x, Tellseq) require the inclusion of a barcode \"whitelist.\" This file is a simple text file that has one barcode per line so a given software knows what barcodes to expect in your data."]},{"l":"Quality filtering","p":["The --quality argument filters out alignments below a given MQ threshold. The default, 30, keeps alignments that are at least 99.9% likely correctly mapped. Set this value to 1"]},{"l":"Marking PCR duplicates","p":["EMA marks duplicates in the resulting alignments, however the read with invalid barcodes are aligned separately with BWA. Therefore, Harpy uses samtools markdup to mark putative"]},{"l":"EMA workflow"}],[{"i":"#","p":["Call SNPs and small indels"]},{"l":"Call SNPs and small indels","p":["After reads have been aligned, e.g., with harpy align, you can use those alignment files(.bam) to call variants in your data. Harpy can call SNPs and small indels using bcftools mpileup"]},{"l":"Running Options","p":["In addition to the common runtime options, the harpy snp module is configured using these command-line arguments:"]},{"l":"windowsize","p":["To speed things along, Harpy will call variants in parallel on different contig intervals, then merge everything at the end. You can control the level of parallelization by using"]},{"l":"populations","p":["Grouping samples changes the way the variant callers computes certain statistics when calling variants. If you have reason to believe there is a biologically meaningful grouping scheme to your samples, then you should include"]},{"l":"SNP calling workflow"}],[{"i":"#","p":["Call structural variants using Leviathan"]},{"l":"Call Structural Variants using LEVIATHAN","p":["(like indels, insertions, duplications, breakends)"]},{"l":"Running Options","p":["In addition to the common runtime options, the harpy sv leviathan module is configured using these command-line arguments:"]},{"l":"Single-sample variant calling","p":["When not using a population grouping file via --populations, variants will be called per-sample. Due to the nature of structural variant VCF files, there isn't an entirely fool-proof way"]},{"l":"Pooled-sample variant calling","p":["With the inclusion of a population grouping file via --populations, Harpy will merge the bam files of all samples within a population and call variants on these alignment pools. Preliminary work shows that this way identifies more variants and with fewer false"]},{"l":"LEVIATHAN workflow"}],[{"i":"#","p":["Call structural variants using NAIBR (plus)"]},{"l":"Call Structural Variants using NAIBR","p":["(like indels, insertions, duplications)"]},{"l":"Running Options","p":["In addition to the common runtime options, the harpy sv naibr module is configured using these command-line arguments:"]},{"l":"Molecule distance","p":["The --molecule-distance option is used to let the program determine how far apart alignments on a contig with the same barcode can be from each other and still considered as originating from the same DNA molecule. See"]},{"l":"Single-sample variant calling","p":["When not using a population grouping file via --populations, variants will be called per-sample. Due to the nature of structural variant VCF files, there isn't an entirely fool-proof way"]},{"l":"Pooled-sample variant calling","p":["With the inclusion of a population grouping file via --populations, Harpy will merge the bam files of all samples within a population and call variants on these alignment pools. Preliminary work shows that this way identifies more variants and with fewer false"]},{"l":"optional vcf file","p":["In order to get the best variant calling performance out of NAIBR, it requires phased bam files as input. The --vcf option is optional and not used by NAIBR. However, to use harpy sv naibr"]},{"i":"a-phased-input---vcf","l":"a phased input --vcf","p":["This file can be in vcf/vcf.gz/bcf format and most importantly it must be phased haplotypes. There are various ways to haplotype SNPs, but you can use harpy phase to phase your SNPs into haplotypes using the haplotag barcode"]},{"l":"NAIBR workflow"}],[{"i":"#","p":["Impute genotypes for haplotagged data with Harpy"]},{"l":"Impute Genotypes using Sequences","p":["After variants have been called, you may want to impute missing genotypes to get the most from your data. Harpy uses STITCH to impute genotypes, a haplotype-based method that is linked-read aware. Imputing genotypes requires a variant call file"]},{"l":"Running Options","p":["In addition to the common runtime options, the harpy impute module is configured using these command-line arguments:"]},{"l":"Extra STITCH parameters","p":["You may add additional parameters to STITCH by way of the--extra-params(or -x) option. Since STITCH is a function in the R language, the parameters you add must be in R syntax (e.g."]},{"l":"Prioritize the vcf file","p":["Sometimes you want to run imputation on all the samples present in the --directory, but other times you may want to only impute the samples present in the --vcf file. By default, Harpy assumes you want to use all the samples"]},{"l":"Parameter file","p":["Typically, one runs STITCH multiple times, exploring how results vary with different model parameters (explained in next section). The solution Harpy uses for this is to have the user"]},{"l":"STITCH Parameters"},{"l":"Imputation Workflow"}],[{"i":"#","p":["Phase haplotypes for haplotagged data with Harpy"]},{"l":"Phase SNPs into Haplotypes","p":["You may want to phase your genotypes into haplotypes, as haplotypes tend to be more informative than unphased genotypes (higher polymorphism, captures relationship between genotypes). Phasing"]},{"l":"Running Options","p":["In addition to the common runtime options, the harpy phase module is configured using these command-line arguments:"]},{"l":"Prioritize the vcf file","p":["Sometimes you want to run imputation on all the samples present in the --directory, but other times you may want to only impute the samples present in the --vcf file. By default, Harpy assumes you want to use all the samples"]},{"l":"Molecule distance","p":["The molecule distance refers to the base-pair distance dilineating separate molecules. In other words, when two alignments on a single contig share the same barcode, how far away from each other are we willing to say they were and still consider them having"]},{"l":"Pruning threshold","p":["The pruning threshold refers to a PHRED-scale value between 0-1 (a percentage) for removing low-confidence SNPs from consideration. With Harpy, you configure this value as an integer"]},{"l":"Phasing Workflow"}],[{"i":"#","p":["Generate extra files for analysis with Harpy"]},{"l":"Other Harpy modules","p":["Some parts of Harpy (variant calling, imputation) want or need extra files. You can create various files necessary for different modules using these extra modules: The arguments represent different sub-commands and can be run in any order or combination to generate the files you need."]},{"l":"Other modules"},{"l":"popgroup"},{"l":"Sample grouping file for variant calling"},{"l":"arguments","p":["This file is entirely optional and useful if you want SNP variant calling to happen on a per-population level via harpy snp ... -p or on samples pooled-as-populations via harpy sv ... -p"]},{"l":"stitchparams"},{"l":"STITCH parameter file"},{"i":"arguments-1","l":"arguments","p":["Typically, one runs STITCH multiple times, exploring how results vary with different model parameters. The solution Harpy uses for this is to have the user provide a tab-delimited dataframe file where the columns are the 6 STITCH model"]},{"l":"hpc"},{"l":"HPC cluster profile"},{"i":"arguments-2","l":"arguments","p":["For snakemake to work in harmony with an HPC scheduler, a \"profile\" needs to be provided that tells Snakemake how it needs to interact with the HPC scheduler to submit your jobs to the cluster. Using"]}],[{"l":"Haplotag data"},{"l":"Data Format"},{"l":"Barcodes","p":["While barcodes are actually combinatorial bases, in the read headers they are represented with the format AxxCxxBxxDxx, where each barcode segment is denoted as Axx(or Bxx, etc.)."]},{"l":"barcode protocol varieties","p":["If you think haplotagging is as simple as exactly 96^4 unique barcodes, you would only be half-correct. The original haplotagging protocol in Meier et al. is good, but the authors (and others) have been working to improve this linked-read technology to improve"]},{"l":"where the barcodes go","p":["Chromium 10X linked-reads have a particular format where the barcode is the leading 16 bases of the read. However, haplotagging data does not use that format, nor do the tools implemented in Harpy work correctly with it. Once demultiplexed, haplotagging sequences should look"]},{"l":"Read headers","p":["Like mentioned, the haplotag barcode is expected to be stored in the BX:Z: tag in the read header. This information is retained through the various Harpy steps. An example read header could look like:"]},{"l":"Read length","p":["Reads must be at least 30 base pairs in length for alignment. The qc module removes reads <50bp."]},{"l":"Compression","p":["Harpy generally doesn't require the input sequences to be in gzipped/bgzipped format, but it's good practice to compress your reads anyway. Compressed files are expected to end with the extension"]},{"l":"Naming conventions","p":["Unfortunately, there are many different ways of naming FASTQ files, which makes it difficult to accomodate every wacky iteration currently in circulation. While Harpy tries its best to be flexible, there are limitations."]},{"l":"Barcode thresholds","p":["By the nature of linked read technologies, there will (almost always) be more DNA fragments than unique barcodes for them. As a result, it's common for barcodes to reappear in sequences. Rather than incorrectly assume that all sequences/alignments with the same barcode"]}],[{"l":"Common Harpy Options"},{"l":"Common command-line options","p":["Every Harpy module has a series of configuration parameters. These are arguments you need to input to configure the module to run on your data, such as the directory with the reads/alignments,"]},{"l":"The workflow folder","p":["When you run one of the main Harpy modules, the output directory will contain a workflow folder. This folder is both necessary for the module to run and is very useful to understand what the module did, be it for your own"]},{"l":"The Genome folder","p":["You will notice that many of the workflows will create a Genome folder in the working directory. This folder is to make it easier for Harpy to store the genome and the associated"]}],[{"l":"Common Issues","p":["Lots of stuff can go wrong during an analysis. The intent of this page is to highlight common issues you may experience during analysis and ways to address these issues."]},{"l":"Problem installing with conda","p":["Conda is an awesome package manager, but it's slow and uses a ton of memory as dependencies increase. Harpy has a lot of dependencies and you might stall out conda trying to install it. Use mamba instead-- it'll work where conda fails."]},{"l":"Failures during imputation or phasing","p":["If you use bamutils clipOverlap on alignments that are used for the impute or phase modules, they will cause both programs to error. We don't know why, but they do."]},{"i":"alignment-file-name-and-id-tag-mismatch","l":"Alignment file name and ID: tag mismatch","p":["Aligning a sample to a genome via Harpy will insert the sample name (based on the file name) into the alignment header (the @RG ID:name SM:name tag). It likewise expects, through various steps,"]}],[{"l":"Adding Snakamake parameters","p":["Harpy relies on Snakemake under the hood to handle file and job dependencies. Most of these details have been abstracted away from the end-user, but every module of Harpy (except"]},{"l":"Common use cases","p":["You likely wont need to invoke --snakemake very often, if ever. However, here are common use cases for this parameter."]}],[{"l":"Software used in Harpy","p":["HARPY is the sum of its parts, and out of tremendous respect for the developers involved in the included software, we would like to highlight the tools directly involved in HARPY's many moving pieces."]}],[{"l":"Developing Harpy","p":["Harpy is an open source program written using a combination of BASH, R, RMarkdown, Python, and Snakemake. This page provides information on Harpy's development and how to contribute to it, if you were inclined to do so."]},{"l":"Installing Harpy for development","p":["The process follows cloning the harpy repository, installing the preconfigured conda environment, and running the misc/buildlocal.sh script to move all the necessary files to the"]},{"i":"harpys-components","l":"Harpy's components"},{"l":"source code","p":["Harpy runs in two stages:"]},{"l":"Bioconda recipe","p":["For the ease of installation for end-users, Harpy has a recipe and build script in Bioconda, which makes it available for download and installation. A copy of the recipe and build script is also stored in"]},{"l":"The Harpy repository"},{"l":"structure","p":["Harpy exists as a Git repository and has 5 standard branches that are used in specific ways during development. Git is a popular version control system and discussing its use is out of the scope of this documentation, however there is no"]},{"l":"development workflow","p":["The dev workflow is reasonably standard:"]},{"l":"Testing and CI","p":["CI ( C ontinuous I ntegration) is a term describing automated actions that do things to/with your code and are triggered by how you interact with a repository. Harpy has a series of GitHub Actions triggered by interactions with the"]}]] \ No newline at end of file +[[{"i":"#","p":["Using Harpy to process your haplotagged data"]},{"l":"Home","p":["Harpy is a haplotagging data processing pipeline for Linux-based systems. It uses all the magic of Snakemake under the hood to handle the worklfow decision-making, but as a user, you just interact with it like a normal command-line"]},{"i":"what-is-haplotagging","l":"What is haplotagging?","p":["Linked-read sequencing exists to combine the throughput and accuracy of short-read sequencing with the long range haplotype information of long-read sequencing. Haplotagging is an implementation of linked-read sequencing developed by"]},{"l":"Harpy Modules","p":["Harpy is modular, meaning you can use different parts of it independent from each other. Need to only align reads? Great! Only want to call variants? Awesome! All modules are called by"]},{"l":"Using Harpy","p":["You can call harpy without any arguments (or with --help) to print the docstring to your terminal. You can likewise call any of the modules without arguments or with --help to see their usage (e.g."]}],[{"l":"Install HARPY","p":["Harpy is now hosted on Bioconda! That means to install it, you just need to have mamba(or conda) on your Linux-based system and install it with a simple command. You can install Harpy into an existing environment or create a new one for it (recommended)."]}],[{"i":"#","p":["Demultiplex raw sequences into haplotag barcoded samples"]},{"l":"Demultiplex Raw Sequences","p":["When pooling samples and sequencing them in parallel on an Illumina sequencer, you will be given large multiplexed FASTQ files in return. These files contain sequences for all of your samples and need to be demultiplexed using barcodes to"]},{"l":"Running Options","p":["In addition to the common runtime options, the harpy demultiplex module is configured using these command-line arguments:"]},{"l":"Haplotag Types"},{"l":"Gen I Demultiplex Workflow"}],[{"i":"#","p":["Run file format checks on haplotagged FASTQ/BAM files"]},{"l":"Pre-flight checks for input files","p":["Harpy does a lot of stuff with a lot of software and each of these programs expect the incoming data to follow particular formats (plural, unfortunately). These formatting opinions/specifics are at the mercy of the original developers and while there are times when Harpy can (and does)"]},{"l":"when to run"},{"l":"Running Options","p":["In addition to the common runtime options, the harpy preflight fastq|bam module is configured using these command-line arguments:"]},{"l":"Workflow"}],[{"i":"#","p":["Quality trim haplotagged sequences with Harpy"]},{"l":"Quality Trim Sequences","p":["Raw sequences are not suitable for downstream analyses. They have sequencing adapters, index sequences, regions of poor quality, etc. The first step of any genetic sequence analyses is to remove these adapters and trim poor quality data. You can remove adapters"]},{"l":"Running Options","p":["In addition to the common runtime options, the harpy qc module is configured using these command-line arguments:"]},{"l":"QC Workflow"}],[{"i":"#","p":["Align haplotagged sequences with BWA MEM"]},{"l":"Map Reads onto a genome with BWA MEM","p":["Once sequences have been trimmed and passed through other QC filters, they will need to be aligned to a reference genome. This module within Harpy expects filtered reads as input,"]},{"l":"Running Options","p":["In addition to the common runtime options, the harpy align bwa module is configured using these command-line arguments:"]},{"l":"Molecule distance","p":["The --molecule-distance option is used during the BWA alignment workflow to assign alignments a unique Molecular Identifier MI:i tag based on their haplotag barcode and the distance threshold you specify. See"]},{"l":"Quality filtering","p":["The --quality argument filters out alignments below a given MQ threshold. The default, 30, keeps alignments that are at least 99.9% likely correctly mapped. Set this value to 1"]},{"l":"Marking PCR duplicates","p":["Harpy uses samtools markdup to mark putative PCR duplicates. By using the --barcode-tag BX option, it considers the linked-read barcode for more accurate duplicate detection. Duplicate"]},{"l":"BWA workflow"}],[{"i":"#","p":["Align haplotagged sequences with EMA"]},{"l":"Map Reads onto a genome with EMA","p":["Once sequences have been trimmed and passed through other QC filters, they will need to be aligned to a reference genome. This module within Harpy expects filtered reads as input,"]},{"l":"Running Options","p":["In addition to the common runtime options, the harpy align ema module is configured using these command-line arguments:"]},{"l":"Barcode whitelist","p":["Some linked-read methods (e.g. 10x, Tellseq) require the inclusion of a barcode \"whitelist.\" This file is a simple text file that has one barcode per line so a given software knows what barcodes to expect in your data."]},{"l":"Quality filtering","p":["The --quality argument filters out alignments below a given MQ threshold. The default, 30, keeps alignments that are at least 99.9% likely correctly mapped. Set this value to 1"]},{"l":"Marking PCR duplicates","p":["EMA marks duplicates in the resulting alignments, however the read with invalid barcodes are aligned separately with BWA. Therefore, Harpy uses samtools markdup to mark putative"]},{"l":"EMA workflow"}],[{"i":"#","p":["Call SNPs and small indels"]},{"l":"Call SNPs and small indels","p":["After reads have been aligned, e.g., with harpy align, you can use those alignment files(.bam) to call variants in your data. Harpy can call SNPs and small indels using bcftools mpileup"]},{"l":"Running Options","p":["In addition to the common runtime options, the harpy snp module is configured using these command-line arguments:"]},{"l":"windowsize","p":["To speed things along, Harpy will call variants in parallel on different contig intervals, then merge everything at the end. You can control the level of parallelization by using"]},{"l":"populations","p":["Grouping samples changes the way the variant callers computes certain statistics when calling variants. If you have reason to believe there is a biologically meaningful grouping scheme to your samples, then you should include"]},{"l":"SNP calling workflow"}],[{"i":"#","p":["Call structural variants using Leviathan"]},{"l":"Call Structural Variants using LEVIATHAN","p":["(like indels, insertions, duplications, breakends)"]},{"l":"Running Options","p":["In addition to the common runtime options, the harpy sv leviathan module is configured using these command-line arguments:"]},{"l":"Single-sample variant calling","p":["When not using a population grouping file via --populations, variants will be called per-sample. Due to the nature of structural variant VCF files, there isn't an entirely fool-proof way"]},{"l":"Pooled-sample variant calling","p":["With the inclusion of a population grouping file via --populations, Harpy will merge the bam files of all samples within a population and call variants on these alignment pools. Preliminary work shows that this way identifies more variants and with fewer false"]},{"l":"LEVIATHAN workflow"}],[{"i":"#","p":["Call structural variants using NAIBR (plus)"]},{"l":"Call Structural Variants using NAIBR","p":["(like indels, insertions, duplications)"]},{"l":"Running Options","p":["In addition to the common runtime options, the harpy sv naibr module is configured using these command-line arguments:"]},{"l":"Molecule distance","p":["The --molecule-distance option is used to let the program determine how far apart alignments on a contig with the same barcode can be from each other and still considered as originating from the same DNA molecule. See"]},{"l":"Single-sample variant calling","p":["When not using a population grouping file via --populations, variants will be called per-sample. Due to the nature of structural variant VCF files, there isn't an entirely fool-proof way"]},{"l":"Pooled-sample variant calling","p":["With the inclusion of a population grouping file via --populations, Harpy will merge the bam files of all samples within a population and call variants on these alignment pools. Preliminary work shows that this way identifies more variants and with fewer false"]},{"l":"optional vcf file","p":["In order to get the best variant calling performance out of NAIBR, it requires phased bam files as input. The --vcf option is optional and not used by NAIBR. However, to use harpy sv naibr"]},{"i":"a-phased-input---vcf","l":"a phased input --vcf","p":["This file can be in vcf/vcf.gz/bcf format and most importantly it must be phased haplotypes. There are various ways to haplotype SNPs, but you can use harpy phase to phase your SNPs into haplotypes using the haplotag barcode"]},{"l":"NAIBR workflow"}],[{"i":"#","p":["Impute genotypes for haplotagged data with Harpy"]},{"l":"Impute Genotypes using Sequences","p":["After variants have been called, you may want to impute missing genotypes to get the most from your data. Harpy uses STITCH to impute genotypes, a haplotype-based method that is linked-read aware. Imputing genotypes requires a variant call file"]},{"l":"Running Options","p":["In addition to the common runtime options, the harpy impute module is configured using these command-line arguments:"]},{"l":"Extra STITCH parameters","p":["You may add additional parameters to STITCH by way of the--extra-params(or -x) option. Since STITCH is a function in the R language, the parameters you add must be in R syntax (e.g."]},{"l":"Prioritize the vcf file","p":["Sometimes you want to run imputation on all the samples present in the --directory, but other times you may want to only impute the samples present in the --vcf file. By default, Harpy assumes you want to use all the samples"]},{"l":"Parameter file","p":["Typically, one runs STITCH multiple times, exploring how results vary with different model parameters (explained in next section). The solution Harpy uses for this is to have the user"]},{"l":"STITCH Parameters"},{"l":"Imputation Workflow"}],[{"i":"#","p":["Phase haplotypes for haplotagged data with Harpy"]},{"l":"Phase SNPs into Haplotypes","p":["You may want to phase your genotypes into haplotypes, as haplotypes tend to be more informative than unphased genotypes (higher polymorphism, captures relationship between genotypes). Phasing"]},{"l":"Running Options","p":["In addition to the common runtime options, the harpy phase module is configured using these command-line arguments:"]},{"l":"Prioritize the vcf file","p":["Sometimes you want to run imputation on all the samples present in the --directory, but other times you may want to only impute the samples present in the --vcf file. By default, Harpy assumes you want to use all the samples"]},{"l":"Molecule distance","p":["The molecule distance refers to the base-pair distance dilineating separate molecules. In other words, when two alignments on a single contig share the same barcode, how far away from each other are we willing to say they were and still consider them having"]},{"l":"Pruning threshold","p":["The pruning threshold refers to a PHRED-scale value between 0-1 (a percentage) for removing low-confidence SNPs from consideration. With Harpy, you configure this value as an integer"]},{"l":"Phasing Workflow"}],[{"i":"#","p":["Generate extra files for analysis with Harpy"]},{"l":"Other Harpy modules","p":["Some parts of Harpy (variant calling, imputation) want or need extra files. You can create various files necessary for different modules using these extra modules: The arguments represent different sub-commands and can be run in any order or combination to generate the files you need."]},{"l":"Other modules"},{"l":"popgroup"},{"l":"Sample grouping file for variant calling"},{"l":"arguments","p":["This file is entirely optional and useful if you want SNP variant calling to happen on a per-population level via harpy snp ... -p or on samples pooled-as-populations via harpy sv ... -p"]},{"l":"stitchparams"},{"l":"STITCH parameter file"},{"i":"arguments-1","l":"arguments","p":["Typically, one runs STITCH multiple times, exploring how results vary with different model parameters. The solution Harpy uses for this is to have the user provide a tab-delimited dataframe file where the columns are the 6 STITCH model"]},{"l":"hpc"},{"l":"HPC cluster profile"},{"i":"arguments-2","l":"arguments","p":["For snakemake to work in harmony with an HPC scheduler, a \"profile\" needs to be provided that tells Snakemake how it needs to interact with the HPC scheduler to submit your jobs to the cluster. Using"]}],[{"l":"Haplotag data"},{"l":"Data Format"},{"l":"Barcodes","p":["While barcodes are actually combinatorial bases, in the read headers they are represented with the format AxxCxxBxxDxx, where each barcode segment is denoted as Axx(or Bxx, etc.)."]},{"l":"barcode protocol varieties","p":["If you think haplotagging is as simple as exactly 96^4 unique barcodes, you would only be half-correct. The original haplotagging protocol in Meier et al. is good, but the authors (and others) have been working to improve this linked-read technology to improve"]},{"l":"where the barcodes go","p":["Chromium 10X linked-reads have a particular format where the barcode is the leading 16 bases of the read. However, haplotagging data does not use that format, nor do the tools implemented in Harpy work correctly with it. Once demultiplexed, haplotagging sequences should look"]},{"l":"Read headers","p":["Like mentioned, the haplotag barcode is expected to be stored in the BX:Z: tag in the read header. This information is retained through the various Harpy steps. An example read header could look like:"]},{"l":"Read length","p":["Reads must be at least 30 base pairs in length for alignment. The qc module removes reads <50bp."]},{"l":"Compression","p":["Harpy generally doesn't require the input sequences to be in gzipped/bgzipped format, but it's good practice to compress your reads anyway. Compressed files are expected to end with the extension"]},{"l":"Naming conventions","p":["Unfortunately, there are many different ways of naming FASTQ files, which makes it difficult to accomodate every wacky iteration currently in circulation. While Harpy tries its best to be flexible, there are limitations."]},{"l":"Barcode thresholds","p":["By the nature of linked read technologies, there will (almost always) be more DNA fragments than unique barcodes for them. As a result, it's common for barcodes to reappear in sequences. Rather than incorrectly assume that all sequences/alignments with the same barcode"]}],[{"l":"Common Harpy Options"},{"l":"Common command-line options","p":["Every Harpy module has a series of configuration parameters. These are arguments you need to input to configure the module to run on your data, such as the directory with the reads/alignments,"]},{"l":"The workflow folder","p":["When you run one of the main Harpy modules, the output directory will contain a workflow folder. This folder is both necessary for the module to run and is very useful to understand what the module did, be it for your own"]},{"l":"The Genome folder","p":["You will notice that many of the workflows will create a Genome folder in the working directory. This folder is to make it easier for Harpy to store the genome and the associated"]}],[{"l":"Common Issues","p":["Lots of stuff can go wrong during an analysis. The intent of this page is to highlight common issues you may experience during analysis and ways to address these issues."]},{"l":"Problem installing with conda","p":["Conda is an awesome package manager, but it's slow and uses a ton of memory as dependencies increase. Harpy has a lot of dependencies and you might stall out conda trying to install it. Use mamba instead-- it'll work where conda fails."]},{"l":"Failures during imputation or phasing","p":["If you use bamutils clipOverlap on alignments that are used for the impute or phase modules, they will cause both programs to error. We don't know why, but they do."]},{"i":"alignment-file-name-and-id-tag-mismatch","l":"Alignment file name and ID: tag mismatch","p":["Aligning a sample to a genome via Harpy will insert the sample name (based on the file name) into the alignment header (the @RG ID:name SM:name tag). It likewise expects, through various steps,"]}],[{"l":"Adding Snakamake parameters","p":["Harpy relies on Snakemake under the hood to handle file and job dependencies. Most of these details have been abstracted away from the end-user, but every module of Harpy (except"]},{"l":"Common use cases","p":["You likely wont need to invoke --snakemake very often, if ever. However, here are common use cases for this parameter."]}],[{"l":"Software used in Harpy","p":["HARPY is the sum of its parts, and out of tremendous respect for the developers involved in the included software, we would like to highlight the tools directly involved in HARPY's many moving pieces."]}],[{"l":"Developing Harpy","p":["Harpy is an open source program written using a combination of BASH, R, RMarkdown, Python, and Snakemake. This page provides information on Harpy's development and how to contribute to it, if you were inclined to do so."]},{"l":"Installing Harpy for development","p":["The process follows cloning the harpy repository, installing the preconfigured conda environment, and running the misc/buildlocal.sh script to move all the necessary files to the"]},{"i":"harpys-components","l":"Harpy's components"},{"l":"source code","p":["Harpy runs in two stages:"]},{"l":"Bioconda recipe","p":["For the ease of installation for end-users, Harpy has a recipe and build script in Bioconda, which makes it available for download and installation. A copy of the recipe and build script is also stored in"]},{"l":"The Harpy repository"},{"l":"structure","p":["Harpy exists as a Git repository and has 5 standard branches that are used in specific ways during development. Git is a popular version control system and discussing its use is out of the scope of this documentation, however there is no"]},{"l":"development workflow","p":["The dev workflow is reasonably standard:"]},{"l":"Automations"},{"l":"Testing","p":["CI ( C ontinuous I ntegration) is a term describing automated actions that do things to/with your code and are triggered by how you interact with a repository. Harpy has a series of GitHub Actions triggered by interactions with the"]},{"l":"Releases","p":["To save on disk space, there is an automation to strip out the unnecessary files and upload a cleaned tarball to the new release. This is triggered automatically when a new version is tagged."]}]] \ No newline at end of file diff --git a/sitemap.xml.gz b/sitemap.xml.gz index 25d7fac70..57e79a263 100644 Binary files a/sitemap.xml.gz and b/sitemap.xml.gz differ diff --git a/snakemake/index.html b/snakemake/index.html index 34b8a790b..9b6a82659 100644 --- a/snakemake/index.html +++ b/snakemake/index.html @@ -4,7 +4,7 @@ - + @@ -32,12 +32,12 @@ - + - + - - + +
    diff --git a/software/index.html b/software/index.html index 892d0afb2..cccee1017 100644 --- a/software/index.html +++ b/software/index.html @@ -4,7 +4,7 @@ - + @@ -32,11 +32,11 @@ - + - + - +