Merge pull request #56 from Cristianetaniguti/joint_EmpiricalReads

Joint empirical reads
Cristianetaniguti · Nov 29, 2022 · 09cc213 · 09cc213
2 parents f05da13 + b39a5e7
commit 09cc213
Show file tree

Hide file tree

Showing 7 changed files with 105 additions and 11 deletions.
diff --git a/README.md b/README.md
@@ -3,9 +3,9 @@
 
 ## Reads2Map 
 
-Reads2Map presents [WDL workflows](https://openwdl.org/) a collection of pipelines to build linkage maps from sequencing reads. Each pipeline release is described in the [Read2Map releases page](https://github.com/Cristianetaniguti/Reads2Map/releases). 
+Reads2Map presents a collection of [WDL workflows](https://openwdl.org/)  to build linkage maps from sequencing reads. Each workflow release is described in the [Read2Map releases page](https://github.com/Cristianetaniguti/Reads2Map/releases). 
 
-The main workflows are the `EmpiricalSNPCalling.wdl`, the `EmpiricalMaps.wdl`, and the `SimulatedReads.wdl`. `EmpiricalSNPCalling.wdl` performs the SNP calling and `EmpiricalMaps.wdl` performs the genotype calling and map building in empirical reads. The `SimulatedReads.wdl` simulates Illumina reads for RADseq, exome, or WGS data and performs the SNP and genotype calling and genetic map building.
+The main workflows are the `EmpiricalReads2Map.wdl` and the `SimulatedReads2Map.wdl`. The `EmpiricalReads2Map.wdl` is composed by the `EmpiricalSNPCalling.wdl` that performs the SNP calling, and the `EmpiricalMaps.wdl` that performs the genotype calling and map building in empirical reads. The `SimulatedReads2Map.wdl` simulates Illumina reads for RADseq, exome, or WGS data and performs the SNP and genotype calling and genetic map building.
 
 By now, [GATK](https://github.com/broadinstitute/gatk), [Freebayes](https://github.com/ekg/freebayes) are included for SNP calling; [updog](https://github.com/dcgerard/updog), [polyRAD](https://github.com/lvclark/polyRAD), [SuperMASSA](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0030906) for dosage calling; and [OneMap](https://github.com/augusto-garcia/onemap), and [GUSMap](https://github.com/tpbilton/GUSMap) for linkage map build.
 
@@ -23,19 +23,19 @@ Check the description of the inputs for the pipelines:
 
 * [EmpiricalReads2Map (EmpiricalSNPCalling and EmpiricalMaps)](https://cristianetaniguti.github.io/Tutorials/Reads2Map/EmpiricalReads.html)
 
-* [SimulatedReads](https://cristianetaniguti.github.io/Tutorials/Reads2Map/simulatedreads.html)
+* [SimulatedReads2Map](https://cristianetaniguti.github.io/Tutorials/Reads2Map/simulatedreads.html)
 
 Check how to evaluate the workflows results in Reads2MapApp Shiny:
 
 * [Reads2MapApp](https://github.com/Cristianetaniguti/Reads2MapApp)
 
 Once you selected the best pipeline using a subset of your data, you can build a complete high-density linkage map:
 
-* [Quick Guide to build High-Density Linkage Maps](https://cristianetaniguti.github.io/Tutorials/onemap/Quick_HighDens/High_density_maps.html)
+* [A Guide to Build High-Density Linkage Maps](https://cristianetaniguti.github.io/Tutorials/onemap/Quick_HighDens/High_density_maps.html)
 
 Check more information and examples of usage in:
 
-* [Taniguti, C. H., Taniguti, L. M., Amadeu, R. R., Mollinari, M., Da, G., Pereira, S., Riera-Lizarazu, O., Lau, J., Byrne, D., de Siqueira Gesteira, G., De, T., Oliveira, P., Ferreira, G. C., &#38; Franco Garcia, A. A.  Developing best practices for genotyping-by-sequencing analysis using linkage maps as benchmarks. BioRxiv. https://doi.org/10.1101/2022.11.24.517847](https://www.biorxiv.org/content/10.1101/2022.11.24.517847v1)
+* [Taniguti, C. H., Taniguti, L. M., Amadeu, R. R., Mollinari, M., Da, G., Pereira, S., Riera-Lizarazu, O., Lau, J., Byrne, D., de Siqueira Gesteira, G., De, T., Oliveira, P., Ferreira, G. C., &#38; Franco Garcia, A. A.  Developing best practices for genotyping-by-sequencing analysis using linkage maps as benchmarks. BioRxiv. https://doi.org/10.1101/2022.11.24.517847](https://www.biorxiv.org/content/10.1101/2022.11.24.517847v2)
 
 ## Third-party software and images
 
@@ -53,6 +53,7 @@ Check more information and examples of usage in:
 - [SuperMASSA](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0030906) in [cristaniguti/reads2map:0.0.1](https://hub.docker.com/repository/docker/cristaniguti/reads2map): Efficient Exact Maximum a Posteriori Computation for Bayesian SNP Genotyping in Polyploids;
 - [bcftools](https://github.com/samtools/bcftools) in [lifebitai/bcftools:1.10.2](https://hub.docker.com/r/lifebitai/bcftools): utilities for variant calling and manipulating VCFs and BCFs;
 - [vcftools](http://vcftools.sourceforge.net/) in [cristaniguti/split_markers:0.0.1](https://hub.docker.com/repository/docker/cristaniguti/split_markers): program package designed for working with VCF files.
+- [MCHap](https://github.com/PlantandFoodResearch/MCHap) in [cristaniguti/mchap:0.7.0](https://hub.docker.com/repository/docker/cristaniguti/mchap): Polyploid micro-haplotype assembly using Markov chain Monte Carlo simulation.
 
 ### R packages
 

diff --git a/pipelines/EmpiricalMaps/EmpiricalMaps.wdl b/pipelines/EmpiricalMaps/EmpiricalMaps.wdl
@@ -17,10 +17,10 @@ workflow Maps {
         Dataset dataset
         File? gatk_vcf_multi
         String gatk_mchap
-        File gatk_vcf
-        File freebayes_vcf
-        File gatk_vcf_bam_counts
-        File freebayes_vcf_bam_counts
+        File? gatk_vcf
+        File? freebayes_vcf
+        File? gatk_vcf_bam_counts
+        File? freebayes_vcf_bam_counts
         String? filters
         Int max_cores
     }

diff --git a/pipelines/EmpiricalReads2Map/EmpiricalReads2Map.changelog.md b/pipelines/EmpiricalReads2Map/EmpiricalReads2Map.changelog.md
@@ -0,0 +1,32 @@
+# 1.0.0
+
+Initial release
+
+This workflow build linkage maps from genotyping-by-sequencing (GBS) data. The GBS samples are splitted into chunks to be run in different nodes and optimize the analyses. Set the number of samples by chunk in the 'chunk_size' input. Use 'max_cores' to define number of cores to be used in each node. The workflow runs the combinations:
+
+* SNP calling: GATK and Freebayes
+* Dosage/genotype calling: updog, polyRAD and SuperMASSA
+* Linkage map build software: OneMap 3.0 and GUSMap
+* Using genotype probabilities from GATK, Freebayes, updog, polyRAD and SuperMASSA, and a global error rate of 5% and 0.001% in the OneMap HMM.
+
+Resulting in 34 linkage maps.
+
+The workflow include de options to:
+
+* Remove or not the read duplicates 
+* Perform the Hard Filtering in GATK results
+* Replace the VCF AD format field by counts from BAM files
+* Run MCHap software to build haplotypes based on GATK called markers
+* Include or not multiallelic (MNP) markers
+* Apply filters using VCFtools
+
+This workflow requires:
+
+* Diploid bi-parental F1 population
+* Single-end reads
+* A reference genome
+* Genomic positions for markers order
+* Selection of a single chromosome from a reference genome to build the linkage maps
+
+
+
diff --git a/pipelines/EmpiricalReads2Map/EmpiricalReads2Map.wdl b/pipelines/EmpiricalReads2Map/EmpiricalReads2Map.wdl
@@ -0,0 +1,61 @@
+version 1.0
+
+import "../../structs/empirical_maps_structs.wdl"
+import "../../structs/dna_seq_structs.wdl"
+
+import "../../pipelines/EmpiricalSNPCalling/EmpiricalSNPCalling.wdl" as snpcalling
+import "../../pipelines/EmpiricalMaps/EmpiricalMaps.wdl" as maps
+
+workflow EmpiricalReads {
+
+    input {
+        File samples_info
+        ReferenceFasta references
+        Dataset dataset
+        Int max_cores
+        Int chunk_size
+        Boolean rm_dupli = true
+        Boolean gatk_mchap = false
+        Boolean hardfilters = true
+        Boolean replaceAD = true
+        Boolean run_gatk = true
+        Boolean run_freebayes = true
+        Int ploidy = 2
+        Int n_chrom
+        String? filters
+    }
+
+    call snpcalling.SNPCalling {
+        input:
+            samples_info = samples_info,
+            references = references,
+            max_cores = max_cores,
+            rm_dupli = rm_dupli,
+            P1 = dataset.parent1,
+            P2 = dataset.parent2,
+            gatk_mchap = gatk_mchap,
+            hardfilters = hardfilters,
+            replaceAD = replaceAD,
+            run_gatk = run_gatk,
+            run_freebayes = run_freebayes,
+            ploidy = ploidy,
+            n_chrom = n_chrom
+    }
+
+    call maps.Maps {
+        input:
+            dataset = dataset,
+            gatk_vcf_multi = SNPCalling.gatk_multi_vcf,
+            gatk_mchap = gatk_mchap,
+            gatk_vcf = SNPCalling.gatk_vcf,
+            freebayes_vcf = SNPCalling.freebayes_vcf,
+            gatk_vcf_bam_counts = SNPCalling.gatk_vcf_bam_count,
+            freebayes_vcf_bam_counts = SNPCalling.freebayes_vcf_bam_count,
+            filters = filters,
+            max_cores = max_cores
+    }
+
+    output {
+        File EmpiricalReads_results = Maps.EmpiricalReads_results
+    }
+}
diff --git a/...imulatedReads/SimulatedReads.changelog.md → ...Reads2Map/SimulatedReads2Map.changelog.md b/...imulatedReads/SimulatedReads.changelog.md → ...Reads2Map/SimulatedReads2Map.changelog.md
diff --git a/pipelines/SimulatedReads/SimulatedReads.wdl → ...SimulatedReads2Map/SimulatedReads2Map.wdl b/pipelines/SimulatedReads/SimulatedReads.wdl → ...SimulatedReads2Map/SimulatedReads2Map.wdl
diff --git a/tasks/utils.wdl b/tasks/utils.wdl
@@ -100,8 +100,8 @@ task GenerateSampleNames {  # TODO: probably a name like 'ReadSamplesNamesInVcf'
 
 task ApplyRandomFilters {
   input{
-    File gatk_vcf
-    File freebayes_vcf
+    File? gatk_vcf
+    File? freebayes_vcf
     File? gatk_vcf_bam_counts
     File? freebayes_vcf_bam_counts
     String? filters