-
Notifications
You must be signed in to change notification settings - Fork 1
LepMap3 Details
The general workflow of LepWrap is:
Step | Module | Description | Directory |
---|---|---|---|
1 | ParentCall2 | creates LepMap3 data file | 1_ParentCall |
2 | Filtering2 | filter LM3 data | 2_Filtering |
3 | SeparateChromosomes2 | clusters markers into possible LGs | 3_SeparateChromosomes |
4 | JoinSingles2ALL | clusters remaining markers into defined linkage groups | project directory |
5 | OrderMarkers2 | orders markers into linkage groups | 4_OrderMarkers |
6 | Trim ends | removes potential error and artifacts from edge clusters | 5_Trim |
7 | OrderMarkers2 | reorders trimmed markers | 6_OrderMarkers |
8 | Calculate Distances/Intervals | final maps and intervals |
7_Distances* 7_Intervals
|
However, with Snakemake involved, there are a lot more rules that each do specific things. So, let's dive in to each rule and what it does:
Merges your pedigree file with your filtered VCF file to create a data.call.gz
file.
Filters the data from ParentCall2
with some kind of data tolerance value. This value is set in the config.yaml
, and if you set it to 0
, it will skip this step, instead creating data_f.call.gz
as a relative soft link to data.call.gz
.
Create maps over a range of LOD score values that you have specified in the config.yaml
file. By default, it's set to distortionLod=1
. Maps will be output into the 3_SeparateChromosomes
folder.
A summary file all.maps.summary
will be created combining the summaries of all the maps into a single file text file for easy assessment. The first column is the linkage group number, and the remaining columns correspond to the summary of markers per linkage group for each LOD limit that SeparateChromosomes2
was performed with.
Depending on how your data looks at this point, you may consider cutting off the pipeline and refining a particular map before continuing. For example, if you expect 23 chromosomes/LG's but your map only has 22 and it seems like LG-1 and LG-2 have a lot more markers than the other groups, then you might want to try splitting out those over-clustered LG's. Use /scripts/refine_map.sh
to iteratively test LOD scores to try splitting target LG's. You will have to play with the parameters a bit, such as setting a size limit to avoid accidentally creating too many LG's. If you perform this manual intervention, you will need to manually run JoinSingles2All
(optional) or just copy your best map from that into the main directory as map.master
.
You are prompted in the console for the map you would like to use for the remainder of the pipeline. The parameters for this rule can be modified in the config.yaml
(lodLimit=10
,lodDifference=2
). This map will be ran through joinSingles2ALL
with those parameters and output as map.master
in the project directory. A log of your choice will be in 3_SeparateChromosomes/chosen.map
.
Order your markers into positions among your linkage groups based on the map you chose. You will need to specify the the number of linkage groups and number of iterations in the config.yaml
. I would recommend 100 iterations or more, as my own work sometimes revealed that the 100th iteration had the best likelihood. Ordering 25,000 markers across 24 linkage groups 100x on 24 processors should take between 1 and 2 days. That's an oddly specific example, but that's the data it was tested on.
Pulls out all the likelihoods from every iteration of every linkage group and sorts them by group into likelihoods.summary
. Also runs an R script to provide some summary information on the recombination information into recombinations.summary
Using the likelihoods isolated from order_summary
, creates a file best.likelihoods
that lists the best iteration for each linkage group.
Uses LepMak3rQA.r
to trim the best ordered maps. The script will install dplyr
locally if it's not found on the system, and will examine the first and last X% of markers along a linkage group and remove marker clusters that are greater than NcM away from the next markers (both X and N need to be configured in the config.yaml
, but 15 and 10 are reasonable values). It's a mechanism to identify and remove clusters of ambiguously ordered markers. A filtered map for each linkage group will be created, along with a file containing the names of the markers removed, and a pdf with a visualization of the mapping and which markers were removed. Here is an example of one of those plots:
These are fairly simple plots, but they feature the name of the linkage group and iteration on the Y-axis (ordered.linkage.iteration), and the markers that were removed from the male and female maps appearing as red crossed-out circles.
Collates all the removed markers into a summary file trim.summary
Uses the trimmed linkage groups to reorder the markers. Uses the same configurations (#linkage groups, #iterations, distance method) as initial ordering.
Summarizes reordering the same way as in the initial ordering
Identifies the best candidates from reordering, just like it did with the initial ordering.
The final product of the pipeline is in 3 folders: distances
, distances_sexaverage
, and intervals
. The distances
folder contains a copy of the best likelihoods from reordering. The distances_sexaverage
folder contains the sex-averaged distances using OrderMarkers2 evaluateOrder=bestLG sexAveraged=1
where bestLG
is the best candidates per linkage group identified in the previous step. The intervals
folder contains the intervals for each marker in a linkage group using OrderMarkers2 calculateIntervals=
.