-
Notifications
You must be signed in to change notification settings - Fork 1
LepMap3 Details
The general workflow of LepWrap is:
Step | Module | Description | Directory |
---|---|---|---|
1 | ParentCall2 | creates LepMap3 data file | 1_ParentCall |
2 | Filtering2 | filter LM3 data | 2_Filtering |
3 | SeparateChromosomes2 | clusters markers into possible LGs | 3_SeparateChromosomes |
4 | JoinSingles2ALL | clusters remaining markers into defined linkage groups | project directory |
5 | OrderMarkers2 | orders markers into linkage groups | 4_OrderMarkers |
6 | Trim ends | removes potential error and artifacts from edge clusters | 5_Trim |
7 | OrderMarkers2 | reorders trimmed markers | 6_OrderMarkers |
8 | Calculate Distances/Intervals | final maps and intervals |
7_Distances* 7_Intervals
|
However, with Snakemake involved, there are a lot more rules that each do specific things. So, let's dive in to each rule and what it does:
Merges your pedigree file with your filtered VCF file to create a data.call.gz
file.
Filters the data from ParentCall2
with some kind of data tolerance value. This value is set in the config.yaml
, and if you set it to 0
, it will skip this step, instead creating data_f.call.gz
as a relative soft link to data.call.gz
.
Create maps over a range of LOD score values that you have specified in the config.yaml
file. By default, it's set to distortionLod=1
. Maps will be output into the 3_SeparateChromosomes
folder.
A summary file all.maps.summary
will be created combining the summaries of all the maps into a single file text file for easy assessment. The first column is the linkage group number, and the remaining columns correspond to the summary of markers per linkage group for each LOD limit that SeparateChromosomes2
was performed with.
Depending on how your data looks at this point, you may consider cutting off the pipeline and refining a particular map before continuing. For example, if you expect 23 chromosomes/LG's but your map only has 22 and it seems like LG-1 and LG-2 have a lot more markers than the other groups, then you might want to try splitting out those over-clustered LG's. Use /scripts/refine_map.sh
to iteratively test LOD scores to try splitting target LG's. You will have to play with the parameters a bit, such as setting a size limit to avoid accidentally creating too many LG's. If you perform this manual intervention, you will need to manually run JoinSingles2All
(optional) or just copy your best map from that into the main directory as map.master
.
You are prompted in the console for the map you would like to use for the remainder of the pipeline. The parameters for this rule can be modified in the config.yaml
(lodLimit=10
,lodDifference=2
). If not disabled in the config.yaml
, this map will be ran through joinSingles2ALL
with those parameters and output as map.master
in the project directory. A log of your choice will be in 3_SeparateChromosomes/chosen.map
.
Order your markers into positions among your linkage groups based on the map you chose. You will need to specify the the number of linkage groups and number of iterations in the config.yaml
. I would recommend 100 iterations or more, as my own work sometimes revealed that the 100th iteration had the best likelihood. Ordering 25,000 markers across 24 linkage groups 100x on 24 processors should take between 1 and 2 days. That's an oddly specific example, but that's the data it was tested on. Another thing to note is that your linkage groups may be processed at different rates due to the way the rules are configured and how Snakemake dispatches tasks. It's likely the pipeline will process some linkage groups to completion before others, so don't expect LepWrap to wait for all linkage groups to finish initial ordering before moving on to further steps. This is a feature (not a bug) for Snakemake to always try to use available cores to run something when it can.
Runs an R script to provide some summary information on the recombination information into recombinations.summary
Uses LepWrapTrim.r
to trim the ordered maps. It's a mechanism to identify and remove clusters of ambiguously ordered markers. A filtered map for each linkage group will be created, where the removed markers are commented out, along with a file containing the names of the markers removed and a pdf visualization of which markers were removed. These aren't actual Marey maps (physical position vs. genetic position) because the X axis is marker number sorted by genetic position, so I've come to refer to them as "Sequential Plots".
This step examines the first and last X% of markers along a linkage group and remove marker clusters that are greater than N%cM away from the next markers (both X and N need to be configured in the config.yaml
, but 15 and 10 are reasonable values). The centiMorgan distance is relative to the length of that linkage group for that sex, meaning a 10% threshold (N = 10) will have a separate threshold for the female map length(female) x 0.1
and male map length(male) x 0.1
. The percent of markers will look at that many markers from each side, meaning X = 10 will look at the first 10% and last 10% of markers for each sex for each linkage group. Here is an example of one of the output plots:
Collates all the removed markers into a summary files trim.summary
(counts per lg) and trim.details
(which markers per lg).
Uses the trimmed linkage groups to reorder the markers. Uses the same configurations (#linkage groups, #iterations) as initial ordering.
Summarizes reordering the same way as in the initial ordering
The final product of the pipeline is in 3 folders: distances
, distances_sexaverage
, and intervals
. The distances
folder contains a copy of the best likelihoods from reordering. The distances_sexaverage
folder contains the sex-averaged distances using OrderMarkers2 evaluateOrder=LG sexAveraged=1
where LG
is each linkage group from the previous step. The intervals
folder contains the intervals for each marker in a linkage group using OrderMarkers2 calculateIntervals=
.