Skip to content

9. Filtering of Possible Paralogs

George Pacheco edited this page Aug 4, 2021 · 3 revisions

Given the difficulties to remove possible paralogs from GBSed data, we decided to take advantage of our Re-Seqed data for doing so. Thus, we first run ANGSD--v0.929 on the complete Re-Reqed data in order to identify loci that would have a coverage considerably higher than average and then exclude these loci from the downstream analyses.

Gets list of samples (ALL ReSeq SAMPLES - 52 SAMPLES):
find ~/data/Pigeons/Analysis/PaleoMix_Re-Sequencing/*.bam > ~/data/Pigeons/PBGP/PBGP--Analyses/Lists/PBGP--AllRe-SeqedSamples--Article--Ultra.list
Runs ANGSD in order to get the Global Depth per SITE:
xsbatch -c 64 --mem-per-cpu 7800 -J ppAllRe-Seqed --time 10-00:00 --force -- $SCRIPTS/scripts/wrapper_angsd.sh -debug 2 -nThreads 64 -ref ~/data/Pigeons/Reference/DanishTumbler_Dovetail_ReRun.fasta -bam ~/data/Pigeons/PBGP/PBGP--Analyses/Lists/PBGP--GoodRe-SeqedSamples--Article.list -sites ~/data/Pigeons/Reference/PBGP_FinalRun.EcoT22I_Extended_Merged.pos -rf ~/data/Pigeons/Reference/DanishTumbler_Dovetail_ReRun_ChrGreater1kb.id -remove_bads 1 -uniqueOnly 1 -baq 1 -C 50 -minMapQ 30 -minQ 20 -doQsDist 1 -doCounts 1 -doDepth 1 -dumpCounts 2 -maxDepth $((50*1000)) -out ~/data/Pigeons/PBGP/PBGP--Analyses/PBGP--ANGSDRuns/PossibleParalogTest/PBGP--OnlyRe-Seqed_RemovedBadLoci_100Ind_PossibleParalogs--Article--Ultra.depth
Creates a .mean file containing the Mean Global Depth of each LOCI:
zcat ~/data/Pigeons/PBGP/PBGP--Analyses/PBGP--ANGSDRuns/PossibleParalogTest/PBGP--OnlyRe-Seqed_RemovedBadLoci_100Ind_PossibleParalogs--Article--Ultra.depth.pos.gz | awk 'NR>1 {print $1"\t"$2-1"\t"$2"\t"$3}' | bedtools intersect -a - -b ~/data/Pigeons/Reference/PBGP_FinalRun.EcoT22I_Extended_Merged.bed -wb | bedtools groupby -g 8 -c 4 -o mean > ~/data/Pigeons/PBGP/PBGP--Analyses/PBGP--ANGSDRuns/PossibleParalogTest/PBGP--OnlyRe-Seqed_100Ind_PossibleParalogs_IntersectedWithMerged--Article--Ultra.mean
Fourth, we run the line below to create lists containing the POSSIBLE PARALOG LOCI to be eliminated following different cutoffs:
cat ~/data/Pigeons/PBGP/PBGP--Analyses/PBGP--ANGSDRuns/PossibleParalogTest/PBGP--OnlyRe-Seqed_100Ind_PossibleParalogs_IntersectedWithMerged--Article--Ultra.mean | awk '$2>800 {print $1}' > ~/data/Pigeons/PBGP/PBGP--Analyses/PBGP--ANGSDRuns/PossibleParalogTest/PBGP--PossibleParalogLociToBeEliminated-g800--Article--Ultra.list
Extracts from the "FPGP_FinalRun.EcoT22I_Extended_Merged_RemovedBadLoci.bed" file those POSSIBLE PARALOG LOCI, creating an updated .bed file:
fgrep -v -f ~/data/Pigeons/PBGP/PBGP--Analyses/PBGP--ANGSDRuns/PossibleParalogTest/PBGP--PossibleParalogLociToBeEliminated-g800--Article--Ultra.list ~/data/Pigeons/Reference/PBGP_FinalRun.EcoT22I_Extended_Merged.bed > ~/data/Pigeons/Reference/PBGP_FinalRun.EcoT22I_Extended_Merged_RemovedPossibleParalogs-g800--Article--Ultra.bed
Creates a .pos file based on the .bed file above:
awk '{print $1"\t"($2+1)"\t"$3}' ~/data/Pigeons/Reference/PBGP_FinalRun.EcoT22I_Extended_Merged_RemovedPossibleParalogs-g800--Article--Ultra.bed > ~/data/Pigeons/Reference/PBGP_FinalRun.EcoT22I_Extended_Merged_RemovedPossibleParalogs-g800--Article--Ultra.pos
Indexes the just created .pos file usings ANGSD:
angsd sites index ~/data/Pigeons/Reference/PBGP_FinalRun.EcoT22I_Extended_Merged_RemovedPossibleParalogs-g800--Article--Ultra.pos
Calculates what is the percentage of the Pigeon Genome covered by this .pos file:
cat ~/data/Pigeons/Reference/PBGP_FinalRun.EcoT22I_Extended_Merged_RemovedPossibleParalogs-g800--Article--Ultra.pos | awk '{sum+=($3-$2)*100/1111661097} END {print sum"%"}'
PBGP_FinalRun.EcoT22I_Extended_Merged_RemovedPossibleParalogs-g800: 5.87555%

Clone this wiki locally