Skip to content

13. Loci Information

George Pacheco edited this page Aug 4, 2021 · 3 revisions

We calculated some statistics based on both SITES (Dataset I) & SNPs (Dataset II).

12.1. SITES

Here we calculate the number of scaffolds with at least one SITE reported:
zcat /groups/hologenomics/pacheco/data/Pigeons/PBGP/PBGP--Analyses/PBGP--ANGSDRuns/PBGP--GoodSamples_WithAllWGS-GBSPairs--Article--Ultra.mafs.gz | tail -n +2 | sort -u -k 1,1 | wc -l
PBGP--GoodSamples_WithAllWGS-GBSPairs--Article--Ultra: 298 scaffolds.
Calculates the SITES density using ordinary scripts based on the .mafs file:
zcat /groups/hologenomics/pacheco/data/Pigeons/PBGP/PBGP--Analyses/PBGP--ANGSDRuns/PBGP--GoodSamples_WithAllWGS-GBSPairs--Article--Ultra.mafs.gz | tail -n +2 | cut -f1 | sort | uniq -c | awk '{print $2"\t"$1}' | sort -n -k 2,2 > ~/data/Pigeons/PBGP/PBGP--Analyses/PBGP--Miscellaneous/SNPInfo/PBGP--GoodSamples_WithAllWGS-GBSPairs--Article--Ultra.SITESDensity.txt
Expands the result above:
awk 'BEGIN{OFS="\t"} NR==FNR{x[$1]=$2} NR!=FNR && $2>1000{if(!x[$1])x[$1]=0; print $1,$2,x[$1]}' ~/data/Pigeons/PBGP/PBGP--Analyses/PBGP--Miscellaneous/SNPInfo/PBGP--GoodSamples_WithAllWGS-GBSPairs--Article--Ultra.SITESDensity.txt ~/data/Pigeons/Reference/DanishTumbler_Dovetail_ReRun.fasta.fai | sort -n -k 2,2 > ~/data/Pigeons/PBGP/PBGP--Analyses/PBGP--Miscellaneous/SNPInfo/PBGP--GoodSamples_WithAllWGS-GBSPairs--Article--Ultra.ScaffoldInfo.txt
Restricts to only those LOCI with SITES:
awk '{if ($3!=0) print;}' ~/data/Pigeons/PBGP/PBGP--Analyses/PBGP--Miscellaneous/SNPInfo/PBGP--GoodSamples_WithAllWGS-GBSPairs--Article--Ultra.ScaffoldInfo.txt > ~/data/Pigeons/PBGP/PBGP--Analyses/PBGP--Miscellaneous/SNPInfo/PBGP--GoodSamples_WithAllWGS-GBSPairs--Article--Ultra.ScaffoldInfo_OnlyWithSites.txt
These results were plotted using the Rscript below:

12.2. SNPs

Gets average distance between SNPs:
zcat ~/data/Pigeons/PBGP/PBGP--Analyses/PBGP--ANGSDRuns/PBGP--GoodSamples_WithWGSs_NoCrupestris_SNPCalling--Article--Ultra.mafs.gz | cut -f1,2 | tail -n +2 | awk '{print $1"\t"$2-1"\t"$2}' | bedtools merge -i - | bedtools complement -i - -g ~/data/Pigeons/Reference/SamToolsIndex/DanishTumbler_Dovetail_ReRun.Cut.fasta.fai | sort -k 1,1r -k 2,2nr | awk '{sum+=($3-$2)} END {print "Average SNP Distance: " sum/NR}'
Average SNP Distance: 17,227.4bp
Average distance between SNPs:
zcat ~/data/Pigeons/PBGP/PBGP--Analyses/PBGP--ANGSDRuns/PBGP--GoodSamples_WithWGSs_NoCrupestris_SNPCalling--Article--Ultra.mafs.gz | cut -f1,2 | tail -n +2 | awk '{print $1"\t"$2-1"\t"$2}' | bedtools merge -i - | bedtools complement -i - -g ~/data/Pigeons/Reference/SamToolsIndex/DanishTumbler_Dovetail_ReRun.Cut.fasta.fai | sort -k 1,1r -k 2,2nr | awk 'BEGIN{pre=""; safe=""}{if($1!=pre){safe=""}else{if(safe!=""){print safe}safe=$3-$2}pre=$1}' | # awk '{sum+=$1} END { print "Average = ",sum/NR}' # > ~/data/Pigeons/PBGP/PBGP--Analyses/PBGP--Miscellaneous/SNPInfo/PBGP--GoodSamples_WithWGSs_NoCrupestris_SNPCalling--Article--Ultra.SNPDistances.txt

zcat ~/data/Pigeons/PBGP/PBGP--Analyses/PBGP--ANGSDRuns/PBGP--GoodSamples_WithWGSs_NoCrupestris_SNPCalling--Article--Ultra.mafs.gz | tail -n +2 | awk '$1 == pc{print $1,$2-pp-1} {pc=$1; pp=$2}' > ~/data/Pigeons/PBGP/PBGP--Analyses/PBGP--Miscellaneous/SNPInfo/PBGP--GoodSamples_WithWGSs_NoCrupestris_SNPCalling--Article--Ultra.SNPDistances.txt
Average distance between CUTSITES:
cat ~/data/Pigeons/Reference/PBGP_FinalRun.EcoT22I.bed | awk '$1 == pc{print $1,$2-pp} {pc=$1; pp=$3}' > ~/data/Pigeons/Reference/PBGP_FinalRun.EcoT22I--Article--Ultra.CutSiteDistances.txt

wc -l ~/data/Pigeons/Reference/PBGP_FinalRun.EcoT22I--Article--Ultra.CutSiteDistances.txt

cat ~/data/Pigeons/Reference/PBGP_FinalRun.EcoT22I--Article--Ultra.CutSiteDistances.txt | awk '{sum+=($2)}'
Number of LOCI: 386,630
awk '{sum+=($2)} END {print "Average: " sum/NR}' ~/data/Pigeons/Reference/PBGP_FinalRun.EcoT22I--Article--Ultra.CutSiteDistances.txt
Average: 2,785.51
awk '$2 > 500' ~/data/Pigeons/Reference/PBGP_FinalRun.EcoT22I--Article--Ultra.CutSiteDistances.txt | wc -l
Number of LOCI: 311,430
grep -v "WGS" Loci_Merged.coverage.tsv | grep -v "Blank" | tail -n +2 | cut -f 2- | awk '{for(i=1; i<=NF; i++)x[i]+=$i} END{for(i in x)print x[i]}'> ~/data/Pigeons/PBGP/PBGP--Analyses/PBGP--CoverageHeatMap/Loci_Merged.coverage.cutsitesmath

awk '$1==0{cnt++} END{print cnt}' ~/data/Pigeons/PBGP/PBGP--Analyses/PBGP--CoverageHeatMap/Loci_Merged.coverage.cutsitesmath
Number of LOCI with no data for ALL: 288,319
awk '$1==0{cnt++} END{print cnt}' ~/data/Pigeons/PBGP/PBGP--Analyses/PBGP--CoverageHeatMap/Loci_Merged.coverage.cutsitesmath

cat ~/data/Pigeons/Reference/PBGP_FinalRun.EcoT22I.bed | awk '$1 == pc{print $1,$2-pp} {pc=$1; pp=$3}' > ~/data/Pigeons/Reference/PBGP_FinalRun.EcoT22I--Article--Ultra.CutSiteDistances.txt

Clone this wiki locally