Skip to content

8. Running Stats & Filtering of Bad Samples

George Pacheco edited this page Aug 4, 2021 · 7 revisions

We used the outputs from PaleoMix--v1.2.5 to create a summary file containing information on the mapping statistics of each sample. In addition, we used some scripts to create some heatmap plots to help in the identification of bad SAMPLES, and also create some auxiliary files based on these plots.

Gets statistics and creates an absence/presence heatmap:
xsbatch -c 30 --mem-per-cpu 13000 -J HeatMap --time 5-00 -- "$SCRIPTS/scripts/paleomix_summary2tsv.sh -t 30 -n 10 -k 300 -i ~/data/Pigeons/PBGP/PBGP--Analyses/Lists/PBGP--AllSamples--Article.labels ~/data/Pigeons/Analysis/PaleoMix_Re-Sequencing/ ~/data/Pigeons/Analysis/PaleoMix_GBS/ > ~/data/Pigeons/PBGP/PBGP--Analyses/PBGP--CoverageHeatMap/Stats_PBGP--Article--Ultra.txt"
These results were plotted using the Rscript below:
Gets Cutsites Information
grep -v "WGS" ~/data/Pigeons/PBGP/FPGP--Analyses/PBGP--CoverageHeatMap/Loci_Merged.coverage.tsv | grep -v "Blank" | tail -n +2 | cut -f 2- | awk '{for(i=1; i<=NF; i++)x[i]+=$i} END{for(i in x)print x[i]}' > ~/data/Pigeons/PBGP/PBGP--Analyses/PBGP--CoverageHeatMap/Loci_Merged.coverage.cutsitesmath
awk '$1==0{cnt++} END{print cnt}' ~/data/Pigeons/PBGP/PBGP--Analyses/PBGP--CoverageHeatMap/Loci_Merged.coverage.cutsitesmath
Number of LOCI with No data for ALL: 288,319
We manually created a list containing SAMPLES to be excluded (6 BAD GBS SAMPLES and 2 BLANKS / highlighted on the Coverage HeatMap).
~/data/Pigeons/PBGP/PBGP--Analyses/Lists/PBGP--BadSamples--Article.list
Creates an ID file containing only scaffolds longer than 1Kb:
awk '$2 > 1000 {print $1":"}' ~/data/Pigeons/Reference/DanishTumbler_Dovetail_ReRun.fasta.fai > ~/data/Pigeons/Reference/DanishTumbler_Dovetail_ReRun_ChrGreater1kb.id

Clone this wiki locally