Skip to content

Latest commit

 

History

History
89 lines (63 loc) · 7.01 KB

7.diversity_divergece_admixture.md

File metadata and controls

89 lines (63 loc) · 7.01 KB

Genetic diversity and divergence

In this section, we will calculate genetic divergence between species and diversity within species. Our pipeline relies on scripts from the Genomics General toolkit by Simon Martin. Genomics General was adequate for the type of data we were working with because it accepts (among other formats) fasta alignments as input. To better take advantage of the toolkit, we provided whole genome chromosome alignments with multiple individuals, to preserve genomic coordinates (important, for example, for genome scans). However, we applied a missing data filter which resulted in an alignment roughly the size of our exome capture. In the next sections, I added "(from GG)" to scripts from the Genomics General toolkit to distinguish them from my own scripts.

Generating the Genomics General input file

The first step was to convert individual fasta alignments. I did this with a custom script named create_GG_alignments.sh that loops through the exome target chromosomes. This script will create a whole chromosome multi individual fasta alignment with all the samples in samples.txt and the European reference. Use seqToGeno.py (from GG) to create a geno file and filterGenotypes.py (from GG) to filter sites that contain missing data for more than 30% of the individuals.

create_GG_alignments.sh requires:

  1. Whole chromosome consensus fasta sequences for each individual generated previously;
  2. faidx;
  3. seqToGeno.py, from Genomics General;
  4. filterGenotypes.py, from Genomics General.
  5. samples.txt which lists the sample IDs to use.

Run it for a group of chromosomes in parallel.

parallel -j 10 'create_GG_alignments.sh {1}' :::: target_chromosomes.txt

After I generated the geno files per chromosome, I concatenated them into a single whole "genome" alignment. This works well for this dataset because we don't actually have information for entire chromosomes and the resulting whole genome geno is relatively small. (Note: I just noticed there is a specific Genomics General script to merge geno files mergeGeno.py).

for i in $(ls chr_*/*_filtered.geno); do tail -n+2 $i; done > wg_filtered.geno
# Get the header from one of your files.
head -n1 chr_X/chr_X_filtered.geno > head.txt
# Concatenate the header file and the geno file.
cat head.txt wg_filtered.geno > wg_filtered_wHead.geno

Genetic divergence (dxy) among species and nucleotide diversity (pi) within species

Using popgenWindows.py, I calculated genetic divergence between species and nucleotide diversity within species assigning individuals to species in non overlapping 1 Megabase windows and then averaged values across windows to obtain a single estimate per species.

/usr/bin/python ~/my_programs/genomics_general/popgenWindows.py -p Lam LAM_A0604_WA,LAM_A0965_WA,LAM_Allab3_Bo,LAM_NBerg67_Rockies -p Lcalif LCF_CAL_1954,LCF_TUC_2060 -p Lcallotis LCL_1956 -p Lcapensis LCP_SAF_1903,LCP_TAN_3023 -p Lcastroviejoi LCR_ITA_1957,LCR_ITA_1958 -p Lcorsicanus LCS_1891,LCS_1894 -p Leuropaeus LER_PYR_1546,LER_VIE_1639 -p Lfagani LFG_FAG114,LFG_FAG96 -p Lgranatensis LGR_CRE_2553,LGR_SEV_1163 -p Lhabessinicus LHB_HAB35,LHB_HAB68 -p Lmand LMS_PRI_2460,LMS_PRI_2461 -p Lothus LOT_2109,LOT_2110 -p Lstarki LST_STA65,LST_STA89 -p Ltownsendii LTW_JRRK_3,LTW_MTA_3280 -p Ltimidus LTM_AFR_3108,LTM_CAT_2012,LTM_MAG_1862 --windType coordinate -w 1000000 -s 1000000 -m 100 -g wg_filtered_wHead.geno.gz -o wg_filtered_wHead.dxy -f diplo -T 10

Use flag --analysis popDist to restrict results to just nucleotide diversities.

python ~/my_programs/genomics_general/popgenWindows.py -p Lam LAM_A0604_WA,LAM_A0965_WA,LAM_Allab3_Bo,LAM_NBerg67_Rockies -p Lcalif LCF_CAL_1954,LCF_TUC_2060 -p Lcallotis LCL_1956 -p Lcapensis LCP_SAF_1903,LCP_TAN_3023 -p Lcastroviejoi LCR_ITA_1957,LCR_ITA_1958 -p Lcorsicanus LCS_1891,LCS_1894 -p Leuropaeus LER_PYR_1546,LER_VIE_1639 -p Lfagani LFG_FAG114,LFG_FAG96 -p Lgranatensis LGR_CRE_2553,LGR_SEV_1163 -p Lhabessinicus LHB_HAB35,LHB_HAB68 -p Lmand LMS_PRI_2460,LMS_PRI_2461 -p Lothus LOT_2109,LOT_2110 -p Lstarki LST_STA65,LST_STA89 -p Ltownsendii LTW_JRRK_3,LTW_MTA_3280 -p Ltimidus LTM_AFR_3108,LTM_CAT_2012,LTM_MAG_1862 --windType coordinate -w 1000000 -m 100 -g wg_filtered_wHead.geno.gz -o wg_filtered_wHead.pi -f diplo --analysis popDist -T 10

Heterozygosity and shared het sites across species.

The consensus fasta sequences and geno file contain heterozygous sites coded as IUPAC codes. Therefore, it is really simple to identify heterozygous positions using the IUPAC codes. For each individual in the dataset, I wanted to obtain the absolute number of heterozygous positions and how many of those are shared with individuals of other species. To obtain these values, I wrote two scripts.

calculate_hetz_from_geno.py

First, I wrote calculate_hetz_from_geno.py to output the absolute number of heterozygous sites per individual in the geno file. It will also output the total length of the alignment and the number of missing positions.

This script will use python packages pandas, collections and decimal.

calculate_hetz_from_geno.py wg_filtered_wHead.geno het_and_missing_data_info.txt > het_and_missing_data_info.out

count_hetz_singletons_by_species.py

Second, I wrote count_hetz_singletons_by_species.py to count the number of heterozygous positions that each individual in the dataset shares with members of other species. We assign individuals to species using a species_map.txt. The format of this file is as follows:

Column1 Column2
Species1 SampleID1
Species1 SampleID2
Species2 SampleID1
Species2 SampleID2
etc...

Run this script on a geno file that only has SNPs and by chromosome. Unfortunately, count_hetz_singletons_by_species.py is not very efficient, so this will speed the calculations quite a bit.

Use filterGenotypes.py to retain polymorphic positions, using --minAlleles 2.

python ~/my_programs/genomics_general/filterGenotypes.py -i {1}/{1}_filtered.geno -o {1}/{1}_SNPs.geno -t 4 -if diplo -of diplo --minAlleles 2' ::: $(ls -d chr_*)

And then run count_hetz_singletons_by_species.py by chromosome. This script will use python packages pandas, collections and decimal. The output will have five columns:

  • column 1 = no. of heterozygous positions private to that individual.
  • column 2 = no. of heterozygous positions shared between individual and members of other species.
  • column 3 = column 1 / column 5
  • column 4 = column 2 /column 5
  • column 5 = total no. heterozygous sites (column 1 + column 2)
parallel -j 10 'count_hetz_singletons_by_species.py {1}/{1}_SNPs.geno species_map.txt {1}/{1}_singletons_by_species.results' ::: $(ls -d chr_*)

Go back to main page