Skip to content

Latest commit

 

History

History
executable file
·
1523 lines (1086 loc) · 45.2 KB

README.md

File metadata and controls

executable file
·
1523 lines (1086 loc) · 45.2 KB

gwas_scripts

GWAS codebook (Coleman et al, 2015, Briefings in Functional Genomics), version 1.0.0
Please address questions, comments and improvements to my google group
If you use the scripts and advice herein, please consider citing our paper, the full text of which is available on the publisher's website:
Quality control, imputation and analysis of genome-wide genotyping data from the Illumina HumanCoreExome microarray. Jonathan R. I. Coleman; Jack Euesden; Hamel Patel; Amos A. Folarin; Stephen Newhouse; Gerome Breen. Briefings in Functional Genomics 2016; doi:10.1093/bfgp/elv037

The scripts in this repo are referenced in the publication referenced above, which provides a straight-forward guide to the quality control, imputation and analysis of genome-wide genotype data. Scripts can be tested using the toy PLINK dataset kindly provided by Shaun Purcell on the PLINK 1.07 website: example.zip.

This pipeline is designed to provide a useful resource for using genome-wide data from low-coverage arrays and smaller projects. As projects grow larger and more complex, it may be valuable to consult software creators' websites to seek more sophisticated analysis methods. A brief list of these is provided at the end of this document - I will gladly consider suggested inclusions.

For the quality control, imputation and analysis of large scale genome-wide genotype data, it is highly recommended to look at Ricopili, the pipeline of the Psychiatric Genomics Consortium, which is currently being deposited in this repo and is documented here. All credit for Ricopili goes to its creators.

Within this protocol, the following software is used:

PLINK / PLINK2

R

EIGENSOFT

IMPUTE

The protocol runs in a UNIX environment, and makes use of some of the basic software of the UNIX operating system. It should run on a Mac, but not in Windows. An exception to this is the GCTA MLMA GWAS analyses described at the end of the protocol - such analyses are only implemented in the Linux version of GCTA. Most sections of this protocol are designed to be usable simply by pasting into the command line – variables are set when each command is run, and should be straight-forward to modify.

Procedure

Recalling and rare-variant calling

Not covered by this protocol, see this protocol, which presents best-practice for recalling the raw genotype data using Illumina GenomeStudio, and https://github.com/KHP-Informatics/chip_gt, which implements and compares the results of ZCall and Opticall. This Nature Protocols paper is also very good.

Reformat of data from the rare caller pipeline

The Human Core Exome array contains some SNPs called "SNP…" In order to make ZCall run effectively, it is necessary to change the name of these SNPs, e.g. to "xxx…" This can be done using the UNIX program sed

sed 's/SNP/xxx/g' < rootname.report > rootname.updated.report

Following the implementation of the rare caller pipeline, it is recommended to review the concordance between ZCall and Opticall − concordance is expected to be high (>99%).

Define names and locations of important files and software:
printf "root=/path/to/rootname
pheno=/path/to/external_pheno.phe
covar=/path/to/covariates.cov
genders=/path/to/external_genders.txt
names=/path/to/external_individual_names.txt
keeps=/path/to/samples_to_keep.txt 
excludes=/path/to/samples_to_exclude.txt
insnps=/path/to/SNPs_to_keep.txt
outsnps=/path/to/SNPs_to_exclude.txt
plink=/path/to/plink2
R=/path/to/R" > Config.conf 

File formats are the PLINK file formats.

"rootname" is the prefix of the PLINK binary files obtained from the Exome-chip pipeline (i.e. the .bed file from the ZCall branch has the name "rootname_filt_Zcall_UA.bed"), and "/path/to/" is the location of these files on the computer.

NB: not all of these files may be relevant to your study.

Review the PLINK binary (.bed, .bim, .fam) files from Exome-chip pipeline

Check individuals

less $root.fam

Check SNPs

less $root.bim
Update files

Phenotypes, individual names, genders, or SNP alleles may be lost in preparatory steps. These can be updated using external files.

Update phenotype

$plink \
--bfile $root \
--pheno $pheno \
--make-bed \
--out $root.updated_pheno

Update genders

$plink \
--bfile $root \
--update-sex $genders \
--make-bed \
--out $root.updated_genders

Update sample names

$plink \
--bfile $root \
--update-ids $names \
--make-bed \
--out $root.updated_names

Select individuals for analysis

$plink \
--bfile $root \
--keep $keeps \
--make-bed \
--out $root.kept_names

Or:

$plink \
--bfile $root \
--remove $excludes \
--make-bed \
--out $root.kept_names

Select SNPs for analysis

$plink \
--bfile $root \
--extract $insnps \
--make-bed \
--out $root.kept_samples

Or:

$plink \
--bfile $root \
--exclude $outsnps \
--make-bed \
--out $root.kept_samples
Filter for common SNPs
$plink \
--bfile $root \
--maf 0.01 \
--make-bed \
--out $root.common

This assumes no updates were made, otherwise modify the --bfile command to point to that file (e.g. $root.updated_names).

Filter for call rate iteratively
sh ./Iterative_Missingness.sh [begin] [final] [steps] 

Removes SNPs then samples at increasingly high cut-offs. E.g. To remove at 90% to 99%, in steps of 1%:

sh ./Iterative_Missingness.sh 90 99 1 
Review call rates to ensure all missing SNPs and individuals have been dropped

Generate files for individual call rates and variant vall rates.

$plink \
--bfile $root.filtered \
--missing \
--out $root.filtered_missing

Examine the lowest call rates for variants:

sort -k 5 -gr $root.filtered_missing.lmiss | head

Check no variants above threshold remain in column 5 (proportion missing).

Examine the lowest call rates for individuals:

sort -k 6 -gr $root.filtered_missing.imiss | head

Check no individuals above threshold remain in column 6 (proportion missing).

Assess SNPs for deviation from Hardy-Weinberg Equilibrium

--hardy calculates HWE test p-values:

$plink \
--bfile $root.filtered \
--hardy \
--out $root.hw_p_values

--hwe removes deviant SNPs past a given threshold, 1x10^-5 below:

$plink \
--bfile $root.filtered \
--hwe 0.00001 \
--make-bed \
--out  $root.hw_dropped

NB: in case-control datasets, the default behaviour of hwe is to work on controls only

Prune data file for linkage disequilibrium

Using a window of 1500 variants and a shift of 150 variants between windows, with an r2 cut-off of 0.2:

$plink \
--bfile $root.hw_dropped \
--indep-pairwise 1500 150 0.2 \
--out $root.LD_one

Extract pruned-in SNPs

$plink \
--bfile $root.hw_dropped \
--extract $root.LD_one.prune.in \
--make-bed \
--out $root.LD_two

Exclude high-LD and non-autosomal regions from the pruned file (see Mike Weale's website)

awk -f highLDregions4bim_b37.awk $root.LD_two.bim > highLDexcludes
awk '($1 < 1) || ($1 > 22) {print $2}' $root.LD_two.bim > autosomeexcludes
cat highLDexcludes autosomeexcludes > highLD_and_autosomal_excludes  
$plink \
--bfile $root.LD_two \
--exclude highLD_and_autosomal_excludes \
--make-bed \
--out $root.LD_three
Add phenotype to differentiate groups

E.g. Add site of collection ("Site") from an external pheotype file:

$plink \
--bfile $root.LD_three \
--pheno $pheno \
--pheno-name Site \
--make-bed \
--out  $root.LD_four
Compare genotypic and phenotypic gender

Ensure there is a separate XY region for the pseudoautosomal region on X:

Most chips have the pseudoautosomal region mapped separately already. Requires entry of genome build, below this is hg37 ("b37").

$plink \
--bfile $root.LD_two \
--split-x b37 \
--make-bed \
--out $root.LD_split

Compare phenotypic gender to X chromosome heterogeneity and Y chromosome SNP count:

$plink \
--bfile $root.LD_split \
--check-sex ycount 0.2 0.8 0 1 \
--out $root.sex_check

IDs identified as discordant (not the phenotypic gender) or for which F is between 0.2 and 0.8 (not assigned a gender by PLINK), should be reviewed with the collection site where possible. This command also takes into account the number of Y chromosome SNPs present, to counteract the unreliable nature of the F statistic in assigning female gender. The number of Y SNPs with calls in females can be set as part of ycount (above females have a maximum of 0, and males a maximum of 1), and will depend on the recalling method used and sample size. An additional check can be made by assessing whole-genome heterogeneity for all samples (see below) at this point – discordant gender may be the result of unusual heterogeneity

Remove discordant IDs that cannot be resolved:

This command assumes a PLINK-format file of IDs for discordant individuals called "discordant_individuals.txt".

$plink \
--bfile $root.LD_four \
--remove discordant_individuals.txt \
--make-bed \
--out $root.LD_five
	
$plink \
--bfile $root.hw_dropped \
--remove discordant_individuals.txt \
--make-bed \
--out $root.sexcheck_cleaned
Pairwise identical-by-descent (IBD) check
$plink \
--bfile $root.LD_five \
--genome \
--make-bed \
--out $root.IBD

Remove one sample from each pair with pi-hat (% IBD) above threshold (0.1875 below):

awk '$10 >= 0.1875 {print $1, $2}' $root.IBD.genome > $root.IBD_outliers.txt 
$plink \
--bfile $root.IBD \
--remove $root.IBD_outliers.txt \
--make-bed \
--out $root.no_close_relatives

Calculate average IBD per individual using R, output outliers (defined as more than sigma standard deviations above the mean, as provided by the user):

$R --file=IndividualIBD.R --args $root [sigma]

Exclude outliers from both LD-stripped and all SNP binary files

$plink \
--bfile $root.LD_five \
--remove $root.IBD_INDIV_outliers.txt \
--make-bed \
--out $root.LD_IBD

$plink \
--bfile $root.sexcheck_cleaned \
--remove $root.IBD_INDIV_outliers.txt \
--make-bed \
--out $root.IBD_cleaned
Population stratification by principal component analysis in EIGENSOFT

Consult [https://sites.google.com/site/mikeweale/software/eigensoftplus].

Run EIGENSOFT using LD-pruned binary

Convert files to EIGENSOFT format using CONVERTF

Requires par file to convert from packedped format to eigenstrat format

convertf -p <(printf "genotypename: "$root".LD_IBD.bed
snpname: "$root".LD_IBD.bim
indivname: "$root".LD_IBD.fam
outputformat: EIGENSTRAT
genotypeoutname: "$root".pop_strat.eigenstratgeno
snpoutname: "$root".pop_strat.snp
indivoutname: "$root".pop_strat.ind")

Run SmartPCA, removing no outliers

Produces 100 PCs

smartpca.perl \
-i $root.pop_strat.eigenstratgeno \
-a $root.pop_strat.snp \
-b $root.pop_strat.ind \
-o $root.pop_strat.pca \
-p $root.pop_strat.plot \
-e $root.pop_strat.eval \
-l $root.pop_strat_smartpca.log \
-m 0 \
-t 100 \
-k 100 \
-s 6

Note that the order of the inputs is important.

Inputs explained:

-i  is the genotype file
-a is the SNP names
-b is the individual names
-o is the output eigenvectors ( $root.pop_strat.pca.evec)
-p plots the output file. This is only activated if gnuplot is installed, but is a necessary inclusion for smartpca to run. If gnuplot is not installed, this does not affect the running of smartpca. If gnuplot is installed, this produces a plot of the first component on the second.
-e is the output eigenvalues
-l is the log, including a list of individuals defined as outliers. 
-m sets the number of outlier removal iterations. This is initially set to 0, so no outliers are removed.
-t sets the number of components from which outliers should be removed. If -m is 0, this value has no effect.
-k is the number of components to be output
-s defines the minimum number of standard deviations from the mean of each component an individual must be to be counted as an outlier.

Minor edit to allow import into R

Remove leading tab and split ID into two columns.

sed -i -e 's/^[ \t]*//' -e 's/:/ /g' $root.pop_strat.pca.evec

Calculate association between PCs and outcome measure in R

Both scripts require the same IDs to be in $root.pca.evec and $pheno, and look at 100 PCs by default.

Short version (outputs the variance explained by each component and its significance when added to a model including the previous components):

$R --file=PC-VS-OUTCOME_IN_R_SHORT.R --args $root.pop_strat $pheno

Long version (outputs the full results of the linear model, adding each component in turn):

$R --file= PC-VS-OUTCOME_IN_R_FULL.R --args $root.pop_strat $pheno

Run SmartPCA again to remove outliers

Run as above ("Run SmartPCA, removing no outliers"), but change $root suffix to pop_strat_outliers. Set –m 5 and –t x (where x is the number of PCs significantly associated with the outcome measure)

smartpca.perl \
-i $root.pop_strat.eigenstratgeno \
-a $root.pop_strat.snp \
-b $root.pop_strat.ind \
-o $root.pop_strat_outliers.pca \
-p $root.pop_strat_outliers.plot \
-e $root.pop_strat_outliers.eval \
-l $root.pop_strat_outliers_smartpca.log \
-m 5 \
-t x \
-k 100 \
-s 6

Plot principal components in R

Plot before and after outlier exclusion to allow visual inspection of which samples are dropped

Plot first component against second in R and colour by phenotype - this requires ggplot2 to be installed.

sed -i -e 's/^[ \t]*//' -e 's/:/ /g' $root.pop_strat_outliers.pca.evec
$R --file=PlotPCs.R --args $root.pop_strat 1 2

$R --file=PlotPCs.R --args $root.pop_strat_outliers 1 2

This script can be modified to plot any of the first 100 components against each other by changing 1 and 2 above. The design of the plot is extremely modifiable - see [http://docs.ggplot2.org/current/].

Extract outliers

sh ./ExtractAncestryOutliers.sh
$plink \
--bfile $root.LD_IBD \
--remove $root.pop_strat_outliers.outliers \
--make-bed \
--out $root.LD_pop_strat

$plink \
--bfile $root.IBD_cleaned \
--remove $root.pop_strat_outliers.outliers \
--make-bed \
--out $root.pop_strat

Re-run to assess which components to include as covariates in the final analysis

Run ConvertF:

convertf -p <(printf "genotypename: $root.LD_pop_strat.bed
snpname: $root.LD_pop_strat.bim
indivname: $root.LD_pop_strat.fam
outputformat: EIGENSTRAT
genotypeoutname: $root.PCS_for_covariates.eigenstratgeno
snpoutname: $root.PCS_for_covariates.snp
indivoutname: $root.PCS_for_covariates.ind")

Run SmartPCA:

smartpca.perl \
-i $root.PCS_for_covariates.eigenstratgeno \
-a $root.PCS_for_covariates.snp \
-b $root.PCS_for_covariates.ind \
-o $root.PCS_for_covariates.pca \
-p $root.PCS_for_covariates.plot \
-e $root.PCS_for_covariates.eval \
-l $root.PCS_for_covariates_smartpca.log \
-m 0 \
-t 100 \
-k 100 \
-s 6 \

Calculate association (short version):

$R --file=PC-VS-OUTCOME_IN_R_SHORT.R --args $root.PCS_for_covariates

Include components significantly associated with outcome as covariates in the final analysis, or add PCs in turn until inflation falls to an accepted level (lambda ≈ 1).

Optional (but useful): plot individuals on components drawn from the HapMap reference populations to assess likely ancestry groupings.

Details of this procedure can be found at Timothee Flutre's OpenWetWare.

Note that the http://hapmap.ncbi.nlm.nih.gov/downloads/genotypes/2010-08_phaseII+III/forward/ domain referenced by Dr Flutre has since been retired. The HapMap samples are available at ftp://ftp.ncbi.nlm.nih.gov/hapmap/genotypes/2010-08_phaseII+III/forward/.

Manually extract HapMap and own cohort individual names

sh ./MakeKeepIds.sh

Use keepids.txt at this section:

for pop in {CEU,CHB,JPT,YRI}; do echo ${pop}; \
hapmap2impute.py -i genotypes_CHR_${pop}_r28_nr.b36_fwd.txt.gz -n keepids.txt -o genotypes_hapmap_r28_b37_${pop}.impute.gz -b snps_hapmap_r28_nr_b37.bed.gz -s list_snps_redundant.txt; done
zcat genotypes_hapmap_r28_b37_CEU.impute.gz | wc -l
3907899
zcat genotypes_hapmap_r28_b37_CHB.impute.gz | wc -l
3933013
zcat genotypes_hapmap_r28_b37_JPT.impute.gz | wc -l
3931282
zcat genotypes_hapmap_r28_b37_YRI.impute.gz | wc -l
3862842

More populations now exist than those listed in Flutre’s script; these can be obtained in the same manner.

Alternative - Use 1000 Genomes Phase 1 data to achieve the same

Much the same process can be used to assess sample ethnicity by projecting on PCs from the 1000 Genomes samples.

From $root.IBD_cleaned:

Obtain 1KG Phase 1 data from PLINK2 website

WARNING: FILE > 1GB

wget https://www.dropbox.com/s/k9ptc4kep9hmvz5/1kg_phase1_all.tar.gz?dl=1

Note code below creates numerical phenotypes for the 1KG populations. CHANGE THESE IF THEY WILL OVERLAP WITH YOUR PHENOTYPE DATA!

Obtain 1KG Population info from 1KG

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130606_sample_info/20130606_g1k.ped

grep -f <(awk '{print $2}' 1kg_phase1_all.fam) <(awk 'NR > 1 {print 0, $2, $7}' 20130606_g1k.ped) > 1KG_Phenos.txt

sed -i -e 's/ASW/3/g' -e 's/CEU/4/g' -e 's/CHB/5/g' -e 's/CHS/6/g' -e 's/CLM/7/g' -e 's/FIN/8/g' -e 's/GBR/10/g' -e 's/IBS/11/g' -e 's/JPT/12/g' -e 's/LWK/13/g' -e 's/MXL/14/g' -e 's/PUR/15/g' -e 's/TSI/16/g' -e 's/YRI/17/g' 1KG_Phenos.txt 

Provided your data has sufficient common variants (as with most microarrays), you can be fairly brutal with selecting variants for ancestry estimation. Ultimately, good estimation can be achieved with ~20K variants (see Price et al, 2006).

Limit files to SNPs with rs IDs

fgrep rs $root.IBD_cleaned.bim > $root.IBD_cleaned.rsids.txt

Get rs ID variant names

awk '{print $2}' $root.IBD_cleaned.rsids.txt > $root.IBD_cleaned.rsid_names.txt

Extract rs IDs from root

$plink \
--bfile $root.IBD_cleaned \
--extract $root.IBD_cleaned.rsid_names.txt \
--chr 1-22 \
--make-bed \
--out $root.IBD_cleaned.rsids.autosomal

Extract rs IDs from 1KG (and add phenotypes)

$plink \
--bfile 1kg_phase1_all \
--extract $root.IBD_cleaned.rsid_names.txt \
--pheno 1KG_Phenos.txt \
--make-bed \
--out 1kg_phase1_all.rsids.autosomal

Obtain SNPs present in both files

awk '{print $2}' 1kg_phase1_all.rsids.autosomal.bim > 1kg_phase1_all.rsids_names.txt

Extract 1KG SNPs from root

$plink \
--bfile $root.IBD_cleaned.rsids.autosomal \
--extract 1kg_phase1_all.rsids_names.txt \
--make-bed \
--out $root.IBD_cleaned.intersection

Dry run bmerge to identify SNPs PLINK will fail on

$plink \
--bfile $root.IBD_cleaned.intersection \
--bmerge 1kg_phase1_all.rsids.autosomal \
--merge-mode 6 \
--out $root.1KG.IBD_cleaned_failures

Add variants with multiple positions to missnp

fgrep \'rs $root.1KG.IBD_cleaned_failures.log |\
awk '{print $7}' |\
sed -e "s/'//g" -e "s/.//g" > $root.1KG.IBD_cleaned_failures.multiple.positions.txt

cat $root.1KG.IBD_cleaned_failures.missnp $root.1KG.IBD_cleaned_failures.multiple.positions.txt > $root.1KG.IBD_cleaned_failures.multiple.positions.missnp

Exclude mismatched SNPs and variants with multiple positions

$plink \
--bfile $root.IBD_cleaned.intersection \
--exclude $root.1KG.IBD_cleaned_failures.multiple.positions.missnp \
--make-bed \
--out $root.IBD_cleaned.intersection_for_merge

Merge root and 1KG

$plink \
--bfile $root.IBD_cleaned.intersection_for_merge \
--bmerge 1kg_phase1_all.rsids.autosomal \
--out $root.1kg.pop_strat

Filter missing variants, rare variants and HWE

$plink \
--bfile $root.1kg.pop_strat \
--geno 0.01 \
--maf 0.01 \
--hwe 0.0001 \
--make-bed \
--out $root.1kg.pop_strat.for_prune

LD Pruning

$plink \
--bfile $root.1kg.pop_strat.for_prune \
--indep-pairwise 1500 150 0.2 \
--out $root.1kg.pop_strat.prune
$plink \
--bfile $root.1kg.pop_strat.for_prune \
--extract $root.1kg.pop_strat.prune.prune.in \
--make-bed \
--out $root.1kg.LD_pop_strat

Run convertf to make EIGENSTRAT file

convertf -p <(printf "genotypename: $root.1kg.LD_pop_strat.bed
             snpname: $root.1kg.LD_pop_strat.bim
             indivname: $root.1kg.LD_pop_strat.fam
             outputformat: EIGENSTRAT
             genotypeoutname: $root.1kg.LD_pop_strat.eigenstratgeno
             snpoutname: $root.1kg.LD_pop_strat.snp
             indivoutname: $root.1kg.LD_pop_strat.ind")

Generate poplist for projection

awk '{print $3}' 1KG_Phenos.txt | sort | uniq > $root.1kg.LD_poplist.txt

Run Smartpca, projecting on 1KG samples only

smartpca.perl \
-i $root.1kg.LD_pop_strat.eigenstratgeno \
-a $root.1kg.LD_pop_strat.snp \
-b $root.1kg.LD_pop_strat.ind \
-o $root.1kg.LD_pop_strat.pca \
-p $root.1kg.LD_pop_strat.plot \
-e $root.1kg.LD_pop_strat.eigenvalues \
-l $root.1kg.LD_pop_strat.log \
-w $root.1kg.LD_poplist.txt \
-m 0

Note that the command below relabels the phenotype column as xCHANGE, where x is the phenotype, and then relabels the 1KG populations with their names for graphing. Modify the sed command to allow your samples to be labelled usefully!

Modify $root.1kg.LD_pop_strat.pca.evec for R

awk 'NR > 1 {print $1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12"CHANGE"}' $root.1kg.LD_pop_strat.pca.evec > $root.1kg.LD_pop_strat.pca.evec_RENAMED

sed -i -e 's/13CHANGE/LWK/g' -e 's/14CHANGE/MXL/g' -e 's/15CHANGE/PUR/g' -e 's/16CHANGE/TSI/g' -e 's/17CHANGE/YRI/g' -e 's/3CHANGE/ASW/g' -e 's/4CHANGE/CEU/g' -e 's/5CHANGE/CHB/g' -e 's/6CHANGE/CHS/g' -e 's/7CHANGE/CLM/g' -e 's/8CHANGE/FIN/g' -e 's/10CHANGE/GBR/g' -e 's/11CHANGE/IBS/g' -e 's/12CHANGE/JPT/g'  $root.1kg.LD_pop_strat.pca.evec_RENAMED

Plot PCs

RScript PC_Plot_1KG.R $root
Heterozygosity Test

Test for unusual patterns of genome-wide heterogeneity in LD-pruned data

$plink \
--bfile $root.LD_pop_strat \
--ibc \
--out $root.het

Exclude samples identified as outliers

R --file=IdHets.R --args $root.het
$plink \
--bfile $root.LD_pop_strat \
--remove $root.het.LD_het_outliers_sample_exclude \
--make-bed \
--out $root.LD_het_cleaned

$plink \
--bfile $root.pop_strat \
--remove $root.het.LD_het_outliers_sample_exclude \
--make-bed \
--out $root.het_cleaned
Imputation

THIS CODE SHOULD BE CONSIDERED ARCHIVAL - IT IS RECOMMENDED THAT YOU NOW PERFORM IMPUTATION BY SUBMITTING YOUR DATA TO THE TOPMED, MICHIGAN OR SANGER IMPUTATION SERVERS, WHICH ARE FASTER AND USE MORE UP-TO-DATE REFERENCE PANELS.

Consult [http://genome.sph.umich.edu/wiki/IMPUTE2:_1000_Genomes_Imputation_Cookbook] and [https://mathgen.stats.ox.ac.uk/impute/prephasing_and_imputation_with_impute2.tgz]

Download reference files from [http://mathgen.stats.ox.ac.uk/impute/impute_v2.html]

Copy impute2_examples folder from [https://mathgen.stats.ox.ac.uk/impute/prephasing_and_imputation_with_impute2.tgz] to work folder

Download relevant strand file from [http://www.well.ox.ac.uk/~wrayner/strand/] and split by chromosome

awk '{print $3, $5 > "$root."$2".strand"}' HumanCoreExome-12v1-0_B-b37.strand

Convert PLINK binary to GEN files (IMPUTE2 input)

$plink \
--bfile $root.het_cleaned \
--recode oxford \ 
--out $root.for_impute

Split whole--genome .gen into chromosome .gen

awk '{print > "Chr"$1".gen"}' $root.for_impute.gen

Check split has proceeded correctly – total line number of all chromosome .gen files should total $root.for_impute.gen

wc -l *.gen

Generate chunk files for each chromosome

sh ./MakeChunks.sh

This makes two sets of files:

Chunks_chr[1-23].txt

30000001 3.5e+07 875 
35000001 4e+07 500
40000001 4.5e+07 85
45000001 5e+07 424
50000001 5.5e+07 693

These files list the base positions of the edges of each chromosome chunk and the number of SNPs in each chunk. Consult this file and merge chunks with few SNPs (e.g. less than 100) with neighbouring chunks.

analysis_chunks_5Mb_chr[1-23].txt

30000001 3.5e+07
35000001 4e+07
40000001 5e+07
50000001 5.5e+07

These files are the input for IMPUTE2

Modify the submit_impute2_jobs_to_cluster.R script (from impute2_examples) to accept chunk files without headers

From:
  # read in file with chunk boundary definitions
  chunk.file <- paste(data.dir,"analysis_chunks_",chunk.size,"Mb_chr",chr,".txt", sep="")
  chunks <- read.table(chunk.file, head=T, as.is=T) 

To:

  # read in file with chunk boundary definitions
  chunk.file <- paste(data.dir,"analysis_chunks_",chunk.size,"Mb_chr",chr,".txt", sep="")
  chunks <- read.table(chunk.file, head=F, as.is=T). 

Modify scripts in impute2_examples folder (prototype_imputation_job_posterior_sampled_haps.sh master_imputation_script_posterior_sampled_haps.sh and submit_impute_jobs_to_cluster.R) to fit personal needs.

Likely to need to limit number of jobs submitted to remain within local SunGridEngine rules – liaise with local system administrator to establish local best practice. Also likely to need to amend script to allow 5Mb jobs to run - for example, to increase virtual memory allowance to 15Gb, add the following to the header of the prototype_imputation_job_posterior_sampled_haps.sh script:

#$ -l h_vmem=15G

Submit jobs

NB – this runs over 600 jobs on your cluster if not controlled!

sh ./master_imputation_script_posterior_sampled_haps.sh 

Adapt scripts for imputing X chromosome (running the different X map and legend files), and run

Consult [http://mathgen.stats.ox.ac.uk/impute/impute_v2.html]

Merge imputed chunks together (.impute2 and .impute2_info) to form a file for each chromosome

sh ./MergeImputedChunks.sh 

Add chromosome number to each SNP in each chromosome.impute2 file

sh ./AddChromosomeNumber.sh

Merge by-chromosome info files to form a file for the whole genome

cat results-directory/*.impute2_info > path/to/results-directory/$root.whole_genome.impute2_info

For December 2013 release of reference data (Phase1 Integrated), there are several aspects that require clean-up - these do not appear to apply to the Phase 3 release. Steps marked in bold are required for the Phase1 Integrated release, but may not be needed for Phase3.

Exomic variants are named "." It is necessary to make these unique (as chr:position)

sh ./ReplaceDots.sh

Filter imputed data (.impute2 files) by info metric

sh ./FilterByInfoAll.sh [threshold]

Merge filtered by-chromosome .impute2 files to make a single whole-genome file

cat results-directory/*_New_filtered.impute2 > \
/results-directory/$root.whole_genome_filtered.impute2

Remove duplicate SNPs from .impute2 file

awk '{print $2}' $root.whole_genome_filtered.impute2 | \
sort | uniq –c | awk '$1 !=1 {print $0}' > Duplicates

awk '{print $2}' $root.whole_genome_filtered.impute2 | sort | uniq -d > Duplicates_cleaned

These produce two files called Duplicates and Duplicates_cleaned that list the duplicated SNPs in the file with and without the number of instances respectively

grep -vwF -f Duplicates_cleaned $root.whole_genome_filtered.impute2 > Temp1

Removes all lines with an instance of a duplicated rs# from $root.whole_genome_filtered.impute2 and outputs to Temp1:

awk '{print $2}' Temp1 | sort | uniq –d > DuplicatesRemoved

Repeats the check for duplicates – this file should now be empty; check with

less DuplicatesRemoved 

Compare file lengths; the length of Temp1 should be the length of $root.whole_genome_filtered.impute2 minus the number of duplicated SNPs removed

wc -l Temp1 $root.whole_genome_filtered.impute2
mv Temp1 $root.whole_genome_filtered_cleaned.impute2

Convert IMPUTE2 to hard-called PLINK format

$plink \
--gen $root.whole_genome_filtered_cleaned.impute2 \
--sample $root.for_impute.sample \
--hard-call-threshold 0.8 \
--make-bed \
--out $root.post_imputation

NB: if SNP does not pass threshold, it is set as missing!

At this point, it is recommended to gzip all IMPUTE2 files - note that this will be a large job.

gzip *impute2* 
Post-imputation quality control

Remove rare SNPs depending on sample size and dataset characteristics

$plink \
--bfile $root.post_imputation \
--maf 0.01 \
--make-bed \
--out $root.post_imputation_common

Remove missing SNPs, including those set as missing above

$plink \
--bfile $root.post_imputation_common \
--geno 0.02 \
--make-bed \
--out $root.post_imputation_updated

Drop duplicated variants from imputation

sh ./DropDuplicatedSNPs.sh
$plink \
--bfile $root.post_imputation_updated \
--exclude $root.post_imputation_updated_duplicated_IDs \
--make-bed \
--out $root.post_imputation_final

Convert imputed rs IDs back to rs… format

sh ./Relabel_rs.sh

Some rs IDs are imperfectly mapped, resulting in duplications with imputed IDs, so remove these accidental duplicates.

sh ./DropDuplicatedPositions.sh 
Association testing in PLINK/PLINK2

Generate covariates file, merging $covar and $root.dataname_pop_strat_includes.pca.evec (output from SMARTPCA) files

$-R --file=Get_Covariates.R --args $root $covar

Relabels header and adds additional covariates (.pca.evec contains all PCs included in the SmartPCA analysis) Script assumes a covariate file with the same column names for IDs (FID and IID), but no shared column names with the .pca.evec file, which is assumed to contain 100 PCs).

Run association against phenotype

Phenotype here assumed to be in $pheno as the only phenotype (otherwise use --mpheno [column number]) and called "Outcome".

$plink \
--bfile $root.post_imputation_final \
--logistic/--linear (depending whether phenotype of interest is dichotomous or continuous) \
--pheno $pheno  \
--pheno-name Outcome \
--covar $covar
--covar-number 1-10
--hide-covar
--parameters 1-11
--out $root.post_imputation_conc_analysis

Consider coding of phenotype – may require the use of --1 as an option if coding is in 0,1 format (rather than 1,2 format)

     --covar-number indicates which covariates to include. --covar-name can also be used for this 
     --hide-covar hides results of association tests between phenotype and covariates
     --parameters specifies models to include in the analysis (see www.cog-genomics.org/plink2)
       1.	Allelic dosage additive effect (or homozygous minor dummy variable)
       2.	Dominance deviation, if present
       3.	--condition{-list} covariate(s), if present
       4.	--covar covariate(s), if present
       5.	Genotype x non-sex covariate 'interaction' terms, if present
       6.	Sex, if present
       7.	Sex-genotype interaction(s), if present 

Investigate further any SNP that is highly associated with the phenotype, and exclude from analysis if justified

Run BLAT, available on the UCSC Genome Browser on the probe sequence (available from the array manifest) for all highly associated genotyped SNPs as a test of how well mapped/unique the sequence is, particularly with regards to similarity to sequences on the sex chromosomes. Discard any associated SNP that does not map uniquely.

All association details here assume an additive model – see PLINK website to implement other models (but see Knight and Lewis, 2012 for discussion of statistical issues of performing tests using multiple models). More association tests are available in PLINK and PLINK2.

Using GCTA for Genomic-relatedness-matrix Restricted Maximum Likelihood (GREML) and Mixed Linear Model Association (MLMA)

Make GRM

Thresholds below: MAF 1%, IBD 0.025

./gcta \ 
--bfile $root.post_imputation_final \
--autosome \
--maf 0.01 \
--grm-cutoff 0.025 \
--make-grm \
--out $root.post_imputation_final_grm

GRM is created here from imputed data - see text for discussion of the benefits of this.

Generate principal components

./gcta \
--grm $root.post_imputation_final_grm \
--pca \
--out $root.post_imputation_final_pca

Univariate GREML, including principal components as continuous covariates

./gcta \
--grm $root.post_imputation_final_grm \
--pheno $pheno \
--covar $covar \ 
--qcovar $root.post_imputation_final_pca \
--reml \
--out $root.post_imputation_final_greml

The number of principal components generated can be varied to assess the effect of their inclusion - if components are included as covariates for population stratification in GWAS, it is suggested to include the same number in GREML.

This script assumes the covariates file contains only discrete covariates – if there are continuous covariates in the covariates file, these should be removed from the $covar file and added to the $root.post imputation_final_pca file.

Run MLMA-LOCO for autosomes

./gcta \
--bfile $root.post_imputation_final \
--pheno $pheno \ 
--covar $covar \
--qcovar $root.post_imputation_final_pca \ 
--mlma-loco \
--out $root.post_imputation_final_mlma_analysis

Run MLMA for X chromosome

./plink \
--bfile $root.post_imputation_final \
--chr X \
--make-bed \
--out $root.post_imputation_final_X
./gcta \
--grm $root.post_imputation_final_grm \
--bfile $root.post_imputation_final_X \ 
--pheno $pheno \
--covar $covar \
--qcovar $root.post_imputation_final_pca \ 
--mlma \
--out $root.post_imputation_final_mlma_analysis_X

Merge results files together

sed -i '1d' $root.post_imputation_final_mlma_analysis_X.mlma
cat $root.post_imputation_final_mlma_analysis.mlmaloco $root.post_imputation_final_mlma_analysis_X.mlma >  $root.post_imputation_final_mlma_analysis_combined.mlmaloco
SNP Clumping to identify independent hits

Limit associations to lowest p-value in each region of linkage disequilibrium

$plink \
--bfile $root.post_imputation_final \
--clump $root.post_imputation_final_analysis.assoc.logistic \
--clump-p1 1 \
--clump-p2 1 \
--clump-r2 0.25 \
--clump-kb 250 \
--out $root.post_imputation_final_analysis_clumped
--clump-p1 is the p-value threshold below which to consider SNPs for inclusion as the reported SNP from the clump
--clump-p2 is the p-value threshold below which to consider SNPs for inclusion in the clump
--clump-r2 is the LD R2 threshold above which SNPs must be to be included in the same clump 
--clump-kb is the maximum distance a clump SNP can be from the reported SNP

The options given here will generate clumps of all SNPs in LD (above R2 = 0.25), with a maximum size of 500kb, considering all SNPs regardless of p-value

Annotation of Results

Download all RefSeq genes from UCSC

Go to Table Browser

     1.	Pick Group: Genes and Gene Prediction Tracks
     2.	Pick Track: RefSeq Genes
     3.	Pick Table: refGene 
     4.	Pick Region: genome
     5.	Pick Output Format: Selected fields…
     6.	Click Get Output
     7.	Tick Chrom, cdsStart, cdsEnd and name2
     8.	Click GetOutput

Transfer output to file GeneList.txt

Slight reformat of gene list, then make glist_hg19

sed -i 's/#chrom/Chrom/g' GeneList.txt 
sed -i 's/chr//g' GeneList.txt
sh ./Make_glist.sh GeneList.txt glist_hg19 

Annotation in PLINK/PLINK2

Annotates variants with genes within 250kb

$plink \
--annotate $root.post_imputation_final_analysis_clumped.clumped \
ranges=glist-hg19 \
--border 250 \
--out $root.post_imputation_final_analysis_annotated

Alternatively, export results to a web tool such as [http://jjwanglab.org/gwasrap]

Plot Manhattan and QQ plots

Select top million hits for Manhattan plot

head -1000001 $root.post_imputation_final_analysis.assoc.logisitic >  $root.post_imputation_final_analysis_for_MP

Run Manhattan plot and QQ plot scripts in R

$R --file ManhattanPlotinR.R --args $root
$R --file QQPlotinR.R --args $root

QQ plot currently plots top 10% of the data - this can be altered by changing the "frac" option Both of these plots can be output in different graphic file formats (.jpeg, .tiff, .png) - please refer to the R documentation

Files in this GitHub repo

README.md: This file!

Quality Control

Iterative_Missingness.sh: Remove SNPs missing in more than 10% of samples, then samples missing more than 10% of SNPs, then repeat for 5% and 1%.

Usage: Iterative_Missingness.sh

Iterative_Missingness.sh 

highLDregions4bim_b37.awk: Awk script to remove regions of high-LD from LD-pruned files. Original script from M.Weale, adapted using Ensembl.

Usage: highLDregions4bim_b37.awk

awk –f highLDregions4bim_b37.awk input.file > output.file

IndividualIBD.R: In R, calculate and print average identity-by-descent relatedness, and print outliers at > 6 SD from mean

Usage: IndividualIBD.R

R --file=IndividualIBD.R

parfile.par: Provide parameters to the Convertf programme from the EIGENSOFT suite, to convert files from PLNIK format to EIGENSOFT format.

Usage: parfile.par

convertf -p parfile.par

PC-VS-OUTCOME_IN_R_FULL.R: Regress 100 PCs step-wise on outcome, print full results of each regression to file

Usage: PC-VS-OUTCOME_IN_R_FULL.R

R --file=PC-VS_OUTCOME_IN_R_FULL.R

PC-VS-OUTCOME_IN_R_SHORT.R: Regress 100 PCs step-wise on outcome, print variance explained for each prinicipal component when added to model, and the p-value for this variance explained, to file

Usage: PC-VS-OUTCOME_IN_R_SHORT.R

R --file=PC-VS_OUTCOME_IN_R_SHORT.R

PlotPCs.R: Use qplot option from ggplot2 to plot samples on first two principal components from PCA

Usage: PlotPCs.R

R --file=PlotPCs.R

ExtractAncestryOutliers.sh: Pull out outlying samples from PCA (as defined by EIGENSOFT) for removal from PLINK binary.

Usage: ExtractAncestryOutliers.sh

ExtractAncestryOutliers.sh

MakeKeepIDs.sh: Make keepids.txt file for use with T.Flutre's HapMap3 OpenWetWare cookbook

Usage: MakeKeepIDs.sh

MakeKeepIDs.sh

IdHets.R: Identify samples with unusual genome-wide heterozygosity (> or < 3SD from mean). Credit: Amos Folarin.

Usage: IDHets.R

R --file=Id_hets.R

Imputation in Impute2

MakeChunks.sh: Generate two sets of 5Mb chunk files for imputation: Chunks_chr... prints a list of chunks for that chromosome, with SNP number; analysis_chunks... prints the input chunk file for Impute2 for that chromosome.

Usage: MakeChunks.sh

MakeChunks.sh

Master_imputation_script_posterior_sampled_haps.sh: Script to control submission of posterior-sampling imputation jobs to a SGE-based cluster. Modified from scripts provided with Impute2

Modified_submit_impute2_jobs_to_cluster.R: R script for submitting Impute2 jobs to a SGE-controlled cluster. Modified from scripts provided with Impute2.

Prototype_imputation_job_posterior_sampled_haps.sh: Posterior-sampling imputation job script for Impute2. Modified from scripts provided with Impute2

Usage Master_imputation_script_posterior_sampled_haps.sh

Master_imputation_script_posterior_sampled_haps.sh

This runs multiple instances of Prototype_imputation_job_posterior_sampled_haps.sh

Prototype_imputation_job_posterior_sampled_haps.sh <Chr> <Start BP> <End BP>

Post-imputation quality control

MergeImputedChunks.sh: Script to move imputed 5Mb chunks to chromosome-specific folders, and merge them into by-chromosome .impute2 and .impute2_info files.

Usage: MergeImputedChunks.sh

MergeImputedChunks.sh

AddChromosomeNumber.sh:** Post-imputation clean-up script to add chromosome number to by-chromosome .impute2 files

Usage: AddChromosomeNumber.sh

AddChromosomeNumber.sh /path/to/results_directory

ReplaceDots.sh: Post-imputation clean-up file for Phase1_Integrated 1KG reference. Converts exon variants called "." to "chr/position"

Usage: ReplaceDots.sh

ReplaceDots.sh

FilterByInfoAll.sh: Post-imputation QC script to filter whole-genome impute2_info file by info score, and then keep only these variants from the by-chromosome impute2_info files

Usage: FilterByInfoAll.sh

FilterByInfo.sh

DropDuplicatedSNPs.sh: Post-imputation QC script to remove duplicated positions in the post-imputation file (which are usually indels or multiallelic variants)

Usage: DropDuplicatedSNPs.sh

DropDuplicatedSNPs.sh

Relabel_rs.sh: Relabel imputed SNPs with an rs id with the rs id only

Usage: Relabel_rs.sh

Relabel_rs.sh

DropDuplicatedPositions.sh: Some rs IDs are imperfectly mapped, resulting in duplications with imputed IDs, so remove these accidental genotyped duplicates.

Usage: DropDuplicatedPositions.sh

DropDuplicatedPositions.sh

Association testing

Get_Covariates.R: Script to create covariate file for association analyses in PLINK from the principal components from PCA and an external covariates file.

Usage: Get_Covariates.sh

R --file=Get_Covariates.R

Post-GWAS

Make_glist.sh: Script to make a glist-file from the GeneList file downloaded from UCSC Table Browser, for use in annotation in PLINK

Usage: Make_glist.sh

Make_glist.sh <GeneList> <output.file>

ManhattanPlotinR.R: Wrapper script for running Mike Weale's manhattan_v2.R script to generate Manhattan plots from association result

manhattan_v2.R: Mike Weale's Manhattan plot script

Usage: ManhattanPlotinR.R

R --file=ManhattanPlotinR.R

QQPlotinR.R: Wrapper script to generate QQ plots from association results, using Mike Weale's qq_plot_v7.R

qq_plot_v7.R: Mike Weale's QQ-plot script

Usage: QQPlotinR.R

R --file=QQPlotinR.R

ID_Build.py: Python script for quickly identifying the genome build of an unknown PLINK binary.

Usage: ID_Build.py

$plink \
--bfile $root \
--chr 6 \
--make-bed \
--out $root.chr6

python ID_Build.py $root.chr6.bim

Valuable web resources

Genotype recalling pipeline

Rare variant recalling pipeline

Alternative recalling and quality control pipeline

PLINK 1.07

PLINK 1.9

R

EIGENSOFT, EAGLE, HAPLOSNP, LDScore... JUST SO MUCH GOOD STUFF

IMPUTE2

IMPUTE2 Cookbook

William Rayner's strand files for microarrays

Mike Weale's Page

Tim Flutre's OpenWetWare script for HapMap3 PC plot

GWASRAP - Post-GWAS annotation

PRSice - Polygenic Risk Scoring software

GCTA - Software suite for mixed linear modelling and heritability analyses

Acknowledgements

Thank you to the following people for their advice and input on the contents of this cookbook:

  • Gerome Breen
  • Steve Newhouse
  • Amos Folarin
  • Richard Dobson
  • Cass Johnston
  • Hamel Patel
  • Jack Euesden
  • Jemma Walker
  • Niamh Mullins
  • Cathryn Lewis
  • Paul O'Reilly
  • Mike Weale

In addition, thank you to the authors of the resources listed above, and the genetics community for their comments.