The SHAPEIT2-phased, MAF 0.01-filtered GTEx v8 WGS VCF was converted to hg19 using LiftOver for compatibility with the GRCh37 HapMap genetic map and hg19 1000 Genomes (liftOver-hg38Tohg19.sh
). This VCF was then split into chromosome-level VCFs, each of which was then filtered to include only 117AX samples and MAC > 10 using vcftools
:
vcfin=genotypes-hg19
vcfdir=admixed/geno
for chr in {1..22}; do
vcf_unfilt=${vcfin}/gtex.MAF01.phased.chr${chr}.vcf.gz
vcf_out=${vcfdir}/gtex.admixed.MAC10.phased.chr${chr}
vcftools \
--gzvcf ${vcf_unfilt} \
--keep ${admix_ids} \
--recode-INFO-all \
--mac 10 \
--max-alleles 2 \
--min-alleles 2 \
--out ${vcf_out} \
--recode
done
gzip ${vcfdir}/gtex.admixed.MAC10.phased.chr*
prepare_covariates.sh
concatenates covariates provided by GTEx and new PEER factors. It requires paths to two GTEx v8 files:
${tissue}.v8.normalized_expression.bed.gz
(hg38 coordinates now)${tissue}.v8.covariates.txt
It also calls 3 other scripts:
filter_expression_admixed.R
concat_cov.R
run_PEER.R
(not in this repository)
It has a few outputs:
${tissue}.v8.normalized_expression.admixed_subset.bed
: expression in 117AX (however many have samples in this tissue)other_covariates.txt
: necessary covariates extracted from the GTEx v8 covariates filepeer
folder with 3 outputs, including${tissue}.PEER_covariates.txt
, which are the PEER covariates calculated using only 117AX samples${tissue}.all_covariates.txt
: combined covariates. This file will be used for eQTL calling.
batch_eqtl_localaa_globalaa.sh
splits the expression file for a tissue into chunks of 50 genes (line 49) and runs eqtl_localaa_globalaa.R
on each chunk of genes per chromosome.
eqtl_localaa_globalaa.R
requires several inputs (see arguments):
chr
: Chromosome numbertissue
: Tissue specified with GTEx file prefix, e.g.Muscle_Skeletal
exprfile
: Path to GTEx v8 expression file for this tissueglobalcov
: Path to${tissue}.all_covariates.txt
from the previous stepgeno
: Path to filtered VCF for this chromosome (from this step)localcov
: Path to local ancestry covariates for this chromosome from this stepout
: Output file. Results from each test are added to this file as they are generated. This file is analogous to GTEx's "allpairs" files
batch_eqtl_localaa_globalaa.sh
was written to parallelize eQTL calling on a cluster without a job submission system. As written, it restricts processes to 9 cores. If you have access to a HPC with a job submission system like SGE
or SLURM
, I highly recommend parallelizing this process as much as possible. See scripts in eqtl/slurm for an example of how to do this with SLURM
.
eqtl_localaa_globalaa.R
generates a progress bar for its corresponding chunk for your sanity. Once all chunks are done running, concatenate the results for each tissue (see the end of batch_eqtl_localaa_globalaa.sh
).
Now let's assume that, for each tissue, you have a master allpairs file called ${tissue}-LocalAA-GlobalAA-allpairs-merged.tsv.gz
. These files are very large. Some of the downstream steps are facilitated by having a filtered allpairs file that only includes tests where either the GlobalAA or LocalAA nominal p-value is less than 0.05. Generate this filtered allpairs file as follows:
zcat zcat ${tissue}-LocalAA-GlobalAA-allpairs-merged.tsv.gz | head -1 > ${tissue}-LocalAA-GlobalAA-allpairs-merged-filt.tsv
zcat ${tissue}-LocalAA-GlobalAA-allpairs-merged.tsv.gz | \
awk -F "\t" '{ if(($6 < 0.05) || ($9 < 0.05)) { print } }' \
>> ${tissue}-LocalAA-GlobalAA-allpairs-merged-filt.tsv
gzip ${tissue}-LocalAA-GlobalAA-allpairs-merged-filt.tsv
extract-egenes.py
identifies the most significant SNP(s) for each tested gene. If multiple SNPs have the same smallest p-value, all tied lead SNPs are reported. No significance threshold is applied; the lead SNP(s) is reported for all tested genes.
The input should be the unfiltered allpairs file for a given tissue. There are two outputs:
gtex.admix.global.egenes.tied.txt.gz
: Lead SNPs according to GlobalAAgtex.admix.lava.egenes.tied.txt.gz
: Lead SNPs according to LocalAA
get_uniq_egenes.R
generates a few files, a few of which are specific to a particular nominal p-value cutoff:
egenes_master.RData
: Lead SNPs for all tested genes; the union ofgtex.admix.global.egenes.tied.txt.gz
andgtex.admix.lava.egenes.tied.txt.gz
for all tissues. One line per gene-SNP pair per ancestry adjustment methodallp_master.RData
: Any gene-SNP pair with a nominal p-value < 0.05 (either LocalAA or GlobalAA)
Then a few files that are specific to a given nominal p-value cutoff, e.g. 1e-06:
uniq_egene_${cutoff}.RData
: For each tissue, genes that have a nominal eQTL p-value less than the cutoff for only one ancestry adjustment method. Results for all tissues are concatenateddiff_lead_${cutoff}.RData
: For each tissue, genes that have a nominal eQTL p-value less than the cutoff for BOTH ancestry adjustment methods but have different lead SNPs. Results for all tissues are concatenatedsame_egene_diff_lead_${cutoff}.txt
: Unique genes fromdiff_lead_${cutoff}.RData
; used to filter loci for colocalization analyses later on