Skip to content

2. Real example for X‐Wing

Jiacheng Miao edited this page Feb 2, 2023 · 3 revisions

Here we provide a concrete example to run X-Wing. We use BMI European GWAS summary statistics from UKBB (N = 359,983) and East Asian GWAS summary statistics from BBJ (N = 158,284) as the training data and 1000G as the reference panel. The goal in is to calculate the BMI PRS for the target East Asian sample.

A detailed explanation for the available flags can be found in X-Wing.

Step 0: Download the required data and LD from reference panel for the example.

After downloading X-Wing, you can download all the example data and required LD from reference panel using

#!/bin/bash
cd X-Wing
bash download_example.sh

Step 1. Obtain the annotation based on cross-population local genetic correlation using LOGODetect

Applying LOGODetect

#!/bin/bash

cd example

mkdir ./results/LOGODetect

Rscript /X-Wing/LOGODetect.R \
--sumstats ./data/sumstats/BMI_EUR.txt,./data/sumstats/BMI_EAS.txt \
--n_gwas 359983,158284 \
--ref_dir ./data/LOGODetect_1kg_ref \
--pop EUR,EAS \
--block_partition /X-Wing/block_partition.txt \
--gc_snp /X-Wing/1kg_hm3_snp.txt \
--out_dir ./results/LOGODetect \
--n_cores 20 \
--target_pop EAS \
--n_topregion 1000
  • The first few line of GWAS summary statistics is
    head ./data/sumstats/BMI_EUR.txt
    
    CHR	  SNP	          BP	       A1	A2	BETA	                P
    1	  rs12238997	  693731	G	A	-0.276816514205779	0.781921
    1	  rs58276399	  731718	C	T	-0.412794747579417	0.679757
    1	  rs61770163	  732032	C	A	-0.056528239021207	0.954921
  • The output files include the regions with significant local genetic correlation and the annotation based on local genetic correlation as shown below
    head ./results/LOGODetect/LOGODetect_regions.txt
    
    chr    begin_pos    stop_pos     stat                pval                   qval
    1      32159588     32175927     13.9007303828433    0.00479904019196161    0.0107599532725034
    1      74992278     75006328     23.659452286243     0.0001999600079984     0.00107756226532471
    1      107872936    107899979    13.5101005165975    0.0017996400719856     0.00492533072332902
    
    head ./results/LOGODetect/annot_EUR.txt
    
    CHR    SNP           A1    A2    Anno
    1      rs12238997    G     A     0
    1      rs58276399    C     T     0
    1      rs61770163    C     A     0

Step 2. Calculate the SNPs posterior effect size by incorporating the portability-informed annotation into PANTHER

Fit PANTHER PRS model

Here, we fit the PANTHER PRS model using European and East Asian GWAS summary statistics, and incorporate the cross-population local genetic correlation annotation derived from Step1 into PRS modeling; We specify that our target population is East Asian and use 1000G data as reference panel; We specify that our target sample plink bim file is located in ./data/test.bim.

#!/bin/bash

cd example

mkdir ./results/PANTHER; mkdir ./results/PANTHER/post; mkdir ./results/PANTHER/post_collect

for chr in {1..22};do
  for pst_pop in {EUR EAS};do
    python \
    /X-Wing/PANTHER.py \
    --ref_dir ./data/PANTHER_1kg_ref \
    --bim_prefix ./data/test \
    --sumstats ./data/sumstats/BMI_EUR.txt,./data/sumstats/BMI_EAS.txt \
    --n_gwas 359983,158284 \
    --anno_file ./results/LOGODetect/annot_EUR.txt,./results/LOGODetect/annot_EAS.txt \
    --chrom ${chr} \
    --pop EUR,EAS \
    --target_pop EAS \
    --pst_pop ${pst_pop} \
    --out_name BMI \
    --seed 3 \
    --out_dir ./results/PANTHER/post
  done
done
  • The format of the input GWAS summary statistics is the same as LOGODetect.
  • The format of the input annotation file is the same as the output annotation file for LOGODetect.
  • We used for loop in this example, but in practice it can be time-consuming. We highly recommend parallel computation for the 22 autosomes and each posterior populations.
  • The output file of step2 is
    head ./results/PANTHER/post/BMI_EAS_pst_eff_phiauto_chr1.txt
    
    1	  rs4475691	  846808	T	C	7.715496e-05
    1	  rs7537756	  854250	G	A	-5.351412e-05
    1	  rs3748592	  880238	A	G	-2.837376e-04

where the columns in order are chromosome, SNP rs ID, BP, A1, A2, and posterior effect size estimate. Each row is a SNP.

  • PANTHER fit the model for each chromosome separately. One can merge each output file for 22 autosomes into one file by:

    cat ./results/PANTHER/post/BMI_EUR_pst_eff_phiauto_chr*.txt >     ./results/PANTHER/post_collect/BMI_EUR_Post.txt
    sed  -i '1iCHR\tSNP\tBP\tA1\tA2\tBETA' ./results/PANTHER/post_collect/BMI_EUR_Post.txt
    
    cat ./results/PANTHER/post/BMI_EAS_pst_eff_phiauto_chr*.txt > ./results/PANTHER/post_collect/BMI_EAS_Post.txt
    sed  -i '1iCHR\tSNP\tBP\tA1\tA2\tBETA' ./results/PANTHER/post_collect/BMI_EAS_Post.txt

Step 3: Estimate the linear combination weights for PRS from multiple populations using LEOPARD

Step 3.1: Subsample GWAS summary statistics for training and validation sets.

Here, given the full GWAS summary statistics from East Asian population, we try to subsample the GWAS summary statistics for the training and validation set. We set the proportion of the training set is 0.75 and replicate the subsampling 4 times:

cd example

mkdir ./results/LEOPARD/sampled_sumstats

Rscript /X-Wing/LEOPARD_Sim.R \
--sumstats ./data/sumstats/BMI_EAS.txt \
--n_gwas 158284 \
--train_prop 0.75 \
--ref_prefix ./data/LEOPARD_1kg_ref/EAS/EAS_part1 \
--seed 42 \
--rep 4 \
--out_prefix ./results/LEOPARD/sampled_sumstats/BMI_EAS
  • The format of the input GWAS summary statistics is the same as LOGODetect.
  • The output for Step 3.1 contains 3 files for each replication with 4 replications in total. For example. BMI_rep1_train.txt, BMI_rep1_valid.txt, and BMI_rep1_train_valid_N.txt represent the GWAS summary statistics for the training, validation set, and the sample size for training and validation set, respectively.

Step 3.2: Fit the PRS model using sampled target population GWAS summary statistics from the training set.

Next, we fit the PANTHER PRS model using the full European GWAS summary statistics and subsampled East Asian GWAS summary statistics for training set:

cd example

mkdir ./results/LEOPARD/post; mkdir ./results/LEOPARD/post_collect

for index in {1..4}
do
  for chr in {1..22};do
    for pst_pop in {EUR EAS};do
        python /X-Wing/PANTHER.py \
        --ref_dir ./data/PANTHER_LEOPARD_1kg_ref \
        --bim_prefix ./data/test \
        --sumstats ./data/sumstats/BMI_EUR.txt,./results/LEOPARD/sampled_sumstats/BMI_EAS_rep${index}_train.txt \
        --n_gwas 359983,118713 \
        --anno_file ./results/LOGODetect/annot_EUR.txt,./results/LOGODetect/annot_EAS.txt \
        --chrom ${chr} \
        --pop EUR,EAS \
        --target_pop EAS \
        --pst_pop ${pst_pop} \
        --out_name BMI_${index} \
        --seed 3 \
        --out_dir ./results/LEOPARD/post
      done
    done
done
  • We used for loop in this example, but in practice it could be time-consuming. We highly recommend parallel computation for the 22 autosomes * 4 replicates * 2 populations = 176 jobs.
  • Since PANTHER fit the model and outputs the results for each chromosome separately. We merge each output file for 22 autosomes into one file by:
    for index in {1..4}
    do
      cat ./results/LEOPARD/post/BMI_${index}_EUR_pst_eff_phiauto_chr*.txt > ./results/LEOPARD/post_collect/BMI_${index}_EUR_Post.txt
      sed  -i '1iCHR\tSNP\tBP\tA1\tA2\tBETA' ./results/LEOPARD/post_collect/BMI_${index}_EUR_Post.txt
    
      cat ./results/LEOPARD/post/BMI_${index}_EAS_pst_eff_phiauto_chr*.txt > ./results/LEOPARD/post_collect/BMI_${index}_EAS_Post.txt
      sed  -i '1iCHR\tSNP\tBP\tA1\tA2\tBETA' ./results/LEOPARD/post_collect/BMI_${index}_EAS_Post.txt
    done

Step 3.3: Estimate the linear combination weights.

At last, we estimate the linear combination weights using the estimated SNP posterior mean effects from Step 3.2 and subsampled East Asian GWAS summary statistics for validation set:

cd example
mkdir ./results/LEOPARD/weights

# Estimating the linear combination weights
for index in {1..4}
do
    Rscript /X-Wing/LEOPARD_Weights.R \
    --beta_file ./results/LEOPARD/post_collect/BMI_${index}_EUR_Post.txt,./results/LEOPARD/post_collect/BMI_${index}_EAS_Post.txt \
    --valid_file ./results/LEOPARD/sampled_sumstats/BMI_EAS_rep${index}_valid.txt \
    --n_valid 39571 \
    --ref_prefix ./data/LEOPARD_1kg_ref/EAS/EAS_part3 \
    --out ./results/LEOPARD/weights/BMI_LEOPARD_weights_rep${index}.txt
done
  • The output of Step 3.3 is the linear combination weights for multiple population-specific PRS. Here is the result for replication 1:

    head ./results/LEOPARD/weights/BMI_LEOPARD_weights_rep1.txt
    
    Path	                                                  Weights
    ./results/LEOPARD/Post_collect/BMI_1_EUR_Post.txt	  95.3952930409908
    ./results/LEOPARD/Post_collect/BMI_1_EAS_Post.txt	  202.295367124474

    where columns in order are the PATH to the file contains SNP posterior effects and linear combination weights. Each row is the result for each PRS included.

  • The final linear combination weights is the average of the relative weights across 4 replications. We used LEOPARD_avg.R to produce the it:

    Rscript /X-Wing/LEOPARD_Avg.R \
        --weights_prefix ./results/LEOPARD/weights/BMI_LEOPARD_weights_rep \
        --rep 4 \
        --out ./results/LEOPARD/weights/BMI_LEOPARD_weights_avg.txt

    The results is

    head ./results/LEOPARD/weights/BMI_LEOPARD_weights_avg.txt
    
    Path	                                                Weights
    ./results/LEOPARD/post_collect/BMI_1_EUR_Post.txt	0.411651331616073
    ./results/LEOPARD/post_collect/BMI_1_EAS_Post.txt	0.588348668383927
    
  • Thus, the final linearly combined PRS is 0.412 x EUR_PRS + 0.588 x EAS_PRS. We recommend centering each population-specific PRS in the target dataset when calculating the final linearly combined PRS.