-
Notifications
You must be signed in to change notification settings - Fork 23
4. Polygenic Localization with PolyLoc
PolyLoc takes an input file with posterior means and standard deviations of causal effect sizes (estimated by run_finemapper
). PolyLoc uses this file to partition SNPs into bins of similar posterior per-SNP heritability, and estimates the heritability causally explained by each bin. PolyLoc consists of three stages:
- Partition SNPs into bins of similar posterior per-SNP heritability
- Compute LD-scores for the SNP bins
- Estimate the heritability casaully explained by each bin. This stage requires summary statistics based on different data than the data used to run PolyFun (see Weissbrod et al. 2019 bioRxiv for details).
PolyLoc and PolyFun have similar input files and they share many command-line arguments. You can see all the PolyLoc options by typing `python polyloc.py --help`. We now describe each of the stages of PolyLoc in detail.
This stage requires a file with posterior causal effect sizes of SNPs (ideally all genome-wide SNPs, or at least all the ones in genome-wide significant loci), as created by run_finemapper.py
. Here is an example file (seen via cat example_data/posterior_betas.gz | zcat | head
):
CHR SNP BP A1 A2 Z N BETA_MEAN BETA_SD
1 rs3748597 888659 T C 0.85233 383290 0.00021 0.00005
1 rs6661956 2465912 C T 0.98513 383290 0.00161 0.00006
1 rs2376821 2974852 T C -0.73402 383290 -0.00186 0.00111
1 rs10797386 3168280 A G -0.59635 383290 -0.00057 0.00006
1 rs7548372 4943523 C A 2.60293 383290 0.00643 0.00014
1 rs6678838 5003147 G T 1.10294 383290 0.00135 0.00008
1 rs6675995 5226752 G A -1.22154 383290 -0.00027 0.00034
1 rs2012852 5541205 C A 0.30166 383290 0.00022 0.00027
1 rs10864271 7135268 A G 2.03717 383290 0.00471 0.00031
The column BETA_MEAN
contains the posterior means of causal effect sizes (as estimated by PolyFun), and the column BETA_SD
contains their posterior standard deviation. The other required columns are CHR
, SNP
, BP
, A1
, A2
.
Here is an example command that uses this file:
mkdir -p output
python polyloc.py \
--compute-partitions \
--output-prefix output/polyloc_test \
--posterior example_data/posterior_betas.gz \
--bfile-chr example_data/reference.
The parameters we provided are the following:
-
--compute-partitions
- this tells PolyLoc to partition SNPs into bins of similar posterior per-SNP heritability -
--output-prefix
- this specifies the prefix of all PolyLoc output file names -
--posterior
- this specifies the input file with posterior means and standard deviations of causal effect sizes. This file should ideally include information for all SNPs. Every SNP not included in this file will be treated as if its posterior causal effect size is approximately zero, potentially leading to suboptimal polygenic localization if it's an important SNP. As in PolyFun, this file can either be a text file (possibly gzipped) or a parquet file, which allows faster loading. -
--bfile-chr
(Optional parameter) - this is the prefix of plink .bim files that PolyLoc will use to assign SNPs not reported in the--posterior
file into bins. As in PolyFun, there must be one plink file for each chromosome. This is an optional parameter, but we strongly recommend providing it. If you do not provide it, PolyLoc will only perform polygenic localization across SNPs that were fine-mapped. Please note that this argument only requires .bim files, so you can create such files even if you don't have access to individual-level data.
Additional optional parameters are:
-
--skip-Ckmedian
- This tells PolyLoc to partition SNPs into bins using scikits-learn instead of Ckmedian. This is a suboptimal clustering, so Ckmedian is preferable. You should only specify this argument if rpy2 and/or Ckmeans.1d.dp are not installed on your machine or you can't get them to run. -
-num-bins <K>
- this specifies the number of bins to partition SNPs into. By default PolyLoc will try to estimate this number. You should specify this number if either (a) you specified--skip-Ckmedian
(because scikits-learn cannot estimate the number of bins) or (b) the estimation is too slow.
This stage is similar to the LD-score computation stage in PolyFun. In this stage PolyLoc will compute LD-scores for each SNP bin. This is the most computationally intensive part of PolyLoc. PolyLoc can compute LD-scores for all chromosomes in the same run, or for only one chromosome at a time. The second option is recommended if you can run multiple jobs on a cluster. These LD-scores can be based on either genotyping data from a reference panel (encoded in plink binary files) or it can be based on our pre-computed UK Biobank LD matrices.
Here is an example that computes LD-scores for only chromosome 1, using genotypes from the 1000-genomes reference panel:
python polyloc.py \
--output-prefix output/polyloc_test \
--compute-ldscores \
--bfile-chr example_data/reference. \
--chr 1
Here is an example that computes LD-scores for only chromosome 1 by downloading our precomputed UK Biobank LD matrices:
mkdir -p LD_cache
python polyloc.py \
--output-prefix output/polyloc_test \
--compute-ldscores \
--ld-ukb \
--ld-dir LD_cache \
--chr 1
- PolyLoc accepts the same parameters as PolyFun for LD-scores computations.
- You must specify the same
--output-prefix
argument that you provided in stage 1, because PolyLoc requires intermediate files that were created in stage 1. - If you remove the flag
--chr
, PolyLoc will iterate over all chromosomes and compute LD-scores for all of them, which may take a long time. - The genotype or LD data should be population-matched to your study. Ideally this data should come from your study directly. In the first example we used a small subset of SNPs of European-ancestry individuals from the 1000 genomes project. In the second example we used precomputed summary LD information from British-ancestry individuals from the UK Biobank.
- If you compute LD from individual-level data, there are various parameters that you can use to control the LD-score computations, analogue to the respective parameters in the ldsc package. Please type
python polyloc.py --help
to see all available parameters. The parameter--keep <keep file>
can be especially useful if you have a very large reference panel and would like to speed-up the computations by using only a subset of individuals. - The
--ld-dir
flag specifies a directory to which precomputed UK Biobank LD files will be downloaded. If you don't specify this, PolyLoc will use a temporary directory that may be deleted at the end of the analysis. - You can run stages 1 and 2 together by invoking
polyloc.py
with both the flags--compute-partitions
and--compute-ldscores
.
This stage requires LD-score weights files and a summary statistics file that is different from the one used to compute posterior causal effect sizes (to prevent biased estimates due to winner's curse). Here is an example command:
python polyloc.py \
--output-prefix output/polyloc_test \
--compute-polyloc \
--w-ld-chr example_data/weights. \
--sumstats example_data/sumstats2.parquet
This commands creates two output files: output/polyloc_test.bin_h2
and output/polyloc_test.Mp
. The first output file shows estimates of the heritability causally explained by the SNPs in each bin. The output for our example (seen by typing cat output/polyloc_test.bin_h2
) is:
BIN BIN_SIZE SUM_BINSIZE %H2 %H2_STDERR SUM_%H2 SUM_%H2_STDERR
1 10 10 0.17250 0.01015 0.17250 0.01015
2 32 42 0.13510 0.01878 0.30760 0.01970
3 70 112 0.12648 0.01779 0.43408 0.02313
4 110 222 0.13666 0.01394 0.57074 0.02198
5 193 415 0.15129 0.01394 0.72203 0.02134
6 255 670 0.09629 0.01070 0.81832 0.01907
7 315 985 0.07411 0.00857 0.89243 0.01611
8 410 1395 0.05337 0.00840 0.94580 0.01297
9 492 1887 0.05057 0.00877 0.99637 0.00807
10 639 2526 0.00363 0.00807 1.00000 0.00000
11 949 3475 0.00000 0.00000 1.00000 0.00000
The output shows that PolyLoc partitioned SNPs into 11 bins of similar posterior per-SNP heritability. The first bin includes 10 SNPs that jointly explain 17.25% of the total SNP heritability, the second bin includes 32 SNPs that jointly explain 13.5% of the total SNP heritability, and so on. The identities of the SNPs in each bin are the SNPs in the posterior effect sizes file, ranked according to their posterior per-SNP heritability estimates (i.e., the sum of their squared posterior mean and their squared posterior standard deviation). That is, the SNPs in bin 1 are the 10 SNPs with the largest posterior per-SNP heritability, the SNPs in bin 2 are the next top ranked 32 SNPs, and so on.
The second output file shows estimates of the the minimal number of SNPs that causally explain different proportions p
of the SNP heritability (denoted Mp
). Here are the first 50 lines of the output for our example (seen by typing head -51 output/polyloc_test.Mp
):
p Mp Mp_STDERR
1 1 0.00000
2 2 0.00000
3 2 0.00000
4 3 0.00000
5 3 0.00000
6 4 0.00000
7 5 1.40360
8 5 0.00000
9 6 0.00000
10 6 0.00000
11 7 0.00000
12 7 2.76434
13 8 0.00000
14 9 1.40360
15 9 0.00000
16 10 0.99500
17 10 0.99500
18 12 3.12584
19 15 4.54874
20 17 1.72627
21 19 4.76246
22 22 3.54006
23 24 2.97164
24 26 7.10397
25 29 3.14172
26 31 4.25020
27 34 6.50135
28 36 4.22261
29 38 5.38279
30 41 6.23894
31 44 10.83559
32 49 11.50175
33 55 11.07154
34 60 11.79390
35 66 10.80248
36 72 11.79495
37 77 10.92612
38 83 12.09894
39 88 11.54147
40 94 12.26475
41 99 11.86560
42 105 12.71486
43 110 13.26258
44 117 18.60806
45 125 18.93879
46 133 18.51573
47 141 18.18059
48 149 17.72502
49 158 17.44208
50 166 17.29313
Each line shows the number of top-ranked SNPs (with respect to posterior per-SNP heritability estimated by PolyFun) that causally explain proportion p%
of SNP heritability. For example, the last line shows that PolyLoc estimates that 166 top-ranked SNPs are required to causally explain 50% of SNP heritability, with a standard error of 17.29.
- The summary statistics that you provide to stage 3 of PolyLoc must be based on a different dataset than the dataset that was to run PolyFun to compute per-SNP heritabilities. Otherwise, your results are subject to winner's curse and could be extremely biased!
- As in PolyFun, you can run jointly multiple stages of PolyLoc by using several mode parameters (e.g.
python polyloc.py ----compute-partitions --compute-ldscores --compute-polyloc
). - If you use
run_finemapper
to fine-map multiple loci, you can concatenate the output files of these loci (making sure there aren't duplicate SNPs that appear twice) and provide the concatenated file as an input posterior file. - PolyLoc estimates per-bin heritability with respect to all SNPs in the plink files. If you would like to exclude to e.g. only common SNPs (as done in the PolyFun paper), please create plink files with only common SNPs and subset your posterior file accordingly (no need to subset the sumstats file).