5. RegionalAssociationPlots.Rmd

---
title: "Regional association plotting of 11 loci associated with CAC."
author: "[Sander W. van der Laan, PhD](https://vanderlaan.science) | s.w.vanderlaan@gmail.com"
date: "`r Sys.Date()`"
output:
  html_notebook:
    cache: yes
    code_folding: hide
    collapse: yes
    df_print: paged
    fig.align: center
    fig_caption: yes
    fig_height: 6
    fig_retina: 2
    fig_width: 7
    highlight: tango
    theme: lumen
    toc: yes
    toc_float:
      collapsed: no
      smooth_scroll: yes
mainfont: Arial
subtitle: "A 'druggable-MI-targets' project"
editor_options:
  chunk_output_type: inline
---

# Setup
We will clean the environment, setup the locations, define colors, and create a datestamp.

_Clean the environment._
```{r echo = FALSE}
# rm(list = ls())

```

_Set locations and working directories..._
```{r LocalSystem, echo = FALSE}
source("scripts/local.system.R")

```

_... a package-installation function ..._
```{r}
source("scripts/functions.R")
```


_... and load those packages._
```{r loading_packages, message=FALSE, warning=FALSE}
source("scripts/packages05.R")

```

_We will create a datestamp and define the Utrecht Science Park Colour Scheme_.
```{r Setting: Colors}

Today = format(as.Date(as.POSIXlt(Sys.time())), "%Y%m%d")
Today.Report = format(as.Date(as.POSIXlt(Sys.time())), "%A, %B %d, %Y")

source("scripts/colors.R")

```

```{r global_options, include = FALSE}
# further define some knitr-options.
knitr::opts_chunk$set(fig.width = 12, fig.height = 8, fig.path = 'Figures/', 
                      wwarning = TRUE, # show warnings during codebook generation
  message = TRUE, # show messages during codebook generation
  error = TRUE, # do not interrupt codebook generation in case of errors,
                # usually better for debugging
  echo = TRUE,  # show R code
                      eval = TRUE)
ggplot2::theme_set(ggplot2::theme_minimal())
pander::panderOptions("table.split.table", Inf)
```


# Introduction

We will parse the data to create regional association plots for each of the 11 loci. 

# Setting the NPG colors

Here just making a heatmap of the colors.
```{r}
library("scales")
pal_npg("nrc")(10)
show_col(pal_npg("nrc")(10))

# show_col(pal_npg("nrc", alpha = 0.6)(10))

```


# Regional association plotting: EU-AA-ancestry

## Top 11 loci

We are interested in 11 top loci. We will plot these using the EU-AA-ancestry data.

```{r}
library(openxlsx)
variant_list <- read.xlsx(paste0(TARGET_loc, "/Variants.xlsx"), sheet = "TopLoci")

head(variant_list)

```


### All loci
Let's do some plotting.

```{r}
variants_of_interest <- c(variant_list$rsID)
variants_of_interest
length(variants_of_interest)
```

## Load data European and African-American

We need to load the meta-analysis summary statistics from the European - African-American ancestry analysis first.
```{r}

gwas_sumstats_racer_EA_AA <- readRDS(file = paste0(OUT_loc, "/gwas_sumstats_complete_racer.EA_AA.rds"))

```


```{r}
library(RACER)
# Make directory for plots
ifelse(!dir.exists(file.path(PROJECT_loc, "/RACER")), 
       dir.create(file.path(PROJECT_loc, "/RACER")), 
       FALSE)
RACER_loc = paste0(PROJECT_loc,"/RACER")

variants_of_interest_fewgenes <- c("rs10899970") # "rs9349379", "rs3844006", "rs2854746", "rs4977575", "rs10899970", "rs9633535", "rs10762577", "rs11063120", "rs9515203", "rs7182103", "rs7412"

for(VARIANT in variants_of_interest){
  cat(paste0("Getting data for ", VARIANT,".\n"))

  tempCHR <- subset(variant_list, rsID == VARIANT)[,5]
  tempSTART <- subset(variant_list, rsID == VARIANT)[,18]
  tempEND <- subset(variant_list, rsID == VARIANT)[,19]
  tempVARIANTnr <- subset(variant_list, rsID == VARIANT)[,1]

  cat("\nSubset required data.\n")
  temp <- subset(gwas_sumstats_racer_EA_AA, Chr == tempCHR & (Position >= tempSTART & Position <= tempEND))
  
  cat("\nFormatting association data.\n")
  temp_f = RACER::formatRACER(assoc_data = temp, chr_col = 3, pos_col = 4, p_col = 5)

  cat("\nGetting LD data.\n")
  temp_f_ld = 
    data.table::setorder( # this fixes an issue where the SNPs with LD = NA are plotted last and it appears many SNPs are not present in 1000G.
    RACER::ldRACER(assoc_data = temp_f, rs_col = 2, pops = "EUR", lead_snp = VARIANT), 
    LD)
  
  cat(paste0("\nPlotting region surrounding ", VARIANT," on ",tempCHR,":",tempSTART,"-",tempEND,".\n"))
  p1 <- singlePlotRACER2(assoc_data = temp_f_ld, 
                         chr = tempCHR, build = "hg19", 
                         plotby = "coord", snp_plot = VARIANT,
                         start_plot = tempSTART, end_plot = tempEND,
                         label_lead = TRUE, gene_track_h = 2, gene_name_s = 1.75)
  
  print(p1)
  cat(paste0("Saving image for ", VARIANT,".\n"))
  ggsave(filename = paste0(RACER_loc, "/", tempVARIANTnr, ".", Today, ".",VARIANT,".regional_assoc.png"), plot = last_plot())
  ggsave(filename = paste0(RACER_loc, "/", tempVARIANTnr, ".", Today, ".",VARIANT,".regional_assoc.pdf"), plot = last_plot())
  ggsave(filename = paste0(RACER_loc, "/", tempVARIANTnr, ".", Today, ".",VARIANT,".regional_assoc.eps"), plot = last_plot())

  rm(temp, p1,
     temp_f, temp_f_ld,
     tempCHR, tempSTART, tempEND,
     VARIANT, tempVARIANTnr)
  
}


```

### Loci with many genes
These are genetic loci with many genes. 

```{r}

variants_of_interest_manygenes <- c("rs7412", "rs10762577")

for(VARIANT in variants_of_interest_manygenes){
  cat(paste0("Getting data for ", VARIANT,".\n"))

  tempCHR <- subset(variant_list, rsID == VARIANT)[,5]
  tempSTART <- subset(variant_list, rsID == VARIANT)[,18]
  tempEND <- subset(variant_list, rsID == VARIANT)[,19]
  tempVARIANTnr <- subset(variant_list, rsID == VARIANT)[,1]

  cat("\nSubset required data.\n")
  temp <- subset(gwas_sumstats_racer_EA_AA, Chr == tempCHR & (Position >= tempSTART & Position <= tempEND))
  
  cat("\nFormatting association data.\n")
  temp_f = RACER::formatRACER(assoc_data = temp, chr_col = 3, pos_col = 4, p_col = 5)

  cat("\nGetting LD data.\n")
  temp_f_ld = 
    data.table::setorder( # this fixes an issue where the SNPs with LD = NA are plotted last and it appears many SNPs are not present in 1000G.
    RACER::ldRACER(assoc_data = temp_f, rs_col = 2, pops = "EUR", lead_snp = VARIANT), 
    LD)
  
  cat(paste0("\nPlotting region surrounding ", VARIANT," on ",tempCHR,":",tempSTART,"-",tempEND,".\n"))
  p1 <- singlePlotRACER2(assoc_data = temp_f_ld, 
                               chr = tempCHR, build = "hg19", 
                               plotby = "snp", snp_plot = VARIANT,
                               label_lead = TRUE, gene_track_h = 0.75, gene_name_s = 1.75)
  
  print(p1)
  cat(paste0("Saving image for ", VARIANT,".\n"))
  ggsave(filename = paste0(RACER_loc, "/", tempVARIANTnr, ".", Today, ".",VARIANT,".regional_assoc.png"), plot = last_plot())
  ggsave(filename = paste0(RACER_loc, "/", tempVARIANTnr, ".", Today, ".",VARIANT,".regional_assoc.pdf"), plot = last_plot())
  ggsave(filename = paste0(RACER_loc, "/", tempVARIANTnr, ".", Today, ".",VARIANT,".regional_assoc.eps"), plot = last_plot())
  
  rm(temp, p1,
     temp_f, temp_f_ld,
     tempCHR, tempSTART, tempEND,
     VARIANT, tempVARIANTnr)
  
}
```

### CXCL12

The _CXCL12_ genetic locus.

```{r}
variants_of_interest_cxcl12 <- c("rs10899970")

for(VARIANT in variants_of_interest_cxcl12){
  cat(paste0("Getting data for ", VARIANT,".\n"))

  tempCHR <- subset(variant_list, rsID == VARIANT)[,5]
  tempSTART <- subset(variant_list, rsID == VARIANT)[,18]
  tempEND <- subset(variant_list, rsID == VARIANT)[,19]
  tempVARIANTnr <- subset(variant_list, rsID == VARIANT)[,1]

  cat("\nSubset required data.\n")
  temp <- subset(gwas_sumstats_racer_EA_AA, Chr == tempCHR & (Position >= tempSTART & Position <= tempEND))
  
  cat("\nFormatting association data.\n")
  temp_f = RACER::formatRACER(assoc_data = temp, chr_col = 3, pos_col = 4, p_col = 5)

  cat("\nGetting LD data.\n")
  temp_f_ld = 
    data.table::setorder( # this fixes an issue where the SNPs with LD = NA are plotted last and it appears many SNPs are not present in 1000G.
    RACER::ldRACER(assoc_data = temp_f, rs_col = 2, pops = "EUR", lead_snp = VARIANT), 
    LD)
  
  cat(paste0("\nPlotting region surrounding ", VARIANT," on ",tempCHR,":",tempSTART,"-",tempEND,".\n"))
  p1 <- singlePlotRACER2(assoc_data = temp_f_ld, 
                               chr = tempCHR, build = "hg19", set = "all",
                               plotby = "snp", snp_plot = VARIANT,
                               label_lead = TRUE, gene_track_h = 0.75, gene_name_s = 1.75)
  
  print(p1)
  cat(paste0("Saving image for ", VARIANT,".\n"))
  ggsave(filename = paste0(RACER_loc, "/", tempVARIANTnr, ".", Today, ".",VARIANT,".regional_assoc.png"), plot = last_plot())
  ggsave(filename = paste0(RACER_loc, "/", tempVARIANTnr, ".", Today, ".",VARIANT,".regional_assoc.pdf"), plot = last_plot())
  ggsave(filename = paste0(RACER_loc, "/", tempVARIANTnr, ".", Today, ".",VARIANT,".regional_assoc.eps"), plot = last_plot())
  
  rm(temp, p1,
     temp_f, temp_f_ld,
     tempCHR, tempSTART, tempEND,
     VARIANT, tempVARIANTnr)
  
}
```

## Additional regional plots

### Listing regions of interest

We want to create some regional association plots to combine with teh UCSC browser tracks, thus we need the exact same regions. 

```{r}
library(openxlsx)
add_list <- read.xlsx(paste0(TARGET_loc, "/Variants.xlsx"), sheet = "AdditionalPlots")

DT::datatable(add_list)

```

### Credible Sets

We want to color the credible sets, which we load here.

```{r}
credset <- as_tibble(fread(paste0(PROJECT_loc, "/CredibleSets/CAC_EUR_AFR_cred_set_all_loci_50kb.txt")))

credset
```

### Combining GWAS with Credible Set

We want to add the posterior probabilities and make a variable to color by.
```{r}

gwas_sumstats_racer_credset <- merge(gwas_sumstats_racer_EA_AA, 
                                     credset %>% select(RSID, Posterior_Prob), 
                                     sort = FALSE,
                                     by.x = "rsID", by.y = "RSID", all.x = TRUE) %>%
  # mutate(., Posterior_Prob = ifelse(is.na(Posterior_Prob), 0, Posterior_Prob)) %>%
  mutate(CredSet = case_when(Posterior_Prob > 0 ~ '95% credible set',
                             TRUE ~ 'not in credible set'))

head(gwas_sumstats_racer_credset)

table(gwas_sumstats_racer_credset$CredSet)

summary(gwas_sumstats_racer_credset$Posterior_Prob)

```

### Plotting

```{r}
library(RACER)
# library(plotly)

# Make directory for plots
ifelse(!dir.exists(file.path(PROJECT_loc, "/RACER")), 
       dir.create(file.path(PROJECT_loc, "/RACER")), 
       FALSE)
RACER_loc = paste0(PROJECT_loc,"/RACER")

variants_of_interest <- c(add_list$rsID)


for(VARIANT in variants_of_interest){
  cat(paste0("Getting data for ", VARIANT,".\n"))

  tempCHR <- subset(add_list, rsID == VARIANT)[,4]
  tempSTART <- subset(add_list, rsID == VARIANT)[,5]
  tempEND <- subset(add_list, rsID == VARIANT)[,6]
  tempNAME <- subset(add_list, rsID == VARIANT)[,3]

  cat("\nSubset required data.\n")
  temp <- subset(gwas_sumstats_racer_credset, Chr == tempCHR & (Position >= tempSTART & Position <= tempEND))
  
  cat("\nFormatting association data.\n")
  temp_f = RACER::formatRACER(assoc_data = temp, chr_col = 3, pos_col = 4, p_col = 5)

  cat("\nGetting LD data.\n")
  # temp_f_ld = RACER::ldRACER(assoc_data = temp_f, rs_col = 2, pops = "EUR", lead_snp = VARIANT)
  
  cat(paste0("\nPlotting region surrounding ", VARIANT," on ",tempCHR,":",tempSTART,"-",tempEND,".\n"))
  
  p1 <- singlePlotRACER2(assoc_data = temp_f, 
                         chr = tempCHR, build = "hg19", 
                         plotby = "coord", snp_plot = VARIANT,
                         start_plot = tempSTART, end_plot = tempEND,
                         label_lead = FALSE, 
                         grey_colors = FALSE, 
                         cred_set = TRUE, 
                         gene_track_h = 3, gene_name_s = 1.75)
  
  print(p1)
  
  cat(paste0("Saving image for ", VARIANT,".\n"))
  ggsave(filename = paste0(RACER_loc, "/", tempNAME, ".", Today, ".",VARIANT,".",tempSTART,".",tempEND,".regional_assoc.credset.png"), plot = p1)
  ggsave(filename = paste0(RACER_loc, "/", tempNAME, ".", Today, ".",VARIANT,".",tempSTART,".",tempEND,".regional_assoc.credset.pdf"), plot = p1)
  ggsave(filename = paste0(RACER_loc, "/", tempNAME, ".", Today, ".",VARIANT,".",tempSTART,".",tempEND,".regional_assoc.credset.eps"), plot = p1)

  # print(ggplotly(p1))
  rm(temp, p1,
     temp_f,
     tempCHR, tempSTART, tempEND,
     VARIANT, tempNAME)
}
```


# Regional association plots in African-American

Note here that we plot the region, and not based on the lead variant of the EA-AA analyses.

## Load data African-American-only

We need to load the meta-analysis summary statistics from the African-American-only ancestry analysis.
```{r}
# gwas_sumstats_unfiltered_AA <- fread(paste0(GWAS_loc,"/CAC1000G_AA_FINAL_FUMA.unfiltered.txt.gz"),
#                          showProgress = TRUE)
# saveRDS(gwas_sumstats_unfiltered_AA, file = paste0(OUT_loc, "/gwas_sumstats_unfiltered.AA.rds"))

# gwas_sumstats_racer_unfiltered_AA <- subset(gwas_sumstats_unfiltered_AA,
#                               select = c("MarkerName", "rsID", "Chr", "Position", "Pvalue"))
# 
# saveRDS(gwas_sumstats_racer_unfiltered_AA, file = paste0(OUT_loc, "/gwas_sumstats_unfiltered_racer.AA.rds"))
# 
# gwas_sumstats_racer_unfiltered_AA <- readRDS(file = paste0(OUT_loc, "/gwas_sumstats_unfiltered_racer.AA.rds"))

# 
# gwas_sumstats_AA <- fread(paste0(GWAS_loc,"/CAC1000G_AA_FINAL_FUMA.filtered.txt.gz"),
#                          showProgress = TRUE)
# saveRDS(gwas_sumstats_AA, file = paste0(OUT_loc, "/gwas_sumstats.AA.rds"))
# 
# gwas_sumstats_racer_AA <- subset(gwas_sumstats_AA,
#                               select = c("MarkerName", "rsID", "Chr", "Position", "Pvalue"))
# 
# saveRDS(gwas_sumstats_racer_AA, file = paste0(OUT_loc, "/gwas_sumstats_racer.AA.rds"))
# rm(gwas_sumstats_AA)
gwas_sumstats_racer_AA <- readRDS(file = paste0(OUT_loc, "/gwas_sumstats_racer.AA.rds"))

```

## Plotting 

```{r}
library(RACER)
# Make directory for plots
ifelse(!dir.exists(file.path(PROJECT_loc, "/RACER_AA")), 
       dir.create(file.path(PROJECT_loc, "/RACER_AA")), 
       FALSE)
RACER_AA_loc = paste0(PROJECT_loc,"/RACER_AA")

# Plotting is handled a bit differently
# "rs3844006", # throws an error which I don't understand immediately - could be that the variant is not present in AA 1000G data
# "rs9633535", # this one throws an LD error
#  
variants_of_interest_fewgenes_aa <- c("rs9349379", "rs3844006", "rs2854746", "rs10899970", "rs9633535", "rs10762577", "rs11063120", "rs9515203", "rs7182103", "rs7412")

for(VARIANT in variants_of_interest_fewgenes_aa){
  cat(paste0("Getting data for ", VARIANT,".\n"))

  tempCHR <- subset(variant_list, rsID == VARIANT)[,5]
  tempSTART <- subset(variant_list, rsID == VARIANT)[,18]
  tempEND <- subset(variant_list, rsID == VARIANT)[,19]
  tempVARIANTnr <- subset(variant_list, rsID == VARIANT)[,1]

  cat("\nSubset required data.\n")
  temp <- subset(gwas_sumstats_racer_AA, Chr == tempCHR & (Position >= tempSTART & Position <= tempEND))

  cat("\nFormatting association data.\n")
  temp_f = RACER::formatRACER(assoc_data = temp, chr_col = 3, pos_col = 4, p_col = 5)
  
  # cat("\nGetting LD data.\n")
  temp_f_ld = 
    data.table::setorder( # this fixes an issue where the SNPs with LD = NA are plotted last and it appears many SNPs are not present in 1000G.
    RACER::ldRACER(assoc_data = temp_f, rs_col = 2, pops = "AFR", 
                   lead_snp = "rs4576508",
                   auto_snp = FALSE), 
    LD)

  cat(paste0("\nPlotting region surrounding ", VARIANT," on ",tempCHR,":",tempSTART,"-",tempEND,".\n"))

  p1 <- singlePlotRACER2(assoc_data = temp_f_ld, 
                         chr = tempCHR, build = "hg19", 
                         plotby = "coord", 
                         snp_plot = VARIANT,
                         start_plot = tempSTART, end_plot = tempEND,
                         label_lead = TRUE, gene_track_h = 2, gene_name_s = 1.75)
  
  print(p1)
  cat(paste0("Saving image for ", VARIANT,".\n"))
  ggsave(filename = paste0(RACER_AA_loc, "/", tempVARIANTnr, ".", Today, ".",VARIANT,".regional_assoc.AA.png"), plot = last_plot())
  ggsave(filename = paste0(RACER_AA_loc, "/", tempVARIANTnr, ".", Today, ".",VARIANT,".regional_assoc.AA.pdf"), plot = last_plot())
  ggsave(filename = paste0(RACER_AA_loc, "/", tempVARIANTnr, ".", Today, ".",VARIANT,".regional_assoc.AA.eps"), plot = last_plot())

  rm(temp, p1,
     temp_f, temp_f_ld,
     tempCHR, tempSTART, tempEND,
     VARIANT, tempVARIANTnr)
  
}

# chr9-rs4977575: rs7470682 is the most significant but not present in LDlink; alternative rs4576508 is present and plotted
variants_of_interest_fewgenes_aa_9p21 <- c("rs4977575")

for(VARIANT in variants_of_interest_fewgenes_aa_9p21){
  cat(paste0("Getting data for ", VARIANT,".\n"))

  tempCHR <- subset(variant_list, rsID == VARIANT)[,5]
  tempSTART <- subset(variant_list, rsID == VARIANT)[,18]
  tempEND <- subset(variant_list, rsID == VARIANT)[,19]
  tempVARIANTnr <- subset(variant_list, rsID == VARIANT)[,1]

  cat("\nSubset required data.\n")
  temp <- subset(gwas_sumstats_racer_AA, Chr == tempCHR & (Position >= tempSTART & Position <= tempEND))

  cat("\nFormatting association data.\n")
  temp_f = RACER::formatRACER(assoc_data = temp, chr_col = 3, pos_col = 4, p_col = 5)
  
  # cat("\nGetting LD data.\n")
  temp_f_ld = 
    data.table::setorder( # this fixes an issue where the SNPs with LD = NA are plotted last and it appears many SNPs are not present in 1000G.
    RACER::ldRACER(assoc_data = temp_f, rs_col = 2, pops = "AFR", 
                   lead_snp = "rs4576508",
                   auto_snp = FALSE), 
    LD)

  cat(paste0("\nPlotting region surrounding ", VARIANT," on ",tempCHR,":",tempSTART,"-",tempEND,".\n"))

  p1 <- singlePlotRACER2(assoc_data = temp_f_ld, 
                         chr = tempCHR, build = "hg19", 
                         plotby = "coord", 
                         snp_plot = VARIANT,
                         start_plot = tempSTART, end_plot = tempEND,
                         label_lead = TRUE, gene_track_h = 2, gene_name_s = 1.75)
  
  print(p1)
  cat(paste0("Saving image for ", VARIANT,".\n"))
  ggsave(filename = paste0(RACER_AA_loc, "/", tempVARIANTnr, ".", Today, ".",VARIANT,".regional_assoc.AA.png"), plot = last_plot())
  ggsave(filename = paste0(RACER_AA_loc, "/", tempVARIANTnr, ".", Today, ".",VARIANT,".regional_assoc.AA.pdf"), plot = last_plot())
  ggsave(filename = paste0(RACER_AA_loc, "/", tempVARIANTnr, ".", Today, ".",VARIANT,".regional_assoc.AA.eps"), plot = last_plot())

  rm(temp, p1,
     temp_f, temp_f_ld,
     tempCHR, tempSTART, tempEND,
     VARIANT, tempVARIANTnr)
  
}


```

# Approximate Bayes Factor colocalisation analyses

The idea behind the *Approximate Bayes Factor (ABF)* analysis is that the association of each trait with SNPs in a region may be summarized by a vector of 0s and at most a single 1, with the 1 indicating the causal SNP (so, assuming a single causal SNP for each trait). 

The posterior probability of each possible configuration can be calculated and so, crucially, can the posterior probabilities that the traits share their configurations. This allows us to estimate the support for the following cases, i.e. hypotheses:

- 𝐻0: neither trait has a genetic association in the region
- 𝐻1: only trait 1 has a genetic association in the region
- 𝐻2: only trait 2 has a genetic association in the region
- 𝐻3: both traits are associated, but with different causal variants
- 𝐻4: both traits are associated and share a single causal variant

To what extent do the loci between European and African-American ancestries overlap? We are working on the assumption that the 11 loci are _the_ loci and will test whether these overlap, _i.e._ colocalize.

## Preparation

Let's make sure we have `remotes` and `coloc` installed. 
```{r}
if(!require("remotes"))
   install.packages("remotes") # if necessary
library(remotes)
install_github("chr1swallace/coloc@main",
               build_vignettes = TRUE)

library(coloc)

```

## Load data European-only

We need to load the meta-analysis summary statistics from the European-only ancestry analysis.
```{r}

# gwas_sumstats_EA <- fread(paste0(GWAS_loc,"/CAC1000G_EA_FINAL_FUMA.txt.gz"),
#                          showProgress = TRUE)
# names(gwas_sumstats_EA)[names(gwas_sumstats_EA) == "Pos"] <- "Position"
# saveRDS(gwas_sumstats_EA, file = paste0(OUT_loc, "/gwas_sumstats.EA.rds"))
# 
# gwas_sumstats_racer_EA <- subset(gwas_sumstats_EA,
#                               select = c("MarkerName", "rsID", "Chr", "Position", "Pvalue"))
# 
# saveRDS(gwas_sumstats_racer_EA, file = paste0(OUT_loc, "/gwas_sumstats_racer.EA.rds"))
# rm(gwas_sumstats_EA)

gwas_sumstats_racer_EA <- readRDS(file = paste0(OUT_loc, "/gwas_sumstats_racer.EA.rds"))

```


## Visualization

We can create mirror and scatter plot for each region. 

```{r}

library(RACER)
# Make directory for plots
ifelse(!dir.exists(file.path(PROJECT_loc, "/RACER_EA_vs_AA")), 
       dir.create(file.path(PROJECT_loc, "/RACER_EA_vs_AA")), 
       FALSE)
RACER_EA_vs_AA_loc = paste0(PROJECT_loc,"/RACER_EA_vs_AA")

variants_of_interest <- c(variant_list$rsID)

# variants_of_interest_fewgenes <- c("rs9349379", 
#                                    "rs3844006", # throws an error which I don't understand immediately - could be that the variant is not present in AA 1000G data
#                                    "rs2854746", "rs4977575", 
#                                    "rs10899970",
#                                    "rs9633535", 
#                                    "rs10762577",
#                                    "rs11063120", "rs9515203", "rs7182103",
#                                    "rs7412")
# "rs9349379", "rs3844006", "rs2854746", 
variants_of_interest_fewgenes_aa <- c("rs10899970", "rs9633535", "rs10762577", "rs11063120", "rs9515203", "rs7182103", "rs7412")

for(VARIANT in variants_of_interest_fewgenes_aa){
  cat(paste0("Getting data for ", VARIANT,".\n"))

  tempCHR <- subset(variant_list, rsID == VARIANT)[,5]
  tempSTART <- subset(variant_list, rsID == VARIANT)[,18]
  tempEND <- subset(variant_list, rsID == VARIANT)[,19]
  tempVARIANTnr <- subset(variant_list, rsID == VARIANT)[,1]

  cat("\nSubset required data.\n")
  temp1 <- subset(gwas_sumstats_racer_EA, Chr == tempCHR & (Position >= tempSTART & Position <= tempEND))
  temp2 <- subset(gwas_sumstats_racer_AA, Chr == tempCHR & (Position >= tempSTART & Position <= tempEND))

  cat("\nFormatting association data.\n")
  temp_f1 = RACER::formatRACER(assoc_data = temp1, chr_col = 3, pos_col = 4, p_col = 5)
  temp_f2 = RACER::formatRACER(assoc_data = temp2, chr_col = 3, pos_col = 4, p_col = 5)
  
  # cat("\nGetting LD data.\n")
  temp_f_ld1 = 
    data.table::setorder( # this fixes an issue where the SNPs with LD = NA are plotted last and it appears many SNPs are not present in 1000G.
    RACER::ldRACER(assoc_data = temp_f1, rs_col = 2, pops = "EUR", 
                   # lead_snp = VARIANT,
                   auto_snp = TRUE), 
    LD)
  
  temp_f_ld2 = 
    data.table::setorder( # this fixes an issue where the SNPs with LD = NA are plotted last and it appears many SNPs are not present in 1000G.
    RACER::ldRACER(assoc_data = temp_f2, rs_col = 2, pops = "AFR", 
                   # lead_snp = VARIANT,
                   auto_snp = TRUE), 
    LD)
  cat(paste0("\nPlotting region surrounding ", VARIANT," on ",tempCHR,":",tempSTART,"-",tempEND,".\n"))

  p1 <- mirrorPlotRACER(assoc_data1 = temp_f_ld1, 
                        assoc_data2 = temp_f_ld2, 
                        chr = tempCHR, 
                        name1 = "European ancestry", 
                        name2 = "African-American ancestry", 
                        plotby = "coord", 
                        start_plot = tempSTART, end_plot = tempEND,
                        label_lead = TRUE)
  
  print(p1)
  
  p2 <- scatterPlotRACER(assoc_data1 = temp_f_ld1, 
                         assoc_data2 = temp_f_ld2, 
                         chr = tempCHR, 
                         name1 = "European ancestry", 
                         name2 = "African-American ancestry", 
                         region_start = tempSTART, 
                         region_end = tempEND, 
                         ld_df = 1, 
                         label = TRUE)
  
  print(p2)
  
  cat(paste0("Saving image for ", VARIANT,".\n"))
  ggsave(filename = paste0(RACER_EA_vs_AA_loc, "/", tempVARIANTnr, ".", Today, ".",VARIANT,".regional_mirror.EA_vs_AA.png"), plot = p1)
  ggsave(filename = paste0(RACER_EA_vs_AA_loc, "/", tempVARIANTnr, ".", Today, ".",VARIANT,".regional_mirror.EA_vs_AA.pdf"), plot = p1)
  ggsave(filename = paste0(RACER_EA_vs_AA_loc, "/", tempVARIANTnr, ".", Today, ".",VARIANT,".regional_mirror.EA_vs_AA.eps"), plot = p1)

  cat(paste0("Saving image for ", VARIANT,".\n"))
  ggsave(filename = paste0(RACER_EA_vs_AA_loc, "/", tempVARIANTnr, ".", Today, ".",VARIANT,".regional_scatter.EA_vs_AA.png"), plot = p2)
  ggsave(filename = paste0(RACER_EA_vs_AA_loc, "/", tempVARIANTnr, ".", Today, ".",VARIANT,".regional_scatter.EA_vs_AA.pdf"), plot = p2)
  ggsave(filename = paste0(RACER_EA_vs_AA_loc, "/", tempVARIANTnr, ".", Today, ".",VARIANT,".regional_scatter.EA_vs_AA.eps"), plot = p2)


  rm(temp1, temp2,
     p1,p2,
     temp_f1, temp_f_ld1,
     temp_f2, temp_f_ld2,
     tempCHR, tempSTART, tempEND,
     VARIANT, tempVARIANTnr)
  
}

```

### 9p21 

```{r}

# chr9-rs4977575: rs7470682 is the most significant but not present in LDlink; alternative rs4576508 is present and plotted for AA ONLY!!!
variants_of_interest_fewgenes_aa_9p21 <- c("rs4977575")

for(VARIANT in variants_of_interest_fewgenes_aa_9p21){
  cat(paste0("Getting data for ", VARIANT,".\n"))

  tempCHR <- subset(variant_list, rsID == VARIANT)[,5]
  tempSTART <- subset(variant_list, rsID == VARIANT)[,18]
  tempEND <- subset(variant_list, rsID == VARIANT)[,19]
  tempVARIANTnr <- subset(variant_list, rsID == VARIANT)[,1]

  cat("\nSubset required data.\n")
  temp1 <- subset(gwas_sumstats_racer_EA, Chr == tempCHR & (Position >= tempSTART & Position <= tempEND))
  temp2 <- subset(gwas_sumstats_racer_AA, Chr == tempCHR & (Position >= tempSTART & Position <= tempEND))

  cat("\nFormatting association data.\n")
  temp_f1 = RACER::formatRACER(assoc_data = temp1, chr_col = 3, pos_col = 4, p_col = 5)
  temp_f2 = RACER::formatRACER(assoc_data = temp2, chr_col = 3, pos_col = 4, p_col = 5)

  # cat("\nGetting LD data.\n")
  temp_f_ld1 =
    data.table::setorder( # this fixes an issue where the SNPs with LD = NA are plotted last and it appears many SNPs are not present in 1000G.
    RACER::ldRACER(assoc_data = temp_f1, rs_col = 2, pops = "EUR",
                   # lead_snp = VARIANT,
                   auto_snp = TRUE),
    LD)

  temp_f_ld2 =
    data.table::setorder( # this fixes an issue where the SNPs with LD = NA are plotted last and it appears many SNPs are not present in 1000G.
    RACER::ldRACER(assoc_data = temp_f2, rs_col = 2, pops = "AFR",
                   lead_snp = "rs4576508",
                   auto_snp = FALSE),
    LD)
  cat(paste0("\nPlotting region surrounding ", VARIANT," on ",tempCHR,":",tempSTART,"-",tempEND,".\n"))

  p1 <- mirrorPlotRACER(assoc_data1 = temp_f_ld1,
                        assoc_data2 = temp_f_ld2,
                        chr = tempCHR,
                        name1 = "European ancestry",
                        name2 = "African-American ancestry",
                        plotby = "coord",
                        start_plot = tempSTART, end_plot = tempEND,
                        label_lead = TRUE)

  print(p1)

  p2 <- scatterPlotRACER(assoc_data1 = temp_f_ld1,
                         assoc_data2 = temp_f_ld2,
                         chr = tempCHR,
                         name1 = "European ancestry",
                         name2 = "African-American ancestry",
                         region_start = tempSTART,
                         region_end = tempEND,
                         ld_df = 1,
                         label = TRUE)

  print(p2)

  cat(paste0("Saving image for ", VARIANT,".\n"))
  ggsave(filename = paste0(RACER_EA_vs_AA_loc, "/", tempVARIANTnr, ".", Today, ".",VARIANT,".regional_mirror.EA_vs_AA.png"), plot = p1)
  ggsave(filename = paste0(RACER_EA_vs_AA_loc, "/", tempVARIANTnr, ".", Today, ".",VARIANT,".regional_mirror.EA_vs_AA.pdf"), plot = p1)
  ggsave(filename = paste0(RACER_EA_vs_AA_loc, "/", tempVARIANTnr, ".", Today, ".",VARIANT,".regional_mirror.EA_vs_AA.eps"), plot = p1)

  cat(paste0("Saving image for ", VARIANT,".\n"))
  ggsave(filename = paste0(RACER_EA_vs_AA_loc, "/", tempVARIANTnr, ".", Today, ".",VARIANT,".regional_scatter.EA_vs_AA.png"), plot = p2)
  ggsave(filename = paste0(RACER_EA_vs_AA_loc, "/", tempVARIANTnr, ".", Today, ".",VARIANT,".regional_scatter.EA_vs_AA.pdf"), plot = p2)
  ggsave(filename = paste0(RACER_EA_vs_AA_loc, "/", tempVARIANTnr, ".", Today, ".",VARIANT,".regional_scatter.EA_vs_AA.eps"), plot = p2)


  rm(temp1, temp2,
     p1,p2,
     temp_f1, temp_f_ld1,
     temp_f2, temp_f_ld2,
     tempCHR, tempSTART, tempEND,
     VARIANT, tempVARIANTnr)

}
```


## Quantify colocalization

We want to quantify the overlap. 

Setting up an output directory for `coloc`.

```{r}
ifelse(!dir.exists(file.path(PROJECT_loc, "/COLOC_EA_vs_AA")), 
       dir.create(file.path(PROJECT_loc, "/COLOC_EA_vs_AA")), 
       FALSE)
COLOC_EA_vs_AA_loc = paste0(PROJECT_loc,"/COLOC_EA_vs_AA")

```


### Data colleaction

Preparing data for `coloc`: we require beta and standard errors for `coloc`.

First the European ancestry data, next the African-American ancestry data.

```{r}
# EA
# gwas_sumstats_EA <- fread(paste0(GWAS_loc,"/EA/CAC1000G_EA_FINAL_FUMA.txt.gz"),
#                          showProgress = TRUE)
# names(gwas_sumstats_EA)[names(gwas_sumstats_EA) == "Pos"] <- "Position"
# saveRDS(gwas_sumstats_EA, file = paste0(OUT_loc, "/gwas_sumstats.EA.rds"))
# gwas_sumstats_EA <- readRDS(file = paste0(OUT_loc, "/gwas_sumstats.EA.rds"))
# gwas_sumstats_coloc_EA <- subset(gwas_sumstats_EA,
#                               select = c("MarkerName", "rsID", "Chr", "Position", 
#                                          "Effect", "StdErr",
#                                          "Pvalue",
#                                          "N"))
# rm(gwas_sumstats_EA)
# saveRDS(gwas_sumstats_coloc_EA, file = paste0(OUT_loc, "/gwas_sumstats_coloc.EA.rds"))
gwas_sumstats_coloc_EA <- readRDS(file = paste0(OUT_loc, "/gwas_sumstats_coloc.EA.rds"))

# AA
# gwas_sumstats_AA <- fread(paste0(GWAS_loc,"/AA/CAC1000G_AA_FINAL_FUMA.txt.gz"),
#                          showProgress = TRUE)
# names(gwas_sumstats_AA)[names(gwas_sumstats_EA) == "Pos"] <- "Position"
# saveRDS(gwas_sumstats_AA, file = paste0(OUT_loc, "/gwas_sumstats.AA.rds"))
# gwas_sumstats_AA <- readRDS(file = paste0(OUT_loc, "/gwas_sumstats.AA.rds"))
# names(gwas_sumstats_AA)[names(gwas_sumstats_AA) == "SE"] <- "StdErr"
# gwas_sumstats_coloc_AA <- subset(gwas_sumstats_AA,
#                               select = c("MarkerName", "rsID", "Chr", "Position", 
#                                          "Effect", "StdErr",
#                                          "Pvalue",
#                                          "N"))
# rm(gwas_sumstats_AA)
# saveRDS(gwas_sumstats_coloc_AA, file = paste0(OUT_loc, "/gwas_sumstats_coloc.AA.rds"))
gwas_sumstats_coloc_AA <- readRDS(file = paste0(OUT_loc, "/gwas_sumstats_coloc.AA.rds"))

```

### Colocalization

Now we are reading to formally test the colocalization per trait. Note that `trait 1` = European ancestry; `trait 2` = African-American ancestry.

```{r}
for(VARIANT in variants_of_interest){
  cat(paste0("Getting data for ", VARIANT,".\n"))

  tempCHR <- subset(variant_list, rsID == VARIANT)[,5]
  tempSTART <- subset(variant_list, rsID == VARIANT)[,18]
  tempEND <- subset(variant_list, rsID == VARIANT)[,19]
  tempVARIANTnr <- subset(variant_list, rsID == VARIANT)[,1]

  cat("\nSubset required data.\n")
  temp1 <- subset(gwas_sumstats_coloc_EA, Chr == tempCHR & (Position >= tempSTART & Position <= tempEND))
  temp2 <- subset(gwas_sumstats_coloc_AA, Chr == tempCHR & (Position >= tempSTART & Position <= tempEND))

  cat("\nCheck temp1 data.\n")
  temp1 <- rename_with(temp1, tolower)
  
  # correcting column names
  temp1 <- rename(temp1, beta = effect)
  temp1 <- rename(temp1, se = stderr)
  temp1 <- rename(temp1, snp = rsid)
  
  # calculating things
  temp1$varbeta <- temp1$se^2
  
  temp1 <- as.list(temp1) # critical, as coloc expects a list of variables
  
  temp1$type <- "quant"
  temp1$sdY <- 1

  coloc::check_dataset(temp1, warn.minp = 1e-10)
  
  cat("\nCheck temp2 data.\n")
  temp2 <- rename_with(temp2, tolower)
  
  # correcting column names
  temp2 <- rename(temp2, beta = effect)
  temp2 <- rename(temp2, se = stderr)
  temp2 <- rename(temp2, snp = rsid)
  
  # calculating things
  temp2$varbeta <- temp2$se^2
  
  temp2 <- as.list(temp2) # critical, as coloc expects a list of variables
  
  temp2$type <- "quant"
  temp2$sdY <- 1

  coloc::check_dataset(temp2, warn.minp = 1e-10)
  
  cat("\nPlot required data.\n")
  plot_dataset(temp1)
  plot_dataset(temp2)

  res_temp1_vs_temp2_single <- coloc.abf(dataset1 = temp1, 
                                         dataset2 = temp2)

  cat("\nColocalization.\n")
  print(res_temp1_vs_temp2_single)
  write_lines(res_temp1_vs_temp2_single, file = paste0(COLOC_EA_vs_AA_loc, "/res_EA_vs_AA_single.",
                                                       tempVARIANTnr,".",tempCHR,"_",tempSTART,"_",tempEND,".txt"))

  coloc::sensitivity(res_temp1_vs_temp2_single, "H4 > 0.9")
  
  # Step 1: Call the pdf command to start the plot
  pdf(file = paste0(COLOC_EA_vs_AA_loc, "/res_EA_vs_AA_single.",
                    tempVARIANTnr,".",tempCHR,"_",tempSTART,"_",tempEND,".pdf"))   # The directory you want to save the file in
  # Step 2: Create the plot with R code
  coloc::sensitivity(res_temp1_vs_temp2_single, "H4 > 0.9")

  # Step 3: Run dev.off() to create the file!
  dev.off()
  
  rm(temp1, temp2,
     # p1,p2,
     # temp_f1, temp_f_ld1,
     # temp_f2, temp_f_ld2,
     tempCHR, tempSTART, tempEND,
     VARIANT, tempVARIANTnr)
  
}

```

### Summary colocalization

_ENPP1_/_ENPP3_ locus

File: res_EA_vs_AA_single.1.6_131595002_132595002.txt

c(nsnps = 891, PP.H0.abf = 0.212726180436945, PP.H1.abf = 0.0690743673677815, PP.H2.abf = 0.0214723764600455, PP.H3.abf = 0.00628185511784578, PP.H4.abf = 0.690445220617383)

**Conclusion: 69.0% probability that the locus is shared between both ancestries and includes the same causal variant.**


_IGFBP3_ locus

File: res_EA_vs_AA_single.2.7_45460645_46460645.txt

c(nsnps = 1147, PP.H0.abf = 0.598769555553722, PP.H1.abf = 0.108030020950978, PP.H2.abf = 0.0685581712636257, PP.H3.abf = 0.0121567818199419, PP.H4.abf = 0.212485470411732)

**Conclusion: 59.9% probability that the locus is not associated in either ancestry, but 10.8 that it is European-specific, and 21.2% that it is shared between both ancestries and includes the same causal variant.**


_CXCL12_ locus

File: res_EA_vs_AA_single.3.10_44015716_45334720.txt

c(nsnps = 1641, PP.H0.abf = 0.00131956785769608, PP.H1.abf = 0.847689205708678, PP.H2.abf = 0.000166925872986017, PP.H3.abf = 0.107189395862512, PP.H4.abf = 0.0436349046981281)

**Conclusion: 84.8% probability that the locus is European-specific, but 4.3-10.7% probability that the locus is shared between both ancestries.**


_ARID5B_ locus

File: res_EA_vs_AA_single.4.10_63336088_64336088.txt

c(nsnps = 830, PP.H0.abf = 0.153686217856268, PP.H1.abf = 0.52399742579072, PP.H2.abf = 0.0155386299541433, PP.H3.abf = 0.0527253377600635, PP.H4.abf = 0.254052388638805)

**Conclusion: 52.4% probability that the locus is European-specific, but 25.4% probability that the locus is shared between both ancestries and includes different causal variants.**


_ADK_ locus

File: res_EA_vs_AA_single.5.10_75417431_76417431.txt

c(nsnps = 263, PP.H0.abf = 0.0607706367611815, PP.H1.abf = 0.85213066483438, PP.H2.abf = 0.00166979761323014, PP.H3.abf = 0.0233519569734565, PP.H4.abf = 0.0620769438177522)

**Conclusion: 85.2% probability that the locus is European-specific, but 2.3-6.2% probability that the locus is shared between both ancestries.**


_FGF23_ locus

File: res_EA_vs_AA_single.6.12_3986618_4986618.txt

c(nsnps = 916, PP.H0.abf = 0.00446472426354749, PP.H1.abf = 0.706936091086602, PP.H2.abf = 0.00104796032447378, PP.H3.abf = 0.165810337258493, PP.H4.abf = 0.121740887066883)

**Conclusion: 70.7% probability that the locus is European-specific, but 12.2-16.6% probability that the locus is shared between both ancestries.**


_COL4A1_/_COL4A2_ locus

File: res_EA_vs_AA_single.7.13_110549623_111549623.txt

c(nsnps = 1401, PP.H0.abf = 0.10787176608552, PP.H1.abf = 0.739004164794562, PP.H2.abf = 0.0141999398787617, PP.H3.abf = 0.0972387717170162, PP.H4.abf = 0.0416853575241402)

**Conclusion: 73.9% probability that the locus is European-specific, but 4.1-9.7% probability that the locus is shared between both ancestries.**


_MORF4L_ locus

File: res_EA_vs_AA_single.8.15_78623946_79623946.txt

c(nsnps = 984, PP.H0.abf = 9.92862156724414e-05, PP.H1.abf = 0.811395804353311, PP.H2.abf = 1.82061468585998e-05, PP.H3.abf = 0.148746181818894, PP.H4.abf = 0.0397405214652622)

**Conclusion: 81.1% probability that the locus is European-specific, but 14.9% probability that the locus is shared between both ancestries and includes different causal variants.**


_PHACTR1_ locus

File: res_EA_vs_AA_single.9.6_12403957_13403957.txt

c(nsnps = 1082, PP.H0.abf = 2.64349916808974e-21, PP.H1.abf = 0.440110764390371, PP.H2.abf = 1.85706249346447e-22, PP.H3.abf = 0.0303883523691947, PP.H4.abf = 0.529500883240437)

**Conclusion: 44.0% probability that the locus is European-specific, but 52.9% probability that the locus is shared between both ancestries and includes the same causal variant.**


_CDKN2A_/_CDKN2B_ locus

File: res_EA_vs_AA_single.10.9_21624744_22624744.txt

c(nsnps = 1175, PP.H0.abf = 3.8194848005248e-38, PP.H1.abf = 0.40370772152903, PP.H2.abf = 3.70349960775562e-39, PP.H3.abf = 0.0385871394254695, PP.H4.abf = 0.557705139045494)

**Conclusion: 40.4% probability that the locus is European-specific, but 55.8% probability that the locus is shared between both ancestries and includes the same causal variant.**


_APOE_ locus

File: res_EA_vs_AA_single.11.19_44912079_45912079.txt

c(nsnps = 958, PP.H0.abf = 0.0109451663563229, PP.H1.abf = 0.649124092491782, PP.H2.abf = 0.00401834569685491, PP.H3.abf = 0.238218007619045, PP.H4.abf = 0.0976943878359947)

**Conclusion: 64.9% probability that the locus is European-specific, but 23.8% probability that the locus is shared between both ancestries and includes different causal variants.**

Overarching conclusions:

1. The European-specific analysis is better powered than the African-American-specific analysis
2. Taking into account the issues of power, it is more likely that the majority of loci are shared between ancestries. The _ADK_ and _COL4A1/COL4A2_ loci are the least shared between ancestries, and both analyses are underpowered for the _IGFBP3_ locus. 
3. Given sufficient and more equal sample sizes between ancestries more firm conclusions could be drawn, but applying Occam's razor it is more likely that these loci are shared between ancestries than not.

### Cleanup

```{r}
rm(gwas_sumstats_coloc_AA,
   gwas_sumstats_racer_AA,
   gwas_sumstats_coloc_EA,
   gwas_sumstats_racer_EA)
```


# Session information

------

    Version:      v1.5.0
    Last update:  2023-05-04
    Written by:   Sander W. van der Laan (s.w.vanderlaan-2[at]umcutrecht.nl).
    Description:  Script to create plot regional association plots.
    Minimum requirements: R version 3.4.3 (2017-06-30) -- 'Single Candle', Mac OS X El Capitan
    
    Changes log
    * v1.5.0 Fixed an issue with LD r2 plotting.
    * v1.4.2 Added formal quantification of colocalization between ancestries.
    * v1.4.1 Added mirror plots and scatter plots.
    * v1.4.0 Update with AA data.
    * v1.3.0 Added the credible sets to the aditional regions.
    * v1.2.0 Added in aditional regions.
    * v1.1.0 Created PNG and PDF of top loci regions.
    * v1.0.0 Initial version. 

------

```{r eval = TRUE}
sessionInfo()

```


# Saving environment
```{r Saving}
save.image(paste0(PROJECT_loc, "/",Today,".",PROJECTNAME,".RegionalAssociationPlots.RData"))

```


------
<sup>&copy; 1979-2023 Sander W. van der Laan | s.w.vanderlaan[at]gmail.com | [swvanderlaan.github.io](https://vanderlaan.science).</sup>
------