README.Rmd

---
title: "A sparklyr extension for Hail"
output:
  github_document:
    fig_width: 9
    fig_height: 5
---

[Hail](https://hail.is/) is an open-source, general-purpose, Python-based data analysis tool with additional data types and methods for working with genomic data. Hail is built to scale and has first-class support for multi-dimensional structured data, like the genomic data in a genome-wide association study (GWAS). Hail is exposed as a Python library, using primitives for distributed queries and linear algebra implemented in Scala, Spark, and increasingly C++.

The `sparkhail` is a R extension using [sparklyr](https://spark.rstudio.com/) package. The idea is to help R users to use Hail functionalities with the well know [tidyverse](https://www.tidyverse.org/) sintax. In this README we are going to reproduce the [GWAS tutorial](https://hail.is/docs/0.2/tutorials/01-genome-wide-association-study.html) using `sparkhail`, `sparklyr`, `dplyr` and `ggplot2`.     

## Installation

You can install the sparkhail package from CRAN as follows:

```{r eval=FALSE}
install.packages("sparkhail")
```

To upgrade to the latest version of sparkhail, run the following command and restart your R session:

```{r eval=FALSE}
install.packages("devtools")
devtools::install_github("r-spark/sparkhail")
```

You can install Hail manually or using `hail_install()`.

```{r eval=FALSE}
sparkhail::hail_install()
```

## Read a matrix table

The data in Hail is naturally represented as a Hail [MatrixTable](https://hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable). The `sparkhail` converts the MatrixTable to dataframe, in this way is easier to manipulate the data using `dplyr`. 

```{r read_matrix}
library(sparkhail)
library(sparklyr)

sc <- spark_connect(master = "local", version = "2.4", config = hail_config())

hl <- hail_context(sc)
mt <- hail_read_matrix(hl, system.file("extdata/1kg.mt", package = "sparkhail"))
```

Convert to spark Data Frame as follows

```{r}
df <- hail_dataframe(mt)
```

## Getting to know our data

You can see the data structure using `glimpse()`.

```{r message=FALSE}
library(dplyr)
glimpse(df)
```

It’s important to have easy ways to slice, dice, query, and summarize a dataset. The conversion to dataframe is a good way to use `dplyr` verbs. Let's see some examples.

```{r select_row}
df %>% 
  dplyr::select(locus, alleles) %>% 
  head(5)
```

Here is how to peek at the first few sample IDs:

```{r get_s}
s <- hail_ids(mt)
s
```

The genotype calls are in `entries` column and we can see it using `hail_entries()` function. This [function](https://github.com/samuelmacedo83/sparkhail/blob/master/R/hail_entries.R) selects and explodes the data frame using `sparklyr.nested`.
```{r}
hail_entries(df)
```

## Adding column fields

A Hail MatrixTable can have any number of row fields and column fields for storing data associated with each row and column. Annotations are usually a critical part of any genetic study. Column fields are where you’ll store information about sample phenotypes, ancestry, sex, and covariates. Row fields can be used to store information like gene membership and functional impact for use in QC or analysis.

The file provided contains the sample ID, the population and “super-population” designations, the sample sex, and two simulated phenotypes (one binary, one discrete).

This file is a standard text file and can be imported using `sparklyr`.

```{r read_annotations}
annotations <- spark_read_csv(sc, "table", 
                              path = system.file("extdata/1kg_annotations.txt",
                                                 package = "sparkhail"),
                              overwrite = TRUE, 
                              delimiter = "\t")
```

A good way to peek at the structure of a Table is to look at its schema.
```{r annotations_schema}
glimpse(annotations)
```

Now we’ll use this table to add sample annotations to our dataset. To merge these data we can use joins. 

```{r annotations_sample}
annotations_sample <- inner_join(s, annotations, by = c("s" = "Sample"))
```

## Query functions

We will start by looking at some statistics of the information in our data. We can aggregate using `group_by()` and count the number of occurrences using `tally()`. 

```{r}
annotations %>%
  group_by(SuperPopulation) %>%
  tally()
```

We can use `sdf_describe()` to see the summary statistics of the data. 

```{r}
sdf_describe(annotations)
```

However, these metrics aren’t perfectly representative of the samples in our dataset. Here’s why:

```{r}
sdf_nrow(annotations)
sdf_nrow(annotations_sample)
```

Since there are fewer samples in our dataset than in the full thousand genomes cohort, we need to look at sample annotations on the dataset. 

```{r}
annotations_sample %>%
  group_by(SuperPopulation) %>%
  tally()

sdf_describe(annotations)
```

Let's see another example, now we are going to calculate the counts of each of the 12 possible unique SNPs (4 choices for the reference base * 3 choices for the alternate base). To do this, we need to get the alternate allele of each variant and then count the occurences of each unique ref/alt pair. The alleles column is nested, because of this, we need to separete this column using `sdf_separate_column()`.

```{r alleles_count}
df %>% 
  sdf_separate_column("alleles") %>% 
  group_by(alleles_1, alleles_2) %>% 
  tally() %>% 
  arrange(-n)
```

It’s nice to see that we can actually uncover something biological from this small dataset: we see that these frequencies come in pairs. C/T and G/A are actually the same mutation, just viewed from from opposite strands. Likewise, T/A and A/T are the same mutation on opposite strands. There’s a 30x difference between the frequency of C/T and A/T SNPs. Why?

The last example, what about genotypes? Hail can query the collection of all genotypes in the dataset, and this is getting large even for our tiny dataset. Our 284 samples and 10,000 variants produce 10 million unique genotypes. Let's plot this using the `ggplot2` and `dbplot` package.

```{r dp_plot, message=FALSE, warning=FALSE}
library(dbplot)
library(ggplot2)
library(sparklyr.nested) # to access DP nested in info column

df %>% 
  sdf_select(DP = info.DP) %>% 
  dbplot_histogram(DP) +
  labs(title = "Histogram for DP", y = "Frequency")
```

## Cleanup

Then disconenct from Spark and Hail,

```{r, eval=FALSE}
spark_disconnect(sc)
```