09_1_Iranges_Granges_intro.Rmd

---
title: "9_1_Iranges_GRanges_basics"
author: "JR"
date: "12/8/2020"
output: html_document
editor_options: 
  chunk_output_type: console
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(echo = TRUE)
library(GenomicRanges)
library(tidyverse)
library(Gviz)
library(IRanges)
source("util/_setup.R")
source("util/intersect_functions.R")
source("util/plotting_functions.R")
```

Goal: to understand the basic usage of Iranges and Granges. These are core to almost
all analyses we will perform so it's worth taking sometime to review the fundementals.

? Iranges
? GRanges

These two packages both keep track and can be operated on are: 
start
end
width

You really only need two of these to infer the third. In our case the start and stop
could be a long list of the start and stop of all ChiP-peaks for a given DBP.

Let's make a list of ranges and see what this looks like.
* Note plotRanges is a common function to add to plotting_functions.R

```{r}
# Let's use an example of 2 ChiP-peak-files each with three peaks. We can use
# this simple example to construct and intersect ranges -- a common procedure

ir1 <- IRanges(start = c(1,200, 1000), end = c(30, 299, 1200))
ir1
start(ir1)
end(ir1)
width(ir1)
plotRanges(ir1)

# Cool let's make a second set of peak ranges.

ir2 <- IRanges(start = c(17,100, 1100), end = c(49, 199, 1201))
ir2 
plotRanges(ir2)


# nice, so let's concatonate these and plot:
ir3 <- c(ir1, ir2)
plotRanges(ir3)


# The nice thing about IRanges as we can track values associated with each range
# for example we may want to name these replicate 1 and replicate 2
# let's give it a try

names(ir1) <- paste("replicate_1", 1:3, sep = " ")
ir1
# Nice we can see we didn't loose any infomation but we now have a new "level"
# to index that is the name of the range or @NAMES in environment. Let's do the 
# same for ir2.

names(ir2) <- paste("replicate_2", 1:3, sep = " ")


# ok we can see the first and third peak are overlapping but not the second.
# lets find the overlaps between the peaks using "reduce" which will keep the 
# overlapping regions as a range. "findOverlaps" is really powerful but we will
# loose some information as we will only get the index of query and hit overlaps
# for each entry in the two ranges. 

# let's find the overlaps using the 'reduce' function in IRanges. This will
# reduce the overlaps in two IRanges to the largest length of start and end values.

# union

ov1 <- union(ir1, ir2)
plotRanges(ov1)
ov1
# This will merge all the data from the two ranges into 1 IRange. This is not
# exactly what we want for our purposes but good to know you can merge all the
# data with union


# intersect

ov3 <- intersect(ir1, ir2)
plotRanges(ov3)

# cool, so this is typically what we want! This now taking the intersection of the
# two ranges or similar to what we will do to create "consensus peaks" from 
# replicate ChiP-files.
ov3
# we see that the two overlapping ranges are from 17-30 and 1,100-1,200.

# findOverlaps
?finOverlaps
args(findOverlaps)


ov4 <- findOverlaps(ir1, ir2)
plotRanges(ov4)
# ERROR! The output of findOverlaps is a matrix output. It is a matrix of the
# the indices that overlapped in the query and subject ranges.
# let's look closer
ov4
# we see that the first and third entries of ir1 and ir2 are overlapping.
# but that is all the info we get, but these indexes are very powerful.


# lets take a look at ways of accessing findOverlaps output.

countOverlaps(ir1, ir2)
# This is pretty handy and includes out meta-data replicate
# Note that if there were multiple overlaps for a given peak we would see multiple
# enteries. For example let's say there were two peak in ir1 that overlap the 
# third entry we would see a two instead of 1.


# That's it for the very basics of what IRanges is doing and how it would be 
# applicable to our chip-peak files.

```

The lesson above is great if there is only one chromosome in the genome :) 
However, there are multiple chromosomes in the genome and there are two strands
of DNA that we need to keep track of.

The GenomicRanges package was developed to include the addition information of chr1 and other aspects associated with a given interval in IRanges. Essentially GenomicRanges is a genome specific version of IRanges. 

Let's create a GRange using IRange logic. What we need is:

Chromosome
Strand
Ranges

```{r}

# what we need is chromosome, strand, and Ranges.
# let's put it together:

# ?GRanges

gr <- GRanges(seqnames = c("chrX"), strand = c("+", "-", "*"), ranges = IRanges(start = c(1,200, 1000), end = c(30, 299, 1200)))

# let's see what we got here
gr
# the first thing to note is we see a unique identifier for the X chromosome as
# 'seqnames' , 'ranges' , 'strand' These are all indexable as we can see in the
# environment. At the core is IRanges. We can index into GRanges with several
# commands but here is a useful one 'seqinfo'

seqinfo(gr)
# we can see some additional place holders we have not changed such as the 
# genome version or if the chromosome is circular. Seqlenght is the lenght of 
# the chromosome. let's add this stuff !

seqlengths(gr) <- c("chrX" = 100000)
seqinfo(gr)
# Now we have the length of the chromosome.

seqlevels(gr)
seqlengths(gr)
# we see that we only have one 'level' here for ChrX, soon we will add another
# chromosome and see we can add many levels that are indexable.
# let's add more levels!

seqlevels(gr) <- c("chrX", "chrY")
seqlevels(gr)
# now we can see that we have a level that contains both the X nad Y chromosome.
gr
# but we don't have any values for the Y chromosome :) let's add some.

# let's add genome:

genome(gr) <- c("hg38")
gr
# now we have a genome, chromsome and strand associated with each feature - that 
# will stay associated no matter how we intersect etc different GRanges.
```