-
Notifications
You must be signed in to change notification settings - Fork 11
Compare the enrichment of 1 histone mark in 2 lists of transcription start sites (TSS)
There are 2 kind of experimental designs that can be used for a metagene
analysis. This document will introduce the first kind, which is the comparison
of two groups of regions from the same ChIP-Seq sample.
For this vignette, we will compare the enrichment of the H3K27ac histone mark in two groups of transcription start sites (TSS): active and bivalent.
http://egg2.wustl.edu/roadmap/web_portal/chr_state_learning.html#exp_18state
We will use the TSS as defined by the Roadmap Consortium (TODO: add ref), for more details on how the data were downloaded, please see the R/get_regions.R file.
#data(TssA) # Active TSS
#data(TssBiv) # Bivalent TSS
load("../data/TssA.RData") # Active TSS
load("../data/TssBiv.RData") # Bivalent TSS
# We will make sure that all the regions have the same size to avoid having to
# scale them during the metagene analysis
TssA <- GenomicRanges::resize(TssA, 500, fix = "center")
TssBiv <- GenomicRanges::resize(TssBiv, 500, fix = "center")
regions <- GenomicRanges::GRangesList(TssA, TssBiv)
names(regions) <- c("TssA", "TssBiv")
# For memory and speed consideration, we will only use a subset of the regions
regions <- lapply(regions, function(x) x[sample(seq_along(x), 1000)])
regions <- GenomicRanges::GRangesList(regions)
regions
## GRangesList object of length 2:
## $TssA
## GRanges object with 1000 ranges and 1 metadata column:
## seqnames ranges strand | name
## <Rle> <IRanges> <Rle> | <character>
## [1] chr12 [ 9102151, 9102650] * | 1_TssA
## [2] chr19 [ 54371651, 54372150] * | 1_TssA
## [3] chr7 [115857251, 115857750] * | 1_TssA
## [4] chr3 [ 53553451, 53553950] * | 1_TssA
## [5] chr7 [102988251, 102988750] * | 1_TssA
## ... ... ... ... ... ...
## [996] chr11 [123102051, 123102550] * | 1_TssA
## [997] chr9 [100985451, 100985950] * | 1_TssA
## [998] chr20 [ 60719151, 60719650] * | 1_TssA
## [999] chr7 [ 1084251, 1084750] * | 1_TssA
## [1000] chr16 [ 19421951, 19422450] * | 1_TssA
##
## ...
## <1 more element>
## -------
## seqinfo: 25 sequences from an unspecified genome; no seqlengths
For the ChIP-Seq of the H3K27ac histone mark, we will use the [Add accessions]
file from ENCODE. The ENCFF000AKF.bam
and ENCFF000AKI.bam
files are two
replicates from the same experiment and ENCFF000AHC.bam
and ENCFF000AHD.bam
are the recommanded control files.
bam_files <- c("../inst/extdata/ENCFF000AKF.bam",
"../inst/extdata/ENCFF000AKI.bam",
"../inst/extdata/ENCFF000AHC.bam",
"../inst/extdata/ENCFF000AHD.bam")
bam_files
## [1] "../inst/extdata/ENCFF000AKF.bam" "../inst/extdata/ENCFF000AKI.bam"
## [3] "../inst/extdata/ENCFF000AHC.bam" "../inst/extdata/ENCFF000AHD.bam"
Since multiple samples should be included in the same profile (we need to
combine the replicates and remove the control), we have to produce a design so
that metagene
can know how to deal correctly with each file.
design <- data.frame(samples = bam_files, H3K27ac = c(1,1,2,2))
design
## samples H3K27ac
## 1 ../inst/extdata/ENCFF000AKF.bam 1
## 2 ../inst/extdata/ENCFF000AKI.bam 1
## 3 ../inst/extdata/ENCFF000AHC.bam 2
## 4 ../inst/extdata/ENCFF000AHD.bam 2
The design is not used when we call the constructor of metagene (using
metagene$new
). The goal of this first step is to extract the coverage from
the bam_files and to do the normalisation of the signal. Since this step can
be computationally intensive, we do not want to do it every time we want to
experiment with a new design. Thus, it is a good idea to extract the coverage
from every bam file in a single step, save the result in RData
format and
then explore different design.
mg <- metagene$new(regions = regions, bam_files = bam_files, cores = 2)
# We could save this object to avoid re-doing this computationally intensive
# step:
#save(mg, file = "mg.RData")
The plot
function allows you to create the metagene
plot. By default, the
plot will contain a profile for every combination of group of regions and bam
file. In this vignette, we have 2 groups of regions (TssA
and TssBiv
) and
4 bam files (2 samples and 2 controls).
Since in this case, we only want to produce a profile for the TssA
and
TssBiv
groups of regions, we will have to use the design we produced
previously. Other than the samples
column that contains the names of the bam
files, there is only one column in the design
object. This means that there
will only be 1 profile plotted for each group of regions (2 profiles in this
example).
df <- mg$plot(design = design, bin_size = 10)
## [1] "H3K27ac_TssA"
## [1] "H3K27ac_TssBiv"
save(mg, file = "mg.RData")
If we only wanted to plot the profile for one group of region, we could have
specified it using the regions_group
parameter. This parameter is a vector of
the names of the regions we want to keep for the current metagene
plot.
df <- mg$plot(design = design, regions_group = "TssA")
## [1] "H3K27ac_TssA"
Note: In the case that we used a GRangesList
for the group of regions when
the new
function was used, we have to use the same name as the one from this
object. If we specified a vector of filename, metagene will automatically name
the region using the filename without the directory path and without the
extension.