-
Notifications
You must be signed in to change notification settings - Fork 11
Compare the enrichment of 2 histone marks in 1 list of active transcription start sites (TSS)
There are 2 kind of experimental designs that can be used for a metagene
analysis. This document will introduce the second kind, which is the comparison
of the same group of regions between 2 ChIP-Seq experiment.
For this vignette, we will compare the enrichment of the H3K27ac and H4K27me3 histone mark in active transcription start sites (TSS).
http://egg2.wustl.edu/roadmap/web_portal/chr_state_learning.html#exp_18state
We will use the TSS as defined by the Roadmap Consortium (TODO: add ref), for more details on how the data were downloaded, please see the R/get_regions.R file.
#data(TssA) # Active TSS
load("../data/TssA.RData") # Active TSS
# We will make sure that all the regions have the same size to avoid having to
# scale them during the metagene analysis
TssA <- GenomicRanges::resize(TssA, 1000, fix = "center")
regions <- GenomicRanges::GRangesList(TssA)
names(regions) <- c("TssA")
# We need to remove values from chrY that are not present in some bam files
regions <- lapply(regions, function(x) x[seqnames(x) != "chrY"])
# For memory and speed consideration, we will only use a subset of the regions
regions <- lapply(regions, function(x) x[sample(seq_along(x), 1000)])
regions <- GenomicRanges::GRangesList(regions)
regions
## GRangesList object of length 1:
## $TssA
## GRanges object with 1000 ranges and 1 metadata column:
## seqnames ranges strand | name
## <Rle> <IRanges> <Rle> | <character>
## [1] chr18 [ 45455501, 45456500] * | 1_TssA
## [2] chr17 [ 74022701, 74023700] * | 1_TssA
## [3] chr16 [ 75299201, 75300200] * | 1_TssA
## [4] chr11 [ 32916001, 32917000] * | 1_TssA
## [5] chr2 [192544001, 192545000] * | 1_TssA
## ... ... ... ... ... ...
## [996] chr7 [33168501, 33169500] * | 1_TssA
## [997] chr6 [17394901, 17395900] * | 1_TssA
## [998] chr8 [68256001, 68257000] * | 1_TssA
## [999] chr17 [49198801, 49199800] * | 1_TssA
## [1000] chr19 [41641201, 41642200] * | 1_TssA
##
## -------
## seqinfo: 25 sequences from an unspecified genome; no seqlengths
For the ChIP-Seq of the H3K27ac histone mark, we will use the [Add accessions]
file from ENCODE. The ENCFF000AKF.bam
and ENCFF000AKI.bam
files are two
replicates from the same experiment and ENCFF000AHC.bam
and ENCFF000AHD.bam
are the recommanded control files.
bam_files <- c("../inst/extdata/ENCFF000AKF.bam",
"../inst/extdata/ENCFF000AKI.bam",
"../inst/extdata/ENCFF000VGB.bam",
"../inst/extdata/ENCFF000VFS.bam")
bam_files
## [1] "../inst/extdata/ENCFF000AKF.bam" "../inst/extdata/ENCFF000AKI.bam"
## [3] "../inst/extdata/ENCFF000VGB.bam" "../inst/extdata/ENCFF000VFS.bam"
Since multiple samples should be included in the same profile (we need to
combine the replicates), we have to produce a design so that metagene
can
know how to deal correctly with each file.
design <- data.frame(samples = bam_files,
H3K27ac = c(1,1,0,0),
H3K27me3 = c(0,0,1,1))
design
## samples H3K27ac H3K27me3
## 1 ../inst/extdata/ENCFF000AKF.bam 1 0
## 2 ../inst/extdata/ENCFF000AKI.bam 1 0
## 3 ../inst/extdata/ENCFF000VGB.bam 0 1
## 4 ../inst/extdata/ENCFF000VFS.bam 0 1
The design is not used when we call the constructor of metagene (using
metagene$new
). The goal of this first step is to extract the coverage from
the bam_files and to do the normalisation of the signal. Since this step can
be computationally intensive, we do not want to do it every time we want to
experiment with a new design. Thus, it is a good idea to extract the coverage
from every bam file in a single step, save the result in RData
format and
then explore different design.
#mg <- metagene$new(regions = regions, bam_files = bam_files, cores = 2)
mg <- metagene$new(regions = regions, bam_files = bam_files)
# We could save this object to avoid re-doing this computationally intensive
# step:
#save(mg, file = "mg.RData")
The plot
function allows you to create the metagene
plot. By default, the
plot will contain a profile for every combination of group of regions and bam
file. In this vignette, we have 4 bam files (2 replicates for H3K27ac and 2
replicates for H3K27me3) and one group of regions (TssA
). Which means that if
we do not use a design, metagene will produce 4 profiles.
In this case, we only want to produce a profile for each histone mark (i.e. we want to combine the replicates), we have defined the groups of bam files in the design.
df <- mg$plot(design = design, bin_size = 10)
## [1] "H3K27ac_TssA"
## [1] "H3K27me3_TssA"
save(mg, file = "mg.RData")