-
Notifications
You must be signed in to change notification settings - Fork 269
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
59 changed files
with
2,492 additions
and
2,535 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,291 @@ | ||
# Advanced Step-by-step peak calling using MACS3 commands | ||
|
||
Over the years, I have got many emails from users asking if they can | ||
analyze their X-Seq (not ChIP-Seq) data using MACS, or if they can | ||
turn on or off some features in `callpeak` for their special needs. In | ||
most of cases, I would simply reply that they may have to find more | ||
dedicated tool for the type of your data, because the `callpeak` | ||
module is specifically designed and tuned for ChIP-Seq data. However, | ||
MACS3 in fact contains a suite of subcommands and if you can design a | ||
pipeline to combine them, you can control every single step and | ||
analyze your data in a highly customized way. In this tutorial, I show | ||
how the MACS3 main function `callpeak` can be decomposed into a | ||
pipeline containing MACS3 subcommands, | ||
including `filterdup`, `predictd`, `pileup`, `bdgcmp`, `bdgopt`, | ||
and `bdgpeakcall` (or `bdgbroadcall` in case of broad mark). To | ||
analyze your special data in a special way, you may need to skip some | ||
of the steps or tweak some of the parameters of certain steps. Now | ||
let\'s suppose we are dealing with the two testing | ||
files `CTCF_ChIP_200K.bed.gz` and `CTCF_Control_200K.bed.gz`, that you | ||
can find in MACS3 github repository. | ||
|
||
*Note, currently this tutorial is for single-end datasets. Please | ||
modify the command line for paired-end data by yourself.* | ||
|
||
## Step 1: Filter duplicates | ||
|
||
In the first step of ChIP-Seq analysis by `callpeak`, ChIP and control | ||
data need to be read and the redundant reads at each genomic loci have | ||
to be removed. I won\'t go over the rationale, but just tell you how | ||
this can be done by `filterdup` subcommand. By default, the maximum | ||
number of allowed duplicated reads is 1, or `--keep-dup=1` for | ||
`callpeak`. To simulate this behavior, do the following: | ||
|
||
`$ macs3 filterdup -i CTCF_ChIP_200K.bed.gz --keep-dup=1 -o CTCF_ChIP_200K_filterdup.bed` | ||
`$ macs3 filterdup -i CTCF_Control_200K.bed.gz --keep-dup=1 -o CTCF_Control_200K_filterdup.bed` | ||
|
||
You can set different number for `--keep-dup` or let MACS3 | ||
automatically decide the maximum allowed duplicated reads for each | ||
genomic loci for ChIP and control separately. Check `macs3 filterdup | ||
-h` for detail, and remember if you go with auto way, the genome size | ||
should be set accordingly. *Note*, in the output, MACS3 will give | ||
you the final number of reads kept after filtering, you\'d better | ||
write the numbers down since we need them when we have to scale the | ||
ChIP and control signals to the same depth. In this case, the number | ||
is 199583 for ChIP and 199867 for control, and the ratio between them | ||
is 199583/199867=.99858 | ||
|
||
## Step 2: Decide the fragment length `d` | ||
|
||
This is an important step for MACS3 to analyze ChIP-Seq and also for | ||
other types of data since the location of sequenced read may only tell | ||
you the end of a DNA fragment that you are interested in (such as TFBS | ||
or DNA hypersensitive regions), and you have to estimate how long this | ||
DNA fragment is in order to recover the actual enrichment. You can also | ||
regard this as a data smoothing technic. You know that `macs3 callpeak` | ||
will output something like, it can identify certain number of pairs of | ||
peaks and it can predict the fragment length, or `d` in MACS3 | ||
terminology, using cross-correlation. In fact, you can also do this | ||
using `predictd` module. Normally, we only need to do this for ChIP | ||
data: | ||
|
||
` | ||
$ macs3 predictd -i CTCF_ChIP_200K_filterdup.bed -g hs -m 5 50 | ||
` | ||
|
||
Here the `-g` (the genome size) need to be set according to your sample, | ||
and the mfold parameters have to be set reasonably. To simulate the | ||
default behavior of `macs3 callpeak`, set `-m 5 50`. Of course, you can | ||
tweak it. The output from `predictd` will tell you the fragment length | ||
`d`, and in this example, it is *254*. Write the number down on your | ||
notebook since we will need it in the next step. Of course, if you do | ||
not want to extend the reads or you have a better estimation on fragment | ||
length, you can simply skip this step. | ||
|
||
## Step 3: Extend ChIP sample to get ChIP coverage track | ||
|
||
Now you have estimated the fragment length, next, we can use MACS3 | ||
`pileup` subcommand to generate a pileup track in BEDGRAPH format for | ||
ChIP sample. Since we are dealing with ChIP-Seq data in this tutorial, | ||
we need to extend reads in 5\' to 3\' direction which is the default | ||
behavior of `pileup` function. If you are dealing with some DNAse-Seq | ||
data or you think the cutting site, that is detected by short read | ||
sequencing, is just in the `middle` of the fragment you are interested | ||
in, you need to use `-B` option to extend the read in both direction. | ||
Here is the command to simulate `callpeak` behavior: | ||
|
||
` | ||
$ macs3 pileup -i CTCF_ChIP_200K_filterdup.bed -o CTCF_ChIP_200K_filterdup.pileup.bdg --extsize 254 | ||
` | ||
|
||
The file `CTCF_ChIP_200K_filterdup.pileup.bdg` now contains the | ||
fragment pileup signals for ChIP sample. | ||
|
||
## Step 4: Build local bias track from control | ||
|
||
By default, MACS3 `callpeak` function computes the local bias by taking | ||
the maximum bias from surrounding 1kb (set by `--slocal`), 10kb (set by | ||
`--llocal`), the size of fragment length `d` (predicted as what you got | ||
from `predictd`), and the whole genome background. Here I show you how | ||
each of the bias is calculated and how they can be combined using the | ||
subcommands. | ||
|
||
### The `d` background | ||
|
||
Basically, to create the background noise track, you need to extend the | ||
control read to both sides (-B option) using `pileup` function. The idea | ||
is that the cutting site from control sample contains the noise | ||
representing a region surrounding it. To do this, take half of `d` you | ||
got from `predictd`, 127 (1/2 of 254) for our example, then: | ||
|
||
` | ||
$ macs3 pileup -i CTCF_Control_200K_filterdup.bed -B --extsize 127 -o d_bg.bdg | ||
` | ||
|
||
The file `d_bg.bdg` contains the `d` background from control. | ||
|
||
### The slocal background | ||
|
||
Next, you can create a background noise track of slocal local window, or | ||
1kb window by default. Simply imagine that each cutting site (sequenced | ||
read) represent a 1kb (default, you can tweak it) surrounding noise. So: | ||
|
||
` | ||
$ macs3 pileup -i CTCF_Control_200K_filterdup.bed -B --extsize 500 -o 1k_bg.bdg | ||
` | ||
|
||
Note, here 500 is the 1/2 of 1k. Because the ChIP signal track was built | ||
by extending reads into `d` size fragments, we have to normalize the 1kb | ||
noise by multiplying the values by `d/slocal`, which is 254/1000=0.254 | ||
in our example. To do so, use the `bdgopt` subcommand: | ||
|
||
` | ||
$ macs3 bdgopt -i 1k_bg.bdg -m multiply -p 0.254 -o 1k_bg_norm.bdg | ||
` | ||
|
||
The file`1k_bg_norm.bdg` contains the slocal background from control. | ||
Note, we don\'t have to do this for `d` background because the | ||
multiplier is simply 1. | ||
|
||
### The llocal background | ||
|
||
The background noise from larger region can be generated in the same way | ||
as slocal backgound. The only difference is the size for extension. | ||
MACS3 `callpeak` by default asks for 10kb (you can change this value) | ||
surrounding window, so: | ||
|
||
` | ||
$ macs3 pileup -i CTCF_Control_200K_filterdup.bed -B --extsize 5000 -o 10k_bg.bdg | ||
` | ||
|
||
The extsize has to be set as 1/2 of llocal. Then, the multiplier now is | ||
`d/llocal`, or 0.0254 in our example. | ||
|
||
` | ||
$ macs3 bdgopt -i 10k_bg.bdg -m multiply -p 0.0254 -o 10k_bg_norm.bdg | ||
` | ||
|
||
The file `10k_bg_norm.bdg` now contains the slocal background from | ||
control. | ||
|
||
### The genome background | ||
|
||
The whole genome background can be calculated as | ||
`the_number_of_control_reads\fragment_length/genome_size`, and in our | ||
example, it is $199867*254/2700000000 ~= .0188023$. You don\'t need to | ||
run subcommands to build a genome background track since it\'s just a | ||
single value. | ||
|
||
### Combine and generate the maximum background noise | ||
|
||
Now all the above background noises have to be combined and the maximum | ||
bias for each genomic location need be computed. This is the default | ||
behavior of MACS3 `callpeak`, but you can have your own pipeline to | ||
include some of them or even make more noise (such as 5k or 50k | ||
background) then include more tracks. Here is the way to combine and get | ||
the maximum bias. | ||
|
||
Take the maximum between slocal (1k) and llocal (10k) background: | ||
|
||
` | ||
macs3 bdgcmp -m max -t 1k_bg_norm.bdg -c 10k_bg_norm.bdg -o 1k_10k_bg_norm.bdg | ||
` | ||
|
||
Then, take the maximum then by comparing with `d` background: | ||
|
||
` | ||
macs3 bdgcmp -m max -t 1k_10k_bg_norm.bdg -c d_bg.bdg -o d_1k_10k_bg_norm.bdg | ||
` | ||
|
||
Finally, combine with the genome wide background using `bdgopt` | ||
subcommand | ||
|
||
` | ||
macs3 bdgopt -i d_1k_10k_bg_norm.bdg -m max -p .0188023 -o local_bias_raw.bdg | ||
` | ||
|
||
Now the file `local_bias_raw.bdg` is a BEDGRAPH file containing the | ||
raw local bias from control data. | ||
|
||
## Step 5: Scale the ChIP and control to the same sequencing depth | ||
|
||
In order to compare ChIP and control signals, the ChIP pileup and | ||
control lambda have to be scaled to the same sequencing depth. The | ||
default behavior in `callpeak` module is to scale down the larger sample | ||
to the smaller one. This will make sure the noise won\'t be amplified | ||
through scaling and improve the specificity of the final results. In our | ||
example, the final number of reads for ChIP and control are 199583 and | ||
199867 after filtering duplicates, so the control bias have to be scaled | ||
down by multiplying with the ratio between ChIP and control which is | ||
199583/199867=.99858. To do so: | ||
|
||
` | ||
$ macs3 bdgopt -i local_bias_raw.bdg -m multiply -p .99858 -o local_lambda.bdg | ||
` | ||
|
||
Now, I name the output file as `local_lambda.bdg` since the values in | ||
the file can be regarded as the lambda (or expected value) and can be | ||
compared with ChIP signals using the local Poisson test. | ||
|
||
## Step 6: Compare ChIP and local lambda to get the scores in pvalue or qvalue | ||
|
||
Next, to find enriched regions and predict the so-called \'peaks\', | ||
the ChIP signals and local lambda stored in BEDGRAPH file have to be | ||
compared using certain statistic model. To do so, you need to use | ||
`bdgcmp` module, which will output score for each basepair in the | ||
genome. You may wonder it may need a huge file to save score for each | ||
basepair in the genome, however the format BEDGRAPH can deal with the | ||
problem by merging nearby regions with the same score. So | ||
theoratically, the size of the output file for scores depends on the | ||
complexity of your data, and the maximum number of data points, if | ||
`d`, `slocal`, and `llocal` background are all used, is the minimum | ||
value of the genome size and approximately | ||
`(#read_ChIP+#reads_control\*3)\*2`, in our case about 1.6 million. | ||
The command to generate score tracks is | ||
|
||
` | ||
$ macs3 bdgcmp -t CTCF_ChIP_200K_filterdup.pileup.bdg -c local_lambda.bdg -m qpois -o CTCF_ChIP_200K_qvalue.bdg | ||
` | ||
or | ||
|
||
` | ||
$ macs3 bdgcmp -t CTCF_ChIP_200K_filterdup.pileup.bdg -c local_bias.bdg -m ppois -o CTCF_ChIP_200K_pvalue.bdg | ||
` | ||
|
||
The `CTCF_ChIP_200K_pvalue.bdg` or `CTCF_ChIP_200K_qvalue.bdg` file | ||
contains the -log10(p-value)s or -log10(q-value)s for each basepair | ||
through local Poisson test, which means the ChIP signal at each basepair | ||
will be tested against the corresponding local lambda derived from | ||
control with Poisson model. *Note*, if you follow this tutorial, then | ||
you won\'t get any `0` in the local lambda track because the smallest | ||
value is the whole genome background; however if you do not include | ||
genome background, you will see many `0`s in local lambda which will | ||
crash the Poisson test. In this case, you need to set the | ||
`pseudocount` for `bdgcmp` through `-p` option. The pseudocount will | ||
be added to both ChIP and local lambda values before Poisson test. The | ||
choice of pseudocount is mainly arbitrary and you can search on the web | ||
to see some discussion. But in general, higher the pseudocount, higher | ||
the specificity and lower the sensitivity. | ||
|
||
## Step 7: Call peaks on score track using a cutoff | ||
|
||
The final task of peak calling is to just take the scores and call those | ||
regions higher than certain cutoff. We can use the `bdgpeakcall` | ||
function for narrow peak calling and `bdgrroadcall` for broad peak | ||
calling, and I will only cover the usage of `bdgpeakcall` in this | ||
tutorial. It looks simple however there are extra two parameters for the | ||
task. First, if two nearby regions are both above cutoff but the region | ||
in-between is lower, and if the region in-between is small enough, we | ||
should merge the two nearby regions together into a bigger one and | ||
tolerate the fluctuation. This value is set as the read length in MACS3 | ||
`callpeak` function since the read length represent the resolution of | ||
the dataset. As for `bdgpeakcall`, you need to get the read length | ||
information from Step 1 or by checking the raw fastq file and set the | ||
`-g` option. Secondly, we don\'t want to call too many small `peaks` | ||
so we need to have a minimum length for the peak. MACS3 `callpeak` | ||
function automatically uses the fragment size `d` as the parameter of | ||
minimum length of peak, and as for `bdgpeakcall`, you need to set the | ||
`-l` option as the `d` you got from Step 2. Last, you have to set the | ||
cutoff value. Remember the scores in the output from `bdgcmp` are in | ||
-log10 form, so if you need the cutoff as 0.05, the -log10 value is | ||
about 1.3. The final command is like: | ||
|
||
` | ||
$ macs3 bdgpeakcall -i CTCF_ChIP_200K_qvalue.bdg -c 1.301 -l 245 -g 100 -o CTCF_ChIP_200K_peaks.bed | ||
` | ||
|
||
The output is in fact a narrowPeak format file (a type of BED file) | ||
which contains locations of peaks and the summit location in the last | ||
column. | ||
|
||
|
Oops, something went wrong.