Skip to content

Commit

Permalink
update docs
Browse files Browse the repository at this point in the history
  • Loading branch information
taoliu committed Apr 26, 2024
1 parent 295c64d commit fa84fa2
Show file tree
Hide file tree
Showing 59 changed files with 2,492 additions and 2,535 deletions.
2 changes: 0 additions & 2 deletions .github/workflows/build-and-test-MACS3-macos.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,12 @@ on:
paths-ignore:
- 'docs/**'
- '**.md'
- 'ChangeLog'
- '.github/workflows/**'
- '!.github/workflows/build-and-test-MACS3-macos.yml'
pull_request:
paths-ignore:
- 'docs/**'
- '**.md'
- 'ChangeLog'
- '.github/workflows/**'
- '!.github/workflows/build-and-test-MACS3-macos.yml'

Expand Down
2 changes: 0 additions & 2 deletions .github/workflows/build-and-test-MACS3-non-x64.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,12 @@ on:
paths-ignore:
- 'docs/**'
- '**.md'
- 'ChangeLog'
- '.github/workflows/**'
- '!.github/workflows/build-and-test-MACS3-non-x64.yml'
pull_request:
paths-ignore:
- 'docs/**'
- '**.md'
- 'ChangeLog'
- '.github/workflows/**'
- '!.github/workflows/build-and-test-MACS3-non-x64.yml'

Expand Down
2 changes: 0 additions & 2 deletions .github/workflows/build-and-test-MACS3-x64.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,12 @@ on:
paths-ignore:
- 'docs/**'
- '**.md'
- 'ChangeLog'
- '.github/workflows/**'
- '!.github/workflows/build-and-test-MACS3-x64.yml'
pull_request:
paths-ignore:
- 'docs/**'
- '**.md'
- 'ChangeLog'
- '.github/workflows/**'
- '!.github/workflows/build-and-test-MACS3-x64.yml'

Expand Down
4 changes: 3 additions & 1 deletion MACS3/Utilities/OptValidator.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Time-stamp: <2024-04-19 15:11:59 Tao Liu>
# Time-stamp: <2023-07-28 12:17:28 Tao Liu>

"""Module Description
Expand Down Expand Up @@ -32,6 +32,8 @@

import logging
import MACS3.Utilities.Logger
logger = logging.getLogger(__name__)

# ------------------------------------
# Misc functions
# ------------------------------------
Expand Down
291 changes: 291 additions & 0 deletions docs/Advanced_Step-by-step_Peak_Calling.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,291 @@
# Advanced Step-by-step peak calling using MACS3 commands

Over the years, I have got many emails from users asking if they can
analyze their X-Seq (not ChIP-Seq) data using MACS, or if they can
turn on or off some features in `callpeak` for their special needs. In
most of cases, I would simply reply that they may have to find more
dedicated tool for the type of your data, because the `callpeak`
module is specifically designed and tuned for ChIP-Seq data. However,
MACS3 in fact contains a suite of subcommands and if you can design a
pipeline to combine them, you can control every single step and
analyze your data in a highly customized way. In this tutorial, I show
how the MACS3 main function `callpeak` can be decomposed into a
pipeline containing MACS3 subcommands,
including `filterdup`, `predictd`, `pileup`, `bdgcmp`, `bdgopt`,
and `bdgpeakcall` (or `bdgbroadcall` in case of broad mark). To
analyze your special data in a special way, you may need to skip some
of the steps or tweak some of the parameters of certain steps. Now
let\'s suppose we are dealing with the two testing
files `CTCF_ChIP_200K.bed.gz` and `CTCF_Control_200K.bed.gz`, that you
can find in MACS3 github repository.

*Note, currently this tutorial is for single-end datasets. Please
modify the command line for paired-end data by yourself.*

## Step 1: Filter duplicates

In the first step of ChIP-Seq analysis by `callpeak`, ChIP and control
data need to be read and the redundant reads at each genomic loci have
to be removed. I won\'t go over the rationale, but just tell you how
this can be done by `filterdup` subcommand. By default, the maximum
number of allowed duplicated reads is 1, or `--keep-dup=1` for
`callpeak`. To simulate this behavior, do the following:

`$ macs3 filterdup -i CTCF_ChIP_200K.bed.gz --keep-dup=1 -o CTCF_ChIP_200K_filterdup.bed`
`$ macs3 filterdup -i CTCF_Control_200K.bed.gz --keep-dup=1 -o CTCF_Control_200K_filterdup.bed`

You can set different number for `--keep-dup` or let MACS3
automatically decide the maximum allowed duplicated reads for each
genomic loci for ChIP and control separately. Check `macs3 filterdup
-h` for detail, and remember if you go with auto way, the genome size
should be set accordingly. *Note*, in the output, MACS3 will give
you the final number of reads kept after filtering, you\'d better
write the numbers down since we need them when we have to scale the
ChIP and control signals to the same depth. In this case, the number
is 199583 for ChIP and 199867 for control, and the ratio between them
is 199583/199867=.99858

## Step 2: Decide the fragment length `d`

This is an important step for MACS3 to analyze ChIP-Seq and also for
other types of data since the location of sequenced read may only tell
you the end of a DNA fragment that you are interested in (such as TFBS
or DNA hypersensitive regions), and you have to estimate how long this
DNA fragment is in order to recover the actual enrichment. You can also
regard this as a data smoothing technic. You know that `macs3 callpeak`
will output something like, it can identify certain number of pairs of
peaks and it can predict the fragment length, or `d` in MACS3
terminology, using cross-correlation. In fact, you can also do this
using `predictd` module. Normally, we only need to do this for ChIP
data:

`
$ macs3 predictd -i CTCF_ChIP_200K_filterdup.bed -g hs -m 5 50
`

Here the `-g` (the genome size) need to be set according to your sample,
and the mfold parameters have to be set reasonably. To simulate the
default behavior of `macs3 callpeak`, set `-m 5 50`. Of course, you can
tweak it. The output from `predictd` will tell you the fragment length
`d`, and in this example, it is *254*. Write the number down on your
notebook since we will need it in the next step. Of course, if you do
not want to extend the reads or you have a better estimation on fragment
length, you can simply skip this step.

## Step 3: Extend ChIP sample to get ChIP coverage track

Now you have estimated the fragment length, next, we can use MACS3
`pileup` subcommand to generate a pileup track in BEDGRAPH format for
ChIP sample. Since we are dealing with ChIP-Seq data in this tutorial,
we need to extend reads in 5\' to 3\' direction which is the default
behavior of `pileup` function. If you are dealing with some DNAse-Seq
data or you think the cutting site, that is detected by short read
sequencing, is just in the `middle` of the fragment you are interested
in, you need to use `-B` option to extend the read in both direction.
Here is the command to simulate `callpeak` behavior:

`
$ macs3 pileup -i CTCF_ChIP_200K_filterdup.bed -o CTCF_ChIP_200K_filterdup.pileup.bdg --extsize 254
`

The file `CTCF_ChIP_200K_filterdup.pileup.bdg` now contains the
fragment pileup signals for ChIP sample.

## Step 4: Build local bias track from control

By default, MACS3 `callpeak` function computes the local bias by taking
the maximum bias from surrounding 1kb (set by `--slocal`), 10kb (set by
`--llocal`), the size of fragment length `d` (predicted as what you got
from `predictd`), and the whole genome background. Here I show you how
each of the bias is calculated and how they can be combined using the
subcommands.

### The `d` background

Basically, to create the background noise track, you need to extend the
control read to both sides (-B option) using `pileup` function. The idea
is that the cutting site from control sample contains the noise
representing a region surrounding it. To do this, take half of `d` you
got from `predictd`, 127 (1/2 of 254) for our example, then:

`
$ macs3 pileup -i CTCF_Control_200K_filterdup.bed -B --extsize 127 -o d_bg.bdg
`

The file `d_bg.bdg` contains the `d` background from control.

### The slocal background

Next, you can create a background noise track of slocal local window, or
1kb window by default. Simply imagine that each cutting site (sequenced
read) represent a 1kb (default, you can tweak it) surrounding noise. So:

`
$ macs3 pileup -i CTCF_Control_200K_filterdup.bed -B --extsize 500 -o 1k_bg.bdg
`

Note, here 500 is the 1/2 of 1k. Because the ChIP signal track was built
by extending reads into `d` size fragments, we have to normalize the 1kb
noise by multiplying the values by `d/slocal`, which is 254/1000=0.254
in our example. To do so, use the `bdgopt` subcommand:

`
$ macs3 bdgopt -i 1k_bg.bdg -m multiply -p 0.254 -o 1k_bg_norm.bdg
`

The file`1k_bg_norm.bdg` contains the slocal background from control.
Note, we don\'t have to do this for `d` background because the
multiplier is simply 1.

### The llocal background

The background noise from larger region can be generated in the same way
as slocal backgound. The only difference is the size for extension.
MACS3 `callpeak` by default asks for 10kb (you can change this value)
surrounding window, so:

`
$ macs3 pileup -i CTCF_Control_200K_filterdup.bed -B --extsize 5000 -o 10k_bg.bdg
`

The extsize has to be set as 1/2 of llocal. Then, the multiplier now is
`d/llocal`, or 0.0254 in our example.

`
$ macs3 bdgopt -i 10k_bg.bdg -m multiply -p 0.0254 -o 10k_bg_norm.bdg
`

The file `10k_bg_norm.bdg` now contains the slocal background from
control.

### The genome background

The whole genome background can be calculated as
`the_number_of_control_reads\fragment_length/genome_size`, and in our
example, it is $199867*254/2700000000 ~= .0188023$. You don\'t need to
run subcommands to build a genome background track since it\'s just a
single value.

### Combine and generate the maximum background noise

Now all the above background noises have to be combined and the maximum
bias for each genomic location need be computed. This is the default
behavior of MACS3 `callpeak`, but you can have your own pipeline to
include some of them or even make more noise (such as 5k or 50k
background) then include more tracks. Here is the way to combine and get
the maximum bias.

Take the maximum between slocal (1k) and llocal (10k) background:

`
macs3 bdgcmp -m max -t 1k_bg_norm.bdg -c 10k_bg_norm.bdg -o 1k_10k_bg_norm.bdg
`

Then, take the maximum then by comparing with `d` background:

`
macs3 bdgcmp -m max -t 1k_10k_bg_norm.bdg -c d_bg.bdg -o d_1k_10k_bg_norm.bdg
`

Finally, combine with the genome wide background using `bdgopt`
subcommand

`
macs3 bdgopt -i d_1k_10k_bg_norm.bdg -m max -p .0188023 -o local_bias_raw.bdg
`

Now the file `local_bias_raw.bdg` is a BEDGRAPH file containing the
raw local bias from control data.

## Step 5: Scale the ChIP and control to the same sequencing depth

In order to compare ChIP and control signals, the ChIP pileup and
control lambda have to be scaled to the same sequencing depth. The
default behavior in `callpeak` module is to scale down the larger sample
to the smaller one. This will make sure the noise won\'t be amplified
through scaling and improve the specificity of the final results. In our
example, the final number of reads for ChIP and control are 199583 and
199867 after filtering duplicates, so the control bias have to be scaled
down by multiplying with the ratio between ChIP and control which is
199583/199867=.99858. To do so:

`
$ macs3 bdgopt -i local_bias_raw.bdg -m multiply -p .99858 -o local_lambda.bdg
`

Now, I name the output file as `local_lambda.bdg` since the values in
the file can be regarded as the lambda (or expected value) and can be
compared with ChIP signals using the local Poisson test.

## Step 6: Compare ChIP and local lambda to get the scores in pvalue or qvalue

Next, to find enriched regions and predict the so-called \'peaks\',
the ChIP signals and local lambda stored in BEDGRAPH file have to be
compared using certain statistic model. To do so, you need to use
`bdgcmp` module, which will output score for each basepair in the
genome. You may wonder it may need a huge file to save score for each
basepair in the genome, however the format BEDGRAPH can deal with the
problem by merging nearby regions with the same score. So
theoratically, the size of the output file for scores depends on the
complexity of your data, and the maximum number of data points, if
`d`, `slocal`, and `llocal` background are all used, is the minimum
value of the genome size and approximately
`(#read_ChIP+#reads_control\*3)\*2`, in our case about 1.6 million.
The command to generate score tracks is

`
$ macs3 bdgcmp -t CTCF_ChIP_200K_filterdup.pileup.bdg -c local_lambda.bdg -m qpois -o CTCF_ChIP_200K_qvalue.bdg
`
or

`
$ macs3 bdgcmp -t CTCF_ChIP_200K_filterdup.pileup.bdg -c local_bias.bdg -m ppois -o CTCF_ChIP_200K_pvalue.bdg
`

The `CTCF_ChIP_200K_pvalue.bdg` or `CTCF_ChIP_200K_qvalue.bdg` file
contains the -log10(p-value)s or -log10(q-value)s for each basepair
through local Poisson test, which means the ChIP signal at each basepair
will be tested against the corresponding local lambda derived from
control with Poisson model. *Note*, if you follow this tutorial, then
you won\'t get any `0` in the local lambda track because the smallest
value is the whole genome background; however if you do not include
genome background, you will see many `0`s in local lambda which will
crash the Poisson test. In this case, you need to set the
`pseudocount` for `bdgcmp` through `-p` option. The pseudocount will
be added to both ChIP and local lambda values before Poisson test. The
choice of pseudocount is mainly arbitrary and you can search on the web
to see some discussion. But in general, higher the pseudocount, higher
the specificity and lower the sensitivity.

## Step 7: Call peaks on score track using a cutoff

The final task of peak calling is to just take the scores and call those
regions higher than certain cutoff. We can use the `bdgpeakcall`
function for narrow peak calling and `bdgrroadcall` for broad peak
calling, and I will only cover the usage of `bdgpeakcall` in this
tutorial. It looks simple however there are extra two parameters for the
task. First, if two nearby regions are both above cutoff but the region
in-between is lower, and if the region in-between is small enough, we
should merge the two nearby regions together into a bigger one and
tolerate the fluctuation. This value is set as the read length in MACS3
`callpeak` function since the read length represent the resolution of
the dataset. As for `bdgpeakcall`, you need to get the read length
information from Step 1 or by checking the raw fastq file and set the
`-g` option. Secondly, we don\'t want to call too many small `peaks`
so we need to have a minimum length for the peak. MACS3 `callpeak`
function automatically uses the fragment size `d` as the parameter of
minimum length of peak, and as for `bdgpeakcall`, you need to set the
`-l` option as the `d` you got from Step 2. Last, you have to set the
cutoff value. Remember the scores in the output from `bdgcmp` are in
-log10 form, so if you need the cutoff as 0.05, the -log10 value is
about 1.3. The final command is like:

`
$ macs3 bdgpeakcall -i CTCF_ChIP_200K_qvalue.bdg -c 1.301 -l 245 -g 100 -o CTCF_ChIP_200K_peaks.bed
`

The output is in fact a narrowPeak format file (a type of BED file)
which contains locations of peaks and the summit location in the last
column.


Loading

0 comments on commit fa84fa2

Please sign in to comment.