Devloped by : Sourya Bhattacharyya
Supervisors: Dr. Ferhat Ay and Dr. Pandurangan Vijayanand
La Jolla Institute for Allergy and Immunology
La Jolla, San Diego, CA 92037, USA
#######################
ChIPLine is a pipeline to analysis ChIP-seq data, starting from input Fastq/BAM files and generating alignment summary, various quality statistics, peak calling, and BigWig formatted tracks ready for visualization in UCSC genome browser. It also performs IDR analysis between a set of peak files or even a set of BAM alignment files (in which case, peaks are estimated first) corresponding to a set of biological or technical ChIP-seq replicates.
#######################
User can check the following papers or links for understanding ChIP-seq QCs:
-
https://github.com/crazyhottommy/ChIP-seq-analysis (very useful)
-
https://sites.google.com/site/anshulkundaje/projects/idr#TOC-Latest-pipeline (for IDR analysis)
When executing basic ChIP-seq pipeline, user should install following packages / libraries in the system:
-
Bowtie2 (we have used version 2.3.3.1) http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
-
samtools (we have used version 1.6) http://samtools.sourceforge.net/
-
PICARD tools (we have used 2.7.1 version) https://broadinstitute.github.io/picard/
-
Package phantompeakqualtools (Developed by Kundaje et al., for analyzing ChIP-seq quality) https://code.google.com/archive/p/phantompeakqualtools/
-
Utilities "bedGraphToBigWig", "bedSort", "bigBedToBed", "hubCheck" and "fetchChromSizes" downloaded from UCSC repository. Executables corresponding to the linux system, for example, is provided in this link: http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/
-
deepTools (we have used version 3.5.1; requires Python 3.7.12) https://deeptools.readthedocs.io/en/develop/
-
MACS2 (we have used version 2.2.9.1; requires Python 3.7.12) https://github.com/taoliu/MACS
-
HOMER (we recommend using the latest version) http://homer.ucsd.edu/homer/
-
R environment (we have used 3.4.3)
User should include the PATH of above mentioned libraries / packages inside their SYSTEM PATH variable. Some of these PATHS are also to be mentioned in a separate configuration file (mentioned below).
In addition, when user requires to execute the IDR code, following packages / libraries are to be installed in the system:
-
sambamba (we have used version 0.6.7) http://lomereiter.github.io/sambamba/
-
The package IDRCode (available in https://drive.google.com/file/d/0B_ssVVyXv8ZSX3luT0xhV3ZQNWc/view?usp=sharing). Unzip the archieve and store in convenient location. Path of this archieve is to be provided for executing IDR code.
The package phantompeakqualtools (https://github.com/kundajelab/phantompeakqualtools) to be installed. You may check this webpage for installing this package along with its dependencies.
git clone https://github.com/kundajelab/phantompeakqualtools
Then, install the R packages: snow (for parallel processing), snowfall, bitops, caTools, spp
Also install the bioconductor package Rsamtools
Current package includes a sample script file "pipeline_exec.sh". It conains sample commands required to invoke the main executable named "pipeline.sh", which is provided within the folder "bin".
In general, ChIP-seq pipeline (the executable "pipeline.sh") involves following command line options:
Options:
Mandatory parameters:
-C ConfigFile
A configuration file to be separately provided. Mandatory parameter.
Current package includes a sample configuration file named "configfile".
Details of the entries in this file are mentioned later.
-f FASTQ1
Read 1 (or forward strand) of paired-end sequencing data [.fq|.gz|.bz2].
Or, even an aligned genome (.bam file) can be provided.
-r FASTQ2
R2 of pair-end sequencing data [.fq|.gz|.bz2]. If not provided, and the -f parameter
is not a BAM file, the input is assumed to be single ended.
-n PREFIX
Prefix string of output files. For example, -n "TEST" means that the
output filenames start with the string "TEST".
-g BOWTIE2_GENOME
Bowtie2 indexed reference genome. Basically, the folder containing
the bwt2 indices are to be provided.
Mandatory parameter if user provides fastq files as input (-f and -r options).
If user provides .bam files as input (-f option) then no need to provide this value.
-d OutDir
Output directory which will contain all the results.
-c CONTROLBAM
Control file(s) used for peak calling using MACS2. One or more
alignment files can be provided to be used
as a control. It may not be specified at all, in which
case MACS2 operates without any control.
Control file can be either in BAM or in (tagalign.gz) format.
If multiple control files are provided, user needs to ensure that all of the
control files follow the same format (i.e. either all BAM or all TAGAlign).
Example: -c control1.bam -c control2.bam puts two control files for using in MACS2.
Conversion from BAM to TagAlign.gz format can be done using the script "TagAlign.sh"
provided within the folder "bin".
-w BigWigGenome
Reference genome which is used to convert BAM file to a BigWig file.
Used for visualization track creation purpose.
If -g option is enabled (i.e. the Bowtie2 index genome is provided)
then this option is not required.
Otherwise, this is a mandatory parameter. Allowed values are 'hg19'
(default), 'mm9', 'hg38', and 'mm10'.
-T Tagmentation
If 1, means that Tagmentation was used for ChIP file creation.
Then, forward and reverse strands
of the current ChIP signal are shifted by the transposon
length, and a tagAlign file is generated.
Peaks are called from this tagAlign file. Similar to the ATAC seq principle.
Applicable for the ChIPMentation technique (Christian Schmidl et. al.,
ChIPmentation: fast, robust, low-input ChIP-seq for histones and transcription factors,
Nature Methods volume 12, pages 963–965, 2015). Default value of this parameter is 0.
-D DEBUG_TXT
Binary variable. If 1 (recommended), different statistics corresponding to
quality metrics and reads are printed. Useful when a summary of a large set
of ChIP-seq samples are to be generated.
-q MAPQ_THR
Quality value threshold, below which the mapped reads are removed (Default 30).
-p PEAKCALLGENOMESIZE
genome size parameter for MACS2 peak calling ("hs", "mm", "ce", "dm": default "hs")
Optional parameters:
-O Overwrite
Binary variable. If 1, overwrites the existing files (if any). Default = 0.
-t NUMTHREADS
Number of sorting, Bowtie2 mapping THREADS [Default = 1]. For parallel processing of Bowtie2,
user should specify > 1 value such as 4 ot 8.
-m MAX_MEM
Set max memory used for PICARD duplication removal [Default = 8G].
-a ALIGNVALIDMAX
Set the number of (max) valid alignments which will be searched [Default = 4]
for Bowtie2.
-l MAXFRAGLEN
Set the maximum fragment length to be used for Bowtie2 alignment [Default = 2000]
The configuration file follows the format parameter=value
And is to be filled with the following entries:
sppexec=
executable of SPP R code (from the package phantompeakqualtools)
after installation.
For example, /home/sourya/packages/phantompeakqualtools/run_spp.R.
picardexec=
Path of Picard tool executable
Example: /home/sourya/packages/picard-tools/picard-tools-2.7.1/picard.jar
HOMERPath=
Path of HOMER (after installation)
Example: /home/sourya/packages/HOMER/bin/
DeepToolsDir=
Path of deepTools executable
Example: /home/sourya/packages/deepTools/deepTools2.0/bin/
RPackageExec=
Installed R package directory.
Example: /home/sourya/R-3.4.3/bin/Rscript
If left as blank, default Rscript installed in the system will be invoked.
NarrowPeakASFile=
file (SQL) required to convert the narrowPeak file to the bigBed format
Download the file from this link (and save):
https://genome-source.gi.ucsc.edu/gitlist/kent.git/blob/master/src/hg/lib/encode/narrowPeak.as
Specify the location of this downloaded file:
Example: /home/sourya/genomes/chrsize/narrowPeak.as
BigNarrowPeakASFile=
file (SQL) required to convert the bignarrowPeak file to the bigBed format
Download the file from this link (and save):
https://genome.ucsc.edu/goldenPath/help/examples/bigNarrowPeak.as
Specify the location of this downloaded file:
Example: /home/sourya/genomes/chrsize/bigNarrowPeak.as
BroadPeakASFile=
file (SQL) required to convert the broadPeak file to the bigBed format
Download the file from this link (and save):
https://genome-source.gi.ucsc.edu/gitlist/kent.git/blob/master/src/hg/lib/encode/broadPeak.as
Specify the location of this downloaded file:
Example: /home/sourya/genomes/chrsize/broadPeak.as
RefChrSizeFile=
files containing chromosome size information
two column file storing the size of individual chromosomes
Downloaded from the link (depends on the reference Chromosome employed):
For example, the hg38.chrom.sizes file for the hg38 database is located at
http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes.
Alternatively, Use the "fetchChromSizes" script from the UCSC repository
to get the appropriate chromosome size file.
Specify the location of this downloaded file:
Example: /home/sourya/genomes/chrsize/hg38.chrom.sizes
RefChrFastaFile=
Fasta file of the reference Chromosome.
Can be downloaded from the link:
http://hgdownload.cse.ucsc.edu/downloads.html
Example: /home/sourya/genomes/Complete_Genome/hg38/hg38.fa
RefChrAnnotFile=
file containing reference UCSC annotation (.gtf format)
corresponding to the reference Chromosome.
Can be downloaded from the link:
http://hgdownload.cse.ucsc.edu/downloads.html
Example: /home/sourya/genomes/Annotation/hg38/UCSC/hg38_UCSC_Annotation.gtf
Within the folder "OutDir" (base directory containing all the outputs of current ChIP-seq analysis, following files (f) and folders (F) exist):
F1: Alignment_MAPQ${MAPQ_THR}
f1: UniqMappedRead.bam
Initial alignment after unique mapping.
f2: Bowtie2_del_Random.bam
Alignment after deleting random reads.
f3: Bowtie2_del_Mitch.bam:
After deleting mitochondrial reads.
f4: ${PREFIX}.align.sort.MAPQ${MAPQ_THR}.bam
Sorted, and MAPQ thresholded alignment.
f5: ${PREFIX}.align.sort.MAPQ${MAPQ_THR}.bam.bai
Corresponding index.
f6: ${PREFIX}.align.sort.MAPQ${MAPQ_THR}.rmdup.bam
De-duplicated alignment (used for subsequent operations)
f7: ${PREFIX}.align.sort.MAPQ${MAPQ_THR}.picard_metrics.txt
Corresponding PICARD metrics log file.
F2: Out_BigWig
f1: ${PREFIX}.bw
bigwig file for track visualization.
F3: Out_BigWig_NormCov:
f1: ${PREFIX}_NormCov.bw
bigwig file for track visualization (after normalizing the coverage).
F4: MACS2_Default_*
Contains peaks employing MACS2 with default parameters.
f1: *.narrowPeak: narowpeak formatted output with P-value threshold of 0.01
f2: *.narrowPeak_Q0.05filt: peaks with Q-value threshold of 0.05
f3: *.narrowPeak_Q0.01filt: peaks with Q-value threshold of 0.01
f4: *.broadPeak: broadpeak formatted output with P-value threshold of 0.01
f5: *.broadPeak_Q0.05filt: peaks with Q-value threshold of 0.05
f6: *.broadPeak_Q0.01filt: peaks with Q-value threshold of 0.01
f7: out_FRiP.txt: FRIP statistics for the narrow and broad peaks.
f8: Peak_Statistics.txt: number of peaks
F9: Peak_Annotate_Q*:
For Q-value thresholds of either 0.01 or 0.05, contains the
HOMER based annotations.
In addition, files *.bb denote corresponding big-bd formatted peaks,
useful for USCSC track visualization.
F5: MACS2_Ext_*
Contains peaks employing MACS2 with the parameters:
--nomodel --nolambda --shift 0 --extsize 200
File structure is similar as above.
F6: Peak_Overlap_Statistics
Overlapping peaks corresponding to the settings F4 and F5
f7: chipSampleMaster.tagAlign.pdf
Quality metric display using cross correlation analysis.
f8: out_NRF_MAPQ${MAPQ_THR}.txt
Metric NRF
f9: Read_Count_Stat.txt
Read count statistics.
Suppose, a directory "/home/sourya/Results" contain within it, the following folders: 1, 2, 3, 4, ... Each corresponds to the output results for individual ChIP-seq samples.
To get a summarized list of performance metrics for these samples, use the script "ResSummary2.r".
Rscript ResSummary2.r [positional_arguments]:
1) OutBaseDir:
Directory under which results of all the different samples are stored
2) BAMRead:
Boolean variable, indicating if the BAM files were used for
analysis (1) or fastq (0). Default 0
3) Tagmentation:
Binary variable. If 1, denotes that the data samples
have been under ChIPMentation method. Default 0.
4) OldMethod:
Use 0 for this parameter.
5) ControlPeak:
Variable with value of either 0, 1 or 2.
If 1, control samples has been used for peak calling.
If 0, no control sample is used.
If 2, peaks using control and not using control both exist.
Default 0.
Sample execution command:
Rscript ResSummary2.r /home/sourya/ChIPResults/ 0 1 0 2
Which means that
OutBaseDir=/home/sourya/ChIPResults/
BAMRead=0
Tagmentation=1
OldMethod=0
ControlPeak=2
Current pipeline supports IDR analysis between either a list of ChIP-seq peak files or between a list of alignment (BAM) files. In the second case, first the BAM files are analyzed and subsampled to contain equal number of reads (minimum number of reads contained in the inputs), and subsequently, peaks are estimated from these (subsampled) BAM files using MACS2. These peaks are then applied for IDR analysis.
The script "sample_IDR_Script.sh" included within this package shows calling following two functions (both are included within the folder "IDR_Codes"):
1) IDRMain.sh
2) IDR_SubSampleBAM_Main.sh
The first script, IDRMain.sh, performs IDR between two or more
input peak files (we have used peaks estimated from MACS2). The parameters
corresponding to this script are as follows:
-I InpFile
A list of input peak files (obtained from MACS2 - in .narrowPeak or .narrowPeak.gz format).
At least two peak files are required.
-P PathIDRCode
Path of the IDRCode package (Kundaje et. al. after its installation).
Please check the "Required packages" section for the details.
-d OutDir
Output directory (absolute path preferred) which will store the IDR results.
-n PREFIX
Prefix of output files. Default 'IDR_ChIP'.
A sample execution of this script is as follows:
./IDRMain.sh -I peak1.narrowPeak -I peak2.narrowPeak -I peak3.narrowPeak -P /home/sourya/packages/idrCode/ -d /home/sourya/OutDir_IDR -n 'IDR_test'
The second script, IDR_SubSampleBAM_Main.sh, takes input of two or more BAM files,
estimates peaks from these BAM files, and then performs IDR analysis. The parameters
corresponding to this script are as follows:
-I InpFile
A list of input BAM files. At least two BAM files are required.
-P PathIDRCode
Path of the IDRCode package (Kundaje et. al. after its installation).
Please check the "Required packages" section for the details.
-d OutDir
Output directory (absolute path preferred) which will store the IDR results.
-n PREFIX
Prefix of output files. Default 'IDR_ChIP'.
-c CountPeak
No of peaks in both replicates that will be compared for IDR analysis.
Default 25000.
-T Tagmentation
Binary variable. If 1, the input is a ChiPMentation data
where the TAG Align files are created by
shifting the strands a bit. Default 0.
Tag align files are used for estimating peaks using MACS2.
-C CONTROLBAM
Control file (in eiher .BAM or tagalign file in .gz format)
used to estimate the peaks from MACS2. User may leave this field
blank if no control file is available.
A sample execution of this script is as follows:
./IDR_SubSampleBAM_Main.sh -I inpfile1.bam -I inpfile2.bam -P /home/sourya/packages/idrCode/ -d /home/sourya/OutDir_IDR -n 'IDR_test' -c 25000 -T 1 -C control.bam
In the specified output directory "OutDir" mentioned in the IDR script, following files (f) and folders (F) exist:
F1: Folders of the name $i$_and_$j$ where 0 <= i < N and 1 <= j <= N, where N is
the number of replicates analyzed. Individual folders contain results for
pairwise IDR analysis. For example, folder 0_and_1 contain IDR analysis
for the sample 0 (first replicate) and the sample 1 (second replicate).
f1 : "Replicate_Names.txt" : names of the replicate samples used for IDR analysis.
f2: Input_Peak_Statistics.txt: number of peaks and the peak containing replicates.
f3: IDR_Batch_Plot-plot.pdf: final IDR plot. Here individual pairs (whose results
are stored in the above mentioned folders) are numbered 1, 2, ...
Consideing N = 3, the number of pairs possible is also 3. Here,
the number 1 denotes the folder (pair) 0_and_1,
2 denotes the folder (pair) 0_and_2, and 3 denotes the
folder (pair) 1_and_2.
For any queries, please e-mail:
Sourya Bhattacharyya (sourya@lji.org)
Ferhat Ay (ferhatay@lji.org)
Pandurangan Vijayanand (vijay@lji.org)