Skip to content

Commit 61f77dd

Browse files
authored
Merge pull request #108 from ENCODE-DCC/dev
v1.3.4
2 parents 345811a + 365ff5d commit 61f77dd

File tree

8 files changed

+394
-99
lines changed

8 files changed

+394
-99
lines changed

README.md

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,6 @@ This ChIP-Seq pipeline is based off the ENCODE (phase-3) transcription factor an
3535
4) Follow [Caper's README](https://github.com/ENCODE-DCC/caper) carefully. Find an instruction for your platform.
3636
> **IMPORTANT**: Configure your Caper configuration file `~/.caper/default.conf` correctly for your platform.
3737

38-
3938
## Test input JSON file
4039

4140
Use `https://storage.googleapis.com/encode-pipeline-test-samples/encode-chip-seq-pipeline/ENCSR000DYI_subsampled_chr19_only_caper.json` as `[INPUT_JSON]` in Caper's documentation.
@@ -64,3 +63,16 @@ Install [Croo](https://github.com/ENCODE-DCC/croo#installation). **You can skip
6463
$ pip install croo
6564
$ croo [METADATA_JSON_FILE]
6665
```
66+
67+
## How to make a spreadsheet of QC metrics
68+
69+
Install [qc2tsv](https://github.com/ENCODE-DCC/qc2tsv#installation). Make sure that you have python3(> 3.4.1) installed on your system.
70+
71+
Once you have [organized output with Croo](#how-to-organize-outputs), you will be able to find pipeline's final output file `qc/qc.json` which has all QC metrics in it. Simply feed `qc2tsv` with multiple `qc.json` files. It can take various URIs like local path, `gs://` and `s3://`.
72+
73+
```bash
74+
$ pip install qc2tsv
75+
$ qc2tsv /sample1/qc.json gs://sample2/qc.json s3://sample3/qc.json ... > spreadsheet.tsv
76+
```
77+
78+
QC metrics for each experiment (`qc.json`) will be split into multiple rows (1 for overall experiment + 1 for each bio replicate) in a spreadsheet.

chip.croo.json

Lines changed: 348 additions & 62 deletions
Large diffs are not rendered by default.

chip.wdl

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
# ENCODE TF/Histone ChIP-Seq pipeline
22
# Author: Jin Lee (leepc12@gmail.com)
33
4-
#CAPER docker quay.io/encode-dcc/chip-seq-pipeline:v1.3.3
5-
#CAPER singularity docker://quay.io/encode-dcc/chip-seq-pipeline:v1.3.3
4+
#CAPER docker quay.io/encode-dcc/chip-seq-pipeline:v1.3.4
5+
#CAPER singularity docker://quay.io/encode-dcc/chip-seq-pipeline:v1.3.4
66
#CROO out_def https://storage.googleapis.com/encode-pipeline-output-definition/chip.croo.json
77
88
workflow chip {
9-
String pipeline_ver = 'v1.3.3'
9+
String pipeline_ver = 'v1.3.4'
1010
### sample name, description
1111
String title = 'Untitled'
1212
String description = 'No description'
@@ -120,7 +120,7 @@ workflow chip {
120120

121121
Int macs2_signal_track_mem_mb = 16000
122122
Int macs2_signal_track_time_hr = 24
123-
String macs2_signal_track_disks = 'local-disk 200 HDD'
123+
String macs2_signal_track_disks = 'local-disk 400 HDD'
124124

125125
Int call_peak_cpu = 2
126126
Int call_peak_mem_mb = 16000
@@ -1184,8 +1184,6 @@ task align {
11841184
Int? multimapping
11851185
File? custom_align_py
11861186
File? idx_tar # reference index tar
1187-
File? fastq_R1 # [read_end_id]
1188-
File? fastq_R2
11891187
Boolean paired_end
11901188
Boolean use_bwa_mem_for_pe
11911189

@@ -1597,6 +1595,7 @@ task call_peak {
15971595
memory : '${mem_mb} MB'
15981596
time : time_hr
15991597
disks : disks
1598+
preemptible: 0
16001599
}
16011600
}
16021601
@@ -1629,6 +1628,7 @@ task macs2_signal_track {
16291628
memory : '${mem_mb} MB'
16301629
time : time_hr
16311630
disks : disks
1631+
preemptible: 0
16321632
}
16331633
}
16341634

docs/input.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,8 @@ An input JSON file includes all genomic data files, parameters and metadata for
44

55
Please read through the following step-by-step instruction to compose a input JSON file.
66

7+
>**IMPORTANT**: ALWAYS USE ABSOLUTE PATHS.
8+
79
## Pipeline metadata
810

911
Parameter|Description
@@ -252,7 +254,7 @@ Parameter|Default
252254
---------|-------
253255
`chip.macs2_signal_track_mem_mb` | 16000
254256
`chip.macs2_signal_track_time_hr` | 24
255-
`chip.macs2_signal_track_disks` | `local-disk 200 HDD`
257+
`chip.macs2_signal_track_disks` | `local-disk 400 HDD`
256258
257259
> **IMPORTANT**: If you see Java memory errors, check the following resource parameters.
258260

docs/input_short.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
# Input JSON
22

3-
An input JSON file includes all genomic data files, parameters and metadata for running pipelines. Our pipeline will use default values if they are not defined in an input JSON file. We provide a set of template JSON files: [minimum](../example_input_json/template.json) and [full](../example_input_json/template.full.json). We recommend to use a minimum template instead of full one. A full template includes all parameters of the pipeline with default values defined.
3+
An input JSON file is a file which must include all the information needed to run this pipeline. Hence, it must include the absolute paths to all the control and experimental fastq files; paths to all the genomic data files needed for this pipeline, and it must also specify the parameters and the metadata needed for running this pipeline. If the parameters are not specified in an input JSON file, default values will be used. We provide a set of template JSON files: [minimum](../example_input_json/template.json) and [full](../example_input_json/template.full.json). We recommend to use a minimum template instead of full one. A full template includes all parameters of the pipeline with default values defined.
4+
5+
>**IMPORTANT**: ALWAYS USE ABSOLUTE PATHS.
46
57
# Checklist
68

@@ -79,7 +81,7 @@ Pipeline can start from any of the following data types (FASTQ, BAM, NODUP_BAM a
7981
* Define a BAM for each replicate. Our pipeline does not determine read endedness from a BAM file. You need to explicitly define read endedness.
8082
* Example of 3 singled-ended replicates.
8183
```javascript
82-
{
84+
{
8385
"chip.paired_end" : false,
8486
"chip.bams" : ["rep1.bam", "rep2.bam", "rep3.bam"]
8587
}
@@ -230,7 +232,7 @@ Parameter|Default
230232
---------|-------
231233
`chip.macs2_signal_track_mem_mb` | 16000
232234
`chip.macs2_signal_track_time_hr` | 24
233-
`chip.macs2_signal_track_disks` | `local-disk 200 HDD`
235+
`chip.macs2_signal_track_disks` | `local-disk 400 HDD`
234236
235237
> **IMPORTANT**: If you see Java memory errors, check the following resource parameters.
236238

docs/install_conda.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,11 @@
44
55
> **WARNING**: DO NOT SKIP ANY OF THE FOLLOWING STEPS OR PIPELINE'S ENVIRONMENT WILL BE MESSED UP WITH YOUR LOCAL PYTHON/GLOBAL CONDA.
66
7+
0) MacOS users: **MAKE SURE THAT YOU HAVE GNU `grep` INSTALLED ON YOUR SYSTEM**. Check if your `grep` has a `-P` parameter.
8+
```bash
9+
$ grep --help # check if a parameter "-P" exists
10+
```
11+
712
1) Download [Miniconda installer](https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh). Use default answers to all questions except for the first and last.
813
```bash
914
$ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

example_input_json/template.full.json

Lines changed: 4 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -10,10 +10,7 @@
1010
"chip.genome_tsv" : "/path_to_genome_data/hg38/hg38.tsv",
1111

1212
"chip.paired_end" : true,
13-
"chip.ctl_paired_end" : [true, true],
14-
15-
"chip.paired_ends" : true,
16-
"chip.ctl_paired_ends" : [true, true],
13+
"chip.ctl_paired_end" : true,
1714

1815
"chip.fastqs_rep1_R1" : [ "rep1_R1_L1.fastq.gz", "rep1_R1_L2.fastq.gz", "rep1_R1_L3.fastq.gz" ],
1916
"chip.fastqs_rep1_R2" : [ "rep1_R2_L1.fastq.gz", "rep1_R2_L2.fastq.gz", "rep1_R2_L3.fastq.gz" ],
@@ -40,15 +37,14 @@
4037
"chip.ctl_depth_ratio" : 1.2,
4138

4239
"chip.peak_caller" : null,
43-
"chip.macs2_cap_num_peak" : 500000,
40+
"chip.cap_num_peak_macs2" : 500000,
4441
"chip.pval_thresh" : 0.01,
4542
"chip.idr_thresh" : 0.05,
46-
"chip.spp_cap_num_peak" : 300000,
43+
"chip.cap_num_peak_spp" : 300000,
4744

4845
"chip.enable_jsd" : true,
4946
"chip.enable_gc_bias" : true,
5047
"chip.enable_count_signal_track" : false,
51-
"chip.keep_irregular_chr_in_bfilt_peak" : false,
5248

5349
"chip.filter_chrs" : [],
5450

@@ -86,7 +82,7 @@
8682

8783
"chip.macs2_signal_track_mem_mb" : 16000,
8884
"chip.macs2_signal_track_time_hr" : 24,
89-
"chip.macs2_signal_track_disks" : "local-disk 200 HDD",
85+
"chip.macs2_signal_track_disks" : "local-disk 400 HDD",
9086

9187
"chip.filter_picard_java_heap" : "4G",
9288
"chip.gc_bias_picard_java_heap" : "6G"

src/encode_task_qc_report.py

Lines changed: 10 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -191,21 +191,11 @@ def parse_arguments():
191191
if isinstance(value, list):
192192
setattr(args, a, split_entries_and_extend(value))
193193

194-
if args.paired_ends is None:
195-
if args.paired_end:
196-
args.paired_ends = [True]*20
197-
else:
198-
args.paired_ends = [False]*20
199-
else:
194+
if args.paired_ends is not None:
200195
for i, _ in enumerate(args.paired_ends):
201196
args.paired_ends[i] = str2bool(args.paired_ends[i])
202197

203-
if args.ctl_paired_ends is None:
204-
if args.ctl_paired_end:
205-
args.ctl_paired_ends = [True]*20
206-
else:
207-
args.ctl_paired_ends = args.paired_ends
208-
else:
198+
if args.ctl_paired_ends is not None:
209199
for i, _ in enumerate(args.ctl_paired_ends):
210200
args.ctl_paired_ends[i] = str2bool(args.ctl_paired_ends[i])
211201

@@ -250,8 +240,7 @@ def str_ctl(i):
250240
'aligner': 'Aligner',
251241
'peak_caller': 'Peak caller',
252242
'genome': 'Genome',
253-
'paired_end': 'Paired-end per replicate',
254-
'ctl_paired_end': 'Control paired-end per replicate',
243+
'seq_endedness': 'Sequencing endedness'
255244
}
256245

257246

@@ -270,13 +259,16 @@ def make_cat_root(args):
270259
('pipeline_ver', args.pipeline_ver),
271260
('pipeline_type', args.pipeline_type),
272261
('genome', args.genome),
273-
('paired_end', args.paired_ends),
274262
('aligner', args.aligner),
263+
('seq_endedness', OrderedDict()),
275264
('peak_caller', args.peak_caller),
276265
])
277-
if args.ctl_paired_ends \
278-
and args.pipeline_type not in ('atac', 'dnase'):
279-
d_general['ctl_paired_end'] = args.ctl_paired_ends
266+
if args.paired_ends is not None:
267+
for i, paired_end in enumerate(args.paired_ends):
268+
d_general['seq_endedness']['rep{}'.format(i + 1)] = {'paired_end': paired_end}
269+
if args.ctl_paired_ends is not None:
270+
for i, paired_end in enumerate(args.ctl_paired_ends):
271+
d_general['seq_endedness']['ctl{}'.format(i + 1)] = {'paired_end': paired_end}
280272
cat_root.add_log(d_general, key='general')
281273

282274
return cat_root

0 commit comments

Comments
 (0)