Skip to content

Commit

Permalink
Merge branch 'master' of github.com:CGATOxford/UMI-tools_pipelines
Browse files Browse the repository at this point in the history
  • Loading branch information
Tom Smith committed Dec 21, 2016
2 parents d79db45 + 29a09aa commit 9370794
Show file tree
Hide file tree
Showing 12 changed files with 620 additions and 250 deletions.
137 changes: 135 additions & 2 deletions README.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
Tools for dealing with Unique Molecular Identifiers
====================================================

This repository contains:

* pipelines for re-analysis of iCLIP and scRNA-Seq data for UMI-tools publication
Expand All @@ -9,7 +10,12 @@ This repository contains:
* ipython notebooks to generate publication figures


To run the scRNA-Seq pipeline, create a new directory and copy the files in the 'data' directory to the new directory, along with the configuration files:
Single Cell Analysis
---------------------

To run the scRNA-Seq pipeline, create a new directory and copy the
files in the 'data' directory to the new directory, along with the
configuration files:

.. code:: bash
Expand All @@ -22,5 +28,132 @@ Then run:

.. code:: bash
python [UMI-Tool_pipelines git directory]/PipelineScRNASeq.py -v10 make full
python [UMI-Tool_pipelines git directory]/pipeline_ScRNASeq.py -v10 make full
iCLIP Analysis
---------------

To run the iCLIP analysis with the default configuration, create a new
directory, download the datafiles, copy the configuration:

.. code:: bash
mkdir iCLIP_analysis
cd iCLIP_analysis
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR205/004/SRR2057564/SRR2057564.fastq.gz
. .
. .
. .
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR205/004/SRR2057598/SRR2057598.fastq.gz
cp [UMI-Tools_pipelines git directory]/pipeline_iCLIP/SRSF_pipeline.ini pipeline.ini
For the complete analysis of the SRSF data download the following SRR
records from ENA: SRR2057564, SRR2057565, SRR2057566, SRR2057567,
SRR2057568, SRR2057569, SRR2057570, SRR2057571, SRR2057572,
SRR2057573, SRR20 57574, SRR2057575, SRR2057576, SRR2057577,
SRR2057578, SRR2057579, SRR2057580, SRR2057581, SRR2057582,
SRR2057583, SRR2057584, SRR2057 585, SRR2057586, SRR2057587,
SRR2057588, SRR2057589, SRR2057590, SRR2057591, SRR2057592,
SRR2057593, SRR2057594, SRR2057595, SRR2057596, SRR2057597,
SRR2057598

The pipeline requires a genome sequence (in fasta format) and a bowtie
index, named the same way and in the same directory. The `pipeline.ini`
file must be edited to point to these files, so for example in the ini file::

[bowtie]

# The location of the genome and indexes for use with bowtie
# The directory should contain a fasta genome with the .fa
# extension and a bowtie index with the same prefix.
# For the SRSF data set this should be an mm9 genome/index
# fo the TDP dataset this should be a hg19 genome/index
index_dir=~/genomes/bowtie/
genome=mm9

Will make the pipeline look for the genome in the `~/genomes/bowtie` directory.
It will use the mm9 and expect mm9.fa and mm9.1.ebwt, mm9.2.ebwt, etc to be
present in that directory.

Now run the pipeline with:

.. code:: bash
python [UMI-Tool_pipelines git directory]/pipeline_iCLIP.py make full
To run the TDP analysis, copy `TDP_pipeline.ini` rather than
`SRSF_pipeline.ini`, set the bowtie index to a hg19 index and run:

.. code:: bash
python [UMI-Tool_pipelines git directory]/pipeline_iCLIP.py make runNotebooks1
Running locally or running on a cluster
----------------------------------------

Pipelines will run either locally or on a cluster. The default cluster
configuration is to use an SGE cluster manager with the `all.q` queue,
the `dedicated` pe environment and `mem_free`. These defaults can be
altered by changing the following settings in the `[cluster]` section
of the pipeline.ini files:

* `queue`: The default queue to submit jobs to. Leave as NONE to have your
cluster manager decide. (default: all.q)
* `parallel_environment`: The SGE parallel environment to request when
submitting multi-process jobs (default: dedicated)
* `memory_resource`: The resource name to use when requesting a certian
amount of memory for a job. (default: mem_free)
* `pe_queue`: If this variable is set then a different queue is used
when submitting parallel jobs. (no default)
* `options`: any other options to pass to the queue manager.

The pipelines are also compatible with the `SLURM` and `torque`
cluster managers. Set the `manager` parameter in the `[cluster]`
section of the ini.

It is also possible to run the pipelines locally by adding
`--no-cluster` to the command. However this will take a very long
time. Running the iCLIP pipeline on our cluster with 100 parallel jobs
(each possibly using multiple cores) takes around 50 hours. We
estimate running locally would take many weeks.

Dependencies
-------------

These pipelines require the following dependencies:

+--------------------+-------------------+------------------------------------------------+
|*Program* |*Version* |*Purpose* |
+--------------------+-------------------+------------------------------------------------+
|CGAPipelines | e6bb3be |Pipelining infrastructure, mapping pipeline |
| | |(http:/github.com/CGATOxford/CGATPipelines) |
+--------------------+-------------------+------------------------------------------------+
|CGAT | 0.2.4 |Various |
| | |(http:/github.com/CGATOxford/cgat) |
+--------------------+-------------------+------------------------------------------------+
|Bowtie | 1.1.2 |Mapping iCLIP reads |
+--------------------+-------------------+------------------------------------------------+
|BWA | 0.7.12-r1039 |Mapping scRNA-seq reads |
+--------------------+-------------------+------------------------------------------------+
|FastQC | 0.11.2 |Quality Control of demuxed reads |
+--------------------+-------------------+------------------------------------------------+
|bedtools | 2.22.0 |Interval manipulation |
+--------------------+-------------------+------------------------------------------------+
|samtools | 1.3.1 |Read manipulation |
+--------------------+-------------------+------------------------------------------------+
|UMI-tools | 0.0.2 |UMI manipulation |
+--------------------+-------------------+------------------------------------------------+
|reaper | 13-100 |Used for demuxing and clipping reads |
+--------------------+-------------------+------------------------------------------------+
|trimmomatic | 0.32 |Trimming reads for scRNA-seq |
| | | |
+--------------------+-------------------+------------------------------------------------+
|SRA toolkit | 2.8.0 |Extracting data from SRA files |
| | | |
+--------------------+-------------------+------------------------------------------------+
|R | 3.2.1 |Figure creation (packages ggplot2, reshape, |
| | |plyr, grid, gplots, Biobase, RColorBrewer) |
+--------------------+-------------------+------------------------------------------------+
|jupyter | 4.1 |Running the statistical analysis and generating |
| | |figures |
+--------------------+-------------------+------------------------------------------------+
93 changes: 93 additions & 0 deletions data/hg19_ensembl75_contigs.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
chr19_gl000208_random 92689
chr21_gl000210_random 27682
chr6_apd_hap1 4622290
chr13 115169878
chr12 133851895
chr11 135006516
chr10 135534747
chr17 81195210
chr16 90354753
chr15 102531392
chr14 107349540
chr19 59128983
chr18 78077248
chr9_gl000198_random 90085
chrUn_gl000239 33824
chrUn_gl000238 39939
chrUn_gl000233 45941
chrUn_gl000232 40652
chrUn_gl000231 27386
chrUn_gl000230 43691
chrUn_gl000237 45867
chrUn_gl000236 41934
chrUn_gl000235 34474
chrUn_gl000234 40531
chr6_qbl_hap6 4611984
chr11_gl000202_random 40103
chr17_gl000206_random 41001
chr6_cox_hap2 4795371
chr4_gl000193_random 189789
chrUn_gl000248 39786
chrUn_gl000249 38502
chrUn_gl000246 38154
chrUn_gl000247 36422
chrUn_gl000244 39929
chrUn_gl000245 36651
chrUn_gl000242 43523
chrUn_gl000243 43341
chrUn_gl000240 41933
chrUn_gl000241 42152
chr17_gl000204_random 81310
chr17_ctg5_hap1 1680828
chr17_gl000205_random 174588
chr9_gl000199_random 169874
chr9_gl000201_random 36148
chr6_ssto_hap7 4928567
chr8_gl000197_random 37175
chr6_dbb_hap3 4610396
chr7_gl000195_random 182896
chr1_gl000191_random 106433
chr4_ctg9_hap1 590426
chr3 198022430
chr2 243199373
chr1_gl000192_random 547496
chrUn_gl000223 180455
chr17_gl000203_random 37498
chr4_gl000194_random 191469
chrY 59373566
chrX 155270560
chr9_gl000200_random 187035
chrUn_gl000222 186861
chrM 16571
chr8_gl000196_random 38914
chr6_mann_hap4 4683263
chrUn_gl000211 166566
chrUn_gl000213 164239
chrUn_gl000212 186858
chrUn_gl000215 172545
chrUn_gl000214 137718
chrUn_gl000217 172149
chrUn_gl000216 172294
chrUn_gl000219 179198
chrUn_gl000218 161147
chr19_gl000209_random 159169
chr22 51304566
chr20 63025520
chr21 48129895
chr6_mcf_hap5 4833398
chr7 159138663
chr6 171115067
chr5 180915260
chr4 191154276
chrUn_gl000228 129120
chrUn_gl000229 19913
chr1 249250621
chrUn_gl000224 179693
chrUn_gl000225 211173
chrUn_gl000226 15008
chrUn_gl000227 128374
chrUn_gl000220 161802
chrUn_gl000221 155397
chr9 141213431
chr8 146364022
chr18_gl000207_random 4262
Binary file added data/hg19_ensembl75_geneset.gtf.gz
Binary file not shown.
35 changes: 35 additions & 0 deletions data/mm9_ensembl67_contigs.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
chrY_random 58682461
chr8_random 849593
chrY 15902555
chrX 166650296
chr13 120284312
chr12 121257530
chr11 121843856
chr5_random 357350
chr17 95272651
chr16 98319150
chr15 103494974
chr14 125194864
chr19 61342430
chr18 90772031
chrM 16299
chr1_random 1231697
chr13_random 400311
chr3_random 41899
chr9_random 449403
chrUn_random 5900358
chr4_random 160594
chr7 152524553
chr6 149517037
chr5 152537259
chr4 155630120
chr3 159599783
chr2 181748087
chr1 197195432
chr7_random 362490
chrX_random 1785075
chr9 124076172
chr8 131738871
chr16_random 3994
chr10 129993255
chr17_random 628739
Binary file added data/mm9_ensembl67_geneset.gtf.gz
Binary file not shown.
35 changes: 35 additions & 0 deletions data/srsf_sample_table.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
NNNGGTCNN GGTC SRSF1-GFP-R1 SRR2057564
NNNCCACNN CCAC SRSF1-GFP-R2 SRR2057565
NNNGGTCNN GGTC SRSF1-GFP-R3 SRR2057566
NNNCCACNN CCAC SRSF1-GFP-R4 SRR2057567
NNNGGCGNN GGCG SRSF2-GFP-R1 SRR2057568
NNNGGCGNN GGCG SRSF2-GFP-R2 SRR2057569
NNNCCGGNN CCGG SRSF2-GFP-R3 SRR2057570
NNNGGTTNN GGTT SRSF2-GFP-R4 SRR2057571
NNNCAATNN CAAT SRSF2-GFP-R5 SRR2057572
NNNGGTCNN GGTC SRSF3-GFP-R1 SRR2057573
NNNGGCGNN GGCG SRSF3-GFP-R2 SRR2057574
NNNCCGGNN CCGG SRSF3-GFP-R3 SRR2057575
NNNCCACNN CCAC SRSF4-GFP-R1 SRR2057576
NNNTGGCNN TGGC SRSF4-GFP-R2 SRR2057577
NNNGGTCNN GGTC SRSF4-GFP-R3 SRR2057578
NNNCAATNN CAAT SRSF5-GFP-R1 SRR2057579
NNNTGGCNN TGGC SRSF5-GFP-R2 SRR2057580
NNNGGTTNN GGTT SRSF5-GFP-R3 SRR2057581
NNNGGCGNN GGCG SRSF5-GFP-R4 SRR2057582
NNNCAATNN CAAT SRSF6-GFP-R1 SRR2057583
NNNGGCGNN GGCG SRSF6-GFP-R2 SRR2057584
NNNTTGTNN TTGT SRSF6-GFP-R3 SRR2057585
NNNTGGCNN TGGC SRSF7-GFP-R1 SRR2057586
NNNGGTTNN GGTT SRSF7-GFP-R2 SRR2057587
NNNCAATNN CAAT SRSF7-GFP-R3 SRR2057588
NNNCAATNN CAAT SRSF7-GFP-R4 SRR2057589
NNNTGGCNN TGGC SRSF7-GFP-R5 SRR2057590
NNNCCACNN CCAC SRSF7-GFP-R6 SRR2057591
NNNGGTTNN GGTT Nxf1-GFP-R1 SRR2057592
NNNGGCGNN GGCG Nxf1-GFP-R2 SRR2057593
NNNGGTCNN GGTC Nxf1-GFP-R3 SRR2057594
NNNGGTTNN GGTT Control-GFP-R1 SRR2057595
NNNTTGTNN TTGT Control-GFP-R2 SRR2057596
NNNCCGGNN CCGG Control-GFP-R3 SRR2057597
NNNGGCGNN GGCG Control-GFP-R4 SRR2057598
1 change: 1 addition & 0 deletions data/tdp_sample_table.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
NNNGGTTNN GGTT TDP43-FLAG-R1 ERR039854
11 changes: 6 additions & 5 deletions notebooks/Examine_indels.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@
},
"outputs": [],
"source": [
"infile = \"/ifs/projects/ians/umisdeduping/iCLIP_deduping/SR_iCLIP_test3/mapping.dir/Nxf1-GFP-R1.bam\""
"infile = \"mapping.dir/Nxf1-GFP-R1.bam\""
]
},
{
Expand Down Expand Up @@ -380,7 +380,7 @@
"source": [
"import glob\n",
"import os\n",
"infiles = pd.Series(glob.glob(\"/ifs/projects/ians/umisdeduping/iCLIP_deduping/SR_iCLIP_test3/mapping.dir/*R1.bam\"))\n",
"infiles = pd.Series(glob.glob(\"mapping.dir/*R1.bam\"))\n",
"infiles.index = infiles.apply(os.path.basename)\n",
"infiles"
]
Expand Down Expand Up @@ -884,7 +884,7 @@
],
"source": [
"edit_distance = pd.read_csv(\n",
" \"/ifs/projects/ians/umisdeduping/iCLIP_deduping/SR_iCLIP_test3/dedup_directional.dir/Nxf1-GFP-R1_edit_distance.tsv\", sep=\"\\t\")\n",
" \"dedup_directional.dir/Nxf1-GFP-R1_edit_distance.tsv\", sep=\"\\t\")\n",
"edit_distance = edit_distance.set_index(\"edit_distance\")\n",
"edit_distance\n"
]
Expand Down Expand Up @@ -940,8 +940,9 @@
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 2",
"display_name": "Python [default]",
"language": "python",
"name": "python2"
},
Expand All @@ -955,7 +956,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.11"
"version": "2.7.12"
}
},
"nbformat": 4,
Expand Down
Loading

0 comments on commit 9370794

Please sign in to comment.