Merge branch 'master' of github.com:CGATOxford/UMI-tools_pipelines

CGATOxford · Dec 21, 2016 · 9370794 · 9370794
2 parents d79db45 + 29a09aa
commit 9370794
Show file tree

Hide file tree

Showing 12 changed files with 620 additions and 250 deletions.
diff --git a/README.rst b/README.rst
@@ -1,5 +1,6 @@
 Tools for dealing with Unique Molecular Identifiers
 ====================================================
+
 This repository contains:
 
 * pipelines for re-analysis of iCLIP and scRNA-Seq data for UMI-tools publication
@@ -9,7 +10,12 @@ This repository contains:
 * ipython notebooks to generate publication figures
 
 
-To run the scRNA-Seq pipeline, create a new directory and copy the files in the 'data' directory to the new directory, along with the configuration files:
+Single Cell Analysis
+---------------------
+
+To run the scRNA-Seq pipeline, create a new directory and copy the
+files in the 'data' directory to the new directory, along with the
+configuration files:
 
 .. code:: bash
 
@@ -22,5 +28,132 @@ Then run:
 
 .. code:: bash
 
-   python [UMI-Tool_pipelines git directory]/PipelineScRNASeq.py -v10 make full
+   python [UMI-Tool_pipelines git directory]/pipeline_ScRNASeq.py -v10 make full
+
+iCLIP Analysis
+---------------
+
+To run the iCLIP analysis with the default configuration, create a new
+directory, download the datafiles, copy the configuration:
+
+.. code:: bash
+
+	  mkdir iCLIP_analysis
+	  cd iCLIP_analysis
+	  wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR205/004/SRR2057564/SRR2057564.fastq.gz
+	  .   .
+	  .   .
+	  .   .
+	  wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR205/004/SRR2057598/SRR2057598.fastq.gz
+	  cp [UMI-Tools_pipelines git directory]/pipeline_iCLIP/SRSF_pipeline.ini pipeline.ini
+
+For the complete analysis of the SRSF data download the following SRR
+records from ENA: SRR2057564, SRR2057565, SRR2057566, SRR2057567,
+SRR2057568, SRR2057569, SRR2057570, SRR2057571, SRR2057572,
+SRR2057573, SRR20 57574, SRR2057575, SRR2057576, SRR2057577,
+SRR2057578, SRR2057579, SRR2057580, SRR2057581, SRR2057582,
+SRR2057583, SRR2057584, SRR2057 585, SRR2057586, SRR2057587,
+SRR2057588, SRR2057589, SRR2057590, SRR2057591, SRR2057592,
+SRR2057593, SRR2057594, SRR2057595, SRR2057596, SRR2057597,
+SRR2057598
+
+The pipeline requires a genome sequence (in fasta format) and a bowtie 
+index, named the same way and in the same directory. The `pipeline.ini`
+file must be edited to point to these files, so for example in the ini file::
+
+	[bowtie]
+
+	# The location of the genome and indexes for use with bowtie
+	# The directory should contain a fasta genome with the .fa
+	# extension and a bowtie index with the same prefix.
+	# For the SRSF data set this should be an mm9 genome/index
+	# fo the TDP dataset this should be a hg19 genome/index
+	index_dir=~/genomes/bowtie/
+	genome=mm9
+
+Will make the pipeline look for the genome in the `~/genomes/bowtie` directory.
+It will use the mm9 and expect mm9.fa and mm9.1.ebwt, mm9.2.ebwt, etc to be
+present in that directory.
+
+Now run the pipeline with:
+
+.. code:: bash
+
+	  python [UMI-Tool_pipelines git directory]/pipeline_iCLIP.py make full
+
+To run the TDP analysis, copy `TDP_pipeline.ini` rather than
+`SRSF_pipeline.ini`, set the bowtie index to a hg19 index and run:
+
+.. code:: bash
+
+	   python [UMI-Tool_pipelines git directory]/pipeline_iCLIP.py make runNotebooks1
+
+Running locally or running on a cluster
+----------------------------------------
+
+Pipelines will run either locally or on a cluster. The default cluster
+configuration is to use an SGE cluster manager with the `all.q` queue,
+the `dedicated` pe environment and `mem_free`.  These defaults can be
+altered by changing the following settings in the `[cluster]` section
+of the pipeline.ini files:
+
+* `queue`: The default queue to submit jobs to. Leave as NONE to have your
+  cluster manager decide. (default: all.q)
+* `parallel_environment`: The SGE parallel environment to request when
+  submitting multi-process jobs (default: dedicated)
+* `memory_resource`: The resource name to use when requesting a certian
+  amount of memory for a job.  (default: mem_free)
+* `pe_queue`: If this variable is set then a different queue is used
+  when submitting parallel jobs. (no default)
+* `options`: any other options to pass to the queue manager.
+
+The pipelines are also compatible with the `SLURM` and `torque`
+cluster managers. Set the `manager` parameter in the `[cluster]`
+section of the ini. 
+
+It is also possible to run the pipelines locally by adding
+`--no-cluster` to the command.  However this will take a very long
+time. Running the iCLIP pipeline on our cluster with 100 parallel jobs
+(each possibly using multiple cores) takes around 50 hours. We
+estimate running locally would take many weeks.
+
+Dependencies
+-------------
+
+These pipelines require the following dependencies:
 
++--------------------+-------------------+------------------------------------------------+
+|*Program*           |*Version*          |*Purpose*                                       |
++--------------------+-------------------+------------------------------------------------+
+|CGAPipelines        | e6bb3be           |Pipelining infrastructure, mapping pipeline     |
+|                    |                   |(http:/github.com/CGATOxford/CGATPipelines)     | 
++--------------------+-------------------+------------------------------------------------+
+|CGAT                | 0.2.4             |Various                                         |
+|                    |                   |(http:/github.com/CGATOxford/cgat)              |
++--------------------+-------------------+------------------------------------------------+
+|Bowtie              | 1.1.2             |Mapping iCLIP reads                             |
++--------------------+-------------------+------------------------------------------------+
+|BWA                 | 0.7.12-r1039      |Mapping scRNA-seq reads                         |
++--------------------+-------------------+------------------------------------------------+
+|FastQC              | 0.11.2            |Quality Control of demuxed reads                |
++--------------------+-------------------+------------------------------------------------+
+|bedtools            | 2.22.0            |Interval manipulation                           |
++--------------------+-------------------+------------------------------------------------+
+|samtools            | 1.3.1             |Read manipulation                               |
++--------------------+-------------------+------------------------------------------------+
+|UMI-tools           | 0.0.2             |UMI manipulation                                |
++--------------------+-------------------+------------------------------------------------+
+|reaper              | 13-100            |Used for demuxing and clipping reads            |
++--------------------+-------------------+------------------------------------------------+
+|trimmomatic         | 0.32              |Trimming reads for scRNA-seq                    |
+|                    |                   |                                                |
++--------------------+-------------------+------------------------------------------------+
+|SRA toolkit         | 2.8.0             |Extracting data from SRA files                  |
+|                    |                   |                                                |
++--------------------+-------------------+------------------------------------------------+
+|R                   | 3.2.1             |Figure creation (packages ggplot2, reshape,     |
+|                    |                   |plyr, grid, gplots, Biobase, RColorBrewer)      |
++--------------------+-------------------+------------------------------------------------+
+|jupyter             | 4.1               |Running the statistical analysis and generating |
+|                    |                   |figures                                         |
++--------------------+-------------------+------------------------------------------------+
diff --git a/data/hg19_ensembl75_contigs.tsv b/data/hg19_ensembl75_contigs.tsv
@@ -0,0 +1,93 @@
+chr19_gl000208_random	92689
+chr21_gl000210_random	27682
+chr6_apd_hap1	4622290
+chr13	115169878
+chr12	133851895
+chr11	135006516
+chr10	135534747
+chr17	81195210
+chr16	90354753
+chr15	102531392
+chr14	107349540
+chr19	59128983
+chr18	78077248
+chr9_gl000198_random	90085
+chrUn_gl000239	33824
+chrUn_gl000238	39939
+chrUn_gl000233	45941
+chrUn_gl000232	40652
+chrUn_gl000231	27386
+chrUn_gl000230	43691
+chrUn_gl000237	45867
+chrUn_gl000236	41934
+chrUn_gl000235	34474
+chrUn_gl000234	40531
+chr6_qbl_hap6	4611984
+chr11_gl000202_random	40103
+chr17_gl000206_random	41001
+chr6_cox_hap2	4795371
+chr4_gl000193_random	189789
+chrUn_gl000248	39786
+chrUn_gl000249	38502
+chrUn_gl000246	38154
+chrUn_gl000247	36422
+chrUn_gl000244	39929
+chrUn_gl000245	36651
+chrUn_gl000242	43523
+chrUn_gl000243	43341
+chrUn_gl000240	41933
+chrUn_gl000241	42152
+chr17_gl000204_random	81310
+chr17_ctg5_hap1	1680828
+chr17_gl000205_random	174588
+chr9_gl000199_random	169874
+chr9_gl000201_random	36148
+chr6_ssto_hap7	4928567
+chr8_gl000197_random	37175
+chr6_dbb_hap3	4610396
+chr7_gl000195_random	182896
+chr1_gl000191_random	106433
+chr4_ctg9_hap1	590426
+chr3	198022430
+chr2	243199373
+chr1_gl000192_random	547496
+chrUn_gl000223	180455
+chr17_gl000203_random	37498
+chr4_gl000194_random	191469
+chrY	59373566
+chrX	155270560
+chr9_gl000200_random	187035
+chrUn_gl000222	186861
+chrM	16571
+chr8_gl000196_random	38914
+chr6_mann_hap4	4683263
+chrUn_gl000211	166566
+chrUn_gl000213	164239
+chrUn_gl000212	186858
+chrUn_gl000215	172545
+chrUn_gl000214	137718
+chrUn_gl000217	172149
+chrUn_gl000216	172294
+chrUn_gl000219	179198
+chrUn_gl000218	161147
+chr19_gl000209_random	159169
+chr22	51304566
+chr20	63025520
+chr21	48129895
+chr6_mcf_hap5	4833398
+chr7	159138663
+chr6	171115067
+chr5	180915260
+chr4	191154276
+chrUn_gl000228	129120
+chrUn_gl000229	19913
+chr1	249250621
+chrUn_gl000224	179693
+chrUn_gl000225	211173
+chrUn_gl000226	15008
+chrUn_gl000227	128374
+chrUn_gl000220	161802
+chrUn_gl000221	155397
+chr9	141213431
+chr8	146364022
+chr18_gl000207_random	4262
diff --git a/data/hg19_ensembl75_geneset.gtf.gz b/data/hg19_ensembl75_geneset.gtf.gz
diff --git a/data/mm9_ensembl67_contigs.tsv b/data/mm9_ensembl67_contigs.tsv
@@ -0,0 +1,35 @@
+chrY_random	58682461
+chr8_random	849593
+chrY	15902555
+chrX	166650296
+chr13	120284312
+chr12	121257530
+chr11	121843856
+chr5_random	357350
+chr17	95272651
+chr16	98319150
+chr15	103494974
+chr14	125194864
+chr19	61342430
+chr18	90772031
+chrM	16299
+chr1_random	1231697
+chr13_random	400311
+chr3_random	41899
+chr9_random	449403
+chrUn_random	5900358
+chr4_random	160594
+chr7	152524553
+chr6	149517037
+chr5	152537259
+chr4	155630120
+chr3	159599783
+chr2	181748087
+chr1	197195432
+chr7_random	362490
+chrX_random	1785075
+chr9	124076172
+chr8	131738871
+chr16_random	3994
+chr10	129993255
+chr17_random	628739
diff --git a/data/mm9_ensembl67_geneset.gtf.gz b/data/mm9_ensembl67_geneset.gtf.gz
diff --git a/data/srsf_sample_table.tsv b/data/srsf_sample_table.tsv
@@ -0,0 +1,35 @@
+NNNGGTCNN	GGTC	SRSF1-GFP-R1	SRR2057564
+NNNCCACNN	CCAC	SRSF1-GFP-R2	SRR2057565
+NNNGGTCNN	GGTC	SRSF1-GFP-R3	SRR2057566
+NNNCCACNN	CCAC	SRSF1-GFP-R4	SRR2057567
+NNNGGCGNN	GGCG	SRSF2-GFP-R1	SRR2057568
+NNNGGCGNN	GGCG	SRSF2-GFP-R2	SRR2057569
+NNNCCGGNN	CCGG	SRSF2-GFP-R3	SRR2057570
+NNNGGTTNN	GGTT	SRSF2-GFP-R4	SRR2057571
+NNNCAATNN	CAAT	SRSF2-GFP-R5	SRR2057572
+NNNGGTCNN	GGTC	SRSF3-GFP-R1	SRR2057573
+NNNGGCGNN	GGCG	SRSF3-GFP-R2	SRR2057574
+NNNCCGGNN	CCGG	SRSF3-GFP-R3	SRR2057575
+NNNCCACNN	CCAC	SRSF4-GFP-R1	SRR2057576
+NNNTGGCNN	TGGC	SRSF4-GFP-R2	SRR2057577
+NNNGGTCNN	GGTC	SRSF4-GFP-R3	SRR2057578
+NNNCAATNN	CAAT	SRSF5-GFP-R1	SRR2057579
+NNNTGGCNN	TGGC	SRSF5-GFP-R2	SRR2057580
+NNNGGTTNN	GGTT	SRSF5-GFP-R3	SRR2057581
+NNNGGCGNN	GGCG	SRSF5-GFP-R4	SRR2057582
+NNNCAATNN	CAAT	SRSF6-GFP-R1	SRR2057583
+NNNGGCGNN	GGCG	SRSF6-GFP-R2	SRR2057584
+NNNTTGTNN	TTGT	SRSF6-GFP-R3	SRR2057585
+NNNTGGCNN	TGGC	SRSF7-GFP-R1	SRR2057586
+NNNGGTTNN	GGTT	SRSF7-GFP-R2	SRR2057587
+NNNCAATNN	CAAT	SRSF7-GFP-R3	SRR2057588
+NNNCAATNN	CAAT	SRSF7-GFP-R4	SRR2057589
+NNNTGGCNN	TGGC	SRSF7-GFP-R5	SRR2057590
+NNNCCACNN	CCAC	SRSF7-GFP-R6	SRR2057591
+NNNGGTTNN	GGTT	Nxf1-GFP-R1	SRR2057592
+NNNGGCGNN	GGCG	Nxf1-GFP-R2	SRR2057593
+NNNGGTCNN	GGTC	Nxf1-GFP-R3	SRR2057594
+NNNGGTTNN	GGTT	Control-GFP-R1	SRR2057595
+NNNTTGTNN	TTGT	Control-GFP-R2	SRR2057596
+NNNCCGGNN	CCGG	Control-GFP-R3	SRR2057597
+NNNGGCGNN	GGCG	Control-GFP-R4	SRR2057598
diff --git a/data/tdp_sample_table.tsv b/data/tdp_sample_table.tsv
@@ -0,0 +1 @@
+NNNGGTTNN	GGTT	TDP43-FLAG-R1	ERR039854
diff --git a/notebooks/Examine_indels.ipynb b/notebooks/Examine_indels.ipynb
@@ -54,7 +54,7 @@
    },
    "outputs": [],
    "source": [
-    "infile = \"/ifs/projects/ians/umisdeduping/iCLIP_deduping/SR_iCLIP_test3/mapping.dir/Nxf1-GFP-R1.bam\""
+    "infile = \"mapping.dir/Nxf1-GFP-R1.bam\""
    ]
   },
   {
@@ -380,7 +380,7 @@
    "source": [
     "import glob\n",
     "import os\n",
-    "infiles = pd.Series(glob.glob(\"/ifs/projects/ians/umisdeduping/iCLIP_deduping/SR_iCLIP_test3/mapping.dir/*R1.bam\"))\n",
+    "infiles = pd.Series(glob.glob(\"mapping.dir/*R1.bam\"))\n",
     "infiles.index = infiles.apply(os.path.basename)\n",
     "infiles"
    ]
@@ -884,7 +884,7 @@
    ],
    "source": [
     "edit_distance = pd.read_csv(\n",
-    "    \"/ifs/projects/ians/umisdeduping/iCLIP_deduping/SR_iCLIP_test3/dedup_directional.dir/Nxf1-GFP-R1_edit_distance.tsv\", sep=\"\\t\")\n",
+    "    \"dedup_directional.dir/Nxf1-GFP-R1_edit_distance.tsv\", sep=\"\\t\")\n",
     "edit_distance = edit_distance.set_index(\"edit_distance\")\n",
     "edit_distance\n"
    ]
@@ -940,8 +940,9 @@
   }
  ],
  "metadata": {
+  "anaconda-cloud": {},
   "kernelspec": {
-   "display_name": "Python 2",
+   "display_name": "Python [default]",
    "language": "python",
    "name": "python2"
   },
@@ -955,7 +956,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython2",
-   "version": "2.7.11"
+   "version": "2.7.12"
   }
  },
  "nbformat": 4,