diff --git a/conf.py b/conf.py index 05d8c7c..2377699 100644 --- a/conf.py +++ b/conf.py @@ -45,7 +45,7 @@ # General information about the project. project = u'khmer-protocols' -copyright = u'2013, C. Titus Brown et al.' +copyright = u'2015, C. Titus Brown et al.' # The version info for the project you're documenting, acts as replacement for # |version| and |release|, also used in various other places throughout the @@ -54,7 +54,7 @@ # The short X.Y version. version = '0.8' # The full version, including alpha/beta/rc tags. -release = '0.8.4' +release = '0.8.5' # The language for content autogenerated by Sphinx. Refer to documentation # for a list of supported languages. diff --git a/index.txt b/index.txt index 1c4ad8e..4b2e804 100644 --- a/index.txt +++ b/index.txt @@ -51,12 +51,20 @@ This is a protocol for assembling low- and medium-diversity metagenomes. Marine sediment and soil data sets may not be assemblable in the cloud just yet. +Reference based RNAseq assembly: the refTrans Protocol +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:doc:`refTrans/index` + +This is a protocol for reference based transcriptome assembly. +Materials are based on two workshops given at MSU by Titus Brown and Matt Scholz on Dec 4/5 ad Dec 10/11. + Additional information ---------------------- Need help? Either post comments on the bottom of each page, OR `sign up for the mailing list `__. - + Have you used these protocols in a scientific publication? We'll have citation instructions up soon. diff --git a/refTrans/data-snapshot.txt b/refTrans/data-snapshot.txt new file mode 100644 index 0000000..430b9bd --- /dev/null +++ b/refTrans/data-snapshot.txt @@ -0,0 +1,201 @@ +Create a data snapshot on Amazon +================================ + +.. shell start + +Essentially you create a disk; attach it; format it; +and then copy things to and from it. + + +Getting started: lunch EC2 instance with extra EBS volume +--------------------------------------------------------- +We will follow up the main events and highlight the critical points: + +1. log into your AWS account + +2. In the AWS managmenet consol: + 1. click EC2 (Amazon Elastic Compute Cloud) to open the EC2 Dashboard + 2. Change your location on the top right corner of the page to be US East (N. Virginia) + 3. press “Launch Instance” (midway down the page) + 4. Choose an Amazon Machine Image (AMI): Ubuntu Server 14.04 LTS (HVM), SSD Volume Type - ami-9a562df2 + 5. Choose an Instance Type: for now I am using the free tier t2.micro instance + 6. Next: Configure Instance Details: Keep the defaults + 7. **Next: Add Storage**: Keep the defualt storage with 8 GiB as it is and add new EBS volume of 1 GiB general purpose SSD type (to create data snapshot on it). Make sure it does not delete after termination + 8. Next: Tag instance: give a name to your instance (I called mine refTrans_instance) + 9. Next: Configure Security Group: Create a new security group and adjust your security rules to enable ports 22, 80, and 443 (SSH, HTTP, and HTTPS). + 10. click "review and lunch" the click "lunch" + 11. from the top drop-down menu, choose an existing key pair or create a new one and download to your disk + 12. click lunch instance + 13. scroll down and click view instance + 14. copy the Public DNS in your clipboard + +Logging into your new instance "in the cloud" (Windows version) +-------------------------------------------------------------- + 1. Download PuTTY and PuTTYgen from: http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html + 2. Run PuTTYgen, Find and load your '.pem' file, and Now, "save private key" + 3. Run PuTTY, paste the public DNS in the host name & expand SSH, click Auth (do not expand), find your private key and click open + 4. click yes if probmbted then Log in as "ubuntu" + +Getting the data +---------------- + +We'll be using a few RNAseq data sets from Fagerberg et al., `Analysis +of the Human Tissue-specific Expression by Genome-wide Integration of +Transcriptomics and Antibody-based Proteomics +`__. + +You can get this data from the European Nucleotide Archive under +`ERP003613 `. +All samples in this project are `paired end +`__ . +So each sample is represented by 2 files. These files are in +`FASTQ Format `__ . + + +In this tutorial we will work with two tissues: `salivary gland +`__ and `lung +`__. Note that each +tissue has two replicates, and each replicate has two files for the paired end reads + +make a directory to download the data to the Amazon cloud +:: + + mkdir refTransData + cd refTransData + +Download the sequencing files of the salivary gland +:: + + wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR315/ERR315325/* + wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR315/ERR315382/* + +Download the sequencing files of the lung tissue +:: + + wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR315/ERR315326/* + wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR315/ERR315424/* + + +Now do:: + + ls -la refTransData/ + +You'll see something like :: + + -r--r--r-- 1 mscholz common-data 3714262571 Dec 4 08:44 ERR315325_1.fastq.gz + -r--r--r-- 1 mscholz common-data 3714262571 Dec 4 08:44 ERR315325_2.fastq.gz + ... + +which tells you that this file is about 900 MB. You can go a head and use these files +for the reset of the protocl. But for the sake of time (& memory), we will run our demo +on a subset of this dat + + +Prepare the working data on the EBS volume +------------------------------------------ + +1. Use the lsblk command to view your available disk devices and their mount points + :: + + lsblk + + + You should see something like this :: + + NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT + xvda 202:0 0 8G 0 disk + └─xvda1 202:1 0 8G 0 part / + xvdf 202:16 0 1G 0 disk + + + | /dev/xvda1 is mounted as the root device (note the MOUNTPOINT is listed as /, the root of the Linux file system hierarchy), + | and /dev/xvdf is attached, but it has not been mounted yet. + +2. New volume does not have a file system. Use the sudo file -s device command to double check + :: + + sudo file -s /dev/xvdf + + + If the output of the previous command shows simply data for the device, then there is no file system on the device and you need to create one. + +3. Create an ext4 file system on the volume + :: + + sudo mkfs -t ext4 /dev/xvdf + +4. Create a mounting point directory for the volume. The mount point is where the volume is located in the file system tree and where you read and write files to after you mount the volume + :: + + sudo mkdir refTransData/dataSubset + +5. Mount the volume at the location of the project folder + :: + + sudo mount /dev/xvdf refTransData/dataSubset + +6. Copy in a subset of the data (100,000 reads) + :: + + sudo sh -c 'gunzip -c ERR315325_1.fastq.gz | head -400000 | gzip > dataSubset/salivary_repl1_R1.fq.gz' + sudo sh -c 'gunzip -c ERR315325_2.fastq.gz | head -400000 | gzip > dataSubset/salivary_repl1_R2.fq.gz' + sudo sh -c 'gunzip -c ERR315382_1.fastq.gz | head -400000 | gzip > dataSubset/salivary_repl2_R1.fq.gz' + sudo sh -c 'gunzip -c ERR315382_2.fastq.gz | head -400000 | gzip > dataSubset/salivary_repl2_R2.fq.gz' + + + and do the same for the lung samples + :: + + sudo sh -c 'gunzip -c ERR315326_1.fastq.gz | head -400000 | gzip > dataSubset/lung_repl1_R1.fq.gz' + sudo sh -c 'gunzip -c ERR315326_2.fastq.gz | head -400000 | gzip > dataSubset/lung_repl1_R2.fq.gz' + sudo sh -c 'gunzip -c ERR315424_1.fastq.gz | head -400000 | gzip > dataSubset/lung_repl2_R1.fq.gz' + sudo sh -c 'gunzip -c ERR315424_2.fastq.gz | head -400000 | gzip > dataSubset/lung_repl2_R2.fq.gz' + + +Getting Set Up with the AWS Command Line Interface and make the snapshot +------------------------------------------------------------------------ +The AWS Command Line Interface is a unified tool to manage your AWS services. It will help us making a snapshot of our data and a lot more +Follow the `Amazon documenation `__. Briefly: + +1. Get your access key ID and secret access key through `IAM console `__. Do not forget to add adminstrative access role + +2. install AWS Command Line Interface + :: + + cd ~ + sudo apt-get update + sudo apt-get install awscli + +3. configure credentials by running + :: + + aws configure + + The AWS CLI will prompt you for four pieces of information:: + + AWS Access Key ID [None]: You got it from IAM console, it should be something like AKIAIOSFODNN7EXAMPLE + AWS Secret Access Key [None]: You got it from IAM console, it should be something like wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY + Default region name [None]: This is the region of your bucket (in our case it is us-east-1) + Default output format [None]: json + + +4. Create a snapshot (get the volume-id from the EC2 dashboard in the volums tab) + :: + + sudo umount -d /dev/xvdf + aws ec2 create-snapshot --volume-id vol-90ab2b8b --description "refTrans sample data" + + The snapshot has an ID: snap-6a5ddae5 + +5. Now we can share this snapshot but we need to modify its the permissions + :: + + aws ec2 modify-snapshot-attribute --snapshot-id snap-6a5ddae5 --attribute createVolumePermission --operation-type add + +Now you can stop your instance and even delete the EBS divers. + +.. shell stop + +---- + +Back: :doc:`m-dataNsoftware_amazon` diff --git a/refTrans/files/model-rnaseq-pipeline.png b/refTrans/files/model-rnaseq-pipeline.png new file mode 100644 index 0000000..8f476c7 Binary files /dev/null and b/refTrans/files/model-rnaseq-pipeline.png differ diff --git a/refTrans/igenome-backup.txt b/refTrans/igenome-backup.txt new file mode 100644 index 0000000..697153e --- /dev/null +++ b/refTrans/igenome-backup.txt @@ -0,0 +1,70 @@ +Storage of my iGenome tarball in the Amazon S3 +============================================== + +Connect to MSU HPC and get a copy of the Bowtie2Index, genome and gtf annotation +-------------------------------------------------------------------------------- +After I logged into my AWS instance, I opened sftp connection with MSU HPC +and got the file I already prepared there in this way:: + + sftp (username)@hpc.msu.edu + get /path/to/Homo_sapiens_UCSC_hg19_small.tar.gz + exit + +Create S3 bucket +---------------- +In the AWS managmenet consol: + 1. Click S3 (Amazon simple storage service) + 2. Create Bucket + 3. Define the Bucket Name: reftransdata (Choose the Region: US Standard) + +Getting Set Up with the AWS Command Line Interface and copy a backup to S3 +-------------------------------------------------------------------------- +The AWS Command Line Interface is a unified tool to manage your AWS services. It will help us making a snapshot of our data and a lot more +Follow the `Amazon documenation `__. Briefly: + +1. Get your access key ID and secret access key through `IAM console `__. +Do not forget to add adminstrative access role + +2. install AWS Command Line Interface +:: + + cd ~ + sudo apt-get install awscli + +3. configure credentials by running +:: + + aws configure + +| The AWS CLI will prompt you for four pieces of information:: + + AWS Access Key ID [None]: You got it from IAM console, it should be something like AKIAIOSFODNN7EXAMPLE + AWS Secret Access Key [None]: You got it from IAM console, it should be something like wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY + Default region name [None]: This is the region of your bucket (in our case it is us-east-1) + Default output format [None]: json + + +4. copy the tarball to your S3 space +:: + + aws s3 cp $workingPath/Homo_sapiens_UCSC_hg19_small.tar.gz s3://reftransdata/Homo_sapiens_UCSC_hg19_small.tar.gz + +.. Go to S3 consol and right click the file and choose "make it public" + + +Get a copy back from your S3 to the current EC2 +----------------------------------------------- +Install and configure the AWS Command Line Interface then copy the file to the current working directory + +:: + + aws s3 cp s3://reftransdata/Homo_sapiens_UCSC_hg19_small.tar.gz + + +open the file and change name to match the one obtained from igenome +-------------------------------------------------------------------- +:: + + tar -zxvf Homo_sapiens_UCSC_hg19_small.tar.gz + mv Homo_sapiens2 Homo_sapiens + diff --git a/refTrans/index.txt b/refTrans/index.txt new file mode 100644 index 0000000..75bfeea --- /dev/null +++ b/refTrans/index.txt @@ -0,0 +1,31 @@ +=================================================== +Reference based RNAseq assembly (refTrans) protocol +=================================================== + +:author: Tamer A. Mansour and Titus Brown + +This is a protocol for reference based transcriptome assembly. +Materials are based on and extending two workshops given at MSU by Titus Brown and Matt Scholz on Dec 4/5 ad Dec 10/11. +Both workshops were sponsored by `iCER `__ +and made use of the `MSU High Performance Compute Center `__ . + +.. figure:: files/model-rnaseq-pipeline.png + +Tutorials: + +.. toctree:: + :maxdepth: 2 + + m-resources + m-quality + m-tophat + m-count + m-data-analysis + m-func-analysis + m-advice + +Reference material +------------------ +:doc:`../docs/command-line` + +.. :doc:`../amazon/index` \ No newline at end of file diff --git a/refTrans/m-HPC-cuffdiff.txt b/refTrans/m-HPC-cuffdiff.txt new file mode 100644 index 0000000..81f25ec --- /dev/null +++ b/refTrans/m-HPC-cuffdiff.txt @@ -0,0 +1,62 @@ +Submitting cuffdiff job to the MSU HPC queue +============================================ + +.. shell start + + +We should start by specifying the resources required for our job and load the needed softwares +:: + + echo "#PBS -l walltime=04:00:00,nodes=1:ppn=8,mem=12Gb" > run_cuffdiff.sh + echo "module load cufflinks/2.1.1" >> run_cuffdiff.sh + +Define the groups you want to compare (According to our sample naming conventions, +it should be exactly matching the stating name of your samples) +:: + + echo "groups=(lung salivary)" >> run_cuffdiff.sh + +We will prepare the replicates as comma separated lists to pass them to the shell scripts +:: + + gp_replicates=() + for gp in "${groups[@]}"; do + replicates=() + for rep in $gp*_tophat; do + replicates+=("$workingPath/$rep/accepted_hits.bam") + done + gp_replicates+=($(echo ${replicates[*]} | tr ' ' ',')) + done + + echo " + cuffdiff \ + --output-dir "$workingPath/cuffdiff_out" \ + --num-threads 8 \ + --multi-read-correct \ + --frag-bias-correct ${homo_sapiens_genome} \ + --labels ${groups[0]},${groups[1]} \ + $GTF_file ${gp_replicates[0]} ${gp_replicates[1]}" >> run_cuffdiff.sh + + +you can see your script:: + + less run_cuffdiff.sh + +(use 'q' to exit). + + +Now submit your job +:: + + qsub run_cuffdiff.sh + + +.. note:: You will need to adjust the required wall time, memory and number of nodes(processor) required for your job + +You can use 'qstat | grep ' to check on your jobs' status. + +.. shell stop + +---- + +Back: :doc:`m-cuffdiff` diff --git a/refTrans/m-HPC-tophat.txt b/refTrans/m-HPC-tophat.txt new file mode 100644 index 0000000..956bb59 --- /dev/null +++ b/refTrans/m-HPC-tophat.txt @@ -0,0 +1,44 @@ +Submitting TopHat jobs to the MSU HPC queue +=========================================== + +.. shell start + +In :doc:`m-more-tophat`, we showed you a *shell script*, which was a way +of telling the computer to do multiple things in a row. We discussed +several advantages to scripting -- + +1. It automates long-running processes; +2. It tracks what you did, and you can edit it (to modify analyses) and + also copy it (to do a family of analyses); +3. You can also provide it as part of your Methods in your paper; + +There is also a fourth advantage, or really a necessity, to scripting: +it's how you use the MSU HPC to run your computation. Briefly, to run +a job on the HPC, you create a shell script and then run 'qsub'. + +This code will automate making the scripts for each sample files and run the qsub for these scripts +:: + + mkdir qsub_scripts + for f in *R1.pe*; do + input_R1="$f" + input_R2=$(echo "$f" | sed s/R1/R2/) + input_R=$(echo "$f" | sed s/R1.pe/R.se/) + output=$(basename "$f" | cut -f 1,2 -d "_")_tophat + script="./qsub_scripts/"$output"_script.sh" + echo "#PBS -l walltime=00:30:00,nodes=1:ppn=4,mem=4Gb" >> $script + echo "module load TopHat2/2.0.12" >> $script + echo "cd \$PBS_O_WORKDIR" >> $script + echo "tophat -p 4 -G "$GTF_file" --transcriptome-index "$workingPath/trans_Index/transcriptome" -o $output $Bowtie2Index $input_R1 $input_R2,$input_R" >> $script + qsub $script + done + +.. note:: You will need to adjust the required wall time, memory and number of nodes(processor) required for your job + +You can use 'qstat | grep ' to check on your jobs' status. + +.. shell stop + +---- + +Back: :doc:`m-tophat` diff --git a/refTrans/m-advice.txt b/refTrans/m-advice.txt new file mode 100644 index 0000000..4c22420 --- /dev/null +++ b/refTrans/m-advice.txt @@ -0,0 +1,71 @@ +Miscellaneous advice +==================== + +Sequencing depth and number of samples +-------------------------------------- + +`Hart et al. (2013) +`__ provides a +nice description and a set of tools for estimating your needed +sequencing depth and number of samples. They provide an `Excel based +calculator +`__ +for calculating number of samples. Their numbers are surprisingly +large to me ;). + +In a proposal for an exploratory effort to discover differentially +expressed genes, I would suggest 3-5 biological replicates with 30-50 +million reads each. More reads is usually cheaper than more replicates, +so 50-100m reads may give you more power to resolve smaller fold changes. + +Developing your own pipeline +---------------------------- + +Even if all you plan to do is change the filenames you're operating on, +you'll need to develop your own analysis pipeline. Here are some tips. + +1. Start with someone else's approach; don't design your own. There + are lots of partly done examples that you can find on the Web, + including in this tutorial. + +2. Generate a data subset (the first few 100k reads, for example). + +2. Run commands interactively on an HPC dev node until you get all of + the commands basically working; track all of your commands in a + Word document or some such. + +3. Once you have a set of commands that seems to work on small data, + write a script. Run the script on the small data again; make sure + that works. + +4. Turn it into a qsub script (making sure you're in the right + directory, have the modules loaded, etc.) + +5. Make sure the qsub script works on your same small data. + +6. Scale up to a big test data set. + +7. Once that's all working, SAVE THE SCRIPT SOMEWHERE. Then, + edit it to work on all your data sets (you may want to make subsets + again, as much as possible). + +8. Provide your scripts and raw counts files as part of any publication + or thesis, perhaps via `figshare `__. + + +Informational resources +---------------------- + +`UT (Austin) Sequencing Core prices `__ - costs and yields for sequencing. + +`ANGUS - summer NGS course `__ - lots of resources and materials and book reference + +`Data Carpentry `__ - intro to R, etc. + +`Software Carpentry `__ - more scripting, Python, etc. + + +Places to share data, scripts, and results files +------------------------------------------------ + +`Figshare `__. diff --git a/refTrans/m-count.txt b/refTrans/m-count.txt new file mode 100644 index 0000000..fdff133 --- /dev/null +++ b/refTrans/m-count.txt @@ -0,0 +1,67 @@ +Counting the reads that map to each feature +=========================================== + +After mapping of reads back to the transcriptome. Now it is time to figure out +the expression level of each expression unit. There are two main flavours of this process: + + +Counting the reads that map to each gene +---------------------------------------- +Simply count all the reads that map **uniquelly** to all exons of the gene. There are several +software packages that can do this e.g. +`htseq-count `__. +and `Featurecount `__. +Both programs require the user to provide a gene model (Gene annotation file; GTF). +Annotaions file of model oraganisms are avaliable on all gene banks (e.g. +`NCBI `__, +`UCSC `__, and +`Ensemble `__). +Also they are avaliable through other genomic software projects e.g. +`Illumina iGenomes project `__. + +Go for: +:doc:`./m-htseq` + + +Counting the reads that map to each transcript +---------------------------------------------- +Mapping the reads to transcripts (gene isoforms) rather than the whole genes will enable +the identification of changes in alterative splicaing machinary. Alternative splicing is believed to be +one of the most powerful regulatory mechanisms in severel biological conditions including +`cancer `__, +`development `__, +`immmunology `__, +`fertility `__, and many others. + +`Cufflinks `__ is the canonical software +package to do that. But there are other softwares for analysis of differential +splicing events (exon skipping, intron retention, etc.) +e.g. `MATS `__, and +`MISO `__. + +Cufflinks also +pools the results of all transcripts per gene to give an answer similar to the one obtained from +htseq-like packages. One comparative feature that should be considered is the tendency of Cufflinks to +utilize ambiguously aligned reads (versus htseq wich use anly the uniquelly mapped reads). +This feature might explain a lot of the differences in the output of the 2 softwares. + +Cufflinks can do the mapping based on the avaliable annoataion files which will +make Cufflinks operate on the known transcripts only. However Cufflinks can make +its own gene model. This feature has two inportant applications: + +* Prediction of denovo transcripts of known genes which is a valid possibility even in well annoatated oraganisms. +* Prediction of novel genes. This feature is very helpful for semi-model organisms that does not have a well annoatated transcriptome. + +On the other hand, mapping to transcripts requires the implementation of error prone modeling +approaches and thus requires rigorous laboratory confirmation. + + +Go for: +:doc:`./m-cufflinks` + + +---- + +Links: + +* `Trapnell et al. 2013 Differential analysis of gene regulation at transcript resolution with RNA-seq `__ diff --git a/refTrans/m-cuffdiff.txt b/refTrans/m-cuffdiff.txt new file mode 100644 index 0000000..0c0612c --- /dev/null +++ b/refTrans/m-cuffdiff.txt @@ -0,0 +1,61 @@ +Differential expression by Cuffdiff +=================================== + +.. shell start + +We'll be using `Cufdiff +`__ +to do the differential expression analysis based on the cufflinks output. + +define the groups you want to compare (According to our sample naming conventions, +it should be exactly matching the stating name of your samples) +:: + + groups=(lung salivary) + +Cufflinks recive replicates as comma separated lists for each condition. +You can write those lists yourself but we can also create these lists +and store them in an array. +:: + + gp_replicates=() + for gp in "${groups[@]}"; do + replicates=() + for rep in $gp*_tophat; do + replicates+=("$workingPath/$rep/accepted_hits.bam") + done + gp_replicates+=($(echo ${replicates[*]} | tr ' ' ',')) + done + +You can see how these comma separated lists looks like by:: + + declare -p gp_replicates + #printf "${gp_replicates[*]}" + +Now we can run the Cuffdiff + +:: + + cuffdiff \ + --output-dir "$workingPath/cuffdiff_out" \ + --num-threads 2 \ + --multi-read-correct \ + --frag-bias-correct ${homo_sapiens_genome} \ + --labels ${groups[0]},${groups[1]} \ + $GTF_file ${gp_replicates[0]} ${gp_replicates[1]} + +Cuffdiff usually requires high memory and would be more faster if you increased the no of threads +Check your computer resources before running it + +Cuffdif will output a folder cuffdiff_out in the working directory. The folder caintains a lot of files. +You need to read about it `here `__ + + +.. shell stop + +---- + +Next: :doc:`m-func-analysis` + + +FOLDERS=`ls -dm ./ | tr -d ' '` diff --git a/refTrans/m-cufflinks.txt b/refTrans/m-cufflinks.txt new file mode 100644 index 0000000..0e3af71 --- /dev/null +++ b/refTrans/m-cufflinks.txt @@ -0,0 +1,82 @@ +Counting the reads that map to each transcript +============================================== + +.. shell start + +Now that we know which reads go with which gene, we'll use +`Cufflinks `__ to allocate those reads +to the possible transcripts of every gene + +.. note:: This section reuires cufflinks/2.1.1 installed + +You can ask Cufflinks to use only annotated transcripts by the argument (--GTF). +However there is another argument (--GTF-guide) which Tells Cufflinks to use the +supplied reference annotation to guide reference annotation based transcript assembly. +Output will include all reference transcripts as well as any novel genes and isoforms that are assembled. + +As we discussed before, Cufflinks uses the ambiguously aligned reads in its calculation. +To do this in a more conservative approach, there are several arguments that you can pass to Cufflinks: +1. --multi-read-correct Tells Cufflinks to do an initial estimation procedure to more accurately weight reads mapping to multiple locations in the genome +2. --frag-bias-correct Providing Cufflinks with a genome reference to run a bias detection and correction algorithm which improve accuracy of transcript abundance estimates. +However it worth to mention that applying these arguments will cause cufflinks to tke much more longer time to finish the job. +Forexample, with our set of data, cuffinks will need ~ 10 hours to finish compared to less than 2 hours without these arguments + +:: + + for f in *_tophat; do + cufflinks \ + --GTF-guide "$GTF_file" \ + --multi-read-correct \ + --frag-bias-correct ${homo_sapiens_genome} \ + --num-threads 2 \ + --output-dir $f/cufflink_out \ + $f/accepted_hits.bam + done + +For each sample, We directed the outout to a folder called cufflink_out to be inside the tophat output folde. cufflink_out folder will have four files:: + + transcripts.gtf : This GTF file contains Cufflinks' assembled isoforms. + isoforms.fpkm_tracking : This file contains the estimated isoform-level expression values + genes.fpkm_tracking : This file contains the estimated gene-level expression values + skipped.gtf : List of skipped loci. Cufflinks skip a locus if more than 1000000 fragment + mapped to it. You can change this through the (-max-bundle-frags) + +Note, Cufflinks express the expression values in **FPKM** (Fragments per kilobase of exon per million reads mapped). +This unit corrects for the gene size and the depth of sequencing. So let us say that the expression of a given gene was 100, +this means that for every one million fragments(read if single ended or pair if paired ended) +mapping to the whole transcriptome, 100 fragments of this million will map to one kilobase of this transcript. + +---- + +Because we have multiple RNA-Seq libraries, It is recommend that you merge these assemblies into a master transcriptome. +This step is required for a differential expression analysis of the new transcripts we've assembled. +`Cuffmerge `__ performs this merge step. + +First step is to specify all the input GTF files in a **manifest** file listing full paths to the files. +:: + + for f in *_tophat; do + echo $workingPath/$f/cufflink_out/transcripts.gtf >> assembly_GTF_list.txt + done + +Now we can run the cuffmerge software +:: + + cuffmerge \ + --ref-sequence ${homo_sapiens_genome} \ + --ref-gtf $GTF_file \ + --num-threads 2 \ + assembly_GTF_list.txt + +The cuffmerge will output a folder called "merged_asm" which contain "merged.gtf" file +containing all the transcripts that cufflinks had assembled. Cuffmrege has an optional argument +(--ref-gtf) which allow you to pass the reference annotation. Therefore, the merged.gtf has cufflinks +transcripts with their matching reference transcripts. + + +.. shell stop + +---- + +Next: :doc:`m-Cuffdif` + diff --git a/refTrans/m-data-analysis.txt b/refTrans/m-data-analysis.txt new file mode 100644 index 0000000..60773f0 --- /dev/null +++ b/refTrans/m-data-analysis.txt @@ -0,0 +1,17 @@ +Differential expression +======================= + +Now we have tried two different methods to summarize our expression: +* The HTseq has summarized the raw reads per genes and now this ready for further analysis by Packages like DEseq and edgeR. +* Cufflinks has summarized the expression values as FPKM and now this is ready for further analysis by software like Cuffdif. + +So now we are going to disscuss the differential expression analysis by these softwares: + +.. toctree:: + :maxdepth: 2 + + m-edgeR + m-cuffdiff + + +---- \ No newline at end of file diff --git a/refTrans/m-dataNsoftware_HPC.txt b/refTrans/m-dataNsoftware_HPC.txt new file mode 100644 index 0000000..f7ab3da --- /dev/null +++ b/refTrans/m-dataNsoftware_HPC.txt @@ -0,0 +1,172 @@ +High Performance Computaters +============================ + +.. shell start + +Log into HPC +------------ + +Log into the HPC with SSH:: + + ssh @hpc.msu.edu + +use your MSU NetID and log into the machine 'gateway.hpcc.msu.edu'. + +Then log into a compute node e.g. dev-intel14-phi:: + + ssh dev-intel14-phi + + +Assign the working directory +---------------------------- +Assign the path of your working directory to a ageneric variable:: + + workingPath=$"" + export workingPath + +.. :: + + workingPath=$"/mnt/ls12/Tamer/refTransProject/" + export workingPath + + +Getting the data +---------------- + +We'll be using a few RNAseq data sets from Fagerberg et al., `Analysis +of the Human Tissue-specific Expression by Genome-wide Integration of +Transcriptomics and Antibody-based Proteomics +`__. + +You can get this data from the European Nucleotide Archive under +`ERP003613 `. +All samples in this project are `paired end +`__ . +So each sample is represented by 2 files. These files are in +`FASTQ Format `__ . + + +In this tutorial we will work with two tissues: `salivary gland +`__ and `lung +`__. Note that each +tissue has two replicates, and each replicate has two files for the paired end reads + +.. check if the data directory exists +.. :: + + if [ ! -d "$workingPath/refTransData" ]; then + +make a directory to host the data on the HPC +:: + + mkdir $workingPath/refTransData + cd $workingPath/refTransData + +Download the sequencing files of the salivary gland +:: + + wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR315/ERR315325/* + wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR315/ERR315382/* + +Download the sequencing files of the lung tissue +:: + + wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR315/ERR315326/* + wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR315/ERR315424/* + + +.. :: + + fi + +Now do:: + + ls -la $workingPath/refTransData/ + +You'll see something like :: + + -r--r--r-- 1 mscholz common-data 3714262571 Dec 4 08:44 ERR315325_1.fastq.gz + -r--r--r-- 1 mscholz common-data 3714262571 Dec 4 08:44 ERR315325_2.fastq.gz + ... + +which tells you that this file is about 900 MB. You can go a head and use these files +for the reset of the protocl. But for the sake of time (& memory), we will run our demo +on a subset of this dat + +Copying in some data to work with +--------------------------------- +These data files are too big and will take us hours and hourse for analysis. So let us make it simple. +We will create smaller files but getting only the first 100000 reads. + +.. check if the subset data directory exists +.. :: + + if [ ! -d "./dataSubset" ]; then + +First, make a directory +:: + + mkdir ./dataSubset + +Copy in a subset of the data (100,000 reads) +:: + + gunzip -c ERR315325_1.fastq.gz | head -400000 | gzip > dataSubset/salivary_repl1_R1.fq.gz + gunzip -c ERR315325_2.fastq.gz | head -400000 | gzip > dataSubset/salivary_repl1_R2.fq.gz + gunzip -c ERR315382_1.fastq.gz | head -400000 | gzip > dataSubset/salivary_repl2_R1.fq.gz + gunzip -c ERR315382_2.fastq.gz | head -400000 | gzip > dataSubset/salivary_repl2_R2.fq.gz + +and do the same for the lung samples +:: + + gunzip -c ERR315326_1.fastq.gz | head -400000 | gzip > dataSubset/lung_repl1_R1.fq.gz + gunzip -c ERR315326_2.fastq.gz | head -400000 | gzip > dataSubset/lung_repl1_R2.fq.gz + gunzip -c ERR315424_1.fastq.gz | head -400000 | gzip > dataSubset/lung_repl2_R1.fq.gz + gunzip -c ERR315424_2.fastq.gz | head -400000 | gzip > dataSubset/lung_repl2_R2.fq.gz + +.. :: + + fi + +Link your data into a working directory +--------------------------------------- +Rather than copying the files into the working directory, let's just link them in. +this creates a reference so that UNIX knows where to find them but doesn't need to actually move them around. +:: + ln -fs $workingPath/refTransData/dataSubset/*.fq.gz $workingPath + +.. note:: + + You can use your own data instead. The code can be applied to any paired end data set whatever the number of the studied + groups and whatever the number of replicates. To minimize the code manipulation, you would rename your samples according to match + our conventions: _repl#_R(1or2).fq.qz + + +install/load software +--------------------- +Most of softwares used through this protocol require to be installed on the root dircteory of HPC. +This action requires adminstrative privilages. The IT supervisors of the HPC install software packages +in the form modules that you can simply load by the command: module load + +Now let us load all the modules reuired to go through this protocol +:: + + module load FastQC/0.11.2 + module load Trimmomatic/0.32 + module load TopHat2/2.0.12 + module load PySAM/0.6 + module load HTSeq/0.6.1 + module load R + module load cufflinks/2.1.1 + + +.. note:: Loading TopHat2 on MSU HPC includes loading Bowtie/2.2.3.0 and Samtools/0.1.19.0 + +We will elaborate more on each software when we use it + + +.. shell stop + +---- + +Next: :doc:`m-quality` diff --git a/refTrans/m-dataNsoftware_amazon.txt b/refTrans/m-dataNsoftware_amazon.txt new file mode 100644 index 0000000..ad7ef4a --- /dev/null +++ b/refTrans/m-dataNsoftware_amazon.txt @@ -0,0 +1,258 @@ +Amazon cloud computer +===================== + +.. shell start + +Cloud computing involves deploying groups of remote servers and software +networks that allow centralized data storage and online access to computer +services or resources (`wikipedia `__) . + +`Amazon Web Services `__ offers a broad set of cloud +comutational services. In this protocol we will implement our whole pipeline on an +Amazon Elastic Compute Cloud (EC2). + +This does mean that the first thing you need to do is get your data +over to the cloud. + +The basics +---------- + +... Amazon is happy to rent disk space to you, in addition to compute time. +They'll rent you disk space in a few different ways, but the way that's +most useful for us is through what's called Elastic Block Store. This +is essentially a hard-disk rental service. + +There are two basic concepts -- "volume" and "snapshot". A "volume" can +be thought of as a pluggable-in hard drive: you create an empty volume of +a given size, attach it to a running instance, and voila! You have extra +hard disk space. Volume-based hard disks have two problems, however: +first, they cannot be used outside of the "availability zone" they've +been created in, which means that you need to be careful to put them +in the same zone that your instance is running in; and they can't be shared +amongst people. + +Snapshots, the second concept, are the solution to transporting and +sharing the data on volumes. A "snapshot" is essentially a frozen +copy of your volume; you can copy a volume into a snapshot, and a +snapshot into a volume. + +Run through :doc:`../amazon/index` once, to get the hang of the mechanics. +Also you can read :doc:`./data-snapshot` to see how I created my data snapshot + +Launch a data snapshot +---------------------- +Go ahead and open the EC2 service to lunch a suitable instance(s). +Change your location on the top right corner of the page to be US East (N. Virginia). +Push the **Launch Instance** button then follow these instructions: + +1. On "Step 1: Choose an Amazon Machine Image (AMI)": The protocol was prepared to run on "Ubuntu Server 14.04 LTS (HVM)" +2. On "Step 2: Choose an Instance Type": You can run this protocol on **t2.medium instance** (2 vCPU and 4 GiB memory). But when you start running your own data you can scale this up or down according to your needs +3. On "Step 3: Configure Instance Details": We will accept the default settings +4. On "Step 4: Add storage": + a. Start the default volume with 20 GiB + b. Add a new new EBS volume of 1 GiB general purpose SSD type. + c. Use our snapshot (snap-6a5ddae5) to create the additional volume. You can search for it by the snapshot description "refTrans sample data". + d. Set the device name as /dev/sdf (be carefull about naming your device. For Linux instance, recommended device names are '/dev/sd[f-p]'. `Source: AWS Block Device Mapping Documentation `__) + e. Mark the volume with **delete after termination** (you always have the snapshot to make new volumes). +5. On "Step 5: Tag Instance": Give your instance a name. +6. On "Step 6: Configure Security Group": Create a new security group and add security rules to enable ports 22, 80, and 443 (SSH, HTTP, and HTTPS). +7. On "Step 7: Review Instance Launch": Review the information of your instance. You will see an alarm about the security of your instnce. This is ok, once you click launch, you will be able to make a key-paur (or use yours if you already have one) +8. After you launch your instance, the confirmation page will show your instance ID. Click the instance ID to watch your instance status. In the lower half of the page, you will description info of your instance. Copy the puplic IP address to use in the next step. + +Logging into your new cloud instance (Windows version) +------------------------------------------------------ +You need an SSH client to conect from you own computer to the Amazon instance you just started in the cloud: + +1. Download PuTTY and PuTTYgen from: http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html +2. Run PuTTYgen, Find and load your '.pem' file, and Now, "save private key" +3. Run PuTTY, paste the public DNS in the host name. In the left category panel, expand the SSH entry, select Auth, find your private key and click open +4. click yes if probmbted then Log in as "ubuntu" +5. create a folder to contain all the files of this expermint and define this path + :: + + mkdir refTransProject + workingPath=$"/home/ubuntu/refTransProject" + export workingPath + +6. Check for the directory structure :: + + lsblk + + + You should see something like this:: + + NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT + xvda 202:0 0 8G 0 disk + └─xvda1 202:1 0 8G 0 part / + xvdf 202:16 0 1G 0 disk + + + | /dev/xvda1 is mounted as the root device (note the MOUNTPOINT is listed as /, the root of the Linux file system hierarchy), and + | /dev/xvdf is attached, but it has not been mounted yet. + + +7. Create a mounting point and mount the new volume + :: + + mkdir $workingPath/refTransData + sudo mount /dev/xvdf $workingPath/refTransData + + + | Check using lsblk again to make sure everything is ok. + +Now, link the data to the working directory. +This creates a reference so that UNIX knows where to find them but doesn't need to actually move them around. +:: + + ln -fs $workingPath/refTransData/*.fq.gz $workingPath + + +Now if you typed:: + + ls refTransProject/ + +You should see the 8 fastq files in your folder + +.. note:: + + You can use your own data instead. The code can be applied to any paired end data set whatever the number of the studied + groups and whatever the number of replicates. To minimize the code manipulation, you would rename your samples according to match + our conventions: _repl#_R(1or2).fq.qz + +install software and link to the working directory +-------------------------------------------------- + +.. note:: + During the installation, copy the commands to your prompt one line + at a time to make sure that every line pass safelly and because some + commands are interactive requiring your confirmation to continue + +install prerequisites +:: + + cd ~ + sudo apt-get update + sudo apt-get install gcc g++ pkg-config wget make + sudo apt-get install unzip + sudo add-apt-repository ppa:webupd8team/java + sudo apt-get update + sudo apt-get install oracle-java7-installer + +.. Source http://www.webupd8.org/2012/01/install-oracle-java-jdk-7-in-ubuntu-via.html + +install FastQC/0.11.2 +:: + + cd /usr/src/ + sudo wget http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.2.zip + sudo unzip fastqc_v0.11.2.zip + cd FastQC + sudo chmod 755 fastqc + sudo ln -s /usr/src/FastQC/fastqc /usr/local/bin/fastqc + + +install Trimmomatic/0.32 +:: + + cd /usr/src/ + sudo wget http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.32.zip + sudo unzip Trimmomatic-0.32.zip + sudo mv /usr/src/Trimmomatic-0.32/trimmomatic-0.32.jar /usr/src/Trimmomatic-0.32/trimmomatic + TRIM=$"/usr/src/Trimmomatic-0.32" + export TRIM + + +install Bowtie/2.2.3.0 +:: + + cd /usr/src/ + sudo wget http://sourceforge.net/projects/bowtie-bio/files/bowtie2/2.2.3/bowtie2-2.2.3-source.zip + sudo unzip bowtie2-2.2.3-source.zip + cd bowtie2-2.2.3 + sudo make + PATH=$PATH:/usr/src/bowtie2-2.2.3 + export PATH + +install TopHat2/2.0.13 +:: + + cd /usr/src/ + sudo wget http://ccb.jhu.edu/software/tophat/downloads/tophat-2.0.13.Linux_x86_64.tar.gz + sudo tar -zxvf tophat-2.0.13.Linux_x86_64.tar.gz + cd /usr/bin + sudo ln -s /usr/src/tophat-2.0.13.Linux_x86_64/tophat2 ./tophat + +install samtools/0.1.19-1 +:: + + cd ~ + sudo apt-get install samtools + +install HTSeq/0.6.1 +:: + + cd /usr/src/ + sudo apt-get install build-essential python2.7-dev python-numpy python-matplotlib + sudo wget https://pypi.python.org/packages/source/H/HTSeq/HTSeq-0.6.1.tar.gz + sudo tar -zxvf HTSeq-0.6.1.tar.gz + cd HTSeq-0.6.1 + sudo python setup.py install --user + PATH=$PATH:/home/ubuntu/.local/bin + export PATH + sudo chmod -R 755 ~/.local + + +install PySam +:: + + cd ~ + sudo apt-get install python-pip + sudo pip install pysam + +Install GNU R core and the edgeR package +:: + + cd ~ + sudo apt-get install r-base-core + sudo Rscript -e "source('http://bioconductor.org/biocLite.R'); biocLite('edgeR');" + +install cufflinks/2.1.1 +:: + + cd /usr/src/ + sudo wget http://cole-trapnell-lab.github.io/cufflinks/assets/downloads/cufflinks-2.1.1.Linux_x86_64.tar.gz + sudo tar -zxvf cufflinks-2.1.1.Linux_x86_64.tar.gz + cd /usr/bin + sudo ln -s /usr/src/cufflinks-2.1.1.Linux_x86_64/cufflinks . + sudo ln -s /usr/src/cufflinks-2.1.1.Linux_x86_64/cuffmerge . + sudo ln -s /usr/src/cufflinks-2.1.1.Linux_x86_64/cuffcompare . + sudo ln -s /usr/src/cufflinks-2.1.1.Linux_x86_64/cuffdiff . + sudo ln -s /usr/src/cufflinks-2.1.1.Linux_x86_64/gffread . + sudo ln -s /usr/src/cufflinks-2.1.1.Linux_x86_64/gtf_to_sam . + +install tmux +:: + + cd ~ + sudo apt-get install tmux + +.. Install FastX/0.0.13.2:: + cd ~ + sudo apt-get install gcc g++ pkg-config wget make + wget http://hannonlab.cshl.edu/fastx_toolkit/libgtextutils-0.6.1.tar.bz2 + tar -xjf libgtextutils-0.6.1.tar.bz2 + cd libgtextutils-0.6.1 + sudo sh -c "./configure && make && make install" + cd ~ + wget http://hannonlab.cshl.edu/fastx_toolkit/fastx_toolkit-0.0.13.2.tar.bz2 + tar xjf fastx_toolkit-0.0.13.2.tar.bz2 + cd fastx_toolkit-0.0.13.2/ + sudo sh -c "./configure && make && make install" + export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH + + +.. shell stop + +---- + +Next: :doc:`m-quality` diff --git a/refTrans/m-edgeR.txt b/refTrans/m-edgeR.txt new file mode 100644 index 0000000..d5bba7c --- /dev/null +++ b/refTrans/m-edgeR.txt @@ -0,0 +1,126 @@ +Differential expression by EdgeR +================================ + +.. shell start + +We'll be using `edgeR +`__ +to do the basic differential expression analysis of our counts. + +.. note:: This section reuires GNU R core installed with edgeR package installed + +To run edgeR, you need to write a data loading and manipulation script in R. +This script will load in all the studied samples, execute an MA plot, +and provide a spreadsheet with differential expression of the studied tissues. +We will go togther step by step to understand and write this scrip through the command line + +First, We need to combine all the count files of the studied samples into one dataset. +Also we need to classfy these samples into groups to compare them. In this case we will have +two replicates for each studied tissue. This code this going to deals with your +samples whatever the no of replicates and the studied groups. +:: + + echo " + First_sample <- TRUE + columnID <- 2 + group <- c() + for (tophat_folder in dir(pattern='*_tophat')){ + file.name <- paste(tophat_folder, '/total_count.txt', sep='') + sample.name <- sub('_tophat','',tophat_folder) + if (First_sample){ + all.data <- read.delim(file.name, col.names=c('gene', sample.name), sep='\t', colClasses=c('character', 'numeric'), row.names=1) + First_sample <- FALSE + } else { + temp <- read.delim(file.name, col.names=c('gene', sample.name), sep='\t', colClasses=c('character', 'numeric'), row.names=1) + #all.data$sample.name <- temp$sample.name + all.data[,columnID] <- temp[,1] + colnames(all.data)[columnID] = sample.name + columnID = columnID + 1 + } + group_label <- sub('_repl[1-2]','', sample.name) + group <- c(group,group_label) + }" > edgeR_difExp.R + +The count output of HTseqcount ends with five lines of summary statistics about the mapped reads. +So we need to the eliminate the last five rows in our dataset. +:: + + echo "all.data <- all.data[1:(nrow(all.data)-5),]" >> edgeR_difExp.R + +We now have a data frame with all the data in it. let us have an additional +line to show us the first few lines of our dataset:: + + echo "head(all.data)" >> edgeR_difExp.R + +It is time to load edgR package into memory o start working with +:: + + echo "library('edgeR')" >> edgeR_difExp.R + +Now, it is time to calculte the differention expression. +`edgeR `__ +has different functions for calculation of differential expression. +We are using the 2 group comparison model. If you have more than 2 groups to compare, +you will have to select the appropriate functions according to the edgR manual +:: + + echo " + dge = DGEList(counts=all.data, group=group) + dge <- estimateCommonDisp(dge) + dge <- estimateTagwiseDisp(dge) + et <- exactTest(dge) + etp <- topTags(et, n=100000)" >> edgeR_difExp.R + +Let's add a line to have a quick view of the results :: + + echo "summary(etp\$table)" >> edgeR_difExp.R + +We also can plot our results +:: + + echo " + etp\$table\$logFC = -etp\$table\$logFC + pdf('edgeR-MA-plot.pdf') + plot( + etp\$table\$logCPM, + etp\$table\$logFC, + xlim=c(-3, 20), ylim=c(-12, 12), pch=20, cex=.3, + col = ifelse( etp\$table\$FDR < .1, 'red', 'black' ) ) + dev.off()" >> edgeR_difExp.R + +Finally we will write a spreadsheet with differential expression of the studied tissues +:: + + echo "write.csv(etp\$table, 'edgeR-lung-vs-salivary.csv')" >> edgeR_difExp.R + +.. Note:: + + you can copy the script from this link and save it in the working directory + under the name edgeR_difExp.R + + :doc:`./m-edgeR_difExp` + + +Now we can run the R script to do the analysis +:: + + Rscript edgeR_difExp.R + +This will produce two files, `edgeR-MA-plot.pdf +`__ +and `edgeR-lung-vs-salivary.csv +`__; +The last file consists of five columns: gene name, log fold change, P-value, and +FDR-adjusted P-value. + +.. shell stop + +---- + +Links: + +* `edgeR tutorial from UT Austin `__ + +---- + +Next: :doc:`m-func-analysis` diff --git a/refTrans/m-edgeR_difExp.txt b/refTrans/m-edgeR_difExp.txt new file mode 100644 index 0000000..b6d7caf --- /dev/null +++ b/refTrans/m-edgeR_difExp.txt @@ -0,0 +1,63 @@ +edgeR_difExp.R +============== +You can copy the script from here and save it in the working directory under the name edgeR_difExp.R:: + + # First, combine all the count files into one dataset + # and create a list of group labels where all the replicates from the same group have the same label + First_sample <- TRUE + columnID <- 2 + group <- c() + for (tophat_folder in dir(pattern="*_tophat")){ + file.name <- paste(tophat_folder, "/total_count.txt", sep="") + sample.name <- sub("_tophat","",tophat_folder) + if (First_sample){ + all.data <- read.delim(file.name, col.names=c("gene", sample.name), sep="\t", colClasses=c("character", "numeric"), row.names=1) + First_sample <- FALSE + } else { + temp <- read.delim(file.name, col.names=c("gene", sample.name), sep="\t", colClasses=c("character", "numeric"), row.names=1) + #all.data$sample.name <- temp$sample.name + all.data[,columnID] <- temp[,1] + colnames(all.data)[columnID] = sample.name + columnID = columnID + 1 + } + group_label <- sub("_repl[1-2]","", sample.name) + group <- c(group,group_label) + } + + # eliminate the last five rows + all.data <- all.data[1:(nrow(all.data)-5),] + + # We now have a data frame with all the data in it! + head(all.data) + + ### + + # now, do differential expression w/edgeR + library("edgeR") + + # calculate differential expression and dispersion + dge = DGEList(counts=all.data, group=group) + dge <- estimateCommonDisp(dge) + dge <- estimateTagwiseDisp(dge) + + # plot! + et <- exactTest(dge) + etp <- topTags(et, n=100000) + + summary(etp$table) + + etp$table$logFC = -etp$table$logFC + pdf("edgeR-MA-plot.pdf") + plot( + etp$table$logCPM, + etp$table$logFC, + xlim=c(-3, 20), ylim=c(-12, 12), pch=20, cex=.3, + col = ifelse( etp$table$FDR < .1, "red", "black" ) ) + dev.off() + + # output CSV + write.csv(etp$table, "edgeR-lung-vs-salivary.csv") + +---- + +Back: :doc:`m-data-analysis` diff --git a/refTrans/m-func-analysis.txt b/refTrans/m-func-analysis.txt new file mode 100644 index 0000000..9da065a --- /dev/null +++ b/refTrans/m-func-analysis.txt @@ -0,0 +1,50 @@ +Functional and network analysis on differentially expressed genes +================================================================= + +There are a number of sites that let you analyze gene lists, including +`DAVID `__. I have gotten a number of +recommendations for `PANTHER `__. PANTHER +lets you upload gene lists and explore their functional categories +interactively or in a gene list/annotation format. + +More specifically, you can explore your RNAseq data with + +* functional classifications, in pie chart or in list; +* tests for statistical overrepresentation; +* tests for statistical enrichment based on associated fold change. + +You need to crop and transform the data a little bit before using the +functional classification. The steps are: + +1. Download `the CSV file `__. If you've produced your own, copy it over from the HPC. + +2. Open it in Excel. + +3. Choose an FDR cutoff (suggest FDR < 0.05 or lower) and delete all the rows after that (or, copy the rows into a new spreadsheet -- might be quicker). + +4. Save as a "Tab-delimited text." (Note, on Mac OS X, you may need to save this as "Windows formatted text" instead.) + +This is now a file you can upload to PANTHER. + +(`You can download my copy of this here `__) + +To do the analysis, go to http://www.pantherdb.org/. In box 1, select "Choose file" +and find the CSV file you want to upload to PANTHER. Nothing else in box +one should be changed. + +In box 2, select "Homo sapiens." + +In box 3, select either "Functional analysis classification viewed in +gene list" or "Functional analysis classification viewed in pie +chart." + +Click submit. + +Now you can explore the results! + +---- + +Links: + +* `PANTHER database `__ + diff --git a/refTrans/m-htseq.txt b/refTrans/m-htseq.txt new file mode 100644 index 0000000..9900808 --- /dev/null +++ b/refTrans/m-htseq.txt @@ -0,0 +1,81 @@ +Counting the reads that map to each gene +======================================== + +.. shell start + +Now that we know which reads go with which gene, we'll use +`htseq-count `__. + +.. note:: This section reuires PySAM/0.6 and HTSeq/0.6.1 installed + +HTSeq can not deal with BAM files containing mix of paired end and single end reads. +So we need to split each BAM file according to the type of reads. `Samtools +`__ can do this job for us using +the view command. Paire end reads are labeled with FLAG=1 so we can separate the +paired reads using the (-f) option which will *keep* only reads labeled with FALG=1. +Also we can get the single end reads in a smilliar way but using (-F) option +which will *exclude* reads labeled with FALG=1. + +Let us do this for all the BAM files +:: + + for f in *_tophat; do + printf $f"\n" + samtools view -f 1 -bh -o "$f"/accepted_hits_PE.bam "$f"/accepted_hits.bam + samtools view -F 1 -bh -o "$f"/accepted_hits_SE.bam "$f"/accepted_hits.bam + done + +Second preparatory step is to sort the paired end BAM files by name. You do not have to +but HTSeq might through an error if you are using a huge BAM file. +:: + + for f in *_tophat; do + printf $f"\n" + samtools sort -n $f/accepted_hits_PE.bam $f/accepted_hits_PE2 + done + +And next, run HTSeq +:: + + for f in *_tophat; do + output_PE=$(basename "$f" | cut -f 1,2 -d "_")_PEcounts.txt + printf $output_PE"\n" + + htseq-count --format=bam --stranded=no --order=name \ + $f/accepted_hits_PE2.bam \ + $GTF_file > $f/$output_PE + + output_SE=$(basename "$f" | cut -f 1,2 -d "_")_SEcounts.txt + printf $output_SE"\n" + htseq-count --format=bam --stranded=no --order=name \ + $f/accepted_hits_SE.bam \ + $GTF_file > $f/$output_SE + done + +When this is done, let us view one of the output file:: + + less salivary_repl1_tophat/salivary_repl1_PEcounts.txt + +(use 'q' to exit). These are your gene counts. + +Final step will be summing the reads from both PE and SE analysis +:: + + for f in *_tophat; do + paste $f/*counts.txt | cut -f1,2,4 > $f/combined_temp.txt + cat $f/combined_temp.txt | awk '{print $2+$3}' > $f/total_temp.txt + paste $f/combined_temp.txt $f/total_temp.txt | cut -f1,4 > $f/total_count.txt + rm $f/*_temp.txt + done + +Note, these are *raw* gene counts - the number of reads that map to +each feature (gene, in this case). They are not normalized by length +of gene. According to `this post on seqanswers +`__, both +DEseq and edgeR want exactly this kind of information! + +.. shell stop + +---- + +Next: :doc:`m-edgeR` diff --git a/refTrans/m-quality.txt b/refTrans/m-quality.txt new file mode 100644 index 0000000..f71cc1b --- /dev/null +++ b/refTrans/m-quality.txt @@ -0,0 +1,138 @@ +Short read quality and trimming +=============================== + +.. shell start + +It is time to run your Tmux session and it is a good trend to do this from your root directory:: + + cd ~ + tmux + +Now, the remote computer will sustain your jobs even your ssh connection was interrupted. +You can always restore your tmux session by going to your root directory and type "tmux attach" + + +Assessment of sequence quality (FastQC) +--------------------------------------- + +We're going to use `FastQC `__ to summarize the data. +FastQC operates on Fastq files as well as GZip compressed Fastq (.gz) files. + +.. note:: This section reuires FastQC/0.11.2 installed + +Now run FastQC on your sequencing file +:: + + cd $workingPath + mkdir rawData_fastqc + fastqc *.fq.gz --outdir=$workingPath/rawData_fastqc + +.. .. note:: When you pass multiple files to FastQC, it operates one file at a time give it a seprate output. + + +Now type 'ls' in the output directory to explore the result files:: + + ls $workingPath/rawData_fastqc + +and you will see :: + + salivary_repl1_R1_fastqc.html + salivary_repl1_R1_fastqc.zip + ... + +In the root of your terminal, you can copy these files to your laptop and open them in a browser + +For HPCC:: + + scp :/rawData_fastqc/salivary*fastqc.* /tmp + +For Amazon:: + + scp -i ubuntu@:/path/to/file . + +and then open the html files in your browser. + +Links: + +* `FastQC tutorial video `__ + +Trimming of low quality bases (Trimmomatic) +------------------------------------------- + +Now we're going to do some trimming to remove any leftover adaptor sequences and low quality trailing edges. +We'll be using `Trimmomatic `__. +For a discussion of optimal RNAseq trimming strategies, see `MacManes, 2014 +`__. + +.. note:: This section reuires Trimmomatic/0.32 installed + +To run Trimmomatic +:: + + cd $workingPath + for f in *R1*.gz; do + input_R1="$f" + input_R2=$(echo "$f" | sed s/R1/R2/) + output_R1_PE=$(echo "$f" | sed s/R1/R1.pe/) + output_R1_SE=$(echo "$f" | sed s/R1/R1.se/) + output_R2_PE=$(echo "$f" | sed s/R1/R2.pe/) + output_R2_SE=$(echo "$f" | sed s/R1/R2.se/) + java -jar $TRIM/trimmomatic PE $input_R1 $input_R2 \ + $output_R1_PE $output_R1_SE $output_R2_PE $output_R2_SE \ + ILLUMINACLIP:$TRIM/adapters/TruSeq3-PE.fa:2:40:15 \ + LEADING:2 TRAILING:2 \ + SLIDINGWINDOW:4:15 \ + MINLEN:25 &>> log.txt + done + +.. # Titus workshop: ILLUMINACLIP:$TRIM/adapters/TruSeq3-PE.fa:2:40:15 LEADING:2 TRAILING:2 SLIDINGWINDOW:4:2 +.. # my parameters: ILLUMINACLIP:$TRIM/adapters/TruSeq3-PE.fa:2:30:10:1 SLIDINGWINDOW:4:15 + + +Now type 'ls' in the output directory to explore the result files:: + + ls *.fq.gz + +and you will see :: + + salivary_repl1_R1.fq.gz + salivary_repl1_R1.pe.fq.gz + salivary_repl1_R1.se.fq.gz + salivary_repl1_R2.fq.gz + salivary_repl1_R2.pe.fq.gz + salivary_repl1_R2.se.fq.gz + ... + +Trimming of paired end reads in four output files for each pair input files. +The two output files labeled with (.pe) represnt the reads that are still in the pair end format. +The other two files labled with (.se) represent the unpaired reads remaining from one input file +after dropping the corresponding read from the second input file. + +You can check the log file to see how much the trimming affected the data. + +For better understanding of the Trimmomatic parameters, you need to check +`Trimmomatic web page `__ . +You might need to teak these parameters for you own data + +.. * Are parameters different for RNAseq and genomic? +.. * What do we do with the single-ended files (s1_se and s2_se?) + +FastQC again +------------ + +Run FastQC again +:: + + mkdir trimmedData_fastqc + fastqc *.pe.fq.gz --outdir=$workingPath/trimmedData_fastqc + + +Now you can dowload the output files and check for changes after Trimmomatic treatment + +.. * Does it matter that you still have adapters!? + +.. shell stop + +---- + +Next: :doc:`m-tophat` diff --git a/refTrans/m-resources.txt b/refTrans/m-resources.txt new file mode 100644 index 0000000..aad6649 --- /dev/null +++ b/refTrans/m-resources.txt @@ -0,0 +1,26 @@ +High Performance Computation +============================ + +During the last few years, Next generation sequencing (NGS) has become widely available and cost effective. +The tremendous amounts of data generated by NGS are fundamentally changing how research is done. +However, these huge amounts of data require dedicated software and special computational resources. + +We adapted this protocol to run on 2 examples of computational plateforms + +.. toctree:: + :maxdepth: 2 + + m-dataNsoftware_HPC + m-dataNsoftware_amazon + +.. note:: + + Using either system, you need to communicate from your local machine to run your job on a distant server. + We use `SSH client `__ to do this job. + A problem that you have to be aware of is that when you log out of the communication session, + whatever the job you are running on the distant computer will stop. Some of our jobs should take hours and may be days + and it will be too frustrating to start over each time you lose the connection with the server. + So. We need a way to trick the server to continue our jobs even if we are not logged in. + Couple softwares do that, for example `Tmux `__. + Tmux will let us begin a session and detach from it, letting it run by itself. + diff --git a/refTrans/m-tophat.txt b/refTrans/m-tophat.txt new file mode 100644 index 0000000..e662681 --- /dev/null +++ b/refTrans/m-tophat.txt @@ -0,0 +1,191 @@ +Mapping reads to the transcriptome with TopHat +============================================== + +.. shell start + +Now that we have some quality-controlled reads, we're going to *map* the +reads to the reference gene set, for the purpose of counting how many +reads have come from each gene. We'll be using the `TopHat software +`__ + +.. note:: This section reuires TopHat2/2.0.13, Bowtie/2.2.3.0 and Samtools/0.1.19.0 installed + +Download the Bowtie2 index +-------------------------- + +TopHat is built on the short read mapping program `Bowtie +`__. Bowtie indexes the genome +with a Burrows-Wheeler index to keep its memory footprint small. +We can build those indexes for our genome using `bowtie2-build +`__. +However for most if not all model organisms, you can download the Bowtie index for +a given genome assembly. + +The `Illumina iGenomes project `__ +provided the RNA-Seq user community with a set of genome sequence indexes (including Bowtie indexes). + +We will download the iGenome index for the UCSC hg19 assembly version. + +.. check if the data directory exists +.. :: + + if [ ! -d "$workingPath/Homo_sapiens" ]; then + +:: + + cd $workingPath + wget ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/Homo_sapiens/UCSC/hg19/Homo_sapiens_UCSC_hg19.tar.gz + tar -zxvf Homo_sapiens_UCSC_hg19.tar.gz + +.. :: + + fi + +.. note:: + + The homo sapians iGenome tarball is huge (~20 GiB). I have prepared a smaller version containing the minimal required files. + This file is stored in Amazon S3. To use this file replace the last 3 lines of code with:: + + cd $workingPath + wget https://s3.amazonaws.com/reftransdata/Homo_sapiens_UCSC_hg19_small.tar.gz + tar -zxvf Homo_sapiens_UCSC_hg19_small.tar.gz + mv Homo_sapiens2 Homo_sapiens + +.. for more details about creating and using this file, check this file: :doc:`./igenome-backup` + + +Define the path of the genome and genome_index_base which is the name of any +of the index files up to but not including the first period. + +:: + + homo_sapiens_genome="$workingPath/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa" + Bowtie2Index="$workingPath/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/genome" + + +Making use of gene model annotations +------------------------------------ +Model organisms usually have a good annotation of their transcriptome. Annotation +information is typically shared in GTF files. Illumina iGenomes indexes come +with GTF transcript annotation files (Congratualtion, this means you have downloaded them already) +Let us define the path of the required GTF in the downloaded iGenome package +:: + + GTF_file="$workingPath/Homo_sapiens/UCSC/hg19/Annotation/Genes/genes.gtf" + +.. Note:: + The human genome is compased of multi fasta entries for chr1-22, M, X,and Y. + genes.gtf has some non canonical chromosomal headers e.g. chr17_ctg5_hap1, chr17_gl000205_random, + chrUn_gl000228, and many more. You can learn more about these header from this + `discussion `__ . These header will cause some software to + through some warning messages. Do not worry about it :) + +We can pass the annotation file to TopHat through the optional argument (--GTF). +TopHat will first extract the transcript sequences and use Bowtie to align reads to this virtual transcriptome first. +Only the reads that do not fully map to the transcriptome will then be mapped on the genome. + +For TopHat to make use of GTF file, it has to create a Bowtie index for the transcriptome +sequences which could be time consuming. If you are aligning multiple samples with TopHat, +we can create these indexes once and reused for multiple and even *simultaneous* TopHat runs. +To create the transcriptome-index, we can inovke TopHat without input reads +:: + + tophat \ + --num-threads 1 \ + --GTF "$GTF_file" \ + --transcriptome-index "$workingPath/trans_Index/transcriptome" \ + $Bowtie2Index + + +Prepare the input reads +----------------------- +You can pass these two PE files directly to the TopHat to do the pair end orinted alignment. +TopHat allows the use of additional unpaired reads but it has to pass as a single file. +So we need to merge every two SE files belonging to the same sample +:: + + for f in *R1.se*; do + input_R1="$f" + input_R2=$(echo "$f" | sed s/R1/R2/) + output=$(echo "$f" | sed s/R1/R/) + cat $input_R1 $input_R2 > $output + done + +Mapping reads +------------- + +And now run TopHat on one sample:: + + tophat \ + --num-threads 2 \ + --GTF "$GTF_file" \ + --transcriptome-index "$workingPath/trans_Index/transcriptome" \ + --output-dir salivary_repl1_tophat \ + $Bowtie2Index \ + salivary_repl1_R1.pe.fq.gz salivary_repl1_R2.pe.fq.gz,salivary_repl1_R.se.fq.gz + +This will take about ~10-15 minutes. We can run all the samples sequencially using a for loop. +:: + + for f in *R1.pe*; do + input_R1="$f" + input_R2=$(echo "$f" | sed s/R1/R2/) + input_R=$(echo "$f" | sed s/R1.pe/R.se/) + output=$(basename "$f" | cut -f 1,2 -d "_")_tophat + tophat \ + --num-threads 1 \ + --GTF "$GTF_file" \ + --transcriptome-index "$workingPath/trans_Index/transcriptome" \ + --output-dir $output \ + $Bowtie2Index \ + $input_R1 $input_R2,$input_R + done + +.. Note:: + + With large input files, TopHat will take hours and may be days. You can increase the + number of threads but make sure that you have the required underlying resources. + It will be more efficient to run TopHat on all samples in parallel jobs. On Amazon AWS, + you can run multiple instances. For HPC, you can submet multiple jobs simultaneously. + + :doc:`./m-HPC-tophat` + + +The output will be a folder for each sample names like _tophat. Each folder contain multiple files + +1. accepted_hits.bam: The main output file with contains the read alignment in `SAM `__ format. +2. unmapped.bam: Another file in the same format but contains only the unmapped reads +3. junctions.bed, insertions.bed, deletions.bed: A `UCSC BED track `__ of junctions, insertions and deletions reported by TopHat +4. align_summary.txt: a text file that contains summry statistics about the mapped reads + + +Counting mapped reads percentage +-------------------------------- + +You can open the align_summary file in any out folder :: + + vim salivary_repl1_tophat/align_summary.txt + +We can calculte these statistics ourselves. Let's ask samtools for the total number of reads that mapped. +We count all the alignment in the BAM file but exclude unmapped reads (-F 4) and also exclude non primary alignments (-F 256):: + + samtools view -c -F 4 -F 256 salivary_repl1_tophat/accepted_hits.bam + +You should get around 170,849. + +We can count the reads in our input files:: + + R1_pe="$(zcat salivary_repl1_R1.pe.fq.gz | echo $((`wc -l`/4)))" + R2_pe="$(zcat salivary_repl1_R2.pe.fq.gz | echo $((`wc -l`/4)))" + R_se="$(zcat salivary_repl1_R.se.fq.gz | echo $((`wc -l`/4)))" + echo $((R1_pe + R2_pe + R_se)) + +You should get around 189,150 + +So you have 170,849 mapped reads out of 189,150 -- about 90.3%. That's pretty good! + +.. shell stop + +---- + +Next: :doc:`m-count`