diff --git a/.gitignore b/.gitignore index 7c5239d..2fa0d0f 100644 --- a/.gitignore +++ b/.gitignore @@ -4,7 +4,6 @@ *.egg *.egg-info dist -build eggs parts bin diff --git a/amazon/custom-ami.txt b/amazon/custom-ami.txt deleted file mode 100644 index ef36e48..0000000 --- a/amazon/custom-ami.txt +++ /dev/null @@ -1,125 +0,0 @@ -Creating a Custom AMI -===================== - -When you find yourself continually starting EC2 instances and instantly having -to modify the AMI before each project, you may consider creating a custom AMI. -Below are the steps used to create ami-7f340c16. - -Launch base image ------------------ - -The ubuntu to provides at excellent base for our needs. As such we will begin -with ami-8f2718e6 (see :doc:`start-up-an-ec2-instance`). NOTE: It is important -to make sure the base image you choose is a community ami with public -permissions if want your ami to be available to the public. - -Log in to the machine via the command line using:: - - ssh -i ubuntu@ - -Now you can start to add custom packages. - -Customizing the instance ------------------------- - -Since our basic instance is a long-term release of Ubuntu, first we need to sure -everything is up to date:: - - sudo –i - apt–get –y update - apt–get –y dist–upgrade - -Next we need to make sure we have installed all packages commonly used in our -programs:: - - apt-get -y install mercurial less python-matplotlib unzip bzip2 \ - zlib1g-dev ncurses-dev python-dev g++ screen pkg-config python-pip \ - curl git sysstat libgtextutils-dev make openjdk-7-jre-headless - -The next few steps go through the process of installing software our lab uses -regularly. - -Installing Software -------------------- - -Install `ipython `__:: - - cd /tmp - git clone https://github.com/ipython/ipython.git - cd ipython - python setup.py install - -Next the software needs to be configured:: - - mkdir ~/.ipython - cd ~/.ipython - -Then, build a custom certificate for use with https:: - - openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout ~/.ipython/mycert.pem -out ~/.ipython/mycert.pem - -Finally, ipython needs to be configured:: - - ipython profile create - cd profile_default/ - -Now you need to modify the configuration file so ipython can be accessed from -the end user's local machine:: - - c.NotebookApp.certfile = u'/root/.ipython/mycert.pem' - c.NotebookApp.ip = '*' - c.NotebookApp.open_browser = False - c.NotebookApp.password = u'sha1:6b543b7dcc9f:33d3890c64538eab5cad799d30605af314b6cb89' - c.NotebookApp.port = 443 - -This basically sets things up to run under HTTPS on a public port, without -running a local browser on your EC2 instance, using the password ‘ngs’. See the -notebook docs for more information. - -Enabling root access --------------------- - -Since most of the lab's scripts require root access, it is a good idea to -permanently enable this feature:: - - perl -i -pe 's/disable_root: 1/disable_root: 0/' /etc/cloud/cloud.cfg - perl -i -pe 's/#PermitRootLogin .*/PermitRootLogin without-password/' /etc/ssh/sshd_config - perl -i -pe 's/.*(ssh-rsa .*)/\1/' /root/.ssh/authorized_keys - -Next we want to make sure the image boots correctly. To do this simply type:: - - reboot - -Wait a minute or so to allow the machine to reboot. Then login using root:: - - ssh -i root@ - -Saving the new AMI ------------------- - -Now we know that the images working we can create a new AMI. in the next steps -we will use the Amazon EC to console management page. - -On the navigation pane click on "Instances" and this will take you to a list of -all your running machines. Right-click on the image you would like to make an -AMI. From the drop-down menu choose "Create Image". - -.. image:: images/ec2-create-img1.png - :width: 50 - -This will open a pop-up window, where we will need to fill in the image name -and description. If you are making the image on behalf of GED, the naming -pattern is /dragon/. For example, ami-7f340c16 is named -khmer-protocols/dragon/2013.01.30. - -.. image:: images/ec2-create-img2.png - :width: 50 - -Now press "create image". - -Generally it will take around five minutes for Amazon to process this request. -When the new AMI has been created, you will be able to find it by using your -navigation pane in selecting AMIs. - -For more information visit `documentation on Amazon -`_. diff --git a/amazon/images/amazon-1.png b/amazon/images/amazon-1.png new file mode 100644 index 0000000..8a8f905 Binary files /dev/null and b/amazon/images/amazon-1.png differ diff --git a/amazon/images/amazon-2.png b/amazon/images/amazon-2.png new file mode 100644 index 0000000..e820692 Binary files /dev/null and b/amazon/images/amazon-2.png differ diff --git a/amazon/images/amazon-3.png b/amazon/images/amazon-3.png new file mode 100644 index 0000000..abb57bd Binary files /dev/null and b/amazon/images/amazon-3.png differ diff --git a/amazon/images/amazon-4.png b/amazon/images/amazon-4.png new file mode 100644 index 0000000..61948a3 Binary files /dev/null and b/amazon/images/amazon-4.png differ diff --git a/amazon/images/amazon-5.png b/amazon/images/amazon-5.png new file mode 100644 index 0000000..da52ebe Binary files /dev/null and b/amazon/images/amazon-5.png differ diff --git a/amazon/images/amazon-6.png b/amazon/images/amazon-6.png new file mode 100644 index 0000000..e333b3e Binary files /dev/null and b/amazon/images/amazon-6.png differ diff --git a/amazon/images/amazon-7.png b/amazon/images/amazon-7.png new file mode 100644 index 0000000..9ef391a Binary files /dev/null and b/amazon/images/amazon-7.png differ diff --git a/amazon/images/amazon-8.png b/amazon/images/amazon-8.png new file mode 100644 index 0000000..746f281 Binary files /dev/null and b/amazon/images/amazon-8.png differ diff --git a/amazon/images/amazon-9.png b/amazon/images/amazon-9.png new file mode 100644 index 0000000..c369d86 Binary files /dev/null and b/amazon/images/amazon-9.png differ diff --git a/amazon/images/amazon-diff-ami.png b/amazon/images/amazon-diff-ami.png new file mode 100644 index 0000000..592b734 Binary files /dev/null and b/amazon/images/amazon-diff-ami.png differ diff --git a/amazon/images/ec2-1.png b/amazon/images/ec2-1.png deleted file mode 100644 index 7b01ff4..0000000 Binary files a/amazon/images/ec2-1.png and /dev/null differ diff --git a/amazon/images/ec2-2.png b/amazon/images/ec2-2.png deleted file mode 100644 index b24ab18..0000000 Binary files a/amazon/images/ec2-2.png and /dev/null differ diff --git a/amazon/images/ec2-3.png b/amazon/images/ec2-3.png deleted file mode 100644 index 015cc44..0000000 Binary files a/amazon/images/ec2-3.png and /dev/null differ diff --git a/amazon/images/ec2-4.png b/amazon/images/ec2-4.png deleted file mode 100644 index 5a4b864..0000000 Binary files a/amazon/images/ec2-4.png and /dev/null differ diff --git a/amazon/images/ec2-5.png b/amazon/images/ec2-5.png deleted file mode 100644 index fe7f953..0000000 Binary files a/amazon/images/ec2-5.png and /dev/null differ diff --git a/amazon/images/ec2-6.png b/amazon/images/ec2-6.png deleted file mode 100644 index a7d0533..0000000 Binary files a/amazon/images/ec2-6.png and /dev/null differ diff --git a/amazon/images/ec2-7.png b/amazon/images/ec2-7.png deleted file mode 100644 index b6ff797..0000000 Binary files a/amazon/images/ec2-7.png and /dev/null differ diff --git a/amazon/images/ec2-8.png b/amazon/images/ec2-8.png deleted file mode 100644 index d9cffc9..0000000 Binary files a/amazon/images/ec2-8.png and /dev/null differ diff --git a/amazon/images/ec2-9.png b/amazon/images/ec2-9.png deleted file mode 100644 index 09da2b7..0000000 Binary files a/amazon/images/ec2-9.png and /dev/null differ diff --git a/amazon/images/ec2-create-img1.png b/amazon/images/ec2-create-img1.png deleted file mode 100644 index 597e7bf..0000000 Binary files a/amazon/images/ec2-create-img1.png and /dev/null differ diff --git a/amazon/images/ec2-create-img2.png b/amazon/images/ec2-create-img2.png deleted file mode 100644 index eb661a5..0000000 Binary files a/amazon/images/ec2-create-img2.png and /dev/null differ diff --git a/amazon/images/ec2-dashboard-instance-name.png b/amazon/images/ec2-dashboard-instance-name.png deleted file mode 100644 index f59610c..0000000 Binary files a/amazon/images/ec2-dashboard-instance-name.png and /dev/null differ diff --git a/amazon/images/ec2-security.png b/amazon/images/ec2-security.png deleted file mode 100644 index 42945dd..0000000 Binary files a/amazon/images/ec2-security.png and /dev/null differ diff --git a/amazon/images/terminate-1.png b/amazon/images/terminate-1.png new file mode 100644 index 0000000..b45314b Binary files /dev/null and b/amazon/images/terminate-1.png differ diff --git a/amazon/images/terminate-2.png b/amazon/images/terminate-2.png new file mode 100644 index 0000000..a7daaeb Binary files /dev/null and b/amazon/images/terminate-2.png differ diff --git a/amazon/images/win-putty-2.png b/amazon/images/win-putty-2.png index e8a3b8d..b1e6c41 100644 Binary files a/amazon/images/win-putty-2.png and b/amazon/images/win-putty-2.png differ diff --git a/amazon/images/win-putty-3.png b/amazon/images/win-putty-3.png index a34682b..eefbc21 100644 Binary files a/amazon/images/win-putty-3.png and b/amazon/images/win-putty-3.png differ diff --git a/amazon/images/win-putty-4.png b/amazon/images/win-putty-4.png index 64bacc2..7913bc3 100644 Binary files a/amazon/images/win-putty-4.png and b/amazon/images/win-putty-4.png differ diff --git a/amazon/images/win-puttygen-2.png b/amazon/images/win-puttygen-2.png index a279a25..98a495b 100644 Binary files a/amazon/images/win-puttygen-2.png and b/amazon/images/win-puttygen-2.png differ diff --git a/amazon/index.txt b/amazon/index.txt index bdfe21b..8ce524e 100644 --- a/amazon/index.txt +++ b/amazon/index.txt @@ -1,26 +1,15 @@ -=============================== -Getting started with Amazon EC2 -=============================== +Amazon Web Services instructions +================================ .. toctree:: - :maxdepth: 2 + :maxdepth: 5 start-up-an-ec2-instance - log-in-with-ssh-win log-in-with-ssh-mac - + log-in-with-ssh-win installing-dropbox + terminating-your-instance.txt -A final checklist: - -- EC2 instance is running; -- used AMI ami-7f340c16; -- NOT micro instance (m1.large, or bigger); -- SSH, HTTP, HTTPS are enabled on the security group; - -Amazon Web Services reference material --------------------------------------- - -`Instance types `__ - -`Instance costs `__ + starting-up-a-custom-ami + technical-guide + using-screen diff --git a/amazon/installing-dropbox.txt b/amazon/installing-dropbox.txt index 5719f8a..f7f85bc 100644 --- a/amazon/installing-dropbox.txt +++ b/amazon/installing-dropbox.txt @@ -21,11 +21,13 @@ Make the Dropbox directory on /mnt and link it in:: mkdir /mnt/Dropbox ln -fs /mnt/Dropbox /root - ln -fs /mnt/Dropbox /home/ubuntu -and then run it:: +and then run it, configuring it to put stuff in /mnt:: - /root/.dropbox-dist/dropboxd & + HOME=/mnt /root/.dropbox-dist/dropboxd & When you get a message saying "this client is not linked to any account", copy/paste the URL into browser and go log in. Voila! + +Your directory '/root/Dropbox', or, equivalently, '/mnt/Dropbox', is now +linked to your Dropbox account! diff --git a/amazon/log-in-with-ssh-mac.txt b/amazon/log-in-with-ssh-mac.txt index 519adbb..a155363 100644 --- a/amazon/log-in-with-ssh-mac.txt +++ b/amazon/log-in-with-ssh-mac.txt @@ -24,24 +24,32 @@ to set the permissions on the private key file to "closed to all evildoers". Then type:: - ssh -i ~/Desktop/amazon.pem root@ec2-???-???-???-???.compute-1.amazonaws.com + ssh -i ~/Desktop/amazon.pem ubuntu@ec2-???-???-???-???.compute-1.amazonaws.com +(but you have to replace the stuff after the '@' sign with the name of the host). -Here, you're logging in as user 'root' to the machine +Here, you're logging in as user 'ubuntu' to the machine 'ec2-174-129-122-189.compute-1.amazonaws.com' using the authentication key located in 'amazon.pem' on your Desktop. +You should now see a text line that starts with something like +``ubuntu@ip-10-235-34-223:~$``. You're in! Now type:: -Note, you have to replace the stuff after the '@' sign with the name -of the host; see the red circle in: + sudo bash + cd /root -.. image:: images/ec2-dashboard-instance-name.png - :width: 50% +to switch into superuser mode (see: http://xkcd.com/149/) and go to your +home directory. ----- +This is where the rest of the tutorials will start! -At the end you should see text and a prompt that look like this: +If you have Dropbox, you should now visit :doc:`installing-dropbox`. +You might also want to read about :doc:`terminating-your-instance`. -.. image:: images/win-putty-4.png - :width: 50% +To log out, type:: + + exit + logout + +or just close the window. diff --git a/amazon/log-in-with-ssh-win.txt b/amazon/log-in-with-ssh-win.txt index f6d9656..c2281a7 100644 --- a/amazon/log-in-with-ssh-win.txt +++ b/amazon/log-in-with-ssh-win.txt @@ -7,7 +7,7 @@ Download Putty and Puttygen from here: http://www.chiark.greenend.org.uk/~sgtath Generate a ppk file from your pem file ====================================== -(You only need to do this once for each key!) +(You only need to do this once!) Open puttygen; select "Load". @@ -30,9 +30,6 @@ Now, "save private key". Put it somewhere easy to find. .. image:: images/win-puttygen-4.png :width: 50% -Now that you've generated your PPK file from your PEM file, you can log -in. To do that... - Logging into your EC2 instance with Putty ========================================= @@ -47,7 +44,7 @@ by puttygen). Then select 'Open'. .. image:: images/win-putty-2.png :width: 50% -Log in as "root". +Log in as "ubuntu". .. image:: images/win-putty-3.png :width: 50% @@ -57,3 +54,28 @@ Declare victory! .. image:: images/win-putty-4.png :width: 50% +Here, you're logging in as user 'ubuntu' to the machine +'ec2-174-129-122-189.compute-1.amazonaws.com' using the authentication +key located in 'amazon.pem' on your Desktop. + +You should now see a text line that starts with something like +``ubuntu@ip-10-235-34-223:~$``. You're in! Now type:: + + sudo bash + cd /root + +to switch into superuser mode (see: http://xkcd.com/149/) and go to your +home directory. + +This is where the rest of the tutorials will start! + +If you have Dropbox, you should now visit :doc:`installing-dropbox`. + +You might also want to read about :doc:`terminating-your-instance`. + +To log out, type:: + + exit + logout + +or just close the window. diff --git a/amazon/security-rules.txt b/amazon/security-rules.txt deleted file mode 100644 index 68d9090..0000000 --- a/amazon/security-rules.txt +++ /dev/null @@ -1,15 +0,0 @@ -======================== -Adjusting security rules -======================== - -Before continuing, you'll need to adjust your security rules so that you -can access your new instance properly. To do that, go over to Security -Groups on the dashboard, select 'default', and then adjust your security -rules to enable ports 22, 80, and 443 (SSH, HTTP, and HTTPS). - -.. image:: images/ec2-security.png - :width: 90% - -Make sure you "Apply rule changes" afterwards. - -Then, go to :doc:`log-in-with-ssh-win` or :doc:`log-in-with-ssh-mac` diff --git a/amazon/start-up-an-ec2-instance.txt b/amazon/start-up-an-ec2-instance.txt index 3c1fff6..aba08a6 100644 --- a/amazon/start-up-an-ec2-instance.txt +++ b/amazon/start-up-an-ec2-instance.txt @@ -1,8 +1,10 @@ Start up an EC2 instance ======================== -Log in -~~~~~~ +Here, we're going to startup an Amazon Web Services (AWS) Elastic +Cloud Computing (EC2) "instance", or computer. + +---- Go to 'https://aws.amazon.com' in a Web browser. @@ -10,75 +12,63 @@ Select 'My Account/Console' menu option 'AWS Management Console." Log in with your username & password. -Click on EC2 (upper left). - -.. image:: images/ec2-1.png - :width: 80% - -Select your zone -~~~~~~~~~~~~~~~~ - -Many of the resources that we use are hosted by Amazon on the East coast. -Make sure that your dashboard has 'N. Virginia' on the upper right. - -Then click on Launch Instance. - -.. image:: images/ec2-2.png - :width: 80% +Make sure it says North Virginia in the upper right, then select EC2 +(upper left). -Select the machine operating system to boot -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +.. image:: images/amazon-1.png + :width: 50% -Choose 'Community AMIs', enter 'beacon' into the search box, and -click on 'Select'. +Select "Launch Instance" (midway down the page). -.. image:: images/ec2-3.png - :width: 80% +.. image:: images/amazon-2.png + :width: 50% -Choose the machine size -~~~~~~~~~~~~~~~~~~~~~~~ +Next, scroll down the list of operating system types until you find +Ubuntu 14.04 LTS (PV) -- it should be at the very bottom. Click 'select'. +(See :doc:`starting-up-a-custom-ami` if you want to start up a custom +operating system instead of Ubuntu 14.04.) -Select 'General purpose', 'm1.xlarge', and then 'Review and Launch'. +.. image:: images/amazon-3.png + :width: 50% -Enter 'ami-7f340c16' into the search box and click "search". Select -it, and hit Continue. +Scroll down the list of instance types until you find "m1.xlarge". Select +the box to the left, and then click "Review and Launch." -.. image:: images/ec2-4.png - :width: 80% +.. image:: images/amazon-4.png + :width: 50% -Confirm and launch -~~~~~~~~~~~~~~~~~~ +Ignore the warning, check that it says "Ubuntu 14.04 LTS (PV)", and cick +"Launch". -Review the details (ignore the warnings!) and click on Launch. +.. image:: images/amazon-5.png + :width: 50% -.. image:: images/ec2-5.png - :width: 80% +The *first* time through, you will have to "create a new key pair", which +you must then name (something like 'amazon') and download. -(First time through) generate a new key pair -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +After this first time, you will be able to select an existing key pair. -If you don't have any key pairs, enter a key pair name and -then download a key pair. Then click Launch Instance. +.. image:: images/amazon-6.png + :width: 50% -.. image:: images/ec2-6.png - :width: 80% +Select "Launch Instance." -(Next times through) select an existing key pair -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +.. image:: images/amazon-7.png + :width: 50% -Select a key pair and click 'Launch'. +Select "view instance" and you should see a "pending" line in the +menu. -.. image:: images/ec2-7.png - :width: 80% +.. image:: images/amazon-8.png + :width: 50% -Click on View Instances -~~~~~~~~~~~~~~~~~~~~~~~ +Wait until it turns green, then make a note of the "Public DNS" (we +suggest copying and pasting it into a text notepad somewhere). This +is your machine name, which you will need for logging in. -.. image:: images/ec2-8.png - :width: 80% +.. image:: images/amazon-9.png + :width: 50% -Select the public DNS name for later use -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Then, go to :doc:`log-in-with-ssh-win` or :doc:`log-in-with-ssh-mac` -.. image:: images/ec2-9.png - :width: 80% +You might also want to read about :doc:`terminating-your-instance`. diff --git a/amazon/starting-up-a-custom-ami.txt b/amazon/starting-up-a-custom-ami.txt new file mode 100644 index 0000000..7447b05 --- /dev/null +++ b/amazon/starting-up-a-custom-ami.txt @@ -0,0 +1,21 @@ +Starting up a custom operating system +===================================== + +The instructions in :doc:`start-up-an-ec2-instance` tell you how to +start up a machine with Ubuntu Linux version 14.04 on it, but that +machine comes with very little software installed. For anything +where you are executing actual analyses, you're going to want to have +a bunch of basic software installed. + +Therefore, we make custom versions of Ubuntu available as well, that +come with some software pre-installed. (See :doc:`technical-guide` +for a list of the packages installed on the ANGUS custom AMI.) + +To boot these, go to EC2/Launch and select "Community AMIs" instead of +the default Quick Start; then type in the AMI number or name given to +you in the tutorial. Below is a screenshot of an example for +'ami-7606d01e'. Then proceed with the rest of +:doc:`start-up-an-ec2-instance`. + +.. image:: images/amazon-diff-ami.png + :width: 50% diff --git a/amazon/technical-guide.txt b/amazon/technical-guide.txt new file mode 100644 index 0000000..43ea485 --- /dev/null +++ b/amazon/technical-guide.txt @@ -0,0 +1,12 @@ +Technical guide to the ANGUS course +=================================== + +Packages we install +------------------- + +Install:: + + apt-get update + apt-get -y install screen git curl gcc make g++ python-dev unzip \ + default-jre pkg-config libncurses5-dev r-base-core \ + r-cran-gplots python-matplotlib sysstat diff --git a/amazon/terminating-your-instance.txt b/amazon/terminating-your-instance.txt new file mode 100644 index 0000000..ae3f38f --- /dev/null +++ b/amazon/terminating-your-instance.txt @@ -0,0 +1,19 @@ +Terminating (shutting down) your EC2 instance +============================================= + +While your instance is running, Amazon will happily charge you on a per-hour +basis -- `check out the pricing `__ for more +information. In general, you will want to shut down your instance when +you're done with it; to do that, go to your EC2 console and find your +running instances (in green). + +.. image:: images/terminate-1.png + :width: 50% + +Then, select one or all of them, and go to the 'Actions...' menu, and +then select 'Terminate'. Agree. + +After a minute or two, the console should show the instance as "terminated". + +.. image:: images/terminate-2.png + :width: 50% diff --git a/conf.py b/conf.py index 5c57d90..05d8c7c 100644 --- a/conf.py +++ b/conf.py @@ -68,7 +68,7 @@ # List of patterns, relative to source directory, that match files and # directories to ignore when looking for source files. -exclude_patterns = ['_build'] +exclude_patterns = ['_build','env'] # The reST default role (used for this markup: `text`) to use for all documents. #default_role = None diff --git a/metagenomics/5-mapping-and-quantitation.txt b/metagenomics/5-mapping-and-quantitation.txt index 8d6a9ac..813a21b 100644 --- a/metagenomics/5-mapping-and-quantitation.txt +++ b/metagenomics/5-mapping-and-quantitation.txt @@ -1,6 +1,6 @@ ===================================== 5. Mapping and abundance quantitation -====================================== +===================================== .. shell start diff --git a/misc/variant.txt b/misc/variant.txt index 5abe2e5..fcc66ac 100644 --- a/misc/variant.txt +++ b/misc/variant.txt @@ -1,6 +1,5 @@ Variant calling -############### - +================ We are using Podar dataset from ... @@ -21,7 +20,7 @@ First, we need to install the `BWA aligner cp bwa /usr/local/bin -.. We also need a new version of `samtools `__:: +We also need a new version of `samtools `__:: cd /root curl -O -L http://sourceforge.net/projects/samtools/files/samtools/0.1.19/samtools-0.1.19.tar.bz2 @@ -34,7 +33,7 @@ First, we need to install the `BWA aligner cp *.pl maq2sam-long maq2sam-short md5fa md5sum-lite wgsim /usr/local/bin/ Download data -~~~~~~~~~~~~~ +============= Download the reference genome and the resequencing reads:: diff --git a/mrnaseq/0-download-and-save.txt b/mrnaseq/0-download-and-save.txt index 519d8c2..f05410f 100644 --- a/mrnaseq/0-download-and-save.txt +++ b/mrnaseq/0-download-and-save.txt @@ -44,8 +44,8 @@ Downloading and saving your data to a volume -------------------------------------------- There are *many* different ways of getting big sequence files to and -from Amazon. The two that I mostly use are 'curl', which downloads -files from a Web site URL; and 'ncftp', which is a robust FTP client +from Amazon. The two that I mostly use are ``curl``, which downloads +files from a Web site URL; and ``ncftp``, which is a robust FTP client that let's you get files from an FTP site. Sequencing centers almost always make their data available in one of these two ways. @@ -61,20 +61,20 @@ like:: cd /mnt ncftp -u ftp://path/to/FTP/site -use 'cd' to find the right directory, and then:: +use ``cd`` to find the right directory, and then:: >> mget * to download the files. Then type 'quit'. -You can also use 'curl' to download files one at a time from Web or FTP sites. +You can also use ``curl`` to download files one at a time from Web or FTP sites. For example, to save a file from a website, you could use:: cd /mnt curl -O http://path/to/file/on/website -Once you have the files, figure out their size using 'du -sk' (e.g. after the -above, 'du -sk /mnt' will tell you how much data you have saved under /mnt), +Once you have the files, figure out their size using ``du -sh`` (e.g. after the +above, ``du -sh /mnt`` will tell you how much data you have saved under /mnt), and go create and attach a volume (see :doc:`../amazon/index`). Any files in the '/mnt' directory will be lost when the instance is stopped or diff --git a/mrnaseq/1-quality.txt b/mrnaseq/1-quality.txt index c638514..ef83c5f 100644 --- a/mrnaseq/1-quality.txt +++ b/mrnaseq/1-quality.txt @@ -10,15 +10,6 @@ will be enough to complete the assembly of the Nematostella data set. If you are using your own data, be aware of your space requirements and obtain an appropriately sized machine ("instance") and storage ("volume"). - -On the new machine, run the following commands to update the base -software and reboot the machine:: - - apt-get update - apt-get -y install screen git curl gcc make g++ python-dev unzip default-jre \ - pkg-config libncurses5-dev r-base-core r-cran-gplots python-matplotlib\ - sysstat && shutdown -r now - .. note:: The raw data for this tutorial is available as public snapshot @@ -27,91 +18,80 @@ software and reboot the machine:: Install software ---------------- -.. clean up previous installs if we're re-running this... +.. Rackspace instructions + 15 GB I/O v1 + + mkfs -t ext4 /dev/xvde + mkdir /data + mount /dev/xvde /data + cd /data + chmod -R a+rw /data + mkdir nemo + cd nemo + curl -O http://athyra.idyll.org/~t/mrnaseq-subset.tar + tar xvf mrnaseq-subset.tar +On the new machine, run the following commands to update the base +software and reboot the machine:: + + apt-get update + apt-get -y install screen git curl gcc make g++ python-dev unzip default-jre \ + pkg-config libncurses5-dev r-base-core r-cran-gplots \ + python-matplotlib python-pip python-virtualenv sysstat fastqc \ + trimmomatic fastx-toolkit bowtie samtools blast2 .. :: set -x set -e - echo Removing previous installs, if any. - rm -fr /usr/local/share/khmer - rm -fr /root/Trimmomatic-* - rm -f /root/libgtextutils-*.bz2 - rm -f /root/fastx_toolkit-*.bz2 - -.. :: echo Clearing times.out - mv -f /root/times.out /root/times.out.bak - echo 1-quality INSTALL `date` >> /root/times.out + mv -f ${HOME}/times.out ${HOME}/times.out.bak + echo 1-quality INSTALL `date` >> ${HOME}/times.out -Install `khmer `__ : -:: - - cd /usr/local/share - git clone https://github.com/ged-lab/khmer.git - cd khmer - git checkout v1.1 - make install - -Install `FastQC `__:: - - cd /usr/local/share - curl -O http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.10.1.zip - unzip fastqc_v0.10.1.zip - chmod +x FastQC/fastqc - -Install `Trimmomatic `__ : -:: - - cd /root - curl -O http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.30.zip - unzip Trimmomatic-0.30.zip - cd Trimmomatic-0.30/ - cp trimmomatic-0.30.jar /usr/local/bin - cp -r adapters /usr/local/share/adapters +Now switch to the non-root user. -Install `libgtextutils and fastx `__ : -:: +Since we have four CPUs on this machine we'll set a variable that we will use +elsewhere so that our programs make use of all the CPUs :: + THREADS=4 - cd /root - curl -O http://hannonlab.cshl.edu/fastx_toolkit/libgtextutils-0.6.1.tar.bz2 - tar xjf libgtextutils-0.6.1.tar.bz2 - cd libgtextutils-0.6.1/ - ./configure && make && make install - - cd /root - curl -O http://hannonlab.cshl.edu/fastx_toolkit/fastx_toolkit-0.0.13.2.tar.bz2 - tar xjf fastx_toolkit-0.0.13.2.tar.bz2 - cd fastx_toolkit-0.0.13.2/ - ./configure && make && make install - -In each of these cases, we're downloading the software -- you can use -google to figure out what each package is and does if we don't discuss -it below. We're then unpacking it, sometimes compiling it (which we -can discuss later), and then installing it for general use. +Install `khmer `_. We need some files from the +sandbox so we will download the full source repository instead of using +Python's pip command :: + cd ${HOME} + mkdir -p projects/eelpond + python2.7 -m virtualenv projects/eelpond/env + source ${HOME}/projects/eelpond/env/bin/activate + mkdir -p src + cd src + git clone --branch v1.3 https://github.com/ged-lab/khmer.git + cd khmer + make install +The use of ``virtualenv`` allows us to install Python software without having +root access. If you come back to this protocol in a different terminal session +you will need to rerun ``source ${HOME}/projects/eelpond/env/bin/activate`` +again. Find your data -------------- Either load in your own data (as in :doc:`0-download-and-save`) or -create a volume from snapshot snap-f5a9dea7 and mount it as ``/data`` -(again, this is the data from `Tulin et al., 2013 -`__). +create a volume from snapshot snap-f5a9dea7 and mount it as +``${HOME}/data/nemo`` or other appropriate place. (again, this is the data +from `Tulin et al., 2013 `__). Check:: - ls /data + ls ${HOME}/data/nemo If you see all the files you think you should, good! Otherwise, debug. If you're using the Tulin et al. data provided in the snapshot above, you should see a bunch of files like:: - /data/0Hour_ATCACG_L002_R1_001.fastq.gz + 0Hour_ATCACG_L002_R1_001.fastq.gz Link your data into a working directory --------------------------------------- @@ -121,17 +101,18 @@ Rather than *copying* the files into the working directory, let's just find them but doesn't need to actually move them around. : :: - cd /mnt - mkdir -p work - cd work + cd ${HOME} + mkdir -p projects/eelpond/raw + cd projects/eelpond/raw - ln -fs /data/*.fastq.gz . + ln -fs ${HOME}/data/nemo/*.fastq.gz . -(The 'ln' command does the linking.) +(The ``ln`` command does the linking.) -Now, do an 'ls' to list the files. If you see only one entry, ``*.fastq.gz``, +Now, do an ``ls`` to list the files. If you see only one entry, ``*.fastq.gz``, then the ln command above didn't work properly. One possibility is that -your files aren't in /data; another is that they're not named *.fastq.gz. +your files aren't in /data; another is that their names don't end with +``.fastq.gz``. .. note:: @@ -140,21 +121,25 @@ your files aren't in /data; another is that they're not named *.fastq.gz. example data, you can work with a subset of it by running this command instead of the `ln -fs` command above:: - for file in /data/*.fastq.gz + cd ${HOME}/projects/eelpond + mkdir -p extract + for file in raw/*.fastq.gz do - gunzip -c $file | head -400000 | gzip > $(basename $file) + gunzip -c ${file} | head -400000 | gzip \ + > extract/${file%%.fastq.gz}.extract.fastq.gz done This will pull out the first 100,000 reads of each file (4 lines per record) - and put them in the current directory, which should be /mnt/work. + and put them in the new ``${HOME}/projects/eelpond/extract`` directory. OPTIONAL: Evaluate the quality of your files with FastQC -------------------------------------------------------- If you installed Dropbox, we can use FastQC to look at the quality of your sequences:: - mkdir /home/ubuntu/Dropbox/fastqc - /usr/local/share/FastQC/fastqc *001.extract.fastq.gz --outdir=/home/ubuntu/Dropbox/fastqc + mkdir -p ${HOME}/Dropbox/fastqc + cd ${HOME}/projects/eelpond/extract/ + fastqc --threads ${THREADS:-1} *.fastq.gz --outdir=${HOME}/Dropbox/fastqc The output will be placed under the 'fastqc' directory in your Dropbox on your local computer; look for the fastqc_report.html files, and @@ -164,12 +149,11 @@ Find the right Illumina adapters -------------------------------- You'll need to know which Illumina sequencing adapters were used for -your library in order to trim them off; do :: +your library in order to trim them off. Below, we will use the TruSeq3-PE.fa +adapters:: - ls /usr/local/share/adapters/ - -to see which ones are available. Below, we will use the TruSeq3-PE.fa -adapters. + cd ${HOME}/projects/eelpond + wget https://sources.debian.net/data/main/t/trimmomatic/0.32+dfsg-2/adapters/TruSeq3-PE.fa .. note:: @@ -202,32 +186,37 @@ these are the paired ends. It doesn't really matter what, but you need to make sure it's different for each pair of files. -For *each* of these pairs, run the following:: - - # make a temp directory - mkdir trim - cd trim +:: + + # make a directory for this step + mkdir ${HOME}/projects/eelpond/trimming_temp + mkdir ${HOME}/projects/eelpond/trimmed +For *each* of these pairs, run the following :: + + cd ${HOME}/projects/eelpond/trimming_temp # run trimmomatic - java -jar /usr/local/bin/trimmomatic-0.30.jar PE s1_pe s1_se s2_pe s2_se ILLUMINACLIP:/usr/local/share/adapters/TruSeq3-PE.fa:2:30:10 + trimmomatic PE s1_pe s1_se s2_pe s2_se \ + ILLUMINACLIP:${HOME}/projects/eelpond/TruSeq3-PE.fa:2:30:10 # interleave the remaining paired-end files - interleave-reads.py s1_pe s2_pe | gzip -9c > ../.pe.fq.gz + interleave-reads.py s1_pe s2_pe | gzip -9c \ + > ../trimmed/.pe.fq.gz # combine the single-ended files - cat s1_se s2_se | gzip -9c > ../.se.fq.gz + cat s1_se s2_se | gzip -9c > ../trimmed/.se.fq.gz - # go back up to the working directory and remove the temp directory - cd .. - rm -r trim + # clear the temporary files + rm * # make it hard to delete the files you just created - chmod u-w *.pe.fq.gz *.se.fq.gz + cd ../trimmed + chmod u-w * To get a basic idea of what's going on, please read the '#' comments above, but, briefly, this set of commands: -* creates a temporary directory, 'trim/' +* creates a temporary directory, 'trimming_temp' * runs 'Trimmomatic' in that directory to trim off the adapters, and then puts remaining pairs (most of them!) in s1_pe and s2_pe, and any orphaned @@ -246,11 +235,10 @@ Automating things a bit OK, once you've done this once or twice, it gets kind of tedious, doesn't it? I've written a script to write these commands out automatically. Run it -like so : -:: +like so :: - cd /mnt/work - python /usr/local/share/khmer/sandbox/write-trimmomatic.py > trim.sh + cd ${HOME}/projects/eelpond/raw + python ${HOME}/src/khmer/sandbox/write-trimmomatic.py > trim.sh Run this, and then look at 'trim.sh' using the 'more' command --:: @@ -258,7 +246,7 @@ Run this, and then look at 'trim.sh' using the 'more' command --:: .. :: - echo 1-quality TRIM `date` >> /root/times.out + echo 1-quality TRIM `date` >> ${HOME}/times.out If it looks like it contains the right commands, you can run it by doing : :: @@ -269,7 +257,7 @@ If it looks like it contains the right commands, you can run it by doing : This is a prime example of scripting to make your life much easier and less error prone. Take a look at this file sometime -- - 'more /usr/local/share/khmer/sandbox/write-trimmomatic.py' -- to get + ``more ${HOME}/src/khmer/sandbox/write-trimmomatic.py`` -- to get some idea of how this works. @@ -284,13 +272,14 @@ sequences. Next, for each of these files, run :: - gunzip -c | fastq_quality_filter -Q33 -q 30 -p 50 | gzip -9c > .qc.fq.gz + gunzip -c | fastq_quality_filter -Q33 -q 30 -p 50 | gzip -9c \ + > .qc.fq.gz This uncompresses each file, removes poor-quality sequences, and then recompresses it. Note that (following `Short-read quality evaluation -`__) -you can also trim to a specific length by putting in a 'fastx_trimmer --Q33 -l 70 |' into the mix. +`_) +you can also trim to a specific length by putting in a ``fastx_trimmer +-Q33 -l 70 |`` into the mix. If fastq_quality_filter complains about invalid quality scores, try removing the -Q33 in the command; Illumina has blessed us with multiple @@ -301,29 +290,41 @@ Automating this step .. :: - echo 1-quality FILTER `date` >> /root/times.out + echo 1-quality FILTER `date` >> ${HOME}/times.out This step can be automated with a 'for' loop at the shell prompt. Try : :: - for file in *.pe.fq.gz *.se.fq.gz + cd ${HOME}/projects/eelpond/ + mkdir filtered + cd trimmed + for file in * do - echo working with $file - newfile="$(basename $file .fq.gz)" - gunzip -c $file | fastq_quality_filter -Q33 -q 30 -p 50 | gzip -9c \ - > "${newfile}.qc.fq.gz" + echo working with ${file} + newfile=${file%%.fq.gz}.qc.fq.gz + gunzip -c ${file} | fastq_quality_filter -Q33 -q 30 -p 50 | gzip -9c \ + > ../filtered/${newfile} done What this loop does is: -* for every file ending in pe.fq.gz and se.fq.gz, +* for every file in our ``trimmed`` directory * print out a message with the filename, -* construct a name 'newfile' that omits the trailing .fq.gz +* construct a name 'newfile' that omits the trailing ``.fq.gz`` and add + ``.qc.fq.gz`` * uncompresses the original file, passes it through fastq, recompresses it, - and saves it as 'newfile'.qc.fq.gz + and saves it to the ``filtered`` directory with the new filename. + +.. parallel-version:: + + ls * | parallel 'export file={}; \ + echo working with ${file}; + newfile=${file%%.fq.gz}.qc.fq.gz; + gunzip -c ${file} | fastq_quality_filter -Q33 -q 30 -p 50 | gzip -9c \ + > ../filtered/${newfile}' Extracting paired ends from the interleaved files ------------------------------------------------- @@ -342,19 +343,30 @@ respectively. .. :: - echo 1-quality EXTRACT `date` >> /root/times.out + echo 1-quality EXTRACT `date` >> ${HOME}/times.out To run it on all of the pe qc files, do : :: - - for file in *.pe.qc.fq.gz + + cd ${HOME}/projects/eelpond/ + mkdir filtered-pairs + cd filtered-pairs + for file in ../filtered/*.pe.qc.fq.gz do - extract-paired-reads.py $file + extract-paired-reads.py ${file} done +.. gnu-parallel-version: + + cd ${HOME}/projects/eelpond/ + mkdir filtered-pairs + cd filtered-pairs + ls ../filtered/*.pe.qc.fq.gz | parallel extract-paired-reads.py + + .. :: - echo 1-quality DONE `date` >> /root/times.out + echo 1-quality DONE `date` >> ${HOME}/times.out Finishing up ------------ @@ -362,40 +374,40 @@ Finishing up You should now have a whole mess of files. For example, in the Nematostella data, for *each* of the original input files, you'll have:: - 24HourB_GCCAAT_L002_R1_001.fastq.gz - the original data - 24HourB_GCCAAT_L002_R2_001.fastq.gz - 24HourB_GCCAAT_L002_R1_001.pe.fq.gz - adapter trimmed pe - 24HourB_GCCAAT_L002_R1_001.pe.qc.fq.gz - FASTX filtered - 24HourB_GCCAAT_L002_R1_001.pe.qc.fq.gz.pe - FASTX filtered PE - 24HourB_GCCAAT_L002_R1_001.pe.qc.fq.gz.se - FASTX filtered SE - 24HourB_GCCAAT_L002_R1_001.se.fq.gz - adapter trimmed orphans - 24HourB_GCCAAT_L002_R1_001.se.qc.fq.gz - FASTX filtered orphans + raw/24HourB_GCCAAT_L002_R1_001.fastq.gz - the original data + raw/24HourB_GCCAAT_L002_R2_001.fastq.gz + trimmed/24HourB_GCCAAT_L002_R1_001.pe.fq.gz - adapter trimmed pe + trimmed/24HourB_GCCAAT_L002_R1_001.se.fq.gz - adapter trimmed orphans + filtered/24HourB_GCCAAT_L002_R1_001.pe.qc.fq.gz - FASTX filtered + filtered/24HourB_GCCAAT_L002_R1_001.se.qc.fq.gz - FASTX filtered orphans + filtered-pairs/24HourB_GCCAAT_L002_R1_001.pe.qc.fq.gz.pe - FASTX filtered PE + filtered-pairs/24HourB_GCCAAT_L002_R1_001.pe.qc.fq.gz.se - FASTX filtered SE Yikes! What to do? Well, first, you can get rid of the original data. You already have it on a -disk somewhere, right? : -:: - - rm *.fastq.gz - -Next, you can get rid of the 'pe.fq.gz' and 'se.fq.gz' files, since you -only want the QC files. So : -:: +disk somewhere, right? :: + + rm raw/* + rmdir raw - rm *.pe.fq.gz *.se.fq.gz +Next, you can get rid of the trimmed files, since you only want the QC files. +So :: -And, finally, you can toss the pe.qc.fq.gz files, because you've turned *those* -into .pe and .se files. : -:: + rm -f trimmed/* + rmdir trimmed - rm *.pe.qc.fq.gz +And, finally, you can toss the filtered files, because you've turned *those* +into ``*.pe`` and ``*.se`` files:: + + rm filtered/* + rmdir filtered So now you should be left with only three files for each sample:: - 24HourB_GCCAAT_L002_R1_001.pe.qc.fq.gz.pe - FASTX filtered PE - 24HourB_GCCAAT_L002_R1_001.pe.qc.fq.gz.se - FASTX filtered SE - 24HourB_GCCAAT_L002_R1_001.se.qc.fq.gz - FASTX filtered orphans + filtered-pairs/24HourB_GCCAAT_L002_R1_001.pe.qc.fq.gz.pe - FASTX filtered PE + filtered-pairs/24HourB_GCCAAT_L002_R1_001.pe.qc.fq.gz.se - FASTX filtered SE + filtered-pairs/24HourB_GCCAAT_L002_R1_001.se.qc.fq.gz - FASTX filtered orphans Things to think about ~~~~~~~~~~~~~~~~~~~~~ @@ -403,40 +415,37 @@ Things to think about Note that the filenames, while ugly, are conveniently structured with the history of what you've done. This is a good idea. -Also note that we've conveniently named the files so that we can remove -the unwanted ones en masse. This is a good idea, too. +Also note that we've conveniently used a new directort for each step so that we +can remove unwanted files easily. This is a good idea, too. Renaming files -------------- I'm a fan of keeping the files named somewhat sensibly, and keeping them -compressed. Let's do some mass renaming : -:: - - for file in *.pe.qc.fq.gz.pe +compressed. Rename and compress the paired end files :: + + for file in filtered-pairs/*.pe do - newfile="$(basename $file .pe.qc.fq.gz.pe).pe.qc.fq" + newfile=${file%%.pe.qc.fq.gz.pe}.pe.qc.fq mv $file $newfile gzip $newfile done -and also some mass combining : -:: +likewise with the single end files :: - for $file in *.pe.qc.fq.gz.se + for file in filtered-pairs/*.se do - otherfile="$(basename $file .pe.qc.fq.gz.se).se.qc.fq.gz" - gunzip -c $otherfile > combine - cat $file >> combine - gzip -c combine > $otherfile - rm $file combine + otherfile=${file%%.pe.qc.fq.gz.se}.se.qc.fq.gz # the orphans + gunzip -c ${otherfile} > combine + cat ${file} >> combine + gzip -c combine > ${otherfile} # now all the single reads together + rm ${file} combine done -and finally, make the end product files read-only : -:: +and finally, make the end product files read-only :: - chmod u-w *.qc.fq.gz + chmod u-w filtered-pairs/* to make sure you don't accidentally delete something. @@ -446,8 +455,10 @@ OPTIONAL: Evaluate the quality of your files with FastQC again If you installed Dropbox, we can once again use FastQC to look at the quality of your newly-trimmed sequences:: - mkdir /home/ubuntu/Dropbox/fastqc - /usr/local/share/FastQC/fastqc *001.extract.pe.qc.fq.gz --outdir=/home/ubuntu/Dropbox/fastqc + mkdir -p ${HOME}/Dropbox/fastqc + cd ${HOME}/projects/eelpond/filtered-pairs/ + fastqc --threads ${THREADS:-1} *.pe.qc.fq.gz \ + --outdir=${HOME}/Dropbox/fastqc Again, the output will be placed under the 'fastqc' directory in your Dropbox on your local computer; look for the fastqc_report.html files, @@ -461,34 +472,22 @@ ways: first, for assembly; and second, for mapping, to do quantitation and ultimately comparative expression analysis. You can save them by doing this:: - mkdir save - mv *.qc.fq.gz save - du -sk save - -If you are running with a data subset for a workshop, do -:: - - cp /mnt/work/*.qc.fq.gz /data - -to save the QC files for later use. - -.. @CTB fix ^^^ + du -sh ${HOME}/projects/eelpond/filtered-pairs .. shell stop -This puts the data you want to save into a subdirectory named 'save', and -calculates the size. +This calculates the size of your data. -Now, create a volume of the given size -- divide by a thousand to get -gigabytes, multiply by 1.1 to make sure you have enough room, and then -follow the instructions in :doc:`../amazon/index`. Once -you've mounted it properly (I would suggest mounting it on /save -instead of /data!), then do :: +Now, create a volume of the given size (multiply by 1.1 to make sure you have +enough room, and then follow the instructions in :doc:`../amazon/index`. Once +you've mounted it properly (I would suggest mounting it on ${HOME}/save +instead of ${HOME}/data!), then do :: - rsync -av save /save + rsync -av filtered-pairs ${HOME}/save -which will copy all of the files over from the ./save directory onto the -'/save' disk. Then 'umount /save' and voila, you've got a copy of the files! +which will copy all of the files over from the ``filtered-pairs`` directory +onto the ``${HOME}/save`` disk. Then ``umount ${HOME}/save`` and voila, you've +got a copy of the files! Next stop: :doc:`2-diginorm`. diff --git a/mrnaseq/2-diginorm.txt b/mrnaseq/2-diginorm.txt index cfd4933..f19265e 100644 --- a/mrnaseq/2-diginorm.txt +++ b/mrnaseq/2-diginorm.txt @@ -21,69 +21,69 @@ Link in your data ----------------- -Make sure your data is in /mnt/work/. If you've loaded it onto /data, -you can do:: +Make sure your data is in ``${HOME}/projects/eelpond/filtered-pairs``:: - cd /mnt - mkdir -p work - cd /mnt/work - ln -fs /data/*.qc.fq.gz . + ls ${HOME}/projects/eelpond/filtered-pairs -Run digital normalization -------------------------- - -.. Fix the working directory location; clean up some stuff +If you've loaded it onto ``${HOME}/data/``, you can do:: -.. :: + mkdir -p ${HOME}/projects/eelpond/filtered-pairs + ln -fs ${HOME}/data/*.qc.fq.gz ${HOME}/projects/eelpond/filtered-pairs/ - cd /mnt/work - rm -f *.abundfilt.pe.gz - rm -f *.keep.abundfilt.fq.gz +Run digital normalization +------------------------- .. :: - echo 2-diginorm normalize1-pe `date` >> /root/times.out + echo 2-diginorm normalize1-pe `date` >> ${HOME}/times.out -Apply digital normalization to the paired-end reads: -:: +Apply digital normalization to the paired-end reads :: - normalize-by-median.py -p -k 20 -C 20 -N 4 -x 3e8 --savetable normC20k20.kh *.pe.qc.fq.gz + cd ${HOME}/projects/eelpond/ + mkdir diginorm + cd diginorm + normalize-by-median.py --paired -ksize 20 --cutoff 20 -n_tables 4 \ + --min-tablesize 3e8 --savetable normC20k20.ct \ + ../filtered-pairs/*.pe.qc.fq.gz .. :: - echo 2-diginorm normalize1-se `date` >> /root/times.out + echo 2-diginorm normalize1-se `date` >> ${HOME}/times.out -and then to the single-end reads: -:: +and then to the single-end reads:: - normalize-by-median.py -C 20 --loadtable normC20k20.kh --savetable normC20k20.kh *.se.qc.fq.gz + normalize-by-median.py --cutoff 20 --loadtable normC20k20.ct \ + --savetable normC20k20.ct ../filtered-pairs/*.se.qc.fq.gz -Note the '-p' in the first normalize-by-median command -- when run on +Note the ``--paired`` in the first normalize-by-median command -- when run on PE data, that ensures that no paired ends are orphaned. However, it will complain on single-ended data, so you have to give the data to it separately. -Also note the '-N' and '-x' parameters. These specify how much memory -diginorm should use. The product of these should be less than the -memory size of the machine you selected. The maximum needed for *any* -transcriptome should be in the ~60 GB range, e.g. -N 4 -x 15e9; for -only a few hundred million reads, 16 GB should be plenty. (See -`choosing hash sizes for khmer +Also note the ``--n_tables`` and ``--min_tablesize`` parameters. These specify +how much memory diginorm should use. The product of these should be less than +the memory size of the machine you selected. The maximum needed for *any* +transcriptome should be in the ~60 GB range, e.g. +``-- n_tables 4 --min_tablesize 15e9``; for only a few hundred million reads, +16 GB should be plenty. (See `choosing hash sizes for khmer` `__ for more information.) Trim off likely erroneous k-mers -------------------------------- +-------------------------------- .. :: - echo 2-diginorm filter-abund `date` >> /root/times.out + echo 2-diginorm filter-abund `date` >> ${HOME}/times.out Now, run through all the reads and trim off low-abundance parts of -high-coverage reads: -:: +high-coverage reads:: - filter-abund.py -V normC20k20.kh *.keep + cd ${HOME}/projects/eelpond + mkdir abundfilt + cd abundfilt + filter-abund.py --variable-coverage ../diginorm/normC20k20.ct \ + --threads ${THREADS:-1} ../diginorm/*.keep This will turn some reads into orphans, but that's ok -- their partner read was bad. @@ -94,70 +94,99 @@ Rename files You'll have a bunch of 'keep.abundfilt' files -- let's make things prettier. .. :: + + echo 2-diginorm extract `date` >> ${HOME}/times.out - echo 2-diginorm extract `date` >> /root/times.out +First, let's break out the orphaned and still-paired reads:: -First, let's break out the orphaned and still-paired reads: -:: - - for i in *.pe.*.abundfilt; + cd ${HOME}/projects/eel-pond + mkdir digiresult + cd digiresult + for file in ../abundfilt/*.pe.*.abundfilt do - extract-paired-reads.py $i + extract-paired-reads.py ${file} done -We can combine the orphaned reads into a single file: -:: +.. + # parallel version + cd ${HOME}/projects/eel-pong + mkdir digiresult + cd digiresult + ls ../abundfilt/*.pe* | parallel extract-paired-reads.py + +We can combine the orphaned reads into a single file:: - for i in *.se.qc.fq.gz.keep.abundfilt + cd ${HOME}/projects/eel-pond/abundfilt + for file in *.se.qc.fq.gz.keep.abundfilt do - pe_orphans=$(basename $i .se.qc.fq.gz.keep.abundfilt).pe.qc.fq.gz.keep.abundfilt.se - newfile=$(basename $i .se.qc.fq.gz.keep.abundfilt).se.qc.keep.abundfilt.fq.gz - cat $i $pe_orphans | gzip -c > $newfile + pe_orphans=${file%%.se.qc.fq.gz.keep.abundfilt}.pe.qc.fq.gz.keep.abundfilt.se + newfile=${file%%.se.qc.fq.gz.keep.abundfilt}.se.qc.keep.abundfilt.fq.gz + cat ${file} ../digiresult/${pe_orphans} | gzip -c > ../digiresult/${newfile} + rm ${pe_orphans} done -We can also rename the remaining PE reads & compress those files: -:: - - for i in *.abundfilt.pe +.. + # parallel version + cd ${HOME}/projects/eel-pond/abundfilt + ls *.se.qc.fq.gz.keep.abundfilt | parallel \ + 'file={}; + pe_orphans=${file%%.se.qc.fq.gz.keep.abundfilt}.pe.qc.fq.gz.keep.abundfilt.se; + newfile=${file%%.se.qc.fq.gz.keep.abundfilt}.se.qc.keep.abundfilt.fq.gz; + cat ${file} ../digiresult/${pe_orphans} \ + | gzip -c > ../digiresult/${newfile} \ + rm ${pe_orphans}' + +We can also rename the remaining PE reads & compress those files:: + + cd ${HOME}/projects/eel-pond/digiresult + for file in *.abundfilt.pe do - newfile=$(basename $i .fq.gz.keep.abundfilt.pe).keep.abundfilt.fq - mv $i $newfile - gzip $newfile + newfile=${file%%.fq.gz.keep.abundfilt.pe}.keep.abundfilt.fq + mv ${file} ${newfile} + gzip ${newfile} done +.. + # parallel version + cd ${HOME}/projects/eel-pond/digiresult + ls *.abundfilt.pe | parallel \ + 'file={}; + newfile=${file%%.fq.gz.keep.abundfilt.pe}.keep.abundfilt.fq + mv ${file} ${newfile} + gzip ${newfile}' + This leaves you with a whole passel o' files, most of which you want to go away! :: - 6Hour_CGATGT_L002_R1_005.pe.qc.fq.gz - 6Hour_CGATGT_L002_R1_005.pe.qc.fq.gz.keep - 6Hour_CGATGT_L002_R1_005.pe.qc.fq.gz.keep.abundfilt - 6Hour_CGATGT_L002_R1_005.pe.qc.fq.gz.keep.abundfilt.se - 6Hour_CGATGT_L002_R1_005.se.qc.fq.gz - 6Hour_CGATGT_L002_R1_005.se.qc.fq.gz.keep - 6Hour_CGATGT_L002_R1_005.se.qc.fq.gz.keep.abundfilt + filtered-reads/6Hour_CGATGT_L002_R1_005.pe.qc.fq.gz + filtered-reads/6Hour_CGATGT_L002_R1_005.se.qc.fq.gz + diginorm/6Hour_CGATGT_L002_R1_005.pe.qc.fq.gz.keep + diginorm/6Hour_CGATGT_L002_R1_005.se.qc.fq.gz.keep + diginorm/normC20k20.ct + abundfilt/6Hour_CGATGT_L002_R1_005.pe.qc.fq.gz.keep.abundfilt + abundfilt/6Hour_CGATGT_L002_R1_005.se.qc.fq.gz.keep.abundfilt + digiresult/6Hour_CGATGT_L002_R1_005.pe.qc.keep.abundfilt.fq.gz + digiresult/6Hour_CGATGT_L002_R1_005.se.qc.keep.abundfilt.fq.gz .. :: - echo 2-diginorm DONE `date` >> /root/times.out - -So, finally, let's get rid of a lot of the old files : -:: + echo 2-diginorm DONE `date` >> ${HOME}/times.out - rm *.se.qc.fq.gz.keep.abundfilt - rm *.pe.qc.fq.gz.keep.abundfilt.se - rm *.keep - rm *.abundfilt - rm *.kh +So, finally, let's get rid of a lot of the old files :: -.. rm *.qc.fq.gz @CTB should we do? + rm filteredreads/* + rm diginorm/* + rm abundfilt/* + rmdir filteredreads diginorm abundfilt Gut check ~~~~~~~~~ -You should now have the following files in your directory (after typing 'ls'):: +You should now have the following files in your directory (after typing +``ls digifilt``):: - 6Hour_CGATGT_L002_R1_005.pe.qc.keep.abundfilt.fq.gz - 6Hour_CGATGT_L002_R1_005.se.qc.keep.abundfilt.fq.gz + digifilt/6Hour_CGATGT_L002_R1_005.pe.qc.keep.abundfilt.fq.gz + digifilt/6Hour_CGATGT_L002_R1_005.se.qc.keep.abundfilt.fq.gz These files are, respectively, the paired (pe) quality-filtered (qc) digitally normalized (keep) abundance-trimmed (abundfilt) FASTQ (fq) diff --git a/mrnaseq/3-big-assembly.txt b/mrnaseq/3-big-assembly.txt index 5d40139..c368e94 100644 --- a/mrnaseq/3-big-assembly.txt +++ b/mrnaseq/3-big-assembly.txt @@ -8,7 +8,7 @@ All of the below should be run in screen, probably... You will want at least 15 GB of RAM, maybe more. (If you start up a new machine, you'll need to go to -:doc:`1-quality` and install khmer and screed.) +:doc:`1-quality` and go through the Install Software section.) .. note:: @@ -22,62 +22,26 @@ Installing Trinity set -x set -e - rm -fr /root/trinity* /root/bowtie* /root/samtools* - rm -fr /mnt/work/trinity_out_dir - rm -f /mnt/work/*.1 /mnt/work/*.2 - echo 3-big-assembly compileTrinity `date` >> /root/times.out + echo 3-big-assembly compileTrinity `date` >> ${HOME}/times.out To install Trinity: :: - cd /root + mkdir -p ${HOME}/src + cd ${HOME}/src - curl -L http://sourceforge.net/projects/trinityrnaseq/files/latest/download?source=files > trinity.tar.gz + wget https://github.com/trinityrnaseq/trinityrnaseq/archive/v2.0.4.tar.gz \ + -O trinity.tar.gz tar xzf trinity.tar.gz cd trinityrnaseq*/ - export FORCE_UNSAFE_CONFIGURE=1 - make 2>1& > trinity-build.log - -Install bowtie --------------- - -.. :: - - echo 3-big-assembly installBowtie `date` >> /root/times.out - -Download and install bowtie: -:: - - cd /root - curl -O -L http://sourceforge.net/projects/bowtie-bio/files/bowtie/0.12.7/bowtie-0.12.7-linux-x86_64.zip - unzip bowtie-0.12.7-linux-x86_64.zip - cd bowtie-0.12.7 - cp bowtie bowtie-build bowtie-inspect /usr/local/bin - -Install samtools ----------------- - -.. :: - - echo 3-big-assembly installSamtools `date` >> /root/times.out - -Download and install samtools: -:: - - cd /root - curl -L http://sourceforge.net/projects/samtools/files/latest/download?source=files >samtools.tar.bz2 - tar xjf samtools.tar.bz2 - mv samtools-* samtools-latest - cd samtools-latest/ - make 2>&1 > samtools-build.log - cp samtools misc/* /usr/local/bin + make |& tee trinity-build.log Build the files to assemble --------------------------- .. :: - echo 3-big-assembly extractReads `date` >> /root/times.out + echo 3-big-assembly extractReads `date` >> ${HOME}/times.out For paired-end data, Trinity expects two files, 'left' and 'right'; there can be orphan sequences present, however. So, below, we split @@ -85,10 +49,10 @@ all of our interleaved pair files in two, and then add the single-ended seqs to one of 'em. : :: - cd /mnt/work - for i in *.pe.qc.keep.abundfilt.fq.gz + cd ${HOME}/projects/eelpond/digiresult + for file in *.pe.qc.keep.abundfilt.fq.gz do - split-paired-reads.py $i + split-paired-reads.py ${file} done cat *.1 > left.fq @@ -96,31 +60,41 @@ seqs to one of 'em. : gunzip -c *.se.qc.keep.abundfilt.fq.gz >> left.fq +.. + # parallel version + cd ${HOME}/projects/eelpond/digiresult + ls *.pe.qc.keep.abundfilt.fq.gz | parallel split-paired-reads.py + cat *.1 > left.fq & cat *.2 > right.fq + gunzip -c *.se.qc.keep.abundfilt.fq.gz >> left.fq + Assembling with Trinity ----------------------- .. :: - echo 3-big-assembly assemble `date` >> /root/times.out + echo 3-big-assembly assemble `date` >> ${HOME}/times.out -Run the assembler! : -:: +Run the assembler! :: - /root/trinityrnaseq*/Trinity --left left.fq --right right.fq --seqType fq -JM 10G + cd ${HOME}/projects/eelpond + ${HOME}/src/trinity*/Trinity --left digiresult/left.fq \ + --right digiresult/right.fq --seqType fq --max_memory 14G \ + --CPU ${THREADS:-2} -Note that this last bit (10G) is the maximum amount of memory to use. You -can increase (or decrease) it based on what machine you rented. This size -works for the m1.xlarge machines. +Note that this last two parts (``--max_memory 14G --CPU ${THREADS:-2}``) is the +maximum amount of memory and CPUs to use. You can increase (or decrease) them +based on what machine you rented. This size works for the m1.xlarge machines. Once this completes (on the Nematostella data it might take about 12 hours), -you'll have an assembled transcriptome in trinity_out_dir/Trinity.fasta. +you'll have an assembled transcriptome in +``${HOME}/projects/eelpond/trinity_out_dir/Trinity.fasta``. You can now copy it over via Dropbox, or set it up for BLAST (see :doc:`installing-blastkit`). .. :: - echo 3-big-assembly DONE `date` >> /root/times.out + echo 3-big-assembly DONE `date` >> ${HOME}/times.out .. shell stop diff --git a/mrnaseq/5-building-transcript-families.txt b/mrnaseq/5-building-transcript-families.txt index 5b4dacf..e1d167b 100644 --- a/mrnaseq/5-building-transcript-families.txt +++ b/mrnaseq/5-building-transcript-families.txt @@ -12,33 +12,34 @@ m1.xlarge machine. set -x set -e - rm -fr /usr/local/share/eel-pond - rm -f /mnt/trinity-nematostella-raw.fa.gz - echo 5-building-transcript-families install `date` >> /root/times.out + echo 5-building-transcript-families install `date` >> ${HOME}/times.out -You'll also need to install some eel-pond scripts: -:: +You'll also need to setup a personal program binary directory:: - cd /usr/local/share - git clone https://github.com/ctb/eel-pond.git - cd eel-pond - git checkout protocols-v0.8.3 + mkdir -p ${HOME}/bin + export PATH=${PATH}:${HOME}/bin + echo 'export PATH=${PATH}:${HOME}/bin' >> ${HOME}/.bashrc + +Then install a script:: + + cd ${HOME}/bin + wget https://raw.githubusercontent.com/ctb/eel-pond/protocols-v0.8.3/rename-with-partitions.py + chmod u+x rename-with-partitions.py Copy in your data ================= You need your assembled transcriptome (from -e.g. :doc:`3-big-assembly`). Put it in /mnt as -'trinity-nematostella-raw.fa.gz': -:: +e.g. :doc:`3-big-assembly`). Put it in the project directory as +'trinity-nematostella-raw.fa.gz':: - cd /mnt - gzip -c work/trinity_out_dir/Trinity.fasta > trinity-nematostella-raw.fa.gz + cd ${HOME}/projects/eelpond + gzip -c trinity_out_dir/Trinity.fasta > trinity-nematostella-raw.fa.gz For the purposes of your first run through, I suggest just grabbing my copy of the Nematostella assembly:: - cd /mnt + cd ${HOME}/projects/eelpond/ curl -O https://s3.amazonaws.com/public.ged.msu.edu/trinity-nematostella-raw.fa.gz Run khmer partitioning @@ -46,35 +47,37 @@ Run khmer partitioning .. :: - echo 5-building-transcript-families partition `date` >> /root/times.out + echo 5-building-transcript-families partition `date` >> ${HOME}/times.out Partitioning runs a de Bruijn graph-based clustering algorithm that will cluster your transcripts by transitive sequence overlap. That is, it will -group transcripts into transcript families based on shared sequence. -:: +group transcripts into transcript families based on shared sequence. :: - do-partition.py -x 1e9 -N 4 --threads 4 nema trinity-nematostella-raw.fa.gz + cd ${HOME}/projects/eelpond + mkdir partitions + cd partitions + do-partition.py -x 1e9 -N 4 --threads ${THREADS:-1} nema \ + ../trinity-nematostella-raw.fa.gz .. :: - echo 5-building-transcript-families rename `date` >> /root/times.out + echo 5-building-transcript-families rename `date` >> ${HOME}/times.out This should take about 15 minutes, and outputs a file ending in '.part' that contains the partition assignments. Now, group and rename the -sequences: -:: - - python /usr/local/share/eel-pond/rename-with-partitions.py nema trinity-nematostella-raw.fa.gz.part - mv trinity-nematostella-raw.fa.gz.part.renamed.fasta.gz trinity-nematostella.renamed.fa.gz +sequences:: -.. tail n.dist -.. (warning) + cd ${HOME}/projects/eelpond/partitions + rename-with-partitions.py nema trinity-nematostella-raw.fa.gz.part + mv trinity-nematostella-raw.fa.gz.part.renamed.fasta.gz \ + trinity-nematostella.renamed.fa.gz Looking at the renamed sequences ================================ Let's look at the renamed sequences:: + cd ${HOME}/projects/eelpond/partitions gunzip -c trinity-nematostella.renamed.fa.gz | head You'll see that each sequence name looks like this:: @@ -83,24 +86,24 @@ You'll see that each sequence name looks like this:: Some explanation: -* 'nema' is the prefix that you gave the rename script, above; modify +* ``nema`` is the prefix that you gave the rename script, above; modify accordingly for your own organism. It's best to change it each time you do an assembly, just to keep things straight. -* 'idN' is the unique ID for this sequence; it will never be repeated in this +* ``idN`` is the unique ID for this sequence; it will never be repeated in this file. -* 'trN' is the transcript family, which may contain one or more transcripts. +* ``trN`` is the transcript family, which may contain one or more transcripts. -* '1_of_1_in_tr16001' tells you that this transcript family has only +* ``1_of_1_in_tr16001`` tells you that this transcript family has only one transcript in it (this one!) Other transcript families may (will) have more. -* 'len' is the sequence length. +* ``len`` is the sequence length. .. :: - echo 5-building-transcript-families DONE `date` >> /root/times.out + echo 5-building-transcript-families DONE `date` >> ${HOME}/times.out .. shell stop diff --git a/mrnaseq/6-annotating-transcript-families.txt b/mrnaseq/6-annotating-transcript-families.txt index 8e0554f..3ddd8fe 100644 --- a/mrnaseq/6-annotating-transcript-families.txt +++ b/mrnaseq/6-annotating-transcript-families.txt @@ -6,27 +6,19 @@ .. :: - echo 6-annotating-transcript-families START `date` >> /root/times.out + echo 6-annotating-transcript-families START `date` >> ${HOME}/times.out .. :: set -x set -e - rm -fr /root/blast-2.2.* - rm -f /mnt/mouse.protein.faa - rm -f /mnt/trinity-nematostella.renamed.fa - rm -f /mnt/nema.x.mouse - rm -f /mnt/mouse.x.nema - rm -f /mnt/nema.x.mouse.homol - rm -f /mnt/nema.x.mouse.ortho - rm -f /mnt/mouse.protein.faa_screed - echo 6-annotating-transcript-families start `date` >> /root/times.out + echo 6-annotating-transcript-families start `date` >> ${HOME}/times.out You can start with the 'trinity-nematostella.renamed.fa.gz' file from the previous page (:doc:`5-building-transcript-families`) _or_ download a precomputed one:: - cd /mnt + cd ${HOME}/projects/eelpond/partitions curl -O http://public.ged.msu.edu.s3.amazonaws.com/trinity-nematostella.renamed.fa.gz .. note:: @@ -34,7 +26,7 @@ a precomputed one:: The BLASTs below will take a *long* time, like 24-36 hours. If you want to work with canned BLASTs, do:: - cd /mnt + cd ${HOME}/projects/eelpond/annotation curl -O http://public.ged.msu.edu.s3.amazonaws.com/nema.x.mouse.gz curl -O http://public.ged.msu.edu.s3.amazonaws.com/mouse.x.nema.gz gunzip nema.x.mouse.gz @@ -43,43 +35,34 @@ a precomputed one:: However, if you built your own transcript families, you'll need to rerun these BLASTs. -Install BLAST -============= - -Make sure you've updated BLAST, as in :doc:`installing-blastkit`: -:: - - cd /root - - curl -O ftp://ftp.ncbi.nih.gov/blast/executables/release/2.2.24/blast-2.2.24-x64-linux.tar.gz - tar xzf blast-2.2.24-x64-linux.tar.gz - cp blast-2.2.24/bin/* /usr/local/bin - cp -r blast-2.2.24/data /usr/local/blast-data - Doing a preliminary annotation against mouse ============================================ Now let's assign putative homology & orthology to these transcripts, by doing BLASTs & reciprocal best hit analysis. First, uncompress your -transcripts file: -:: +transcripts file:: - cd /mnt - gunzip trinity-nematostella.renamed.fa.gz + cd ${HOME}/projects/eelpond/ + mkdir annotation + cd annotation + gunzip -c ../partitions/trinity-nematostella.renamed.fa.gz \ + trinity-nematostella.renamed.fa -Now, grab the latest mouse RefSeq: -:: +Now, grab the latest mouse RefSeq:: - for file in mouse.1.protein.faa.gz mouse.2.protein.faa.gz mouse.3.protein.faa.gz + cd ${HOME}/projects/eelpond/annotation + for file in mouse.1.protein.faa.gz mouse.2.protein.faa.gz do curl -O ftp://ftp.ncbi.nih.gov/refseq/M_musculus/mRNA_Prot/${file} done gunzip mouse.[123].protein.faa.gz cat mouse.[123].protein.faa > mouse.protein.faa - echo 6-annotating-transcript-families blast `date` >> /root/times.out -Format both as BLAST databases: -:: +.. :: + + echo 6-annotating-transcript-families blast `date` >> ${HOME}/times.out + +Format both as BLAST databases:: formatdb -i mouse.protein.faa -o T -p T formatdb -i trinity-nematostella.renamed.fa -o T -p F @@ -89,33 +72,35 @@ BLAST in both directions. Note, this may take ~24 hours or longer; you probably want to run it in screen: :: - blastall -i trinity-nematostella.renamed.fa -d mouse.protein.faa -e 1e-3 -p blastx -o nema.x.mouse -a 8 -v 4 -b 4 - blastall -i mouse.protein.faa -d trinity-nematostella.renamed.fa -e 1e-3 -p tblastn -o mouse.x.nema -a 8 -v 4 -b 4 + blastx -db mouse.protein.faa -query trinity-nematostella.renamed.fa \ + -evalue 1e-3 -num_threads 8 -num_descriptions 4 -num_alignments 4 \ + -out nema.x.mouse + tblastn -db trinity-nematostella.renamed.fa -query mouse.protein.faa \ + -evalue 1e-3 -num_threads 8 -num_descriptions 4 -num_alignments 4 \ + -out mouse.x.nema Assigning names to sequences ============================ .. :: - echo 6-annotating-transcript-families homolortho `date` >> /root/times.out + echo 6-annotating-transcript-families homolortho `date` >> ${HOME}/times.out Now, calculate putative homology (best BLAST hit) and orthology -(reciprocal best hits): -:: +(reciprocal best hits):: - python /usr/local/share/eel-pond/make-uni-best-hits.py nema.x.mouse nema.x.mouse.homol - python /usr/local/share/eel-pond/make-reciprocal-best-hits.py nema.x.mouse mouse.x.nema nema.x.mouse.ortho + make-uni-best-hits.py nema.x.mouse nema.x.mouse.homol + make-reciprocal-best-hits.py nema.x.mouse mouse.x.nema nema.x.mouse.ortho -Prepare some of the mouse info: -:: +Prepare some of the mouse info:: - python /usr/local/share/eel-pond/make-namedb.py mouse.protein.faa mouse.namedb + make-namedb.py mouse.protein.faa mouse.namedb python -m screed.fadbm mouse.protein.faa -And, finally, annotate the sequences: -:: +And, finally, annotate the sequences:: - python /usr/local/share/eel-pond/annotate-seqs.py trinity-nematostella.renamed.fa nema.x.mouse.ortho nema.x.mouse.homol + annotate-seqs.py trinity-nematostella.renamed.fa nema.x.mouse.ortho \ + nema.x.mouse.homol After this last, you should see:: @@ -134,8 +119,7 @@ will have sequences that look like this:: >nematostella.id1.tr115222 h=43% => suppressor of tumorigenicity 7 protein isoform 2 [Mus musculus] 1_of_7_in_tr115222 len=1635 id=1 tr=115222 1_of_7_in_tr115222 len=1635 id=1 tr=115222 I suggest renaming this file to 'nematostella.fa' and using it for -BLASTs (see :doc:`installing-blastkit`). : -:: +BLASTs (see :doc:`installing-blastkit`). :: cp trinity-nematostella.renamed.fa.annot nematostella.fa @@ -147,14 +131,14 @@ contains all of the same information as in the first but *also* contains all of the actual DNA sequence in the last column. (Some spreadsheet programs may not be able to open it.) You can do:: - cp *.csv /root/Dropbox + cp *.csv ${HOME}/Dropbox to copy them locally, if you have set up Dropbox (see: :doc:`../amazon/installing-dropbox`). .. :: - echo 6-annotating-transcript-families DONE `date` >> /root/times.out + echo 6-annotating-transcript-families DONE `date` >> ${HOME}/times.out .. shell stop diff --git a/mrnaseq/8-differential-expression.txt b/mrnaseq/8-differential-expression.txt index 3005746..68bbda3 100644 --- a/mrnaseq/8-differential-expression.txt +++ b/mrnaseq/8-differential-expression.txt @@ -29,7 +29,7 @@ .. note:: If all you have is a genes.results file, you can recover the - *.fq.genes.results files by doing the following:: + \*.fq.genes.results files by doing the following:: python /usr/local/share/eel-pond/unpack-genes-matrix.py genes.results diff --git a/mrnaseq/acceptance-3-big-assembly.txt b/mrnaseq/acceptance-3-big-assembly.txt index bf3bc55..5345bbf 100644 --- a/mrnaseq/acceptance-3-big-assembly.txt +++ b/mrnaseq/acceptance-3-big-assembly.txt @@ -31,4 +31,4 @@ Acceptance tests for khmer cat mouse.[123].protein.faa > mouse.protein.faa formatdb -i mouse.protein.faa -o T -p T cp ../work/trinity_out_dir/Trinity.fasta . - blastall -i Trinity.fasta -d mouse.protein.faa -p blastx -e 1e-6 -o trinity.x.mouse + blastx -db mouse.protein.faa -query Trinity.fasta -evalue 1e-6 -out trinity.x.mouse diff --git a/mrnaseq/installing-blastkit.txt b/mrnaseq/installing-blastkit.txt index e7330fe..aa4f0c9 100644 --- a/mrnaseq/installing-blastkit.txt +++ b/mrnaseq/installing-blastkit.txt @@ -1,7 +1,8 @@ =============================== 4. BLASTing your assembled data =============================== -.. shell:: start +.. shell start + One thing everyone wants to do is BLAST sequence data, right? Here's a simple way to set up a stylish little BLAST server that lets you search your newly assembled sequences. @@ -12,13 +13,10 @@ Installing blastkit Installing some prerequisites: :: - pip install pygr - pip install whoosh - pip install Pillow - pip install Jinja2 - pip install git+https://github.com/ctb/pygr-draw.git - pip install git+https://github.com/ged-lab/screed.git - apt-get -y install lighttpd + pip install pygr whoosh Pillow Jinja2 \ + git+https://github.com/ctb/pygr-draw.git screed + apt-get -y install lighttpd blast2 + ln -s /usr/bin/blastall /usr/local/bin/ and configure them: :: @@ -30,17 +28,7 @@ and configure them: /etc/init.d/lighttpd restart -Next, install BLAST: -:: - - cd /root - - curl -O ftp://ftp.ncbi.nih.gov/blast/executables/release/2.2.24/blast-2.2.24-x64-linux.tar.gz - tar xzf blast-2.2.24-x64-linux.tar.gz - cp blast-2.2.24/bin/* /usr/local/bin - cp -r blast-2.2.24/data /usr/local/blast-data - -And put in blastkit: +Next install blastkit: :: cd /root @@ -67,6 +55,7 @@ If you've just finished a transcriptome assembly (:doc:`3-big-assembly`) then you can do this to copy your newly generated assembly into the right place: :: + cd /mnt/work cp trinity_out_dir/Trinity.fasta /root/blastkit/db/db.fa Alternatively, you can grab my version of the assembly (from running this