title	author	contact	organization	date
Working with the SRA and the SRA Toolkit	Nathaniel Maki	nmaki@mdibl.org	MDIBL Computational Core	May 12th, 2020

add images to improve guide

What is SRA?

The Sequence Read Archive, or SRA, is an online archive for raw sequence data, generated from next generation sequencing tech such as Illumina, PacBio, and IonTorrent.

It is also the National Institute of Health's (NIH) primary repository for high throughput sequencing data.

What is the SRA Toolkit?

The SRA Toolkit is a suite of compiled binaries and source code for tools that download, manipulate, and validate next generation HTS data stored inthe SRA archive.

Learning Objectives

Become comfortable working with the SRA, navigating through the site, and acquiring SRA data
Explore the SRA Toolkits' various functions and utilities
Using an SRA accession file, download corresponding experimental fastq data from archive

Installing SRA Toolkit

Lets get this out of the way first:

Potentially just point to the NCBI Github page: https://github.com/ncbi/sra-tools/wiki/02.-Installing-SRA-Toolkit
Click this link view the Download page
Compiled binaries exist for macOS and RHEL/Ubuntu based Linux distributions
If running Windows (follow up with guide to install WSL Ubuntu 20.04)
- The above is only needed if you want to use the newest binaries (as of date, 2.10.5)
- Otherwise, just use Toolkit version 2.9.6-1
Unzip the tar.gz file
- if using macOS or Windows, either double click or right click and select extract
- for Linux use tar -xzf sratoolkit.version-ubuntu_linux64.tar.gz
If you are working from the command line, you can utilize the NCBI FTP site and download the Toolkit via wget https://ftp.ncbi.nlm.nih.gov/sra/sdk/2.10.5/sratoolkit.2.10.5-ubuntu64.tar.gz

Alternative: Using Brew

It took me some time to install sratools through NCBI documentation, so here is another, easier, method. Instead of going through the process manually, you can install it using Homebrew, an (imo fantastic) package manager for macOS and Linux

To install Homebrew, just past the following into your terminal: /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install.sh)" Follow up on WSL/Linux Installation, running into issues with perl
Once homebrew is installed, you need to add the tap brewsci/bio
- This contains tons of bioinformatics packages, with one included being sratoolkit
- Simply type brew tap brewsci/bio in the command line
- Once added, follow up with brew install sratoolkit
Installing sratoolkit in this fashion adds it to your PATH (which normally you have to do manually), and you can immediatly progress with the next component of this tutorial
In a nutshell, adding a program to PATH allows you to use it anywhere in the filesystem, not just in the location in which it's located.
- Instead of having to give a direct path to fastq-dump like this: ~/sra-toolkit/bin/fastq-dump.2.10.5 -X
- You can immedially use it like so: fastq-dump -X on the command line

Alternative Alternative: Use Docker

For macOS and Windows, install Docker Desktop
Run the following docker pull inutano/sra-toolkit
Linux is a bit more involved, but the documentation does a pretty goopd job of providing a walkthrough
- post doc here

Downloading data from SRA

Include image for below The Sequence Read Archive can be broken into four main levels, each with their own specific accession IDs

Study: DRP, ERP, SRP accessions
Sample: DRS, ERS, SRS accessions
Analysis: DRZ, ERZ, SRZ accessions
Run: DRR, ERR, SRR accessions

will introduce GEO before SRA

The smallest unit able to published in the SRA is an Analysis, but most of the neat stuff can be found in Studies.

We'll be using this one GEO summary page as an example.

To find the SRA link, just scroll down from the top of the page:

And look near the bottom, and under Relations select the SRA accession link

The accession link brings you to a page that holds all biological samples related to this analysis. Selecting, for example, this Link loads that specific run, and the files associated with it.

However, we want to view all files with every sample in an analysis. To do this:

Click Send to, located below the runs and across from the summary button
Choose Run Selector as your Destination
Click Go to send the entire recordset to Run Selector

The SRA Run Selector page should open upo in another tab, and lookc like this:

Under Common Fields you'll find a ton of information describing the samples from the study, including:

Assay Type (RNA-seq, CHiP-seq, etc)
Instrument (Sequencing platform utilized)
Library Layout (Single or Paired End sequencing)
Organism
as well as a plethora of other relevant information

Under Select you have the option to download either all Metadata or an Accession List, or just those that are selected from the Found X Items

The Metadata option downloads metadata for all runs as a table, or as a list of accession numbers
The Accession option downloads a text file with all of the SRA accession IDs from the study
For additional information about many of the fields on this page, click the Question mark icon at the top of the page

Downloading Experimental Data from SRA using fastq-dump

Contact

If you have questions about the information in this workshop document, please contact:

Nathaniel Maki
Bioinformatics Research Training Specialist
MDI Biological Laboratory
nmaki[at]mdibl.org

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataset_downloads_sra.md

dataset_downloads_sra.md

What is SRA?

What is the SRA Toolkit?

Learning Objectives

Installing SRA Toolkit

Alternative: Using Brew

Alternative Alternative: Use Docker

Downloading data from SRA

Downloading Experimental Data from SRA using fastq-dump

Contact

Files

dataset_downloads_sra.md

Latest commit

History

dataset_downloads_sra.md

File metadata and controls

What is SRA?

What is the SRA Toolkit?

Learning Objectives

Installing SRA Toolkit

Alternative: Using Brew

Alternative Alternative: Use Docker

Downloading data from SRA

Downloading Experimental Data from SRA using fastq-dump

Contact