Skip to content

SchiefLab/SteichenScience2019

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 

Repository files navigation

Steichen et al - Data Analysis

This repository contains instructions for obtaining the Next-Generation Sequencing (NGS) raw reads as well as post-production formatted sequencing data used in J.M Steichen et al., Science 10.1126/aax4380 (2019).

In addition, this repository contains an example Zeppelin notebook containing PySpark code used to execute searches on the sequencing data.

Raw Sequencing Reads

Paired reads for NextSeq and HiSeq sequencing data obtained in this study are available for download from our publicly available Amazon Web Services S3 data bucket found here. Paired reads from HiSeq sequencing data from Briney et al, Nature 2019 that were used in Steichen et al are also avaialable in our S3 data bucket using the above link.

Joined, Annotated, and Clustered Sequences

Beyond the raw reads, we are also making available the joined, annotated and clustered sequences, as a compressed csv file. It is these sequences that were analyzed in the manuscript. You may use your own database framework to analyze the clustered sequences in the csv file.

The compressed csv file is available for download here.

Caution - the compressed csv file is 100 GB while the uncompressed file is over 700GB

In addition, we are also providing all the sequences in the csv file as a converted parquet file which is more convenient for those that wish to carry out analyses using Amazon EMR (see below), as we did in this study. The compressed parquet file can be found at here. Instructions for loading and querying parquet files are found below.

The fields in the csv file include the following annotations from AbStar as well as several clustering and other metadata fields.

Abstar Annotations

Name Description
_id hash id of sequence
chain heavy or light
cdr3_aa CDR3 amino acid sequence, IMGT definition
cdr3_nt CDR3 amino acid sequence, IMGT definition
junct_nt Junctional nucleotide sequence as defined by IMGT
isotype Antibody Isotype, e.g. IgM, IgG etc..
vdj_aa The V,D and J gene segment amino acid sequences
vdj_nt The V,D and J gene segment nucleotide sequences
v_full The full V gene segment name, e.g. IGHV1-69*01
v_fam The V gene segment family, e.g. IGHV1
v_gene The V gene segment, e.g. IGHV1-69
d_full The full D gene segment name, e.g. IGHD3-3*01
d_fam The D gene segment family, e.g. IGHD3
d_gene The D gene segment, e.g. IGHD3-3
j_full The full J gene segment name, e.g. IGHJ6*01
j_gene The J gene segment, e.g. IGHJ6

Clustering Metadata

Name Description
donor The de-identified donor id, e.g donor_316188
collection The demultiplexed technical replicate collection specification
generation The sequencing generation, see below
method The sequencing platform used, e.g. HiSeq, see below
ez_donor An integer based designation of the donor
unique_donors How many unique donors this sequence cluster was found in, see below
unique_mRNA How many mRNA transcripts were found across replicates, see below
original_cursor Legacy field, not needed
original_collection Legacy field, not needed

Method

The method designation is either HiSeq, NextSeq or Clustered HiSeq (cHiSeq).

Generation

Generations correspond to the the sequencing generation as obtained during the course of the study. Generation 1 was the first four donors which were run through multiple independent HiSeq and NextSeq runs. Generation 2 is the next 8 donors in which a unique molecular identifier (UMI) was used to group PCR bias before HiSeq. Generation 3 was the last two donors using HiSeq and UMIs.

Unique mRNA

The Unique mRNA corresponds to the number of times a sequence was found in the same donor but in multiple collections. Each collection corresponds to an individual PCR reaction from different mRNA preps.

Querying Datasets with PySpark/Zeppelin

While we have provided the sequencing data via formatted csv and raw sequencing reads to enable the user to employ their own sequence analysis pipelines, we found that a managed Hadoop cluster via Spark was an effective database implementation that could handle the large datasets with relatively low overhead.

We recommend Amazon Web Services Elastic Map Reduce (EMR) service which has the option to automatically setup a managed Spark Cluster. AWS has the added convenience of allowing the user you to designate the amount of worker nodes. More nodes means faster queries but will have a higher cost.

Finally, while one can run any given Spark Job by writing a custom application in Scala, we found it easiest to interact with Spark through a notebook implementation called Zeppelin (analogous to Jupyter notebooks for Python). Zeppelin automatically configures the user's Spark app, and will automatically distribute the user's queries across all available nodes.

Changing CSV to Parquet File Format

The Hadoop ecosystem uses a column based format called Parquet. PySpark can be used to read in CSV files and store them as parquet. Optionally you can read in CSV files directly to your Spark dataframe, but this may take more time.

##CSV2Parquet.py
from pyspark import SparkContext
from pyspark.sql import SQLContext

#Start with Spark Context
sc = SparkContext(appName="CSV2Parquet")

#Change to SQL context 
sql = SQLContext(sc)

##S3 bucket link to merged CSV
csv_file = "s3://steichenetalpublicdata/analyzed_sequences/AllDataMerged.csv.gz"

#Read CSV to DataFrame
df = (sql.read
          .format("com.databricks.spark.csv")
          .option("header", "true")
          .load(csv_file))

#Output path can also be an S3 bucket
p_path = "myoutputpath.parquet"
df.write.parquet(p_path, mode='overwrite')

Querying Donors with Zeppelin

In the Zeppelin notebook folder, we have provided an example notebook for querying our dataset. This includes an example showing how to load parquet data as well as an example query using BG18 germline definitions used in Steichen et al. Please note, in order to view the Zeppelin notebook, you will need to upload the Zeppelin JSON file to a Zeppelin notebook server. In addition you must untar the parquet.tgz file and load in the folder into your Zeppelin environment.

Releases

No releases published

Packages

No packages published