Skip to content

SparkRMA: Robust Multi-array Average (RMA) In Apache Spark

License

Notifications You must be signed in to change notification settings

michaeltneylon/spark_rma

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SparkRMA: a scalable implementation of Robust Multi-array Average (RMA) in Apache Spark

Read the documentation for background information and a tutorial.

See the subdirectories for more detailed READMEs including usage of programs.

contents

  • spark_rma
  • helper

spark_rma

This directory contains the code for running RMA analysis. This includes steps for:

  • annotation and background correction
  • quantile normalization
  • median polish

Annotation and Background Correction

Annotation maps perfect match (PM) probes on the array to their targets. Background correction removes artifacts and preprocesses the raw CEL files for analysis. This is not done in Spark, it is an embarrassingly parallel problem, done independently for each sample. It completed in R.

Quantile Normalization

Quantile normalization removes array effects by normalizing each array against all others.

Median Polish

Tukey's median polish is used to summarize the values of multiple probes mapping within the same transcript cluster or probeset.

Helper Scripts

Parquet Converter

convert_to_parquet.py

This can convert flat files (csv and tsv currently supported) to parquet format using SNAPPY compression without using JVM.

HTA 2.0 Annotation

hta_annotation.R

Many examples have been shown for HTA 2.0. The annotation and background correction step requires an input specific to the array type. In this script, this file is generated from Bioconductor for HTA 2.0.

About

SparkRMA: Robust Multi-array Average (RMA) In Apache Spark

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published