Skip to content

Scalable and reproducible metabolomics preprocessing workflow powered by Pachyderm

License

Notifications You must be signed in to change notification settings

jonandernovella/MTBLS233-Pachyderm

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 

Repository files navigation

MTBLS233-Pachyderm

In this page we introduce an metabolomics preprocessing workflow that you can run using Pachyderm, a distributed data-processing tool built on software containers that enables scalable and reproducible pipelines.

Introduction

The main goal of the study performed on MTBLS233 was to produce quantitative information of the highest possible number of reliable features in untargeted metabolomics. In order to do so, diverse approaches of mass spectromic acquisition parameter tuning were tested in order to maximize the number of spectral features.

We aimed at rereating the workflow used in the MTBLS233 study in a distributed manner by using Pachyderm. The workflow was originally implemeted in OpenMS v. 1.1.1. followed by the downstream analysis in KNIME. Here we fire up the preprocessing pipeline using Pachyderm, a tool built on top of Kubernetes that allows to process the data in a distributed fashion and to keep track of the input/output data from every stage of our the pipeline (think “git for data”), such that it is possible to track the provenance of results and accurately reproduce scientific workflows.

Run the preprocessing workflow

Start by installing the Pachyderm client:

# For OSX:
brew tap pachyderm/tap && brew install pachyderm/tap/pachctl@1.3
# For Linux (64 bit):
$ curl -o /tmp/pachctl.deb -L https://github.com/pachyderm/pachyderm/releases/download/v1.3.17/pachctl_1.3.17_amd64.deb && sudo dpkg -i /tmp/pachctl.deb

Ingest the MTBLS233 dataset from MetaboLights

MetaboLights offers an FTP service, so we can ingest the MTBLS233 dataset in a terminal.

  1. First open a terminal in your working directory and create a directory to store the data
  2. Ingest the dataset using wget:
# Dataset
mkdir dataset
cd dataset
wget ftp://ftp.ebi.ac.uk/pub/databases/metabolights/studies/public/MTBLS233/*alternate_pos_low_mr.mzML

Deploy Pachyderm on Kubernetes

Here we assume that you have a Kubernetes cluster up and running. In this case, we will describe a simple deployment on a single node using Minikube. Deploy Pachyderm using:

pachctl deploy local

It may take a while for the pachd nodes to start running because it’s pulling containers from DockerHub. Also, set up port forwarding so that you can reach Pachyderm:

pachctl port-forward &

It is also possible to deploy Pachyderm into the Cloud and to configure it to use custom object storage options. In our particular use-case, we made a custom deployment with GlusterFS and Minio as storage backend in a OpenStack environment. More information can be found here.

Add the MTBLS233 dataset to Pachyderm

Create a repo called metabrepo and push the dataset into it.

pachctl put-file metabrepotest master -c -r -p 10 -f ./path/to/dataset

Process the data

Now that the data is in the repository, it’s time to use the execute the pipeline. Four different jobs compose the pipeline, which can be found in the ./pipelinesdirectory.

pachctl create-pipeline -f ./path/to/pipelines/PeakPickerHiRes.json
pachctl create-pipeline -f ./path/to/pipelines/FeatureFinderMetabo.json
pachctl create-pipeline -f ./path/to/pipelines/FeatureLinkerUnlabeledQT.json
pachctl create-pipeline -f ./path/to/pipelines/TextExporter.json

After the whole workflow has been successfully executed, the resulting CSV file generated by the TextExporter in OpenMS will be saved in the TextExporterrepository. You can download the file simply by using:

pachctl get-file TextExporter <commit-id> <path-to-file>

The <commit-id> is easily obtainable by checking the most recently made commit in the TextExporter repository using:

pachctl list-commit TextExporter

Also, the <path-to-file> can be obtained by checking the list of files outputed to the TextExporter repository after a given commit:

pachctl list-file TextExporter <commit-id>

About

Scalable and reproducible metabolomics preprocessing workflow powered by Pachyderm

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published