In this page we introduce an metabolomics preprocessing workflow that you can run using Pachyderm, a distributed data-processing tool built on software containers that enables scalable and reproducible pipelines.
The main goal of the study performed on MTBLS233 was to produce quantitative information of the highest possible number of reliable features in untargeted metabolomics. In order to do so, diverse approaches of mass spectromic acquisition parameter tuning were tested in order to maximize the number of spectral features.
We aimed at rereating the workflow used in the MTBLS233 study in a distributed manner by using Pachyderm. The workflow was originally implemeted in OpenMS v. 1.1.1. followed by the downstream analysis in KNIME. Here we fire up the preprocessing pipeline using Pachyderm, a tool built on top of Kubernetes that allows to process the data in a distributed fashion and to keep track of the input/output data from every stage of our the pipeline (think “git for data”), such that it is possible to track the provenance of results and accurately reproduce scientific workflows.
Start by installing the Pachyderm client:
# For OSX:
brew tap pachyderm/tap && brew install pachyderm/tap/pachctl@1.3
# For Linux (64 bit):
$ curl -o /tmp/pachctl.deb -L https://github.com/pachyderm/pachyderm/releases/download/v1.3.17/pachctl_1.3.17_amd64.deb && sudo dpkg -i /tmp/pachctl.deb
MetaboLights offers an FTP service, so we can ingest the MTBLS233 dataset in a terminal.
- First open a terminal in your working directory and create a directory to store the data
- Ingest the dataset using wget:
# Dataset
mkdir dataset
cd dataset
wget ftp://ftp.ebi.ac.uk/pub/databases/metabolights/studies/public/MTBLS233/*alternate_pos_low_mr.mzML
Here we assume that you have a Kubernetes cluster up and running. In this case, we will describe a simple deployment on a single node using Minikube. Deploy Pachyderm using:
pachctl deploy local
It may take a while for the pachd nodes to start running because it’s pulling containers from DockerHub. Also, set up port forwarding so that you can reach Pachyderm:
pachctl port-forward &
It is also possible to deploy Pachyderm into the Cloud and to configure it to use custom object storage options. In our particular use-case, we made a custom deployment with GlusterFS and Minio as storage backend in a OpenStack environment. More information can be found here.
Create a repo called metabrepo
and push the dataset into it.
pachctl put-file metabrepotest master -c -r -p 10 -f ./path/to/dataset
Now that the data is in the repository, it’s time to use the execute the pipeline. Four different jobs compose the pipeline, which can be found in the ./pipelines
directory.
pachctl create-pipeline -f ./path/to/pipelines/PeakPickerHiRes.json
pachctl create-pipeline -f ./path/to/pipelines/FeatureFinderMetabo.json
pachctl create-pipeline -f ./path/to/pipelines/FeatureLinkerUnlabeledQT.json
pachctl create-pipeline -f ./path/to/pipelines/TextExporter.json
After the whole workflow has been successfully executed, the resulting CSV file generated by the TextExporter
in OpenMS will be saved in the TextExporter
repository. You can download the file simply by using:
pachctl get-file TextExporter <commit-id> <path-to-file>
The <commit-id>
is easily obtainable by checking the most recently made commit in the TextExporter repository using:
pachctl list-commit TextExporter
Also, the <path-to-file>
can be obtained by checking the list of files outputed to the TextExporter
repository after a given commit:
pachctl list-file TextExporter <commit-id>