Default preprocessing pipeline for MSI data in raw ASCII format, as used by Data Mining Group in Silesian University of Technology.
The packaged pipeline consists of the following steps:
- Find common m/z range
- Find resampled m/z axis that will be common for all datasets
- Resample all datasets to common m/z axis
- Remove baseline using adaptive window method (as proposed by Katarzyna Frątczak).
- Detect outliers in the data with respect to TIC value
- Align spectra to average spectrum with PAFFT method
- Normalize spectra to common TIC
- Build Gaussian Mixture Model of the average spectrum
- Remove outlier components of the GMM model
- Compute convolutions of the spectra and the GMM model
- Merge multiple GMM components resembling single peak
The preferred installation is via Docker.
Having Docker installed, you can just pull the image:
docker pull gmrukwa/msi-preprocessing
You need to prepare your data for processing:
- Create an empty directory
/mydata
- Create a directory
/mydata/raw
- this is where pipeline expects your original data - Each dataset should be contained in subdirectory:
/mydata
|- raw
|- my-dataset1
|- my-dataset2
|- ...
- Each subdirectory should contain ASCII files in the organization as provided by Bruker, e.g.:
/mydata
|- raw
|- my-dataset1
|- my-dataset1_0_R00X309Y111_1.txt
|- my-dataset1_0_R00X309Y112_1.txt
|- my-dataset1_0_R00X309Y113_1.txt
|- my-dataset1_0_R00X309Y114_1.txt
|- my-dataset1_0_R00X309Y115_1.txt
|- ...
Note: File names are important, since R
, X
and Y
is parsed as
metadata! If you put there broken values, spatial dependencies between
spectra will be lost.
- Each ASCII file should be in the format as provided by Bruker, e.g.:
700,043096125457 2
700,051503297599 2
700,059910520559 1
700,068317794335 0
...
<another-mz-value> <another-ions-count>
...
3496,66186447226 1
3496,68071341296 3
3496,69956240485 2
Both .
and ,
are supported as decimal separator.
Example of the expected structure can be found in
sample-data
.
You can launch preprocessing via:
docker run -v /mydata:/data -p 8082:8082 gmrukwa/msi-preprocessing '["my-dataset1","my-dataset2"]'
Results will appear in the /mydata
directory as soon as they are
available. You can track the progress live at
localhost:8082
.
If you need output data also in the format of .csv
files (not a binary
numpy-related .npy
), you can simply add a switch --export-csv
:
docker run --rm -ti -v /mydata:/data -p 8082:8082 gmrukwa/msi-preprocessing '["my-dataset1","my-dataset2"]' --export-csv
Note: There is no space between dataset names.
Note: The --export-csv
switch must appear right after the datasets
(due to the way Docker handles arguments).
If you want to review time needed for each task to process, you can prevent
scheduler from being stopped with --keep-alive
switch:
docker run --rm -ti -v /mydata:/data -p 8082:8082 gmrukwa/msi-preprocessing '["my-dataset1","my-dataset2"]' --keep-alive
Note: --keep-alive
switch must always come last.
- Download
sample-data
directory - Run
docker run -v sample-data:/data -p 8082:8082 gmrukwa/msi-preprocessing '["my-dataset1","my-dataset2"]'
- Track progress at
localhost:8082
Building GMM model takes longer time (at least 1 hour), so be patient.
It may happen that the data is actually too big to be copied across all your CPUs.
Then it may be useful to limit the number of the CPU cores exploited.
You can do this via additional switch --pool-size
. By default all cores
are used (or single one, when detection was impossible).
Example:
docker run --rm -ti -v /mydata:/data -p 8082:8082 gmrukwa/msi-preprocessing '["my-dataset1","my-dataset2"]' --pool-size 2 --keep-alive
You can simply add e-mail notifications to your configuration. They will provide you with failure messages and notification, when the pipeline completes successfully. Two methods are supported: via SendGrid and via SMTP server.
- Create API key on SendGrid account (try here)
- Download
template.env
as.env
- In the
.env
file, set following values (rest of the content preserve intact):
LUIGI_EMAIL_METHOD=sendgrid
LUIGI_EMAIL_RECIPIENT=<your-email-here>
LUIGI_SENDGRID_APIKEY=<your-api-key-here>
- Launching processing with Docker, use additional switch
--env-file .env
:
docker run --rm -ti -v /mydata:/data --env-file .env gmrukwa/msi-preprocessing '["my-dataset1","my-dataset2"]'
- For your e-mail provider, get configuration of mail program, like here
- Download [
template.env](./template.env) as
.env` - In the
.env
file, set following values (rest of the content preserve intact):
LUIGI_EMAIL_METHOD=smtp
LUIGI_EMAIL_RECIPIENT=<your-email-here>
LUIGI_EMAIL_SENDER=<your-email-here>
LUIGI_SMTP_HOST=<smtp-host-of-your-provider>
LUIGI_SMTP_PORT=<smtp-port-of-your-provider>
LUIGI_SMTP_NO_TLS=<False-if-your-provider-uses-TLS-True-otherwise>
LUIGI_SMTP_SSL=<False-if-your-provider-uses-TLS-True-otherwise>
LUIGI_SMTP_PASSWORD=<password-to-your-email-account>
LUIGI_SMTP_USERNAME=<login-to-your-email-account>
- Launching processing with Docker, use additional switch
--env-file .env
:
docker run --rm -ti -v /mydata:/data --env-file .env gmrukwa/msi-preprocessing '["my-dataset1","my-dataset2"]'
Task history is collected to SQLite database. If you want to persist the
database, you need to mount the /luigi
directory. This can be done via:
docker run --rm -ti -v tasks-history:/luigi -v /mydata:/data gmrukwa/msi-preprocessing