Scripts used to run the experiments presented in the paper:
"Liquid-Chromatography Retention Order Prediction for Metabolite Identification",
Eric Bach, Sandor Szedmak, Celine Brouard, Sebastian Böcker and Juho Rousu, 2018
Summary of the results shown in the paper (File needs to be downloaded and opened with a web-browser.).
There is no further installation required. The scripts run out of the box, if all the package dependencies are sattisfied. All the source code in this repository is under the MIT License.
The order predictor, e.g. RankSVM, and evaluation scripts are implemented in Python. The code has been tested with Python 3.5 and 3.6. The following packages are required:
- scipy >= 0.19.1
- json >= 2.0.9
- numpy >= 1.13.1
- joblib >= 0.11
- pandas >= 0.20.3
- sklearn >= 0.19.0
- networkx >= 2.0
- matplotlib >= 2.1 (optional)
The data pre-processing scripts as well as the script to reproduce the results shown in the paper are written in R. For the development R version 3.4 was used. The following packages are required:
- Reproduction of results: ECCB2018.Rmd
- data.table
- ggplot2
- knitr
- Reproduction of data pre-processing:
- Matrix
- obabel2R
- rcdk (used for fingerprint calculation)
- fingerprint
Furthermore, the OpenBabel (>= 2.3.2)
command line tool obabel
must be installed only if the data
pre-processing needs to be repeated.
The rcdkTools package allows the computation of several counting fingerprints through the Chemical Development Kit (CDK).
All experiments of the paper can be reproduced by using the evaluation_scenarios_main.py script with the proper parameters:
usage: evaluation_scenarios_main.py <ESTIMATOR> <SCENARIO> <SYSSET> <TSYSIDX> <PATH/TO/CONFIG.JSON> <NJOBS> <DEBUG>
ESTIMATOR: {'ranksvm', 'svr'}, which order predictor to use.
SCENARIO: {'baseline', 'baseline_single', 'baseline_single_perc', 'all_on_one', 'all_on_one_perc', 'met_ident_perf_GS_BS'}, which experiment to run.
SYSSET: {10, imp, 10_imp}, which set of systems to train on.
TSYSIDX: {-1, 0, ..., |sysset| - 1}, which target system to use for evaluation.
PATH/TO/CONFIG.JSON: configuration file, e.g. PredRet/v2/config.json
NJOBS: How many jobs should run in parallel for hyper-parameter estimation?
DEBUG: {True, False}, should we run a smoke test.
SCENARIO | Description | Reference in the Paper |
---|---|---|
baseline_single |
Single system used as training and target | Table 3, Table 4 (first two columns) |
baseline_single_perc |
Single system used as training and target. Different percentage of data used for trainging. | Figure 4 (stroked lines) |
all_on_one |
All systems used for training. Single system used as target. Target system in training (LTSO): True & False | Table 4, LTSO = True 3. & 4. column, LTSO = False 5. & 6. column |
all_on_one_perc |
All systems used for training. Single system used as target. Varying percentage of target system data used for training | Figure 4 (solid lines) |
The following function calls are need:
MACCS counting fingerprints:
python src/evaluation_scenarios_main.py ranksvm baseline_single 10 -1 results/raw/PredRet/v2/config.json 2 False
baseline_single
: Single system used for training and testing.10
: Use "Eawag_XBridgeC18", "FEM_long", "RIKEN", "UFZ_Phenomenex", "LIFE_old" for training and testing.-1
: By setting TSYSIDX to -1, we run all target systems in a single job. This parameter can be used for parallelization.results/raw/PredRet/v2/config.json
: Configuration of the experiment, e.g. molecular features and kernels.2
: Number of jobs/cpus used for the hyper-parameter search.False
: Not running in debug-mode. Results will be stored in the final directory.
The results will be stored into:
results/PredRet/v2
└── final
└── ranksvm_slacktype=on_pairs
└── allow_overlap=True_d_lower=0_d_upper=16_ireverse=False_type=order_graph
└── difference
└── maccsCount_f2dcf0b3
└── minmax
└── baseline_single
MACCS binary fingerprints:
Modify the results/raw/PredRet/v2/config.json
configuration file:
"molecule_representation": {
"kernel": "minmax",
"predictor": ["maccsCount_f2dcf0b3"],
"feature_scaler": "noscaling",
"poly_feature_exp": false
}
becomes
"molecule_representation": {
"kernel": "tanimoto",
"predictor": ["maccs"],
"feature_scaler": "noscaling",
"poly_feature_exp": false
}
Then run:
python src/evaluation_scenarios_main.py ranksvm baseline_single 10 -1 results/raw/PredRet/v2/config.json 2 False
The results will be stored into:
results/PredRet/v2
└── final
└── ranksvm_slacktype=on_pairs
└── allow_overlap=True_d_lower=0_d_upper=16_ireverse=False_type=order_graph
└── difference
└── maccs
└── tanimoto
└── baseline_single
How the results can be loaded and visualized is described here.
To refer the original publication please use:
@article{doi:10.1093/bioinformatics/bty590,
author = {Bach, Eric and Szedmak, Sandor and Brouard, Céline and Böcker, Sebastian and Rousu, Juho},
title = {Liquid-chromatography retention order prediction for metabolite identification},
journal = {Bioinformatics},
volume = {34},
number = {17},
pages = {i875-i883},
year = {2018},
doi = {10.1093/bioinformatics/bty590},
URL = {http://dx.doi.org/10.1093/bioinformatics/bty590},
eprint = {/oup/backfile/content_public/journal/bioinformatics/34/17/10.1093_bioinformatics_bty590/2/bty590.pdf}
}