OpenRetro

An open source library for benchmarking one-step retrosynthesis. This README is mainly for one-click benchmarking on existing/new reaction datasets. For serving retrosynthesis models, please refer to README_serving.md. All included models are under the MIT license, except LocalRetro which is under Apache 2.0 (compatible with the MIT license).

Selected Results

We used the RetroXpert version of USPTO datasets (https://github.com/uta-smile/RetroXpert/tree/main/data). We mostly followed the recommended or default hyperparameters without tuning, which we found to be sufficiently robust. These are probably not universally optimal for any dataset, and we leave the turning to the users.

USPTO_50k without reaction type

Accuracy (%)	Top-1	Top-2	Top-3	Top-5	Top-10	Top-20	Top-50
NeuralSym	45.5	59.7	67.1	74.6	81.6	84.9	85.7
GLN	51.8	62.5	68.8	75.9	83.4	89.3	92.3
Transformer	43.4	53.9	58.5	63.0	67.1	69.4	-
RetroXpert	45.4	55.5	59.8	64.1	68.8	72.0	-
RetroComposer	49.3	64.6	71.1	78.5	85.7	88.2	88.6
LocalRetro	52.5	68.5	75.7	84.0	91.0	94.5	96.3

USPTO_full without reaction type

Accuracy (%)	Top-1	Top-2	Top-3	Top-5	Top-10	Top-20	Top-50
NeuralSym	43.6	54.8	59.8	64.6	68.9	71.4	72.2
Transformer	44.5	55.6	60.3	65.1	69.6	72.1	-
RetroXpert	39.7	47.2	49.9	52.3	54.9	56.9	-

Environment Setup

Building the Docker for benchmarking requires GPU support, which will speed up training with any models in general. First follow the instruction on https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html to install the NVIDIA Container Toolkit (a.k.a. nvidia-docker2). Then run

docker build -f Dockerfile_gpu -t openretro:gpu .

Benchmarking in Docker

Note: please rebuild the Docker before running if there is any change in code.

Step 1/3

Prepare the raw atom-mapped .csv files for train, validation and test. The required columns are "class", "id" and "rxn_smiles". See data/USPTO_50k/{train,val,test}.csv for sample data format. Atom mapping is required for GLN, LocalRetro, RetroComposer, RetroXpert and NeuralSym; behaviour of these models without atom mapping is undefined.

It is possible to run benchmarking with non atom-mapped reactions, with template-free models only. Currently, the only template-free model supported is Transformer.

Step 2/3

Configure variables in ./scripts/config_metadata.sh, especially the paths to point to the absolute path of raw files and desired output paths. Only the variables related to benchmarked models need to be configured. Once the changes have been made, execute the configuration script with

source scripts/config_metadata.sh

Step 3/3

Run benchmarking on a machine with GPU using

bash scripts/benchmark_in_docker.sh MODEL_NAME1 MODEL_NAME2 ...

This will run the preprocessing, training, predicting and scoring with specified models, with Top-n accuracies up to n=50 as the final outputs. Progress and result logs will be saved under ./logs

Currently, we support 6 models as MODEL_NAME, namely,

gln, adapted from original GLN (https://github.com/Hanjun-Dai/GLN)
localretro, adapted from original LocalRetro (https://github.com/kaist-amsg/LocalRetro)
retroxpert, adapted from original RetroXpert (https://github.com/uta-smile/RetroXpert)
retrocomposer, adapted from original RetroComposer (https://github.com/uta-smile/RetroComposer)
neuralsym, adapted from Min Htoo's re-implementation (https://github.com/linminhtoo/neuralsym)
transformer, adapted from Augmented Transformer (https://github.com/bigchem/synthesis) and Molecular Transformer (https://github.com/pschwllr/MolecularTransformer). We (trivially) re-implemented the Transformer using models from OpenNMT, which gave cleaner and more modularized codes.

For example, to benchmark with selected 4 models, run

bash scripts/benchmark_in_docker.sh gln retroxpert neuralsym transformer

To benchmark with only Transformer (e.g. in case there is no atom mapping information), run

bash scripts/benchmark_in_docker.sh transformer

By default, the benchmarking engine will canonicalize the reactions and subsequently re-number the atoms. This is done to avoid the information leak that RetroXpert (https://github.com/uta-smile/RetroXpert/blob/canonical_product/readme.md) and GraphRetro (https://arxiv.org/pdf/2006.07038.pdf) have suffered from. Without re-calibrating the atom mapping, the test accuracies would be higher than expected since the original numbering might hint at where the reaction center is.

The estimated running time for benchmarking the USPTO_50k dataset on a 20-core machine with 1 RTX3090 GPU is

GLN: ~1 hr preprocessing, ~2 hrs training, ~3 hrs testing
RetroXpert: ~5 mins stage 1 preprocessing, ~2 hrs stage 1 training, ~10 mins stage 2 preprocessing, ~12 hrs stage 2 training, ~20 mins testing
Transformer: ~1 min preprocessing, ~16 hrs training, ~10 mins testing
NeuralSym: ~15 mins preprocessing, ~5 mins training, ~2 mins testing

Name		Name	Last commit message	Last commit date
Latest commit History 170 Commits
base		base
checkpoints		checkpoints
data		data
logs		logs
mars		mars
models		models
results		results
scripts		scripts
scripts_serving		scripts_serving
utils		utils
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile_gpu		Dockerfile_gpu
Dockerfile_serving		Dockerfile_serving
LICENSE		LICENSE
README.md		README.md
README_serving.md		README_serving.md
config.properties		config.properties
predict.py		predict.py
preprocess.py		preprocess.py
score.py		score.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenRetro

Selected Results

USPTO_50k without reaction type

USPTO_full without reaction type

Environment Setup

Benchmarking in Docker

Step 1/3

Step 2/3

Step 3/3

About

Releases

Packages

Languages

License

coleygroup/openretro

Folders and files

Latest commit

History

Repository files navigation

OpenRetro

Selected Results

USPTO_50k without reaction type

USPTO_full without reaction type

Environment Setup

Benchmarking in Docker

Step 1/3

Step 2/3

Step 3/3

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages