Enzymatic Transformer

This repo complements the "Predicting Enzymatic Reactions with a Molecular Transformer" publication

Requirements

Specific versions used:

Python: 3.6.10
Torch: 1.13.1
TorchText: 0.6.1
OpenNMT: 1.1.1
RDKit: 2017.09.1

Conda Environment Setup

conda create -n enztrans_test python=3.6
conda activate enztrans_test
conda install -c rdkit rdkit=2017.09.1 -y
conda install -c pytorch pytorch=1.5.1 -y
git clone https://github.com/reymond-group/OpenNMT-py.git
cd OpenNMT-py
git checkout Enzymatic_Transformer
pip install -e .

Quickstart

The training and evaluation was performed using OpenNMT-py. The full documentation of the OpenNMT can be found here.

Step 1: Tokenization

The reaction SMILES are tokenized using the tokenization function available from the Molecular Transformer here

Enzyme sentences are tokenized using the Hugging Face tokenizers available here. The custom tokenizer can be build from a file containing the list of sentences using the following commands:

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors

# Initialize a tokenizer
tokenizer2 = Tokenizer(models.BPE())

# Customize pre-tokenization and decoding
tokenizer2.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer2.decoder = decoders.ByteLevel()
tokenizer2.post_processor = processors.ByteLevel(trim_offsets=True)

# And then train
trainer = trainers.BpeTrainer(vocab_size=9000, min_frequency=2, limit_alphabet=55, special_tokens=['ase', 'hydro', 'mono', 'cyclo', 'thermo', 'im'])
tokenizer2.train(trainer, ["list_of_sentences.txt"])

Then, sentences of the dataset are tokenized using the following function:

def enzyme_sentence_tokenizer(sentence):
    '''
    Tokenize a sentenze, optimized for enzyme-like descriptions & names
    '''
    encoded = tokenizer2.encode(sentence)
    my_list = [item for item in encoded.tokens if 'Ġ' != item]
    my_list = [item.replace('Ġ', '_') for item in my_list]
    my_list = ' '.join(my_list)
    return my_list

Step 2: Preprocess the data

DATASET=data/uspto_dataset
DATASET_TRANSFER=data/transfer_dataset

preprocess.py -train_ids ENZR ST_sep_aug \
	-train_src DATADIR/src_train.txt $DATASET_TRANSFER/src-train.txt \
	-train_tgt DATADIR/tgt_train.txt $DATASET_TRANSFER/tgt-train.txt \
	-valid_src DATADIR/src_val.txt -valid_tgt $DATASET_TRANSFER/multi_task /tgt_val.txt \
	-save_data DATADIR/Preprocessed \-src_seq_length 3000 -tgt_seq_length 3000 \
	-src_vocab_size 3000 -tgt_vocab_size 3000 \-share_vocab -lower

Step 3: Training of the model

The Enzymatic Transformer was trained using the following parameters:

Multi-task transfer learning:

WEIGHT1=1
WEIGHT2=9

train.py -data DATADIR/Preprocessed \
	-save_model ENZR_MTL -seed 42 -train_steps 200000 -param_init 0 \
	-param_init_glorot  -max_generator_batches 32 -batch_size 6144 \
	-batch_type tokens -normalization tokens -max_grad_norm 0 -accum_count 4 \
	-optim adam -adam_beta1 0.9 -adam_beta2 0.998 -decay_method noam \
	-warmup_steps 8000 -learning_rate 4 -label_smoothing 0.0 -layers 4 \
	-rnn_size 384 -word_vec_size 384 \
	-encoder_type transformer -decoder_type transformer \
	-dropout 0.1 -position_encoding -global_attention general \
	-global_attention_function softmax -self_attn_type scaled-dot \
	-heads 8 -transformer_ff 2048 \
	-data_ids ENZR ST_sep_aug -data_weights WEIGHT1 WEIGHT2 \
	-valid_steps 5000 -valid_batch_size 4 -early_stopping_criteria accuracy \

Step 4: Model prediction

A reaction can be predicted after tokenization using the following command:

translate.py -model model_uspto_ENZR_multitask.pt \
	-src DATASET/src_test.txt \
	-output predictions.txt \
	-batch_size 64 -replace_unk -max_length 1000 \
	-log_probs -beam_size 5 -n_best 5 \

Citation