Unifying Molecular and Textual Representations via Multi-task Language Modelling
Dimitrios Christofidellis*, Giorgio Giannone*, Jannis Born, Ole Winther, Teodoro Laino, Matteo Manica
International Conference on Machine Learning (ICML), 2023
[paper] [gradio app] [code]
The recent advances in neural language models have also been successfully applied to the field of chemistry, offering generative solutions for classical problems in molecular design and synthesis planning. These new methods have the potential to fuel a new era of data-driven automation in scientific discovery. However, specialized models are still typically required for each task, leading to the need for problem-specific fine-tuning and neglecting task interrelations. The main obstacle in this field is the lack of a unified representation between natural language and chemical representations, complicating and limiting human-machine interaction. Here, we propose the first multi-domain, multi-task language model that can solve a wide range of tasks in both the chemical and natural language domains. Our model can handle chemical and natural language concurrently, without requiring expensive pre-training on single domains or task-specific models. Interestingly, sharing weights across domains remarkably improves our model when benchmarked against state-of-the-art baselines on single-domain and cross-domain tasks. In particular, sharing information across domains and tasks gives rise to large improvements in cross-domain tasks, the magnitude of which increase with scale, as measured by more than a dozen of relevant metrics. Our work suggests that such models can robustly and efficiently accelerate discovery in physical sciences by superseding problem-specific fine-tuning and enhancing human-model interactions.
Install requirements:
pip install -r requirements.txt
Create a dedicated kernel:
python -m ipykernel install --name text_chem_t5_demo
Good to go 🚀
The training process is carried out using the language modeling trainer based on Hugging Face transformers (Wolf et al., 2020) and PyTorch Lightning (Falcon
and The PyTorch Lightning team, 2019) from the GT4SD library (Manica et al., 2022). To reproduce the training, you need first to install the GT4SD library. For more information regarding the installation process of the GT4SD library, you can visit its page. Once GT4SD is installed, you can use the following command to launch our training. Note that the provided dataset splits in the dataset-sample
directory contain just a small subset of our actual dataset splits.
To regenerate our full training dataset, we refer the interested reader to the respective section of our paper and the references that are provided there.
gt4sd-trainer --training_pipeline_name language-modeling-trainer \
--model_name_or_path t5-base \
--lr 6e-4 \
--lr_decay 0.99 \
--batch_size 8 \
--train_file dataset-sample/train.jsonl \
--validation_file dataset-sample/valid.jsonl \
--default_root_dir text_chem_t5_base \
--type cgm \
--val_check_interval 20000 \
--max_epochs 1 \
--limit_val_batches 500 \
--accumulate_grad_batches 4 \
--log_every_n_steps 5000 \
--monitor val_loss \
--save_top_k 1 \
--mode min \
--every_n_train_steps 20000 \
--accelerator 'ddp'
The prompt templates that we have used for the 5 different tasks can be found in the following table, where <input> represents the actual input for each task.
Task | Template |
---|---|
Forward prediction | Predict the product of the following reaction: <input> |
Retrosynthesis | Predict the reaction that produces the following product: <input> |
Paragraph-to-actions | Which actions are described in the following paragraph: <input> |
Description-to-smiles | Write in SMILES the described molecule: <input> |
Smiles-to-caption | Caption the following SMILES: <input> |
The four variants of our model are available via the HuggignFace Hub in the following links:
- multitask-text-and-chemistry-t5-small-standard
- multitask-text-and-chemistry-t5-small-augm
- multitask-text-and-chemistry-t5-base-standard
- multitask-text-and-chemistry-t5-base-augm
In the provided notebook (demo.ipynb), we present examples of how the model can be used for the 5 different tasks.
@inproceedings{christofidellis2023unifying,
title = {Unifying Molecular and Textual Representations via Multi-task Language Modelling},
author = {Christofidellis, Dimitrios and Giannone, Giorgio and Born, Jannis and Winther, Ole and Laino, Teodoro and Manica, Matteo},
booktitle = {Proceedings of the 40th International Conference on Machine Learning},
pages = {6140--6157},
year = {2023},
volume = {202},
series = {Proceedings of Machine Learning Research},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v202/christofidellis23a/christofidellis23a.pdf},
url = {https://proceedings.mlr.press/v202/christofidellis23a.html},
}