DeepAcceptor

Computational design and screening of acceptor materials for organic solar cells

Motivation

It is a time-consuming and costly process to develop affordable and high-performance organic photovoltaic materials. Developing reliable computational methods to predict the power conversion efficiency (PCE) is crucial to triage unpromising molecules in large-scale databases and accelerate the material discovery process. In this study, a deep learning-based framework (DeepAcceptor) has been built to design and discover high-efficient small molecule acceptor materials. Specifically, an experimental dataset was constructed by collecting data from publications. Then, a BERT-based model was customized to predict PCEs by taking fully advantages of the atom, bond, connection information in molecular structures of acceptors, and this customized architecture is termed as abcBERT. The computation molecules and experimental molecules were used to pre-train and fine-tune the model, respectively. The molecular graph was used as the input and the computation molecules and experimental molecules were used to pretrain and finetune the model, respectively. In sum, DeepAcceptor is a promising method to predict the PCE and speed up the discovery of high-performance acceptor materials.

Depends

We recommend to use conda and pip.

By using the environment.yml file, it will install all the required packages.

git clone --depth=1 https://github.com/jinysun/deepacceptor.git
cd deepacceptor
conda env create -f environment.yml
conda activate deepacceptor

Usage

The code of abcBERT is as follows.
-- pretrain: contains the codes for masked atom prediction pre-training task.
-- regression: contain the code for fune-tuning on specified tasks
-- dataset: contain the code to building dataset for pre-traing and fine-tuning
-- utils: contain the code to convert molecules to graphs
--predict: contain the code for predict the properties
--Demo: contain the code to show how the model works

Data pre-processing

abcBERT is a model for predicting PCE based on molecular graph, so we need to convert SMILES strings to Graph. The related method is shown in deepacceptor/utils.py

First, put the test file in the file data/reg/.

Then, run the utils.py as follows.

import pandas as pd 
import utils 
utils.pretrainprocess()
utils.processtrain()
utils.processtest()
utils.processtval()

or use the command line as follows

cd abcBERT
python utils.py

Model training

Pre-train the model

The pre-training process can be completed after pre-processing the data.

import pretrain
pretrain.main()

or use the command line as follows

cd abcBERT
#pre-process the data for pretraining
python -c "import utils; utils.pretrainprocess()"

#pretraining
python pretrain.py

Fine-tune the model

The training process can be completed after pre-processing the training/test/validation set and pre-training the model.
```
import regression
from regression import *
result =[]
r2_list = []
seed = 12
r2,prediction_val,prediction_test= main(seed)
```

or use the command line as follows

    cd abcBERT
    #pre-process the data for training/test/validation
    python -c "import utils; utils.processtrain()"
    python -c "import utils; utils.processtest()"
    python -c "import utils; utils.processtval()"
    
    # Fine-tuning the model
    python regression.py

Predicting PCE of large-scale database

The PCE prediction is obtained by feeding the the processed molecules into the already trained abcBERT model with predict.py

    #Pre-process the test data
    import utils
    from utils import *
    utils.processtest()

    # Prediction on large-scale dataset
    import predict
    from predict import *
    np.set_printoptions(threshold=sys.maxsize)
    prediction_val= main()

or use the command line as follows

    cd abcBERT
    
    # pre-process the data
    python -c "import utils; utils.processtest()"
    
    # prediction on large-scale dataset
    python predict.py

Predicting PCE of single molecule

    import predictbysmiles
    from predictbysmiles import *
    # prediction without any pre-process
    prediction_val = main ('CCCCCCCCC1=CC=C(C2(C3=CC=C(CCCCCCCC)C=C3)C3=CC4=C(C=C3C3=C2C2=C(C=C(C5=CC=C(/C=C6/C(=O)C7=C(C=CC=C7)C6=C(C#N)C#N)C6=NSN=C56)S2)S3)C(C2=CC=C(CCCCCCCC)C=C2)(C2=CC=C(CCCCCCCC)C=C2)C2=C4SC3=C2SC(C2=CC=C(/C=C4\C(=O)C5=C(C=CC=C5)C4=C(C#N)C#N)C4=NSN=C24)=C3)C=C1')

or use the command line as follows

    cd abcBERT
    # prediction without any pre-process
    python -c "import predictbysmiles; predictbysmiles.main('CCCCCCCCC1=CC=C(C2(C3=CC=C(CCCCCCCC)C=C3)C3=CC4=C(C=C3C3=C2C2=C(C=C(C5=CC=C(/C=C6/C(=O)C7=C(C=CC=C7)C6=C(C#N)C#N)C6=NSN=C56)S2)S3)C(C2=CC=C(CCCCCCCC)C=C2)(C2=CC=C(CCCCCCCC)C=C2)C2=C4SC3=C2SC(C2=CC=C(/C=C4\C(=O)C5=C(C=CC=C5)C4=C(C#N)C#N)C4=NSN=C24)=C3)C=C1')"

The example codes for prediction is included in the test.ipynb

Demo

The example.ipynb was used to show the whole process of abcBERT. The files in Demo were used to test that the codes work well. The parameters (such as epochs, dataset size) were set to small numbers to show how the abcBERT worked.

Designing and Screening

Molecular generation

BRICS+VAE: A fragments-based molecule design framework was built by using the breaking of retrosynthetically interesting chemical substructures (BRICS) algorithm and variational autoencoder (VAE) to obtain a database with specific potential molecular properties.

Basic properties

Basic properties: The Gen database was screened with some basic properties such as molecular size, logP, the number of H-bond acceptors and donors, number of rotatable bonds. These properties were calculated by using RDKit.

HOMO & LUMO matching

GNN was trained on a NFA dataset including HOMO and LUMO computing by DFT. The dataset including 51000 NFAs was splited randomly with a ratio of 8:1:1. The MAE and R2 of the predicted HOMO are 0.052 and 0.972.

SAscore

SAscore was used to synthetic accessibility and complexity.

Molecular polarities and charge distribution

Properties related to molecular polarity and charge distribution were calculated by RDKit.

Discussion

The Discussion folder contains the scripts for evaluating the PCE prediction performance. We compared sevaral common methods widely used in molecular property prediction, such as MolCLR GNN,RF, ANN,QDF.

Cite

Sun, J., Li, D., Zou, J. et al. Accelerating the discovery of acceptor materials for organic solar cells by deep learning. npj Comput Mater 10, 181 (2024). https://doi.org/10.1038/s41524-024-01367-7

Contact

Jinyu Sun. E-mail: jinyusun@csu.edu.cn