MonoSeqCP is a multimodal, monomer-level, transformer-based framework for predicting membrane permeability of cyclic peptides. The model integrates multiple monomer-level representations, including physicochemical descriptors, fingerprint-based features, and connectivity information, and explicitly accounts for cyclic rotational invariance.
This repository contains the code and analysis pipelines used in an ongoing research project at Molecular AI, AstraZeneca, Gothenburg, Sweden.
- 'scripts/' – Command-line scripts for feature generation, training, evaluation, and saliency analysis
- 'notebooks/' – Data preprocessing and splitting, result visualization and saliency result analysis
- 'data/' – Dataset location (not included in the repository)
- 'results/' – Generated outputs (not included)
- 'results/plots/' – Generated figures (not included)
git clone https://github.com/MolecularAI/MonoSeqCP
cd MonoSeqCP
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
PyTorch is installed separately to support different CPU/GPU configurations.
pip install torch torchvision torchaudio
On many HPC systems, CUDA is provided via environment modules. Load the CUDA module recommended on your cluster before installing PyTorch, then install PyTorch.
module load CUDA/12.1.1
pip install torch torchvision torchaudio
Verify installation:
python -c "import torch; print(torch.__version__, torch.cuda.is_available(), torch.version.cuda)"
An exact snapshot of the author environment is provided in
requirements-author-freeze.txt. This file is primarily for reference and may be
system- or HPC-specific.
Author torch/CUDA setup:
torch 2.4.0 (CUDA 12.1)
Installed using:
pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 \
--index-url https://download.pytorch.org/whl/cu121
Due to licensing restrictions, the datasets used in this project are not included in the repository.
The required datasets can be downloaded from the CycPeptMPDB database and from an
external benchmark repository. Detailed download instructions, including which
subsets to select, are provided in data/README.md.
All scripts are executed from the repository root.
High-level workflow:
-
Preprocess data and generate dataset splits using:
- 'notebooks/dataset.ipynb'
- 'notebooks/dataset_bench.ipynb'
-
Generate input features using 'scripts/input_features2.py'
-
Train models using 'scripts/model_training2.py'
-
Evaluate trained models using 'scripts/eval1.py'
Optional:
- Generate plots using 'notebooks/plots.ipynb'
- Perform saliency analysis using 'scripts/saliency.py'
- Analyse saliency results using 'notebooks/saliency.ipynb'
Exact command-line arguments for scripts are described in the script docstrings, together with instructions how to change specific variable values.
If you use this code, please cite the repository using the information provided in 'CITATION.cff'.
If you use data from the CycPeptMPDB database, please also cite:
Li J, Yanagisawa K, Sugita M, Fujie T, Ohue M, Akiyama Y.
CycPeptMPDB: A Comprehensive Database of Membrane Permeability of Cyclic Peptides.
Journal of Chemical Information and Modeling, 63(7):2240–2250, 2023.
https://doi.org/10.1021/acs.jcim.2c01573
If you use the benchmark splits from the external repository, please also cite:
Liu W, Li J, Verma CS, Lee HK. Code for systematic benchmarking of 13 AI methods for cyclic peptide permeability. GitHub, https://github.com/Gobliu/BenchmarkCycPeptMP