Skip to content

Latest commit

 

History

History
57 lines (52 loc) · 2.81 KB

README.md

File metadata and controls

57 lines (52 loc) · 2.81 KB

Kinodata-3D dataset and models

This repository contains a pyg-based interface to the Kinodata-3D dataset and the code used to train and evaluate the models presented in the Kinodata-3D publication

Installation

We currently only support installation from source.

(1) Clone this repo

(2) Set up Python environment

Use mamba (or conda) to set up a Python environment,

mamba env create -f environment.yml
mamba activate kinodata

and install this package in editable/develop mode

pip install -e .

(3) Obtain raw data

The raw data, docked poses and kinase pdb files, can be obtained from Zenodo. After downloading the archives, extract them in the root directory of this repository.

cd PATH_TO_REPO
unzip ...

See the Kinodata-3D repo for more information and the code used to generate the raw data.

General usage

Reproducing results

(1) Acquire exact dataset and data split versions

If you intend to reproduce our results, we strongly recommend that you use our preprocessed version of the dataset and corresponding data splits.

(2) Model training and evaluation

You can use the shell script condor/train_generic.sh to train and test a model in one run, on one particular split. Create a file wandb_api_key in the root directory of this repository and paste your wandb API key, if you want to sync results to Weight & Biases. Otherwise, run wandb disable in a terminal with the conda environment activated, before training.

The script requires the following positional arguments

  1. Base python script, one of "train_dti_baseline", "train_sparse_transformer"
  2. Split type, i.e. one of "scaffold-k-fold", "random-k-fold", "pocket-k-fold".
  3. Integer RMSD cutoff for the dataset, e.g. 2, 4, or 6 as used in the publication.
  4. A .yaml file that contains additional configuration parameters, e.g. model hyperparameters.
  5. The integer index of the cross-validation fold used for testing.

For instance,

./condor/train_generic.sh train_dti_baseline scaffold-k-fold 2 dti.yaml 0

trains and tests the DTI baseline on the scaffold-5-fold (default k is 5) split of the dataset containing all complexes with predicted RMSD <= 2 Angstroms. Folds 1-4 are used for training and fold 0 for testing.