Skip to content

Latest commit

 

History

History
159 lines (137 loc) · 7.47 KB

README.md

File metadata and controls

159 lines (137 loc) · 7.47 KB

OTMTD

Directory Structure

├── example.ipynb                  # Example of OTMTD
├── otmtd_cal.ipynb                # OTMDT metric calculation
├── emprical_relate.ipynb          # Analysis of OTMTD and empirical results correlation
├── ot_baselines_cal.ipynb         # OT-based baseline metrics
├── non_ot_baselines_cal.ipynb     # Non-OT-based baseline metrics
├── metrics_perf_comp.ipynb        # Comparison of different metrics
├── requirements.txt               # Environment dependencies
├── README.md                      # Readme file
├── otmtd/                         # Definition files for OTMTD
├── otdd/                          # Definition files for OTDD
├── otce/                          # Definition files for OTCE
├── represent/                     # Represent protein tasks using MASSA
├── processed_data                 # Processed protein downstream task text data
├── protein_embeddings_MultiTasks  # Pre-trained or downstream task embeddings
├── cv_emb                         # Represent CV tasks as embeddings

Requirements

python >3.9.1 torch-1.8.1

conda create -n otmtd python==3.9.1
conda activate otmtd
cd OTMTD
pip install -r requirements.txt

Example

example.ipynb demonstrates the transferability calculation from pre-training to the Fluorescence task. Note that the embeddings of the example need to be download at
https://drive.google.com/drive/folders/1RTphom46oGlJlnw52NSABMNQurWldhJi?usp=sharing.

1. Data Processing

Raw data of protein downstream tasks could be download at https://drive.google.com/drive/folders/1BYzf2RJFcMnT_8Cf_F0Gu_ZWGvM7Z0eY?usp=sharing. The data format and size of datasets are as follows:

  • Without uniprot id

    task seq label
    Stability DQSVRKLV... -0.2099
    Fluorescence SKGEELFT... 3.7107
    Remote Homology PKKVLTGV... 51
    Secondary Structure MNDKRLQF... 22222000...
    Signal Peptide MLGMIRNS... 0
    Fold Classes MSPFTGSA... c
  • With uniprot id

    • PDBBind

      uniprot_id seq smiles rdkit_smiles label dataset_type
      11gs PYTVVYFP... OC(=O)c1cc... O=C(O)c1cc... 4.62 train
    • Kinase

      molecule uniprot_id seq label
      COC1C(N(C)C(C)=O)... P05129 MAGLGPGV... 1
  • Size of datasets

    Stability Fluorescence Remote Homology Secondary Structure Signal Peptide Fold Classes Pdbbind Kinase
    Train 53614 21446 12312 8678 16606 15680 11906 91552
    Valid 2512 5362 736 2170 / / 1000 /
    Test 12851 27217 718 513 4152 3921 290 19685

Then, the datasets without uniprot ids are manually added with the uniprot id following the template <task>_<dataset_type>_<number>, e.g., fluo_train_17878.

Next, the corresponding Gene Ontology(GO) is retrieved from the idmapping_selected.tab according to the uniprot id, and No goterm will be returned if no GO is retrieved. The command is as follow:

grep -w <uniprot_id> idmapping_selected.tab -m 1

The reference code about processing data can be found in the file named processing_data.ipynb, which includes retrieving GO and processing labels.

Processed data of protein pretraining and downstream tasks should be placed in the processed_data directory. The data format is as follows:

task uniprot_id seq GO label
Stability stab_train_0 DQSVRKLV... No goterm 2
Fluorescence fluo_train_17878 SKGEELFT... No goterm 0
Remote Homology remo_train_0 PKKVLTGV... No goterm 0
Secondary Structure secstruc_train_0 MNDKRLQF... No goterm 1
Signal Peptide sign_train_0 MLGMIRNS... No goterm 0
Fold Classes fold_train_0 MSPFTGSA... No goterm 2
PDBBind 11gs PYTVVYFP... GO:0005737;GO:0005829;... 2
Kinase P05129 MAGLGPGV... GO:0004672;GO:0004674;... 1

2. Embeddings generation

Use represent/model_interpreter_multi.py to represent protein tasks, and modify represent/config.yaml to configure the downstream task paths. For example,

python model_interpreter_multi.py --batch_size=32 --gpu=0 --ft=multi

The generated embeddings have the following format:

pro_id pro_seq pro_emb
0 fluo_train_0 SKGEELFT... [-0.5087447166442871, -2.313387870788574, -0.1...

Additionly, pretrained and finetuned weights used in genereating embeddings come from ours previous work MASSA. And the hyperparameters of experiment are as follow:

Pretrain Stability Fluorescence Remote Homology Secondary Structure Pdbbind Kinase Skempi
epoch 150 150 150 150 150 150 150 150
batch size 4 8 32 4 8 8 4 8
lr (learning rate) 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4
weight decay 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4
gradient accumulation 8 8 8 8 8 8 8 8
optimizer RAdam RAdam RAdam RAdam RAdam RAdam RAdam RAdam
Loss CrossEntropy Loss MSELoss MSELoss Equalized Focal Loss CrossEntropy Loss MSELoss CrossEntropy Loss MSELoss

3. OTMTD Calculation

Run otmtd_cal.ipynb to calculate the transferability metrics from multi-modal multi-task pre-training to downstream tasks.

Acknowledgement

The SOFTWARE will be used for teaching or not-for-profit research purposes only. Permission is required for any commercial use of the Software.