Skip to content

Commit

Permalink
updated with generated seqs
Browse files Browse the repository at this point in the history
  • Loading branch information
nityathakkar committed Sep 8, 2023
1 parent 0f172f7 commit e91b5bd
Showing 1 changed file with 62 additions and 5 deletions.
67 changes: 62 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ We evaluate our sequence and MSA models – EvoDiff-Seq and EvoDiff-MSA, respect
- [Table of contents](#table-of-contents)
- [Installation](#installation)
- [Data](#data)
- [Generated sequences and MSAs](#generated-sequences-and-msas)
- [Loading pretrained models](#loading-pretrained-models)
- [Provided notebook with examples](#provided-notebook-with-examples)
- [Conditional sequence generation](#conditional-sequence-generation)
Expand Down Expand Up @@ -53,18 +54,74 @@ Our downstream analysis scripts make use of a variety of tools we do not include
Please follow the setup instructions outlined by the authors of those tools.

## Data
We obtain sequences from the [Uniref50 dataset](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4375400/), which contains approximately 45 million protein sequences. The Multiple Sequence Alignments (MSAs) are from the [OpenFold dataset](https://www.biorxiv.org/content/10.1101/2022.11.20.517210v2), which contains 401,381 MSAs for 140,000 unique Protein Data Bank (PDB) chains and 16,000,000 UniClust30 clusters.
We obtain sequences from the [Uniref50 dataset](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4375400/), which contains approximately 45 million protein sequences. The Multiple Sequence Alignments (MSAs) are from the [OpenFold dataset](https://www.biorxiv.org/content/10.1101/2022.11.20.517210v2), which contains 401,381 MSAs for 140,000 unique Protein Data Bank (PDB) chains and 16,000,000 UniClust30 clusters. The intrinsically disordered regions (IDR) data was obtained from the [Reverse Homology GitHub](https://github.com/alexxijielu/reverse_homology/).

To access the sequences described in table S1 of the paper, use the following code:
For the scaffolding structural motifs task, we provide pdb files used for conditionally generating sequences in the [examples/scaffolding-pdbs](https://github.com/microsoft/evodiff/tree/main/examples/scaffolding-pdbs) folder. We also provide
We provide pdb files used for conditionally generating MSAs in the [examples/scaffolding-msas](https://github.com/microsoft/evodiff/tree/main/examples/scaffolding-msas) folder.

## Generated sequences and MSAs

To access the UniRef50 test sequences, use the following code:

```
test_data = UniRefDataset('data/uniref50/', 'rtest', structure=False) # To access the test sequences
curl -O ...(TODO) # To access the generated sequences
```

For the scaffolding structural motifs task, we provide pdb files used for conditionally generating sequences in the [examples/scaffolding-pdbs](https://github.com/microsoft/evodiff/tree/main/examples/scaffolding-pdbs) folder. We also provide
We provide pdb files used for conditionally generating MSAs in the [examples/scaffolding-msas](https://github.com/microsoft/evodiff/tree/main/examples/scaffolding-msas) folder.
We provide all generated sequences and MSAs on the [EvoDiff Zenodo](https://zenodo.org/record/8329165). We provide 6 files with the following columns:
* ` esmif_predictions_unconditional_structure_generations.csv`
* sequence: predicted protein sequence from protein structure (using ESM-IF1 model)
* seq len: length of generated sequence
* model: 'foldingdiff' or 'rfdiffusion'
* ` idr_conditional_generations.csv`
* sequence: subsampled sequence that contains IDR
* seq len: length of generated sequence
* gen_idrs: the generated IDR sequence
* original_idrs: the original IDR sequence
* start_idxs: indices corresponding to start of motif
* end_idxs: indices corresponding to end of motif
* model: model type used for generations
* ` msa_evolution_conditional_generations.csv`
* sequence: generated query sequences
* seq len: length of generated sequence
* model: model type used for generations
* ` msa_scaffold.csv` (generations made using EvoDiff-msa model)
* pdb: pdb code used for task
* seqs: generated motif
* start_idxs: indices corresponding to start of motif
* end_idxs: indices corresponding to end of motif
* seq len: length of generated sequence
* scores: average predicted local distance difference test (pLDDT) of sequence
* rmsd: RMSD between predicted motif coordinates and desired motif coordinates
* model: model type used for generations
* ` seq_scaffold.csv` (generations made using EvoDiff-seq model)
* pdb: pdb code used for task
* seqs: generated motif
* start_idxs: indices corresponding to start of motif
* end_idxs: indices corresponding to end of motif
* seq len: length of generated sequence
* scores: average predicted local distance difference test (pLDDT) of sequence
* rmsd: RMSD between predicted motif coordinates and desired motif coordinates
* model: model type used for generations
* ` unconditional_generations.csv`
* sequence: generated sequence
* min hamming dist: minimum Hamming distance between generated sequence and all training sequences
* seq len: length of generated sequence
* model: model type used for generations
'sequence', 'min hamming dist', 'seq len', 'model'

Here is an example of downloading the `unconditional_generations.csv` file:

```
curl -O https://zenodo.org/record/8329165/files/unconditional_generations.csv?download=1
```

To extract all unconditionally generated sequences created using the EvoDiff-seq `oadm-38M` model, run the following code:

```
import pandas as pd
df = pd.read_csv('unconditional_generations.csv', index_col = 0)
subset = df.loc[df['model'] == 'evodiff_oadm_38M']
```

## Loading pretrained models
To load a model:
Expand Down

0 comments on commit e91b5bd

Please sign in to comment.