Skip to content

Commit

Permalink
updated card
Browse files Browse the repository at this point in the history
  • Loading branch information
nityathakkar committed Sep 6, 2023
1 parent 1c5170d commit 0f172f7
Show file tree
Hide file tree
Showing 2 changed files with 91 additions and 47 deletions.
136 changes: 90 additions & 46 deletions EvoDiff_modelcard.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,11 +28,10 @@ Generation of protein sequences and evolutionary alignments via discrete diffusi

<!-- Provide a longer summary of what this model is. -->

In this work, we train and evaluate a series of discrete diffusion models for both unconditional and conditional generation of single protein sequences as well as multiple sequence alignments (MSAs). We test both order-agnostic autoregressive diffusion and discrete denoising diffusion probabilistic models for protein sequence generation; formulate unique, bio-inspired corruption schemes for both classes of models; and evaluate the quality of generated samples for fidelity, diversity, and structural plausibility.
In this work, we introduce a general-purpose diffusion framework, EvoDiff, that combines evolutionary-scale data with the distinct conditioning capabilities of diffusion models for controllable protein generation in sequence space. EvoDiff generates high-fidelity, diverse, and structurally-plausible proteins that cover natural sequence and functional space. Critically, EvoDiff can generate proteins inaccessible to structure-based models, such as those with disordered regions, while maintaining the ability to design scaffolds for functional structural motifs, demonstrating the universality of our sequence-based formulation. We envision that EvoDiff will expand capabilities in protein engineering beyond the structure-function paradigm toward programmable, sequence-first design.


- **Developed by:** Sarah Alamdari, Nitya Thakkar, Rianne van den Berg, Nicolo Fusi, Ava P. Amini, Kevin K. Yang
- **Shared by:** Microsoft Research
- **Developed by:** Sarah Alamdari, Nitya Thakkar, Rianne van den Berg, Alex X. Lu, Nicolo Fusi, Ava P. Amini, Kevin K. Yang
- **Shared by:** Microsoft Research New England
- **Model type:** Diffusion-based protein sequence generation
- **License:** MIT License

Expand All @@ -51,7 +50,33 @@ In this work, we train and evaluate a series of discrete diffusion models for bo

<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->

This model can be used directly to generate proteins sequences and alignments. We provide checkpoints for all our models so users can run our unconditional and conditional generation scripts. We also provide a [notebook](https://github.com/microsoft/evodiff/blob/main/evodiff.ipynb) for easy access to run our code.
This model is intended for research use. It can be used directly to generate proteins sequences and alignments. We provide checkpoints for all our models so users can run our unconditional and conditional generation scripts.

We provide a notebook with guidance that can be found in [examples/evodiff.ipynb](https://github.com/microsoft/evodiff/tree/main/examples/evodiff.ipynb). It includes installation instructions, as well as examples on how to generate a smaller number of sequences and MSAs using our models. We recommend following this notebook if you would like to use our models to generate proteins.

To load a model:
```
from evodiff.pretrained import OADM_38M
model, collater, tokenizer, scheme = OADM_38M()
```
Available models are:
* ``` D3PM_BLOSUM_640M() ```
* ``` D3PM_BLOSUM_38M() ```
* ``` D3PM_UNIFORM_640M() ```
* ``` D3PM_UNIFORM_38M() ```
* ``` OADM_640M() ```
* ``` OADM_38M() ```
* ``` LR_AR_640M() ```
* ``` LR_AR_38M() ```
* ``` MSA_D3PM_BLOSUM() ```
* ``` MSA_D3PM_UNIFORM() ```
* ``` MSA_D3PM_OADM_RANDSUB() ```
* ``` MSA_D3PM_OADM_MAXSUB() ```

Note: if you want to download a `BLOSUM` model, you will first need to download [data/blosum62-special-MSA.mat](https://github.com/microsoft/evodiff/blob/main/data/blosum62-special-MSA.mat).

Please view our [README.md](https://github.com/microsoft/evodiff/blob/main/README.md) for detailed instructions on how to generate sequences and multiple sequence alignments (MSAs) both unconditionally and conditionally.

<!-- ### Downstream Use [optional] -->

Expand All @@ -61,13 +86,13 @@ This model can be used directly to generate proteins sequences and alignments. W

<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->

This model is intended for use on protein sequences. It is not meant for other biological sequences, such as DNA sequences, or regular language.
This model is intended for use on protein sequences. It is not meant for other biological sequences, such as DNA sequences, or natural language.

## Bias, Risks, and Limitations

<!-- This section is meant to convey both technical and sociotechnical limitations. -->

No forseeable issues
This model will not perform well when trying to generate things that aren't proteins. This includes cases such as trying to generate other biological sequences, such as DNA sequences, or natural language. In other words, the model will perform best on data within the data distribution, which includes protein sequences and multiple sequence alignments (MSAs).

<!-- ### Recommendations -->

Expand All @@ -77,50 +102,48 @@ No forseeable issues

## How to Get Started with the Model

To download our model, run:
```
pip install evodiff
pip install git+https://github.com/microsoft/protein-sequence-models.git # bleeding edge, current repo main branch
```
To download our code, we recommend creating a clean conda environment with python ```v3.8.5```. After installing Anaconda, you can do so by running

To set up a working environment, run:
```
cd evodiff
conda env create -f environment.yml
conda activate evodiff
pip install -e .
conda create --name evodiff python=3.8.5
```
You will also need to install PyTorch. We tested our models on `v2.0.1`.

To load a model, run:
In that new environment, to download our code, run:
```
from evodiff.pretrained import OA_AR_38M
model, collater, tokenizer, scheme = OA_AR_38M()
pip install evodiff
pip install git+https://github.com/microsoft/evodiff.git # bleeding edge, current repo main branch
```

Available models are:
You will also need to install PyTorch (we tested our models on ` v2.0.1 `), PyTorch Geometric, and PyTorch Scatter.

- ` D3PM_BLOSUM_640M()`
- ` D3PM_BLOSUM_38M()`
- ` D3PM_UNIFORM_640M()`
- ` D3PM_UNIFORM_38M()`
- ` OA_AR_640M()`
- ` OA_AR_38M()`
- ` LR_AR_640M()`
- ` LR_AR_38M()`
- ` MSA_D3PM_BLOSUM()`
- ` MSA_D3PM_UNIFORM()`
- ` MSA_D3PM_OA_AR_RANDSUB()`
- ` MSA_D3PM_OA_AR_MAXSUB()`
Our downstream analysis scripts make use of a variety of tools we do not include in our package. To run the scripts, please download the following packages first:
* [TM score](https://zhanggroup.org/TM-score/)
* [Omegafold](https://github.com/HeliXonProtein/OmegaFold)
* [ProteinMPNN](https://github.com/dauparas/ProteinMPNN)
* [ESM-IF1](https://github.com/facebookresearch/esm/tree/main/esm/inverse_folding); see this [Jupyter notebook](https://colab.research.google.com/github/facebookresearch/esm/blob/main/examples/inverse_folding/notebook.ipynb) for setup details.
* [PGP](https://github.com/hefeda/PGP)
* [DISOPRED3](https://github.com/psipred/disopred)
* [DR-BERT](https://github.com/maslov-group/DR-BERT)

Please follow the setup instructions outlined by the authors of those tools.

## Training Details

### Training Data

<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

We obtain sequences from the [Uniref50 dataset](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4375400/), which contains approximately 45 million protein sequences. The Multiple Sequence Alignments (MSAs) are from the [OpenFold dataset](https://www.biorxiv.org/content/10.1101/2022.11.20.517210v2), containing MSAs for 132,000 unique Protein Data Bank (PDB) chains.
We obtain sequences from the [Uniref50 dataset](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4375400/), which contains approximately 45 million protein sequences. The Multiple Sequence Alignments (MSAs) are from the [OpenFold dataset](https://www.biorxiv.org/content/10.1101/2022.11.20.517210v2), which contains 401,381 MSAs for 140,000 unique Protein Data Bank (PDB) chains and 16,000,000 UniClust30 clusters.

To access the sequences described in table S1 of the paper, use the following code:

```
test_data = UniRefDataset('data/uniref50/', 'rtest', structure=False) # To access the test sequences
curl -O ...(TODO) # To access the generated sequences
```

For the scaffolding structural motifs task, we provide pdb files used for conditionally generating sequences in the [examples/scaffolding-pdbs](https://github.com/microsoft/evodiff/tree/main/examples/scaffolding-pdbs) folder. We also provide
We provide pdb files used for conditionally generating MSAs in the [examples/scaffolding-msas](https://github.com/microsoft/evodiff/tree/main/examples/scaffolding-msas) folder.

<!-- ### Training Procedure -->

Expand Down Expand Up @@ -151,31 +174,52 @@ We obtain sequences from the [Uniref50 dataset](https://www.ncbi.nlm.nih.gov/pmc

<!-- This should link to a Data Card if possible. -->

To access the test data:
To access the sequences described in table S1 of the paper, use the following code:

`test_data = UniRefDataset('data/uniref50/', 'rtest', structure=False)`

and to access the generated sequences:
```
test_data = UniRefDataset('data/uniref50/', 'rtest', structure=False) # To access the test sequences
curl -O ...(TODO) # To access the generated sequences
```

`TODO`
For the scaffolding structural motifs task, we provide pdb files used for conditionally generating sequences in the [examples/scaffolding-pdbs](https://github.com/microsoft/evodiff/tree/main/examples/scaffolding-pdbs) folder. We also provide
We provide pdb files used for conditionally generating MSAs in the [examples/scaffolding-msas](https://github.com/microsoft/evodiff/tree/main/examples/scaffolding-msas) folder.

<!-- #### Factors -->

<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->

#### Metrics

To analyze the quality of the generations, we look at the amino acid KL divergence ([aa_reconstruction_parity_plot](https://github.com/microsoft/evodiff/blob/main/analysis/plot.py)), the secondary structure KL divergence ([evodiff/analysis/calc_kl_ss.py](https://github.com/microsoft/evodiff/blob/main/analysis/calc_kl_ss.py)), the model perplexity ([evodiff/analysis/model_perp.py](https://github.com/microsoft/evodiff/blob/main/analysis/model_perp.py)), the Fréchet inception distance ([evodiff/analysis/calc_fid.py](https://github.com/microsoft/evodiff/blob/main/analysis/calc_fid.py)), and the hamming distance ([evodiff/analysis/calc_nearestseq_hamming.py](https://github.com/microsoft/evodiff/blob/main/analysis/calc_nearestseq_hamming.py)).
To analyze the quality of the generations, we look at:
* amino acid KL divergence ([aa_reconstruction_parity_plot](https://github.com/microsoft/evodiff/blob/main/evodiff/plot.py))
* secondary structure KL divergence ([evodiff/analysis/calc_kl_ss.py](https://github.com/microsoft/evodiff/blob/main/analysis/calc_kl_ss.py))
* model perplexity for sequences ([evodiff/analysis/sequence_perp.py](https://github.com/microsoft/evodiff/blob/main/analysis/sequence_perp.py))
* model perplexity for MSAs ([evodiff/analysis/msa_perp.py](https://github.com/microsoft/evodiff/blob/main/analysis/msa_perp.py)
* Fréchet inception distance ([evodiff/analysis/calc_fid.py](https://github.com/microsoft/evodiff/blob/main/analysis/calc_fid.py))
* Hamming distance ([evodiff/analysis/calc_nearestseq_hamming.py](https://github.com/microsoft/evodiff/blob/main/analysis/calc_nearestseq_hamming.py))

We also compute the self-consistency perplexity to evaluate the foldability of generated sequences. To do so, we make use of various tools:
* [TM score](https://zhanggroup.org/TM-score/)
* [Omegafold](https://github.com/HeliXonProtein/OmegaFold)
* [ProteinMPNN](https://github.com/dauparas/ProteinMPNN)
* [ESM-IF1](https://github.com/facebookresearch/esm/tree/main/esm/inverse_folding); see this [Jupyter notebook](https://colab.research.google.com/github/facebookresearch/esm/blob/main/examples/inverse_folding/notebook.ipynb) for setup details.
* [PGP](https://github.com/hefeda/PGP)
* [DISOPRED3](https://github.com/psipred/disopred)
* [DR-BERT](https://github.com/maslov-group/DR-BERT)

Please follow the setup instructions outlined by the authors of those tools.

Our analysis scripts for iterating over these tools are in the [evodiff/analysis/downstream_bash_scripts](https://github.com/microsoft/evodiff/tree/main/analysis/downstream_bash_scripts) folder. Once we run the scripts in this folder, we analyze the results in [self_consistency_analysis.py](https://github.com/microsoft/evodiff/blob/main/analysis/self_consistency_analysis.py).

<!-- These are the evaluation metrics being used, ideally with a description of why. -->

<!-- ### Results -->

<!-- {{ results | default("[More Information Needed]", true)}} -->

#### Summary: UPDATE LATER
#### Summary

Our discrete diffusion models generate high-fidelity, diverse, and structurally-plausible sequences and outperform existing methods when controlling for datasets and architectures. We find that OA-ARDM generally outperforms D3PM, and that D3PM corruption with a BLOSUM transition matrix does not consistently outperform a uniform transition matrix.
We present EvoDiff, a diffusion modeling framework capable of generating high-fidelity, diverse, and novel proteins with the option of conditioning according to sequence constraints. Because it operates in the universal protein design space, EvoDiff can unconditionally sample diverse structurally-plausible proteins, generate intrinsically disordered regions, and scaffold structural motifs using only sequence information, challenging a paradigm in structure-based protein design.

<!-- ## Model Examination [optional] -->

Expand All @@ -189,7 +233,7 @@ Our discrete diffusion models generate high-fidelity, diverse, and structurally-

<!-- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). -->

- **Hardware Type:** `38-V100s(32GB)`
- **Hardware Type:** `32GB NVIDIA V100` GPUs
- **Hours used:** 4,128 (14 days per sequence model, 10 days per MSA model)
- **Cloud Provider:** Azure
- **Compute Region:** East US
Expand All @@ -213,7 +257,7 @@ Our discrete diffusion models generate high-fidelity, diverse, and structurally-

<!-- {{ software | default("[More Information Needed]", true)}} -->

## Citation [optional]
## Citation

<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ Our downstream analysis scripts make use of a variety of tools we do not include
Please follow the setup instructions outlined by the authors of those tools.

## Data
We obtain sequences from the [Uniref50 dataset](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4375400/), which contains approximately 45 million protein sequences. The Multiple Sequence Alignments (MSAs) are from the [OpenFold dataset](https://www.biorxiv.org/content/10.1101/2022.11.20.517210v2), containing MSAs for 132,000 unique Protein Data Bank (PDB) chains.
We obtain sequences from the [Uniref50 dataset](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4375400/), which contains approximately 45 million protein sequences. The Multiple Sequence Alignments (MSAs) are from the [OpenFold dataset](https://www.biorxiv.org/content/10.1101/2022.11.20.517210v2), which contains 401,381 MSAs for 140,000 unique Protein Data Bank (PDB) chains and 16,000,000 UniClust30 clusters.

To access the sequences described in table S1 of the paper, use the following code:

Expand Down

0 comments on commit 0f172f7

Please sign in to comment.