updated card

microsoft · Sep 6, 2023 · 0f172f7 · 0f172f7
1 parent 1c5170d
commit 0f172f7
Show file tree

Hide file tree

Showing 2 changed files with 91 additions and 47 deletions.
diff --git a/EvoDiff_modelcard.md b/EvoDiff_modelcard.md
@@ -28,11 +28,10 @@ Generation of protein sequences and evolutionary alignments via discrete diffusi
 
 <!-- Provide a longer summary of what this model is. -->
 
-In this work, we train and evaluate a series of discrete diffusion models for both unconditional and conditional generation of single protein sequences as well as multiple sequence alignments (MSAs). We test both order-agnostic autoregressive diffusion and discrete denoising diffusion probabilistic models for protein sequence generation; formulate unique, bio-inspired corruption schemes for both classes of models; and evaluate the quality of generated samples for fidelity, diversity, and structural plausibility. 
+In this work, we introduce a general-purpose diffusion framework, EvoDiff, that combines evolutionary-scale data with the distinct conditioning capabilities of diffusion models for controllable protein generation in sequence space. EvoDiff generates high-fidelity, diverse, and structurally-plausible proteins that cover natural sequence and functional space. Critically, EvoDiff can generate proteins inaccessible to structure-based models, such as those with disordered regions, while maintaining the ability to design scaffolds for functional structural motifs, demonstrating the universality of our sequence-based formulation. We envision that EvoDiff will expand capabilities in protein engineering beyond the structure-function paradigm toward programmable, sequence-first design.
 
-
-- **Developed by:** Sarah Alamdari, Nitya Thakkar, Rianne van den Berg, Nicolo Fusi, Ava P. Amini, Kevin K. Yang
-- **Shared by:** Microsoft Research
+- **Developed by:** Sarah Alamdari, Nitya Thakkar, Rianne van den Berg, Alex X. Lu, Nicolo Fusi, Ava P. Amini, Kevin K. Yang
+- **Shared by:** Microsoft Research New England
 - **Model type:** Diffusion-based protein sequence generation
 - **License:** MIT License
 
@@ -51,7 +50,33 @@ In this work, we train and evaluate a series of discrete diffusion models for bo
 
 <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
 
-This model can be used directly to generate proteins sequences and alignments. We provide checkpoints for all our models so users can run our unconditional and conditional generation scripts. We also provide a [notebook](https://github.com/microsoft/evodiff/blob/main/evodiff.ipynb) for easy access to run our code.
+This model is intended for research use. It can be used directly to generate proteins sequences and alignments. We provide checkpoints for all our models so users can run our unconditional and conditional generation scripts. 
+
+We provide a notebook with guidance that can be found in [examples/evodiff.ipynb](https://github.com/microsoft/evodiff/tree/main/examples/evodiff.ipynb). It includes installation instructions, as well as examples on how to generate a smaller number of sequences and MSAs using our models. We recommend following this notebook if you would like to use our models to generate proteins.
+
+To load a model:
+```
+from evodiff.pretrained import OADM_38M
+
+model, collater, tokenizer, scheme = OADM_38M()
+```
+Available models are:
+* ``` D3PM_BLOSUM_640M() ```
+* ``` D3PM_BLOSUM_38M() ```
+* ``` D3PM_UNIFORM_640M() ```
+* ``` D3PM_UNIFORM_38M() ```
+* ``` OADM_640M() ```
+* ``` OADM_38M() ```
+* ``` LR_AR_640M() ```
+* ``` LR_AR_38M() ```
+* ``` MSA_D3PM_BLOSUM() ```
+* ``` MSA_D3PM_UNIFORM() ```
+* ``` MSA_D3PM_OADM_RANDSUB() ```
+* ``` MSA_D3PM_OADM_MAXSUB() ```
+
+Note: if you want to download a `BLOSUM` model, you will first need to download [data/blosum62-special-MSA.mat](https://github.com/microsoft/evodiff/blob/main/data/blosum62-special-MSA.mat).
+
+Please view our [README.md](https://github.com/microsoft/evodiff/blob/main/README.md) for detailed instructions on how to generate sequences and multiple sequence alignments (MSAs) both unconditionally and conditionally.
 
 <!-- ### Downstream Use [optional] -->
 
@@ -61,13 +86,13 @@ This model can be used directly to generate proteins sequences and alignments. W
 
 <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
 
-This model is intended for use on protein sequences. It is not meant for other biological sequences, such as DNA sequences, or regular language.
+This model is intended for use on protein sequences. It is not meant for other biological sequences, such as DNA sequences, or natural language.
 
 ## Bias, Risks, and Limitations
 
 <!-- This section is meant to convey both technical and sociotechnical limitations. -->
 
-No forseeable issues
+This model will not perform well when trying to generate things that aren't proteins. This includes cases such as trying to generate other biological sequences, such as DNA sequences, or natural language. In other words, the model will perform best on data within the data distribution, which includes protein sequences and multiple sequence alignments (MSAs).
 
 <!-- ### Recommendations -->
 
@@ -77,50 +102,48 @@ No forseeable issues
 
 ## How to Get Started with the Model
 
-To download our model, run:
-```
-pip install evodiff
-pip install git+https://github.com/microsoft/protein-sequence-models.git # bleeding edge, current repo main branch
-```
+To download our code, we recommend creating a clean conda environment with python ```v3.8.5```. After installing Anaconda, you can do so by running 
 
-To set up a working environment, run:
 ```
-cd evodiff
-conda env create -f environment.yml
-conda activate evodiff
-pip install -e .
+conda create --name evodiff python=3.8.5
 ```
-You will also need to install PyTorch. We tested our models on `v2.0.1`.
 
-To load a model, run:
+In that new environment, to download our code, run:
 ```
-from evodiff.pretrained import OA_AR_38M
-
-model, collater, tokenizer, scheme = OA_AR_38M()
+pip install evodiff
+pip install git+https://github.com/microsoft/evodiff.git # bleeding edge, current repo main branch
 ```
 
-Available models are:
+You will also need to install PyTorch (we tested our models on ` v2.0.1 `), PyTorch Geometric, and PyTorch Scatter.
 
-- ` D3PM_BLOSUM_640M()`
-- ` D3PM_BLOSUM_38M()`
-- ` D3PM_UNIFORM_640M()`
-- ` D3PM_UNIFORM_38M()`
-- ` OA_AR_640M()`
-- ` OA_AR_38M()`
-- ` LR_AR_640M()`
-- ` LR_AR_38M()`
-- ` MSA_D3PM_BLOSUM()`
-- ` MSA_D3PM_UNIFORM()`
-- ` MSA_D3PM_OA_AR_RANDSUB()`
-- ` MSA_D3PM_OA_AR_MAXSUB()`
+Our downstream analysis scripts make use of a variety of tools we do not include in our package. To run the scripts, please download the following packages first:
+* [TM score](https://zhanggroup.org/TM-score/)
+* [Omegafold](https://github.com/HeliXonProtein/OmegaFold)
+* [ProteinMPNN](https://github.com/dauparas/ProteinMPNN)
+* [ESM-IF1](https://github.com/facebookresearch/esm/tree/main/esm/inverse_folding); see this [Jupyter notebook](https://colab.research.google.com/github/facebookresearch/esm/blob/main/examples/inverse_folding/notebook.ipynb) for setup details.
+* [PGP](https://github.com/hefeda/PGP)
+* [DISOPRED3](https://github.com/psipred/disopred)
+* [DR-BERT](https://github.com/maslov-group/DR-BERT)
+
+Please follow the setup instructions outlined by the authors of those tools.
 
 ## Training Details
 
 ### Training Data
 
 <!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
 
-We obtain sequences from the [Uniref50 dataset](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4375400/), which contains approximately 45 million protein sequences. The Multiple Sequence Alignments (MSAs) are from the [OpenFold dataset](https://www.biorxiv.org/content/10.1101/2022.11.20.517210v2), containing MSAs for 132,000 unique Protein Data Bank (PDB) chains.
+We obtain sequences from the [Uniref50 dataset](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4375400/), which contains approximately 45 million protein sequences. The Multiple Sequence Alignments (MSAs) are from the [OpenFold dataset](https://www.biorxiv.org/content/10.1101/2022.11.20.517210v2), which contains 401,381 MSAs for 140,000 unique Protein Data Bank (PDB) chains and 16,000,000 UniClust30 clusters.
+
+To access the sequences described in table S1 of the paper, use the following code:
+
+```
+test_data = UniRefDataset('data/uniref50/', 'rtest', structure=False) # To access the test sequences
+curl -O ...(TODO) # To access the generated sequences
+```
+
+For the scaffolding structural motifs task, we provide pdb files used for conditionally generating sequences in the [examples/scaffolding-pdbs](https://github.com/microsoft/evodiff/tree/main/examples/scaffolding-pdbs) folder. We also provide
+We provide pdb files used for conditionally generating MSAs in the [examples/scaffolding-msas](https://github.com/microsoft/evodiff/tree/main/examples/scaffolding-msas) folder.
 
 <!-- ### Training Procedure  -->
 
@@ -151,31 +174,52 @@ We obtain sequences from the [Uniref50 dataset](https://www.ncbi.nlm.nih.gov/pmc
 
 <!-- This should link to a Data Card if possible. -->
 
-To access the test data:
+To access the sequences described in table S1 of the paper, use the following code:
 
-`test_data = UniRefDataset('data/uniref50/', 'rtest', structure=False)`
-
-and to access the generated sequences:
+```
+test_data = UniRefDataset('data/uniref50/', 'rtest', structure=False) # To access the test sequences
+curl -O ...(TODO) # To access the generated sequences
+```
 
-`TODO`
+For the scaffolding structural motifs task, we provide pdb files used for conditionally generating sequences in the [examples/scaffolding-pdbs](https://github.com/microsoft/evodiff/tree/main/examples/scaffolding-pdbs) folder. We also provide
+We provide pdb files used for conditionally generating MSAs in the [examples/scaffolding-msas](https://github.com/microsoft/evodiff/tree/main/examples/scaffolding-msas) folder.
 
 <!-- #### Factors -->
 
 <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
 
 #### Metrics
 
-To analyze the quality of the generations, we look at the amino acid KL divergence ([aa_reconstruction_parity_plot](https://github.com/microsoft/evodiff/blob/main/analysis/plot.py)), the secondary structure KL divergence ([evodiff/analysis/calc_kl_ss.py](https://github.com/microsoft/evodiff/blob/main/analysis/calc_kl_ss.py)), the model perplexity ([evodiff/analysis/model_perp.py](https://github.com/microsoft/evodiff/blob/main/analysis/model_perp.py)), the Fréchet inception distance ([evodiff/analysis/calc_fid.py](https://github.com/microsoft/evodiff/blob/main/analysis/calc_fid.py)), and the hamming distance ([evodiff/analysis/calc_nearestseq_hamming.py](https://github.com/microsoft/evodiff/blob/main/analysis/calc_nearestseq_hamming.py)).
+To analyze the quality of the generations, we look at:
+* amino acid KL divergence ([aa_reconstruction_parity_plot](https://github.com/microsoft/evodiff/blob/main/evodiff/plot.py))
+* secondary structure KL divergence ([evodiff/analysis/calc_kl_ss.py](https://github.com/microsoft/evodiff/blob/main/analysis/calc_kl_ss.py))
+* model perplexity for sequences ([evodiff/analysis/sequence_perp.py](https://github.com/microsoft/evodiff/blob/main/analysis/sequence_perp.py))
+* model perplexity for MSAs ([evodiff/analysis/msa_perp.py](https://github.com/microsoft/evodiff/blob/main/analysis/msa_perp.py)
+* Fréchet inception distance ([evodiff/analysis/calc_fid.py](https://github.com/microsoft/evodiff/blob/main/analysis/calc_fid.py))
+* Hamming distance ([evodiff/analysis/calc_nearestseq_hamming.py](https://github.com/microsoft/evodiff/blob/main/analysis/calc_nearestseq_hamming.py))
+
+We also compute the self-consistency perplexity to evaluate the foldability of generated sequences. To do so, we make use of various tools:
+* [TM score](https://zhanggroup.org/TM-score/)
+* [Omegafold](https://github.com/HeliXonProtein/OmegaFold)
+* [ProteinMPNN](https://github.com/dauparas/ProteinMPNN)
+* [ESM-IF1](https://github.com/facebookresearch/esm/tree/main/esm/inverse_folding); see this [Jupyter notebook](https://colab.research.google.com/github/facebookresearch/esm/blob/main/examples/inverse_folding/notebook.ipynb) for setup details.
+* [PGP](https://github.com/hefeda/PGP)
+* [DISOPRED3](https://github.com/psipred/disopred)
+* [DR-BERT](https://github.com/maslov-group/DR-BERT)
+
+Please follow the setup instructions outlined by the authors of those tools.
+
+Our analysis scripts for iterating over these tools are in the [evodiff/analysis/downstream_bash_scripts](https://github.com/microsoft/evodiff/tree/main/analysis/downstream_bash_scripts) folder. Once we run the scripts in this folder, we analyze the results in [self_consistency_analysis.py](https://github.com/microsoft/evodiff/blob/main/analysis/self_consistency_analysis.py).
 
 <!-- These are the evaluation metrics being used, ideally with a description of why. -->
 
 <!-- ### Results -->
 
 <!-- {{ results | default("[More Information Needed]", true)}} -->
 
-#### Summary: UPDATE LATER
+#### Summary
 
-Our discrete diffusion models generate high-fidelity, diverse, and structurally-plausible sequences and outperform existing methods when controlling for datasets and architectures. We find that OA-ARDM generally outperforms D3PM, and that D3PM corruption with a BLOSUM transition matrix does not consistently outperform a uniform transition matrix.
+We present EvoDiff, a diffusion modeling framework capable of generating high-fidelity, diverse, and novel proteins with the option of conditioning according to sequence constraints. Because it operates in the universal protein design space, EvoDiff can unconditionally sample diverse structurally-plausible proteins, generate intrinsically disordered regions, and scaffold structural motifs using only sequence information, challenging a paradigm in structure-based protein design.
 
 <!-- ## Model Examination [optional] -->
 
@@ -189,7 +233,7 @@ Our discrete diffusion models generate high-fidelity, diverse, and structurally-
 
 <!-- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). -->
 
-- **Hardware Type:** `38-V100s(32GB)`
+- **Hardware Type:** `32GB NVIDIA V100` GPUs
 - **Hours used:** 4,128 (14 days per sequence model, 10 days per MSA model)
 - **Cloud Provider:** Azure
 - **Compute Region:** East US
@@ -213,7 +257,7 @@ Our discrete diffusion models generate high-fidelity, diverse, and structurally-
 
 <!-- {{ software | default("[More Information Needed]", true)}} -->
 
-## Citation [optional]
+## Citation
 
 <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
 

diff --git a/README.md b/README.md
@@ -53,7 +53,7 @@ Our downstream analysis scripts make use of a variety of tools we do not include
 Please follow the setup instructions outlined by the authors of those tools.
 
 ## Data
-We obtain sequences from the [Uniref50 dataset](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4375400/), which contains approximately 45 million protein sequences. The Multiple Sequence Alignments (MSAs) are from the [OpenFold dataset](https://www.biorxiv.org/content/10.1101/2022.11.20.517210v2), containing MSAs for 132,000 unique Protein Data Bank (PDB) chains.
+We obtain sequences from the [Uniref50 dataset](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4375400/), which contains approximately 45 million protein sequences. The Multiple Sequence Alignments (MSAs) are from the [OpenFold dataset](https://www.biorxiv.org/content/10.1101/2022.11.20.517210v2), which contains 401,381 MSAs for 140,000 unique Protein Data Bank (PDB) chains and 16,000,000 UniClust30 clusters.
 
 To access the sequences described in table S1 of the paper, use the following code: