Skip to content

Commit

Permalink
Update to README.md to make sure everything is good for V3.
Browse files Browse the repository at this point in the history
  • Loading branch information
ryanemenecker committed Nov 5, 2024
1 parent 8e1af00 commit 20de53d
Showing 1 changed file with 18 additions and 98 deletions.
116 changes: 18 additions & 98 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,21 +2,22 @@

### Last updated November 2024

## Current version: metapredict V3
The current recommended and default version of metapredict is metapredict V3 (version 3.0). Small increments (3.0.x) may be made as bug fixes or feature enhancements.
## Current default version: V3
In November 2024, we changed the default version of metapredict from V2 to V3. Small increments (3.0.x) may be made as bug fixes or feature enhancements.

For context, V3 provides major improvements to V2. Metapredict V3 uses a **new network to predict disorder** that in our benchmarks is the most accurate version of metapredict to date. In addition, *metapredict V3 is backwards compatible with V2* and can be used as a drop-in replacement for V2. Although the Python API has been improved to massively simplify how you can use metapredict, we have **for the time being** updated it such that all previously created functions *should still work*. If they do not, please raise an issue and we will fix the problem ASAP!
For context, V3 provides major improvements to V2. Metapredict V3 uses a **new network to predict disorder** that in our benchmarks is the most accurate version to date. In addition, *V3 is backwards compatible with V2* and can be used as a drop-in replacement for V2. Although the Python API has been improved to massively simplify how you can use metapredict, we have **for the time being** updated it such that all previously created functions *should still work*. If they do not, please raise an issue and we will fix the problem ASAP!

## What are the major changes for V3?

1. **A new disorder prediction network**: Metapredict V3 uses a new (more accurate) network for disorder prediction. V1 and V2 are still available!
2. **A new pLDDT prediction network**: metapredict used to rely on an external package called [alphaPredict](https://github.com/ryanemenecker/alphaPredict) for pLDDT prediction. This same network is still available in metapredict when using ``meta.predict_pLDDT()`` by setting ``version='V1'``. However, the default V2 network is by all metrics better for pLDDT prediction, so we recommend using the default!
3. **Easier batch predictions**: V2 previously required you to use ``predict_disorder_batch()`` to take advantage of the 10-100x improvement in prediction speed on CPUs and GPUs. However, you can now use a single function - ``predict_disorder()`` - on individual sequences, lists of sequences, and dictionaries of sequences, and metapredict will automatically take care of the rest for you while automatically doing batch predictions if more than 1 sequence is present.
2. **A new pLDDT prediction network**: metapredict used to rely on an external package called [alphaPredict](https://github.com/ryanemenecker/alphaPredict) for pLDDT prediction. This same network is still available in metapredict when using ``meta.predict_pLDDT()`` by setting ``pLDDT_version=1``. However, the default V2 network is by all metrics better for pLDDT prediction, so we recommend using V2!
3. **Easier batch predictions**: V2 previously required you to use ``predict_disorder_batch()`` to take advantage of the 10-100x improvement in prediction speed on CPUs and GPUs. However, you can now use a single function - ``predict_disorder()`` - on individual sequences, lists of sequences, and dictionaries of sequences, and metapredict will automatically take care of the rest for you including running batch predictions if you input more than 1 sequence.
4. **Easier access to DisorderObject**. You can now return the ``DisorderObject`` by setting ``return_domains=True`` when using ``predict_disorder()``.
5. **Batch prediction for all**: Previously, batch predictions were only available for the V2 disorder prediction network of metapredict. Now, you can do batch predictions using all of the disorder prediction networks - v1 (previously called legacy), v2, and v3!
5. **Batch prediction for all**: Previously, batch predictions were only available for the V2 disorder prediction network of metapredict. Now, you can do batch predictions using all of the disorder prediction networks - V1 (previously called legacy), V2, and V3!
6. **Batch pLDDT predictions**: Batch predictions (and therefore the massive increases in prediction speed) are now available for pLDDT predictions using the `predict_pLDDT()` function.
7. **More robust device selection**: Newer versions of Torch (>2.0) support MacOS GPU utilization through the Metal Performance Shaders (MPS) framework, so you can now choose to use *mps* on MacOS. In addition, if you try to specify using a CUDA-enabled GPU and it does not work, metapredict will not automatically fall back to CPU.
8. We updated metapredict-uniprot to work with the new version of [getSequence](https://github.com/ryanemenecker/getSequence). This allows for getting different protein isoforms if specified.
7. **More device selection**: Newer versions of Torch (>2.0) support MacOS GPU utilization through the Metal Performance Shaders (MPS) framework, so you can now choose to use *mps* on MacOS.
8. **More clear device selection**: Metapredict used to fall back to using CPU for predictions if it failed to use GPU for whatever reason. This had good intentions but made troubleshooting GPU usage very tricky. Now if you specify using a specific device and it does not work, metapredict will not automatically fall back to CPU.
9. **Ability to get protein isoforms from Uniprot**: We updated ``metapredict-uniprot`` to work with the new version of [getSequence](https://github.com/ryanemenecker/getSequence), which enables you to input a valid Uniprot ID including designations for different protein isoforms. If you want to predict a sequence from the CLI using the name of the protein and the organism name (optional but recommended), please use ``metapredict-name`` as **``metapredict-uniprot`` will only work with valid Uniprot Accession numbers**.


## Installation
Expand Down Expand Up @@ -88,127 +89,46 @@ Documentation for metapredict V3 automatically builds from the `/doc` directory
In brief, metapredict provides both command-line tools and a set of user-face functions from the metapredict python module. Both sets of tools are fully documented online.

## How can I use metapredict?
Metapredict can be used in four different ways:
Metapredict can be used in five different ways:

1. As a stand-alone command-line tool (installable via pip - the code in this repository).
2. As a Python library for integrating into your favorite bioinformatics pipeline (installable via pip - the code in this repository).
3. As a web-server for examining disorder predictions on individual sequences found at [https://metapredict.net/](https://metapredict.net/).
4. *NEW as of August 2022:* as a Google Colab notebook for batch-predicting disorder scores for larger numbers of sequences: [**LINK HERE**](https://colab.research.google.com/github/idptools/metapredict/blob/master/colab/metapredict_colab.ipynb). Performance-wise, batch mode can predict the entire yeast proteome in ~1.5 min.
5. *NEW as of May 2023:* as part of the [ALBATROSS paper](https://www.biorxiv.org/content/10.1101/2023.05.08.539824), we provide a colab notebook for predicting IDRs on a proteome-wide scale [**LINK HERE**](https://colab.research.google.com/github/holehouse-lab/ALBATROSS-colab/blob/main/idrome_constructor/idrome_constructor.ipynb).
4. *NEW as of August 2022:* as a Google Colab notebook for batch-predicting disorder scores for larger numbers of sequences: [**LINK HERE**](https://colab.research.google.com/github/idptools/metapredict/blob/master/colab/metapredict_colab.ipynb). Performance-wise, batch mode can predict the entire yeast proteome in ~1.5 min using the Colab Notebook and much faster if using a local GPU.
5. *NEW as of May 2023:* as part of the [ALBATROSS paper](https://www.nature.com/articles/s41592-023-02159-5), we provide a colab notebook for predicting IDRs on a proteome-wide scale [**LINK HERE**](https://colab.research.google.com/github/holehouse-lab/ALBATROSS-colab/blob/main/idrome_constructor/idrome_constructor.ipynb).

## How to cite

If you use metapredict for your work, please cite the metapredict paper:

Emenecker, R. J., Griffith, D. & Holehouse, A. S. Metapredict: a fast, accurate, and easy-to-use predictor of consensus disorder and structure. Biophys. J. 120, 4312–4319 (2021).

Note that in addition to the original paper, there's a V2 preprint; HOWEVER, we ask you only cite the original paper and describe the version being used (V1, V2, V2-FF, or V3).
Note that in addition to the [original paper](https://www.cell.com/biophysj/fulltext/S0006-3495(21)00725-6), there's a [V2 preprint](https://www.biorxiv.org/content/10.1101/2022.06.06.494887v2); HOWEVER, we ask you only cite the original paper and describe the version being used (V1, V2, V2-FF, or V3).

We are hoping to get a paper out for V3 in the near future (if we already have, then we just forgot to delete this sentence)...
We are hoping to get a paper out for V3 in the near future (we will update this section once the V3 paper is available)...

Emenecker, R. J., Griffith, D. & Holehouse, A. S. Metapredict V2: An update to metapredict, a fast, accurate, and easy-to-use predictor of consensus disorder and structure. bioRxiv 2022.06.06.494887 (2022). doi:10.1101/2022.06.06.494887

## Changes

For changes see the `changelog.md` file in this directory.
For changes see the `changelog.md` file in this directory or check them out in Github [here](https://github.com/idptools/metapredict/blob/master/changelog.md).

## Running tests
Note that to run tests you must compile the cython code in place. We suggest doing this by running the following set of commands:

```bash
pip uninstall metapredict; rm -rf build dist *.egg-info; python -m build; pip install .*

```
## Acknowledgements

PARROT, created by Dan Griffith, was used to generate the network used for metapredict. See [https://pypi.org/project/idptools-parrot/](https://pypi.org/project/idptools-parrot/) for some very cool machine learning stuff.
A modified version of PARROT, created by Dan Griffith, was used to generate the network used for metapredict V3. The original implementation of PARROT was used to generate the V1 and V2 networks. See [https://pypi.org/project/idptools-parrot/](https://pypi.org/project/idptools-parrot/) for some very cool machine learning stuff. You can also check out the [PARROT paper](https://elifesciences.org/articles/70576).

In addition to using Dan Griffith's tool for creating metapredict, the original code for `encode_sequence.py` was written by Dan.

We would like to thank the **DeepMind** team for developing AlphaFold2 and EBI/UniProt for making these data so readily available.

We would also like to thank the team at MobiDB for creating the database that was used to train metapredict V1. Check out their awesome stuff at [https://mobidb.bio.unipd.it](https://mobidb.bio.unipd.it)

## Running metapredict for CAID competition predictions

We include the ability to easily run predictions of .fasta formatted files and returns a 'CAID compliant' formatted file per sequence that is in the fasta file.

#### CAID formatted predictions from Python

To get CAID formatted predictions in Python, use the `predict_disorder_caid()` function. This function takes in the path to a .fasta formatted file of sequences and returns a 'CAID compliant' formatted file per sequence that is in the fasta file. The files generated are in .caid format where each sequence header is a line then the following lines for that sequence are tab separated and contain:
1. The amino acid number
2. The amino acid letter
3. The metapredict disorder score
4. The binarized metapredict score where 1=disordered and 0=not disordered.


To use this function, first import metapredict

```python:
import metapredict as meta
```

The function takes in three arguments:
1. `input_fasta` - the path to the .fasta file
2. `output_path` - the path of where to save each CAID formatted prediction file. This should be a directory. Each sequence in the .fasta file will generate a file in this directory where the name of the file will be the sequence header and the file extension will be .caid.
3. `version` - the version of metapredict to use.

The disorder cutoff values are handled automatically (0.42 for V1 and 0.5 for V2/V3).

**Examples**

**V1, AKA metapredict legacy**

```python:
path_to_fasta='/Users/thisUser/Desktop/myCaidSeqs.fasta'
```

```python:
meta.predict_disorder_caid(path_to_fasta, '/Users/thisUser/Desktop/CaidPredictions/metapredictV1', version='v1')
```

**V2**

```python:
path_to_fasta='/Users/thisUser/Desktop/myCaidSeqs.fasta'
```

```python:
meta.predict_disorder_caid(path_to_fasta, '/Users/thisUser/Desktop/CaidPredictions/metapredictV2, 'version='v2')
```

**V3, (new default, do not need to specify)**

```python:
path_to_fasta='/Users/thisUser/Desktop/myCaidSeqs.fasta'
```

```python:
meta.predict_disorder_caid(path_to_fasta, '/Users/thisUser/Desktop/CaidPredictions/metapredictV3, 'version='v3')
```

#### CAID formatted predictions from the command-line

To run metapredict for the CAID competition and get CAID-formatted files out, see the following examples:


```bash
metapredict-caid /Users/thisUser/Desktop/myCaidSeqs.fasta /Users/thisUser/Desktop/CaidPredictions/metapredictV1 v1
```

**V2:**

```bash
metapredict-caid /Users/thisUser/Desktop/myCaidSeqs.fasta /Users/thisUser/Desktop/CaidPredictions/metapredictV2 v2
```

**V3**

```bash
metapredict-caid /Users/thisUser/Desktop/myCaidSeqs.fasta /Users/thisUser/Desktop/CaidPredictions/metapredictV3 v3
```


## Copyright
Copyright (c) 2020-2024, Holehouse Lab - Washington University School of Medicine



0 comments on commit 20de53d

Please sign in to comment.