updated pLDDT network and docs

idptools · Sep 7, 2021 · a1a4fdf · a1a4fdf
1 parent a11b9f7
commit a1a4fdf
Show file tree

Hide file tree

Showing 7 changed files with 55 additions and 4 deletions.
diff --git a/README.md b/README.md
@@ -8,7 +8,7 @@
 
 ### In addition to predicting disorder, metapredict also can predict AlphaFold2 pLDDT confidence scores
 
-In addition, metapredict offers predicted pLDDT confidence scores from AlphaFold2. These predicted scores use a bidirectional recurrent neural network (BRNN) trained on the per residue pLDDT (predicted IDDT-Ca) confidence scores generated by AlphaFold2 (AF2). The confidence scores from 9 proteomes (151,970 total proteins) were used to train the BRNN used to generate these scores. The confidence scores from the proteomes of *Rattus norvegicus*, *Danio rerio*, *Dictyostelium discoideum*, *Drosophila melanogaster*, *Mus musculus*, *Saccharomyces cerevisiae*, *Arabidopsis thaliana*, *Homo sapiens*, and *Escherichia coli* were used to generate the BRNN. These pLDDT scores measure the local confidence that AlphaFold2 has in its predicted structure. The scores go from 0-100 where 0 represents low confidence and 100 represents high confidence. For more information, please see: *Highly accurate protein structure prediction with AlphaFold* https://doi.org/10.1038/s41586-021-03819-2. In describing these scores, the team states that regions with pLDDT scores of less than 50 should not be interpreted except as *possible* disordered regions.
+In addition, metapredict offers predicted pLDDT confidence scores from AlphaFold2. These predicted scores use a bidirectional recurrent neural network (BRNN) trained on the per residue pLDDT (predicted IDDT-Ca) confidence scores generated by AlphaFold2 (AF2). The confidence scores (pLDDT) from the proteomes of *Danio rerio*, *Candida albicans*, *Mus musculus*, *Escherichia coli*, *Drosophila melanogaster*, *Methanocaldococcus jannaschii*, *Plasmodium falciparum*, *Mycobacterium tuberculosis*, *Caenorhabditis elegans*, *Dictyostelium discoideum*, *Trypanosoma cruzi*, *Saccharomyces cerevisiae*, *Schizosaccharomyces pombe*, *Rattus norvegicus*, *Homo sapiens*, *Arabidopsis thaliana*, *Zea mays*, *Leishmania infantum*, *Staphylococcus aureus*, *Glycine max*, *Oryza sativa* were used to generate the BRNN. These pLDDT scores measure the local confidence that AlphaFold2 has in its predicted structure. The scores go from 0-100 where 0 represents low confidence and 100 represents high confidence. For more information, please see: *Highly accurate protein structure prediction with AlphaFold* https://doi.org/10.1038/s41586-021-03819-2. In describing these scores, the team states that regions with pLDDT scores of less than 50 should not be interpreted except as *possible* disordered regions.
 
 
 ### What might the predicted pLDDT scores from AlphaFold2 be used for?
@@ -651,6 +651,13 @@ Example data that can be used with metapredict can be found in the metapredict/d
 
 This section is a log of recent changes with metapredict. My hope is that as I change things, this section can help you figure out why a change was made and if it will break any of your current work flows. The first major changes were made for the 0.56 release, so tracking will start there. Reasons are not provided for bug fixes for because the reason can assumed to be fixing the bug...
 
+
+#### V1.51
+
+Changes:
+Updated to require V1.0 of alphaPredict for pLDDT scores. This improves accuracy from over 9% per residue to about 8% per residue for pLDDT score predictions. Documentation was updated for this change.
+
+
 #### V1.5
 
 Changes:

diff --git a/docs/changes.rst b/docs/changes.rst
@@ -6,6 +6,13 @@ About
 
 This section is a log of recent changes with metapredict. My hope is that as I change things, this section can help you figure out why a change was made and if it will break any of your current work flows. The first major changes were made for the 0.56 release, so tracking will start there.
 
+V1.51
+-----
+Changes:
+Updated to require V1.0 of alphaPredict for pLDDT scores. This improves accuracy from over 9% per residue to about 8% per residue for pLDDT score predictions. Documentation was updated for this change.
+
+
+
 V1.5
 -----
 Changes:

diff --git a/docs/getting_started.rst b/docs/getting_started.rst
@@ -20,7 +20,7 @@ How does metapredict work?
 
 **metapredict** is a deep-learning-based predictor trained on consensus disorder data from 8 different predictors, as pre-computed and provided by `MobiDB <https://mobidb.bio.unipd.it/>`_. Functionally, this means each residue is assigned a score between 0 and 1 which reflects the confidence we have that the residue is disordered (or not). If the score was 0.5, this means half of the predictors predict that residue to be disordered. In this way, **metapredict** can help you quickly determine the likelihood that residues are disordered by giving you an approximation of what other predictors would predict (things got pretty 'meta' there, hence the name **metapredict**).
 
-In addition, metapredict offers predicted confidence scores from AlphaFold2. These predicted scores use a bidirectional recurrent neural network (BRNN) trained on the per residue pLDDT (predicted IDDT-Ca) confidence scores generated by AlphaFold2 (AF2). The confidence scores from 9 proteomes (151,970 total proteins) were used to train the BRNN used to generate these scores. The confidence scores from the proteomes of *Rattus norvegicus*, *Danio rerio*, *Dictyostelium discoideum*, *Drosophila melanogaster*, *Mus musculus*, *Saccharomyces cerevisiae*, *Arabidopsis thaliana*, *Homo sapiens*, and *Escherichia coli* were used to generate the BRNN. These confidence scores measure the local confidence that AlphaFold2 has in its predicted structure. The scores go from 0-100 where 0 represents low confidence and 100 represents high confidence. For more information, please see: *Highly accurate protein structure prediction with AlphaFold* https://doi.org/10.1038/s41586-021-03819-2. In describing these scores, the team states that regions with pLDDT scores of less than 50 should not be interpreted except as *possible* disordered regions.
+In addition, metapredict offers predicted confidence scores from AlphaFold2. These predicted scores use a bidirectional recurrent neural network (BRNN) trained on the per residue pLDDT (predicted IDDT-Ca) confidence scores generated by AlphaFold2 (AF2). The confidence scores (pLDDT) from the proteomes of *Danio rerio*, *Candida albicans*, *Mus musculus*, *Escherichia coli*, *Drosophila melanogaster*, *Methanocaldococcus jannaschii*, *Plasmodium falciparum*, *Mycobacterium tuberculosis*, *Caenorhabditis elegans*, *Dictyostelium discoideum*, *Trypanosoma cruzi*, *Saccharomyces cerevisiae*, *Schizosaccharomyces pombe*, *Rattus norvegicus*, *Homo sapiens*, *Arabidopsis thaliana*, *Zea mays*, *Leishmania infantum*, *Staphylococcus aureus*, *Glycine max*, *Oryza sativa* were used to generate the BRNN. These confidence scores measure the local confidence that AlphaFold2 has in its predicted structure. The scores go from 0-100 where 0 represents low confidence and 100 represents high confidence. For more information, please see: *Highly accurate protein structure prediction with AlphaFold* https://doi.org/10.1038/s41586-021-03819-2. In describing these scores, the team states that regions with pLDDT scores of less than 50 should not be interpreted except as *possible* disordered regions.
 
 
 What might the predicted confidence scores from AlphaFold2 be used for?

diff --git a/docs/requirements.txt b/docs/requirements.txt
@@ -9,8 +9,8 @@ matplotlib
 protfasta
 scipy
 urllib3
-alphaPredict
 #
 ###### Requirements with Version Specifiers ######
 #   See https://www.python.org/dev/peps/pep-0440/#version-specifiers
 #
+alphaPredict == 1.0
diff --git a/metapredict/backend/meta_predict_disorder.py b/metapredict/backend/meta_predict_disorder.py
@@ -24,8 +24,14 @@
 PATH = os.path.dirname(os.path.realpath(__file__))
 
 # Setting predictor equal to location of weighted values.
+
+# originl network
 predictor = "{}/networks/meta_predict_disorder_100e_v1.pt".format(PATH)
 
+# V2 network holds slight increases in accuracy but is still undergoing testing.
+# so far, 0.5% increase in accuracy has been consistently seen. V1 is the published
+# network though, so leaving fo the time being.
+# predictor = "{}/networks/metapredict_network_v2_200epochs_nl1_hs20.pt".format(PATH)
 
 ##################################################################################################
 # hyperparameters used by when metapredict was trained. Manually setting them here for clarity.
@@ -34,6 +40,36 @@
 #
 
 
+'''
+meta_predict_disorder_100e_v1 paramters
+# original published network!
+
+
+device = 'cpu'
+hidden_size = 5
+num_layers = 1
+dtype = 'residues'
+num_classes = 1
+encoding_scheme = 'onehot'
+input_size = 20
+problem_type = 'regression'
+
+
+# metapredict_network_v2_200epochs_nl1_hs20 parameters 
+# if you want to use V2 network, move this code out of
+commented out section and delete similar code below.
+
+device = 'cpu'
+hidden_size = 20
+num_layers = 1
+dtype = 'residues'
+num_classes = 1
+encoding_scheme = 'onehot'
+input_size = 20
+problem_type = 'regression'
+'''
+
+
 device = 'cpu'
 hidden_size = 5
 num_layers = 1
@@ -43,6 +79,7 @@
 input_size = 20
 problem_type = 'regression'
 
+
 # set location of saved_weights for load_state_dict
 saved_weights = predictor
 

diff --git a/metapredict/backend/networks/metapredict_network_v2_200epochs_nl1_hs20.pt b/metapredict/backend/networks/metapredict_network_v2_200epochs_nl1_hs20.pt
diff --git a/setup.py b/setup.py
@@ -64,7 +64,7 @@
             'protfasta',
             'scipy',
             'urllib3',
-            'alphaPredict'],              # Required packages, pulls from pip if needed; do not use for Conda deployment
+            'alphaPredict==1.0'],              # Required packages, pulls from pip if needed; do not use for Conda deployment
     # platforms=['Linux',
     #            'Mac OS-X',
     #            'Unix',