Merge pull request #7 from greener-group/chainsaw

Domain splitting
greener-group · Aug 23, 2024 · 95ad8fd · 95ad8fd
2 parents d8f3eec + 7e16898
commit 95ad8fd
Show file tree

Hide file tree

Showing 36 changed files with 5,809 additions and 85 deletions.
diff --git a/.github/workflows/CI.yml b/.github/workflows/CI.yml
@@ -31,6 +31,8 @@ jobs:
       run: conda install pytorch==1.11 faiss-cpu -c pytorch
     - name: Install PyTorch Scatter and PyTorch Geometric
       run: conda install pytorch-scatter pyg -c pyg
+    - name: Install STRIDE
+      run: conda install kimlab::stride
     - name: Test install
       run: pip install -e .
     - name: Test help
@@ -41,14 +43,22 @@ jobs:
       run: |
         wget https://files.rcsb.org/view/1CRN.pdb
         wget https://files.rcsb.org/view/1SSU.cif
+        wget https://alphafold.ebi.ac.uk/files/AF-P31434-F1-model_v4.pdb
     - name: Test search
       run: time python bin/progres search -q 1CRN.pdb -t scope95
+    - name: Test domain split
+      run: time python bin/progres search -q AF-P31434-F1-model_v4.pdb -t cath40 -c
     - name: Test score
       run: time python bin/progres score 1CRN.pdb 1SSU.cif > score.txt
     - name: Check score
       run: |
         sc=$(cat score.txt)
         if [ ${sc:0:7} == "0.72652" ]; then echo "Correct score"; else echo "Wrong score, score is $sc"; exit 1; fi
+    - name: Test database embedding
+      run: |
+        cd data
+        time python bin/progres embed -l filepaths.txt -o out.pt
+        time python bin/progres search -q query.pdb -t out.pt
   test_pypi:
     runs-on: ubuntu-latest
     strategy:

diff --git a/.gitignore b/.gitignore
@@ -4,5 +4,6 @@ build
 dist
 progres/trained_models/v*
 progres/databases/v*
+progres/chainsaw/model_v3/weights*
 .vs
 .vscode
diff --git a/LICENSE b/LICENSE
@@ -71,3 +71,27 @@ SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
 CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
 OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+Code is included from Chainsaw (https://github.com/JudeWells/chainsaw):
+
+MIT License
+
+Copyright (c) 2024 Jude Wells
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/NEWS.md b/NEWS.md
@@ -0,0 +1,53 @@
+# Progres release notes
+
+## v0.2.5 - Aug 2024
+
+- Structures can now be split into domains with Chainsaw before searching, with each domain searched separately. This makes Progres suitable for use with multi-domain structures.
+- The whole PDB split into domains with Chainsaw is made available to search against.
+- Hetero atoms are now ignored during file reading.
+- Example files are added for searching and database embedding.
+
+## v0.2.4 - Jul 2024
+
+- The `score` mode is added to calculate the Progres score between two structures.
+
+## v0.2.3 - May 2024
+
+- Incomplete downloads are handled during setup.
+
+## v0.2.2 - Apr 2024
+
+- The environmental variable `PROGRES_DATA_DIR` can be used to change where the downloaded data is stored.
+- A Docker file is added.
+- Searching on GPU is made more memory efficient.
+- Bugs when running on Windows are fixed.
+
+## v0.2.1 - Apr 2024
+
+- The AlphaFold database TED domains are made available to search against, with FAISS used for fast searching.
+- Pre-embedded databases are stored as Float16 to reduce disk usage.
+- Datasets and scripts for benchmarking (including for other methods), FAISS index generation and training are made available.
+
+## v0.2.0 - Mar 2023
+
+- Change model architecture to use 6 EGNN layers and tau torsion angles, making it faster and SE(3)-invariant rather than E(3)-invariant.
+- The AlphaFold models for 21 model organisms are made available to search against.
+- The trained model and pre-embedded databases are downloaded from Zenodo rather than GitHub when first running the software.
+
+## v0.1.3 - Nov 2022
+
+- Fix data download.
+
+## v0.1.2 - Nov 2022
+
+- Add ECOD database.
+- Use versioned model directory.
+
+## v0.1.1 - Nov 2022
+
+- Add einops dependency.
+- Add code for ECOD database.
+
+## v0.1.0 - Nov 2022
+
+Initial release of the `progres` Python package for fast protein structure searching using structure graph embeddings.
diff --git a/README.md b/README.md
@@ -9,20 +9,23 @@ This repository contains the method from the pre-print:
 It provides the `progres` Python package that lets you search structures against pre-embedded structural databases, score pairs of structures and pre-embed datasets for searching against.
 Searching typically takes 1-2 s and is much faster for multiple queries.
 For the AlphaFold database, initial data loading takes around a minute but subsequent searching takes a tenth of a second per query.
-Currently [SCOPe](https://scop.berkeley.edu), [CATH](http://cathdb.info), [ECOD](http://prodata.swmed.edu/ecod), the [AlphaFold structures for 21 model organisms](https://doi.org/10.1093/nar/gkab1061) and the [AlphaFold database TED domains](https://www.biorxiv.org/content/10.1101/2024.03.18.585509) are provided for searching against.
+
+Currently [SCOPe](https://scop.berkeley.edu), [CATH](http://cathdb.info), [ECOD](http://prodata.swmed.edu/ecod), the whole [PDB](https://www.rcsb.org), the [AlphaFold structures for 21 model organisms](https://doi.org/10.1093/nar/gkab1061) and the [AlphaFold database TED domains](https://www.biorxiv.org/content/10.1101/2024.03.18.585509) are provided for searching against.
+Searching is done by domain but [Chainsaw](https://github.com/JudeWells/chainsaw) can be used to automatically split query structures into domains.
 
 ## Installation
 
 1. Python 3.8 or later is required. The software is OS-independent.
-2. Install [PyTorch](https://pytorch.org) 1.11 or later, [PyTorch Scatter](https://github.com/rusty1s/pytorch_scatter), [PyTorch Geometric](https://github.com/pyg-team/pytorch_geometric) and [FAISS](https://github.com/facebookresearch/faiss) as appropriate for your system. A GPU is not required but may provide speedup in certain situations. Example commands:
+2. Install [PyTorch](https://pytorch.org) 1.11 or later, [PyTorch Scatter](https://github.com/rusty1s/pytorch_scatter), [PyTorch Geometric](https://github.com/pyg-team/pytorch_geometric), [FAISS](https://github.com/facebookresearch/faiss) and [STRIDE](https://webclu.bio.wzw.tum.de/stride) as appropriate for your system. A GPU is not required but may provide speedup in certain situations. Example commands:
 ```bash
 conda create -n prog python=3.9
 conda activate prog
 conda install pytorch=1.11 faiss-cpu -c pytorch
 conda install pytorch-scatter pyg -c pyg
+conda install kimlab::stride
 ```
-3. Run `pip install progres`, which will also install [Biopython](https://biopython.org), [mmtf-python](https://github.com/rcsb/mmtf-python) and [einops](https://github.com/arogozhnikov/einops) if they are not already present.
-4. The first time you search with the software the trained model and pre-embedded databases (~220 MB) will be downloaded to the package directory from [Zenodo](https://zenodo.org/record/7782088), which requires an internet connection. This can take a few minutes. You can set the environmental variable `PROGRES_DATA_DIR` to change where this data is stored, for example if you cannot write to the package directory. Remember to keep it set the next time you run Progres.
+3. Run `pip install progres`, which will also install [Biopython](https://biopython.org), [mmtf-python](https://github.com/rcsb/mmtf-python), [einops](https://github.com/arogozhnikov/einops) and [pydantic](https://github.com/pydantic/pydantic) if they are not already present.
+4. The first time you search with the software the trained model and pre-embedded databases (~660 MB) will be downloaded to the package directory from [Zenodo](https://zenodo.org/record/7782088), which requires an internet connection. This can take a few minutes. You can set the environmental variable `PROGRES_DATA_DIR` to change where this data is stored, for example if you cannot write to the package directory. Remember to keep it set the next time you run Progres.
 5. The first time you search against the AlphaFold database TED domains the pre-embedded database (~33 GB) will be downloaded similarly. This can take a while. Make sure you have enough disk space!
 
 Alternatively, a Docker file is available in the `docker` directory.
@@ -34,20 +37,22 @@ On Windows you can call the `bin/progres` script with python if you can't access
 
 Run `progres -h` to see the help text and `progres {mode} -h` to see the help text for each mode.
 The modes are described below but there are other options outlined in the help text.
-For example the `-d` flag sets the device to run on; this is `cpu` by default since this is often fastest for searching, but `cuda` may be faster when searching many queries or embedding a dataset.
+For example the `-d` flag sets the device to run on; this is `cpu` by default since this is often fastest for searching, but `cuda` will likely be faster when splitting domains with Chainsaw, searching many queries or embedding a dataset.
+Try both if performance is important.
 
 ## Search a structure against a database
 
-To search a PDB file `query.pdb` against domains in the SCOPe database and print output:
+To search a PDB file `query.pdb` (which can be found in the `data` directory) against domains in the SCOPe database and print output:
 ```bash
 progres search -q query.pdb -t scope95
 ```
 ```
 # QUERY_NUM: 1
 # QUERY: query.pdb
-# QUERY_SIZE: 150 residues
+# DOMAIN_NUM: 1
+# DOMAIN_SIZE: 150 residues (1-150)
 # DATABASE: scope95
-# PARAMETERS: minsimilarity 0.8, maxhits 100, progres v0.2.4
+# PARAMETERS: minsimilarity 0.8, maxhits 100, chainsaw no, faiss no, progres v0.2.5
 # HIT_N  DOMAIN   HIT_NRES  SIMILARITY  NOTES
       1  d1a6ja_       150      1.0000  d.112.1.1 - Nitrogen regulatory bacterial protein IIa-ntr {Escherichia coli [TaxId: 562]}
       2  d2a0ja_       146      0.9988  d.112.1.0 - automated matches {Neisseria meningitidis [TaxId: 122586]}
@@ -56,20 +61,22 @@ progres search -q query.pdb -t scope95
       5  d3oxpa1       147      0.9968  d.112.1.0 - automated matches {Yersinia pestis [TaxId: 214092]}
 ...
 ```
-- `-q` is the path to the query structure file. Alternatively, `-l` is a text file with one query file path per line and each result will be printed in turn. This is considerably faster for multiple queries since setup only occurs once and multiple workers can be used.
+- `-q` is the path to the query structure file. Alternatively, `-l` is a text file with one query file path per line and each result will be printed in turn. This is considerably faster for multiple queries since setup only occurs once and multiple workers can be used. Only the first chain in each file is considered.
 - `-t` is the pre-embedded database to search against. Currently this must be either one of the databases listed below or the file path to a pre-embedded dataset generated with `progres embed`.
 - `-f` determines the file format of the query structure (`guess`, `pdb`, `mmcif`, `mmtf` or `coords`). By default this is guessed from the file extension, with `pdb` chosen if a guess can't be made. `coords` refers to a text file with the coordinates of a Cα atom separated by white space on each line.
-- `-s` is the minimum similarity threshold above which to return hits, default 0.8. As discussed in the paper, 0.8 indicates the same fold.
+- `-s` is the Progres score (0 -> 1) above which to return hits, default 0.8. As discussed in the paper, 0.8 indicates the same fold.
 - `-m` is the maximum number of hits to return, default 100.
+- `-c` indicates to split the query structure(s) into domains with Chainsaw and search with each domain separately. If no domains are found with Chainsaw, no results will be returned. Only the first chain in each file is considered. Running Chainsaw may take a few seconds.
 
-Query structures should be a single protein domain, though it can be discontinuous (chain IDs are ignored).
-Tools such as [Merizo](https://github.com/psipred/Merizo), [SWORD2](https://www.dsimb.inserm.fr/SWORD2) and [Chainsaw](https://github.com/JudeWells/chainsaw) can be used to predict domains from a larger structure.
+Other tools for splitting query structures into domains include [Merizo](https://github.com/psipred/Merizo) and [SWORD2](https://www.dsimb.inserm.fr/SWORD2).
 You can also slice out domains manually using software such as the `pdb_selres` command from [pdb-tools](http://www.bonvinlab.org/pdb-tools).
 
 Interpreting the hit descriptions depends on the database being searched.
 The domain name often includes a reference to the corresponding PDB file, for example d1a6ja_ refers to PDB ID 1A6J chain A, and this can be opened in the [RCSB PDB structure view](https://www.rcsb.org/3d-view/1A6J/1) to get a quick look.
 For the AlphaFold database TED domains, files can be downloaded from [links such as this](https://alphafold.ebi.ac.uk/files/AF-A0A6J8EXE6-F1-model_v4.pdb) where `AF-A0A6J8EXE6-F1` is the first part of the hit notes and is followed by the residue range of the domain.
 
+### Available databases
+
 The available pre-embedded databases are:
 
 | Name      | Description                                                                                                                                                                                | Number of domains | Search time (1 query)      | Search time (100 queries)  |
@@ -78,6 +85,7 @@ The available pre-embedded databases are:
 | `scope40` | ASTRAL set of [SCOPe](https://scop.berkeley.edu) 2.08 domains clustered at 40% seq ID                                                                                                      | 15,127            | 1.32 s                     | 2.36 s                     |
 | `cath40`  | S40 non-redundant domains from [CATH](http://cathdb.info) 23/11/22                                                                                                                         | 31,884            | 1.38 s                     | 2.79 s                     |
 | `ecod70`  | F70 representative domains from [ECOD](http://prodata.swmed.edu/ecod) develop287                                                                                                           | 71,635            | 1.46 s                     | 3.82 s                     |
+| `pdb100`  | All [PDB](https://www.rcsb.org) protein chains as of 02/08/24 split into domains with Chainsaw                                                                                             | 1,177,152         | 2.90 s                     | 27.3 s                     |
 | `af21org` | [AlphaFold](https://alphafold.ebi.ac.uk) structures for 21 model organisms split into domains by [CATH-Assign](https://doi.org/10.1038/s42003-023-04488-9)                                 | 338,258           | 2.21 s                     | 11.0 s                     |
 | `afted`   | [AlphaFold database](https://alphafold.ebi.ac.uk) structures split into domains by [TED](https://www.biorxiv.org/content/10.1101/2024.03.18.585509) and clustered at 50% sequence identity | 53,344,209        | 67.7 s                     | 73.1 s                     |
 
@@ -113,6 +121,8 @@ progres embed -l filepaths.txt -o searchdb.pt
 Again, the structures should correspond to single protein domains.
 The embeddings are stored as Float16, which has no noticeable effect on search performance.
 
+As an example, you can run the above command from the `data` directory to generate a database with two structures.
+
 ## Python library
 
 `progres` can also be used in Python, allowing it to be integrated into other methods:
@@ -172,6 +182,7 @@ The trained model and pre-embedded databases are available on [Zenodo](https://z
 ## Notes
 
 The implementation of the E(n)-equivariant GNN uses [EGNN PyTorch](https://github.com/lucidrains/egnn-pytorch).
+We also include code from [SupContrast](https://github.com/HobbitLong/SupContrast) and [Chainsaw](https://github.com/JudeWells/chainsaw).
 
 Please open issues or [get in touch](http://jgreener64.github.io) with any feedback.
 Contributions via pull requests are welcome.