a-r-j · a-r-j · Dec 28, 2023 · Sep 1, 2023 · Sep 16, 2023 · Sep 17, 2023
diff --git a/.github/workflows/code-tests.yaml b/.github/workflows/code-tests.yaml
@@ -19,16 +19,20 @@ jobs:
     strategy:
       matrix:
         platform: [ubuntu-latest, macos-latest, windows-latest]
-        python-version: [3.9, "3.10", 3.11]
+        python-version: [3.9, "3.10"]
     runs-on: ubuntu-latest
 
     steps:
       - name: Checkout repository
         uses: actions/checkout@v3
-      - name: Set up Python ${{ matrix.python-version }}
-        uses: actions/setup-python@v2.3.1
+      - name: Setup miniconda
+        uses: conda-incubator/setup-miniconda@v2
         with:
+          auto-update-conda: true
+          miniforge-variant: Mambaforge
+          channels: "conda-forge, pytorch, pyg"
           python-version: ${{ matrix.python-version }}
+          use-mamba: true
       - id: cache-dependencies
         name: Cache dependencies
         uses: actions/cache@v2.1.7

diff --git a/.gitignore b/.gitignore
@@ -143,12 +143,12 @@ proteinworkshop/data/*
 !proteinworkshop/data/.gitkeep
 
 logs/
+ProteinWorkshop/
 wandb/
 .DS_Store
 .env
 
+# Explanations
+explanations/
 # Visualisations
 visualisations/
-
-# Explanations
-explanations/
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -0,0 +1,25 @@
+repos:
+- repo: https://github.com/pre-commit/pre-commit-hooks
+  rev: v4.5.0
+  hooks:
+    - id: trailing-whitespace
+    - id: end-of-file-fixer
+    - id: check-yaml
+    - id: check-added-large-files
+- repo: https://github.com/ambv/black
+  rev: 23.9.1
+  hooks:
+    - id: black
+- repo: https://github.com/jsh9/pydoclint
+  # pydoclint version.
+  rev: 0.3.3
+  hooks:
+    - id: pydoclint
+      args:
+        - "--config=pyproject.toml"
+- repo: https://github.com/astral-sh/ruff-pre-commit
+  # Ruff version.
+  rev: v0.1.1
+  hooks:
+    - id: ruff
+      args: [--fix, --exit-non-zero-on-fix]
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,3 +1,33 @@
+### 0.2.6 (Unreleased)
+
+
+
+#### Datasets
+
+* Adds to antibody-specific datasets using the IGFold corpuses for paired OAS and Jaffe 2022 [#53](https://github.com/a-r-j/ProteinWorkshop/pull/53/)
+* Set `in_memory=True` as default for most (small) datasets for improved performance [#53](https://github.com/a-r-j/ProteinWorkshop/pull/53/)
+* Fix `num_classes` for GO datamodules * Set `in_memory=True` as default for most (downstream) datasets for improved performance [#53](https://github.com/a-r-j/ProteinWorkshop/pull/53/)
+* Fixes GO labelling [#53](https://github.com/a-r-j/ProteinWorkshop/pull/53/)
+
+
+### Features
+* Improves positional encoding performance by adding a `seq_pos` attribute on `Data/Protein` objects in the base dataset getter. [#53](https://github.com/a-r-j/ProteinWorkshop/pull/53/)
+
+### Models
+* Adds CDConv implementation [#53](https://github.com/a-r-j/ProteinWorkshop/pull/53/)
+* Adds tuned hparams for models [#53](https://github.com/a-r-j/ProteinWorkshop/pull/53/)
+
+### Framework
+* Refactors beartype/jaxtyping to use latest recommended syntax [#53](https://github.com/a-r-j/ProteinWorkshop/pull/53/)
+* Adds explainability module for performing attribution on a trained model [#53](https://github.com/a-r-j/ProteinWorkshop/pull/53/)
+* Change default finetuning features in config: `ca_base` -> `ca_seq` [#53](https://github.com/a-r-j/ProteinWorkshop/pull/53/)
+* Add optional hparam entry point to finetuning config [#53](https://github.com/a-r-j/ProteinWorkshop/pull/53/)
+* Fixes GPU memory accumulation for some metrics [#53](https://github.com/a-r-j/ProteinWorkshop/pull/53/)
+* Updates zenodo URL for processed datasets to reflect upstream API change [#53](https://github.com/a-r-j/ProteinWorkshop/pull/53/)
+* Adds multi-hot label encoding transform [#53](https://github.com/a-r-j/ProteinWorkshop/pull/53/)
+* Fixes auto PyG install for `torch>2.1.0` [#53](https://github.com/a-r-j/ProteinWorkshop/pull/53/)
+* Adds `proteinworkshop.model_io` containing utils for loading trained models [#53](https://github.com/a-r-j/ProteinWorkshop/pull/53/)
+
 ### 0.2.5 (25/09/2024)
 
 * Implement ESM embedding encoder ([#33](https://github.com/a-r-j/ProteinWorkshop/pull/33), [#41](https://github.com/a-r-j/ProteinWorkshop/pull/33))

diff --git a/README.md b/README.md
@@ -37,6 +37,7 @@ Configuration files to run the experiments described in the manuscript are provi
     - [Running a sweep/experiment](#running-a-sweepexperiment)
     - [Embedding a dataset](#embedding-a-dataset)
     - [Visualising a dataset's embeddings](#visualising-pre-trained-model-embeddings-for-a-given-dataset)
+    - [Performing attribution of a pre-trained model](#performing-attribution-of-a-pre-trained-model)
     - [Verifying a config](#verifying-a-config)
     - [Using `proteinworkshop` modules functionally](#using-proteinworkshop-modules-functionally)
   - [Models](#models)
@@ -67,14 +68,11 @@ Below, we outline how one may set up a virtual environment for `proteinworkshop`
 
 ### From PyPI
 
-`proteinworkshop` is available for install [from PyPI](https://pypi.org/project/proteinworkshop/). This enables training of specific configurations via the CLI **or** using individual components from the benchmark, such as datasets, featurisers, or transforms, as drop-ins to other projects. Make sure to install [PyTorch](https://pytorch.org/) (specifically version `2.0.0`) using its official `pip` installation instructions, with CUDA support as desired.
+`proteinworkshop` is available for install [from PyPI](https://pypi.org/project/proteinworkshop/). This enables training of specific configurations via the CLI **or** using individual components from the benchmark, such as datasets, featurisers, or transforms, as drop-ins to other projects. Make sure to install [PyTorch](https://pytorch.org/) (specifically version `2.1.2` or newer) using its official `pip` installation instructions, with CUDA support as desired.
 
 ```bash
 # install `proteinworkshop` from PyPI
-pip install proteinworkshop --no-cache-dir
-
-# e.g., install PyTorch with CUDA 11.8 support on Linux
-pip install torch==2.0.0+cu118 torchvision==0.15.1+cu118 torchaudio==2.0.1 --index-url https://download.pytorch.org/whl/cu118 --no-cache-dir
+pip install proteinworkshop
 
 # install PyTorch Geometric using the (now-installed) CLI
 workshop install pyg
@@ -86,7 +84,7 @@ export DATA_PATH="where/you/want/data/" # e.g., `export DATA_PATH="proteinworksh
 However, for full exploration we recommend cloning the repository and building from source.
 
 ### Building from source
-With a local virtual environment activated (e.g., one created with `conda create -n proteinworkshop python=3.9`):
+With a local virtual environment activated (e.g., one created with `conda create -n proteinworkshop python=3.10`):
 1. Clone and install the project
 
     ```bash
@@ -95,11 +93,11 @@ With a local virtual environment activated (e.g., one created with `conda create
     pip install -e .
     ```
 
-2. Install [PyTorch](https://pytorch.org/) (specifically version `2.0.0`) using its official `pip` installation instructions, with CUDA support as desired (N.B. make sure to add `--no-cache-dir` to the end of the `pip` installation command)
+2. Install [PyTorch](https://pytorch.org/) (specifically version `2.1.2` or newer) using its official `pip` installation instructions, with CUDA support as desired
 
     ```bash
     # e.g., to install PyTorch with CUDA 11.8 support on Linux:
-    pip install torch==2.0.0+cu118 torchvision==0.15.1+cu118 torchaudio==2.0.1 --index-url https://download.pytorch.org/whl/cu118 --no-cache-dir
+    pip install torch==2.1.2+cu118 torchvision==0.16.2+cu118 torchaudio==2.1.2+cu118 --index-url https://download.pytorch.org/whl/cu118
     ```
 
 3. Then use the newly-installed `proteinworkshop` CLI to install [PyTorch Geometric](https://pyg.org/)
@@ -252,6 +250,21 @@ python proteinworkshop/visualise.py ckpt_path=PATH/TO/CHECKPOINT plot_filepath=V
 ```
 See the `visualise` section of `proteinworkshop/config/visualise.yaml` for additional parameters.
 
+### Performing attribution of a pre-trained model
+
+We provide a utility in `proteinworkshop/explain.py` for performing attribution of a pre-trained model using integrated gradients.
+
+This will write PDB files for all the structures in a dataset for a supervised task with residue-level attributions in the `b_factor` column. To visualise the attributions, we recommend using the [Protein Viewer VSCode extension](https://marketplace.visualstudio.com/items?itemName=ArianJamasb.protein-viewer) and changing the 3D representation to colour by `Uncertainty/Disorder`.
+
+To run the attribution:
+
+```bash
+python proteinworkshop/explain.py ckpt_path=PATH/TO/CHECKPOINT output_dir=ATTRIBUTION/DIRECTORY
+```
+
+See the `explain` section of `proteinworkshop/config/explain.yaml` for additional parameters.
+
+
 ### Verifying a config
 
 ```bash
@@ -309,6 +322,7 @@ Read [the docs](https://www.proteins.sh) for a full list of modules available in
 | `GearNet`| [Zhang et al.](https://arxiv.org/abs/2203.06125) | ✓
 | `DimeNet++`   | [Gasteiger et al.](https://arxiv.org/abs/2011.14115) | ✗
 | `SchNet`   | [Schütt et al.](https://arxiv.org/abs/1706.08566) | ✗
+| `CDConv`   | [Fan et al.](https://openreview.net/forum?id=P5Z-Zl9XJ7) | ✓
 
 ### Equivariant Graph Encoders
 
@@ -361,8 +375,11 @@ Pre-training corpuses (with the exception of `pdb`, `cath`, and `astral`) are pr
 | `esmatlas` | [ESMAtlas](https://esmatlas.com/) predictions  (full)     | [Kim et al.](https://academic.oup.com/bioinformatics/article/39/4/btad153/7085592) | | 1 Tb | [GPL-3.0](https://github.com/steineggerlab/foldcomp/blob/master/LICENSE.txt) / [CC-BY 4.0](https://esmatlas.com/about)
 | `esmatlas_v2023_02`| [ESMAtlas](https://esmatlas.com/) predictions (v2023_02 release)      | [Kim et al.](https://academic.oup.com/bioinformatics/article/39/4/btad153/7085592)       | | 137 Gb| [GPL-3.0](https://github.com/steineggerlab/foldcomp/blob/master/LICENSE.txt) / [CC-BY 4.0](https://esmatlas.com/about)
 | `highquality_clust30`| [ESMAtlas](https://esmatlas.com/) High Quality predictions       |  [Kim et al.](https://academic.oup.com/bioinformatics/article/39/4/btad153/7085592)      | 37M Chains | 114 Gb |  [GPL-3.0](https://github.com/steineggerlab/foldcomp/blob/master/LICENSE.txt) / [CC-BY 4.0](https://esmatlas.com/about)
+| `igfold_paired_oas` | IGFold Predictions for [Paired OAS](https://journals.aai.org/jimmunol/article/201/8/2502/107069/Observed-Antibody-Space-A-Resource-for-Data-Mining) | [Ruffolo et al.](https://www.nature.com/articles/s41467-023-38063-x) | 104,994 paired Ab chains | | [CC-BY 4.0](https://www.nature.com/articles/s41467-023-38063-x#rightslink)
+| `igfold_jaffe` | IGFold predictions for [Jaffe2022](https://www.nature.com/articles/s41586-022-05371-z) data | [Ruffolo et al.](https://www.nature.com/articles/s41467-023-38063-x) | 1,340,180 paired Ab chains   | | [CC-BY 4.0](https://www.nature.com/articles/s41467-023-38063-x#rightslink)
 | `pdb`| Experimental structures deposited in the [RCSB Protein Data Bank](https://www.rcsb.org/)       |  [wwPDB consortium](https://academic.oup.com/nar/article/47/D1/D520/5144142)      | ~800k Chains |23 Gb | [CC0 1.0](https://www.rcsb.org/news/feature/611e8d97ef055f03d1f222c6) |
 
+
 <details>
   <summary>Additionally, we provide several species-specific compilations (mostly reference species)</summary>
 
@@ -528,8 +545,8 @@ We use `poetry` to manage the project's underlying dependencies and to push upda
 To keep with the code style for the `proteinworkshop` repository, using the following lines, please format your commits before opening a pull request:
 ```bash
 # assuming you are located in the `ProteinWorkshop` top-level directory
-isort . 
-autoflake -r --in-place --remove-unused-variables --remove-all-unused-imports --ignore-init-module-imports . 
+isort .
+autoflake -r --in-place --remove-unused-variables --remove-all-unused-imports --ignore-init-module-imports .
 black --config=pyproject.toml .
 ```
 

diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -23,6 +23,7 @@
     "sphinx.ext.autosummary",
     "sphinx.ext.intersphinx",
     "sphinx.ext.viewcode",
+    "sphinx.ext.doctest",
     "sphinx_copybutton",
     "sphinx_inline_tabs",
     "sphinxcontrib.gtagjs",
@@ -32,7 +33,7 @@
     "nbsphinx_link",
     "sphinx.ext.napoleon",
     "sphinx_codeautolink",
-    "sphinxcontrib.jquery"
+    "sphinxcontrib.jquery",
     # "sphinx_autorun",
 ]
 
@@ -109,7 +110,6 @@
             "vu": "\\mathbf{u}",
             "vv": "\\mathbf{v}",
             "vw": "\\mathbf{w}",
-            "vx": "\\mathbf{x}",
             "vy": "\\mathbf{y}",
             "vz": "\\mathbf{z}",
         }

diff --git a/docs/source/configs/dataset.rst b/docs/source/configs/dataset.rst
@@ -31,8 +31,8 @@ Unlabelled Datasets
 
 
 .. mdinclude:: ../../../README.md
-    :start-line: 331
-    :end-line: 373
+    :start-line: 361
+    :end-line: 406
 
 
 :py:class:`ASTRAL <proteinworkshop.datasets.astral.AstralDataModule>` (``astral``)
@@ -116,7 +116,7 @@ This is a dataset of approximately 3 million protein structures from the AlphaFo
 
 Species-Specific Datasets
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-TODO
+Stay tuned!
 
 
 Graph-level Datasets

diff --git a/docs/source/configs/features.rst b/docs/source/configs/features.rst
@@ -7,8 +7,8 @@ Features
   :width: 400
 
 .. mdinclude:: ../../../README.md
-    :start-line: 426
-    :end-line: 475
+    :start-line: 459
+    :end-line: 508
 
 
 Default Features

diff --git a/docs/source/configs/framework_components/env.rst b/docs/source/configs/framework_components/env.rst
@@ -2,8 +2,8 @@ Environment
 ------------
 
 .. mdinclude:: ../../../../README.md
-    :start-line: 109
-    :end-line: 111
+    :start-line: 108
+    :end-line: 110
 
 .. literalinclude:: ../../../../.env.example
     :language: bash

diff --git a/docs/source/configs/model.rst b/docs/source/configs/model.rst
@@ -34,14 +34,14 @@ Invariant Encoders
 =============================
 
 .. mdinclude:: ../../../README.md
-    :start-line: 295
-    :end-line: 302
+    :start-line: 319
+    :end-line: 326
 
 :py:class:`SchNet <proteinworkshop.models.graph_encoders.schnet.SchNetModel>` (``schnet``)
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 SchNet is one of the most popular and simplest instantiation of E(3) invariant message passing GNNs. SchNet constructs messages through element-wise multiplication of scalar features modulated by a radial filter conditioned on the pairwise distance :math:`\Vert \vec{\vx}_{ij} \Vert`` between two neighbours.
-Scalar features are update from iteration :math:`t`` to :math:`t+1` via:
+Scalar features are updated from iteration :math:`t`` to :math:`t+1` via:
 
 .. math::
     \begin{align}
@@ -113,12 +113,25 @@ where :math:`\mathrm{FC(\cdot)}` denotes a linear transformation upon the messag
     :caption: config/encoder/gear_net_edge.yaml
 
 
+
+:py:class:`CDConv <proteinworkshop.models.graph_encoders.cdconv.CDConvModel>` (``cdconv``)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+CDConv is an SE(3) invariant architecture that uses independent learnable weights for sequential displacement, whilst directly encoding geometric displacements.
+
+As a result of the downsampling procedures, this architecture is only suitable for graph-level prediction tasks.
+
+.. literalinclude:: ../../../proteinworkshop/config/encoder/cdconv.yaml
+    :language: yaml
+    :caption: config/encoder/cdconv.yaml
+
+
 Vector-Equivariant Encoders
 =============================
 
 .. mdinclude:: ../../../README.md
-    :start-line: 306
-    :end-line: 312
+    :start-line: 330
+    :end-line: 336
 
 :py:class:`EGNN <proteinworkshop.models.graph_encoders.egnn.EGNNModel>` (``egnn``)
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -170,8 +183,8 @@ Tensor-Equivariant Encoders
 =============================
 
 .. mdinclude:: ../../../README.md
-    :start-line: 314
-    :end-line: 319
+    :start-line: 338
+    :end-line: 343
 
 
 :py:class:`Tensor Field Networks <proteinworkshop.models.graph_encoders.tfn.TensorProductModel>` (``tfn``)
@@ -200,7 +213,7 @@ where the weights :math:`\vw` of the tensor product are computed via a learnt ra
 
 MACE (Batatia et al., 2022) is a higher order E(3) or SE(3) equivariant GNN originally developed for molecular dynamics simulations.
 MACE provides an efficient approach to computing high body order equivariant features in the Tensor Field Network framework via Atomic Cluster Expansion:
-They first aggregate neighbourhood features analogous to the node update equation for TFN above (the :math:`A` functions in Batatia et al. (2022) (eq.9)) and then take :math:`k-1` repeated self-tensor products of these neighbourhood features. 
+They first aggregate neighbourhood features analogous to the node update equation for TFN above (the :math:`A` functions in Batatia et al. (2022) (eq.9)) and then take :math:`k-1` repeated self-tensor products of these neighbourhood features.
 In our formalism, this corresponds to:
 
 .. math::
@@ -214,6 +227,25 @@ In our formalism, this corresponds to:
     :caption: config/encoder/mace.yaml
 
 
+Sequence-Based Encoders
+=============================
+
+.. mdinclude:: ../../../README.md
+    :start-line: 345
+    :end-line: 349
+
+
+:py:class:`Evolutionary Scale Modeling <proteinworkshop.models.graph_encoders.esm_embeddings.EvolutionaryScaleModeling>` (``esm``)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Evolutionary Scale Modeling is a series of Transformer-based protein sequence encoders (Vaswani et al., 2017) that has been successfully used in protein structure prediction (Lin et al., 2023), protein design (Verkuil et al., 2022), and beyond.
+This model class has commonly been used as a baseline for protein-related representation learning tasks, and we included it in our benchmark for this reason.
+
+.. literalinclude:: ../../../proteinworkshop/config/encoder/esm.yaml
+    :language: yaml
+    :caption: config/encoder/esm.yaml
+
+
 Decoder Models
 =============================
 

diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -42,6 +42,7 @@ Welcome to Protein Workshop's documentation!
    configs/task
    configs/features
    configs/transforms
+   configs/metrics
    framework
    ml_components
 
@@ -57,9 +58,9 @@ Welcome to Protein Workshop's documentation!
    modules/proteinworkshop.tasks
    modules/proteinworkshop.features
    modules/proteinworkshop.utils
-   modules/protein_workshop.constants
+   modules/proteinworkshop.constants
    modules/proteinworkshop.types
-
+   modules/proteinworkshop.metrics
 
 Indices and tables
 ==================

diff --git a/docs/source/installation.rst b/docs/source/installation.rst
@@ -5,5 +5,5 @@ Installation
     :doc:`/configs/framework_components/env`
 
 .. mdinclude:: ../../README.md
-    :start-line: 64
-    :end-line: 109
+    :start-line: 66
+    :end-line: 108
diff --git a/docs/source/modules/proteinworkshop.metrics.rst b/docs/source/modules/proteinworkshop.metrics.rst
@@ -0,0 +1,4 @@
+protein_workshop.metrics
+-------------------------
+
+Stay tuned!