ProteinGym

Overview

Important

This is the workspace for the ProteinGym-base and ProteinGym-benchmark that aim to bring more structure to the ProteinGym project. The original ProteinGym repository can be found here

ProteinGym has become a widely used resource for comparing the ability of models to predict the effects of protein mutations. ProteinGym Base is an effort to bring more structure to the datasets so that they become easier to work with and to distribute; and to the models in the benchmark so that they become easier to install and run. Less undocumented csv's and more formal schemas. Fewer scripts with hardcoded paths and more Dockerfiles and precise requirements.

Mission

To advance protein engineering and computational biology by providing:

Structured, standardized datasets for protein variant effect prediction
Reproducible benchmarking frameworks for model evaluation
Open-source tools that facilitate research and development in protein science

Core Repositories

🧬 proteingym-base

Infrastructure for protein variant effect prediction datasets

Purpose: Provides structured dataset classes and tools to facilitate benchmarking
Key Features:
- Formal schemas for protein data (sequences, structures, MSAs, assays)
- Dataset manifest system with TOML configuration
- Archive/persistence capabilities for easy data sharing
- Support for multiple data formats (FASTA, PDB, CIF, A3M, CSV)
- BioPython integration for biological data handling
Target Users: Researchers working with protein datasets, ML practitioners
Tech Stack: Python 3.12+, Pydantic, BioPython, Polars, TOML

🏆 proteingym-benchmark

Standardized benchmarking framework for protein prediction models

Purpose: Enables reproducible evaluation of protein variant effect prediction models
Key Features:
- Two benchmark games: supervised learning and zero-shot prediction
- Model repository with standardized model cards
- DVC-based pipeline orchestration
- Local (HPC) and AWS cloud execution environments
- Automated metric calculation and comparison
- Docker containerization for reproducibility
Target Users: Model developers, researchers comparing prediction methods
Tech Stack: Python, DVC, Docker, AWS SageMaker, Polars

Architecture & Integration

┌─────────────────┐    uses datasets     ┌──────────────────┐
│   pg-benchmark  │ ◄─────────────────── │    pg-base       │
│                 │                      │                  │
│ • Model repos   │                      │ • Dataset class  │
│ • Benchmarks    │                      │ • Manifests      │
│ • Metrics       │                      │ • Archives       │
│ • Pipelines     │                      │ • Schemas        │
└─────────────────┘                      └──────────────────┘

Getting Started

Start with reading the tutorial notebooks in proteingym-base. These explain creating datasets and working with the datasets.

Technical Requirements

Python: 3.12+
Cloud: AWS (optional, for large-scale benchmarking)
Containerization: Docker
Data Versioning: DVC
Package Management: UV/pip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ProteinGym

Overview

Mission

Core Repositories

🧬 proteingym-base

🏆 proteingym-benchmark

Architecture & Integration

Getting Started

Technical Requirements

Popular repositories Loading

Repositories

Uh oh!

Uh oh!

Uh oh!

People

Top languages

Uh oh!

Most used topics

Uh oh!