Skip to content
@ProteinGym

ProteinGym

Overview

Important

This is the workspace for the ProteinGym-base and ProteinGym-benchmark that aim to bring more structure to the ProteinGym project. The original ProteinGym repository can be found here

DOI PyPI version License: MIT

ProteinGym has become a widely used resource for comparing the ability of models to predict the effects of protein mutations. ProteinGym Base is an effort to bring more structure to the datasets so that they become easier to work with and to distribute; and to the models in the benchmark so that they become easier to install and run. Less undocumented csv's and more formal schemas. Fewer scripts with hardcoded paths and more Dockerfiles and precise requirements.

Mission

To advance protein engineering and computational biology by providing:

  • Structured, standardized datasets for protein variant effect prediction
  • Reproducible benchmarking frameworks for model evaluation
  • Open-source tools that facilitate research and development in protein science

Core Repositories

Infrastructure for protein variant effect prediction datasets

  • Purpose: Provides structured dataset classes and tools to facilitate benchmarking
  • Key Features:
    • Formal schemas for protein data (sequences, structures, MSAs, assays)
    • Dataset manifest system with TOML configuration
    • Archive/persistence capabilities for easy data sharing
    • Support for multiple data formats (FASTA, PDB, CIF, A3M, CSV)
    • BioPython integration for biological data handling
  • Target Users: Researchers working with protein datasets, ML practitioners
  • Tech Stack: Python 3.12+, Pydantic, BioPython, Polars, TOML

Standardized benchmarking framework for protein prediction models

  • Purpose: Enables reproducible evaluation of protein variant effect prediction models
  • Key Features:
    • Two benchmark games: supervised learning and zero-shot prediction
    • Model repository with standardized model cards
    • DVC-based pipeline orchestration
    • Local (HPC) and AWS cloud execution environments
    • Automated metric calculation and comparison
    • Docker containerization for reproducibility
  • Target Users: Model developers, researchers comparing prediction methods
  • Tech Stack: Python, DVC, Docker, AWS SageMaker, Polars

Architecture & Integration

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    uses datasets     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   pg-benchmark  β”‚ ◄─────────────────── β”‚    pg-base       β”‚
β”‚                 β”‚                      β”‚                  β”‚
β”‚ β€’ Model repos   β”‚                      β”‚ β€’ Dataset class  β”‚
β”‚ β€’ Benchmarks    β”‚                      β”‚ β€’ Manifests      β”‚
β”‚ β€’ Metrics       β”‚                      β”‚ β€’ Archives       β”‚
β”‚ β€’ Pipelines     β”‚                      β”‚ β€’ Schemas        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Getting Started

Start with reading the tutorial notebooks in proteingym-base. These explain creating datasets and working with the datasets.

Technical Requirements

  • Python: 3.12+
  • Cloud: AWS (optional, for large-scale benchmarking)
  • Containerization: Docker
  • Data Versioning: DVC
  • Package Management: UV/pip

Popular repositories Loading

  1. proteingym-benchmark proteingym-benchmark Public

    Python 3

  2. proteingym-base proteingym-base Public

    Infrastructure for facilitating variant effect prediction using machine learning. Data class to combine assay, sequence, MSA and structure data to a coherent data package. Model cards for supervise…

    Python 2

  3. .github .github Public

    Repository for group README.

Repositories

Showing 3 of 3 repositories

People

This organization has no public members. You must be a member to see who’s a part of this organization.

Top languages

Loading…

Most used topics

Loading…