Important
This is the workspace for the ProteinGym-base and ProteinGym-benchmark that aim to bring more structure to the ProteinGym project. The original ProteinGym repository can be found here
ProteinGym has become a widely used resource for comparing the ability of models to predict the effects of protein mutations. ProteinGym Base is an effort to bring more structure to the datasets so that they become easier to work with and to distribute; and to the models in the benchmark so that they become easier to install and run. Less undocumented csv's and more formal schemas. Fewer scripts with hardcoded paths and more Dockerfiles and precise requirements.
To advance protein engineering and computational biology by providing:
- Structured, standardized datasets for protein variant effect prediction
- Reproducible benchmarking frameworks for model evaluation
- Open-source tools that facilitate research and development in protein science
𧬠proteingym-base
Infrastructure for protein variant effect prediction datasets
- Purpose: Provides structured dataset classes and tools to facilitate benchmarking
- Key Features:
- Formal schemas for protein data (sequences, structures, MSAs, assays)
- Dataset manifest system with TOML configuration
- Archive/persistence capabilities for easy data sharing
- Support for multiple data formats (FASTA, PDB, CIF, A3M, CSV)
- BioPython integration for biological data handling
- Target Users: Researchers working with protein datasets, ML practitioners
- Tech Stack: Python 3.12+, Pydantic, BioPython, Polars, TOML
π proteingym-benchmark
Standardized benchmarking framework for protein prediction models
- Purpose: Enables reproducible evaluation of protein variant effect prediction models
- Key Features:
- Two benchmark games: supervised learning and zero-shot prediction
- Model repository with standardized model cards
- DVC-based pipeline orchestration
- Local (HPC) and AWS cloud execution environments
- Automated metric calculation and comparison
- Docker containerization for reproducibility
- Target Users: Model developers, researchers comparing prediction methods
- Tech Stack: Python, DVC, Docker, AWS SageMaker, Polars
βββββββββββββββββββ uses datasets ββββββββββββββββββββ
β pg-benchmark β ββββββββββββββββββββ β pg-base β
β β β β
β β’ Model repos β β β’ Dataset class β
β β’ Benchmarks β β β’ Manifests β
β β’ Metrics β β β’ Archives β
β β’ Pipelines β β β’ Schemas β
βββββββββββββββββββ ββββββββββββββββββββ
Start with reading the tutorial notebooks in proteingym-base. These explain creating datasets and working with the datasets.
- Python: 3.12+
- Cloud: AWS (optional, for large-scale benchmarking)
- Containerization: Docker
- Data Versioning: DVC
- Package Management: UV/pip