Protein Language Model Embedding Library
PLMEmbedder is a standalone library for computing protein sequence embeddings using state-of-the-art protein language models (PLMs). It provides a clean API and CLI for embedding proteins, caching results, and generating decoy sequences.
PLMEmbedder takes protein sequences in FASTA format and:
- Computes PLM embeddings for each amino acid using protein language models (ESM2, ESM1b, ProtBert, ProtT5)
- Caches embeddings for reuse across multiple analyses
- Generates decoy sequences using various methods (reversed, permuted, shuffled blocks)
- Provides a clean API for integration with downstream analysis pipelines
pip install plmembeddergit clone https://github.com/biodataganache/plmembedder.git
cd plmembedder
pip install -e .# ESM models only
pip install plmembedder[esm]
# ProtBert models only
pip install plmembedder[protbert]
# ProtT5 models only
pip install plmembedder[prott5]
# All PLM support
pip install plmembedder[all]- Python 3.8 or higher
- 8 GB RAM
- 10 GB free disk space
- Python 3.9 or 3.10
- 16+ GB RAM
- NVIDIA GPU with 8+ GB VRAM (for GPU acceleration)
- CUDA 11.0+ and cuDNN 8.0+
- 20+ GB free disk space
# Basic embedding
plmembedder proteins.fasta -o results/
# With caching (recommended for large datasets)
plmembedder proteins.fasta --cache-embeddings -c cache/ -o results/
# Use cached embeddings for subsequent runs
plmembedder proteins.fasta --cache-embeddings -c cache/ -o results2/
# Generate decoy sequences
plmembedder proteins.fasta --n-decoys 1 --decoy-type reversed -o results/
# Use a smaller/faster model
plmembedder proteins.fasta --model esm2_t6_8M_UR50D -o results/
# CPU only (no GPU)
plmembedder proteins.fasta --device cpu -o results/from plmembedder import EmbeddingPipeline, PipelineConfig, CacheConfig
# Simple usage with defaults
pipeline = EmbeddingPipeline()
embeddings = pipeline.run("proteins.fasta")
# With caching enabled
config = PipelineConfig(
cache=CacheConfig(enabled=True, cache_dir="cache/")
)
pipeline = EmbeddingPipeline(config)
embeddings = pipeline.run("proteins.fasta", output_dir="results/")
# Step-by-step for more control
pipeline = EmbeddingPipeline(config)
pipeline.load_sequences("proteins.fasta")
pipeline.generate_decoys() # If configured
pipeline.compute_embeddings()
# Access individual embeddings
for seq_id, emb_result in pipeline.iterate_embeddings():
print(f"{seq_id}: shape {emb_result.embeddings.shape}")
# emb_result.embeddings is shape (seq_len, embedding_dim)usage: plmembedder [-h] [-o OUTPUT] [--model MODEL] [--model-type {esm2,esm1b,protbert,prott5}]
[--device DEVICE] [--batch-size BATCH_SIZE] [--layer LAYER]
[--max-length MAX_LENGTH] [--cache-embeddings] [-c CACHE_DIR]
[--n-decoys N_DECOYS] [--decoy-type {reversed,permuted,shuffled_blocks}]
[--decoy-prefix DECOY_PREFIX] [--decoy-seed DECOY_SEED]
[--decoys-only] [--write-decoy-fasta WRITE_DECOY_FASTA]
[--max-sequences MAX_SEQUENCES] [--no-validate] [-v] [--embed-only]
fasta
Protein Language Model Embedding Pipeline
positional arguments:
fasta Input FASTA file
options:
-h, --help show this help message and exit
-o, --output OUTPUT Output directory (default: output)
-v, --verbose Verbose output
Model Parameters:
--model MODEL Model name (default: esm2_t33_650M_UR50D)
--model-type Model type: esm2, esm1b, protbert, prott5 (default: esm2)
--device DEVICE Device: cuda or cpu (default: cuda)
--batch-size Batch size for embedding (default: 4)
--layer LAYER Model layer to extract from (default: -1, last layer)
--max-length Maximum sequence length (default: 1024)
Caching Parameters:
--cache-embeddings Enable embedding caching
-c, --cache-dir Cache directory (default: embeddings_cache)
Decoy Parameters:
--n-decoys N_DECOYS Number of decoys per sequence (default: 0)
--decoy-type Decoy method: reversed, permuted, shuffled_blocks
--decoy-prefix Prefix for decoy IDs (default: DECOY_)
--decoy-seed Random seed for decoy generation
--decoys-only Embed only decoy sequences
--write-decoy-fasta Write decoys to FASTA file
Input Options:
--max-sequences Maximum sequences to process
--no-validate Skip sequence validation
Output Options:
--embed-only Only compute embeddings, skip saving consolidated output
| Model | Parameters | Embedding Dim | Memory |
|---|---|---|---|
esm2_t33_650M_UR50D |
650M | 1280 | ~3GB |
esm2_t30_150M_UR50D |
150M | 640 | ~1GB |
esm2_t12_35M_UR50D |
35M | 480 | ~500MB |
esm2_t6_8M_UR50D |
8M | 320 | ~200MB |
| Model | Parameters | Embedding Dim |
|---|---|---|
esm1b_t33_650M_UR50S |
650M | 1280 |
| Model | Embedding Dim |
|---|---|
Rostlab/prot_bert |
1024 |
Rostlab/prot_bert_bfd |
1024 |
| Model | Embedding Dim |
|---|---|
Rostlab/prot_t5_xl_half_uniref50-enc |
1024 |
Rostlab/prot_t5_xl_uniref50 |
1024 |
Rostlab/prot_t5_base_mt_uniref50 |
768 |
Main pipeline for computing protein embeddings.
from plmembedder import EmbeddingPipeline, PipelineConfig
pipeline = EmbeddingPipeline(config: PipelineConfig = None)
# Methods
pipeline.load_sequences(fasta_path: str) -> List[Tuple[str, str]]
pipeline.generate_decoys() -> List[DecoyResult]
pipeline.compute_embeddings() -> Dict[str, EmbeddingResult]
pipeline.run(fasta_path: str, output_dir: str = None) -> Dict[str, EmbeddingResult]
pipeline.get_embedding(sequence_id: str) -> EmbeddingResult
pipeline.iterate_embeddings() -> Iterator[Tuple[str, EmbeddingResult]]
pipeline.save_embeddings(output_dir: str) -> NoneContainer for embedding results.
from plmembedder import EmbeddingResult
result.sequence_id: str # Sequence identifier
result.sequence: str # Amino acid sequence
result.embeddings: np.ndarray # Shape: (seq_length, embedding_dim)
result.attention_weights: np.ndarray # Optional attention weightsManage cached embeddings.
from plmembedder import EmbeddingCache
cache = EmbeddingCache(cache_dir: str, embedder_config: EmbedderConfig)
cache.get(sequence_id: str, sequence: str) -> Optional[EmbeddingResult]
cache.save(result: EmbeddingResult) -> None
cache.clear() -> NoneGenerate decoy sequences.
from plmembedder import DecoyGenerator, DecoyConfig, DecoyType
config = DecoyConfig(
n_decoys=1,
decoy_type=DecoyType.REVERSED,
decoy_prefix="DECOY_"
)
generator = DecoyGenerator(config)
decoys = generator.generate_all(sequences: List[Tuple[str, str]])from plmembedder import (
PipelineConfig,
EmbedderConfig,
CacheConfig,
DecoyConfig,
OutputConfig,
EmbedderType,
DecoyType,
)
# Full configuration example
config = PipelineConfig(
embedder=EmbedderConfig(
embedder_type=EmbedderType.ESM2,
model_name="esm2_t33_650M_UR50D",
device="cuda",
batch_size=4,
layer=-1,
max_sequence_length=1024,
),
cache=CacheConfig(
enabled=True,
cache_dir="embeddings_cache",
),
decoy=DecoyConfig(
n_decoys=1,
decoy_type=DecoyType.REVERSED,
decoy_prefix="DECOY_",
random_seed=42,
decoys_only=False,
),
output=OutputConfig(
output_dir="output",
save_embeddings=True,
save_format="npz",
),
max_sequences=None,
validate_sequences=True,
)PLMEmbedder is designed to work seamlessly with snaikmer for k-mer embedding analysis:
from plmembedder import EmbeddingPipeline, PipelineConfig
from snaikmer import KmerEmbeddingPipeline
# Step 1: Compute embeddings with plmembedder
embed_config = PipelineConfig(
cache=CacheConfig(enabled=True, cache_dir="cache/")
)
embed_pipeline = EmbeddingPipeline(embed_config)
embed_pipeline.load_sequences("proteins.fasta")
sequence_embeddings = embed_pipeline.compute_embeddings()
# Step 2: Pass to snaikmer for k-mer analysis
kmer_pipeline = KmerEmbeddingPipeline(kmer_config)
kmer_pipeline.analyze_with_embeddings(
sequences=embed_pipeline.sequences,
embeddings=sequence_embeddings
)For large-scale embedding on HPC systems:
# Pre-compute and cache embeddings (no downstream analysis)
plmembedder proteins.fasta --embed-only --cache-embeddings -c /scratch/cache/
# Generate decoys and write to FASTA for external tools
plmembedder proteins.fasta --n-decoys 1 --write-decoy-fasta decoys.fasta --embed-onlyEmbeddings are saved in NumPy's compressed .npz format:
output/
├── embeddings.npz # All embeddings (sequence_id -> embedding array)
└── metadata.json # Sequence metadata and shapes
Loading saved embeddings:
import numpy as np
import json
# Load embeddings
data = np.load("output/embeddings.npz")
for seq_id in data.files:
embedding = data[seq_id] # Shape: (seq_len, embedding_dim)
# Load metadata
with open("output/metadata.json") as f:
metadata = json.load(f)- Reduce batch size:
--batch-size 1 - Use a smaller model:
--model esm2_t6_8M_UR50D - Use CPU:
--device cpu
Models are downloaded automatically on first use. To pre-download:
# ESM2
import esm
esm.pretrained.esm2_t33_650M_UR50D()
# ProtBert/ProtT5
from transformers import AutoModel
AutoModel.from_pretrained("Rostlab/prot_bert")By default, invalid amino acid characters are removed. To skip validation:
plmembedder proteins.fasta --no-validate -o results/MIT License - see LICENSE for details.
Contributions are welcome! Please open an issue or submit a pull request.
If you use PLMEmbedder in your research, please cite the underlying models:
- ESM2: Lin et al. (2022). "Language models of protein sequences at the scale of evolution enable accurate structure prediction."
- ProtBert/ProtT5: Elnaggar et al. (2021). "ProtTrans: Towards Cracking the Language of Life's Code Through Self-Supervised Deep Learning and High Performance Computing."
This README provides comprehensive documentation for the plmembedder library, including installation instructions, CLI usage, Python API reference, supported models, and integration guidance with snaikmer.