GP2Vec is a PyTorch implementation of Wav2Vec2-style self-supervised learning adapted for seismic waveform data. It learns robust representations from continuous seismic data hosted on S3, with optional conditioning on station metadata.
- Wav2Vec2-inspired architecture adapted for 3-component seismic data (Z, N, E)
- S3-native data pipeline for scalable access to EarthScope/SCEDC data
- Station metadata conditioning using FDSN web services
- WebDataset streaming for efficient training on large datasets
- PyTorch Lightning integration with distributed training support
- Hydra configuration for flexible experiment management
- Production-ready with comprehensive logging, monitoring, and checkpointing
GP2Vec follows the Wav2Vec2 architecture with adaptations for seismic data:
- Feature Encoder: 1D CNN that processes raw waveforms into latent representations
- Vector Quantizer: Learns discrete codebook representations (Gumbel or EMA-based)
- Context Encoder: Transformer that models temporal dependencies
- Metadata Fusion: Optional conditioning on station coordinates and instrument metadata
- Contrastive Learning: InfoNCE loss for self-supervised pretraining
- ๐ Transfer Learning: Initialize with pre-trained Wav2Vec2 weights for faster training
Waveform (3, 3000) โ CNN โ Features (768, T) โ VQ โ Quantized โ Transformer โ Contextual Features
โ โ
Station Metadata โโโโโโโโโโโโโโโโโโโโโโโ
โ
Contrastive Loss
- Python โฅ 3.11
- PyTorch โฅ 2.4
- CUDA (optional, for GPU training)
Quick Setup with environment.yml:
# Clone repository
git clone https://github.com/Denolle-Lab/gp2vec.git
cd gp2vec
# Create environment from file (includes all dependencies)
conda env create -f environment.yml
conda activate gp2vec
# Install GP2Vec package in development mode
pip install -e .
# Register as Jupyter kernel (for notebook usage)
python -m ipykernel install --user --name=gp2vec --display-name="Python (gp2vec)"Manual Setup:
# Clone repository
git clone https://github.com/Denolle-Lab/gp2vec.git
cd gp2vec
# Create conda environment with Python 3.11
conda create -n gp2vec python=3.11 -y
conda activate gp2vec
# Install PyTorch (choose appropriate version for your system)
# For CUDA 11.8:
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
# For CUDA 12.1:
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
# For CPU only:
conda install pytorch torchvision torchaudio cpuonly -c pytorch
# Install GP2Vec package and dependencies
pip install -e .
# Register as Jupyter kernel (for notebook usage)
python -m ipykernel install --user --name=gp2vec --display-name="Python (gp2vec)"
# Optional: Install development dependencies
pip install -e ".[dev]"
# Optional: Install transformers for Wav2Vec2 weight transfer
pip install transformers# Clone repository
git clone https://github.com/Denolle-Lab/gp2vec.git
cd gp2vec
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install package
pip install -e .
# Optional: Install development dependencies
pip install -e ".[dev]"Core dependencies are automatically installed:
torch,torchaudio- Deep learning frameworkpytorch-lightning- Training frameworkhydra-core- Configuration managementobspy- Seismic data processings3fs,boto3- S3 data accesswebdataset- Streaming datasetspandas,pyarrow- Data manipulation
GP2Vec can be initialized with pre-trained Wav2Vec2 weights for faster convergence:
# Extract Wav2Vec2 weights
python scripts/extract_wav2vec_weights.py \
--model facebook/wav2vec2-base-960h \
--output weights/wav2vec2_base.pth \
--create-dir
# Run full demonstration
python examples/wav2vec_transfer_demo.py# Use in Python code
from src.gp2vec.models.gp2vec import create_gp2vec_model
# Create model
model = create_gp2vec_model("base", input_channels=3)
# Load pre-trained weights (requires transformers library)
stats = model.load_wav2vec_weights("weights/wav2vec2_base.pth")
print(f"Transferred {stats['update_ratio']:.1%} of model parameters")
# Model is now ready for seismic data training!# Train with default configuration
python -m gp2vec.train.train
# Debug training (small model, limited data)
python -m gp2vec.train.train --config-name=experiment/debug
# Production training (large model, full dataset)
python -m gp2vec.train.train --config-name=experiment/production# Full pipeline: data preparation + training
./scripts/pretrain.sh
# With custom configuration
./scripts/pretrain.sh configs/experiment/debug.yaml
# Override specific parameters
./scripts/pretrain.sh --data.batch_size=32 --model.embed_dim=512from gp2vec.data.datapipes import SeismicDataPipeline
from gp2vec.data.metadata import StationMetadataManager
# Set up data pipeline
pipeline = SeismicDataPipeline(
manifest_path="cache/manifest.parquet",
metadata_manager=StationMetadataManager(),
target_sampling_rate=100.0,
window_length=30.0,
)
# Create streaming dataset
dataset = pipeline.create_webdataset(shard_size=10000)
dataloader = DataLoader(dataset, batch_size=32, num_workers=4)import torch
from gp2vec.models.gp2vec import GP2Vec
# Create model
model = GP2Vec()
# Example waveform data (batch_size=4, channels=3, time=3000)
waveforms = torch.randn(4, 3, 3000)
# Station metadata
metadata = {
'latitude': torch.tensor([34.0, 34.1, 34.2, 34.3]),
'longitude': torch.tensor([-118.2, -118.3, -118.4, -118.5]),
'elevation': torch.tensor([100.0, 150.0, 200.0, 250.0]),
}
# Extract features
model.eval()
with torch.no_grad():
features = model.encode(waveforms, metadata) # (4, T, 768)- EarthScope/SCEDC: Primary data source via S3 (
s3://scedc-pds/continuous_waveforms/) - Custom S3 buckets: Any S3-compatible storage with miniSEED files
- Local files: Direct file system access for smaller datasets
Load real seismic data from SCEDC S3 bucket:
from torch.utils.data import DataLoader
from gp2vec.data.s3_manifest import SCEDCSeismicDataset
# Create dataset - direct S3 access (no credentials needed for public bucket)
dataset = SCEDCSeismicDataset(
start_date="2023-01-01",
num_days=7,
networks=["CI"], # Southern California Seismic Network
stations=["ADE", "ADO", "BAR"], # Select stations
channels=["BHE", "BHN", "BHZ"], # 3-component broadband
sample_length_sec=30.0,
sample_rate=100.0,
samples_per_day=10
)
# Create DataLoader
dataloader = DataLoader(dataset, batch_size=32, shuffle=True, num_workers=4)
# Iterate through batches
for batch in dataloader:
waveforms = batch['waveform'] # (batch_size, 3, 3000) - 3 components, 30s @ 100Hz
metadata = batch['metadata'] # (batch_size, 4) - lat, lon, elev, timestamp
station_ids = batch['station_id']
# Your training code here...
breakFor large-scale training with WebDataset shards:
- Build Manifest:
python scripts/make_manifest.py \
--bucket scedc-pds \
--prefix continuous_waveforms/ \
--output cache/manifest.parquet \
--networks CI AZ US TA- Fetch Metadata:
python scripts/fetch_metadata.py \
--client IRIS \
--output cache/metadata/stations_iris.parquet \
--extract-features- Preprocessing: Detrending, demeaning, normalization, quality control
- Windowing: Configurable window length with overlap
- Augmentation: Time shifts, amplitude scaling, noise injection, filtering
- Streaming: WebDataset-based pipeline for scalable training
GP2Vec uses Hydra for configuration management. Configurations are organized as:
configs/
โโโ config.yaml # Main config file
โโโ data/
โ โโโ default.yaml # Default data settings
โ โโโ small.yaml # Small dataset for testing
โโโ model/
โ โโโ default.yaml # Default model architecture
โ โโโ small.yaml # Small model for debugging
โโโ train/
โ โโโ default.yaml # Default training settings
โ โโโ debug.yaml # Debug training settings
โโโ experiment/
โโโ debug.yaml # Complete debug experiment
โโโ production.yaml # Production experiment
# data/default.yaml
processing:
target_sampling_rate: 100.0
window_length: 30.0
overlap: 0.5
augmentation:
enabled: true
probability: 0.8
time_shift:
max_shift: 0.1
metadata:
enabled: true
fdsn_client: IRIS# model/default.yaml
feature_encoder:
conv_layers:
- {channels: 64, kernel_size: 10, stride: 5}
- {channels: 128, kernel_size: 8, stride: 4}
# ...
context_encoder:
embed_dim: 768
num_heads: 12
num_layers: 12
quantizer:
type: gumbel
codebook_size: 320
num_codebooks: 2# train/default.yaml
optimizer:
name: adamw
lr: 1e-4
weight_decay: 0.01
lr_scheduler:
name: cosine_with_warmup
warmup_steps: 10000
trainer:
max_epochs: 100
devices: auto
strategy: ddp
precision: 16-mixed# Override from command line
python -m gp2vec.train.train \
data.batch_size=64 \
model.embed_dim=1024 \
train.max_epochs=200
# Use different configs
python -m gp2vec.train.train \
--config-path=configs \
--config-name=experiment/productionGP2Vec includes tools for evaluating learned representations on downstream tasks:
from gp2vec.train.evaluate_downstream import DownstreamEvaluator
# Load pretrained model
evaluator = DownstreamEvaluator(model)
# Evaluate on phase picking
results = evaluator.evaluate_phase_picking(
waveforms, pick_labels, metadata
)
# Evaluate on tremor detection
results = evaluator.evaluate_tremor_detection(
waveforms, tremor_labels, metadata
)- Phase Picking: Binary classification of P/S wave arrivals
- Tremor Detection: Binary classification of tremor vs. normal signals
- Magnitude Estimation: Multi-class classification of earthquake magnitude bins
- Linear Probing: Freeze backbone, train linear classifier
- Sklearn Evaluation: Extract features, train sklearn models
- Fine-tuning: End-to-end fine-tuning (planned)
GP2Vec supports various distributed training strategies:
# Multi-GPU training
train:
trainer:
devices: 4
strategy: ddp
# Multi-node training
train:
trainer:
devices: 8
num_nodes: 4
strategy: ddp- DDP (DistributedDataParallel): Standard multi-GPU training
- FSDP (FullyShardedDataParallel): Memory-efficient training for large models
- DeepSpeed: Advanced optimization strategies
wandb:
enabled: true
project: gp2vec
entity: your-team
tags: [self-supervised, seismic]loggers:
tensorboard:
enabled: true
log_graph: trueInteractive progress bars with training metrics, GPU utilization, and throughput monitoring.
See the examples/ directory for complete usage examples:
basic_training.py: Simple training scriptdata_pipeline.py: Data loading and processingmodel_usage.py: Model inference and feature extraction
gp2vec/
โโโ src/gp2vec/
โ โโโ data/ # Data loading and processing
โ โโโ models/ # Model architectures
โ โโโ train/ # Training and evaluation
โ โโโ utils/ # Utilities
โโโ configs/ # Hydra configurations
โโโ scripts/ # Operational scripts
โโโ examples/ # Usage examples
โโโ tests/ # Unit tests (planned)
The project uses modern Python tooling:
- Black: Code formatting
- Ruff: Fast linting and formatting
- MyPy: Static type checking
- Pytest: Unit testing framework
# Format code
black src/ examples/ scripts/
# Lint code
ruff check src/ examples/ scripts/
# Type checking
mypy src/gp2vec- Fork the repository
- Create a feature branch
- Make your changes with tests
- Run quality checks
- Submit a pull request
- RAM: 16GB+ (32GB+ recommended for large datasets)
- Storage: 100GB+ for data caching
- GPU: 8GB+ VRAM (16GB+ recommended)
GP2Vec is designed for cloud deployment with:
- AWS S3 integration for data access
- Kubernetes deployment support
- Docker containerization
- Multi-node distributed training
If you use GP2Vec in your research, please cite:
@software{gp2vec2025,
title={GP2Vec: Self-Supervised Learning for Geophysical Waveform Representation},
author={Marine Denolle},
year={2025},
url={https://github.com/Denolle-lab/gp2vec}
}This project is licensed under the GNU General Public License v3.0 (GPL-3.0).
GP2Vec uses GPL-3.0 to ensure compatibility with key dependencies:
- SeisBench: GPL-3.0 (seismic deep learning benchmarks)
- ObsPy: LGPL-3.0 (seismological data processing)
- Wav2Vec2/Fairseq: MIT (compatible with GPL-3.0)
The GPL-3.0 license ensures that:
- โ All modifications and derivative works remain open source
- โ Full compatibility with seismological research software ecosystem
- โ Community contributions are preserved for scientific progress
- โ Users receive complete source code and modification rights
See the LICENSE file for the complete terms.
This project incorporates or builds upon:
- Wav2Vec2 (Meta AI/Facebook): MIT License
- ObsPy: LGPL v3.0
- SeisBench: GPL v3.0
- PyTorch: BSD-style License
- PyTorch Lightning: Apache 2.0
All third-party licenses are compatible with GPL-3.0.
- Documentation: Full documentation (planned)
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Wav2Vec2-style architecture for seismic data
- S3-native data pipeline with WebDataset streaming
- Station metadata conditioning
- PyTorch Lightning training framework
- Hydra configuration system
- Distributed training support
- Downstream evaluation tools
- Additional downstream tasks (magnitude estimation, source characterization)
- Pre-trained model zoo
- Advanced augmentation strategies
- Real-time inference capabilities
- Integration with seismological workflows
- Comprehensive documentation and tutorials
- Multi-modal learning (waveforms + spectrograms + metadata)
- Federated learning across seismic networks
- Integration with physics-informed neural networks
- Uncertainty quantification in representations
GP2Vec - Advancing seismology through self-supervised representation learning ๐๐
This document outlines an end-to-end plan to build a Wav2Vec 2.0โstyle self-supervised foundation model for seismology. It covers data access from EarthScope S3, station metadata integration, model architecture, training strategy, and repository scaffolding suitable for hand-off to an engineering team or code-generation system.
- Adopt a Wav2Vec 2.0 objective on continuous seismic waveforms: mask latent spans, predict quantized targets, and reuse the encoder for downstream tasks such as phase picking, tremor characterization, magnitude estimation, and slow slip proxies.1
- Train on long, contiguous segments, masking contiguous time spans and sampling negatives from other temporal windows or channels; support multi-component inputs (Z/N/E) and targeted volcano or fault scenarios as in the reference study.1
Two supported approaches:
- Exchange EarthScope credentials for temporary AWS keys using the SDK, then use
boto3to list and fetch miniSEED objects.2
- Follow the
earthscope-s3-direct-accessexamples to read data withs3fsorboto3, respecting the documented bucket and prefix structure.3
Python sketch:
import io
import s3fs
from obspy import read
fs = s3fs.S3FileSystem(anon=True) # or provide credentials from EarthScope SDK / boto3
with fs.open("s3://<earthscope-bucket>/<prefix>/<file>.mseed", "rb") as f:
st = read(io.BytesIO(f.read())) # ObsPy accepts file-like objectsAdjust bucket and prefix paths per EarthScope documentation.2
- Retrieve StationXML via
obspy.clients.fdsn.Client.get_stations, requestinglevel="response"to obtain instrument metadata.4 - Extract key fields (lat/lon/elev, instrument/datalogger descriptions, azimuth, dip, sampling rate, response gain) using ObsPy
InventoryAPIs; optionally leverage wrappers likefdsn_station_infofor CSV exports.5
Python sketch:
from obspy.clients.fdsn import Client
inv = Client("IRIS").get_stations(
network="IU", station="ANMO", level="response", starttime=t0, endtime=t1
)
net = inv.networks[0]
sta = net.stations[0]
ch = sta.channels[0]
meta = {
"network": net.code,
"station": sta.code,
"loc": ch.location_code,
"chan": ch.code,
"latitude": sta.latitude,
"longitude": sta.longitude,
"elevation": sta.elevation,
"azimuth": ch.azimuth,
"dip": ch.dip,
"sample_rate": ch.sample_rate,
"response_gain": (
ch.response.instrument_sensitivity.value if ch.response else None
),
"sensor": ch.sensor.description if ch.sensor else None,
"datalogger": ch.data_logger.description if ch.data_logger else None,
}- Categorical variables (network, station, location, channel, sensor, datalogger): map to learnable embeddings (16โ64 dimensions each).
- Continuous variables (latitude, longitude, elevation, azimuth, dip, sample rate, gain): normalize and feed through a small MLP to produce dense features.
- Response curves: summarize via scalar features or sample log-amplitude response over fixed frequencies, then encode with a 1D convolution.
- Concatenate and project the metadata embedding to the model dimension, fusing with waveform tokens via additive conditioning, cross-attention, or FiLM-style scaling.
- Feature encoder: 1D convolution stack over single or tri-axial channels with 10โ25 ms kernels; downsample to ~50โ100 Hz latent frame rate.
- Vector quantizer: Gumbel-softmax or EMA k-means codebook to produce discrete targets with diversity regularization.
- Context network: Transformer encoder (8โ24 layers, 256โ768 hidden units) with random masking of contiguous latent spans (covering roughly 30โ60% of steps).
- Loss: Contrastive InfoNCE objective between masked positions and quantized targets plus a codebook diversity loss.
- Augmentations: Random bandpass filtering, amplitude jitter, Gaussian noise, time shifts, channel dropout, and optional instrument response removal using metadata inventories.
- Optional multitask heads: Predict sampling-rate bins, channel families, or station regions from masked contexts to encourage invariance.
- Catalog: List S3 keys (organized by network/station/year/day) via
s3fsorboto3; store a Parquet manifest containing object key, time bounds, sampling rate, and byte size.3 - Sharding: Assemble WebDataset or torchdata shards (1โ2 GB) comprising 20โ60 second windows and context negatives.
- Streaming: Use torchdata/DataPipes or WebDataset to stream from S3, decode miniSEED with ObsPy, resample, optionally remove instrument response, and cache as needed.
- Metadata join: Merge station/channel IDs with pre-fetched StationXML features and attach metadata embeddings to each sample.4
- Distributed training: Train via PyTorch Lightning with FSDP or DeepSpeed, mixed precision, EMA, and periodic checkpoints written back to S3.
Quick Start: Load SCEDC Real Data (No Setup Required)
# Simple script to load real seismic data from SCEDC S3
from torch.utils.data import DataLoader
from gp2vec.data.s3_manifest import SCEDCSeismicDataset
# Create dataset - works out of the box with anonymous S3 access
dataset = SCEDCSeismicDataset(
start_date="2023-01-01",
num_days=3,
networks=["CI"],
stations=["ADE", "ADO"],
channels=["BHE", "BHN", "BHZ"],
sample_length_sec=30.0,
sample_rate=100.0
)
# Load data
loader = DataLoader(dataset, batch_size=16, num_workers=2)
for batch in loader:
waveforms = batch['waveform'] # (16, 3, 3000)
print(f"Loaded {waveforms.shape} from {batch['station_id']}")
breakList and stream from EarthScope S3:
import io
import random
import obspy
import s3fs
fs = s3fs.S3FileSystem(anon=True) # or authenticate via key/secret/token
keys = fs.ls("s3://<earthscope-bucket>/<miniSEED-prefix>/")
key = random.choice([k for k in keys if k.endswith(".mseed")])
with fs.open(key, "rb") as f:
st = obspy.read(io.BytesIO(f.read()))
st.merge(fill_value="interpolate")
st.detrend("linear")
st.taper(0.05)Match the bucket and prefix naming per EarthScope documentation.2
Instrument correction (optional):
from obspy.signal import filter
st.remove_response(inventory=inv, output="VEL") # requires matching Inventory metadata
st.filter("bandpass", freqmin=0.1, freqmax=20.0, corners=4, zerophase=True)- Johnson, B. et al. (2025). Automatic speech recognition predicts contemporaneous earthquake fault slip and tremor. Nature Communications.1
- EarthScope SDK: Direct S3 Access to SAGE miniSEED data repository.2
earthscope-s3-direct-accesstutorial repository.3- ObsPy Inventory and FDSN documentation.4
fdsn_station_infoObsPy wrapper.5
Footnotes
-
Johnson, B., et al. (2025). Automatic speech recognition predicts contemporaneous earthquake fault slip and tremor. Nature Communications. https://www.nature.com/articles/s41467-025-55994-9.pdf โฉ โฉ2 โฉ3
-
EarthScope SDK Documentation. Direct S3 Access to SAGE miniSEED data repository. https://docs.earthscope.org/projects/SDK/en/stable/content/s3_direct_access_tutorial.html โฉ โฉ2 โฉ3 โฉ4
-
Niyiyu, Y. earthscope-s3-direct-access (GitHub repository). https://github.com/niyiyu/earthscope-s3-direct-access โฉ โฉ2 โฉ3
-
ObsPy Documentation. obspy.core.inventory. https://docs.obspy.org/packages/autogen/obspy.core.inventory.html โฉ โฉ2 โฉ3
-
Flyrok. fdsn_station_info (GitHub repository). https://github.com/flyrok/fdsn_station_info โฉ โฉ2