GP2Vec: Self-Supervised Learning for Seismic Data

GP2Vec is a PyTorch implementation of Wav2Vec2-style self-supervised learning adapted for seismic waveform data. It learns robust representations from continuous seismic data hosted on S3, with optional conditioning on station metadata.

🌟 Key Features

Wav2Vec2-inspired architecture adapted for 3-component seismic data (Z, N, E)
S3-native data pipeline for scalable access to EarthScope/SCEDC data
Station metadata conditioning using FDSN web services
WebDataset streaming for efficient training on large datasets
PyTorch Lightning integration with distributed training support
Hydra configuration for flexible experiment management
Production-ready with comprehensive logging, monitoring, and checkpointing

🏗️ Architecture

GP2Vec follows the Wav2Vec2 architecture with adaptations for seismic data:

Feature Encoder: 1D CNN that processes raw waveforms into latent representations
Vector Quantizer: Learns discrete codebook representations (Gumbel or EMA-based)
Context Encoder: Transformer that models temporal dependencies
Metadata Fusion: Optional conditioning on station coordinates and instrument metadata
Contrastive Learning: InfoNCE loss for self-supervised pretraining
🔄 Transfer Learning: Initialize with pre-trained Wav2Vec2 weights for faster training

Waveform (3, 3000) → CNN → Features (768, T) → VQ → Quantized → Transformer → Contextual Features
                                    ↓                              ↑
                            Station Metadata ――――――――――――――――――――――┘
                                    ↓
                              Contrastive Loss

📦 Installation

Prerequisites

Python ≥ 3.11
PyTorch ≥ 2.4
CUDA (optional, for GPU training)

Installation Options

Option 1: Conda Environment (Recommended)

Quick Setup with environment.yml:

# Clone repository
git clone https://github.com/Denolle-Lab/gp2vec.git
cd gp2vec

# Create environment from file (includes all dependencies)
conda env create -f environment.yml
conda activate gp2vec

# Install GP2Vec package in development mode
pip install -e .

# Register as Jupyter kernel (for notebook usage)
python -m ipykernel install --user --name=gp2vec --display-name="Python (gp2vec)"

Manual Setup:

# Clone repository
git clone https://github.com/Denolle-Lab/gp2vec.git
cd gp2vec

# Create conda environment with Python 3.11
conda create -n gp2vec python=3.11 -y
conda activate gp2vec

# Install PyTorch (choose appropriate version for your system)
# For CUDA 11.8:
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

# For CUDA 12.1:
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia

# For CPU only:
conda install pytorch torchvision torchaudio cpuonly -c pytorch

# Install GP2Vec package and dependencies
pip install -e .

# Register as Jupyter kernel (for notebook usage)
python -m ipykernel install --user --name=gp2vec --display-name="Python (gp2vec)"

# Optional: Install development dependencies
pip install -e ".[dev]"

# Optional: Install transformers for Wav2Vec2 weight transfer
pip install transformers

Option 2: pip (Virtual Environment)

# Clone repository
git clone https://github.com/Denolle-Lab/gp2vec.git
cd gp2vec

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install package
pip install -e .

# Optional: Install development dependencies
pip install -e ".[dev]"

Dependencies

Core dependencies are automatically installed:

torch, torchaudio - Deep learning framework
pytorch-lightning - Training framework
hydra-core - Configuration management
obspy - Seismic data processing
s3fs, boto3 - S3 data access
webdataset - Streaming datasets
pandas, pyarrow - Data manipulation

🚀 Quick Start

0. Transfer Learning from Wav2Vec2 🔄

GP2Vec can be initialized with pre-trained Wav2Vec2 weights for faster convergence:

# Extract Wav2Vec2 weights
python scripts/extract_wav2vec_weights.py \
    --model facebook/wav2vec2-base-960h \
    --output weights/wav2vec2_base.pth \
    --create-dir

# Run full demonstration
python examples/wav2vec_transfer_demo.py

# Use in Python code
from src.gp2vec.models.gp2vec import create_gp2vec_model

# Create model
model = create_gp2vec_model("base", input_channels=3)

# Load pre-trained weights (requires transformers library)
stats = model.load_wav2vec_weights("weights/wav2vec2_base.pth")
print(f"Transferred {stats['update_ratio']:.1%} of model parameters")

# Model is now ready for seismic data training!

1. Basic Training

# Train with default configuration
python -m gp2vec.train.train

# Debug training (small model, limited data)
python -m gp2vec.train.train --config-name=experiment/debug

# Production training (large model, full dataset)
python -m gp2vec.train.train --config-name=experiment/production

2. Using the Pretraining Script

# Full pipeline: data preparation + training
./scripts/pretrain.sh

# With custom configuration
./scripts/pretrain.sh configs/experiment/debug.yaml

# Override specific parameters
./scripts/pretrain.sh --data.batch_size=32 --model.embed_dim=512

3. Data Pipeline

from gp2vec.data.datapipes import SeismicDataPipeline
from gp2vec.data.metadata import StationMetadataManager

# Set up data pipeline
pipeline = SeismicDataPipeline(
    manifest_path="cache/manifest.parquet",
    metadata_manager=StationMetadataManager(),
    target_sampling_rate=100.0,
    window_length=30.0,
)

# Create streaming dataset
dataset = pipeline.create_webdataset(shard_size=10000)
dataloader = DataLoader(dataset, batch_size=32, num_workers=4)

4. Model Usage

import torch
from gp2vec.models.gp2vec import GP2Vec

# Create model
model = GP2Vec()

# Example waveform data (batch_size=4, channels=3, time=3000)
waveforms = torch.randn(4, 3, 3000)

# Station metadata
metadata = {
    'latitude': torch.tensor([34.0, 34.1, 34.2, 34.3]),
    'longitude': torch.tensor([-118.2, -118.3, -118.4, -118.5]),
    'elevation': torch.tensor([100.0, 150.0, 200.0, 250.0]),
}

# Extract features
model.eval()
with torch.no_grad():
    features = model.encode(waveforms, metadata)  # (4, T, 768)

📊 Data

Supported Data Sources

EarthScope/SCEDC: Primary data source via S3 (s3://scedc-pds/continuous_waveforms/)
Custom S3 buckets: Any S3-compatible storage with miniSEED files
Local files: Direct file system access for smaller datasets

Quick Start: Loading Real Data

Load real seismic data from SCEDC S3 bucket:

from torch.utils.data import DataLoader
from gp2vec.data.s3_manifest import SCEDCSeismicDataset

# Create dataset - direct S3 access (no credentials needed for public bucket)
dataset = SCEDCSeismicDataset(
    start_date="2023-01-01",
    num_days=7,
    networks=["CI"],  # Southern California Seismic Network
    stations=["ADE", "ADO", "BAR"],  # Select stations
    channels=["BHE", "BHN", "BHZ"],  # 3-component broadband
    sample_length_sec=30.0,
    sample_rate=100.0,
    samples_per_day=10
)

# Create DataLoader
dataloader = DataLoader(dataset, batch_size=32, shuffle=True, num_workers=4)

# Iterate through batches
for batch in dataloader:
    waveforms = batch['waveform']  # (batch_size, 3, 3000) - 3 components, 30s @ 100Hz
    metadata = batch['metadata']   # (batch_size, 4) - lat, lon, elev, timestamp
    station_ids = batch['station_id']
    
    # Your training code here...
    break

Data Preparation (Advanced)

For large-scale training with WebDataset shards:

Build Manifest:

python scripts/make_manifest.py \
    --bucket scedc-pds \
    --prefix continuous_waveforms/ \
    --output cache/manifest.parquet \
    --networks CI AZ US TA

Fetch Metadata:

python scripts/fetch_metadata.py \
    --client IRIS \
    --output cache/metadata/stations_iris.parquet \
    --extract-features

Data Processing

Preprocessing: Detrending, demeaning, normalization, quality control
Windowing: Configurable window length with overlap
Augmentation: Time shifts, amplitude scaling, noise injection, filtering
Streaming: WebDataset-based pipeline for scalable training

⚙️ Configuration

GP2Vec uses Hydra for configuration management. Configurations are organized as:

configs/
├── config.yaml              # Main config file
├── data/
│   ├── default.yaml         # Default data settings
│   └── small.yaml           # Small dataset for testing
├── model/
│   ├── default.yaml         # Default model architecture
│   └── small.yaml           # Small model for debugging
├── train/
│   ├── default.yaml         # Default training settings
│   └── debug.yaml           # Debug training settings
└── experiment/
    ├── debug.yaml           # Complete debug experiment
    └── production.yaml      # Production experiment

Key Configuration Options

Data Configuration

# data/default.yaml
processing:
  target_sampling_rate: 100.0
  window_length: 30.0
  overlap: 0.5

augmentation:
  enabled: true
  probability: 0.8
  time_shift:
    max_shift: 0.1
  
metadata:
  enabled: true
  fdsn_client: IRIS

Model Configuration

# model/default.yaml
feature_encoder:
  conv_layers:
    - {channels: 64, kernel_size: 10, stride: 5}
    - {channels: 128, kernel_size: 8, stride: 4}
    # ...

context_encoder:
  embed_dim: 768
  num_heads: 12
  num_layers: 12

quantizer:
  type: gumbel
  codebook_size: 320
  num_codebooks: 2

Training Configuration

# train/default.yaml
optimizer:
  name: adamw
  lr: 1e-4
  weight_decay: 0.01

lr_scheduler:
  name: cosine_with_warmup
  warmup_steps: 10000

trainer:
  max_epochs: 100
  devices: auto
  strategy: ddp
  precision: 16-mixed

Configuration Overrides

# Override from command line
python -m gp2vec.train.train \
    data.batch_size=64 \
    model.embed_dim=1024 \
    train.max_epochs=200

# Use different configs
python -m gp2vec.train.train \
    --config-path=configs \
    --config-name=experiment/production

🔬 Downstream Evaluation

GP2Vec includes tools for evaluating learned representations on downstream tasks:

from gp2vec.train.evaluate_downstream import DownstreamEvaluator

# Load pretrained model
evaluator = DownstreamEvaluator(model)

# Evaluate on phase picking
results = evaluator.evaluate_phase_picking(
    waveforms, pick_labels, metadata
)

# Evaluate on tremor detection  
results = evaluator.evaluate_tremor_detection(
    waveforms, tremor_labels, metadata
)

Supported Tasks

Phase Picking: Binary classification of P/S wave arrivals
Tremor Detection: Binary classification of tremor vs. normal signals
Magnitude Estimation: Multi-class classification of earthquake magnitude bins

Evaluation Protocols

Linear Probing: Freeze backbone, train linear classifier
Sklearn Evaluation: Extract features, train sklearn models
Fine-tuning: End-to-end fine-tuning (planned)

🖥️ Distributed Training

GP2Vec supports various distributed training strategies:

# Multi-GPU training
train:
  trainer:
    devices: 4
    strategy: ddp
    
# Multi-node training
train:
  trainer:
    devices: 8
    num_nodes: 4
    strategy: ddp

Supported Strategies

DDP (DistributedDataParallel): Standard multi-GPU training
FSDP (FullyShardedDataParallel): Memory-efficient training for large models
DeepSpeed: Advanced optimization strategies

📈 Monitoring and Logging

Weights & Biases Integration

wandb:
  enabled: true
  project: gp2vec
  entity: your-team
  tags: [self-supervised, seismic]

TensorBoard Support

loggers:
  tensorboard:
    enabled: true
    log_graph: true

Rich Progress Bars

Interactive progress bars with training metrics, GPU utilization, and throughput monitoring.

🧪 Examples

See the examples/ directory for complete usage examples:

basic_training.py: Simple training script
data_pipeline.py: Data loading and processing
model_usage.py: Model inference and feature extraction

🛠️ Development

Project Structure

gp2vec/
├── src/gp2vec/
│   ├── data/              # Data loading and processing
│   ├── models/            # Model architectures  
│   ├── train/             # Training and evaluation
│   └── utils/             # Utilities
├── configs/               # Hydra configurations
├── scripts/               # Operational scripts
├── examples/              # Usage examples
└── tests/                 # Unit tests (planned)

Code Quality

The project uses modern Python tooling:

Black: Code formatting
Ruff: Fast linting and formatting
MyPy: Static type checking
Pytest: Unit testing framework

# Format code
black src/ examples/ scripts/

# Lint code  
ruff check src/ examples/ scripts/

# Type checking
mypy src/gp2vec

Contributing

Fork the repository
Create a feature branch
Make your changes with tests
Run quality checks
Submit a pull request

📋 Requirements

Minimum System Requirements

RAM: 16GB+ (32GB+ recommended for large datasets)
Storage: 100GB+ for data caching
GPU: 8GB+ VRAM (16GB+ recommended)

Cloud Deployment

GP2Vec is designed for cloud deployment with:

AWS S3 integration for data access
Kubernetes deployment support
Docker containerization
Multi-node distributed training

🤝 Citation

If you use GP2Vec in your research, please cite:

@software{gp2vec2025,
  title={GP2Vec: Self-Supervised Learning for Geophysical Waveform Representation},
  author={Marine Denolle},
  year={2025},
  url={https://github.com/Denolle-lab/gp2vec}
}

📄 License

This project is licensed under the GNU General Public License v3.0 (GPL-3.0).

License Rationale

GP2Vec uses GPL-3.0 to ensure compatibility with key dependencies:

SeisBench: GPL-3.0 (seismic deep learning benchmarks)
ObsPy: LGPL-3.0 (seismological data processing)
Wav2Vec2/Fairseq: MIT (compatible with GPL-3.0)

The GPL-3.0 license ensures that:

✅ All modifications and derivative works remain open source
✅ Full compatibility with seismological research software ecosystem
✅ Community contributions are preserved for scientific progress
✅ Users receive complete source code and modification rights

See the LICENSE file for the complete terms.

Third-Party Licenses

This project incorporates or builds upon:

Wav2Vec2 (Meta AI/Facebook): MIT License
ObsPy: LGPL v3.0
SeisBench: GPL v3.0
PyTorch: BSD-style License
PyTorch Lightning: Apache 2.0

All third-party licenses are compatible with GPL-3.0.

🆘 Support

Documentation: Full documentation (planned)
Issues: GitHub Issues
Discussions: GitHub Discussions

🗺️ Roadmap

Current Features ✅

Wav2Vec2-style architecture for seismic data
S3-native data pipeline with WebDataset streaming
Station metadata conditioning
PyTorch Lightning training framework
Hydra configuration system
Distributed training support
Downstream evaluation tools

Planned Features 🚧

Additional downstream tasks (magnitude estimation, source characterization)
Pre-trained model zoo
Advanced augmentation strategies
Real-time inference capabilities
Integration with seismological workflows
Comprehensive documentation and tutorials

Future Research 🔮

Multi-modal learning (waveforms + spectrograms + metadata)
Federated learning across seismic networks
Integration with physics-informed neural networks
Uncertainty quantification in representations

GP2Vec - Advancing seismology through self-supervised representation learning 🌍📈

This document outlines an end-to-end plan to build a Wav2Vec 2.0–style self-supervised foundation model for seismology. It covers data access from EarthScope S3, station metadata integration, model architecture, training strategy, and repository scaffolding suitable for hand-off to an engineering team or code-generation system.

1. Core Ideas from Johnson et al. (2025)

Adopt a Wav2Vec 2.0 objective on continuous seismic waveforms: mask latent spans, predict quantized targets, and reuse the encoder for downstream tasks such as phase picking, tremor characterization, magnitude estimation, and slow slip proxies.¹
Train on long, contiguous segments, masking contiguous time spans and sampling negatives from other temporal windows or channels; support multi-component inputs (Z/N/E) and targeted volcano or fault scenarios as in the reference study.¹

2. Streaming miniSEED from EarthScope S3

Two supported approaches:

A. EarthScope SDK → Temporary AWS Credentials → boto3

Exchange EarthScope credentials for temporary AWS keys using the SDK, then use boto3 to list and fetch miniSEED objects.²

B. Direct Access Tutorials (s3fs / boto3)

Follow the earthscope-s3-direct-access examples to read data with s3fs or boto3, respecting the documented bucket and prefix structure.³

Python sketch:

import io
import s3fs
from obspy import read

fs = s3fs.S3FileSystem(anon=True)  # or provide credentials from EarthScope SDK / boto3
with fs.open("s3://<earthscope-bucket>/<prefix>/<file>.mseed", "rb") as f:
    st = read(io.BytesIO(f.read()))  # ObsPy accepts file-like objects

Adjust bucket and prefix paths per EarthScope documentation.²

3. Station Metadata (FDSN → ObsPy)

Retrieve StationXML via obspy.clients.fdsn.Client.get_stations, requesting level="response" to obtain instrument metadata.⁴
Extract key fields (lat/lon/elev, instrument/datalogger descriptions, azimuth, dip, sampling rate, response gain) using ObsPy Inventory APIs; optionally leverage wrappers like fdsn_station_info for CSV exports.⁵