4D-SSL: Self-Supervised Learning from In-the-Wild Driving Videos

A PyTorch implementation of 4D self-supervised learning for autonomous driving, enabling 3D scene understanding and motion prediction from monocular video sequences.

Overview

This project implements a self-supervised learning framework that learns 3D geometry, object segmentation, and motion dynamics from unlabeled driving videos. The model can:

3D Scene Reconstruction: Generate depth maps and 3D point clouds from monocular video
Object Segmentation: Identify vehicles, pedestrians, bicycles, road signs, and traffic lights
Motion Prediction: Track dynamic objects and predict their 3D motion trajectories
Temporal Understanding: Learn consistent 4D representations across video sequences

Key Features

Multi-Model Architecture: Supports Pi3, AutoregressivePi3, and AutonomyPi3 variants
Large-Scale Training: Optimized for YouTube driving dataset (58M+ samples)
Advanced Segmentation: 6-class segmentation with GSAM2 integration
Motion Analysis: Dynamic vs static object classification with CoTracker
Distributed Training: Multi-GPU support with Accelerate
Cloud Integration: S3 dataset streaming and checkpoint management

Documentation

Installation Guide - Complete setup instructions for dependencies and environment
Configuration Guide - Detailed explanation of configuration options
Training Guide - Step-by-step training instructions
Loss Functions Documentation - Details on loss computation and optimization

Quick Start

Prerequisites

Python 3.10+
CUDA 12.1+
24GB+ GPU (RTX 4090, A100 recommended)

Basic Setup

# Clone repository
git clone https://github.com/matthew-strong-ai/4d-ssl.git
cd 4d-ssl

# Create environment
conda create -n 4d-ssl python=3.10
conda activate 4d-ssl

# Install dependencies
pip install -r requirements.txt

# Initialize submodules
git submodule update --init --recursive

Training

# Basic training
python train_cluster.py

# With custom config
python train_cluster.py --config config.yaml

# Resume from checkpoint
python train_cluster.py --resume checkpoints/latest.pt

Model Architecture

The project implements three main architectures:

1. Pi3 (Base Model)

3D point prediction from video sequences
Camera pose estimation
Depth and normal map generation

2. AutoregressivePi3

Transformer-based temporal modeling
Future frame prediction
Enhanced motion understanding

3. AutonomyPi3

Extended Pi3 with detection capabilities
Traffic light and road sign detection
Multi-task learning framework

Dataset

The model is trained on large-scale driving video datasets:

YouTube Driving Dataset: 58M+ frames from 200+ cities worldwide
Custom S3 Dataset: Support for proprietary driving data
Local Dataset: For testing and development

Dataset Configuration

DATASET:
  USE_YOUTUBE: True
  YOUTUBE_ROOT_PREFIX: "openDV-YouTube/full_images/"
  BATCH_SIZE: 1
  MAX_SAMPLES: -1  # Use full dataset

Training Pipeline

Data Loading: Streaming from S3/YouTube dataset
GSAM2 Processing: Object detection and segmentation
CoTracker Integration: Point tracking across frames
Model Forward Pass: 3D prediction and motion estimation
Loss Computation: Multi-task losses with class weighting
Optimization: AdamW with cosine annealing

Key Components

Object Classes

1. Vehicle (cars, trucks, buses, motorcycles)
2. Bicycle
3. Person
4. Road Sign
5. Traffic Light
0. Background

Loss Functions

Point Cloud Loss: 3D geometry supervision
Segmentation Loss: Cross-entropy with class weights
Motion Loss: 3D motion field prediction
Camera Pose Loss: SE(3) pose estimation
Confidence Loss: Prediction uncertainty

Results

The model produces:

3D Point Clouds: Dense depth estimation
Segmentation Masks: Per-pixel class predictions
Motion Vectors: 3D motion flow fields
Dynamic Masks: Moving vs static object classification

Monitoring

Weights & Biases

# Set up W&B
wandb login
export WANDB_API_KEY="your_key"

Training metrics are logged to W&B including:

Loss curves
Validation metrics
Sample visualizations
Model checkpoints

TensorBoard

tensorboard --logdir runs/

Advanced Usage

Multi-GPU Training

# Using Accelerate
accelerate launch train_cluster.py

# Distributed training
torchrun --nproc_per_node=4 train_cluster.py

Custom Model Configuration

MODEL:
  ARCHITECTURE: "AutoregressivePi3"
  USE_MOTION_HEAD: True
  USE_SEGMENTATION_HEAD: True
  FREEZE_DECODERS: True

Hyperparameter Tuning

See hyperparameter_search.yaml for sweep configurations.

Troubleshooting

Common issues and solutions:

CUDA Out of Memory

DATASET.BATCH_SIZE: 1
TRAINING.GRAD_ACCUM_STEPS: 4

S3 Access Issues

export AWS_REQUEST_CHECKSUM_CALCULATION="WHEN_SUPPORTED"
aws configure

Missing Dependencies

pip install -e git+https://github.com/IDEA-Research/Grounded-SAM-2.git#egg=SAM-2

Contributing

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Citation

If you use this code in your research, please cite:

@software{4d-ssl,
  author = {Matthew Strong},
  title = {4D-SSL: Self-Supervised Learning from In-the-Wild Driving Videos},
  year = {2025},
  url = {https://github.com/matthew-strong-ai/4d-ssl}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Pi3 Model - Base 3D prediction architecture
Grounded-SAM-2 - Segmentation backend
CoTracker - Point tracking
DINOv3 - Feature extraction

Contact

For questions and feedback:

Create an issue on GitHub
Email: [your-email@example.com]

Note: This is an active research project. APIs and features may change.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
Pi3 @ 0cac637		Pi3 @ 0cac637
config		config
configs		configs
examples		examples
gdino_config		gdino_config
models		models
sample_sequences		sample_sequences
utils		utils
.gitignore		.gitignore
.gitmodules		.gitmodules
INSTALL.md		INSTALL.md
LOSS_CHANGES.md		LOSS_CHANGES.md
MAPANYTHING_TRAINING_STEPS.md		MAPANYTHING_TRAINING_STEPS.md
README.md		README.md
README_YACS_CONFIG.md		README_YACS_CONFIG.md
alignment.py		alignment.py
autonomy_pi3_detection.py		autonomy_pi3_detection.py
autoregressive_future_head.py		autoregressive_future_head.py
base_model.py		base_model.py
benchmark_youtube_dataset.py		benchmark_youtube_dataset.py
config.yaml		config.yaml
config_mapanything.yaml		config_mapanything.yaml
config_ppgeo.yaml		config_ppgeo.yaml
config_ppgeo_resnet.yaml		config_ppgeo_resnet.yaml
config_ppgeo_resnet152.yaml		config_ppgeo_resnet152.yaml
config_ppgeo_stage2.yaml		config_ppgeo_stage2.yaml
consecutive_images_dataset.py		consecutive_images_dataset.py
cvae_future_head.py		cvae_future_head.py
cvae_future_pts3d.py		cvae_future_pts3d.py
cvae_losses.py		cvae_losses.py
debug_ar_transformer.py		debug_ar_transformer.py
debug_utils.py		debug_utils.py
detect_road_signs.py		detect_road_signs.py
detection_utils.py		detection_utils.py
detr_components.py		detr_components.py
detr_losses.py		detr_losses.py
distillation_visualizer.py		distillation_visualizer.py
distilled_vit.py		distilled_vit.py
environment.yaml		environment.yaml
geometry.py		geometry.py
gsam2_optimized.py		gsam2_optimized.py
gsam2_optimized_complete.py		gsam2_optimized_complete.py
gsam2_optimized_example.py		gsam2_optimized_example.py
gsam2_simple_optimization.py		gsam2_simple_optimization.py
inference_from_checkpoint.py		inference_from_checkpoint.py
inspect_pseudo_gt.py		inspect_pseudo_gt.py
install_requirements.sh		install_requirements.sh
lilypad_config.yaml		lilypad_config.yaml
load_distilled_vit.py		load_distilled_vit.py
load_motionnet_resnet152.py		load_motionnet_resnet152.py
load_motionnet_simple.py		load_motionnet_simple.py
load_motionnet_stage2.py		load_motionnet_stage2.py
load_motionnet_with_encoder.py		load_motionnet_with_encoder.py
load_ppgeo_depth_encoder.py		load_ppgeo_depth_encoder.py
load_ppgeo_resnet_encoder.py		load_ppgeo_resnet_encoder.py
losses.py		losses.py
minimal_pi3_inference.py		minimal_pi3_inference.py
pi3_detection_model.py		pi3_detection_model.py
ppgeo_dataset.py		ppgeo_dataset.py
ppgeo_losses.py		ppgeo_losses.py
ppgeo_model.py		ppgeo_model.py
ppgeo_motionnet.py		ppgeo_motionnet.py
ppgeo_resnet.py		ppgeo_resnet.py
requirements.txt		requirements.txt
run_gsam2_optimized.py		run_gsam2_optimized.py
run_training_with_gsam2.sh		run_training_with_gsam2.sh
s3_consecutive_images_dataset.py		s3_consecutive_images_dataset.py
s3_sequence_learning_dataset.py		s3_sequence_learning_dataset.py
s3_utils.py		s3_utils.py
save_sample_images.py		save_sample_images.py
simple_s3_dataset.py		simple_s3_dataset.py
simple_youtube_benchmark.py		simple_youtube_benchmark.py
test.py		test.py
test_detection_compatibility.py		test_detection_compatibility.py
test_gsam2_quick.py		test_gsam2_quick.py
test_inference_pi3.py		test_inference_pi3.py
test_mapanything.py		test_mapanything.py
test_pkl_loading.py		test_pkl_loading.py
test_ppgeo_resnet.py		test_ppgeo_resnet.py
test_resnet152.py		test_resnet152.py
test_youtube_dataset.py		test_youtube_dataset.py
test_youtube_quick.py		test_youtube_quick.py
train.py		train.py
train_cluster.py		train_cluster.py
train_cluster_fixed.py		train_cluster_fixed.py
train_pi3.py		train_pi3.py
train_ppgeo.py		train_ppgeo.py
transformer_future_head.py		transformer_future_head.py
upload_vit_boto3.py		upload_vit_boto3.py
upload_vit_to_s3.py		upload_vit_to_s3.py
verify_gradients.py		verify_gradients.py
visualize_youtube_samples.py		visualize_youtube_samples.py

matthew-strong-ai/4d-ssl

Folders and files

Latest commit

History

Repository files navigation