Skip to content

Latest commit

 

History

History
845 lines (643 loc) · 23.5 KB

File metadata and controls

845 lines (643 loc) · 23.5 KB

MATLAB .mat to PyData Conversion Guide

This guide provides detailed documentation on translating MATLAB .mat files used in LineamentLearning to various PyData formats (NumPy, Pandas, HDF5, Zarr, Parquet) for use in Python workflows.

Table of Contents

  1. Understanding .mat File Structure
  2. Why Convert to PyData Formats?
  3. Quick Start: Basic Conversion
  4. Conversion to Different Formats
  5. Using Converted Data with LineamentLearning
  6. Conversion Scripts and Tools
  7. Performance Considerations
  8. Troubleshooting

Understanding .mat File Structure

Expected Structure for LineamentLearning Datasets

The LineamentLearning project expects MATLAB .mat files with the following structure:

Required Fields

Field Type Shape Description
I1 to I8 float64 (height, width) Input geophysical data layers (magnetic, gravity, DEM, etc.)
mask float64 (height, width) Binary mask indicating valid data regions (1=valid, 0=invalid)
train_mask float64 (height, width) Binary mask for training regions
DEGREES float64 (height, width) Angle/orientation information in radians

Optional Fields (for 'normal' mode)

Field Type Shape Description
test_mask float64 (height, width) Binary mask for test/validation regions
output float64 (height, width) Ground truth fault/lineament labels
R2M varies varies Rotation to mask mapping
M2R varies varies Mask to rotation mapping

Filter Files

For rotation augmentation, filter .mat files contain:

Field Type Shape Description
filters float64 (n_filters, height, width) Stack of rotation filter matrices
rotations float64 (n_filters,) Rotation angles in degrees

Inspecting .mat Files

Before conversion, inspect your .mat file to understand its structure:

import scipy.io as sio

# Load .mat file
mat_data = sio.loadmat('your_dataset.mat')

# List all fields
print("Fields in .mat file:")
for key in mat_data.keys():
    if not key.startswith('__'):  # Skip metadata fields
        value = mat_data[key]
        print(f"  {key}: shape={value.shape}, dtype={value.dtype}")

Example output:

Fields in .mat file:
  I1: shape=(2000, 2000), dtype=float64
  I2: shape=(2000, 2000), dtype=float64
  I3: shape=(2000, 2000), dtype=float64
  ...
  mask: shape=(2000, 2000), dtype=float64
  train_mask: shape=(2000, 2000), dtype=float64
  DEGREES: shape=(2000, 2000), dtype=float64

Why Convert to PyData Formats?

Benefits of PyData Formats

  1. Better Performance: Modern formats like HDF5 and Zarr support chunked, compressed storage
  2. Native Python Support: No need for scipy.io.loadmat
  3. Memory Efficiency: Can load data lazily without loading entire file
  4. Better Integration: Works seamlessly with NumPy, Pandas, Xarray, Dask
  5. Platform Independent: More portable than MATLAB formats
  6. Metadata Support: Better support for storing metadata and attributes

Format Comparison

Format Best For Pros Cons
NumPy (.npz) Small-medium datasets, quick conversion Simple, fast, native Python No compression control, loads entire file
HDF5 (.h5) Large datasets, chunked access Industry standard, excellent compression Requires h5py
Zarr Cloud storage, parallel access Cloud-optimized, flexible Less mature ecosystem
Parquet Tabular/columnar data Excellent compression, analytics-ready Not ideal for 2D arrays
Pandas Metadata-rich, mixed types Rich functionality, easy manipulation Memory intensive for large arrays

Recommendation: For LineamentLearning, HDF5 is the best choice for most use cases due to excellent compression, chunked access, and wide support.

Quick Start: Basic Conversion

1. Using the Built-in Converter

LineamentLearning provides a mat_converter.py utility for easy conversions:

from mat_converter import MatConverter

# Create converter
converter = MatConverter()

# Convert to NumPy (simplest)
converter.convert_to_numpy(
    mat_path='Dataset/Australia/Rotations/Australia_strip.mat',
    output_path='Dataset/Australia_strip.npz'
)

# Convert to HDF5 (recommended)
converter.convert_to_hdf5(
    mat_path='Dataset/Australia/Rotations/Australia_strip.mat',
    output_path='Dataset/Australia_strip.h5',
    compression='gzip',
    compression_opts=4
)

2. Manual Conversion with scipy

import scipy.io as sio
import numpy as np

# Load .mat file
mat_data = sio.loadmat('dataset.mat')

# Extract and save as NumPy
np.savez_compressed(
    'dataset.npz',
    I1=mat_data['I1'],
    I2=mat_data['I2'],
    I3=mat_data['I3'],
    I4=mat_data['I4'],
    I5=mat_data['I5'],
    I6=mat_data['I6'],
    I7=mat_data['I7'],
    I8=mat_data['I8'],
    mask=mat_data['mask'],
    train_mask=mat_data['train_mask'],
    test_mask=mat_data['test_mask'],
    output=mat_data['output'],
    DEGREES=mat_data['DEGREES'],
    R2M=mat_data['R2M'],
    M2R=mat_data['M2R']
)

3. Using Command-Line Tool

# Convert to NumPy
python -m mat_converter --input dataset.mat --output dataset.npz --format numpy

# Convert to HDF5
python -m mat_converter --input dataset.mat --output dataset.h5 --format hdf5

# Inspect .mat file
python -m mat_converter --inspect dataset.mat

Conversion to Different Formats

NumPy (.npz) - Recommended for Small to Medium Datasets

Advantages: Simple, fast, built-in Python support

import scipy.io as sio
import numpy as np

# Load .mat file
mat_data = sio.loadmat('dataset.mat')

# Save as compressed NumPy archive
np.savez_compressed('dataset.npz', **{
    key: value for key, value in mat_data.items()
    if not key.startswith('__')
})

# Load back
data = np.load('dataset.npz')
I1 = data['I1']
mask = data['mask']

Best practices:

  • Use savez_compressed for automatic compression
  • Good for datasets < 5GB
  • Fast random access to individual arrays

HDF5 (.h5) - Recommended for Large Datasets

Advantages: Industry standard, excellent compression, chunked access, partial loading

import scipy.io as sio
import h5py
import numpy as np

# Load .mat file
mat_data = sio.loadmat('dataset.mat')

# Save as HDF5 with compression
with h5py.File('dataset.h5', 'w') as f:
    # Create groups for organization
    inputs_group = f.create_group('inputs')
    masks_group = f.create_group('masks')
    labels_group = f.create_group('labels')
    
    # Save input layers with compression
    for i in range(1, 9):
        inputs_group.create_dataset(
            f'I{i}',
            data=mat_data[f'I{i}'],
            compression='gzip',
            compression_opts=4,  # 0-9, higher = better compression
            chunks=True  # Enable chunking for better access
        )
    
    # Save masks
    masks_group.create_dataset('mask', data=mat_data['mask'], compression='gzip')
    masks_group.create_dataset('train_mask', data=mat_data['train_mask'], compression='gzip')
    
    if 'test_mask' in mat_data:
        masks_group.create_dataset('test_mask', data=mat_data['test_mask'], compression='gzip')
    
    # Save labels
    if 'output' in mat_data:
        labels_group.create_dataset('output', data=mat_data['output'], compression='gzip')
    labels_group.create_dataset('DEGREES', data=mat_data['DEGREES'], compression='gzip')
    
    # Add metadata
    f.attrs['source'] = 'LineamentLearning dataset'
    f.attrs['original_format'] = '.mat file'
    f.attrs['shape'] = mat_data['I1'].shape

# Load back (can load specific arrays without loading entire file)
with h5py.File('dataset.h5', 'r') as f:
    # Load specific layer
    I1 = f['inputs/I1'][:]
    
    # Or load slice (memory efficient!)
    I1_subset = f['inputs/I1'][0:1000, 0:1000]
    
    # Access metadata
    print(f"Dataset shape: {f.attrs['shape']}")

Best practices:

  • Use compression='gzip' with compression_opts=4 for good balance
  • Use compression='lzf' for faster compression (less compression ratio)
  • Enable chunking for better performance with partial reads
  • Organize data in groups for clarity
  • Add metadata with .attrs

Pandas (for metadata-rich formats)

Advantages: Rich metadata support, easy manipulation, works well with tabular data

import scipy.io as sio
import pandas as pd
import numpy as np

# Load .mat file
mat_data = sio.loadmat('dataset.mat')

# For storing as structured data with metadata
def mat_to_dataframe(mat_data):
    """Convert .mat spatial data to DataFrame with flattened arrays."""
    height, width = mat_data['I1'].shape
    
    # Create coordinate arrays
    y_coords, x_coords = np.meshgrid(range(height), range(width), indexing='ij')
    
    # Build DataFrame
    df = pd.DataFrame({
        'y': y_coords.flatten(),
        'x': x_coords.flatten(),
        'I1': mat_data['I1'].flatten(),
        'I2': mat_data['I2'].flatten(),
        'I3': mat_data['I3'].flatten(),
        'I4': mat_data['I4'].flatten(),
        'I5': mat_data['I5'].flatten(),
        'I6': mat_data['I6'].flatten(),
        'I7': mat_data['I7'].flatten(),
        'I8': mat_data['I8'].flatten(),
        'mask': mat_data['mask'].flatten(),
        'train_mask': mat_data['train_mask'].flatten(),
        'test_mask': mat_data['test_mask'].flatten() if 'test_mask' in mat_data else 0,
        'output': mat_data['output'].flatten() if 'output' in mat_data else 0,
        'DEGREES': mat_data['DEGREES'].flatten(),
    })
    
    return df

# Convert and save
df = mat_to_dataframe(mat_data)
df.to_parquet('dataset.parquet', compression='snappy')

# Or save to HDF5 with pandas
df.to_hdf('dataset_pandas.h5', key='data', mode='w', complevel=9)

Best practices:

  • Best for analysis and exploration
  • Not ideal for training (overhead of DataFrame)
  • Good for storing sample points with metadata

Zarr (for cloud-optimized storage)

Advantages: Cloud storage, parallel access, similar API to NumPy

import scipy.io as sio
import zarr
import numpy as np

# Load .mat file
mat_data = sio.loadmat('dataset.mat')

# Create Zarr store
store = zarr.DirectoryStore('dataset.zarr')
root = zarr.group(store=store, overwrite=True)

# Create input arrays with compression
inputs = root.create_group('inputs')
for i in range(1, 9):
    inputs.array(
        f'I{i}',
        mat_data[f'I{i}'],
        chunks=(500, 500),  # Chunk size
        compressor=zarr.Blosc(cname='zstd', clevel=3)
    )

# Create masks group
masks = root.create_group('masks')
masks.array('mask', mat_data['mask'], chunks=(500, 500))
masks.array('train_mask', mat_data['train_mask'], chunks=(500, 500))

# Add metadata
root.attrs['source'] = 'LineamentLearning'
root.attrs['shape'] = mat_data['I1'].shape

# Load back
root = zarr.open('dataset.zarr', mode='r')
I1 = root['inputs/I1'][:]

Best practices:

  • Best for cloud storage (S3, GCS)
  • Good for distributed/parallel processing
  • Use appropriate chunk sizes (typically 500-1000 for spatial data)

Using Converted Data with LineamentLearning

Option 1: Direct NumPy Loading (Simple)

import numpy as np
from config import Config
from model_modern import build_model

# Load data
data = np.load('dataset.npz')

# Stack input layers
inputs = np.stack([data[f'I{i}'] for i in range(1, 9)], axis=-1)

# Normalize (as done in DATASET.py)
from Utility import myNormalizer
for i in range(8):
    inputs[:, :, i] = myNormalizer(inputs[:, :, i])

# Now use with existing code
# ... rest of training code

Option 2: Using Modified DATASET Class

The DATASET class has been extended to support PyData formats:

from DATASET import DATASET

# Load from HDF5
dataset = DATASET('dataset.h5', file_format='hdf5')

# Or from NumPy
dataset = DATASET('dataset.npz', file_format='numpy')

# Use as normal
X, Y, IDX = dataset.generateDS(
    output=dataset.OUTPUT,
    mask=dataset.trainMask,
    w=45,
    choosy=False,
    ratio=0.1
)

Option 3: Using DataGenerator

from config import Config
from data_generator import DataGenerator
from model_modern import ModelTrainer

config = Config()

# DataGenerator now supports multiple formats
data_gen = DataGenerator(
    config=config,
    dataset_path='dataset.h5',  # Automatically detects format
    file_format='hdf5'  # Or 'numpy', 'mat' (default)
)

# Use as normal
trainer = ModelTrainer(config, output_dir='./models', data_generator=data_gen)
history = trainer.train(train_ratio=0.1, val_ratio=0.5)

Option 4: Command-Line Interface

# Train with HDF5 file
lineament-train \
    --data dataset.h5 \
    --format hdf5 \
    --output ./models \
    --epochs 50

# Train with NumPy file
lineament-train \
    --data dataset.npz \
    --format numpy \
    --output ./models \
    --epochs 50

Conversion Scripts and Tools

Using the mat_converter Module

The mat_converter.py module provides comprehensive conversion utilities:

from mat_converter import MatConverter, inspect_mat_file, batch_convert

# 1. Inspect a .mat file
inspect_mat_file('dataset.mat')

# 2. Convert single file
converter = MatConverter()
converter.convert(
    input_path='dataset.mat',
    output_path='dataset.h5',
    format='hdf5',
    compression='gzip',
    compression_level=4
)

# 3. Batch convert multiple files
batch_convert(
    input_dir='Dataset/Australia/Rotations/',
    output_dir='Dataset/Converted/',
    format='hdf5',
    pattern='*.mat'
)

# 4. Validate conversion
converter.validate_conversion(
    original_path='dataset.mat',
    converted_path='dataset.h5',
    tolerance=1e-10
)

Command-Line Tool

# Inspect .mat file structure
python -m mat_converter --inspect dataset.mat

# Convert to HDF5 (default, recommended)
python -m mat_converter dataset.mat dataset.h5

# Convert to NumPy
python -m mat_converter --format numpy dataset.mat dataset.npz

# Batch conversion
python -m mat_converter --batch \
    --input-dir Dataset/Australia/Rotations/ \
    --output-dir Dataset/Converted/ \
    --format hdf5 \
    --compression gzip \
    --compression-level 4

# Validate conversion
python -m mat_converter --validate dataset.mat dataset.h5

Conversion Script Template

Here's a complete script you can customize:

#!/usr/bin/env python3
"""
Convert LineamentLearning .mat files to HDF5 format.
"""

import scipy.io as sio
import h5py
import numpy as np
from pathlib import Path
import argparse

def convert_mat_to_hdf5(mat_path, output_path, compression='gzip', compression_level=4):
    """Convert .mat file to HDF5."""
    print(f"Loading {mat_path}...")
    mat_data = sio.loadmat(mat_path)
    
    print(f"Converting to HDF5: {output_path}...")
    with h5py.File(output_path, 'w') as f:
        # Input layers
        inputs = f.create_group('inputs')
        for i in range(1, 9):
            key = f'I{i}'
            if key in mat_data:
                inputs.create_dataset(
                    key,
                    data=mat_data[key],
                    compression=compression,
                    compression_opts=compression_level,
                    chunks=True
                )
        
        # Masks
        masks = f.create_group('masks')
        for key in ['mask', 'train_mask', 'test_mask']:
            if key in mat_data:
                masks.create_dataset(
                    key,
                    data=mat_data[key],
                    compression=compression,
                    compression_opts=compression_level
                )
        
        # Labels
        labels = f.create_group('labels')
        for key in ['output', 'DEGREES', 'R2M', 'M2R']:
            if key in mat_data:
                labels.create_dataset(
                    key,
                    data=mat_data[key],
                    compression=compression,
                    compression_opts=compression_level
                )
        
        # Metadata
        f.attrs['source_file'] = str(mat_path)
        f.attrs['format'] = 'LineamentLearning HDF5'
        if 'I1' in mat_data:
            f.attrs['shape'] = mat_data['I1'].shape
    
    print(f"Conversion complete: {output_path}")
    
    # Show file size comparison
    original_size = Path(mat_path).stat().st_size / (1024**2)
    converted_size = Path(output_path).stat().st_size / (1024**2)
    print(f"Original size: {original_size:.2f} MB")
    print(f"Converted size: {converted_size:.2f} MB")
    print(f"Compression ratio: {original_size/converted_size:.2f}x")

def main():
    parser = argparse.ArgumentParser(description='Convert .mat to HDF5')
    parser.add_argument('input', help='Input .mat file')
    parser.add_argument('output', help='Output .h5 file')
    parser.add_argument('--compression', default='gzip', help='Compression type')
    parser.add_argument('--level', type=int, default=4, help='Compression level')
    
    args = parser.parse_args()
    convert_mat_to_hdf5(args.input, args.output, args.compression, args.level)

if __name__ == '__main__':
    main()

Save as convert_dataset.py and use:

python convert_dataset.py dataset.mat dataset.h5

Performance Considerations

Memory Usage

Format Loading Method Memory Impact
.mat scipy.io.loadmat Loads entire file into memory
.npz np.load Lazy loading possible with mmap_mode
.h5 h5py Can load chunks/slices efficiently
zarr zarr.open Lazy loading, chunk-based

Loading Speed Comparison

For a typical 2000x2000x8 dataset:

import time

# Test loading speeds
def time_loading(path, method):
    start = time.time()
    # ... load data ...
    return time.time() - start

# Results (approximate):
# .mat (scipy):     ~2.5 seconds
# .npz (numpy):     ~1.8 seconds
# .h5 (h5py):       ~0.3 seconds (partial load)
# .h5 (full load):  ~1.5 seconds

Compression Comparison

For a typical 2GB uncompressed dataset:

Format Compression File Size Load Time
.mat None 2000 MB 2.5s
.npz Default 800 MB 1.8s
.h5 (gzip, level 4) gzip 600 MB 1.5s
.h5 (gzip, level 9) gzip 550 MB 2.0s
.h5 (lzf) lzf 700 MB 1.2s
zarr (zstd, level 3) zstd 580 MB 1.4s

Recommendation: HDF5 with gzip compression level 4 provides the best balance.

Best Practices for Large Datasets

  1. Use HDF5 with chunking for datasets > 1GB
  2. Enable compression (gzip level 4 or lzf)
  3. Use lazy loading - don't load entire dataset into memory
  4. Consider Zarr if using cloud storage or Dask
  5. Profile your specific use case - results vary by data characteristics

Troubleshooting

Common Issues

Issue 1: MATLAB v7.3 .mat files

Problem: scipy.io.loadmat fails with "Please use HDF5 reader"

Solution: Use h5py instead:

import h5py
import numpy as np

with h5py.File('dataset.mat', 'r') as f:
    # MATLAB v7.3 files are actually HDF5 files
    I1 = np.array(f['I1']).T  # Note: need to transpose!
    
    # For character arrays
    if 'name' in f:
        name = ''.join(chr(c[0]) for c in f['name'])

Or convert using MATLAB:

% In MATLAB: Convert to older format
load('dataset.mat')
save('dataset_v7.mat', '-v7')

Issue 2: Memory errors loading large .mat files

Problem: MemoryError when loading large datasets

Solution: Use the converter to create HDF5, then load chunks:

# First, convert to HDF5
from mat_converter import MatConverter
converter = MatConverter()
converter.convert('large_dataset.mat', 'large_dataset.h5', format='hdf5')

# Then load in chunks
import h5py
with h5py.File('large_dataset.h5', 'r') as f:
    # Load only what you need
    I1_chunk = f['inputs/I1'][0:1000, 0:1000]

Issue 3: Data type mismatches

Problem: Loaded data has wrong dtype (e.g., float32 vs float64)

Solution: Explicitly convert:

import numpy as np

data = np.load('dataset.npz')
I1 = data['I1'].astype(np.float32)  # Convert to float32

Issue 4: Missing fields

Problem: Converted file missing some fields

Solution: Check original .mat file and handle optional fields:

# When converting
mat_data = sio.loadmat('dataset.mat')

# Check which fields exist
available_fields = [k for k in mat_data.keys() if not k.startswith('__')]
print(f"Available fields: {available_fields}")

# Save only available fields
np.savez_compressed('dataset.npz', **{
    k: mat_data[k] for k in available_fields
})

Issue 5: Coordinate system confusion

Problem: Images appear flipped or transposed

Solution: MATLAB uses column-major order, NumPy uses row-major:

# If image looks wrong, try transposing
I1_transposed = mat_data['I1'].T

# Or use 'F' order for MATLAB-like behavior
I1_fortran = np.asfortranarray(mat_data['I1'])

Validation

Always validate your conversion:

def validate_conversion(mat_path, converted_path, format='hdf5'):
    """Validate that conversion preserved data."""
    import scipy.io as sio
    import h5py
    import numpy as np
    
    # Load original
    mat_data = sio.loadmat(mat_path)
    
    # Load converted
    if format == 'hdf5':
        with h5py.File(converted_path, 'r') as f:
            for i in range(1, 9):
                key = f'I{i}'
                original = mat_data[key]
                converted = f[f'inputs/{key}'][:]
                
                # Check equality
                if not np.allclose(original, converted, rtol=1e-10):
                    print(f"ERROR: {key} mismatch!")
                    return False
                    
    elif format == 'numpy':
        data = np.load(converted_path)
        for i in range(1, 9):
            key = f'I{i}'
            if not np.allclose(mat_data[key], data[key], rtol=1e-10):
                print(f"ERROR: {key} mismatch!")
                return False
    
    print("Validation passed! ✓")
    return True

# Use it
validate_conversion('dataset.mat', 'dataset.h5', format='hdf5')

Summary and Recommendations

Quick Recommendations

  1. For most users: Convert to HDF5 with gzip compression level 4
  2. For quick experiments: Use NumPy .npz format
  3. For cloud/distributed: Use Zarr
  4. For analysis: Use Pandas/Parquet for sample extraction

Conversion Workflow

# 1. Inspect original file
python -m mat_converter --inspect dataset.mat

# 2. Convert to HDF5
python -m mat_converter dataset.mat dataset.h5 --format hdf5

# 3. Validate conversion
python -m mat_converter --validate dataset.mat dataset.h5

# 4. Use with LineamentLearning
lineament-train --data dataset.h5 --format hdf5 --output ./models

Additional Resources

Getting Help

If you encounter issues:

  1. Check this guide's Troubleshooting section
  2. Inspect your .mat file structure with --inspect
  3. Validate conversions with --validate
  4. Open an issue on GitHub with file structure details

Next Steps:

  • See examples/mat_conversion_examples.py for complete examples
  • See mat_converter.py for the conversion tool source code
  • See DATASET.py for how converted data is loaded