This guide provides detailed documentation on translating MATLAB .mat files used in LineamentLearning to various PyData formats (NumPy, Pandas, HDF5, Zarr, Parquet) for use in Python workflows.
- Understanding .mat File Structure
- Why Convert to PyData Formats?
- Quick Start: Basic Conversion
- Conversion to Different Formats
- Using Converted Data with LineamentLearning
- Conversion Scripts and Tools
- Performance Considerations
- Troubleshooting
The LineamentLearning project expects MATLAB .mat files with the following structure:
| Field | Type | Shape | Description |
|---|---|---|---|
I1 to I8 |
float64 | (height, width) | Input geophysical data layers (magnetic, gravity, DEM, etc.) |
mask |
float64 | (height, width) | Binary mask indicating valid data regions (1=valid, 0=invalid) |
train_mask |
float64 | (height, width) | Binary mask for training regions |
DEGREES |
float64 | (height, width) | Angle/orientation information in radians |
| Field | Type | Shape | Description |
|---|---|---|---|
test_mask |
float64 | (height, width) | Binary mask for test/validation regions |
output |
float64 | (height, width) | Ground truth fault/lineament labels |
R2M |
varies | varies | Rotation to mask mapping |
M2R |
varies | varies | Mask to rotation mapping |
For rotation augmentation, filter .mat files contain:
| Field | Type | Shape | Description |
|---|---|---|---|
filters |
float64 | (n_filters, height, width) | Stack of rotation filter matrices |
rotations |
float64 | (n_filters,) | Rotation angles in degrees |
Before conversion, inspect your .mat file to understand its structure:
import scipy.io as sio
# Load .mat file
mat_data = sio.loadmat('your_dataset.mat')
# List all fields
print("Fields in .mat file:")
for key in mat_data.keys():
if not key.startswith('__'): # Skip metadata fields
value = mat_data[key]
print(f" {key}: shape={value.shape}, dtype={value.dtype}")Example output:
Fields in .mat file:
I1: shape=(2000, 2000), dtype=float64
I2: shape=(2000, 2000), dtype=float64
I3: shape=(2000, 2000), dtype=float64
...
mask: shape=(2000, 2000), dtype=float64
train_mask: shape=(2000, 2000), dtype=float64
DEGREES: shape=(2000, 2000), dtype=float64
- Better Performance: Modern formats like HDF5 and Zarr support chunked, compressed storage
- Native Python Support: No need for scipy.io.loadmat
- Memory Efficiency: Can load data lazily without loading entire file
- Better Integration: Works seamlessly with NumPy, Pandas, Xarray, Dask
- Platform Independent: More portable than MATLAB formats
- Metadata Support: Better support for storing metadata and attributes
| Format | Best For | Pros | Cons |
|---|---|---|---|
| NumPy (.npz) | Small-medium datasets, quick conversion | Simple, fast, native Python | No compression control, loads entire file |
| HDF5 (.h5) | Large datasets, chunked access | Industry standard, excellent compression | Requires h5py |
| Zarr | Cloud storage, parallel access | Cloud-optimized, flexible | Less mature ecosystem |
| Parquet | Tabular/columnar data | Excellent compression, analytics-ready | Not ideal for 2D arrays |
| Pandas | Metadata-rich, mixed types | Rich functionality, easy manipulation | Memory intensive for large arrays |
Recommendation: For LineamentLearning, HDF5 is the best choice for most use cases due to excellent compression, chunked access, and wide support.
LineamentLearning provides a mat_converter.py utility for easy conversions:
from mat_converter import MatConverter
# Create converter
converter = MatConverter()
# Convert to NumPy (simplest)
converter.convert_to_numpy(
mat_path='Dataset/Australia/Rotations/Australia_strip.mat',
output_path='Dataset/Australia_strip.npz'
)
# Convert to HDF5 (recommended)
converter.convert_to_hdf5(
mat_path='Dataset/Australia/Rotations/Australia_strip.mat',
output_path='Dataset/Australia_strip.h5',
compression='gzip',
compression_opts=4
)import scipy.io as sio
import numpy as np
# Load .mat file
mat_data = sio.loadmat('dataset.mat')
# Extract and save as NumPy
np.savez_compressed(
'dataset.npz',
I1=mat_data['I1'],
I2=mat_data['I2'],
I3=mat_data['I3'],
I4=mat_data['I4'],
I5=mat_data['I5'],
I6=mat_data['I6'],
I7=mat_data['I7'],
I8=mat_data['I8'],
mask=mat_data['mask'],
train_mask=mat_data['train_mask'],
test_mask=mat_data['test_mask'],
output=mat_data['output'],
DEGREES=mat_data['DEGREES'],
R2M=mat_data['R2M'],
M2R=mat_data['M2R']
)# Convert to NumPy
python -m mat_converter --input dataset.mat --output dataset.npz --format numpy
# Convert to HDF5
python -m mat_converter --input dataset.mat --output dataset.h5 --format hdf5
# Inspect .mat file
python -m mat_converter --inspect dataset.matAdvantages: Simple, fast, built-in Python support
import scipy.io as sio
import numpy as np
# Load .mat file
mat_data = sio.loadmat('dataset.mat')
# Save as compressed NumPy archive
np.savez_compressed('dataset.npz', **{
key: value for key, value in mat_data.items()
if not key.startswith('__')
})
# Load back
data = np.load('dataset.npz')
I1 = data['I1']
mask = data['mask']Best practices:
- Use
savez_compressedfor automatic compression - Good for datasets < 5GB
- Fast random access to individual arrays
Advantages: Industry standard, excellent compression, chunked access, partial loading
import scipy.io as sio
import h5py
import numpy as np
# Load .mat file
mat_data = sio.loadmat('dataset.mat')
# Save as HDF5 with compression
with h5py.File('dataset.h5', 'w') as f:
# Create groups for organization
inputs_group = f.create_group('inputs')
masks_group = f.create_group('masks')
labels_group = f.create_group('labels')
# Save input layers with compression
for i in range(1, 9):
inputs_group.create_dataset(
f'I{i}',
data=mat_data[f'I{i}'],
compression='gzip',
compression_opts=4, # 0-9, higher = better compression
chunks=True # Enable chunking for better access
)
# Save masks
masks_group.create_dataset('mask', data=mat_data['mask'], compression='gzip')
masks_group.create_dataset('train_mask', data=mat_data['train_mask'], compression='gzip')
if 'test_mask' in mat_data:
masks_group.create_dataset('test_mask', data=mat_data['test_mask'], compression='gzip')
# Save labels
if 'output' in mat_data:
labels_group.create_dataset('output', data=mat_data['output'], compression='gzip')
labels_group.create_dataset('DEGREES', data=mat_data['DEGREES'], compression='gzip')
# Add metadata
f.attrs['source'] = 'LineamentLearning dataset'
f.attrs['original_format'] = '.mat file'
f.attrs['shape'] = mat_data['I1'].shape
# Load back (can load specific arrays without loading entire file)
with h5py.File('dataset.h5', 'r') as f:
# Load specific layer
I1 = f['inputs/I1'][:]
# Or load slice (memory efficient!)
I1_subset = f['inputs/I1'][0:1000, 0:1000]
# Access metadata
print(f"Dataset shape: {f.attrs['shape']}")Best practices:
- Use compression='gzip' with compression_opts=4 for good balance
- Use compression='lzf' for faster compression (less compression ratio)
- Enable chunking for better performance with partial reads
- Organize data in groups for clarity
- Add metadata with
.attrs
Advantages: Rich metadata support, easy manipulation, works well with tabular data
import scipy.io as sio
import pandas as pd
import numpy as np
# Load .mat file
mat_data = sio.loadmat('dataset.mat')
# For storing as structured data with metadata
def mat_to_dataframe(mat_data):
"""Convert .mat spatial data to DataFrame with flattened arrays."""
height, width = mat_data['I1'].shape
# Create coordinate arrays
y_coords, x_coords = np.meshgrid(range(height), range(width), indexing='ij')
# Build DataFrame
df = pd.DataFrame({
'y': y_coords.flatten(),
'x': x_coords.flatten(),
'I1': mat_data['I1'].flatten(),
'I2': mat_data['I2'].flatten(),
'I3': mat_data['I3'].flatten(),
'I4': mat_data['I4'].flatten(),
'I5': mat_data['I5'].flatten(),
'I6': mat_data['I6'].flatten(),
'I7': mat_data['I7'].flatten(),
'I8': mat_data['I8'].flatten(),
'mask': mat_data['mask'].flatten(),
'train_mask': mat_data['train_mask'].flatten(),
'test_mask': mat_data['test_mask'].flatten() if 'test_mask' in mat_data else 0,
'output': mat_data['output'].flatten() if 'output' in mat_data else 0,
'DEGREES': mat_data['DEGREES'].flatten(),
})
return df
# Convert and save
df = mat_to_dataframe(mat_data)
df.to_parquet('dataset.parquet', compression='snappy')
# Or save to HDF5 with pandas
df.to_hdf('dataset_pandas.h5', key='data', mode='w', complevel=9)Best practices:
- Best for analysis and exploration
- Not ideal for training (overhead of DataFrame)
- Good for storing sample points with metadata
Advantages: Cloud storage, parallel access, similar API to NumPy
import scipy.io as sio
import zarr
import numpy as np
# Load .mat file
mat_data = sio.loadmat('dataset.mat')
# Create Zarr store
store = zarr.DirectoryStore('dataset.zarr')
root = zarr.group(store=store, overwrite=True)
# Create input arrays with compression
inputs = root.create_group('inputs')
for i in range(1, 9):
inputs.array(
f'I{i}',
mat_data[f'I{i}'],
chunks=(500, 500), # Chunk size
compressor=zarr.Blosc(cname='zstd', clevel=3)
)
# Create masks group
masks = root.create_group('masks')
masks.array('mask', mat_data['mask'], chunks=(500, 500))
masks.array('train_mask', mat_data['train_mask'], chunks=(500, 500))
# Add metadata
root.attrs['source'] = 'LineamentLearning'
root.attrs['shape'] = mat_data['I1'].shape
# Load back
root = zarr.open('dataset.zarr', mode='r')
I1 = root['inputs/I1'][:]Best practices:
- Best for cloud storage (S3, GCS)
- Good for distributed/parallel processing
- Use appropriate chunk sizes (typically 500-1000 for spatial data)
import numpy as np
from config import Config
from model_modern import build_model
# Load data
data = np.load('dataset.npz')
# Stack input layers
inputs = np.stack([data[f'I{i}'] for i in range(1, 9)], axis=-1)
# Normalize (as done in DATASET.py)
from Utility import myNormalizer
for i in range(8):
inputs[:, :, i] = myNormalizer(inputs[:, :, i])
# Now use with existing code
# ... rest of training codeThe DATASET class has been extended to support PyData formats:
from DATASET import DATASET
# Load from HDF5
dataset = DATASET('dataset.h5', file_format='hdf5')
# Or from NumPy
dataset = DATASET('dataset.npz', file_format='numpy')
# Use as normal
X, Y, IDX = dataset.generateDS(
output=dataset.OUTPUT,
mask=dataset.trainMask,
w=45,
choosy=False,
ratio=0.1
)from config import Config
from data_generator import DataGenerator
from model_modern import ModelTrainer
config = Config()
# DataGenerator now supports multiple formats
data_gen = DataGenerator(
config=config,
dataset_path='dataset.h5', # Automatically detects format
file_format='hdf5' # Or 'numpy', 'mat' (default)
)
# Use as normal
trainer = ModelTrainer(config, output_dir='./models', data_generator=data_gen)
history = trainer.train(train_ratio=0.1, val_ratio=0.5)# Train with HDF5 file
lineament-train \
--data dataset.h5 \
--format hdf5 \
--output ./models \
--epochs 50
# Train with NumPy file
lineament-train \
--data dataset.npz \
--format numpy \
--output ./models \
--epochs 50The mat_converter.py module provides comprehensive conversion utilities:
from mat_converter import MatConverter, inspect_mat_file, batch_convert
# 1. Inspect a .mat file
inspect_mat_file('dataset.mat')
# 2. Convert single file
converter = MatConverter()
converter.convert(
input_path='dataset.mat',
output_path='dataset.h5',
format='hdf5',
compression='gzip',
compression_level=4
)
# 3. Batch convert multiple files
batch_convert(
input_dir='Dataset/Australia/Rotations/',
output_dir='Dataset/Converted/',
format='hdf5',
pattern='*.mat'
)
# 4. Validate conversion
converter.validate_conversion(
original_path='dataset.mat',
converted_path='dataset.h5',
tolerance=1e-10
)# Inspect .mat file structure
python -m mat_converter --inspect dataset.mat
# Convert to HDF5 (default, recommended)
python -m mat_converter dataset.mat dataset.h5
# Convert to NumPy
python -m mat_converter --format numpy dataset.mat dataset.npz
# Batch conversion
python -m mat_converter --batch \
--input-dir Dataset/Australia/Rotations/ \
--output-dir Dataset/Converted/ \
--format hdf5 \
--compression gzip \
--compression-level 4
# Validate conversion
python -m mat_converter --validate dataset.mat dataset.h5Here's a complete script you can customize:
#!/usr/bin/env python3
"""
Convert LineamentLearning .mat files to HDF5 format.
"""
import scipy.io as sio
import h5py
import numpy as np
from pathlib import Path
import argparse
def convert_mat_to_hdf5(mat_path, output_path, compression='gzip', compression_level=4):
"""Convert .mat file to HDF5."""
print(f"Loading {mat_path}...")
mat_data = sio.loadmat(mat_path)
print(f"Converting to HDF5: {output_path}...")
with h5py.File(output_path, 'w') as f:
# Input layers
inputs = f.create_group('inputs')
for i in range(1, 9):
key = f'I{i}'
if key in mat_data:
inputs.create_dataset(
key,
data=mat_data[key],
compression=compression,
compression_opts=compression_level,
chunks=True
)
# Masks
masks = f.create_group('masks')
for key in ['mask', 'train_mask', 'test_mask']:
if key in mat_data:
masks.create_dataset(
key,
data=mat_data[key],
compression=compression,
compression_opts=compression_level
)
# Labels
labels = f.create_group('labels')
for key in ['output', 'DEGREES', 'R2M', 'M2R']:
if key in mat_data:
labels.create_dataset(
key,
data=mat_data[key],
compression=compression,
compression_opts=compression_level
)
# Metadata
f.attrs['source_file'] = str(mat_path)
f.attrs['format'] = 'LineamentLearning HDF5'
if 'I1' in mat_data:
f.attrs['shape'] = mat_data['I1'].shape
print(f"Conversion complete: {output_path}")
# Show file size comparison
original_size = Path(mat_path).stat().st_size / (1024**2)
converted_size = Path(output_path).stat().st_size / (1024**2)
print(f"Original size: {original_size:.2f} MB")
print(f"Converted size: {converted_size:.2f} MB")
print(f"Compression ratio: {original_size/converted_size:.2f}x")
def main():
parser = argparse.ArgumentParser(description='Convert .mat to HDF5')
parser.add_argument('input', help='Input .mat file')
parser.add_argument('output', help='Output .h5 file')
parser.add_argument('--compression', default='gzip', help='Compression type')
parser.add_argument('--level', type=int, default=4, help='Compression level')
args = parser.parse_args()
convert_mat_to_hdf5(args.input, args.output, args.compression, args.level)
if __name__ == '__main__':
main()Save as convert_dataset.py and use:
python convert_dataset.py dataset.mat dataset.h5| Format | Loading Method | Memory Impact |
|---|---|---|
| .mat | scipy.io.loadmat | Loads entire file into memory |
| .npz | np.load | Lazy loading possible with mmap_mode |
| .h5 | h5py | Can load chunks/slices efficiently |
| zarr | zarr.open | Lazy loading, chunk-based |
For a typical 2000x2000x8 dataset:
import time
# Test loading speeds
def time_loading(path, method):
start = time.time()
# ... load data ...
return time.time() - start
# Results (approximate):
# .mat (scipy): ~2.5 seconds
# .npz (numpy): ~1.8 seconds
# .h5 (h5py): ~0.3 seconds (partial load)
# .h5 (full load): ~1.5 secondsFor a typical 2GB uncompressed dataset:
| Format | Compression | File Size | Load Time |
|---|---|---|---|
| .mat | None | 2000 MB | 2.5s |
| .npz | Default | 800 MB | 1.8s |
| .h5 (gzip, level 4) | gzip | 600 MB | 1.5s |
| .h5 (gzip, level 9) | gzip | 550 MB | 2.0s |
| .h5 (lzf) | lzf | 700 MB | 1.2s |
| zarr (zstd, level 3) | zstd | 580 MB | 1.4s |
Recommendation: HDF5 with gzip compression level 4 provides the best balance.
- Use HDF5 with chunking for datasets > 1GB
- Enable compression (gzip level 4 or lzf)
- Use lazy loading - don't load entire dataset into memory
- Consider Zarr if using cloud storage or Dask
- Profile your specific use case - results vary by data characteristics
Problem: scipy.io.loadmat fails with "Please use HDF5 reader"
Solution: Use h5py instead:
import h5py
import numpy as np
with h5py.File('dataset.mat', 'r') as f:
# MATLAB v7.3 files are actually HDF5 files
I1 = np.array(f['I1']).T # Note: need to transpose!
# For character arrays
if 'name' in f:
name = ''.join(chr(c[0]) for c in f['name'])Or convert using MATLAB:
% In MATLAB: Convert to older format
load('dataset.mat')
save('dataset_v7.mat', '-v7')Problem: MemoryError when loading large datasets
Solution: Use the converter to create HDF5, then load chunks:
# First, convert to HDF5
from mat_converter import MatConverter
converter = MatConverter()
converter.convert('large_dataset.mat', 'large_dataset.h5', format='hdf5')
# Then load in chunks
import h5py
with h5py.File('large_dataset.h5', 'r') as f:
# Load only what you need
I1_chunk = f['inputs/I1'][0:1000, 0:1000]Problem: Loaded data has wrong dtype (e.g., float32 vs float64)
Solution: Explicitly convert:
import numpy as np
data = np.load('dataset.npz')
I1 = data['I1'].astype(np.float32) # Convert to float32Problem: Converted file missing some fields
Solution: Check original .mat file and handle optional fields:
# When converting
mat_data = sio.loadmat('dataset.mat')
# Check which fields exist
available_fields = [k for k in mat_data.keys() if not k.startswith('__')]
print(f"Available fields: {available_fields}")
# Save only available fields
np.savez_compressed('dataset.npz', **{
k: mat_data[k] for k in available_fields
})Problem: Images appear flipped or transposed
Solution: MATLAB uses column-major order, NumPy uses row-major:
# If image looks wrong, try transposing
I1_transposed = mat_data['I1'].T
# Or use 'F' order for MATLAB-like behavior
I1_fortran = np.asfortranarray(mat_data['I1'])Always validate your conversion:
def validate_conversion(mat_path, converted_path, format='hdf5'):
"""Validate that conversion preserved data."""
import scipy.io as sio
import h5py
import numpy as np
# Load original
mat_data = sio.loadmat(mat_path)
# Load converted
if format == 'hdf5':
with h5py.File(converted_path, 'r') as f:
for i in range(1, 9):
key = f'I{i}'
original = mat_data[key]
converted = f[f'inputs/{key}'][:]
# Check equality
if not np.allclose(original, converted, rtol=1e-10):
print(f"ERROR: {key} mismatch!")
return False
elif format == 'numpy':
data = np.load(converted_path)
for i in range(1, 9):
key = f'I{i}'
if not np.allclose(mat_data[key], data[key], rtol=1e-10):
print(f"ERROR: {key} mismatch!")
return False
print("Validation passed! ✓")
return True
# Use it
validate_conversion('dataset.mat', 'dataset.h5', format='hdf5')- For most users: Convert to HDF5 with gzip compression level 4
- For quick experiments: Use NumPy .npz format
- For cloud/distributed: Use Zarr
- For analysis: Use Pandas/Parquet for sample extraction
# 1. Inspect original file
python -m mat_converter --inspect dataset.mat
# 2. Convert to HDF5
python -m mat_converter dataset.mat dataset.h5 --format hdf5
# 3. Validate conversion
python -m mat_converter --validate dataset.mat dataset.h5
# 4. Use with LineamentLearning
lineament-train --data dataset.h5 --format hdf5 --output ./models- HDF5 Documentation: https://docs.h5py.org/
- Zarr Documentation: https://zarr.readthedocs.io/
- NumPy I/O: https://numpy.org/doc/stable/reference/routines.io.html
- SciPy MATLAB I/O: https://docs.scipy.org/doc/scipy/reference/io.html
If you encounter issues:
- Check this guide's Troubleshooting section
- Inspect your .mat file structure with
--inspect - Validate conversions with
--validate - Open an issue on GitHub with file structure details
Next Steps:
- See
examples/mat_conversion_examples.pyfor complete examples - See
mat_converter.pyfor the conversion tool source code - See
DATASET.pyfor how converted data is loaded