Authors (of original research concept): Esraa E. Abouelmaaty, Obed Omane Okyere, José Manuel Echevarría Rubio Date (of original research): April 8th, 2025 Script Author/Maintainer: [b10nics]
This project implements a seabed classification workflow using integrated multibeam echosounder (MBES) data and machine learning algorithms.
The primary goal is to classify the seabed into distinct categories based on bathymetry, backscatter, and derived terrain features. This script automates the following processes:
- Terrain Analysis: Calculation of morphological features (Slope, Aspect, TRI, TPI, Roughness, Hillshade) from a bathymetry GeoTIFF.
- Data Alignment: Optional alignment of a backscatter GeoTIFF to the bathymetry grid.
- Feature Engineering: Stacking of bathymetry, (aligned) backscatter, and terrain features into a multi-band GeoTIFF.
- Ground Truth Integration: Extraction of feature values at ground truth point locations (from a CSV file), including reprojection to match the raster CRS.
- Supervised Classification: Training and application of a Random Forest classifier, including hyperparameter tuning via GridSearchCV.
- Unsupervised Classification: Application of K-Means clustering for an exploratory perspective.
- Output Generation: Creation of GeoTIFF classification maps for both Random Forest and K-Means results.
The script successfully processed the example dataset (bathy_cube_10_filled_5x5.tiff, back_10_filled_5x5.tiff, ground_truth_samples_removed.csv):
- Input Data:
- Bathymetry: 3299x1907 pixels, 10m resolution, UTM zone 15S.
- Backscatter: Successfully aligned to the bathymetry grid.
- Ground Truth: 292 points loaded and reprojected from EPSG:4326 to EPSG:32715.
- Feature Generation:
- All terrain derivatives (Slope, Aspect, TRI, TPI, Roughness, Hillshade) were successfully generated.
- The final stacked raster for classification contained 7 features: Depth, Backscatter, Slope, Aspect, TRI, TPI, and Roughness.
- Training Data:
- Feature values were extracted for all 292 ground truth points; no points were dropped due to NoData values.
- Classes were mapped to integers (e.g., 'Biogenic mat': 0, 'Lava flows': 4).
- Random Forest Classification:
- The model was trained on 204 samples and tested on 88 samples.
- Best Parameters (GridSearchCV):
{'class_weight': 'balanced', 'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100} - Test Set Performance:
- Overall Accuracy: 77.27%
- Kappa Coefficient: 0.727
- Feature Importances (Top 3):
- Depth: ~29.22%
- TPI: ~14.32%
- Backscatter: ~12.46%
- The full classification map (
classification_rf.tif) was generated for 485,932 valid pixels.
- K-Means Clustering:
- Data was scaled, and K-Means (k=7) was successfully applied.
- The K-Means classification map (
classification_kmeans.tif) was generated.
- Execution Time: Approximately 17-19 seconds.
- Known Issues (from log):
- Plotting of the final classification maps encountered an error: "This method only works with the ScalarFormatter." This is a Matplotlib issue with the current colorbar formatter for discrete integer maps and does not affect the GeoTIFF output.
- A UserWarning from scikit-learn (
X does not have valid feature names, but RandomForestClassifier was fitted with feature names) was observed during prediction. This is generally benign if the order and number of features are consistent, but ideally, prediction data should also be a DataFrame with matching column names.
The script expects the following input files to be placed in the input_data directory (relative to the script's location):
- Bathymetry File (
bathy_file_name):- Format: GeoTIFF (
.tiff,.tif) - Example:
bathy_cube_10_filled_5x5.tiff
- Format: GeoTIFF (
- Backscatter File (
backscatter_file_name): (Optional)- Format: GeoTIFF (
.tiff,.tif) - Example:
back_10_filled_5x5.tiff
- Format: GeoTIFF (
- Ground Truth CSV File (
ground_truth_csv_name):- Format: CSV (
.csv) - Required Columns:
Longitude,Latitude,Class - Example:
ground_truth_samples_removed.csv
- Format: CSV (
The script generates output files in the Outputs_SeabedClassification directory:
- Terrain Feature Rasters:
slope.tif,aspect.tif,tri.tif,tpi.tif,roughness.tif,hillshade.tif. - Processed Backscatter:
backscatter_aligned_to_bathy.tif. - Stacked Features:
stacked_features_for_classification.tif. - Classification Maps:
classification_rf.tif,classification_kmeans.tif. - Intermediate VRT files may also be present.
Python 3 with the following major libraries: GDAL/OGR, Rasterio, GeoPandas, Pandas, NumPy, Scikit-learn, Matplotlib.
(See requirements.txt for a more detailed list).
- Clone the repository.
- Install GDAL: System-wide or via Conda is recommended (e.g.,
sudo apt install gdal-bin libgdal-dev python3-gdalorconda install -c conda-forge gdal). - Create a Python Virtual Environment:
python3 -m venv venv source venv/bin/activate - Install Python Dependencies:
(Ensure
pip install -r requirements.txt
requirements.txtreflects the necessary packages). - PROJ_LIB Environment Variable: The script attempts to set this. If CRS errors persist, ensure it points to your PROJ data directory (e.g.,
/usr/share/proj,/opt/conda/envs/your_env/share/proj).
- Prepare Data: Place input files in the
input_datasub-directory. - Configure Script:
- Open
Seabed_classification_mod.py. - Verify
data_dir_relativeandoutput_dir_relativeif your project structure differs. - Adjust
bathy_file_name,backscatter_file_name,ground_truth_csv_nameif needed. - Review other parameters like
n_kmeans_clusters,rf_cv_folds, etc.
- Open
- Execute:
python Seabed_classification_mod.py
- Check Outputs: In the
Outputs_SeabedClassificationsub-directory.