An empirical study and comparison of Deterministic, Statistical, and ML Algorithms for performing Spatial Modeling of significant wave height values collected by buoy and sea monitoring stations managed by the United States' National Data Buoy Center (NDBC) located near costs of the Southern Atlantic regions of the United States, including those on the Gulf of Mexico and parts of the Caribbean.
- Deterministic methods: such as linear barycentric interpolation, Inverse Distance Weighting (IDW), and Radial Basis Function (RBF) Interpolation.
- Statistical methods: Kriging Interpolation (Gaussian Process Regression).
- Machine Learning methods: LightGBM and Random Forests.
- The experimental study was conducted on a large dataset consisting of hourly wave and meteorological measurements on the 2010-2022 period collected by buoys moored in the Mid Atlantic near the south-East coast of the continental United States.
- Data was downloaded directly from the historical standard meteorological archive of the NDBC. The locations of each of the targeted buoys was obtained by scraping the webpages that lists their individual information (e.g: https://www.ndbc.noaa.gov/station_page.php?station=44008).
Timeseries of wave height measurements from buoy #42019 |
- The general preprocessing steps were done by defining a kedro pipeline to detect and parse missing values, format the columns, and convert it to a geo-parquet format (Geopandas was used for read/write operations and to work with it as geospatial data).
- The data was then split into training and test sets. The test set itself consisted of several subsets of selected data, each of which was used to evaluate the performance of the algorithms based on the specific spatial configuration of the buoys available in each set.
Test subsets evaluated in this area. Inside red circles are the buoys that were not available in the training set of each period mentioned. |
- Evaluation was conducted by writing individual MLFlow experiments of each of the algorithms and were then executed with each of the subsets of the test data on parallel (see the experiments/ directory to see examples of this).
- The results of the experiments were then analyzed by comparing the performance of the algorithms on the test sets.
The results of the study favour the use of ML algorithms over the use of other methods when paired with a strong feature set that are able to capture the spatial distribution of the data well. While they achieve similar error than other algorithms in sets that test interpolation inside the convex hull of the data (such as those in sets A,B,C) they are much better than the others on points that would require extrapolation outside the convex hull of the data (sets D,E,F).
Overall error metrics | Avg RMSE per test set |
Visual results per evaluated technique |
Of the two ML methods, Gradient Boost (LightGBM) was the one that turned out to be most successful not only on accuracy but also when comparing the time it takes to run inference in comparison to Random Forest (3x faster).