Skip to content

Add ClassificationLayer for Raster Classification #668

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

vschaffn
Copy link
Contributor

@vschaffn vschaffn commented Mar 25, 2025

Resolves GlacioHack/xdem#692.

Context:

This PR introduces the ClassificationLayer functionality within the geoutils library. The goal is to provide tools for classifying raster data into discrete classes, whether based on numeric intervals (e.g., binning elevation ranges) or categorical classes (e.g., land cover types). This enhancement improves geospatial workflows by allowing users to easily classify raster data and compute statistics for each class.

Features:

  • ClassificationLayer Abstract Base Class:

    • Serves as the foundation for implementing raster classification workflows.
    • Contains methods for applying classification (apply_classification), computing statistics (get_stats, based on Raster.get_stats), and saving results (save).
  • Binning Class:

    • Extends ClassificationLayer for classifying continuous raster data using custom binning.
    • Automatically generates class masks based on user-defined bin edges.
  • Segmentation Class:

    • Classify from pre-defined segmentation masks.
  • Fusion Class:

    • Combines multiple classification masks into fused classes.
  • Save/Export Functionality:

    • Exports classification masks as a multi-band GeoTIFF file.
    • Saves class names as a JSON file.
    • Outputs computed statistics as a CSV file.

Tests:

Unit tests for Binning to verify:

  • Correct creation of class masks based on the given bin ranges.
  • Proper computation of statistics for classified raster pixels.
  • Coverage for saving/exporting classification results (GeoTIFF, JSON, CSV).

Unit tests for Segmentation.
Unit tests for Fusion.

Documentation:

  • Adds a new documentation page for classification.

Example:

# load raster (note, the raster parameter can also be a file path)
raster_file = gu.examples.get_path("exploradores_aster_dem")
raster = gu.Raster(raster_file)

# Define bins
bins = [0, 1000, 2000, 3000, np.inf]

# Create Binning object
binning = gu.raster.Binning(raster, "elevation", bins)

# Apply binning
binning.apply()

# Compute statistics
# if `stats` is None, all stats are calculated by default
# if `classes` is None, stats are computed for all classes by default
binning.get_stats(stats=["mean", "std"], classes=["[0, 1000)", "[3000, inf)"])

# Save results
binning.save("elevation_binning")

@adebardo
Copy link
Contributor

Thanks, this PR will be used as support for 28/03/2025 and debate with the all team in person :)

Copy link
Contributor

@adebardo adebardo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its not a real review, I just look a it for a minute, thanks for work again

@vschaffn vschaffn force-pushed the 692-classification_layer branch 2 times, most recently from 1a10549 to d687665 Compare April 17, 2025 11:43
@vschaffn vschaffn force-pushed the 692-classification_layer branch from 86a9747 to 9c59796 Compare April 24, 2025 08:25
@vschaffn vschaffn force-pushed the 692-classification_layer branch from 77fe41a to bf7fd44 Compare April 24, 2025 15:44
:raise ValueError: if req_stats_classes are not class names.
"""
if self.classification_masks is None or self.class_names is None:
raise ValueError("Classification has not been applied yet. Call apply_classification() first.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace "apply_classification" -> apply

@adehecq
Copy link
Member

adehecq commented Apr 28, 2025

Thanks @vschaffn for the implementation. This will certainly be useful for mainy users!
I summarize here my suggestions following the meeting from this morning:

  • I find it a bit difficult to grasp the differences between the different classes (ClassificationLayer, Binning, Segmentation, Fusion). I would suggest simplifying to a single class, with different methods/inputs to handle different cases. In short, the Segmentation class could take 1) a single raster with discrete elements (for example a land cover classification containing values 1, 2, 3 etc...) 2) multiple binary rasters (Masks) for each classification 3) a continuous raster (dem, slope...) and a binning strategy for discretizing the values. Then the get_stats method would loop through the different values or layers in the input raster.
  • in case the previous point is not implemented, I would suggest removing the "shortcut" to geoutils.rasters.classification, to clarify for example that Fusion means fusion of classifiers.
  • I would add a method read that would do the reverse of `save
  • I suggest adding a landcover dataset in the example datasets for documenting and testing these features. For example, the Corine land cover. Unfortunately, I don't think this extends outside Europe, so not possible to have it with the aster_exploradores dataset...
  • I would add an issue to remember to remove redundancy of similar functionalities in xdem (e.g., binning)

@rhugonnet
Copy link
Member

rhugonnet commented Apr 28, 2025

I was about to write my feedback in the issue following previous comments there (GlacioHack/xdem#692 (comment)), but seeing @adehecq's comment I'll follow up here as well.
I already detailed a lot of ideas in these previous comments, notably on relying on SciPy or Pandas/Xarray directly, so I refer to those previous comments and I'm only thinking further ahead here.

Taking stock of existing tools:
I did some additional reading on this topic (as things have been evolving in the last years, in Pandas and Xarray), so that we don't re-invent the wheel on this.
I think the best resource about this topic is the following Xarray page that we should all read: https://docs.xarray.dev/en/stable/user-guide/groupby.html#groupby-group-and-bin-data. The related pandas doc is here: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html.
In short, the utility GroupBy allows to group categorical values with a UniqueGrouper (what we call "segmentation" here) or to group continuous values with a BinGrouper (what we call "binning" here), and supports multiple indexing (what we call "fusion" here). So it basically includes all of the functionalities we need, and is accelerated natively in Xarray and through the recent development of Flox: https://xarray.dev/blog/flox.

So I think the big question is: What's different in our case that would justify developping a new functionality?

There are two aspects:
1. Technical aspect: Can we better optimize this for rasters? We are not using generic rectilinear N-D labelled arrays like Xarray, but regular-grid 2D array that are georeferenced. Does this have any impact on the operation? In this case, I don't think so: Masking values based on groups (categorical or bins) is the same no matter the shape of the array, or the nature of the grid (we can flatten everything into 1D arrays without georeferencing, and it would work exactly the same). So nothing to optimize here.
2. Convenience aspect: Can we make it more easy to use for rasters? Here, there are definitely things to do, because groupby() is meant to support any N-D input, that are not necessarily georeferenced, so it can be overly complex to approach. We could wrap it to return easier-to-understand objects.

So, in my view, it's only 2 that is relevant. But then, I don't think we want to create a completely new API when there is already a good abstraction covering all of our use cases that exists in Pandas/Xarray.

So I'd be in favour of adding a geoutils/stats/group.py and a Raster.groupby() function that wraps Pandas/Xarray groupby(), potentially with additional options specific to rasters (return_masks for instance).
This could probably support saving masks and stats fairly easily, such as:

stats, masks = dem.groupby(slope, slope_bins, return_masks=True)
stats.to_csv(out_csv)
masks.save(out_tif)

If that API is still deemed too difficult by CNES, we could maintain the classes above to wrap up the operations, but it would duplicate things into a second API.

@rhugonnet
Copy link
Member

Thinking a bit more about this, a couple additional points:

  1. To bin in Pandas, one always need to combine pd.groupby() with pd.cut(). In Xarray, they simplified that by adding a groupby_bins function which repeats all the argument of pd.cut()/BinGrouper to streamline things for the user (https://docs.xarray.dev/en/stable/generated/xarray.Dataset.groupby_bins.html#xarray.Dataset.groupby_bins). We could do the same for Raster. We could also think of accepting bins=None to indicate a categorical variable, this way we could combine the API in a single function.
  2. A groupby() call doesn't return the reduced result directly, but an iterator of (bin/category, groups), which then can be used to apply the desired reducer function. The difference in Xarray is that the coordinates are preserved in the grouping by returning a DataArray with the right dimensions and subset coordinates. For a Raster, this is not possible anyway, so we can leave it aside.
  3. Finally, I realized our main structural difference with Xarray/Pandas is that we have a single variable being grouped (if we consider a single-band Raster, which is 95% of the time). Because of this, having the Pandas logic described in point 2 is less useful, because one of the main purposes of the "Grouped" dataframe object of Pandas is being able to then use different reducers on the different variables that were grouped later on, without re-running the grouping.

So, to simplify the API for a Raster and avoid the more complex logic of Pandas, we could immediately have a higher-level function grouped_statistics that already calls both pd.cut() where necessary, and .apply()/.agg() on the grouped output of self.data, because that's the only one we can compute anyway. So it would look like this:

class Raster
    def grouped_statistic(self, band=1, vars, bins, apply_funcs):
         df_self = pd.DataFrame([self.data[band-1].ravel(), *(v.ravel() for v in vars)])
         return df_self.groupby([var1, pd.cut(var1, bins1, ...)], [...]).agg(apply_funcs)

For this reason (and why it's like that in xDEM's nd_binning now that I look back on this), we could use SciPy's binned_statistic_dd, we would only need to use a small trick for defining the bins of the categorical data. It should be equivalent.
Both support binning and multi-indexing otherwise.

@rhugonnet
Copy link
Member

So the question we need to answer to move forward is: What would be missing in the functionalities of such a Raster.grouped_statistic() method to justify building a class wrapper on top?

For getting/saving the georeferenced masks, there could be an optional dictionary output using the same group name logic coded in this PR, with group key copied from the output dataframe, and all could be saved doing:

stats, mask_dict = dem.grouped_statistics(slope, slope_bins, ...)
for name, mask in mask_dict.items():
    mask.save("mask_"+name+".tif")

But, if we wanted to have tailored plotting functions (to visualize the stats of 1D/2D groups), it'd be something hard to fit as an output of a function call (and we use it a lot in xDEM), and a class would be really useful.

Finally, having a CLI functionality that wraps this (reading + binning + plotting + saving) all at once might also simplify things for end-users?

These are all my thoughts! 😄

@adehecq
Copy link
Member

adehecq commented Apr 29, 2025

Just a note on my suggestion of adding a landcover classification. I think we could use the ESA worldcover. An example for accessing the data through the AWS bucket is shown here: https://github.com/ESA-WorldCover/esa-worldcover-datasets.

@belletva
Copy link
Contributor

Thank you @rhugonnet and @adehecq for your comments. It is nice to see that the topic is generating a lot of interest. However, I suggest that these ideas should correspond to new issues. It might be interesting to merge if there are no blocking points and propose new issues to be addressed later.

@rhugonnet
Copy link
Member

@belletva Thanks for wrapping things up! 🙂
It made me realize I need to clarify further: All my points above are not about proposing enhancements, they relate to the core question "Do we need a class structure for this feature?". So it is blocking in the sense that we don't want to add a class structure now to then remove it entirely in next PRs because we actually don't need it (and because it won't be backwards-compatible). We need to decide ahead what is the right structure for this feature.

From my point of view, based on existing functionalities in SciPy/Pandas described above that already cover all use cases, I think we should have a wrapper method Raster.grouped_statistics(), rather than a class system. (Note that these other packages also chose to define a method there, even SciPy that is a package that loves class implementations)
This is mainly because a pd.DataFrame object can already store all attributes we need after a single function call. And, in our case for rasters, associated masks can easily be returned/saved in a one-liner as examplified above, which covers all functionalities. Finally, there doesn't seem to be a need to develop additional methods on this class object (except plotting?), which is usually what a class structure would be useful for.

In short: If there is a need for a class structure, I haven't really grasped it yet. This rather fits a method-type functionality (which would be very fast to implement, as simply wrapping Pandas/SciPy).
I'm happy to discuss more to understand if you see the need for a class structure! 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[POC] classif step 1 : Create a base class for classification layers
5 participants