Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Polygonize potential improvements #27

Closed
7 of 13 tasks
cholmes opened this issue Oct 22, 2024 · 19 comments · Fixed by #31, #51 or #50
Closed
7 of 13 tasks

Polygonize potential improvements #27

cholmes opened this issue Oct 22, 2024 · 19 comments · Fixed by #31, #51 or #50
Assignees
Labels

Comments

@cholmes
Copy link
Member

cholmes commented Oct 22, 2024

These should perhaps break out into their own issues, but a number of these could likely be done in one chunk of work for the 'polygonize' command proposed in #19 (these suggestions are currently based on #25 where it is its own command).

@m-mohr m-mohr linked a pull request Oct 25, 2024 that will close this issue
@cholmes
Copy link
Member Author

cholmes commented Oct 28, 2024

@m-mohr - could you take on a chunk of these and then we can make additional issues for the ones you don't do.

Top one to me is fiboa / geoparquet output. And could do json too in fiboa, and then make geopackage 'fiboa-like' by default. And make it so if someone says -o fields.parquet then it just automatically does fiboa and geoparquet.

@m-mohr m-mohr reopened this Oct 28, 2024
@m-mohr
Copy link
Collaborator

m-mohr commented Oct 29, 2024

@cholmes How long should this take following the README instructions?

The top one for sure should probably be the algorithm. This thing is so slow due to O(n³) complexity, that I can't even get it to properly print to the status to terminal on my machine. It always stops to print the status at some point and the machine is busy forever. But that's also the most complex issue, but doesn't make sense to do something else before this works better.

@calebrob6
Copy link
Collaborator

What takes O(n³) complexity?

@m-mohr
Copy link
Collaborator

m-mohr commented Oct 29, 2024

The three for loops that are nested in the code :-)

            for y in range(0, input_height, polygonization_stride):
                for x in range(0, input_width, polygonization_stride):
                    ...
                    for geom, val in rasterio.features.shapes(mask_window, transform=new_transform):
                        ...

@calebrob6
Copy link
Collaborator

calebrob6 commented Oct 29, 2024

Hmmm these aren't all n -- the first two for loops iterate over chunks of the input raster, the third nested for loop iterates over the number of contiguous blocks of 1's in the chunk. The whole thing takes time that depends on the number of contiguous 1's.

rasterio's shapes(..) says -- https://rasterio.readthedocs.io/en/latest/api/rasterio.features.html#rasterio.features.shapes

The amount of memory used by this algorithm is proportional to the number and complexity of polygons produced. This algorithm is most appropriate for simple thematic data. Data with high pixel-to-pixel variability, such as imagery, may produce one polygon per pixel and consume large amounts of memory.

Instead of running rasterio.features.shapes(...) on the entire input at one time (which just blocks for a long time), I split it up so that the user at least gets status updates.

@m-mohr
Copy link
Collaborator

m-mohr commented Oct 29, 2024

You are right, my bad. Nevertheless, it's slow ;-) And for whetever reason at some point the CLI progress bar stops updating for me (before 100%). So I'm trying to fix/optimize it - Not sure yet how...

@m-mohr
Copy link
Collaborator

m-mohr commented Oct 29, 2024

@calebrob6 I just looked at one of the polygonize outputs. Do theses lines that split the boundaries are a result from the optimizations for the progress reporting? That seems not ideal. They seem to appear pretty regularly in the file.

grafik

@calebrob6
Copy link
Collaborator

Yep -- I agree not ideal

@cholmes
Copy link
Member Author

cholmes commented Oct 29, 2024

We talked about it in slack and it's on the list above (just edited it to make it a bit more clear, it was a bit buried) - we should merge those back.

As for performance, we can remove the reprojection as @calebrob6 pointed out that we shouldn't need that for area calculations if projections are utm, and at this point we can assume everything is s2 input. The reprojection definitely made things slower.

But I didn't find it excessively slow, and it'd always complete fit me, so sounds like something else may be going wrong.

@m-mohr
Copy link
Collaborator

m-mohr commented Oct 30, 2024

I don't think you can easily merge them back. You already see some examples above where it's not really clear whether they are separate fields or were split due to the window. Also, some shapes still look differently even when merged...

I think we need to get rid of the windows and just work on the full thing unfortunately. That will remove the progressbar for the first minutes though, but that's not a good enough reason to return suboptimal geometries, I think.

I'm having an alternative algorithm in mind, will try that out and I'll also try to optimize the number of operations for each shape. Working in #49

@calebrob6
Copy link
Collaborator

I think the main optimization will be to sieve / preprocess the raster input (i.e. model output) that goes into the polygonization step. A 3x3 or 5x5 mode filter may help (but may also merge nearby fields that are separated by a single pixel).

Another thought I had (and reason why I chunked it in the first place) was to run rasterio.features.shapes in parallel for different input chunks.

To partially offset the boundary issues we are seeing with the chunked approach we could buffer each chunk by X pixels (i.e. run on overlapping windows) then merge the results intelligently before simplifying. I haven't thought too closely, but, any polygons that cross the border between chunks (disregarding the buffer in this case) should be mergeable. You could do this very quickly and would offset having to do a large expensive dissolve.

@calebrob6
Copy link
Collaborator

calebrob6 commented Oct 30, 2024

image

Here's a crudely drawn example. In this case we can see a horizontal line pretty clearly that indicates a boundary from polygonizing two chunks. In my proposed method the "top chunk" in this example would extend say 100 pixels down into the bottom chunk and vice versa. I've drawn what a resulting polygon would look like in yellow for the top chunk and blue for the bottom chunk. I'm claiming any pair of overlapping polygons between the set from the top and from the bottom can be merged (dissolved) to fix the issue. You further know which polygons fit this criteria quickly because they will cross the horizontal line that divides the unbuffered chunks (or vertical line in other cases). If this works it'd actually be nice to put in gdal_polygonize :) (FYI this script is basically a gdal_polygonize wrapper at the moment, so it might be worth benchmarking against!)

@m-mohr
Copy link
Collaborator

m-mohr commented Oct 30, 2024

Let's take this in multiple steps.

I've started to speed up the code without removing the windowed approach in #50 - See the PR for details. This should help with further work steps as it's just speedier.

I've started to realize that the potential issues that I saw in the files were due to the simplification, but if we resolve the split before simplification, I guess it could be fine. I'm not sure about the buffer approach as it may lead to other issues down the line, e.g. merging fields that don't belong together.

@calebrob6
Copy link
Collaborator

I'm not sure about the buffer approach as it may lead to other issues down the line, e.g. merging fields that don't belong together.

I don't think so -- you are not buffering any of the polygons themselves, just the windows in which you are running polygonization over.

I've started to realize that the potential issues that I saw in the files were due to the simplification, but if we resolve the split before simplification, I guess it could be fine

Simplification will cause the fields that are split up by the windowing to not be touching anymore.

@m-mohr
Copy link
Collaborator

m-mohr commented Oct 30, 2024

Oh, I didn't realize that you meant to buffer the chunks only. Makes more sense then.

Yes, that's what I meant to say with regards to simplification.

@cholmes What does the following mean?

Logic for warnings specific to when you want to polygonize

This was linked to pull requests Oct 30, 2024
@cholmes
Copy link
Member Author

cholmes commented Oct 30, 2024

@cholmes What does the following mean?

Logic for warnings specific to when you want to polygonize

That was copied from a note @calebrob6 put in slack. I think it likely comes sorta naturally - originally polygonize was a flag on top of inference, so it meant that if you had an error in polygonize then it'd just bubble up with the 'inference' ones. So I take this item to mean just 'raise warnings for users'. Not sure if @calebrob6 had some particular things he wanted to warn about, but we can likely just check it.

@m-mohr m-mohr reopened this Oct 30, 2024
@cholmes
Copy link
Member Author

cholmes commented Oct 30, 2024

Let's close this out when #51 merges, and just break the remaining check marks out into their own issues.

@m-mohr
Copy link
Collaborator

m-mohr commented Oct 30, 2024

@cholmes I've opened separate issues for most items, but I'm not sure whether we really need

@cholmes
Copy link
Member Author

cholmes commented Oct 30, 2024

Alternative polygonization algorithms -> Isn't fixing #54 enough? We only really need another algorithm if we can't solve #54, I think.

I think the idea was that there could be more specialized approaches that are optimized towards field boundaries or even how this dataset is done. Indeed there's likely even approaches of trying to get polygons directly from inference / using ML in the generation of the polygons. I don't think we need to do anything now, just put an issue to say that we may explore others in the future. @calebrob6 may have some specific ideas here.

Option to reproject? Like to put the output into EPSG:4326 -> Is this really needed or can we just provide a snippet that uses ogr2ogr for CRS transformations?

Yeah, probably not really needed. Having an example and just calling out that the output is going to be in a UTM projection would be good though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment