-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Polygonize potential improvements #27
Comments
@m-mohr - could you take on a chunk of these and then we can make additional issues for the ones you don't do. Top one to me is fiboa / geoparquet output. And could do json too in fiboa, and then make geopackage 'fiboa-like' by default. And make it so if someone says -o fields.parquet then it just automatically does fiboa and geoparquet. |
@cholmes How long should this take following the README instructions? The top one for sure should probably be the algorithm. This thing is so slow due to O(n³) complexity, that I can't even get it to properly print to the status to terminal on my machine. It always stops to print the status at some point and the machine is busy forever. But that's also the most complex issue, but doesn't make sense to do something else before this works better. |
What takes O(n³) complexity? |
The three for loops that are nested in the code :-)
|
Hmmm these aren't all n -- the first two for loops iterate over chunks of the input raster, the third nested for loop iterates over the number of contiguous blocks of 1's in the chunk. The whole thing takes time that depends on the number of contiguous 1's. rasterio's shapes(..) says -- https://rasterio.readthedocs.io/en/latest/api/rasterio.features.html#rasterio.features.shapes
Instead of running |
You are right, my bad. Nevertheless, it's slow ;-) And for whetever reason at some point the CLI progress bar stops updating for me (before 100%). So I'm trying to fix/optimize it - Not sure yet how... |
@calebrob6 I just looked at one of the polygonize outputs. Do theses lines that split the boundaries are a result from the optimizations for the progress reporting? That seems not ideal. They seem to appear pretty regularly in the file. |
Yep -- I agree not ideal |
We talked about it in slack and it's on the list above (just edited it to make it a bit more clear, it was a bit buried) - we should merge those back. As for performance, we can remove the reprojection as @calebrob6 pointed out that we shouldn't need that for area calculations if projections are utm, and at this point we can assume everything is s2 input. The reprojection definitely made things slower. But I didn't find it excessively slow, and it'd always complete fit me, so sounds like something else may be going wrong. |
I don't think you can easily merge them back. You already see some examples above where it's not really clear whether they are separate fields or were split due to the window. Also, some shapes still look differently even when merged... I think we need to get rid of the windows and just work on the full thing unfortunately. That will remove the progressbar for the first minutes though, but that's not a good enough reason to return suboptimal geometries, I think. I'm having an alternative algorithm in mind, will try that out and I'll also try to optimize the number of operations for each shape. Working in #49 |
I think the main optimization will be to sieve / preprocess the raster input (i.e. model output) that goes into the polygonization step. A 3x3 or 5x5 mode filter may help (but may also merge nearby fields that are separated by a single pixel). Another thought I had (and reason why I chunked it in the first place) was to run rasterio.features.shapes in parallel for different input chunks. To partially offset the boundary issues we are seeing with the chunked approach we could buffer each chunk by X pixels (i.e. run on overlapping windows) then merge the results intelligently before simplifying. I haven't thought too closely, but, any polygons that cross the border between chunks (disregarding the buffer in this case) should be mergeable. You could do this very quickly and would offset having to do a large expensive dissolve. |
Here's a crudely drawn example. In this case we can see a horizontal line pretty clearly that indicates a boundary from polygonizing two chunks. In my proposed method the "top chunk" in this example would extend say 100 pixels down into the bottom chunk and vice versa. I've drawn what a resulting polygon would look like in yellow for the top chunk and blue for the bottom chunk. I'm claiming any pair of overlapping polygons between the set from the top and from the bottom can be merged (dissolved) to fix the issue. You further know which polygons fit this criteria quickly because they will cross the horizontal line that divides the unbuffered chunks (or vertical line in other cases). If this works it'd actually be nice to put in gdal_polygonize :) (FYI this script is basically a gdal_polygonize wrapper at the moment, so it might be worth benchmarking against!) |
Let's take this in multiple steps. I've started to speed up the code without removing the windowed approach in #50 - See the PR for details. This should help with further work steps as it's just speedier. I've started to realize that the potential issues that I saw in the files were due to the simplification, but if we resolve the split before simplification, I guess it could be fine. I'm not sure about the buffer approach as it may lead to other issues down the line, e.g. merging fields that don't belong together. |
I don't think so -- you are not buffering any of the polygons themselves, just the windows in which you are running polygonization over.
Simplification will cause the fields that are split up by the windowing to not be touching anymore. |
Oh, I didn't realize that you meant to buffer the chunks only. Makes more sense then. Yes, that's what I meant to say with regards to simplification. @cholmes What does the following mean?
|
That was copied from a note @calebrob6 put in slack. I think it likely comes sorta naturally - originally polygonize was a flag on top of inference, so it meant that if you had an error in polygonize then it'd just bubble up with the 'inference' ones. So I take this item to mean just 'raise warnings for users'. Not sure if @calebrob6 had some particular things he wanted to warn about, but we can likely just check it. |
Let's close this out when #51 merges, and just break the remaining check marks out into their own issues. |
@cholmes I've opened separate issues for most items, but I'm not sure whether we really need
|
I think the idea was that there could be more specialized approaches that are optimized towards field boundaries or even how this dataset is done. Indeed there's likely even approaches of trying to get polygons directly from inference / using ML in the generation of the polygons. I don't think we need to do anything now, just put an issue to say that we may explore others in the future. @calebrob6 may have some specific ideas here.
Yeah, probably not really needed. Having an example and just calling out that the output is going to be in a UTM projection would be good though. |
These should perhaps break out into their own issues, but a number of these could likely be done in one chunk of work for the 'polygonize' command proposed in #19 (these suggestions are currently based on #25 where it is its own command).
Fiboa output - determination_method = auto-imagery, plus ideally pick-up determination_datetime from input files (depends on include time information in the input raster format for inference #26)-> Polygonize: Create fiboa compliant output #53Whether to use fiboa or just write minimal attribute fields (default to using fiboa)-> Polygonize: Create fiboa compliant output #53Control over the window size for the grid?-> Polygonize: Geometries split due to windowed approach #54Get rid of the "arbitrarily split" polygons from the grid step (ie dissolve and split into single parts as a very final step). Potentially expose as option.-> Polygonize: Geometries split due to windowed approach #54The text was updated successfully, but these errors were encountered: