Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dissolve using dask-geopandas #313

Open
emaildipen opened this issue Oct 21, 2024 · 4 comments
Open

Dissolve using dask-geopandas #313

emaildipen opened this issue Oct 21, 2024 · 4 comments

Comments

@emaildipen
Copy link

emaildipen commented Oct 21, 2024

I have around 200k polygons in a shapefile, and I want to dissolve the polygons that are connected to each other. ArcGIS offers simple techniques to achieve this, but I was wondering if there are quicker ways to do it. I’ve tried the following but it took ages to execute.

import dask-geopands as dd

# Read the shapefile
ddf = dd.read_file(input_shapefile, npartitions=10)


# Dissolve polygons that are connected with each other
ddf['dissolve'] = 1  # Create a dummy column for dissolving
dissolved_gdf = ddf.dissolve('dissolve', split_out=11, sort=False)
    
# Explode the dissolved multipolygon into individual polygons and reset index
dissolved_gdf = dissolved_gdf.explode().reset_index(drop=True)
    
# Add an index column
dissolved_gdf['index'] = dissolved_gdf.index
    
dissolved_gdf.compute().to_file(output_shapefile_filled, use_arrow=True)
@martinfleis
Copy link
Member

Do you need dask-geopandas? Because if you are fine with vanilla geopandas, it will be much easier. And 200k should be perfectly fine.

You need to identify connected components and dissolve by a component label. That is tricky in distributed setting. But in a single GeoDataFrame, it is easy with the help of libpysal / (or scipy only).

from libpysal import graph

comp_label = graph.Graph.build_contiguity(gdf, rook=False).component_labels

gdf.dissolve(comp_label)

If you know that you have a correct polygonal coverage, you can even use much faster coverage union.

gdf.dissolve(comp_label, method="coverage")

@emaildipen
Copy link
Author

Thanks! Yes, I do need Dask since I’ll be processing millions of polygons. I added map_partitions to my function, and it worked. However, now the problem is that it’s taking a long time to transfer it to a GeoPandas DataFrame.

@martinfleis
Copy link
Member

map_partitions will work only if you ensure that a single component is always within a single partition. If it stretches across multiple, the approach will not work.

@phofl
Copy link

phofl commented Jan 2, 2025

Re your runtime (can't comment on what needs to be in a single partition):

Could you try creating a cluster before you call compute? That should help with parallelising things, i.e.

from distributed import Client

client = Client(n_workers=your_number_of_cores)
print(client.dashboard_link)

That will properly parallelize things and the url is helpful to observe what's going on

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants