Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-threaded raster strategy #161

Open
wants to merge 24 commits into
base: master
Choose a base branch
from

Conversation

shortcutman
Copy link

Adds a new strategy to exact extract named raster-parallel.

The strategy utilises oneAPI TBB to setup a parallel pipeline for finding intersecting features, reading raster data, performing zonal stats and merging stats for final output. Number of 'tokens' (TBB terminology, essentially maximum parallel tasks in flight) is controlled with the --tokens [number] command line argument.

Implementation

Prior to the parallel pathway, logic is the same as raster-sequential where all features are read in and an STR tree is created for doing intersection.

For the parallel pipeline:

  • A context per a subdivision of each raster utilised in an operation is created. Context starts with the subdivision and pointer to source Raster. This is a serial stage (as required for the first stage of a parallel pipeline in TBB).
  • For each of the above, performs query against the STR tree to find all features that intersect with the subdivision. This is a parallel stage for each context.
  • For each context that has intersecting features, reads the raster data in. This is a serial stage.
  • Performs zonal stat calculations on features and the raster data writing the results into its own StatsRegistry. This is a parallel stage for each context.
  • Merges context provided StatRegistry data into Processor owned StatRegistry. This is a serial stage.

Finally all features are written out with the same implementation as raster-sequential.

Parallel considerations

  • All raster read is done sequential but not necessarily the same thread. In testing this hasn't created any observed issues.
  • Tree intersection and zonal stats calculations require their own GEOS contexts in order to be parallelised. Incorrect stats were observed before this was fixed.
  • I haven't yet looked at the new GDAL raster read thread safety API and what other improvements that may provide.

Performance

Performing mean and count on ~1.6M polygons of Western Australia cadastral boundaries against ~25m square pixels of Australian agricultural land use data (with national coverage) done on Ryzen 7700 (8c/16t) with 32GB RAM:

  • feature-sequential elapsed time: 7m 42s, maximum memory usage: 2.224 GB
  • raster-sequential elapsed time: 2m 52s, maximum memory usage: 4.639 GB
  • raster-parallel with 4 simultaneous tokens, elapsed time: 52s, max memory usage: 5.653 GB
  • raster-parallel with 8 simultaneous tokens, elapsed time: 37s, max memory usage: 5.823 GB
  • raster-parallel with 12 simultaneous tokens, elapsed time: 36s, max memory usage: 5.961 GB

Other notes

All results were tested against raster-sequential outputs to control for any parallel bugs. Nothing major was observed other than occasional floating point errors at the end of its precision. I did not that multiple raster input to raster-sequential doesn't look like it is working as intended, but out of scope for this PR.

I didn't make any changes to the Python bindings or libs at this stage. I note that the actions continually fail on them but not sure why.

Welcome any comments, hope this is something that helps!

Copy link

codecov bot commented Jan 13, 2025

Codecov Report

Attention: Patch coverage is 72.41379% with 24 lines in your changes missing coverage. Please review.

Project coverage is 90.47%. Comparing base (18c2c68) to head (40f8512).
Report is 4 commits behind head on master.

Files with missing lines Patch % Lines
src/stats_registry.cpp 0.00% 16 Missing ⚠️
src/raster_stats.h 79.31% 3 Missing and 3 partials ⚠️
src/exactextract.cpp 50.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #161      +/-   ##
==========================================
- Coverage   90.71%   90.47%   -0.24%     
==========================================
  Files          85       85              
  Lines        6664     6751      +87     
  Branches      628      636       +8     
==========================================
+ Hits         6045     6108      +63     
- Misses        587      608      +21     
- Partials       32       35       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dbaston
Copy link
Member

dbaston commented Jan 21, 2025

Thanks for this contribution! I've tested it with some other data locally and am seeing similar performance gains. I have a few questions/notes:

  • Should I open a PR against your fork to add some tests, etc? Or commit directly to your branch?

  • It looks like this doesn't handle weighted calculations, at least not in a thread-safe way. For now I think it makes sense to simply error out if a user requests weighted stats. Do you agree?

I did not that multiple raster input to raster-sequential doesn't look like it is working as intended, but out of scope for this PR.

Any details on this? Are you referring to weighted operations, or something else?

Number of 'tokens' (TBB terminology, essentially maximum parallel tasks in flight) is controlled with the --tokens [number] command line argument.

Any issues with calling this --threads ?

Performs zonal stat calculations on features and the raster data writing the results into its own StatsRegistry.

I don't expect it to be a major bottleneck, but do you see a reason not to have a StatsRegistry for each thread, instead of for each work unit?

Tree intersection and zonal stats calculations require their own GEOS contexts in order to be parallelised. Incorrect stats were observed before this was fixed.

Any chance you're using GEOS < 3.10?

I haven't yet looked at the new GDAL raster read thread safety API and what other improvements that may provide.

It's essentially handling the locking of the raster dataset to prevent multiple threads from reading at once. But it still may provide benefits here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants