Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorize coordinates._construct_face_centroids() #1117

Merged
merged 12 commits into from
Feb 5, 2025

Conversation

erogluorhan
Copy link
Member

@erogluorhan erogluorhan commented Dec 21, 2024

Closes #1116.

Overview

Optimizes coordinates.py/_construct_face_centroids() by replacing the for-loop with vectorization with the help of a masking method thanks to @hongyuchen1030 's older suggestion

Further optimization of the code would be possible, but maybe it is handled in future PRs.

FYI @philipc2 : The masking method this PR uses might be useful for cases we previously dealt with via partitioning, e.g. the nodal_averaging problem we look into last year

FYI @rajeeja : This is where the cartesian-based centroid calculations are being fixed. Welzl is a bit different but you may still want to look into it.

Results

#1116 shows the profiling results for the existing code for a 4GB SCREAM dataset where the execution time is around 5 mins.

The optimized code gives around 6 seconds, see below:

Screenshot 2024-12-21 at 8 34 37 AM

PR Checklist

General

  • An issue is linked created and linked
  • Add appropriate labels
  • Filled out Overview and Expected Usage (if applicable) sections

Testing

  • Tests cover all possible logical paths in your function

@rajeeja
Copy link
Contributor

rajeeja commented Dec 21, 2024

Good fix, will test welzl and see appropriate fix.

@philipc2 philipc2 added the run-benchmark Run ASV benchmark workflow label Dec 21, 2024
@philipc2 philipc2 added this to the Scalability & Performance milestone Dec 21, 2024
Copy link

github-actions bot commented Dec 21, 2024

ASV Benchmarking

Benchmark Comparison Results

Benchmarks that have improved:

Change Before [fe4cae1] After [3b3d8e2] Ratio Benchmark (Parameter)
- 556M 445M 0.8 face_bounds.FaceBounds.peakmem_face_bounds(PosixPath('/home/runner/work/uxarray/uxarray/test/meshfiles/ugrid/geoflow-small/grid.nc'))
- 747M 441M 0.59 face_bounds.FaceBounds.peakmem_face_bounds(PosixPath('/home/runner/work/uxarray/uxarray/test/meshfiles/ugrid/quad-hexagon/grid.nc'))
- 7.70±0.08ms 9.31±0.8μs 0 face_bounds.FaceBounds.time_face_bounds(PosixPath('/home/runner/work/uxarray/uxarray/test/meshfiles/scrip/outCSne8/outCSne8.nc'))
- 6.51±0.1μs 5.67±0.04μs 0.87 mpas_ocean.ConnectivityConstruction.time_n_nodes_per_face('120km')
- 4.87±0.1ms 4.31±0.05ms 0.88 mpas_ocean.ConstructFaceLatLon.time_cartesian_averaging('120km')
- 55.2±0.7ms 1.27±0.01ms 0.02 mpas_ocean.GeoDataFrame.time_to_geodataframe('120km', True)
- 535M 422M 0.79 mpas_ocean.Integrate.peakmem_integrate('480km')

Benchmarks that have stayed the same:

Change Before [fe4cae1] After [3b3d8e2] Ratio Benchmark (Parameter)
453M 441M 0.97 face_bounds.FaceBounds.peakmem_face_bounds(PosixPath('/home/runner/work/uxarray/uxarray/test/meshfiles/mpas/QU/oQU480.231010.nc'))
483M 470M 0.97 face_bounds.FaceBounds.peakmem_face_bounds(PosixPath('/home/runner/work/uxarray/uxarray/test/meshfiles/scrip/outCSne8/outCSne8.nc'))
19.1±0.2ms 19.0±0.1ms 0.99 face_bounds.FaceBounds.time_face_bounds(PosixPath('/home/runner/work/uxarray/uxarray/test/meshfiles/mpas/QU/oQU480.231010.nc'))
44.3±0.6ms 44.3±0.5ms 1.00 face_bounds.FaceBounds.time_face_bounds(PosixPath('/home/runner/work/uxarray/uxarray/test/meshfiles/ugrid/geoflow-small/grid.nc'))
4.16±0.1ms 4.01±0.2ms 0.96 face_bounds.FaceBounds.time_face_bounds(PosixPath('/home/runner/work/uxarray/uxarray/test/meshfiles/ugrid/quad-hexagon/grid.nc'))
12.1±0.1s 12.3±0.1s 1.01 import.Imports.timeraw_import_uxarray
679±4μs 691±10μs 1.02 mpas_ocean.CheckNorm.time_check_norm('120km')
452±10μs 451±3μs 1.00 mpas_ocean.CheckNorm.time_check_norm('480km')
667±4ms 689±4ms 1.03 mpas_ocean.ConnectivityConstruction.time_face_face_connectivity('120km')
42.3±0.5ms 43.3±1ms 1.02 mpas_ocean.ConnectivityConstruction.time_face_face_connectivity('480km')
5.61±0.03μs 5.62±0.1μs 1.00 mpas_ocean.ConnectivityConstruction.time_n_nodes_per_face('480km')
3.01±0.09ms 3.00±0.03ms 1.00 mpas_ocean.ConstructFaceLatLon.time_cartesian_averaging('480km')
3.53±0.02s 3.57±0.03s 1.01 mpas_ocean.ConstructFaceLatLon.time_welzl('120km')
225±0.9ms 225±1ms 1.00 mpas_ocean.ConstructFaceLatLon.time_welzl('480km')
1.22±0μs 1.30±0.02μs 1.07 mpas_ocean.ConstructTreeStructures.time_ball_tree('120km')
304±2ns 312±5ns 1.03 mpas_ocean.ConstructTreeStructures.time_ball_tree('480km')
293±4ns 292±1ns 1.00 mpas_ocean.ConstructTreeStructures.time_kd_tree('480km')
failed failed n/a mpas_ocean.CrossSections.time_const_lat('120km', 1)
failed failed n/a mpas_ocean.CrossSections.time_const_lat('120km', 2)
failed failed n/a mpas_ocean.CrossSections.time_const_lat('120km', 4)
failed failed n/a mpas_ocean.CrossSections.time_const_lat('120km', 8)
failed failed n/a mpas_ocean.CrossSections.time_const_lat('480km', 1)
failed failed n/a mpas_ocean.CrossSections.time_const_lat('480km', 2)
failed failed n/a mpas_ocean.CrossSections.time_const_lat('480km', 4)
failed failed n/a mpas_ocean.CrossSections.time_const_lat('480km', 8)
failed failed n/a mpas_ocean.CrossSections.time_constant_lat_fast('120km', 1)
failed failed n/a mpas_ocean.CrossSections.time_constant_lat_fast('120km', 2)
failed failed n/a mpas_ocean.CrossSections.time_constant_lat_fast('120km', 4)
failed failed n/a mpas_ocean.CrossSections.time_constant_lat_fast('120km', 8)
failed failed n/a mpas_ocean.CrossSections.time_constant_lat_fast('480km', 1)
failed failed n/a mpas_ocean.CrossSections.time_constant_lat_fast('480km', 2)
failed failed n/a mpas_ocean.CrossSections.time_constant_lat_fast('480km', 4)
failed failed n/a mpas_ocean.CrossSections.time_constant_lat_fast('480km', 8)
123±0.4ms 123±1ms 1.00 mpas_ocean.DualMesh.time_dual_mesh_construction('120km')
8.88±0.4ms 8.76±0.3ms 0.99 mpas_ocean.DualMesh.time_dual_mesh_construction('480km')
1.08±0.01s 1.08±0.01s 1.00 mpas_ocean.GeoDataFrame.time_to_geodataframe('120km', False)
86.8±0.9ms 87.9±0.9ms 1.01 mpas_ocean.GeoDataFrame.time_to_geodataframe('480km', False)
5.62±0.1ms 5.85±0.2ms 1.04 mpas_ocean.GeoDataFrame.time_to_geodataframe('480km', True)
433M 433M 1.00 mpas_ocean.Gradient.peakmem_gradient('120km')
411M 411M 1.00 mpas_ocean.Gradient.peakmem_gradient('480km')
2.80±0.05ms 2.84±0.02ms 1.01 mpas_ocean.Gradient.time_gradient('120km')
315±2μs 325±3μs 1.03 mpas_ocean.Gradient.time_gradient('480km')
227±20μs 244±3μs 1.07 mpas_ocean.HoleEdgeIndices.time_construct_hole_edge_indices('120km')
119±2μs 120±2μs 1.01 mpas_ocean.HoleEdgeIndices.time_construct_hole_edge_indices('480km')
453M 444M 0.98 mpas_ocean.Integrate.peakmem_integrate('120km')
158±2ms 155±0.9ms 0.98 mpas_ocean.Integrate.time_integrate('120km')
11.2±0.2ms 11.5±0.07ms 1.03 mpas_ocean.Integrate.time_integrate('480km')
358±3ms 360±1ms 1.01 mpas_ocean.MatplotlibConversion.time_dataarray_to_polycollection('120km', 'exclude')
363±4ms 356±2ms 0.98 mpas_ocean.MatplotlibConversion.time_dataarray_to_polycollection('120km', 'include')
357±2ms 357±4ms 1.00 mpas_ocean.MatplotlibConversion.time_dataarray_to_polycollection('120km', 'split')
23.7±0.3ms 24.0±0.6ms 1.01 mpas_ocean.MatplotlibConversion.time_dataarray_to_polycollection('480km', 'exclude')
24.0±0.2ms 24.4±0.4ms 1.01 mpas_ocean.MatplotlibConversion.time_dataarray_to_polycollection('480km', 'include')
24.0±0.4ms 23.9±0.5ms 0.99 mpas_ocean.MatplotlibConversion.time_dataarray_to_polycollection('480km', 'split')
55.9±0.2ms 56.3±0.4ms 1.01 mpas_ocean.RemapDownsample.time_inverse_distance_weighted_remapping
45.4±0.2ms 45.8±0.4ms 1.01 mpas_ocean.RemapDownsample.time_nearest_neighbor_remapping
357±0.9ms 358±2ms 1.00 mpas_ocean.RemapUpsample.time_inverse_distance_weighted_remapping
261±0.7ms 261±0.7ms 1.00 mpas_ocean.RemapUpsample.time_nearest_neighbor_remapping
25.6±0.8ms 26.0±0.7ms 1.02 mpas_ocean.ZonalAverage.time_zonal_average('120km')
4.96±0.02ms 4.71±0.1ms 0.95 mpas_ocean.ZonalAverage.time_zonal_average('480km')
408M 407M 1.00 quad_hexagon.QuadHexagon.peakmem_open_dataset
408M 407M 1.00 quad_hexagon.QuadHexagon.peakmem_open_grid
6.74±0.2ms 6.59±0.01ms 0.98 quad_hexagon.QuadHexagon.time_open_dataset
5.64±0.04ms 5.50±0.06ms 0.97 quad_hexagon.QuadHexagon.time_open_grid

Benchmarks that have got worse:

Change Before [fe4cae1] After [3b3d8e2] Ratio Benchmark (Parameter)
+ 811±7ns 928±80ns 1.14 mpas_ocean.ConstructTreeStructures.time_kd_tree('120km')

@philipc2 philipc2 removed the run-benchmark Run ASV benchmark workflow label Dec 21, 2024
@philipc2
Copy link
Member

Nice work @erogluorhan

It appears that a few ASV benchmarks are failing (may or may not be due to the changes in this PR). I can look at it deeper into it on Monday.

@erogluorhan
Copy link
Member Author

All sound good, thanks both @rajeeja @philipc2 !

@erogluorhan
Copy link
Member Author

I can't see any failing benchmarks though! @philipc2

@philipc2
Copy link
Member

I can't see any failing benchmarks though! @philipc2

If you take a look at the GitHub Bot above, it shows which ones failed. If they failed before and after, it means something was wrong before this PR.

Looks like this PR didn't break anything, must have been something with the benchmark written before.

@erogluorhan
Copy link
Member Author

Ok, I see what you mean now, @philipc2 !

Also, re benchmarking, I'd expect mpas_ocean.ConstructFaceLatLon.time_cartesian_averaging('120km') to show the difference this PR creates with the optimization, but it does not. Dataset is pretty small I think for that, which makes me think big data benchmarking would be important in the future.

@philipc2 philipc2 added the run-benchmark Run ASV benchmark workflow label Dec 23, 2024
@philipc2
Copy link
Member

Here is a minimal example that fails. Feel free to create a test case from this

node_lon = np.array([-20.0, 0.0, 20.0, -20, -40])
node_lat = np.array([-10.0, 10.0, -10.0, 10, -10])
face_node_connectivity = np.array([[0, 1, 2, -1], [0, 1, 3, 4]])

uxgrid = ux.Grid.from_topology(
    node_lon=node_lon,
    node_lat=node_lat,
    face_node_connectivity=face_node_connectivity,
    fill_value=-1,
)

uxgrid.construct_face_centers()

@philipc2
Copy link
Member

Consider this implementation with Numba.

@njit(cache=True, parallel=True)
def _construct_face_centroids(node_x, node_y, node_z, face_nodes, n_nodes_per_face):
    centroid_x = np.zeros((face_nodes.shape[0]), dtype=np.float64)
    centroid_y = np.zeros((face_nodes.shape[0]), dtype=np.float64)
    centroid_z = np.zeros((face_nodes.shape[0]), dtype=np.float64)
    n_face = n_nodes_per_face.shape[0]

    for i_face in prange(n_face):
        n_max_nodes = n_nodes_per_face[i_face]

        x = np.mean(node_x[face_nodes[i_face, 0:n_max_nodes]])
        y = np.mean(node_y[face_nodes[i_face, 0:n_max_nodes]])
        z = np.mean(node_z[face_nodes[i_face, 0:n_max_nodes]])

        x, y, z = _normalize_xyz_scalar(x, y, z)

        centroid_x[i_face] = x
        centroid_y[i_face] = y
        centroid_z[i_face] = z
    return centroid_x, centroid_y, centroid_z

Without partitioning our arrays, it's really difficult to find an elegant vectorized solution directly using NumPy.

@rajeeja
Copy link
Contributor

rajeeja commented Dec 23, 2024

Consider this implementation with Numba.

@njit(cache=True, parallel=True)
def _construct_face_centroids(node_x, node_y, node_z, face_nodes, n_nodes_per_face):
    centroid_x = np.zeros((face_nodes.shape[0]), dtype=np.float64)
    centroid_y = np.zeros((face_nodes.shape[0]), dtype=np.float64)
    centroid_z = np.zeros((face_nodes.shape[0]), dtype=np.float64)
    n_face = n_nodes_per_face.shape[0]

    for i_face in prange(n_face):
        n_max_nodes = n_nodes_per_face[i_face]

        x = np.mean(node_x[face_nodes[i_face, 0:n_max_nodes]])
        y = np.mean(node_y[face_nodes[i_face, 0:n_max_nodes]])
        z = np.mean(node_z[face_nodes[i_face, 0:n_max_nodes]])

        x, y, z = _normalize_xyz_scalar(x, y, z)

        centroid_x[i_face] = x
        centroid_y[i_face] = y
        centroid_z[i_face] = z
    return centroid_x, centroid_y, centroid_z

Without partitioning our arrays, it's really difficult to find an elegant vectorized solution directly using NumPy.

Yes, it was a sticky one, I left it for a later PR to optimize this.

@rajeeja
Copy link
Contributor

rajeeja commented Jan 10, 2025

this also might use a bit more memory due to the creation of the mask array - but that should be fine for larger grids and the speedup would be pretty good considering it uses vectorized operations and broadcasting. Seems to work and I'm okay with approving this.

@philipc2
Copy link
Member

@erogluorhan

Let me know if you have any thoughts about the Numba implementation that I shared.

@rajeeja
Copy link
Contributor

rajeeja commented Jan 11, 2025

@erogluorhan

Let me know if you have any thoughts about the Numba implementation that I shared.

Fine with the Numba suggestion also, we have modify this PR to use that, we use Numba all over the project.

@philipc2
Copy link
Member

@erogluorhan @rajeeja

With the parallel Numba implementation, I was able to obtain the following times on my local machine for a 3.75km MPAS grid (84 million nodes, 42 million faces)

image

@erogluorhan
Copy link
Member Author

erogluorhan commented Jan 16, 2025

@philipc2 thanks very much for catching the original issue in this PR and suggesting the Numba solution!

Also thank you and @rajeeja for the convo you kept running here!

Please see some considerations of mine below and let me know your thoughts about each of them, mostly for future, but at least for the scope of this PR, I am fine with the current Numba version, you all feel free to give it a review for this version:

  1. Even ~5 secs with the method I tried, and also the current execution time of ~8secs (in Philip's local machine) are also sub-ideal for permanent long term performance of this function. We need to bring it to a better spot (say around a second or two) for this setup (4GB SCREAM). That's because when the user wants to use a time-series data with several time-steps, maybe for an even finer-res grid, the performance should still be okay.

  2. Speaking of time-series, the user will likely use chunking schemes, e.g. each time step being a separate chunk. In such scenarios, we need to be sure Numba and Dask will play well together.

  3. While this current Numba implementation seems to have a decent performance for now, we saw that vectorization would give us even better performance with my original attempt (if only we didn't have the masking issue with nonuniform connectivity array), and I thought even further parts of the code in that case would be optimized in the future. That said, partitioning the connectivity as Philip attempted in DRAFT: API for accessing partitioned connectivity, reworked topological aggregations  #978 still seems like the best bet to me to still achieve a vectorized performance. @philipc2 , was there any difficulty in that code that made you stop and closing that PR?

Thoughts?

@philipc2
Copy link
Member

Hi @erogluorhan

Even ~5 secs with the method I tried, and also the current execution time of ~8secs (in Philip's local machine) are also sub-ideal for permanent long term performance of this function. We need to bring it to a better spot (say around a second or two) for this setup (4GB SCREAM). That's because when the user wants to use a time-series data with several time-steps, maybe for an even finer-res grid, the performance should still be okay.

This is a one-time execution, so I personally think a couple of seconds is acceptable, especially for grids with millions of elements. The comparison is also not direct, as I ran my benchmark on a 3.75km MPAS grid (~84 million nodes). Do you recall the size of the SCREAM grid?

Speaking of time-series, the user will likely use chunking schemes, e.g. each time step being a separate chunk. In such scenarios, we need to be sure Numba and Dask will play well together.

The centroid computation here is independent of any time-series data, since this is tied to the Grid and not a UxDataset.

While this current Numba implementation seems to have a decent performance for now, we saw that vectorization would give us even better performance with my original attempt (if only we didn't have the masking issue with nonuniform connectivity array), and I thought even further parts of the code in that case would be optimized in the future.

While I agree the speed is significantly in the masked approach better, it only appears that way due to the fixed-dimension. While partitioning as initially explored in #978 may improve our execution time, that PR has been stale for a while and I plan to re-visit it in the near future. However, for our needs, especially with the Face bounds optimization, Numba has proven to be an excellent choice for obtaining significant performance gains.

There was some discussion about Numba in #1124

Pinging @Huite if he has anything he'd like to comment on.

@Huite
Copy link

Huite commented Jan 16, 2025

Happy to comment!

You can easily circumvent the fill values for floating point data by utilizing NaNs and the relevant numpy methods.
E.g. index with the -1 value, then overwrite those values with NaN, then use the appropriate NaN-aware function to filter.

Vectorized

face_nodes_x = node_x[face_node_connectivity]
face_nodes_x[face_node_connectivity == -1] = np.nan
centroid_x = np.nanmean(face_nodes_x, axis=1)

This allocates two intermediate arrays (the face_nodes_x and the boolean mask). That isn't too bad, although I'm fairly confident numpy doesn't parallellize these operations.

Numba

With regards to numba, you're probably leaving quite some performance on the table:

 x = np.mean(node_x[face_nodes[i_face, 0:n_max_nodes]])

Note that the node_x[face_nodes[i_face, 0:n_max_nodes]] continuously allocates two small (heap-allocated) arrays. My guess is that this takes quite some memory, because it's right in the hottest part of the code. Ideally you pre-allocate and reuse the memory instead. The parallellization with prange makes this slightly trickier though.

You can always write your own non-allocating mean, e.g. something like:

def mean(face, node_x):
     for index in face:
           accum = 0.0 
           if index == -1:
                break
           accum += node_x[face]
      return accum / index

This might not have optimal performance either, because memory is accessed all over the place with node_x[face], which probably causes cache misses (depending on the size of the node_x).

You want to do indexing and computation of mean in parallel. The mean ideally operates over a contiguous piece of memory.

Centroid versus center-of-mass

But something else to consider is whether the mean is really what you're after.

I compute a mean for triangles, but use a center of mass approach otherwise. These are normally equivalent for convex polygons. The reason (which I should've commented in the code...!) is that there are quite a number of grids where there's stuff like hanging nodes (in quadtree meshes e.g.).

See this sketch in paint.

image

Node 3 is hanging. My expectation is that the user wants to get the spot marked by the x (which is the center). But if you compute the mean of the vertices, node 3 is going to "pull" the centroid towards the right, which is undesirable.

You can avoid this by using a center of mass approach instead, as I do:

https://github.com/Deltares/xugrid/blob/fc0bdcb0eaeff62887f1e62fadd61957629dff61/xugrid/ugrid/connectivity.py#L584

To deal with the fill values, I "close" the faces with close_polygons (as shapely calls it, I add another column to the face nodes, and I replace the fill values by the first node). The center of mass can then be computed easily, because the fill values will result in zero-area weights (thereby discounting their value).

My method generates relatively many temporary numpy arrays, but you could easily avoid those with fully scalar operations in numba.

My general philosophy is to rely on vectorized solution first, and primarily use numba when there's no other way, primarily to keep JIT latency down. Even though caching will make it only on first use, e.g. CI runners will not be able to utilize the caches. But it's debatable at which point numba becomes superior; I have 16 cores on my laptop so numba's prange will give huge speed ups compared to the vectorized approach...

@erogluorhan
Copy link
Member Author

erogluorhan commented Feb 5, 2025

Hi @philipc2 @Huite thanks for all the contributions, and sorry for my delayed response!

I went over and profiled all the options last night, and the Numba-decorated for-loop seems to be performing the best. On my M2 laptop, that one takes only ~2.8 seconds, the vectorized option that we couldn't go with anyways due to the clarification @philipc2 made before takes ~3.7 seconds, and np.nanmean option takes the longest with ~4.7 seconds.

I'd like to move forward with the numba option (at least for now until we might get improvements with storing connectivity as partitioned arrays in #978), but I also want to respond to a few comments above:

The centroid computation here is independent of any time-series data, since this is tied to the Grid and not a UxDataset.

@philipc2 , of course! I think I confused myself at the time since I was simultaneously looking into a time-series problem at the time.

Centroid versus center-of-mass

But something else to consider is whether the mean is really what you're after.

I compute a mean for triangles, but use a center of mass approach otherwise. These are normally equivalent for convex polygons. The reason (which I should've commented in the code...!) is that there are quite a number of grids where there's stuff like hanging nodes (in quadtree meshes e.g.).

See this sketch in paint.

@Huite, this is a great suggestion, thanks ! Let's consider creating an issue to address this sometime after this PR. Please let me know if you'd like to create that yourself, otherwise I am happy to do so.

@erogluorhan erogluorhan requested a review from philipc2 February 5, 2025 16:29
@erogluorhan
Copy link
Member Author

@rajeeja you had already approved this before for the previous plan, and now that we've switched to Numba, please feel free to re-review or not.

@philipc2
Copy link
Member

philipc2 commented Feb 5, 2025

Below are some profiling results for a 7.5km MPAS grid on my local M1 macbook. Excellent work!

image

Copy link
Member

@philipc2 philipc2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

@erogluorhan
Copy link
Member Author

Thanks a lot for your contributions!

@erogluorhan erogluorhan merged commit f5f5a1b into main Feb 5, 2025
16 checks passed
@erogluorhan erogluorhan deleted the optimize_face_centroids branch February 5, 2025 21:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-benchmark Run ASV benchmark workflow
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimize Face Centroid Calculations
5 participants