-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorize coordinates._construct_face_centroids()
#1117
Conversation
Good fix, will test welzl and see appropriate fix. |
ASV BenchmarkingBenchmark Comparison ResultsBenchmarks that have improved:
Benchmarks that have stayed the same:
Benchmarks that have got worse:
|
Nice work @erogluorhan It appears that a few ASV benchmarks are failing (may or may not be due to the changes in this PR). I can look at it deeper into it on Monday. |
I can't see any failing benchmarks though! @philipc2 |
If you take a look at the GitHub Bot above, it shows which ones failed. If they failed before and after, it means something was wrong before this PR. Looks like this PR didn't break anything, must have been something with the benchmark written before. |
Ok, I see what you mean now, @philipc2 ! Also, re benchmarking, I'd expect |
Here is a minimal example that fails. Feel free to create a test case from this node_lon = np.array([-20.0, 0.0, 20.0, -20, -40])
node_lat = np.array([-10.0, 10.0, -10.0, 10, -10])
face_node_connectivity = np.array([[0, 1, 2, -1], [0, 1, 3, 4]])
uxgrid = ux.Grid.from_topology(
node_lon=node_lon,
node_lat=node_lat,
face_node_connectivity=face_node_connectivity,
fill_value=-1,
)
uxgrid.construct_face_centers() |
Consider this implementation with Numba. @njit(cache=True, parallel=True)
def _construct_face_centroids(node_x, node_y, node_z, face_nodes, n_nodes_per_face):
centroid_x = np.zeros((face_nodes.shape[0]), dtype=np.float64)
centroid_y = np.zeros((face_nodes.shape[0]), dtype=np.float64)
centroid_z = np.zeros((face_nodes.shape[0]), dtype=np.float64)
n_face = n_nodes_per_face.shape[0]
for i_face in prange(n_face):
n_max_nodes = n_nodes_per_face[i_face]
x = np.mean(node_x[face_nodes[i_face, 0:n_max_nodes]])
y = np.mean(node_y[face_nodes[i_face, 0:n_max_nodes]])
z = np.mean(node_z[face_nodes[i_face, 0:n_max_nodes]])
x, y, z = _normalize_xyz_scalar(x, y, z)
centroid_x[i_face] = x
centroid_y[i_face] = y
centroid_z[i_face] = z
return centroid_x, centroid_y, centroid_z Without partitioning our arrays, it's really difficult to find an elegant vectorized solution directly using NumPy. |
Yes, it was a sticky one, I left it for a later PR to optimize this. |
this also might use a bit more memory due to the creation of the mask array - but that should be fine for larger grids and the speedup would be pretty good considering it uses vectorized operations and broadcasting. Seems to work and I'm okay with approving this. |
Let me know if you have any thoughts about the Numba implementation that I shared. |
Fine with the Numba suggestion also, we have modify this PR to use that, we use Numba all over the project. |
With the parallel Numba implementation, I was able to obtain the following times on my local machine for a 3.75km MPAS grid (84 million nodes, 42 million faces) |
@philipc2 thanks very much for catching the original issue in this PR and suggesting the Numba solution! Also thank you and @rajeeja for the convo you kept running here! Please see some considerations of mine below and let me know your thoughts about each of them, mostly for future, but at least for the scope of this PR, I am fine with the current Numba version, you all feel free to give it a review for this version:
Thoughts? |
Hi @erogluorhan
This is a one-time execution, so I personally think a couple of seconds is acceptable, especially for grids with millions of elements. The comparison is also not direct, as I ran my benchmark on a 3.75km MPAS grid (~84 million nodes). Do you recall the size of the SCREAM grid?
The centroid computation here is independent of any time-series data, since this is tied to the
While I agree the speed is significantly in the masked approach better, it only appears that way due to the fixed-dimension. While partitioning as initially explored in #978 may improve our execution time, that PR has been stale for a while and I plan to re-visit it in the near future. However, for our needs, especially with the Face bounds optimization, Numba has proven to be an excellent choice for obtaining significant performance gains. There was some discussion about Numba in #1124 Pinging @Huite if he has anything he'd like to comment on. |
Happy to comment! You can easily circumvent the fill values for floating point data by utilizing NaNs and the relevant numpy methods. Vectorizedface_nodes_x = node_x[face_node_connectivity]
face_nodes_x[face_node_connectivity == -1] = np.nan
centroid_x = np.nanmean(face_nodes_x, axis=1) This allocates two intermediate arrays (the face_nodes_x and the boolean mask). That isn't too bad, although I'm fairly confident numpy doesn't parallellize these operations. NumbaWith regards to numba, you're probably leaving quite some performance on the table: x = np.mean(node_x[face_nodes[i_face, 0:n_max_nodes]]) Note that the You can always write your own non-allocating def mean(face, node_x):
for index in face:
accum = 0.0
if index == -1:
break
accum += node_x[face]
return accum / index This might not have optimal performance either, because memory is accessed all over the place with You want to do indexing and computation of mean in parallel. The mean ideally operates over a contiguous piece of memory. Centroid versus center-of-massBut something else to consider is whether the mean is really what you're after. I compute a mean for triangles, but use a center of mass approach otherwise. These are normally equivalent for convex polygons. The reason (which I should've commented in the code...!) is that there are quite a number of grids where there's stuff like hanging nodes (in quadtree meshes e.g.). See this sketch in paint. Node 3 is hanging. My expectation is that the user wants to get the spot marked by the x (which is the center). But if you compute the mean of the vertices, node 3 is going to "pull" the centroid towards the right, which is undesirable. You can avoid this by using a center of mass approach instead, as I do: To deal with the fill values, I "close" the faces with My method generates relatively many temporary numpy arrays, but you could easily avoid those with fully scalar operations in numba. My general philosophy is to rely on vectorized solution first, and primarily use numba when there's no other way, primarily to keep JIT latency down. Even though caching will make it only on first use, e.g. CI runners will not be able to utilize the caches. But it's debatable at which point numba becomes superior; I have 16 cores on my laptop so numba's |
Hi @philipc2 @Huite thanks for all the contributions, and sorry for my delayed response! I went over and profiled all the options last night, and the Numba-decorated for-loop seems to be performing the best. On my M2 laptop, that one takes only ~2.8 seconds, the vectorized option that we couldn't go with anyways due to the clarification @philipc2 made before takes ~3.7 seconds, and I'd like to move forward with the numba option (at least for now until we might get improvements with storing connectivity as partitioned arrays in #978), but I also want to respond to a few comments above:
@philipc2 , of course! I think I confused myself at the time since I was simultaneously looking into a time-series problem at the time.
@Huite, this is a great suggestion, thanks ! Let's consider creating an issue to address this sometime after this PR. Please let me know if you'd like to create that yourself, otherwise I am happy to do so. |
@rajeeja you had already approved this before for the previous plan, and now that we've switched to Numba, please feel free to re-review or not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
Thanks a lot for your contributions! |
Closes #1116.
Overview
Optimizes
coordinates.py/_construct_face_centroids()
by replacing the for-loop with vectorization with the help of a masking method thanks to @hongyuchen1030 's older suggestionFurther optimization of the code would be possible, but maybe it is handled in future PRs.
FYI @philipc2 : The masking method this PR uses might be useful for cases we previously dealt with via partitioning, e.g. the nodal_averaging problem we look into last year
FYI @rajeeja : This is where the cartesian-based centroid calculations are being fixed. Welzl is a bit different but you may still want to look into it.
Results
#1116 shows the profiling results for the existing code for a 4GB SCREAM dataset where the execution time is around 5 mins.
The optimized code gives around 6 seconds, see below:
PR Checklist
General
Testing