Extend and simplify API for calculation of range-based rolling window offsets #17807
+3,466
−962
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
In both cudf.pandas and cudf-polars we would like to be able to use the existing libcudf functionality for computing the
preceding
andfollowing
columns for range-based (grouped) rolling windows.The current functionality is designed with spark in mind and supports calculations with slightly different requirements compared to pandas and polars.
In this PR, we unify the construction of these offset column calculations to satisfy all uses.
Specifically:
BOUNDED_OPEN
window. This is a range-based window where the endpoint of the range is not included.The proposed interface for describing the requested window-type going forward is a strongly-typed one using
std::variant
. This removes the need for default-constructed scalars when specifyingUNBOUNDED
andCURRENT_ROW
windows.Performance improvements
Spark permits nulls in the
orderby
column. In the grouped-rolling case, these nulls must be sorted at one end of each group. The current implementation finds the partition point in each group usingthrust::for_each
over each group andthrust::partition_point
to find the break. For low-cardinality grouped rolling aggregations this can be very slow, since we have only a single thread searching over each group. We replace this with a segmented sum with CUB to find the per-group null count (and hence the break).The dispatched functor for computing the bound given a row value has constexpr dispatch for the common, and required for pandas and polars, case that the
orderby
column has no-nulls. This shaves some milliseconds.In the case that the
orderby
column does have nulls, we extend the interface such that the caller (who must have arranged that it is sorted correctly) tells us whether nulls were sortedBEFORE
orAFTER
. This avoids multiple kernel launches to deduce the null order, saving some milliseconds, and (in the grouped case) memory footprint. We polyfill this deduction so that the existing interface still works until we can deprecate and remove it.Guide for review
The core logic is implemented in
range_utils.cuh
, specifically therange_window_clamper
, dispatched to bymake_range_window
. To keep compile times under control, the template instantiation for computing preceding and following windows is moved into separate translation units: locally compilation takes ~10mins/TU. Things could be split out further if deemed necessary.Checklist