-
Notifications
You must be signed in to change notification settings - Fork 1.7k
docs: refine AggregateUDFImpl::is_ordered_set_aggregate
documentation
#17805
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// calculation performed by these functions is dependent on the specific | ||
/// sequence of the input rows, unlike other aggregate functions like `SUM` | ||
/// `AVG`, or `COUNT`. If explit order is specified then a default order | ||
/// of ascending is assumed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically we don't enforce the default order; our only ordered set aggregate functions internally use ascending as the default, so I'm not sure if we should instead say its implementation dependent or try to enforce it somehow?
/// Note that setting this to `true` does not guarantee input sort order to | ||
/// the aggregate function; it instead gives the function full control over | ||
/// the sorting process and they are expected to handle order of input values | ||
/// themselves. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope I'm correct in this; some reading I used for reference: https://paquier.xyz/postgresql-2/postgres-9-4-feature-highlight-within-group/
In my mind this is mostly for backwards compatibility reasons -- #13511 basically broke a bunch of our existing user queries, so I wanted to revert the unnecessarily strict interpretation
Indeed -- and both of these functions have the property that many times their argument will be the same as the Though allowing different arguments means you can write expressions like
I suggest we hold off unless someone explicitly asks about it, though I am not opposed to it either |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Jefffrey -- this seems like an improvement to me
/// An example of an ordered-set aggregate function is `percentile_cont` | ||
/// which computes a specific percentile value from a sorted list of values, and | ||
/// is only meaningful when the input data is ordered. | ||
/// Note that setting this to `true` does not guarantee input sort order to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a good clarification
If DataFusion ever supports more ordered set aggregation functions, we may want to revisit this
In addition to saying what this setting doesn't do, maybe we could also say what setting it to true
does do? Specifically, it seems like it only affects the output display somehow 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to saying what this setting doesn't do, maybe we could also say what setting it to
true
does do?
This is a good point 🤔
Let me look into the code a bit more to clarify my understanding and I'll update the doc accordingly.
I might raise a separate issue to track keeping the SQL API & DataFrame API in parity in regards to this; especially for when we consider adding more ordered set aggregate functions, if we should enforce
I'm a bit confused by this example; is this just a hypothetical or something that is feasible with ordered set aggregate functions? I thought they would expected one column/expression which is the same as the |
I am clearly a little confused myself. Now I am not sure if there is some example where the arguments differ 🤔 |
I'll move this PR back to draft and do some more research to update the docs, since it seems we're both still confused about this 😅 |
Got around to doing some research. Expectation (aka Postgres)
I'm not familiar with postgres internals but it reads like it expects the aggregate functions to keep the state themselves and do their own sort; They also make a point about "direct" arguments:
For example, the percentile value like DuckDB DuckDB names some functions differently ( D select quantile_cont(col0, 1) from values (1), (2) t;
┌────────────────────────┐
│ quantile_cont(col0, 1) │
│ double │
├────────────────────────┤
│ 2.0 │
└────────────────────────┘ However you can use the D select percentile_cont(1) within group (order by col0) from values (1), (2) t;
┌────────────────────────────────┐
│ quantile_cont(1 ORDER BY col0) │
│ double │
├────────────────────────────────┤
│ 2.0 │
└────────────────────────────────┘
D select quantile_cont(1) within group (order by col0) from values (1), (2) t;
Parser Error:
Unknown ordered aggregate "quantile_cont".
D select percentile_cont(col0, 1) from values (1), (2) t;
Catalog Error:
Scalar Function with name percentile_cont does not exist!
Did you mean "pi"?
LINE 1: select percentile_cont(col0, 1) from values (1), (2) t;
However DuckDB allows specifying order by for D select quantile_cont(1 order by col0) from values (1), (2) t;
┌────────────────────────────────┐
│ quantile_cont(1 ORDER BY col0) │
│ double │
├────────────────────────────────┤
│ 2.0 │
└────────────────────────────────┘ But it doesn't allow using both: D select quantile_cont(1 order by col0) within group (order by col0) from values (1), (2) t;
Parser Error:
cannot use multiple ORDER BY clauses with WITHIN GROUP
LINE 1: select quantile_cont(1 order by col0) within group (order by col0) from values (1), (2) t;
^ Actual (whats currently implemented in DataFusion) (I'm ignoring anything related to schema name or unparser, focus only on functionality of aggregate). It seems all datafusion/datafusion/sql/src/expr/function.rs Lines 450 to 469 in c84e3cf
This is all we currently do from what I see; we don't disallow using
References Postgres git commit of ordered set aggregates
Postgres documentation:
DuckDB: |
What I propose:
// Currently
pub fn approx_percentile_cont(
order_by: Sort,
percentile: Expr,
centroids: Option<Expr>,
) -> Expr {
todo!()
}
// Proposed: much more explicit instead of needing to extract `expression` from `Sort`
pub fn approx_percentile_cont(
expression: Expr,
percentile: Expr,
ascending: bool,
centroids: Option<Expr>,
) -> Expr {
todo!()
} Thoughts @alamb ? |
Thank you for the crazy thorough research ❤️
I agree
I agree (and I think this is consistent with the current behavior too)
What do you mean "strictly apply" ? As in return an error for queries with
If this is consistent with what currently happens (I think it is) then it sounds good to me --
Rather than just
I have one more question, which is "what should we do with just ORDER BY clause in the argument? For example SELECT approx_percentile_cont(expr ORDER BY time, 0.5) FROM ... Would we treat that |
Yes, since that syntax is essentially ignored in that case, so I think it's better to explicit error if users try do this.
Ordered-set aggregates actually ignore nulls completely in both postgres & duckdb, so we wouldn't need to consider null orders.
This made me realise though that we have datafusion/datafusion/expr/src/udaf.rs Lines 743 to 747 in 93f136c
And this would be tied to
Ok I didn't realize we supported this syntax 😅 Time for another rabbit hole. Reference: DuckDB Base case: D select quantile_cont(col0, 0.75) from values (1, 3), (2, 2), (3, 1) t;
┌───────────────────────────┐
│ quantile_cont(col0, 0.75) │
│ double │
├───────────────────────────┤
│ 2.5 │
└───────────────────────────┘ DuckDB allows an order by after the arguments: D select quantile_cont(col0, 0.75 order by col0) from values (1, 3), (2, 2), (3, 1) t;
┌─────────────────────────────────────────┐
│ quantile_cont(col0, 0.75 ORDER BY col0) │
│ double │
├─────────────────────────────────────────┤
│ 2.5 │
└─────────────────────────────────────────┘
D select quantile_cont(col0, 0.75 order by col0 asc) from values (1, 3), (2, 2), (3, 1) t;
┌─────────────────────────────────────────────┐
│ quantile_cont(col0, 0.75 ORDER BY col0 ASC) │
│ double │
├─────────────────────────────────────────────┤
│ 2.5 │
└─────────────────────────────────────────────┘
D select quantile_cont(col0, 0.75 order by col0 desc) from values (1, 3), (2, 2), (3, 1) t;
┌──────────────────────────────────────────────┐
│ quantile_cont(col0, 0.75 ORDER BY col0 DESC) │
│ double │
├──────────────────────────────────────────────┤
│ 1.5 │
└──────────────────────────────────────────────┘ Interestingly, if you try order by a separate column ( D select quantile_cont(col0, 0.75 order by col1 asc) from values (1, 3), (2, 2), (3, 1) t;
┌─────────────────────────────────────────────┐
│ quantile_cont(col0, 0.75 ORDER BY col1 ASC) │
│ double │
├─────────────────────────────────────────────┤
│ 2.5 │
└─────────────────────────────────────────────┘
D select quantile_cont(col0, 0.75 order by col1 desc) from values (1, 3), (2, 2), (3, 1) t;
┌──────────────────────────────────────────────┐
│ quantile_cont(col0, 0.75 ORDER BY col1 DESC) │
│ double │
├──────────────────────────────────────────────┤
│ 1.5 │
└──────────────────────────────────────────────┘ See how This syntax is also supported: D select quantile_cont(0.75 order by col0) from values (1, 3), (2, 2), (3, 1) t;
┌───────────────────────────────────┐
│ quantile_cont(0.75 ORDER BY col0) │
│ double │
├───────────────────────────────────┤
│ 2.5 │
└───────────────────────────────────┘
D select quantile_cont(0.75 order by col0 asc) from values (1, 3), (2, 2), (3, 1) t;
┌───────────────────────────────────────┐
│ quantile_cont(0.75 ORDER BY col0 ASC) │
│ double │
├───────────────────────────────────────┤
│ 2.5 │
└───────────────────────────────────────┘
D select quantile_cont(0.75 order by col0 desc) from values (1, 3), (2, 2), (3, 1) t;
┌────────────────────────────────────────┐
│ quantile_cont(0.75 ORDER BY col0 DESC) │
│ double │
├────────────────────────────────────────┤
│ 1.5 │
└────────────────────────────────────────┘ It is similar to Lastly, it is a syntax error if you try D select quantile_cont(col0 order by col0, 0.75) from values (1, 3), (2, 2), (3, 1) t;
Binder Error:
ORDER BY non-integer literal has no effect.
* SET order_by_non_integer_literal=true to allow this behavior.
Perhaps you misplaced ORDER BY; ORDER BY must appear after all regular arguments of the aggregate.
LINE 1: select quantile_cont(col0 order by col0, 0.75) from values (1, 3), (2, 2), (3, 1) t; DataFusion We support having the > select quantile_cont(col0, 0.75 order by col0) from values (1, 3), (2, 2), (3, 1) t(col0, col1);
+-------------------------------------+
| quantile_cont(t.col0,Float64(0.75)) |
+-------------------------------------+
| 2.5 |
+-------------------------------------+
1 row(s) fetched.
Elapsed 0.009 seconds.
> select quantile_cont(col0, 0.75 order by col0 desc) from values (1, 3), (2, 2), (3, 1) t(col0, col1);
+-------------------------------------+
| quantile_cont(t.col0,Float64(0.75)) |
+-------------------------------------+
| 2.5 |
+-------------------------------------+
1 row(s) fetched.
Elapsed 0.011 seconds.
This doesn't work either: > select quantile_cont(0.75 order by col0) from values (1, 3), (2, 2), (3, 1) t(col0, col1);
Error during planning: Failed to coerce arguments to satisfy a call to 'percentile_cont' function: coercion from Float64 to the signature OneOf([Exact([Int8, Float64]), Exact([Int16, Float64]), Exact([Int32, Float64]), Exact([Int64, Float64]), Exact([UInt8, Float64]), Exact([UInt16, Float64]), Exact([UInt32, Float64]), Exact([UInt64, Float64]), Exact([Float32, Float64]), Exact([Float64, Float64])]) failed No function matches the given name and argument types 'percentile_cont(Float64)'. You might need to add explicit type casts.
Candidate functions:
percentile_cont(Int8, Float64)
percentile_cont(Int16, Float64)
percentile_cont(Int32, Float64)
percentile_cont(Int64, Float64)
percentile_cont(UInt8, Float64)
percentile_cont(UInt16, Float64)
percentile_cont(UInt32, Float64)
percentile_cont(UInt64, Float64)
percentile_cont(Float32, Float64)
percentile_cont(Float64, Float64)
Same if you put the order by after the first argument: > select quantile_cont(col0 order by col0, 0.75) from values (1, 3), (2, 2), (3, 1) t(col0, col1);
Error during planning: Failed to coerce arguments to satisfy a call to 'percentile_cont' function: coercion from Int64 to the signature OneOf([Exact([Int8, Float64]), Exact([Int16, Float64]), Exact([Int32, Float64]), Exact([Int64, Float64]), Exact([UInt8, Float64]), Exact([UInt16, Float64]), Exact([UInt32, Float64]), Exact([UInt64, Float64]), Exact([Float32, Float64]), Exact([Float64, Float64])]) failed No function matches the given name and argument types 'percentile_cont(Int64)'. You might need to add explicit type casts.
Candidate functions:
percentile_cont(Int8, Float64)
percentile_cont(Int16, Float64)
percentile_cont(Int32, Float64)
percentile_cont(Int64, Float64)
percentile_cont(UInt8, Float64)
percentile_cont(UInt16, Float64)
percentile_cont(UInt32, Float64)
percentile_cont(UInt64, Float64)
percentile_cont(Float32, Float64)
percentile_cont(Float64, Float64)
And if we try using it with > select quantile_cont(col0, 0.75 order by col0 asc) within group (order by col0) from values (1, 3), (2, 2), (3, 1) t(col0, col1);
Error during planning: ORDER BY and WITHIN GROUP clauses cannot be used together in the same aggregate function
> select quantile_cont(0.75 order by col0 asc) within group (order by col0) from values (1, 3), (2, 2), (3, 1) t(col0, col1);
Error during planning: ORDER BY and WITHIN GROUP clauses cannot be used together in the same aggregate function
> select quantile_cont(col0 order by col0, 0.75) within group (order by col0) from values (1, 3), (2, 2), (3, 1) t(col0, col1);
Error during planning: ORDER BY and WITHIN GROUP clauses cannot be used together in the same aggregate function We get a nice SQL error. What to support I think we should explicitly disallow So action items:
> select quantile_cont(col0, 0.75 order by col0 asc) within group (order by col0) from values (1, 3), (2, 2), (3, 1) t(col0, col1);
Error during planning: ORDER BY and WITHIN GROUP clauses cannot be used together in the same aggregate function
> select quantile_cont(0.75 order by col0 asc) within group (order by col0) from values (1, 3), (2, 2), (3, 1) t(col0, col1);
Error during planning: ORDER BY and WITHIN GROUP clauses cannot be used together in the same aggregate function
> select quantile_cont(col0 order by col0, 0.75) within group (order by col0) from values (1, 3), (2, 2), (3, 1) t(col0, col1);
Error during planning: ORDER BY and WITHIN GROUP clauses cannot be used together in the same aggregate function Also note to self, see if we have test for this: > select quantile_cont(0.75) within group (order by col0, col1 desc) from values (1, 3), (2, 2), (3, 1) t(col0, col1);
This feature is not implemented: Only a single ordering expression is permitted in a WITHIN GROUP clause
postgres=# select percentile_cont(0.75) within group (order by salary, employee_id) from employees;
ERROR: function percentile_cont(numeric, numeric, integer) does not exist
LINE 1: select percentile_cont(0.75) within group (order by salary, ...
^
HINT: No function matches the given name and argument types. You might need to add explicit type casts. |
Going through some tickets related to ordered set aggregates and got a little confused on DataFusion's support for them.
As I understand it, #13511 made
WITHIN GROUP
mandatory for ordered set aggregate functions, of which we support only two so far:approx_percentile_cont
approx_median
shares some internals withapprox_percentile_cont
but itself isn't an ordered set aggregationapprox_percentile_cont_with_weight
(which usesapprox_percentile_cont
internally)This was then amended in #16999 to make it optional, at least via the SQL API; it is still mandatory on the DataFrame API:
datafusion/datafusion/functions-aggregate/src/approx_percentile_cont.rs
Lines 53 to 58 in bbb5cc7
I'm updating the doc here to try clarify things to my understanding, as a followup to the original doc update: #17744