docs: refine `AggregateUDFImpl::is_ordered_set_aggregate` documentation #17805

Jefffrey · 2025-09-27T02:19:55Z

Going through some tickets related to ordered set aggregates and got a little confused on DataFusion's support for them.

As I understand it, #13511 made WITHIN GROUP mandatory for ordered set aggregate functions, of which we support only two so far:

approx_percentile_cont
- Technically approx_median shares some internals with approx_percentile_cont but itself isn't an ordered set aggregation
approx_percentile_cont_with_weight (which uses approx_percentile_cont internally)

This was then amended in #16999 to make it optional, at least via the SQL API; it is still mandatory on the DataFrame API:

datafusion/datafusion/functions-aggregate/src/approx_percentile_cont.rs

Lines 53 to 58 in bbb5cc7

    
           /// Computes the approximate percentile continuous of a set of numbers 
        
           pub fn approx_percentile_cont( 
        
               order_by: Sort, 
        
               percentile: Expr, 
        
               centroids: Option<Expr>, 
        
           ) -> Expr {

I'm updating the doc here to try clarify things to my understanding, as a followup to the original doc update: #17744

Jefffrey

A question I have is if we should loosen the DataFrame API to allow omitting the sort, as #16999 did for the SQL API?

cc @alamb

Jefffrey · 2025-09-27T02:20:39Z

datafusion/expr/src/udaf.rs

+    /// calculation performed by these functions is dependent on the specific
+    /// sequence of the input rows, unlike other aggregate functions like `SUM`
+    /// `AVG`, or `COUNT`. If explit order is specified then a default order
+    /// of ascending is assumed.


Technically we don't enforce the default order; our only ordered set aggregate functions internally use ascending as the default, so I'm not sure if we should instead say its implementation dependent or try to enforce it somehow?

Jefffrey · 2025-09-27T02:21:00Z

datafusion/expr/src/udaf.rs

+    /// Note that setting this to `true` does not guarantee input sort order to
+    /// the aggregate function; it instead gives the function full control over
+    /// the sorting process and they are expected to handle order of input values
+    /// themselves.


I hope I'm correct in this; some reading I used for reference: https://paquier.xyz/postgresql-2/postgres-9-4-feature-highlight-within-group/

alamb · 2025-09-29T17:11:47Z

This was then amended in #16999 to make it optional, at least via the SQL API; it is still mandatory on the DataFrame API:

In my mind this is mostly for backwards compatibility reasons -- #13511 basically broke a bunch of our existing user queries, so I wanted to revert the unnecessarily strict interpretation

As I understand it, #13511 made WITHIN GROUP mandatory for ordered set aggregate functions, of which we support only two so far:

Indeed -- and both of these functions have the property that many times their argument will be the same as the ORDER BY WITHIN GROUP-- for example, computing approx_median(x) implicitly means approx_median(x ORDER BY x WITHIN GROUP)

Though allowing different arguments means you can write expressions like approx_median(first_name ORDER BY salary WITHIN GROUP) and save yourself a subquery

A question I have is if we should loosen the DataFrame API to allow omitting the sort, as #16999 did for the SQL API?

cc @alamb

I suggest we hold off unless someone explicitly asks about it, though I am not opposed to it either

alamb

Thank you @Jefffrey -- this seems like an improvement to me

alamb · 2025-09-29T17:15:20Z

datafusion/expr/src/udaf.rs

-    /// An example of an ordered-set aggregate function is `percentile_cont`
-    /// which computes a specific percentile value from a sorted list of values, and
-    /// is only meaningful when the input data is ordered.
+    /// Note that setting this to `true` does not guarantee input sort order to


this is a good clarification

If DataFusion ever supports more ordered set aggregation functions, we may want to revisit this

In addition to saying what this setting doesn't do, maybe we could also say what setting it to true does do? Specifically, it seems like it only affects the output display somehow 🤔

In addition to saying what this setting doesn't do, maybe we could also say what setting it to true does do?

This is a good point 🤔

Let me look into the code a bit more to clarify my understanding and I'll update the doc accordingly.

Jefffrey · 2025-09-30T08:26:09Z

In my mind this is mostly for backwards compatibility reasons -- #13511 basically broke a bunch of our existing user queries, so I wanted to revert the unnecessarily strict interpretation

I suggest we hold off unless someone explicitly asks about it, though I am not opposed to it either

I might raise a separate issue to track keeping the SQL API & DataFrame API in parity in regards to this; especially for when we consider adding more ordered set aggregate functions, if we should enforce WITHIN GROUP for those but not existing ones.

Though allowing different arguments means you can write expressions like approx_median(first_name ORDER BY salary WITHIN GROUP) and save yourself a subquery

I'm a bit confused by this example; is this just a hypothetical or something that is feasible with ordered set aggregate functions? I thought they would expected one column/expression which is the same as the ORDER BY in the WITHIN GROUP 🤔

alamb · 2025-09-30T18:30:03Z

I'm a bit confused by this example; is this just a hypothetical or something that is feasible with ordered set aggregate functions? I thought they would expected one column/expression which is the same as the ORDER BY in the WITHIN GROUP

I am clearly a little confused myself. Now I am not sure if there is some example where the arguments differ 🤔

Jefffrey · 2025-10-01T04:37:18Z

I'm a bit confused by this example; is this just a hypothetical or something that is feasible with ordered set aggregate functions? I thought they would expected one column/expression which is the same as the ORDER BY in the WITHIN GROUP

I am clearly a little confused myself. Now I am not sure if there is some example where the arguments differ 🤔

I'll move this PR back to draft and do some more research to update the docs, since it seems we're both still confused about this 😅

Jefffrey · 2025-10-17T05:40:08Z

Got around to doing some research.

Expectation (aka Postgres)

WITHIN GROUP clause is required for the ordered set aggregate functions; you can't omit them so an order by must always be set. The functions themselves (e.g. percentile_cont) do the sort, see this description from the postgres git commit:

Unlike the case for normal aggregates, the sorting of input rows for an ordered-set aggregate is not done behind the scenes, but is the responsibility of the aggregate's support functions. The typical implementation approach is to keep a reference to a tuplesort object in the aggregate's state value, feed the incoming rows into that object, and then complete the sorting and read out the data in the final function. This design allows the final function to perform special operations such as injecting additional hypothetical rows into the data to be sorted.

I'm not familiar with postgres internals but it reads like it expects the aggregate functions to keep the state themselves and do their own sort; WITHIN GROUP doesn't guarantee anything about input order to the function.

They also make a point about "direct" arguments:

The aggregates we have been describing so far are "normal" aggregates. PostgreSQL also supports ordered-set aggregates, which differ from normal aggregates in two key ways. First, in addition to ordinary aggregated arguments that are evaluated once per input row, an ordered-set aggregate can have "direct" arguments that are evaluated only once per aggregation operation. Second, the syntax for the ordinary aggregated arguments specifies a sort ordering for them explicitly. An ordered-set aggregate is usually used to implement a computation that depends on a specific row ordering, for instance rank or percentile, so that the sort ordering is a required aspect of any call.

For example, the percentile value like 0.5 in percentile_cont would be a direct argument.

DuckDB

DuckDB names some functions differently (quantile_cont instead of percentile_cont) and it doesn't require the WITHIN GROUP syntax:

D select quantile_cont(col0, 1) from values (1), (2) t;
┌────────────────────────┐
│ quantile_cont(col0, 1) │
│         double         │
├────────────────────────┤
│          2.0           │
└────────────────────────┘

However you can use the WITHIN GROUP syntax along with percentile_cont name:

D select percentile_cont(1) within group (order by col0) from values (1), (2) t;
┌────────────────────────────────┐
│ quantile_cont(1 ORDER BY col0) │
│             double             │
├────────────────────────────────┤
│              2.0               │
└────────────────────────────────┘
D select quantile_cont(1) within group (order by col0) from values (1), (2) t;
Parser Error:
Unknown ordered aggregate "quantile_cont".
D select percentile_cont(col0, 1) from values (1), (2) t;
Catalog Error:
Scalar Function with name percentile_cont does not exist!
Did you mean "pi"?

LINE 1: select percentile_cont(col0, 1) from values (1), (2) t;

Note how they don't allow quantile_cont for WITHIN GROUP, only percentile_cont which is also locked to WITHIN GROUP (it cannot be used like quantile_cont)

However DuckDB allows specifying order by for quantile_cont without WITHIN GROUP:

D select quantile_cont(1 order by col0) from values (1), (2) t;
┌────────────────────────────────┐
│ quantile_cont(1 ORDER BY col0) │
│             double             │
├────────────────────────────────┤
│              2.0               │
└────────────────────────────────┘

But it doesn't allow using both:

D select quantile_cont(1 order by col0) within group (order by col0) from values (1), (2) t;
Parser Error:
cannot use multiple ORDER BY clauses with WITHIN GROUP

LINE 1: select quantile_cont(1 order by col0) within group (order by col0) from values (1), (2) t;
                                              ^

Actual (whats currently implemented in DataFusion)

(I'm ignoring anything related to schema name or unparser, focus only on functionality of aggregate).

It seems all AggregateUDFImpl::is_ordered_set_aggregate() does is during SQL parsing it does the necessary magic to put the ORDER BY column as the argument to the aggregate function:

datafusion/datafusion/sql/src/expr/function.rs

Lines 450 to 469 in c84e3cf

    
           let order_by = if fm.is_ordered_set_aggregate() { 
        
               let within_group = self.order_by_to_sort_expr( 
        
                   within_group, 
        
                   schema, 
        
                   planner_context, 
        
                   false, 
        
                   None, 
        
               )?; 
        
               // Add the WITHIN GROUP ordering expressions to the front of the argument list 
        
               // So function(arg) WITHIN GROUP (ORDER BY x) becomes function(x, arg) 
        
               if !within_group.is_empty() { 
        
                   args = within_group 
        
                       .iter() 
        
                       .map(|sort| sort.expr.clone()) 
        
                       .chain(args) 
        
                       .collect::<Vec<_>>(); 
        
               } 
        
               within_group 
        
           } else {

i.e. essentially rewrites percentile_cont(0.5) within group (order by col0) to percentile_cont(num, 0.5) and passes along the sort order too

This is all we currently do from what I see; we don't disallow using WITHIN GROUP on other aggregate functions (see #18109). Though our DataFrame API functions for the ordered set aggregate functions do require an explicit sort order as I mentioned before:

datafusion/datafusion/functions-aggregate/src/approx_percentile_cont.rs

Lines 55 to 60 in c84e3cf

    
           /// Computes the approximate percentile continuous of a set of numbers 
        
           pub fn approx_percentile_cont( 
        
               order_by: Sort, 
        
               percentile: Expr, 
        
               centroids: Option<Expr>, 
        
           ) -> Expr {

References

Postgres git commit of ordered set aggregates

Also a nice article explaining the commit a bit

Postgres documentation:

DuckDB:

https://duckdb.org/docs/stable/sql/functions/aggregates#ordered-set-aggregate-functions

Jefffrey · 2025-10-17T05:48:55Z

What I propose:

We should keep WITHIN GROUP optional, since we had users using approx_percentile_cont without it (so follow DuckDB here). For other ordered set aggregate functions we follow suit (already did so for percentile_cont here feat: Add percentile_cont aggregate function #17988)
- If WITHIN GROUP not specified then it is implementation dependent what the default order is, though we should say it is ascending (we can't control this yet via code, only comments)
Make WITHIN GROUP strictly apply only to aggregate functions which return true for AggregateUDFImpl::is_ordered_set_aggregate() -> WITHIN GROUP needs to be more strict #18109
AggregateUDFImpl::is_ordered_set_aggregate() itself doesn't apply any guarantees to input order, like postgres; we won't insert a sort or anything, it'll be up to the aggregate function itself to take the "hint" from WITHIN GROUP (not really a hint but you get the point) and do its ordering internally accordingly
Refactor DataFrame APIs a bit to make it easier to use, for example:

// Currently
pub fn approx_percentile_cont(
    order_by: Sort,
    percentile: Expr,
    centroids: Option<Expr>,
) -> Expr {
    todo!()
}

// Proposed: much more explicit instead of needing to extract `expression` from `Sort`
pub fn approx_percentile_cont(
    expression: Expr,
    percentile: Expr,
    ascending: bool,
    centroids: Option<Expr>,
) -> Expr {
    todo!()
}

Thoughts @alamb ?

alamb · 2025-10-18T11:33:04Z

Thoughts @alamb ?

Thank you for the crazy thorough research ❤️

What I propose:

We should keep WITHIN GROUP optional, since we had users using approx_percentile_cont without it (so follow DuckDB here). For other ordered set aggregate functions we follow suit (already did so for percentile_cont here feat: Add percentile_cont aggregate function #17988)

I agree

If WITHIN GROUP not specified then it is implementation dependent what the default order is, though we should say it is ascending (we can't control this yet via code, only comments)

I agree (and I think this is consistent with the current behavior too)

Make WITHIN GROUP strictly apply only to aggregate functions which return true for AggregateUDFImpl::is_ordered_set_aggregate() -> WITHIN GROUP needs to be more strict #18109

What do you mean "strictly apply" ? As in return an error for queries with WITHIN GROUP for an aggregate that AggregateUDFImpl::is_ordered_set_aggregate() --> false?

AggregateUDFImpl::is_ordered_set_aggregate() itself doesn't apply any guarantees to input order, like postgres; we won't insert a sort or anything, it'll be up to the aggregate function itself to take the "hint" from WITHIN GROUP (not really a hint but you get the point) and do its ordering internally accordingly

If this is consistent with what currently happens (I think it is) then it sounds good to me --

Refactor DataFrame APIs a bit to make it easier to use, for example:

Rather than just ascending how about taking in expression: SortExpr , which would also allow specifying NULLS FIRST and/or NULLS LAST ?

// Proposed: much more explicit instead of needing to extract expression from Sort
pub fn approx_percentile_cont(
expression: Expr,
percentile: Expr,
ascending: bool,
centroids: Option,
) -> Expr {
todo!()
}

I have one more question, which is "what should we do with just ORDER BY clause in the argument? For example

SELECT approx_percentile_cont(expr ORDER BY time, 0.5) FROM ...

Would we treat that ORDER BY the same as if it were specified in the WITHIN GROUP clause?

Jefffrey · 2025-10-18T14:42:16Z

What do you mean "strictly apply" ? As in return an error for queries with WITHIN GROUP for an aggregate that AggregateUDFImpl::is_ordered_set_aggregate() --> false?

Yes, since that syntax is essentially ignored in that case, so I think it's better to explicit error if users try do this.

Rather than just ascending how about taking in expression: SortExpr, which would also allow specifying NULLS FIRST and/or NULLS LAST ?

Ordered-set aggregates actually ignore nulls completely in both postgres & duckdb, so we wouldn't need to consider null orders.

Apparently hypothetical-set aggregate functions in postgres do respect nulls, but that is a separate feature

This made me realise though that we have supports_null_handling_clause:

datafusion/datafusion/expr/src/udaf.rs

Lines 743 to 747 in 93f136c

    
               /// If this function supports `[IGNORE NULLS | RESPECT NULLS]` clause, return true 
        
               /// If the function does not, return false 
        
               fn supports_null_handling_clause(&self) -> bool { 
        
                   true 
        
               }

And this would be tied to is_ordered_set_aggregate as it doesn't make sense to have both be true; I'll look more into this separately.

I have one more question, which is "what should we do with just ORDER BY clause in the argument? For example
SELECT approx_percentile_cont(expr ORDER BY time, 0.5) FROM ...
Would we treat that ORDER BY the same as if it were specified in the WITHIN GROUP clause?

Ok I didn't realize we supported this syntax 😅

Time for another rabbit hole.

Reference: DuckDB

Base case:

D select quantile_cont(col0, 0.75) from values (1, 3), (2, 2), (3, 1) t;
┌───────────────────────────┐
│ quantile_cont(col0, 0.75) │
│          double           │
├───────────────────────────┤
│            2.5            │
└───────────────────────────┘

DuckDB allows an order by after the arguments:

D select quantile_cont(col0, 0.75 order by col0) from values (1, 3), (2, 2), (3, 1) t;
┌─────────────────────────────────────────┐
│ quantile_cont(col0, 0.75 ORDER BY col0) │
│                 double                  │
├─────────────────────────────────────────┤
│                   2.5                   │
└─────────────────────────────────────────┘
D select quantile_cont(col0, 0.75 order by col0 asc) from values (1, 3), (2, 2), (3, 1) t;
┌─────────────────────────────────────────────┐
│ quantile_cont(col0, 0.75 ORDER BY col0 ASC) │
│                   double                    │
├─────────────────────────────────────────────┤
│                     2.5                     │
└─────────────────────────────────────────────┘
D select quantile_cont(col0, 0.75 order by col0 desc) from values (1, 3), (2, 2), (3, 1) t;
┌──────────────────────────────────────────────┐
│ quantile_cont(col0, 0.75 ORDER BY col0 DESC) │
│                    double                    │
├──────────────────────────────────────────────┤
│                     1.5                      │
└──────────────────────────────────────────────┘

Interestingly, if you try order by a separate column (col1) it behaves as if you're ordering by col0 instead:

D select quantile_cont(col0, 0.75 order by col1 asc) from values (1, 3), (2, 2), (3, 1) t;
┌─────────────────────────────────────────────┐
│ quantile_cont(col0, 0.75 ORDER BY col1 ASC) │
│                   double                    │
├─────────────────────────────────────────────┤
│                     2.5                     │
└─────────────────────────────────────────────┘
D select quantile_cont(col0, 0.75 order by col1 desc) from values (1, 3), (2, 2), (3, 1) t;
┌──────────────────────────────────────────────┐
│ quantile_cont(col0, 0.75 ORDER BY col1 DESC) │
│                    double                    │
├──────────────────────────────────────────────┤
│                     1.5                      │
└──────────────────────────────────────────────┘

See how col0 and col1 have values in reverse of each other; we'd expect quantile_cont(col0, 0.75 order by col0 asc) to be the same as quantile_cont(col0, 0.75 order by col1 desc) but this is not the case. So we can surmise that it just looks at the asc/desc part and ignores the actual column in the ORDER BY, considering only the column in the first argument.

This syntax is also supported:

D select quantile_cont(0.75 order by col0) from values (1, 3), (2, 2), (3, 1) t;
┌───────────────────────────────────┐
│ quantile_cont(0.75 ORDER BY col0) │
│              double               │
├───────────────────────────────────┤
│                2.5                │
└───────────────────────────────────┘
D select quantile_cont(0.75 order by col0 asc) from values (1, 3), (2, 2), (3, 1) t;
┌───────────────────────────────────────┐
│ quantile_cont(0.75 ORDER BY col0 ASC) │
│                double                 │
├───────────────────────────────────────┤
│                  2.5                  │
└───────────────────────────────────────┘
D select quantile_cont(0.75 order by col0 desc) from values (1, 3), (2, 2), (3, 1) t;
┌────────────────────────────────────────┐
│ quantile_cont(0.75 ORDER BY col0 DESC) │
│                 double                 │
├────────────────────────────────────────┤
│                  1.5                   │
└────────────────────────────────────────┘

It is similar to WITHIN GROUP in that it implicitly takes the column you are ordering by as the column you want to compute on.

Lastly, it is a syntax error if you try ORDER BY before the last argument:

D select quantile_cont(col0 order by col0, 0.75) from values (1, 3), (2, 2), (3, 1) t;
Binder Error:
ORDER BY non-integer literal has no effect.
* SET order_by_non_integer_literal=true to allow this behavior.

Perhaps you misplaced ORDER BY; ORDER BY must appear after all regular arguments of the aggregate.

LINE 1: select quantile_cont(col0 order by col0, 0.75) from values (1, 3), (2, 2), (3, 1) t;

DataFusion

We support having the ORDER BY inside but after the arguments but it doesn't seem to do anything:

> select quantile_cont(col0, 0.75 order by col0) from values (1, 3), (2, 2), (3, 1) t(col0, col1);
+-------------------------------------+
| quantile_cont(t.col0,Float64(0.75)) |
+-------------------------------------+
| 2.5                                 |
+-------------------------------------+
1 row(s) fetched.
Elapsed 0.009 seconds.

> select quantile_cont(col0, 0.75 order by col0 desc) from values (1, 3), (2, 2), (3, 1) t(col0, col1);
+-------------------------------------+
| quantile_cont(t.col0,Float64(0.75)) |
+-------------------------------------+
| 2.5                                 |
+-------------------------------------+
1 row(s) fetched.
Elapsed 0.011 seconds.

Expect output to change since we change sort order

This doesn't work either:

> select quantile_cont(0.75 order by col0) from values (1, 3), (2, 2), (3, 1) t(col0, col1);
Error during planning: Failed to coerce arguments to satisfy a call to 'percentile_cont' function: coercion from Float64 to the signature OneOf([Exact([Int8, Float64]), Exact([Int16, Float64]), Exact([Int32, Float64]), Exact([Int64, Float64]), Exact([UInt8, Float64]), Exact([UInt16, Float64]), Exact([UInt32, Float64]), Exact([UInt64, Float64]), Exact([Float32, Float64]), Exact([Float64, Float64])]) failed No function matches the given name and argument types 'percentile_cont(Float64)'. You might need to add explicit type casts.
        Candidate functions:
        percentile_cont(Int8, Float64)
        percentile_cont(Int16, Float64)
        percentile_cont(Int32, Float64)
        percentile_cont(Int64, Float64)
        percentile_cont(UInt8, Float64)
        percentile_cont(UInt16, Float64)
        percentile_cont(UInt32, Float64)
        percentile_cont(UInt64, Float64)
        percentile_cont(Float32, Float64)
        percentile_cont(Float64, Float64)

Looks like it pretty much ignores the order by col0 as it sees the input as 'percentile_cont(Float64)'

Same if you put the order by after the first argument:

> select quantile_cont(col0 order by col0, 0.75) from values (1, 3), (2, 2), (3, 1) t(col0, col1);
Error during planning: Failed to coerce arguments to satisfy a call to 'percentile_cont' function: coercion from Int64 to the signature OneOf([Exact([Int8, Float64]), Exact([Int16, Float64]), Exact([Int32, Float64]), Exact([Int64, Float64]), Exact([UInt8, Float64]), Exact([UInt16, Float64]), Exact([UInt32, Float64]), Exact([UInt64, Float64]), Exact([Float32, Float64]), Exact([Float64, Float64])]) failed No function matches the given name and argument types 'percentile_cont(Int64)'. You might need to add explicit type casts.
        Candidate functions:
        percentile_cont(Int8, Float64)
        percentile_cont(Int16, Float64)
        percentile_cont(Int32, Float64)
        percentile_cont(Int64, Float64)
        percentile_cont(UInt8, Float64)
        percentile_cont(UInt16, Float64)
        percentile_cont(UInt32, Float64)
        percentile_cont(UInt64, Float64)
        percentile_cont(Float32, Float64)
        percentile_cont(Float64, Float64)

This time it sees 'percentile_cont(Int64)' which means it ignores the 0.75 fraction?

And if we try using it with WITHIN GROUP:

> select quantile_cont(col0, 0.75 order by col0 asc) within group (order by col0) from values (1, 3), (2, 2), (3, 1) t(col0, col1);
Error during planning: ORDER BY and WITHIN GROUP clauses cannot be used together in the same aggregate function
> select quantile_cont(0.75 order by col0 asc) within group (order by col0) from values (1, 3), (2, 2), (3, 1) t(col0, col1);
Error during planning: ORDER BY and WITHIN GROUP clauses cannot be used together in the same aggregate function
> select quantile_cont(col0 order by col0, 0.75) within group (order by col0) from values (1, 3), (2, 2), (3, 1) t(col0, col1);
Error during planning: ORDER BY and WITHIN GROUP clauses cannot be used together in the same aggregate function

We get a nice SQL error.

What to support

I think we should explicitly disallow ORDER BY within the aggregate arguments entirely; it allows too much flexibility and frankly it's doing my head in a bit (even without considering SQL planner code). Let's just stick with Postgres, and if anyone needs/wants the DuckDB way we can implement that later.

So action items:

Raise issue to look into way to relate supports_null_handling_clause and is_ordered_set_aggregate
Raise issue to disallow ORDER BY within ordered-set aggregate functions argument lists
- Add tests to lock in this behaviour
Check if we have test for these already:

> select quantile_cont(col0, 0.75 order by col0 asc) within group (order by col0) from values (1, 3), (2, 2), (3, 1) t(col0, col1);
Error during planning: ORDER BY and WITHIN GROUP clauses cannot be used together in the same aggregate function
> select quantile_cont(0.75 order by col0 asc) within group (order by col0) from values (1, 3), (2, 2), (3, 1) t(col0, col1);
Error during planning: ORDER BY and WITHIN GROUP clauses cannot be used together in the same aggregate function
> select quantile_cont(col0 order by col0, 0.75) within group (order by col0) from values (1, 3), (2, 2), (3, 1) t(col0, col1);
Error during planning: ORDER BY and WITHIN GROUP clauses cannot be used together in the same aggregate function

Also note to self, see if we have test for this:

> select quantile_cont(0.75) within group (order by col0, col1 desc) from values (1, 3), (2, 2), (3, 1) t(col0, col1);
This feature is not implemented: Only a single ordering expression is permitted in a WITHIN GROUP clause

We might wanna change this from NotImplemented error as it shouldn't be supported at all, see postgres:

postgres=# select percentile_cont(0.75) within group (order by salary, employee_id) from employees;
ERROR:  function percentile_cont(numeric, numeric, integer) does not exist
LINE 1: select percentile_cont(0.75) within group (order by salary, ...
               ^
HINT:  No function matches the given name and argument types. You might need to add explicit type casts.

docs: refine AggregateUDFImpl::is_ordered_set_aggregate documentation

440ab7b

github-actions bot added the logical-expr Logical plan and expressions label Sep 27, 2025

Jefffrey commented Sep 27, 2025

View reviewed changes

I am very good at the English language

cb8facd

Jefffrey marked this pull request as ready for review September 27, 2025 02:36

alamb approved these changes Sep 29, 2025

View reviewed changes

Jefffrey marked this pull request as draft October 1, 2025 04:37

Merge branch 'main' into update-ordered-set-aggregate-doc

ceebea0

Jefffrey mentioned this pull request Oct 17, 2025

WITHIN GROUP needs to be more strict #18109

Open

Jefffrey mentioned this pull request Oct 18, 2025

Epic: Ordered Set Aggregate Functions #12824

Open

10 tasks

	/// Computes the approximate percentile continuous of a set of numbers
	pub fn approx_percentile_cont(
	order_by: Sort,
	percentile: Expr,
	centroids: Option<Expr>,
	) -> Expr {

docs: refine AggregateUDFImpl::is_ordered_set_aggregate documentation #17805

Are you sure you want to change the base?

docs: refine AggregateUDFImpl::is_ordered_set_aggregate documentation #17805

Conversation

Jefffrey commented Sep 27, 2025

Uh oh!

Jefffrey left a comment

Choose a reason for hiding this comment

Uh oh!

Jefffrey Sep 27, 2025

Choose a reason for hiding this comment

Uh oh!

Jefffrey Sep 27, 2025

Choose a reason for hiding this comment

Uh oh!

alamb commented Sep 29, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

Jefffrey Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

Jefffrey commented Sep 30, 2025

Uh oh!

alamb commented Sep 30, 2025

Uh oh!

Jefffrey commented Oct 1, 2025

Uh oh!

Jefffrey commented Oct 17, 2025

Uh oh!

Jefffrey commented Oct 17, 2025

Uh oh!

alamb commented Oct 18, 2025

Uh oh!

Jefffrey commented Oct 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

docs: refine `AggregateUDFImpl::is_ordered_set_aggregate` documentation #17805

docs: refine `AggregateUDFImpl::is_ordered_set_aggregate` documentation #17805

Jefffrey commented Oct 18, 2025 •

edited

Loading