Skip to content

Add string_agg function, partition_by and order_by for analytic functions, and partition_by and limit for string_agg #1568

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 30 commits into from
Jan 31, 2024

Conversation

christopherswenson
Copy link
Contributor

@christopherswenson christopherswenson commented Jan 4, 2024

Changes:

  • Adds a string_agg function with two overloads: string_agg(field) and string_agg(field, separator). If separator is not specified, it will be ,. Also adds a string_agg_distinct which is the same as string_agg, but distinct.
  • Adds, currently for use only with string_agg, the ability to specify an order_by and limit for an aggregate function, similar to the way you specify filters for an aggregate expression: string_agg(field, sep) { order_by: field, limit: 10 }. These are under experiments function_order_by and aggregate_limit respectively. (These names were chosen because order_by also applies to window functions, whereas limit only applies to aggregates.)
  • Adds, for all window functions, the ability to specify order_by and partition_by in the same way as above. This is under the experiment partition_by.

For both aggregates and window functions, order_by is a comma-required-separated list of expressions followed by an optional asc or desc. This is different from a query's order_by, which is a comma-optional-separated list of field names or integers (also with asc/desc). This difference comes from the fact that order_by in a query needs to be a field in the output space, whereas order_by in an aggregate or window function can be an expression.

Specifically, order_by in an aggregate (i.e. string_agg) can be any input-space expression, and order_by in an analytic function must either be an aggregate expression or an output field reference. These requirements are correctly detected by the parser.

partition_by is a comma-required-separated list of non-dotted identifiers (referring to fields in the output space). The comma is required because for select views, expressions should actually be allowed in partition_by, so commas will be required down the line anyway.

limit must be a literal integer.

Known limitations:

  • When using string_agg_distinct with an order_by, the order by must be exactly the same as the first parameter; this is currently a SQL error not caught by Malloy
  • A problem for future me and Michael is that currently limit is only allowed for BQ, and there's some funkiness with the way functions are parsed/compiled to make sure there's a good error message in the other dialects. There is an existing similar issue that if an overload doesn't exist for one dialect, it gets a gross internal error. We need to give this some more thought.

Known bugs:

  • string_agg's order_by currently doesn't get added to the array agg unnest, meaning order_by doesn't work properly when there's a fanout

Future changes:

  • Change the string_agg functions to use a custom fragment to produce a custom SQL compile with the following rules:
    • A regular string_agg with no order_by and no fanout just uses the STRING_AGG function
    • If a string_agg has a fanout or an order_by, ordering expression and/or the fanout distinct key are put into a concat("MAGIC_PREFIX", ordering_expression, "MAGIC SEPARATOR", distinct_key, "MAGIC_SUFFIX", thing_you_wanted_to_string_agg), then that expression is STRING_AGG(DISTINCT)ed, then the MAGIC_PREFIX.*?MAGIC_SUFFIXes are removed with REGEXP_REPLACE. This allows BigQuery to do fanout string_agg calls and removes the bug with string_agg's order_by not being added to the array agg unnest.
  • Add the syntax { order_by: asc } ( and { order_by: desc }) for string_agg_distinct to solve the issue of "the order by must be exactly the same as the first parameter". For string_agg and analytic functions, order_by: asc is an error, and for string_agg_distinct, order_by: anything_else is an error.

Fixes #1557
Fixes #1560

@christopherswenson christopherswenson marked this pull request as ready for review January 4, 2024 21:25
@christopherswenson
Copy link
Contributor Author

Probably will not merge without implementing WN-0011, assuming it is accepted.

@christopherswenson
Copy link
Contributor Author

Fixes #1560

@christopherswenson christopherswenson changed the title Add string_agg function Add string_agg function, partition_by and order_by for analytic functions, and partition_by and limit for string_agg Jan 30, 2024
Christopher Swenson added 3 commits January 30, 2024 16:21
@christopherswenson christopherswenson merged commit 39856de into main Jan 31, 2024
@christopherswenson christopherswenson deleted the string_agg branch January 31, 2024 22:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support something similar to PARTITION BY in SQL window functions Support STRING_AGG aggregation function
2 participants