Background
After some offline discussion, we identified the need for a groupBy capability in ActionbaseQuery. When running multi-hop queries, the raw result is a flat list of edges. In many use cases (e.g. "for each friend, collect their wishlist items"), the caller needs to group and aggregate those results by a key — currently this has to be done client-side.
This proposal adds a top-level groupBy field to the v3 query request that groups the output of a specified hop and collects values into per-group lists.
Why a Top-Level Field (Not Aggregator / PostProcessor)?
Both Aggregator and PostProcessor currently follow a "flat DataFrame in → flat DataFrame out, cells are primitives" contract. groupBy produces a nested structure — cells containing arrays — which breaks this contract.
- Aggregator — designed for scalar reductions (
Count, Sum). Supporting collect_list would either break the return type (Mono<DataFrame>) and the applyAggregators chain, or require a new DataType.ARRAY that ripples through DataFrame / Row / StructType / toJsonFormat / PostProcessor — classic YAGNI.
- PostProcessor — designed for row-wise reshaping (
JsonObject: 1 row → 1 row, SplitExplode: 1 row → N rows). groupBy is row-reduction (N rows → fewer rows), a fundamentally different semantic.
- Top-level field — applied at the final boundary of
ActionbaseQueryExecutor.query(), so the "flat DataFrame" invariant is preserved throughout the pipeline. The nested output never feeds into a subsequent hop, keeping the internal contract intact.
Proposed Design
Setup
hop1: SCAN follows (index: created_at_desc)
| source |
target |
createdAt |
| 1000 |
2001 |
200 |
| 1000 |
2000 |
100 |
hop2: CACHE wishlist (cache: recent_wishlist)
| source |
target |
createdAt |
| 2001 |
5002 |
500 |
| 2001 |
5001 |
400 |
| 2000 |
5000 |
300 |
Request
curl -X POST http://localhost:8080/graph/v3/query \
-H 'Content-Type: application/json' \
-d '{
"query": [
{
"type": "SCAN",
"name": "hop1",
"database": "social",
"table": "follows",
"source": {"type": "VALUE", "value": [1000]},
"direction": "OUT",
"index": "created_at_desc"
},
{
"type": "CACHE",
"name": "hop2",
"database": "commerce",
"table": "wishlist",
"source": {"type": "REF", "ref": "hop1", "field": "target"},
"direction": "OUT",
"cache": "recent_wishlist",
"limit": 10,
"include": true
}
],
"groupBy": {
"target": "hop2",
"keys": ["source"],
"collect": [
{"field": "target", "alias": "wishes"}
]
}
}'
| Field |
Description |
target |
The name of the query item to apply grouping to. The item must have include: true. |
keys |
List of field names to group by. |
collect |
Fields whose values are accumulated into per-group lists. |
collect[].field |
Column name to collect values from. |
collect[].alias |
(optional, defaults to field) Output name for the collected list. |
Response
{
"items": [
{
"name": "hop2",
"data": [
{"source": 2001, "wishes": [5002, 5001]},
{"source": 2000, "wishes": [5000]}
],
"rows": 2,
"stats": [],
"offset": null,
"hasNext": false
}
]
}
offset and hasNext are not meaningful but required by NamedJsonFormat.
Feedback welcome on the schema shape and semantics before we start implementation.
Background
After some offline discussion, we identified the need for a groupBy capability in ActionbaseQuery. When running multi-hop queries, the raw result is a flat list of edges. In many use cases (e.g. "for each friend, collect their wishlist items"), the caller needs to group and aggregate those results by a key — currently this has to be done client-side.
This proposal adds a top-level
groupByfield to the v3 query request that groups the output of a specified hop and collects values into per-group lists.Why a Top-Level Field (Not Aggregator / PostProcessor)?
Both
AggregatorandPostProcessorcurrently follow a "flat DataFrame in → flat DataFrame out, cells are primitives" contract.groupByproduces a nested structure — cells containing arrays — which breaks this contract.Count,Sum). Supportingcollect_listwould either break the return type (Mono<DataFrame>) and theapplyAggregatorschain, or require a newDataType.ARRAYthat ripples throughDataFrame/Row/StructType/toJsonFormat/PostProcessor— classic YAGNI.JsonObject: 1 row → 1 row,SplitExplode: 1 row → N rows).groupByis row-reduction (N rows → fewer rows), a fundamentally different semantic.ActionbaseQueryExecutor.query(), so the "flat DataFrame" invariant is preserved throughout the pipeline. The nested output never feeds into a subsequent hop, keeping the internal contract intact.Proposed Design
Setup
hop1: SCAN follows (index:
created_at_desc)hop2: CACHE wishlist (cache:
recent_wishlist)Request
targetnameof the query item to apply grouping to. The item must haveinclude: true.keyscollectcollect[].fieldcollect[].aliasfield) Output name for the collected list.Response
{ "items": [ { "name": "hop2", "data": [ {"source": 2001, "wishes": [5002, 5001]}, {"source": 2000, "wishes": [5000]} ], "rows": 2, "stats": [], "offset": null, "hasNext": false } ] }offsetandhasNextare not meaningful but required byNamedJsonFormat.Feedback welcome on the schema shape and semantics before we start implementation.