Skip to content

groupBy support for multi-hop query #246

@zipdoki

Description

@zipdoki

Background

After some offline discussion, we identified the need for a groupBy capability in ActionbaseQuery. When running multi-hop queries, the raw result is a flat list of edges. In many use cases (e.g. "for each friend, collect their wishlist items"), the caller needs to group and aggregate those results by a key — currently this has to be done client-side.

This proposal adds a top-level groupBy field to the v3 query request that groups the output of a specified hop and collects values into per-group lists.

Why a Top-Level Field (Not Aggregator / PostProcessor)?

Both Aggregator and PostProcessor currently follow a "flat DataFrame in → flat DataFrame out, cells are primitives" contract. groupBy produces a nested structure — cells containing arrays — which breaks this contract.

  • Aggregator — designed for scalar reductions (Count, Sum). Supporting collect_list would either break the return type (Mono<DataFrame>) and the applyAggregators chain, or require a new DataType.ARRAY that ripples through DataFrame / Row / StructType / toJsonFormat / PostProcessor — classic YAGNI.
  • PostProcessor — designed for row-wise reshaping (JsonObject: 1 row → 1 row, SplitExplode: 1 row → N rows). groupBy is row-reduction (N rows → fewer rows), a fundamentally different semantic.
  • Top-level field — applied at the final boundary of ActionbaseQueryExecutor.query(), so the "flat DataFrame" invariant is preserved throughout the pipeline. The nested output never feeds into a subsequent hop, keeping the internal contract intact.

Proposed Design

Setup

hop1: SCAN follows (index: created_at_desc)

source target createdAt
1000 2001 200
1000 2000 100

hop2: CACHE wishlist (cache: recent_wishlist)

source target createdAt
2001 5002 500
2001 5001 400
2000 5000 300

Request

curl -X POST http://localhost:8080/graph/v3/query \
  -H 'Content-Type: application/json' \
  -d '{
    "query": [
      {
        "type": "SCAN",
        "name": "hop1",
        "database": "social",
        "table": "follows",
        "source": {"type": "VALUE", "value": [1000]},
        "direction": "OUT",
        "index": "created_at_desc"
      },
      {
        "type": "CACHE",
        "name": "hop2",
        "database": "commerce",
        "table": "wishlist",
        "source": {"type": "REF", "ref": "hop1", "field": "target"},
        "direction": "OUT",
        "cache": "recent_wishlist",
        "limit": 10,
        "include": true
      }
    ],
    "groupBy": {
      "target": "hop2",
      "keys": ["source"],
      "collect": [
        {"field": "target", "alias": "wishes"}
      ]
    }
  }'
Field Description
target The name of the query item to apply grouping to. The item must have include: true.
keys List of field names to group by.
collect Fields whose values are accumulated into per-group lists.
collect[].field Column name to collect values from.
collect[].alias (optional, defaults to field) Output name for the collected list.

Response

{
  "items": [
    {
      "name": "hop2",
      "data": [
        {"source": 2001, "wishes": [5002, 5001]},
        {"source": 2000, "wishes": [5000]}
      ],
      "rows": 2,
      "stats": [],
      "offset": null,
      "hasNext": false
    }
  ]
}
  • offset and hasNext are not meaningful but required by NamedJsonFormat.

Feedback welcome on the schema shape and semantics before we start implementation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions