Add support for MSQ CLUSTERED BY expressions to be preserved in the segment shard spec as virtual columns by clintropolis · Pull Request #19061 · apache/druid

clintropolis · 2026-02-26T18:06:11Z

Description

changes:

ShardSpec interface has a new method, getDomainVirtualColumns to provide the virtual column information for pruning
DimensionRangeShardSpec stores VirtualColumns in segment metadata so they can be compared to query expressions and be used for pruning
FilterSegmentPruner is virtual column aware for segment pruning using the new methods
ClusterBy now contains a map of key column to VirtualColumn alongside key columns, to support key columns being virtual columns
ControllerImpl persists clustering virtual columns in compaction state in the transform spec
MSQCompactionRunner handles virtual columns in order-by/cluster-by for compaction
opportunistically tidied up several calls to VirtualColumns.create across the codebase, and relaxed the constraints on one of the methods to use Collection instead of List to provide more flexibility

…egment shard spec as virtual columns changes: * `ShardSpec` interface has a new method, `getDomainVirtualColumns` to provide the virtual column information for pruning * `DimensionRangeShardSpec` stores `VirtualColumns` in segment metadata so they can be compared to query expressions and be used for pruning * `FilterSegmentPruner` is virtual column aware for segment pruning using the new methods * `ClusterBy` now contains a map of key column to `VirtualColumn` alongside key columns, to support key columns being virtual columns * `ControllerImpl` persists clustering virtual columns in compaction state in the transform spec * `MSQCompactionRunner` handles virtual columns in order-by/cluster-by for compaction

clintropolis · 2026-02-26T23:44:35Z

pom.xml

                    <failsOnError>true</failsOnError>
                    <excludes>
-                        *com/fasterxml/jackson/databind/*,**/NestedDataFormatsTest.java
+                        *com/fasterxml/jackson/databind/*,**/NestedDataFormatsTest.java,**/CompactionSupervisorTest.java,**/MultiStageQueryTest.java


i'm going to upgrade checkstyle in a follow-up so we can remove this stuff now that #18977 is merged

capistrant · 2026-02-26T23:57:50Z

I am wondering if we have to push up a minor refactor of the cascading reindexing stuff from #18939 to support this for that type of MSQ compaction supervisor? Not saying it needs to be in this PR. In fact it should be able to be done in parallel. The VCs in the tuning config rule that I talk about below would just be meaningless without the context of this PR

off the top of my head:

add a VirtualColumns virtualColumns field to the ReindexingTuningConfigRule. Then we'd have to change the ReindexingConfigBuilder to cleanly handle creating the underlying config. Right now the deletion rules just post withTransformSpec(...) but we'd have to either change the underlying InlineSchemaDataSourceCompactionConfig.Builder to incrementally build the transform spec our accumulate it all in ReindexingConfigBuilder and post once with the VCs from both the deletion rules and the tuning config.

also we'd have to make sure the config optimizer doesn't nuke the VC for partitioning just because no dim filters in the deletes reference it.

clintropolis · 2026-02-27T00:28:03Z

I am wondering if we have to push up a minor refactor of the cascading reindexing stuff from #18939 to support this for that type of MSQ compaction supervisor? Not saying it needs to be in this PR. In fact it should be able to be done in parallel. The VCs in the tuning config rule that I talk about below would just be meaningless without the context of this PR

Yea, there is some work to do to support the reindexing stuff, I haven't decide yet what is best way to do it. Part of me thinks it would be kind of nice if all of the partitioning stuff was part of the same config like maybe a ReindexingPartitioningRule or something, so like both segment granularity and the partition spec stuff all in the same place, and also to try to move further away from directly matching the current compaction config stuff (tuning config in particular is a bit bloated imo with a lot of stuff covering a variety of concerns). But, I suppose ReindexingTuningConfigRule approach would work too if we don't want to get too disruptive.

capistrant

I've got no objections if you want to merge this now. Regarding cascading reindexing template support, I hope to open a gh issue soon and will call out how we need to adapt that to work nicely with this before D37

gianm · 2026-03-05T21:50:20Z

processing/src/main/java/org/apache/druid/frame/key/ClusterBy.java

  @JsonCreator
  public ClusterBy(
      @JsonProperty("columns") List<KeyColumn> columns,
+      @JsonProperty("virtualColumnMap") @Nullable Map<String, VirtualColumn> virtualColumnMap,


Why does this need to be on the clusterBy? It seems to me like the wrong place to put it, since clusterBy is an MSQ framework concept and virtual columns are an ingestion & query concept.

I couldn't really find a clean way to get them to create shardspec or compaction state, while the clusterby is created from the query, so it seemed by far the easiest and least disruptive to just add additional information if those cluster keys were created from virtual columns by the query creating the object.

Am i missing some clean way i could get the virtual columns at the time we are translating the clusterby into shardspec/compaction state?

gianm · 2026-03-06T00:15:04Z

processing/src/main/java/org/apache/druid/timeline/partition/DimensionRangeShardSpec.java

  @JsonCreator
  public DimensionRangeShardSpec(
      @JsonProperty("dimensions") List<String> dimensions,
+      @JsonProperty("virtualColumns") @Nullable VirtualColumns virtualColumns,


Are there going to be issues with deserializing virtual columns on server types that haven't had to deal with them before (like the Coordinator)? I wonder if all expressions are registered there or if some modules have more narrow scopes.

Good question i guess, i am not aware of any modules which conditionally register expressions or custom virtual implementation, but i guess they could exist... All of the built in expressions seemed fine at least since everything has a macro table from expression module, and using the json virtual column worked fine too.

imo if someone is trying to partition by something that makes the coordinator explode, then maybe we should fix that thing so that it doesn't explode can load on the coordinator?

Yeah, fair enough.

gianm · 2026-03-06T00:19:58Z

multi-stage-query/src/main/java/org/apache/druid/msq/exec/ControllerImpl.java

+    } else {
+      transformSpec = new CompactionTransformSpec(
+          dataSchema.getTransformSpec().getFilter(),
+          VirtualColumns.create(clusterBy.getVirtualColumnMap().values())


Won't adding the virtual columns to the transformSpec make them become real columns? I don't think that's what we want.

Won't adding the virtual columns to the transformSpec make them become real columns? I don't think that's what we want.

I don't think so, depending on what you mean by real columns. Virtual columns being here on compaction transform config is a new MSQ compaction only thing that was added a couple of weeks ago for reindexing templates to support filters on virtual columns (for the deletion rules). Native compaction explodes if they exist.

I would agree that it is kind of an odd and confusing place to define virtual columns available for MSQ compaction, especially since there is no such thing on the actual TransformSpec. They were added they were added there in that PR i believe because there wasn't really a better existing place on the compaction config.

The virtual columns here are added as part of building the synthetic virtual columns like time_floor and mv_to_array stuff, which are then passed into the build query methods which can add them as appropriate depending on how they are used, https://github.com/clintropolis/druid/blob/d5d63c753cf5c3216081ebd0dc9797deb8c72876/multi-stage-query/src/main/java/org/apache/druid/msq/indexing/MSQCompactionRunner.java#L276.

By real columns I mean actual physically stored columns. I didn't catch the change that recently added virtualColumns to CompactionTransformSpec. I suppose I assumed they worked the same way as transforms on the regular TransformSpec, in that they actually create columns. It would be good to have javadocs on CompactionTransformSpec that explain that the virtualColumns are just for use by the filter.

But I guess in this patch you're adding a new use for them? How does the new use work?

the new use allows them to be used as intermediary columns to aid in the sorting/clustering, but not saved in the final segment, basically the MSQ compaction equivalent of writing SQL replace queries like in this test https://github.com/apache/druid/pull/19061/changes#diff-207e886c7791d20d886d23425945203683100878b45eb070547d23ff9ed516deR172

…ual-columns

…ssor

gianm · 2026-03-15T18:47:43Z

multi-stage-query/src/main/java/org/apache/druid/msq/indexing/MSQCompactionRunner.java

        .filters(dataSchema.getTransformSpec().getFilter())
+        .virtualColumns(VirtualColumns.create(inputColToVirtualCol.values()))
+        .columns(columns)
+        .columnTypes(rowSignatureWithOrderByBuilder.build().getColumnTypes())


columnTypes, columns, and virtualColumns appear twice in this list

oops, made a mistake resolving merge conflicts

gianm · 2026-03-15T18:48:33Z

...ry/src/main/java/org/apache/druid/msq/indexing/processor/SegmentGeneratorStageProcessor.java

  public SegmentGeneratorStageProcessor(
      @JsonProperty("dataSchema") final DataSchema dataSchema,
      @JsonProperty("columnMappings") final ColumnMappings columnMappings,
+      @JsonProperty("clusterByVirtualColumnsMappings") @Nullable final Map<String, VirtualColumn> clusterByVirtualColumnMappings,


Does not match getClusterByVirtualColumnMappings() (Columns vs Column). Please add a serde test.

fixed and added test

gianm · 2026-03-15T18:52:04Z

...ry/src/main/java/org/apache/druid/msq/indexing/processor/SegmentGeneratorStageProcessor.java

 {
  private final DataSchema dataSchema;
  private final ColumnMappings columnMappings;
+  private final Map<String, VirtualColumn> clusterByVirtualColumnMappings;


This new field is missing from equals and hashCode. Please add an EqualsVerifier test.

this must not have mattered all that much, DataSchema also didn't implement equals and hashcode, and after adding it i see why was skipped, was kind of wonky with some lazy initialization of stuff, but between #19109 and #19166 we will soon be able to drop the parser stuff from it completely, so i went ahead and changed it to be eager for now, added equalsverifier tests for both DataSchema and SegmentGeneratorStageProcessor... i may revert this if it has too much trouble in CI, since there were quite a lot of failures just in DataSchemaTest

gianm · 2026-03-15T18:54:36Z

processing/src/main/java/org/apache/druid/query/groupby/orderby/DefaultLimitSpec.java

          sortingNeeded = true;
          break;
        }
+        if (query.getVirtualColumns().getVirtualColumn(columnSpec.getDimension()) != null) {


This doesn't seem right. The OrderByColumnSpec refers to dimension and aggregator output names. Virtual columns would potentially contain the names of input fields to dimensions and aggregators, but wouldn't contain the output names. What was the check needed for?

oops, this was not meant to be here i think, and was some experiment i was doing much earlier on, removed

gianm · 2026-03-15T18:55:28Z

multi-stage-query/src/main/java/org/apache/druid/msq/exec/ControllerImpl.java

+
+    // if the clustered by requires virtual columns, preserve them here so that we can rebuild during compaction
+    CompactionTransformSpec transformSpec;
+    // this is true if we are in here


what is true if we are in here?

oops, a stale comment i didn't remove

gianm · 2026-03-15T18:57:43Z

processing/src/main/java/org/apache/druid/frame/key/ClusterBy.java


    // Key must be 100% sortable or 100% nonsortable. If empty, call it sortable.
    boolean sortable = true;
-


The changes in this file have become formatting-only, how about reverting it to match master?

gianm · 2026-03-15T18:58:44Z

processing/src/main/java/org/apache/druid/timeline/partition/DimensionRangeShardSpec.java

  }

+  @Override
+  public VirtualColumns getDomainVirtualColumns()


Javadoc please. I don't think it will be immediately obvious what this means.

gianm · 2026-03-15T19:21:36Z

processing/src/main/java/org/apache/druid/query/filter/FilterSegmentPruner.java

        List<String> dimensions = shard.getDomainDimensions();
        for (String dimension : dimensions) {
-          if (filterFields == null || filterFields.contains(dimension)) {
+          final VirtualColumn shardVirtualColumn = shard.getDomainVirtualColumns().getVirtualColumn(dimension);


Consider adding test cases that verify pruning still works if the query-time virtual column doesn't have the same name as the one in the shard spec. Maybe it's there but I didn't see one.

server/src/test/java/org/apache/druid/segment/indexing/DataSchemaTest.java

…ual-columns

capistrant

Few minor test comments. app code looks good to me. Since Gian is deeper in the review, I think I will defer approval to him once he reviews your responses to his comments

capistrant · 2026-03-18T21:21:43Z

processing/src/test/java/org/apache/druid/query/filter/FilterSegmentPrunerTest.java

+    // same expression, different name
+    queryVirtualColumns = VirtualColumns.create(
+        new ExpressionVirtualColumn("v0", "concat(dim1, 'foo')", ColumnType.STRING, TestExprMacroTable.INSTANCE)
+    );
+    range_a = new RangeFilter("v0", ColumnType.STRING, null, "aaa", null, null, null);
+    prunerRange = new FilterSegmentPruner(range_a, null, queryVirtualColumns);
+    prunerEmptyFields = new FilterSegmentPruner(range_a, Collections.emptySet(), queryVirtualColumns);
+
+    Assertions.assertEquals(Set.of(seg1), prunerRange.prune(segs, Function.identity()));
+    Assertions.assertEquals(Set.copyOf(segs), prunerEmptyFields.prune(segs, Function.identity()));
+
+    // same expression, different name
+    queryVirtualColumns = VirtualColumns.create(
+        new ExpressionVirtualColumn("v10", "concat(dim1, 'foo')", ColumnType.STRING, TestExprMacroTable.INSTANCE)
+    );
+    range_a = new RangeFilter("v10", ColumnType.STRING, null, "aaa", null, null, null);
+    prunerRange = new FilterSegmentPruner(range_a, null, queryVirtualColumns);
+    prunerEmptyFields = new FilterSegmentPruner(range_a, Collections.emptySet(), queryVirtualColumns);
+
+    Assertions.assertEquals(Set.of(seg1), prunerRange.prune(segs, Function.identity()));
+    Assertions.assertEquals(Set.copyOf(segs), prunerEmptyFields.prune(segs, Function.identity()));


are these testing different things? they look the same to me, just different vc names with neither matching the shard vc

ah no they are not really different, removed this redundant one but added some additional stuff like mixing different names from the segments too and calling prune twice to ensure cache is hit

capistrant · 2026-03-18T21:29:53Z

...-tests/src/test/java/org/apache/druid/testing/embedded/compact/CompactionSupervisorTest.java

+                )
+            )
+            .withTuningConfig(
+                new UserCompactionTaskQueryTuningConfig(


a recent pr added a builder for this

capistrant · 2026-03-18T21:30:31Z

...-tests/src/test/java/org/apache/druid/testing/embedded/compact/CompactionSupervisorTest.java

+                )
+            )
+            .withTuningConfig(
+                new UserCompactionTaskQueryTuningConfig(


same builder nit

github-actions bot added Area - Batch Ingestion Area - Querying Area - Segment Format and Ser/De Area - Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Feb 26, 2026

clintropolis force-pushed the clustered-by-virtual-columns branch from 3bf2b31 to 814b4ec Compare February 26, 2026 18:13

fix lost line in refactor, suppress checkstyle for now

1712243

github-actions bot added the Area - Dependencies label Feb 26, 2026

fixes

d5d63c7

clintropolis commented Feb 26, 2026

View reviewed changes

capistrant approved these changes Mar 4, 2026

View reviewed changes

capistrant mentioned this pull request Mar 5, 2026

CascadingReindexingTemplate Production Readiness #19092

Open

gianm reviewed Mar 6, 2026

View reviewed changes

Merge remote-tracking branch 'upstream/master' into clustered-by-virt…

c465492

…ual-columns

capistrant mentioned this pull request Mar 6, 2026

CascadingReindexingTemplate Refactoring #19106

Open

9 tasks

clintropolis added the WIP label Mar 12, 2026

clintropolis added 5 commits March 11, 2026 21:36

Merge remote-tracking branch 'upstream/master' into clustered-by-virt…

9fe4dd7

…ual-columns

rework stuff, virtual column map now lives on segment generator proce…

e205419

…ssor

tidy up

c744f2d

fix style

0aaaf61

fix test

0a7dca0

clintropolis removed the WIP label Mar 14, 2026

gianm reviewed Mar 15, 2026

View reviewed changes

capistrant self-requested a review March 16, 2026 20:56

clintropolis added 2 commits March 16, 2026 14:40

review stuff

dc7587c

fix up test

e43a26b

github-advanced-security bot found potential problems Mar 16, 2026

View reviewed changes

server/src/test/java/org/apache/druid/segment/indexing/DataSchemaTest.java Fixed Show fixed Hide fixed

server/src/test/java/org/apache/druid/segment/indexing/DataSchemaTest.java Fixed Show fixed Hide fixed

clintropolis mentioned this pull request Mar 18, 2026

allow Collection for VirtualColumns.create and tidy up callers #19174

Merged

clintropolis added 2 commits March 18, 2026 11:39

Merge remote-tracking branch 'upstream/master' into clustered-by-virt…

33b4bd7

…ual-columns

restore some formatting on now unrelated files, fix unused imports

a9c0673

capistrant reviewed Mar 18, 2026

View reviewed changes

gianm approved these changes Mar 18, 2026

View reviewed changes

clintropolis added 2 commits March 18, 2026 16:26

use builder, adjust test to more useful

583d2cb

more better

614ba60


		// Key must be 100% sortable or 100% nonsortable. If empty, call it sortable.
		boolean sortable = true;

Conversation

clintropolis commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

Choose a reason for hiding this comment

Uh oh!

capistrant commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clintropolis commented Feb 27, 2026

Uh oh!

capistrant left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

clintropolis Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

clintropolis Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

capistrant left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

clintropolis commented Feb 26, 2026 •

edited

Loading

capistrant commented Feb 26, 2026 •

edited

Loading

clintropolis Mar 6, 2026 •

edited

Loading

clintropolis Mar 6, 2026 •

edited

Loading