Skip to content

[core] Introduce 'pk-clustering-override' to clustering by non-primary key fields#7426

Open
JingsongLi wants to merge 3 commits intoapache:masterfrom
JingsongLi:pk-clustering-override
Open

[core] Introduce 'pk-clustering-override' to clustering by non-primary key fields#7426
JingsongLi wants to merge 3 commits intoapache:masterfrom
JingsongLi:pk-clustering-override

Conversation

@JingsongLi
Copy link
Contributor

@JingsongLi JingsongLi commented Mar 15, 2026

PK Clustering Override

By default, data files in a primary key table are physically sorted by the primary key. This is optimal for point
lookups but can hurt scan performance when queries filter on non-primary-key columns.

PK Clustering Override mode changes the physical sort order of data files from the primary key to user-specified
clustering columns. This significantly improves scan performance for queries that filter or group by clustering columns,
while still maintaining primary key uniqueness through deletion vectors.

Quick Start

CREATE TABLE my_table (
    id BIGINT,
    dt STRING,
    city STRING,
    amount DOUBLE,
    PRIMARY KEY (id) NOT ENFORCED
) WITH (
    'pk-clustering-override' = 'true',
    'clustering.columns' = 'city',
    'deletion-vectors.enabled' = 'true',
    'bucket' = '4'
);

After this, data files within each bucket will be physically sorted by city instead of id. Queries like
SELECT * FROM my_table WHERE city = 'Beijing' can skip irrelevant data files by checking their min/max statistics
on the clustering column.

How It Works

PK Clustering Override replaces the default LSM compaction with a two-phase clustering compaction:

Phase 1 — Sort by Clustering Columns: Newly flushed (level 0) files are read, sorted by the configured clustering
columns, and rewritten as sorted (level 1) files. A key index tracks each primary key's file and row position to
maintain uniqueness.

Phase 2 — Merge Overlapping Sections: Sorted files are grouped into sections based on clustering column range
overlap. Overlapping sections are merged together. Adjacent small sections are also consolidated to reduce file count
and IO amplification. Non-overlapping large files are left untouched.

During both phases, deduplication is handled via deletion vectors:

  • Deduplicate mode: When a key already exists in an older file, the old row is marked as deleted.
  • First-row mode: When a key already exists, the new row is marked as deleted, keeping the first-seen value.

When the number of files to merge exceeds sort-spill-threshold, smaller files are first spilled to row-based
temporary files to reduce memory consumption, preventing OOM during multi-way merge.

Requirements

Option Requirement
pk-clustering-override true
clustering.columns Must be set (one or more non-primary-key columns)
deletion-vectors.enabled Must be true
merge-engine deduplicate (default) or first-row only
sequence.fields Must not be set
record-level.expire-time Must not be set

Related Options

Option Default Description
clustering.columns (none) Comma-separated column names used as the physical sort order for data files.
sort-spill-threshold (auto) When the number of merge readers exceeds this value, smaller files are spilled to row-based temp files to reduce memory usage.
sort-spill-buffer-size 64 mb Buffer size used for external sort during Phase 1 rewrite.

When to Use

PK Clustering Override is beneficial when:

  • Analytical queries frequently filter or aggregate on non-primary-key columns (e.g., WHERE city = 'Beijing').
  • The table uses deduplicate or first-row merge engine.
  • You want data files physically co-located by a business dimension rather than the primary key.

It is not suitable when:

  • Point lookups by primary key are the dominant access pattern (default LSM sort is already optimal).
  • You need partial-update or aggregation merge engine.
  • sequence.fields or record-level.expire-time is required.

@JingsongLi JingsongLi changed the title [WIP][core] Introduce 'pk-clustering-override' to clustering by non-primar… [WIP][core] Introduce 'pk-clustering-override' to clustering by non-primary key fields Mar 15, 2026
@JingsongLi JingsongLi changed the title [WIP][core] Introduce 'pk-clustering-override' to clustering by non-primary key fields [core] Introduce 'pk-clustering-override' to clustering by non-primary key fields Mar 16, 2026
@JingsongLi JingsongLi requested a review from Copilot March 16, 2026 02:19
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduces a new PK Clustering Override mode for primary-key tables, allowing data files to be physically sorted by user-selected clustering columns (instead of the primary key) to improve scan performance on non-PK filters/aggregations, while maintaining PK uniqueness via deletion vectors.

Changes:

  • Adds pk-clustering-override core option, documentation, and schema validation adjustments.
  • Introduces a new clustering compaction path (manager + file tracking) and a dedicated clustering data-file writer that stores clustering min/max keys.
  • Improves temp-dir selection via IOManager.pickRandomTempDir() and updates related call sites; adds extensive integration-style tests for clustering behavior (including spill and first-row mode).

Reviewed changes

Copilot reviewed 21 out of 22 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
paimon-core/src/test/java/org/apache/paimon/separated/ClusteringTableTest.java Adds end-to-end tests covering deduplicate/first-row behavior, overlap section merging, and spill scenarios under pk clustering override.
paimon-core/src/main/java/org/apache/paimon/schema/SchemaValidation.java Adjusts DV validation to allow FIRST_ROW + deletion vectors when pk clustering override is enabled.
paimon-core/src/main/java/org/apache/paimon/operation/commit/ConflictDetection.java Skips PK key-range conflict checks when files are no longer clustered by PK.
paimon-core/src/main/java/org/apache/paimon/mergetree/compact/clustering/ClusteringFiles.java New flat file tracker separating unsorted (L0) vs sorted (L1) files for clustering compaction.
paimon-core/src/main/java/org/apache/paimon/mergetree/compact/clustering/ClusteringCompactManagerFactory.java New compaction manager factory validating required options for pk clustering override.
paimon-core/src/main/java/org/apache/paimon/mergetree/compact/clustering/ClusteringCompactManager.java Implements two-phase clustering compaction (rewrite L0 by clustering cols + section-based merging) with DV-backed dedup/first-row and spill support.
paimon-core/src/main/java/org/apache/paimon/mergetree/compact/KvCompactionManagerFactory.java Routes pk clustering override tables to the new clustering compaction manager factory.
paimon-core/src/main/java/org/apache/paimon/mergetree/Levels.java Refactors level grouping to reuse DataFileMeta.groupByLevel.
paimon-core/src/main/java/org/apache/paimon/io/KeyValueFileWriterFactory.java Adds createRollingClusteringFileWriter() to write clustering-sorted level-1 files.
paimon-core/src/main/java/org/apache/paimon/io/KeyValueFileReaderFactory.java Adds copyWithoutValue() builder variant for key-only reads used by key index bootstrap.
paimon-core/src/main/java/org/apache/paimon/io/KeyValueClusteringFileWriter.java New writer that records clustering-field min/max into DataFileMeta instead of PK min/max.
paimon-core/src/main/java/org/apache/paimon/io/DataFileMeta.java Adds groupByLevel(...) helper.
paimon-core/src/main/java/org/apache/paimon/disk/IOManagerImpl.java Implements pickRandomTempDir() convenience API.
paimon-core/src/main/java/org/apache/paimon/disk/IOManager.java Adds pickRandomTempDir() to the IOManager interface.
paimon-core/src/main/java/org/apache/paimon/crosspartition/GlobalIndexAssigner.java Switches temp-dir selection to ioManager.pickRandomTempDir().
paimon-core/src/main/java/org/apache/paimon/AbstractFileStore.java Wires pk clustering override flag into conflict detection construction.
paimon-common/src/test/java/org/apache/paimon/lookup/sort/db/SimpleLsmKvDbTest.java Updates tests for the new cache manager builder API.
paimon-common/src/main/java/org/apache/paimon/utils/VarLengthIntUtils.java Adds varint encode/decode helpers for InputStream/OutputStream.
paimon-common/src/main/java/org/apache/paimon/lookup/sort/db/SimpleLsmKvDb.java Replaces cacheSize(...) with cacheManager(...) and introduces a default cache manager if unset.
paimon-api/src/main/java/org/apache/paimon/CoreOptions.java Adds PK_CLUSTERING_OVERRIDE config option and accessor.
docs/layouts/shortcodes/generated/core_configuration.html Adds generated config doc entry for pk-clustering-override.
docs/content/primary-key-table/pk-clustering-override.md New user documentation page describing the feature, requirements, and related options.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants