CNDB-15946: Revamp SAI metrics for fetched/returned keys/partitions/rows/tombstones #2132

adelapena · 2025-11-19T13:16:38Z

What is the issue

The metrics partitionsRead, rowsFiltered, rowsPreFiltered and shadowedPrimaryKeyCount don't always work as expected. Details on https://github.com/riptano/cndb/issues/15946.

What does this PR fix and why was it fixed

Replace those metrics with:

keysFetched: Number of partition/row keys that will be used to fetch rows from the base table.
partitionsFetched: Number of live partitions fetched from the base table, before post-filtering and sorting. Note that currently ANN fetches one partition per index key, without any grouping of same-partition clusterings.
partitionsReturned: Number of live partitions returned to the coordinator, after post-filtering and sorting.
partitionTombstonesFetched: Number of deleted partitions found when reading the base table.
rowsFetched: Number of live rows fetched from the base table, before post-filtering and sorting.
rowsReturned: Number of live rows returned to the coordinator, after post-filtering and sorting.
rowTombstonesFetched: Number deleted rows or row ranges found when reading the base table.

StorageAttachedIndexSearcher is modified to use the command timestamp, rather than getting the current
time everytime a query timestamp is needed, which was possibly buggy and inefficient.

Here are some examples on how to interpret these metrics for AA partition-aware disk format:

keysFetched > partitionsFetched: The indexed data has changed since it was indexed and there has been deletions or updates.
partitionsFetched < rowsFetched: The partitions are wide, if the difference is large queries with high within-partition selectivity might behave poorly compared to later disk formats, and vice versa.
rowsFetched > rowsReturned: The queries have some ALLOW FILTERING restriction or the partition-aware nature of the index is producing filtering.

And for row-aware disk formats later than AA :

keysFetched > rowsFetched: The indexed data has changed since it was indexed and there has been deletions or updates.
partitionsFetched < rowsFetched: The partitions are wide, if the difference is large queries with low within-partition selectivity might behave worse than with AA, and vice versa.
rowsFetched > rowsReturned: The queries have some ALLOW FILTERING restriction that is producing filtering.

github-actions · 2025-11-19T13:16:52Z

adelapena · 2025-11-20T15:12:01Z

The three test failures above are junit timeouts, probably unrelated since those tests take quite long in main and we are having trouble with timeouts.

ekaterinadimitrova2

Still need to finish reviewing some of the tests, you added so many, that's great! But I wanted to push my first comments and questions for consideration. I am a bit nervous around the transformations' changes, (reason - I haven't done a lot in that area of the code) so I am trying to be very carefully reasoning about all the cases.

src/java/org/apache/cassandra/index/sai/plan/StorageAttachedIndexSearcher.java

src/java/org/apache/cassandra/index/sai/plan/CountFetchedRowsTransformation.java

src/java/org/apache/cassandra/index/sai/plan/CountFetchedTransformation.java

src/java/org/apache/cassandra/index/sai/plan/CountFetchedRowsTransformation.java

pkolaczk

Great job! Those new metric names make a lot more sense.
Much more readable and complete than before!
Thank you! :)

adelapena · 2025-11-24T17:24:11Z

Looks like the only new CI failures are unrelated timeouts.

ekaterinadimitrova2

Between QueryContextTest and SlowSAIQueryLoggerTest, I was wondering what other scenarios we may want to add. The only thing I can think of are collections and TTL. WDYT? Does it make sense to add a few small tests?

Also, I think it is quite surprising with the current documentation and naming how partitionsFetched works with ANN queries. It is confusing for users. I guess changing the metric for ANN queries is hard but then we should change name/document better? Thoughts?

CC @pkolaczk

The rest looks great!

src/java/org/apache/cassandra/index/sai/plan/CountFetchedTransformation.java

test/unit/org/apache/cassandra/index/sai/QueryContextTest.java

adelapena · 2025-11-25T13:50:37Z

Between QueryContextTest and SlowSAIQueryLoggerTest, I was wondering what other scenarios we may want to add. The only thing I can think of are collections and TTL. WDYT? Does it make sense to add a few small tests?

I don't see lots of value, but why not. I have added tests to QueryContextTest only, because SlowSAIQueryLoggerTest and QueryMetricsTest only need to check that the QueryContext metrics get published in logs and query metrics. We don't need to test again there the various ways in which those metrics can get their values.

ekaterinadimitrova2

Great job fixing and testing these metrics, thanks.
I see CI errored-out so the merge is pending on clean CI and approval of the other PR.

…ows/tombstones The metrics partitionsRead, rowsFiltered, rowsPreFiltered and shadowedPrimaryKeyCount, which dodn't always work as expected, are replaced by these metrics: * keysFetched: Number of partition/row keys that will be used to fetch rows from the base table. * partitionsFetched: Number of live partitions fetched from the base table, before post-filtering and sorting. Note that currently ANN fetches one partition per index key, without any grouping of same-partition clusterings. * partitionsReturned: Number of live partitions returned to the coordinator, after post-filtering and sorting. * partitionTombstonesFetched: Number of deleted partitions found when reading the base table. * rowsFetched: Number of live rows fetched from the base table, before post-filtering and sorting. * rowsReturned: Number of live rows returned to the coordinator, after post-filtering and sorting. * rowTombstonesFetched: Number deleted rows or row ranges found when reading the base table. StorageAttachedIndexSearcher is modified to use the command timestamp, rather than getting the current time everytime a query timestamp is needed, which was possibly buggy and inefficient.

sonarqubecloud · 2025-11-26T15:50:35Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
97.7% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

cassci-bot · 2025-11-26T15:54:55Z

❌ Build ds-cassandra-pr-gate/PR-2132 rejected by Butler

2 regressions found
See build details here

Found 2 new test failures

Test	Explanation	Runs	Upstream
o.a.c.index.sai.cql.VectorCompaction100dTest.testCompactionWithEnoughRowsForPQAndDeleteARow[eb false]	REGRESSION	🔴⚪	0 / 18
o.a.c.index.sai.cql.VectorSiftSmallTest.testSiftSmall[dc false]	REGRESSION	🔴⚪	0 / 18

Found 4 known test failures

adelapena · 2025-11-26T16:28:41Z

The only new test failures are unrelated junit timeouts, so CI looks good to me.

adelapena self-assigned this Nov 19, 2025

adelapena force-pushed the CNDB-15946-main-row-metrics branch from 457f399 to 2b837e7 Compare November 19, 2025 15:38

k-rus mentioned this pull request Nov 20, 2025

CNDB-15608 fix lint issues in affected files #2131

Merged

ekaterinadimitrova2 reviewed Nov 21, 2025

View reviewed changes

adelapena force-pushed the CNDB-15946-main-row-metrics branch from 2b837e7 to 69e1d77 Compare November 21, 2025 13:57

pkolaczk approved these changes Nov 21, 2025

View reviewed changes

ekaterinadimitrova2 reviewed Nov 25, 2025

View reviewed changes

src/java/org/apache/cassandra/index/sai/plan/CountFetchedTransformation.java Show resolved Hide resolved

test/unit/org/apache/cassandra/index/sai/QueryContextTest.java Show resolved Hide resolved

ekaterinadimitrova2 approved these changes Nov 25, 2025

View reviewed changes

adelapena force-pushed the CNDB-15946-main-row-metrics branch from 5ef6014 to 6affe01 Compare November 26, 2025 10:28

Empty-Commit to trigger CI

ae53699

adelapena merged commit 4fc5df6 into main Nov 26, 2025
479 of 498 checks passed

adelapena deleted the CNDB-15946-main-row-metrics branch November 26, 2025 16:29

CNDB-15946: Revamp SAI metrics for fetched/returned keys/partitions/rows/tombstones #2132

CNDB-15946: Revamp SAI metrics for fetched/returned keys/partitions/rows/tombstones #2132

Uh oh!

Conversation

adelapena commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the issue

What does this PR fix and why was it fixed

Uh oh!

github-actions bot commented Nov 19, 2025 • edited by adelapena Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist before you submit for review

Uh oh!

adelapena commented Nov 20, 2025

Uh oh!

ekaterinadimitrova2 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pkolaczk left a comment

Choose a reason for hiding this comment

Uh oh!

adelapena commented Nov 24, 2025

Uh oh!

ekaterinadimitrova2 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

adelapena commented Nov 25, 2025

Uh oh!

ekaterinadimitrova2 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud bot commented Nov 26, 2025

Quality Gate passed

Uh oh!

cassci-bot commented Nov 26, 2025

❌ Build ds-cassandra-pr-gate/PR-2132 rejected by Butler

Found 2 new test failures

Found 4 known test failures

Uh oh!

adelapena commented Nov 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

adelapena commented Nov 19, 2025 •

edited

Loading

github-actions bot commented Nov 19, 2025 •

edited by adelapena

Loading

ekaterinadimitrova2 left a comment •

edited

Loading

ekaterinadimitrova2 left a comment •

edited

Loading

ekaterinadimitrova2 left a comment •

edited

Loading