Skip to content

Conversation

@adelapena
Copy link

What is the issue

The slow query logger is one of the most useful tools when dealing with performance-related incidents. Currently it simply prints the CQL representation of the slow query and its running time. So while it can allow us to identify problematic SAI queries, we have to go to the aggregated metrics where it's difficult to see what matches the problematic query.

I think it would be useful if the slow query logger also printed the SAI query metrics and plan. That way we could examine log records at the time of the incident to see what's going on with individual slow queries, without needing to extract conclusions from aggregated metrics.

What does this PR fix and why was it fixed

This PR adds SAI-specific information to the log messages produced when slow queries are detected. That information consists on the metrics in the QueryContext and the tree view of the query Plan. There is a cassandra.cassandra.sai.slow_query_log.details_enabled property to disable this extended logging.

This is how slow queries are currently logged:

DEBUG [node1_isolatedExecutor:1] node1 2025-10-09 11:14:57,354 MonitoringTask.java:175 - 2 operations were slow in the last 131 msecs:
<SELECT s FROM distributed_test_keyspace.t WHERE (s = s_2 OR s = s_3) AND  ALLOW FILTERING>, time 107 msec - slow timeout 100 msec
<SELECT n FROM distributed_test_keyspace.t WHERE n > 1 AND  ALLOW FILTERING>, was slow 2 times: avg/min/max 104/104/105 msec - slow timeout 100 msec

And this is how they would look like with SAI details:

DEBUG [node1_isolatedExecutor:1] node1 2025-10-09 11:14:57,354 MonitoringTask.java:175 - 2 operations were slow in the last 131 msecs:
<SELECT * FROM distributed_test_keyspace.t WHERE (s = s_2 OR s = s_3) AND  ALLOW FILTERING>, time 111 msec - slow timeout 100 msec
  SAI slow query metrics:
    sstablesHit: 2
    segmentsHit: 2
    partitionsRead: 2
    rowsFiltered: 2
    rowsPreFiltered: 0
    trieSegmentsHit: 2
    bkdPostingListsHit: 0
    bkdSegmentsHit: 0
    bkdPostingsSkips: 0
    bkdPostingsDecodes: 0
    triePostingsSkips: 0
    triePostingsDecodes: 2
    annGraphSearchLatencyNanos: 0
    shadowedPrimaryKeyCount: 0
  SAI slow query plan:
    Limit 2147483647 (rows: 1.8, cost/row: 101.5, cost: 3000.0..3177.6)
     └─ Filter (s = s_2 OR s = s_3) (sel: 1.000000000) (rows: 1.8, cost/row: 101.5, cost: 3000.0..3177.6)
         └─ Fetch (rows: 1.8, cost/row: 101.5, cost: 3000.0..3177.6)
             └─ Union (keys: 1.8, cost/key: 0.1, cost: 3000.0..3000.2)
                 ├─ LiteralIndexScan of t_s_idx (sel: 0.250000000, step: 1.0) (keys: 1.0, cost/key: 0.1, cost: 1500.0..1500.1)
                 │  predicate: Expression{name: s, op: EQ, lower: (s_2, true), upper: (s_2, true), exclusions: []}
                 └─ LiteralIndexScan of t_s_idx (sel: 0.250000000, step: 1.0) (keys: 1.0, cost/key: 0.1, cost: 1500.0..1500.1)
                    predicate: Expression{name: s, op: EQ, lower: (s_3, true), upper: (s_3, true), exclusions: []}
<SELECT * FROM distributed_test_keyspace.t WHERE n > 1 AND  ALLOW FILTERING>, was slow 2 times: avg/min/max 258/103/413 msec - slow timeout 100 msec
  SAI slowest query metrics:
    sstablesHit: 3
    segmentsHit: 3
    partitionsRead: 3
    rowsFiltered: 6
    rowsPreFiltered: 0
    trieSegmentsHit: 0
    bkdPostingListsHit: 3
    bkdSegmentsHit: 3
    bkdPostingsSkips: 0
    bkdPostingsDecodes: 7
    triePostingsSkips: 0
    triePostingsDecodes: 0
    annGraphSearchLatencyNanos: 0
    shadowedPrimaryKeyCount: 0
  SAI slowest query plan:
    Limit 2147483647 (rows: 6.0, cost/row: 101.1, cost: 4500.0..5106.8)
     └─ Filter n > 1 (sel: 1.000000000) (rows: 6.0, cost/row: 101.1, cost: 4500.0..5106.8)
         └─ Fetch (rows: 6.0, cost/row: 101.1, cost: 4500.0..5106.8)
             └─ NumericIndexScan of t_n_idx (sel: 0.857142857, step: 1.0) (keys: 6.0, cost/key: 0.1, cost: 4500.0..4500.6)
                predicate: Expression{name: n, op: RANGE, lower: (1, false), upper: (null, false), exclusions: []}

Note that the queries are printed as invalid CQL and include user data. That is being dealt with in CNDB-15280, which is under review.

@adelapena adelapena self-assigned this Oct 9, 2025
@github-actions
Copy link

github-actions bot commented Oct 9, 2025

Checklist before you submit for review

  • This PR adheres to the Definition of Done
  • Make sure there is a PR in the CNDB project updating the Converged Cassandra version
  • Use NoSpamLogger for log lines that may appear frequently in the logs
  • Verify test results on Butler
  • Test coverage for new/modified code is > 80%
  • Proper code formatting
  • Proper title for each commit staring with the project-issue number, like CNDB-1234
  • Each commit has a meaningful description
  • Each commit is not very long and contains related changes
  • Renames, moves and reformatting are in distinct commits
  • All new files should contain the DataStax copyright header instead of the Apache License one

Copy link

@ekaterinadimitrova2 ekaterinadimitrova2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did first round of review and left some questions and suggestions, mostly trivial.
I am wondering how good of an idea is to have to restart a node to disable it if we sense too much noise or performance degradation once enabled?

Sonar didn't run but we can check it with the next CI run, it didn't prevent me from doing initial review.

public Index.Searcher indexSearcher()
{
return indexQueryPlan == null ? null : indexQueryPlan.searcherFor(this);
if (indexQueryPlan == null)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not see tests that show that things look ok if we have queries involving non-sai indexes. Can you add some, please?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, and also for regular queries.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems solar is still complaining on the test coverage for this one and the next check. Can you take a look, please?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added an artificial test for the sake of increasing coverage from the current 96.6%.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why artificial? I see queries and real test case?
for the sake of increasing coverage from the current 96.6%.
This was not about the number at all. There are valid cases as unreachable checks which would exhibit a bug, but this wasn't one of them. Thanks for the added test.

Copy link

@ekaterinadimitrova2 ekaterinadimitrova2 Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw by looking into the new test - this was probably just missing queries in the already existing test which would have hit those paths. As this is about testing the slow query logger, they should be checking probably whether we log or not, as expected.

@adelapena
Copy link
Author

I am wondering how good of an idea is to have to restart a node to disable it if we sense too much noise or performance degradation once enabled?

We don't need to restart the node with the current approach. The system property that toggles logging SAI execution details is read every time we detect a slow query, and the system property can be changed at any time. Relevant bits here:

public Monitorable.ExecutionInfo monitorableExecutionInfo()
{
return CassandraRelevantProperties.SAI_SLOW_QUERY_LOG_EXECUTION_INFO_ENABLED.getBoolean()
? new QueryMonitorableExecutionInfo(queryContext, controller.buildPlan())
: Monitorable.ExecutionInfo.EMPTY;
}

Copy link

@ekaterinadimitrova2 ekaterinadimitrova2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are almost ready.
A few small comments plus complaint on test coverage from the tool. I feel it is worth to check it as even before the tool I mentioned we have to test those if clauses. Thanks
The test failures seem to be known to me flakies not related to what we do here.

About the below comment

I am wondering how good of an idea is to have to restart a node to disable it if we sense too much noise or performance degradation once enabled?

We don't need to restart the node with the current approach. The system property that toggles logging SAI execution details is read every time we detect a slow query, and the system property can be changed at any time. Relevant bits here:

I am clear how we check it to decide whether to log or not, but how do you set that property in runtime? I don't see any setter or anything? What did I miss?

public Index.Searcher indexSearcher()
{
return indexQueryPlan == null ? null : indexQueryPlan.searcherFor(this);
if (indexQueryPlan == null)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems solar is still complaining on the test coverage for this one and the next check. Can you take a look, please?

@adelapena
Copy link
Author

I am clear how we check it to decide whether to log or not, but how do you set that property in runtime? I don't see any setter or anything? What did I miss?

You didn't miss anything, I did. I have added a new mbean just for SAI and added the new config property there. Hopefully we can incrementally keep adding other similar past and future properties to it.

I have named it StorageAttachedIndexConfig because of a lack of imagination, but any suggestions are welcome.

@adelapena
Copy link
Author

After this Slack discussion, I'm removing the SAI mbean since we have other ways to manipulate CassandraRelevantProperties.

@ekaterinadimitrova2 ekaterinadimitrova2 changed the title CNDB-15260: Add SAI-specific details to slow query logger CNDB-15260: Add SAI-specific execution info to slow query logger Oct 24, 2025
@sonarqubecloud
Copy link

@cassci-bot
Copy link

❌ Build ds-cassandra-pr-gate/PR-2044 rejected by Butler


2 regressions found
See build details here


Found 2 new test failures

Test Explanation Runs Upstream
o.a.c.db.commitlog.BatchCommitLogTest.testBatchCLSyncImmediately[6] (compression) REGRESSION 🔴🔵 0 / 12
o.a.c.index.sai.cql.LuceneAnalyzerTest.testStopFilterNoFormat (compression) REGRESSION 🔴🔵 0 / 12

Found 3 known test failures

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants