-
Couldn't load subscription status.
- Fork 21
CNDB-15260: Add SAI-specific execution info to slow query logger #2044
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Checklist before you submit for review
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did first round of review and left some questions and suggestions, mostly trivial.
I am wondering how good of an idea is to have to restart a node to disable it if we sense too much noise or performance degradation once enabled?
Sonar didn't run but we can check it with the next CI run, it didn't prevent me from doing initial review.
src/java/org/apache/cassandra/config/CassandraRelevantProperties.java
Outdated
Show resolved
Hide resolved
src/java/org/apache/cassandra/index/sai/plan/QueryMonitorableDetails.java
Outdated
Show resolved
Hide resolved
src/java/org/apache/cassandra/index/sai/plan/StorageAttachedIndexSearcher.java
Outdated
Show resolved
Hide resolved
test/distributed/org/apache/cassandra/distributed/test/sai/SlowSAIQueryLoggerTest.java
Outdated
Show resolved
Hide resolved
src/java/org/apache/cassandra/config/CassandraRelevantProperties.java
Outdated
Show resolved
Hide resolved
| public Index.Searcher indexSearcher() | ||
| { | ||
| return indexQueryPlan == null ? null : indexQueryPlan.searcherFor(this); | ||
| if (indexQueryPlan == null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not see tests that show that things look ok if we have queries involving non-sai indexes. Can you add some, please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added, and also for regular queries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems solar is still complaining on the test coverage for this one and the next check. Can you take a look, please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added an artificial test for the sake of increasing coverage from the current 96.6%.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why artificial? I see queries and real test case?
for the sake of increasing coverage from the current 96.6%.
This was not about the number at all. There are valid cases as unreachable checks which would exhibit a bug, but this wasn't one of them. Thanks for the added test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw by looking into the new test - this was probably just missing queries in the already existing test which would have hit those paths. As this is about testing the slow query logger, they should be checking probably whether we log or not, as expected.
d742940 to
5b0a230
Compare
We don't need to restart the node with the current approach. The system property that toggles logging SAI execution details is read every time we detect a slow query, and the system property can be changed at any time. Relevant bits here: cassandra/src/java/org/apache/cassandra/index/sai/plan/StorageAttachedIndexSearcher.java Lines 205 to 210 in 5c33fd2
|
5c33fd2 to
13bf6ad
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are almost ready.
A few small comments plus complaint on test coverage from the tool. I feel it is worth to check it as even before the tool I mentioned we have to test those if clauses. Thanks
The test failures seem to be known to me flakies not related to what we do here.
About the below comment
I am wondering how good of an idea is to have to restart a node to disable it if we sense too much noise or performance degradation once enabled?
We don't need to restart the node with the current approach. The system property that toggles logging SAI execution details is read every time we detect a slow query, and the system property can be changed at any time. Relevant bits here:
I am clear how we check it to decide whether to log or not, but how do you set that property in runtime? I don't see any setter or anything? What did I miss?
src/java/org/apache/cassandra/db/monitoring/MonitoringTask.java
Outdated
Show resolved
Hide resolved
test/distributed/org/apache/cassandra/distributed/test/sai/SlowSAIQueryLoggerTest.java
Show resolved
Hide resolved
| public Index.Searcher indexSearcher() | ||
| { | ||
| return indexQueryPlan == null ? null : indexQueryPlan.searcherFor(this); | ||
| if (indexQueryPlan == null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems solar is still complaining on the test coverage for this one and the next check. Can you take a look, please?
You didn't miss anything, I did. I have added a new mbean just for SAI and added the new config property there. Hopefully we can incrementally keep adding other similar past and future properties to it. I have named it |
|
After this Slack discussion, I'm removing the SAI mbean since we have other ways to manipulate |
b47f01b to
b271894
Compare
Adds the metrics in IndexContext and the tree view of the query's Plan to the log messages produced for slow SAI queries.
…tails Adds a new org.apache.cassandra.index.sai:type=StorageAttachedIndexConfig mbean
…query details" This reverts commit b793093.
36077f6 to
cb8ed72
Compare
|
❌ Build ds-cassandra-pr-gate/PR-2044 rejected by Butler2 regressions found Found 2 new test failures
Found 3 known test failures |



What is the issue
The slow query logger is one of the most useful tools when dealing with performance-related incidents. Currently it simply prints the CQL representation of the slow query and its running time. So while it can allow us to identify problematic SAI queries, we have to go to the aggregated metrics where it's difficult to see what matches the problematic query.
I think it would be useful if the slow query logger also printed the SAI query metrics and plan. That way we could examine log records at the time of the incident to see what's going on with individual slow queries, without needing to extract conclusions from aggregated metrics.
What does this PR fix and why was it fixed
This PR adds SAI-specific information to the log messages produced when slow queries are detected. That information consists on the metrics in the
QueryContextand the tree view of the queryPlan. There is acassandra.cassandra.sai.slow_query_log.details_enabledproperty to disable this extended logging.This is how slow queries are currently logged:
And this is how they would look like with SAI details:
Note that the queries are printed as invalid CQL and include user data. That is being dealt with in CNDB-15280, which is under review.