CNDB-15280: Remove user data from AbstractReadQuery.toCQLString #2038

adelapena · 2025-10-07T10:18:34Z

The method AbstractReadQuery.toCQLString prints commands as CQL queries including any column values. This includes the queried values in the WHERE part of a SELECT statement or the written values on INSERT and UPDATE statement. This method is used at least by the slow query logger, printing user data into the logs.

This PR modifies AbstractReadQuery.toCQLString so it doesn't include column values. There is a boolean flag to opt-out from redaction, since seeing the queried values can be useful while debugging.

The criteria for what should be redacted is:

Needs redaction: Messages that go to external monitoring systems, such as JMX, diagnostic events, etc.
Doesn't need redaction: User-facing exceptions such as InvalidRequestException, query tracing (Tracing.trace) and generic Object#toString() methods.
Ideally should use redaction: Things printed in logs. We treat logs as sensitive data and there is plenty of user data that is printed there. I think we should gradually move towards logs free of user data, and this PR does that for AbstractReadQuery.toCQLString, which is used for example by the slow query logger. However, there are still plenty of other things that print user data, for example partition keys. Discussion here: https://datastax.slack.com/archives/C05LHP4HX5J/p1757687570882049?thread_ts=1757533116.788859&cid=C05LHP4HX5J

At reviewer's request, this PR separately adds redaction over the tightly related changes in toCQLString methods done by this other PR. That PR originally combined both things in separate commits, and it already had multiple review comments regarding changes that now are in this PR.

github-actions · 2025-10-07T10:18:50Z

sonarqubecloud · 2025-10-15T16:51:42Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
88.9% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

… current() depending on a keyspace (#2041) There is a new cassandra.sai.version.selector.class system property allowing to provide an implementation of the o.a.c.index.sai.disk.format.Version.Selector interface to specify that version of the SAI on-disk index format should be used for each keyspace.

### What is the issue ... We need that knowledge for CNDB ### What does this PR fix and why was it fixed ... It exposes `containsDateRangeTypeColumn` methods --------- Co-authored-by: Massimiliano Tomassi <max.tomassi@datastax.com>

… CA (#2071) Creating vector indexes if version is earlier than CA would usually fail in the asynchronous build. This patch makes them fail synchronously at CREATE INDEX depending on the local index version. If the local node has the right version but any of the remotes doesn't, the failure will remain asynchronous.

…es (#2066) When row-aware and non-row-aware indexes are mixed, we now check the clustering index filter for all the keys that have clustering information, i.e. keys coming from the row-aware indexes. Earlier that check was accidentally disabled if at least one non-row-aware index was used by the query. That could cause retrieving rows that do not match the clustering condition of the query.

### What is the issue Fixes: riptano/cndb#15640 ### What does this PR fix and why was it fixed In order to lay the ground work for Fused ADC, I want to refactor some of the PQ/BQ logic. The unit length computation needs to move, so I decided to move it out to its own PR. The core idea is that: * some models are documented to provide unit length vectors, and in those cases, we should skip the computational check * otherwise, we should check at runtime until we hit a non-unit length vector, and then we can skip the check and configure the `writePQ` method as needed ### Embedding normalization notes (I asked chat gpt to provide proof for the config changes proposed in this PR. Here is it's generated description.) Quick rundown of which models spit out normalized vectors (so cosine == dot product, etc.): * **OpenAI (ada-002, v3-small, v3-large)** → already normalized. [OpenAI FAQ](https://platform.openai.com/docs/guides/embeddings/what-are-embeddings) literally says embeddings are unit-length. * **BERT** → depends. The SBERT “-cos-” models add a [`Normalize` layer](https://www.sbert.net/docs/package_reference/layers.html#normalize) so they’re fine; vanilla BERT doesn’t. * **Google Gecko** → normalized out of the box per [Vertex AI docs](https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings). * **NVIDIA QA-4** → nothing in the [NVIDIA NIM model card](https://docs.api.nvidia.com/nim/reference/nvidia-embed-qa-4) about normalization, so assume *not* normalized and handle it yourself. * **Cohere v3** → not explicitly in their [API docs](https://docs.cohere.com/docs/cohere-embed) TL;DR: OpenAI + Gecko are definitely safe, Cohere/BERT/NV need manual normalization due to lack of documentation.

Fixes for toCQLString mainly coming from CASSANDRA-16510, and removal of code duplication.

Replace column values by '?' when converting internal read queries to CQL, so user data don't end up in logs or any other unprotected place. # Conflicts: # src/java/org/apache/cassandra/db/Clustering.java # src/java/org/apache/cassandra/db/Slices.java

cassci-bot · 2025-10-21T18:48:55Z

✔️ Build ds-cassandra-pr-gate/PR-2038 approved by Butler

Approved by Butler
See build details here

adelapena requested a review from k-rus October 7, 2025 10:18

adelapena self-assigned this Oct 7, 2025

adelapena mentioned this pull request Oct 7, 2025

CNDB-15280: Fix AbstractReadQuery.toCQLString to produce correct CQL when possible #1985

Closed

adelapena force-pushed the CNDB-15280-main branch from ae827cd to c2702c7 Compare October 15, 2025 14:52

adelapena force-pushed the CNDB-15280-main-redaction branch from c5339f8 to 2c07844 Compare October 15, 2025 16:08

eolivelli and others added 4 commits October 16, 2025 17:50

adelapena changed the title ~~CNDB-15280: Remove user data from AbstractReadQuery.toCQLString (redaction)~~ CNDB-15280: Remove user data from AbstractReadQuery.toCQLString Oct 20, 2025

michaeljmarshall and others added 5 commits October 20, 2025 11:49

CNDB-15760: Fix AbstractReadQuery.toCQLString

ee2c990

Fixes for toCQLString mainly coming from CASSANDRA-16510, and removal of code duplication.

CNDB-15280: Remove user data from Plan.toString

18760ee

CNDB-15280: Address review feedback on redaction

c73a499

adelapena force-pushed the CNDB-15280-main-redaction branch from 2c07844 to c73a499 Compare October 21, 2025 16:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CNDB-15280: Remove user data from AbstractReadQuery.toCQLString #2038

CNDB-15280: Remove user data from AbstractReadQuery.toCQLString #2038

adelapena commented Oct 7, 2025

Uh oh!

github-actions bot commented Oct 7, 2025 •

edited by adelapena

Loading

Uh oh!

sonarqubecloud bot commented Oct 15, 2025

Uh oh!

cassci-bot commented Oct 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

CNDB-15280: Remove user data from AbstractReadQuery.toCQLString #2038

Are you sure you want to change the base?

CNDB-15280: Remove user data from AbstractReadQuery.toCQLString #2038

Conversation

adelapena commented Oct 7, 2025

Uh oh!

github-actions bot commented Oct 7, 2025 • edited by adelapena Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist before you submit for review

Uh oh!

sonarqubecloud bot commented Oct 15, 2025

Quality Gate passed

Uh oh!

cassci-bot commented Oct 21, 2025

✔️ Build ds-cassandra-pr-gate/PR-2038 approved by Butler

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

github-actions bot commented Oct 7, 2025 •

edited by adelapena

Loading