forked from apache/cassandra
-
Notifications
You must be signed in to change notification settings - Fork 21
CNDB-15280: Remove user data from AbstractReadQuery.toCQLString #2038
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
adelapena
wants to merge
9
commits into
CNDB-15280-main
Choose a base branch
from
CNDB-15280-main-redaction
base: CNDB-15280-main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+2,687
−872
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Checklist before you submit for review
|
ae827cd
to
c2702c7
Compare
c5339f8
to
2c07844
Compare
|
… current() depending on a keyspace (#2041) There is a new cassandra.sai.version.selector.class system property allowing to provide an implementation of the o.a.c.index.sai.disk.format.Version.Selector interface to specify that version of the SAI on-disk index format should be used for each keyspace.
### What is the issue ... We need that knowledge for CNDB ### What does this PR fix and why was it fixed ... It exposes `containsDateRangeTypeColumn` methods --------- Co-authored-by: Massimiliano Tomassi <max.tomassi@datastax.com>
… CA (#2071) Creating vector indexes if version is earlier than CA would usually fail in the asynchronous build. This patch makes them fail synchronously at CREATE INDEX depending on the local index version. If the local node has the right version but any of the remotes doesn't, the failure will remain asynchronous.
…es (#2066) When row-aware and non-row-aware indexes are mixed, we now check the clustering index filter for all the keys that have clustering information, i.e. keys coming from the row-aware indexes. Earlier that check was accidentally disabled if at least one non-row-aware index was used by the query. That could cause retrieving rows that do not match the clustering condition of the query.
### What is the issue Fixes: riptano/cndb#15640 ### What does this PR fix and why was it fixed In order to lay the ground work for Fused ADC, I want to refactor some of the PQ/BQ logic. The unit length computation needs to move, so I decided to move it out to its own PR. The core idea is that: * some models are documented to provide unit length vectors, and in those cases, we should skip the computational check * otherwise, we should check at runtime until we hit a non-unit length vector, and then we can skip the check and configure the `writePQ` method as needed ### Embedding normalization notes (I asked chat gpt to provide proof for the config changes proposed in this PR. Here is it's generated description.) Quick rundown of which models spit out normalized vectors (so cosine == dot product, etc.): * **OpenAI (ada-002, v3-small, v3-large)** → already normalized. [OpenAI FAQ](https://platform.openai.com/docs/guides/embeddings/what-are-embeddings) literally says embeddings are unit-length. * **BERT** → depends. The SBERT “-cos-” models add a [`Normalize` layer](https://www.sbert.net/docs/package_reference/layers.html#normalize) so they’re fine; vanilla BERT doesn’t. * **Google Gecko** → normalized out of the box per [Vertex AI docs](https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings). * **NVIDIA QA-4** → nothing in the [NVIDIA NIM model card](https://docs.api.nvidia.com/nim/reference/nvidia-embed-qa-4) about normalization, so assume *not* normalized and handle it yourself. * **Cohere v3** → not explicitly in their [API docs](https://docs.cohere.com/docs/cohere-embed) TL;DR: OpenAI + Gecko are definitely safe, Cohere/BERT/NV need manual normalization due to lack of documentation.
Fixes for toCQLString mainly coming from CASSANDRA-16510, and removal of code duplication.
Replace column values by '?' when converting internal read queries to CQL, so user data don't end up in logs or any other unprotected place. # Conflicts: # src/java/org/apache/cassandra/db/Clustering.java # src/java/org/apache/cassandra/db/Slices.java
2c07844
to
c73a499
Compare
✔️ Build ds-cassandra-pr-gate/PR-2038 approved by ButlerApproved by Butler |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The method
AbstractReadQuery.toCQLString
prints commands as CQL queries including any column values. This includes the queried values in theWHERE
part of aSELECT
statement or the written values onINSERT
andUPDATE
statement. This method is used at least by the slow query logger, printing user data into the logs.This PR modifies
AbstractReadQuery.toCQLString
so it doesn't include column values. There is a boolean flag to opt-out from redaction, since seeing the queried values can be useful while debugging.The criteria for what should be redacted is:
InvalidRequestException
, query tracing (Tracing.trace
) and genericObject#toString()
methods.AbstractReadQuery.toCQLString
, which is used for example by the slow query logger. However, there are still plenty of other things that print user data, for example partition keys. Discussion here: https://datastax.slack.com/archives/C05LHP4HX5J/p1757687570882049?thread_ts=1757533116.788859&cid=C05LHP4HX5JAt reviewer's request, this PR separately adds redaction over the tightly related changes in
toCQLString
methods done by this other PR. That PR originally combined both things in separate commits, and it already had multiple review comments regarding changes that now are in this PR.