From 0dbadc85055b56f1219e9a1a9c72c80485f4b97a Mon Sep 17 00:00:00 2001 From: April M <36110273+aimurphy@users.noreply.github.com> Date: Thu, 13 Nov 2025 09:06:38 -0800 Subject: [PATCH 01/11] combine and remove troubleshooting scenarios page --- modules/ROOT/nav.adoc | 1 - .../ROOT/pages/feasibility-checklists.adoc | 2 +- .../ROOT/pages/manage-proxy-instances.adoc | 1 - .../ROOT/pages/troubleshooting-scenarios.adoc | 480 ----------------- modules/ROOT/pages/troubleshooting-tips.adoc | 490 +++++++++++++++++- 5 files changed, 486 insertions(+), 488 deletions(-) delete mode 100644 modules/ROOT/pages/troubleshooting-scenarios.adoc diff --git a/modules/ROOT/nav.adoc b/modules/ROOT/nav.adoc index 15df5024..62b009b9 100644 --- a/modules/ROOT/nav.adoc +++ b/modules/ROOT/nav.adoc @@ -34,7 +34,6 @@ ** xref:ROOT:connect-clients-to-target.adoc[] * Support ** xref:ROOT:troubleshooting-tips.adoc[] -** xref:ROOT:troubleshooting-scenarios.adoc[] ** xref:ROOT:faqs.adoc[] ** xref:ROOT:glossary.adoc[] * Release notes diff --git a/modules/ROOT/pages/feasibility-checklists.adoc b/modules/ROOT/pages/feasibility-checklists.adoc index 6174545d..e5be3ba7 100644 --- a/modules/ROOT/pages/feasibility-checklists.adoc +++ b/modules/ROOT/pages/feasibility-checklists.adoc @@ -84,7 +84,7 @@ For upgrade instructions, see xref:ROOT:manage-proxy-instances.adoc#_upgrade_the ==== //TODO: combine the below 2 sections to only use 2.1.0 or later. -//Reconcile with troubleshooting-scenarios.adoc in case this issue is also described there. +//Reconcile with troubleshooting-tips.adoc in case this issue is also described there. ==== Versions older than 2.1.0 If a client application only sends `SELECT` statements to a database connection then you may find that {product-proxy} terminates these read-only connections periodically, which may result in request errors if the driver is not configured to retry these requests in these conditions. diff --git a/modules/ROOT/pages/manage-proxy-instances.adoc b/modules/ROOT/pages/manage-proxy-instances.adoc index 9e36d1b5..2a266e9c 100644 --- a/modules/ROOT/pages/manage-proxy-instances.adoc +++ b/modules/ROOT/pages/manage-proxy-instances.adoc @@ -374,5 +374,4 @@ Additionally, if you need to restart {product-proxy} instances, and there is onl == See also * xref:ROOT:troubleshooting-tips.adoc[] -* xref:ROOT:troubleshooting-scenarios.adoc[] * xref:deploy-proxy-monitoring.adoc#_indications_of_success_on_origin_and_target_clusters[Indications of success on origin and target clusters] \ No newline at end of file diff --git a/modules/ROOT/pages/troubleshooting-scenarios.adoc b/modules/ROOT/pages/troubleshooting-scenarios.adoc deleted file mode 100644 index 9595f7ad..00000000 --- a/modules/ROOT/pages/troubleshooting-scenarios.adoc +++ /dev/null @@ -1,480 +0,0 @@ -= Troubleshooting scenarios - -//TODO: use same format as driver troubleshooting. -//TODO: Remove or hide issues that have been resolved by a later release. - -This page provides troubleshooting advice for specific issues or error messages related to {product}. - -Each section includes symptoms, causes, and suggested solutions or workarounds. - -== Configuration changes are not being applied by the automation - -=== Symptoms - -You changed the values of some configuration variables in the automation and then rolled them out using the `rolling_update_zdm_proxy.yml` playbook, but these changes are not taking effect on your {product-proxy} instances. - -=== Cause - -The {product-proxy} configuration comprises a number of variables, but only a subset of these can be changed on an existing deployment in a rolling fashion. -The variables that can be changed with a rolling update are listed xref:manage-proxy-instances.adoc#change-mutable-config-variable[here]. - -All other configuration variables excluded from the list above are considered immutable and can only be changed by a redeployment. -This is by design: immutable configuration variables should not be changed after finalizing the deployment prior to starting the migration, so allowing them to be changed through a rolling update would risk accidentally propagating some misconfiguration that could compromise the deployment's integrity. - -=== Solution or Workaround - -To change the value of configuration variables that are considered immutable, simply run the `deploy_zdm_proxy.yml` playbook again. -This playbook can be run as many times as necessary and will just recreate the entire {product-proxy} deployment from scratch with the provided configuration. -This doesn't happen in a rolling fashion: the existing {product-proxy} instances are torn down all at the same time prior to being recreated, resulting in a brief window in which the whole {product-proxy} deployment will become unavailable. - - -== Unsupported protocol version error on the client application - -=== Symptoms - -In the logs for the Java driver 4.x series, the following issues can manifest during session initialization, or after initialization. - -[source,log] ----- -[s0|/10.169.241.224:9042] Fatal error while initializing pool, forcing the node down (UnsupportedProtocolVersionException: [/10.169.241.224:9042] Host does not support protocol version DSE_V2) - -[s0|/10.169.241.24:9042] Fatal error while initializing pool, forcing the node down (UnsupportedProtocolVersionException: [/10.169.241.24:9042] Host does not support protocol version DSE_V2) - -[s0|/10.169.241.251:9042] Fatal error while initializing pool, forcing the node down (UnsupportedProtocolVersionException: [/10.169.241.251:9042] Host does not support protocol version DSE_V2) - -[s0] Failed to connect with protocol DSE_V1, retrying with V4 - -[s0] Failed to connect with protocol DSE_V2, retrying with DSE_V1 ----- - -=== Cause - -https://datastax-oss.atlassian.net/browse/JAVA-2905[JAVA-2905] is a driver bug that manifests itself in this way. It affects Java driver 4.x, and was fixed on the 4.10.0 release. - -=== Solution or Workaround - -If you are using spring boot and/or spring-data-cassandra then an upgrade of these dependencies will be necessary to a version that has the java driver fix. - -Alternatively, you can force the protocol version on the driver to the max supported version by both clusters. -V4 is a good recommendation that usually fits all but if the user is migrating from {dse-short} to {dse-short} then DSE_V1 should be used for {dse-short} 5.x and DSE_V2 should be used for {dse-short} 6.x. - -To force the protocol version on the Java driver, see the documentation for your version of the Java driver: - -* https://apache.github.io/cassandra-java-driver/4.19.0/core/native_protocol/?h=controlling#controlling-the-protocol-version[{cass-reg} Java driver 4.18 and later: Controlling the protocol version] -* https://docs.datastax.com/en/developer/java-driver/latest/manual/core/native_protocol/index.html#controlling-the-protocol-version[{company} Java driver 4.17 and earlier: Controlling the protocol version] - -== Protocol errors in the proxy logs but clients can connect successfully - -=== Symptoms - -{product-proxy} logs contain: - -[source,log] ----- -{"log":"time=\"2022-10-01T12:02:12Z\" level=debug msg=\"[TARGET-CONNECTOR] Protocol v5 detected while decoding a frame. -Returning a protocol error to the client to force a downgrade: ERROR PROTOCOL ERROR (code=ErrorCode ProtocolError [0x0000000A], -msg=Invalid or unsupported protocol version (5)).\"\n","stream":"stderr","time":"2022-07-20T12:02:12.379287735Z"} ----- - -=== Cause - -Protocol errors like these are a normal part of the handshake process where the protocol version is being negotiated. -These protocol version downgrades happen when either {product-proxy} or at least one of the clusters doesn't support the version requested by the client. - -V5 downgrades are enforced by {product-proxy} but any other downgrade is requested by one of the clusters when they don't support the version that the client requested. -The proxy supports V3, V4, DSE_V1 and DSE_V2. - -=== Solution or Workaround - -These log messages are informative only (log level `DEBUG`). - -If you find one of these messages with a higher log level (especially `level=error`) then there might be a bug. -At that point the issue will need to be investigated by the {product-short} team. -This log message with a log level of `ERROR` means that the protocol error occurred after the handshake, and this is a fatal unexpected error that results in a disconnect for that particular connection. - -== Error during proxy startup: `Invalid or unsupported protocol version: 3` - -If the {product-proxy} logs contain the following type of output, it indicates that one of the origin clusters doesn't support at least V3 (e.g. {cass-short} 2.0, {dse-short} 4.6), and {product-short} cannot be used for that migration. - -[source,log] ----- -time="2022-10-01T19:58:15+01:00" level=info msg="Starting proxy..." -time="2022-10-01T19:58:15+01:00" level=info msg="Parsed Topology Config: TopologyConfig{VirtualizationEnabled=false, Addresses=[127.0.0.1], Count=1, Index=0, NumTokens=8}" -time="2022-10-01T19:58:15+01:00" level=info msg="Parsed Origin contact points: [127.0.0.1]" -time="2022-10-01T19:58:15+01:00" level=info msg="Parsed Target contact points: [127.0.0.1]" -time="2022-10-01T19:58:15+01:00" level=info msg="TLS was not configured for Origin" -time="2022-10-01T19:58:15+01:00" level=info msg="TLS was not configured for Target" -time="2022-10-01T19:58:15+01:00" level=info msg="[openTCPConnection] Opening connection to 127.0.0.1:9042" -time="2022-10-01T19:58:15+01:00" level=info msg="[openTCPConnection] Successfully established connection with 127.0.0.1:9042" -time="2022-10-01T19:58:15+01:00" level=debug msg="performing handshake" -time="2022-10-01T19:58:15+01:00" level=error msg="cqlConn{conn: 127.0.0.1:9042}: handshake failed: expected AUTHENTICATE or READY, got ERROR PROTOCOL ERROR (code=ErrorCode ProtocolError [0x0000000A], msg=Invalid or unsupported protocol version: 3)" -time="2022-10-01T19:58:15+01:00" level=warning msg="Error while initializing a new cql connection for the control connection of ORIGIN: failed to perform handshake: expected AUTHENTICATE or READY, got ERROR PROTOCOL ERROR (code=ErrorCode ProtocolError [0x0000000A], msg=Invalid or unsupported protocol version: 3)" -time="2022-10-01T19:58:15+01:00" level=debug msg="Shutting down request loop on cqlConn{conn: 127.0.0.1:9042}" -time="2022-10-01T19:58:15+01:00" level=debug msg="Shutting down response loop on cqlConn{conn: 127.0.0.1:9042}." -time="2022-10-01T19:58:15+01:00" level=debug msg="Shutting down event loop on cqlConn{conn: 127.0.0.1:9042}." -time="2022-10-01T19:58:15+01:00" level=error msg="Couldn't start proxy: failed to initialize origin control connection: could not open control connection to ORIGIN, tried endpoints: [127.0.0.1:9042]." -time="2022-10-01T19:58:15+01:00" level=info msg="Initiating proxy shutdown..." -time="2022-10-01T19:58:15+01:00" level=debug msg="Requesting shutdown of the client listener..." -time="2022-10-01T19:58:15+01:00" level=debug msg="Requesting shutdown of the client handlers..." -time="2022-10-01T19:58:15+01:00" level=debug msg="Waiting until all client handlers are done..." -time="2022-10-01T19:58:15+01:00" level=debug msg="Requesting shutdown of the control connections..." -time="2022-10-01T19:58:15+01:00" level=debug msg="Waiting until control connections done..." -time="2022-10-01T19:58:15+01:00" level=debug msg="Shutting down the schedulers and metrics handler..." -time="2022-10-01T19:58:15+01:00" level=info msg="Proxy shutdown complete." -time="2022-10-01T19:58:15+01:00" level=error msg="Couldn't start proxy, retrying in 2.229151525s: failed to initialize origin control connection: could not open control connection to ORIGIN, tried endpoints: [127.0.0.1:9042]." ----- - -The control connections of {product-proxy} don't perform protocol version negotiation, they only attempt to use protocol version 3. - -== Authentication errors - -Authentication errors indicate that credentials are incorrect or have insufficient permissions. - -There are three sets of credentials used with {product-proxy}: - -* Target: credentials that you set in the proxy configuration through the `ZDM_TARGET_USERNAME` and `ZDM_TARGET_PASSWORD` settings. - -* Origin: credentials that you set in the proxy configuration through the `ZDM_ORIGIN_USERNAME` and `ZDM_ORIGIN_PASSWORD` settings. - -* Client: credentials that the client application sends to the proxy during the connection handshake, these are set in the application configuration, not the proxy configuration. - -Authentication errors mean that at least one of these three sets of credentials is incorrect or has insufficient permissions. - -If the authentication error is preventing the proxy from starting then it's either the origin or target credentials that are incorrect or have insufficient permissions. -The log message shows whether it is the origin or target handshake that is failing. - -If the proxy is able to start up, and you can see the following message in the logs: `Proxy started. Waiting for SIGINT/SIGTERM to shutdown`, then the authentication error is happening when a client application tries to open a connection to the proxy. -In this case, the issue is with the client credentials. -The application itself is using invalid credentials, such as an incorrect username/password, expired token, or insufficient permissions. - -Note that the proxy startup message has log level `INFO`, so if the configured log level on the proxy is `warning` or `error`, you must rely on other ways to know whether {product-proxy} started correctly. -You can check if the docker container is running (or process if docker isn't being used) or if there is a log message similar to `Error launching proxy`. - -== {product-proxy} listens on a custom port, and all applications are able to connect to one proxy instance only - -=== Symptoms - -{product-proxy} is listening on a custom port (not 9042) and: - -* The Grafana dashboard shows only one proxy instance receiving all the connections from the application. -* Only one proxy instance has log messages such as `level=info msg="Accepted connection from 10.4.77.210:39458"`. - -=== Cause - -The application is specifying the custom port as part of the contact points using the format -`:`. - -For example, using the Java driver, if the {product-proxy} instances were listening on port 14035, this would look like: - -`.addContactPoints("172.18.10.36:14035", "172.18.11.48:14035", "172.18.12.61:14035")` - -The contact point is used as the first point of contact to the cluster, but the driver discovers the rest of the nodes via CQL queries. -However, this discovery process doesn't discover the ports, just the addresses so the driver uses the addresses it discovers with the port that is configured at startup. - -As a result, port 14035 will only be used for the contact point initially discovered, while for all other nodes the driver will attempt to use the default 9042 port. - -=== Solution or Workaround - -In the application, ensure that the custom port is explicitly indicated using the `.withPort()` API. In the above example: - -[source,java] ----- -.addContactPoints("172.18.10.36", "172.18.11.48", "172.18.12.61") -.withPort(14035) ----- - - -== Syntax error "no viable alternative at input 'CALL'" in proxy logs - -=== Symptoms - -{product-proxy} logs contain: - -[source,log] ----- -{"log":"time=\"2022-10-01T13:10:47Z\" level=debug msg=\"Recording TARGET-CONNECTOR other error: -ERROR SYNTAX ERROR (code=ErrorCode SyntaxError [0x00002000], msg=line 1:0 no viable alternative -at input 'CALL' ([CALL]...))\"\n","stream":"stderr","time":"2022-07-20T13:10:47.322882877Z"} ----- - -=== Cause - -The log message indicates that the server doesn't recognize the word “CALL” in the query string which most likely means that it is an RPC (remote procedure call). -From the proxy logs alone, it is not possible to see what method is being called by the query but it's very likely the RPC that the drivers use to send {dse-short} Insights data to the server. - -Most {company}-compatible drivers have {dse-short} Insights reporting enabled by default when they detect a server version that supports it (regardless of whether the feature is enabled on the server side or not). -The driver might also have it enabled for {astra-db} depending on what server version {astra-db} is returning for queries involving the `system.local` and `system.peers` tables. - -=== Solution or Workaround - -These log messages are harmless, but if you need to remove them, you can disable {dse-short} Insights in the driver configuration. -For example, in the Java driver, you can set `https://github.com/apache/cassandra-java-driver/blob/4.x/core/src/main/resources/reference.conf#L1365[advanced.monitor-reporting]` to `false`. - -== Default Grafana credentials don't work - -=== Symptoms - -Consider a case where you deploy the metrics component of our {product-automation}, a Grafana instance is deployed but you cannot login using the usual default `admin/admin` credentials. - -=== Cause - -{product-automation} specifies a custom set of credentials instead of relying on the `admin/admin` ones that are typically the default for Grafana deployments. - -=== Solution or Workaround - -Check the credentials that are being used by looking up the `vars/zdm_monitoring_config.yml` file on the {product-automation} directory. -These credentials can also be modified before deploying the metrics stack. - -== Proxy starts but client cannot connect (connection timeout/closed) - -=== Symptoms - -{product-proxy} log contains: - -[source] ----- -INFO[0000] [openTCPConnection] Opening connection to 10.0.63.163:9042 -INFO[0000] [openTCPConnection] Successfully established connection with 10.0.63.163:9042 -INFO[0000] [openTLSConnection] Opening TLS connection to 10.0.63.163:9042 using underlying TCP connection -INFO[0000] [openTLSConnection] Successfully established connection with 10.0.63.163:9042 -INFO[0000] Successfully opened control connection to ORIGIN using endpoint 10.0.63.163:9042. -INFO[0000] [openTCPConnection] Opening connection to 5bc479c2-c3d0-45be-bfba-25388f2caff7-us-east-1.db.astra.datastax.com:29042 -INFO[0000] [openTCPConnection] Successfully established connection with 54.84.75.118:29042 -INFO[0000] [openTLSConnection] Opening TLS connection to 211d66bf-de8d-48ac-a25b-bd57d504bd7c using underlying TCP connection -INFO[0000] [openTLSConnection] Successfully established connection with 211d66bf-de8d-48ac-a25b-bd57d504bd7 -INFO[0000] Successfully opened control connection to TARGET using endpoint 5bc479c2-c3d0-45be-bfba-25388f2caff7-us-east-1.db.astra.datastax.com:29042-211d66bf-de8d-48ac-a25b-bd57d504bd7c. -INFO[0000] Proxy connected and ready to accept queries on 0.0.0.0:9042 -INFO[0000] Proxy started. Waiting for SIGINT/SIGTERM to shutdown. -INFO[0043] Accepted connection from 10.0.62.255:33808 -INFO[0043] [ORIGIN-CONNECTOR] Opening request connection to ORIGIN (10.0.63.20:9042). -ERRO[0043] [openTCPConnectionWithBackoff] Couldn't connect to 10.0.63.20:9042, retrying in 100ms... -ERRO[0043] [openTCPConnectionWithBackoff] Couldn't connect to 10.0.63.20:9042, retrying in 200ms... -ERRO[0043] [openTCPConnectionWithBackoff] Couldn't connect to 10.0.63.20:9042, retrying in 400ms... -ERRO[0043] [openTCPConnectionWithBackoff] Couldn't connect to 10.0.63.20:9042, retrying in 800ms... -ERRO[0044] [openTCPConnectionWithBackoff] Couldn't connect to 10.0.63.20:9042, retrying in 1.6s... -ERRO[0046] [openTCPConnectionWithBackoff] Couldn't connect to 10.0.63.20:9042, retrying in 3.2s... -ERRO[0049] [openTCPConnectionWithBackoff] Couldn't connect to 10.0.63.20:9042, retrying in 6.4s... -ERRO[0056] [openTCPConnectionWithBackoff] Couldn't connect to 10.0.63.20:9042, retrying in 10s... -ERRO[0066] [openTCPConnectionWithBackoff] Couldn't connect to 10.0.63.20:9042, retrying in 10s... -ERRO[0076] Client Handler could not be created: ORIGIN-CONNECTOR context timed out or cancelled while opening connection to ORIGIN: context deadline exceeded ----- - -=== Cause - -{product-proxy} has connectivity only to a subset of the nodes. - -The control connection (during {product-proxy} startup) cycles through the nodes until it finds one that can be connected to. -For client connections, each proxy instance cycles through its "assigned nodes" only. -_(The "assigned nodes" are a different subset of the cluster nodes for each proxy instance, generally non-overlapping between proxy instances so as to avoid any interference with the load balancing already in place at client-side driver level. -The assigned nodes are not necessarily contact points: even discovered nodes undergo assignment to proxy instances.)_ - -In the example above, {product-proxy} doesn't have connectivity to 10.0.63.20, which was chosen as the origin node for the incoming client connection, but it connected to 10.0.63.163 during startup. - -=== Solution or Workaround - -Ensure that network connectivity exists and is stable between the {product-proxy} instances and all {cass-short} / {dse-short} nodes of the local datacenter. - -== Client application driver takes too long to reconnect to a proxy instance - -=== Symptoms - -After a {product-proxy} instance has been unavailable for some time and it gets back up, the client application takes too long to reconnect. - -There should never be a reason to stop a {product-proxy} instance other than a configuration change, but maybe the proxy crashed or the user tried to do a configuration change and took a long time to get the {product-proxy} instance back up. - -=== Cause - -{product-proxy} does not send topology events to the client applications, so the reconnection policy determines the time required for the driver to reconnect to a {product-proxy} instance. - -=== Solution or Workaround - -Restart the client application to force an immediate reconnect. - -If you expect {product-proxy} instances to go down frequently, change the reconnection policy on the driver so that the interval between reconnection attempts has a shorter limit. - -== Error with {astra} DevOps API when using {product-automation} - -=== Symptoms - -{product-automation}'s logs: - -[source,log] ----- -fatal: [10.255.13.6]: FAILED! => {"changed": false, "elapsed": 0, "msg": "Status code was -1 and not [200]: -Connection failure: Remote end closed connection without response", "redirected": false, "status": -1, "url": -"https://api.astra.datastax.com/v2/databases/REDACTED/secureBundleURL"} ----- - -=== Cause - -The {astra} DevOps API is likely temporarily unavailable. - -=== Solution or Workaround - -xref:astra-db-serverless:databases:secure-connect-bundle.adoc[Download the {astra-db} {scb}] manually and provide its path in the xref:deploy-proxy-monitoring.adoc#_core_configuration[{product-automation} configuration]. - -== Metadata service returned not successful status code 4xx or 5xx - -=== Symptoms - -{product-proxy} doesn't start and the following appears on the proxy logs: - -[source,log] ----- -Couldn't start proxy: error initializing the connection configuration or control connection for Target: -metadata service (Astra) returned not successful status code ----- - -=== Cause - -There are two possible causes for this: - -* The credentials that {product-proxy} is using for {astra-db} don't have sufficient permissions. -* The {astra-db} database is hibernated or otherwise unavailable. - -=== Solution or Workaround - -In the {astra-ui}, check the xref:astra-db-serverless:databases:database-statuses.adoc[database status]. - -If the database is not in *Active* status, you might need to take action or wait for the database to return to active status. -For example, if the database is hibernated, xref:astra-db-serverless:databases:database-statuses.adoc#hibernated[reactivate the database]. -When the database is active again, retry the connection. - -If the database is in *Active* status, then the issue is likely due to the credentials permissions. -Try using an xref:astra-db-serverless:administration:manage-application-tokens.adoc[application token scoped to a database], specifically a token with the *Database Administrator* role for your target database. - -[[_async_read_timeouts_stream_id_map_exhausted]] -== Async read timeouts / stream id map exhausted - -//Supposedly resolved in 2.1.0 release? - -=== Symptoms - -Dual reads are enabled and the following messages are found in the {product-proxy} logs: - -[source,log] ----- -{"log":"\u001b[33mWARN\u001b[0m[430352] Async Request (OpCode EXECUTE [0x0A]) timed out after 10000 ms. \r\n","stream":"stdout","time":"2022-10-03T17:29:42.548941854Z"} - -{"log":"\u001b[33mWARN\u001b[0m[430368] Could not find async request context for stream id 331 received from async connector. It either timed out or a protocol error occurred. \r\n","stream":"stdout","time":"2022-10-03T17:29:58.378080933Z"} - -{"log":"\u001b[33mWARN\u001b[0m[431533] Could not send async request due to an error while storing the request state: stream id map ran out of stream ids: channel was empty. \r\n","stream":"stdout","time":"2022-10-03T17:49:23.786335428Z"} ----- - -=== Cause - -The last log message is logged when the async connection runs out of stream ids. -The async connection is a connection dedicated to the async reads (asynchronous dual reads feature). -This can be caused by timeouts (first log message) or the connection not being able to keep up with the load. - -If the log files are being spammed with these messages then it is likely that an outage occurred which caused all responses to arrive after requests timed out (second log message). -In this case the async connection might not be able to recover. - -=== Solution or Workaround - -Keep in mind that any errors in the async request path (dual reads) will not affect the client application so these log messages might be useful to predict what may happen when the reads are switched over to the TARGET cluster but async read errors/warnings by themselves do not cause any impact to the client. - -Starting in version 2.1.0, you can now tune the maximum number of stream ids available per connection, which by default is 2048. -You can increase it to match your driver configuration through the xref:manage-proxy-instances.adoc#zdm_proxy_max_stream_ids[zdm_proxy_max_stream_ids] property. - -If these errors are being constantly written to the log files (for minutes or even hours) then it is likely that only an application OR {product-proxy} restart will fix it. -If you find an issue like this, submit a {product-proxy-repo}/issues[GitHub issue]. - -== Client application closed connection errors every 10 minutes when migrating to {astra-db} - -//TODO: Remove - resolved in 2.1.0 -[NOTE] -==== -This issue is fixed in {product-proxy} 2.1.0. See the Fix section below. -==== - -=== Symptoms - -Every 10 minutes a message is logged in the {product-proxy} logs showing a disconnect that was caused by {astra-db}: - -[source,log] ----- -{"log":"\u001b[36mINFO\u001b[0m[426871] [TARGET-CONNECTOR] REDACTED disconnected \r\n","stream":"stdout","time":"2022-10-01T16:31:41.48598498Z"} ----- - -=== Cause - -{astra-db} terminates idle connections after 10 minutes of inactivity. -If a client application only sends reads through a connection then the target cluster, which is an {astra-db} database in this example, then the connection won't get any traffic because {product-short} forwards all reads to the origin connection. - -=== Solution or Workaround - -This issue has been fixed in {product-proxy} 2.1.0. -We encourage you to upgrade to that version or greater. -By default, {product-proxy} now sends heartbeats after 30 seconds of inactivity on a cluster connection, to keep it alive. -You can tune the heartbeat interval with the Ansible configuration variable `heartbeat_insterval_ms`, or by directly setting the `ZDM_HEARTBEAT_INTERVAL_MS` environment variable if you do not use {product-automation}. - -== Performance degradation with {product-proxy} - -=== Symptoms - -Consider a case where a user runs separate benchmarks against: - -* {astra-db} directly -* Origin directly -* {product-proxy} (with {astra-db} and the origin cluster) - -The results of these tests show latency/throughput values are worse with {product-proxy} than when connecting to {astra-db} or origin cluster directly. - -=== Cause - -{product-short} always increases latency and, depending on the nature of the test, reduces throughput. -Whether this performance hit is expected or not depends on the difference between the {product-short} test results and the test results with the cluster that performed the worst. - -Writes in {product-short} require successful acknowledgement from both clusters, while reads only require the result from the primary cluster, which is typically the origin cluster. -This means that if the origin cluster has better performance than the target cluster, then {product-short} will have worse write performance. - -It is typical for latency to increase with {product-proxy}. -To minimize performance degradation with {product-proxy}, note the following: - -* Make sure your {product-proxy} infrastructure or configuration doesn't unnecessarily increase latency. -For example, make sure your {product-proxy} instances are in the same availability zone (AZ) as your origin cluster or application instances. -* Understand the impact of simple and batch statements on latency, as compared to typical prepared statements. -+ -Avoid simple statements with {product-proxy} because they require significant time for {product-proxy} to parse the queries. -+ -In contrast, prepared statements are parsed once, and then reused on subsequent requests, if repreparation isn't required. - -=== Solution or Workaround - -If you are using simple statements, consider using prepared statements as the best first step. - -Increasing the number of proxies might help, but only if the VMs resources (CPU, RAM or network IO) are near capacity. -{product-proxy} doesn't use a lot of RAM, but it uses a lot of CPU and network IO. - -Deploying the proxy instances on VMs with faster CPUs and faster network IO might help, but only your own tests will reveal whether it helps, because it depends on the workload type and details about your environment such as network/VPC configurations, hardware, and so on. - -== `InsightsRpc` related permissions errors - -=== Symptoms - -{product-proxy} logs contain: - -[source,log] ----- -time="2023-05-05T19:14:31Z" level=debug msg="Recording ORIGIN-CONNECTOR other error: ERROR UNAUTHORIZED (code=ErrorCode Unauthorized [0x00002100], msg=User my_user has no EXECUTE permission on or any of its parents)" -time="2023-05-05T19:14:31Z" level=debug msg="Recording TARGET-CONNECTOR other error: ERROR SERVER ERROR (code=ErrorCode ServerError [0x00000000], msg=Unexpected persistence error: Unable to authorize statement com.datastax.bdp.cassandra.cql3.RpcCallStatement)" ----- - -=== Cause - -This could be the case if the origin ({dse-short}) cluster has Metrics Collector enabled to report metrics for {company} drivers and `my_user` does not have the required permissions. -{product-proxy} simply passes through these. - -=== Solution or Workaround - -There are two options to get this fixed. - -==== Option 1: Disable {dse-short} Metrics Collector - -* On the origin {dse-short} cluster, run `dsetool insights_config --mode DISABLED` -* Run `dsetool insights_config --show_config` and ensure the `mode` has a value of `DISABLED` - -==== Option 2: Use this option if disabling metrics collector is not an option - -* Using a superuser role, grant the appropriate permissions to `my_user` role by running `GRANT EXECUTE ON REMOTE OBJECT InsightsRpc TO my_user;` \ No newline at end of file diff --git a/modules/ROOT/pages/troubleshooting-tips.adoc b/modules/ROOT/pages/troubleshooting-tips.adoc index 308d1210..7be7ae3b 100644 --- a/modules/ROOT/pages/troubleshooting-tips.adoc +++ b/modules/ROOT/pages/troubleshooting-tips.adoc @@ -1,12 +1,11 @@ = Troubleshooting tips :page-aliases: ROOT:troubleshooting.adoc :description: Get help with {product}. +:page-aliases: ROOT:troubleshooting-scenarios.adoc This page provides general troubleshooting advice and describes some common issues you might encounter with {product}. -For specific error messages, see xref:troubleshooting-scenarios.adoc[]. - -You can also contact your {company} account representative or {support-url}[{company} Support], if you have an IBM Elite Support for {cass} contract. +For additional assistance, you can <>, contact your {company} account representative, or contact {support-url}[{company} Support]. [#proxy-logs] == {product-proxy} logs @@ -148,7 +147,7 @@ This flag will prevent you from accessing the logs when {product-proxy} stops or Querying `system.peers` and `system.local` can help you investigate {product-proxy} configuration issues: -. xref:ROOT:connect-clients-to-proxy.adoc#connect-the-cql-shell-to-zdm-proxy[Connect cqlsh to a {product-proxy} instance.] +. xref:ROOT:connect-clients-to-proxy.adoc#connect-the-cql-shell-to-zdm-proxy[Connect cqlsh to a {product-proxy} instance]. . Query `system.peers`: + @@ -172,6 +171,488 @@ Because `system.peers` and `system.local` reflect the local {product-proxy} inst + For example, you might compare `cluster_name` to ensure that all instances are connected to the same cluster, rather than mixing contact points from different clusters. +== Troubleshooting scenarios + +//TODO: use same format as driver troubleshooting. +//TODO: Remove or hide issues that have been resolved by a later release. + +This page provides troubleshooting advice for specific issues or error messages related to {product}. + +Each section includes symptoms, causes, and suggested solutions or workarounds. + +=== Configuration changes are not being applied by the automation + +==== Symptoms + +You changed the values of some configuration variables in the automation and then rolled them out using the `rolling_update_zdm_proxy.yml` playbook, but these changes are not taking effect on your {product-proxy} instances. + +==== Cause + +The {product-proxy} configuration comprises a number of variables, but only a subset of these can be changed on an existing deployment in a rolling fashion. +The variables that can be changed with a rolling update are listed xref:manage-proxy-instances.adoc#change-mutable-config-variable[here]. + +All other configuration variables excluded from the list above are considered immutable and can only be changed by a redeployment. +This is by design: immutable configuration variables should not be changed after finalizing the deployment prior to starting the migration, so allowing them to be changed through a rolling update would risk accidentally propagating some misconfiguration that could compromise the deployment's integrity. + +==== Solution or Workaround + +To change the value of configuration variables that are considered immutable, simply run the `deploy_zdm_proxy.yml` playbook again. +This playbook can be run as many times as necessary and will just recreate the entire {product-proxy} deployment from scratch with the provided configuration. +This doesn't happen in a rolling fashion: the existing {product-proxy} instances are torn down all at the same time prior to being recreated, resulting in a brief window in which the whole {product-proxy} deployment will become unavailable. + + +=== Unsupported protocol version error on the client application + +==== Symptoms + +In the logs for the Java driver 4.x series, the following issues can manifest during session initialization, or after initialization. + +[source,log] +---- +[s0|/10.169.241.224:9042] Fatal error while initializing pool, forcing the node down (UnsupportedProtocolVersionException: [/10.169.241.224:9042] Host does not support protocol version DSE_V2) + +[s0|/10.169.241.24:9042] Fatal error while initializing pool, forcing the node down (UnsupportedProtocolVersionException: [/10.169.241.24:9042] Host does not support protocol version DSE_V2) + +[s0|/10.169.241.251:9042] Fatal error while initializing pool, forcing the node down (UnsupportedProtocolVersionException: [/10.169.241.251:9042] Host does not support protocol version DSE_V2) + +[s0] Failed to connect with protocol DSE_V1, retrying with V4 + +[s0] Failed to connect with protocol DSE_V2, retrying with DSE_V1 +---- + +==== Cause + +https://datastax-oss.atlassian.net/browse/JAVA-2905[JAVA-2905] is a driver bug that manifests itself in this way. It affects Java driver 4.x, and was fixed on the 4.10.0 release. + +==== Solution or Workaround + +If you are using spring boot and/or spring-data-cassandra then an upgrade of these dependencies will be necessary to a version that has the java driver fix. + +Alternatively, you can force the protocol version on the driver to the max supported version by both clusters. +V4 is a good recommendation that usually fits all but if the user is migrating from {dse-short} to {dse-short} then DSE_V1 should be used for {dse-short} 5.x and DSE_V2 should be used for {dse-short} 6.x. + +To force the protocol version on the Java driver, see the documentation for your version of the Java driver: + +* https://apache.github.io/cassandra-java-driver/4.19.0/core/native_protocol/?h=controlling#controlling-the-protocol-version[{cass-reg} Java driver 4.18 and later: Controlling the protocol version] +* https://docs.datastax.com/en/developer/java-driver/latest/manual/core/native_protocol/index.html#controlling-the-protocol-version[{company} Java driver 4.17 and earlier: Controlling the protocol version] + +=== Protocol errors in the proxy logs but clients can connect successfully + +==== Symptoms + +{product-proxy} logs contain: + +[source,log] +---- +{"log":"time=\"2022-10-01T12:02:12Z\" level=debug msg=\"[TARGET-CONNECTOR] Protocol v5 detected while decoding a frame. +Returning a protocol error to the client to force a downgrade: ERROR PROTOCOL ERROR (code=ErrorCode ProtocolError [0x0000000A], +msg=Invalid or unsupported protocol version (5)).\"\n","stream":"stderr","time":"2022-07-20T12:02:12.379287735Z"} +---- + +==== Cause + +Protocol errors like these are a normal part of the handshake process where the protocol version is being negotiated. +These protocol version downgrades happen when either {product-proxy} or at least one of the clusters doesn't support the version requested by the client. + +V5 downgrades are enforced by {product-proxy} but any other downgrade is requested by one of the clusters when they don't support the version that the client requested. +The proxy supports V3, V4, DSE_V1 and DSE_V2. + +==== Solution or Workaround + +These log messages are informative only (log level `DEBUG`). + +If you find one of these messages with a higher log level (especially `level=error`) then there might be a bug. +At that point the issue will need to be investigated by the {product-short} team. +This log message with a log level of `ERROR` means that the protocol error occurred after the handshake, and this is a fatal unexpected error that results in a disconnect for that particular connection. + +=== Error during proxy startup: `Invalid or unsupported protocol version: 3` + +If the {product-proxy} logs contain the following type of output, it indicates that one of the origin clusters doesn't support at least V3 (e.g. {cass-short} 2.0, {dse-short} 4.6), and {product-short} cannot be used for that migration. + +[source,log] +---- +time="2022-10-01T19:58:15+01:00" level=info msg="Starting proxy..." +time="2022-10-01T19:58:15+01:00" level=info msg="Parsed Topology Config: TopologyConfig{VirtualizationEnabled=false, Addresses=[127.0.0.1], Count=1, Index=0, NumTokens=8}" +time="2022-10-01T19:58:15+01:00" level=info msg="Parsed Origin contact points: [127.0.0.1]" +time="2022-10-01T19:58:15+01:00" level=info msg="Parsed Target contact points: [127.0.0.1]" +time="2022-10-01T19:58:15+01:00" level=info msg="TLS was not configured for Origin" +time="2022-10-01T19:58:15+01:00" level=info msg="TLS was not configured for Target" +time="2022-10-01T19:58:15+01:00" level=info msg="[openTCPConnection] Opening connection to 127.0.0.1:9042" +time="2022-10-01T19:58:15+01:00" level=info msg="[openTCPConnection] Successfully established connection with 127.0.0.1:9042" +time="2022-10-01T19:58:15+01:00" level=debug msg="performing handshake" +time="2022-10-01T19:58:15+01:00" level=error msg="cqlConn{conn: 127.0.0.1:9042}: handshake failed: expected AUTHENTICATE or READY, got ERROR PROTOCOL ERROR (code=ErrorCode ProtocolError [0x0000000A], msg=Invalid or unsupported protocol version: 3)" +time="2022-10-01T19:58:15+01:00" level=warning msg="Error while initializing a new cql connection for the control connection of ORIGIN: failed to perform handshake: expected AUTHENTICATE or READY, got ERROR PROTOCOL ERROR (code=ErrorCode ProtocolError [0x0000000A], msg=Invalid or unsupported protocol version: 3)" +time="2022-10-01T19:58:15+01:00" level=debug msg="Shutting down request loop on cqlConn{conn: 127.0.0.1:9042}" +time="2022-10-01T19:58:15+01:00" level=debug msg="Shutting down response loop on cqlConn{conn: 127.0.0.1:9042}." +time="2022-10-01T19:58:15+01:00" level=debug msg="Shutting down event loop on cqlConn{conn: 127.0.0.1:9042}." +time="2022-10-01T19:58:15+01:00" level=error msg="Couldn't start proxy: failed to initialize origin control connection: could not open control connection to ORIGIN, tried endpoints: [127.0.0.1:9042]." +time="2022-10-01T19:58:15+01:00" level=info msg="Initiating proxy shutdown..." +time="2022-10-01T19:58:15+01:00" level=debug msg="Requesting shutdown of the client listener..." +time="2022-10-01T19:58:15+01:00" level=debug msg="Requesting shutdown of the client handlers..." +time="2022-10-01T19:58:15+01:00" level=debug msg="Waiting until all client handlers are done..." +time="2022-10-01T19:58:15+01:00" level=debug msg="Requesting shutdown of the control connections..." +time="2022-10-01T19:58:15+01:00" level=debug msg="Waiting until control connections done..." +time="2022-10-01T19:58:15+01:00" level=debug msg="Shutting down the schedulers and metrics handler..." +time="2022-10-01T19:58:15+01:00" level=info msg="Proxy shutdown complete." +time="2022-10-01T19:58:15+01:00" level=error msg="Couldn't start proxy, retrying in 2.229151525s: failed to initialize origin control connection: could not open control connection to ORIGIN, tried endpoints: [127.0.0.1:9042]." +---- + +The control connections of {product-proxy} don't perform protocol version negotiation, they only attempt to use protocol version 3. + +=== Authentication errors + +Authentication errors indicate that credentials are incorrect or have insufficient permissions. + +There are three sets of credentials used with {product-proxy}: + +* Target: credentials that you set in the proxy configuration through the `ZDM_TARGET_USERNAME` and `ZDM_TARGET_PASSWORD` settings. + +* Origin: credentials that you set in the proxy configuration through the `ZDM_ORIGIN_USERNAME` and `ZDM_ORIGIN_PASSWORD` settings. + +* Client: credentials that the client application sends to the proxy during the connection handshake, these are set in the application configuration, not the proxy configuration. + +Authentication errors mean that at least one of these three sets of credentials is incorrect or has insufficient permissions. + +If the authentication error is preventing the proxy from starting then it's either the origin or target credentials that are incorrect or have insufficient permissions. +The log message shows whether it is the origin or target handshake that is failing. + +If the proxy is able to start up, and you can see the following message in the logs: `Proxy started. Waiting for SIGINT/SIGTERM to shutdown`, then the authentication error is happening when a client application tries to open a connection to the proxy. +In this case, the issue is with the client credentials. +The application itself is using invalid credentials, such as an incorrect username/password, expired token, or insufficient permissions. + +Note that the proxy startup message has log level `INFO`, so if the configured log level on the proxy is `warning` or `error`, you must rely on other ways to know whether {product-proxy} started correctly. +You can check if the docker container is running (or process if docker isn't being used) or if there is a log message similar to `Error launching proxy`. + +=== {product-proxy} listens on a custom port, and all applications are able to connect to one proxy instance only + +==== Symptoms + +{product-proxy} is listening on a custom port (not 9042) and: + +* The Grafana dashboard shows only one proxy instance receiving all the connections from the application. +* Only one proxy instance has log messages such as `level=info msg="Accepted connection from 10.4.77.210:39458"`. + +==== Cause + +The application is specifying the custom port as part of the contact points using the format +`:`. + +For example, using the Java driver, if the {product-proxy} instances were listening on port 14035, this would look like: + +`.addContactPoints("172.18.10.36:14035", "172.18.11.48:14035", "172.18.12.61:14035")` + +The contact point is used as the first point of contact to the cluster, but the driver discovers the rest of the nodes via CQL queries. +However, this discovery process doesn't discover the ports, just the addresses so the driver uses the addresses it discovers with the port that is configured at startup. + +As a result, port 14035 will only be used for the contact point initially discovered, while for all other nodes the driver will attempt to use the default 9042 port. + +==== Solution or Workaround + +In the application, ensure that the custom port is explicitly indicated using the `.withPort()` API. In the above example: + +[source,java] +---- +.addContactPoints("172.18.10.36", "172.18.11.48", "172.18.12.61") +.withPort(14035) +---- + + +=== Syntax error "no viable alternative at input 'CALL'" in proxy logs + +==== Symptoms + +{product-proxy} logs contain: + +[source,log] +---- +{"log":"time=\"2022-10-01T13:10:47Z\" level=debug msg=\"Recording TARGET-CONNECTOR other error: +ERROR SYNTAX ERROR (code=ErrorCode SyntaxError [0x00002000], msg=line 1:0 no viable alternative +at input 'CALL' ([CALL]...))\"\n","stream":"stderr","time":"2022-07-20T13:10:47.322882877Z"} +---- + +==== Cause + +The log message indicates that the server doesn't recognize the word “CALL” in the query string which most likely means that it is an RPC (remote procedure call). +From the proxy logs alone, it is not possible to see what method is being called by the query but it's very likely the RPC that the drivers use to send {dse-short} Insights data to the server. + +Most {company}-compatible drivers have {dse-short} Insights reporting enabled by default when they detect a server version that supports it (regardless of whether the feature is enabled on the server side or not). +The driver might also have it enabled for {astra-db} depending on what server version {astra-db} is returning for queries involving the `system.local` and `system.peers` tables. + +==== Solution or Workaround + +These log messages are harmless, but if you need to remove them, you can disable {dse-short} Insights in the driver configuration. +For example, in the Java driver, you can set `https://github.com/apache/cassandra-java-driver/blob/4.x/core/src/main/resources/reference.conf#L1365[advanced.monitor-reporting]` to `false`. + +=== Default Grafana credentials don't work + +==== Symptoms + +Consider a case where you deploy the metrics component of our {product-automation}, a Grafana instance is deployed but you cannot login using the usual default `admin/admin` credentials. + +==== Cause + +{product-automation} specifies a custom set of credentials instead of relying on the `admin/admin` ones that are typically the default for Grafana deployments. + +==== Solution or Workaround + +Check the credentials that are being used by looking up the `vars/zdm_monitoring_config.yml` file on the {product-automation} directory. +These credentials can also be modified before deploying the metrics stack. + +=== Proxy starts but client cannot connect (connection timeout/closed) + +==== Symptoms + +{product-proxy} log contains: + +[source] +---- +INFO[0000] [openTCPConnection] Opening connection to 10.0.63.163:9042 +INFO[0000] [openTCPConnection] Successfully established connection with 10.0.63.163:9042 +INFO[0000] [openTLSConnection] Opening TLS connection to 10.0.63.163:9042 using underlying TCP connection +INFO[0000] [openTLSConnection] Successfully established connection with 10.0.63.163:9042 +INFO[0000] Successfully opened control connection to ORIGIN using endpoint 10.0.63.163:9042. +INFO[0000] [openTCPConnection] Opening connection to 5bc479c2-c3d0-45be-bfba-25388f2caff7-us-east-1.db.astra.datastax.com:29042 +INFO[0000] [openTCPConnection] Successfully established connection with 54.84.75.118:29042 +INFO[0000] [openTLSConnection] Opening TLS connection to 211d66bf-de8d-48ac-a25b-bd57d504bd7c using underlying TCP connection +INFO[0000] [openTLSConnection] Successfully established connection with 211d66bf-de8d-48ac-a25b-bd57d504bd7 +INFO[0000] Successfully opened control connection to TARGET using endpoint 5bc479c2-c3d0-45be-bfba-25388f2caff7-us-east-1.db.astra.datastax.com:29042-211d66bf-de8d-48ac-a25b-bd57d504bd7c. +INFO[0000] Proxy connected and ready to accept queries on 0.0.0.0:9042 +INFO[0000] Proxy started. Waiting for SIGINT/SIGTERM to shutdown. +INFO[0043] Accepted connection from 10.0.62.255:33808 +INFO[0043] [ORIGIN-CONNECTOR] Opening request connection to ORIGIN (10.0.63.20:9042). +ERRO[0043] [openTCPConnectionWithBackoff] Couldn't connect to 10.0.63.20:9042, retrying in 100ms... +ERRO[0043] [openTCPConnectionWithBackoff] Couldn't connect to 10.0.63.20:9042, retrying in 200ms... +ERRO[0043] [openTCPConnectionWithBackoff] Couldn't connect to 10.0.63.20:9042, retrying in 400ms... +ERRO[0043] [openTCPConnectionWithBackoff] Couldn't connect to 10.0.63.20:9042, retrying in 800ms... +ERRO[0044] [openTCPConnectionWithBackoff] Couldn't connect to 10.0.63.20:9042, retrying in 1.6s... +ERRO[0046] [openTCPConnectionWithBackoff] Couldn't connect to 10.0.63.20:9042, retrying in 3.2s... +ERRO[0049] [openTCPConnectionWithBackoff] Couldn't connect to 10.0.63.20:9042, retrying in 6.4s... +ERRO[0056] [openTCPConnectionWithBackoff] Couldn't connect to 10.0.63.20:9042, retrying in 10s... +ERRO[0066] [openTCPConnectionWithBackoff] Couldn't connect to 10.0.63.20:9042, retrying in 10s... +ERRO[0076] Client Handler could not be created: ORIGIN-CONNECTOR context timed out or cancelled while opening connection to ORIGIN: context deadline exceeded +---- + +==== Cause + +{product-proxy} has connectivity only to a subset of the nodes. + +The control connection (during {product-proxy} startup) cycles through the nodes until it finds one that can be connected to. +For client connections, each proxy instance cycles through its "assigned nodes" only. +_(The "assigned nodes" are a different subset of the cluster nodes for each proxy instance, generally non-overlapping between proxy instances so as to avoid any interference with the load balancing already in place at client-side driver level. +The assigned nodes are not necessarily contact points: even discovered nodes undergo assignment to proxy instances.)_ + +In the example above, {product-proxy} doesn't have connectivity to 10.0.63.20, which was chosen as the origin node for the incoming client connection, but it connected to 10.0.63.163 during startup. + +==== Solution or Workaround + +Ensure that network connectivity exists and is stable between the {product-proxy} instances and all {cass-short} / {dse-short} nodes of the local datacenter. + +=== Client application driver takes too long to reconnect to a proxy instance + +==== Symptoms + +After a {product-proxy} instance has been unavailable for some time and it gets back up, the client application takes too long to reconnect. + +There should never be a reason to stop a {product-proxy} instance other than a configuration change, but maybe the proxy crashed or the user tried to do a configuration change and took a long time to get the {product-proxy} instance back up. + +==== Cause + +{product-proxy} does not send topology events to the client applications, so the reconnection policy determines the time required for the driver to reconnect to a {product-proxy} instance. + +==== Solution or Workaround + +Restart the client application to force an immediate reconnect. + +If you expect {product-proxy} instances to go down frequently, change the reconnection policy on the driver so that the interval between reconnection attempts has a shorter limit. + +=== Error with {astra} DevOps API when using {product-automation} + +==== Symptoms + +{product-automation}'s logs: + +[source,log] +---- +fatal: [10.255.13.6]: FAILED! => {"changed": false, "elapsed": 0, "msg": "Status code was -1 and not [200]: +Connection failure: Remote end closed connection without response", "redirected": false, "status": -1, "url": +"https://api.astra.datastax.com/v2/databases/REDACTED/secureBundleURL"} +---- + +==== Cause + +The {astra} DevOps API is likely temporarily unavailable. + +==== Solution or Workaround + +xref:astra-db-serverless:databases:secure-connect-bundle.adoc[Download the {astra-db} {scb}] manually and provide its path in the xref:deploy-proxy-monitoring.adoc#_core_configuration[{product-automation} configuration]. + +=== Metadata service returned not successful status code 4xx or 5xx + +==== Symptoms + +{product-proxy} doesn't start and the following appears on the proxy logs: + +[source,log] +---- +Couldn't start proxy: error initializing the connection configuration or control connection for Target: +metadata service (Astra) returned not successful status code +---- + +==== Cause + +There are two possible causes for this: + +* The credentials that {product-proxy} is using for {astra-db} don't have sufficient permissions. +* The {astra-db} database is hibernated or otherwise unavailable. + +==== Solution or Workaround + +In the {astra-ui}, check the xref:astra-db-serverless:databases:database-statuses.adoc[database status]. + +If the database is not in *Active* status, you might need to take action or wait for the database to return to active status. +For example, if the database is hibernated, xref:astra-db-serverless:databases:database-statuses.adoc#hibernated[reactivate the database]. +When the database is active again, retry the connection. + +If the database is in *Active* status, then the issue is likely due to the credentials permissions. +Try using an xref:astra-db-serverless:administration:manage-application-tokens.adoc[application token scoped to a database], specifically a token with the *Database Administrator* role for your target database. + +[[_async_read_timeouts_stream_id_map_exhausted]] +=== Async read timeouts / stream id map exhausted + +//Supposedly resolved in 2.1.0 release? + +==== Symptoms + +Dual reads are enabled and the following messages are found in the {product-proxy} logs: + +[source,log] +---- +{"log":"\u001b[33mWARN\u001b[0m[430352] Async Request (OpCode EXECUTE [0x0A]) timed out after 10000 ms. \r\n","stream":"stdout","time":"2022-10-03T17:29:42.548941854Z"} + +{"log":"\u001b[33mWARN\u001b[0m[430368] Could not find async request context for stream id 331 received from async connector. It either timed out or a protocol error occurred. \r\n","stream":"stdout","time":"2022-10-03T17:29:58.378080933Z"} + +{"log":"\u001b[33mWARN\u001b[0m[431533] Could not send async request due to an error while storing the request state: stream id map ran out of stream ids: channel was empty. \r\n","stream":"stdout","time":"2022-10-03T17:49:23.786335428Z"} +---- + +==== Cause + +The last log message is logged when the async connection runs out of stream ids. +The async connection is a connection dedicated to the async reads (asynchronous dual reads feature). +This can be caused by timeouts (first log message) or the connection not being able to keep up with the load. + +If the log files are being spammed with these messages then it is likely that an outage occurred which caused all responses to arrive after requests timed out (second log message). +In this case the async connection might not be able to recover. + +==== Solution or Workaround + +Keep in mind that any errors in the async request path (dual reads) will not affect the client application so these log messages might be useful to predict what may happen when the reads are switched over to the TARGET cluster but async read errors/warnings by themselves do not cause any impact to the client. + +Starting in version 2.1.0, you can now tune the maximum number of stream ids available per connection, which by default is 2048. +You can increase it to match your driver configuration through the xref:manage-proxy-instances.adoc#zdm_proxy_max_stream_ids[zdm_proxy_max_stream_ids] property. + +If these errors are being constantly written to the log files (for minutes or even hours) then it is likely that only an application OR {product-proxy} restart will fix it. +If you find an issue like this, submit a {product-proxy-repo}/issues[GitHub issue]. + +=== Client application closed connection errors every 10 minutes when migrating to {astra-db} + +//TODO: Remove - resolved in 2.1.0 +[NOTE] +==== +This issue is fixed in {product-proxy} 2.1.0. See the Fix section below. +==== + +==== Symptoms + +Every 10 minutes a message is logged in the {product-proxy} logs showing a disconnect that was caused by {astra-db}: + +[source,log] +---- +{"log":"\u001b[36mINFO\u001b[0m[426871] [TARGET-CONNECTOR] REDACTED disconnected \r\n","stream":"stdout","time":"2022-10-01T16:31:41.48598498Z"} +---- + +==== Cause + +{astra-db} terminates idle connections after 10 minutes of inactivity. +If a client application only sends reads through a connection then the target cluster, which is an {astra-db} database in this example, then the connection won't get any traffic because {product-short} forwards all reads to the origin connection. + +==== Solution or Workaround + +This issue has been fixed in {product-proxy} 2.1.0. +We encourage you to upgrade to that version or greater. +By default, {product-proxy} now sends heartbeats after 30 seconds of inactivity on a cluster connection, to keep it alive. +You can tune the heartbeat interval with the Ansible configuration variable `heartbeat_insterval_ms`, or by directly setting the `ZDM_HEARTBEAT_INTERVAL_MS` environment variable if you do not use {product-automation}. + +=== Performance degradation with {product-proxy} + +==== Symptoms + +Consider a case where a user runs separate benchmarks against: + +* {astra-db} directly +* Origin directly +* {product-proxy} (with {astra-db} and the origin cluster) + +The results of these tests show latency/throughput values are worse with {product-proxy} than when connecting to {astra-db} or origin cluster directly. + +==== Cause + +{product-short} always increases latency and, depending on the nature of the test, reduces throughput. +Whether this performance hit is expected or not depends on the difference between the {product-short} test results and the test results with the cluster that performed the worst. + +Writes in {product-short} require successful acknowledgement from both clusters, while reads only require the result from the primary cluster, which is typically the origin cluster. +This means that if the origin cluster has better performance than the target cluster, then {product-short} will have worse write performance. + +It is typical for latency to increase with {product-proxy}. +To minimize performance degradation with {product-proxy}, note the following: + +* Make sure your {product-proxy} infrastructure or configuration doesn't unnecessarily increase latency. +For example, make sure your {product-proxy} instances are in the same availability zone (AZ) as your origin cluster or application instances. +* Understand the impact of simple and batch statements on latency, as compared to typical prepared statements. ++ +Avoid simple statements with {product-proxy} because they require significant time for {product-proxy} to parse the queries. ++ +In contrast, prepared statements are parsed once, and then reused on subsequent requests, if repreparation isn't required. + +==== Solution or Workaround + +If you are using simple statements, consider using prepared statements as the best first step. + +Increasing the number of proxies might help, but only if the VMs resources (CPU, RAM or network IO) are near capacity. +{product-proxy} doesn't use a lot of RAM, but it uses a lot of CPU and network IO. + +Deploying the proxy instances on VMs with faster CPUs and faster network IO might help, but only your own tests will reveal whether it helps, because it depends on the workload type and details about your environment such as network/VPC configurations, hardware, and so on. + +=== `InsightsRpc` related permissions errors + +==== Symptoms + +{product-proxy} logs contain: + +[source,log] +---- +time="2023-05-05T19:14:31Z" level=debug msg="Recording ORIGIN-CONNECTOR other error: ERROR UNAUTHORIZED (code=ErrorCode Unauthorized [0x00002100], msg=User my_user has no EXECUTE permission on or any of its parents)" +time="2023-05-05T19:14:31Z" level=debug msg="Recording TARGET-CONNECTOR other error: ERROR SERVER ERROR (code=ErrorCode ServerError [0x00000000], msg=Unexpected persistence error: Unable to authorize statement com.datastax.bdp.cassandra.cql3.RpcCallStatement)" +---- + +==== Cause + +This could be the case if the origin ({dse-short}) cluster has Metrics Collector enabled to report metrics for {company} drivers and `my_user` does not have the required permissions. +{product-proxy} simply passes through these. + +==== Solution or Workaround + +There are two options to get this fixed. + +===== Option 1: Disable {dse-short} Metrics Collector + +* On the origin {dse-short} cluster, run `dsetool insights_config --mode DISABLED` +* Run `dsetool insights_config --show_config` and ensure the `mode` has a value of `DISABLED` + +===== Option 2: Use this option if disabling metrics collector is not an option + +* Using a superuser role, grant the appropriate permissions to `my_user` role by running `GRANT EXECUTE ON REMOTE OBJECT InsightsRpc TO my_user;` + +[#report-an-issue] == Report an issue To report an issue or get additional support, submit an issue in the {product-short} component GitHub repositories: @@ -226,5 +707,4 @@ To determine if this feature is enabled, check the following variables: == See also -* xref:ROOT:troubleshooting-scenarios.adoc[] * xref:ROOT:metrics.adoc[] \ No newline at end of file From e903ac4f68202cc57c7ccc60c1f36838e4268592 Mon Sep 17 00:00:00 2001 From: April M <36110273+aimurphy@users.noreply.github.com> Date: Thu, 13 Nov 2025 14:14:22 -0800 Subject: [PATCH 02/11] resolve dup content from manage-proxy-instances --- .../ROOT/pages/manage-proxy-instances.adoc | 171 +++++++----------- modules/ROOT/pages/troubleshooting-tips.adoc | 110 +++++++++-- 2 files changed, 154 insertions(+), 127 deletions(-) diff --git a/modules/ROOT/pages/manage-proxy-instances.adoc b/modules/ROOT/pages/manage-proxy-instances.adoc index 2a266e9c..0d9f9261 100644 --- a/modules/ROOT/pages/manage-proxy-instances.adoc +++ b/modules/ROOT/pages/manage-proxy-instances.adoc @@ -64,69 +64,10 @@ To avoid downtime, wait for each instance to fully restart and begin receiving t -- ====== -[#access-the-proxy-logs] -== Access the proxy logs +== Inspect {product-proxy} logs -To confirm that the {product-proxy} instances are operating normally, or investigate any issue, you can view or collect their logs. - -You can view the logs for a single proxy instance, or you can use a playbook to systematically retrieve logs from all instances and package them in a zip archive for later inspection. - -=== View the logs - -{product-proxy} runs as a Docker container on each proxy host. -Its logs can be viewed by connecting to a proxy host and running the following command. - -[source,bash] ----- -docker container logs zdm-proxy-container ----- - -To leave the logs open and continuously output the latest log messages, append the `--follow` (or `-f`) option to the command above. - -=== Collect the logs - -{product-automation} has a dedicated playbook, `collect_zdm_proxy_logs.yml`, that you can use to collect logs for all {product-proxy} instances in a deployment. - -You can view the playbook's configuration in `vars/zdm_proxy_log_collection_config.yml`, but no changes are required to run it. - -. Connect to the Ansible Control Host Docker container. -You can do this from the jumphost machine by running the following command: -+ -[source,bash] ----- -docker exec -it zdm-ansible-container bash ----- -+ -.Result -[%collapsible] -==== -[source,bash] ----- -ubuntu@52772568517c:~$ ----- -==== - -. Run the log collection playbook: -+ -[source,bash] ----- -ansible-playbook collect_zdm_proxy_logs.yml -i zdm_ansible_inventory ----- -+ -This playbook creates a single zip file, `zdm_proxy_logs_**TIMESTAMP**.zip`, that contains the logs from all proxy instances. -This archive is stored on the Ansible Control Host Docker container at `/home/ubuntu/zdm_proxy_archived_logs`. - -. To copy the archive from the container to the jumphost, open a shell on the jumphost, and then run the following command: -+ -[source,bash,subs="+quotes"] ----- -docker cp zdm-ansible-container:/home/ubuntu/zdm_proxy_archived_logs/zdm_proxy_logs_**TIMESTAMP**.zip **DESTINATION_DIRECTORY_ON_JUMPHOST** ----- -+ -Replace the following: -+ -* `**TIMESTAMP**`: The timestamp from the name of your log file archive -* `**DESTINATION_DIRECTORY_ON_JUMPHOST**`: The path to the directory where you want to copy the archive +{product-proxy} logs can help you verify that your {product-proxy} instances are operating normally, investigate how processes are executed, and troubleshoot issues. +For information about configuring, retrieving, and interpreting {product-proxy} logs, see xref:ROOT:troubleshooting-tips.adoc#proxy-logs[Viewing and interpreting {product-proxy} logs]. [[change-mutable-config-variable]] == Change a mutable configuration variable @@ -249,59 +190,51 @@ Be aware that running the `deploy_zdm_proxy.yml` playbook results in a brief win [[_upgrade_the_proxy_version]] == Upgrade the proxy version -The {product-proxy} version is displayed at startup, in a message such as `Starting {product-proxy} version ...`. -It can also be retrieved at any time by using the `version` option as in the following command. +The same playbook that you use for configuration changes can also be used to upgrade the {product-proxy} version in a rolling fashion. +All containers are recreated with the given image version. +The same behavior and observations noted in <> also apply to {product-proxy} image upgrades. -Example: +To check your current {product-proxy} version, see xref:ROOT:troubleshooting-tips.adoc#check-version[Check your {product-proxy} version]. -[source,bash] ----- -docker run --rm datastax/zdm-proxy: -version ----- - -Here's an example for {product-proxy} 2.1.x: - -[source,bash] ----- -docker run --rm datastax/zdm-proxy:2.1.x -version ----- - -The playbook for configuration changes can also be used to upgrade the {product-proxy} version in a rolling fashion. -All containers will be recreated with the image of the specified version. -The same behavior and observations as above apply here. - -To perform an upgrade, change the version tag number to the desired version in `vars/zdm_proxy_container.yml`: - -[source,bash] +. In `vars/zdm_proxy_container.yml`, set `zdm_proxy_image` to the desired tag. +For available tags, see the https://hub.docker.com/r/datastax/zdm-proxy/tags[{product-proxy} Docker Hub repository]. ++ +[source,yaml,subs="+quotes"] ---- -zdm_proxy_image: datastax/zdm-proxy:x.y.z +zdm_proxy_image: datastax/zdm-proxy:**TAG** ---- - -Replace `x.y.z` with the version you would like to upgrade to. - -{product-proxy} example: - -[source,bash] ++ +For example: ++ +[source,yaml] ---- -zdm_proxy_image: datastax/zdm-proxy:2.1.0 +zdm_proxy_image: datastax/zdm-proxy:2.3.4 ---- -Then run the same playbook as above, with the following command: - +. Run the `rolling_update_zdm_proxy.yml` playbook: ++ [source,bash] ---- ansible-playbook rolling_update_zdm_proxy.yml -i zdm_ansible_inventory ---- -== Scale operations with {product-automation} +== Scale {product-proxy} instances +[tabs] +====== +Scale with {product-automation}:: ++ +-- {product-automation} doesn't provide a way to scale operations up or down in a rolling fashion. -If you are using {product-automation} and you need a larger {product-proxy} deployment, you have two options: +If you are using {product-automation} and you need a larger {product-proxy} deployment, you can create a new deployment, or you can add instances to an existing deployment. -Recommended: Create a new deployment:: -This is the recommended way to scale your {product-proxy} deployment because it requires no downtime. +[tabs] +==== +Create a new deployment (recommended):: ++ +This option is the recommended way to scale your {product-proxy} deployment because it requires no downtime. + -With this option, you create a new {product-proxy} deployment, and then move your client application to it: +Create a new {product-proxy} deployment, and then reconfigure your client application to use the new instance: + . xref:ROOT:setup-ansible-playbooks.adoc[Create a new {product-proxy} deployment] with the desired topology on a new set of machines. . Change the contact points in the application configuration so that the application instances point to the new {product-proxy} deployment. @@ -312,53 +245,71 @@ The application instances switch seamlessly from the old deployment to the new o . After restarting all application instances, you can safely remove the old {product-proxy} deployment. Add instances to an existing deployment:: -This option requires some manual effort and a brief amount of downtime. + -With this option, you change the topology of your existing {product-proxy} deployment, and then restart the entire deployment to apply the change: - +This option requires manual configuration and a small amount of downtime. ++ +Change the topology of your existing {product-proxy} deployment, and then restart the entire deployment to apply the change: ++ . Amend the inventory file so that it contains one line for each machine where you want to deploy a {product-proxy} instance. + For example, if you want to add three nodes to a deployment with six nodes, then the amended inventory file must contain nine total IPs, including the six existing IPs and the three new IPs. - ++ . Run the `deploy_zdm_proxy.yml` playbook to apply the change and start the new instances. + Rerunning the playbook stops the existing instances, destroys them, and then creates and starts a new deployment with new instances based on the amended inventory. This results in a brief interruption of service for your entire {product-proxy} deployment. +==== +-- -== Scale {product-proxy} without {product-automation} - -If you aren't using {product-automation}, you can still add and remove {product-proxy} instances. +Scale without {product-automation}:: ++ +-- +If you aren't using {product-automation}, use these steps to add, change, or remove {product-proxy} instances. -[#add-an-instance] +[tabs] +==== Add an instance:: ++ . Prepare and configure the new {product-proxy} instances appropriately based on your other instances. + Make sure the new instance's configuration references all planned {product-proxy} cluster nodes. ++ . On all {product-proxy} instances, add the new instance's address to the `ZDM_PROXY_TOPOLOGY_ADDRESSES` environment variable. + Make sure to include all new nodes. ++ . On the new {product-proxy} instance, set the `ZDM_PROXY_TOPOLOGY_INDEX` to the next sequential integer after the greatest one in your existing deployment. ++ . Perform a rolling restart of all {product-proxy} instances, one at a time. Vertically scale existing instances:: ++ Use these steps to increase or decrease resources for existing {product-proxy} instances, such as CPU or memory. To avoid downtime, perform the following steps on one instance at a time: + . Stop the first {product-proxy} instance that you want to modify. ++ . Modify the instance's resources as required. + Make sure the instance's IP address remains the same. -If the IP address changes, you need to <>. +If the IP address changes, you must treat it as a new instance; follow the steps on the **Add an instance** tab. ++ . Restart the modified {product-proxy} instance. ++ . Wait until the instance starts, and then confirm that it is receiving traffic. ++ . Repeat these steps to modify each additional instance, one at a time. Remove an instance:: ++ . On all {product-proxy} instances, remove the unused instance's address from the `ZDM_PROXY_TOPOLOGY_ADDRESSES` environment variable. . Perform a rolling restart of all remaining {product-proxy} instances. . Clean up resources used by the removed instance, such as the container or VM. +==== +-- +====== -== Purpose of proxy topology addresses +== Proxy topology addresses enable failover and high availability When you configure a {product-proxy} deployment, either through {product-automation} or manually-managed {product-proxy} instances, you specify the addresses of your instances. These are populated in the `ZDM_PROXY_TOPOLOGY_ADDRESSES` variable, either manually or automatically depending on how you manage your instances. @@ -366,7 +317,7 @@ These are populated in the `ZDM_PROXY_TOPOLOGY_ADDRESSES` variable, either manua {cass-short} drivers look up nodes on a cluster by querying the `system.peers` table. {product-proxy} uses the topology addresses to effectively respond to the driver's request for connection nodes. If there are no topology addresses specified, {product-proxy} defaults to a single-instance configuration. -This means that driver connections will use only that one {product-proxy} instance, rather than all instances in your {product-proxy} deployment. +This means that driver connections use only that one {product-proxy} instance rather than all instances in your {product-proxy} deployment. If that one instance goes down, {product-proxy} won't know that there are other instances available, and your application can experience an outage. Additionally, if you need to restart {product-proxy} instances, and there is only one instance specified in the topology addresses, your migration will have downtime while that one instance restarts. diff --git a/modules/ROOT/pages/troubleshooting-tips.adoc b/modules/ROOT/pages/troubleshooting-tips.adoc index 7be7ae3b..ceac050a 100644 --- a/modules/ROOT/pages/troubleshooting-tips.adoc +++ b/modules/ROOT/pages/troubleshooting-tips.adoc @@ -1,16 +1,16 @@ -= Troubleshooting tips -:page-aliases: ROOT:troubleshooting.adoc += Troubleshoot {product} +:navtitle: Troubleshoot {product-short} +:page-aliases: ROOT:troubleshooting.adoc, ROOT:troubleshooting-scenarios.adoc :description: Get help with {product}. -:page-aliases: ROOT:troubleshooting-scenarios.adoc -This page provides general troubleshooting advice and describes some common issues you might encounter with {product}. +This page provides general troubleshooting advice and describes some common issues you might encounter with {product} ({product-short}). For additional assistance, you can <>, contact your {company} account representative, or contact {support-url}[{company} Support]. [#proxy-logs] -== {product-proxy} logs +== Check {product-proxy} logs -{product-proxy} logs can help you troubleshoot issues with {product}. +{product-proxy} logs can help you verify that your {product-proxy} instances are operating normally, investigate how processes are executed, and troubleshoot issues. === Set the {product-proxy} log level @@ -30,12 +30,81 @@ For more information, see xref:manage-proxy-instances.adoc#change-mutable-config * If you didn't use {product-automation} to deploy {product-proxy}, set the `ZDM_LOG_LEVEL` environment variable on each proxy instance and then restart each instance. -=== Retrieve the {product-proxy} log files - -//TODO: Reconcile with manage-proxy-instance.adoc content. +=== Get {product-proxy} log files If you used {product-automation} to deploy {product-proxy}, then you can get logs for a single proxy instance, and you can use a playbook to retrieve logs for all instances. -For instructions and more information, see xref:ROOT:manage-proxy-instances.adoc#access-the-proxy-logs[Access the proxy logs]. + +[tabs] +====== +Single-instance logs:: ++ +-- +-- + +Multi-instance logs:: ++ +-- +-- +====== + +//// +=== View the logs + +{product-proxy} runs as a Docker container on each proxy host. +Its logs can be viewed by connecting to a proxy host and running the following command. + +[source,bash] +---- +docker container logs zdm-proxy-container +---- + +To leave the logs open and continuously output the latest log messages, append the `--follow` (or `-f`) option to the command above. + +=== Collect the logs + +{product-automation} has a dedicated playbook, `collect_zdm_proxy_logs.yml`, that you can use to collect logs for all {product-proxy} instances in a deployment. + +You can view the playbook's configuration in `vars/zdm_proxy_log_collection_config.yml`, but no changes are required to run it. + +. Connect to the Ansible Control Host Docker container. +You can do this from the jumphost machine by running the following command: ++ +[source,bash] +---- +docker exec -it zdm-ansible-container bash +---- ++ +.Result +[%collapsible] +==== +[source,bash] +---- +ubuntu@52772568517c:~$ +---- +==== + +. Run the log collection playbook: ++ +[source,bash] +---- +ansible-playbook collect_zdm_proxy_logs.yml -i zdm_ansible_inventory +---- ++ +This playbook creates a single zip file, `zdm_proxy_logs_**TIMESTAMP**.zip`, that contains the logs from all proxy instances. +This archive is stored on the Ansible Control Host Docker container at `/home/ubuntu/zdm_proxy_archived_logs`. + +. To copy the archive from the container to the jumphost, open a shell on the jumphost, and then run the following command: ++ +[source,bash,subs="+quotes"] +---- +docker cp zdm-ansible-container:/home/ubuntu/zdm_proxy_archived_logs/zdm_proxy_logs_**TIMESTAMP**.zip **DESTINATION_DIRECTORY_ON_JUMPHOST** +---- ++ +Replace the following: ++ +* `**TIMESTAMP**`: The timestamp from the name of your log file archive +* `**DESTINATION_DIRECTORY_ON_JUMPHOST**`: The path to the directory where you want to copy the archive +//// If you did not use {product-automation} to deploy {product-proxy}, you might have to access the logs another way. For example, if you used Docker, you can use the following command to export a container's logs to a `log.txt` file: @@ -117,8 +186,7 @@ Instead, this is a normal part of protocol version negotiation (handshake) durin [#check-version] == Check your {product-proxy} version -//TODO: Possibly duplicated on manage-proxy-instances.html#_upgrade_the_proxy_version -In the {product-proxy} logs, the first message contains the version string: +The {product-proxy} version is printed at startup, and it is the first message in the logs, immediately before the long `Parsed configuration` string: [source,console] ---- @@ -126,16 +194,24 @@ time="2023-01-13T13:37:28+01:00" level=info msg="Starting ZDM proxy version 2.1. time="2023-01-13T13:37:28+01:00" level=info msg="Parsed configuration: ..." ---- -This message is logged immediately before the long `Parsed configuration` string. - You can also pass the `-version` flag to {product-proxy} to print the version. -For example, you can use the following Docker command: +For example, you can use the following Docker command, replacing `**TAG**` with the `zdm_proxy_image` tag set in `vars/zdm_proxy_container.yml`: -[source,bash] +[source,bash,subs="+quotes"] +---- +docker run --rm datastax/zdm-proxy:**TAG** -version +---- + +.Result +[%collapsible] +==== +The output shows the binary version of {product-proxy} that is currently running: + +[source,console] ---- -docker run --rm datastax/zdm-proxy:2.x -version ZDM proxy version 2.1.0 ---- +==== [IMPORTANT] ==== From 435b7e965e9ca1017190c1bf76d6968a3ab7fa9a Mon Sep 17 00:00:00 2001 From: April M <36110273+aimurphy@users.noreply.github.com> Date: Thu, 13 Nov 2025 16:35:20 -0800 Subject: [PATCH 03/11] revising troubleshooting advice --- modules/ROOT/pages/troubleshooting-tips.adoc | 356 ++++++++----------- 1 file changed, 153 insertions(+), 203 deletions(-) diff --git a/modules/ROOT/pages/troubleshooting-tips.adoc b/modules/ROOT/pages/troubleshooting-tips.adoc index ceac050a..e2a31693 100644 --- a/modules/ROOT/pages/troubleshooting-tips.adoc +++ b/modules/ROOT/pages/troubleshooting-tips.adoc @@ -5,7 +5,7 @@ This page provides general troubleshooting advice and describes some common issues you might encounter with {product} ({product-short}). -For additional assistance, you can <>, contact your {company} account representative, or contact {support-url}[{company} Support]. +For additional assistance, you can <>, contact your {company} account representative, or contact {support-url}[{company} Support]. [#proxy-logs] == Check {product-proxy} logs @@ -36,32 +36,31 @@ If you used {product-automation} to deploy {product-proxy}, then you can get log [tabs] ====== -Single-instance logs:: +View or tail logs for one instance:: + -- --- - -Multi-instance logs:: -+ --- --- -====== - -//// -=== View the logs - {product-proxy} runs as a Docker container on each proxy host. -Its logs can be viewed by connecting to a proxy host and running the following command. + +To view the logs for a single {product-proxy} instance, connect to a proxy host, and then run the following command: [source,bash] ---- docker container logs zdm-proxy-container ---- -To leave the logs open and continuously output the latest log messages, append the `--follow` (or `-f`) option to the command above. +To tail (stream) the logs as they are written, use the `--follow` (`-f`) option: -=== Collect the logs +[source,bash] +---- +docker container logs zdm-proxy-container -f +---- + +Keep in mind that Docker logs are deleted if the container is recreated. +-- +Collect logs for multiple instances:: ++ +-- {product-automation} has a dedicated playbook, `collect_zdm_proxy_logs.yml`, that you can use to collect logs for all {product-proxy} instances in a deployment. You can view the playbook's configuration in `vars/zdm_proxy_log_collection_config.yml`, but no changes are required to run it. @@ -104,9 +103,13 @@ Replace the following: + * `**TIMESTAMP**`: The timestamp from the name of your log file archive * `**DESTINATION_DIRECTORY_ON_JUMPHOST**`: The path to the directory where you want to copy the archive -//// +-- + +Get logs for deployments that don't use {product-automation}:: ++ +-- +If you didn't use {product-automation} to deploy {product-proxy}, you must access the logs another way, depending on your deployment configuration and infrastructure. -If you did not use {product-automation} to deploy {product-proxy}, you might have to access the logs another way. For example, if you used Docker, you can use the following command to export a container's logs to a `log.txt` file: [source,bash] @@ -115,31 +118,33 @@ docker logs my-container > log.txt ---- Keep in mind that Docker logs are deleted if the container is recreated. +-- +====== === Message levels -Some log messages contain text that sounds like an error, but they are not errors. -The message's `level` typically indicates severity: +Some log messages contain text that seems like an error but they aren't errors. +Instead, the message's `level` indicates severity: -* `level=debug` and `level=info`: Expected and normal messages that are typically not errors. -However, if you enable `DEBUG` logging, `debug` messages can help you find the source of a problem. +* `level=debug` and `level=info`: Expected and normal messages that typically aren't errors. ++ +If you enable `DEBUG` logging, the `debug` messages can help you find the source of a problem by providing information about the environment and conditions when the error occurred. -* `level=warn`: Reports an event that wasn't fatal to the overall process, but could indicate an issue with an individual request or connection. +* `level=warn`: Reports an event that wasn't fatal to the overall process but might indicate an issue with an individual request or connection. * `level=error`: Indicates an issue with {product-proxy}, the client application, or the clusters. These messages require further examination. -If the meaning of a `warn` or `error` message isn't clear, you can submit an issue in the {product-proxy-repo}/issues[{product-proxy} GitHub repository]. +If the meaning of a `warn` or `error` message isn't clear, you can <>. === Common log messages -Here are the most common messages in the {product-proxy} logs. - -==== {product-proxy} startup message +Here are some of the most common messages in the {product-proxy} logs. +{product-proxy} startup message:: If the log level doesn't filter out `info` entries, you can look for a `Proxy started` log message to verify that {product-proxy} started correctly. For example: - ++ [source,json] ---- {"log":"time=\"2023-01-13T11:50:48Z\" level=info @@ -147,12 +152,11 @@ msg=\"Proxy started. Waiting for SIGINT/SIGTERM to shutdown. \"\n","stream":"stderr","time":"2023-01-13T11:50:48.522097083Z"} ---- -==== {product-proxy} configuration message - +{product-proxy} configuration message:: If the log level doesn't filter out `info` entries, the first few lines of a {product-proxy} log file contain all configuration variables and values in a long JSON string. - -For example, this log message has been truncated for readability: - ++ +The following example log message is truncated for readability: ++ [source,json] ---- {"log":"time=\"2023-01-13T11:50:48Z\" level=info @@ -160,18 +164,17 @@ msg=\"Parsed configuration: {\\\"ProxyIndex\\\":1,\\\"ProxyAddresses\\\":"...", ...TRUNCATED... ","stream":"stderr","time":"2023-01-13T11:50:48.339225051Z"} ---- - ++ Configuration settings can help with troubleshooting. - ++ To make this message easier to read, pass it through a JSON formatter or paste it into a text editor that can reformat JSON. -==== Protocol log messages - +Protocol log messages:: There are cases where protocol errors are fatal, and they will kill an active connection that was being used to serve requests. However, it is also possible to get normal protocol log messages that contain wording that sounds like an error. - ++ For example, the following `DEBUG` message contains the phrases `force a downgrade` and `unsupported protocol version`, which can sound like errors: - ++ [source,json] ---- {"log":"time=\"2023-01-13T12:02:12Z\" level=debug msg=\"[TARGET-CONNECTOR] @@ -179,7 +182,7 @@ Protocol v5 detected while decoding a frame. Returning a protocol message to the client to force a downgrade: PROTOCOL (code=Code Protocol [0x0000000A], msg=Invalid or unsupported protocol version (5)).\"\n","stream":"stderr","time":"2023-01-13T12:02:12.379287735Z"} ---- - ++ However, `level=debug` indicates that this is not an error. Instead, this is a normal part of protocol version negotiation (handshake) during connection initialization. @@ -219,11 +222,11 @@ Don't use `--rm` when you launch the {product-proxy} container. This flag will prevent you from accessing the logs when {product-proxy} stops or crashes. ==== -== Query system.peers and system.local to check for {product-proxy} configuration issues +== Query system.peers and system.local to check for configuration issues Querying `system.peers` and `system.local` can help you investigate {product-proxy} configuration issues: -. xref:ROOT:connect-clients-to-proxy.adoc#connect-the-cql-shell-to-zdm-proxy[Connect cqlsh to a {product-proxy} instance]. +. xref:ROOT:connect-clients-to-proxy.adoc#connect-cqlsh-to-zdm-proxy[Connect cqlsh to a {product-proxy} instance]. . Query `system.peers`: + @@ -241,47 +244,38 @@ SELECT * FROM system.local . Repeat for each of your {product-proxy} instances. + -Because `system.peers` and `system.local` reflect the local {product-proxy} instance's configuration, you need to query all instances to get all information and identify potential misconfigurations. +Because `system.peers` and `system.local` reflect the local {product-proxy} instance's configuration, you must query all instances to get all information and identify potential misconfigurations. -. Inspect the results for values related to an error that you are troubleshooting, such as IP addresses or tokens. +. Compare the results from each instance by searching for values related to an error that you are troubleshooting, such as IP addresses or tokens. + -For example, you might compare `cluster_name` to ensure that all instances are connected to the same cluster, rather than mixing contact points from different clusters. +For example, you might compare `cluster_name` to ensure that all instances are connected to the same cluster rather than mixing contact points from different clusters. == Troubleshooting scenarios -//TODO: use same format as driver troubleshooting. -//TODO: Remove or hide issues that have been resolved by a later release. - -This page provides troubleshooting advice for specific issues or error messages related to {product}. - -Each section includes symptoms, causes, and suggested solutions or workarounds. +The following sections provide troubleshooting advice for specific issues or error messages related to {product}. -=== Configuration changes are not being applied by the automation +=== Configuration changes aren't applied by {product-automation} -==== Symptoms - -You changed the values of some configuration variables in the automation and then rolled them out using the `rolling_update_zdm_proxy.yml` playbook, but these changes are not taking effect on your {product-proxy} instances. - -==== Cause +If you change some configuration variables, and then performing a rolling restart with the `rolling_update_zdm_proxy.yml` playbook, you might notice that some changes aren't applied to your {product-proxy} instances. -The {product-proxy} configuration comprises a number of variables, but only a subset of these can be changed on an existing deployment in a rolling fashion. -The variables that can be changed with a rolling update are listed xref:manage-proxy-instances.adoc#change-mutable-config-variable[here]. +Typically, this happens because you modified an immutable configuration variable. -All other configuration variables excluded from the list above are considered immutable and can only be changed by a redeployment. -This is by design: immutable configuration variables should not be changed after finalizing the deployment prior to starting the migration, so allowing them to be changed through a rolling update would risk accidentally propagating some misconfiguration that could compromise the deployment's integrity. +Not all {product-proxy} configuration variables can be changed after deployment, with or without a rolling restart. +For a list of variables that you can change on a live deployment, see xref:manage-proxy-instances.adoc#change-mutable-config-variable[Change mutable configuration variables]. -==== Solution or Workaround - -To change the value of configuration variables that are considered immutable, simply run the `deploy_zdm_proxy.yml` playbook again. -This playbook can be run as many times as necessary and will just recreate the entire {product-proxy} deployment from scratch with the provided configuration. -This doesn't happen in a rolling fashion: the existing {product-proxy} instances are torn down all at the same time prior to being recreated, resulting in a brief window in which the whole {product-proxy} deployment will become unavailable. +Any configuration variables excluded from the mutable variables list are considered _immutable_, and you must fully redeploy your instances to change them. +This is by design because immutable configuration variables store values that must not change between the time that you finalize the deployment and start the migration. +Allowing these values to change from a rolling restart could propagate a misconfiguration and compromise the deployment's integrity. +If you change the value of an immutable configuration variable, you must run the `deploy_zdm_proxy.yml` playbook again. +You can run this playbook as many times as needed. +Each time, {product-proxy-automation} recreates your entire {product-proxy} deployment with the new configuration. +However, this doesn't happen in a rolling fashion: The existing {product-proxy} instances are torn down simultaneously, and then they are recreated. +This results in a brief period of downtime where the entire {product-proxy} deployment is unavailable. -=== Unsupported protocol version error on the client application - -==== Symptoms +=== Client application throws unsupported protocol version error -In the logs for the Java driver 4.x series, the following issues can manifest during session initialization, or after initialization. +If you are running version 4.0 to 4.9 of the {cass-short} Java driver, the following errors can occur during or after session initialization: [source,log] ---- @@ -296,27 +290,34 @@ In the logs for the Java driver 4.x series, the following issues can manifest du [s0] Failed to connect with protocol DSE_V2, retrying with DSE_V1 ---- -==== Cause +These errors are caused by a Java driver bug that was resolved in version 4.10.0. -https://datastax-oss.atlassian.net/browse/JAVA-2905[JAVA-2905] is a driver bug that manifests itself in this way. It affects Java driver 4.x, and was fixed on the 4.10.0 release. +To resolve this issue, do one of the following: -==== Solution or Workaround +* If your application uses any dependency that includes a version of the Java driver, such as Spring Boot or `spring-data-cassandra`, you must upgrade these dependencies to a version that uses Java driver 4.10.0 or later. -If you are using spring boot and/or spring-data-cassandra then an upgrade of these dependencies will be necessary to a version that has the java driver fix. +* If your are using the Java driver directly, upgrade to version 4.10.0 or later, if these versions are compatible with both your source and target clusters. -Alternatively, you can force the protocol version on the driver to the max supported version by both clusters. -V4 is a good recommendation that usually fits all but if the user is migrating from {dse-short} to {dse-short} then DSE_V1 should be used for {dse-short} 5.x and DSE_V2 should be used for {dse-short} 6.x. +* Force the protocol version on the driver to the highest version that is supported by both your source and target clusters. +Typically, `V4` is broadly supported. +However, if you are migrating from {dse-short} to {dse-short}, then use `DSE_V1` for {dse-short} 5.x migrations, and `DSE_V2` for {dse-short} 6.x migrations. ++ +For more information, see the documentation for your version of the Java driver: ++ +** https://apache.github.io/cassandra-java-driver/4.19.0/core/native_protocol/?h=controlling#controlling-the-protocol-version[{cass-reg} Java driver 4.18 and later: Controlling the protocol version] +** https://docs.datastax.com/en/developer/java-driver/latest/manual/core/native_protocol/index.html#controlling-the-protocol-version[{company} Java driver 4.17 and earlier: Controlling the protocol version] -To force the protocol version on the Java driver, see the documentation for your version of the Java driver: +=== Logs report protocol errors but clients connect successfully -* https://apache.github.io/cassandra-java-driver/4.19.0/core/native_protocol/?h=controlling#controlling-the-protocol-version[{cass-reg} Java driver 4.18 and later: Controlling the protocol version] -* https://docs.datastax.com/en/developer/java-driver/latest/manual/core/native_protocol/index.html#controlling-the-protocol-version[{company} Java driver 4.17 and earlier: Controlling the protocol version] +`PROTOCOL ERROR` messages in {product-proxy} logs are a normal part of the handshake process while the protocol version is being negotiated. -=== Protocol errors in the proxy logs but clients can connect successfully +These messages indicate that a protocol version downgrades happened because {product-proxy} or one of the clusters doesn't support the version requested by the client. -==== Symptoms +`V5` downgrades are enforced by {product-proxy}. +Any other downgrade results from a request by a cluster that doesn't support the version that the client requested. +{product-proxy} supports `V3`, `V4`, `DSE_V1` and `DSE_V2`. -{product-proxy} logs contain: +In the following example, notice that the `PROTOCOL ERROR` message is introduced by `level=debug`, indicating that it isn't a true error: [source,log] ---- @@ -325,26 +326,17 @@ Returning a protocol error to the client to force a downgrade: ERROR PROTOCOL ER msg=Invalid or unsupported protocol version (5)).\"\n","stream":"stderr","time":"2022-07-20T12:02:12.379287735Z"} ---- -==== Cause - -Protocol errors like these are a normal part of the handshake process where the protocol version is being negotiated. -These protocol version downgrades happen when either {product-proxy} or at least one of the clusters doesn't support the version requested by the client. - -V5 downgrades are enforced by {product-proxy} but any other downgrade is requested by one of the clusters when they don't support the version that the client requested. -The proxy supports V3, V4, DSE_V1 and DSE_V2. - -==== Solution or Workaround - -These log messages are informative only (log level `DEBUG`). - -If you find one of these messages with a higher log level (especially `level=error`) then there might be a bug. -At that point the issue will need to be investigated by the {product-short} team. -This log message with a log level of `ERROR` means that the protocol error occurred after the handshake, and this is a fatal unexpected error that results in a disconnect for that particular connection. +`PROTOCOL ERROR` messages recorded at a higher log level, especially `level=error`, might indicate a bug because this means that the error occurred outside of the handshake process. +This is a fatal unexpected error that terminates the connection. +If you observe this behavior in your logs, <> so the issue can be investigated by the {product-short} team. -=== Error during proxy startup: `Invalid or unsupported protocol version: 3` +=== Proxy fails to start due to invalid or unsupported protocol version -If the {product-proxy} logs contain the following type of output, it indicates that one of the origin clusters doesn't support at least V3 (e.g. {cass-short} 2.0, {dse-short} 4.6), and {product-short} cannot be used for that migration. +If the {product-proxy} logs contain `debug` messages with `Invalid or unsupported protocol version: 3`, this means that one of the origin clusters doesn't support protocol version `V3` or later. +.Invalid or unsupported protocol version logs +[%collapsible] +==== [source,log] ---- time="2022-10-01T19:58:15+01:00" level=info msg="Starting proxy..." @@ -372,8 +364,10 @@ time="2022-10-01T19:58:15+01:00" level=debug msg="Shutting down the schedulers a time="2022-10-01T19:58:15+01:00" level=info msg="Proxy shutdown complete." time="2022-10-01T19:58:15+01:00" level=error msg="Couldn't start proxy, retrying in 2.229151525s: failed to initialize origin control connection: could not open control connection to ORIGIN, tried endpoints: [127.0.0.1:9042]." ---- +==== -The control connections of {product-proxy} don't perform protocol version negotiation, they only attempt to use protocol version 3. +Specifically, this happens with {cass-short} 2.0 and {dse-short} 4.6. +{product-short} cannot be used for these migrations because the {product-proxy} control connections don't perform protocol version negotiation, they only attempt to use `V3`. === Authentication errors @@ -381,50 +375,48 @@ Authentication errors indicate that credentials are incorrect or have insufficie There are three sets of credentials used with {product-proxy}: -* Target: credentials that you set in the proxy configuration through the `ZDM_TARGET_USERNAME` and `ZDM_TARGET_PASSWORD` settings. +* Target cluster: Credentials that you set in the {product-proxy} configuration through the `ZDM_TARGET_USERNAME` and `ZDM_TARGET_PASSWORD` settings. -* Origin: credentials that you set in the proxy configuration through the `ZDM_ORIGIN_USERNAME` and `ZDM_ORIGIN_PASSWORD` settings. +* Origin cluster: Credentials that you set in the {product-proxy} configuration through the `ZDM_ORIGIN_USERNAME` and `ZDM_ORIGIN_PASSWORD` settings. -* Client: credentials that the client application sends to the proxy during the connection handshake, these are set in the application configuration, not the proxy configuration. +* Client application: Credentials that the client application sends to the proxy during the connection handshake. +These are set in the application configuration, not the proxy configuration. Authentication errors mean that at least one of these three sets of credentials is incorrect or has insufficient permissions. -If the authentication error is preventing the proxy from starting then it's either the origin or target credentials that are incorrect or have insufficient permissions. -The log message shows whether it is the origin or target handshake that is failing. +If the authentication error prevents {product-proxy} from starting, then the issue is in the origin or target cluster credentials. +The log message shows whether the origin or target handshake failed. -If the proxy is able to start up, and you can see the following message in the logs: `Proxy started. Waiting for SIGINT/SIGTERM to shutdown`, then the authentication error is happening when a client application tries to open a connection to the proxy. -In this case, the issue is with the client credentials. -The application itself is using invalid credentials, such as an incorrect username/password, expired token, or insufficient permissions. +If {product-proxy} starts but the logs contain a message like `Proxy started. Waiting for SIGINT/SIGTERM to shutdown`, then the authentication error occurs when a client application tries to open a connection to the proxy. +This means that the client application itself has invalid credentials, such as an incorrect username/password, expired token, or insufficient permissions. -Note that the proxy startup message has log level `INFO`, so if the configured log level on the proxy is `warning` or `error`, you must rely on other ways to know whether {product-proxy} started correctly. -You can check if the docker container is running (or process if docker isn't being used) or if there is a log message similar to `Error launching proxy`. +[TIP] +==== +Proxy startup messages are reported at `level=info`. +If your configured log level is `warning` or `error`, you won't see these messages in the logs, and you must use another method to determine if {product-proxy} started correctly. +For example, you can check if the {product-proxy} process or Docker container is running, or check for log messages like `Error launching proxy`. +==== -=== {product-proxy} listens on a custom port, and all applications are able to connect to one proxy instance only +=== {product-proxy} listens on a custom port, and all applications connect to one proxy instance only -==== Symptoms - -{product-proxy} is listening on a custom port (not 9042) and: +If {product-proxy} is listening on a custom port (not 9042), you might see either of the following issues: * The Grafana dashboard shows only one proxy instance receiving all the connections from the application. -* Only one proxy instance has log messages such as `level=info msg="Accepted connection from 10.4.77.210:39458"`. - -==== Cause +* Only one proxy instance has log messages like `level=info msg="Accepted connection from 10.4.77.210:39458"`. -The application is specifying the custom port as part of the contact points using the format -`:`. +This happens because the application specifies the custom port as part of the contact points using the format +`**PROXY_IP_ADDRESS**:**CUSTOM_PORT**`. -For example, using the Java driver, if the {product-proxy} instances were listening on port 14035, this would look like: +For example, if the {product-proxy} instances were listening on port 14035, the contact points for the {cass-short} Java driver might be specified as `.addContactPoints("172.18.10.36:14035", "172.18.11.48:14035", "172.18.12.61:14035")`. -`.addContactPoints("172.18.10.36:14035", "172.18.11.48:14035", "172.18.12.61:14035")` +The contact point is the first point of contact to the cluster, but the driver discovers the rest of the nodes through CQL queries. +However, this discovery process finds the addresses only, not the ports. +The driver uses the addresses it discovers with the port that is configured at startup. +As a result, the custom port is used for the initial contact point only, and the default port is used with all other nodes. -The contact point is used as the first point of contact to the cluster, but the driver discovers the rest of the nodes via CQL queries. -However, this discovery process doesn't discover the ports, just the addresses so the driver uses the addresses it discovers with the port that is configured at startup. - -As a result, port 14035 will only be used for the contact point initially discovered, while for all other nodes the driver will attempt to use the default 9042 port. - -==== Solution or Workaround - -In the application, ensure that the custom port is explicitly indicated using the `.withPort()` API. In the above example: +To resolve this issue, ensure that the custom port is explicitly set in your application. +The way that you do this depends on your driver language and version. +For example, for the Java driver, use `.withPort(**CUSTOM_PORT**)`: [source,java] ---- @@ -432,12 +424,9 @@ In the application, ensure that the custom port is explicitly indicated using th .withPort(14035) ---- +=== Proxy logs contain SyntaxError no viable alternative at input 'CALL' -=== Syntax error "no viable alternative at input 'CALL'" in proxy logs - -==== Symptoms - -{product-proxy} logs contain: +{product-proxy} log messages such as the following indicate that the server doesn't recognize the word "CALL" in the query string, which typically means that it is a remote procedure call (RPC): [source,log] ---- @@ -446,41 +435,38 @@ ERROR SYNTAX ERROR (code=ErrorCode SyntaxError [0x00002000], msg=line 1:0 no via at input 'CALL' ([CALL]...))\"\n","stream":"stderr","time":"2022-07-20T13:10:47.322882877Z"} ---- -==== Cause - -The log message indicates that the server doesn't recognize the word “CALL” in the query string which most likely means that it is an RPC (remote procedure call). -From the proxy logs alone, it is not possible to see what method is being called by the query but it's very likely the RPC that the drivers use to send {dse-short} Insights data to the server. +From the proxy logs alone, you cannot determine which method was called by the query, but it's typically the RPC that {cass-short} drivers use to send {dse-short} Insights data to the server. -Most {company}-compatible drivers have {dse-short} Insights reporting enabled by default when they detect a server version that supports it (regardless of whether the feature is enabled on the server side or not). +Most {company}-compatible drivers have {dse-short} Insights reporting enabled by default when they detect a server version that supports it, even if the feature is disabled on the server side. The driver might also have it enabled for {astra-db} depending on what server version {astra-db} is returning for queries involving the `system.local` and `system.peers` tables. -==== Solution or Workaround - These log messages are harmless, but if you need to remove them, you can disable {dse-short} Insights in the driver configuration. For example, in the Java driver, you can set `https://github.com/apache/cassandra-java-driver/blob/4.x/core/src/main/resources/reference.conf#L1365[advanced.monitor-reporting]` to `false`. === Default Grafana credentials don't work -==== Symptoms - -Consider a case where you deploy the metrics component of our {product-automation}, a Grafana instance is deployed but you cannot login using the usual default `admin/admin` credentials. +When you deploy the {product-automation} metrics component, a Grafana instance is deployed that _doesn't_ use Grafana's default `admin`/`admin` credentials. -==== Cause +Instead, {product-automation} specifies a custom set of credentials. -{product-automation} specifies a custom set of credentials instead of relying on the `admin/admin` ones that are typically the default for Grafana deployments. +You can find the credentials for your {product-automation} Grafana instance in the `vars/zdm_monitoring_config.yml` file in the {product-automation} directory. +You can also modify these credentials before deploying the metrics stack. -==== Solution or Workaround +=== Proxy starts but client cannot connect (connection timed out or connection closed) -Check the credentials that are being used by looking up the `vars/zdm_monitoring_config.yml` file on the {product-automation} directory. -These credentials can also be modified before deploying the metrics stack. +If the {product-proxy} logs contain messages like `Couldn't connect to`, `context timed out or cancelled while opening connection`, and `context deadline exceeded`, it can indicate that the {product-proxy} couldn't establish a connection with a particular node. -=== Proxy starts but client cannot connect (connection timeout/closed) +This can happen because {product-proxy} has connectivity to a specific subset of the nodes. +The control connection, which is established during {product-proxy} startup, cycles through the nodes until it finds one that it can connect to successfully. -==== Symptoms +For client connections, each proxy instance cycles through its assigned nodes only. +Each proxy instance has a different group of _assigned nodes_, which are a subset of the cluster nodes. +Generally, these are unique for each proxy instances to avoid interference with load balancing that is already in place at client-side driver level. +The assigned nodes aren't necessarily contact points: Even discovered nodes undergo assignment to proxy instances. -{product-proxy} log contains: +In the following example, {product-proxy} doesn't have connectivity to `10.0.63.20`, which was chosen as the origin node for the incoming client connection, but it connected to `10.0.63.163` during startup: -[source] +[source,log] ---- INFO[0000] [openTCPConnection] Opening connection to 10.0.63.163:9042 INFO[0000] [openTCPConnection] Successfully established connection with 10.0.63.163:9042 @@ -508,44 +494,21 @@ ERRO[0066] [openTCPConnectionWithBackoff] Couldn't connect to 10.0.63.20:9042, r ERRO[0076] Client Handler could not be created: ORIGIN-CONNECTOR context timed out or cancelled while opening connection to ORIGIN: context deadline exceeded ---- -==== Cause - -{product-proxy} has connectivity only to a subset of the nodes. - -The control connection (during {product-proxy} startup) cycles through the nodes until it finds one that can be connected to. -For client connections, each proxy instance cycles through its "assigned nodes" only. -_(The "assigned nodes" are a different subset of the cluster nodes for each proxy instance, generally non-overlapping between proxy instances so as to avoid any interference with the load balancing already in place at client-side driver level. -The assigned nodes are not necessarily contact points: even discovered nodes undergo assignment to proxy instances.)_ - -In the example above, {product-proxy} doesn't have connectivity to 10.0.63.20, which was chosen as the origin node for the incoming client connection, but it connected to 10.0.63.163 during startup. - -==== Solution or Workaround - -Ensure that network connectivity exists and is stable between the {product-proxy} instances and all {cass-short} / {dse-short} nodes of the local datacenter. +To avoid this issue, ensure that a stable network connection exists between the {product-proxy} instances and all nodes of your origin and target clusters in the client application's local datacenter. === Client application driver takes too long to reconnect to a proxy instance -==== Symptoms - -After a {product-proxy} instance has been unavailable for some time and it gets back up, the client application takes too long to reconnect. - -There should never be a reason to stop a {product-proxy} instance other than a configuration change, but maybe the proxy crashed or the user tried to do a configuration change and took a long time to get the {product-proxy} instance back up. - -==== Cause - -{product-proxy} does not send topology events to the client applications, so the reconnection policy determines the time required for the driver to reconnect to a {product-proxy} instance. - -==== Solution or Workaround +When a {product-proxy} instance comes back online after being unavailable for some time, then your client application might take too long to reconnect. -Restart the client application to force an immediate reconnect. +If {product-proxy} doesn't send topology events to the client application, the driver's reconnection policy determines the time required for the driver to reconnect to the {product-proxy} instance. -If you expect {product-proxy} instances to go down frequently, change the reconnection policy on the driver so that the interval between reconnection attempts has a shorter limit. +You can restart the client application to force an immediate reconnection attempt. -=== Error with {astra} DevOps API when using {product-automation} +If you expect {product-proxy} instances to go down frequently, change the driver's reconnection policy to shorten the interval between reconnection attempts. -==== Symptoms +=== {astra} DevOps API errors when using {product-automation} -{product-automation}'s logs: +{product-automation} logs might report errors that contain your {astra} DevOps API endpoint: [source,log] ---- @@ -554,19 +517,12 @@ Connection failure: Remote end closed connection without response", "redirected" "https://api.astra.datastax.com/v2/databases/REDACTED/secureBundleURL"} ---- -==== Cause - -The {astra} DevOps API is likely temporarily unavailable. - -==== Solution or Workaround +This can indicate that the {astra} DevOps API is temporarily unavailable. +You can either wait to retry the operation, or you can xref:astra-db-serverless:databases:secure-connect-bundle.adoc[download your database's {scb} from the {astra-ui}], and then provide the path the {scb-short} zip file in the xref:deploy-proxy-monitoring.adoc#_core_configuration[{product-automation} configuration]. -xref:astra-db-serverless:databases:secure-connect-bundle.adoc[Download the {astra-db} {scb}] manually and provide its path in the xref:deploy-proxy-monitoring.adoc#_core_configuration[{product-automation} configuration]. +=== Metadata service returned not successful status code (4xx or 5xx) -=== Metadata service returned not successful status code 4xx or 5xx - -==== Symptoms - -{product-proxy} doesn't start and the following appears on the proxy logs: +If {product-proxy} doesn't start, the logs might contain messages with `not successful status code`: [source,log] ---- @@ -574,22 +530,16 @@ Couldn't start proxy: error initializing the connection configuration or control metadata service (Astra) returned not successful status code ---- -==== Cause - There are two possible causes for this: * The credentials that {product-proxy} is using for {astra-db} don't have sufficient permissions. * The {astra-db} database is hibernated or otherwise unavailable. -==== Solution or Workaround - -In the {astra-ui}, check the xref:astra-db-serverless:databases:database-statuses.adoc[database status]. +To resolve this issue, sign in to the {astra-ui}, and then check the xref:astra-db-serverless:databases:database-statuses.adoc[database status]. -If the database is not in *Active* status, you might need to take action or wait for the database to return to active status. -For example, if the database is hibernated, xref:astra-db-serverless:databases:database-statuses.adoc#hibernated[reactivate the database]. -When the database is active again, retry the connection. +If the database isn't in *Active* status, xref:astra-db-serverless:databases:database-statuses.adoc#hibernated[reactivate] or wait for the database to return to **Active** status, and then retry the connection. -If the database is in *Active* status, then the issue is likely due to the credentials permissions. +If the database is in *Active* status, then the credentials likely have insufficient permissions. Try using an xref:astra-db-serverless:administration:manage-application-tokens.adoc[application token scoped to a database], specifically a token with the *Database Administrator* role for your target database. [[_async_read_timeouts_stream_id_map_exhausted]] From 127557ddbf677be4830c8f84c4a40c3c8032cac2 Mon Sep 17 00:00:00 2001 From: April M <36110273+aimurphy@users.noreply.github.com> Date: Thu, 13 Nov 2025 17:11:51 -0800 Subject: [PATCH 04/11] finish revising troubleshooting --- modules/ROOT/pages/troubleshooting-tips.adoc | 147 +++++++++---------- 1 file changed, 67 insertions(+), 80 deletions(-) diff --git a/modules/ROOT/pages/troubleshooting-tips.adoc b/modules/ROOT/pages/troubleshooting-tips.adoc index e2a31693..edbde2e7 100644 --- a/modules/ROOT/pages/troubleshooting-tips.adoc +++ b/modules/ROOT/pages/troubleshooting-tips.adoc @@ -545,11 +545,7 @@ Try using an xref:astra-db-serverless:administration:manage-application-tokens.a [[_async_read_timeouts_stream_id_map_exhausted]] === Async read timeouts / stream id map exhausted -//Supposedly resolved in 2.1.0 release? - -==== Symptoms - -Dual reads are enabled and the following messages are found in the {product-proxy} logs: +When dual reads are enabled, you might find the following messages in the {product-proxy} logs: [source,log] ---- @@ -560,99 +556,68 @@ Dual reads are enabled and the following messages are found in the {product-prox {"log":"\u001b[33mWARN\u001b[0m[431533] Could not send async request due to an error while storing the request state: stream id map ran out of stream ids: channel was empty. \r\n","stream":"stdout","time":"2022-10-03T17:49:23.786335428Z"} ---- -==== Cause - -The last log message is logged when the async connection runs out of stream ids. -The async connection is a connection dedicated to the async reads (asynchronous dual reads feature). -This can be caused by timeouts (first log message) or the connection not being able to keep up with the load. +The last message is logged when the async connection runs out of stream IDs. +The async connection is a connection dedicated to the asynchronous dual reads feature. +This can be caused by timeouts, as indicated in the first log message, or the connection being unable to keep up with the load. -If the log files are being spammed with these messages then it is likely that an outage occurred which caused all responses to arrive after requests timed out (second log message). -In this case the async connection might not be able to recover. +Errors in the async request path (dual reads) don't affect the client application. +These log messages can be useful to predict what could happen when reads are switched over to the target cluster permanently, but async read errors and warnings don't, by themselves, have any impact on the client. -==== Solution or Workaround +If you find many of these messages in the logs, it is likely that an outage occurred. +This causes all responses to arrive after requests timed out, as reported in the second log message. +In this case the async connection might not recover. -Keep in mind that any errors in the async request path (dual reads) will not affect the client application so these log messages might be useful to predict what may happen when the reads are switched over to the TARGET cluster but async read errors/warnings by themselves do not cause any impact to the client. +Starting in version 2.1.0, you can tune the maximum number of stream IDs available per connection. +The default is 2048, and you can increase it to match your driver configuration with the `xref:manage-proxy-instances.adoc#zdm_proxy_max_stream_ids[zdm_proxy_max_stream_ids]` property. +If you are running a version prior to 2.1.0, upgrade {product-proxy}. -Starting in version 2.1.0, you can now tune the maximum number of stream ids available per connection, which by default is 2048. -You can increase it to match your driver configuration through the xref:manage-proxy-instances.adoc#zdm_proxy_max_stream_ids[zdm_proxy_max_stream_ids] property. - -If these errors are being constantly written to the log files (for minutes or even hours) then it is likely that only an application OR {product-proxy} restart will fix it. -If you find an issue like this, submit a {product-proxy-repo}/issues[GitHub issue]. +If these errors are constantly written to the log files over a period of minutes or hours, then you likely need to restart the client application _or_ {product-proxy} to fix the issue. +If you find an error like this, <> so the {product-short} team can investigate it. === Client application closed connection errors every 10 minutes when migrating to {astra-db} -//TODO: Remove - resolved in 2.1.0 -[NOTE] -==== -This issue is fixed in {product-proxy} 2.1.0. See the Fix section below. -==== - -==== Symptoms - -Every 10 minutes a message is logged in the {product-proxy} logs showing a disconnect that was caused by {astra-db}: - -[source,log] ----- -{"log":"\u001b[36mINFO\u001b[0m[426871] [TARGET-CONNECTOR] REDACTED disconnected \r\n","stream":"stdout","time":"2022-10-01T16:31:41.48598498Z"} ----- - -==== Cause +This issue is fixed in {product-proxy} 2.1.0. -{astra-db} terminates idle connections after 10 minutes of inactivity. -If a client application only sends reads through a connection then the target cluster, which is an {astra-db} database in this example, then the connection won't get any traffic because {product-short} forwards all reads to the origin connection. +If you are running an earlier version, and the logs report that the {astra-db} `TARGET-CONNECTOR` is disconnected every 10 minutes, upgrade your {product-proxy} instances to 2.1.0 or later to resolve this issue. -==== Solution or Workaround +This issue occurred because {astra-db} terminates idle connections after 10 minutes of inactivity. +In the absence of asynchronous dual reads, the target cluster won't get any traffic if the client application sends only read requests because {product-short} forwards all reads to the origin cluster only. -This issue has been fixed in {product-proxy} 2.1.0. -We encourage you to upgrade to that version or greater. -By default, {product-proxy} now sends heartbeats after 30 seconds of inactivity on a cluster connection, to keep it alive. -You can tune the heartbeat interval with the Ansible configuration variable `heartbeat_insterval_ms`, or by directly setting the `ZDM_HEARTBEAT_INTERVAL_MS` environment variable if you do not use {product-automation}. +This issue is fixed in {product-proxy} 2.1.0, which sends heartbeats after 30 seconds of inactivity on a cluster connection to keep it alive. +You can tune the heartbeat interval with the Ansible configuration variable `heartbeat_insterval_ms`, or by directly setting the `ZDM_HEARTBEAT_INTERVAL_MS` environment variable if you aren't using {product-automation}. === Performance degradation with {product-proxy} -==== Symptoms - -Consider a case where a user runs separate benchmarks against: - -* {astra-db} directly -* Origin directly -* {product-proxy} (with {astra-db} and the origin cluster) - -The results of these tests show latency/throughput values are worse with {product-proxy} than when connecting to {astra-db} or origin cluster directly. - -==== Cause +If you run separate benchmarks against {astra-db} directly, the origin cluster directly, and {product-proxy} with both {astra-db} and the origin cluster, then the results of these tests might show that latency or throughput is worse with {product-proxy} than when connecting to {astra-db} or origin cluster directly. -{product-short} always increases latency and, depending on the nature of the test, reduces throughput. +This is observed because {product-short} always increases latency and, depending on the nature of the test, reduces throughput. Whether this performance hit is expected or not depends on the difference between the {product-short} test results and the test results with the cluster that performed the worst. -Writes in {product-short} require successful acknowledgement from both clusters, while reads only require the result from the primary cluster, which is typically the origin cluster. -This means that if the origin cluster has better performance than the target cluster, then {product-short} will have worse write performance. +Writes through {product-short} require successful acknowledgement from both clusters, whereas reads require only the result from the primary cluster, which is typically the origin cluster. +This means that if the origin cluster has better performance than the target cluster, then {product-short} will inevitably have worse write performance than the target cluster alone. -It is typical for latency to increase with {product-proxy}. -To minimize performance degradation with {product-proxy}, note the following: +Although it is typical for latency to increase with {product-proxy}, you can minimize performance degradation with {product-proxy}: * Make sure your {product-proxy} infrastructure or configuration doesn't unnecessarily increase latency. ++ For example, make sure your {product-proxy} instances are in the same availability zone (AZ) as your origin cluster or application instances. + * Understand the impact of simple and batch statements on latency, as compared to typical prepared statements. + Avoid simple statements with {product-proxy} because they require significant time for {product-proxy} to parse the queries. + -In contrast, prepared statements are parsed once, and then reused on subsequent requests, if repreparation isn't required. - -==== Solution or Workaround +As an alternative, use prepared statements, which are parsed once, and then reused on subsequent requests if repreparation isn't required. +However, inefficient use of prepared statements can degrade performance further, but this would be observed even without {product-proxy}. -If you are using simple statements, consider using prepared statements as the best first step. - -Increasing the number of proxies might help, but only if the VMs resources (CPU, RAM or network IO) are near capacity. +* Increase the number of proxies only if the VMs resources (CPU, RAM or network IO) are near capacity. {product-proxy} doesn't use a lot of RAM, but it uses a lot of CPU and network IO. +Deploying the proxy instances on VMs with faster CPUs and faster network IO might help, but there is no standardized approach to scaling these resources for {product-proxy}. +This is because the ideal balance of resources depends on the workload type and your environment, such as network/VPC configurations and hardware. +If you choose to adjust the infrastructure, you must repeat your tests to determine if there was any benefit -Deploying the proxy instances on VMs with faster CPUs and faster network IO might help, but only your own tests will reveal whether it helps, because it depends on the workload type and details about your environment such as network/VPC configurations, hardware, and so on. - -=== `InsightsRpc` related permissions errors +=== Permission errors related to InsightsRpc -==== Symptoms - -{product-proxy} logs contain: +If the {product-proxy} logs contain messages such as the following, it's likely that you have an origin {dse-short} cluster where Metrics Collector is enabled, and the user named in the logs doesn't have sufficient permissions to report Insights data: [source,log] ---- @@ -660,23 +625,45 @@ time="2023-05-05T19:14:31Z" level=debug msg="Recording ORIGIN-CONNECTOR other er time="2023-05-05T19:14:31Z" level=debug msg="Recording TARGET-CONNECTOR other error: ERROR SERVER ERROR (code=ErrorCode ServerError [0x00000000], msg=Unexpected persistence error: Unable to authorize statement com.datastax.bdp.cassandra.cql3.RpcCallStatement)" ---- -==== Cause +These are reported as `level=debug`, so {product-proxy} isn't affected by them. -This could be the case if the origin ({dse-short}) cluster has Metrics Collector enabled to report metrics for {company} drivers and `my_user` does not have the required permissions. -{product-proxy} simply passes through these. +There are two ways to resolve this issue: -==== Solution or Workaround +[tabs] +====== +Disable {dse-short} Metrics Collector (recommended):: ++ +-- +. On the origin {dse-short} cluster, disable Metrics Collector: ++ +[source,bash] +---- +dsetool insights_config --mode DISABLED +---- -There are two options to get this fixed. +. Run the following command to verify that `mode` is set to `DISABLED`: ++ +[source,bash] +---- +dsetool insights_config --show_config +---- +-- -===== Option 1: Disable {dse-short} Metrics Collector +Grant InsightsRpc permissions:: ++ +-- +Only use this option if you cannot disable Metrics Collector. -* On the origin {dse-short} cluster, run `dsetool insights_config --mode DISABLED` -* Run `dsetool insights_config --show_config` and ensure the `mode` has a value of `DISABLED` +Using a superuser role, grant the appropriate permissions to the user named in the logs: -===== Option 2: Use this option if disabling metrics collector is not an option +[source,bash,subs="+quotes"] +---- +GRANT EXECUTE ON REMOTE OBJECT InsightsRpc TO **USER**; +---- -* Using a superuser role, grant the appropriate permissions to `my_user` role by running `GRANT EXECUTE ON REMOTE OBJECT InsightsRpc TO my_user;` +Replace **USER** with the actual username given in the logs. +-- +====== [#report-an-issue] == Report an issue From ec6b92296e7ff5d3a676b53b14503c2eedf21a3c Mon Sep 17 00:00:00 2001 From: April M <36110273+aimurphy@users.noreply.github.com> Date: Fri, 14 Nov 2025 06:00:35 -0800 Subject: [PATCH 05/11] add some more origin/target primary/secondary defs --- modules/ROOT/pages/components.adoc | 7 +++++-- modules/ROOT/pages/enable-async-dual-reads.adoc | 4 ++-- modules/ROOT/pages/troubleshooting-tips.adoc | 2 +- 3 files changed, 8 insertions(+), 5 deletions(-) diff --git a/modules/ROOT/pages/components.adoc b/modules/ROOT/pages/components.adoc index 5396b05e..b26c845b 100644 --- a/modules/ROOT/pages/components.adoc +++ b/modules/ROOT/pages/components.adoc @@ -20,7 +20,7 @@ This tool is open-source software. It doesn't perform data migrations and it doesn't have awareness of ongoing migrations. Instead, you use a <> to perform the data migration and validate migrated data. -{product-proxy} reduces risks to upgrades and migrations by decoupling the origin cluster from the target cluster and maintaining consistency between both clusters. +{product-proxy} reduces risks to upgrades and migrations by decoupling the origin (source) cluster from the target (destination) cluster and maintaining consistency between both clusters. You decide when you want to switch permanently to the target cluster. After migrating your data, changes to your application code are usually minimal, depending on your client's compatibility with the origin and target clusters. @@ -34,7 +34,10 @@ These clusters can be any CQL-compatible data store, such as {cass-reg}, {dse}, During the migration process, you designate one cluster as the _primary cluster_, which serves as the source of truth for reads. For the majority of the migration process, this is typically the origin cluster. -Towards the end of the migration process, when you are ready to read from your target cluster, you set the target cluster as the primary cluster. +Towards the end of the migration process, when you are ready to read exclusively from your target cluster, you set the target cluster as the primary cluster. + +The other cluster is referred to as the _secondary cluster_. +While {product-proxy} is active, write requests are sent to both clusters to ensure data consistency, but only the primary cluster serves read requests. ==== Writes diff --git a/modules/ROOT/pages/enable-async-dual-reads.adoc b/modules/ROOT/pages/enable-async-dual-reads.adoc index 980ee9c4..8d8057f1 100644 --- a/modules/ROOT/pages/enable-async-dual-reads.adoc +++ b/modules/ROOT/pages/enable-async-dual-reads.adoc @@ -17,7 +17,7 @@ This allows you to assess the target cluster's performance and make any adjustme == Response and error handling with asynchronous dual reads With or without asynchronous dual reads, the client application only receives results from synchronous reads on the primary cluster. -The client never receives results from asynchronous reads on the secondary cluster because these results are used only for {product-proxy}'s asynchronous dual read metrics. +The client never receives results from asynchronous reads on the secondary cluster because these results are used only for {product-proxy}'s asynchronous dual read metrics and testing purposes. By design, if an asynchronous read fails or times out, it has no impact on client operations and the client application doesn't receive an error. However, the increased workload from read requests can cause write requests to fail or time out on the secondary cluster. @@ -25,7 +25,7 @@ With or without asynchronous dual reads, a failed write on either cluster return This functionality is intentional so you can simulate production-scale read traffic on the secondary cluster, in addition to the existing write traffic from {product-proxy}'s xref:components.adoc#how-zdm-proxy-handles-reads-and-writes[dual writes], with the least impact to your applications. -To avoid unnecessary failures due to unmigrated data, enable asynchronous dual reads only after you migrate, validate, and reconcile all data from the origin cluster to the target cluster. +To avoid unnecessary failures due to missing unmigrated data, don't enable asynchronous dual reads until you migrate, validate, and reconcile _all_ data from the origin cluster to the target cluster. [#configure-asynchronous-dual-reads] == Configure asynchronous dual reads diff --git a/modules/ROOT/pages/troubleshooting-tips.adoc b/modules/ROOT/pages/troubleshooting-tips.adoc index edbde2e7..5950b941 100644 --- a/modules/ROOT/pages/troubleshooting-tips.adoc +++ b/modules/ROOT/pages/troubleshooting-tips.adoc @@ -269,7 +269,7 @@ Allowing these values to change from a rolling restart could propagate a misconf If you change the value of an immutable configuration variable, you must run the `deploy_zdm_proxy.yml` playbook again. You can run this playbook as many times as needed. -Each time, {product-proxy-automation} recreates your entire {product-proxy} deployment with the new configuration. +Each time, {product-automation} recreates your entire {product-proxy} deployment with the new configuration. However, this doesn't happen in a rolling fashion: The existing {product-proxy} instances are torn down simultaneously, and then they are recreated. This results in a brief period of downtime where the entire {product-proxy} deployment is unavailable. From bb7c1a7edc36d168b85ba56ee34a0a4d335ac61d Mon Sep 17 00:00:00 2001 From: April M <36110273+aimurphy@users.noreply.github.com> Date: Fri, 14 Nov 2025 08:15:12 -0800 Subject: [PATCH 06/11] working on faq --- modules/ROOT/pages/components.adoc | 43 +++-- .../ROOT/pages/connect-clients-to-proxy.adoc | 2 +- .../ROOT/pages/deployment-infrastructure.adoc | 13 ++ modules/ROOT/pages/dse-migration-paths.adoc | 2 +- modules/ROOT/pages/faqs.adoc | 165 +++++++----------- .../ROOT/pages/feasibility-checklists.adoc | 2 +- .../ROOT/pages/migrate-and-validate-data.adoc | 2 +- modules/ROOT/pages/tls.adoc | 17 +- modules/ROOT/pages/troubleshooting-tips.adoc | 4 +- .../ROOT/pages/zdm-proxy-migration-paths.adoc | 21 ++- 10 files changed, 139 insertions(+), 132 deletions(-) diff --git a/modules/ROOT/pages/components.adoc b/modules/ROOT/pages/components.adoc index b26c845b..0cdce771 100644 --- a/modules/ROOT/pages/components.adoc +++ b/modules/ROOT/pages/components.adoc @@ -83,22 +83,23 @@ For simplicity, you can use {product-utility} and {product-automation} to set up == {product-utility} and {product-automation} -You can use {product-automation-repo}[{product-utility} and {product-automation}] to set up and run Ansible playbooks that deploy and manage {product-proxy} and the associated monitoring stack. +You can use {product-automation-repo}[{product-utility} and {product-automation}] to set up and run Ansible playbooks that deploy and manage multiple {product-proxy} instances and the associated monitoring stack (Prometheus metrics and associated Grafana visualizations). -https://www.ansible.com/[Ansible] is a suite of software tools that enables infrastructure as code. -It is open source and its capabilities include software provisioning, configuration management, and application deployment functionality. -The Ansible automation for {product-short} is organized into playbooks, each implementing a specific operation. -The machine from which the playbooks are run is known as the Ansible Control Host. -In {product-short}, the Ansible Control Host runs as a Docker container. +https://www.ansible.com/[Ansible] is a suite of software tools that enable infrastructure as code. +It is open source, and its capabilities include software provisioning, configuration management, and application deployment. -You use {product-utility} to set up Ansible in a Docker container, and then you use {product-automation} to run the Ansible playbooks from the Docker container created by {product-utility}. +Ansible playbooks streamline and automate the deployment and management of {product-proxy} instances and their monitoring components. +Playbooks are YAML files that define a series of tasks to be executed on one or more remote machines, including installing software, configuring settings, and managing services. +They are repeatable and reusable, and they simplify deployment and configuration management because each playbook focuses on a specific operation, such as rolling restarts. -{product-utility} creates the Docker container acting as the Ansible Control Host, from which {product-automation} allows you to deploy and manage the {product-proxy} instances and the associated monitoring stack, which includes Prometheus metrics and Grafana visualizations of the metrics data. +You run playbooks from a centralized machine known as the Ansible Control Host. +{product-utility}, which is included with {product-automation}, creates the Docker container that acts as the Ansible Control Host. -To use {product-utility} and {product-automation}, you must prepare the recommended infrastructure, as explained in xref:deployment-infrastructure.adoc[]. +To use {product-utility} and {product-automation}, you must xref:deployment-infrastructure.adoc[prepare the recommended infrastructure]. -For more information, see xref:setup-ansible-playbooks.adoc[] and xref:deploy-proxy-monitoring.adoc[]. +For more information about the role of Ansible and Ansible playbooks in the {product-short} process, see xref:setup-ansible-playbooks.adoc[] and xref:deploy-proxy-monitoring.adoc[]. +[#data-migration-tools] == Data migration tools You use data migration tools to move data between clusters and validate the migrated data. @@ -133,7 +134,7 @@ For more information, see xref:ROOT:dsbulk-migrator.adoc[]. === Other data migration processes -Depending on your source and target databases, there might be other data migration tools available for your migration. +Depending on your origin and target databases, there might be other data migration tools available for your migration. For example, if you want to write your own custom data migration processes, you can use a tool like Apache Spark(TM). To use a data migration tool with {product-proxy}, it must meet the following requirements: @@ -145,5 +146,21 @@ To use a data migration tool with {product-proxy}, it must meet the following re Because {product-proxy} requires that both databases can successfully process the same read/write statements, migrations that perform significant data transformations might not be compatible with {product-proxy}. The impact of data transformations depends on your specific data model, database platforms, and the scale of your migration. -For data-only migrations that aren't concerned with live application traffic or minimizing downtime, your chosen tool depends on your source and target databases, the compatibility of the data models, and the scale of your migration. -Describing the full range of these tools is beyond the scope of this document, which focuses on full-scale platform migrations with the {product-short} tools and verified {product-short}-compatible data migration tools. \ No newline at end of file +For data-only migrations that aren't concerned with live application traffic or minimizing downtime, your chosen tool depends on your origin and target databases, the compatibility of the data models, and the scale of your migration. +Describing the full range of these tools is beyond the scope of this document, which focuses on full-scale platform migrations with the {product-short} tools and verified {product-short}-compatible data migration tools. + +== In-place migrations + +[WARNING] +==== +In-place migrations carry a higher risk of data loss or corruption, require progressive manual reconfiguration of the cluster, and are more cumbersome to rollback compared to the {product-short} process. + +Whenever possible, {company} recommends using the {product-short} process to orchestrate live migrations between separate clusters, which eliminates the need for progressive configuration changes, and allows you to seamlessly xref:ROOT:rollback.adoc[rollback to your origin cluster] if there is a problem during the migration. +==== + +For certain migration paths, it is possible to perform in-place database platform replacements on the same cluster where you data already exists. +Supported paths for in-place migrations include xref:6.9@dse:planning:migrate-cassandra-to-dse.adoc[{cass} to {dse-short}] and xref:1.2@hcd:migrate:dse-68-to-hcd-12.adoc[{dse-short} to {hcd-short}]. + +== See also + +* xref:ROOT:zdm-proxy-migration-paths.adoc#incompatible-clusters-and-migrations-with-some-downtime[Incompatible clusters and migrations with some downtime] \ No newline at end of file diff --git a/modules/ROOT/pages/connect-clients-to-proxy.adoc b/modules/ROOT/pages/connect-clients-to-proxy.adoc index e5341d38..2bf4934e 100644 --- a/modules/ROOT/pages/connect-clients-to-proxy.adoc +++ b/modules/ROOT/pages/connect-clients-to-proxy.adoc @@ -166,7 +166,7 @@ This is disabled by default in all drivers, but if it was enabled in your client Token-aware routing isn't enforced when connecting through {product-proxy} because these instances don't hold actual token ranges in the same way as database nodes. Instead, each {product-proxy} instance has a unique, non-overlapping set of synthetic tokens that simulate token ownership and enable balanced load distribution across the instances. -Upon receiving a request, a {product-proxy} instance routes the request to appropriate source and target database nodes, independent of token ownership. +Upon receiving a request, a {product-proxy} instance routes the request to appropriate origin and target database nodes, independent of token ownership. If your clients have token-aware routing enabled, you don't need to disable this behavior while using {product-proxy}. Clients can continue to operate with token-aware routing enabled without negative impacts to functionality or performance. diff --git a/modules/ROOT/pages/deployment-infrastructure.adoc b/modules/ROOT/pages/deployment-infrastructure.adoc index 86b9c6bb..efcbd114 100644 --- a/modules/ROOT/pages/deployment-infrastructure.adoc +++ b/modules/ROOT/pages/deployment-infrastructure.adoc @@ -19,6 +19,19 @@ Here's a typical deployment showing connectivity between client applications, {p image::zdm-during-migration3.png[Connectivity between client applications, proxy instances, and clusters.] +=== Don't deploy {product-proxy} as a sidecar + +Don't deploy {product-proxy} as a sidecar because it was designed to mimic communication with a {cass-short}-based cluster. +For this reason, {company} recommends deploying multiple {product-proxy} instances, each running on a dedicated machine, instance, or VM. + +For best performance, deploy your {product-proxy} instances as close as possible to your client applications, ideally on the same local network, but don't co-deploy them on the same machines as the client applications. +This way, each client application instance can connect to all {product-proxy} instances, just as it would connect to all nodes in a {cass-short}-based cluster or datacenter. + +This deployment model provides maximum resilience and failure tolerance guarantees, and it allows the client application driver to continue using the same load balancing and retry mechanisms that it would normally use. + +Conversely, deploying a single {product-proxy} instance undermines this resilience mechanism and creates a single point of failure, which can affect client applications if one or more nodes of the underlying origin or target clusters go offline. +In a sidecar deployment, each client application instance would be connecting to a single {product-proxy} instance, and would, therefore, be exposed to this risk. + == Infrastructure requirements To deploy {product-proxy} and its companion monitoring stack, you must provision infrastructure that meets the following requirements. diff --git a/modules/ROOT/pages/dse-migration-paths.adoc b/modules/ROOT/pages/dse-migration-paths.adoc index 253dd952..b6247d27 100644 --- a/modules/ROOT/pages/dse-migration-paths.adoc +++ b/modules/ROOT/pages/dse-migration-paths.adoc @@ -40,7 +40,7 @@ Migrate data from {dse-short}:: When migrating _from_ {dse-short} to another {cass-short}-based database, follow the migration guidance for your target database to determine cluster compatibility, migration options, and recommendations. For example, for {astra-db}, see xref:ROOT:astra-migration-paths.adoc[], and for {hcd-short}, see xref:ROOT:hcd-migration-paths.adoc[]. -For information about source and target clusters that are supported by the {product-short} tools, see xref:ROOT:zdm-proxy-migration-paths.adoc[]. +For information about origin and target clusters that are supported by the {product-short} tools, see xref:ROOT:zdm-proxy-migration-paths.adoc[]. If your target database isn't directly compatible with a migration from {dse-short}, you might need to take interim steps to prepare your data for migration, such as upgrading your {dse-short} version, modifying the data in your existing database to be compatible with the target database, or running an extract, transform, load (ETL) pipeline. -- diff --git a/modules/ROOT/pages/faqs.adoc b/modules/ROOT/pages/faqs.adoc index ce81fafe..9f0fdf9c 100644 --- a/modules/ROOT/pages/faqs.adoc +++ b/modules/ROOT/pages/faqs.adoc @@ -2,79 +2,95 @@ :navtitle: FAQs :page-aliases: ROOT:contributions.adoc -If you're new to the {company} {product} features, these FAQs are for you. +This page includes common questions about the {company} {product} tools. //TODO: Eliminate redundancies in these FAQs and the Glossary. //FAQs in ZDM-proxy repo: https://github.com/datastax/zdm-proxy/blob/main/faq.md#what-versions-of-apache-cassandra-or-cql-compatible-data-stores-does-the-zdm-proxy-support -== What is meant by {product}? +== What is a zero-downtime migration? -{product} ({product-short}) means the ability for you to reliably migrate client applications and data between CQL clusters with no interruption of service. +A zero-downtime migration with the {company} {product} ({product-short}) tools means you can reliably migrate your client applications and data between CQL clusters with no interruption of service. -{product-short} lets you accomplish migrations without the need to change your client application code, and with only minimal configuration changes. While in some cases you may need to make some minor changes at the client application level, these changes will be minimal and non-invasive, especially if your client application uses an externalized property configuration for contact points. +== Which platforms and versions are supported by the {product-short} tools? -The suite of {product-short} tools enables you to migrate the real-time activity generated by your client applications, as well as transfer your existing data, always with a simple rollback strategy that does not require any downtime. +See xref:ROOT:zdm-proxy-migration-paths.adoc[]. -It is important to note that the {product} process requires you to be able to perform rolling restarts of your client applications during the migration. +== Why should I use the {product} ({product-short}) tools for my migration? -[TIP] -==== -In the context of migrating between clusters (client applications and data), the examples in this guide sometimes refer to the migration to our cloud-native database environment, {astra-db}. +There are several benefits to using the {product-short} tools for your migration: -However, it is important to emphasize that {product-proxy} can be freely used to support migrations without downtime between any combination of CQL clusters of any type. In addition to {astra-db}, examples include {cass-reg} or {dse}. -==== +* Minimal client code changes: Depending on cluster compatibility, the {product-short} tools help you migrate to a new or upgraded database platform with minimal changes to your client application code. +In some cases, you only need to change the connection string to point to the new cluster at the end of the migration process. +Typically, these changes are minimal and non-invasive, especially if your client application uses an externalized property configuration for contact points. -== Can you illustrate the overall workflow and phases of a migration? +* Real-time data consistency: {product-proxy} orchestrates real-time activity generated by your client applications, ensuring data consistency while you replicate, validate, and test your existing data on the new cluster. +Once you set up {product-proxy}, the dual-writes feature ensures that new writes are sent to both the origin and target clusters, so you can focus on migrating the data that was present before initializing {product-proxy}. -See the diagrams of the {product-short} xref:introduction.adoc#_migration_phases[migration phases]. +* Safely test the new cluster under full production workloads: In addition to the dual-writes feature, you can optionally enable asynchronous dual-reads to test the target cluster's ability to handle a production workload before you permanently switch to the target cluster at the end of the migration process. ++ +Client applications aren't interrupted by errors or latency spikes on the new, target cluster. +Although these errors and metrics are received by {product-proxy} for monitoring and performance benchmarking purposes, they aren't propagated back to the client applications. ++ +From the client side, traffic is seamless and uninterrupted during the entire migration process. -== Do you have a demo of {product-short}? +* Seamless rollback without data loss: If there is a problem during the migration, you can xref:ROOT:rollback.adoc[rollback to the original cluster] without any data loss or interruption of service. +You can allow {product-proxy} to continue orchestrating dual-writes, or redirect your client applications back to the origin cluster and disable {product-proxy}. -Yes, you can use the {product-short} interactive lab to see how the migration process works. +* Endless validation and testing time: Because your client applications remain fully operational during the migration, and your clusters are kept in sync by {product-proxy}, you can take as much time as you need to validate and test the target cluster before switching over permanently. -For more information, see xref:ROOT:introduction.adoc#lab[{product} interactive lab]. +* Migrate to a different platform or perform major version upgrades: The {product-short} tools support migrations between different CQL-based platforms, such as open-source {cass-reg} to {astra-db}, as well as major version upgrades of the same platform, such as {dse-short} 5.0 to {dse-short} 6.9. -== What tools are available for {product}? +== What are the requirements for true zero-downtime migrations? -To support live migrations, you can use {product-proxy}, {product-utility}, and {product-automation}. +See xref:ROOT:feasibility-checklists.adoc[]. -For data migration with or without downtime, you can use {sstable-sideloader}, {cass-migrator}, {dsbulk-migrator}, or custom data migration scripts. +== Is there a summary of the migration process? -For more information, see xref:ROOT:components.adoc[]. +Yes, see xref:ROOT:introduction.adoc[]. -== What exactly is {product-proxy}? +== Is there a {product-short} demo? -{product-proxy} is a component designed to seamlessly handle the real-time client application activity while a migration is in progress. See xref:introduction.adoc#_role_of_zdm_proxy[here] for an overview. +Yes, you can use the xref:ROOT:introduction.adoc#lab[{product-short} interactive lab] to see how the migration process works. -== What are the benefits of {product-proxy} and its use cases? +== What {product-short} tools are available? -Migrating client applications between clusters is a need that arises in many scenarios. For example, you may want to: +The {product-short} tools are {product-proxy}, {product-utility}, and {product-automation}. +These tools orchestrate the traffic between your client applications and the origin and target clusters during the migration process. -* Move to a cloud-native, managed service such as {astra-db}. -* Migrate your client application to a brand new cluster, on a more recent version and perhaps on new infrastructure, or even a different CQL database entirely, without intermediate upgrade steps and ensuring that you always have an easy way to roll back in case of issues. -* Separate out a client application from a shared cluster to a dedicated one. -* Consolidate client applications, currently running on separate clusters, into fewer clusters or even a single one. +For the actual data migration, there are many tools you can use, such as {sstable-sideloader}, {cass-migrator}, {dsbulk-migrator}, and custom data migration scripts. -Bottom line: You want to migrate your critical database infrastructure without risk or concern that your users' experiences will be affected. +For more information, see xref:ROOT:components.adoc[]. -== Which releases of {cass-short} or {dse-short} are supported for migrations? +== What is {product-proxy}? -See xref:ROOT:zdm-proxy-migration-paths.adoc[]. +Generally speaking, a proxy is a software class functioning as an interface to any other component, connection, or resource, such as a network connection, a server, a large object in memory, or a file. +The proxy is a wrapper or agent object that the client calls to access the real object served through that proxy. -== Does the {product} process migrate clusters? +In the context of {product-short}, the {product-proxy} is an open-source component designed to seamlessly handle real-time client application activity while a migration is in progress. +For more information, see xref:ROOT:components.adoc[]. + +== What are {product-automation} and {product-utility}? + +{product-automation} is an Ansible-based tool that allows you to deploy and manage multiple {product-proxy} instances and the associated monitoring stack. +To simplify the setup, the {product-automation} suite includes {product-utility}, which is an interactive utility that creates a Docker container to act as the Ansible Control Host. +For more information, see xref:ROOT:components.adoc[]. -The {product} ({product-short}) process doesn't directly migrate clusters. -Instead, it migrates data and applications between clusters. +== Does the {product-short} process migrate clusters? -At the end of the migration process, your application runs exclusively on your new cluster, which was populated with data from the original cluster. +The {product-short} process doesn't directly migrate clusters. +Instead, the {product-short} tools orchestrate live traffic between your existing cluster and a new cluster while you use a data migration tool to replicate and validate data on the new cluster. {product-proxy} handles real-time requests generated by your client applications during the migration process, and keeps both clusters in sync through dual writes. +If there is a problem during the migration, you can confidently rollback to the original cluster without data loss or interruption of service. + +At the end of the migration process, your client application connects exclusively to your new cluster, and then you decommission {product-proxy} and the old cluster. == What challenges does {product-short} solve? -Before {company} {product} was available, migrating client applications between clusters involved granular and intrusive client application code changes, extensive migration preparation, and a window of downtime to the client application's end users. +Before the {product-short} tools were available, migrating client applications between clusters involved granular and intrusive client application code changes, extensive migration preparation, and a window of downtime for the client application's end users. -{product-short} allows you to leverage mature migration tools that have been used with large scale enterprises and applications to make migrations easy and transparent to end users. +With the {product-short} tools, you can migrate your client applications and data between CQL clusters with minimal code changes and no interruption of service. +You can have the confidence that you are using tools designed specifically to handle the complexities of live traffic during large enterprise migrations. == What is the pricing model? @@ -82,81 +98,30 @@ Before {company} {product} was available, migrating client applications between {sstable-sideloader} is part of an {astra-db} *Enterprise* subscription plan, and it incurs costs based on usage. -== Is there support available if I have questions or issues during our migration? +== Where can I get help with my migration? -{product-proxy} and related software tools in the migration suite include technical assistance by {support-url}[{company} Support] for {dse-short} users, https://www.ibm.com/docs/en/esfac[IBM Elite Support for {cass}] subscribers, and {astra} organizations on an Enterprise plan. +Technical assistance with the {product-short} process is available from {support-url}[{company} Support] for {dse-short} users, https://www.ibm.com/docs/en/esfac[IBM Elite Support for {cass}] subscribers, and {astra} organizations on an **Enterprise** plan. -For any observed problems with {product-proxy}, submit a {product-proxy-repo}/issues[GitHub Issue] in the {product-proxy} GitHub repo. +For any observed problems with {product-proxy} or the other open-source {product-short} and data migration tools, you can report an issue in their respective GitHub repositories: -Additional examples serve as templates, from which you can learn about migrations. -{company} does not assume responsibility for making the templates work for specific use cases. +* {product-proxy-repo}[{product-proxy} repository] +* {product-automation-repo}[{product-automation} repository] (includes {product-automation} and {product-utility}) +* {cass-migrator-repo}[{cass-migrator} repository] +* {dsbulk-migrator-repo}[{dsbulk-migrator} repository] == Can I contribute to {product-proxy}? -Yes. -See https://github.com/datastax/zdm-proxy/blob/main/CONTRIBUTING.md[CONTRIBUTING.md]. - -== Where are the public GitHub repos? - -* {product-proxy-repo}[{product-proxy}] repo. - -* {product-automation-repo}[{product-automation}] repo for the Ansible-based {product-automation}, which includes {product-utility}. - -* {cass-migrator-repo}[cassandra-data-migrator] repo for the tool that supports migrating larger data quantities as well as detailed verifications and reconciliation options. - -* {dsbulk-migrator-repo}[dsbulk-migrator] repo for the tool that allows simple data migrations without validation and reconciliation capabilities. +Yes, see `https://github.com/datastax/zdm-proxy/blob/main/CONTRIBUTING.md[CONTRIBUTING.md]`. == Does {product-proxy} support Transport Layer Security (TLS)? -Yes, and here's a summary: - -* For application-to-proxy TLS, the application is the TLS client and {product-proxy} is the TLS server. -One-way TLS and Mutual TLS are both supported. -* For proxy-to-cluster TLS, {product-proxy} acts as the TLS client and the cluster as the TLS server. -One-way TLS and Mutual TLS are both supported. -* When {product-proxy} connects to {astra-db} clusters, it always implicitly uses Mutual TLS. -This is done through the {scb} and does not require any extra configuration. - -For TLS details, see xref:tls.adoc[]. +Yes, see xref:tls.adoc[]. == How does {product-proxy} handle Lightweight Transactions (LWTs)? -//TODO: Compare and replace with link to LWT section on feasibility-checklists.adoc - -{product-proxy} handles LWTs as write operations. -The proxy sends the LWT to the origin and target clusters concurrently, and waits for a response from both. -{product-proxy} will return a `success` status to the client if both the origin and target clusters send successful acknowledgements. -Otherwise, it will return a `failure` status if one or both do not return an acknowledgement. - -What sets LWTs apart from regular writes is that they are conditional. For important details, including the client context for a returned `applied` flag, see xref:feasibility-checklists.adoc#_lightweight_transactions_and_the_applied_flag[Lightweight transactions and the applied flag]. +See xref:feasibility-checklists.adoc#_lightweight_transactions_and_the_applied_flag[Lightweight transactions and the applied flag]. == Can {product-proxy} be deployed as a sidecar? -{product-proxy} should not be deployed as a sidecar. - -{product-proxy} was designed to mimic a {cass-short} cluster. -For this reason, we recommend deploying multiple {product-proxy} instances, each running on a dedicated machine, instance, or VM. - -For best performance, this deployment should be close to the client applications (ideally on the same local network) but not co-deployed on the same machines as the client applications. - -This way, each client application instance can connect to all {product-proxy} instances, just as it would connect to all nodes in a {cass-short} cluster (or datacenter). - -This deployment model gives maximum resilience and failure tolerance guarantees and allows the client application driver to continue using the same load balancing and retry mechanisms that it would normally use. - -Conversely, deploying a single {product-proxy} instance would undermine this resilience mechanism and create a single point of failure, which could affect the client applications if one or more nodes of the underlying origin or target clusters go offline. -In a sidecar deployment, each client application instance would be connecting to a single {product-proxy} instance, and would therefore be exposed to this risk. - -For more information, see xref:deployment-infrastructure.adoc#_choosing_where_to_deploy_the_proxy[Choosing where to deploy the proxy]. - -== What are the benefits of using a cloud-native database? - -When moving your client applications and data from on-premise {cass-short} Query Language (CQL) based data stores ({cass-short} or {dse-short}) to a cloud-native database (CNDB) like {astra-db}, it's important to acknowledge the fundamental differences ahead. - -With on-premise infrastructure, you have total control of the datacenter's physical infrastructure, software configurations, and your custom procedures. -At the same time, with on-premise clusters you take on the cost of infrastructure resources, maintenance, operations, and personnel. - -Ranging from large enterprises to small teams, IT managers, operators, and developers are realizing that the Total Cost of Ownership (TCO) with cloud solutions is much lower than continuing to run on-prem physical data centers. - -A CNDB like {astra-db} is a different environment. -Running on proven cloud providers like AWS, Google Cloud, and Azure, {astra-db} greatly reduces complexity and increases convenience by surfacing a subset of configurable settings. -It provides a UI (the {astra-ui}), APIs, and CLI tools to interact with your {astra-db} organizations and databases. \ No newline at end of file +Don't deploy {product-proxy} as a sidecar. +For more information, see xref:deployment-infrastructure.adoc#_choosing_where_to_deploy_the_proxy[Choosing where to deploy the proxy]. \ No newline at end of file diff --git a/modules/ROOT/pages/feasibility-checklists.adoc b/modules/ROOT/pages/feasibility-checklists.adoc index e5be3ba7..4a439870 100644 --- a/modules/ROOT/pages/feasibility-checklists.adoc +++ b/modules/ROOT/pages/feasibility-checklists.adoc @@ -1,7 +1,7 @@ = Feasibility checks :page-aliases: ROOT:preliminary-steps.adoc -Before starting your migration, refer to the following considerations to ensure that your client application workload and xref:glossary.adoc#origin[**Origin**] are suitable for this {product} process. +Before starting your migration, review the following information to ensure that your client application workload and origin (source) cluster are suitable for the {product} process. True zero downtime migration is only possible if your database meets the minimum requirements described on this page. If your database doesn't meet these requirements, you can still complete the migration, but downtime might be necessary to finish the migration. diff --git a/modules/ROOT/pages/migrate-and-validate-data.adoc b/modules/ROOT/pages/migrate-and-validate-data.adoc index 2bdc2018..e2a8e2e6 100644 --- a/modules/ROOT/pages/migrate-and-validate-data.adoc +++ b/modules/ROOT/pages/migrate-and-validate-data.adoc @@ -38,7 +38,7 @@ For more information, see xref:ROOT:dsbulk-migrator.adoc[]. == Other data migration processes -Depending on your source and target databases, there might be other {product-short}-compatible data migration tools available, or you can write your own custom data migration processes with a tool like Apache Spark(TM). +Depending on your origin and target databases, there might be other {product-short}-compatible data migration tools available, or you can write your own custom data migration processes with a tool like Apache Spark(TM). To use a data migration tool with {product-proxy}, it must meet the following requirements: diff --git a/modules/ROOT/pages/tls.adoc b/modules/ROOT/pages/tls.adoc index 7f32cdc4..84e99ecf 100644 --- a/modules/ROOT/pages/tls.adoc +++ b/modules/ROOT/pages/tls.adoc @@ -5,19 +5,20 @@ The TLS configuration is an optional part of the initial {product-proxy} configuration, which includes xref:setup-ansible-playbooks.adoc[] and xref:deploy-proxy-monitoring.adoc[]. -== Introduction +You can enable TLS between {product-proxy} and any cluster that requires it. +You can also enable it between your client application and {product-proxy}, if required. -* All TLS configuration is optional. Enable TLS between {product-proxy} and any cluster that requires it, and/or between your client application and {product-proxy}, if required. - -* Proxy-to-cluster TLS can be configured between {product-proxy} and either or both the origin and target clusters, as desired. -Each set of configurations is independent of the other. When using proxy-to-cluster TLS, {product-proxy} acts as the TLS client and the cluster as the TLS server. -One-way TLS and Mutual TLS are both supported and can be enabled depending on each cluster's requirements. +* Proxy-to-cluster TLS can be configured between {product-proxy} and either or both the origin and target clusters, as needed. +Each set of configurations is independent of the other. ++ +When using proxy-to-cluster TLS, {product-proxy} acts as the TLS client, and the cluster acts as the TLS server. +One-way TLS and Mutual TLS are both supported, and they can be enabled as needed for each cluster's requirements. -* When using application-to-proxy TLS, your client application is the TLS client and {product-proxy} is the TLS server. +* When using application-to-proxy TLS, your client application is the TLS client, and {product-proxy} is the TLS server. One-way TLS and Mutual TLS are both supported. * When {product-proxy} connects to {astra-db}, it always implicitly uses Mutual TLS. -This is done through the {scb} and does not require any extra configuration. +This is done through the xref:astra-db-serverless:databases:secure-connect-bundle.adoc[{scb}] and does not require any extra configuration. [[_retrieving_files_from_a_jks_keystore]] == Retrieving files from a JKS keystore diff --git a/modules/ROOT/pages/troubleshooting-tips.adoc b/modules/ROOT/pages/troubleshooting-tips.adoc index 5950b941..56564b6b 100644 --- a/modules/ROOT/pages/troubleshooting-tips.adoc +++ b/modules/ROOT/pages/troubleshooting-tips.adoc @@ -296,9 +296,9 @@ To resolve this issue, do one of the following: * If your application uses any dependency that includes a version of the Java driver, such as Spring Boot or `spring-data-cassandra`, you must upgrade these dependencies to a version that uses Java driver 4.10.0 or later. -* If your are using the Java driver directly, upgrade to version 4.10.0 or later, if these versions are compatible with both your source and target clusters. +* If your are using the Java driver directly, upgrade to version 4.10.0 or later, if these versions are compatible with both your origin and target clusters. -* Force the protocol version on the driver to the highest version that is supported by both your source and target clusters. +* Force the protocol version on the driver to the highest version that is supported by both your origin and target clusters. Typically, `V4` is broadly supported. However, if you are migrating from {dse-short} to {dse-short}, then use `DSE_V1` for {dse-short} 5.x migrations, and `DSE_V2` for {dse-short} 6.x migrations. + diff --git a/modules/ROOT/pages/zdm-proxy-migration-paths.adoc b/modules/ROOT/pages/zdm-proxy-migration-paths.adoc index a42b90c9..56e56433 100644 --- a/modules/ROOT/pages/zdm-proxy-migration-paths.adoc +++ b/modules/ROOT/pages/zdm-proxy-migration-paths.adoc @@ -1,20 +1,31 @@ = Cluster compatibility for {product} :description: Learn which sources and targets are eligible for {product}. -True zero downtime migration is only possible if your database meets the minimum requirements described in xref:ROOT:feasibility-checklists.adoc[], including compatibility of the source and target clusters. +True zero downtime migration is only possible if your database meets the minimum requirements described in xref:ROOT:feasibility-checklists.adoc[], including compatibility of the origin (source) and target (destination) clusters. -== Compatible source and target clusters for migrations with zero downtime +== Compatible origin and target clusters for migrations with zero downtime include::ROOT:partial$migration-scenarios.adoc[] +[TIP] +==== +You can use {product-short} to support major version upgrades for your current database platform, such as upgrades from {dse-short} 5.0 to {dse-short} 6.9. +Using {product-short} reduces the risk of data loss or corruption due to breaking changes between versions, provides a seamless rollback option, and streamlines the upgrade process, eliminating the need for interim upgrades and progressive manual reconfiguration. +==== + +[#incompatible-clusters-and-migrations-with-some-downtime] == Incompatible clusters and migrations with some downtime If you don't want to use {product-proxy} or your databases don't meet the zero-downtime requirements, you can still complete the migration, but some downtime might be necessary to finish the migration. -If your clusters are incompatible, you might be able to use data migration tools such as xref:ROOT:dsbulk-migrator-overview.adoc[{dsbulk-migrator}] or a custom data migration script. +If your origin cluster is incompatible with {product-proxy}, {product-utility}, and {product-automation}, you might be able to use standalone xref:ROOT:components.adoc#data-migration-tools[data migration tools] such as {dsbulk-migrator} or a custom data migration script. Make sure you transform or prepare the data to comply with the target cluster's schema. +For more complex migrations, such as RDBMS-to-NoSQL migrations, it is likely that your migration will require downtime for additional processing, such as extract, transform, and load (ETL) operations. +For example, see the data modeling and compatibility considerations for xref:6.9@dse:managing:operations/migrate-data.adoc[migrating to {dse-short}]. + +{company} recommends that you contact your {company} account representative or {support-url}[{company} Support] for guidance on incompatible or partially compatible migrations. + == See also -* xref:ROOT:components.adoc[] -* xref:ROOT:feasibility-checklists.adoc[] \ No newline at end of file +* xref:ROOT:components.adoc[] \ No newline at end of file From bd722d322623bb8871d5cec6396fde5d154d8a45 Mon Sep 17 00:00:00 2001 From: April M <36110273+aimurphy@users.noreply.github.com> Date: Fri, 14 Nov 2025 08:56:12 -0800 Subject: [PATCH 07/11] finish faq and glossary pages --- modules/ROOT/nav.adoc | 6 +- modules/ROOT/pages/components.adoc | 4 +- .../ROOT/pages/connect-clients-to-proxy.adoc | 18 ++- .../ROOT/pages/deployment-infrastructure.adoc | 2 +- modules/ROOT/pages/faqs.adoc | 42 +++++-- modules/ROOT/pages/glossary.adoc | 105 ------------------ modules/ROOT/pages/introduction.adoc | 4 +- .../ROOT/pages/manage-proxy-instances.adoc | 2 +- modules/ROOT/pages/metrics.adoc | 2 +- 9 files changed, 50 insertions(+), 135 deletions(-) delete mode 100644 modules/ROOT/pages/glossary.adoc diff --git a/modules/ROOT/nav.adoc b/modules/ROOT/nav.adoc index 62b009b9..db0137ca 100644 --- a/modules/ROOT/nav.adoc +++ b/modules/ROOT/nav.adoc @@ -32,10 +32,8 @@ ** xref:ROOT:change-read-routing.adoc[] * Phase 5 ** xref:ROOT:connect-clients-to-target.adoc[] -* Support -** xref:ROOT:troubleshooting-tips.adoc[] -** xref:ROOT:faqs.adoc[] -** xref:ROOT:glossary.adoc[] +* xref:ROOT:troubleshooting-tips.adoc[] +* xref:ROOT:faqs.adoc[] * Release notes ** {product-proxy-repo}/releases[{product-proxy} release notes] ** {product-automation-repo}/releases[{product-automation} release notes] diff --git a/modules/ROOT/pages/components.adoc b/modules/ROOT/pages/components.adoc index 0cdce771..33f409cc 100644 --- a/modules/ROOT/pages/components.adoc +++ b/modules/ROOT/pages/components.adoc @@ -30,7 +30,7 @@ Typically, you only need to update the connection string. === How {product-proxy} handles reads and writes {company} created {product-proxy} to orchestrate requests between a client application and both the origin and target clusters. -These clusters can be any CQL-compatible data store, such as {cass-reg}, {dse}, and {astra-db}. +These clusters can be any xref:cql:ROOT:index.adoc[CQL]-compatible data store, such as {cass-reg}, {dse}, and {astra-db}. During the migration process, you designate one cluster as the _primary cluster_, which serves as the source of truth for reads. For the majority of the migration process, this is typically the origin cluster. @@ -39,7 +39,7 @@ Towards the end of the migration process, when you are ready to read exclusively The other cluster is referred to as the _secondary cluster_. While {product-proxy} is active, write requests are sent to both clusters to ensure data consistency, but only the primary cluster serves read requests. -==== Writes +==== Writes (dual-write logic) {product-proxy} sends every write operation (`INSERT`, `UPDATE`, `DELETE`) synchronously to both clusters at the requested consistency level: diff --git a/modules/ROOT/pages/connect-clients-to-proxy.adoc b/modules/ROOT/pages/connect-clients-to-proxy.adoc index 2bf4934e..80c542ea 100644 --- a/modules/ROOT/pages/connect-clients-to-proxy.adoc +++ b/modules/ROOT/pages/connect-clients-to-proxy.adoc @@ -1,18 +1,16 @@ = Connect your client applications to {product-proxy} :navtitle: Connect client applications to {product-proxy} -{product-proxy} is designed to be similar to a conventional {cass-reg} cluster. -You communicate with it using the CQL query language used in your existing client applications. -It understands the same messaging protocols used by {cass-short}, {dse}, and {astra-db}. -As a result, most of your client applications won't be able to distinguish between connecting to {product-proxy} and connecting directly to your {cass-short} cluster. +{product-proxy} is designed to mimic communication with a typical cluster based on {cass-reg}. +This means that your client applications connect to {product-proxy} in the same way that they already connect to your existing {cass-short}-based clusters. -On this page, we explain how to connect your client applications to a {cass-short} cluster. -We then move on to discuss how this process changes when connecting to a {product-proxy}. -We conclude by describing two sample client applications that serve as real-world examples of how to build a client application that works effectively with {product-proxy}. +You can communicate with {product-proxy} using the same xref:cql:ROOT:index.adoc[CQL] statements used in your existing client applications. +It understands the same messaging protocols used by {cass-short}, {dse-short}, {hcd-short}, and {astra-db}. -You can use the provided sample client applications, in addition to your own, as a quick way to validate that the deployed {product-proxy} is reading and writing data from the expected origin and target clusters. +As a result, most client applications won't be able to distinguish between connections to {product-proxy} and direct connections to a {cass-short}-based cluster. -This topic also explains how to connect `cqlsh` to {product-proxy}. +This page explains how to connect your client applications to a {cass-short}-based cluster, compares this process to connections to {product-proxy}, provides realistic examples of client applications that work effectively with {product-proxy}, and, finally, explains how to connect `cqlsh` to {product-proxy}. +You can use the provided sample client applications, in addition to your own, as a quick way to validate that the deployed {product-proxy} is reading and writing data from the expected origin and target clusters. == {company}-compatible drivers @@ -209,7 +207,7 @@ The configuration logic as well as the cluster and session management code have == Connect cqlsh to {product-proxy} -`cqlsh` is a command-line tool that you can use to send {cass-short} Query Language (CQL) statements to your {cass-short}-based clusters, including {astra-db}, {dse-short}, {hcd-short}, and {cass} databases. +`cqlsh` is a command-line tool that you can use to send CQL statements and `cqlsh`-specific commands to your {cass-short}-based clusters, including {astra-db}, {dse-short}, {hcd-short}, and {cass} databases. You can use your database's included version of `cqlsh`, or you can download and run a standalone `cqlsh`. diff --git a/modules/ROOT/pages/deployment-infrastructure.adoc b/modules/ROOT/pages/deployment-infrastructure.adoc index efcbd114..10bdcb45 100644 --- a/modules/ROOT/pages/deployment-infrastructure.adoc +++ b/modules/ROOT/pages/deployment-infrastructure.adoc @@ -100,7 +100,7 @@ The only direct access to these machines should be from the jumphost. The {product-proxy} machines must be able to connect to the origin and target cluster nodes: * For self-managed clusters ({cass} or {dse-short}), connectivity is needed to the {cass-short} native protocol port (typically 9042). -* For {astra-db}, you will need to ensure outbound connectivity to the {astra} endpoint indicated in the {scb}. +* For {astra-db}, you will need to ensure outbound connectivity to the {astra} endpoint indicated in the xref:astra-db-serverless:databases:secure-connect-bundle.adoc[{scb}]. Connectivity over Private Link is also supported. The connectivity requirements for the jumphost / monitoring machine are: diff --git a/modules/ROOT/pages/faqs.adoc b/modules/ROOT/pages/faqs.adoc index 9f0fdf9c..58ef0289 100644 --- a/modules/ROOT/pages/faqs.adoc +++ b/modules/ROOT/pages/faqs.adoc @@ -1,21 +1,20 @@ -= Frequently Asked Questions -:navtitle: FAQs -:page-aliases: ROOT:contributions.adoc += {product} frequently asked questions +:navtitle: {product-short} FAQs +:page-aliases: ROOT:contributions.adoc, ROOT:glossary.adoc -This page includes common questions about the {company} {product} tools. +This page includes common questions about the {company} {product} ({product-short}) tools. -//TODO: Eliminate redundancies in these FAQs and the Glossary. //FAQs in ZDM-proxy repo: https://github.com/datastax/zdm-proxy/blob/main/faq.md#what-versions-of-apache-cassandra-or-cql-compatible-data-stores-does-the-zdm-proxy-support == What is a zero-downtime migration? -A zero-downtime migration with the {company} {product} ({product-short}) tools means you can reliably migrate your client applications and data between CQL clusters with no interruption of service. +A zero-downtime migration with the {company} {product-short} tools means you can reliably migrate your client applications and data between CQL clusters with no interruption of service. == Which platforms and versions are supported by the {product-short} tools? See xref:ROOT:zdm-proxy-migration-paths.adoc[]. -== Why should I use the {product} ({product-short}) tools for my migration? +== Why should I use the {product-short} tools for my migration? There are several benefits to using the {product-short} tools for your migration: @@ -28,7 +27,7 @@ Once you set up {product-proxy}, the dual-writes feature ensures that new writes * Safely test the new cluster under full production workloads: In addition to the dual-writes feature, you can optionally enable asynchronous dual-reads to test the target cluster's ability to handle a production workload before you permanently switch to the target cluster at the end of the migration process. + -Client applications aren't interrupted by errors or latency spikes on the new, target cluster. +Client applications aren't interrupted by read errors or latency spikes on the new, target cluster. Although these errors and metrics are received by {product-proxy} for monitoring and performance benchmarking purposes, they aren't propagated back to the client applications. + From the client side, traffic is seamless and uninterrupted during the entire migration process. @@ -124,4 +123,29 @@ See xref:feasibility-checklists.adoc#_lightweight_transactions_and_the_applied_f == Can {product-proxy} be deployed as a sidecar? Don't deploy {product-proxy} as a sidecar. -For more information, see xref:deployment-infrastructure.adoc#_choosing_where_to_deploy_the_proxy[Choosing where to deploy the proxy]. \ No newline at end of file +For more information, see xref:deployment-infrastructure.adoc#_choosing_where_to_deploy_the_proxy[Choosing where to deploy the proxy]. + +[#what-are-origin-target-primary-and-secondary-clusters] +== What are origin, target, primary, and secondary clusters? + +These terms refer to where your data is located during the migration process, and where read and write requests are sent by {product-proxy}: + +Origin and target:: +The _origin cluster_ is your existing database that you are migrating away from. +The _target cluster_ is your new database that you are migrating to. ++ +Origin and target cluster credentials are provided to {product-proxy} so it can establish connections and send requests to both clusters. + +Primary and secondary:: +The _primary cluster_ is the database that is designated as the source of truth for read requests. +It receives all read requests by default, and the responses from these read requests are returned to the client application. +The primary cluster is set by {product-automation} through the `primary_cluster` variable, or you can set it directly through the {product-proxy} `ZDM_PRIMARY_CLUSTER` environment variable. ++ +The other database is the _secondary cluster_. +It doesn't receive read requests unless you enable asynchronous dual-reads. ++ +For the majority of the migration process, the origin cluster is the primary cluster, and the target cluster is the secondary cluster. +Near the end of the migration, when you have validated that all pre-existing data has been replicated to the target cluster, you set the target cluster as the primary cluster. ++ +Throughout the entire migration, until you decommission {product-proxy}, both clusters receive all write requests through the dual writes feature. +For more information, see xref:components.adoc#how-zdm-proxy-handles-reads-and-writes[How {product-proxy} handles reads and writes]. \ No newline at end of file diff --git a/modules/ROOT/pages/glossary.adoc b/modules/ROOT/pages/glossary.adoc deleted file mode 100644 index fd245662..00000000 --- a/modules/ROOT/pages/glossary.adoc +++ /dev/null @@ -1,105 +0,0 @@ -= Glossary - -//TODO: Determine which terms are actually needed. Convert to partials if the definitions need to be repeated, otherwise replace links to this page with links to more useful and complete information. - -Here are a few terms used throughout the {company} {product} documentation and code. - -[[_ansible_playbooks]] -== Ansible playbooks - -A repeatable, re-usable, simple configuration management and multi-machine deployment system, one that is well suited to deploying complex applications. -For details about the playbooks available in {product-automation}, see: - -* xref:setup-ansible-playbooks.adoc[]. -* xref:deploy-proxy-monitoring.adoc[]. - -[[_asynchronous_dual_reads]] -== Asynchronous dual reads - -An optional feature that is designed to test the target cluster's ability to handle a production workload before you permanently switch to the target cluster at the end of the migration process. - -When enabled, {product-proxy} sends asynchronous read requests to the secondary cluster (typically the target cluster) in addition to the synchronous read requests that are sent to the primary cluster by default. - -For more information, see xref:ROOT:enable-async-dual-reads.adoc[]. - -== CQL - -{cass-short} Query Language (CQL) is a query language for the {cass-short} database. -It includes DDL and DML statements. -For details, see the xref:cql:ROOT:index.adoc[{cass-short} Query Language documentation]. - -== Dual-write logic - -{product-proxy} handles your client application's real-time write requests and forwards them to two {cass-short}-based origin and target clusters simultaneously. -The dual-write logic in {product-proxy} means that you do not need to modify your client application to perform dual writes manually during a migration: {product-proxy} takes care of it for you. -See the diagram in the xref:introduction.adoc#migration-workflow[workflow introduction]. - -[[origin]] -== Origin - -Your existing {cass-short}-based database that you are migrating away from. -It is the opposite of the <>. - -[[_primary_cluster]] -== Primary cluster - -The database that is designated as the source of truth for read requests. -It is the opposite of the <>. - -The primary cluster is set by {product-automation} through the `primary_cluster` variable, or you can set it directly through the `ZDM_PRIMARY_CLUSTER` environment variable for {product-proxy}. - -For the majority of the migration process, the <> is typically the primary cluster. -Near the end of the migration, you shift the primary cluster to the <>. - -For information about which cluster receives reads and writes during the migration process, see xref:components.adoc#how-zdm-proxy-handles-reads-and-writes[How {product-proxy} handles reads and writes]. - -== Playbooks - -See xref:glossary.adoc#_ansible_playbooks[Ansible playbooks]. - -== Proxy - -Generally speaking, a proxy is a software class functioning as an interface to something else. -The proxy could interface to anything: a network connection, a large object in memory, a file, or some other resource. -A proxy is a wrapper or agent object that is being called by the client to access the real serving object behind the scenes. -In our context here, see <>. - -== Read mirroring - -See <<_asynchronous_dual_reads>>. - -[[secondary-cluster]] -== Secondary cluster - -The database that isn't designated as the source of truth for read requests. -It is the opposite of the <<_primary_cluster>>. - -For the majority of the migration process, the secondary cluster is the <>. -Near the end of the migration, the target database becomes the <<_primary_cluster>>, and then the <> becomes the secondary cluster. - -For information about which cluster receives reads and writes during the migration process, see xref:components.adoc#how-zdm-proxy-handles-reads-and-writes[How {product-proxy} handles reads and writes]. - -[[_secure_connect_bundle_scb]] -== {scb} - -A ZIP file that contains connection metadata and TLS encryption certificates (but not the database credentials) for your {astra-db} database. -For more information, see xref:astra-db-serverless:databases:secure-connect-bundle.adoc[]. - -[[target]] -== Target - -The database to which you are migrating your data and applications. -It is the opposite of the <>. - -[[zdm-automation]] -== {product-automation} - -An Ansible-based tool that allows you to deploy and manage the {product-proxy} instances and associated monitoring stack. -To simplify its setup, the suite includes {product-utility}. -This interactive utility creates a Docker container acting as the Ansible Control Host. -The Ansible playbooks constitute {product-automation}. - -[[zdm-proxy]] -== {product-proxy} - -An open-source component designed to seamlessly handle the real-time client application activity while a migration is in progress. diff --git a/modules/ROOT/pages/introduction.adoc b/modules/ROOT/pages/introduction.adoc index 5d0afb98..63ad2bda 100644 --- a/modules/ROOT/pages/introduction.adoc +++ b/modules/ROOT/pages/introduction.adoc @@ -22,7 +22,7 @@ For example, you might move from self-managed clusters to a cloud-based Database {product-short} is comprised of {product-proxy}, {product-utility}, and {product-automation}, which orchestrate activity-in-transition on your databases. To move and validate data, you use {sstable-sideloader}, {cass-migrator}, or {dsbulk-migrator}. -{product-proxy} keeps your databases in sync at all times by a dual-write logic configuration, which means you can stop the migration or xref:rollback.adoc[roll back] at any point. +{product-proxy} keeps your databases in sync at all times by the dual-writes feature, which means you can stop the migration or xref:rollback.adoc[roll back] at any point. For more information about these tools, see xref:ROOT:components.adoc[]. When the migration is complete, the data is present in the new database, and you can update your client applications to connect exclusively to the new database. @@ -46,7 +46,7 @@ The _target_ is your new {cass-short}-based environment where you want to migrat === Migration planning -Before you begin a migration, your client applications perform read/write operations with your existing CQL-compatible database, such as {cass}, {dse-short}, {hcd-short}, or {astra-db}. +Before you begin a migration, your client applications perform read/write operations with your existing xref:cql:ROOT:index.adoc[CQL]-compatible database, such as {cass}, {dse-short}, {hcd-short}, or {astra-db}. image:pre-migration0ra.png["Pre-migration environment."] diff --git a/modules/ROOT/pages/manage-proxy-instances.adoc b/modules/ROOT/pages/manage-proxy-instances.adoc index 0d9f9261..8582fc31 100644 --- a/modules/ROOT/pages/manage-proxy-instances.adoc +++ b/modules/ROOT/pages/manage-proxy-instances.adoc @@ -77,7 +77,7 @@ The following configuration variables are considered mutable and can be changed Commonly changed variables, located in `vars/zdm_proxy_core_config.yml`: * `primary_cluster`: -** This variable determines which cluster is currently considered the xref:glossary.adoc#_primary_cluster[primary cluster]. +** This variable determines which cluster is currently considered the xref:ROOT:faqs.adoc#what-are-origin-target-primary-and-secondary-clusters[primary cluster]. At the start of the migration, the primary cluster is the origin cluster because it contains all of the data. In Phase 4 of the migration, once all the existing data has been transferred and any validation/reconciliation step has been successfully executed, you can switch the primary cluster to be the target cluster. ** Valid values: `ORIGIN`, `TARGET`. diff --git a/modules/ROOT/pages/metrics.adoc b/modules/ROOT/pages/metrics.adoc index 397eeb84..509fab44 100644 --- a/modules/ROOT/pages/metrics.adoc +++ b/modules/ROOT/pages/metrics.adoc @@ -35,7 +35,7 @@ image::zdm-grafana-proxy-dashboard1.png[Grafana dashboard shows three categories + ** Read Latency: Total latency measured by {product-proxy} per read request, including post-processing, such as response aggregation. This metric has two labels: `reads_origin` and `reads_target`. -The label that has data depends on which cluster is receiving the reads, which is the current xref:glossary.adoc#_primary_cluster[primary cluster]. +The label that has data depends on which cluster is receiving the reads, which is the current xref:ROOT:faqs.adoc#what-are-origin-target-primary-and-secondary-clusters[primary cluster]. ** Write Latency: Total latency measured by {product-proxy} per write request, including post-processing, such as response aggregation. This metric is measured as the total latency across both clusters for a single xref:ROOT:components.adoc#how-zdm-proxy-handles-reads-and-writes[bifurcated write request]. From d7c5adc73b58788a715322eb360a2fa32d4b38fa Mon Sep 17 00:00:00 2001 From: April M <36110273+aimurphy@users.noreply.github.com> Date: Fri, 14 Nov 2025 09:03:41 -0800 Subject: [PATCH 08/11] finish faq page --- modules/ROOT/pages/components.adoc | 2 +- modules/ROOT/pages/faqs.adoc | 12 +++++++++--- 2 files changed, 10 insertions(+), 4 deletions(-) diff --git a/modules/ROOT/pages/components.adoc b/modules/ROOT/pages/components.adoc index 33f409cc..1ff5c791 100644 --- a/modules/ROOT/pages/components.adoc +++ b/modules/ROOT/pages/components.adoc @@ -159,7 +159,7 @@ Whenever possible, {company} recommends using the {product-short} process to orc ==== For certain migration paths, it is possible to perform in-place database platform replacements on the same cluster where you data already exists. -Supported paths for in-place migrations include xref:6.9@dse:planning:migrate-cassandra-to-dse.adoc[{cass} to {dse-short}] and xref:1.2@hcd:migrate:dse-68-to-hcd-12.adoc[{dse-short} to {hcd-short}]. +Supported paths for in-place migrations include xref:6.9@dse:planning:migrate-cassandra-to-dse.adoc[{cass} to {dse-short}] and xref:1.2@hyper-converged-database:migrate:dse-68-to-hcd-12.adoc[{dse-short} to {hcd-short}]. == See also diff --git a/modules/ROOT/pages/faqs.adoc b/modules/ROOT/pages/faqs.adoc index 58ef0289..ba2774b5 100644 --- a/modules/ROOT/pages/faqs.adoc +++ b/modules/ROOT/pages/faqs.adoc @@ -4,8 +4,6 @@ This page includes common questions about the {company} {product} ({product-short}) tools. -//FAQs in ZDM-proxy repo: https://github.com/datastax/zdm-proxy/blob/main/faq.md#what-versions-of-apache-cassandra-or-cql-compatible-data-stores-does-the-zdm-proxy-support - == What is a zero-downtime migration? A zero-downtime migration with the {company} {product-short} tools means you can reliably migrate your client applications and data between CQL clusters with no interruption of service. @@ -148,4 +146,12 @@ For the majority of the migration process, the origin cluster is the primary clu Near the end of the migration, when you have validated that all pre-existing data has been replicated to the target cluster, you set the target cluster as the primary cluster. + Throughout the entire migration, until you decommission {product-proxy}, both clusters receive all write requests through the dual writes feature. -For more information, see xref:components.adoc#how-zdm-proxy-handles-reads-and-writes[How {product-proxy} handles reads and writes]. \ No newline at end of file +For more information, see xref:components.adoc#how-zdm-proxy-handles-reads-and-writes[How {product-proxy} handles reads and writes]. + +== Is there a difference between cluster and database? + +In the context of the {product-short} process, the terms _cluster_ and _database_ are used interchangeably to refer to the source and destination for the data that you are moving during your migration. + +== See also + +* https://github.com/datastax/zdm-proxy/blob/main/faq.md[{product-proxy} FAQ on GitHub] \ No newline at end of file From 3f1f00a1623bfbd08fcc78d08afe897e494ae96a Mon Sep 17 00:00:00 2001 From: April M <36110273+aimurphy@users.noreply.github.com> Date: Fri, 14 Nov 2025 11:54:17 -0800 Subject: [PATCH 09/11] unifying some redundant stuff --- .../ROOT/pages/feasibility-checklists.adoc | 63 ++--- .../ROOT/pages/manage-proxy-instances.adoc | 256 ++++++++++-------- modules/ROOT/pages/troubleshooting-tips.adoc | 25 +- 3 files changed, 189 insertions(+), 155 deletions(-) diff --git a/modules/ROOT/pages/feasibility-checklists.adoc b/modules/ROOT/pages/feasibility-checklists.adoc index 4a439870..3d9e6967 100644 --- a/modules/ROOT/pages/feasibility-checklists.adoc +++ b/modules/ROOT/pages/feasibility-checklists.adoc @@ -74,33 +74,16 @@ It is also highly recommended to perform tests and benchmarks when connected dir [[_read_only_applications]] === Read-only applications -Read-only applications require special handling only if you are using {product-proxy} versions older than 2.1.0. +In versions 2.1.0 and later, {product-proxy} sends periodic heartbeats to keep idle cluster connections alive. +The default interval is 30,000 milliseconds, and it can be configured with the `xref:ROOT:manage-proxy-instances.adoc#change-mutable-config-variable[heartbeat_interval_ms]` variable, or by directly setting the `ZDM_HEARTBEAT_INTERVAL_MS` environment variable if you aren't using {product-automation}. -[TIP] -==== -If you have an existing {product-proxy} deployment, you can xref:ROOT:troubleshooting-tips.adoc#check-version[check your {product-proxy} version]. +In {product-proxy} versions earlier than 2.1.0, read-only applications require special handling to avoid connection termination due to inactivity. +{company} recommends that you upgrade to version 2.1.0 or later to benefit from the heartbeat feature. +If you have an existing {product-proxy} deployment, you can xref:ROOT:troubleshooting-tips.adoc#check-version[check your {product-proxy} version]. For upgrade instructions, see xref:ROOT:manage-proxy-instances.adoc#_upgrade_the_proxy_version[Upgrade the proxy version]. -==== - -//TODO: combine the below 2 sections to only use 2.1.0 or later. -//Reconcile with troubleshooting-tips.adoc in case this issue is also described there. -==== Versions older than 2.1.0 - -If a client application only sends `SELECT` statements to a database connection then you may find that {product-proxy} terminates these read-only connections periodically, which may result in request errors if the driver is not configured to retry these requests in these conditions. - -This happens because {astra-db} terminates idle connections after some inactivity period (usually around 10 minutes). -If {astra-db} is your target, and a client connection is only sending read requests to {product-proxy}, then the {astra-db} connection that is paired to that client connection will remain idle and will be eventually terminated. - -A potential workaround is to not connect these read-only client applications to {product-proxy}, but you need to ensure that these client applications switch reads to the target at any point after all the data has been migrated and all validation and reconciliation has completed. -Another work around is to implement a mechanism in your client application that creates a new `Session` periodically to avoid the {astra-db} inactivity timeout. -You can also implement some kind of meaningless write request that the application sends periodically to make sure the {astra-db} connection doesn't idle. - -==== Version 2.1.0 and newer - -This issue is solved in version 2.1.0 of {product-proxy}, which introduces periodic heartbeats to keep alive idle cluster connections. -We strongly recommend using version 2.1.0 (or newer) to benefit from this improvement, especially if you have a read-only workload. +If you cannot upgrade to version 2.1.0 or later, see the alternatives described in xref:ROOT:troubleshooting-tips.adoc#client-application-closed-connection-errors-every-10-minutes-when-migrating-to-astra-db[Client application closed connection errors every 10 minutes when migrating to {astra-db}]. [[non-idempotent-operations]] == Lightweight Transactions and other non-idempotent operations @@ -110,9 +93,7 @@ Examples of non-idempotent operations in CQL are: * Lightweight Transactions (LWTs) * Counter updates * Collection updates with `+=` and `-=` operators -* Non-deterministic functions like `now()` and `uuid()` - -For more information on how to handle non-deterministic functions, see <>. +* Non-deterministic functions like `now()` and `uuid()` (see <>) Given that there are two separate clusters involved, the state of each cluster may be different. For conditional writes, this may create a divergent state for a time. @@ -209,26 +190,28 @@ The authentication configuration on each cluster can be different between the or [[cql-function-replacement]] == Server-side non-deterministic functions in the primary key -Statements with functions like `now()` and `uuid()` will result in data inconsistency between the origin and target clusters because the values are computed at the cluster level. +Statements with xref:dse-6.9@cql:reference:uuid.adoc[UUID and timeuuid functions], like `now()` and `uuid()`, create data inconsistencies between the origin and target clusters because the values are computed at the cluster level. -If these functions are used for columns that are not part of the primary key, you may find it acceptable to have different values in the two clusters depending on your application business logic. -However, if these columns are part of the primary key, the data migration phase will not be successful as there will be data inconsistencies between the two clusters and they will never be in sync. +If these functions are used for regular non-primary key columns, you must determine if it is acceptable to have different values in the two clusters depending on your application business logic. +However, if these functions are used in any primary key column, then your data migration phase will fail because of data inconsistencies between the two clusters. +Effectively, the clusters will never be truly in sync from a programmatic perspective. -[NOTE] -==== -{product-short} does not support the `uuid()` function currently. +{product-proxy} has an option to replace `now()` with a timeUUID calculated at the proxy level to ensure that these records write the same value to both clusters. + +To enable this feature, set `replace_cql_functions` to `true`. +For more information, see xref:manage-proxy-instances.adoc#change-mutable-config-variable[Change a mutable configuration variable]. + +[IMPORTANT] ==== +The `replace_cql_functions` option only replaces the `now()` function. -{product-proxy} is able to compute timestamps and replace `now()` function references with such timestamps in CQL statements at proxy level to ensure that these parameters will have the same value when these statements are sent to both clusters. -However, this feature is disabled by default because it might result in performance degradation. -We highly recommend that you test this properly before using it in production. -Also keep in mind that this feature is only supported for `now()` functions at the moment. -To enable this feature, set the configuration variable `replace_cql_function` to `true`. -For more, see xref:manage-proxy-instances.adoc#change-mutable-config-variable[Change a mutable configuration variable]. +This feature is disabled by default because it has a noticeable impact on performance. +{company} recommends that you test this feature extensively before using it in production. +==== -If you find that the performance is not acceptable when this feature is enabled, or the feature doesn't cover a particular function that your client application is using, then you will have to make a change to your client application so that the value is computed locally (at client application level) before the statement is sent to the database. +If the performance impact is unacceptable for your application, or you are using functions other than `now()`, then you must change your client application to use values calculated locally at the client-level before the statement is sent to the database. Most drivers have utility methods that help you compute these values locally. -For more information, see your driver's documentation. +For more information, see your driver's documentation and xref:datastax-drivers:developing:query-timestamps.adoc[Query timestamps in {cass-short} drivers]. == Driver retry policy and query idempotence diff --git a/modules/ROOT/pages/manage-proxy-instances.adoc b/modules/ROOT/pages/manage-proxy-instances.adoc index 8582fc31..31e8c093 100644 --- a/modules/ROOT/pages/manage-proxy-instances.adoc +++ b/modules/ROOT/pages/manage-proxy-instances.adoc @@ -1,13 +1,8 @@ = Manage your {product-proxy} instances -In this topic, we'll learn how to perform simple operations on your {product-proxy} deployment with no interruption to its availability: +After you deploy {product-proxy} instances, you might need to perform various management operations, such as rolling restarts, configuration changes, log inspection, version upgrades, and infrastructure changes. -* Do a simple rolling restart of the {product-proxy} instances -* View or collect the logs of all {product-proxy} instances -* Change a mutable configuration variable -* Upgrade the {product-proxy} version - -With {product-automation}, you can use Ansible playbooks for all of these operations. +If you are using {product-automation}, you can use Ansible playbooks for all of these operations. == Perform a rolling restart of the proxies @@ -70,122 +65,171 @@ To avoid downtime, wait for each instance to fully restart and begin receiving t For information about configuring, retrieving, and interpreting {product-proxy} logs, see xref:ROOT:troubleshooting-tips.adoc#proxy-logs[Viewing and interpreting {product-proxy} logs]. [[change-mutable-config-variable]] -== Change a mutable configuration variable +== Change mutable configuration variables + +Some, but not all, configuration variables can be changed after you deploy a {product-proxy} instance. -The following configuration variables are considered mutable and can be changed in a rolling fashion on an existing {product-proxy} deployment. +This section lists the _mutable_ configuration variables that you can change on an existing {product-proxy} deployment using the rolling restart playbook. -Commonly changed variables, located in `vars/zdm_proxy_core_config.yml`: +=== Mutable variables in `vars/zdm_proxy_core_config.yml` -* `primary_cluster`: -** This variable determines which cluster is currently considered the xref:ROOT:faqs.adoc#what-are-origin-target-primary-and-secondary-clusters[primary cluster]. +* `primary_cluster`: Determines which cluster is currently considered the xref:ROOT:faqs.adoc#what-are-origin-target-primary-and-secondary-clusters[primary cluster], either `ORIGIN` or `TARGET`. ++ At the start of the migration, the primary cluster is the origin cluster because it contains all of the data. -In Phase 4 of the migration, once all the existing data has been transferred and any validation/reconciliation step has been successfully executed, you can switch the primary cluster to be the target cluster. -** Valid values: `ORIGIN`, `TARGET`. -* `read_mode`: -** This variable determines how reads are handled by {product-proxy}. -** Valid values: -*** `PRIMARY_ONLY`: reads are only sent synchronously to the primary cluster. This is the default behavior. -*** `DUAL_ASYNC_ON_SECONDARY`: reads are sent synchronously to the primary cluster and also asynchronously to the secondary cluster. +After all the existing data has been transferred and validated/reconciled on the new cluster, you can switch the primary cluster to the target cluster. + +* `read_mode`: Determines how reads are handled by {product-proxy}: ++ +** `PRIMARY_ONLY` (default): Reads are sent synchronously to the primary cluster only. +** `DUAL_ASYNC_ON_SECONDARY`: Reads are sent synchronously to the primary cluster, and also asynchronously to the secondary cluster. See xref:enable-async-dual-reads.adoc[]. -** Typically, when choosing `DUAL_ASYNC_ON_SECONDARY` you will want to ensure that `primary_cluster` is still set to `ORIGIN`. -When you are ready to use the target cluster as the primary cluster, revert `read_mode` to `PRIMARY_ONLY`. -* `log_level`: -** Defaults to `INFO`. -** Only set to `DEBUG` if necessary and revert to `INFO` as soon as possible, as the extra logging can have a slight performance impact. - -Other, rarely changed variables: - -* Origin username/password in `vars/zdm_proxy_cluster_config.yml` -* Target username/password in `vars/zdm_proxy_cluster_config.yml` -* Advanced configuration variables in `vars/zdm_proxy_advanced_config.yml`: -** `zdm_proxy_max_clients_connections`: -*** Maximum number of client connections that {product-proxy} should accept. -Each client connection results in additional cluster connections and causes the allocation of several in-memory structures, so this variable can be tweaked to cap the total number on each instance. -A high number of client connections per proxy instance may cause some performance degradation, especially at high throughput. -*** Defaults to `1000`. -** `replace_cql_functions`: -*** Whether {product-proxy} should replace standard CQL function calls in write requests with a value computed at proxy level. -*** Currently, only the replacement of `now()` is supported. -*** Boolean value. -Disabled by default. -Enabling this will have a noticeable performance impact. -** `zdm_proxy_request_timeout_ms`: -*** Global timeout (in ms) of a request at proxy level. -*** This variable determines how long {product-proxy} will wait for one cluster (in case of reads) or both clusters (in case of writes) to reply to a request. -If this timeout is reached, {product-proxy} will abandon that request and no longer consider it as pending, thus freeing up the corresponding internal resources. -Note that, in this case, {product-proxy} will not return any result or error: when the client application's own timeout is reached, the driver will time out the request on its side. -*** Defaults to `10000` ms. -If your client application has a higher client-side timeout because it is expected to generate requests that take longer to complete, you need to increase this timeout accordingly. -** `origin_connection_timeout_ms` and `target_connection_timeout_ms`: -*** Timeout (in ms) when attempting to establish a connection from the proxy to the origin or the target. -*** Defaults to `30000` ms. -** `async_handshake_timeout_ms`: -*** Timeout (in ms) when performing the initialization (handshake) of a proxy-to-secondary cluster connection that will be used solely for asynchronous dual reads. -*** If this timeout occurs, the asynchronous reads will not be sent. -This has no impact on the handling of synchronous requests: {product-proxy} will continue to handle all synchronous reads and writes normally. -*** Defaults to `4000` ms. -** `heartbeat_interval_ms`: -*** Frequency (in ms) with which heartbeats will be sent on cluster connections (i.e. all control and request connections to the origin and the target). -Heartbeats keep idle connections alive. -*** Defaults to `30000` ms. -** `metrics_enabled`: -*** Whether metrics collection should be enabled. -*** Boolean value. -Defaults to `true`, but can be set to `false` to completely disable metrics collection. -This is not recommended. - -** [[zdm_proxy_max_stream_ids]]`zdm_proxy_max_stream_ids`: -*** In the CQL protocol every request has a unique id, named stream id. -This variable allows you to tune the maximum pool size of the available stream ids managed by {product-proxy} per client connection. -In the application client, the stream ids are managed internally by the driver, and in most drivers the max number is 2048 (the same default value used in the proxy). -If you have a custom driver configuration with a higher value, you should change this property accordingly. -*** Defaults to `2048`. - -Deprecated variables, which will be removed in a future {product-proxy} release: - -* `forward_client_credentials_to_origin`: -** Whether the credentials provided by the client application are for the origin cluster. -** Boolean value. -Defaults to `false` (the client application is expected to pass the target credentials), can be set to `true` if the client passes credentials for the origin cluster instead. - -To change any of these variables, edit the desired values in `vars/zdm_proxy_core_config.yml`, `vars/zdm_proxy_cluster_config.yml` (credentials only) and/or `vars/zdm_proxy_advanced_config.yml` (mutable variables only, as listed above). - -To apply the configuration changes to the {product-proxy} instances in a rolling fashion, run the following command: ++ +Typically, you only set `read_mode` to `DUAL_ASYNC_ON_SECONDARY` if the `primary_cluster` variable is set to `ORIGIN`. +This is because asynchronous dual reads are primarily intended to help test production workloads against the target cluster near the end of the migration. +When you are ready to switch `primary_cluster` to `TARGET`, revert `read_mode` to `PRIMARY_ONLY` because there is no need to send writes to both clusters at that point in the migration. -[source,bash] ----- -ansible-playbook rolling_update_zdm_proxy.yml -i zdm_ansible_inventory ----- +* `log_level`: Set the {product-proxy} log level as `INFO` (default) or `DEBUG`. ++ +Only use `DEBUG` while temporarily troubleshooting an issue. +Revert to `INFO` as soon as possible because the extra logging can impact performance slightly. ++ +For more information, see xref:ROOT:troubleshooting-tips.adoc#proxy-logs[Check {product-proxy} logs]. + +=== Mutable variables in `vars/zdm_proxy_cluster_config.yml` + +* Origin username and password + +* Target username and password + +=== Mutable variables in `vars/zdm_proxy_advanced_config.yml` + +* `zdm_proxy_max_clients_connections`: The maximum number of client connections that {product-proxy} can accept. +Each client connection results in additional cluster connections and causes the allocation of several in-memory structures. +A high number of client connections per proxy instance can cause performance degradation, especially at high throughput. +Adjust this variable to limit the total number of connections on each instance. ++ +Default: `1000` + +* `replace_cql_functions`: Whether {product-proxy} replaces standard `now()` CQL function calls in write requests with an explicit timeUUID value computed at proxy level. ++ +If `false` (default), replacement of `now()` is disabled. +If `true`, {product-proxy} replaces instances of `now()` in write requests with an explicit timeUUID value before sending the write to each cluster. ++ +[IMPORTANT] +==== +Enabling `replace_cql_functions` has a noticeable performance impact because the proxy must do more extensive parsing and manipulation of the statements before sending the modified statement to each cluster. +Only enable this variable if required, and implement proper performance testing to quantify and prepare for the performance impact. + +If you use `now()` to populate a regular (non-primary key) column, consider if you can pragmatically accept a slight discrepancy in the values between the origin and target cluster for these writes. +This depends on your application, and whether it can tolerate a potential difference of a few milliseconds. + +However, if you use `now()` to populate a primary key column, differences between the origin and target values result in different primary keys. +This means that the same row on the origin and target are technically considered different records, and this will cause problems with duplicate entries that aren't caught by validation (because the primary keys are different). +If `now()` is used in any of your primary key columns, {company} recommends enabling `replace_cql_functions`, regardless of the performance impact. + +For more information, see xref:ROOT:feasibility-checklists.adoc#cql-function-replacement[Server-side non-deterministic functions in the primary key]. +==== + +* `zdm_proxy_request_timeout_ms`: Global timeout in milliseconds of a request at proxy level. +Determines how long {product-proxy} waits for one cluster (for reads) or both clusters (for writes) to reply to a request. +Upon reaching the timeout limit, {product-proxy} abandons the request and no longer considers it pending, which frees up internal resources to processes other requests. +When a request is abandoned due to a timeout, {product-proxy} doesn't return any result or error. +A timeout warning or error is only returned when the client application's own timeout is reached and the request is expired on the driver side. ++ +Make sure `zdm_proxy_request_timeout_ms` is always greater than your client application's client-side timeout. +If the client has an especially high timeout because it routinely generates long-running requests, you must increase the `zdm_proxy_request_timeout_ms` timeout accordingly so that the {product-proxy} doesn't abandon requests prematurely. ++ +Default: `10000` -This playbook operates by recreating each proxy container one by one. -The {product-proxy} deployment remains available at all times and can be safely used throughout this operation. -The playbook automates the following steps: +* `origin_connection_timeout_ms` and `target_connection_timeout_ms`: Timeout in milliseconds for establishing a connection from the proxy to the origin or target cluster, respectively. ++ +Default: `30000` + +* `async_handshake_timeout_ms`: Timeout in milliseconds for the initialization (handshake) of the connection that is used solely for asynchronous dual reads between the proxy and the secondary cluster. ++ +Upon reaching the timeout limit, the asynchronous reads aren't sent because the connection failed to be established. +This has no impact on the handling of synchronous requests: {product-proxy} continues to handle all synchronous reads and writes as normal against the primary cluster. ++ +Default: `4000` + +* `heartbeat_interval_ms`: The interval in milliseconds that heartbeats are sent to keep idle cluster connections alive. +This includes all control and request connections to the origin and the target clusters. ++ +Default: `30000` + +* `metrics_enabled`: Whether to enable metrics collection. +The default is `true` (enabled). ++ +If `false`, {product-proxy} metrics collection is completely disabled. +This isn't recommended. + +[[zdm_proxy_max_stream_ids]] +* `zdm_proxy_max_stream_ids`: Set the maximum pool size of available stream IDs managed by {product-proxy} per client connection. +Use the same value as your driver's maximum stream IDs configuration. ++ +In the CQL protocol, every request has a unique stream ID. +However, if there are a lot of requests in a given amount of time, errors can occur due to xref:datastax-drivers:developing:speculative-retry.adoc#stream-id-exhaustion[stream ID exhaustion]. ++ +In the client application, the stream IDs are managed internally by the driver, and, in most drivers, the max number is 2048, which is the same default value used by {product-proxy}. +If you have a custom driver configuration with a higher value, make sure `zdm_proxy_max_stream_ids` matches your driver's maximum stream IDs. ++ +Defaults: `2048` + +=== Deprecated mutable variables + +Deprecated variables will be removed in a future {product-proxy} release. +Replace them with their recommended alternatives as soon as possible. + +* `forward_client_credentials_to_origin`: Whether to use the credentials provided by the client application to connect to the origin cluster. +If `false` (default), the credentials from the client application were used to connect to the target cluster. +If `true`, the credentials from the client application were used to connect to the origin cluster. ++ +This deprecated variable is no longer functional. +Instead, the expected credentials are based on the authentication requirements of the origin and target clusters. +For more information, see xref:ROOT:connect-clients-to-proxy.adoc#_client_application_credentials[Client application credentials]. + +=== Apply mutable configuration changes -. It stops one container gracefully, waiting for it to shut down. -. It recreates the container and starts it up. +. Edit mutable variables in their corresponding configuration files: `vars/zdm_proxy_core_config.yml`, `vars/zdm_proxy_cluster_config.yml`, or `vars/zdm_proxy_advanced_config.yml`. + +. Apply the configuration changes to your {product-proxy} instances using the rolling restart playbook. + [IMPORTANT] ==== -A configuration change is a destructive action because containers are considered immutable. -Note that this will remove the previous container and its logs. -Make sure you collect the logs prior to this operation if you want to keep them. +A configuration change is a destructive action because running this playbook removes the previous container and its logs, replacing it with a new container and the new configuration. +xref:ROOT:troubleshooting-tips.adoc#proxy-logs[Collect the logs] before you run the playbook if you want to keep them. ==== -. It checks that the container has come up successfully by checking the readiness endpoint: -.. If unsuccessful, it repeats the check for six times at 5-second intervals and eventually interrupts the whole process if the check still fails. -.. If successful, it waits for 10 seconds and then moves on to the next container. ++ +[source,bash] +---- +ansible-playbook rolling_update_zdm_proxy.yml -i zdm_ansible_inventory +---- +The rolling restart playbook recreates each {product-proxy} container, one by one, with the updated configuration files. +The {product-proxy} deployment remains available at all times, and you can safely use it throughout this operation. + +The playbook performs the following actions automatically: + +. Stop one container gracefully, and then wait for it to shut down. +. Recreate the container, and then start it. +. Check that the container started successfully by checking the readiness endpoint: ++ +* If unsuccessful, repeat the check up to six times at 5-second intervals. +If it still fails, interrupt the entire rolling restart process. +* If successful, wait 10 seconds (default), and then move on to the next container. ++ The pause between the restart of each {product-proxy} instance defaults to 10 seconds. To change this value, you can set the desired number of seconds in `zdm-proxy-automation/ansible/vars/zdm_playbook_internal_config.yml`. -[NOTE] -==== -All configuration variables that are not listed in this section are considered immutable and can only be changed by recreating the deployment. +== Change immutable configuration variables -If you wish to change any of the immutable configuration variables on an existing deployment, you will need to re-run the deployment playbook (`deploy_zdm_proxy.yml`, as documented in xref:deploy-proxy-monitoring.adoc[this page]). -This playbook can be run as many times as necessary. +All configuration variables not listed in <> are _immutable_ and can only be changed by recreating the deployment with the xref:ROOT:deploy-proxy-monitoring.adoc[initial deployment playbook] (`deploy_zdm_proxy.yml`). -Be aware that running the `deploy_zdm_proxy.yml` playbook results in a brief window of unavailability of the whole {product-proxy} deployment while all the {product-proxy} instances are torn down and recreated. -==== +You can re-run the deployment playbook as many times as necessary. +However, this playbook decommissions and recreates _all_ {product-proxy} instances simultaneously. +This results in a brief period of time where the entire {product-proxy} deployment is offline because no instances are available. + +For more information, see xref:ROOT:troubleshooting-tips.adoc#configuration-changes-arent-applied-by-zdm-automation[Configuration changes aren't applied by {product-automation}]. [[_upgrade_the_proxy_version]] == Upgrade the proxy version @@ -309,7 +353,7 @@ Remove an instance:: -- ====== -== Proxy topology addresses enable failover and high availability +=== Proxy topology addresses enable failover and high availability When you configure a {product-proxy} deployment, either through {product-automation} or manually-managed {product-proxy} instances, you specify the addresses of your instances. These are populated in the `ZDM_PROXY_TOPOLOGY_ADDRESSES` variable, either manually or automatically depending on how you manage your instances. diff --git a/modules/ROOT/pages/troubleshooting-tips.adoc b/modules/ROOT/pages/troubleshooting-tips.adoc index 56564b6b..7fccc2c5 100644 --- a/modules/ROOT/pages/troubleshooting-tips.adoc +++ b/modules/ROOT/pages/troubleshooting-tips.adoc @@ -23,12 +23,10 @@ However, this can slightly degrade performance, and {company} recommends that yo How you set the log level depends on how you deployed {product-proxy}: -* If you used {product-automation} to deploy {product-proxy}, set `log_level` in `vars/zdm_proxy_core_config.yml`. -+ -You can change this value in a rolling fashion by editing the variable and running the `rolling_update_zdm_proxy.yml` playbook. +* If you used {product-automation} to deploy {product-proxy}, set `log_level` in `vars/zdm_proxy_core_config.yml`, and then run the `rolling_update_zdm_proxy.yml` playbook. For more information, see xref:manage-proxy-instances.adoc#change-mutable-config-variable[Change a mutable configuration variable]. -* If you didn't use {product-automation} to deploy {product-proxy}, set the `ZDM_LOG_LEVEL` environment variable on each proxy instance and then restart each instance. +* If you didn't use {product-automation} to deploy {product-proxy}, set the `ZDM_LOG_LEVEL` environment variable on each proxy instance, and then restart each instance. === Get {product-proxy} log files @@ -254,6 +252,7 @@ For example, you might compare `cluster_name` to ensure that all instances are c The following sections provide troubleshooting advice for specific issues or error messages related to {product}. +[#configuration-changes-arent-applied-by-zdm-automation] === Configuration changes aren't applied by {product-automation} If you change some configuration variables, and then performing a rolling restart with the `rolling_update_zdm_proxy.yml` playbook, you might notice that some changes aren't applied to your {product-proxy} instances. @@ -574,17 +573,25 @@ If you are running a version prior to 2.1.0, upgrade {product-proxy}. If these errors are constantly written to the log files over a period of minutes or hours, then you likely need to restart the client application _or_ {product-proxy} to fix the issue. If you find an error like this, <> so the {product-short} team can investigate it. +[#client-application-closed-connection-errors-every-10-minutes-when-migrating-to-astra-db] === Client application closed connection errors every 10 minutes when migrating to {astra-db} This issue is fixed in {product-proxy} 2.1.0. -If you are running an earlier version, and the logs report that the {astra-db} `TARGET-CONNECTOR` is disconnected every 10 minutes, upgrade your {product-proxy} instances to 2.1.0 or later to resolve this issue. +In {product-proxy} versions earlier than 2.1.0, the logs can report that the {astra-db} `TARGET-CONNECTOR` is disconnected every 10 minutes. +This happens because {astra-db} terminates idle connections after 10 minutes of inactivity. +In the absence of asynchronous dual reads, the target cluster won't get any traffic when the client application produces only read requests because {product-short} forwards all reads to the origin cluster only. + +To resolve this issue, {company} recommends that you upgrade your {product-proxy} instances to 2.1.0 or later to take advantage of the heartbeats feature, which keeps the connection alive during periods of inactivity. +You can tune the heartbeat interval with the `xref:ROOT:manage-proxy-instances.adoc#change-mutable-config-variable[heartbeat_interval_ms]` variable, or by directly setting the `ZDM_HEARTBEAT_INTERVAL_MS` environment variable if you aren't using {product-automation}. + +If upgrading is impossible, you can try the following alternatives: + +* Don't connect the read-only client applications to {product-proxy}, and then manually ensure that these client applications switch reads to the target at any point after all the data has been migrated, validated, and reconciled on the target cluster. -This issue occurred because {astra-db} terminates idle connections after 10 minutes of inactivity. -In the absence of asynchronous dual reads, the target cluster won't get any traffic if the client application sends only read requests because {product-short} forwards all reads to the origin cluster only. +* Implement a mechanism in your client application that creates a new `Session` periodically to avoid the {astra-db} inactivity timeout. -This issue is fixed in {product-proxy} 2.1.0, which sends heartbeats after 30 seconds of inactivity on a cluster connection to keep it alive. -You can tune the heartbeat interval with the Ansible configuration variable `heartbeat_insterval_ms`, or by directly setting the `ZDM_HEARTBEAT_INTERVAL_MS` environment variable if you aren't using {product-automation}. +* Implement a mechanism in your client application to issue a periodic meaningless write request to prevent the {astra-db} connection from becoming idle. === Performance degradation with {product-proxy} From 4cdfc9d1004e835b81b6822fe829aed8992f1087 Mon Sep 17 00:00:00 2001 From: April M <36110273+aimurphy@users.noreply.github.com> Date: Fri, 14 Nov 2025 12:14:53 -0800 Subject: [PATCH 10/11] finish manage proxy instances --- .../ROOT/pages/feasibility-checklists.adoc | 107 +++++++++--------- .../ROOT/pages/manage-proxy-instances.adoc | 43 +++++-- 2 files changed, 85 insertions(+), 65 deletions(-) diff --git a/modules/ROOT/pages/feasibility-checklists.adoc b/modules/ROOT/pages/feasibility-checklists.adoc index 3d9e6967..27e6a201 100644 --- a/modules/ROOT/pages/feasibility-checklists.adoc +++ b/modules/ROOT/pages/feasibility-checklists.adoc @@ -90,22 +90,23 @@ If you cannot upgrade to version 2.1.0 or later, see the alternatives described Examples of non-idempotent operations in CQL are: -* Lightweight Transactions (LWTs) +* Lightweight Transactions (LWTs) (see <<_lightweight_transactions_and_the_applied_flag>>) * Counter updates * Collection updates with `+=` and `-=` operators * Non-deterministic functions like `now()` and `uuid()` (see <>) -Given that there are two separate clusters involved, the state of each cluster may be different. -For conditional writes, this may create a divergent state for a time. +Given that there are two separate clusters involved, the state of each cluster can be different. +For conditional writes, this can create a temporary divergent state. -If non-idempotent operations are used, {company} recommends adding a reconciliation phase to your migration before and after Phase 4, where you switch reads to the target. +If you use non-idempotent operations, {company} recommends adding a reconciliation phase to your migration before and after Phase 4 (where you switch reads to the target). +This allows you additional opportunities to resolve any data inconsistencies that are produced by non-idempotent operations. -For details about using the {cass-migrator}, see xref:migrate-and-validate-data.adoc[]. +The {cass-migrator} is ideal for detecting and reconciling these types of inconsistencies. +For more information, see xref:migrate-and-validate-data.adoc[]. -[TIP] -==== -Some application workloads can tolerate inconsistent data in some cases (especially for counter values) in which case you may not need to do anything special to handle those non-idempotent operations. -==== +If your application workloads can tolerate inconsistencies produced by LWTs and non-idempotent operations, you might not need to perform any additional validation or reconciliation steps. +This depends entirely on your application business logic and requirements. +It is your responsibility to determine whether your workloads can tolerate these inconsistencies and to what extent. [[_lightweight_transactions_and_the_applied_flag]] === Lightweight transactions and the applied flag @@ -135,7 +136,7 @@ Up to that point, an LWT's condition can be evaluated differently on each side, The response that a cluster sends after executing a LWT includes a flag called `applied`. This flag tells the client whether the LWT update was actually applied. The status depends on the condition, which in turn depends on the state of the data. -When {product-proxy} receives a response from both the origin and target, each response would have its own `applied` flag. +When {product-proxy} receives a response from both the origin and target, each response would have its own `applied` flag. However, {product-proxy} can only return a *single response* to the client. Recall that the client has no knowledge that there are two clusters behind the proxy. @@ -145,50 +146,8 @@ If your client has logic that depends on the `applied` flag, be aware that durin To reiterate, {product-proxy} only returns the `applied` value from the primary cluster, which is the cluster from where read results are returned to the client application. By default, this is the origin cluster. This means that when you set the target cluster as your primary cluster, then the `applied` value returned to the client application will come from the target cluster. -== Advanced workloads ({dse-short}) - -=== Graph - -{product-proxy} handles all {dse-short} Graph requests as write requests even if the traversals are read-only. There is no special handling for these requests, so you need to take a look at the traversals that your client application sends and determine whether the traversals are idempotent. If the traversals are non-idempotent then the reconciliation step is needed. - -Keep in mind that our recommended tools for data migration and reconciliation are CQL-based, so they can be used for migrations where the origin cluster is a database that uses the new {dse-short} Graph engine released with {dse-short} 6.8, but *cannot be used for the old Graph engine* that older {dse-short} versions relied on. -See <> for more information about non-idempotent operations. - -=== Search - -Read-only {dse-short} Search workloads can be moved directly from the origin to the target without {product-proxy} being involved. -If your client application uses Search and also issues writes, or if you need the read routing capabilities from {product-proxy}, then you can connect your Search workloads to it as long as you are using xref:datastax-drivers:compatibility:driver-matrix.adoc[{company}-compatible drivers] to submit these queries. -This approach means the queries are regular CQL `SELECT` statements, so {product-proxy} handles them as regular read requests. - -If you use the HTTP API then you can either modify your applications to use the CQL API instead or you will have to move those applications directly from the origin to the target when the migration is complete if that is acceptable. - -== Client compression - -The binary protocol used by {cass-short}, {dse-short}, {hcd-short}, and {astra-db} supports optional compression of transport-level requests and responses that reduces network traffic at the cost of CPU overhead. - -When establishing connections from client applications, {product-proxy} responds with a list of compression algorithms supported by both clusters. -The compression algorithm configured in your {company}-compatible driver must match any item from the common list, or CQL request compression must be disabled completely. -{product-proxy} cannot decompress and recompress CQL requests using different compression algorithms. - -This isn't related to storage compression, which you can configure on specific tables with the `compression` table property. -Storage/table compression doesn't affect the client application or {product-proxy} in any way. - -== Authenticator and Authorizer configuration - -{product-proxy} supports the following cluster authenticator configurations: - -* No authenticator -* `PasswordAuthenticator` -* `DseAuthenticator` with `internal` or `ldap` scheme - -{product-proxy} does *not* support `DseAuthenticator` with `kerberos` scheme. - -While the authenticator has to be supported, the *authorizer* does not affect client applications or {product-proxy} so you should be able to use any kind of authorizer configuration on both of your clusters. - -The authentication configuration on each cluster can be different between the origin and target clustesr, as {product-proxy} treats them independently. - [[cql-function-replacement]] -== Server-side non-deterministic functions in the primary key +=== Server-side non-deterministic functions in the primary key Statements with xref:dse-6.9@cql:reference:uuid.adoc[UUID and timeuuid functions], like `now()` and `uuid()`, create data inconsistencies between the origin and target clusters because the values are computed at the cluster level. @@ -236,6 +195,48 @@ For more information, see the following driver documentation: * xref:datastax-drivers:developing:query-idempotence.adoc[] * xref:datastax-drivers:connecting:retry-policies.adoc[] +== {dse-short} advanced workloads + +Graph:: +{product-proxy} handles all {dse-short} Graph requests as write requests even if the traversals are read-only. There is no special handling for these requests, so you need to take a look at the traversals that your client application sends and determine whether the traversals are idempotent. If the traversals are non-idempotent then the reconciliation step is needed. ++ +Keep in mind that our recommended tools for data migration and reconciliation are CQL-based, so they can be used for migrations where the origin cluster is a database that uses the new {dse-short} Graph engine released with {dse-short} 6.8, but *cannot be used for the old Graph engine* that older {dse-short} versions relied on. +See <> for more information about non-idempotent operations. + +Search:: +Read-only {dse-short} Search workloads can be moved directly from the origin to the target without {product-proxy} being involved. +If your client application uses Search and also issues writes, or if you need the read routing capabilities from {product-proxy}, then you can connect your Search workloads to it as long as you are using xref:datastax-drivers:compatibility:driver-matrix.adoc[{company}-compatible drivers] to submit these queries. +This approach means the queries are regular CQL `SELECT` statements, so {product-proxy} handles them as regular read requests. ++ +If you use the HTTP API then you can either modify your applications to use the CQL API instead or you will have to move those applications directly from the origin to the target when the migration is complete if that is acceptable. + +== Client compression + +The binary protocol used by {cass-short}, {dse-short}, {hcd-short}, and {astra-db} supports optional compression of transport-level requests and responses that reduces network traffic at the cost of CPU overhead. + +When establishing connections from client applications, {product-proxy} responds with a list of compression algorithms supported by both clusters. +The compression algorithm configured in your {company}-compatible driver must match any item from the common list, or CQL request compression must be disabled completely. +{product-proxy} cannot decompress and recompress CQL requests using different compression algorithms. + +This isn't related to storage compression, which you can configure on specific tables with the `compression` table property. +Storage/table compression doesn't affect the client application or {product-proxy} in any way. + +== Authenticator and authorizer configuration + +A cluster's _authorizer_ doesn't affect client applications or {product-proxy}, which means that you can use any kind of authorizer configuration on your clusters, and they can use different authorizers. + +In contrast, a cluster's _authenticator_ must be compatible with {product-proxy}. + +{product-proxy} supports the following cluster authenticator configurations: + +* No authenticator +* `PasswordAuthenticator` +* `DseAuthenticator` with `internal` or `ldap` scheme + +{product-proxy} _doesn't_ support `DseAuthenticator` with `kerberos` scheme. + +The origin and target clusters can have different authentication configurations because {product-proxy} treats them independently. + == Next steps * xref:ROOT:deployment-infrastructure.adoc[] \ No newline at end of file diff --git a/modules/ROOT/pages/manage-proxy-instances.adoc b/modules/ROOT/pages/manage-proxy-instances.adoc index 31e8c093..70e9a3a3 100644 --- a/modules/ROOT/pages/manage-proxy-instances.adoc +++ b/modules/ROOT/pages/manage-proxy-instances.adoc @@ -190,16 +190,14 @@ For more information, see xref:ROOT:connect-clients-to-proxy.adoc#_client_applic === Apply mutable configuration changes -. Edit mutable variables in their corresponding configuration files: `vars/zdm_proxy_core_config.yml`, `vars/zdm_proxy_cluster_config.yml`, or `vars/zdm_proxy_advanced_config.yml`. +Edit mutable variables in their corresponding configuration files (`vars/zdm_proxy_core_config.yml`, `vars/zdm_proxy_cluster_config.yml`, or `vars/zdm_proxy_advanced_config.yml`), and then apply the configuration changes to your {product-proxy} instances using the rolling restart playbook. -. Apply the configuration changes to your {product-proxy} instances using the rolling restart playbook. -+ [IMPORTANT] ==== -A configuration change is a destructive action because running this playbook removes the previous container and its logs, replacing it with a new container and the new configuration. +A configuration change is a destructive action because the rolling restart playbook removes the previous containers and their logs, replacing them with new containers and the new configuration. xref:ROOT:troubleshooting-tips.adoc#proxy-logs[Collect the logs] before you run the playbook if you want to keep them. ==== -+ + [source,bash] ---- ansible-playbook rolling_update_zdm_proxy.yml -i zdm_ansible_inventory @@ -210,13 +208,13 @@ The {product-proxy} deployment remains available at all times, and you can safel The playbook performs the following actions automatically: -. Stop one container gracefully, and then wait for it to shut down. -. Recreate the container, and then start it. -. Check that the container started successfully by checking the readiness endpoint: +. {product-automation} stops one container gracefully, and then waits for it to shut down. +. {product-automation} recreates the container, and then starts it. +. {product-automation} checks that the container started successfully by checking the readiness endpoint: + -* If unsuccessful, repeat the check up to six times at 5-second intervals. -If it still fails, interrupt the entire rolling restart process. -* If successful, wait 10 seconds (default), and then move on to the next container. +* If unsuccessful, {product-automation} repeats the check up to six times at 5-second intervals. +If it still fails, {product-automation} interrupts the entire rolling restart process. +* If successful, {product-automation} waits 10 seconds (default), and then moves on to the next container. + The pause between the restart of each {product-proxy} instance defaults to 10 seconds. To change this value, you can set the desired number of seconds in `zdm-proxy-automation/ansible/vars/zdm_playbook_internal_config.yml`. @@ -236,7 +234,12 @@ For more information, see xref:ROOT:troubleshooting-tips.adoc#configuration-chan The same playbook that you use for configuration changes can also be used to upgrade the {product-proxy} version in a rolling fashion. All containers are recreated with the given image version. -The same behavior and observations noted in <> also apply to {product-proxy} image upgrades. + +[IMPORTANT] +==== +A version change is a destructive action because the rolling restart playbook removes the previous containers and their logs, replacing them with new containers using the new image. +xref:ROOT:troubleshooting-tips.adoc#proxy-logs[Collect the logs] before you run the playbook if you want to keep them. +==== To check your current {product-proxy} version, see xref:ROOT:troubleshooting-tips.adoc#check-version[Check your {product-proxy} version]. @@ -261,6 +264,22 @@ zdm_proxy_image: datastax/zdm-proxy:2.3.4 ---- ansible-playbook rolling_update_zdm_proxy.yml -i zdm_ansible_inventory ---- ++ +The rolling restart playbook recreates each {product-proxy} container, one by one, with the new image. +The {product-proxy} deployment remains available at all times, and you can safely use it throughout this operation. ++ +The playbook performs the following actions automatically: ++ +.. {product-automation} stops one container gracefully, and then waits for it to shut down. +.. {product-automation} recreates the container, and then starts it. +.. {product-automation} checks that the container started successfully by checking the readiness endpoint: ++ +** If unsuccessful, {product-automation} repeats the check up to six times at 5-second intervals. +If it still fails, {product-automation} interrupts the entire rolling restart process. +** If successful, {product-automation} waits 10 seconds (default), and then moves on to the next container. ++ +The pause between the restart of each {product-proxy} instance defaults to 10 seconds. +To change this value, you can set the desired number of seconds in `zdm-proxy-automation/ansible/vars/zdm_playbook_internal_config.yml`. == Scale {product-proxy} instances From 8e4acf0a5d5588108d9bcb798212888dbaf34fdc Mon Sep 17 00:00:00 2001 From: April M <36110273+aimurphy@users.noreply.github.com> Date: Fri, 14 Nov 2025 15:40:49 -0800 Subject: [PATCH 11/11] peer review --- modules/ROOT/pages/components.adoc | 2 +- modules/ROOT/pages/faqs.adoc | 20 ++++++++++++------- .../ROOT/pages/feasibility-checklists.adoc | 6 +++--- .../ROOT/pages/manage-proxy-instances.adoc | 5 +++-- modules/ROOT/pages/troubleshooting-tips.adoc | 8 ++++++-- 5 files changed, 26 insertions(+), 15 deletions(-) diff --git a/modules/ROOT/pages/components.adoc b/modules/ROOT/pages/components.adoc index 1ff5c791..510fff29 100644 --- a/modules/ROOT/pages/components.adoc +++ b/modules/ROOT/pages/components.adoc @@ -49,7 +49,7 @@ The client can then retry the request, if appropriate, based on the client's ret This design ensures that new data is always written to both clusters, and that any failure on either cluster is always made visible to the client application. -For information about how {product-proxy} handles lightweight transactions (LWTs), see xref:feasibility-checklists.adoc#_lightweight_transactions_and_the_applied_flag[Lightweight Transactions and the applied flag]. +For information about how {product-proxy} handles Lightweight Transactions (LWTs), see xref:feasibility-checklists.adoc#_lightweight_transactions_and_the_applied_flag[Lightweight Transactions and the applied flag]. ==== Reads diff --git a/modules/ROOT/pages/faqs.adoc b/modules/ROOT/pages/faqs.adoc index ba2774b5..fbf9a3d8 100644 --- a/modules/ROOT/pages/faqs.adoc +++ b/modules/ROOT/pages/faqs.adoc @@ -16,26 +16,32 @@ See xref:ROOT:zdm-proxy-migration-paths.adoc[]. There are several benefits to using the {product-short} tools for your migration: -* Minimal client code changes: Depending on cluster compatibility, the {product-short} tools help you migrate to a new or upgraded database platform with minimal changes to your client application code. +Minimal client code changes:: +Depending on cluster compatibility, the {product-short} tools help you migrate to a new or upgraded database platform with minimal changes to your client application code. In some cases, you only need to change the connection string to point to the new cluster at the end of the migration process. Typically, these changes are minimal and non-invasive, especially if your client application uses an externalized property configuration for contact points. -* Real-time data consistency: {product-proxy} orchestrates real-time activity generated by your client applications, ensuring data consistency while you replicate, validate, and test your existing data on the new cluster. +Real-time data consistency:: +{product-proxy} orchestrates real-time activity generated by your client applications, ensuring data consistency while you replicate, validate, and test your existing data on the new cluster. Once you set up {product-proxy}, the dual-writes feature ensures that new writes are sent to both the origin and target clusters, so you can focus on migrating the data that was present before initializing {product-proxy}. -* Safely test the new cluster under full production workloads: In addition to the dual-writes feature, you can optionally enable asynchronous dual-reads to test the target cluster's ability to handle a production workload before you permanently switch to the target cluster at the end of the migration process. +Safely test the new cluster under full production workloads:: +In addition to the dual-writes feature, you can optionally enable asynchronous dual-reads to test the target cluster's ability to handle a production workload before you permanently switch to the target cluster at the end of the migration process. + Client applications aren't interrupted by read errors or latency spikes on the new, target cluster. Although these errors and metrics are received by {product-proxy} for monitoring and performance benchmarking purposes, they aren't propagated back to the client applications. + From the client side, traffic is seamless and uninterrupted during the entire migration process. -* Seamless rollback without data loss: If there is a problem during the migration, you can xref:ROOT:rollback.adoc[rollback to the original cluster] without any data loss or interruption of service. +Seamless rollback without data loss:: +If there is a problem during the migration, you can xref:ROOT:rollback.adoc[rollback to the original cluster] without any data loss or interruption of service. You can allow {product-proxy} to continue orchestrating dual-writes, or redirect your client applications back to the origin cluster and disable {product-proxy}. -* Endless validation and testing time: Because your client applications remain fully operational during the migration, and your clusters are kept in sync by {product-proxy}, you can take as much time as you need to validate and test the target cluster before switching over permanently. +Endless validation and testing time:: +Because your client applications remain fully operational during the migration, and your clusters are kept in sync by {product-proxy}, you can take as much time as you need to validate and test the target cluster before switching over permanently. -* Migrate to a different platform or perform major version upgrades: The {product-short} tools support migrations between different CQL-based platforms, such as open-source {cass-reg} to {astra-db}, as well as major version upgrades of the same platform, such as {dse-short} 5.0 to {dse-short} 6.9. +Migrate to a different platform or perform major version upgrades:: +The {product-short} tools support migrations between different CQL-based platforms, such as open-source {cass-reg} to {astra-db}, as well as major version upgrades of the same platform, such as {dse-short} 5.0 to {dse-short} 6.9. == What are the requirements for true zero-downtime migrations? @@ -116,7 +122,7 @@ Yes, see xref:tls.adoc[]. == How does {product-proxy} handle Lightweight Transactions (LWTs)? -See xref:feasibility-checklists.adoc#_lightweight_transactions_and_the_applied_flag[Lightweight transactions and the applied flag]. +See xref:feasibility-checklists.adoc#_lightweight_transactions_and_the_applied_flag[Lightweight Transactions and the applied flag]. == Can {product-proxy} be deployed as a sidecar? diff --git a/modules/ROOT/pages/feasibility-checklists.adoc b/modules/ROOT/pages/feasibility-checklists.adoc index 27e6a201..42fabb83 100644 --- a/modules/ROOT/pages/feasibility-checklists.adoc +++ b/modules/ROOT/pages/feasibility-checklists.adoc @@ -109,16 +109,16 @@ This depends entirely on your application business logic and requirements. It is your responsibility to determine whether your workloads can tolerate these inconsistencies and to what extent. [[_lightweight_transactions_and_the_applied_flag]] -=== Lightweight transactions and the applied flag +=== Lightweight Transactions and the applied flag //TODO: Align with the write request language on components.adoc //// -The ZDM proxy can bifurcate lightweight transactions to the ORIGIN and TARGET clusters. +The ZDM proxy can bifurcate Lightweight Transactions to the ORIGIN and TARGET clusters. However, it only returns the applied flag from one cluster, whichever cluster is the source of truth. Given that there are two separate clusters involved, the state of each cluster may be different. For conditional writes, this may create a divergent state for a time. -It may not make a difference in many cases, but if lightweight transactions are used, we would recommend a reconciliation phase in the migration before switching reads to rely on the TARGET cluster. +It may not make a difference in many cases, but if Lightweight Transactions are used, we would recommend a reconciliation phase in the migration before switching reads to rely on the TARGET cluster. //// {product-proxy} handles LWTs as write operations. diff --git a/modules/ROOT/pages/manage-proxy-instances.adoc b/modules/ROOT/pages/manage-proxy-instances.adoc index 70e9a3a3..12fbe247 100644 --- a/modules/ROOT/pages/manage-proxy-instances.adoc +++ b/modules/ROOT/pages/manage-proxy-instances.adoc @@ -158,10 +158,11 @@ This includes all control and request connections to the origin and the target c Default: `30000` * `metrics_enabled`: Whether to enable metrics collection. -The default is `true` (enabled). + If `false`, {product-proxy} metrics collection is completely disabled. This isn't recommended. ++ +Default: `true` (enabled) [[zdm_proxy_max_stream_ids]] * `zdm_proxy_max_stream_ids`: Set the maximum pool size of available stream IDs managed by {product-proxy} per client connection. @@ -173,7 +174,7 @@ However, if there are a lot of requests in a given amount of time, errors can oc In the client application, the stream IDs are managed internally by the driver, and, in most drivers, the max number is 2048, which is the same default value used by {product-proxy}. If you have a custom driver configuration with a higher value, make sure `zdm_proxy_max_stream_ids` matches your driver's maximum stream IDs. + -Defaults: `2048` +Default: `2048` === Deprecated mutable variables diff --git a/modules/ROOT/pages/troubleshooting-tips.adoc b/modules/ROOT/pages/troubleshooting-tips.adoc index 7fccc2c5..cafdd09a 100644 --- a/modules/ROOT/pages/troubleshooting-tips.adoc +++ b/modules/ROOT/pages/troubleshooting-tips.adoc @@ -12,6 +12,7 @@ For additional assistance, you can <>, contact {product-proxy} logs can help you verify that your {product-proxy} instances are operating normally, investigate how processes are executed, and troubleshoot issues. +[#set-the-zdm-proxy-log-level] === Set the {product-proxy} log level Set the {product-proxy} log level to print the messages that you need. @@ -124,9 +125,12 @@ Keep in mind that Docker logs are deleted if the container is recreated. Some log messages contain text that seems like an error but they aren't errors. Instead, the message's `level` indicates severity: -* `level=debug` and `level=info`: Expected and normal messages that typically aren't errors. +* `level=info`: Expected and normal messages that typically aren't errors. + +* `level=debug`: Expected and normal messages that typically aren't errors. +However, they can help you find the source of a problem by providing information about the environment and conditions when the error occurred. + -If you enable `DEBUG` logging, the `debug` messages can help you find the source of a problem by providing information about the environment and conditions when the error occurred. +`debug` messages are only recorded if you <>. * `level=warn`: Reports an event that wasn't fatal to the overall process but might indicate an issue with an individual request or connection.