Ensure python-driver handles zero-token nodes properly #352

dkropachev · 2024-08-04T17:30:42Z

PR#19684 brings possibility of having nodes coordinator-only nodes (or zero-token nodes).
These types of nodes are going to be supported only in RAFT.

Such nodes, despite being registered in the cluster, do not handle any queries and should be excluded from query routing.
This feature is already present in cassandra, but not merged into scylla yet, so we might want to start testing it on our drivers with cassandra first.

Difference between cassandra and scylla implementation

Major difference is that these nodes are absent from system.peers and system.peers_v2 in cassandra, while in scylla implementation these nodes are going to be present there.

Due to this fact we will need to test Apache and datastax drivers against scylla as well.

Approx. Testing plan

Regular cluster

Spin up a cluster with 3 nodes
Join one additional node in zero-token mode, by setting join_ring to false in it's configuration, or adding -Dcassandra.join_ring=false to cli (cassandra only).
Make sure that drivers works as expected and do not throw any errors while reading schema with this node being in the cluster
Make sure that drivers works as expected and do not throw any errors while processing topology events (if these events issues) when such node joins/leaves cluster.
Make sure that zero-token node does not participate in the routing
Test if driver works properly if only connection point provided is zero-token node
Ensure that at no point driver throw error or warning caused by zero-token node presence.

Cluster that starts with zero-token node (DROPPED)

Start single node cluster with join_ring=false
Connect to it, to make sure that driver session is created and every query end up in no host available error.
Populate cluster with 3 more nodes
Make sure that driver can execute queries
Ensure that at no point driver throw error or warning.

Zero-token Datacenter

Repeat this scenario for following policies:

DCAwareRoundRobinPolicy
TokenAwareHostPolicy(DCAwareRoundRobinPolicy())
TokenAwareHostPolicy(RoundRobinHostPolicy())

For DCAwareRoundRobinPolicy use three variants:

Target first DC with real nodes
Target second DC with zero token nodes
(For drivers that supports it, gocql does not) Do not target any DC, make sure that policy won't pick datacenter with no real nodes.

Steps:
3. Start cluster of 2 nodes with 1 DC
4. Provision 2 more nodes into 2nd DC in join_ring=false mode
5. Connect to the cluster, using policy to make sure that driver session is created and every query is being scheduled to regular nodes and executed successfully. In cases when zero-token DC is targeted queries suppose to fail with no host available error

Links

Original umbrella issue in scylladb/scylladb repo: scylladb/scylladb#19693
Core issue to bring join_ring option into scylla: scylladb/scylladb#6527
PR that brings this feature in scylladb/scylladb#19684

The text was updated successfully, but these errors were encountered:

sylwiaszunejko · 2024-12-06T13:29:26Z

(For drivers that supports it, gocql does not) Do not target any DC, make sure that policy won't pick datacenter with no real nodes.

Not sure how to achieve that for python-driver, local_dc is set during the process of creating the policy (https://github.com/scylladb/python-driver/blob/master/cassandra/policies.py#L245) and AFAIK at that moment we have no information about hosts so there is no validate if datacenter has any "real" nodes. We would have to check that during populate, but I am also not sure if it is a good idea. And what if we pick datacenter with both standard and zero-token nodes, but then all standard nodes are removed, should the local_dc be changed to other dc that has real nodes? All that seems a little bit unpredictable to me. Maybe we should just require that user to specify datacenter if they use zero-token nodes as a contact points? @dkropachev

sylwiaszunejko · 2024-12-10T06:55:27Z

@dkropachev @Lorak-mmk what is your opinion on correct behavior?

dkropachev · 2024-12-10T10:42:06Z

I think we need to fix ControlConnection._connect_host_in_lbp to connect to lbp first and when it is exhausted connect to Cluster.endpoints_resolved.
And stop calling self.profile_manager.populate and self.load_balancing_policy.populate at Cluster._add_resolved_hosts, it looks like we can safely stop calling .populate at all, or move it to ControlConnection._refresh_node_list_and_token_map to be called when original node_list is empty.

sylwiaszunejko · 2024-12-12T08:46:17Z

I think we need to fix ControlConnection._connect_host_in_lbp to connect to lbp first and when it is exhausted connect to Cluster.endpoints_resolved. And stop calling self.profile_manager.populate and self.load_balancing_policy.populate at Cluster._add_resolved_hosts, it looks like we can safely stop calling .populate at all, or move it to ControlConnection._refresh_node_list_and_token_map to be called when original node_list is empty.

@dkropachev I don't quite see how all these changes are answering the problem of "what to do if the local_dc is not specified and first host among .Cluster.contact_points is zero-token node".

dkropachev · 2024-12-16T17:54:25Z

I think we need to fix ControlConnection._connect_host_in_lbp to connect to lbp first and when it is exhausted connect to Cluster.endpoints_resolved. And stop calling self.profile_manager.populate and self.load_balancing_policy.populate at Cluster._add_resolved_hosts, it looks like we can safely stop calling .populate at all, or move it to ControlConnection._refresh_node_list_and_token_map to be called when original node_list is empty.

@dkropachev I don't quite see how all these changes are answering the problem of "what to do if the local_dc is not specified and first host among .Cluster.contact_points is zero-token node".

Suppoosedly, If we do so, Cluster.contact_points are not added to the load balancing policy and therefore it never learn local_dc from them.
And only when node is valid and have tokens then on_up is being called on it and LoadBalancingPolicy gets to learn local_dc

sylwiaszunejko · 2024-12-19T12:32:43Z

I think we need to fix ControlConnection._connect_host_in_lbp to connect to lbp first and when it is exhausted connect to Cluster.endpoints_resolved. And stop calling self.profile_manager.populate and self.load_balancing_policy.populate at Cluster._add_resolved_hosts, it looks like we can safely stop calling .populate at all, or move it to ControlConnection._refresh_node_list_and_token_map to be called when original node_list is empty.

@dkropachev I don't quite see how all these changes are answering the problem of "what to do if the local_dc is not specified and first host among .Cluster.contact_points is zero-token node".

Suppoosedly, If we do so, Cluster.contact_points are not added to the load balancing policy and therefore it never learn local_dc from them. And only when node is valid and have tokens then on_up is being called on it and LoadBalancingPolicy gets to learn local_dc

I think there is one problem with this approach, it assumes that on_up is not called on zero-token node when it is a contact point what is not true, at least in current implementation. I don't see how we can change that keeping in mind this requirement:

Test if driver works properly if only connection point provided is zero-token node

mykaul · 2025-02-17T12:22:11Z

@sylwiaszunejko , @dkropachev - what's the status of this issue?

dkropachev · 2025-02-17T12:29:09Z

It is done, rest part of the problem related to advanced scenarios and should be addressed separately at https://github.com/scylladb/scylla-drivers/issues/39

dkropachev · 2025-05-20T00:53:35Z

Fixed by #389

roydahan assigned dkropachev Nov 18, 2024

sylwiaszunejko mentioned this issue Dec 10, 2024

Do not throw a warning for zero-token node #389

Merged

dkropachev closed this as completed Feb 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure python-driver handles zero-token nodes properly #352

Ensure python-driver handles zero-token nodes properly #352

dkropachev commented Aug 4, 2024 •

edited

Loading

sylwiaszunejko commented Dec 6, 2024

sylwiaszunejko commented Dec 10, 2024

dkropachev commented Dec 10, 2024

sylwiaszunejko commented Dec 12, 2024 •

edited

Loading

dkropachev commented Dec 16, 2024

sylwiaszunejko commented Dec 19, 2024

mykaul commented Feb 17, 2025

dkropachev commented Feb 17, 2025

dkropachev commented May 20, 2025

Ensure python-driver handles zero-token nodes properly #352

Ensure python-driver handles zero-token nodes properly #352

Comments

dkropachev commented Aug 4, 2024 • edited Loading

Difference between cassandra and scylla implementation

Approx. Testing plan

Regular cluster

Cluster that starts with zero-token node (DROPPED)

Zero-token Datacenter

Links

sylwiaszunejko commented Dec 6, 2024

sylwiaszunejko commented Dec 10, 2024

dkropachev commented Dec 10, 2024

sylwiaszunejko commented Dec 12, 2024 • edited Loading

dkropachev commented Dec 16, 2024

sylwiaszunejko commented Dec 19, 2024

mykaul commented Feb 17, 2025

dkropachev commented Feb 17, 2025

dkropachev commented May 20, 2025

dkropachev commented Aug 4, 2024 •

edited

Loading

sylwiaszunejko commented Dec 12, 2024 •

edited

Loading