Skip to content

Ensure python-driver handles zero-token nodes properly #352

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dkropachev opened this issue Aug 4, 2024 · 9 comments
Closed

Ensure python-driver handles zero-token nodes properly #352

dkropachev opened this issue Aug 4, 2024 · 9 comments
Assignees

Comments

@dkropachev
Copy link
Collaborator

dkropachev commented Aug 4, 2024

PR#19684 brings possibility of having nodes coordinator-only nodes (or zero-token nodes).
These types of nodes are going to be supported only in RAFT.

Such nodes, despite being registered in the cluster, do not handle any queries and should be excluded from query routing.
This feature is already present in cassandra, but not merged into scylla yet, so we might want to start testing it on our drivers with cassandra first.

Difference between cassandra and scylla implementation

Major difference is that these nodes are absent from system.peers and system.peers_v2 in cassandra, while in scylla implementation these nodes are going to be present there.

Due to this fact we will need to test Apache and datastax drivers against scylla as well.

Approx. Testing plan

Regular cluster

  1. Spin up a cluster with 3 nodes
  2. Join one additional node in zero-token mode, by setting join_ring to false in it's configuration, or adding -Dcassandra.join_ring=false to cli (cassandra only).
  3. Make sure that drivers works as expected and do not throw any errors while reading schema with this node being in the cluster
  4. Make sure that drivers works as expected and do not throw any errors while processing topology events (if these events issues) when such node joins/leaves cluster.
  5. Make sure that zero-token node does not participate in the routing
  6. Test if driver works properly if only connection point provided is zero-token node
  7. Ensure that at no point driver throw error or warning caused by zero-token node presence.

Cluster that starts with zero-token node (DROPPED)

  1. Start single node cluster with join_ring=false
  2. Connect to it, to make sure that driver session is created and every query end up in no host available error.
  3. Populate cluster with 3 more nodes
  4. Make sure that driver can execute queries
  5. Ensure that at no point driver throw error or warning.

Zero-token Datacenter

Repeat this scenario for following policies:

  1. DCAwareRoundRobinPolicy
  2. TokenAwareHostPolicy(DCAwareRoundRobinPolicy())
  3. TokenAwareHostPolicy(RoundRobinHostPolicy())

For DCAwareRoundRobinPolicy use three variants:

  1. Target first DC with real nodes
  2. Target second DC with zero token nodes
  3. (For drivers that supports it, gocql does not) Do not target any DC, make sure that policy won't pick datacenter with no real nodes.

Steps:
3. Start cluster of 2 nodes with 1 DC
4. Provision 2 more nodes into 2nd DC in join_ring=false mode
5. Connect to the cluster, using policy to make sure that driver session is created and every query is being scheduled to regular nodes and executed successfully. In cases when zero-token DC is targeted queries suppose to fail with no host available error

Links

Original umbrella issue in scylladb/scylladb repo: scylladb/scylladb#19693
Core issue to bring join_ring option into scylla: scylladb/scylladb#6527
PR that brings this feature in scylladb/scylladb#19684

@sylwiaszunejko
Copy link
Collaborator

(For drivers that supports it, gocql does not) Do not target any DC, make sure that policy won't pick datacenter with no real nodes.

Not sure how to achieve that for python-driver, local_dc is set during the process of creating the policy (https://github.com/scylladb/python-driver/blob/master/cassandra/policies.py#L245) and AFAIK at that moment we have no information about hosts so there is no validate if datacenter has any "real" nodes. We would have to check that during populate, but I am also not sure if it is a good idea. And what if we pick datacenter with both standard and zero-token nodes, but then all standard nodes are removed, should the local_dc be changed to other dc that has real nodes? All that seems a little bit unpredictable to me. Maybe we should just require that user to specify datacenter if they use zero-token nodes as a contact points? @dkropachev

@sylwiaszunejko
Copy link
Collaborator

@dkropachev @Lorak-mmk what is your opinion on correct behavior?

@dkropachev
Copy link
Collaborator Author

I think we need to fix ControlConnection._connect_host_in_lbp to connect to lbp first and when it is exhausted connect to Cluster.endpoints_resolved.
And stop calling self.profile_manager.populate and self.load_balancing_policy.populate at Cluster._add_resolved_hosts, it looks like we can safely stop calling .populate at all, or move it to ControlConnection._refresh_node_list_and_token_map to be called when original node_list is empty.

@sylwiaszunejko
Copy link
Collaborator

sylwiaszunejko commented Dec 12, 2024

I think we need to fix ControlConnection._connect_host_in_lbp to connect to lbp first and when it is exhausted connect to Cluster.endpoints_resolved. And stop calling self.profile_manager.populate and self.load_balancing_policy.populate at Cluster._add_resolved_hosts, it looks like we can safely stop calling .populate at all, or move it to ControlConnection._refresh_node_list_and_token_map to be called when original node_list is empty.

@dkropachev I don't quite see how all these changes are answering the problem of "what to do if the local_dc is not specified and first host among .Cluster.contact_points is zero-token node".

@dkropachev
Copy link
Collaborator Author

I think we need to fix ControlConnection._connect_host_in_lbp to connect to lbp first and when it is exhausted connect to Cluster.endpoints_resolved. And stop calling self.profile_manager.populate and self.load_balancing_policy.populate at Cluster._add_resolved_hosts, it looks like we can safely stop calling .populate at all, or move it to ControlConnection._refresh_node_list_and_token_map to be called when original node_list is empty.

@dkropachev I don't quite see how all these changes are answering the problem of "what to do if the local_dc is not specified and first host among .Cluster.contact_points is zero-token node".

Suppoosedly, If we do so, Cluster.contact_points are not added to the load balancing policy and therefore it never learn local_dc from them.
And only when node is valid and have tokens then on_up is being called on it and LoadBalancingPolicy gets to learn local_dc

@sylwiaszunejko
Copy link
Collaborator

I think we need to fix ControlConnection._connect_host_in_lbp to connect to lbp first and when it is exhausted connect to Cluster.endpoints_resolved. And stop calling self.profile_manager.populate and self.load_balancing_policy.populate at Cluster._add_resolved_hosts, it looks like we can safely stop calling .populate at all, or move it to ControlConnection._refresh_node_list_and_token_map to be called when original node_list is empty.

@dkropachev I don't quite see how all these changes are answering the problem of "what to do if the local_dc is not specified and first host among .Cluster.contact_points is zero-token node".

Suppoosedly, If we do so, Cluster.contact_points are not added to the load balancing policy and therefore it never learn local_dc from them. And only when node is valid and have tokens then on_up is being called on it and LoadBalancingPolicy gets to learn local_dc

I think there is one problem with this approach, it assumes that on_up is not called on zero-token node when it is a contact point what is not true, at least in current implementation. I don't see how we can change that keeping in mind this requirement:

Test if driver works properly if only connection point provided is zero-token node

@mykaul
Copy link

mykaul commented Feb 17, 2025

@sylwiaszunejko , @dkropachev - what's the status of this issue?

@dkropachev
Copy link
Collaborator Author

It is done, rest part of the problem related to advanced scenarios and should be addressed separately at https://github.com/scylladb/scylla-drivers/issues/39

@dkropachev
Copy link
Collaborator Author

Fixed by #389

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants