feat(grpc): add round robin load balancing policy #2405

cjqzhao · 2025-08-27T17:48:02Z

Add round robin load balancing policy

delete round robin store changes save changes saving Save changes again lol Save changes add round robin save changes some changes delete store changes save changes saving Save changes again lol Save changes add round robin

easwars

Havent looked at the tests yet.

grpc/src/client/load_balancing/test_utils.rs

grpc/src/client/load_balancing/round_robin.rs

easwars

Still have to look at a few more tests.

grpc/src/client/load_balancing/round_robin.rs

grpc/src/client/load_balancing/test_utils.rs

grpc/src/client/load_balancing/round_robin.rs

easwars · 2025-08-29T16:41:13Z

grpc/src/client/load_balancing/round_robin.rs

+        let lb_policy = lb_policy.as_mut();
+        let tcc = tcc.as_mut();
+
+        let endpoints = create_n_endpoints_with_k_addresses(2, 3);


Given that the whole point of this test is to ensure that the policy moves to TF on receiving empty addresses, you can simplify this test a little by only sending one endpoint with one address initially, instead of 2 endpoints with 3 addresses each.

easwars · 2025-08-29T16:49:56Z

grpc/src/client/load_balancing/round_robin.rs

+        let subchannels = verify_subchannel_creation_from_policy(&mut rx_events, 1).await;
+
+        let second_subchannels = verify_subchannel_creation_from_policy(&mut rx_events, 1).await;
+        let mut all_subchannels = subchannels.clone();
+        all_subchannels.extend(second_subchannels.clone());


Could this be replaced with:

let subchannels = verify_subchannel_creation_from_policy(&mut rx_events, 2).await

easwars · 2025-08-29T16:53:20Z

grpc/src/client/load_balancing/round_robin.rs

+        let new_picker = verify_roundrobin_ready_picker_from_policy(&mut rx_events).await;
+
+        let req = test_utils::new_request();
+        let mut picked = Vec::new();
+        for _ in 0..4 {
+            match new_picker.pick(&req) {
+                PickResult::Pick(pick) => {
+                    println!("picked subchannel is {}", pick.subchannel);
+                    picked.push(pick.subchannel.clone())
+                }
+                other => panic!("unexpected pick result {}", other),
+            }
+        }
+
+        assert_eq!(&picked[0], &picked[2]);
+        assert_eq!(&picked[1], &picked[3]);
+        assert!(picked.contains(&subchannels[0]));
+        assert!(!picked.contains(&subchannel_being_removed));


Can you use the other helper function here instead: verify_ready_picker_from_policy since you expect the pick to return the single Ready subchannel all the time.

This picker would have two subchannels so it might be easier to do this and make sure that it alternates.

easwars · 2025-08-29T17:02:46Z

grpc/src/client/load_balancing/round_robin.rs

+    // should not be apart of its picks anymore and should be removed. It should
+    // then roundrobin across the endpoints it still has and the new one.
+    #[tokio::test]
+    async fn roundrobin_pick_after_resolved_updated_hosts() {


IMO, the variable names are making the test a little harder to follow. Especially, when you call something removed_xxx and then you go ahead and add them :)

Instead if you just name then subchannel_1, subchannel_2 and subchannel_3 and state in the comments that we start off with subchannels 1 & 2, and then get an update from the resolver that removes subchannel 1, but adds subchannel 3, I think that would make it easier to read.

easwars · 2025-08-29T17:04:46Z

grpc/src/client/load_balancing/round_robin.rs

+        let subchannels = verify_subchannel_creation_from_policy(&mut rx_events, 1).await;
+        let second_subchannels = verify_subchannel_creation_from_policy(&mut rx_events, 1).await;


Here as well, can't we just wait for two subchannels at have them in a vector instead of two different variables, which are again vectors.

easwars · 2025-08-29T17:05:46Z

grpc/src/client/load_balancing/round_robin.rs

+        let lb_policy = lb_policy.as_mut();
+        let tcc = tcc.as_mut();
+
+        let endpoint = create_endpoint_with_n_addresses(2);


Just have one address.

easwars · 2025-08-29T17:07:37Z

grpc/src/client/load_balancing/round_robin.rs

+    // once the connection succeeds, move to READY state with a picker that returns
+    // that subchannel.
+    #[tokio::test]
+    async fn roundrobin_with_one_backend() {


This seems to be very similar to roundrobin_simple_test where you create a single endpoint with two addresses, but here a single endpoint with a single address. I think they both are testing the same scenario. Correct me if I'm wrong.

Yes deleted.

easwars · 2025-08-29T17:10:20Z

grpc/src/client/load_balancing/round_robin.rs

+    // to connect to them in order, until a connection succeeds, at which point it
+    // should move to READY state with a picker that returns that subchannel.
+    #[tokio::test]
+    async fn roundrobin_with_multiple_backends_first_backend_is_ready() {


Same with this test. I don't see what exactly this is testing that is different from the above test and the simple test. Also, we are not actually verifying that the LB policy is attempting to connect. GIven that the functionality to start connecting is a pick_first functionality, we ideally shouldn't be testing that here (or if we want to test that, we should be using the real pick_first. Even then, if we want to test that, an e2e test would be better).

So, let me know what scenario this one is testing that is different from the above mentioned tests. THanks.

Connecting isn't tested here. I can combine all of these into one test to make things simpler.

I ended up keeping only roundrobin_with_multiple_backends_first_backend_is_ready.

easwars · 2025-08-29T17:13:06Z

grpc/src/client/load_balancing/round_robin.rs

+        send_resolver_update_to_policy(lb_policy, vec![endpoints.clone()], tcc);
+        let subchannels = verify_subchannel_creation_from_policy(&mut rx_events, 4).await;
+        lb_policy.subchannel_update(subchannels[0].clone(), &SubchannelState::default(), tcc);
+        verify_connecting_picker_from_policy(&mut rx_events).await;


It is interesting that we have to move to Connecting here given that the update contains the previously connected address.

A lot of this is based on Pick First functionality, which I believe isn't included in my stub balancer.

Add a TODO that we wouldn't expect this state transition once we start using the actual pick_first LB policy?

grpc/src/client/load_balancing/utils.rs

grpc/src/client/load_balancing/round_robin.rs

dfawley · 2025-08-29T16:43:52Z

grpc/src/client/load_balancing/round_robin.rs

+                    self.last_resolver_error =
+                        Some("received no endpoints from the name resolver".to_string());
+                    self.move_to_transient_failure(channel_controller);
+                    return Err("received no endpoints from the name resolver".into());


We need to call into the child manager at some point here. Otherwise if we had endpoints before then they'll stick around.

I was expecting RR's methods to all immediately call into child_manager, and then update the picker if the child_manager reported any changes.

I don't think that the Child Manager handles error cases like this.

Having zero children shouldn't be an error case, should it? Is something breaking if you do this?

grpc/src/client/load_balancing/round_robin.rs

…ound-robin

grpc/src/client/load_balancing/round_robin.rs

grpc/src/client/load_balancing/child_manager.rs

grpc/src/client/load_balancing/test_utils.rs

grpc/src/client/load_balancing/round_robin.rs

dfawley · 2025-09-02T22:02:02Z

grpc/src/client/load_balancing/round_robin.rs

+
+    fn move_children_from_idle(&mut self, channel_controller: &mut dyn ChannelController) {
+        let mut should_exit_idle = false;
+        for (_, state) in self.child_manager.child_states() {


Use child_states().any() instead?

Or skip it and always call exit_idle and put any filtering in child_manager instead is fine.

Putting it in child_manager makes more sense! I will do that.

I added filtering logic in child manager where it calls exit_idle on all children if at least one child is Idle. Let me know if you think it's better to call exit_idle on all idle children.

dfawley · 2025-09-02T22:02:38Z

grpc/src/client/load_balancing/round_robin.rs

+                let result = self
+                    .child_manager
+                    .resolver_update(update, config, channel_controller);
+                self.move_children_from_idle(channel_controller);


This should be done inside has_updated (everywhere). If children haven't updated, there's no way their states could have changed to become idle.

That makes sense. Will fix!

dfawley · 2025-09-02T22:05:47Z

grpc/src/client/load_balancing/round_robin.rs

+                    self.move_children_from_idle(channel_controller);
+                    if self.child_manager.has_updated() {
+                        self.send_aggregate_picker(channel_controller);
+                    }


Actually since you follow this exact pattern many times, it would be better to put it in a function:

fn resolve_child_updates(&mut self, channel_controller: &mut dyn ChannelController) { if !self.child_manager.has_updated() { return; } self.move_from_idle(channel_controller); self.send_aggregate_picker(channel_controller); // or just do the above things inline; that's probable fine too... }

easwars · 2025-09-10T19:49:03Z

grpc/src/client/load_balancing/test_utils.rs

@@ -234,7 +244,7 @@ impl LbPolicyBuilder for StubPolicyBuilder {
        &self,
        _config: &ParsedJsonLbConfig,
    ) -> Result<Option<LbConfig>, Box<dyn Error + Send + Sync>> {
-        todo!("Implement parse_config in StubPolicyBuilder")
+        todo!("Implement parse_config in StubPolicy");


Hmm ... do we need this change, given that parse_config is a method on the builder?

easwars · 2025-09-10T19:49:29Z

grpc/src/client/load_balancing/round_robin.rs

+        let lb_policy = lb_policy.as_mut();
+        let tcc = tcc.as_mut();
+
+        let endpoints = create_n_endpoints_with_k_addresses(2, 3);


easwars · 2025-09-10T19:49:36Z

grpc/src/client/load_balancing/round_robin.rs

+        let subchannels = verify_subchannel_creation_from_policy(&mut rx_events, 1).await;
+
+        let second_subchannels = verify_subchannel_creation_from_policy(&mut rx_events, 1).await;
+        let mut all_subchannels = subchannels.clone();
+        all_subchannels.extend(second_subchannels.clone());


easwars · 2025-09-10T19:49:48Z

grpc/src/client/load_balancing/round_robin.rs

+        let new_picker = verify_roundrobin_ready_picker_from_policy(&mut rx_events).await;
+
+        let req = test_utils::new_request();
+        let mut picked = Vec::new();
+        for _ in 0..4 {
+            match new_picker.pick(&req) {
+                PickResult::Pick(pick) => {
+                    println!("picked subchannel is {}", pick.subchannel);
+                    picked.push(pick.subchannel.clone())
+                }
+                other => panic!("unexpected pick result {}", other),
+            }
+        }
+
+        assert_eq!(&picked[0], &picked[2]);
+        assert_eq!(&picked[1], &picked[3]);
+        assert!(picked.contains(&subchannels[0]));
+        assert!(!picked.contains(&subchannel_being_removed));


easwars · 2025-09-10T19:49:57Z

grpc/src/client/load_balancing/round_robin.rs

+    // should not be apart of its picks anymore and should be removed. It should
+    // then roundrobin across the endpoints it still has and the new one.
+    #[tokio::test]
+    async fn roundrobin_pick_after_resolved_updated_hosts() {


easwars · 2025-09-10T19:50:02Z

grpc/src/client/load_balancing/round_robin.rs

+        let subchannels = verify_subchannel_creation_from_policy(&mut rx_events, 1).await;
+        let second_subchannels = verify_subchannel_creation_from_policy(&mut rx_events, 1).await;


easwars · 2025-09-10T19:50:09Z

grpc/src/client/load_balancing/round_robin.rs

+        let lb_policy = lb_policy.as_mut();
+        let tcc = tcc.as_mut();
+
+        let endpoint = create_endpoint_with_n_addresses(2);


easwars · 2025-09-10T19:50:18Z

grpc/src/client/load_balancing/round_robin.rs

+    // once the connection succeeds, move to READY state with a picker that returns
+    // that subchannel.
+    #[tokio::test]
+    async fn roundrobin_with_one_backend() {


easwars · 2025-09-10T19:50:29Z

grpc/src/client/load_balancing/round_robin.rs

+    // to connect to them in order, until a connection succeeds, at which point it
+    // should move to READY state with a picker that returns that subchannel.
+    #[tokio::test]
+    async fn roundrobin_with_multiple_backends_first_backend_is_ready() {


easwars · 2025-09-10T19:51:52Z

grpc/src/client/load_balancing/round_robin.rs

+        send_resolver_update_to_policy(lb_policy, vec![endpoints.clone()], tcc);
+        let subchannels = verify_subchannel_creation_from_policy(&mut rx_events, 4).await;
+        lb_policy.subchannel_update(subchannels[0].clone(), &SubchannelState::default(), tcc);
+        verify_connecting_picker_from_policy(&mut rx_events).await;


Add a TODO that we wouldn't expect this state transition once we start using the actual pick_first LB policy?

dfawley

It seems I wrote some comments awhile back and didn't send them. I'm pretty sure I didn't get all the way through, sorry, but I'll send these for now anyway.

dfawley · 2025-09-05T20:57:02Z

grpc/src/client/load_balancing/child_manager.rs

@@ -386,6 +376,14 @@ impl<T: PartialEq + Hash + Eq + Send + Sync + 'static> LbPolicy for ChildManager
    }

    fn exit_idle(&mut self, channel_controller: &mut dyn ChannelController) {
+        let has_idle = self


I'm pretty sure this pre-processing step is actually going to be less performant than simply removing it.

You can just loop through the children and exit_idle any of the ones whose state is currently idle.

(With this in place you will have to loop over all elements of the vector twice, potentially, and without it you will do exactly one full pass no matter what.)

dfawley · 2025-09-05T20:58:19Z

grpc/src/client/load_balancing/child_manager.rs

@@ -571,20 +558,20 @@ mod test {
    // Defines the functions resolver_update and subchannel_update to test
    // aggregate_states.
    fn create_verifying_funcs_for_aggregate_tests() -> StubPolicyFuncs {
-        let data = StubPolicyData::default();
+        let _data = StubPolicyData::default();


Can you delete this since it seems to be unused?

dfawley · 2025-09-05T21:00:57Z

grpc/src/client/load_balancing/round_robin.rs

+                    self.last_resolver_error =
+                        Some("received no endpoints from the name resolver".to_string());
+                    self.move_to_transient_failure(channel_controller);
+                    return Err("received no endpoints from the name resolver".into());


Having zero children shouldn't be an error case, should it? Is something breaking if you do this?

some changes

03671e4

delete round robin store changes save changes saving Save changes again lol Save changes add round robin save changes some changes delete store changes save changes saving Save changes again lol Save changes add round robin

cjqzhao force-pushed the add-round-robin branch from 18dea86 to 03671e4 Compare August 27, 2025 18:03

cjqzhao added 5 commits August 27, 2025 18:16

fix some style

f36b03c

delete time import

afdddb3

remove redundant code

fe3a457

update some logic

efb6a11

fix things

099a341

easwars reviewed Aug 28, 2025

View reviewed changes

address comments

00e510a

easwars reviewed Aug 28, 2025

View reviewed changes

cjqzhao added 2 commits August 28, 2025 21:13

implement comments

0757785

fix rustfmt

750fdf6

easwars reviewed Aug 28, 2025

View reviewed changes

address comments

3a1541b

easwars reviewed Aug 29, 2025

View reviewed changes

dfawley reviewed Aug 29, 2025

View reviewed changes

cjqzhao and others added 9 commits August 30, 2025 02:19

save changes

d8d4c45

save round robi

b9f1282

Merge branch 'master' of https://github.com/hyperium/tonic into add-r…

cc3bc95

…ound-robin

address comments

19575ab

remove clippy warning

6a36c79

add has_no_children

53457a0

update roundrobin

133478b

update logic

8ce394f

fix

de3c8d3

dfawley reviewed Sep 2, 2025

View reviewed changes

cjqzhao added 4 commits September 3, 2025 12:22

implement comments

4165129

fix rustfmt

f9fcf71

remove import

5788dca

add back unused_imports

f52443d

easwars reviewed Sep 10, 2025

View reviewed changes

dfawley reviewed Sep 12, 2025

View reviewed changes

		let subchannels = verify_subchannel_creation_from_policy(&mut rx_events, 1).await;
		let second_subchannels = verify_subchannel_creation_from_policy(&mut rx_events, 1).await;

feat(grpc): add round robin load balancing policy #2405

Are you sure you want to change the base?

feat(grpc): add round robin load balancing policy #2405

Conversation

cjqzhao commented Aug 27, 2025

Uh oh!

easwars left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

easwars left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cjqzhao Sep 2, 2025 •

edited

Loading