Skip to content
This repository has been archived by the owner on Feb 18, 2021. It is now read-only.

round robin peer selection #309

Open
Raynos opened this issue Oct 1, 2016 · 2 comments
Open

round robin peer selection #309

Raynos opened this issue Oct 1, 2016 · 2 comments

Comments

@Raynos
Copy link
Contributor

Raynos commented Oct 1, 2016

From a flame graph I've observed that some workers / services are really struggling with peer selection

image

If we implement a random peer selection strategy and add a flipr where we can change the peer selection strategy per serviceName

We already have boolean logic to enabled / disable peer heap per serviceName.

A round robin peer selection will reduce CPU utilization and slightly degrade load balancing by increasing variance.

If round robin is involved we can also just implement random peer selection.

@Raynos
Copy link
Contributor Author

Raynos commented Oct 1, 2016

Figuring out which serviceNames to enable with this "degraded load balancing" strategy is going to be involved.

One strategy is to:

  • look at top10 workers in terms of average CPU
  • flamegraph them to see if choosePeer() dominates the CPU
  • identify which serviceNames dominate that worker.
  • turn on "dumb load balancing" for those service names
  • repeat

In theory, there should only be ~10ish service names that have both: "high QPS" and "high number of peers" which causes choosePeer() to dominate the flamegraph.

@rf
Copy link
Contributor

rf commented Oct 3, 2016

we could also add a timing stat that's cluster-wide and only tagged by service name

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants