Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: horizontal scalability #759

Closed
wants to merge 10 commits into from

Conversation

benthomasson
Copy link
Collaborator

@benthomasson benthomasson commented Mar 19, 2024

This PR allows for running rulebooks on multiple worker nodes.

It does this by tagging the rulebook process with the worker that started it and then the monitor process runs periodically to update just the rulebook processes that it has local access to.

This PR works with any number of worker nodes and worker nodes can be dynamically added and removed.

I am still working on the scenario of when a worker node is removed while a rulebook process is running.

@benthomasson benthomasson requested a review from a team as a code owner March 19, 2024 12:52
@benthomasson benthomasson marked this pull request as draft March 19, 2024 12:52
@benthomasson benthomasson force-pushed the horizontal_scaling branch 3 times, most recently from e8eebe3 to 0b00da2 Compare March 19, 2024 13:03
@@ -1046,6 +1047,7 @@ def _create_activation_instance(self):
self._set_activation_status(ActivationStatus.PENDING, msg)
raise exceptions.MaxRunningProcessesError
args = {
"worker": os.environ["HOSTNAME"],
Copy link
Contributor

@mkanoor mkanoor Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benthomasson If there are 5 activation workers running on a node, is this specifying a single worker or all workers on that node? Is this node/host name as opposed to worker?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be any value. I am just using hostname right now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will be multiple workers within a node. This is our current setup and we want it for concurrency and redundancy.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a node name, worker name, or worker group name.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple workers on a node would work fine. They would monitor the rulebooks that they started.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a worker is lost we need to hand off monitoring to another worker and that is not yet done.

@@ -317,10 +317,6 @@ def _get_secret_key() -> str:

RQ_STARTUP_JOBS = []
RQ_PERIODIC_JOBS = [
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benthomasson If this gets removed how would the periodic updates happen? The activations run as detached process and have no affinity to a worker they have affinity to the node where it is running. Any worker on that node can get the logs/status from an activation and take actions like stop/restart etc.

Copy link
Collaborator Author

@benthomasson benthomasson Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are actions that are not yet implemented in this PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Monitor processes run instead of the RQ workers on the nodes. A monitor process can monitor any number of activations.

@mkanoor mkanoor mentioned this pull request Mar 19, 2024
@benthomasson
Copy link
Collaborator Author

Closed in favor of #701

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants