Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worker Slots interface and Resource Based Autotuner First Cut #719

Merged
merged 30 commits into from
May 13, 2024

Conversation

Sushisource
Copy link
Member

@Sushisource Sushisource commented Apr 16, 2024

Implements the first portions of the slot management proposal and an initial take on resource-based auto-tuning.

Here the auto-tuning is done based on PID controllers which target specified CPU and RAM usage levels. This will very likely need more tweaking as we proceed and do more testing, but initial tests are quite promising for your typical sorts of relatively-low-resource-usage activities.

I have already successfully made use of the resource-based tuner in Typescript without much challenge. The lang layer needs little adjustment to use it. Passing through the callbacks to allow users to implement the interface with their own implementation will be more involved, and we've decided to do that after shipping the auto-tuner and letting users try that out.

@Sushisource Sushisource force-pushed the resource-slots-poc branch 2 times, most recently from d9821b2 to add670f Compare April 30, 2024 22:15
@Sushisource Sushisource changed the title [DRAFT] Resource slots POC Worker Slots interface and Resource Based Autotuner First Cut May 1, 2024
@Sushisource Sushisource force-pushed the resource-slots-poc branch from 341d3c3 to eb23321 Compare May 1, 2024 02:19
@Sushisource Sushisource marked this pull request as ready for review May 1, 2024 02:23
@Sushisource Sushisource requested a review from a team as a code owner May 1, 2024 02:23
Comment on lines 128 to 129
// TODO: Real release reason
supp_c_c.release_slot(SlotReleaseReason::TaskComplete);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This turns out to be... quite annoying to deal with. Since permits simply being droped causes them to be released, it's incredibly annoying to find every place that might ever happen and attach a reason.

My inclination is to simply not bother with this for now, possibly never, unless users ask for it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may not be understanding, but can you make permit release a more explicit step where it occurs instead of relying on drop?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can, but it's worse from a safety perspective (could still use it as a backup though, with some default reason), and it can potentially happen in many, many places. It's just a lot of gruntwork for something I don't know that users will actually care about yet. Hence my preference to just defer it for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Can we just get rid of reason for now then?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that was what I was going to do based on discussion

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we remove this from Java then?

Copy link
Member Author

@Sushisource Sushisource May 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah probably for consistencies' sake (I'll ask specifically for feedback on this one)

Copy link
Member

@cretz cretz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM, though I am only judging from the pub API and behavior POV (with the knowledge that this is not truly public to users, so langs can clarify, not use abbreviations, etc).

Comment on lines 36 to 46
/// Set a [SlotSupplier] for workflow tasks.
#[builder(setter(into = false))]
pub workflow_task_slot_supplier:
Arc<dyn SlotSupplier<SlotKind = WorkflowSlotKind> + Send + Sync>,
/// Set a [SlotSupplier] for activity tasks.
#[builder(setter(into = false))]
pub activity_task_slot_supplier:
Arc<dyn SlotSupplier<SlotKind = ActivitySlotKind> + Send + Sync>,
/// Set a [SlotSupplier] for local activity tasks.
#[builder(setter(into = false))]
pub local_activity_task_slot_supplier:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think, as I did on the Java SDK, that these should be one worker tuner object that a user can provide (even if the thing is just these three). It is better for callers. Granted this is more of a comment about lang than core.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are tradeoffs with both. I did encounter a couple patterns where having just one would have made things slightly easier - but there are just as many situations where having them separate was really useful. Ultimately I think this is just a matter of taste.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It matters from a caller POV too. I suspect many callers will configure workers with tuners created by someone else (same with interceptors). Setting a single resource-based tuner is easier than three resource-based slot suppliers (same with interceptors). This also makes instance reuse across workers easier since it's a single instance shared instead of three instances shared (same with interceptors).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm gonna try it out before merging it just to see if it's much simpler. If I like the end result we'll go that way but if it ends up uglier I'll keep the three options.

My main concern is what happens when you want to use different kinds of suppliers for different types. EX: you want fixed size for workflow and resource based for activity. Obviously you can just stuff those inside your own implementation, but in the (I would guess fairly common) case that you want to do exactly that, you now have to make your own implementation where before you could just set ours - or we have to provide some combinators.

Copy link
Member

@cretz cretz May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 FWIW your main concern also applies to interceptors and data converters in many langs (interceptors are just combinations of client, activity, and worker interceptors, and data converters are just combinations of payload codecs, payload converters, and failure converters). So there is a precedent for this type of combining. And many languages already have these combinators assuming you provide a simple 3-item struct instead of just the interface (e.g. with keyword in C# or dataclasses.replace in Python or .. in Rust or ... in JS). But I think we should optimize for what I think will be the common use case of users wanting to just provide a "tuner".

core-api/src/worker.rs Outdated Show resolved Hide resolved
Comment on lines 248 to 250
/// Core will call this at worker initialization time, allowing the implementation to hook up to
/// metrics if any are configured. If not, it will not be called.
fn attach_metrics(&self, metrics: TemporalMeter);
Copy link
Member

@cretz cretz May 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrmm. I know everywhere else implementers want to use metrics, they access via runtime (e.g. .telemetry().get_metric_meter()). Do we need this here? Can't the user wire up metrics themselves? Is there a general purpose initialization method that may be needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least the way TS is structured it's not really possible, and is likely to be annoying in other langs. The problem is you need to have initialized the runtime before you construct worker options to be able to provide the telemetry instance when constructing your slot supplier - but since the runtime is often constructed implicitly when a user initializes a worker, the worker config is already created before the runtime.

Copy link
Member

@cretz cretz May 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but since the runtime is often constructed implicitly when a user initializes a worker

Hrmm, is this true? It seems from the TS bridge code I see that the client already has the runtime before worker is created. I would expect it can provide the metric meter to the slot supplier instances it creates at worker creation time.

For TS-side implementations, when they get around to temporalio/sdk-typescript#1229, should provide an accessor to the metric meter off of the runtime object (and if that object is created implicitly, so be it).

Copy link
Member Author

@Sushisource Sushisource May 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah fair point, it's not impossible, I actually had it kinda like this originally but it just made for a bit of a mess in TS's bridge because of type erasure stuff, where you only had access to the erased slot supplier, so the trait needed this way to attach the metrics without changing things such that the slot suppliers are passed separately from the worker config which would've been annoying.

The nice thing about this is that lang doesn't need to do anything, this is all handled inside Core and there's going to have to be some kind of LangWrapper implementation of SlotSupplier anyway, and lang can provide the attached meter to the user's implementation in a context object or however else it feels like

Copy link
Member

@cretz cretz May 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lang can provide the attached meter to the user's implementation in a context object or however else it feels like

I think lang may want to ask users to do this themselves and not provide them anything in this abstraction specifically. At least in .NET and Python, any user that wants to emit metrics from their custom tuner implementation can of course use their own metrics impl, or use ours the same way they'd do it anywhere else outside of workflows/activities (using the runtime metric meter).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, they can still do that just fine, as well as we can provide it to them for access whenever they like outside of that. For now, within core, this makes the most sense and doesn't have any impact on lang.

core-api/src/worker.rs Outdated Show resolved Hide resolved
Comment on lines 128 to 129
// TODO: Real release reason
supp_c_c.release_slot(SlotReleaseReason::TaskComplete);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may not be understanding, but can you make permit release a more explicit step where it occurs instead of relying on drop?

@@ -395,6 +406,7 @@ pub(super) const ACT_SCHED_TO_START_LATENCY_NAME: &str = "activity_schedule_to_s
pub(super) const ACT_EXEC_LATENCY_NAME: &str = "activity_execution_latency";
pub(super) const NUM_POLLERS_NAME: &str = "num_pollers";
pub(super) const TASK_SLOTS_AVAILABLE_NAME: &str = "worker_task_slots_available";
pub(super) const TASK_SLOTS_USED_NAME: &str = "worker_task_slots_used";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this metric, can we make an issue to make sure all SDKs get it? (or ignore if this is implicitly assumed as part of the general tuning project)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this all should get this as part of this work

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm is Java supposed to have this metric? I don't recall it getting added as part of your PR then.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, but I'll add it


/// Implements [SlotSupplier] and attempts to maintain certain levels of resource usage when
/// under load.
pub struct ResourceBasedSlots<MI> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once you get the algorithm settled, I think we should document the exact algorithm so that all languages can get the same (or at least real close)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added more to the docstring here, but of course we might still come back and decide to change the algo later

core/src/worker/slot_supplier/resource_based.rs Outdated Show resolved Hide resolved
core/src/worker/slot_supplier/resource_based.rs Outdated Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have an issue for a resource based slot supplier in java?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one in Jira at least

@Sushisource Sushisource force-pushed the resource-slots-poc branch from eaeb4cc to b650e3d Compare May 3, 2024 00:48
@Sushisource Sushisource force-pushed the resource-slots-poc branch from b650e3d to aadb636 Compare May 3, 2024 00:48
@Sushisource Sushisource force-pushed the resource-slots-poc branch 2 times, most recently from 7ad344b to e934332 Compare May 4, 2024 00:38
Copy link
Member

@cretz cretz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I like the tuner abstraction. I can already see this will have a good interface look in langs.

In Python/TS we can have a tuner be FixedTunerConfig | ResourceBasedTunerConfig | CustomTuner (the existing 3 fixed values are just shortcuts for the first one, the latter is just an interface), but in .NET I can't really make unions or sum types, so there'll probably be TunerOptions and 3 mutually exclusive properties of Fixed, ResourceBased, and Custom.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arguably could change the dir/package from slot_suppler to tuner, but meh. Could also have a public FixedSizeTuner in this package and just remove those 3 worker options and force everyone to create tuners if you want, but no prob leaving as is too.

@Sushisource Sushisource force-pushed the resource-slots-poc branch from e934332 to 0e0544e Compare May 13, 2024 17:08
@Sushisource Sushisource force-pushed the resource-slots-poc branch from 0e0544e to 42140a6 Compare May 13, 2024 17:21
@Sushisource Sushisource merged commit 84e10bf into temporalio:master May 13, 2024
6 checks passed
@Sushisource Sushisource deleted the resource-slots-poc branch May 13, 2024 17:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants