Worker Slots interface and Resource Based Autotuner First Cut #719

Sushisource · 2024-04-16T18:47:11Z

Implements the first portions of the slot management proposal and an initial take on resource-based auto-tuning.

Here the auto-tuning is done based on PID controllers which target specified CPU and RAM usage levels. This will very likely need more tweaking as we proceed and do more testing, but initial tests are quite promising for your typical sorts of relatively-low-resource-usage activities.

I have already successfully made use of the resource-based tuner in Typescript without much challenge. The lang layer needs little adjustment to use it. Passing through the callbacks to allow users to implement the interface with their own implementation will be more involved, and we've decided to do that after shipping the auto-tuner and letting users try that out.

Pausing while removing wf state inputs

core/src/worker/slot_supplier/resource_based.rs

Sushisource · 2024-05-01T02:25:58Z

core/src/abstractions.rs

+                // TODO: Real release reason
+                supp_c_c.release_slot(SlotReleaseReason::TaskComplete);


This turns out to be... quite annoying to deal with. Since permits simply being droped causes them to be released, it's incredibly annoying to find every place that might ever happen and attach a reason.

My inclination is to simply not bother with this for now, possibly never, unless users ask for it.

I may not be understanding, but can you make permit release a more explicit step where it occurs instead of relying on drop?

I can, but it's worse from a safety perspective (could still use it as a backup though, with some default reason), and it can potentially happen in many, many places. It's just a lot of gruntwork for something I don't know that users will actually care about yet. Hence my preference to just defer it for now.

👍 Can we just get rid of reason for now then?

Yeah that was what I was going to do based on discussion

should we remove this from Java then?

Yeah probably for consistencies' sake (I'll ask specifically for feedback on this one)

cretz

Mostly LGTM, though I am only judging from the pub API and behavior POV (with the knowledge that this is not truly public to users, so langs can clarify, not use abbreviations, etc).

cretz · 2024-05-01T12:18:32Z

core-api/src/worker.rs

+    /// Set a [SlotSupplier] for workflow tasks.
+    #[builder(setter(into = false))]
+    pub workflow_task_slot_supplier:
+        Arc<dyn SlotSupplier<SlotKind = WorkflowSlotKind> + Send + Sync>,
+    /// Set a [SlotSupplier] for activity tasks.
+    #[builder(setter(into = false))]
+    pub activity_task_slot_supplier:
+        Arc<dyn SlotSupplier<SlotKind = ActivitySlotKind> + Send + Sync>,
+    /// Set a [SlotSupplier] for local activity tasks.
+    #[builder(setter(into = false))]
+    pub local_activity_task_slot_supplier:


I still think, as I did on the Java SDK, that these should be one worker tuner object that a user can provide (even if the thing is just these three). It is better for callers. Granted this is more of a comment about lang than core.

There are tradeoffs with both. I did encounter a couple patterns where having just one would have made things slightly easier - but there are just as many situations where having them separate was really useful. Ultimately I think this is just a matter of taste.

It matters from a caller POV too. I suspect many callers will configure workers with tuners created by someone else (same with interceptors). Setting a single resource-based tuner is easier than three resource-based slot suppliers (same with interceptors). This also makes instance reuse across workers easier since it's a single instance shared instead of three instances shared (same with interceptors).

I'm gonna try it out before merging it just to see if it's much simpler. If I like the end result we'll go that way but if it ends up uglier I'll keep the three options.

My main concern is what happens when you want to use different kinds of suppliers for different types. EX: you want fixed size for workflow and resource based for activity. Obviously you can just stuff those inside your own implementation, but in the (I would guess fairly common) case that you want to do exactly that, you now have to make your own implementation where before you could just set ours - or we have to provide some combinators.

👍 FWIW your main concern also applies to interceptors and data converters in many langs (interceptors are just combinations of client, activity, and worker interceptors, and data converters are just combinations of payload codecs, payload converters, and failure converters). So there is a precedent for this type of combining. And many languages already have these combinators assuming you provide a simple 3-item struct instead of just the interface (e.g. with keyword in C# or dataclasses.replace in Python or .. in Rust or ... in JS). But I think we should optimize for what I think will be the common use case of users wanting to just provide a "tuner".

core-api/src/worker.rs

cretz · 2024-05-01T12:27:00Z

core-api/src/worker.rs

+    /// Core will call this at worker initialization time, allowing the implementation to hook up to
+    /// metrics if any are configured. If not, it will not be called.
+    fn attach_metrics(&self, metrics: TemporalMeter);


Hrmm. I know everywhere else implementers want to use metrics, they access via runtime (e.g. .telemetry().get_metric_meter()). Do we need this here? Can't the user wire up metrics themselves? Is there a general purpose initialization method that may be needed?

At least the way TS is structured it's not really possible, and is likely to be annoying in other langs. The problem is you need to have initialized the runtime before you construct worker options to be able to provide the telemetry instance when constructing your slot supplier - but since the runtime is often constructed implicitly when a user initializes a worker, the worker config is already created before the runtime.

but since the runtime is often constructed implicitly when a user initializes a worker

Hrmm, is this true? It seems from the TS bridge code I see that the client already has the runtime before worker is created. I would expect it can provide the metric meter to the slot supplier instances it creates at worker creation time.

For TS-side implementations, when they get around to temporalio/sdk-typescript#1229, should provide an accessor to the metric meter off of the runtime object (and if that object is created implicitly, so be it).

Yeah fair point, it's not impossible, I actually had it kinda like this originally but it just made for a bit of a mess in TS's bridge because of type erasure stuff, where you only had access to the erased slot supplier, so the trait needed this way to attach the metrics without changing things such that the slot suppliers are passed separately from the worker config which would've been annoying.

The nice thing about this is that lang doesn't need to do anything, this is all handled inside Core and there's going to have to be some kind of LangWrapper implementation of SlotSupplier anyway, and lang can provide the attached meter to the user's implementation in a context object or however else it feels like

lang can provide the attached meter to the user's implementation in a context object or however else it feels like

I think lang may want to ask users to do this themselves and not provide them anything in this abstraction specifically. At least in .NET and Python, any user that wants to emit metrics from their custom tuner implementation can of course use their own metrics impl, or use ours the same way they'd do it anywhere else outside of workflows/activities (using the runtime metric meter).

Yeah, they can still do that just fine, as well as we can provide it to them for access whenever they like outside of that. For now, within core, this makes the most sense and doesn't have any impact on lang.

core-api/src/worker.rs

cretz · 2024-05-01T12:34:39Z

core/src/abstractions.rs

+                // TODO: Real release reason
+                supp_c_c.release_slot(SlotReleaseReason::TaskComplete);


I may not be understanding, but can you make permit release a more explicit step where it occurs instead of relying on drop?

cretz · 2024-05-01T12:36:32Z

core/src/telemetry/metrics.rs

@@ -395,6 +406,7 @@ pub(super) const ACT_SCHED_TO_START_LATENCY_NAME: &str = "activity_schedule_to_s
 pub(super) const ACT_EXEC_LATENCY_NAME: &str = "activity_execution_latency";
 pub(super) const NUM_POLLERS_NAME: &str = "num_pollers";
 pub(super) const TASK_SLOTS_AVAILABLE_NAME: &str = "worker_task_slots_available";
+pub(super) const TASK_SLOTS_USED_NAME: &str = "worker_task_slots_used";


I like this metric, can we make an issue to make sure all SDKs get it? (or ignore if this is implicitly assumed as part of the general tuning project)

Yes this all should get this as part of this work

Hm is Java supposed to have this metric? I don't recall it getting added as part of your PR then.

Nope, but I'll add it

cretz · 2024-05-01T12:41:48Z

core/src/worker/slot_supplier/resource_based.rs

+
+/// Implements [SlotSupplier] and attempts to maintain certain levels of resource usage when
+/// under load.
+pub struct ResourceBasedSlots<MI> {


Once you get the algorithm settled, I think we should document the exact algorithm so that all languages can get the same (or at least real close)

I've added more to the docstring here, but of course we might still come back and decide to change the algo later

core/src/worker/slot_supplier/resource_based.rs

Quinn-With-Two-Ns · 2024-05-02T22:43:26Z

core/src/worker/slot_supplier/resource_based.rs

Do we have an issue for a resource based slot supplier in java?

There is one in Jira at least

cretz

👍 I like the tuner abstraction. I can already see this will have a good interface look in langs.

In Python/TS we can have a tuner be FixedTunerConfig | ResourceBasedTunerConfig | CustomTuner (the existing 3 fixed values are just shortcuts for the first one, the latter is just an interface), but in .NET I can't really make unions or sum types, so there'll probably be TunerOptions and 3 mutually exclusive properties of Fixed, ResourceBased, and Custom.

cretz · 2024-05-06T15:19:06Z

core/src/worker/slot_supplier/resource_based.rs

Arguably could change the dir/package from slot_suppler to tuner, but meh. Could also have a public FixedSizeTuner in this package and just remove those 3 worker options and force everyone to create tuners if you want, but no prob leaving as is too.

Sushisource added 11 commits April 23, 2024 16:35

Super basic implementation for workflow task slots

6175849

Mostly there to the interface changes.

5ba52c4

Pausing while removing wf state inputs

Add slot suppliers to config

dba422c

Unit tests passing

cf871f0

Integ tests passing

b0b2fd3

Implement resource-based basics & preliminary testing

07ad7e6

Added generics to permits

06e1d45

Added a heavier activity test, and basic multi-slot supplier

8055d98

Semi-reasonable but simple algorithm which performs decently

b6b91c9

PID controller w/ mem & cpu

5886007

Respect minimum / remove dbg

d848b99

Sushisource force-pushed the resource-slots-poc branch from 9042276 to d848b99 Compare April 23, 2024 23:36

Sushisource added 7 commits April 25, 2024 15:49

Metrics emission for load testing

adda8d9

Fix cpu being out of 100 instead of 1

611b1f2

Add missing docstrings

e57b354

Add tests for slots used metric

7bd7c5a

Fix unit tests / lints / other cleanup

935248a

Make sure we don't try to reserve slot if it would exceed WF cache

b719ecc

Merge branch 'master' into resource-slots-poc

93dfcea

Sushisource force-pushed the resource-slots-poc branch 2 times, most recently from d9821b2 to add670f Compare April 30, 2024 22:15

Fix a handful of integ test problems or sensititivy to new server

7d6e505

Sushisource force-pushed the resource-slots-poc branch from add670f to 7d6e505 Compare April 30, 2024 22:28

Sushisource changed the title ~~[DRAFT] Resource slots POC~~ Worker Slots interface and Resource Based Autotuner First Cut May 1, 2024

Address todos

eb23321

Sushisource force-pushed the resource-slots-poc branch from 341d3c3 to eb23321 Compare May 1, 2024 02:19

Sushisource marked this pull request as ready for review May 1, 2024 02:23

Sushisource requested a review from a team as a code owner May 1, 2024 02:23

Sushisource commented May 1, 2024

View reviewed changes

core/src/worker/slot_supplier/resource_based.rs Outdated Show resolved Hide resolved

Sushisource commented May 1, 2024

View reviewed changes

cretz reviewed May 1, 2024

View reviewed changes

Sushisource added 3 commits May 1, 2024 10:13

Docstring / naming fixes from review comments

c9e80c9

Fix periodic metric emission

a459380

Merge branch 'master' into resource-slots-poc

5d136c9

Quinn-With-Two-Ns reviewed May 2, 2024

View reviewed changes

Quinn-With-Two-Ns approved these changes May 2, 2024

View reviewed changes

Sushisource added 2 commits May 2, 2024 16:39

Default available_slots implementation to None

daaf52a

Remove release reason for now

5eea7df

Sushisource force-pushed the resource-slots-poc branch from eaeb4cc to b650e3d Compare May 3, 2024 00:48

Make all the PID options fully configurable

aadb636

Sushisource force-pushed the resource-slots-poc branch from b650e3d to aadb636 Compare May 3, 2024 00:48

Sushisource added 2 commits May 2, 2024 17:52

Add docstring about algorithm

cdcdd33

Fix possible underflow when recording metrics

5d567fb

Sushisource force-pushed the resource-slots-poc branch 2 times, most recently from 7ad344b to e934332 Compare May 4, 2024 00:38

cretz approved these changes May 6, 2024

View reviewed changes

Sushisource force-pushed the resource-slots-poc branch from e934332 to 0e0544e Compare May 13, 2024 17:08

Add overall WorkerTuner trait to bring together suppliers

42140a6

Sushisource force-pushed the resource-slots-poc branch from 0e0544e to 42140a6 Compare May 13, 2024 17:21

Package rename / pub fixed size

97741c2

Sushisource merged commit 84e10bf into temporalio:master May 13, 2024
6 checks passed

Sushisource deleted the resource-slots-poc branch May 13, 2024 17:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Worker Slots interface and Resource Based Autotuner First Cut #719

Worker Slots interface and Resource Based Autotuner First Cut #719

Sushisource commented Apr 16, 2024 •

edited

Loading

Sushisource May 1, 2024

cretz May 1, 2024

Sushisource May 1, 2024

cretz May 1, 2024

Sushisource May 1, 2024

Quinn-With-Two-Ns May 2, 2024

Sushisource May 2, 2024 •

edited

Loading

cretz left a comment •

edited

Loading

cretz May 1, 2024

Sushisource May 1, 2024

cretz May 1, 2024

Sushisource May 3, 2024

cretz May 3, 2024 •

edited

Loading

cretz May 1, 2024 •

edited

Loading

Sushisource May 1, 2024

cretz May 1, 2024 •

edited

Loading

Sushisource May 1, 2024 •

edited

Loading

cretz May 1, 2024 •

edited

Loading

Sushisource May 3, 2024

cretz May 1, 2024

cretz May 1, 2024

Sushisource May 1, 2024

Quinn-With-Two-Ns May 2, 2024

Sushisource May 2, 2024

cretz May 1, 2024

Sushisource May 3, 2024

Quinn-With-Two-Ns May 2, 2024

Sushisource May 2, 2024

cretz left a comment

cretz May 6, 2024

		// TODO: Real release reason
		supp_c_c.release_slot(SlotReleaseReason::TaskComplete);

Worker Slots interface and Resource Based Autotuner First Cut #719

Worker Slots interface and Resource Based Autotuner First Cut #719

Conversation

Sushisource commented Apr 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Sushisource May 2, 2024 • edited Loading

Choose a reason for hiding this comment

cretz left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cretz May 3, 2024 • edited Loading

Choose a reason for hiding this comment

cretz May 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cretz May 1, 2024 • edited Loading

Choose a reason for hiding this comment

Sushisource May 1, 2024 • edited Loading

Choose a reason for hiding this comment

cretz May 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cretz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Sushisource commented Apr 16, 2024 •

edited

Loading

Sushisource May 2, 2024 •

edited

Loading

cretz left a comment •

edited

Loading

cretz May 3, 2024 •

edited

Loading

cretz May 1, 2024 •

edited

Loading

cretz May 1, 2024 •

edited

Loading

Sushisource May 1, 2024 •

edited

Loading

cretz May 1, 2024 •

edited

Loading