Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worker Autotuning #88

Merged
merged 9 commits into from
Apr 2, 2024
Merged
88 changes: 44 additions & 44 deletions all-sdk/autotuning.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,77 +93,70 @@ trait SlotSupplier {
/// Blocks until a slot is available. In languages with explicit cancel mechanisms, this should be cancellable and
/// return a boolean indicating whether a slot was actually obtained or not. In Rust, the future can simply
/// be dropped if the reservation is no longer desired.
async fn reserve_slot(&self);
async fn reserve_slot(&self, ctx: &dyn SlotReservationContext<Self::SlotKind>);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am now wondering for future proofing if we should let implementers return a slot and that same object be given back on mark_slot_used and release_slot. I can see use cases where users want to put state on the slot. Would also change the try_reserve_slot response to Option then instead of bool.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I originally had, and is what I have in Rust because internally I store a struct that auto-releases on drop.

As for user data though, if we're saying elsewhere we'd prefer to not have something until there's a demonstrated need for it, I think that's probably even more applicable here. I don't know what the immediate use case is for this, but I do have known use cases for the used_slots on context.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any language with real destructors, yeah, definitely.

Copy link
Member

@cretz cretz Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for user data though, if we're saying elsewhere we'd prefer to not have something until there's a demonstrated need for it, I think that's probably even more applicable here

Less applicable, because this is future proofing. We cannot add a return type to this easily later. The immediate use case would be like etcd lease info. But just an opaque object that we give back to them is nice.

Can we at least make sure that a user can uniquely identify the slot they reserved at mark-used/release time? I don't think there's a way today.

Copy link
Member Author

@Sushisource Sushisource Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add it back. I'm just a little worried about how extensive of changes that might result in in Go/Java where there is no permit object like in Core. We'd have to track one everywhere like Core does, which ultimately is nice, but could be quite a lot of code churn.

I think any kind of uniquely identifying a particular slot/permit would have the same consequence. It's useful though, in the abstract.

@Quinn-With-Two-Ns what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a cancellation mechanism to this? (I forget if Rust implicitly supports cancellation of async like Python, but this at least needs to be called out for other langs)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment addresses exactly that. There's no need for an explicit cancel in Rust.


/// Tries to immediately reserve a slot, returning true if a slot is available.
fn try_reserve_slot(&self) -> bool;
fn try_reserve_slot(&self, ctx: &dyn SlotReservationContext<Self::SlotKind>) -> bool;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If people are allowed to use remote tooling like zookeeper/etcd, this should be async


/// Marks a slot as actually now being used. This is separate from reserving one because the pollers need to
/// reserve a slot before they have actually obtained work from server. Once that task is obtained (and validated)
/// then the slot can actually be used to work on the task.
///
/// Users' implementation of this can choose to emit metrics, or otherwise leverage the information provided by the
/// `info` parameter to be better able to make future decisions about whether a slot should be handed out.
///
/// `info` may not be provided if the slot was never used
/// `error` may be provided if an error was encountered at any point during processing
/// TODO: Error type should maybe also be generic and bound to slot type
fn mark_slot_used(&self, info: Option<&Self::SlotKind::Info>, error: Option<&anyhow::Error>);
fn mark_slot_used(&self, info: <Self::SlotKind as SlotKind>::Info<'_>);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking at the "neat example" below, it's clear this can be a long-blocking call. It should be documented as such, made async, allow cancellation/interruption, and maybe change the name. "marking" isn't usually considered a blocking operation. Maybe apply_slot or start_slot or something.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I say it could be, but I also explain why that's not necessary and you can defer the async-ness to the reservation. Ideally I think you'd keep this as nonblocking. Places where it gets used really, really shouldn't have to sit and wait.

Copy link
Member

@cretz cretz Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrmm, yeah I first thought the "neat example" blocked in this call, but I see it doesn't. It means slot reservation may be unnecessarily expensive for things that never actually use the slot. Basically it means remote semaphores are based on poll instead of usage. But this is just a product of how we obtain work from server. We should at least document that people should not block in "mark used" since we don't give them async to do so.

Though now I wonder if someone would want to separate their remote polling semaphores from their usage semaphores. It makes sense in cases where you want to do one-activity-running-anywhere-globally cases. So maybe the API should allow that w/ async and just document that the "mark used" is subject to task timeout. Then again, I guess that can be in the activity. Now I'm starting to think about providing contextual data from slot supplier to the activity impl, I'm overthinking for sure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel pretty strongly that marking used can't ever block at all (and yeah I can mention that in the comment).

The reasoning is that it happened, and the implementer has no choice but to accept that fact. If that then means, later, that they take a long time to hand out the next permit that's acceptable, but they can't block on mark used because a) it gums things up, and b) they can't change anything anyway, it's too late.

Copy link
Member

@cretz cretz Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me (can consider this resolved but can leave discussion open for others if you want). I do think release should be allowed to be async though.


/// Frees a slot.
fn release_slot(&self, info: &SlotReleaseReason<Self::SlotKind::Info>);

/// If this implementation knows how many slots are available at any moment, it should return that here.
fn available_slots(&self) -> Option<usize>;
/// Frees a slot. This is always called when a slot is no longer needed, even if it was never marked as used.
/// EX: If the poller reserves a slot, and then receives an invalid task from the server for whatever reason, this
/// method would be called with [SlotReleaseReason::Error].
fn release_slot(&self, info: SlotReleaseReason);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If people are allowed to use remote tooling like zookeeper/etcd, this should be async

}

struct WorkflowSlotsInfo {
used_slots: Vec<WorkflowSlotInfo>,
/// Current size of the workflow cache.
num_cached_workflows: usize,
/// The limit on the size of the cache, if any. This is important for users to know as discussed below in the section
/// on workflow cache management.
max_cache_size: Option<usize>,
// ... Possibly also metric information
}
struct ActivitySlotsInfo {
used_slots: Vec<ActivitySlotInfo>,
/// This trait lets implementors obtain other information that might be relevant to their decision on whether to hand
/// out a slot. It's a trait rather than a struct because the most up-to-date information should be available even
/// if waiting for some time in the blocking version of `reserve_slot`.
pub trait SlotReservationContext<SK: SlotKind>: Send + Sync {
/// Returns information about currently in-use slots
fn used_slots(&self) -> &[SK::Info];
Copy link

@Quinn-With-Two-Ns Quinn-With-Two-Ns Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the slot supplier know this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can know it, but, I don't really see a reason to force the implementor to track it themselves when we already have to track it ourselves at least in some capacity as discussed here: #88 (comment)

Certainly in Core this made dramatically more sense to provide automatically because the SlotSupplier gets wrapped by a struct in Core that has to track all this stuff anyway, and I anticipate creating a similar wrapper in Go/Java.

If it turns out to be a huge pain to provide as I'm working through Java, it can be left off - but I imagine it's going to be pretty straightforward to do.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think just because we can provided something doesn't mean we should. We can always add information based on user feadback, it is much harder to remove. My main concern is the adding the overhead of keeping track of all this data getting updated from potential 10,000s of goroutines in Go when a user doesn't care about the information.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, I prefer the lightest API surface needed to do an an acceptable job (may be only slot kind and task queue name atm)

Copy link
Member Author

@Sushisource Sushisource Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely have known reasons to want this though, and the stripe guys even explicitly said they would want and use this information inside their implementation. I'm pretty clear on it having value.

The overhead is tiny. It's just keeping things in a list or a map, even if that's 10k things that's not hard, there's no meaningful compute going on. It's already got to be tracked somewhere anyway in order to provide activity/wf context to users, etc.

Copy link
Member Author

@Sushisource Sushisource Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, sure, but this isn't custom information - it's information that's extracted from the task which we'll be storing anyway. It's just making it available to them rather than asking them to do that work themselves. I don't think there's any strong reason they should record it over us. It's just making their lives easier for something that's very likely to come up in many or most implementations.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, sure, but this isn't custom information

Right, but I think users will also want to track custom information for each slot if they already want this level of detail.

I don't think there's any strong reason they should record it over us

Because it add overhead to every implementation that doesn't need it like our implementations and it adds more complexity to the SDK and the API. I am not sure this is very likely to come up in real implementations since the ones we have proposed (fixed, resource based, pausable) do not need it. I'm totally open to adding a wrapper that keeps track of it or I think we need to do benchmarking to show this doesn't increase latency or memory overhead. I apologize for being such a stickler here, but since we have more low latency use cases coming into temporal I am very paranoid about introducing potential performance regressions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The resource based one does need it. It needs to know how many slots are used since you need to know that you're not going to hand out more slots than there is room in the cache (or even just to define a configurable hard limit).

This is zero cost in Core, for example. The info is all tracked already. Fetching system resources like memory usage for that implementation is dramatically more expensive for example.

We can benchmark... but I feel that's a little odd here given there are virtually zero changes we do actually benchmark and I don't see any reason to believe this would add substantial overhead compared to anything else we do. I mean it's literally just stuffing some things (which are likely just references to objects that already exist elsewhere) in a collection, which isn't very expensive to do.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The resource based one does need it. It needs to know how many slots are used since you need to know that you're not going to hand out more slots than there is room in the cache (or even just to define a configurable hard limit).

The resource based one just needs to know the number it doesn't need a whole map of every slot and details about it.

but I feel that's a little odd here given there are virtually zero changes we do actually benchmark

I get for Core it is zero cost, but for Go and Java this is new. I thought in slack we already talked about benchmarking these changes so I didn't even think this was a new requirement? I assumed as part of benchmarking we would test the old SDK implementation with the new fixed slot supplier implementation?

Copy link
Member Author

@Sushisource Sushisource Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, of course, but isolating providing this particular function will require it's own independent round of testing - which is fine, but further than we have gone before for any micro-benchmark kind of thing.

The resource based one just needs to know the number it doesn't need a whole map of every slot and details about it.

Yep - but this is more generally useful and (I would bet quite readily) not meaningfully more costly.

In any case, none of this is really a hard blocker either way on this proposal. We need the context object and it'll have basic stuff like get_task_queue. Whether or not this specific function lives on it we can decide after seeing how much effort and overhead it costs, though I'm pretty sure those are both small.


// ... anything else users end up wanting here
}
struct LocalActivitySlotsInfo {
used_slots: Vec<LocalActivitySlotInfo>,

pub enum SlotReleaseReason {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the worker is shutting down is that an Error or NeverUsed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say NeverUsed, since if it is used, then we're waiting until that task completes before finishing shutdown. (or, if we're not, then it'd be an Error... but we should be waiting for tasks to finish)

Copy link

@Quinn-With-Two-Ns Quinn-With-Two-Ns Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine with me, might be worth documenting so we are consistent across SDKs.

TaskComplete,
NeverUsed,
Error(anyhow::Error), // Or possibly something specific to the slot kind, but unlikely.
}

struct WorkflowSlotInfo {
workflow_type: String,
// task queue, worker id, etc...
pub struct WorkflowSlotInfo<'a> {
pub workflow_type: &'a str,
// etc...
}
struct ActivitySlotInfo {
activity_type: String,

pub struct ActivitySlotInfo<'a> {
pub activity_type: &'a str,
// etc...
}
struct LocalActivitySlotInfo {
activity_type: String,
pub struct LocalActivitySlotInfo<'a> {
pub activity_type: &'a str,
// etc...
}

struct WorkflowSlotKind {}
struct ActivitySlotKind {}
struct LocalActivitySlotKind {}
trait SlotKind {
type Info;
pub trait SlotKind {
type Info<'a>;
}
impl SlotKind for WorkflowSlotKind {
type Info = WorkflowSlotInfo;
type Info<'a> = WorkflowSlotInfo<'a>;
}
impl SlotKind for ActivitySlotKind {
type Info = ActivitySlotInfo;
type Info<'a> = ActivitySlotInfo<'a>;
}
impl SlotKind for LocalActivitySlotKind {
type Info = LocalActivitySlotInfo;
type Info<'a> = LocalActivitySlotInfo<'a>;
}
trait WorkflowTaskSlotSupplier: SlotSupplier<SlotKind=WorkflowSlotKind> {}
trait ActivityTaskSlotSupplier: SlotSupplier<SlotKind=ActivitySlotKind> {}
trait LocalActivityTaskSlotSupplier: SlotSupplier<SlotKind=LocalActivitySlotKind> {}

/// Users might want to be able to pause the handing-out of slots as an effective way of pausing their workers.
/// We can provide an implementation for this that wraps their implementation, or one of the defaults we provide.
Expand Down Expand Up @@ -224,6 +217,16 @@ trait WorkflowCacheSizer {
///
/// I'm a bit at a loss for a good name
trait WorkflowSlotAndCacheManager: WorkflowTaskSlotSupplier + WorkflowCacheSizer {}

struct WorkflowSlotsInfo {
used_slots: Vec<WorkflowSlotInfo>,
/// Current size of the workflow cache.
num_cached_workflows: usize,
/// The limit on the size of the cache, if any. This is important for users to know as discussed below in the section
/// on workflow cache management.
max_cache_size: Option<usize>,
// ... Possibly also metric information
}
```

## Flow
Expand Down Expand Up @@ -277,8 +280,7 @@ We should provide a few default implementations:
Users can add their own metrics behind their implementations now, which is great - but we can also provide some out of
the box. We can always provide `slots_in_use` by type. Right now we emit the inverse of this,
`worker_task_slots_available` - we can keep emitting that with a bundled implementation that replicates the old
Sushisource marked this conversation as resolved.
Show resolved Hide resolved
fixed-number based implementation. In fact, any implementation which returns a value from `available_slots` we can make
emit the metric automatically.
fixed-number based implementation.

We can also of course keep emitting the current number of cached workflows.

Expand Down Expand Up @@ -366,8 +368,6 @@ impl ActivityTaskSlotSupplier for MyCoordinatingSlotSupplier {
}
self.per_worker_allowed.release();
}

fn available_slots(&self) -> Option<usize> { Some(self.per_worker_allowed.available_permits()) }
}
```

Expand Down