-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Handle dropped actors without crashing the RPC thread #4769
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
linera-core/src/worker.rs
Outdated
| .expect("`ChainWorkerActor` stopped executing unexpectedly"); | ||
| if let Err(e) = chain_actor.send((request_builder(callback), tracing::Span::current())) { | ||
| // The actor endpoint was dropped. Let's clear it from the cache. | ||
| self.clear_chain_worker_endpoint(chain_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be possible at this point that some future somewhere still has a reference of the shared chain state view.
It's maybe not a blocker for now, but we should make sure soon that the shared views are only used for GraphQL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Good point
linera-core/src/worker.rs
Outdated
| } | ||
|
|
||
| /// Clear any endpoint to a [`ChainWorkerActor`] from the cache for the given chain. | ||
| #[instrument(level = "trace", target = "telemetry_only", skip(self), fields( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should this be telemetry_only?
I think the default configuration for instrument should just be level = "debug". Remember these don't cause new log lines unless execution tracing is explicitly enabled with RUST_LOG_SPAN_EVENTS, but add potentially helpful debug information to events that occur within the span.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @ndr-ds axplained to me:
Basically if you have
telemetry_onlyit just means that you’re gonna send the traces to Tempo for sure. Even if you’re below the log level. But if you’re in or above the log level, it will still also print to the screen
So we can have both debug and telemtry_only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sorry, I need to change the name of this to telemetry or something like that, the _only was supposed to have been removed and I missed it 🤦🏻♂️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha :) Shouldn't we just always send all TRACEs (and above) to Tempo though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There were some of these instrument annotations with TRACE for example. Since we run with log level INFO by default, if I sent to Tempo just based on the log level, it wouldn't send those spans. And we technically want all the spans 😅 so we have as much information as possible.
So I needed a way to disassociate the log level with it being sent to Tempo, and that's why I did the target thing, and added telemetry_only (soon to be tempo maybe) everywhere. To guarantee that even if we don't match the log level, we still send the spans to Tempo.
Hopefully the explanation makes sense 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OTOH I could just always send to Tempo regardless of level 😅 but I wanted a way for people to opt out of sending to Tempo, in case some spans are just too noisy (for example spans in the microseconds that are very frequent and pollute the Tempo visualization)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd rather have a ‘don't send to Tempo’ annotation than a ‘do send to Tempo’ annotation 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did start the convention of instrument(level = "trace"), but on reflection I regret it: I think they should basically always be DEBUG. The only reason to make them TRACE is so that they get compiled out when we use tracing/release_max_level_debug, but after your experimentation it seems that there's no point doing that, and in exchange we lose potentially useful context from our DEBUG logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's my attempt at making this less confusing #4771
linera-core/src/worker.rs
Outdated
| self.clear_chain_worker_endpoint(chain_id); | ||
| Err(WorkerError::ChainActorRecvError { | ||
| chain_id, | ||
| error: e.to_string(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please never convert errors to strings as it loses the entire backtrace, and then debugging gets hard :) If you need a dynamic error, which I don't think you do here, you can use Box<dyn std::error::Error + Send>.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had an issue with a generic type. Let me try again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't recommend making it generic — I think you can either use the concrete type from oneshot or else use a trait object.
linera-core/src/worker.rs
Outdated
| let (response, receiver) = { | ||
| let mut chain_workers = self.chain_workers.lock().unwrap(); | ||
|
|
||
| let (sender, new_receiver) = if let Some(endpoint) = chain_workers.remove(&chain_id) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that this will be the case both when the chain worker has crashed (and needs to be restarted) and when someone else is using the chain worker to process a query, in which case we must not start a new one.
Maybe rather than an Option this should be an async::Mutex<Option>, and we treat None the same as not present (i.e. start a new one)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't the self.chain_workers.lock() around this line ensure that we atomically use & test the entrypoint ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the test+use here is fine, but then we drop (free) the lock at the end of the block, and another request can come in and find the endpoint missing (and decide to spawn another one). This task then won't see that until it goes to put the endpoint back and finds there's already one there (and/or gets its storage corrupted…)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we always lock the entire map? Then what you're describing doesn't seem possible 🤔 but maybe I'm missing something
## Motivation
Crashing the RPC thread was a bad idea because it doesn't crash the
process itself. Let's try another approach
## Proposal
* Fail the RPC query
* Remove the dropped actor from the actor cache
## Test Plan
CI + TBD?
## Release Plan
- These changes should be backported to the latest `devnet` branch, then
- be released in a new SDK,
Crashing the RPC thread was a bad idea because it doesn't crash the process itself. Let's try another approach * Fail the RPC query * Remove the dropped actor from the actor cache CI
## Motivation With the exception of a GraphQL query, the chain state view should only be accessed by the corresponding chain worker. GraphQL is why the worker can return a shared chain state view behind a lock, but we are using that in many other places, too. After #4769, this is even more dangerous, as an actor could be dropped and restarted while shared views are still being used. ## Proposal Replace several `chain_state_view()` calls with chain info or new chain worker requests. Also, extend a comment. (See #4793 (comment).) ## Test Plan CI ## Release Plan - These changes should be backported to `testnet_conway`, then - be released in a new SDK. ## Links - [reviewer checklist](https://github.com/linera-io/linera-protocol/blob/main/CONTRIBUTING.md#reviewer-checklist)
…4797) (#4798) Backport of #4797. ## Motivation With the exception of a GraphQL query, the chain state view should only be accessed by the corresponding chain worker. GraphQL is why the worker can return a shared chain state view behind a lock, but we are using that in many other places, too. After #4769, this is even more dangerous, as an actor could be dropped and restarted while shared views are still being used. ## Proposal Replace several `chain_state_view()` calls with chain info or new chain worker requests. Also, extend a comment. (See #4793 (comment).) ## Test Plan CI ## Release Plan - These changes should be released in a new SDK. ## Links - PR to main: #4797 - [reviewer checklist](https://github.com/linera-io/linera-protocol/blob/main/CONTRIBUTING.md#reviewer-checklist)
Motivation
Crashing the RPC thread was a bad idea because it doesn't crash the process itself. Let's try another approach
Proposal
Test Plan
CI + TBD?
Release Plan
devnetbranch, then