Handle dropped actors without crashing the RPC thread #4769

ma2bd · 2025-10-09T16:44:23Z

Motivation

Crashing the RPC thread was a bad idea because it doesn't crash the process itself. Let's try another approach

Proposal

Fail the RPC query
Remove the dropped actor from the actor cache

Test Plan

CI + TBD?

Release Plan

These changes should be backported to the latest devnet branch, then
- be released in a new SDK,

afck · 2025-10-09T16:50:30Z

linera-core/src/worker.rs

-            .expect("`ChainWorkerActor` stopped executing unexpectedly");
+        if let Err(e) = chain_actor.send((request_builder(callback), tracing::Span::current())) {
+            // The actor endpoint was dropped. Let's clear it from the cache.
+            self.clear_chain_worker_endpoint(chain_id);


It may be possible at this point that some future somewhere still has a reference of the shared chain state view.

It's maybe not a blocker for now, but we should make sure soon that the shared views are only used for GraphQL.

I see. Good point

Twey · 2025-10-09T16:58:43Z

linera-core/src/worker.rs

    }

+    /// Clear any endpoint to a [`ChainWorkerActor`] from the cache for the given chain.
+    #[instrument(level = "trace", target = "telemetry_only", skip(self), fields(


Why should this be telemetry_only?

I think the default configuration for instrument should just be level = "debug". Remember these don't cause new log lines unless execution tracing is explicitly enabled with RUST_LOG_SPAN_EVENTS, but add potentially helpful debug information to events that occur within the span.

As @ndr-ds axplained to me:

Basically if you have telemetry_only it just means that you’re gonna send the traces to Tempo for sure. Even if you’re below the log level. But if you’re in or above the log level, it will still also print to the screen

So we can have both debug and telemtry_only.

Yes, sorry, I need to change the name of this to telemetry or something like that, the _only was supposed to have been removed and I missed it 🤦🏻‍♂️

Gotcha :) Shouldn't we just always send all TRACEs (and above) to Tempo though?

There were some of these instrument annotations with TRACE for example. Since we run with log level INFO by default, if I sent to Tempo just based on the log level, it wouldn't send those spans. And we technically want all the spans 😅 so we have as much information as possible.
So I needed a way to disassociate the log level with it being sent to Tempo, and that's why I did the target thing, and added telemetry_only (soon to be tempo maybe) everywhere. To guarantee that even if we don't match the log level, we still send the spans to Tempo.
Hopefully the explanation makes sense 😅

OTOH I could just always send to Tempo regardless of level 😅 but I wanted a way for people to opt out of sending to Tempo, in case some spans are just too noisy (for example spans in the microseconds that are very frequent and pollute the Tempo visualization)

I think I'd rather have a ‘don't send to Tempo’ annotation than a ‘do send to Tempo’ annotation 🤔

I did start the convention of instrument(level = "trace"), but on reflection I regret it: I think they should basically always be DEBUG. The only reason to make them TRACE is so that they get compiled out when we use tracing/release_max_level_debug, but after your experimentation it seems that there's no point doing that, and in exchange we lose potentially useful context from our DEBUG logs.

Here's my attempt at making this less confusing #4771

Twey · 2025-10-09T17:01:40Z

linera-core/src/worker.rs

+                self.clear_chain_worker_endpoint(chain_id);
+                Err(WorkerError::ChainActorRecvError {
+                    chain_id,
+                    error: e.to_string(),


Please never convert errors to strings as it loses the entire backtrace, and then debugging gets hard :) If you need a dynamic error, which I don't think you do here, you can use Box<dyn std::error::Error + Send>.

I had an issue with a generic type. Let me try again

I wouldn't recommend making it generic — I think you can either use the concrete type from oneshot or else use a trait object.

linera-core/src/worker.rs

Twey · 2025-10-09T17:35:20Z

linera-core/src/worker.rs

+        let (response, receiver) = {
+            let mut chain_workers = self.chain_workers.lock().unwrap();
+
+            let (sender, new_receiver) = if let Some(endpoint) = chain_workers.remove(&chain_id) {


The problem is that this will be the case both when the chain worker has crashed (and needs to be restarted) and when someone else is using the chain worker to process a query, in which case we must not start a new one.

Maybe rather than an Option this should be an async::Mutex<Option>, and we treat None the same as not present (i.e. start a new one)?

Doesn't the self.chain_workers.lock() around this line ensure that we atomically use & test the entrypoint ?

Yes, the test+use here is fine, but then we drop (free) the lock at the end of the block, and another request can come in and find the endpoint missing (and decide to spawn another one). This task then won't see that until it goes to put the endpoint back and finds there's already one there (and/or gets its storage corrupted…)

Don't we always lock the entire map? Then what you're describing doesn't seem possible 🤔 but maybe I'm missing something

## Motivation Crashing the RPC thread was a bad idea because it doesn't crash the process itself. Let's try another approach ## Proposal * Fail the RPC query * Remove the dropped actor from the actor cache ## Test Plan CI + TBD? ## Release Plan - These changes should be backported to the latest `devnet` branch, then - be released in a new SDK,

Crashing the RPC thread was a bad idea because it doesn't crash the process itself. Let's try another approach * Fail the RPC query * Remove the dropped actor from the actor cache CI

## Motivation Avoid crashing the RPC thread (but apparently not the process) when an actor crashes. ## Proposal backport #4769 ## Test Plan CI

## Motivation With the exception of a GraphQL query, the chain state view should only be accessed by the corresponding chain worker. GraphQL is why the worker can return a shared chain state view behind a lock, but we are using that in many other places, too. After #4769, this is even more dangerous, as an actor could be dropped and restarted while shared views are still being used. ## Proposal Replace several `chain_state_view()` calls with chain info or new chain worker requests. Also, extend a comment. (See #4793 (comment).) ## Test Plan CI ## Release Plan - These changes should be backported to `testnet_conway`, then - be released in a new SDK. ## Links - [reviewer checklist](https://github.com/linera-io/linera-protocol/blob/main/CONTRIBUTING.md#reviewer-checklist)

…4797) (#4798) Backport of #4797. ## Motivation With the exception of a GraphQL query, the chain state view should only be accessed by the corresponding chain worker. GraphQL is why the worker can return a shared chain state view behind a lock, but we are using that in many other places, too. After #4769, this is even more dangerous, as an actor could be dropped and restarted while shared views are still being used. ## Proposal Replace several `chain_state_view()` calls with chain info or new chain worker requests. Also, extend a comment. (See #4793 (comment).) ## Test Plan CI ## Release Plan - These changes should be released in a new SDK. ## Links - PR to main: #4797 - [reviewer checklist](https://github.com/linera-io/linera-protocol/blob/main/CONTRIBUTING.md#reviewer-checklist)

ma2bd requested review from Twey and afck October 9, 2025 16:44

ma2bd marked this pull request as draft October 9, 2025 16:47

afck reviewed Oct 9, 2025

View reviewed changes

Twey reviewed Oct 9, 2025

View reviewed changes

linera-core/src/worker.rs Outdated Show resolved Hide resolved

Twey reviewed Oct 9, 2025

View reviewed changes

handle dropped actors without crashing the RPC thread

8e219ed

ma2bd force-pushed the dropped_actors branch from fd8f812 to 8e219ed Compare October 9, 2025 17:44

ma2bd added 2 commits October 9, 2025 13:52

nit

2cc8db8

address reviewer comment

4a97c52

ma2bd marked this pull request as ready for review October 9, 2025 17:56

Twey approved these changes Oct 9, 2025

View reviewed changes

ma2bd changed the title ~~handle dropped actors without crashing the RPC thread~~ Handle dropped actors without crashing the RPC thread Oct 9, 2025

ma2bd added this pull request to the merge queue Oct 9, 2025

Merged via the queue into linera-io:main with commit 264b7ec Oct 9, 2025
31 checks passed

ma2bd deleted the dropped_actors branch October 9, 2025 19:21

ma2bd mentioned this pull request Oct 9, 2025

[testnet] Backport #4769: support for dropped actors #4772

Merged

ma2bd added a commit that referenced this pull request Oct 9, 2025

[testnet] Backport support for dropped actors #4769 (#4772)

b59fd5a

## Motivation Avoid crashing the RPC thread (but apparently not the process) when an actor crashes. ## Proposal backport #4769 ## Test Plan CI

This was referenced Oct 14, 2025

Avoid using shared chain state views except for GraphQL. #4797

Merged

[testnet] Avoid using shared chain state views except for GraphQL. (#4797) #4798

Merged

Handle dropped actors without crashing the RPC thread #4769

Handle dropped actors without crashing the RPC thread #4769

Uh oh!

Conversation

ma2bd commented Oct 9, 2025

Motivation

Proposal

Test Plan

Release Plan

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Twey Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Twey Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Twey Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Twey Oct 9, 2025 •

edited

Loading

Twey Oct 9, 2025 •

edited

Loading

Twey Oct 9, 2025 •

edited

Loading