Skip to content

Conversation

@ma2bd
Copy link
Contributor

@ma2bd ma2bd commented Oct 9, 2025

Motivation

Crashing the RPC thread was a bad idea because it doesn't crash the process itself. Let's try another approach

Proposal

  • Fail the RPC query
  • Remove the dropped actor from the actor cache

Test Plan

CI + TBD?

Release Plan

  • These changes should be backported to the latest devnet branch, then
    • be released in a new SDK,

@ma2bd ma2bd requested review from Twey and afck October 9, 2025 16:44
@ma2bd ma2bd marked this pull request as draft October 9, 2025 16:47
.expect("`ChainWorkerActor` stopped executing unexpectedly");
if let Err(e) = chain_actor.send((request_builder(callback), tracing::Span::current())) {
// The actor endpoint was dropped. Let's clear it from the cache.
self.clear_chain_worker_endpoint(chain_id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be possible at this point that some future somewhere still has a reference of the shared chain state view.

It's maybe not a blocker for now, but we should make sure soon that the shared views are only used for GraphQL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Good point

}

/// Clear any endpoint to a [`ChainWorkerActor`] from the cache for the given chain.
#[instrument(level = "trace", target = "telemetry_only", skip(self), fields(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should this be telemetry_only?

I think the default configuration for instrument should just be level = "debug". Remember these don't cause new log lines unless execution tracing is explicitly enabled with RUST_LOG_SPAN_EVENTS, but add potentially helpful debug information to events that occur within the span.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @ndr-ds axplained to me:

Basically if you have telemetry_only it just means that you’re gonna send the traces to Tempo for sure. Even if you’re below the log level. But if you’re in or above the log level, it will still also print to the screen

So we can have both debug and telemtry_only.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sorry, I need to change the name of this to telemetry or something like that, the _only was supposed to have been removed and I missed it 🤦🏻‍♂️

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha :) Shouldn't we just always send all TRACEs (and above) to Tempo though?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were some of these instrument annotations with TRACE for example. Since we run with log level INFO by default, if I sent to Tempo just based on the log level, it wouldn't send those spans. And we technically want all the spans 😅 so we have as much information as possible.
So I needed a way to disassociate the log level with it being sent to Tempo, and that's why I did the target thing, and added telemetry_only (soon to be tempo maybe) everywhere. To guarantee that even if we don't match the log level, we still send the spans to Tempo.
Hopefully the explanation makes sense 😅

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OTOH I could just always send to Tempo regardless of level 😅 but I wanted a way for people to opt out of sending to Tempo, in case some spans are just too noisy (for example spans in the microseconds that are very frequent and pollute the Tempo visualization)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd rather have a ‘don't send to Tempo’ annotation than a ‘do send to Tempo’ annotation 🤔

Copy link
Contributor

@Twey Twey Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did start the convention of instrument(level = "trace"), but on reflection I regret it: I think they should basically always be DEBUG. The only reason to make them TRACE is so that they get compiled out when we use tracing/release_max_level_debug, but after your experimentation it seems that there's no point doing that, and in exchange we lose potentially useful context from our DEBUG logs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's my attempt at making this less confusing #4771

self.clear_chain_worker_endpoint(chain_id);
Err(WorkerError::ChainActorRecvError {
chain_id,
error: e.to_string(),
Copy link
Contributor

@Twey Twey Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please never convert errors to strings as it loses the entire backtrace, and then debugging gets hard :) If you need a dynamic error, which I don't think you do here, you can use Box<dyn std::error::Error + Send>.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had an issue with a generic type. Let me try again

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't recommend making it generic — I think you can either use the concrete type from oneshot or else use a trait object.

let (response, receiver) = {
let mut chain_workers = self.chain_workers.lock().unwrap();

let (sender, new_receiver) = if let Some(endpoint) = chain_workers.remove(&chain_id) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that this will be the case both when the chain worker has crashed (and needs to be restarted) and when someone else is using the chain worker to process a query, in which case we must not start a new one.

Maybe rather than an Option this should be an async::Mutex<Option>, and we treat None the same as not present (i.e. start a new one)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the self.chain_workers.lock() around this line ensure that we atomically use & test the entrypoint ?

Copy link
Contributor

@Twey Twey Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the test+use here is fine, but then we drop (free) the lock at the end of the block, and another request can come in and find the endpoint missing (and decide to spawn another one). This task then won't see that until it goes to put the endpoint back and finds there's already one there (and/or gets its storage corrupted…)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we always lock the entire map? Then what you're describing doesn't seem possible 🤔 but maybe I'm missing something

@ma2bd ma2bd marked this pull request as ready for review October 9, 2025 17:56
@ma2bd ma2bd changed the title handle dropped actors without crashing the RPC thread Handle dropped actors without crashing the RPC thread Oct 9, 2025
@ma2bd ma2bd added this pull request to the merge queue Oct 9, 2025
Merged via the queue into linera-io:main with commit 264b7ec Oct 9, 2025
31 checks passed
@ma2bd ma2bd deleted the dropped_actors branch October 9, 2025 19:21
ma2bd added a commit to ma2bd/linera-protocol that referenced this pull request Oct 9, 2025
## Motivation

Crashing the RPC thread was a bad idea because it doesn't crash the
process itself. Let's try another approach

## Proposal

* Fail the RPC query
* Remove the dropped actor from the actor cache

## Test Plan

CI + TBD?

## Release Plan

- These changes should be backported to the latest `devnet` branch, then
    - be released in a new SDK,
ma2bd added a commit to ma2bd/linera-protocol that referenced this pull request Oct 9, 2025
Crashing the RPC thread was a bad idea because it doesn't crash the
process itself. Let's try another approach

* Fail the RPC query
* Remove the dropped actor from the actor cache

CI
ma2bd added a commit that referenced this pull request Oct 9, 2025
## Motivation

Avoid crashing the RPC thread (but apparently not the process) when an
actor crashes.

## Proposal

backport #4769

## Test Plan

CI
github-merge-queue bot pushed a commit that referenced this pull request Oct 14, 2025
## Motivation

With the exception of a GraphQL query, the chain state view should only
be accessed by the corresponding chain worker. GraphQL is why the worker
can return a shared chain state view behind a lock, but we are using
that in many other places, too. After
#4769, this is even
more dangerous, as an actor could be dropped and restarted while shared
views are still being used.

## Proposal

Replace several `chain_state_view()` calls with chain info or new chain
worker requests.

Also, extend a comment. (See
#4793 (comment).)

## Test Plan

CI

## Release Plan

- These changes should be backported to `testnet_conway`, then
    - be released in a new SDK.

## Links

- [reviewer
checklist](https://github.com/linera-io/linera-protocol/blob/main/CONTRIBUTING.md#reviewer-checklist)
afck added a commit that referenced this pull request Oct 14, 2025
…4797) (#4798)

Backport of #4797.

## Motivation

With the exception of a GraphQL query, the chain state view should only
be accessed by the corresponding chain worker. GraphQL is why the worker
can return a shared chain state view behind a lock, but we are using
that in many other places, too. After
#4769, this is even
more dangerous, as an actor could be dropped and restarted while shared
views are still being used.

## Proposal

Replace several `chain_state_view()` calls with chain info or new chain
worker requests.

Also, extend a comment. (See
#4793 (comment).)

## Test Plan

CI

## Release Plan

- These changes should be released in a new SDK.

## Links

- PR to main: #4797
- [reviewer
checklist](https://github.com/linera-io/linera-protocol/blob/main/CONTRIBUTING.md#reviewer-checklist)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants