Sqlite-backed DOs: Allow calling deleteAll() inside an alarm handler #3018

jclee · 2024-10-28T16:39:39Z

And also fix a potential issue with alarm exception propagation.

Should address #2993

SRS actors: Ensure that deleteAll() always has a commit scheduled during reset

...particularly, in the case where an alarm is set. Should fix errors seen when calling deleteAll() during an alarm handler.
SRS actors: Don't attempt alarm deletion at end of handler if the output gate is already broken

May fix cases where a broken output gate could cause an exception to propagate beyond the regular alarm output gate handling.

jclee · 2024-10-28T16:41:34Z

src/workerd/io/actor-sqlite.c++

+  // We want to ensure a commit is scheduled during the deleteAll() call, but if we end up doing a
+  // write to restore alarm state afterward, we can skip the original scheduled commit, since the
+  // write will queue its own commit.
+  kj::Maybe<kj::Own<bool>> maybeCanSkipDeleteAllCommit;


I'm not sure if cancelling the scheduled commit is strictly necessary, or if there isn't a simpler way to do it that preserves the desired invariants in all cases (exceptions, cancellation, etc.). But without the cancellation, I noticed via the tests that there was otherwise a redundant commit call.

jclee · 2024-10-28T16:58:11Z

src/workerd/io/actor-sqlite.c++

+  // write will queue its own commit.
+  kj::Maybe<kj::Own<bool>> maybeCanSkipDeleteAllCommit;
+
+  if (!deleteAllCommitScheduled) {


In case it isn't apparent: the main fix here is removal of the localAlarmState == kj::none check, to ensure deleteAllCommitScheduled is always true by the time we reach db.clear(), which in turn ensures that isCommitScheduled() is true, since it's used to check an invariant in a storage relay callback.

The localAlarmState == kj::none check was originally put in place to avoid having a redundant commit, since the write to restore the alarmState also adds a commit. But of course there's no way for the storage relay callback to know via the current isCommitScheduled() implementation that we're planning to add the commit after calling db.clear() (and maybe it isn't even correct to rely solely on committing after db.clear(), if it's important that the output lock is synchronously started before calling db.clear()?).

Hmm, I think you're overestimating the harm of a "redundant commit". StorageRelayClient::commit() already detects if there's nothing to commit (if (!needCommit at the beginning), so it's not that big a deal to schedule an extra one here.

And deleteAll() is not commonly called so I don't think it's worth optimizing away some promises.

Let's instead simplify this code to just always schedule (and apply) the deleteAll commit.

Sure, makes sense... I'll try to simplify it.

Hmm... Thinking about it a bit more while responding to Justin's comment... Is there a potential problem if we queue a commit task for the db.reset() and another (via onWrite() -> commitImpl()) for the alarm state reset? If the first commit task succeeds but the second somehow fails, wouldn't that leave the db alarm state empty, when the current expectation may be that the committed alarm state remains in place regardless of success or failure of the deleteAll() operation?

Or is it perhaps OK to assume the alarm state reset isn't crucial, if deleteAll() is being called?

Or, I suppose, is some small window of alarm state loss unavoidable, due to the nature of deleting and reinstating the sqlite database?

I guess what you're getting at is that the reset commit task doesn't actually do the same set of work as a regular write commit -- it doesn't actually commit a transaction and whatnot. Yeah, I agree it would be a problem if we committed the reset without actually committing the transaction to restore the alarm time.

So perhaps the reset commit task needs to actually skip calling commitCallback in the case that it knows that a further commit is coming for the metadata. Perhaps it can just be like:

commitTasks.add(outputGate.lockWhile(kj::evalLater([this]() mutable -> kj::Promise<void> { // Don't commit if shutdown() has been called. requireNotBroken(); deleteAllCommitScheduled = false; if (currentTxn.is<ImplicitTxn>()) { // An implicit transaction is already scheduled, so we'll count on it to perform a commit when it's // done. This is particularly important for the case where deleteAll() was called while an alarm // is outstanding; we'll immediately begin an implicit transaction that restores the alarm. We don't // want to commit the deletion without that transaction. return kj::READY_NOW; } else { return commitCallback(); } }))); deleteAllCommitScheduled = true;

Another, maybe cleaner approach would be to unify the two commit tasks into a single one which checks both currentTxn.is<ImplicitTxn>() and deleteAllCommitScheduled to decide what actual work needs to be performed before committing?

I was also curious... I assume it's OK/moot that we call requireNotBroken() prior to clearing deleteAllCommitScheduled, since once the output gate is broken, we expect all subsequent user operations to fail and continue failing, such that the return value of isCommitScheduled() no longer matters?

Yeah nothing matters once broken.

I tried combining the deleteAll() and onWrite() commit tasks, but it was turning out to be a more complicated change than your suggestion of just checking if (currentTxn.is<ImplicitTxn*>()) {...} in deleteAll(), so I went with that.

One thing I wasn't quite sure of, though, is the case involving explicit transactions:

user calls deleteAll() while alarm state is present.

deleteAll() adds its kj::evalLater() commit task to the event queue.

deleteAll() calls db.reset() to clear the database.

deleteAll() calls setAlarm() to restore the post-delete alarm state, which fires onWrite(), which sets up the implicit transaction and adds its kj::evalLater() commit task to the event queue to run immediately after.

before the event queue processes either commit task, it runs an event where user code starts an explicit transaction.

the explicit transaction constructor runs the implicit transaction's commit(), then updates currentTxn to point to the explicit transaction.

the user code maybe makes a few additional db changes within the explicit transaction.

When the deleteAll() commit task runs, it will see that currentTxn is no longer an implicit transaction, so it's going to just run commitCallback(). Is that OK?

I assume the commitCallback() would end up persisting the interim state of the explicit transaction. But presumably that's OK, because if the system goes down at that point, we recover from that by rolling back any explicit transaction savepoints we see in the persisted DB when reloading? (If not, I think a similar sequence of events is already possible just using implicit and explicit transactions together, without calling deleteAll().)

And in this case, we'd also be running a commitCallback() that effectively updates alarm state without also running the alarm rescheduling commit code, but maybe that's OK, since the very next event queue item should be the implicit transaction's commit task, which would run it?

justin-mp · 2024-10-28T21:13:59Z

src/workerd/io/actor-sqlite-test.c++

+  test.pollAndExpectCalls({"commit"})[0]->reject(KJ_EXCEPTION(FAILED, "a_rejected_commit"));
+  KJ_EXPECT_THROW_MESSAGE("a_rejected_commit", promise.wait(test.ws));
+  // Ensure taskFailed handler runs and notices brokenness:
+  test.ws.poll();


Up to and including this line, this test is the same as the test above it. Why does it mean the handler succeeds in this case, but fails in the previous one?

During normal alarm processing, if the alarm handler fails (i.e. the application throws an exception), cancelDeferredAlarmDeletion() is invoked.

Right... The intent behind the two tests was to exercise both the case where the caller calls cancelDeferredAlarmDeletion() prior to dropping the deleter and the case where it doesn't. Strictly speaking, only the behavior of the non-cancelled test changes in this commit; the cancelled test was already succeeding because it didn't attempt to reset the alarm after brokenness, but I don't think the interaction between gate brokenness and cancellation was otherwise being covered by other tests.

I can try to clarify the comments.

justin-mp · 2024-10-28T21:17:02Z

src/workerd/io/actor-sqlite-test.c++

-  auto commitFulfiller = kj::mv(test.pollAndExpectCalls({"commit"})[0]);
-  KJ_ASSERT(results.count.wait(test.ws) == 0);
-  KJ_ASSERT(expectSync(test.getAlarm()) == oneMs);
+    auto commitFulfiller = kj::mv(test.pollAndExpectCalls({"commit"})[0]);


For my own education, we only expect to see scheduleRun(1ms) when setAlarm is called, and not on subsequent resets? Presumably, that's because the scheduled run gets scheduled outside of this deletion?

Pretty much, yeah. The scheduleRun() hook is called by ActorSqlite as part of the committing logic -- implemented in startPrecommitAlarmScheduling() and commitImpl() -- to request a change to the alarm manager's scheduled alarm notification time. If ActorSqlite knows the currently scheduled alarm doesn't need to be updated, it doesn't call scheduleRun().

But thinking about this a bit more, I guess the point of confusion might be why we don't see a scheduleRun(kj::none) in the test corresponding to the db.reset() and a subsequent scheduleRun(1ms) corresponding to the alarm state reset? That's because we don't run the committing logic for the db.reset(), only for the state reset. And because we're restoring the original state, the committing logic sees no change, so scheduleRun() is not called.

I haven't yet implemented Kenton's suggestion of simplifying the deleteAll() logic by performing a redundant commit, but I'm pretty sure this won't change the situation, since the promise that deleteAll() queues calls commitCallback() directly, rather than through the alarm commit logic. And I'm pretty sure that's OK? But might need to think about it a bit more to be sure.

jclee · 2024-10-30T20:22:43Z

I applied Kenton's suggested deleteAll() logic simplification and rewrote the two "alarm handler cleanup" tests to be a single test, then squashed the fixup commits:

…ing reset ...particularly, in the case where an alarm is set. Should fix errors seen when calling deleteAll() during an alarm handler.

…vocations

…put gate is already broken May fix cases where a broken output gate could cause an exception to propagate beyond the regular alarm output gate handling.

jclee requested review from kentonv and justin-mp October 28, 2024 16:39

jclee requested review from a team as code owners October 28, 2024 16:39

jclee requested a review from anonrig October 28, 2024 16:39

jclee commented Oct 28, 2024

View reviewed changes

jclee force-pushed the jlee/actor-srs-deleteall-alarms branch from 2a5c8bf to a0e0bc0 Compare October 28, 2024 17:04

justin-mp reviewed Oct 28, 2024

View reviewed changes

jclee force-pushed the jlee/actor-srs-deleteall-alarms branch from 36bd8c0 to cca0a96 Compare October 30, 2024 20:19

jclee added 3 commits October 30, 2024 13:24

SRS actors: Ensure that deleteAll() always has a commit scheduled dur…

6c61da9

…ing reset ...particularly, in the case where an alarm is set. Should fix errors seen when calling deleteAll() during an alarm handler.

SRS actors: Add a little more test coverage for repeat deleteAll() in…

01e81ee

…vocations

SRS actors: Don't attempt alarm deletion at end of handler if the out…

25fa631

…put gate is already broken May fix cases where a broken output gate could cause an exception to propagate beyond the regular alarm output gate handling.

jclee force-pushed the jlee/actor-srs-deleteall-alarms branch from cca0a96 to 25fa631 Compare October 30, 2024 20:25

kentonv approved these changes Nov 1, 2024

View reviewed changes

jclee merged commit ad8580f into main Nov 1, 2024
13 checks passed

jclee deleted the jlee/actor-srs-deleteall-alarms branch November 1, 2024 17:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sqlite-backed DOs: Allow calling deleteAll() inside an alarm handler #3018

Sqlite-backed DOs: Allow calling deleteAll() inside an alarm handler #3018

jclee commented Oct 28, 2024

jclee Oct 28, 2024

jclee Oct 28, 2024

kentonv Oct 29, 2024

jclee Oct 29, 2024 •

edited

Loading

jclee Oct 29, 2024 •

edited

Loading

kentonv Oct 29, 2024

kentonv Oct 29, 2024

jclee Oct 29, 2024

kentonv Oct 30, 2024

jclee Oct 30, 2024 •

edited

Loading

justin-mp Oct 28, 2024

kentonv Oct 29, 2024

jclee Oct 29, 2024 •

edited

Loading

justin-mp Oct 28, 2024

jclee Oct 29, 2024

jclee commented Oct 30, 2024 •

edited

Loading

Sqlite-backed DOs: Allow calling deleteAll() inside an alarm handler #3018

Sqlite-backed DOs: Allow calling deleteAll() inside an alarm handler #3018

Conversation

jclee commented Oct 28, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jclee Oct 29, 2024 • edited Loading

Choose a reason for hiding this comment

jclee Oct 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jclee Oct 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jclee Oct 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jclee commented Oct 30, 2024 • edited Loading

jclee Oct 29, 2024 •

edited

Loading

jclee Oct 29, 2024 •

edited

Loading

jclee Oct 30, 2024 •

edited

Loading

jclee Oct 29, 2024 •

edited

Loading

jclee commented Oct 30, 2024 •

edited

Loading