-
Notifications
You must be signed in to change notification settings - Fork 301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sqlite-backed DOs: Allow calling deleteAll() inside an alarm handler #3018
Conversation
src/workerd/io/actor-sqlite.c++
Outdated
// We want to ensure a commit is scheduled during the deleteAll() call, but if we end up doing a | ||
// write to restore alarm state afterward, we can skip the original scheduled commit, since the | ||
// write will queue its own commit. | ||
kj::Maybe<kj::Own<bool>> maybeCanSkipDeleteAllCommit; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if cancelling the scheduled commit is strictly necessary, or if there isn't a simpler way to do it that preserves the desired invariants in all cases (exceptions, cancellation, etc.). But without the cancellation, I noticed via the tests that there was otherwise a redundant commit call.
// write will queue its own commit. | ||
kj::Maybe<kj::Own<bool>> maybeCanSkipDeleteAllCommit; | ||
|
||
if (!deleteAllCommitScheduled) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case it isn't apparent: the main fix here is removal of the localAlarmState == kj::none
check, to ensure deleteAllCommitScheduled
is always true by the time we reach db.clear()
, which in turn ensures that isCommitScheduled()
is true, since it's used to check an invariant in a storage relay callback.
The localAlarmState == kj::none
check was originally put in place to avoid having a redundant commit, since the write to restore the alarmState also adds a commit. But of course there's no way for the storage relay callback to know via the current isCommitScheduled()
implementation that we're planning to add the commit after calling db.clear()
(and maybe it isn't even correct to rely solely on committing after db.clear()
, if it's important that the output lock is synchronously started before calling db.clear()
?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I think you're overestimating the harm of a "redundant commit". StorageRelayClient::commit()
already detects if there's nothing to commit (if (!needCommit
at the beginning), so it's not that big a deal to schedule an extra one here.
And deleteAll()
is not commonly called so I don't think it's worth optimizing away some promises.
Let's instead simplify this code to just always schedule (and apply) the deleteAll commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Sure, makes sense... I'll try to simplify it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... Thinking about it a bit more while responding to Justin's comment... Is there a potential problem if we queue a commit task for the db.reset()
and another (via onWrite()
-> commitImpl()
) for the alarm state reset? If the first commit task succeeds but the second somehow fails, wouldn't that leave the db alarm state empty, when the current expectation may be that the committed alarm state remains in place regardless of success or failure of the deleteAll()
operation?
Or is it perhaps OK to assume the alarm state reset isn't crucial, if deleteAll()
is being called?
Or, I suppose, is some small window of alarm state loss unavoidable, due to the nature of deleting and reinstating the sqlite database?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess what you're getting at is that the reset commit task doesn't actually do the same set of work as a regular write commit -- it doesn't actually commit a transaction and whatnot. Yeah, I agree it would be a problem if we committed the reset without actually committing the transaction to restore the alarm time.
So perhaps the reset commit task needs to actually skip calling commitCallback
in the case that it knows that a further commit is coming for the metadata. Perhaps it can just be like:
commitTasks.add(outputGate.lockWhile(kj::evalLater([this]() mutable -> kj::Promise<void> {
// Don't commit if shutdown() has been called.
requireNotBroken();
deleteAllCommitScheduled = false;
if (currentTxn.is<ImplicitTxn>()) {
// An implicit transaction is already scheduled, so we'll count on it to perform a commit when it's
// done. This is particularly important for the case where deleteAll() was called while an alarm
// is outstanding; we'll immediately begin an implicit transaction that restores the alarm. We don't
// want to commit the deletion without that transaction.
return kj::READY_NOW;
} else {
return commitCallback();
}
})));
deleteAllCommitScheduled = true;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another, maybe cleaner approach would be to unify the two commit tasks into a single one which checks both currentTxn.is<ImplicitTxn>()
and deleteAllCommitScheduled
to decide what actual work needs to be performed before committing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was also curious... I assume it's OK/moot that we call requireNotBroken()
prior to clearing deleteAllCommitScheduled
, since once the output gate is broken, we expect all subsequent user operations to fail and continue failing, such that the return value of isCommitScheduled()
no longer matters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah nothing matters once broken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried combining the deleteAll()
and onWrite()
commit tasks, but it was turning out to be a more complicated change than your suggestion of just checking if (currentTxn.is<ImplicitTxn*>()) {...}
in deleteAll()
, so I went with that.
One thing I wasn't quite sure of, though, is the case involving explicit transactions:
- user calls
deleteAll()
while alarm state is present. deleteAll()
adds itskj::evalLater()
commit task to the event queue.deleteAll()
callsdb.reset()
to clear the database.deleteAll()
callssetAlarm()
to restore the post-delete alarm state, which firesonWrite()
, which sets up the implicit transaction and adds itskj::evalLater()
commit task to the event queue to run immediately after.- before the event queue processes either commit task, it runs an event where user code starts an explicit transaction.
- the explicit transaction constructor runs the implicit transaction's commit(), then updates
currentTxn
to point to the explicit transaction. - the user code maybe makes a few additional db changes within the explicit transaction.
When the deleteAll()
commit task runs, it will see that currentTxn
is no longer an implicit transaction, so it's going to just run commitCallback()
. Is that OK?
I assume the commitCallback()
would end up persisting the interim state of the explicit transaction. But presumably that's OK, because if the system goes down at that point, we recover from that by rolling back any explicit transaction savepoints we see in the persisted DB when reloading? (If not, I think a similar sequence of events is already possible just using implicit and explicit transactions together, without calling deleteAll()
.)
And in this case, we'd also be running a commitCallback()
that effectively updates alarm state without also running the alarm rescheduling commit code, but maybe that's OK, since the very next event queue item should be the implicit transaction's commit task, which would run it?
2a5c8bf
to
a0e0bc0
Compare
src/workerd/io/actor-sqlite-test.c++
Outdated
test.pollAndExpectCalls({"commit"})[0]->reject(KJ_EXCEPTION(FAILED, "a_rejected_commit")); | ||
KJ_EXPECT_THROW_MESSAGE("a_rejected_commit", promise.wait(test.ws)); | ||
// Ensure taskFailed handler runs and notices brokenness: | ||
test.ws.poll(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Up to and including this line, this test is the same as the test above it. Why does it mean the handler succeeds in this case, but fails in the previous one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During normal alarm processing, if the alarm handler fails (i.e. the application throws an exception), cancelDeferredAlarmDeletion()
is invoked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right... The intent behind the two tests was to exercise both the case where the caller calls cancelDeferredAlarmDeletion()
prior to dropping the deleter and the case where it doesn't. Strictly speaking, only the behavior of the non-cancelled test changes in this commit; the cancelled test was already succeeding because it didn't attempt to reset the alarm after brokenness, but I don't think the interaction between gate brokenness and cancellation was otherwise being covered by other tests.
- I can try to clarify the comments.
auto commitFulfiller = kj::mv(test.pollAndExpectCalls({"commit"})[0]); | ||
KJ_ASSERT(results.count.wait(test.ws) == 0); | ||
KJ_ASSERT(expectSync(test.getAlarm()) == oneMs); | ||
auto commitFulfiller = kj::mv(test.pollAndExpectCalls({"commit"})[0]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my own education, we only expect to see scheduleRun(1ms)
when setAlarm is called, and not on subsequent resets? Presumably, that's because the scheduled run gets scheduled outside of this deletion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty much, yeah. The scheduleRun()
hook is called by ActorSqlite as part of the committing logic -- implemented in startPrecommitAlarmScheduling()
and commitImpl()
-- to request a change to the alarm manager's scheduled alarm notification time. If ActorSqlite knows the currently scheduled alarm doesn't need to be updated, it doesn't call scheduleRun()
.
But thinking about this a bit more, I guess the point of confusion might be why we don't see a scheduleRun(kj::none)
in the test corresponding to the db.reset()
and a subsequent scheduleRun(1ms)
corresponding to the alarm state reset? That's because we don't run the committing logic for the db.reset()
, only for the state reset. And because we're restoring the original state, the committing logic sees no change, so scheduleRun()
is not called.
I haven't yet implemented Kenton's suggestion of simplifying the deleteAll() logic by performing a redundant commit, but I'm pretty sure this won't change the situation, since the promise that deleteAll() queues calls commitCallback()
directly, rather than through the alarm commit logic. And I'm pretty sure that's OK? But might need to think about it a bit more to be sure.
36bd8c0
to
cca0a96
Compare
…ing reset ...particularly, in the case where an alarm is set. Should fix errors seen when calling deleteAll() during an alarm handler.
…put gate is already broken May fix cases where a broken output gate could cause an exception to propagate beyond the regular alarm output gate handling.
cca0a96
to
25fa631
Compare
And also fix a potential issue with alarm exception propagation.
Should address #2993
SRS actors: Ensure that deleteAll() always has a commit scheduled during reset
...particularly, in the case where an alarm is set. Should fix errors seen when calling deleteAll() during an alarm handler.
SRS actors: Don't attempt alarm deletion at end of handler if the output gate is already broken
May fix cases where a broken output gate could cause an exception to propagate beyond the regular alarm output gate handling.