Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sqlite-backed DOs: Allow calling deleteAll() inside an alarm handler #3018

Merged
merged 3 commits into from
Nov 1, 2024

Conversation

jclee
Copy link
Contributor

@jclee jclee commented Oct 28, 2024

And also fix a potential issue with alarm exception propagation.

Should address #2993


  • SRS actors: Ensure that deleteAll() always has a commit scheduled during reset

    ...particularly, in the case where an alarm is set. Should fix errors seen when calling deleteAll() during an alarm handler.

  • SRS actors: Don't attempt alarm deletion at end of handler if the output gate is already broken

    May fix cases where a broken output gate could cause an exception to propagate beyond the regular alarm output gate handling.

@jclee jclee requested review from a team as code owners October 28, 2024 16:39
@jclee jclee requested a review from anonrig October 28, 2024 16:39
// We want to ensure a commit is scheduled during the deleteAll() call, but if we end up doing a
// write to restore alarm state afterward, we can skip the original scheduled commit, since the
// write will queue its own commit.
kj::Maybe<kj::Own<bool>> maybeCanSkipDeleteAllCommit;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if cancelling the scheduled commit is strictly necessary, or if there isn't a simpler way to do it that preserves the desired invariants in all cases (exceptions, cancellation, etc.). But without the cancellation, I noticed via the tests that there was otherwise a redundant commit call.

// write will queue its own commit.
kj::Maybe<kj::Own<bool>> maybeCanSkipDeleteAllCommit;

if (!deleteAllCommitScheduled) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case it isn't apparent: the main fix here is removal of the localAlarmState == kj::none check, to ensure deleteAllCommitScheduled is always true by the time we reach db.clear(), which in turn ensures that isCommitScheduled() is true, since it's used to check an invariant in a storage relay callback.

The localAlarmState == kj::none check was originally put in place to avoid having a redundant commit, since the write to restore the alarmState also adds a commit. But of course there's no way for the storage relay callback to know via the current isCommitScheduled() implementation that we're planning to add the commit after calling db.clear() (and maybe it isn't even correct to rely solely on committing after db.clear(), if it's important that the output lock is synchronously started before calling db.clear()?).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think you're overestimating the harm of a "redundant commit". StorageRelayClient::commit() already detects if there's nothing to commit (if (!needCommit at the beginning), so it's not that big a deal to schedule an extra one here.

And deleteAll() is not commonly called so I don't think it's worth optimizing away some promises.

Let's instead simplify this code to just always schedule (and apply) the deleteAll commit.

Copy link
Contributor Author

@jclee jclee Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Sure, makes sense... I'll try to simplify it.

Copy link
Contributor Author

@jclee jclee Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... Thinking about it a bit more while responding to Justin's comment... Is there a potential problem if we queue a commit task for the db.reset() and another (via onWrite() -> commitImpl()) for the alarm state reset? If the first commit task succeeds but the second somehow fails, wouldn't that leave the db alarm state empty, when the current expectation may be that the committed alarm state remains in place regardless of success or failure of the deleteAll() operation?

Or is it perhaps OK to assume the alarm state reset isn't crucial, if deleteAll() is being called?

Or, I suppose, is some small window of alarm state loss unavoidable, due to the nature of deleting and reinstating the sqlite database?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess what you're getting at is that the reset commit task doesn't actually do the same set of work as a regular write commit -- it doesn't actually commit a transaction and whatnot. Yeah, I agree it would be a problem if we committed the reset without actually committing the transaction to restore the alarm time.

So perhaps the reset commit task needs to actually skip calling commitCallback in the case that it knows that a further commit is coming for the metadata. Perhaps it can just be like:

    commitTasks.add(outputGate.lockWhile(kj::evalLater([this]() mutable -> kj::Promise<void> {
      // Don't commit if shutdown() has been called.
      requireNotBroken();

      deleteAllCommitScheduled = false;
      if (currentTxn.is<ImplicitTxn>()) {
        // An implicit transaction is already scheduled, so we'll count on it to perform a commit when it's
        // done. This is particularly important for the case where deleteAll() was called while an alarm
        // is outstanding; we'll immediately begin an implicit transaction that restores the alarm. We don't
        // want to commit the deletion without that transaction.
        return kj::READY_NOW;
      } else {
        return commitCallback();
      }
    })));
    deleteAllCommitScheduled = true;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another, maybe cleaner approach would be to unify the two commit tasks into a single one which checks both currentTxn.is<ImplicitTxn>() and deleteAllCommitScheduled to decide what actual work needs to be performed before committing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also curious... I assume it's OK/moot that we call requireNotBroken() prior to clearing deleteAllCommitScheduled, since once the output gate is broken, we expect all subsequent user operations to fail and continue failing, such that the return value of isCommitScheduled() no longer matters?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah nothing matters once broken.

Copy link
Contributor Author

@jclee jclee Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried combining the deleteAll() and onWrite() commit tasks, but it was turning out to be a more complicated change than your suggestion of just checking if (currentTxn.is<ImplicitTxn*>()) {...} in deleteAll(), so I went with that.

One thing I wasn't quite sure of, though, is the case involving explicit transactions:

  • user calls deleteAll() while alarm state is present.
  • deleteAll() adds its kj::evalLater() commit task to the event queue.
  • deleteAll() calls db.reset() to clear the database.
  • deleteAll() calls setAlarm() to restore the post-delete alarm state, which fires onWrite(), which sets up the implicit transaction and adds its kj::evalLater() commit task to the event queue to run immediately after.
  • before the event queue processes either commit task, it runs an event where user code starts an explicit transaction.
  • the explicit transaction constructor runs the implicit transaction's commit(), then updates currentTxn to point to the explicit transaction.
  • the user code maybe makes a few additional db changes within the explicit transaction.

When the deleteAll() commit task runs, it will see that currentTxn is no longer an implicit transaction, so it's going to just run commitCallback(). Is that OK?

I assume the commitCallback() would end up persisting the interim state of the explicit transaction. But presumably that's OK, because if the system goes down at that point, we recover from that by rolling back any explicit transaction savepoints we see in the persisted DB when reloading? (If not, I think a similar sequence of events is already possible just using implicit and explicit transactions together, without calling deleteAll().)

And in this case, we'd also be running a commitCallback() that effectively updates alarm state without also running the alarm rescheduling commit code, but maybe that's OK, since the very next event queue item should be the implicit transaction's commit task, which would run it?

@jclee jclee force-pushed the jlee/actor-srs-deleteall-alarms branch from 2a5c8bf to a0e0bc0 Compare October 28, 2024 17:04
test.pollAndExpectCalls({"commit"})[0]->reject(KJ_EXCEPTION(FAILED, "a_rejected_commit"));
KJ_EXPECT_THROW_MESSAGE("a_rejected_commit", promise.wait(test.ws));
// Ensure taskFailed handler runs and notices brokenness:
test.ws.poll();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to and including this line, this test is the same as the test above it. Why does it mean the handler succeeds in this case, but fails in the previous one?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During normal alarm processing, if the alarm handler fails (i.e. the application throws an exception), cancelDeferredAlarmDeletion() is invoked.

Copy link
Contributor Author

@jclee jclee Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right... The intent behind the two tests was to exercise both the case where the caller calls cancelDeferredAlarmDeletion() prior to dropping the deleter and the case where it doesn't. Strictly speaking, only the behavior of the non-cancelled test changes in this commit; the cancelled test was already succeeding because it didn't attempt to reset the alarm after brokenness, but I don't think the interaction between gate brokenness and cancellation was otherwise being covered by other tests.

  • I can try to clarify the comments.

auto commitFulfiller = kj::mv(test.pollAndExpectCalls({"commit"})[0]);
KJ_ASSERT(results.count.wait(test.ws) == 0);
KJ_ASSERT(expectSync(test.getAlarm()) == oneMs);
auto commitFulfiller = kj::mv(test.pollAndExpectCalls({"commit"})[0]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my own education, we only expect to see scheduleRun(1ms) when setAlarm is called, and not on subsequent resets? Presumably, that's because the scheduled run gets scheduled outside of this deletion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty much, yeah. The scheduleRun() hook is called by ActorSqlite as part of the committing logic -- implemented in startPrecommitAlarmScheduling() and commitImpl() -- to request a change to the alarm manager's scheduled alarm notification time. If ActorSqlite knows the currently scheduled alarm doesn't need to be updated, it doesn't call scheduleRun().

But thinking about this a bit more, I guess the point of confusion might be why we don't see a scheduleRun(kj::none) in the test corresponding to the db.reset() and a subsequent scheduleRun(1ms) corresponding to the alarm state reset? That's because we don't run the committing logic for the db.reset(), only for the state reset. And because we're restoring the original state, the committing logic sees no change, so scheduleRun() is not called.

I haven't yet implemented Kenton's suggestion of simplifying the deleteAll() logic by performing a redundant commit, but I'm pretty sure this won't change the situation, since the promise that deleteAll() queues calls commitCallback() directly, rather than through the alarm commit logic. And I'm pretty sure that's OK? But might need to think about it a bit more to be sure.

@jclee jclee force-pushed the jlee/actor-srs-deleteall-alarms branch from 36bd8c0 to cca0a96 Compare October 30, 2024 20:19
@jclee
Copy link
Contributor Author

jclee commented Oct 30, 2024

I applied Kenton's suggested deleteAll() logic simplification and rewrote the two "alarm handler cleanup" tests to be a single test, then squashed the fixup commits:

…ing reset

...particularly, in the case where an alarm is set.  Should fix errors seen
when calling deleteAll() during an alarm handler.
…put gate is already broken

May fix cases where a broken output gate could cause an exception to propagate
beyond the regular alarm output gate handling.
@jclee jclee force-pushed the jlee/actor-srs-deleteall-alarms branch from cca0a96 to 25fa631 Compare October 30, 2024 20:25
@jclee jclee merged commit ad8580f into main Nov 1, 2024
13 checks passed
@jclee jclee deleted the jlee/actor-srs-deleteall-alarms branch November 1, 2024 17:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants