Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CA-386316 Fix race condition between sr_detach and GC #664

Merged
merged 1 commit into from
Feb 15, 2024

Conversation

rdn32
Copy link
Contributor

@rdn32 rdn32 commented Dec 21, 2023

The problem showed up with NFSSR, but other SR types work the same way: the detach method was causing any active GC for an SR to abort, before going on to the rest of work of detaching; but there was nothing preventing a new GC starting in between. The change is to prevent the GC lock for an SR being acquired (as the GC need to do in order to become active) whilst something else (such as an sr_detach operation) is holding the SR lock; and (in case the GC does this just after the SR has been detached) it avoids doing anything if the SR is no longer plugged in.

It might have been sufficient to guard the acquisition of the GC lock like this in _gcLoop, but I decided to wrap the GC lock so that any acquisition of it requires the SR lock momentarily be acquired.

Copy link
Contributor

@edwintorok edwintorok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know whether you already use read-write locks or not, but this seems like a good use-case for a read-lock: while the GC is running hold an SR read lock (occasionally upgrading it to write as needed): this'd prevent trying to unplug it while the GC is active, and instead the unplug operation would block until the GC completes.
(and a read-lock wouldn't prevent other read-only operations, it'd only prevent write operations, where unplugging would have to be considered a "write")

I think currently an unplug operation may fail if a GC is active, which can be problematic because the GC is not triggered by the user, not visible to the user through APIs, and not cancelable by the user through APIs and can take a very long time to complete sometimes.

If SM doesn't already use read-write locks and only use purely binary locks, then ignore my suggestion, it'd probably be far too much work to convert SMAPIv1 to use those and probably something best delayed for SMAPIv3.

@rdn32
Copy link
Contributor Author

rdn32 commented Dec 22, 2023

I don't know whether you already use read-write locks or not

The sm code implements distinct WriteLock and ReadLock classes, but as far as I can see it never uses the latter.

I think currently an unplug operation may fail if a GC is active

I thought that was the point of aborting GC as part of detaching. (That said, I've not yet tried to understand the mechanism used to do this, and I have no idea how reliable it is.)

@MarkSymsCtx
Copy link
Contributor

I think currently an unplug operation may fail if a GC is active

I thought that was the point of aborting GC as part of detaching. (That said, I've not yet tried to understand the mechanism used to do this, and I have no idea how reliable it is.)

Indeed, the expectation is that an SR detach/unplug shall terminate any in progress GC activity, which the code attempted to do but didn't hold the relevant locks for long enough to prevent things kicking off while the detach was happening.

@MarkSymsCtx
Copy link
Contributor

The change to cleanup is fine.

I'm less happy about the unit test changes. It would be preferable I think to avoid needing to write a script out to storage and then execute it but i'm not sure if even using a fork as per https://github.com/xapi-project/sm/blob/master/drivers/cleanup.py#L2867 and then having the two processes play with locks would be sufficiently reliable and safe?

@MarkSymsCtx MarkSymsCtx self-requested a review January 9, 2024 10:19
@rdn32
Copy link
Contributor Author

rdn32 commented Jan 9, 2024

I'm less happy about the unit test changes. It would be preferable I think to avoid needing to write a script out to storage and then execute it

I wrote it like that, rather than using mocking, was that I thought the most likely source of bugs was me misunderstanding what other bits of code, specifically Lock, were doing. Now I know it passes in this form, I'm happy to rework it to be more straightforward.

The problem showed up with NFSSR, but other SR types work the same way: the
detach method was causing any active GC for an SR to abort, before going on to
the rest of work of detaching; but there was nothing preventing a new GC
starting in between. The change is to prevent the GC lock for an SR being acquired (as
the GC need to do in order to become active) whilst something else (such as an
sr_detach operation) is holding the SR lock; and (in case the GC does this
just after the SR has been detached) it avoids doing anything if the SR is no
longer plugged in.

It might have been sufficient to guard the acquisition of the GC lock like
this in _gcLoop, but I decided to wrap the GC lock so that any acquisition of
it requires the SR lock momentarily be acquired.

Signed-off-by: Robin Newton <robin.newton@cloud.com>
@rdn32 rdn32 force-pushed the private/rnewton/CA-386316 branch from bb9a3a2 to 904ba27 Compare January 10, 2024 11:11
return -1

def setUp(self):
tmp_dir = TemporaryDirectory()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like we actually use this as a temporary directory anymore just as a source for a named directory to inject into a mock? If so then we can simplify it down to just a statically defined literal path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lock will be creating directories and lock files in that directory, so I'd rather it were a temporary directory that gets clean up.

@rdn32 rdn32 requested a review from MarkSymsCtx February 1, 2024 10:26
@MarkSymsCtx MarkSymsCtx merged commit b7b90c8 into xapi-project:master Feb 15, 2024
2 checks passed
@rdn32 rdn32 deleted the private/rnewton/CA-386316 branch February 15, 2024 14:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants