-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CA-386316 Fix race condition between sr_detach and GC #664
CA-386316 Fix race condition between sr_detach and GC #664
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know whether you already use read-write locks or not, but this seems like a good use-case for a read-lock: while the GC is running hold an SR read lock (occasionally upgrading it to write as needed): this'd prevent trying to unplug it while the GC is active, and instead the unplug operation would block until the GC completes.
(and a read-lock wouldn't prevent other read-only operations, it'd only prevent write operations, where unplugging would have to be considered a "write")
I think currently an unplug operation may fail if a GC is active, which can be problematic because the GC is not triggered by the user, not visible to the user through APIs, and not cancelable by the user through APIs and can take a very long time to complete sometimes.
If SM doesn't already use read-write locks and only use purely binary locks, then ignore my suggestion, it'd probably be far too much work to convert SMAPIv1 to use those and probably something best delayed for SMAPIv3.
The sm code implements distinct WriteLock and ReadLock classes, but as far as I can see it never uses the latter.
I thought that was the point of aborting GC as part of detaching. (That said, I've not yet tried to understand the mechanism used to do this, and I have no idea how reliable it is.) |
Indeed, the expectation is that an SR detach/unplug shall terminate any in progress GC activity, which the code attempted to do but didn't hold the relevant locks for long enough to prevent things kicking off while the detach was happening. |
The change to cleanup is fine. I'm less happy about the unit test changes. It would be preferable I think to avoid needing to write a script out to storage and then execute it but i'm not sure if even using a fork as per https://github.com/xapi-project/sm/blob/master/drivers/cleanup.py#L2867 and then having the two processes play with locks would be sufficiently reliable and safe? |
I wrote it like that, rather than using mocking, was that I thought the most likely source of bugs was me misunderstanding what other bits of code, specifically Lock, were doing. Now I know it passes in this form, I'm happy to rework it to be more straightforward. |
The problem showed up with NFSSR, but other SR types work the same way: the detach method was causing any active GC for an SR to abort, before going on to the rest of work of detaching; but there was nothing preventing a new GC starting in between. The change is to prevent the GC lock for an SR being acquired (as the GC need to do in order to become active) whilst something else (such as an sr_detach operation) is holding the SR lock; and (in case the GC does this just after the SR has been detached) it avoids doing anything if the SR is no longer plugged in. It might have been sufficient to guard the acquisition of the GC lock like this in _gcLoop, but I decided to wrap the GC lock so that any acquisition of it requires the SR lock momentarily be acquired. Signed-off-by: Robin Newton <robin.newton@cloud.com>
bb9a3a2
to
904ba27
Compare
return -1 | ||
|
||
def setUp(self): | ||
tmp_dir = TemporaryDirectory() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't look like we actually use this as a temporary directory anymore just as a source for a named directory to inject into a mock? If so then we can simplify it down to just a statically defined literal path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lock will be creating directories and lock files in that directory, so I'd rather it were a temporary directory that gets clean up.
The problem showed up with NFSSR, but other SR types work the same way: the detach method was causing any active GC for an SR to abort, before going on to the rest of work of detaching; but there was nothing preventing a new GC starting in between. The change is to prevent the GC lock for an SR being acquired (as the GC need to do in order to become active) whilst something else (such as an sr_detach operation) is holding the SR lock; and (in case the GC does this just after the SR has been detached) it avoids doing anything if the SR is no longer plugged in.
It might have been sufficient to guard the acquisition of the GC lock like this in _gcLoop, but I decided to wrap the GC lock so that any acquisition of it requires the SR lock momentarily be acquired.