Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CA-388933: rework GC Active lock to ensure GC starts #679

Merged
merged 1 commit into from
Mar 14, 2024

Conversation

MarkSymsCtx
Copy link
Contributor

Commit b7b90c8 was addresing a race condition between the Garbage Collector and the SR detach operation where the GC could get started while the SR detach was proceeding due to there not being any mutual exclusion between these operations. To address this the code was changed to obtain the gc_active lock only when within the SR lock, which is also held by the detach operation. This prevents the starting GC process from acquiring the active lock until the detach has completed, at which point the SR would be detached and the GC would exit.

There was a problem with this commit in that it used a Lock acquireNoblock to acquire the SR lock and if it failed to do so assumed that this meant the GC was already running. As the GC is typically kicked as the result of a VDI delete (or manually as an SR scan) the SR lock would be held. This results in the GC lock acquisition being racy and dependent on the time taken for the process to fork and daemonise. The outcome of this is that under some conditions the GC process will never start and cleanup will not occur, leading to an inability to take new snapshots when the maximum chain length is exceeded. It should instead be using a blocking acquire which will wait until the current holder exits. This commit applies this change.

Commit b7b90c8 was addresing a race
condition between the Garbage Collector and the SR detach operation
where the GC could get started while the SR detach was proceeding due
to there not being any mutual exclusion between these operations. To
address this the code was changed to obtain the gc_active lock only
when within the SR lock, which is also held by the detach
operation. This prevents the starting GC process from acquiring the
active lock until the detach has completed, at which point the SR
would be detached and the GC would exit.

There was a problem with this commit in that it used a Lock
acquireNoblock to acquire the SR lock and if it failed to do so
assumed that this meant the GC was already running. As the GC is
typically kicked as the result of a VDI delete (or manually as an SR
scan) the SR lock would be held. This results in the GC lock
acquisition being racy and dependent on the time taken for the process
to fork and daemonise. The outcome of this is that under some
conditions the GC process will never start and cleanup will not occur,
leading to an inability to take new snapshots when the maximum chain
length is exceeded.  It should instead be using a blocking acquire
which will wait until the current holder exits. This commit applies
this change.

Signed-off-by: Mark Syms <mark.syms@citrix.com>
@@ -3103,8 +3103,7 @@ def __init__(self, srUuid):
self._srLock = lock.Lock(vhdutil.LOCK_TYPE_SR, srUuid)

def acquireNoblock(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of this method is now a bit misleading, although I can't think of a pith, accurate replacement

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's not as it is the API method for the super-class of this object type, that being lock.Lock and we are doing an acquireNoblock on the GC active lock, with the extra caveat that this needs the SR lock before it can check the GC active lock.

@MarkSymsCtx MarkSymsCtx merged commit fc6e4d3 into xapi-project:master Mar 14, 2024
2 checks passed
@MarkSymsCtx MarkSymsCtx deleted the CA-388933 branch March 14, 2024 14:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants