CA-388933: rework GC Active lock to ensure GC starts #679

MarkSymsCtx · 2024-03-14T09:30:37Z

Commit b7b90c8 was addresing a race condition between the Garbage Collector and the SR detach operation where the GC could get started while the SR detach was proceeding due to there not being any mutual exclusion between these operations. To address this the code was changed to obtain the gc_active lock only when within the SR lock, which is also held by the detach operation. This prevents the starting GC process from acquiring the active lock until the detach has completed, at which point the SR would be detached and the GC would exit.

There was a problem with this commit in that it used a Lock acquireNoblock to acquire the SR lock and if it failed to do so assumed that this meant the GC was already running. As the GC is typically kicked as the result of a VDI delete (or manually as an SR scan) the SR lock would be held. This results in the GC lock acquisition being racy and dependent on the time taken for the process to fork and daemonise. The outcome of this is that under some conditions the GC process will never start and cleanup will not occur, leading to an inability to take new snapshots when the maximum chain length is exceeded. It should instead be using a blocking acquire which will wait until the current holder exits. This commit applies this change.

Commit b7b90c8 was addresing a race condition between the Garbage Collector and the SR detach operation where the GC could get started while the SR detach was proceeding due to there not being any mutual exclusion between these operations. To address this the code was changed to obtain the gc_active lock only when within the SR lock, which is also held by the detach operation. This prevents the starting GC process from acquiring the active lock until the detach has completed, at which point the SR would be detached and the GC would exit. There was a problem with this commit in that it used a Lock acquireNoblock to acquire the SR lock and if it failed to do so assumed that this meant the GC was already running. As the GC is typically kicked as the result of a VDI delete (or manually as an SR scan) the SR lock would be held. This results in the GC lock acquisition being racy and dependent on the time taken for the process to fork and daemonise. The outcome of this is that under some conditions the GC process will never start and cleanup will not occur, leading to an inability to take new snapshots when the maximum chain length is exceeded. It should instead be using a blocking acquire which will wait until the current holder exits. This commit applies this change. Signed-off-by: Mark Syms <mark.syms@citrix.com>

rdn32 · 2024-03-14T11:44:28Z

drivers/cleanup.py

@@ -3103,8 +3103,7 @@ def __init__(self, srUuid):
        self._srLock = lock.Lock(vhdutil.LOCK_TYPE_SR, srUuid)

    def acquireNoblock(self):


The name of this method is now a bit misleading, although I can't think of a pith, accurate replacement

No, it's not as it is the API method for the super-class of this object type, that being lock.Lock and we are doing an acquireNoblock on the GC active lock, with the extra caveat that this needs the SR lock before it can check the GC active lock.

MarkSymsCtx requested review from rdn32, TimSmithCtx and KevinLCtx March 14, 2024 09:30

rdn32 approved these changes Mar 14, 2024

View reviewed changes

KevinLCtx approved these changes Mar 14, 2024

View reviewed changes

MarkSymsCtx merged commit fc6e4d3 into xapi-project:master Mar 14, 2024
2 checks passed

MarkSymsCtx deleted the CA-388933 branch March 14, 2024 14:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CA-388933: rework GC Active lock to ensure GC starts #679

CA-388933: rework GC Active lock to ensure GC starts #679

MarkSymsCtx commented Mar 14, 2024

rdn32 Mar 14, 2024

MarkSymsCtx Mar 14, 2024

		@@ -3103,8 +3103,7 @@ def __init__(self, srUuid):
		self._srLock = lock.Lock(vhdutil.LOCK_TYPE_SR, srUuid)

		def acquireNoblock(self):

CA-388933: rework GC Active lock to ensure GC starts #679

CA-388933: rework GC Active lock to ensure GC starts #679

Conversation

MarkSymsCtx commented Mar 14, 2024

rdn32 Mar 14, 2024

Choose a reason for hiding this comment

MarkSymsCtx Mar 14, 2024

Choose a reason for hiding this comment