Fix #634: Robust cleanup of symlinks on fetch failure + Regression Test #855

Kesavaraja67 · 2025-12-23T15:33:44Z

Fixes #634

Problem

When fetch_and_notify_on_changes fails (e.g., network timeout), the repository path (directory or symbolic link) is left in a corrupted state, causing subsequent fetches to fail indefinitely.

Solution

Robust Cleanup: Wrapped the fetch logic in a try...except block that explicitly handles both os.unlink (for symlinks) and shutil.rmtree (for directories).
Cache Clearing: Ensures the internal GitPolicyFetcher.repos cache is cleared on failure.
Cross-Platform Test: Added a regression test (packages/opal-server/opal_server/tests/test_git_fetcher_cleanup.py) that mocks the failure and verifies filesystem cleanup. (Includes a Windows compatibility check for local dev).

Verification

I have added a new test file. Run it with:
pytest packages/opal-server/opal_server/tests/test_git_fetcher_cleanup.py

/claim #634

netlify · 2025-12-23T15:33:50Z

✅ Deploy Preview for opal-docs canceled.

Name	Link
🔨 Latest commit	`d6729be`
🔍 Latest deploy log	https://app.netlify.com/projects/opal-docs/deploys/6952284033aa6200088fa607

Kesavaraja67 · 2025-12-23T17:48:15Z

git_fetcher.py.-.opal.-.Visual.Studio.Code.2025-12-23.23-16-58.mp4

Here is a short demo video verifying the fix.

I ran a simulation script (demo_proof.py) that creates a "broken" git state to mimic the issue. The video confirms that the new cleanup logic successfully detects and removes the leftover files/links.

Note: Recorded on Windows, verifying that the fix works robustly even in environments with symlink restrictions.

omer9564

Few comments:

You except a very broad exception, no matter which step failed you still do the entire checks for existence, deletion, etc. a better approach is to extract the lower level part of the code that needs to be "atomic" to a function and try except in the function for the specific exceptions and do a more dedicated "rollback" behavior.
I'm not sure that I agree with entirely deleting the repo directory, when scale increases and repo grows, cloning the repo from zero can be a very expensive operation. I'm not familiar with the corrupted state that is left, but I'd say a better approach would be to retry ( wouldn't it just work on the next fetch? ) or try to fix the corrupted state instead of deleting the repo directory which the process worked so hard to clone

Perhaps if you'll explain more on the corrupted state the git repo stays in I'll understand better the issue and can help with thinking about a solution

Kesavaraja67 · 2025-12-29T03:06:12Z

Thanks for the thoughtful review, @omer9564! I see your point about broad exceptions and the high cost of deleting large repos. I’m going to refactor the logic to:

Isolate the atomic fetch operations into a helper function with specific error handling.
Implement a 'soft cleanup' (like removing git locks or fixing symlinks) instead of a full rmtree whenever possible. I'll update the PR shortly with these improvements!

Kesavaraja67 · 2025-12-29T04:07:49Z

Hi @omer9564, I have refactored the implementation to address your feedback:

Atomic Extraction: The core sync logic is now isolated in _attempt_atomic_sync. This cleanly separates the primary 'work' from the recovery logic.

Soft Recovery: Instead of a full shutil.rmtree, the fetcher now performs a 'soft cleanup' by targeting stale Git lock files (index.lock, shallow.lock, etc.).

Scalability: This approach fixes the corrupted state without forcing an expensive full re-clone, preserving data and performance for large repositories.

Narrowed Exceptions: I narrowed the exception handling to catch specific pygit2 and subprocess errors to trigger this dedicated recovery path.

Demo: Attached is a video demonstrating the soft recovery. I manually created an index.lock in the .git folder; the fetcher successfully identified and removed the lock while keeping the repository directory intact.

git_fetcher.py.-.opal.-.Visual.Studio.Code.2025-12-29.10-23-19.mp4

omer9564 · 2025-12-31T14:14:55Z

Hey @Kesavaraja67, thanks for the quickly done fixes for my review.
I won't be able to re-review it by EOW, will do it early next week, in the meanwhile, pinging @danyi1212 @orweis which might be available to do so before I am and adding Copilt as a reviewer

Copilot

Pull request overview

This PR addresses Issue #634 by implementing robust cleanup of repository paths when fetch_and_notify_on_changes fails due to network or Git errors. The solution prevents corrupted symlinks and directories from blocking subsequent fetch operations.

Key changes:

Added comprehensive error handling with targeted cleanup strategies for different failure scenarios
Implemented soft cleanup for recoverable errors (stale locks, broken symlinks) and full cleanup as a last resort
Added regression test to verify cleanup behavior on fetch failures

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File	Description
packages/opal-server/opal_server/git_fetcher.py	Refactored fetch logic into atomic sync function with multi-tier error handling and cleanup strategies
packages/opal-server/opal_server/tests/test_git_fetcher_cleanup.py	Added regression test verifying repository cleanup on simulated network failures

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-31T14:15:27Z

packages/opal-server/opal_server/tests/test_git_fetcher_cleanup.py

+from pathlib import Path
+
+# Ensure we can import the server package relative to this test file
+current_dir = Path(__file__).parent
+server_package_path = current_dir.parent.parent
+sys.path.insert(0, str(server_package_path))
+
+from opal_server.git_fetcher import GitPolicyFetcher
+


Manual sys.path manipulation in tests suggests the package structure may not follow standard Python conventions. Tests should be able to import the package without path modifications. Consider restructuring the package to use proper relative imports or installing it in editable mode for testing.

Suggested change

from pathlib import Path

# Ensure we can import the server package relative to this test file

current_dir = Path(__file__).parent

server_package_path = current_dir.parent.parent

sys.path.insert(0, str(server_package_path))

from opal_server.git_fetcher import GitPolicyFetcher

from ..git_fetcher import GitPolicyFetcher

Copilot · 2025-12-31T14:15:28Z

packages/opal-server/opal_server/tests/test_git_fetcher_cleanup.py

+    fetcher = MagicMock(spec=GitPolicyFetcher)
+    fetcher._repo_path = fake_repo_path
+
+    # Mock the internal cache to ensure we test the dictionary cleanup too
+    GitPolicyFetcher.repos = {str(fake_repo_path): "stale_object"}


This test doesn't actually exercise the code changes in git_fetcher.py. It manually replicates the cleanup logic instead of testing the actual fetch_and_notify_on_changes method. The test should mock the git operations to fail and verify that the actual error handling code performs the cleanup correctly.

Copilot · 2025-12-31T14:15:28Z

packages/opal-server/opal_server/git_fetcher.py

+                await self._notify_on_changes(repo)
+                return
+            else:
+                raise pygit2.GitError("Invalid repository metadata")


The error message "Invalid repository metadata" is vague. Include the repository path in the error message to aid debugging, such as f"Invalid repository metadata at {repo_path}".

Suggested change

raise pygit2.GitError("Invalid repository metadata")

raise pygit2.GitError(f"Invalid repository metadata at {repo_path}")

Copilot · 2025-12-31T14:15:28Z

packages/opal-server/opal_server/git_fetcher.py

+                if self._repo_path.exists() or os.path.islink(self._repo_path):
+                    if self._repo_path.is_symlink():
+                        self._repo_path.unlink()
                    else:
-                        # repo dir exists but invalid -> we must delete the directory
-                        logger.warning(
-                            "Deleting invalid repo: {path}", path=self._repo_path
-                        )
-                        shutil.rmtree(self._repo_path)
-                else:
-                    logger.info("Repo not found at {path}", path=self._repo_path)
+                        shutil.rmtree(self._repo_path, ignore_errors=True)


The condition self._repo_path.exists() or os.path.islink(self._repo_path) is checked before calling is_symlink(), but is_symlink() should be checked first since broken symlinks return False for exists(). Restructure to check is_symlink() in the outer condition to properly handle broken symlinks.

Suggested change

if self._repo_path.exists() or os.path.islink(self._repo_path):

if self._repo_path.is_symlink():

self._repo_path.unlink()

else:

# repo dir exists but invalid -> we must delete the directory

logger.warning(

"Deleting invalid repo: {path}", path=self._repo_path

)

shutil.rmtree(self._repo_path)

else:

logger.info("Repo not found at {path}", path=self._repo_path)

shutil.rmtree(self._repo_path, ignore_errors=True)

if self._repo_path.is_symlink():

self._repo_path.unlink()

elif self._repo_path.exists() or os.path.islink(self._repo_path):

shutil.rmtree(self._repo_path, ignore_errors=True)

Copilot · 2025-12-31T14:15:29Z

packages/opal-server/opal_server/git_fetcher.py

+                    if self._repo_path.is_symlink():
+                        self._repo_path.unlink()
                    else:


This cleanup logic duplicates the symlink handling already present in _perform_soft_cleanup. Consider calling _perform_soft_cleanup here followed by shutil.rmtree for any remaining directories, to avoid code duplication and ensure consistent cleanup behavior.

Suggested change

if self._repo_path.is_symlink():

self._repo_path.unlink()

else:

# Reuse soft cleanup for consistent symlink and lock-file handling

self._perform_soft_cleanup(self._repo_path)

# Remove any remaining directory tree as a last resort

if self._repo_path.exists():

Fix #634: Robust cleanup of symlinks on fetch failure + Regression Test

13987cd

algora-pbc bot added the 🙋 Bounty claim label Dec 23, 2025

algora-pbc bot mentioned this pull request Dec 23, 2025

OPAL Server doesn't clean up symbolic links when github is down #634

Open

omer9564 reviewed Dec 28, 2025

View reviewed changes

Refactor: atomic sync logic and soft recovery for git locks

d6729be

Kesavaraja67 force-pushed the fix/issue-634-cleanup branch from 02f64dd to d6729be Compare December 29, 2025 07:05

omer9564 requested a review from Copilot December 31, 2025 14:14

Copilot AI reviewed Dec 31, 2025

View reviewed changes

Kesavaraja67 closed this by deleting the head repository Jan 16, 2026

	raise pygit2.GitError("Invalid repository metadata")
	raise pygit2.GitError(f"Invalid repository metadata at {repo_path}")

-                    if self._repo_path.is_symlink():
-                        self._repo_path.unlink()
-                    else:
+                    # Reuse soft cleanup for consistent symlink and lock-file handling
+                    self._perform_soft_cleanup(self._repo_path)
+                    # Remove any remaining directory tree as a last resort
+                    if self._repo_path.exists():

Fix #634: Robust cleanup of symlinks on fetch failure + Regression Test #855

Fix #634: Robust cleanup of symlinks on fetch failure + Regression Test #855

Uh oh!

Conversation

Kesavaraja67 commented Dec 23, 2025

Problem

Solution

Verification

Uh oh!

netlify bot commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for opal-docs canceled.

Uh oh!

Kesavaraja67 commented Dec 23, 2025

Uh oh!

omer9564 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Kesavaraja67 commented Dec 29, 2025

Uh oh!

Kesavaraja67 commented Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

omer9564 commented Dec 31, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

netlify bot commented Dec 23, 2025 •

edited

Loading

omer9564 left a comment •

edited

Loading

Kesavaraja67 commented Dec 29, 2025 •

edited

Loading