Skip to content

Conversation

@Kesavaraja67
Copy link

Fixes #634

Problem

When fetch_and_notify_on_changes fails (e.g., network timeout), the repository path (directory or symbolic link) is left in a corrupted state, causing subsequent fetches to fail indefinitely.

Solution

  • Robust Cleanup: Wrapped the fetch logic in a try...except block that explicitly handles both os.unlink (for symlinks) and shutil.rmtree (for directories).
  • Cache Clearing: Ensures the internal GitPolicyFetcher.repos cache is cleared on failure.
  • Cross-Platform Test: Added a regression test (packages/opal-server/opal_server/tests/test_git_fetcher_cleanup.py) that mocks the failure and verifies filesystem cleanup. (Includes a Windows compatibility check for local dev).

Verification

I have added a new test file. Run it with:
pytest packages/opal-server/opal_server/tests/test_git_fetcher_cleanup.py

/claim #634

@netlify
Copy link

netlify bot commented Dec 23, 2025

Deploy Preview for opal-docs canceled.

Name Link
🔨 Latest commit d6729be
🔍 Latest deploy log https://app.netlify.com/projects/opal-docs/deploys/6952284033aa6200088fa607

@Kesavaraja67
Copy link
Author

git_fetcher.py.-.opal.-.Visual.Studio.Code.2025-12-23.23-16-58.mp4

Here is a short demo video verifying the fix.

I ran a simulation script (demo_proof.py) that creates a "broken" git state to mimic the issue. The video confirms that the new cleanup logic successfully detects and removes the leftover files/links.

Note: Recorded on Windows, verifying that the fix works robustly even in environments with symlink restrictions.

Copy link
Contributor

@omer9564 omer9564 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments:

  1. You except a very broad exception, no matter which step failed you still do the entire checks for existence, deletion, etc. a better approach is to extract the lower level part of the code that needs to be "atomic" to a function and try except in the function for the specific exceptions and do a more dedicated "rollback" behavior.
  2. I'm not sure that I agree with entirely deleting the repo directory, when scale increases and repo grows, cloning the repo from zero can be a very expensive operation. I'm not familiar with the corrupted state that is left, but I'd say a better approach would be to retry ( wouldn't it just work on the next fetch? ) or try to fix the corrupted state instead of deleting the repo directory which the process worked so hard to clone

Perhaps if you'll explain more on the corrupted state the git repo stays in I'll understand better the issue and can help with thinking about a solution

@Kesavaraja67
Copy link
Author

Thanks for the thoughtful review, @omer9564! I see your point about broad exceptions and the high cost of deleting large repos. I’m going to refactor the logic to:

  1. Isolate the atomic fetch operations into a helper function with specific error handling.

  2. Implement a 'soft cleanup' (like removing git locks or fixing symlinks) instead of a full rmtree whenever possible. I'll update the PR shortly with these improvements!

@Kesavaraja67
Copy link
Author

Kesavaraja67 commented Dec 29, 2025

Hi @omer9564, I have refactored the implementation to address your feedback:

Atomic Extraction: The core sync logic is now isolated in _attempt_atomic_sync. This cleanly separates the primary 'work' from the recovery logic.

Soft Recovery: Instead of a full shutil.rmtree, the fetcher now performs a 'soft cleanup' by targeting stale Git lock files (index.lock, shallow.lock, etc.).

Scalability: This approach fixes the corrupted state without forcing an expensive full re-clone, preserving data and performance for large repositories.

Narrowed Exceptions: I narrowed the exception handling to catch specific pygit2 and subprocess errors to trigger this dedicated recovery path.

Demo: Attached is a video demonstrating the soft recovery. I manually created an index.lock in the .git folder; the fetcher successfully identified and removed the lock while keeping the repository directory intact.

git_fetcher.py.-.opal.-.Visual.Studio.Code.2025-12-29.10-23-19.mp4

@omer9564 omer9564 requested a review from Copilot December 31, 2025 14:14
@omer9564
Copy link
Contributor

Hey @Kesavaraja67, thanks for the quickly done fixes for my review.
I won't be able to re-review it by EOW, will do it early next week, in the meanwhile, pinging @danyi1212 @orweis which might be available to do so before I am and adding Copilt as a reviewer

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses Issue #634 by implementing robust cleanup of repository paths when fetch_and_notify_on_changes fails due to network or Git errors. The solution prevents corrupted symlinks and directories from blocking subsequent fetch operations.

Key changes:

  • Added comprehensive error handling with targeted cleanup strategies for different failure scenarios
  • Implemented soft cleanup for recoverable errors (stale locks, broken symlinks) and full cleanup as a last resort
  • Added regression test to verify cleanup behavior on fetch failures

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
packages/opal-server/opal_server/git_fetcher.py Refactored fetch logic into atomic sync function with multi-tier error handling and cleanup strategies
packages/opal-server/opal_server/tests/test_git_fetcher_cleanup.py Added regression test verifying repository cleanup on simulated network failures

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +15 to +23
from pathlib import Path

# Ensure we can import the server package relative to this test file
current_dir = Path(__file__).parent
server_package_path = current_dir.parent.parent
sys.path.insert(0, str(server_package_path))

from opal_server.git_fetcher import GitPolicyFetcher

Copy link

Copilot AI Dec 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manual sys.path manipulation in tests suggests the package structure may not follow standard Python conventions. Tests should be able to import the package without path modifications. Consider restructuring the package to use proper relative imports or installing it in editable mode for testing.

Suggested change
from pathlib import Path
# Ensure we can import the server package relative to this test file
current_dir = Path(__file__).parent
server_package_path = current_dir.parent.parent
sys.path.insert(0, str(server_package_path))
from opal_server.git_fetcher import GitPolicyFetcher
from ..git_fetcher import GitPolicyFetcher

Copilot uses AI. Check for mistakes.
Comment on lines +37 to +41
fetcher = MagicMock(spec=GitPolicyFetcher)
fetcher._repo_path = fake_repo_path

# Mock the internal cache to ensure we test the dictionary cleanup too
GitPolicyFetcher.repos = {str(fake_repo_path): "stale_object"}
Copy link

Copilot AI Dec 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test doesn't actually exercise the code changes in git_fetcher.py. It manually replicates the cleanup logic instead of testing the actual fetch_and_notify_on_changes method. The test should mock the git operations to fail and verify that the actual error handling code performs the cleanup correctly.

Copilot uses AI. Check for mistakes.
await self._notify_on_changes(repo)
return
else:
raise pygit2.GitError("Invalid repository metadata")
Copy link

Copilot AI Dec 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message "Invalid repository metadata" is vague. Include the repository path in the error message to aid debugging, such as f"Invalid repository metadata at {repo_path}".

Suggested change
raise pygit2.GitError("Invalid repository metadata")
raise pygit2.GitError(f"Invalid repository metadata at {repo_path}")

Copilot uses AI. Check for mistakes.
Comment on lines +241 to +245
if self._repo_path.exists() or os.path.islink(self._repo_path):
if self._repo_path.is_symlink():
self._repo_path.unlink()
else:
# repo dir exists but invalid -> we must delete the directory
logger.warning(
"Deleting invalid repo: {path}", path=self._repo_path
)
shutil.rmtree(self._repo_path)
else:
logger.info("Repo not found at {path}", path=self._repo_path)
shutil.rmtree(self._repo_path, ignore_errors=True)
Copy link

Copilot AI Dec 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition self._repo_path.exists() or os.path.islink(self._repo_path) is checked before calling is_symlink(), but is_symlink() should be checked first since broken symlinks return False for exists(). Restructure to check is_symlink() in the outer condition to properly handle broken symlinks.

Suggested change
if self._repo_path.exists() or os.path.islink(self._repo_path):
if self._repo_path.is_symlink():
self._repo_path.unlink()
else:
# repo dir exists but invalid -> we must delete the directory
logger.warning(
"Deleting invalid repo: {path}", path=self._repo_path
)
shutil.rmtree(self._repo_path)
else:
logger.info("Repo not found at {path}", path=self._repo_path)
shutil.rmtree(self._repo_path, ignore_errors=True)
if self._repo_path.is_symlink():
self._repo_path.unlink()
elif self._repo_path.exists() or os.path.islink(self._repo_path):
shutil.rmtree(self._repo_path, ignore_errors=True)

Copilot uses AI. Check for mistakes.
Comment on lines +242 to 244
if self._repo_path.is_symlink():
self._repo_path.unlink()
else:
Copy link

Copilot AI Dec 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cleanup logic duplicates the symlink handling already present in _perform_soft_cleanup. Consider calling _perform_soft_cleanup here followed by shutil.rmtree for any remaining directories, to avoid code duplication and ensure consistent cleanup behavior.

Suggested change
if self._repo_path.is_symlink():
self._repo_path.unlink()
else:
# Reuse soft cleanup for consistent symlink and lock-file handling
self._perform_soft_cleanup(self._repo_path)
# Remove any remaining directory tree as a last resort
if self._repo_path.exists():

Copilot uses AI. Check for mistakes.
@Kesavaraja67 Kesavaraja67 closed this by deleting the head repository Jan 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OPAL Server doesn't clean up symbolic links when github is down

2 participants