Skip to content

Fix #634: Clean up pygit2 Repository objects to prevent symlink accumulation in /proc#885

Closed
tboy1337 wants to merge 2 commits intopermitio:masterfrom
tboy1337:symbolic-links
Closed

Fix #634: Clean up pygit2 Repository objects to prevent symlink accumulation in /proc#885
tboy1337 wants to merge 2 commits intopermitio:masterfrom
tboy1337:symbolic-links

Conversation

@tboy1337
Copy link

/claim #634

Fixes Issue

Closes #634

Problem

When OPAL Server has GitHub policy sources configured and GitHub becomes unavailable (returns 500 errors), pygit2.Repository objects remain in the class-level GitPolicyFetcher.repos cache. These objects hold native C-level file descriptors that appear as symbolic links in /proc/{pid}/fd/, making the server look like it has spawned zombie processes. Over time, this leads to file descriptor exhaustion.

Root Cause

The existing code never explicitly releases pygit2.Repository objects when Git operations fail. Simply deleting references from the cache dictionary is insufficient because:

  1. Python's garbage collector timing is unpredictable
  2. The C-level resources (file descriptors) held by pygit2 may not be released promptly
  3. Failed fetch/clone operations leave corrupted Repository objects in the cache

Solution

Added a _cleanup_repo_from_cache() method that:

  1. Removes the repository from GitPolicyFetcher.repos cache
  2. Calls repo.free() to immediately release native C resources
  3. Handles errors gracefully during cleanup

Applied cleanup at 4 critical failure points:

  • Fetch failure (main scenario from OPAL Server doesn't clean up symbolic links when github is down #634): Wrapped fetch in try/except, cleanup on pygit2.GitError
  • Clone failure: Cleanup cache + remove partial repository directories with shutil.rmtree(ignore_errors=True)
  • Invalid repository: Cleanup before returning None from _get_valid_repo()
  • Repository directory removal: Cleanup before shutil.rmtree() when deleting invalid repos

Key Changes

File: packages/opal-server/opal_server/git_fetcher.py

  • Added _cleanup_repo_from_cache() method with explicit repo.free() call
  • Enhanced fetch error handling with try/except and cleanup
  • Enhanced clone error handling with cleanup and partial directory removal
  • Added cleanup calls before invalid repository directory deletion
  • Added cleanup in _get_valid_repo() for corrupted repositories

File: packages/opal-server/opal_server/tests/test_git_fetcher_cleanup.py

  • 7 comprehensive test cases covering:
    • Repository cache removal and free() call verification
    • Idempotent cleanup behavior
    • Error handling during free()
    • Partial clone directory cleanup
    • Fetch failure cleanup (main issue scenario)
    • Invalid repository cleanup
    • Cache cleanup before directory deletion

Demo Video

[TO BE RECORDED: The video will demonstrate:]

  1. OPAL Server running with GitHub policy source configured
  2. Simulating GitHub downtime (network failure/500 error)
  3. Monitoring /proc/{pid}/fd showing stable symlink count (no accumulation)
  4. Monitoring cache size remaining stable after failures
  5. Logs showing "Cleaned up cached repository" messages
  6. Before/After comparison showing the fix prevents FD leaks

Why This Approach

  • Deterministic resource release: Calling repo.free() immediately releases native C resources instead of relying on garbage collection
  • Handles all failure modes: Covers fetch, clone, invalid repo, and directory deletion scenarios
  • Behavior-compatible: Preserves existing error propagation and retry logic
  • Minimal changes: Focused fixes without refactoring existing logic
  • Well-tested: Comprehensive test coverage for all cleanup paths

Testing

All 7 tests verify:

  • Repository objects are removed from cache
  • repo.free() is called to release file descriptors
  • Partial directories are cleaned up
  • Multiple cleanup calls are safe (idempotent)
  • Error handling during cleanup doesn't break the flow

Running tests:

cd packages/opal-server
pytest opal_server/tests/test_git_fetcher_cleanup.py -v

Verification

To verify the fix in production:

  1. Monitor cache size: len(GitPolicyFetcher.repos) should not grow after failures
  2. Monitor /proc symlinks (Linux): ls -la /proc/{pid}/fd | wc -l should remain stable
  3. Check logs for cleanup messages: "Cleaned up cached repository: {path}"
  4. Simulate GitHub downtime: Block GitHub access and verify no FD accumulation

Check List

  • I sign off on contributing this submission to open-source
  • My code follows the code style of this project
  • My change requires changes to the documentation
  • I have updated the documentation accordingly (inline comments)
  • All new and existing tests passed
  • This PR does not contain plagiarized content
  • The title of my pull request is a short description of the requested changes
  • Demo video will be provided showing the fix in action

Related PRs

This PR builds on insights from previous attempts (#850, #855, #864, #881) and combines the best aspects: explicit resource cleanup (repo.free()), comprehensive error handling, and thorough test coverage.

- Added error handling for fetch and clone operations to log failures and clean up cached repositories.
- Introduced a new method `_cleanup_repo_from_cache` to release file descriptors held by pygit2 Repository objects during failures.
- Ensured that partial clones are cleaned up to prevent resource leaks.
- Changed early exit to raise an exception in the GitPolicyFetcher to ensure proper error propagation.
- Updated test to assert that the correct exception is raised during fetch operations, improving test coverage for error scenarios.
@netlify
Copy link

netlify bot commented Feb 17, 2026

Deploy Preview for opal-docs canceled.

Name Link
🔨 Latest commit 0d39a40
🔍 Latest deploy log https://app.netlify.com/projects/opal-docs/deploys/69947221ae3eba0008c81175

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OPAL Server doesn't clean up symbolic links when github is down

1 participant