Fix checkpoint CLI empty results in sandbox#886
Conversation
The checkpoint CLI failed to resolve checkpoint_repo inside sandbox containers because repositories.yaml is only mounted on the gateway. When the CLI couldn't resolve checkpoint_repo locally, it sent HTTP requests without it. The gateway then couldn't auto-detect it either because the sandbox's repo path doesn't exist on the gateway filesystem. Changes: - CLI: Extract _get_source_repo() helper and pass source_repo param to gateway when checkpoint_repo can't be resolved locally - Gateway: _resolve_checkpoint_repo() uses source_repo param as fallback for config lookup when path-based auto-detection fails - Gateway: _resolve_repo_path_for_checkpoints() falls through to session/env/scratch-repo fallbacks when sandbox path doesn't exist locally instead of hard-returning None Fixes #884
There was a problem hiding this comment.
Review: Fix checkpoint CLI empty results in sandbox
Good diagnosis of the root cause — repositories.yaml not being mounted in the sandbox is a real gap, and the three-pronged fix (CLI sends source_repo, gateway looks it up in config, scratch repo for fetch cwd) is a reasonable approach. However, there are several issues that need attention before merging.
1. Token resolution fails with scratch repo — checkpoint fetch will be unauthenticated
File: gateway/gateway.py — _ensure_checkpoint_scratch_repo / _resolve_repo_path_for_checkpoints
When _resolve_repo_path_for_checkpoints() returns the scratch repo path (/tmp/egg-checkpoint-scratch), downstream code calls _resolve_github_token(repo_path) (in checkpoint_handler.py:207-254) which runs git remote get-url origin on the scratch bare repo. That will fail — the scratch repo has no remotes — and _resolve_github_token returns None.
The fallback in _run_git is token = github_token or self._github_token (line 1095), but self._github_token on the singleton is set at initialization and may be None or stale for long-running gateways.
This means git fetch https://github.com/{checkpoint_repo}.git egg/checkpoints/v2 will run unauthenticated, which fails for private repos. For public repos it may work, but the sidecar architecture implies private-mode usage.
Suggested fix: When using the scratch repo fallback, also resolve a token from source_repo (which the request provides) using get_token_for_repo(source_repo) and pass it explicitly to the handler methods.
2. TOCTOU race in _ensure_checkpoint_scratch_repo
File: gateway/gateway.py:1907-1919
if os.path.isdir(os.path.join(_CHECKPOINT_SCRATCH_DIR, "objects")):
return _CHECKPOINT_SCRATCH_DIRThe existence check and the git init --bare are not synchronized. Under concurrent requests, two threads can both see the directory missing and both run git init --bare. While git init --bare is likely idempotent, this is fragile. The os.makedirs(exist_ok=True) handles the directory creation race, but git init --bare on an existing bare repo is not guaranteed to be safe if another git init is in progress.
Suggested fix: Use a threading lock, or at minimum an os.path.isdir re-check after makedirs to skip git init if the objects dir already exists.
3. _resolve_repo_path_for_checkpoints — session.last_repo_path returned without existence check
File: gateway/gateway.py:1947-1949
session = getattr(g, "session", None)
if session and getattr(session, "last_repo_path", None):
return session.last_repo_pathWhen the explicit repo_path param passes validation but the directory doesn't exist, the code falls through to the session's last_repo_path — which is returned without checking if the directory exists. If last_repo_path is also a sandbox path (not mounted on the gateway), it will be returned and downstream git commands will fail with confusing errors.
This was a pre-existing issue, but this PR makes it more likely to be hit since the code now falls through instead of returning None. Consider adding os.path.isdir checks to the session fallback too, so it reaches the scratch repo fallback when all real paths are unavailable.
4. Scratch repo path /tmp/ is outside ALLOWED_REPO_PATHS
File: gateway/gateway.py:1893
_CHECKPOINT_SCRATCH_DIR = "/tmp/egg-checkpoint-scratch" is outside the allowed repo paths (/home/egg/repos/, /home/egg/.egg-worktrees/, /repos/). This path is returned from _resolve_repo_path_for_checkpoints, which is used as repo_path in checkpoint operations. If any downstream code re-validates the path with validate_repo_path, it will fail.
Currently the return value of _resolve_repo_path_for_checkpoints is used directly without re-validation, so this works. But it creates a latent assumption — any future code that adds validation to checkpoint handler inputs would break the scratch path. Document this or place the scratch dir under an allowed path (e.g., /home/egg/.egg-worktrees/.checkpoint-scratch).
5. Redundant _get_source_repo calls
File: shared/egg_contracts/checkpoint_cli.py:510-529
_add_checkpoint_resolution_params calls _get_checkpoint_repo_from_args(args) which internally calls _get_source_repo(repo_path). When that returns None (sandbox case — get_checkpoint_repo import fails), the outer function calls _get_source_repo again on line 527. This runs git remote get-url origin twice for the same path.
Not a bug, but it would be cleaner to have _get_checkpoint_repo_from_args return (checkpoint_repo, source_repo) as a tuple so the caller has both values without the duplicate git call.
6. No gateway-side tests
The CLI-side changes have 9 new tests (good). But there are zero tests for the three gateway-side changes:
_resolve_checkpoint_repowithsource_repofallback_ensure_checkpoint_scratch_repo_resolve_repo_path_for_checkpointsfallthrough behavior
These are the most complex parts of the change and the ones most likely to have edge-case failures (token resolution, race conditions, path validation). Gateway unit tests should cover at minimum:
source_repoparam used whencheckpoint_repoand path auto-detection both failsource_repowith invalid format is rejected- Scratch repo is created and reused
_resolve_repo_path_for_checkpointsfalls through to scratch when repo_path doesn't exist locally
7. Minor: _get_source_repo regex doesn't handle trailing slashes
File: shared/egg_contracts/checkpoint_cli.py:482
re.search(r"github\.com[:/]([^/]+)/([^/]+?)(?:\.git)?$", remote_url)This won't match if the remote URL has a trailing slash (e.g., https://github.com/org/repo/). While uncommon, git remote URLs can have trailing slashes. The same regex is used in checkpoint_handler.py:1296 and checkpoint_handler.py:237 already, so this is a pre-existing issue, but this PR copies it to a new location.
Summary
Issue #1 (unauthenticated fetch) is the primary concern — it means the fix may not actually work for private repos, which is the primary use case described in the PR. The TOCTOU (#2) and missing gateway tests (#6) are also significant. The rest are lower severity.
— Authored by egg
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
- Fix #1: Resolve GitHub token from source_repo query param when repo_path is the scratch repo (no git remote). New _resolve_checkpoint_token helper uses get_token_for_repo(source_repo) as fallback, and all checkpoint endpoints pass the token explicitly. - Fix #2: Add threading lock to _ensure_checkpoint_scratch_repo to prevent TOCTOU race when concurrent requests create the scratch repo. - Fix #3: Add os.path.isdir check for session.last_repo_path so nonexistent sandbox paths fall through to the scratch repo. - Fix #4: Move scratch dir from /tmp/ to /home/egg/.egg-worktrees/ which is within ALLOWED_REPO_PATHS. - Fix #5: Refactor _get_checkpoint_repo_from_args to return (checkpoint_repo, source_repo) tuple, eliminating duplicate git remote calls in _add_checkpoint_resolution_params. - Fix #6: Add 11 gateway-side unit tests covering source_repo resolution, scratch repo creation, repo path fallthrough, and token resolution fallback. - Fix #7: Handle trailing slashes in _get_source_repo regex.
There was a problem hiding this comment.
All 7 issues addressed in 8e65ca6. Summary of changes:
#1 (Token resolution for scratch repo): Added _resolve_checkpoint_token() helper that falls back to get_token_for_repo(source_repo) when _resolve_github_token(repo_path) fails (e.g., scratch repo with no remotes). All three checkpoint endpoints now pass the resolved token explicitly to handler methods. This ensures authenticated fetch for private repos.
#2 (TOCTOU race): Added threading.Lock with double-checked locking pattern — re-checks objects/ dir existence after acquiring the lock to skip git init if another thread already completed it.
#3 (session.last_repo_path existence check): Added os.path.isdir() guard so nonexistent sandbox paths fall through to the scratch repo instead of being returned directly.
#4 (Scratch dir under allowed path): Moved from /tmp/egg-checkpoint-scratch to /home/egg/.egg-worktrees/.checkpoint-scratch which is within ALLOWED_REPO_PATHS.
#5 (Redundant git calls): Refactored _get_checkpoint_repo_from_args to return (checkpoint_repo, source_repo) tuple. _add_checkpoint_resolution_params now uses the already-resolved source_repo from the tuple instead of calling _get_source_repo a second time.
#6 (Gateway-side tests): Added 11 new tests in test_checkpoint_read.py covering:
_resolve_checkpoint_repowithsource_repofallback and invalid format rejection_ensure_checkpoint_scratch_repocreation and reuse, plus failure handling_resolve_repo_path_for_checkpointsfallthrough to scratch when paths missing, and session path existence check_resolve_checkpoint_tokenfallback from repo_path to source_repo, and both success/failure paths
#7 (Trailing slash regex): Updated regex in _get_source_repo to /?$ to handle trailing slashes. Added corresponding test case.
All 122 checkpoint tests pass. Ruff clean.
— Authored by egg
There was a problem hiding this comment.
Re-review: Fix checkpoint CLI empty results in sandbox
All 7 issues from the previous review have been addressed. Here is a detailed assessment of each fix followed by new observations.
Previous Issues — Verification
#1 (Token resolution for scratch repo): Fixed.
_resolve_checkpoint_token() correctly falls back to get_token_for_repo(source_repo) when _resolve_github_token(repo_path) fails (scratch repo, no remotes). All three endpoints pass the resolved token explicitly. The token propagation chain is correct — fetch_and_read_index() and ensure_ref() both accept github_token and pass it through to _run_git().
#2 (TOCTOU race): Fixed.
Double-checked locking with _checkpoint_scratch_lock and re-check of objects/ dir inside the lock. The pattern is correct. Minor style note: __import__("threading").Lock() is unconventional compared to a regular import, but functionally equivalent.
#3 (session.last_repo_path existence check): Fixed.
os.path.isdir() guard added at line 1991. Non-existent paths now fall through to scratch repo.
#4 (Scratch dir under allowed path): Fixed.
Moved from /tmp/egg-checkpoint-scratch to /home/egg/.egg-worktrees/.checkpoint-scratch, which is within ALLOWED_REPO_PATHS. Note that validate_repo_path() is not re-applied to the scratch path in the current code flow (scratch is returned directly from _resolve_repo_path_for_checkpoints), so this is for correctness if future validation is added — good defensive choice.
#5 (Redundant git calls): Fixed.
_get_checkpoint_repo_from_args now returns (checkpoint_repo, source_repo) tuple. _add_checkpoint_resolution_params consumes both without duplicate git remote calls.
#6 (Gateway-side tests): Fixed.
11 new tests added covering _resolve_checkpoint_repo source_repo fallback, _ensure_checkpoint_scratch_repo creation/reuse/failure, _resolve_repo_path_for_checkpoints fallthrough, and _resolve_checkpoint_token fallback paths. Coverage is appropriate.
#7 (Trailing slash regex): Fixed in CLI.
The CLI's _get_source_repo regex now includes /?$ to handle trailing slashes. Test added at test_handles_trailing_slash.
New Observations
A. Trailing slash regex not updated in checkpoint_handler.py (non-blocking)
The previous review noted the trailing slash issue was pre-existing in checkpoint_handler.py. The CLI regex was fixed, but the two instances in checkpoint_handler.py (lines 237 and 1305) still use the old regex without /?$. This is consistent with the PR description ("pre-existing issue") but worth noting for a follow-up since the PR touched this exact regex pattern.
B. test_returns_none_on_unexpected_error missing source_repo assertion (non-blocking)
In test_checkpoint_cli.py:363-368, the test unpacks source_repo from the tuple but doesn't assert on it. When run_git raises, _get_source_repo catches and returns None, so source_repo should be None. Adding assert source_repo is None would complete the assertion.
C. _resolve_checkpoint_repo — source_repo regex accepts leading dots/hyphens (non-blocking)
The regex ^[A-Za-z0-9._-]+/[A-Za-z0-9._-]+$ accepts owner/repo names starting with dots or hyphens (e.g., ../foo or .-/repo). While GitHub doesn't allow these as org/repo names, and get_checkpoint_repo would simply return None for invalid lookups, a stricter regex like ^[A-Za-z0-9][A-Za-z0-9._-]*/[A-Za-z0-9][A-Za-z0-9._-]*$ would better match GitHub's actual naming rules. The same applies to _resolve_checkpoint_token. Not a security issue since .. is matched by the regex (not path traversal) and the value is used for config lookup / token retrieval, not filesystem operations. But good to tighten for defense in depth.
Summary
All blocking issues from the first review have been properly addressed. The token resolution chain is correct, the TOCTOU race is fixed, path validation is sound, and test coverage for the new gateway code is adequate. The remaining observations are non-blocking.
— Authored by egg
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
…cts in gateway.py and checkpoint_cli.py
Conflict Resolution SummaryResolved merge conflicts with
Additional fixes required by mergeMain's commit f684b17 introduced stricter mypy and ruff-format settings. The PR code needed these adjustments to pass the new checks:
Verification
— Authored by egg |
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Re-review: Merge conflict resolution (3ce9794)
Reviewed the delta from 8e65ca6 (previously approved) to 3ce9794 (merge of origin/main).
Conflict Resolution — Verified
The merge resolved two conflicts correctly:
gateway/gateway.py:1993-1994 — session.last_repo_path: Combined both changes: the PR's os.path.isdir() existence guard (review fix #3) AND main's str() type-safety conversion. Result: if os.path.isdir(session.last_repo_path): return str(session.last_repo_path). Correct.
shared/egg_contracts/checkpoint_cli.py — _get_source_repo: Kept the PR's version, which extracts _get_source_repo() as a standalone function with the tuple-returning _get_checkpoint_repo_from_args(). Main's #885 had inlined a get_checkpoint_repo() call into this function, which would duplicate logic already in the caller. The PR's version is the right choice. All 5 git-fallback callsites (cmd_list, cmd_show, cmd_browse, cmd_context, cmd_cost) properly destructure the tuple with checkpoint_repo, _ = .... The HTTP-path callsites go through _add_checkpoint_resolution_params which destructures both values.
Additional merge adjustments — Verified
- Removed stale
# type: ignore[import-not-found]→ updated toimport-untypedper main's stricter mypy config. Consistent across all gateway files. - Added
token: str | Nonetype annotation in_resolve_checkpoint_tokento satisfyno-any-return. - Added
checkpoint_id: str | Noneandassert checkpoint_id is not Noneincheckpoint_showfor mypy narrowing. - Added
assert session is not Noneingit_executefor mypy narrowing (guarded byis_pipelinecheck). - Ruff formatting changes in test files (line length adjustments). No behavioral changes.
Non-blocking suggestions from previous review
The three non-blocking items (A: trailing slash regex in checkpoint_handler.py, B: missing assertion in test, C: regex accepting leading dots) remain unaddressed, which is expected — the merge commit only resolved conflicts. These can be addressed in a follow-up.
Verdict
No new issues introduced by the merge. Conflict resolution is semantically correct. The PR remains in approvable state.
— Authored by egg
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
- Fix trailing slash regex in checkpoint_handler.py (3 instances) to match the CLI fix already applied — handles URLs like https://github.com/org/repo/ (review item A) - Add missing assertion for source_repo in test_returns_none_on_unexpected_error (review item B) - Tighten owner/repo regex in gateway.py to reject names starting with dots or hyphens, matching GitHub's actual naming rules (review item C)
|
All three non-blocking items from the re-review addressed in 31f5418: A (Trailing slash regex in checkpoint_handler.py): Updated all 3 instances at lines 192, 237, and 1305 to include B (Missing assertion in test): Added C (Regex accepting leading dots/hyphens): Tightened the owner/repo regex in All 66 checkpoint tests pass. Lint clean. — Authored by egg |
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Re-review: Non-blocking feedback addressed (31f5418)
Reviewed the delta from 3ce9794 (last review) to 31f5418. This commit addresses the three non-blocking suggestions from the previous approval.
Previous Non-blocking Items — Verification
A (Trailing slash regex in checkpoint_handler.py): Fixed.
All three instances at lines 192, 237, and 1305 now include /?$. The regex (?:\.git)?/?$ correctly handles URLs with or without trailing slashes and .git suffixes (including combined repo.git/). Consistent with the CLI-side fix.
B (Missing assertion in test_checkpoint_cli.py): Fixed.
assert source_repo is None added at line 369 in test_returns_none_on_unexpected_error. Confirms both elements of the tuple are None when run_git raises.
C (Regex accepting leading dots/hyphens in gateway.py): Fixed.
All three instances in gateway.py (lines 1869, 1887, 1926) updated from [A-Za-z0-9._-]+ to [A-Za-z0-9][A-Za-z0-9._-]* — first character must be alphanumeric. Applied consistently across _resolve_checkpoint_repo (both checkpoint_repo and source_repo validation) and _resolve_checkpoint_token.
Remaining Observation (non-blocking, pre-existing)
_REPO_PATTERN in checkpoint_handler.py:134 still uses the looser [a-zA-Z0-9._-]+ pattern without the leading-character restriction. This is not introduced by this PR and is not exploitable — values reaching _validate_checkpoint_repo have already passed the gateway's stricter regex. But for consistency, a follow-up could align it to [a-zA-Z0-9][a-zA-Z0-9._-]*.
Missing test coverage for tightened regex (non-blocking)
There is no test case that specifically exercises the new regex behavior — i.e., a source_repo or checkpoint_repo value starting with . or - (e.g., .hidden/repo, -dash/repo) that was accepted by the old regex but rejected by the new one. The existing test (../../etc/passwd) would have been rejected by both old and new patterns. A targeted test would prevent accidental regression if the regex is modified again.
Verdict
All feedback items properly addressed. No new issues introduced. The changes are minimal, correct, and consistently applied.
— Authored by egg
|
egg review completed. View run logs 10 previous review(s) hidden. |
Summary
Fix
egg-checkpointCLI returning empty results when running inside sandbox containers.The CLI failed to resolve
checkpoint_repobecauserepositories.yamlis only mounted on the gateway container, not the sandbox. When the CLI couldn't resolve it locally, HTTP requests were sent withoutcheckpoint_repo. The gateway's auto-detection also failed because the sandbox's repo path doesn't exist on the gateway filesystem.Three changes fix the issue:
CLI: New
_get_source_repo()helper extracts the source repo name (owner/repo) from git remote — which works in the sandbox — and passes it as asource_repoquery param whencheckpoint_repocan't be resolved locally.Gateway
_resolve_checkpoint_repo(): Addedsource_repoquery param fallback. When path-based auto-detection fails, the gateway looks upcheckpoint_repoin its ownrepositories.yamlusing thesource_repohint from the CLI.Gateway
_resolve_repo_path_for_checkpoints(): When the sandbox path doesn't exist on the gateway (expected in the sidecar architecture), falls through to session/env/scratch-repo fallbacks instead of returning an error.Issue: #884
Test plan:
egg-checkpoint list --agent-type checkerreturns results inside sandboxAuthored-by: egg