Skip to content

Comments

Fix checkpoint CLI empty results in sandbox#886

Merged
jwbron merged 4 commits intomainfrom
egg/fix-checkpoint-cli-sandbox
Feb 22, 2026
Merged

Fix checkpoint CLI empty results in sandbox#886
jwbron merged 4 commits intomainfrom
egg/fix-checkpoint-cli-sandbox

Conversation

@james-in-a-box
Copy link
Contributor

Summary

Fix egg-checkpoint CLI returning empty results when running inside sandbox containers.

The CLI failed to resolve checkpoint_repo because repositories.yaml is only mounted on the gateway container, not the sandbox. When the CLI couldn't resolve it locally, HTTP requests were sent without checkpoint_repo. The gateway's auto-detection also failed because the sandbox's repo path doesn't exist on the gateway filesystem.

Three changes fix the issue:

  1. CLI: New _get_source_repo() helper extracts the source repo name (owner/repo) from git remote — which works in the sandbox — and passes it as a source_repo query param when checkpoint_repo can't be resolved locally.

  2. Gateway _resolve_checkpoint_repo(): Added source_repo query param fallback. When path-based auto-detection fails, the gateway looks up checkpoint_repo in its own repositories.yaml using the source_repo hint from the CLI.

  3. Gateway _resolve_repo_path_for_checkpoints(): When the sandbox path doesn't exist on the gateway (expected in the sidecar architecture), falls through to session/env/scratch-repo fallbacks instead of returning an error.

Issue: #884

Test plan:

  • All 42 checkpoint CLI tests pass (9 new tests added)
  • Lint clean
  • Verify egg-checkpoint list --agent-type checker returns results inside sandbox

Authored-by: egg

The checkpoint CLI failed to resolve checkpoint_repo inside sandbox
containers because repositories.yaml is only mounted on the gateway.
When the CLI couldn't resolve checkpoint_repo locally, it sent HTTP
requests without it. The gateway then couldn't auto-detect it either
because the sandbox's repo path doesn't exist on the gateway filesystem.

Changes:
- CLI: Extract _get_source_repo() helper and pass source_repo param to
  gateway when checkpoint_repo can't be resolved locally
- Gateway: _resolve_checkpoint_repo() uses source_repo param as fallback
  for config lookup when path-based auto-detection fails
- Gateway: _resolve_repo_path_for_checkpoints() falls through to
  session/env/scratch-repo fallbacks when sandbox path doesn't exist
  locally instead of hard-returning None

Fixes #884
Copy link
Contributor

@egg-reviewer egg-reviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Fix checkpoint CLI empty results in sandbox

Good diagnosis of the root cause — repositories.yaml not being mounted in the sandbox is a real gap, and the three-pronged fix (CLI sends source_repo, gateway looks it up in config, scratch repo for fetch cwd) is a reasonable approach. However, there are several issues that need attention before merging.


1. Token resolution fails with scratch repo — checkpoint fetch will be unauthenticated

File: gateway/gateway.py_ensure_checkpoint_scratch_repo / _resolve_repo_path_for_checkpoints

When _resolve_repo_path_for_checkpoints() returns the scratch repo path (/tmp/egg-checkpoint-scratch), downstream code calls _resolve_github_token(repo_path) (in checkpoint_handler.py:207-254) which runs git remote get-url origin on the scratch bare repo. That will fail — the scratch repo has no remotes — and _resolve_github_token returns None.

The fallback in _run_git is token = github_token or self._github_token (line 1095), but self._github_token on the singleton is set at initialization and may be None or stale for long-running gateways.

This means git fetch https://github.com/{checkpoint_repo}.git egg/checkpoints/v2 will run unauthenticated, which fails for private repos. For public repos it may work, but the sidecar architecture implies private-mode usage.

Suggested fix: When using the scratch repo fallback, also resolve a token from source_repo (which the request provides) using get_token_for_repo(source_repo) and pass it explicitly to the handler methods.


2. TOCTOU race in _ensure_checkpoint_scratch_repo

File: gateway/gateway.py:1907-1919

if os.path.isdir(os.path.join(_CHECKPOINT_SCRATCH_DIR, "objects")):
    return _CHECKPOINT_SCRATCH_DIR

The existence check and the git init --bare are not synchronized. Under concurrent requests, two threads can both see the directory missing and both run git init --bare. While git init --bare is likely idempotent, this is fragile. The os.makedirs(exist_ok=True) handles the directory creation race, but git init --bare on an existing bare repo is not guaranteed to be safe if another git init is in progress.

Suggested fix: Use a threading lock, or at minimum an os.path.isdir re-check after makedirs to skip git init if the objects dir already exists.


3. _resolve_repo_path_for_checkpointssession.last_repo_path returned without existence check

File: gateway/gateway.py:1947-1949

session = getattr(g, "session", None)
if session and getattr(session, "last_repo_path", None):
    return session.last_repo_path

When the explicit repo_path param passes validation but the directory doesn't exist, the code falls through to the session's last_repo_path — which is returned without checking if the directory exists. If last_repo_path is also a sandbox path (not mounted on the gateway), it will be returned and downstream git commands will fail with confusing errors.

This was a pre-existing issue, but this PR makes it more likely to be hit since the code now falls through instead of returning None. Consider adding os.path.isdir checks to the session fallback too, so it reaches the scratch repo fallback when all real paths are unavailable.


4. Scratch repo path /tmp/ is outside ALLOWED_REPO_PATHS

File: gateway/gateway.py:1893

_CHECKPOINT_SCRATCH_DIR = "/tmp/egg-checkpoint-scratch" is outside the allowed repo paths (/home/egg/repos/, /home/egg/.egg-worktrees/, /repos/). This path is returned from _resolve_repo_path_for_checkpoints, which is used as repo_path in checkpoint operations. If any downstream code re-validates the path with validate_repo_path, it will fail.

Currently the return value of _resolve_repo_path_for_checkpoints is used directly without re-validation, so this works. But it creates a latent assumption — any future code that adds validation to checkpoint handler inputs would break the scratch path. Document this or place the scratch dir under an allowed path (e.g., /home/egg/.egg-worktrees/.checkpoint-scratch).


5. Redundant _get_source_repo calls

File: shared/egg_contracts/checkpoint_cli.py:510-529

_add_checkpoint_resolution_params calls _get_checkpoint_repo_from_args(args) which internally calls _get_source_repo(repo_path). When that returns None (sandbox case — get_checkpoint_repo import fails), the outer function calls _get_source_repo again on line 527. This runs git remote get-url origin twice for the same path.

Not a bug, but it would be cleaner to have _get_checkpoint_repo_from_args return (checkpoint_repo, source_repo) as a tuple so the caller has both values without the duplicate git call.


6. No gateway-side tests

The CLI-side changes have 9 new tests (good). But there are zero tests for the three gateway-side changes:

  • _resolve_checkpoint_repo with source_repo fallback
  • _ensure_checkpoint_scratch_repo
  • _resolve_repo_path_for_checkpoints fallthrough behavior

These are the most complex parts of the change and the ones most likely to have edge-case failures (token resolution, race conditions, path validation). Gateway unit tests should cover at minimum:

  • source_repo param used when checkpoint_repo and path auto-detection both fail
  • source_repo with invalid format is rejected
  • Scratch repo is created and reused
  • _resolve_repo_path_for_checkpoints falls through to scratch when repo_path doesn't exist locally

7. Minor: _get_source_repo regex doesn't handle trailing slashes

File: shared/egg_contracts/checkpoint_cli.py:482

re.search(r"github\.com[:/]([^/]+)/([^/]+?)(?:\.git)?$", remote_url)

This won't match if the remote URL has a trailing slash (e.g., https://github.com/org/repo/). While uncommon, git remote URLs can have trailing slashes. The same regex is used in checkpoint_handler.py:1296 and checkpoint_handler.py:237 already, so this is a pre-existing issue, but this PR copies it to a new location.


Summary

Issue #1 (unauthenticated fetch) is the primary concern — it means the fix may not actually work for private repos, which is the primary use case described in the PR. The TOCTOU (#2) and missing gateway tests (#6) are also significant. The rest are lower severity.

— Authored by egg

@james-in-a-box

This comment has been minimized.

@james-in-a-box

This comment has been minimized.

- Fix #1: Resolve GitHub token from source_repo query param when
  repo_path is the scratch repo (no git remote). New
  _resolve_checkpoint_token helper uses get_token_for_repo(source_repo)
  as fallback, and all checkpoint endpoints pass the token explicitly.
- Fix #2: Add threading lock to _ensure_checkpoint_scratch_repo to
  prevent TOCTOU race when concurrent requests create the scratch repo.
- Fix #3: Add os.path.isdir check for session.last_repo_path so
  nonexistent sandbox paths fall through to the scratch repo.
- Fix #4: Move scratch dir from /tmp/ to /home/egg/.egg-worktrees/
  which is within ALLOWED_REPO_PATHS.
- Fix #5: Refactor _get_checkpoint_repo_from_args to return
  (checkpoint_repo, source_repo) tuple, eliminating duplicate
  git remote calls in _add_checkpoint_resolution_params.
- Fix #6: Add 11 gateway-side unit tests covering source_repo
  resolution, scratch repo creation, repo path fallthrough, and
  token resolution fallback.
- Fix #7: Handle trailing slashes in _get_source_repo regex.
Copy link
Contributor Author

@james-in-a-box james-in-a-box bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All 7 issues addressed in 8e65ca6. Summary of changes:

#1 (Token resolution for scratch repo): Added _resolve_checkpoint_token() helper that falls back to get_token_for_repo(source_repo) when _resolve_github_token(repo_path) fails (e.g., scratch repo with no remotes). All three checkpoint endpoints now pass the resolved token explicitly to handler methods. This ensures authenticated fetch for private repos.

#2 (TOCTOU race): Added threading.Lock with double-checked locking pattern — re-checks objects/ dir existence after acquiring the lock to skip git init if another thread already completed it.

#3 (session.last_repo_path existence check): Added os.path.isdir() guard so nonexistent sandbox paths fall through to the scratch repo instead of being returned directly.

#4 (Scratch dir under allowed path): Moved from /tmp/egg-checkpoint-scratch to /home/egg/.egg-worktrees/.checkpoint-scratch which is within ALLOWED_REPO_PATHS.

#5 (Redundant git calls): Refactored _get_checkpoint_repo_from_args to return (checkpoint_repo, source_repo) tuple. _add_checkpoint_resolution_params now uses the already-resolved source_repo from the tuple instead of calling _get_source_repo a second time.

#6 (Gateway-side tests): Added 11 new tests in test_checkpoint_read.py covering:

  • _resolve_checkpoint_repo with source_repo fallback and invalid format rejection
  • _ensure_checkpoint_scratch_repo creation and reuse, plus failure handling
  • _resolve_repo_path_for_checkpoints fallthrough to scratch when paths missing, and session path existence check
  • _resolve_checkpoint_token fallback from repo_path to source_repo, and both success/failure paths

#7 (Trailing slash regex): Updated regex in _get_source_repo to /?$ to handle trailing slashes. Added corresponding test case.

All 122 checkpoint tests pass. Ruff clean.

— Authored by egg

Copy link
Contributor

@egg-reviewer egg-reviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review: Fix checkpoint CLI empty results in sandbox

All 7 issues from the previous review have been addressed. Here is a detailed assessment of each fix followed by new observations.


Previous Issues — Verification

#1 (Token resolution for scratch repo): Fixed.
_resolve_checkpoint_token() correctly falls back to get_token_for_repo(source_repo) when _resolve_github_token(repo_path) fails (scratch repo, no remotes). All three endpoints pass the resolved token explicitly. The token propagation chain is correct — fetch_and_read_index() and ensure_ref() both accept github_token and pass it through to _run_git().

#2 (TOCTOU race): Fixed.
Double-checked locking with _checkpoint_scratch_lock and re-check of objects/ dir inside the lock. The pattern is correct. Minor style note: __import__("threading").Lock() is unconventional compared to a regular import, but functionally equivalent.

#3 (session.last_repo_path existence check): Fixed.
os.path.isdir() guard added at line 1991. Non-existent paths now fall through to scratch repo.

#4 (Scratch dir under allowed path): Fixed.
Moved from /tmp/egg-checkpoint-scratch to /home/egg/.egg-worktrees/.checkpoint-scratch, which is within ALLOWED_REPO_PATHS. Note that validate_repo_path() is not re-applied to the scratch path in the current code flow (scratch is returned directly from _resolve_repo_path_for_checkpoints), so this is for correctness if future validation is added — good defensive choice.

#5 (Redundant git calls): Fixed.
_get_checkpoint_repo_from_args now returns (checkpoint_repo, source_repo) tuple. _add_checkpoint_resolution_params consumes both without duplicate git remote calls.

#6 (Gateway-side tests): Fixed.
11 new tests added covering _resolve_checkpoint_repo source_repo fallback, _ensure_checkpoint_scratch_repo creation/reuse/failure, _resolve_repo_path_for_checkpoints fallthrough, and _resolve_checkpoint_token fallback paths. Coverage is appropriate.

#7 (Trailing slash regex): Fixed in CLI.
The CLI's _get_source_repo regex now includes /?$ to handle trailing slashes. Test added at test_handles_trailing_slash.


New Observations

A. Trailing slash regex not updated in checkpoint_handler.py (non-blocking)

The previous review noted the trailing slash issue was pre-existing in checkpoint_handler.py. The CLI regex was fixed, but the two instances in checkpoint_handler.py (lines 237 and 1305) still use the old regex without /?$. This is consistent with the PR description ("pre-existing issue") but worth noting for a follow-up since the PR touched this exact regex pattern.

B. test_returns_none_on_unexpected_error missing source_repo assertion (non-blocking)

In test_checkpoint_cli.py:363-368, the test unpacks source_repo from the tuple but doesn't assert on it. When run_git raises, _get_source_repo catches and returns None, so source_repo should be None. Adding assert source_repo is None would complete the assertion.

C. _resolve_checkpoint_reposource_repo regex accepts leading dots/hyphens (non-blocking)

The regex ^[A-Za-z0-9._-]+/[A-Za-z0-9._-]+$ accepts owner/repo names starting with dots or hyphens (e.g., ../foo or .-/repo). While GitHub doesn't allow these as org/repo names, and get_checkpoint_repo would simply return None for invalid lookups, a stricter regex like ^[A-Za-z0-9][A-Za-z0-9._-]*/[A-Za-z0-9][A-Za-z0-9._-]*$ would better match GitHub's actual naming rules. The same applies to _resolve_checkpoint_token. Not a security issue since .. is matched by the regex (not path traversal) and the value is used for config lookup / token retrieval, not filesystem operations. But good to tighten for defense in depth.


Summary

All blocking issues from the first review have been properly addressed. The token resolution chain is correct, the TOCTOU race is fixed, path validation is sound, and test coverage for the new gateway code is adequate. The remaining observations are non-blocking.

— Authored by egg

@james-in-a-box

This comment has been minimized.

@james-in-a-box

This comment has been minimized.

@james-in-a-box

This comment has been minimized.

@james-in-a-box

This comment has been minimized.

@james-in-a-box
Copy link
Contributor Author

Conflict Resolution Summary

Resolved merge conflicts with main (3 commits: f684b17, c5aa4ba, 5091be8):

File Category Resolution
gateway/gateway.py:1993 Additive Combined both changes: kept PR's os.path.isdir() guard (enables fallthrough for non-existent sandbox paths) and main's str() conversion (type safety for Path objects). Result: if os.path.isdir(session.last_repo_path): return str(session.last_repo_path)
shared/egg_contracts/checkpoint_cli.py:483 Semantic (safe) Kept PR's version of _get_source_repo(). Main's change (#885) moved a get_checkpoint_repo() lookup into this function, but that duplicates logic already in the caller _get_checkpoint_repo_from_args() (line 515) and violates the function's contract to return source_repo, not checkpoint_repo.

Additional fixes required by merge

Main's commit f684b17 introduced stricter mypy and ruff-format settings. The PR code needed these adjustments to pass the new checks:

File Fix
shared/egg_contracts/checkpoint_cli.py Removed stale # type: ignore[import-not-found] (line 513)
shared/egg_contracts/checkpoint_cli.py Updated 5 callers of _get_checkpoint_repo_from_args() to destructure the tuple return (checkpoint_repo, _ = ...) — the PR changed the return type to tuple[str | None, str | None] but didn't update all call sites
gateway/gateway.py Added type annotation token: str | None to suppress no-any-return in _resolve_checkpoint_token()
shared/egg_contracts/checkpoint_cli.py, gateway/tests/test_checkpoint_read.py Reformatted to satisfy stricter ruff-format rules

Verification

  • make lint — all checks passed
  • make test — 6651 passed, 85 skipped, 0 failures

— Authored by egg

@james-in-a-box

This comment has been minimized.

Copy link
Contributor

@egg-reviewer egg-reviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review: Merge conflict resolution (3ce9794)

Reviewed the delta from 8e65ca6 (previously approved) to 3ce9794 (merge of origin/main).

Conflict Resolution — Verified

The merge resolved two conflicts correctly:

gateway/gateway.py:1993-1994session.last_repo_path: Combined both changes: the PR's os.path.isdir() existence guard (review fix #3) AND main's str() type-safety conversion. Result: if os.path.isdir(session.last_repo_path): return str(session.last_repo_path). Correct.

shared/egg_contracts/checkpoint_cli.py_get_source_repo: Kept the PR's version, which extracts _get_source_repo() as a standalone function with the tuple-returning _get_checkpoint_repo_from_args(). Main's #885 had inlined a get_checkpoint_repo() call into this function, which would duplicate logic already in the caller. The PR's version is the right choice. All 5 git-fallback callsites (cmd_list, cmd_show, cmd_browse, cmd_context, cmd_cost) properly destructure the tuple with checkpoint_repo, _ = .... The HTTP-path callsites go through _add_checkpoint_resolution_params which destructures both values.

Additional merge adjustments — Verified

  • Removed stale # type: ignore[import-not-found] → updated to import-untyped per main's stricter mypy config. Consistent across all gateway files.
  • Added token: str | None type annotation in _resolve_checkpoint_token to satisfy no-any-return.
  • Added checkpoint_id: str | None and assert checkpoint_id is not None in checkpoint_show for mypy narrowing.
  • Added assert session is not None in git_execute for mypy narrowing (guarded by is_pipeline check).
  • Ruff formatting changes in test files (line length adjustments). No behavioral changes.

Non-blocking suggestions from previous review

The three non-blocking items (A: trailing slash regex in checkpoint_handler.py, B: missing assertion in test, C: regex accepting leading dots) remain unaddressed, which is expected — the merge commit only resolved conflicts. These can be addressed in a follow-up.

Verdict

No new issues introduced by the merge. Conflict resolution is semantically correct. The PR remains in approvable state.

— Authored by egg

@james-in-a-box

This comment has been minimized.

@james-in-a-box

This comment has been minimized.

- Fix trailing slash regex in checkpoint_handler.py (3 instances) to
  match the CLI fix already applied — handles URLs like
  https://github.com/org/repo/ (review item A)
- Add missing assertion for source_repo in
  test_returns_none_on_unexpected_error (review item B)
- Tighten owner/repo regex in gateway.py to reject names starting with
  dots or hyphens, matching GitHub's actual naming rules (review item C)
@james-in-a-box
Copy link
Contributor Author

All three non-blocking items from the re-review addressed in 31f5418:

A (Trailing slash regex in checkpoint_handler.py): Updated all 3 instances at lines 192, 237, and 1305 to include /?$, matching the CLI fix.

B (Missing assertion in test): Added assert source_repo is None to test_returns_none_on_unexpected_error.

C (Regex accepting leading dots/hyphens): Tightened the owner/repo regex in _resolve_checkpoint_repo and _resolve_checkpoint_token from [A-Za-z0-9._-]+ to [A-Za-z0-9][A-Za-z0-9._-]* — first character must be alphanumeric. Applied to all 3 instances in gateway.py (lines 1869, 1886, 1923).

All 66 checkpoint tests pass. Lint clean.

— Authored by egg

@james-in-a-box

This comment has been minimized.

Copy link
Contributor

@egg-reviewer egg-reviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review: Non-blocking feedback addressed (31f5418)

Reviewed the delta from 3ce9794 (last review) to 31f5418. This commit addresses the three non-blocking suggestions from the previous approval.


Previous Non-blocking Items — Verification

A (Trailing slash regex in checkpoint_handler.py): Fixed.
All three instances at lines 192, 237, and 1305 now include /?$. The regex (?:\.git)?/?$ correctly handles URLs with or without trailing slashes and .git suffixes (including combined repo.git/). Consistent with the CLI-side fix.

B (Missing assertion in test_checkpoint_cli.py): Fixed.
assert source_repo is None added at line 369 in test_returns_none_on_unexpected_error. Confirms both elements of the tuple are None when run_git raises.

C (Regex accepting leading dots/hyphens in gateway.py): Fixed.
All three instances in gateway.py (lines 1869, 1887, 1926) updated from [A-Za-z0-9._-]+ to [A-Za-z0-9][A-Za-z0-9._-]* — first character must be alphanumeric. Applied consistently across _resolve_checkpoint_repo (both checkpoint_repo and source_repo validation) and _resolve_checkpoint_token.


Remaining Observation (non-blocking, pre-existing)

_REPO_PATTERN in checkpoint_handler.py:134 still uses the looser [a-zA-Z0-9._-]+ pattern without the leading-character restriction. This is not introduced by this PR and is not exploitable — values reaching _validate_checkpoint_repo have already passed the gateway's stricter regex. But for consistency, a follow-up could align it to [a-zA-Z0-9][a-zA-Z0-9._-]*.

Missing test coverage for tightened regex (non-blocking)

There is no test case that specifically exercises the new regex behavior — i.e., a source_repo or checkpoint_repo value starting with . or - (e.g., .hidden/repo, -dash/repo) that was accepted by the old regex but rejected by the new one. The existing test (../../etc/passwd) would have been rejected by both old and new patterns. A targeted test would prevent accidental regression if the regex is modified again.


Verdict

All feedback items properly addressed. No new issues introduced. The changes are minimal, correct, and consistently applied.

— Authored by egg

@james-in-a-box
Copy link
Contributor Author

egg review completed. View run logs

10 previous review(s) hidden.

@jwbron jwbron merged commit a06e813 into main Feb 22, 2026
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant