Skip to content

feat: add checkpoint/resume system and enforce set -euo pipefail#408

Closed
boffin-dmytro wants to merge 1 commit intoLight-Heart-Labs:mainfrom
boffin-dmytro:feat/checkpoint-resume-system
Closed

feat: add checkpoint/resume system and enforce set -euo pipefail#408
boffin-dmytro wants to merge 1 commit intoLight-Heart-Labs:mainfrom
boffin-dmytro:feat/checkpoint-resume-system

Conversation

@boffin-dmytro
Copy link
Contributor

Summary

Adds installer checkpoint/resume capability and enforces strict error handling across all shell scripts. This is the base infrastructure PR that PRs #388, #391, #392, #393, and #394 depend on.

Problem

  • Installer failures require starting over from scratch
  • Inconsistent error handling across shell scripts (37% missing set -e)
  • Silent failures possible without strict error handling
  • No way to resume after fixing transient issues

Solution

1. Checkpoint/Resume System

Created lib/checkpoint.sh library (171 lines) with:

  • Saves installer state after each phase
  • Allows resume from last successful phase on failure
  • Smart location strategy:
    • Phases 1-5: temp location (~/.cache/dream-server-install-checkpoint)
    • Phase 6+: final location (INSTALL_DIR/.install-checkpoint)
  • --force flag clears checkpoint and starts fresh
  • Automatic cleanup after successful installation

2. Strict Error Handling

Added set -euo pipefail to all 13 installer phases:

  • Errors kill the process immediately (no silent failures)
  • Undefined variables fail fast
  • Pipeline failures are detected

3. Install Core Refactor

Modified install-core.sh to support checkpoint system:

  • Phase execution now conditional: [[ $START_PHASE -le N ]]
  • Checkpoint saved after each phase completes
  • Checkpoint cleared after successful installation
  • User prompted to resume on subsequent runs

Benefits

  • Reliability: Resume failed installations without starting over
  • Debugging: Clear failure points with checkpoint state
  • Safety: Strict error handling prevents silent failures
  • User experience: No need to re-run successful phases (saves time on large downloads, Docker setup, etc.)

Testing

  • All lint checks pass (make lint)
  • Syntax validation passes
  • Follows "Let It Crash" principle from CLAUDE.md
  • Checkpoint system tested with simulated failures

Impact

Files Changed

  • New: dream-server/lib/checkpoint.sh (checkpoint/resume system)
  • Modified: dream-server/install-core.sh (orchestrator with checkpoint support)
  • Modified: All 13 installer phases (added set -euo pipefail)

Dependent PRs

Once this merges, the following PRs can merge cleanly:

🤖 Generated with Claude Code

@boffin-dmytro boffin-dmytro force-pushed the feat/checkpoint-resume-system branch from e87399a to fc135ad Compare March 19, 2026 01:34
@boffin-dmytro
Copy link
Contributor Author

Update: Conflicts Resolved

Rebased on latest upstream/main. All conflicts in 06-directories.sh and 11-services.sh have been resolved.

The PR is now ready for review with no merge conflicts.

@Lightheartdevs
Copy link
Collaborator

Closing as part of v2.1.0 stabilization. We're pausing new feature PRs while we solidify the installer and CI gates. If this is still relevant for v2.2.0 (targeting April 1), please rebase and resubmit. Thank you for contributing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants