Skip to content

Conversation

@zktommy
Copy link
Collaborator

@zktommy zktommy commented Dec 11, 2025

Overview

This PR adds a faster retry mechanism for running ethproofs and improves Docker container management performance through parallelization.

Changes

  • Fast retry mechanism: Added docker-multi-fast-retry.sh script that skips log saving and preserves CHUNK_SIZE for faster recovery. Integrated into Rust proving client via fast-retry feature flag with reduced default timeout (30s vs 120s).

  • Concurrent container management: Modified docker-common.sh to execute container operations (stop, start, cleanup, status, logs, env updates) in parallel across multiple machines using run_parallel_workers() and run_parallel_all() helpers.

  • Configurable retry timeout: Added PROVING_TIMEOUT_SECONDS environment variable support (default: 30s with fast-retry, 120s otherwise).

- Added functions to run worker operations in parallel, improving efficiency for starting, stopping, cleaning up, and verifying worker containers.
- Updated existing worker management functions to utilize parallel execution, enhancing performance during bulk operations.
- Introduced internal wrappers for parallel execution of various worker tasks, including deployment, environment variable updates, and log saving.
- Introduced a new `fast-retry` feature that enables quicker retries without log saving or changes to CHUNK_SIZE, allowing for faster recovery from proving failures.
- Added a `proving_timeout_seconds` configuration option to set custom timeout durations for proving operations, defaulting to 30 seconds for fast-retry and 120 seconds otherwise.
- Updated the proving client logic to handle the new timeout settings and retry mechanisms, improving overall efficiency and error handling during proving operations.
- Enhanced Docker control scripts to support the new fast-retry functionality.
- Improved the `run_parallel_workers` and `run_parallel_all` functions to include better error handling and logging for worker operations.
- Added checks for successful creation of temporary directories and files, ensuring robust execution.
- Updated output handling to include messages for workers with no output, enhancing clarity in logs.
- Exported additional internal worker functions for better modularity and reusability across scripts.
@zktommy zktommy requested a review from eason1981 December 11, 2025 04:02
@eason1981 eason1981 merged commit 3ff2c1d into main Dec 11, 2025
2 checks passed
@eason1981 eason1981 deleted the tommy/faster-retry branch December 11, 2025 06:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants