Skip to content

Conversation

yongkangc
Copy link
Member

@yongkangc yongkangc commented Oct 10, 2025

Context:

  • As part of our performance work to reduce overhead and improve scheduling, we added worker pooling for multiproof generation.
  • This PR aims to perform cleanup and remove ProofTaskManager as an abstraction as we can now directly dispatch the proofs jobs to workers.

closes: #18801

impact:

Change Impact
Remove run() loop thread -1 thread, -1 channel hop
Direct channel sends ~some time saved per task
Eliminate enum wrapping ~2 allocations saved per task

reference PRs:

- Removed the Factory type parameter from the ParallelProof struct, streamlining its definition and implementation.
- Updated the constructor and related methods to reflect this change, enhancing code clarity and maintainability.
- Eliminated unused PhantomData field, reducing complexity in the struct's design.
- Eliminated the Factory type definition from the proof tests, simplifying the code structure.
- This change contributes to improved clarity and maintainability of the test implementation.
- Removed the generic Factory type from MultiProofConfig and related structs, streamlining their definitions and improving code clarity.
- Updated methods to reflect the removal of the Factory type, enhancing maintainability.
- Adjusted the implementation of PendingMultiproofTask and its associated methods to eliminate unnecessary type parameters, simplifying the codebase.
- Replaced the ProofTaskManager with a new spawn_proof_workers function for better clarity and maintainability.
- Updated related code to utilize the new function, simplifying the worker spawning process.
- Enhanced metrics tracking for storage and account proof requests, ensuring thread-safe operations.
- Improved error handling and code structure across proof task implementations.
@github-project-automation github-project-automation bot moved this to Backlog in Reth Tracker Oct 10, 2025
@yongkangc yongkangc self-assigned this Oct 10, 2025
@yongkangc yongkangc moved this from Backlog to In Progress in Reth Tracker Oct 10, 2025
- Added a constant `MIN_WORKER_COUNT` to enforce a minimum number of workers for storage and account proof tasks.
- Updated `default_storage_worker_count` and `default_account_worker_count` functions to utilize the new minimum constraint.
- Enhanced setter methods in `TreeConfig` to ensure worker counts do not fall below the minimum.
- Modified command-line argument parsing to validate worker counts against the minimum requirement.
- Added a debug assertion to ensure active_handles does not underflow when dropping a ProofTaskManagerHandle.
- Implemented metrics recording to flush before exit when the last handle is dropped, enhancing monitoring capabilities.
@yongkangc yongkangc requested a review from Copilot October 10, 2025 10:36
@yongkangc yongkangc changed the title refactor: remove proof task manager refactor(trie): remove proof task manager Oct 10, 2025
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the proof task management by removing the ProofTaskManager abstraction and replacing it with direct worker pool spawning. The change eliminates the routing thread overhead by providing direct channel access to storage and account worker pools, simplifying the architecture while maintaining the same worker pool functionality.

Key changes:

  • Replaced ProofTaskManager with spawn_proof_workers function for direct worker spawning
  • Converted ProofTaskManagerHandle to provide type-safe queue methods with direct channel access
  • Updated metrics to use lock-free atomic counters for thread-safe operations

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
crates/trie/parallel/src/proof_task_metrics.rs Converts metrics fields to atomic counters for lock-free thread safety
crates/trie/parallel/src/proof_task.rs Replaces ProofTaskManager with spawn_proof_workers function and updates handle interface
crates/trie/parallel/src/proof.rs Updates proof generation to use new direct queue methods
crates/node/core/src/args/engine.rs Adds minimum worker count validation to CLI arguments
crates/engine/tree/src/tree/payload_validator.rs Updates error message to reflect new spawning approach
crates/engine/tree/src/tree/payload_processor/multiproof.rs Updates multiproof manager to use new queue methods
crates/engine/tree/src/tree/payload_processor/mod.rs Replaces ProofTaskManager instantiation with spawn_proof_workers
crates/engine/primitives/src/config.rs Adds MIN_WORKER_COUNT constant and enforces minimum worker limits

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@yongkangc yongkangc requested a review from Copilot October 10, 2025 10:38
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

- Introduced helper functions to streamline error conversion from ProviderError and channel receive errors to SparseTrieError.
- Enhanced readability and maintainability of the trie_node method by reducing repetitive error handling code.
@yongkangc yongkangc requested a review from Copilot October 10, 2025 10:51
@yongkangc yongkangc marked this pull request as ready for review October 10, 2025 10:51
- Updated the error conversion helper function in ProofTaskTrieNodeProvider to directly wrap the ProviderError, enhancing clarity and maintainability.
- This change simplifies the error handling logic within the trie_node method.
@yongkangc
Copy link
Member Author

@shekhirin thanks for the review, just addressed all your comments in the commits

Comment on lines 1083 to 1091
/// Helper to convert `ProviderError` to `SparseTrieError`
fn provider_err_to_trie_err(e: ProviderError) -> SparseTrieError {
SparseTrieErrorKind::Other(Box::new(e)).into()
}

/// Helper to convert channel recv error to `SparseTrieError`
fn recv_err_to_trie_err(_: std::sync::mpsc::RecvError) -> SparseTrieError {
SparseTrieErrorKind::Other(Box::new(std::io::Error::other("channel closed"))).into()
}
Copy link
Collaborator

@shekhirin shekhirin Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just do this without helper functions?

.map_err(|error| SparseTrieErrorKind::Other(Box::new(error)))?;

RecvError already implements Error, so should work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup we could, initially wanted helper because it looked ugly with all the map_err

Copy link
Member Author

@yongkangc yongkangc Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, initially wanted helper because it looked ugly with all the map_err

done in 2698cc8

/// Clamps the worker count to the minimum allowed value.
///
/// Ensures that the worker count is at least [`MIN_WORKER_COUNT`].
const fn clamp_worker_count(count: usize) -> usize {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this just .max(MIN_WORKER_COUNT)? Let's move this to with_*_worker_count functions, no need to have a separate helper fn for this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

much more concise, done in c68eb7a

- Removed redundant helper functions for error conversion in ProofTaskTrieNodeProvider.
- Simplified error handling by directly mapping errors to SparseTrieError, improving code clarity and maintainability.
- Removed the `clamp_worker_count` function and replaced its logic with direct usage of `max` in the setter methods for storage and account worker counts.
- This change enhances code clarity and reduces unnecessary function overhead while ensuring the minimum worker count is enforced.
@yongkangc
Copy link
Member Author

@mediocregopher @shekhirin bumping up for final review again :)

@yongkangc yongkangc requested a review from klkvr as a code owner October 13, 2025 05:07
Base automatically changed from refactor/remove-factory-generic to yk/worker_pool_acc October 13, 2025 05:13
/// Count of blinded storage node requests.
pub storage_nodes: usize,
/// Count of storage proof requests (lock-free).
pub storage_proofs: Arc<AtomicU64>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of ProofTaskMetrics is to allow us to lazily record metrics at the end of block processing, so that during block processing we can just increment raw ints and not incur synchronization overhead. By using atomics here we are negating that.

If the issue is in sharing the ProofTaskMetrics across workers then we could clone the ProofTaskMetrics into each worker on startup, and have each worker record prior to termination.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point on the synchronization overhead, addressed here bc4ecf5

/// - `task_ctx`: Shared context with trie updates and prefix sets
/// - `storage_worker_count`: Number of storage workers to spawn
/// - `account_worker_count`: Number of account workers to spawn
pub fn spawn_proof_workers<Factory>(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this would make more sense as ProofTaskManagerHandler::new I think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thats much nicer api, addressed f9e167e

@yongkangc yongkangc requested a review from shekhirin October 13, 2025 08:53
- Updated the code to utilize ProofTaskManagerHandle for spawning proof workers instead of the deprecated spawn_proof_workers function.
- This change enhances code clarity and maintainability by consolidating the worker management logic within the ProofTaskManagerHandle struct.
- Removed lock-free atomic counters from ProofTaskMetrics and replaced them with direct method calls for recording blinded node counts.
- Updated storage and account worker loops to utilize the new metrics recording methods, enhancing clarity and maintainability.
- Simplified the ProofTaskManagerHandle by removing unnecessary metrics fields, streamlining the overall structure.
Copy link
Member Author

@yongkangc yongkangc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mediocregopher just addressed your comments

@yongkangc yongkangc requested a review from Copilot October 13, 2025 09:58
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link
Collaborator

@mattsse mattsse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

left some suggestions, iirc additional changes to the worker pool are planned anyway and we could tackle those separately

let _ = self.proof_task_handle.queue_task(ProofTaskKind::StorageProof(input, sender));
receiver
self.proof_task_handle
.queue_storage_proof(input)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we rename these after we merge, because I find queue very confusing here because this only sends

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thats true -> should rename it as send_task

Comment on lines +882 to +883
for worker_id in 0..storage_worker_count {
let provider_ro = view.provider_ro()?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is still something we pay for upfront, meaning this isn't done in the background

this does mean it currently takes more time to set this up if we bump the worker count?

I think ideally we return the channels right away and do this setup in the background so that we don't block here:

let proof_handle = match ProofTaskManagerHandle::new(

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes sense, good idea for us to do in bg

Comment on lines +909 to +910
for worker_id in 0..account_worker_count {
let provider_ro = view.provider_ro()?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

5 participants