-
Notifications
You must be signed in to change notification settings - Fork 1.9k
refactor(trie): remove proof task manager #18934
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: yk/worker_pool_acc
Are you sure you want to change the base?
Conversation
- Removed the Factory type parameter from the ParallelProof struct, streamlining its definition and implementation. - Updated the constructor and related methods to reflect this change, enhancing code clarity and maintainability. - Eliminated unused PhantomData field, reducing complexity in the struct's design.
- Eliminated the Factory type definition from the proof tests, simplifying the code structure. - This change contributes to improved clarity and maintainability of the test implementation.
- Removed the generic Factory type from MultiProofConfig and related structs, streamlining their definitions and improving code clarity. - Updated methods to reflect the removal of the Factory type, enhancing maintainability. - Adjusted the implementation of PendingMultiproofTask and its associated methods to eliminate unnecessary type parameters, simplifying the codebase.
- Replaced the ProofTaskManager with a new spawn_proof_workers function for better clarity and maintainability. - Updated related code to utilize the new function, simplifying the worker spawning process. - Enhanced metrics tracking for storage and account proof requests, ensuring thread-safe operations. - Improved error handling and code structure across proof task implementations.
- Added a constant `MIN_WORKER_COUNT` to enforce a minimum number of workers for storage and account proof tasks. - Updated `default_storage_worker_count` and `default_account_worker_count` functions to utilize the new minimum constraint. - Enhanced setter methods in `TreeConfig` to ensure worker counts do not fall below the minimum. - Modified command-line argument parsing to validate worker counts against the minimum requirement.
- Added a debug assertion to ensure active_handles does not underflow when dropping a ProofTaskManagerHandle. - Implemented metrics recording to flush before exit when the last handle is dropped, enhancing monitoring capabilities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors the proof task management by removing the ProofTaskManager
abstraction and replacing it with direct worker pool spawning. The change eliminates the routing thread overhead by providing direct channel access to storage and account worker pools, simplifying the architecture while maintaining the same worker pool functionality.
Key changes:
- Replaced
ProofTaskManager
withspawn_proof_workers
function for direct worker spawning - Converted
ProofTaskManagerHandle
to provide type-safe queue methods with direct channel access - Updated metrics to use lock-free atomic counters for thread-safe operations
Reviewed Changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
Show a summary per file
File | Description |
---|---|
crates/trie/parallel/src/proof_task_metrics.rs | Converts metrics fields to atomic counters for lock-free thread safety |
crates/trie/parallel/src/proof_task.rs | Replaces ProofTaskManager with spawn_proof_workers function and updates handle interface |
crates/trie/parallel/src/proof.rs | Updates proof generation to use new direct queue methods |
crates/node/core/src/args/engine.rs | Adds minimum worker count validation to CLI arguments |
crates/engine/tree/src/tree/payload_validator.rs | Updates error message to reflect new spawning approach |
crates/engine/tree/src/tree/payload_processor/multiproof.rs | Updates multiproof manager to use new queue methods |
crates/engine/tree/src/tree/payload_processor/mod.rs | Replaces ProofTaskManager instantiation with spawn_proof_workers |
crates/engine/primitives/src/config.rs | Adds MIN_WORKER_COUNT constant and enforces minimum worker limits |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
2c025cd
to
2b90133
Compare
- Introduced helper functions to streamline error conversion from ProviderError and channel receive errors to SparseTrieError. - Enhanced readability and maintainability of the trie_node method by reducing repetitive error handling code.
- Updated the error conversion helper function in ProofTaskTrieNodeProvider to directly wrap the ProviderError, enhancing clarity and maintainability. - This change simplifies the error handling logic within the trie_node method.
@shekhirin thanks for the review, just addressed all your comments in the commits |
/// Helper to convert `ProviderError` to `SparseTrieError` | ||
fn provider_err_to_trie_err(e: ProviderError) -> SparseTrieError { | ||
SparseTrieErrorKind::Other(Box::new(e)).into() | ||
} | ||
|
||
/// Helper to convert channel recv error to `SparseTrieError` | ||
fn recv_err_to_trie_err(_: std::sync::mpsc::RecvError) -> SparseTrieError { | ||
SparseTrieErrorKind::Other(Box::new(std::io::Error::other("channel closed"))).into() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we just do this without helper functions?
.map_err(|error| SparseTrieErrorKind::Other(Box::new(error)))?; |
RecvError
already implements Error
, so should work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup we could, initially wanted helper because it looked ugly with all the map_err
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, initially wanted helper because it looked ugly with all the map_err
done in 2698cc8
/// Clamps the worker count to the minimum allowed value. | ||
/// | ||
/// Ensures that the worker count is at least [`MIN_WORKER_COUNT`]. | ||
const fn clamp_worker_count(count: usize) -> usize { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this just .max(MIN_WORKER_COUNT)
? Let's move this to with_*_worker_count
functions, no need to have a separate helper fn for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
much more concise, done in c68eb7a
- Removed redundant helper functions for error conversion in ProofTaskTrieNodeProvider. - Simplified error handling by directly mapping errors to SparseTrieError, improving code clarity and maintainability.
- Removed the `clamp_worker_count` function and replaced its logic with direct usage of `max` in the setter methods for storage and account worker counts. - This change enhances code clarity and reduces unnecessary function overhead while ensuring the minimum worker count is enforced.
@mediocregopher @shekhirin bumping up for final review again :) |
/// Count of blinded storage node requests. | ||
pub storage_nodes: usize, | ||
/// Count of storage proof requests (lock-free). | ||
pub storage_proofs: Arc<AtomicU64>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose of ProofTaskMetrics is to allow us to lazily record metrics at the end of block processing, so that during block processing we can just increment raw ints and not incur synchronization overhead. By using atomics here we are negating that.
If the issue is in sharing the ProofTaskMetrics across workers then we could clone
the ProofTaskMetrics into each worker on startup, and have each worker record
prior to termination.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point on the synchronization overhead, addressed here bc4ecf5
/// - `task_ctx`: Shared context with trie updates and prefix sets | ||
/// - `storage_worker_count`: Number of storage workers to spawn | ||
/// - `account_worker_count`: Number of account workers to spawn | ||
pub fn spawn_proof_workers<Factory>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this would make more sense as ProofTaskManagerHandler::new
I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thats much nicer api, addressed f9e167e
- Updated the code to utilize ProofTaskManagerHandle for spawning proof workers instead of the deprecated spawn_proof_workers function. - This change enhances code clarity and maintainability by consolidating the worker management logic within the ProofTaskManagerHandle struct.
- Removed lock-free atomic counters from ProofTaskMetrics and replaced them with direct method calls for recording blinded node counts. - Updated storage and account worker loops to utilize the new metrics recording methods, enhancing clarity and maintainability. - Simplified the ProofTaskManagerHandle by removing unnecessary metrics fields, streamlining the overall structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mediocregopher just addressed your comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
left some suggestions, iirc additional changes to the worker pool are planned anyway and we could tackle those separately
let _ = self.proof_task_handle.queue_task(ProofTaskKind::StorageProof(input, sender)); | ||
receiver | ||
self.proof_task_handle | ||
.queue_storage_proof(input) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we rename these after we merge, because I find queue
very confusing here because this only sends
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thats true -> should rename it as send_task
for worker_id in 0..storage_worker_count { | ||
let provider_ro = view.provider_ro()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is still something we pay for upfront, meaning this isn't done in the background
this does mean it currently takes more time to set this up if we bump the worker count?
I think ideally we return the channels right away and do this setup in the background so that we don't block here:
let proof_handle = match ProofTaskManagerHandle::new( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this makes sense, good idea for us to do in bg
for worker_id in 0..account_worker_count { | ||
let provider_ro = view.provider_ro()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
Context:
closes: #18801
impact:
run()
loop threadreference PRs: