Skip to content

Comments

Timeout copy batches that take too long and copy tables in parallel#5918

Closed
lutter wants to merge 0 commit intomasterfrom
lutter/copy-timeout
Closed

Timeout copy batches that take too long and copy tables in parallel#5918
lutter wants to merge 0 commit intomasterfrom
lutter/copy-timeout

Conversation

@lutter
Copy link
Collaborator

@lutter lutter commented Mar 28, 2025

Our estimation of batch sizes is generally good and stays within the prescribed bounds, but there are cases where proper estimation of the batch size is nearly impossible since the size of the rows in the table jumps sharply at some point that is hard to predict. This mechanism ensures that if our estimation is wrong, the consequences aren't too severe.

That's what the first three commits do; the rest of this PR changes how we copy so that we can copy the tables for a deployment in parallel. The copying parallelizes opportunistically, i.e., it will always copy at least one table, and more if there are database connections available and the configuration allows copying more than one table

@lutter lutter requested review from Copilot and zorancv March 28, 2025 16:50
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements a mechanism to handle long-running copy batch operations by timing out queries and retrying with a smaller batch size if necessary. Key changes include:

  • Adding a setter method for batch size in VidBatcher.
  • Wrapping batch copy transactions in a loop that sets a local statement timeout and resets the batch size on a timeout.
  • Introducing a new environment variable and error variant (StatementTimeout) to support batch timeout functionality.

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
store/postgres/src/vid_batcher.rs Added a new method to adjust the batch size dynamically.
store/postgres/src/copy.rs Implemented timeout checking for copy batches with retry logic and a timeout.
graph/src/env/store.rs Updated EnvVarsStore to include batch_timeout with a constraint check.
graph/src/env/mod.rs Modified from_env to use try_into for proper error handling.
graph/src/components/store/err.rs Added a new StatementTimeout error variant and refactored error conversion logic.
docs/environment-variables.md Updated documentation to describe the new GRAPH_STORE_BATCH_TIMEOUT variable.

Comment on lines 737 to 738

let status = self.transaction(|conn| table.copy_batch(conn))?;
let status = {
Copy link

Copilot AI Mar 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider introducing a maximum retry limit in this loop to avoid potential infinite retries in the event of persistent statement timeouts.

Copilot uses AI. Check for mistakes.
@lutter lutter changed the title Timeout copy batches that take too long Timeout copy batches that take too long and copy tables in parallel Mar 29, 2025
@lutter lutter force-pushed the lutter/copy-timeout branch from c456c0a to 417ee08 Compare March 29, 2025 22:30
pub batch_timeout: Option<Duration>,

/// The number of workers to use for batch operations. If there are idle
/// connectiosn, each subgraph copy operation will use up to this many
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// connectiosn, each subgraph copy operation will use up to this many
/// connections, each subgraph copy operation will use up to this many

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the typo

if let Some(worker) = self.default_worker(&mut state, &progress)? {
workers.push(worker);
}
loop {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably question of taste but I would move the loop inside the above if. Current way is correct too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That wouldn't be correct - we can only call self.default_worker when self.conn.is_some(), i.e., once per while loop. The loop { .. } is about trying to get more workers than just the one we always have.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noticed that I misread your comment - yes, putting the loop inside the if would also have been possible, but as you said, more a matter of taste

Copy link
Contributor

@zorancv zorancv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hope it helps the copies.

@lutter lutter closed this Mar 31, 2025
@lutter lutter force-pushed the lutter/copy-timeout branch from a39eaf1 to b3543bb Compare March 31, 2025 23:42
@lutter
Copy link
Collaborator Author

lutter commented Mar 31, 2025

Messed up my git commands, this was merged at b3543bb

@lutter lutter deleted the lutter/copy-timeout branch April 9, 2025 21:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants