Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VReplication: Gather source positions once we know all writes are done during traffic switch #16572

Merged
merged 6 commits into from
Aug 19, 2024

Conversation

mattlord
Copy link
Contributor

@mattlord mattlord commented Aug 9, 2024

Description

This PR addresses a logic bug when switching traffic:

  • We gathered the replication position on migration sources immediately after stopping writes there, then stopped other streams, did LOCK TABLES rounds to ensure that there are no in flight writes on the tables we're migrating, then we ensure that the migration targets' replication position is at least where the migration source was.
  • Logically we should get the replication source positions AFTER all of the other work is done and just before we ensure that the targets are caught up with that position.

I have not been able to trigger any actual bug/issue (lost writes during a traffic switch) through extensive manual testing and nobody has seen or reported an issue around this, but it's clearly a logical bug.

We address this here in a way that causes no additional work or overhead by simply moving the position gathering work from the stopSourceWrites step/function to its own step/function that we call immediately before the waitForCatchup step/function.

Related Issue(s)

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

Signed-off-by: Matt Lord <mattalord@gmail.com>
Copy link
Contributor

vitess-bot bot commented Aug 9, 2024

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

@vitess-bot vitess-bot bot added NeedsBackportReason If backport labels have been applied to a PR, a justification is required NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsWebsiteDocsUpdate What it says labels Aug 9, 2024
@github-actions github-actions bot added this to the v21.0.0 milestone Aug 9, 2024
@mattlord mattlord removed NeedsWebsiteDocsUpdate What it says NeedsIssue A linked issue is missing for this Pull Request labels Aug 9, 2024
Signed-off-by: Matt Lord <mattalord@gmail.com>
Copy link

codecov bot commented Aug 16, 2024

Codecov Report

Attention: Patch coverage is 57.89474% with 8 lines in your changes missing coverage. Please review.

Project coverage is 68.83%. Comparing base (127b9ae) to head (fcdc8bd).
Report is 4 commits behind head on main.

Files Patch % Lines
go/vt/vtctl/workflow/traffic_switcher.go 71.42% 4 Missing ⚠️
go/vt/vtctl/workflow/server.go 33.33% 2 Missing ⚠️
go/vt/vtctl/workflow/utils.go 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #16572      +/-   ##
==========================================
- Coverage   68.84%   68.83%   -0.02%     
==========================================
  Files        1558     1558              
  Lines      200025   200048      +23     
==========================================
- Hits       137714   137694      -20     
- Misses      62311    62354      +43     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mattlord mattlord force-pushed the switch_traffic_positions branch 3 times, most recently from 0e100c7 to 2f8f73d Compare August 16, 2024 14:40
Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord force-pushed the switch_traffic_positions branch from 2f8f73d to a76bf9a Compare August 16, 2024 14:58
@mattlord mattlord removed NeedsBackportReason If backport labels have been applied to a PR, a justification is required NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work labels Aug 16, 2024
@mattlord mattlord force-pushed the switch_traffic_positions branch 3 times, most recently from a0a027f to dbb1ee5 Compare August 16, 2024 15:41
Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord force-pushed the switch_traffic_positions branch 2 times, most recently from f8675c1 to 8b030db Compare August 16, 2024 18:34
Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord force-pushed the switch_traffic_positions branch from 8b030db to fcdc8bd Compare August 16, 2024 18:36
@mattlord mattlord marked this pull request as ready for review August 16, 2024 19:52
Copy link
Member

@deepthi deepthi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you amend the last paragraph of the PR description, given that you have included a unit test?
Rest LGTM

@deepthi deepthi added the NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work label Aug 16, 2024
@mattlord mattlord changed the title VReplication: Update source positions once we know all writes are done during traffic switch VReplication: Gather source positions once we know all writes are done during traffic switch Aug 19, 2024
@mattlord mattlord removed the NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work label Aug 19, 2024
@mattlord mattlord merged commit fae7540 into vitessio:main Aug 19, 2024
130 of 131 checks passed
@mattlord mattlord deleted the switch_traffic_positions branch August 19, 2024 13:05
@GrahamCampbell
Copy link
Contributor

GrahamCampbell commented Aug 19, 2024

I think this issue may have been introduced by #13015, which is why many people have not run into this yet, as it only affects v18 and later. I didn't dig into this in much detail at all (and I could be totally wrong), but if I'm right, is this PR worth backporting?

@mattlord
Copy link
Contributor Author

I think this issue may have been introduced by #13015

@GrahamCampbell no, this aspect is the same in the wrangler package which vtctlclient uses. I didn't change the wrangler because it's been deprecated — via the vtctlclient or "legacy client" deprecation — since v18 and we want to remove it entirely soon.

which is why many people have not run into this yet, as it only affects v18 and later.

Exactly nobody has reported it. It's not that "many people have not". The issue reported was from going through the code for the distributed transactions work, not because somebody encountered a problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BugReport: Incorrect order of steps in SwitchTraffic
4 participants