Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix RemoveTablet during TabletExternallyReparented causing connection issues #16371

Merged
merged 9 commits into from
Aug 9, 2024
22 changes: 21 additions & 1 deletion go/vt/discovery/healthcheck.go
Original file line number Diff line number Diff line change
Expand Up @@ -470,7 +470,20 @@ func (hc *HealthCheckImpl) deleteTablet(tablet *topodata.Tablet) {
// delete from healthy list
healthy, ok := hc.healthy[key]
if ok && len(healthy) > 0 {
hc.recomputeHealthy(key)
if tabletType == topodata.TabletType_PRIMARY {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this some more..

  • we are deleting a tablet. if that tablet's type is not PRIMARY, we should not even "recompute the healthy primary"
  • even if that tablet's type is PRIMARY, why should we iterate over the healthData for PRIMARY? we know that at any point there should only be one "healthy" primary. i.e healthy[PRIMARY] should have a size of either 0 or 1. If the tablet that we are deleting is THE current healthy primary, we should just set healthy[PRIMARY] to an empty slice. If it is not, we should do nothing.

It seems to me that the problem with the original implementation of this defer func was that it was calling recomputeHealthy for PRIMARY type (which was never done before this defer func was added) instead of handling PRIMARY separately. This caused it to violate the assumption that there is at most one healthy PRIMARY at any point in time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're absolutely right. I simplified this changeset by simply performing the same logic as in updateHealth for PRIMARY tablet types, and calling recomputeHealth for the other types.

// If the deleted tablet was a primary,
// and it matches what we think is the current active primary,
// clear the healthy list for the primary.
//
// See the logic in `updateHealth` for more details.
alias := tabletAliasString(topoproto.TabletAliasString(healthy[0].Tablet.Alias))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check and handle cases where healthy has a length > 1 just to be safe? I'm not sure, but the thought came up.

Copy link
Contributor Author

@arthurschreiber arthurschreiber Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maintaining the invariant that healthy only ever contains a single entry for PRIMARY (and verifying this in the tests) seems "good enough" to me? I'm also not sure what the correct behaviour would be if we detect more than one entry in the list, because that means the invariant is broken and all guarantees are invalid.

if alias == tabletAlias {
hc.healthy[key] = []*TabletHealth{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why we do this rather than removing the key? Not implying it's somehow wrong. 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just mirroring what happens in updateHealth. 🤷

}
} else {
// Simply recompute the list of healthy tablets for all other tablet types.
hc.recomputeHealthy(key)
}
}
}
}()
Expand Down Expand Up @@ -586,6 +599,13 @@ func (hc *HealthCheckImpl) updateHealth(th *TabletHealth, prevTarget *query.Targ
hc.broadcast(th)
}

// Recomputes the healthy tablets for the given key.
arthurschreiber marked this conversation as resolved.
Show resolved Hide resolved
//
// This filters out tablets that might be healthy, but are not part of the current
// cell or cell alias. It also performs filtering of tablets based on replication lag,
// if configured to do so.
//
// This should not be called for primary tablets.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we not allow it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately key is a string that concats a bunch of values, so checking whether the type is PRIMARY is not straightforward.

@deepthi and I discussed various options for refactoring this, but I feel like that shouldn't be part of this PR.

func (hc *HealthCheckImpl) recomputeHealthy(key KeyspaceShardTabletType) {
all := hc.healthData[key]
allArray := make([]*TabletHealth, 0, len(all))
Expand Down
121 changes: 121 additions & 0 deletions go/vt/discovery/healthcheck_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -784,6 +784,127 @@ func TestRemoveTablet(t *testing.T) {
assert.Empty(t, a, "wrong result, expected empty list")
}

// When an external primary failover is performed,
// the demoted primary will advertise itself as a `PRIMARY`
// tablet until it recognizes that it was demoted,
// and until all in-flight operations have either finished
// (successfully or unsuccessfully, see `--shutdown_grace_period` flag).
//
// During this time, operations like `RemoveTablet` should not lead
// to multiple tablets becoming valid targets for `PRIMARY`.
func TestRemoveTabletDuringExternalReparenting(t *testing.T) {
ctx := utils.LeakCheckContext(t)

// reset error counters
hcErrorCounters.ResetAll()
ts := memorytopo.NewServer(ctx, "cell")
defer ts.Close()
hc := createTestHc(ctx, ts)
// close healthcheck
defer hc.Close()

firstTablet := createTestTablet(0, "cell", "a")
firstTablet.Type = topodatapb.TabletType_PRIMARY

secondTablet := createTestTablet(1, "cell", "b")
secondTablet.Type = topodatapb.TabletType_REPLICA

thirdTablet := createTestTablet(2, "cell", "c")
thirdTablet.Type = topodatapb.TabletType_REPLICA

firstTabletHealthStream := make(chan *querypb.StreamHealthResponse)
firstTabletConn := createFakeConn(firstTablet, firstTabletHealthStream)
firstTabletConn.errCh = make(chan error)

secondTabletHealthStream := make(chan *querypb.StreamHealthResponse)
secondTabletConn := createFakeConn(secondTablet, secondTabletHealthStream)
secondTabletConn.errCh = make(chan error)

thirdTabletHealthStream := make(chan *querypb.StreamHealthResponse)
thirdTabletConn := createFakeConn(thirdTablet, thirdTabletHealthStream)
thirdTabletConn.errCh = make(chan error)

resultChan := hc.Subscribe()

hc.AddTablet(firstTablet)
<-resultChan

hc.AddTablet(secondTablet)
<-resultChan

hc.AddTablet(thirdTablet)
<-resultChan

arthurschreiber marked this conversation as resolved.
Show resolved Hide resolved
firstTabletPrimaryTermStartTimestamp := time.Now().Unix() - 10

firstTabletHealthStream <- &querypb.StreamHealthResponse{
TabletAlias: firstTablet.Alias,
Target: &querypb.Target{Keyspace: "k", Shard: "s", TabletType: topodatapb.TabletType_PRIMARY},
Serving: true,

PrimaryTermStartTimestamp: firstTabletPrimaryTermStartTimestamp,
RealtimeStats: &querypb.RealtimeStats{ReplicationLagSeconds: 0, CpuUsage: 0.5},
}
<-resultChan

secondTabletHealthStream <- &querypb.StreamHealthResponse{
TabletAlias: secondTablet.Alias,
Target: &querypb.Target{Keyspace: "k", Shard: "s", TabletType: topodatapb.TabletType_REPLICA},
Serving: true,

PrimaryTermStartTimestamp: 0,
RealtimeStats: &querypb.RealtimeStats{ReplicationLagSeconds: 1, CpuUsage: 0.5},
}
<-resultChan

thirdTabletHealthStream <- &querypb.StreamHealthResponse{
TabletAlias: thirdTablet.Alias,
Target: &querypb.Target{Keyspace: "k", Shard: "s", TabletType: topodatapb.TabletType_REPLICA},
Serving: true,

PrimaryTermStartTimestamp: 0,
RealtimeStats: &querypb.RealtimeStats{ReplicationLagSeconds: 1, CpuUsage: 0.5},
}
<-resultChan

secondTabletPrimaryTermStartTimestamp := time.Now().Unix()

// Simulate a failover
firstTabletHealthStream <- &querypb.StreamHealthResponse{
TabletAlias: firstTablet.Alias,
Target: &querypb.Target{Keyspace: "k", Shard: "s", TabletType: topodatapb.TabletType_PRIMARY},
Serving: true,

PrimaryTermStartTimestamp: firstTabletPrimaryTermStartTimestamp,
RealtimeStats: &querypb.RealtimeStats{ReplicationLagSeconds: 0, CpuUsage: 0.5},
}
<-resultChan

secondTabletHealthStream <- &querypb.StreamHealthResponse{
TabletAlias: secondTablet.Alias,
Target: &querypb.Target{Keyspace: "k", Shard: "s", TabletType: topodatapb.TabletType_PRIMARY},
Serving: true,

PrimaryTermStartTimestamp: secondTabletPrimaryTermStartTimestamp,
RealtimeStats: &querypb.RealtimeStats{ReplicationLagSeconds: 0, CpuUsage: 0.5},
}
<-resultChan

hc.RemoveTablet(thirdTablet)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does removing an unrelated tablet trigger the bug?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see this

When RemoveTablet was called, the list of healthy tablets was recomputed, but the logic did not account for the fact that there can only be a single entry in the list of healthy primary tablets.


// `secondTablet` should be the primary now
expectedTabletStats := []*TabletHealth{{
Tablet: secondTablet,
Target: &querypb.Target{Keyspace: "k", Shard: "s", TabletType: topodatapb.TabletType_PRIMARY},
Serving: true,
Stats: &querypb.RealtimeStats{ReplicationLagSeconds: 0, CpuUsage: 0.5},
PrimaryTermStartTime: secondTabletPrimaryTermStartTimestamp,
}}

actualTabletStats := hc.GetHealthyTabletStats(&querypb.Target{Keyspace: "k", Shard: "s", TabletType: topodatapb.TabletType_PRIMARY})
mustMatch(t, expectedTabletStats, actualTabletStats, "unexpected result")
}

// TestGetHealthyTablets tests the functionality of GetHealthyTabletStats.
func TestGetHealthyTablets(t *testing.T) {
ctx := utils.LeakCheckContext(t)
Expand Down
Loading