Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support adaptive update interval for low resolution ts #1484

Merged

Conversation

MyonKeminta
Copy link
Contributor

Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
@ti-chi-bot ti-chi-bot bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. dco-signoff: yes Indicates the PR's author has signed the dco. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 1, 2024
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
@ti-chi-bot ti-chi-bot bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 5, 2024
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
@MyonKeminta MyonKeminta force-pushed the m/adaptive-low-resolution-tso-update-interval branch from 56c7714 to 19c6546 Compare November 6, 2024 09:41
@MyonKeminta MyonKeminta changed the title [WIP] Support adaptive update interval for low resolution ts Support adaptive update interval for low resolution ts Nov 6, 2024
@MyonKeminta MyonKeminta marked this pull request as ready for review November 6, 2024 09:41
@ti-chi-bot ti-chi-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 6, 2024
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
// time.
// WARNING: This method does not guarantee whether the generated timestamp is legal for accessing the data.
// Neither is it safe to use it for verifying the legality of another calculated timestamp.
// Be sure to validate the timestamp before using it to access the data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about deprecating the usages of this interface in the future, or integrate the check of ValidateSnapshotReadTS into it?

It's more robust to return a verifed timestamp.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've considered this. It makes sence, but it can't be easily done due to the current usage in TiDB repo.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we modify GetStaleTimestamp to return the latest fetched timestamp, without adding the arrival duration?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussion with @MyonKeminta , the refactor done could be done in another PR as it requires a lot of work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we modify GetStaleTimestamp to return the latest fetched timestamp, without adding the arrival duration?

What you said is exactly what GetLowResolutionTSO does. If we deprecate the GetStaleTimestamp method, the change can be complictated and difficult considering the current usage in TiDB repo. Simply changing all usages of GetStaleTimestamp to GetLowResolutionTSO may significantly reduce the precision of user specified staleness.

oracle/oracles/pd.go Outdated Show resolved Hide resolved

adaptiveUpdateIntervalState struct {
// The mutex to avoid racing between updateTS goroutine and SetLowResolutionTimestampUpdateInterval.
mu sync.Mutex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about moving the to be protected fields into the mu like

mu {
    var1
    var2
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it's actually not protected. There are also some accesses without locking.🤔

oracle/oracles/pd.go Outdated Show resolved Hide resolved
oracle/oracles/pd.go Outdated Show resolved Hide resolved
oracle/oracles/pd.go Outdated Show resolved Hide resolved
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
o.adaptiveLastTSUpdateInterval.Store(int64(configuredInterval))
}
if o.adaptiveUpdateIntervalState.state != adaptiveUpdateTSIntervalStateUnadjustable {
logutil.Logger(context.Background()).Info("update low resolution ts interval is not being adaptive because the configured interval is too short",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logutil.Logger(context.Background()).Info("update low resolution ts interval is not being adaptive because the configured interval is too short",
logutil.BgLogger().Info("update low resolution ts interval is not being adaptive because the configured interval is too short",

prevAdaptiveInterval := currentAdaptiveInterval
currentAdaptiveInterval = max(requiredStaleness-adaptiveUpdateTSIntervalShrinkingPreserve, minAllowedAdaptiveUpdateTSInterval)
o.adaptiveLastTSUpdateInterval.Store(int64(currentAdaptiveInterval))
logutil.Logger(context.Background()).Info("shrink low resolution ts update interval immediately",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

Comment on lines 384 to 385
zap.Duration("requestedStaleness", requiredStaleness),
zap.Duration("prevAdaptiveUpdateInterval", prevAdaptiveInterval),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why those log field names don't match the field name...


func (o *pdOracle) getCurrentTSForValidation(ctx context.Context, opt *oracle.Option) (uint64, error) {
ch := o.tsForValidation.DoChan(opt.TxnScope, func() (interface{}, error) {
//metrics.ValidateReadTSFromPDCount.Inc()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debug code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, there should be a metrics here.

// current ctx.
res, err := o.GetTimestamp(context.Background(), opt)
// After finishing the current call, allow the next call to trigger fetching a new TS.
o.tsForValidation.Forget(opt.TxnScope)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to call Forget?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it's not needed. I just realized that I misunderstood the usage of singleflight.

// If the call that triggers the execution of this function is canceled by the context, other calls that are
// waiting for reusing the same result should not be canceled. So pass context.Background() instead of the
// current ctx.
res, err := o.GetTimestamp(context.Background(), opt)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
res, err := o.GetTimestamp(context.Background(), opt)
res, err := o.GetTimestamp(ctx, opt)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using context.Background() here is expected, as said in the above comments.

Comment on lines +652 to +658
estimatedCurrentTS, err := o.getStaleTimestamp(opt.TxnScope, 0)
if err != nil {
logutil.Logger(ctx).Warn("failed to estimate current ts by getSlateTimestamp for auto-adjusting update low resolution ts interval",
zap.Error(err), zap.Uint64("readTS", readTS), zap.String("txnScope", opt.TxnScope))
} else {
o.adjustUpdateLowResolutionTSIntervalWithRequestedStaleness(readTS, estimatedCurrentTS, time.Now())
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, I don't understand why we need those logic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We estimate the gap between the current time and the readTS, to see if the gap is shorter than the update interval, so that we need to shrink the update interval.

// time.
// WARNING: This method does not guarantee whether the generated timestamp is legal for accessing the data.
// Neither is it safe to use it for verifying the legality of another calculated timestamp.
// Be sure to validate the timestamp before using it to access the data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we modify GetStaleTimestamp to return the latest fetched timestamp, without adding the arrival duration?


configuredInterval := time.Duration(o.lastTSUpdateInterval.Load())
currentAdaptiveInterval := time.Duration(o.adaptiveLastTSUpdateInterval.Load())
if configuredInterval <= minAllowedAdaptiveUpdateTSInterval {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If configuredInterval <= minAllowedAdaptiveUpdateTSInterval is true, then adaptiveLastTSUpdateInterval can be set to configuredInterval. The variable minAllowedAdaptiveUpdateTSInterval doesn't behave like its name.

How about defining a minUpdateTSInterval and both lastTSUpdateInterval and adaptiveLastTSUpdateInterval should less than minUpdateTSInterval.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name minAllowedAdaptiveUpdateTSInterval expected to limit the adaptive update ts interval, to avoid it automatically performs the update too frequently. If the user configures a short interval intentionally, we adopt the user's choice.

Copy link
Contributor Author

@MyonKeminta MyonKeminta Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way I tried to refine the code of the nextUpdateInterval function, which extracted each branches into small closures and unified the log printing. Please see if it's better than the previous one.

const (
// minAllowedAdaptiveUpdateTSInterval is the lower bound of the adaptive update ts interval for avoiding an abnormal
// read operation causing the update interval to be too short.
minAllowedAdaptiveUpdateTSInterval = 500 * time.Millisecond
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The minimal value of tidb variable tidb_low_resolution_tso_update_interval is 10ms, why minAllowedAdaptiveUpdateTSInterval is 500ms?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid the adaptive update interval mechanism automatically and silently chooses a too low interval can causing too high frequency of getting ts from PD. But we do not reject it if the user configures it explicitly and the low update interval is just what the user expects.

Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
nextState := func(checkFuncs ...func() (adaptiveUpdateTSIntervalState, time.Duration)) time.Duration {
for _, f := range checkFuncs {
state, newInterval := f()
if state == none {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be returned here if the state is adaptiveUpdateTSIntervalStateUnadjustable instead of continue processing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If state is not none, it should return at line 454

Copy link
Contributor

@cfzjywxk cfzjywxk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM

@ti-chi-bot ti-chi-bot bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Nov 8, 2024
latestTS, err := o.GetLowResolutionTimestamp(ctx, opt)
// If we fail to get latestTS or the readTS exceeds it, get a timestamp from PD to double-check.
// But we don't need to strictly fetch the latest TS. So if there are already concurrent calls to this function
// loading the latest TS, we can just reuse the same result to avoid too many concurrent GetTS calls.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation is enough I think. This is just a minor suggestion.

Use singleflight to reduce the GetTS calls is reasonable, but it's possible that a valid ts is invalid in this case.

  1. thread 1 validate current_ts and send a get ts request, PD allocates its latest TSO and the response is slow.
  2. thread 2 validate current_ts and the singleflight will reuse the thread 1's TSO.
  3. The GetTS request of thread 1's is responded.

Then thread 2 can get a stale TSO from PD which increases the failure possibility.

Maybe an enhancement is to resend a singleflight GetTS request if readTS > latestTS to recheck the timestamp in this case.

@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Nov 11, 2024
Copy link

ti-chi-bot bot commented Nov 11, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cfzjywxk, crazycs520

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [cfzjywxk,crazycs520]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

ti-chi-bot bot commented Nov 11, 2024

[LGTM Timeline notifier]

Timeline:

  • 2024-11-08 11:13:25.792549727 +0000 UTC m=+9167.983418723: ☑️ agreed by cfzjywxk.
  • 2024-11-11 07:45:51.012984421 +0000 UTC m=+255913.203853417: ☑️ agreed by crazycs520.

@ti-chi-bot ti-chi-bot bot merged commit 23531ad into tikv:master Nov 11, 2024
12 checks passed
@MyonKeminta MyonKeminta deleted the m/adaptive-low-resolution-tso-update-interval branch November 11, 2024 09:31
MyonKeminta added a commit to MyonKeminta/client-go that referenced this pull request Nov 11, 2024
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
ti-chi-bot bot pushed a commit that referenced this pull request Nov 11, 2024
 

Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved dco-signoff: yes Indicates the PR's author has signed the dco. lgtm size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants