update ring with new ip address when instance is lost, rejoins, but heartbeat is disabled #6271

CharlieTLe · 2024-10-16T02:54:00Z

What this PR does:
Updates the ring with new ip address when instance is lost, rejoins, but heartbeat is disabled.

An instance can become lost, for example, when it does not announce that it is leaving the ring before exiting. When this happens and a new instance is created with a new ip address but the same name, the instance will rejoin the ring and reclaim the tokens it once had. However, it will not update the ring with the new ip address it can be located at. This can cause problems for services that use the ring to find where it can reach a member of the ring.

Which issue(s) this PR fixes:
Fixes #

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

…eartbeat is disabled Signed-off-by: Charlie Le <charlie_le@apple.com>

``` pkg/ring/lifecycler_test.go:728:2: ineffectual assignment to l2 (ineffassign) l2 = startIngesterAndWaitActive("ing2", "0.0.0.0") ^ ``` Signed-off-by: Charlie Le <charlie_le@apple.com>

Signed-off-by: Charlie Le <charlie_le@apple.com>

yeya24 · 2024-10-16T19:04:16Z

@CharlieTLe Is there a specific use case or component you are looking at for this feature?
We can persist tokens to a file for instances to re-join the ring with the same token by reusing the token file (stored in a PVC)

CharlieTLe · 2024-10-16T20:07:46Z

@yeya24

Yeah, it's to handle the case where an ingester changes its IP address but doesn't go through the typical lifecycle of leaving the ring while doing so. Then there's a discrepancy between the ring description and the actual description of the ingester.

This specific issue isn't about reclaiming the tokens.

Here's what can happen:

An ingester (id=ingester-0, addr=1.1.1.1) has a heartbeat_period=0 and is in the memberlist ring with the state=ACTIVE
The node that the ingester is running on is abruptly terminated
A new node is created for the ingester (id=ingester-0, addr=2.2.2.2) to run on
The ingester (id=ingester-0, addr=2.2.2.2) joins the ring and reclaims its token
The ring description still using the ingester's old address information (id=ingester=0, addr=1.1.1.1)

A symptom of the problem would be on the distributor logging the error that it is unable to reach the ingester because it's using the ingester's old address:

{
  "addr": "1.1.1.1:9095",
  "caller": "pool.go:184",
  "level": "warn",
  "msg": "removing ingester failing healthcheck",
  "reason": "rpc error: code = DeadlineExceeded desc = context deadline exceeded"
}

update ring with new ip address when instance is lost, rejoins, but h…

e14b66d

…eartbeat is disabled Signed-off-by: Charlie Le <charlie_le@apple.com>

pull-request-size bot added the size/M label Oct 16, 2024

dosubot bot added the component/ring label Oct 16, 2024

CharlieTLe added 2 commits October 16, 2024 10:45

Fix lint error (ineffassign)

86d0e5a

``` pkg/ring/lifecycler_test.go:728:2: ineffectual assignment to l2 (ineffassign) l2 = startIngesterAndWaitActive("ing2", "0.0.0.0") ^ ``` Signed-off-by: Charlie Le <charlie_le@apple.com>

Update changelog

90a4c67

Signed-off-by: Charlie Le <charlie_le@apple.com>

yeya24 requested a review from alanprot October 25, 2024 05:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update ring with new ip address when instance is lost, rejoins, but heartbeat is disabled #6271

update ring with new ip address when instance is lost, rejoins, but heartbeat is disabled #6271

CharlieTLe commented Oct 16, 2024 •

edited

Loading

yeya24 commented Oct 16, 2024

CharlieTLe commented Oct 16, 2024

update ring with new ip address when instance is lost, rejoins, but heartbeat is disabled #6271

Are you sure you want to change the base?

update ring with new ip address when instance is lost, rejoins, but heartbeat is disabled #6271

Conversation

CharlieTLe commented Oct 16, 2024 • edited Loading

yeya24 commented Oct 16, 2024

CharlieTLe commented Oct 16, 2024

CharlieTLe commented Oct 16, 2024 •

edited

Loading