Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[CELEBORN-932] Fix worker register after gracefaully restart
### What changes were proposed in this pull request? Worker will firstly register failed after worker gracefully restart in HA mode, it will be really registered after one heartbeat. <img width="889" alt="image" src="https://github.com/apache/incubator-celeborn/assets/19429353/371aa0e0-b2e9-4c1f-9e40-276dc1460219"> This is because master here uses same `requestId` to submit request, causing the second request not be processed correctly, due to Ratis `RetryCache`. Master logs like below: (worker gracefully stop) Master: Receive ReportNodeFailure (worker start) Master: Received RegisterWorker request Master: Received heartbeat from unknown worker Master: Registered worker So here improve AbstractMetaManager#updateRegisterWorkerMeta to cover `WorkerRemove` logic. For back compatibility and possible inconsistencies during rolling upgrade, temporarily fix duplicate requestId and keep remove function. And we can try to remove `WorkerRemove` logic in the future version. ### Why are the changes needed? Ditto ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Cluster test Closes #1863 from onebox-li/fix-restart-register. Authored-by: onebox-li <lyh-36@163.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
- Loading branch information