Add timeout for backup/restore expose #6472
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
For backup expose, the exposer waits for the snapshot to be ready, creates a volume from the snapshot and a pod to consume it and then Velero data mover waits for the pod to get to running status.
For restore expose, the exposer dynamically provisions a volume and a pod to consume it and then Velero data mover waits for the pod to get to running status.
It is possible that due to an unsatisfied condition, the volume creation from snapshot/volume dynamic creation hangs so the pod never gets to running status.
One example is that the information in the storage class is wrong, as a result, the volume dynamic creation never finishes.
For both backup expose and restore expose, if the above problem happens, the DataUpload/DataDownload will hang until a 4 hours timeout.
This PR adds a mechanism to track the time of the backup/restore expose and set a timeout value, if the timeout happens, the DataUpload/DataDownload will be marked as fail and any intermediate resources will be cleared.
At present, we set the timeout value as
30 min
and is configurable by specifying a node-agent server parameter.This PR can also fix the problem in a node-agent restart scenario. In the case that node-agent restarts while a backup exposer is waiting for the snapshot to be ready by the mean time, after node-agent restarts it doesn't know which DataUploads are affected, as a result, it cannot cancel them.
This mechanism can back node-agent server in this case -- any orphan DataUploads that node-agent server cannot cancel will fall into this timeout mechanism.