Add timeout for backup/restore expose #6472

Lyndon-Li · 2023-07-07T11:39:05Z

For backup expose, the exposer waits for the snapshot to be ready, creates a volume from the snapshot and a pod to consume it and then Velero data mover waits for the pod to get to running status.
For restore expose, the exposer dynamically provisions a volume and a pod to consume it and then Velero data mover waits for the pod to get to running status.

It is possible that due to an unsatisfied condition, the volume creation from snapshot/volume dynamic creation hangs so the pod never gets to running status.
One example is that the information in the storage class is wrong, as a result, the volume dynamic creation never finishes.

For both backup expose and restore expose, if the above problem happens, the DataUpload/DataDownload will hang until a 4 hours timeout.

This PR adds a mechanism to track the time of the backup/restore expose and set a timeout value, if the timeout happens, the DataUpload/DataDownload will be marked as fail and any intermediate resources will be cleared.
At present, we set the timeout value as 30 min and is configurable by specifying a node-agent server parameter.

This PR can also fix the problem in a node-agent restart scenario. In the case that node-agent restarts while a backup exposer is waiting for the snapshot to be ready by the mean time, after node-agent restarts it doesn't know which DataUploads are affected, as a result, it cannot cancel them.
This mechanism can back node-agent server in this case -- any orphan DataUploads that node-agent server cannot cancel will fall into this timeout mechanism.

codecov-commenter · 2023-07-07T11:51:39Z

Codecov Report

Merging #6472 (bf2a981) into main (7deae4c) will increase coverage by 0.12%.
The diff coverage is 75.18%.

@@            Coverage Diff             @@
##             main    #6472      +/-   ##
==========================================
+ Coverage   60.18%   60.31%   +0.12%     
==========================================
  Files         229      229              
  Lines       24219    24319     +100     
==========================================
+ Hits        14577    14667      +90     
- Misses       8634     8647      +13     
+ Partials     1008     1005       -3

Impacted Files	Coverage Δ
pkg/cmd/cli/nodeagent/server.go	`11.20% <0.00%> (-0.10%)`	⬇️
pkg/controller/data_upload_controller.go	`69.12% <77.14%> (+2.98%)`	⬆️
pkg/controller/data_download_controller.go	`79.08% <80.32%> (+1.84%)`	⬆️

Signed-off-by: Lyndon-Li <lyonghui@vmware.com>

Lyndon-Li added the kind/changelog-not-required PR does not require a user changelog. Often for docs, website, or build changes label Jul 7, 2023

Lyndon-Li force-pushed the add-wait-timeout-for-expose-prepare branch from 76a9936 to 7e445f1 Compare July 7, 2023 11:42

Lyndon-Li force-pushed the add-wait-timeout-for-expose-prepare branch from 7e445f1 to a002b7f Compare July 7, 2023 11:52

Lyndon-Li changed the title ~~Add wait timeout for expose prepare~~ Add timeout for backup/restore expose Jul 7, 2023

Lyndon-Li force-pushed the add-wait-timeout-for-expose-prepare branch 2 times, most recently from aaa2105 to bf2a981 Compare July 7, 2023 16:19

github-actions bot added the has-unit-tests label Jul 7, 2023

Lyndon-Li marked this pull request as ready for review July 10, 2023 01:23

github-actions bot requested review from shubham-pampattiwar and ywk253100 July 10, 2023 01:23

github-actions bot assigned Lyndon-Li Jul 10, 2023

Lyndon-Li requested a review from qiuming-best July 10, 2023 01:24

add wait timeout for expose prepare

9f5162e

Signed-off-by: Lyndon-Li <lyonghui@vmware.com>

Lyndon-Li force-pushed the add-wait-timeout-for-expose-prepare branch from bf2a981 to 9f5162e Compare July 10, 2023 09:32

github-actions bot added the Dependencies Pull requests that update a dependency file label Jul 10, 2023

ywk253100 approved these changes Jul 11, 2023

View reviewed changes

qiuming-best approved these changes Jul 11, 2023

View reviewed changes

Lyndon-Li merged commit 0945879 into vmware-tanzu:main Jul 11, 2023
22 checks passed

Lyndon-Li deleted the add-wait-timeout-for-expose-prepare branch July 11, 2023 01:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add timeout for backup/restore expose #6472

Add timeout for backup/restore expose #6472

Lyndon-Li commented Jul 7, 2023 •

edited

Loading

codecov-commenter commented Jul 7, 2023 •

edited

Loading

Add timeout for backup/restore expose #6472

Add timeout for backup/restore expose #6472

Conversation

Lyndon-Li commented Jul 7, 2023 • edited Loading

codecov-commenter commented Jul 7, 2023 • edited Loading

Codecov Report

Lyndon-Li commented Jul 7, 2023 •

edited

Loading

codecov-commenter commented Jul 7, 2023 •

edited

Loading