DataUpload isn't canceled even the Backup is marked as "Failed" when Velero pod restarts #7230

ywk253100 · 2023-12-19T07:59:10Z

Restart the Velero pod when the Backup CR is in InProgress status (the DataUpload CR is in Accept status), the Backup CR is marked as Failed when the Velero pod starts up again, but the DataUpload CR isn't canceled and after a while the Backup CR is marked as WaitingForPluginOperations then Completed.

And here is the final status of the backup with status as Completed but failureReason as found a backup with status "InProgress" during the server starting, mark it as "Failed":

status:
  backupItemOperationsAttempted: 1
  backupItemOperationsCompleted: 1
  completionTimestamp: "2023-12-19T07:50:44Z"
  expiration: "2024-01-18T07:49:14Z"
  failureReason: found a backup with status "InProgress" during the server starting,
    mark it as "Failed"
  formatVersion: 1.1.0
  hookStatus:
    hooksAttempted: 1
  phase: Completed
  progress:
    itemsBackedUp: 32
    totalItems: 32
  startTimestamp: "2023-12-19T07:49:15Z"
  version: 1

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

👍 for "I would like to see this bug fixed as soon as possible"
👎 for "There are more important bugs to focus on right now"

The text was updated successfully, but these errors were encountered:

qiuming-best · 2024-01-03T07:02:33Z

It occurred at the moment during Velero deployment grace periods:

the terminating velero server pod still doing the data upload flow while the new start velero server pod is marking the "Failed" status of DataUpload CR simultaneously.

This corner case wouldn't happen If the velero pod is OOM killed, so we decided to postpone the repair for low priority.

reasonerjt · 2024-02-02T08:53:40Z

As @qiuming-best explained in this comment, this is not likely happen in real usage scenario.
A possible solution to this problem is to introduce leader election mechanism so there won't be two velero servers working at the same time.
This may be put into backlog, but not very important fo v1.14

github-actions · 2024-05-14T01:48:20Z

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.

kaovilai · 2024-05-14T01:56:47Z

unstale

github-actions · 2024-07-14T01:57:18Z

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.

kaovilai · 2024-07-15T12:37:24Z

unstale

github-actions · 2024-09-15T02:03:49Z

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.

blackpiglet · 2024-09-18T05:46:44Z

unstale

rkashasl · 2024-10-09T13:15:21Z

Have same issue, restart doesn't help it fails all the time with the same error

Name:         velero-daily-20241009130824
Namespace:    velero
Labels:       app.kubernetes.io/instance=velero
              app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=velero
              helm.sh/chart=velero-5.4.1
              helm.toolkit.fluxcd.io/name=velero
              helm.toolkit.fluxcd.io/namespace=velero
              velero.io/schedule-name=velero-daily
              velero.io/storage-location=default
Annotations:  meta.helm.sh/release-name: velero
              meta.helm.sh/release-namespace: velero
              velero.io/resource-timeout: 10m0s
              velero.io/source-cluster-k8s-gitversion: v1.29.8-eks-a737599
              velero.io/source-cluster-k8s-major-version: 1
              velero.io/source-cluster-k8s-minor-version: 29+
API Version:  velero.io/v1
Kind:         Backup
Metadata:
  Creation Timestamp:  2024-10-09T13:08:24Z
  Generation:          3
  Resource Version:    264000873
  UID:                 f8aeda37-a39d-42ad-848e-7752c2667927
Spec:
  Csi Snapshot Timeout:          10m0s
  Default Volumes To Fs Backup:  false
  Hooks:
  Include Cluster Resources:  true
  Item Operation Timeout:     4h0m0s
  Metadata:
  Snapshot Move Data:  false
  Snapshot Volumes:    true
  Storage Location:    default
  Ttl:                 168h0m0s
  Volume Snapshot Locations:
    default
Status:
  Completion Timestamp:  2024-10-09T13:09:23Z
  Expiration:            2024-10-16T13:08:24Z
  Failure Reason:        found a backup with status "InProgress" during the server starting, mark it as "Failed"
  Format Version:        1.1.0
  Phase:                 Failed
  Start Timestamp:       2024-10-09T13:08:24Z
  Version:               1
Events:                  <none>

blackpiglet · 2024-10-10T06:47:58Z

@rkashasl
I think your error is not related to this issue.
Please try to enlarge the Velero deployment resource setting to resolve your issue.

rkashasl · 2024-10-10T07:05:20Z

@rkashasl I think your error is not related to this issue. Please try to enlarge the Velero deployment resource setting to resolve your issue.

Resources are fine

Hoever, when i completely removed velero from the cluster including all crds and then reconcile flux to get it back - all backups after provisioning have been completed successfully, but then i run command
backup create --from-schedule velero-daily
and check the backup status it went from InProgress to failed with same error as before

Name:         velero-daily-20241010070123
Namespace:    velero
Labels:       app.kubernetes.io/instance=velero
              app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=velero
              helm.sh/chart=velero-5.4.1
              helm.toolkit.fluxcd.io/name=velero
              helm.toolkit.fluxcd.io/namespace=velero
              velero.io/schedule-name=velero-daily
              velero.io/storage-location=default
Annotations:  meta.helm.sh/release-name: velero
              meta.helm.sh/release-namespace: velero
              velero.io/resource-timeout: 10m0s
              velero.io/source-cluster-k8s-gitversion: v1.29.8-eks-a737599
              velero.io/source-cluster-k8s-major-version: 1
              velero.io/source-cluster-k8s-minor-version: 29+
API Version:  velero.io/v1
Kind:         Backup
Metadata:
  Creation Timestamp:  2024-10-10T07:01:23Z
  Generation:          3
  Resource Version:    264911127
  UID:                 273a47b5-9b1a-4d01-a9c5-bc510e3b5c47
Spec:
  Csi Snapshot Timeout:          10m0s
  Default Volumes To Fs Backup:  false
  Hooks:
  Include Cluster Resources:  true
  Item Operation Timeout:     4h0m0s
  Metadata:
  Snapshot Move Data:  false
  Snapshot Volumes:    true
  Storage Location:    default
  Ttl:                 168h0m0s
  Volume Snapshot Locations:
    default
Status:
  Completion Timestamp:  2024-10-10T07:02:04Z
  Expiration:            2024-10-17T07:01:23Z
  Failure Reason:        found a backup with status "InProgress" during the server starting, mark it as "Failed"
  Format Version:        1.1.0
  Phase:                 Failed
  Start Timestamp:       2024-10-10T07:01:24Z
  Version:               1
Events:                  <none>

rkashasl · 2024-10-10T08:24:15Z

I increased memory requests to 1Gi and limits to 2Gi, also adjust cpu requests to 250m and all started to work as it should
Btw, before i did that i noticed velero server restart during the backup procedure in pod logs
However i don't think this is a clear understanding what is the problem as in grafana i don't see the cpu or memory utilization more than a 30% from the requests, maybe you can take this into account

ywk253100 assigned qiuming-best Dec 19, 2023

ywk253100 added this to the v1.13 milestone Dec 19, 2023

ywk253100 added target/1.13-rc1 and removed target/1.13-rc1 labels Dec 19, 2023

ywk253100 removed this from the v1.13 milestone Dec 20, 2023

pradeepkchaturvedi added the 1.14-candidate label Jan 24, 2024

reasonerjt assigned blackpiglet Feb 2, 2024

reasonerjt unassigned qiuming-best Feb 2, 2024

reasonerjt added defer-candidate 2024 Q1 reviewed backlog labels Feb 2, 2024

reasonerjt added Icebox We see the value, but it is not slated for the next couple releases. and removed defer-candidate backlog 1.14-candidate labels Mar 13, 2024

github-actions bot added the staled label May 14, 2024

github-actions bot removed the staled label May 15, 2024

github-actions bot added the staled label Jul 14, 2024

github-actions bot removed the staled label Jul 16, 2024

github-actions bot added the staled label Sep 15, 2024

github-actions bot removed the staled label Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataUpload isn't canceled even the Backup is marked as "Failed" when Velero pod restarts #7230

DataUpload isn't canceled even the Backup is marked as "Failed" when Velero pod restarts #7230

ywk253100 commented Dec 19, 2023 •

edited

Loading

qiuming-best commented Jan 3, 2024

reasonerjt commented Feb 2, 2024

github-actions bot commented May 14, 2024

kaovilai commented May 14, 2024

github-actions bot commented Jul 14, 2024

kaovilai commented Jul 15, 2024

github-actions bot commented Sep 15, 2024

blackpiglet commented Sep 18, 2024

rkashasl commented Oct 9, 2024 •

edited

Loading

blackpiglet commented Oct 10, 2024

rkashasl commented Oct 10, 2024

rkashasl commented Oct 10, 2024

DataUpload isn't canceled even the Backup is marked as "Failed" when Velero pod restarts #7230

DataUpload isn't canceled even the Backup is marked as "Failed" when Velero pod restarts #7230

Comments

ywk253100 commented Dec 19, 2023 • edited Loading

qiuming-best commented Jan 3, 2024

reasonerjt commented Feb 2, 2024

github-actions bot commented May 14, 2024

kaovilai commented May 14, 2024

github-actions bot commented Jul 14, 2024

kaovilai commented Jul 15, 2024

github-actions bot commented Sep 15, 2024

blackpiglet commented Sep 18, 2024

rkashasl commented Oct 9, 2024 • edited Loading

blackpiglet commented Oct 10, 2024

rkashasl commented Oct 10, 2024

rkashasl commented Oct 10, 2024

ywk253100 commented Dec 19, 2023 •

edited

Loading

rkashasl commented Oct 9, 2024 •

edited

Loading