Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataUpload isn't canceled even the Backup is marked as "Failed" when Velero pod restarts #7230

Open
ywk253100 opened this issue Dec 19, 2023 · 12 comments
Assignees
Labels
2024 Q1 reviewed Icebox We see the value, but it is not slated for the next couple releases.

Comments

@ywk253100
Copy link
Contributor

ywk253100 commented Dec 19, 2023

Restart the Velero pod when the Backup CR is in InProgress status (the DataUpload CR is in Accept status), the Backup CR is marked as Failed when the Velero pod starts up again, but the DataUpload CR isn't canceled and after a while the Backup CR is marked as WaitingForPluginOperations then Completed.

And here is the final status of the backup with status as Completed but failureReason as found a backup with status "InProgress" during the server starting, mark it as "Failed":

status:
  backupItemOperationsAttempted: 1
  backupItemOperationsCompleted: 1
  completionTimestamp: "2023-12-19T07:50:44Z"
  expiration: "2024-01-18T07:49:14Z"
  failureReason: found a backup with status "InProgress" during the server starting,
    mark it as "Failed"
  formatVersion: 1.1.0
  hookStatus:
    hooksAttempted: 1
  phase: Completed
  progress:
    itemsBackedUp: 32
    totalItems: 32
  startTimestamp: "2023-12-19T07:49:15Z"
  version: 1

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@ywk253100 ywk253100 added this to the v1.13 milestone Dec 19, 2023
@ywk253100 ywk253100 removed this from the v1.13 milestone Dec 20, 2023
@qiuming-best
Copy link
Contributor

It occurred at the moment during Velero deployment grace periods:

the terminating velero server pod still doing the data upload flow while the new start velero server pod is marking the "Failed" status of DataUpload CR simultaneously.

This corner case wouldn't happen If the velero pod is OOM killed, so we decided to postpone the repair for low priority.

@reasonerjt
Copy link
Contributor

As @qiuming-best explained in this comment, this is not likely happen in real usage scenario.
A possible solution to this problem is to introduce leader election mechanism so there won't be two velero servers working at the same time.
This may be put into backlog, but not very important fo v1.14

@reasonerjt reasonerjt added Icebox We see the value, but it is not slated for the next couple releases. and removed defer-candidate backlog 1.14-candidate labels Mar 13, 2024
Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.

@kaovilai
Copy link
Contributor

unstale

@github-actions github-actions bot removed the staled label May 15, 2024
Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.

@kaovilai
Copy link
Contributor

unstale

@github-actions github-actions bot removed the staled label Jul 16, 2024
Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.

@blackpiglet
Copy link
Contributor

unstale

@github-actions github-actions bot removed the staled label Sep 19, 2024
@rkashasl
Copy link

rkashasl commented Oct 9, 2024

Have same issue, restart doesn't help it fails all the time with the same error

Name:         velero-daily-20241009130824
Namespace:    velero
Labels:       app.kubernetes.io/instance=velero
              app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=velero
              helm.sh/chart=velero-5.4.1
              helm.toolkit.fluxcd.io/name=velero
              helm.toolkit.fluxcd.io/namespace=velero
              velero.io/schedule-name=velero-daily
              velero.io/storage-location=default
Annotations:  meta.helm.sh/release-name: velero
              meta.helm.sh/release-namespace: velero
              velero.io/resource-timeout: 10m0s
              velero.io/source-cluster-k8s-gitversion: v1.29.8-eks-a737599
              velero.io/source-cluster-k8s-major-version: 1
              velero.io/source-cluster-k8s-minor-version: 29+
API Version:  velero.io/v1
Kind:         Backup
Metadata:
  Creation Timestamp:  2024-10-09T13:08:24Z
  Generation:          3
  Resource Version:    264000873
  UID:                 f8aeda37-a39d-42ad-848e-7752c2667927
Spec:
  Csi Snapshot Timeout:          10m0s
  Default Volumes To Fs Backup:  false
  Hooks:
  Include Cluster Resources:  true
  Item Operation Timeout:     4h0m0s
  Metadata:
  Snapshot Move Data:  false
  Snapshot Volumes:    true
  Storage Location:    default
  Ttl:                 168h0m0s
  Volume Snapshot Locations:
    default
Status:
  Completion Timestamp:  2024-10-09T13:09:23Z
  Expiration:            2024-10-16T13:08:24Z
  Failure Reason:        found a backup with status "InProgress" during the server starting, mark it as "Failed"
  Format Version:        1.1.0
  Phase:                 Failed
  Start Timestamp:       2024-10-09T13:08:24Z
  Version:               1
Events:                  <none>

@blackpiglet
Copy link
Contributor

@rkashasl
I think your error is not related to this issue.
Please try to enlarge the Velero deployment resource setting to resolve your issue.

@rkashasl
Copy link

@rkashasl I think your error is not related to this issue. Please try to enlarge the Velero deployment resource setting to resolve your issue.

Resources are fine
image

Hoever, when i completely removed velero from the cluster including all crds and then reconcile flux to get it back - all backups after provisioning have been completed successfully, but then i run command
backup create --from-schedule velero-daily
and check the backup status it went from InProgress to failed with same error as before

Name:         velero-daily-20241010070123
Namespace:    velero
Labels:       app.kubernetes.io/instance=velero
              app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=velero
              helm.sh/chart=velero-5.4.1
              helm.toolkit.fluxcd.io/name=velero
              helm.toolkit.fluxcd.io/namespace=velero
              velero.io/schedule-name=velero-daily
              velero.io/storage-location=default
Annotations:  meta.helm.sh/release-name: velero
              meta.helm.sh/release-namespace: velero
              velero.io/resource-timeout: 10m0s
              velero.io/source-cluster-k8s-gitversion: v1.29.8-eks-a737599
              velero.io/source-cluster-k8s-major-version: 1
              velero.io/source-cluster-k8s-minor-version: 29+
API Version:  velero.io/v1
Kind:         Backup
Metadata:
  Creation Timestamp:  2024-10-10T07:01:23Z
  Generation:          3
  Resource Version:    264911127
  UID:                 273a47b5-9b1a-4d01-a9c5-bc510e3b5c47
Spec:
  Csi Snapshot Timeout:          10m0s
  Default Volumes To Fs Backup:  false
  Hooks:
  Include Cluster Resources:  true
  Item Operation Timeout:     4h0m0s
  Metadata:
  Snapshot Move Data:  false
  Snapshot Volumes:    true
  Storage Location:    default
  Ttl:                 168h0m0s
  Volume Snapshot Locations:
    default
Status:
  Completion Timestamp:  2024-10-10T07:02:04Z
  Expiration:            2024-10-17T07:01:23Z
  Failure Reason:        found a backup with status "InProgress" during the server starting, mark it as "Failed"
  Format Version:        1.1.0
  Phase:                 Failed
  Start Timestamp:       2024-10-10T07:01:24Z
  Version:               1
Events:                  <none>

@rkashasl
Copy link

I increased memory requests to 1Gi and limits to 2Gi, also adjust cpu requests to 250m and all started to work as it should
Btw, before i did that i noticed velero server restart during the backup procedure in pod logs
However i don't think this is a clear understanding what is the problem as in grafana i don't see the cpu or memory utilization more than a 30% from the requests, maybe you can take this into account

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2024 Q1 reviewed Icebox We see the value, but it is not slated for the next couple releases.
Projects
None yet
Development

No branches or pull requests

7 participants