velero_backup_success_total metric keeps being reported for deleted schedules #1333

multi-io · 2019-03-29T19:49:10Z

What steps did you take and what happened:

Create a schedule, let it perform a few backups, then delete the schedule.

The /metrics endpoint keeps reporting the velero_backup_success_total{schedule=""} metric, even after all the backups have expired, until velero itself is restarted.

What did you expect to happen:

The metric should vanish when the schedule is deleted or after all its backups have expired, or it should stay around indefinitely (which I wouldn't prefer). In any case, the fact that it goes away when the velero pod is restarted suggests that this an artifact of the implementation rather than desired behaviour, and it does make it harder to implement certain alerting rules.

maberny · 2019-04-04T20:43:33Z

Same thing happens with the ark_backup_failure_total metric.

h4wkmoon · 2020-09-18T14:45:52Z

Hi,
I really love velero. It does really great for its principal purpose : backups and restores.

But, exposing nothing at all is better than incorrect metrics.
Can you, either :

advertise that your metrics are not entireiy ok,
correct or remove those which are not

Thanks.

stale · 2021-07-08T15:36:55Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale · 2021-07-22T18:13:14Z

Closing the stale issue.

h4wkmoon · 2021-10-18T14:25:08Z

Velero still has this issue. I think it should be reopen.

sbkg0002 · 2022-01-26T08:00:22Z

The issue is still valid for the latest version!

@eleanor-millman can you re-open?

eleanor-millman · 2022-02-11T21:39:54Z

Sure!

stale · 2022-04-13T12:40:19Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

sbkg0002 · 2022-04-14T08:50:40Z

Bump since this is still an issue.

guillaumefenollar · 2022-05-02T04:04:11Z

Really interested by a fix for this issue as well, for years now. I'm considering writing a new exporter but that would be counterproductive, thus I lack golang skills to properly improve Velero's one with confidence.

stale · 2022-07-06T23:02:38Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

sbkg0002 · 2022-07-07T08:01:21Z

Is there anyone with a possible fix? This keeps spamming our monitoring.

h4wkmoon · 2022-07-07T08:15:12Z

The workourand is still to delete velero pods. Be sure to run it when there are no backups.

LuckySB · 2022-08-05T16:03:44Z

up

KlavsKlavsen · 2022-09-26T09:00:45Z

Still a problem :(

stale · 2022-11-26T17:35:06Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale · 2022-12-11T16:23:38Z

Closing the stale issue.

KlavsKlavsen · 2022-12-12T09:15:03Z

@eleanor-millman this issue is still an open bug - can you re-open and ask stale-bot to not close it, just because no one has solved it yet?

vmware-tanzu/velero#1333

cwrau · 2023-05-26T13:18:57Z

We have the same problem, please reopen

lorenzomorandini · 2023-06-21T07:38:18Z

Just faced the same issue. Deleting the pod works but it's a workaround, can it be re opened?

Neurobion · 2023-06-28T06:56:32Z

Currently, it is impossible to effectively monitor the status of backups. The important thing that should be fixed.

draghuram · 2023-06-28T19:10:53Z

@Neurobion, Since the original problem has been reported quite a while ago, would you mind describing it again after testing with latest Velero? We at CloudCasa will see if we can provide a fix if the problem still exists.

Neurobion · 2023-06-29T08:57:36Z

I use tag: v1.11.0 + velero-plugin-for-microsoft-azure:v1.6.0 and deploy through Helm. Everything works as it should, but at the moment of redeploying or deleting a specific schedule, this metric still returns it, which is not correct for monitoring purposes. I only need backup information only for existing schedules.

Jiris-MacBook-Pro:velero xx$ velero get backup
NAME                                                        STATUS      ERRORS   WARNINGS   CREATED                          EXPIRES   STORAGE LOCATION   SELECTOR
man-backup                                                  Completed   0        0          2023-06-21 15:51:17 +0200 CEST   22d       default            <none>
velero-1687355039-XX-20230627103052   Completed   0        0          2023-06-27 12:30:52 +0200 CEST   28d       default            <none>
velero-1687355039-XX-20230626103051   Completed   0        0          2023-06-26 12:30:51 +0200 CEST   27d       default            <none>
velero-XX-pv-20230628103002              Completed   0        0          2023-06-28 12:30:02 +0200 CEST   29d       default            <none>

Jiris-MacBook-Pro:velero xx$ velero get schedules
NAME                              STATUS    CREATED                          SCHEDULE      BACKUP TTL   LAST BACKUP   SELECTOR   PAUSED
velero-XX-pv   Enabled   2023-06-28 09:32:51 +0200 CEST   30 10 * * *   720h0m0s     22h ago       <none>     false

velero_backup_last_successful_timestamp return:

velero_backup_last_successful_timestamp{endpoint="http-monitoring", instance="10.142.118.80:8085", job="velero", namespace="velero", pod="velero-69d5544f8f-pbwcc", schedule="velero-1687355039-XX", service="velero"} | 1687861964
velero_backup_last_successful_timestamp{endpoint="http-monitoring", instance="10.142.118.80:8085", job="velero", namespace="velero", pod="velero-69d5544f8f-pbwcc", schedule="velero-XX-pv", service="velero"} | 1687948312

In my opinion velero-1687355039-XX has nothing to do there because the schedule no longer exists.

And in case of set alerting ((time() - velero_backup_last_successful_timestamp) / 60 / 60 > 24) it doesn't work as it should.

PS: If that is not the goal of this metric, then another metric would be useful that would only take into account backups according to the current list of schedules.

jmuleiro · 2023-07-04T12:42:24Z

Agree with @Neurobion, I set up alerts recently for Velero backups, just to find out that they don't work as expected. We need a fix, as these metrics are not reliable for monitoring purposes.

jmuleiro · 2023-07-04T16:34:15Z

I've been looking for a workaround to this issue for hours. I checked the code (be aware I can barely understand Golang) and AFAIK this problem can be attributed to the backup finalizer controller, these lines in particular:

backupScheduleName := backupRequest.GetLabels()[velerov1api.ScheduleNameLabel]
switch backup.Status.Phase {
case velerov1api.BackupPhaseFinalizing:
	backup.Status.Phase = velerov1api.BackupPhaseCompleted
	r.metrics.RegisterBackupSuccess(backupScheduleName)
	r.metrics.RegisterBackupLastStatus(backupScheduleName, metrics.BackupLastStatusSucc)
case velerov1api.BackupPhaseFinalizingPartiallyFailed:
	backup.Status.Phase = velerov1api.BackupPhasePartiallyFailed
	r.metrics.RegisterBackupPartialFailure(backupScheduleName)
	r.metrics.RegisterBackupLastStatus(backupScheduleName, metrics.BackupLastStatusFailure)
}
backup.Status.CompletionTimestamp = &metav1.Time{Time: r.clock.Now()}
recordBackupMetrics(log, backup, outBackupFile, r.metrics, true)

It seems that the finalizer controller is the starting point when it comes to backup deletion. Backup CRDs get marked with a finalizer and subsequentially a deletionTimestamp in Kubernetes, and then Velero starts the backup deletion process. The backup deletion itself could still be unsuccessful or get stuck somehow - this happened to me just yesterday - but the controller will still mark what actually is a deletion attempt as a successful backup. It doesn't make sense.

It may very well be that they did this instead of implementing new metrics and updating the Kubernetes CRDs to support these new backup status phases. Nonetheless, because this is the current behavior, the metrics being updated by the finalizer controller are rendered essentially useless.

In my opinion, it would be great if the team could release a hotfix, deleting the lines pointed out above as most of us desperately need this fixed to be able to set up alerts based on these metric sets. If they are accepting pull requests, some of us would be willing to contribute to get this fixed as soon as possible.

draghuram · 2023-07-06T19:00:24Z

Thanks @Neurobion. We (at CloudCasa) will try to fix the issue.

nilesh-akhade · 2023-08-03T13:16:23Z

Currently, Velero does not clear the Prometheus counter after the schedule gets deleted. In the schedule's reconciler, we can add the following code to delete the counters, but that's not sufficient. Because these counters are updated from the backup reconcilers too.

if c, ok := m.metrics[backupAttemptTotal].(*prometheus.CounterVec); ok {
    c.DeleteLabelValues(scheduleName)
}

jmuleiro · 2023-08-17T15:33:57Z

Hello @draghuram, any news on this issue?

draghuram · 2023-08-17T18:22:10Z

@nilesh-akhade created a PR that we are internally reviewing before submitting to Velero. Will post here when it is ready.

skriss added the Metrics Related to prometheus metrics label Sep 12, 2019

eleanor-millman added the Reviewed Q2 2021 label May 4, 2021

stale bot added the staled label Jul 8, 2021

stale bot closed this as completed Jul 22, 2021

eleanor-millman reopened this Feb 11, 2022

stale bot removed the staled label Feb 11, 2022

stale bot added the staled label Apr 13, 2022

stale bot removed the staled label Apr 14, 2022

stale bot added the staled label Jul 6, 2022

stale bot removed the staled label Jul 7, 2022

stale bot added the staled label Nov 26, 2022

stale bot closed this as completed Dec 11, 2022

Aman1994 pushed a commit to Aman1994/k8id that referenced this issue Feb 1, 2023

[#8727] velero still sends metrics for the schedule which are deleted.

8cdd7d6

vmware-tanzu/velero#1333

Aman1994 pushed a commit to Obmondo/kubeaid that referenced this issue Feb 1, 2023

[#8727] velero still sends metrics for the schedule which are deleted.

b840217

vmware-tanzu/velero#1333

KlavsKlavsen pushed a commit to Obmondo/kubeaid that referenced this issue Feb 2, 2023

[#8727] velero still sends metrics for the schedule which are deleted.

215d91b

vmware-tanzu/velero#1333

eaglesemanation mentioned this issue Apr 8, 2023

Velero Prometheus metrics eaglesemanation/ops.emnt.dev#15

Open

nilesh-akhade mentioned this issue Aug 28, 2023

Remove schedule-related metrics on schedule delete #6715

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

velero_backup_success_total metric keeps being reported for deleted schedules #1333

velero_backup_success_total metric keeps being reported for deleted schedules #1333

multi-io commented Mar 29, 2019

maberny commented Apr 4, 2019

h4wkmoon commented Sep 18, 2020

stale bot commented Jul 8, 2021

stale bot commented Jul 22, 2021

h4wkmoon commented Oct 18, 2021

sbkg0002 commented Jan 26, 2022 •

edited

Loading

eleanor-millman commented Feb 11, 2022

stale bot commented Apr 13, 2022

sbkg0002 commented Apr 14, 2022

guillaumefenollar commented May 2, 2022

stale bot commented Jul 6, 2022

sbkg0002 commented Jul 7, 2022

h4wkmoon commented Jul 7, 2022

LuckySB commented Aug 5, 2022

KlavsKlavsen commented Sep 26, 2022

stale bot commented Nov 26, 2022

stale bot commented Dec 11, 2022

KlavsKlavsen commented Dec 12, 2022

cwrau commented May 26, 2023

lorenzomorandini commented Jun 21, 2023

Neurobion commented Jun 28, 2023

draghuram commented Jun 28, 2023

Neurobion commented Jun 29, 2023 •

edited

Loading

jmuleiro commented Jul 4, 2023

jmuleiro commented Jul 4, 2023 •

edited

Loading

draghuram commented Jul 6, 2023

nilesh-akhade commented Aug 3, 2023

jmuleiro commented Aug 17, 2023

draghuram commented Aug 17, 2023

velero_backup_success_total metric keeps being reported for deleted schedules #1333

velero_backup_success_total metric keeps being reported for deleted schedules #1333

Comments

multi-io commented Mar 29, 2019

maberny commented Apr 4, 2019

h4wkmoon commented Sep 18, 2020

stale bot commented Jul 8, 2021

stale bot commented Jul 22, 2021

h4wkmoon commented Oct 18, 2021

sbkg0002 commented Jan 26, 2022 • edited Loading

eleanor-millman commented Feb 11, 2022

stale bot commented Apr 13, 2022

sbkg0002 commented Apr 14, 2022

guillaumefenollar commented May 2, 2022

stale bot commented Jul 6, 2022

sbkg0002 commented Jul 7, 2022

h4wkmoon commented Jul 7, 2022

LuckySB commented Aug 5, 2022

KlavsKlavsen commented Sep 26, 2022

stale bot commented Nov 26, 2022

stale bot commented Dec 11, 2022

KlavsKlavsen commented Dec 12, 2022

cwrau commented May 26, 2023

lorenzomorandini commented Jun 21, 2023

Neurobion commented Jun 28, 2023

draghuram commented Jun 28, 2023

Neurobion commented Jun 29, 2023 • edited Loading

jmuleiro commented Jul 4, 2023

jmuleiro commented Jul 4, 2023 • edited Loading

draghuram commented Jul 6, 2023

nilesh-akhade commented Aug 3, 2023

jmuleiro commented Aug 17, 2023

draghuram commented Aug 17, 2023

sbkg0002 commented Jan 26, 2022 •

edited

Loading

Neurobion commented Jun 29, 2023 •

edited

Loading

jmuleiro commented Jul 4, 2023 •

edited

Loading