Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

velero_backup_success_total metric keeps being reported for deleted schedules #1333

Closed
multi-io opened this issue Mar 29, 2019 · 29 comments · Fixed by #6715
Closed

velero_backup_success_total metric keeps being reported for deleted schedules #1333

multi-io opened this issue Mar 29, 2019 · 29 comments · Fixed by #6715
Labels
Metrics Related to prometheus metrics Reviewed Q2 2021 staled

Comments

@multi-io
Copy link

What steps did you take and what happened:

Create a schedule, let it perform a few backups, then delete the schedule.

The /metrics endpoint keeps reporting the velero_backup_success_total{schedule=""} metric, even after all the backups have expired, until velero itself is restarted.

What did you expect to happen:

The metric should vanish when the schedule is deleted or after all its backups have expired, or it should stay around indefinitely (which I wouldn't prefer). In any case, the fact that it goes away when the velero pod is restarted suggests that this an artifact of the implementation rather than desired behaviour, and it does make it harder to implement certain alerting rules.

@maberny
Copy link

maberny commented Apr 4, 2019

Same thing happens with the ark_backup_failure_total metric.

@skriss skriss added the Metrics Related to prometheus metrics label Sep 12, 2019
@h4wkmoon
Copy link

Hi,
I really love velero. It does really great for its principal purpose : backups and restores.

But, exposing nothing at all is better than incorrect metrics.
Can you, either :

  • advertise that your metrics are not entireiy ok,
  • correct or remove those which are not

Thanks.

@stale
Copy link

stale bot commented Jul 8, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the staled label Jul 8, 2021
@stale
Copy link

stale bot commented Jul 22, 2021

Closing the stale issue.

@stale stale bot closed this as completed Jul 22, 2021
@h4wkmoon
Copy link

Velero still has this issue. I think it should be reopen.

@sbkg0002
Copy link

sbkg0002 commented Jan 26, 2022

The issue is still valid for the latest version!

@eleanor-millman can you re-open?

@eleanor-millman
Copy link
Contributor

Sure!

@stale stale bot removed the staled label Feb 11, 2022
@stale
Copy link

stale bot commented Apr 13, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the staled label Apr 13, 2022
@sbkg0002
Copy link

Bump since this is still an issue.

@stale stale bot removed the staled label Apr 14, 2022
@guillaumefenollar
Copy link

Really interested by a fix for this issue as well, for years now. I'm considering writing a new exporter but that would be counterproductive, thus I lack golang skills to properly improve Velero's one with confidence.

@stale
Copy link

stale bot commented Jul 6, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the staled label Jul 6, 2022
@sbkg0002
Copy link

sbkg0002 commented Jul 7, 2022

Is there anyone with a possible fix? This keeps spamming our monitoring.

@h4wkmoon
Copy link

h4wkmoon commented Jul 7, 2022

The workourand is still to delete velero pods. Be sure to run it when there are no backups.

@stale stale bot removed the staled label Jul 7, 2022
@LuckySB
Copy link

LuckySB commented Aug 5, 2022

up

@KlavsKlavsen
Copy link
Contributor

Still a problem :(

@stale
Copy link

stale bot commented Nov 26, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the staled label Nov 26, 2022
@stale
Copy link

stale bot commented Dec 11, 2022

Closing the stale issue.

@stale stale bot closed this as completed Dec 11, 2022
@KlavsKlavsen
Copy link
Contributor

@eleanor-millman this issue is still an open bug - can you re-open and ask stale-bot to not close it, just because no one has solved it yet?

@cwrau
Copy link

cwrau commented May 26, 2023

We have the same problem, please reopen

@lorenzomorandini
Copy link

Just faced the same issue. Deleting the pod works but it's a workaround, can it be re opened?

@Neurobion
Copy link

Currently, it is impossible to effectively monitor the status of backups. The important thing that should be fixed.

@draghuram
Copy link
Contributor

@Neurobion, Since the original problem has been reported quite a while ago, would you mind describing it again after testing with latest Velero? We at CloudCasa will see if we can provide a fix if the problem still exists.

@Neurobion
Copy link

Neurobion commented Jun 29, 2023

I use tag: v1.11.0 + velero-plugin-for-microsoft-azure:v1.6.0 and deploy through Helm. Everything works as it should, but at the moment of redeploying or deleting a specific schedule, this metric still returns it, which is not correct for monitoring purposes. I only need backup information only for existing schedules.

Jiris-MacBook-Pro:velero xx$ velero get backup
NAME                                                        STATUS      ERRORS   WARNINGS   CREATED                          EXPIRES   STORAGE LOCATION   SELECTOR
man-backup                                                  Completed   0        0          2023-06-21 15:51:17 +0200 CEST   22d       default            <none>
velero-1687355039-XX-20230627103052   Completed   0        0          2023-06-27 12:30:52 +0200 CEST   28d       default            <none>
velero-1687355039-XX-20230626103051   Completed   0        0          2023-06-26 12:30:51 +0200 CEST   27d       default            <none>
velero-XX-pv-20230628103002              Completed   0        0          2023-06-28 12:30:02 +0200 CEST   29d       default            <none>

Jiris-MacBook-Pro:velero xx$ velero get schedules
NAME                              STATUS    CREATED                          SCHEDULE      BACKUP TTL   LAST BACKUP   SELECTOR   PAUSED
velero-XX-pv   Enabled   2023-06-28 09:32:51 +0200 CEST   30 10 * * *   720h0m0s     22h ago       <none>     false

velero_backup_last_successful_timestamp return:

velero_backup_last_successful_timestamp{endpoint="http-monitoring", instance="10.142.118.80:8085", job="velero", namespace="velero", pod="velero-69d5544f8f-pbwcc", schedule="velero-1687355039-XX", service="velero"} | 1687861964
velero_backup_last_successful_timestamp{endpoint="http-monitoring", instance="10.142.118.80:8085", job="velero", namespace="velero", pod="velero-69d5544f8f-pbwcc", schedule="velero-XX-pv", service="velero"} | 1687948312

In my opinion velero-1687355039-XX has nothing to do there because the schedule no longer exists.

And in case of set alerting ((time() - velero_backup_last_successful_timestamp) / 60 / 60 > 24) it doesn't work as it should.

PS: If that is not the goal of this metric, then another metric would be useful that would only take into account backups according to the current list of schedules.

@jmuleiro
Copy link

jmuleiro commented Jul 4, 2023

Agree with @Neurobion, I set up alerts recently for Velero backups, just to find out that they don't work as expected. We need a fix, as these metrics are not reliable for monitoring purposes.

@jmuleiro
Copy link

jmuleiro commented Jul 4, 2023

I've been looking for a workaround to this issue for hours. I checked the code (be aware I can barely understand Golang) and AFAIK this problem can be attributed to the backup finalizer controller, these lines in particular:

backupScheduleName := backupRequest.GetLabels()[velerov1api.ScheduleNameLabel]
switch backup.Status.Phase {
case velerov1api.BackupPhaseFinalizing:
	backup.Status.Phase = velerov1api.BackupPhaseCompleted
	r.metrics.RegisterBackupSuccess(backupScheduleName)
	r.metrics.RegisterBackupLastStatus(backupScheduleName, metrics.BackupLastStatusSucc)
case velerov1api.BackupPhaseFinalizingPartiallyFailed:
	backup.Status.Phase = velerov1api.BackupPhasePartiallyFailed
	r.metrics.RegisterBackupPartialFailure(backupScheduleName)
	r.metrics.RegisterBackupLastStatus(backupScheduleName, metrics.BackupLastStatusFailure)
}
backup.Status.CompletionTimestamp = &metav1.Time{Time: r.clock.Now()}
recordBackupMetrics(log, backup, outBackupFile, r.metrics, true)

It seems that the finalizer controller is the starting point when it comes to backup deletion. Backup CRDs get marked with a finalizer and subsequentially a deletionTimestamp in Kubernetes, and then Velero starts the backup deletion process. The backup deletion itself could still be unsuccessful or get stuck somehow - this happened to me just yesterday - but the controller will still mark what actually is a deletion attempt as a successful backup. It doesn't make sense.

It may very well be that they did this instead of implementing new metrics and updating the Kubernetes CRDs to support these new backup status phases. Nonetheless, because this is the current behavior, the metrics being updated by the finalizer controller are rendered essentially useless.

In my opinion, it would be great if the team could release a hotfix, deleting the lines pointed out above as most of us desperately need this fixed to be able to set up alerts based on these metric sets. If they are accepting pull requests, some of us would be willing to contribute to get this fixed as soon as possible.

@draghuram
Copy link
Contributor

Thanks @Neurobion. We (at CloudCasa) will try to fix the issue.

@nilesh-akhade
Copy link
Contributor

Currently, Velero does not clear the Prometheus counter after the schedule gets deleted. In the schedule's reconciler, we can add the following code to delete the counters, but that's not sufficient. Because these counters are updated from the backup reconcilers too.

if c, ok := m.metrics[backupAttemptTotal].(*prometheus.CounterVec); ok {
    c.DeleteLabelValues(scheduleName)
}

@jmuleiro
Copy link

Hello @draghuram, any news on this issue?

@draghuram
Copy link
Contributor

@nilesh-akhade created a PR that we are internally reviewing before submitting to Velero. Will post here when it is ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Metrics Related to prometheus metrics Reviewed Q2 2021 staled
Projects
None yet
Development

Successfully merging a pull request may close this issue.