Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(backup): delete backup in the backupstore asynchronously #3038

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ChanYiLin
Copy link
Contributor

ref: longhorn/longhorn#8746

follow LEP to implement: longhorn/longhorn#9152

  • Make backup deletion asynchronous
  • Add Deleting state to Backup
    • Add deleting in-memory map to allow go routine to pass the command failure to the controller.
    • Add backoff to delay deletion command execution.
    • When the Backup is being deleted it should follow the following diagram
  • Disallow backup creation when there is any backup being deleted.
# Normal Case
Completed => Deleting 
=> finalizer removed (CR isgone)

# Command failure Case
Completed => Deleting 
=> Error (found the error message in the map) 
=> Deleting (Retry the command) => finalizer removed (CR isgone)

# Controller crashes
Completed => Deleting 
=> Error (failed to find the record in the map)
=> Deleting (Retry the command) => finalizer removed (CR isgone)

@ChanYiLin ChanYiLin self-assigned this Aug 1, 2024
@ChanYiLin ChanYiLin force-pushed the LH8746_make_backup_deletion_async_and_add_deleting_state branch 2 times, most recently from c809fd8 to 707f44c Compare August 1, 2024 14:04
@ChanYiLin
Copy link
Contributor Author

ChanYiLin commented Aug 1, 2024

Test Plan

Normal Case

  1. Create a Volume
  2. Write small data and then create a BackupA
  3. Write large data (~2G) and then create a BackupB
  4. Write small data and then create a Snapshot
  5. Delete the BackupB(large data), at the same time, create a BackupC from the Snapshot(you can click from UI)
  6. BackupC will be in Pending state with message (waiting for backupB to be deleted)
  7. After BackupB is deleted, BackupC should be in progress.

Error Case (use nfs)

  1. Create a Volume
  2. Write some data and then create a BackupA
  3. Write some data and then create a sSnapshot
  4. Exec into the backupstore pod and make the backup.cfg immutable
    $ chattr +i backups/backup_backup-5640dfd33a054f98.cfg`
    
  5. Delete the BackupA, at the same time, create a BackupB from the Snapshot(you can click from UI)
  6. BackupA will be in Deleting and Error state repeatedly to retry the deletion. When in Error state, it shows error message related to permission
  7. BackupB will be InProgress when BackupA is in Deleting state. BackupB should be complete after awhile.
  8. Remove the immutable, after a while, the BackupA should be in Deleting again and should be deleted successfully.

Controller Crashes Case (use nfs)

  1. Create a Volume
  2. Write some data and then create a BackupA
  3. Exec into the backupstore pod and make the backup.cfg immutable, example
    $ chattr +i backups/backup_backup-5640dfd33a054f98.cfg`
    
  4. Delete the BackupA
  5. BackupA will be in Deleting and Error state repeatedly to retry the deletion. When in Error state, it shows error message related to permission
  6. When the BackupA is in Deleting state, delete the longhorn manager pod directly. (you can find the one doing the deleting with Backup.Status.OwnerID)
  7. After the longhorn manager pod is recreated, the BackupA should turn into Error state with message No deletion in progress record, retry the deletion command
  8. Then after a while the BackupA should be in Deleting again and should be deleted successfully after remove the immutable.

@ChanYiLin ChanYiLin force-pushed the LH8746_make_backup_deletion_async_and_add_deleting_state branch from 707f44c to 4e74975 Compare August 1, 2024 14:12
@ChanYiLin ChanYiLin marked this pull request as draft August 2, 2024 09:53
@ChanYiLin
Copy link
Contributor Author

There is a regression
https://ci.longhorn.io/job/private/job/longhorn-tests-regression/7389/console
test_backup_status_for_unavailable_replicas
I know the root cause of it, fixing...

@ChanYiLin ChanYiLin force-pushed the LH8746_make_backup_deletion_async_and_add_deleting_state branch from 4e74975 to 5e21ded Compare August 5, 2024 10:00
@ChanYiLin
Copy link
Contributor Author

There is a regression https://ci.longhorn.io/job/private/job/longhorn-tests-regression/7389/console test_backup_status_for_unavailable_replicas I know the root cause of it, fixing...

Fixed.
https://ci.longhorn.io/job/private/job/longhorn-tests-regression/7392/

@ChanYiLin ChanYiLin marked this pull request as ready for review August 5, 2024 10:49
@ChanYiLin ChanYiLin force-pushed the LH8746_make_backup_deletion_async_and_add_deleting_state branch from 5e21ded to cf4419c Compare August 9, 2024 07:52
@ChanYiLin
Copy link
Contributor Author

change the backoff time to
min: 1Min -> max: 24hour(1 time/day)

@innobead innobead requested a review from shuo-wu August 11, 2024 16:14
@derekbit
Copy link
Member

derekbit commented Sep 2, 2024

@ChanYiLin Is it ready for review?

@ChanYiLin
Copy link
Contributor Author

Yes, it is ready for review.

@ChanYiLin
Copy link
Contributor Author

Hi @derekbit
This is ready for review, thanks

controller/backup_controller.go Outdated Show resolved Hide resolved
controller/backup_controller.go Outdated Show resolved Hide resolved
Comment on lines +318 to 319
if !backupDeleted {
return nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR introduces deletingBackoff for handling deletion backoff. Since handleErr() can also achieve backoff after returning an error if the backup has not been deleted yet. Can we remove deletingBackoff and leverage handleErr() instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer not to do that, it is not an error after all.
Besides, inside the handleErr we even print the logs Failed to sync Longhorn backup
I think that might confuse the users.

We actually use this backoff a lot in our controllers

Copy link
Contributor Author

@ChanYiLin ChanYiLin Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another point is that, we can control the backOff period with this backoff
But with using k8s workqueue, we can't set different backOff time for this case from other error cases.

ref: longhorn/longhorn 8746

Signed-off-by: Jack Lin <jack.lin@suse.com>
@ChanYiLin ChanYiLin force-pushed the LH8746_make_backup_deletion_async_and_add_deleting_state branch from cf4419c to 25dcfe8 Compare September 24, 2024 07:08
@ChanYiLin
Copy link
Contributor Author

Hi @derekbit
I have updated the PR, please take a look, thanks!
Let me know if you have any concern for the last comment.

Copy link
Member

@derekbit derekbit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, LGTM

deletingMapLock: &sync.Mutex{},
inProgressDeletingMap: map[string]*DeletingStatus{},

deletingBackoff: flowcontrol.NewBackOff(time.Minute*1, time.Hour*24),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make time.Minute*1, time.Hour*24 as a constant

// We should consider the backup exists even the backupInfo is nil when it is in progress.
backupInfo, err := backupTargetClient.BackupGet(backupURL, backupTargetClient.Credential)
if err != nil && !types.ErrorIsNotFound(err) && !types.ErrorIsInProgress(err) {
log.WithError(err).Debugf("failed to check backup %v in the backupstore", backup.Name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log.WithError(err).Debugf("failed to check backup %v in the backupstore", backup.Name)
log.WithError(err).Debugf("Failed to check backup %v in the backupstore", backup.Name)

return false
}
if _, err := bc.ds.UpdateBackupStatus(backup); err != nil {
log.WithError(err).Debugf("Backup %v update status error", backup.Name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log.WithError(err).Debugf("Backup %v update status error", backup.Name)
log.WithError(err).Errorf("Backup %v update status error", backup.Name)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants