Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ItemBlock model and phase 1 (single-thread) workflow changes #8102

Merged
merged 1 commit into from
Sep 9, 2024

Conversation

sseago
Copy link
Collaborator

@sseago sseago commented Aug 9, 2024

Thank you for contributing to Velero!

Please add a summary of your change

This PR adds the ItemBlock model and introduces the workflow changes for (single-threaded) ItemBlock functionality. ItemBlocks are created based on:

  1. ItemBlockAction plugins for an item which return other related items are placed in the same ItemBlock
  2. OrderedResources in the backup spec (ordered resources for a particular GroupResource are placed in the same ItemBlock to guarantee their relative backup order)
  3. Items of the same GroupResource+Namespace+Name but different APIVersions (only happens if EnableAPIGroupVersions feature flag is set) are placed in the same ItemBlock

ItemBlocks are then backed up together using the new kubernetesBackupper.backupItemBlock func which

  1. runs pre hooks for any pods in the ItemBlock
  2. calls BackupItem on each item in the ItemBlock
  3. runs post hooks for any pods in the ItemBlock

This change does not yet introduce worker threads to back up ItemBlocks in parallel -- that is out of scope for Velero 1.15

Does your change fix a particular issue?

Third PR for #7900

Please indicate you've done the following:

  • [ x] Accepted the DCO. Commits without the DCO will delay acceptance.
  • [ x] Created a changelog file or added /kind changelog-not-required as a comment on this pull request.
  • [ x] Updated the corresponding documentation in site/content/docs/main.

@sseago
Copy link
Collaborator Author

sseago commented Aug 9, 2024

Converting to Draft for now until:

  1. Iba plugins #8054 is merged
  2. I get some more testing of actual backups and restores with this refactoring.

Copy link

codecov bot commented Aug 9, 2024

Codecov Report

Attention: Patch coverage is 80.14981% with 53 lines in your changes missing coverage. Please review.

Project coverage is 59.23%. Comparing base (8ae667e) to head (9d6f4d2).
Report is 15 commits behind head on main.

Files with missing lines Patch % Lines
pkg/backup/backup.go 83.22% 17 Missing and 9 partials ⚠️
pkg/backup/item_backupper.go 80.00% 6 Missing and 5 partials ⚠️
pkg/backup/itemblock.go 71.42% 4 Missing and 2 partials ⚠️
pkg/itemblock/itemblock.go 64.70% 5 Missing and 1 partial ⚠️
pkg/controller/backup_controller.go 50.00% 1 Missing and 2 partials ⚠️
internal/hook/wait_exec_hook_handler.go 66.66% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8102      +/-   ##
==========================================
+ Coverage   59.09%   59.23%   +0.14%     
==========================================
  Files         364      366       +2     
  Lines       30340    30520     +180     
==========================================
+ Hits        17929    18080     +151     
- Misses      10968    10986      +18     
- Partials     1443     1454      +11     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@sseago sseago force-pushed the itemblock-workflow branch 2 times, most recently from e28c959 to 68cee28 Compare August 13, 2024 13:32
@sseago sseago marked this pull request as ready for review August 21, 2024 20:00
pkg/backup/backup.go Outdated Show resolved Hide resolved
pkg/backup/backup.go Outdated Show resolved Hide resolved
pkg/backup/backup.go Show resolved Hide resolved
pkg/backup/backup.go Outdated Show resolved Hide resolved
pkg/backup/backup.go Outdated Show resolved Hide resolved
pkg/backup/backup.go Outdated Show resolved Hide resolved
itemBlock.Log.WithError(err).WithField("name", item.Item.GetName()).Error("Error running hooks for pod")
// if pre hook fails, flag pod as backed-up and move on
itemBlock.itemBackupper.backupRequest.BackedUpItems[key] = struct{}{}
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we skip the pod, what about the related items in the same itemBlock? i.e., PVCs, secects, etc.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pod hook API's "fail on error" only calls for the pod itself to fail. Current Velero logic won't skip PVC, secrets, etc. either.

Copy link
Contributor

@Lyndon-Li Lyndon-Li Aug 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. We do see some problems here though they also happen with the existing code:

  • The hook may be used to quiescence the application or prepare the backup data to the volume
  • When the hook fails and the pod is skipped, but the PVC may be still backed up
  • The consequence is that users either get a corrupted backup data or empty backup data

Do you think we can enhance this in future based on the new itemBlock implementation? Or do you think it is even harder to fix it with the new implementation?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Lyndon-Li I think we should defer this to the future, since we'll need to think carefully about the design. Do we always want to not restore items in the block that were returned by any IBA plugins for the pod if the pre-hook fails, or should this be part of the IBA plugin return value? If it's plugin-decided rather than optional, then we'll need to modify the IBA plugin return (with a v2 plugin). If we decide that we always want to skip related items for a pod if a hook fails, then we won't need that. But also, the model here doesn't track which item's IBA plugin (if any) was responsible for pulling the item into the block, so we'd need to add that too.

Also, this could introduce as many problems as it solves. If there are 2 pods in the block, either because of a shared RWX volume, or (in the future) due to the pods having volumes in the same VolumeGroup, then should one failed pod pre-hook cause all of the pods to be skipped in the backup? I think this should be a separate design discussion post-1.15.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be sufficient to just fail the pod -- the user will see the PartiallyFailed backup, note that the error was on a stateful pod and therefore assume the backup is bad anyway.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would agree that if we decide to make a fix here in the future, it will be easier to do with the ItemBlock implementation than without.

pkg/backup/backup.go Outdated Show resolved Hide resolved
pkg/backup/backup.go Outdated Show resolved Hide resolved
pkg/backup/backup.go Show resolved Hide resolved
itemGV, err := schema.ParseGroupVersion(item.Item.GetAPIVersion())
if err == nil && item.PreferredGVR.GroupVersion() == itemGV {
returnList = append(returnList, item)
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic does not look right.
The error branch will append the item even if the error was reported.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@blackpiglet this func returns a list of items matching the GR (could be more than one if EnableAPIGroupVersions is set). The purpose of the PreferredGVR comparison is that we want to return the item corresponding to the preferred GVR first, followed by the rest. So we loop over the list and when we find an item that's the preferred GVR, we add it to returnList, else we add to itemList -- then at the end, we append itemList to preferredList. So after filtering out the non-matching items (first if), there are 3 possibilities:

  1. We parse preferredGVR, compare it to item GVR and it's a match, so put the item at the front of the list (if branch)
  2. We parse preferredGVR, compare it to item GVR and it's not a match, so put the item in the "other items" list (else branch)
  3. We fail to parse preferredGVR, so we just put the item in the "other items" list (else branch).

Basically if we ever fail to parse preferred GVR, then we just return the list of items in the order we have them. If we parse it and there's a match (there should be), then that one goes first. We really don't expect this to ever fail, since the preferred GVR string was determined by the item collector, so there's no reason it would fail parsing -- but since ParseGroupVersion has the possibility of returning an error, we need to handle that unlikely case -- but we don't want to error out if it happened, since it would be a non-fatal error. Maybe I should add a debug log here in the case that error is not nil and separate that if from the comparison test to make the logic clearer, though.

"namespace": namespace,
})

func (ib *itemBackupper) itemInclusionChecks(log logrus.FieldLogger, mustInclude bool, metadata metav1.Object, obj runtime.Unstructured, groupResource schema.GroupResource) bool {
if mustInclude {
log.Infof("Skipping the exclusion checks for this resource")
} else {
if metadata.GetLabels()[velerov1api.ExcludeFromBackupLabel] == "true" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to check the GroupResource here.
Only the PVC and PV need to go through the PV skip tracking logic.
I know it's legacy code, but the change could save some condition checks.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. So the tracking logic is only needed for PV/PC, but the exclusion checks overall apply to all types. i.e. any item is excludable, but the skipped PV tracking is only relevant for PVC/PV -- but that func already checks inside it. ExcludeFromBackupLabel, excluded namespace or resource type, etc. All of those checks are needed for even non-PV/PVC resources.

@blackpiglet
Copy link
Contributor

Maybe it's better to add a future enhancement, but it's better to know how the backed-up resources are organized with the ItemBlock, for example it can be displayed in the velero backup describe --details output.

@sseago
Copy link
Collaborator Author

sseago commented Aug 30, 2024

@blackpiglet Hmm. Yeah, we could definitely do that as a future enhancement. I guess we'd need to generate some sort of yaml file to upload to object storage -- basically when we generate an ItemBlock, before passing it on to start backup we could append yaml metadata, and at the end, we could upload this yaml to object store. I think it's fine to do in the future, since it won't really affect much of this code and we're too late in the 1.15 cycle for new design, etc.

Signed-off-by: Scott Seago <sseago@redhat.com>
@sseago
Copy link
Collaborator Author

sseago commented Sep 3, 2024

Updated PR based on feedback.

@shubham-pampattiwar shubham-pampattiwar merged commit b92143d into vmware-tanzu:main Sep 9, 2024
45 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants