Skip to content

Commit

Permalink
Implement the Repo maintanence Job configuration design.
Browse files Browse the repository at this point in the history
Remove the resource parameters from the velero server CLI.

Signed-off-by: Xun Jiang <xun.jiang@broadcom.com>
  • Loading branch information
blackpiglet committed Sep 1, 2024
1 parent 3408ffe commit 87fb132
Show file tree
Hide file tree
Showing 21 changed files with 779 additions and 379 deletions.
1 change: 1 addition & 0 deletions changelogs/unreleased/8145-blackpiglet
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Implement the Repo maintenance Job configuration.
114 changes: 67 additions & 47 deletions design/repo_maintenance_job_config.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Repository maintenance job configuration design

## Abstract
Add this design to make the repository maintenance job can read configuration from a dedicate ConfigMap and make the Job's necessary parts configurable, e.g. `PodSpec.Affinity` and `PodSpec.resources`.
Add this design to make the repository maintenance job can read configuration from a dedicate ConfigMap and make the Job's necessary parts configurable, e.g. `PodSpec.Affinity` and `PodSpec.Resources`.

## Background
Repository maintenance is split from the Velero server to a k8s Job in v1.14 by design [repository maintenance job](Implemented/repository-maintenance.md).
Expand Down Expand Up @@ -49,46 +49,41 @@ velero server \
### Structure
The data structure for ```repo-maintenance-job-config``` is as below:
```go
type MaintenanceConfigMap map[string]Configs

type Configs struct {
// LoadAffinity is the config for data path load affinity.
LoadAffinity []*LoadAffinity `json:"loadAffinity,omitempty"`

// Resources is the config for the CPU and memory resources setting.
Resource Resources `json:"resources,omitempty"`
// PodResources is the config for the CPU and memory resources setting.
PodResources *kube.PodResources `json:"podResources,omitempty"`
}

type LoadAffinity struct {
// NodeSelector specifies the label selector to match nodes
NodeSelector metav1.LabelSelector `json:"nodeSelector"`
}

type Resources struct {
// The repository maintenance job CPU request setting
CPURequest string `json:"cpuRequest,omitempty"`

// The repository maintenance job memory request setting
MemRequest string `json:"memRequest,omitempty"`

// The repository maintenance job CPU limit setting
CPULimit string `json:"cpuLimit,omitempty"`

// The repository maintenance job memory limit setting
MemLimit string `json:"memLimit,omitempty"`
type PodResources struct {
CPURequest string `json:"cpuRequest,omitempty"`
MemoryRequest string `json:"memoryRequest,omitempty"`
CPULimit string `json:"cpuLimit,omitempty"`
MemoryLimit string `json:"memoryLimit,omitempty"`
}
```

The ConfigMap content is a map.
If there is a key value as `global` in the map, the key's value is applied to all BackupRepositories maintenance jobs that don't their own specific configuration in the ConfigMap.
If there is a key value as `global` in the map, the key's value is applied to all BackupRepositories maintenance jobs that cannot find their own specific configuration in the ConfigMap.
The other keys in the map is the combination of three elements of a BackupRepository:
* The namespace in which BackupRepository backs up volume data
* The BackupRepository referenced BackupStorageLocation's name
* The BackupRepository's type. Possible values are `kopia` and `restic`
* The namespace in which BackupRepository backs up volume data.
* The BackupRepository referenced BackupStorageLocation's name.
* The BackupRepository's type. Possible values are `kopia` and `restic`.

Those three keys can identify a [unique BackupRepository](https://github.com/vmware-tanzu/velero/blob/2fc6300f2239f250b40b0488c35feae59520f2d3/pkg/repository/backup_repo_op.go#L32-L37).

If there is a key match with BackupRepository, the key's value is applied to the BackupRepository's maintenance jobs.
By this way, it's possible to let user configure before the BackupRepository is created.
This is especially convenient for administrator configuring during the Velero installation.
For example, the following BackupRepository's key should be `test-default-kopia`
For example, the following BackupRepository's key should be `test-default-kopia`.

``` yaml
- apiVersion: velero.io/v1
kind: BackupRepository
Expand Down Expand Up @@ -119,11 +114,11 @@ A sample of the ```repo-maintenance-job-config``` ConfigMap is as below:
cat <<EOF > repo-maintenance-job-config.json
{
"global": {
resources: {
podResources: {
"cpuRequest": "100m",
"cpuLimit": "200m",
"memRequest": "100Mi",
"memLimit": "200Mi"
"memoryRequest": "100Mi",
"memoryLimit": "200Mi"
},
"loadAffinity": [
{
Expand Down Expand Up @@ -177,18 +172,18 @@ config := Configs {
LoadAffinity: nil,
// Resources is the config for the CPU and memory resources setting.
Resources: Resources{
PodResources: &kube.PodResources{
// The repository maintenance job CPU request setting
CPURequest: "0m",
// The repository maintenance job memory request setting
MemRequest: "0Mi",
MemoryRequest: "0Mi",
// The repository maintenance job CPU limit setting
CPULimit: "0m",
// The repository maintenance job memory limit setting
MemLimit: "0Mi",
MemoryLimit: "0Mi",
},
}
```
Expand All @@ -204,17 +199,32 @@ For example, the ConfigMap content has two elements.
``` json
{
"global": {
"resources": {
"loadAffinity": [
{
"nodeSelector": {
"matchExpressions": [
{
"key": "cloud.google.com/machine-family",
"operator": "In",
"values": [
"e2"
]
}
]
}
},
],
"podResources": {
"cpuRequest": "100m",
"cpuLimit": "200m",
"memRequest": "100Mi",
"memLimit": "200Mi"
"memoryRequest": "100Mi",
"memoryLimit": "200Mi"
}
},
"ns1-default-kopia": {
"resources": {
"memRequest": "400Mi",
"memLimit": "800Mi"
"podResources": {
"memoryRequest": "400Mi",
"memoryLimit": "800Mi"
}
}
}
Expand All @@ -223,19 +233,29 @@ The config value used for BackupRepository backing up volume data in namespace `
``` go
config := Configs {
// LoadAffinity is the config for data path load affinity.
LoadAffinity: nil,
// The repository maintenance job CPU request setting
CPURequest: "100m",
// The repository maintenance job memory request setting
MemRequest: "400Mi",
// The repository maintenance job CPU limit setting
CPULimit: "200m",
// The repository maintenance job memory limit setting
MemLimit: "800Mi",
LoadAffinity: []*kube.LoadAffinity{
{
NodeSelector: metav1.LabelSelector{
MatchExpressions: []metav1.LabelSelectorRequirement{
{
Key: "cloud.google.com/machine-family",
Operator: metav1.LabelSelectorOpIn,
Values: []string{"e2"},
},
},
},
},
},
PodResources: &kube.PodResources{
// The repository maintenance job CPU request setting
CPURequest: "",
// The repository maintenance job memory request setting
MemoryRequest: "400Mi",
// The repository maintenance job CPU limit setting
CPULimit: "",
// The repository maintenance job memory limit setting
MemoryLimit: "800Mi",
}
}
```

Expand Down
8 changes: 0 additions & 8 deletions pkg/cmd/cli/install/install.go
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,6 @@ import (
"strings"
"time"

"github.com/vmware-tanzu/velero/pkg/repository"
"github.com/vmware-tanzu/velero/pkg/uploader"

"github.com/pkg/errors"
Expand Down Expand Up @@ -85,7 +84,6 @@ type Options struct {
DefaultSnapshotMoveData bool
DisableInformerCache bool
ScheduleSkipImmediately bool
MaintenanceCfg repository.MaintenanceConfig
}

// BindFlags adds command line values to the options struct.
Expand Down Expand Up @@ -130,11 +128,6 @@ func (o *Options) BindFlags(flags *pflag.FlagSet) {
flags.BoolVar(&o.DefaultSnapshotMoveData, "default-snapshot-move-data", o.DefaultSnapshotMoveData, "Bool flag to configure Velero server to move data by default for all snapshots supporting data movement. Optional.")
flags.BoolVar(&o.DisableInformerCache, "disable-informer-cache", o.DisableInformerCache, "Disable informer cache for Get calls on restore. With this enabled, it will speed up restore in cases where there are backup resources which already exist in the cluster, but for very large clusters this will increase velero memory usage. Default is false (don't disable). Optional.")
flags.BoolVar(&o.ScheduleSkipImmediately, "schedule-skip-immediately", o.ScheduleSkipImmediately, "Skip the first scheduled backup immediately after creating a schedule. Default is false (don't skip).")
flags.IntVar(&o.MaintenanceCfg.KeepLatestMaitenanceJobs, "keep-latest-maintenance-jobs", o.MaintenanceCfg.KeepLatestMaitenanceJobs, "Number of latest maintenance jobs to keep each repository. Optional.")
flags.StringVar(&o.MaintenanceCfg.CPURequest, "maintenance-job-cpu-request", o.MaintenanceCfg.CPURequest, "CPU request for maintenance jobs. Default is no limit.")
flags.StringVar(&o.MaintenanceCfg.MemRequest, "maintenance-job-mem-request", o.MaintenanceCfg.MemRequest, "Memory request for maintenance jobs. Default is no limit.")
flags.StringVar(&o.MaintenanceCfg.CPULimit, "maintenance-job-cpu-limit", o.MaintenanceCfg.CPULimit, "CPU limit for maintenance jobs. Default is no limit.")
flags.StringVar(&o.MaintenanceCfg.MemLimit, "maintenance-job-mem-limit", o.MaintenanceCfg.MemLimit, "Memory limit for maintenance jobs. Default is no limit.")
}

// NewInstallOptions instantiates a new, default InstallOptions struct.
Expand Down Expand Up @@ -231,7 +224,6 @@ func (o *Options) AsVeleroOptions() (*install.VeleroOptions, error) {
DefaultSnapshotMoveData: o.DefaultSnapshotMoveData,
DisableInformerCache: o.DisableInformerCache,
ScheduleSkipImmediately: o.ScheduleSkipImmediately,
MaintenanceCfg: o.MaintenanceCfg,
}, nil
}

Expand Down
18 changes: 16 additions & 2 deletions pkg/cmd/cli/nodeagent/server.go
Original file line number Diff line number Diff line change
Expand Up @@ -293,7 +293,7 @@ func (s *nodeAgentServer) run() {
s.logger.WithError(err).Fatal("Unable to create the pod volume restore controller")
}

var loadAffinity *nodeagent.LoadAffinity
var loadAffinity *kube.LoadAffinity

Check warning on line 296 in pkg/cmd/cli/nodeagent/server.go

View check run for this annotation

Codecov / codecov/patch

pkg/cmd/cli/nodeagent/server.go#L296

Added line #L296 was not covered by tests
if s.dataPathConfigs != nil && len(s.dataPathConfigs.LoadAffinity) > 0 {
loadAffinity = s.dataPathConfigs.LoadAffinity[0]
s.logger.Infof("Using customized loadAffinity %v", loadAffinity)
Expand All @@ -315,7 +315,21 @@ func (s *nodeAgentServer) run() {
}
}

dataUploadReconciler := controller.NewDataUploadReconciler(s.mgr.GetClient(), s.mgr, s.kubeClient, s.csiSnapshotClient.SnapshotV1(), s.dataPathMgr, loadAffinity, backupPVCConfig, podResources, clock.RealClock{}, s.nodeName, s.config.dataMoverPrepareTimeout, s.logger, s.metrics)
dataUploadReconciler := controller.NewDataUploadReconciler(
s.mgr.GetClient(),
s.mgr,
s.kubeClient,
s.csiSnapshotClient.SnapshotV1(),
s.dataPathMgr,
loadAffinity,
backupPVCConfig,
podResources,
clock.RealClock{},
s.nodeName,
s.config.dataMoverPrepareTimeout,
s.logger,
s.metrics,
)

Check warning on line 332 in pkg/cmd/cli/nodeagent/server.go

View check run for this annotation

Codecov / codecov/patch

pkg/cmd/cli/nodeagent/server.go#L318-L332

Added lines #L318 - L332 were not covered by tests
if err = dataUploadReconciler.SetupWithManager(s.mgr); err != nil {
s.logger.WithError(err).Fatal("Unable to create the data upload controller")
}
Expand Down
50 changes: 34 additions & 16 deletions pkg/cmd/server/server.go
Original file line number Diff line number Diff line change
Expand Up @@ -139,8 +139,8 @@ type serverConfig struct {
defaultSnapshotMoveData bool
disableInformerCache bool
scheduleSkipImmediately bool
maintenanceCfg repository.MaintenanceConfig
backukpRepoConfig string
backupRepoConfig string
repoMaintenanceJobConfig string
}

func NewCommand(f client.Factory) *cobra.Command {
Expand Down Expand Up @@ -172,9 +172,6 @@ func NewCommand(f client.Factory) *cobra.Command {
defaultSnapshotMoveData: false,
disableInformerCache: defaultDisableInformerCache,
scheduleSkipImmediately: false,
maintenanceCfg: repository.MaintenanceConfig{
KeepLatestMaitenanceJobs: repository.DefaultKeepLatestMaitenanceJobs,
},
}
)

Expand Down Expand Up @@ -248,17 +245,20 @@ func NewCommand(f client.Factory) *cobra.Command {
command.Flags().BoolVar(&config.defaultSnapshotMoveData, "default-snapshot-move-data", config.defaultSnapshotMoveData, "Move data by default for all snapshots supporting data movement.")
command.Flags().BoolVar(&config.disableInformerCache, "disable-informer-cache", config.disableInformerCache, "Disable informer cache for Get calls on restore. With this enabled, it will speed up restore in cases where there are backup resources which already exist in the cluster, but for very large clusters this will increase velero memory usage. Default is false (don't disable).")
command.Flags().BoolVar(&config.scheduleSkipImmediately, "schedule-skip-immediately", config.scheduleSkipImmediately, "Skip the first scheduled backup immediately after creating a schedule. Default is false (don't skip).")
command.Flags().IntVar(&config.maintenanceCfg.KeepLatestMaitenanceJobs, "keep-latest-maintenance-jobs", config.maintenanceCfg.KeepLatestMaitenanceJobs, "Number of latest maintenance jobs to keep each repository. Optional.")
command.Flags().StringVar(&config.maintenanceCfg.CPURequest, "maintenance-job-cpu-request", config.maintenanceCfg.CPURequest, "CPU request for maintenance job. Default is no limit.")
command.Flags().StringVar(&config.maintenanceCfg.MemRequest, "maintenance-job-mem-request", config.maintenanceCfg.MemRequest, "Memory request for maintenance job. Default is no limit.")
command.Flags().StringVar(&config.maintenanceCfg.CPULimit, "maintenance-job-cpu-limit", config.maintenanceCfg.CPULimit, "CPU limit for maintenance job. Default is no limit.")
command.Flags().StringVar(&config.maintenanceCfg.MemLimit, "maintenance-job-mem-limit", config.maintenanceCfg.MemLimit, "Memory limit for maintenance job. Default is no limit.")

command.Flags().StringVar(&config.backukpRepoConfig, "backup-repository-config", config.backukpRepoConfig, "The name of configMap containing backup repository configurations.")
command.Flags().StringVar(
&config.backupRepoConfig,
"backup-repository-config",
config.backupRepoConfig,
"The name of configMap containing backup repository configurations.",
)
command.Flags().StringVar(
&config.repoMaintenanceJobConfig,
"repo-maintenance-job-config",
config.repoMaintenanceJobConfig,
"The name of ConfigMap containing repository maintenance Job configurations.",
)

// maintenance job log setting inherited from velero server
config.maintenanceCfg.FormatFlag = config.formatFlag
config.maintenanceCfg.LogLevelFlag = logLevelFlag
return command
}

Expand Down Expand Up @@ -667,7 +667,18 @@ func (s *server) initRepoManager() error {
s.repoLocker = repository.NewRepoLocker()
s.repoEnsurer = repository.NewEnsurer(s.mgr.GetClient(), s.logger, s.config.resourceTimeout)

s.repoManager = repository.NewManager(s.namespace, s.mgr.GetClient(), s.repoLocker, s.repoEnsurer, s.credentialFileStore, s.credentialSecretStore, s.config.maintenanceCfg, s.logger)
s.repoManager = repository.NewManager(
s.namespace,
s.mgr.GetClient(),
s.repoLocker,
s.repoEnsurer,
s.credentialFileStore,
s.credentialSecretStore,
s.config.repoMaintenanceJobConfig,
s.logger,
s.logLevel,
s.config.formatFlag,
)

Check warning on line 681 in pkg/cmd/server/server.go

View check run for this annotation

Codecov / codecov/patch

pkg/cmd/server/server.go#L670-L681

Added lines #L670 - L681 were not covered by tests

return nil
}
Expand Down Expand Up @@ -881,7 +892,14 @@ func (s *server) runControllers(defaultVolumeSnapshotLocations map[string]string
}

if _, ok := enabledRuntimeControllers[controller.BackupRepo]; ok {
if err := controller.NewBackupRepoReconciler(s.namespace, s.logger, s.mgr.GetClient(), s.config.repoMaintenanceFrequency, s.config.backukpRepoConfig, s.repoManager).SetupWithManager(s.mgr); err != nil {
if err := controller.NewBackupRepoReconciler(
s.namespace,
s.logger,
s.mgr.GetClient(),
s.config.repoMaintenanceFrequency,
s.config.backupRepoConfig,
s.repoManager,
).SetupWithManager(s.mgr); err != nil {

Check warning on line 902 in pkg/cmd/server/server.go

View check run for this annotation

Codecov / codecov/patch

pkg/cmd/server/server.go#L895-L902

Added lines #L895 - L902 were not covered by tests
s.logger.Fatal(err, "unable to create controller", "controller", controller.BackupRepo)
}
}
Expand Down
10 changes: 5 additions & 5 deletions pkg/controller/backup_repository_controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -55,19 +55,19 @@ type BackupRepoReconciler struct {
logger logrus.FieldLogger
clock clocks.WithTickerAndDelayedExecution
maintenanceFrequency time.Duration
backukpRepoConfig string
backupRepoConfig string
repositoryManager repository.Manager
}

func NewBackupRepoReconciler(namespace string, logger logrus.FieldLogger, client client.Client,
maintenanceFrequency time.Duration, backukpRepoConfig string, repositoryManager repository.Manager) *BackupRepoReconciler {
maintenanceFrequency time.Duration, backupRepoConfig string, repositoryManager repository.Manager) *BackupRepoReconciler {
c := &BackupRepoReconciler{
client,
namespace,
logger,
clocks.RealClock{},
maintenanceFrequency,
backukpRepoConfig,
backupRepoConfig,
repositoryManager,
}

Expand Down Expand Up @@ -229,7 +229,7 @@ func (r *BackupRepoReconciler) getIdentiferByBSL(ctx context.Context, req *veler
}

func (r *BackupRepoReconciler) initializeRepo(ctx context.Context, req *velerov1api.BackupRepository, log logrus.FieldLogger) error {
log.WithField("repoConfig", r.backukpRepoConfig).Info("Initializing backup repository")
log.WithField("repoConfig", r.backupRepoConfig).Info("Initializing backup repository")

// confirm the repo's BackupStorageLocation is valid
repoIdentifier, err := r.getIdentiferByBSL(ctx, req)
Expand All @@ -244,7 +244,7 @@ func (r *BackupRepoReconciler) initializeRepo(ctx context.Context, req *velerov1
})
}

config, err := getBackupRepositoryConfig(ctx, r, r.backukpRepoConfig, r.namespace, req.Name, req.Spec.RepositoryType, log)
config, err := getBackupRepositoryConfig(ctx, r, r.backupRepoConfig, r.namespace, req.Name, req.Spec.RepositoryType, log)
if err != nil {
log.WithError(err).Warn("Failed to get repo config, repo config is ignored")
} else if config != nil {
Expand Down
Loading

0 comments on commit 87fb132

Please sign in to comment.