Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLOUPD-193335: Deletion Protection Serverless Private Endpoints #1093

Merged
merged 1 commit into from
Sep 1, 2023

Conversation

josvazg
Copy link
Collaborator

@josvazg josvazg commented Aug 16, 2023

All Submissions:

  • Have you signed our CLA?
  • Put closes #XXXX in your comment to auto-close the issue that your PR fixes (if there is one).
  • Update docs/release-notes/release-notes.md if your changes should be included in the release notes for the next release.

@josvazg josvazg marked this pull request as draft August 16, 2023 10:21
@github-actions
Copy link
Contributor

github-actions bot commented Aug 16, 2023

@josvazg josvazg force-pushed the CLOUDP-193335/deletion-protection-spe branch 3 times, most recently from 4439b5c to aafeb2d Compare August 17, 2023 13:27
@josvazg josvazg changed the title Cloudp 193335/deletion protection spe CLOUPD-193335/deletion protection spe Aug 17, 2023
@josvazg josvazg changed the title CLOUPD-193335/deletion protection spe CLOUPD-193335: Selection Protection Serverless Private Endpoints Aug 17, 2023
@josvazg
Copy link
Collaborator Author

josvazg commented Aug 18, 2023

Still a draft as it should go after bugfix #1082

@josvazg josvazg force-pushed the CLOUDP-193335/deletion-protection-spe branch 5 times, most recently from da2ed29 to 4911f95 Compare August 23, 2023 07:53
@josvazg josvazg marked this pull request as ready for review August 23, 2023 08:00
@josvazg josvazg force-pushed the CLOUDP-193335/deletion-protection-spe branch 2 times, most recently from 2fe9b4e to 35045d4 Compare August 23, 2023 09:46
@josvazg josvazg force-pushed the CLOUDP-193335/deletion-protection-spe branch 3 times, most recently from 24cc966 to 6fd303c Compare August 24, 2023 07:20
@josvazg
Copy link
Collaborator Author

josvazg commented Aug 25, 2023

debugging

@josvazg josvazg force-pushed the CLOUDP-193335/deletion-protection-spe branch from 6fd303c to f593f2a Compare August 25, 2023 07:04
@josvazg
Copy link
Collaborator Author

josvazg commented Aug 25, 2023

Even after a rebase the issues persist. Seems main is not affected.
Debugging the database user integration test failure first.

@josvazg josvazg force-pushed the CLOUDP-193335/deletion-protection-spe branch from f593f2a to 88c6e6b Compare August 25, 2023 12:27
@josvazg
Copy link
Collaborator Author

josvazg commented Aug 25, 2023

Debugging some panics in CI tests.

@josvazg josvazg force-pushed the CLOUDP-193335/deletion-protection-spe branch from 88c6e6b to c7e1aaa Compare August 25, 2023 15:57
test/e2e/actions/actions.go Outdated Show resolved Hide resolved
@josvazg
Copy link
Collaborator Author

josvazg commented Aug 25, 2023

I think there are more usages of .Spec.DeploymentSpec. leading to flaky CI failures. And there seems to be a persistent helm upgrade issue. Still do not understand how my changes are related to that, but will keep on debugging and sending fixes in this branch.

@josvazg josvazg force-pushed the CLOUDP-193335/deletion-protection-spe branch from c7e1aaa to 838e7ae Compare August 28, 2023 06:43
@josvazg
Copy link
Collaborator Author

josvazg commented Aug 28, 2023

@igor-karpukhin gave me the right hint here.
Somehow my code must be saving the converted deployment to Kubernetes, changing the original spec. We do the conversion here:
https://github.com/mongodb/mongodb-atlas-kubernetes/blob/CLOUDP-193335/deletion-protection-spe/pkg/controller/atlasdeployment/atlasdeployment_controller.go#L165

if deployment.IsLegacyDeployment() {
		if err := ConvertLegacyDeployment(&deployment.Spec); err != nil {
			result = workflow.Terminate(workflow.Internal, err.Error())
			log.Errorw("failed to convert legacy deployment", "error", err)
			return result.ReconcileResult(), nil
		}
		deployment.Spec.DeploymentSpec = nil
	}

But the key point is that conversion must never be seen in Kubernetes. It is ONLY used in memory for convenience so we always handle only serverless deployments and advanced. aAs old deployments get converted to their advanced equivalent.

My code somewhere must be saving the converted spec to Kubernetes, so the rest of the test code and even Helm fail as they assume a default configuration which is missing, as it was converted to advanced. This conversion is wrong, it should not have happened.

@josvazg josvazg force-pushed the CLOUDP-193335/deletion-protection-spe branch from 838e7ae to 58560b7 Compare August 28, 2023 09:54
@josvazg
Copy link
Collaborator Author

josvazg commented Aug 28, 2023

Seems all issues, including the helm error and the panics were due to ApplyLastConfigApplied adding the annotation and updating the converted deployment in Kubernetes, instead of the original deployment without conversion. This update was not expected and all tests break accordingly.

The fix is to actually make sure ApplyLastConfigApplied is called with the original (non converted) deployment object.

@josvazg
Copy link
Collaborator Author

josvazg commented Aug 28, 2023

2 errors remain that might not be flaky:

Summarizing 2 Failures:
  [FAIL] AtlasDeployment Create the advanced deployment with enabled autoscaling [It] Should Succeed [int, AtlasDeployment, deployment-non-backups]
  /home/runner/work/mongodb-atlas-kubernetes/mongodb-atlas-kubernetes/test/int/deployment_test.go:209
  [FAIL] AtlasDeployment Create the advanced deployment & change the InstanceSize [It] Should Succeed [int, AtlasDeployment, deployment-non-backups]
  /home/runner/work/mongodb-atlas-kubernetes/mongodb-atlas-kubernetes/test/int/deployment_test.go:868

Debugging those now.

@josvazg josvazg force-pushed the CLOUDP-193335/deletion-protection-spe branch from 58560b7 to dab8a24 Compare August 29, 2023 05:06
@josvazg
Copy link
Collaborator Author

josvazg commented Aug 29, 2023

2 errors remain that might not be flaky:

Summarizing 2 Failures:
  [FAIL] AtlasDeployment Create the advanced deployment with enabled autoscaling [It] Should Succeed [int, AtlasDeployment, deployment-non-backups]
  /home/runner/work/mongodb-atlas-kubernetes/mongodb-atlas-kubernetes/test/int/deployment_test.go:209
  [FAIL] AtlasDeployment Create the advanced deployment & change the InstanceSize [It] Should Succeed [int, AtlasDeployment, deployment-non-backups]
  /home/runner/work/mongodb-atlas-kubernetes/mongodb-atlas-kubernetes/test/int/deployment_test.go:868

Debugging those now.

The problem was due to handleAutoscaling making changes in the Kubernetes spec for advanced deployment here:
https://github.com/mongodb/mongodb-atlas-kubernetes/blob/main/pkg/controller/atlasdeployment/advanced_deployment.go#L192

func handleAutoscaling(ctx *workflow.Context, desiredDeployment *mdbv1.AdvancedDeploymentSpec, currentDeployment *mongodbatlas.AdvancedCluster) error {
	...
	syncInstanceSize := func(s *mdbv1.Specs, as *mdbv1.AdvancedAutoScalingSpec) error {
		if s != nil {
			...

			if isInstanceSizeTheSame(currentDeployment, size) {
				size = ""
			}

			s.InstanceSize = size

Those changes were being saved to the Kubernetes state when updating the deployment after (re)setting the previous config here:
https://github.com/mongodb/mongodb-atlas-kubernetes/blob/main/pkg/controller/customresource/protection.go#L50

func ApplyLastConfigApplied(ctx context.Context, resource mdbv1.AtlasCustomResource, k8sClient client.Client) error {
...
	return k8sClient.Update(ctx, resource, &client.UpdateOptions{})
}

This was only happening for advanced cluster, because I was NOT making a copy of the Kubernetes deployment structure unless the conversion was actually needed.

The fix was to simply always make the converted deployment variable a separate copy. That way, the original deployment will be unchanged, and no changes in the reconcile loop will be reflected back into the Kubernetes etcd status.

A separate matter is whether or not those ephemeral changes to the deployment struct make actual sense, but that is out of the scope of this PR.

@josvazg
Copy link
Collaborator Author

josvazg commented Aug 29, 2023

Current persistent failures are in encryption-at-rest a well know flaky test that, AFAIK, is totally unrelated to the changes here:

[FAIL] Encryption at REST test Encryption at rest for AWS, GCP, Azure [It] Test[encryption-at-rest-aws]: Can add Encryption at Rest to AWS project [encryption-at-rest, encryption-at-rest-aws]

Will be debugging it now.

@josvazg
Copy link
Collaborator Author

josvazg commented Aug 29, 2023

Current persistent failures are in encryption-at-rest a well know flaky test that, AFAIK, is totally unrelated to the changes here:

[FAIL] Encryption at REST test Encryption at rest for AWS, GCP, Azure [It] Test[encryption-at-rest-aws]: Can add Encryption at Rest to AWS project [encryption-at-rest, encryption-at-rest-aws]

Will be debugging it now.

Created a follow up issue for that.

Now the test passed, we are ready for review.

@igor-karpukhin
Copy link
Collaborator

@josvazg this is done because of the way we operate with autoscaling. For example, if you set instanceSize to be M10 and enabled autoscaling from M10 to M40, later, after cluster is autoscaled to M30, if you try to modify the AtlasDeployment resource, the instanceSize is going to be M10, and we send this M10 to Atlas. Which is wrong. That's why if instance size is equal to the one in Atlas and autoscaling is enabled, we remove this field before sending PATCH request.

@josvazg
Copy link
Collaborator Author

josvazg commented Aug 29, 2023

@josvazg this is done because of the way we operate with autoscaling. For example, if you set instanceSize to be M10 and enabled autoscaling from M10 to M40, later, after cluster is autoscaled to M30, if you try to modify the AtlasDeployment resource, the instanceSize is going to be M10, and we send this M10 to Atlas. Which is wrong. That's why if instance size is equal to the one in Atlas and autoscaling is enabled, we remove this field before sending PATCH request.

Thanks for the explanation. Then it means that ephemeral changes to the k8s struct is like an intermediate representation to ease syncing with Atlas. Makes sense.

@igor-karpukhin igor-karpukhin changed the title CLOUPD-193335: Selection Protection Serverless Private Endpoints CLOUPD-193335: Deletion Protection Serverless Private Endpoints Aug 29, 2023
@josvazg josvazg force-pushed the CLOUDP-193335/deletion-protection-spe branch from dab8a24 to 8dabad8 Compare August 31, 2023 13:15
@josvazg
Copy link
Collaborator Author

josvazg commented Aug 31, 2023

Rebased

@josvazg josvazg force-pushed the CLOUDP-193335/deletion-protection-spe branch 3 times, most recently from a3dee69 to f1977e4 Compare September 1, 2023 06:08
@josvazg josvazg force-pushed the CLOUDP-193335/deletion-protection-spe branch from f1977e4 to a9a544f Compare September 1, 2023 06:39
Copy link
Collaborator

@helderjs helderjs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed something strange with your rebase. Changes already in the main branch are showing as new in your PR, I pointed out 2 but there are more, even entire files.

@josvazg
Copy link
Collaborator Author

josvazg commented Sep 1, 2023

I noticed something strange with your rebase. Changes already in the main branch are showing as new in your PR, I pointed out 2 but there are more, even entire files.

Thanks for the heads up, checking...

Signed-off-by: Jose Vazquez <jose.vazquez@mongodb.com>
@josvazg josvazg force-pushed the CLOUDP-193335/deletion-protection-spe branch from a9a544f to 1a5c121 Compare September 1, 2023 09:45
@josvazg josvazg merged commit 597141c into main Sep 1, 2023
40 checks passed
@josvazg josvazg deleted the CLOUDP-193335/deletion-protection-spe branch September 1, 2023 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants