Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add initial openstack magnum terraform config #5518

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

GeorgianaElena
Copy link
Member

This is a work in progress for #5455. Everything is in one file for simplicity, harcoded nodegroups also for simplicity until the create command passes.

  • Min count of nodegroups cannot be 0 :(

  • Nodegroup creation currently fails with:

CREATE_FAILED                                                                                                                                                          |
| status_reason      | Unexpected error while running command.                                                                                                                                |
|                    | Command: helm upgrade test-cluster-jde45zh7dcud openstack-cluster --history-max 10 --install --output json --timeout 5m --values - --namespace                         |
|                    | magnum-390542082bd74fa6abcde82f8c7ded89 --repo https://azimuth-cloud.github.io/capi-helm-charts --version 0.4.0                                                        |
|                    | Exit code: 1                                                                                                                                                           |
|                    | Stdout: ''                                                                                                                                                             |
|                    | Stderr: 'Error: UPGRADE FAILED: release: already exists\n'      

Also, when redeploying, the failed nodegroups cannot be deleted and they just hang and the error is the same.

@GeorgianaElena
Copy link
Member Author

@julianpistorius, do you have any ideas about why this error might be showing up when creating the nodepools? 🤔

The cluster is created successfully, but the nodegroups fail with the error above.

@julianpistorius
Copy link

Hmm... No idea. I'll have to dig a bit.

@julianpistorius
Copy link

While I dig, do you mind trying again? We had a networking problem this morning, possibly around the same time you had run into this problem.

@julianpistorius
Copy link

We are still seeing intermittent problems with networking. Please stand by.

@julianpistorius
Copy link

Never mind. The networking problems had been solved, so you should be able to try again.

@GeorgianaElena
Copy link
Member Author

GeorgianaElena commented Feb 8, 2025

Thanks for looking into it @julianpistorius! I've tried again today and it fails with an error which I think indicates that communication with the management cluster failed?

I see the error both when I try to re-apply the terraform but also when trying to delete the cluster or nodegroups with openstack coe cluster ...

| stack_id             | test-cluster-b6ppkxavrpgq                                                                                            |
| status_reason        | Unexpected error while running command.                                                                              |
|                      | Command: helm uninstall test-cluster-b6ppkxavrpgq --timeout 5m --namespace magnum-390542082bd74fa6abcde82f8c7ded89   |
|                      | Exit code: 1                                                                                                         |
|                      | Stdout: ''                                                                                                           |
|                      | Stderr: 'Error: Kubernetes cluster unreachable: Get "https://149.165.173.80:6443/version": dial tcp                  |
|                      | 149.165.173.80:6443: i/o timeout\n'                                                                                  |

@GeorgianaElena
Copy link
Member Author

GeorgianaElena commented Feb 10, 2025

It looks like if I want to create only one nodepool via terraform it works, but having more than one results in this release: already exists error.

It might be some kind of race condition somewhere.

@GeorgianaElena
Copy link
Member Author

GeorgianaElena commented Feb 10, 2025

It looks like I can force the nodegroups to fire create requests in sequence with depends_on.

Update:
depends_on wants constants so we'd have to define each nodegroup one by one for this to work and it will take more time as node creation won't happen in parallel.

@GeorgianaElena
Copy link
Member Author

It might be some kind of race condition somewhere.

@julianpistorius, given the namespace in the command below that is ran by terraform when trying to create a new nodegroup and fails:

helm upgrade test-cluster-tnsrkhcvtj6e openstack-cluster --history-max 10 --install --output json --timeout   |
|                    | 5m --values - --namespace magnum-390542082bd74fa6abcde82f8c7ded89 --repo https://azimuth-cloud.github.io/capi-helm-    |
|                    | charts --version 0.4.0\

Stderr: 'Error: UPGRADE FAILED: release: already exists\n'  

It makes me think that the command is ran against the management cluster, right?

Do you any ideas where in there might be a possible race condition that could cause this? I found this old issue from when tiller was used by helm helm/helm#7319 which suggests some form of corrupted etcd.

@julianpistorius
Copy link

Hi @GeorgianaElena. I'll ask somebody who should know and get back to you.

@GeorgianaElena
Copy link
Member Author

Thank you @julianpistorius!

For now, I've learnt that I can unblock myself by telling terraform to disable creation of resource in parallel with --parralelism=1.

@julianpistorius
Copy link

julianpistorius commented Feb 11, 2025

Great! So according to Scott from StackHPC (@sd109) you have apparently hit a known bug in Magnum:

I think this is probably the same as the bug which was raised by another one of our clients which Stig has recorded as an upstream bug here: https://bugs.launchpad.net/magnum/+bug/2097946

They'll work on fixing it. In the meantime your workaround will get you by.

@GeorgianaElena
Copy link
Member Author

Thank you @julianpistorius! I think I came across the bug report today too #5455 (comment)

Now I'm battling a different issue. The labels I set on the nodegroups through terraform are not propagated to the actual node instances and scheduling fails.

@julianpistorius
Copy link

Have you been able to set the labels manually using the openstack coe nodegroup update command? Because this is something I'm currently also struggling with.

@julianpistorius
Copy link

Any ideas @sd109?

@julianpistorius
Copy link

@GeorgianaElena

The labels I set on the nodegroups through terraform are not propagated to the actual node instances and scheduling fails.

Is the terraform trying to set the labels at the time of creating the nodegroups? Or afterwards on existing nodegroups?

@GeorgianaElena
Copy link
Member Author

@julianpistorius, I believe it does it during creation. Based the code, they do it all in one request, i.e. they pass the labels in the post request creating the nodegroups.

For context, I can see the labels on the actual nodegroups as labels_added with:

openstack coe nodegroup show js-cluster core-js
Screenshot 2025-02-12 at 11 44 13

But then when I run k describe node on a node in that core-js nodegroup I don't see these custom labels

@julianpistorius
Copy link

Interesting. I'm going to see if I can reproduce this using the OpenStack Magnum CLI.

@julianpistorius
Copy link

This is indeed a bug (with a workaround) as I mentioned in #5455 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants