Add initial openstack magnum terraform config #5518

GeorgianaElena · 2025-02-07T16:19:01Z

This is a work in progress for #5455. Everything is in one file for simplicity, harcoded nodegroups also for simplicity until the create command passes.

Min count of nodegroups cannot be 0 :(
Nodegroup creation currently fails with:

CREATE_FAILED                                                                                                                                                          |
| status_reason      | Unexpected error while running command.                                                                                                                                |
|                    | Command: helm upgrade test-cluster-jde45zh7dcud openstack-cluster --history-max 10 --install --output json --timeout 5m --values - --namespace                         |
|                    | magnum-390542082bd74fa6abcde82f8c7ded89 --repo https://azimuth-cloud.github.io/capi-helm-charts --version 0.4.0                                                        |
|                    | Exit code: 1                                                                                                                                                           |
|                    | Stdout: ''                                                                                                                                                             |
|                    | Stderr: 'Error: UPGRADE FAILED: release: already exists\n'

Also, when redeploying, the failed nodegroups cannot be deleted and they just hang and the error is the same.

GeorgianaElena · 2025-02-07T17:10:52Z

@julianpistorius, do you have any ideas about why this error might be showing up when creating the nodepools? 🤔

The cluster is created successfully, but the nodegroups fail with the error above.

julianpistorius · 2025-02-07T18:02:26Z

Hmm... No idea. I'll have to dig a bit.

julianpistorius · 2025-02-07T18:46:13Z

While I dig, do you mind trying again? We had a networking problem this morning, possibly around the same time you had run into this problem.

julianpistorius · 2025-02-07T19:45:39Z

We are still seeing intermittent problems with networking. Please stand by.

julianpistorius · 2025-02-08T02:52:44Z

Never mind. The networking problems had been solved, so you should be able to try again.

GeorgianaElena · 2025-02-08T10:58:25Z

Thanks for looking into it @julianpistorius! I've tried again today and it fails with an error which I think indicates that communication with the management cluster failed?

I see the error both when I try to re-apply the terraform but also when trying to delete the cluster or nodegroups with openstack coe cluster ...

| stack_id             | test-cluster-b6ppkxavrpgq                                                                                            |
| status_reason        | Unexpected error while running command.                                                                              |
|                      | Command: helm uninstall test-cluster-b6ppkxavrpgq --timeout 5m --namespace magnum-390542082bd74fa6abcde82f8c7ded89   |
|                      | Exit code: 1                                                                                                         |
|                      | Stdout: ''                                                                                                           |
|                      | Stderr: 'Error: Kubernetes cluster unreachable: Get "https://149.165.173.80:6443/version": dial tcp                  |
|                      | 149.165.173.80:6443: i/o timeout\n'                                                                                  |

GeorgianaElena · 2025-02-10T10:30:19Z

It looks like if I want to create only one nodepool via terraform it works, but having more than one results in this release: already exists error.

It might be some kind of race condition somewhere.

…odes

GeorgianaElena · 2025-02-10T12:45:20Z

It looks like I can force the nodegroups to fire create requests in sequence with depends_on.

Update:
depends_on wants constants so we'd have to define each nodegroup one by one for this to work and it will take more time as node creation won't happen in parallel.

GeorgianaElena · 2025-02-10T15:13:14Z

It might be some kind of race condition somewhere.

@julianpistorius, given the namespace in the command below that is ran by terraform when trying to create a new nodegroup and fails:

helm upgrade test-cluster-tnsrkhcvtj6e openstack-cluster --history-max 10 --install --output json --timeout   |
|                    | 5m --values - --namespace magnum-390542082bd74fa6abcde82f8c7ded89 --repo https://azimuth-cloud.github.io/capi-helm-    |
|                    | charts --version 0.4.0\

Stderr: 'Error: UPGRADE FAILED: release: already exists\n'

It makes me think that the command is ran against the management cluster, right?

Do you any ideas where in there might be a possible race condition that could cause this? I found this old issue from when tiller was used by helm helm/helm#7319 which suggests some form of corrupted etcd.

julianpistorius · 2025-02-10T20:53:31Z

Hi @GeorgianaElena. I'll ask somebody who should know and get back to you.

GeorgianaElena · 2025-02-11T14:03:59Z

Thank you @julianpistorius!

For now, I've learnt that I can unblock myself by telling terraform to disable creation of resource in parallel with --parralelism=1.

julianpistorius · 2025-02-11T17:18:38Z

Great! So according to Scott from StackHPC (@sd109) you have apparently hit a known bug in Magnum:

I think this is probably the same as the bug which was raised by another one of our clients which Stig has recorded as an upstream bug here: https://bugs.launchpad.net/magnum/+bug/2097946

They'll work on fixing it. In the meantime your workaround will get you by.

GeorgianaElena · 2025-02-11T18:26:41Z

Thank you @julianpistorius! I think I came across the bug report today too #5455 (comment)

Now I'm battling a different issue. The labels I set on the nodegroups through terraform are not propagated to the actual node instances and scheduling fails.

julianpistorius · 2025-02-11T19:46:31Z

Have you been able to set the labels manually using the openstack coe nodegroup update command? Because this is something I'm currently also struggling with.

julianpistorius · 2025-02-11T19:48:21Z

Any ideas @sd109?

julianpistorius · 2025-02-11T21:44:36Z

@GeorgianaElena

The labels I set on the nodegroups through terraform are not propagated to the actual node instances and scheduling fails.

Is the terraform trying to set the labels at the time of creating the nodegroups? Or afterwards on existing nodegroups?

GeorgianaElena · 2025-02-12T09:52:14Z

@julianpistorius, I believe it does it during creation. Based the code, they do it all in one request, i.e. they pass the labels in the post request creating the nodegroups.

For context, I can see the labels on the actual nodegroups as labels_added with:

openstack coe nodegroup show js-cluster core-js

But then when I run k describe node on a node in that core-js nodegroup I don't see these custom labels

julianpistorius · 2025-02-12T16:15:06Z

Interesting. I'm going to see if I can reproduce this using the OpenStack Magnum CLI.

julianpistorius · 2025-02-13T03:39:04Z

This is indeed a bug (with a workaround) as I mentioned in #5455 (comment)

Add initial openstack magnum terraform config

f3f2d67

Create a single array-like nodegroups resource for both core and nb n…

284b19b

…odes

GeorgianaElena added 2 commits February 11, 2025 16:01

Separate vars

17f017f

Fix linter

fd33c80

Add dask nodepoll

79c8d9a

GeorgianaElena mentioned this pull request Feb 11, 2025

Deploy kubernetes cluster with terraform on Jetstream2 #5455

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add initial openstack magnum terraform config #5518

Add initial openstack magnum terraform config #5518

GeorgianaElena commented Feb 7, 2025

GeorgianaElena commented Feb 7, 2025

julianpistorius commented Feb 7, 2025

julianpistorius commented Feb 7, 2025

julianpistorius commented Feb 7, 2025

julianpistorius commented Feb 8, 2025

GeorgianaElena commented Feb 8, 2025 •

edited

Loading

GeorgianaElena commented Feb 10, 2025 •

edited

Loading

GeorgianaElena commented Feb 10, 2025 •

edited

Loading

GeorgianaElena commented Feb 10, 2025

julianpistorius commented Feb 10, 2025

GeorgianaElena commented Feb 11, 2025

julianpistorius commented Feb 11, 2025 •

edited

Loading

GeorgianaElena commented Feb 11, 2025

julianpistorius commented Feb 11, 2025

julianpistorius commented Feb 11, 2025

julianpistorius commented Feb 11, 2025

GeorgianaElena commented Feb 12, 2025

julianpistorius commented Feb 12, 2025

julianpistorius commented Feb 13, 2025

Add initial openstack magnum terraform config #5518

Are you sure you want to change the base?

Add initial openstack magnum terraform config #5518

Conversation

GeorgianaElena commented Feb 7, 2025

GeorgianaElena commented Feb 7, 2025

julianpistorius commented Feb 7, 2025

julianpistorius commented Feb 7, 2025

julianpistorius commented Feb 7, 2025

julianpistorius commented Feb 8, 2025

GeorgianaElena commented Feb 8, 2025 • edited Loading

GeorgianaElena commented Feb 10, 2025 • edited Loading

GeorgianaElena commented Feb 10, 2025 • edited Loading

GeorgianaElena commented Feb 10, 2025

julianpistorius commented Feb 10, 2025

GeorgianaElena commented Feb 11, 2025

julianpistorius commented Feb 11, 2025 • edited Loading

GeorgianaElena commented Feb 11, 2025

julianpistorius commented Feb 11, 2025

julianpistorius commented Feb 11, 2025

julianpistorius commented Feb 11, 2025

GeorgianaElena commented Feb 12, 2025

julianpistorius commented Feb 12, 2025

julianpistorius commented Feb 13, 2025

GeorgianaElena commented Feb 8, 2025 •

edited

Loading

GeorgianaElena commented Feb 10, 2025 •

edited

Loading

GeorgianaElena commented Feb 10, 2025 •

edited

Loading

julianpistorius commented Feb 11, 2025 •

edited

Loading