[BUGFIX] Fix ironic conductor for post-adoption baremetal provisioning#1302
Draft
imatza-rh wants to merge 11 commits intoopenstack-k8s-operators:downstreamfrom
Draft
[BUGFIX] Fix ironic conductor for post-adoption baremetal provisioning#1302imatza-rh wants to merge 11 commits intoopenstack-k8s-operators:downstreamfrom
imatza-rh wants to merge 11 commits intoopenstack-k8s-operators:downstreamfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
32956ab to
e6d940f
Compare
e6d940f to
aed358b
Compare
ecd0956 to
d7434d0
Compare
post-adoption baremetal provisioning The ironic conductor's dnsmasq container had no dhcp-range configured, preventing BMaaS VMs from PXE booting IPA during deploy/clean operations. This caused post-adoption baremetal provisioning (test-with-ironic) to timeout waiting for IPA heartbeat. The ironicInspector section already had dhcpRanges (190-199), but the ironicConductors section was missing them. Without DHCP, nodes power on via sushy/redfish but never receive an IP to chainload iPXE and boot IPA. Adds conductor dhcpRanges (240-249) on the ironic provisioning subnet, avoiding overlap with: - Inspector DHCP range (190-199) - TripleO allocation pool (150-200, dead post-adoption) - OSP 17.1 inspector subnets (210-239, dead post-adoption) Assisted-By: Claude Code Signed-off-by: Itay Matza <imatza@redhat.com>
The ironic-operator conductor dnsmasq template is missing UEFI iPXE boot directives that the inspector template has. This causes post-adoption baremetal provisioning to fail: UEFI PXE clients get DHCPOFFER with no bootfile and loop indefinitely. Additionally, the IPA ramdisk gets ipa-api-url set to the Kubernetes internal hostname which is unreachable from the provisioning network. Fix both issues: - Add endpoint_override to conductor customServiceConfig so IPA uses the MetalLB VIP (reachable via hypervisor routing) instead of K8s-internal hostname - Add post-deployment task to patch conductor dnsmasq with UEFI iPXE boot chain directives (snponly.efi for first-stage PXE, boot.ipxe for second-stage iPXE) The dnsmasq fix is gated by ironic_conductor_dnsmasq_uefi_fix (default: true) and should be disabled once the ironic-operator template is fixed upstream. Both fixes validated on live environment (titan35): node cleaning completed in ~80s vs 31-min timeouts. Assisted-By: Claude Code Signed-off-by: Itay Matza <imatza@redhat.com>
The verification grep checked /etc/dnsmasq.conf but the UEFI boot directives are written to /var/lib/ironic/dnsmasq.conf. Assisted-By: Claude Code Signed-off-by: Itay Matza <imatza@redhat.com>
The multi-line python3 -c one-liner had leading whitespace on the print() continuation line, causing IndentationError when executed inside the YAML block scalar. Put the entire expression on a single line. Assisted-By: Claude Code
After pcp_cleanup reverts source VMs to snapshot, TripleO services (Pacemaker, HAProxy, Keystone) need time to start up. Without a readiness check, pre-launch scripts immediately hit HTTP 503 from Keystone at the source VIP and fail. Add a retry loop (up to 5 min) at the start of the development_environment role that waits for a successful token issue before running any pre-launch scripts. Assisted-By: Claude Code Signed-off-by: Itay Matza <imatza@redhat.com>
When nova-compute pods register with a different hostname than the original TripleO services (e.g. .localdomain vs .example.com), old service entries persist in the DB with the old version. The nova_ffu version convergence check fails because it finds mismatched versions across all non-deleted nova-compute services. Add a defensive cleanup step before the version check that soft-deletes any nova-compute services with a version lower than the current maximum. This handles hostname mismatches during adoption without requiring exact hostname alignment. Assisted-By: Claude Code Signed-off-by: Itay Matza <imatza@redhat.com>
After adoption, nodes retain old TripleO deploy_kernel and deploy_ramdisk references. The existing reset task cleared these to empty, expecting centrally configured defaults in ironic.conf — but no global [conductor]deploy_kernel or deploy_ramdisk is set by the ironic_patch. Discover the conductor IP on the baremetal network and set each node's deploy images to the httpboot-served IPA URLs (port 8088), matching the pattern used by the UEFI fix. Without this fix, nodes enter 'clean failed' state because the IPA ramdisk cannot boot, causing any subsequent nova instance creation to hang in BUILD/spawning indefinitely. Signed-off-by: Itay Matza <imatza@redhat.com> Assisted-By: Claude Code
Add UEFI iPXE boot chain and DHCP host entry workarounds to nova_verify.yaml. The conductor dnsmasq template is missing UEFI boot directives, and assigns pool IPs instead of neutron-assigned IPs to deployed baremetal instances (OVN doesn't serve vif_type=other). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Itay Matza <imatza@redhat.com>
Post-restart verification greps were checking /etc/dnsmasq.conf instead of /var/lib/ironic/dnsmasq.conf where the config was written. With set -e active, this causes the task to fail even when the fix was correctly applied. Signed-off-by: Itay Matza <imatza@redhat.com>
The "Provision new instance on ironic" task uses complex bash with function definitions, escaped quotes, and arithmetic that triggers Ansible's parse_kv() argument splitter error: ERROR! failed at splitting arguments, either an unbalanced jinja2 block or quotes Fix: use explicit `cmd:` parameter instead of shell shorthand. This bypasses parse_kv() and passes the content directly to the shell module. Also simplify echo messages (remove escaped quotes) and quote the jq expression on the final ping line. Relates: OSPRH-26289 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Itay Matza <imatza@redhat.com>
Three bugs in the pre-launch ironic test script:
1. wait_node_state used -c "Provisioning\ State" column name which
breaks when the openstack command runs through SSH (quotes consumed
by local shell, column name splits into two arguments).
Fix: use --provision-state filter + count instead of column parsing.
2. wait_image_active hardcoded "Fedora-Cloud-Base-38" instead of using
the $image_name parameter. Fix: use "${image_name}".
3. ACTIVE_NODES check used -c "Provisioning State" with same SSH
quoting issue - always returned 0 regardless of actual node states.
Fix: use --provision-state active filter + wc -l.
These bugs were latent on first runs (nodes in available state) but
caused failures on re-runs where nodes had active instances.
Signed-off-by: Itay Matza <imatza@redhat.com>
d7434d0 to
d0c5105
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes two blockers preventing post-adoption baremetal provisioning
(
test-with-ironic) from working. Both were discovered and validatedlive on titan35 (shiftstack adoption pipeline).
1. Missing conductor DHCP ranges
The ironic conductor's dnsmasq had no
dhcp-rangeconfigured. WithoutDHCP, nodes power on via sushy/redfish but never receive an IP to
chainload iPXE and boot IPA. The
ironicInspectorsection already haddhcpRanges(190-199), butironicConductorswas missing them entirely.Fix: Add conductor
dhcpRanges(240-249) on the ironic provisioningsubnet, matching the existing inspector pattern (IPv4/IPv6).
2. IPA heartbeat uses unreachable K8s hostname
After PXE boot succeeds, IPA calls
get_ironic_api_url()which resolvesthe ironic endpoint via Keystone catalog — returning
http://ironic-internal.openstack.svc:6385(K8s internal hostname).BMaaS VMs on the provisioning network (172.20.1.x) cannot resolve this.
Fix: Set
endpoint_override=http://<MetalLB-VIP>:6385in conductor[service_catalog]config. This bypasses Keystone lookup and returns aroutable IP. Routing works: BMaaS VM → 172.20.1.1 (hypervisor gateway) →
192.168.122.80 (MetalLB VIP). Reuses existing
dns_server_provisioning_ipvariable.
3. Conductor dnsmasq missing UEFI iPXE boot directives
The ironic-operator's conductor dnsmasq template is missing UEFI boot
chain directives (
dhcp-boot=tag:efi,...). The inspector template hasthem, the conductor doesn't — this is an upstream operator bug.
Fix (workaround): Post-adoption task appends the two
dhcp-bootlines to dnsmasq.conf and restarts dnsmasq. Gated by
ironic_conductor_dnsmasq_uefi_fix: true(disable once operator isfixed upstream). Idempotent — skips if already applied.
Root cause
Confirmed via live debugging on titan35:
dhcp-range)ipa-api-urlused K8s hostname)dhcp-bootlines + restarting dnsmasq enabled full PXE chainendpoint_overridefix: node cleaned in ~80s vs 31-min timeoutsIP range allocation on 172.20.1.0/24
No overlaps.
Affected jobs
Only jobs using
dpa_test_suite: 'test-with-ironic':uni01alphaadoptionshiftstackadoptionRelated
boot directives (to be filed)
Test plan
Depends-On)dhcp-rangeanddhcp-bootafter adoption