[BUGFIX] Fix ironic conductor for post-adoption baremetal provisioning by imatza-rh · Pull Request #1302 · openstack-k8s-operators/data-plane-adoption

imatza-rh · 2026-03-10T18:14:35Z

Summary

Fixes two blockers preventing post-adoption baremetal provisioning
(test-with-ironic) from working. Both were discovered and validated
live on titan35 (shiftstack adoption pipeline).

1. Missing conductor DHCP ranges

The ironic conductor's dnsmasq had no dhcp-range configured. Without
DHCP, nodes power on via sushy/redfish but never receive an IP to
chainload iPXE and boot IPA. The ironicInspector section already had
dhcpRanges (190-199), but ironicConductors was missing them entirely.

Fix: Add conductor dhcpRanges (240-249) on the ironic provisioning
subnet, matching the existing inspector pattern (IPv4/IPv6).

2. IPA heartbeat uses unreachable K8s hostname

After PXE boot succeeds, IPA calls get_ironic_api_url() which resolves
the ironic endpoint via Keystone catalog — returning
http://ironic-internal.openstack.svc:6385 (K8s internal hostname).
BMaaS VMs on the provisioning network (172.20.1.x) cannot resolve this.

Fix: Set endpoint_override=http://<MetalLB-VIP>:6385 in conductor
[service_catalog] config. This bypasses Keystone lookup and returns a
routable IP. Routing works: BMaaS VM → 172.20.1.1 (hypervisor gateway) →
192.168.122.80 (MetalLB VIP). Reuses existing dns_server_provisioning_ip
variable.

3. Conductor dnsmasq missing UEFI iPXE boot directives

The ironic-operator's conductor dnsmasq template is missing UEFI boot
chain directives (dhcp-boot=tag:efi,...). The inspector template has
them, the conductor doesn't — this is an upstream operator bug.

Fix (workaround): Post-adoption task appends the two dhcp-boot
lines to dnsmasq.conf and restarts dnsmasq. Gated by
ironic_conductor_dnsmasq_uefi_fix: true (disable once operator is
fixed upstream). Idempotent — skips if already applied.

Root cause

Confirmed via live debugging on titan35:

Zero DHCP activity in conductor dnsmasq logs (no dhcp-range)
IPA booted but couldn't heartbeat (ipa-api-url used K8s hostname)
Adding dhcp-boot lines + restarting dnsmasq enabled full PXE chain
endpoint_override fix: node cleaned in ~80s vs 31-min timeouts

IP range allocation on 172.20.1.0/24

Range	Usage
.1	Gateway
.99	VIP
.150-.200	TripleO allocation pool
.190-.199	Inspector DHCP (existing)
.210-.239	OSP 17.1 inspector
.240-.249	Conductor DHCP (this PR)

No overlaps.

Affected jobs

Only jobs using dpa_test_suite: 'test-with-ironic':

uni01alpha adoption
shiftstack adoption

Test plan

Rerun shiftstack adoption pipeline with this fix (via Depends-On)
Verify conductor dnsmasq has dhcp-range and dhcp-boot after adoption
Verify IPA heartbeat succeeds (endpoint_override returns routable IP)
Verify new baremetal instance provisions successfully

openshift-ci · 2026-03-10T18:14:40Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign sathlan for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

post-adoption baremetal provisioning The ironic conductor's dnsmasq container had no dhcp-range configured, preventing BMaaS VMs from PXE booting IPA during deploy/clean operations. This caused post-adoption baremetal provisioning (test-with-ironic) to timeout waiting for IPA heartbeat. The ironicInspector section already had dhcpRanges (190-199), but the ironicConductors section was missing them. Without DHCP, nodes power on via sushy/redfish but never receive an IP to chainload iPXE and boot IPA. Adds conductor dhcpRanges (240-249) on the ironic provisioning subnet, avoiding overlap with: - Inspector DHCP range (190-199) - TripleO allocation pool (150-200, dead post-adoption) - OSP 17.1 inspector subnets (210-239, dead post-adoption) Assisted-By: Claude Code Signed-off-by: Itay Matza <imatza@redhat.com>

The ironic-operator conductor dnsmasq template is missing UEFI iPXE boot directives that the inspector template has. This causes post-adoption baremetal provisioning to fail: UEFI PXE clients get DHCPOFFER with no bootfile and loop indefinitely. Additionally, the IPA ramdisk gets ipa-api-url set to the Kubernetes internal hostname which is unreachable from the provisioning network. Fix both issues: - Add endpoint_override to conductor customServiceConfig so IPA uses the MetalLB VIP (reachable via hypervisor routing) instead of K8s-internal hostname - Add post-deployment task to patch conductor dnsmasq with UEFI iPXE boot chain directives (snponly.efi for first-stage PXE, boot.ipxe for second-stage iPXE) The dnsmasq fix is gated by ironic_conductor_dnsmasq_uefi_fix (default: true) and should be disabled once the ironic-operator template is fixed upstream. Both fixes validated on live environment (titan35): node cleaning completed in ~80s vs 31-min timeouts. Assisted-By: Claude Code Signed-off-by: Itay Matza <imatza@redhat.com>

The verification grep checked /etc/dnsmasq.conf but the UEFI boot directives are written to /var/lib/ironic/dnsmasq.conf. Assisted-By: Claude Code Signed-off-by: Itay Matza <imatza@redhat.com>

The multi-line python3 -c one-liner had leading whitespace on the print() continuation line, causing IndentationError when executed inside the YAML block scalar. Put the entire expression on a single line. Assisted-By: Claude Code

After pcp_cleanup reverts source VMs to snapshot, TripleO services (Pacemaker, HAProxy, Keystone) need time to start up. Without a readiness check, pre-launch scripts immediately hit HTTP 503 from Keystone at the source VIP and fail. Add a retry loop (up to 5 min) at the start of the development_environment role that waits for a successful token issue before running any pre-launch scripts. Assisted-By: Claude Code Signed-off-by: Itay Matza <imatza@redhat.com>

When nova-compute pods register with a different hostname than the original TripleO services (e.g. .localdomain vs .example.com), old service entries persist in the DB with the old version. The nova_ffu version convergence check fails because it finds mismatched versions across all non-deleted nova-compute services. Add a defensive cleanup step before the version check that soft-deletes any nova-compute services with a version lower than the current maximum. This handles hostname mismatches during adoption without requiring exact hostname alignment. Assisted-By: Claude Code Signed-off-by: Itay Matza <imatza@redhat.com>

After adoption, nodes retain old TripleO deploy_kernel and deploy_ramdisk references. The existing reset task cleared these to empty, expecting centrally configured defaults in ironic.conf — but no global [conductor]deploy_kernel or deploy_ramdisk is set by the ironic_patch. Discover the conductor IP on the baremetal network and set each node's deploy images to the httpboot-served IPA URLs (port 8088), matching the pattern used by the UEFI fix. Without this fix, nodes enter 'clean failed' state because the IPA ramdisk cannot boot, causing any subsequent nova instance creation to hang in BUILD/spawning indefinitely. Signed-off-by: Itay Matza <imatza@redhat.com> Assisted-By: Claude Code

Add UEFI iPXE boot chain and DHCP host entry workarounds to nova_verify.yaml. The conductor dnsmasq template is missing UEFI boot directives, and assigns pool IPs instead of neutron-assigned IPs to deployed baremetal instances (OVN doesn't serve vif_type=other). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Itay Matza <imatza@redhat.com>

Post-restart verification greps were checking /etc/dnsmasq.conf instead of /var/lib/ironic/dnsmasq.conf where the config was written. With set -e active, this causes the task to fail even when the fix was correctly applied. Signed-off-by: Itay Matza <imatza@redhat.com>

The "Provision new instance on ironic" task uses complex bash with function definitions, escaped quotes, and arithmetic that triggers Ansible's parse_kv() argument splitter error: ERROR! failed at splitting arguments, either an unbalanced jinja2 block or quotes Fix: use explicit `cmd:` parameter instead of shell shorthand. This bypasses parse_kv() and passes the content directly to the shell module. Also simplify echo messages (remove escaped quotes) and quote the jq expression on the final ping line. Relates: OSPRH-26289 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Itay Matza <imatza@redhat.com>

Three bugs in the pre-launch ironic test script: 1. wait_node_state used -c "Provisioning\ State" column name which breaks when the openstack command runs through SSH (quotes consumed by local shell, column name splits into two arguments). Fix: use --provision-state filter + count instead of column parsing. 2. wait_image_active hardcoded "Fedora-Cloud-Base-38" instead of using the $image_name parameter. Fix: use "${image_name}". 3. ACTIVE_NODES check used -c "Provisioning State" with same SSH quoting issue - always returned 0 regardless of actual node states. Fix: use --provision-state active filter + wc -l. These bugs were latent on first runs (nodes in available state) but caused failures on re-runs where nodes had active instances. Signed-off-by: Itay Matza <imatza@redhat.com>

imatza-rh marked this pull request as draft March 10, 2026 19:15

openshift-ci bot added the do-not-merge/work-in-progress label Mar 10, 2026

imatza-rh changed the title ~~[BUGFIX] Add dhcpRanges to ironicConductors~~ [BUGFIX] Fix ironic conductor for post-adoption baremetal provisioning Mar 11, 2026

imatza-rh force-pushed the add-conductor-dhcp-ranges branch 5 times, most recently from 32956ab to e6d940f Compare March 16, 2026 19:46

openshift-merge-robot added the needs-rebase label Mar 16, 2026

imatza-rh force-pushed the add-conductor-dhcp-ranges branch from e6d940f to aed358b Compare March 16, 2026 20:47

openshift-merge-robot removed the needs-rebase label Mar 16, 2026

imatza-rh force-pushed the add-conductor-dhcp-ranges branch from ecd0956 to d7434d0 Compare March 19, 2026 13:35

imatza-rh and others added 11 commits March 21, 2026 13:04

[BUGFIX] Fix dnsmasq UEFI verification path

c36bac0

The verification grep checked /etc/dnsmasq.conf but the UEFI boot directives are written to /var/lib/ironic/dnsmasq.conf. Assisted-By: Claude Code Signed-off-by: Itay Matza <imatza@redhat.com>

[BUGFIX] Fix conductor IP Python indent

29bebe3

The multi-line python3 -c one-liner had leading whitespace on the print() continuation line, causing IndentationError when executed inside the YAML block scalar. Put the entire expression on a single line. Assisted-By: Claude Code

imatza-rh force-pushed the add-conductor-dhcp-ranges branch from d7434d0 to d0c5105 Compare March 21, 2026 11:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUGFIX] Fix ironic conductor for post-adoption baremetal provisioning#1302

[BUGFIX] Fix ironic conductor for post-adoption baremetal provisioning#1302
imatza-rh wants to merge 11 commits intoopenstack-k8s-operators:downstreamfrom
imatza-rh:add-conductor-dhcp-ranges

imatza-rh commented Mar 10, 2026 •

edited by openshift-ci bot

Loading

Uh oh!

openshift-ci bot commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

imatza-rh commented Mar 10, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. Missing conductor DHCP ranges

2. IPA heartbeat uses unreachable K8s hostname

3. Conductor dnsmasq missing UEFI iPXE boot directives

Root cause

IP range allocation on 172.20.1.0/24

Affected jobs

Related

Test plan

Uh oh!

openshift-ci bot commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

imatza-rh commented Mar 10, 2026 •

edited by openshift-ci bot

Loading