Skip to content

[BUGFIX] Fix ironic conductor for post-adoption baremetal provisioning#1302

Draft
imatza-rh wants to merge 11 commits intoopenstack-k8s-operators:downstreamfrom
imatza-rh:add-conductor-dhcp-ranges
Draft

[BUGFIX] Fix ironic conductor for post-adoption baremetal provisioning#1302
imatza-rh wants to merge 11 commits intoopenstack-k8s-operators:downstreamfrom
imatza-rh:add-conductor-dhcp-ranges

Conversation

@imatza-rh
Copy link

@imatza-rh imatza-rh commented Mar 10, 2026

Summary

Fixes two blockers preventing post-adoption baremetal provisioning
(test-with-ironic) from working. Both were discovered and validated
live on titan35 (shiftstack adoption pipeline).

1. Missing conductor DHCP ranges

The ironic conductor's dnsmasq had no dhcp-range configured. Without
DHCP, nodes power on via sushy/redfish but never receive an IP to
chainload iPXE and boot IPA. The ironicInspector section already had
dhcpRanges (190-199), but ironicConductors was missing them entirely.

Fix: Add conductor dhcpRanges (240-249) on the ironic provisioning
subnet, matching the existing inspector pattern (IPv4/IPv6).

2. IPA heartbeat uses unreachable K8s hostname

After PXE boot succeeds, IPA calls get_ironic_api_url() which resolves
the ironic endpoint via Keystone catalog — returning
http://ironic-internal.openstack.svc:6385 (K8s internal hostname).
BMaaS VMs on the provisioning network (172.20.1.x) cannot resolve this.

Fix: Set endpoint_override=http://<MetalLB-VIP>:6385 in conductor
[service_catalog] config. This bypasses Keystone lookup and returns a
routable IP. Routing works: BMaaS VM → 172.20.1.1 (hypervisor gateway) →
192.168.122.80 (MetalLB VIP). Reuses existing dns_server_provisioning_ip
variable.

3. Conductor dnsmasq missing UEFI iPXE boot directives

The ironic-operator's conductor dnsmasq template is missing UEFI boot
chain directives (dhcp-boot=tag:efi,...). The inspector template has
them, the conductor doesn't — this is an upstream operator bug.

Fix (workaround): Post-adoption task appends the two dhcp-boot
lines to dnsmasq.conf and restarts dnsmasq. Gated by
ironic_conductor_dnsmasq_uefi_fix: true (disable once operator is
fixed upstream). Idempotent — skips if already applied.

Root cause

Confirmed via live debugging on titan35:

  • Zero DHCP activity in conductor dnsmasq logs (no dhcp-range)
  • IPA booted but couldn't heartbeat (ipa-api-url used K8s hostname)
  • Adding dhcp-boot lines + restarting dnsmasq enabled full PXE chain
  • endpoint_override fix: node cleaned in ~80s vs 31-min timeouts

IP range allocation on 172.20.1.0/24

Range Usage
.1 Gateway
.99 VIP
.150-.200 TripleO allocation pool
.190-.199 Inspector DHCP (existing)
.210-.239 OSP 17.1 inspector
.240-.249 Conductor DHCP (this PR)

No overlaps.

Affected jobs

Only jobs using dpa_test_suite: 'test-with-ironic':

  • uni01alpha adoption
  • shiftstack adoption

Related

  • OSPRH-26289 — shiftstack adoption CI pipeline
  • ironic-operator upstream bug: conductor dnsmasq template missing UEFI
    boot directives (to be filed)

Test plan

  • Rerun shiftstack adoption pipeline with this fix (via Depends-On)
  • Verify conductor dnsmasq has dhcp-range and dhcp-boot after adoption
  • Verify IPA heartbeat succeeds (endpoint_override returns routable IP)
  • Verify new baremetal instance provisions successfully

@openshift-ci
Copy link

openshift-ci bot commented Mar 10, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign sathlan for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@imatza-rh imatza-rh marked this pull request as draft March 10, 2026 19:15
@imatza-rh imatza-rh changed the title [BUGFIX] Add dhcpRanges to ironicConductors [BUGFIX] Fix ironic conductor for post-adoption baremetal provisioning Mar 11, 2026
@imatza-rh imatza-rh force-pushed the add-conductor-dhcp-ranges branch 5 times, most recently from 32956ab to e6d940f Compare March 16, 2026 19:46
@imatza-rh imatza-rh force-pushed the add-conductor-dhcp-ranges branch from e6d940f to aed358b Compare March 16, 2026 20:47
@imatza-rh imatza-rh force-pushed the add-conductor-dhcp-ranges branch from ecd0956 to d7434d0 Compare March 19, 2026 13:35
imatza-rh and others added 11 commits March 21, 2026 13:04
post-adoption baremetal provisioning

The ironic conductor's dnsmasq container had no
dhcp-range configured, preventing BMaaS VMs from
PXE booting IPA during deploy/clean operations.
This caused post-adoption baremetal provisioning
(test-with-ironic) to timeout waiting for IPA
heartbeat.

The ironicInspector section already had dhcpRanges
(190-199), but the ironicConductors section was
missing them. Without DHCP, nodes power on via
sushy/redfish but never receive an IP to chainload
iPXE and boot IPA.

Adds conductor dhcpRanges (240-249) on the ironic
provisioning subnet, avoiding overlap with:
- Inspector DHCP range (190-199)
- TripleO allocation pool (150-200, dead
  post-adoption)
- OSP 17.1 inspector subnets (210-239, dead
  post-adoption)

Assisted-By: Claude Code
Signed-off-by: Itay Matza <imatza@redhat.com>
The ironic-operator conductor dnsmasq template is
missing UEFI iPXE boot directives that the inspector
template has. This causes post-adoption baremetal
provisioning to fail: UEFI PXE clients get DHCPOFFER
with no bootfile and loop indefinitely.

Additionally, the IPA ramdisk gets ipa-api-url set to
the Kubernetes internal hostname which is unreachable
from the provisioning network.

Fix both issues:
- Add endpoint_override to conductor customServiceConfig
  so IPA uses the MetalLB VIP (reachable via hypervisor
  routing) instead of K8s-internal hostname
- Add post-deployment task to patch conductor dnsmasq
  with UEFI iPXE boot chain directives (snponly.efi for
  first-stage PXE, boot.ipxe for second-stage iPXE)

The dnsmasq fix is gated by ironic_conductor_dnsmasq_uefi_fix
(default: true) and should be disabled once the
ironic-operator template is fixed upstream.

Both fixes validated on live environment (titan35):
node cleaning completed in ~80s vs 31-min timeouts.

Assisted-By: Claude Code
Signed-off-by: Itay Matza <imatza@redhat.com>
The verification grep checked /etc/dnsmasq.conf but the UEFI
boot directives are written to /var/lib/ironic/dnsmasq.conf.

Assisted-By: Claude Code
Signed-off-by: Itay Matza <imatza@redhat.com>
The multi-line python3 -c one-liner had leading whitespace on
the print() continuation line, causing IndentationError when
executed inside the YAML block scalar. Put the entire expression
on a single line.

Assisted-By: Claude Code
After pcp_cleanup reverts source VMs to snapshot,
TripleO services (Pacemaker, HAProxy, Keystone)
need time to start up. Without a readiness check,
pre-launch scripts immediately hit HTTP 503 from
Keystone at the source VIP and fail.

Add a retry loop (up to 5 min) at the start of the
development_environment role that waits for a
successful token issue before running any pre-launch
scripts.

Assisted-By: Claude Code
Signed-off-by: Itay Matza <imatza@redhat.com>
When nova-compute pods register with a different hostname than the
original TripleO services (e.g. .localdomain vs .example.com), old
service entries persist in the DB with the old version. The
nova_ffu version convergence check fails because it finds
mismatched versions across all non-deleted nova-compute services.

Add a defensive cleanup step before the version check that
soft-deletes any nova-compute services with a version lower than
the current maximum. This handles hostname mismatches during
adoption without requiring exact hostname alignment.

Assisted-By: Claude Code
Signed-off-by: Itay Matza <imatza@redhat.com>
After adoption, nodes retain old TripleO deploy_kernel and
deploy_ramdisk references. The existing reset task cleared
these to empty, expecting centrally configured defaults in
ironic.conf — but no global [conductor]deploy_kernel or
deploy_ramdisk is set by the ironic_patch.

Discover the conductor IP on the baremetal network and set
each node's deploy images to the httpboot-served IPA URLs
(port 8088), matching the pattern used by the UEFI fix.

Without this fix, nodes enter 'clean failed' state because
the IPA ramdisk cannot boot, causing any subsequent nova
instance creation to hang in BUILD/spawning indefinitely.

Signed-off-by: Itay Matza <imatza@redhat.com>
Assisted-By: Claude Code
Add UEFI iPXE boot chain and DHCP host entry workarounds
to nova_verify.yaml. The conductor dnsmasq template is
missing UEFI boot directives, and assigns pool IPs instead
of neutron-assigned IPs to deployed baremetal instances
(OVN doesn't serve vif_type=other).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Itay Matza <imatza@redhat.com>
Post-restart verification greps were checking /etc/dnsmasq.conf
instead of /var/lib/ironic/dnsmasq.conf where the config was
written. With set -e active, this causes the task to fail even
when the fix was correctly applied.

Signed-off-by: Itay Matza <imatza@redhat.com>
The "Provision new instance on ironic" task uses complex bash with
function definitions, escaped quotes, and arithmetic that triggers
Ansible's parse_kv() argument splitter error:

  ERROR! failed at splitting arguments, either an unbalanced
  jinja2 block or quotes

Fix: use explicit `cmd:` parameter instead of shell shorthand.
This bypasses parse_kv() and passes the content directly to the
shell module.

Also simplify echo messages (remove escaped quotes) and quote
the jq expression on the final ping line.

Relates: OSPRH-26289

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Itay Matza <imatza@redhat.com>
Three bugs in the pre-launch ironic test script:

1. wait_node_state used -c "Provisioning\ State" column name which
   breaks when the openstack command runs through SSH (quotes consumed
   by local shell, column name splits into two arguments).
   Fix: use --provision-state filter + count instead of column parsing.

2. wait_image_active hardcoded "Fedora-Cloud-Base-38" instead of using
   the $image_name parameter. Fix: use "${image_name}".

3. ACTIVE_NODES check used -c "Provisioning State" with same SSH
   quoting issue - always returned 0 regardless of actual node states.
   Fix: use --provision-state active filter + wc -l.

These bugs were latent on first runs (nodes in available state) but
caused failures on re-runs where nodes had active instances.

Signed-off-by: Itay Matza <imatza@redhat.com>
@imatza-rh imatza-rh force-pushed the add-conductor-dhcp-ranges branch from d7434d0 to d0c5105 Compare March 21, 2026 11:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants