feat(dpf): delegate internal state handling to DPF Operator by FrankSpitulski · Pull Request #401 · NVIDIA/bare-metal-manager-core

FrankSpitulski · 2026-02-27T01:21:36Z

Description

part two of dpf sdk refactor. this moves internal state handling largely to the dpf operator and lets the sdk trigger events when the state handler loop should act on dpu state changes. still using a custom bfb config and preloaded systemd services. adds dts as the first crd managed dpu service. holding back on using the other dpu services (present in other branch) until we can decide how those should be configured. the reasoning is not to break existing functionality from milestone 1 as those dpu services are untested. when all services are over, we should be able to remove the (backported) tera config.

Type of Change

Add - New feature or capability
Change - Changes in existing functionality
Fix - Bug fixes
Remove - Removed features or deprecated functionality
Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

Closes FORGE-7959

Breaking Changes

This PR contains breaking changes

Testing

Unit tests added/updated
Integration tests added/updated
Manual testing performed
No testing required (docs, internal refactor, etc.)

Additional Notes

abvarshney-nv · 2026-03-03T05:05:57Z

crates/api/src/state_controller/machine/handler/dpf.rs

+                host_bmc_ip: bmc_ip(&state.host_snapshot)?.to_string(),
+                dpu_device_names: dpu_ids,
+            };
+            dpf_sdk.register_dpu_node(node_info).await?;


Shouldn't we check if DPUDevice CR came into Ready state? This can be due to various reason like BMC IP is not reachable, serial number mismatch or any other DPF related issues. It will be difficult for the team to debug.

I don't think we need to wait, no. we can declare everything up front and let dpf operator figure it out. we just need to check on the status afterwards (which may be error).

abvarshney-nv · 2026-03-03T05:08:49Z

crates/api/files/bf.cfg

+ilog "==================================="
+# move the files out of the installer and into the installed OS
+mkdir -p /mnt/opt/forge
+mv /forge-scout /mnt/opt/forge


This file won't work as no carbide related software is bundled into vanilla BFB.

for this one we're still going to use the custom BFB. I removed the dpu services except DTS.

abvarshney-nv · 2026-03-03T07:00:46Z

crates/api-model/src/machine/mod.rs

+        /// Current DPU phase from DPF operator (for debugging/observability only).
+        /// Carbide should not care about non actionable DPF internal phases.
+        #[serde(default)]
+        phase: Option<String>,


Can we change this String to Rust type? I see that you are using string to represent the phase in sdk also:

DpuStatusPhase::Initializing => Self::Provisioning("Initializing".into()), DpuStatusPhase::NodeEffect => Self::NodeEffect, DpuStatusPhase::Pending => Self::Provisioning("Pending".into()), DpuStatusPhase::ConfigFwParameters => Self::Provisioning("ConfigFwParameters".into()), DpuStatusPhase::PrepareBfb => Self::Provisioning("PrepareBfb".into()), DpuStatusPhase::OsInstalling => Self::Provisioning("OsInstalling".into()), DpuStatusPhase::DpuClusterConfig => Self::Provisioning("DpuClusterConfig".into()), DpuStatusPhase::HostNetworkConfiguration => { Self::Provisioning("HostNetworkConfiguration".into()) }

These can also be converted to Rust type.

this is already a compromise from discussion with @chet and @wminckler. the string is to encourage use only as an informational note by carbide sre. it originally was not included and there is less complexity and db writes when removed. I'd be happy to remove it entirely.

abvarshney-nv · 2026-03-03T07:05:25Z

crates/api/src/state_controller/machine/handler/dpf.rs

+            Ok(StateHandlerOutcome::transition(next))
+        }
+        DpfState::WaitingForReady { phase } => {
+            let node_name = dpu_node_name(&state.host_snapshot.id.to_string());


This seems like a big function doing many things. Can we break this state into multiple smaller states?

abvarshney-nv · 2026-03-03T07:18:41Z

crates/api/src/dpf.rs

+    }
+
+    fn node_labels(&self) -> BTreeMap<String, String> {
+        BTreeMap::from([(


This can be destructive as all the DPUs provisioned with Milestone 1 will trigger provisioning immediately due to DPUDeployement change. Also, this is the same label used for DPUSet. It is better to use a different label here, something like: carbide.nvidia.com/controlled.node.v2

abvarshney-nv · 2026-03-03T07:22:56Z

crates/api/src/state_controller/machine/handler/dpf.rs

+            let node_name = dpu_node_name(&state.host_snapshot.id.to_string());
+            for dpu in &state.dpu_snapshots {
+                dpf_sdk
+                    .reprovision_dpu(&dpu.id.to_string(), &node_name)


Deleting DPU CRs won't shift current provisioned DPUs to new DPUDeployment based mode. The carbide will still contain DPUSet based CRs. We need a new logic something like following:

if old label exists

Remove the old label. This will delete the DPU cr.

Apply new label. This will trigger the provisioning.

Continue with state machine.

If new label is applied, just delete the DPU cr.

I thought we were leaving all existing dpus alone?

abvarshney-nv · 2026-03-03T07:24:53Z

crates/api/src/state_controller/machine/handler/dpf.rs

-            state: DpfState::UpdateNodeEffectAnnotation,
+        DpfState::Reprovisioning => {
+            let node_name = dpu_node_name(&state.host_snapshot.id.to_string());
+            for dpu in &state.dpu_snapshots {


Carbide supports the model where only ONE DPU can be provisioned. In this case more validations and logic will be needed.

what is the expected behaviour? reprovisioning requires host maintenance and reboots anyway.

abvarshney-nv · 2026-03-03T07:28:47Z

crates/api/src/state_controller/machine/handler/dpf.rs

+    dpus: &[Machine],
+    dpf_sdk: &dyn DpfOperations,
+) -> Result<bool, StateHandlerError> {
+    FuturesUnordered::from_iter(dpus.iter().map(|dpu| {


Usually we use sync state to verify such stuff in carbide. All DPUs are moved to a sync state when all processing is done. Once all DPUs reach to the sync state, carbide takes the next action, moves all DPUs to next state and continue the processing.

added new state and synced on that

abvarshney-nv · 2026-03-03T07:33:53Z

crates/api/src/state_controller/machine/handler/dpf.rs

+                    return Ok(StateHandlerOutcome::transition(updated)
+                        .in_transaction(&pool, move |txn| {
+                            async move {
+                                db::machine::insert_health_report_override(


What is the benefit of adding this health override? Usually when machine is ingested, we don't need a health override (cause machine is not usable during this) and when machine is re-provisioned when it is allocated to the tenant, carbide already update a override (Host-Update something).

the dpu can indicate failure as well

Closes FORGE-7959 Signed-off-by: fspitulski <fspitulski@nvidia.com>

FrankSpitulski requested a review from a team as a code owner February 27, 2026 01:21

FrankSpitulski force-pushed the feat/dpf-operator/use-sdk-no-charts branch 4 times, most recently from a290d8c to 2060bdd Compare February 28, 2026 02:56

abvarshney-nv requested changes Mar 3, 2026

View reviewed changes

feat(dpf): delegate internal state handling to DPF Operator

ffe9f26

Closes FORGE-7959 Signed-off-by: fspitulski <fspitulski@nvidia.com>

FrankSpitulski force-pushed the feat/dpf-operator/use-sdk-no-charts branch from 2060bdd to ffe9f26 Compare March 4, 2026 01:34

Conversation

FrankSpitulski commented Feb 27, 2026

Description

Type of Change

Related Issues (Optional)

Breaking Changes

Testing

Additional Notes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

FrankSpitulski Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

FrankSpitulski Mar 3, 2026 •

edited

Loading