Skip to content

feat(dpf): delegate internal state handling to DPF Operator#401

Open
FrankSpitulski wants to merge 1 commit intoNVIDIA:mainfrom
FrankSpitulski:feat/dpf-operator/use-sdk-no-charts
Open

feat(dpf): delegate internal state handling to DPF Operator#401
FrankSpitulski wants to merge 1 commit intoNVIDIA:mainfrom
FrankSpitulski:feat/dpf-operator/use-sdk-no-charts

Conversation

@FrankSpitulski
Copy link
Contributor

Description

part two of dpf sdk refactor. this moves internal state handling largely to the dpf operator and lets the sdk trigger events when the state handler loop should act on dpu state changes. still using a custom bfb config and preloaded systemd services. adds dts as the first crd managed dpu service. holding back on using the other dpu services (present in other branch) until we can decide how those should be configured. the reasoning is not to break existing functionality from milestone 1 as those dpu services are untested. when all services are over, we should be able to remove the (backported) tera config.

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

Closes FORGE-7959

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

@FrankSpitulski FrankSpitulski requested a review from a team as a code owner February 27, 2026 01:21
@FrankSpitulski FrankSpitulski force-pushed the feat/dpf-operator/use-sdk-no-charts branch 4 times, most recently from a290d8c to 2060bdd Compare February 28, 2026 02:56
host_bmc_ip: bmc_ip(&state.host_snapshot)?.to_string(),
dpu_device_names: dpu_ids,
};
dpf_sdk.register_dpu_node(node_info).await?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we check if DPUDevice CR came into Ready state? This can be due to various reason like BMC IP is not reachable, serial number mismatch or any other DPF related issues. It will be difficult for the team to debug.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to wait, no. we can declare everything up front and let dpf operator figure it out. we just need to check on the status afterwards (which may be error).

ilog "==================================="
# move the files out of the installer and into the installed OS
mkdir -p /mnt/opt/forge
mv /forge-scout /mnt/opt/forge
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file won't work as no carbide related software is bundled into vanilla BFB.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for this one we're still going to use the custom BFB. I removed the dpu services except DTS.

/// Current DPU phase from DPF operator (for debugging/observability only).
/// Carbide should not care about non actionable DPF internal phases.
#[serde(default)]
phase: Option<String>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change this String to Rust type? I see that you are using string to represent the phase in sdk also:

DpuStatusPhase::Initializing => Self::Provisioning("Initializing".into()),
            DpuStatusPhase::NodeEffect => Self::NodeEffect,
            DpuStatusPhase::Pending => Self::Provisioning("Pending".into()),
            DpuStatusPhase::ConfigFwParameters => Self::Provisioning("ConfigFwParameters".into()),
            DpuStatusPhase::PrepareBfb => Self::Provisioning("PrepareBfb".into()),
            DpuStatusPhase::OsInstalling => Self::Provisioning("OsInstalling".into()),
            DpuStatusPhase::DpuClusterConfig => Self::Provisioning("DpuClusterConfig".into()),
            DpuStatusPhase::HostNetworkConfiguration => {
                Self::Provisioning("HostNetworkConfiguration".into())
            }

These can also be converted to Rust type.

Copy link
Contributor Author

@FrankSpitulski FrankSpitulski Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is already a compromise from discussion with @chet and @wminckler. the string is to encourage use only as an informational note by carbide sre. it originally was not included and there is less complexity and db writes when removed. I'd be happy to remove it entirely.

Ok(StateHandlerOutcome::transition(next))
}
DpfState::WaitingForReady { phase } => {
let node_name = dpu_node_name(&state.host_snapshot.id.to_string());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a big function doing many things. Can we break this state into multiple smaller states?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

split

}

fn node_labels(&self) -> BTreeMap<String, String> {
BTreeMap::from([(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be destructive as all the DPUs provisioned with Milestone 1 will trigger provisioning immediately due to DPUDeployement change. Also, this is the same label used for DPUSet. It is better to use a different label here, something like: carbide.nvidia.com/controlled.node.v2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

let node_name = dpu_node_name(&state.host_snapshot.id.to_string());
for dpu in &state.dpu_snapshots {
dpf_sdk
.reprovision_dpu(&dpu.id.to_string(), &node_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleting DPU CRs won't shift current provisioned DPUs to new DPUDeployment based mode. The carbide will still contain DPUSet based CRs. We need a new logic something like following:

if old label exists

  1. Remove the old label. This will delete the DPU cr.
  2. Apply new label. This will trigger the provisioning.
  3. Continue with state machine.

If new label is applied, just delete the DPU cr.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we were leaving all existing dpus alone?

state: DpfState::UpdateNodeEffectAnnotation,
DpfState::Reprovisioning => {
let node_name = dpu_node_name(&state.host_snapshot.id.to_string());
for dpu in &state.dpu_snapshots {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Carbide supports the model where only ONE DPU can be provisioned. In this case more validations and logic will be needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the expected behaviour? reprovisioning requires host maintenance and reboots anyway.

dpus: &[Machine],
dpf_sdk: &dyn DpfOperations,
) -> Result<bool, StateHandlerError> {
FuturesUnordered::from_iter(dpus.iter().map(|dpu| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually we use sync state to verify such stuff in carbide. All DPUs are moved to a sync state when all processing is done. Once all DPUs reach to the sync state, carbide takes the next action, moves all DPUs to next state and continue the processing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added new state and synced on that

return Ok(StateHandlerOutcome::transition(updated)
.in_transaction(&pool, move |txn| {
async move {
db::machine::insert_health_report_override(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the benefit of adding this health override? Usually when machine is ingested, we don't need a health override (cause machine is not usable during this) and when machine is re-provisioned when it is allocated to the tenant, carbide already update a override (Host-Update something).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the dpu can indicate failure as well

Closes FORGE-7959

Signed-off-by: fspitulski <fspitulski@nvidia.com>
@FrankSpitulski FrankSpitulski force-pushed the feat/dpf-operator/use-sdk-no-charts branch from 2060bdd to ffe9f26 Compare March 4, 2026 01:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants