feat(dpf): delegate internal state handling to DPF Operator#401
feat(dpf): delegate internal state handling to DPF Operator#401FrankSpitulski wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
a290d8c to
2060bdd
Compare
| host_bmc_ip: bmc_ip(&state.host_snapshot)?.to_string(), | ||
| dpu_device_names: dpu_ids, | ||
| }; | ||
| dpf_sdk.register_dpu_node(node_info).await?; |
There was a problem hiding this comment.
Shouldn't we check if DPUDevice CR came into Ready state? This can be due to various reason like BMC IP is not reachable, serial number mismatch or any other DPF related issues. It will be difficult for the team to debug.
There was a problem hiding this comment.
I don't think we need to wait, no. we can declare everything up front and let dpf operator figure it out. we just need to check on the status afterwards (which may be error).
| ilog "===================================" | ||
| # move the files out of the installer and into the installed OS | ||
| mkdir -p /mnt/opt/forge | ||
| mv /forge-scout /mnt/opt/forge |
There was a problem hiding this comment.
This file won't work as no carbide related software is bundled into vanilla BFB.
There was a problem hiding this comment.
for this one we're still going to use the custom BFB. I removed the dpu services except DTS.
| /// Current DPU phase from DPF operator (for debugging/observability only). | ||
| /// Carbide should not care about non actionable DPF internal phases. | ||
| #[serde(default)] | ||
| phase: Option<String>, |
There was a problem hiding this comment.
Can we change this String to Rust type? I see that you are using string to represent the phase in sdk also:
DpuStatusPhase::Initializing => Self::Provisioning("Initializing".into()),
DpuStatusPhase::NodeEffect => Self::NodeEffect,
DpuStatusPhase::Pending => Self::Provisioning("Pending".into()),
DpuStatusPhase::ConfigFwParameters => Self::Provisioning("ConfigFwParameters".into()),
DpuStatusPhase::PrepareBfb => Self::Provisioning("PrepareBfb".into()),
DpuStatusPhase::OsInstalling => Self::Provisioning("OsInstalling".into()),
DpuStatusPhase::DpuClusterConfig => Self::Provisioning("DpuClusterConfig".into()),
DpuStatusPhase::HostNetworkConfiguration => {
Self::Provisioning("HostNetworkConfiguration".into())
}
These can also be converted to Rust type.
There was a problem hiding this comment.
this is already a compromise from discussion with @chet and @wminckler. the string is to encourage use only as an informational note by carbide sre. it originally was not included and there is less complexity and db writes when removed. I'd be happy to remove it entirely.
| Ok(StateHandlerOutcome::transition(next)) | ||
| } | ||
| DpfState::WaitingForReady { phase } => { | ||
| let node_name = dpu_node_name(&state.host_snapshot.id.to_string()); |
There was a problem hiding this comment.
This seems like a big function doing many things. Can we break this state into multiple smaller states?
| } | ||
|
|
||
| fn node_labels(&self) -> BTreeMap<String, String> { | ||
| BTreeMap::from([( |
There was a problem hiding this comment.
This can be destructive as all the DPUs provisioned with Milestone 1 will trigger provisioning immediately due to DPUDeployement change. Also, this is the same label used for DPUSet. It is better to use a different label here, something like: carbide.nvidia.com/controlled.node.v2
| let node_name = dpu_node_name(&state.host_snapshot.id.to_string()); | ||
| for dpu in &state.dpu_snapshots { | ||
| dpf_sdk | ||
| .reprovision_dpu(&dpu.id.to_string(), &node_name) |
There was a problem hiding this comment.
Deleting DPU CRs won't shift current provisioned DPUs to new DPUDeployment based mode. The carbide will still contain DPUSet based CRs. We need a new logic something like following:
if old label exists
- Remove the old label. This will delete the DPU cr.
- Apply new label. This will trigger the provisioning.
- Continue with state machine.
If new label is applied, just delete the DPU cr.
There was a problem hiding this comment.
I thought we were leaving all existing dpus alone?
| state: DpfState::UpdateNodeEffectAnnotation, | ||
| DpfState::Reprovisioning => { | ||
| let node_name = dpu_node_name(&state.host_snapshot.id.to_string()); | ||
| for dpu in &state.dpu_snapshots { |
There was a problem hiding this comment.
Carbide supports the model where only ONE DPU can be provisioned. In this case more validations and logic will be needed.
There was a problem hiding this comment.
what is the expected behaviour? reprovisioning requires host maintenance and reboots anyway.
| dpus: &[Machine], | ||
| dpf_sdk: &dyn DpfOperations, | ||
| ) -> Result<bool, StateHandlerError> { | ||
| FuturesUnordered::from_iter(dpus.iter().map(|dpu| { |
There was a problem hiding this comment.
Usually we use sync state to verify such stuff in carbide. All DPUs are moved to a sync state when all processing is done. Once all DPUs reach to the sync state, carbide takes the next action, moves all DPUs to next state and continue the processing.
There was a problem hiding this comment.
added new state and synced on that
| return Ok(StateHandlerOutcome::transition(updated) | ||
| .in_transaction(&pool, move |txn| { | ||
| async move { | ||
| db::machine::insert_health_report_override( |
There was a problem hiding this comment.
What is the benefit of adding this health override? Usually when machine is ingested, we don't need a health override (cause machine is not usable during this) and when machine is re-provisioned when it is allocated to the tenant, carbide already update a override (Host-Update something).
There was a problem hiding this comment.
the dpu can indicate failure as well
Closes FORGE-7959 Signed-off-by: fspitulski <fspitulski@nvidia.com>
2060bdd to
ffe9f26
Compare
Description
part two of dpf sdk refactor. this moves internal state handling largely to the dpf operator and lets the sdk trigger events when the state handler loop should act on dpu state changes. still using a custom bfb config and preloaded systemd services. adds dts as the first crd managed dpu service. holding back on using the other dpu services (present in other branch) until we can decide how those should be configured. the reasoning is not to break existing functionality from milestone 1 as those dpu services are untested. when all services are over, we should be able to remove the (backported) tera config.
Type of Change
Related Issues (Optional)
Closes FORGE-7959
Breaking Changes
Testing
Additional Notes