feat: integrate mlxconfig profile management into the dpa workflow#371
feat: integrate mlxconfig profile management into the dpa workflow#371chet wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
This builds on the [firmware management work](NVIDIA#323) (and `ApplyFirmware`) to additionally implement `ApplyProfile` within the DPA provisioning workflow (it has been stubbed out with placeholders). The `ApplyProfile` state now handles `mlxconfig` profile management -- resetting the device's `mlxconfig` parameters to factory defaults between tenancies, and then optionally applying a named `MlxConfigProfile` if one is configured for the interface. This behavior of reset + apply updated values is the recommended guidance from NBU. High level changes include: 1. New `mlxconfig_profile` column on `dpa_interfaces` -- an optional profile name that maps into `carbide-api`'s `mlxconfig_profiles` config map. 2. Reworking the `OpCode::ApplyProfile` variant to carry an `Option<SerializableProfile>` (mirroring how `ApplyFirmware` carries a `FirmwareFlasherProfile`). 3. `carbide-api`-side config lookup + serialization in `build_apply_profile_command`. 4. `scout`-side implementation in `mlx_device::apply_profile()`. 5. Corresponding State Controller updates to handle both the reset-only and reset + profile sync workflows. In this workflow: 1. We check the interface's `mlxconfig_profile` field. 2. If `None`, we send `ApplyProfile { serialized_profile: None }`, and `scout` will reset to factory defaults (to prepare for the next tenant) and report success. 3. If set, we look it up in the `runtime_config.mlxconfig_profiles` map, serialize it via `SerializableProfile::from_profile()`, and send it down to `scout`. 4. If the profile name is set, but can't be found in config, we return an error rather than sending `None` (which would silently reset without applying any intended profile(s)). 5. `scout` always resets mlxconfig to factory defaults first, then applies the profile if one was provided, and reports back via `MlxObservation`. The `ApplyProfile` state handler was also broken out into its own `handle_apply_profile()` function, making it independently testable without needing the full async state controller scaffolding. I need to go back and do this in a few other pre-existing places. Existing tests updated as needed, and new tests introduced. Signed-off-by: Chet Nichols III <chetn@nvidia.com>
wminckler
left a comment
There was a problem hiding this comment.
Mostly nits/thoughts.
If the version thing doesn't apply, then its good
| -- device during the ApplyProfile state. A null/empty value | ||
| -- means just reset to the card defaults (and don't apply | ||
| -- anything else beyond that). | ||
| ALTER TABLE dpa_interfaces ADD COLUMN IF NOT EXISTS mlxconfig_profile TEXT; |
There was a problem hiding this comment.
should the name of a profile be unbounded? (TEXT vs VARCHAR). In reality, probably doesn't matter...l
There was a problem hiding this comment.
Do we need this? I don't see any changes in api-db/src to update this. Also, we already save this in the card_state field in the DB.
| %machine_id, %pci_name, %profile_name, | ||
| "mlxconfig_profile not found in config" | ||
| ); | ||
| CarbideError::GenericErrorFromReport(eyre!( |
There was a problem hiding this comment.
I don't understand when we use "GenericError" vs making a specific error enum. Add to that, why do we use eyre and CarbideError together. Not really an issue for this PR, but having a generic error in an error enum seems contradictory
There was a problem hiding this comment.
since this isn't actually returned from the API, it seems legit ;)
| /// | ||
| /// In both cases, profile_synced=Some(true) is the signal that | ||
| /// the workflow completed successfully, and it's safe to transition | ||
| /// to the next state. |
There was a problem hiding this comment.
- typically we use a version to identify that the correct thing was sync'd.
- comment says that a profile_name is sent back from scout, but that's not checked, so the api doesn't actually know what's sync'd (and as above, we normally use a version).
- can a profile config change? then you really should have a version and not rely on the name
Description
This builds on the firmware management work (and
ApplyFirmware) to additionally implementApplyProfilewithin the DPA provisioning workflow (it has been stubbed out with placeholders).The
ApplyProfilestate now handlesmlxconfigprofile management -- resetting the device'smlxconfigparameters to factory defaults between tenancies, and then optionally applying a namedMlxConfigProfileif one is configured for the interface. This behavior of reset + apply updated values is the recommended guidance from NBU.High level changes include:
mlxconfig_profilecolumn ondpa_interfaces-- an optional profile name that maps intocarbide-api'smlxconfig_profilesconfig map.OpCode::ApplyProfilevariant to carry anOption<SerializableProfile>(mirroring howApplyFirmwarecarries aFirmwareFlasherProfile).carbide-api-side config lookup + serialization inbuild_apply_profile_command.scout-side implementation inmlx_device::apply_profile().In this workflow:
mlxconfig_profilefield.None, we sendApplyProfile { serialized_profile: None }, andscoutwill reset to factory defaults (to prepare for the next tenant) and report success.runtime_config.mlxconfig_profilesmap, serialize it viaSerializableProfile::from_profile(), and send it down toscout.None(which would silently reset without applying any intended profile(s)).scoutalways resets mlxconfig to factory defaults first, then applies the profile if one was provided, and reports back viaMlxObservation.The
ApplyProfilestate handler was also broken out into its ownhandle_apply_profile()function, making it independently testable without needing the full async state controller scaffolding. I need to go back and do this in a few other pre-existing places.Existing tests updated as needed, and new tests introduced.
Signed-off-by: Chet Nichols III chetn@nvidia.com
Type of Change
Related Issues (Optional)
Breaking Changes
Testing
Additional Notes