feat: integrate mlxconfig profile management into the dpa workflow by chet · Pull Request #371 · NVIDIA/bare-metal-manager-core

chet · 2026-02-24T05:41:21Z

Description

This builds on the firmware management work (and ApplyFirmware) to additionally implement ApplyProfile within the DPA provisioning workflow (it has been stubbed out with placeholders).

The ApplyProfile state now handles mlxconfig profile management -- resetting the device's mlxconfig parameters to factory defaults between tenancies, and then optionally applying a named MlxConfigProfile if one is configured for the interface. This behavior of reset + apply updated values is the recommended guidance from NBU.

High level changes include:

New mlxconfig_profile column on dpa_interfaces -- an optional profile name that maps into carbide-api's mlxconfig_profiles config map.
Reworking the OpCode::ApplyProfile variant to carry an Option<SerializableProfile> (mirroring how ApplyFirmware carries a FirmwareFlasherProfile).
carbide-api-side config lookup + serialization in build_apply_profile_command.
scout-side implementation in mlx_device::apply_profile().
Corresponding State Controller updates to handle both the reset-only and reset + profile sync workflows.

In this workflow:

We check the interface's mlxconfig_profile field.
If None, we send ApplyProfile { serialized_profile: None }, and scout will reset to factory defaults (to prepare for the next tenant) and report success.
If set, we look it up in the runtime_config.mlxconfig_profiles map, serialize it via SerializableProfile::from_profile(), and send it down to scout.
If the profile name is set, but can't be found in config, we return an error rather than sending None (which would silently reset without applying any intended profile(s)).
scout always resets mlxconfig to factory defaults first, then applies the profile if one was provided, and reports back via MlxObservation.

The ApplyProfile state handler was also broken out into its own handle_apply_profile() function, making it independently testable without needing the full async state controller scaffolding. I need to go back and do this in a few other pre-existing places.

Existing tests updated as needed, and new tests introduced.

Signed-off-by: Chet Nichols III chetn@nvidia.com

Type of Change

Add - New feature or capability
Change - Changes in existing functionality
Fix - Bug fixes
Remove - Removed features or deprecated functionality
Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

Breaking Changes

This PR contains breaking changes

Testing

Unit tests added/updated
Integration tests added/updated
Manual testing performed
No testing required (docs, internal refactor, etc.)

Additional Notes

This builds on the [firmware management work](NVIDIA#323) (and `ApplyFirmware`) to additionally implement `ApplyProfile` within the DPA provisioning workflow (it has been stubbed out with placeholders). The `ApplyProfile` state now handles `mlxconfig` profile management -- resetting the device's `mlxconfig` parameters to factory defaults between tenancies, and then optionally applying a named `MlxConfigProfile` if one is configured for the interface. This behavior of reset + apply updated values is the recommended guidance from NBU. High level changes include: 1. New `mlxconfig_profile` column on `dpa_interfaces` -- an optional profile name that maps into `carbide-api`'s `mlxconfig_profiles` config map. 2. Reworking the `OpCode::ApplyProfile` variant to carry an `Option<SerializableProfile>` (mirroring how `ApplyFirmware` carries a `FirmwareFlasherProfile`). 3. `carbide-api`-side config lookup + serialization in `build_apply_profile_command`. 4. `scout`-side implementation in `mlx_device::apply_profile()`. 5. Corresponding State Controller updates to handle both the reset-only and reset + profile sync workflows. In this workflow: 1. We check the interface's `mlxconfig_profile` field. 2. If `None`, we send `ApplyProfile { serialized_profile: None }`, and `scout` will reset to factory defaults (to prepare for the next tenant) and report success. 3. If set, we look it up in the `runtime_config.mlxconfig_profiles` map, serialize it via `SerializableProfile::from_profile()`, and send it down to `scout`. 4. If the profile name is set, but can't be found in config, we return an error rather than sending `None` (which would silently reset without applying any intended profile(s)). 5. `scout` always resets mlxconfig to factory defaults first, then applies the profile if one was provided, and reports back via `MlxObservation`. The `ApplyProfile` state handler was also broken out into its own `handle_apply_profile()` function, making it independently testable without needing the full async state controller scaffolding. I need to go back and do this in a few other pre-existing places. Existing tests updated as needed, and new tests introduced. Signed-off-by: Chet Nichols III <chetn@nvidia.com>

wminckler

Mostly nits/thoughts.

If the version thing doesn't apply, then its good

wminckler · 2026-02-25T12:58:10Z

crates/api-db/migrations/20260224045500_dpa_mlxconfig_profile.sql

+-- device during the ApplyProfile state. A null/empty value
+-- means just reset to the card defaults (and don't apply
+-- anything else beyond that).
+ALTER TABLE dpa_interfaces ADD COLUMN IF NOT EXISTS mlxconfig_profile TEXT;


should the name of a profile be unbounded? (TEXT vs VARCHAR). In reality, probably doesn't matter...l

Do we need this? I don't see any changes in api-db/src to update this. Also, we already save this in the card_state field in the DB.

wminckler · 2026-02-25T13:07:10Z

crates/api/src/handlers/dpa.rs

+                %machine_id, %pci_name, %profile_name,
+                "mlxconfig_profile not found in config"
+            );
+            CarbideError::GenericErrorFromReport(eyre!(


I don't understand when we use "GenericError" vs making a specific error enum. Add to that, why do we use eyre and CarbideError together. Not really an issue for this PR, but having a generic error in an error enum seems contradictory

since this isn't actually returned from the API, it seems legit ;)

wminckler · 2026-02-25T13:18:04Z

crates/api/src/state_controller/dpa_interface/handler.rs

+///
+/// In both cases, profile_synced=Some(true) is the signal that
+/// the workflow completed successfully, and it's safe to transition
+/// to the next state.


typically we use a version to identify that the correct thing was sync'd.

comment says that a profile_name is sent back from scout, but that's not checked, so the api doesn't actually know what's sync'd (and as above, we normally use a version).

can a profile config change? then you really should have a version and not rely on the name

chet requested a review from a team as a code owner February 24, 2026 05:41

wminckler approved these changes Feb 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: integrate mlxconfig profile management into the dpa workflow#371

feat: integrate mlxconfig profile management into the dpa workflow#371
chet wants to merge 1 commit intoNVIDIA:mainfrom
chet:supernic_integrate_apply_nvconfig_profiles

chet commented Feb 24, 2026

Uh oh!

wminckler left a comment

Uh oh!

wminckler Feb 25, 2026

Uh oh!

srinivasadmurthy Feb 25, 2026

Uh oh!

wminckler Feb 25, 2026

Uh oh!

wminckler Feb 25, 2026

Uh oh!

wminckler Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

chet commented Feb 24, 2026

Description

Type of Change

Related Issues (Optional)

Breaking Changes

Testing

Additional Notes

Uh oh!

wminckler left a comment

Choose a reason for hiding this comment

Uh oh!

wminckler Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

srinivasadmurthy Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

wminckler Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

wminckler Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

wminckler Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants