Skip to content

feat: integrate mlxconfig profile management into the dpa workflow#371

Open
chet wants to merge 1 commit intoNVIDIA:mainfrom
chet:supernic_integrate_apply_nvconfig_profiles
Open

feat: integrate mlxconfig profile management into the dpa workflow#371
chet wants to merge 1 commit intoNVIDIA:mainfrom
chet:supernic_integrate_apply_nvconfig_profiles

Conversation

@chet
Copy link
Contributor

@chet chet commented Feb 24, 2026

Description

This builds on the firmware management work (and ApplyFirmware) to additionally implement ApplyProfile within the DPA provisioning workflow (it has been stubbed out with placeholders).

The ApplyProfile state now handles mlxconfig profile management -- resetting the device's mlxconfig parameters to factory defaults between tenancies, and then optionally applying a named MlxConfigProfile if one is configured for the interface. This behavior of reset + apply updated values is the recommended guidance from NBU.

High level changes include:

  1. New mlxconfig_profile column on dpa_interfaces -- an optional profile name that maps into carbide-api's mlxconfig_profiles config map.
  2. Reworking the OpCode::ApplyProfile variant to carry an Option<SerializableProfile> (mirroring how ApplyFirmware carries a FirmwareFlasherProfile).
  3. carbide-api-side config lookup + serialization in build_apply_profile_command.
  4. scout-side implementation in mlx_device::apply_profile().
  5. Corresponding State Controller updates to handle both the reset-only and reset + profile sync workflows.

In this workflow:

  1. We check the interface's mlxconfig_profile field.
  2. If None, we send ApplyProfile { serialized_profile: None }, and scout will reset to factory defaults (to prepare for the next tenant) and report success.
  3. If set, we look it up in the runtime_config.mlxconfig_profiles map, serialize it via SerializableProfile::from_profile(), and send it down to scout.
  4. If the profile name is set, but can't be found in config, we return an error rather than sending None (which would silently reset without applying any intended profile(s)).
  5. scout always resets mlxconfig to factory defaults first, then applies the profile if one was provided, and reports back via MlxObservation.

The ApplyProfile state handler was also broken out into its own handle_apply_profile() function, making it independently testable without needing the full async state controller scaffolding. I need to go back and do this in a few other pre-existing places.

Existing tests updated as needed, and new tests introduced.

Signed-off-by: Chet Nichols III chetn@nvidia.com

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

This builds on the [firmware management work](NVIDIA#323) (and `ApplyFirmware`) to additionally implement `ApplyProfile` within the DPA provisioning workflow (it has been stubbed out with placeholders).

The `ApplyProfile` state now handles `mlxconfig` profile management -- resetting the device's `mlxconfig` parameters to factory defaults between tenancies, and then optionally applying a named `MlxConfigProfile` if one is configured for the interface. This behavior of reset + apply updated values is the recommended guidance from NBU.

High level changes include:
1. New `mlxconfig_profile` column on `dpa_interfaces` -- an optional profile name that maps into `carbide-api`'s `mlxconfig_profiles` config map.
2. Reworking the `OpCode::ApplyProfile` variant to carry an `Option<SerializableProfile>` (mirroring how `ApplyFirmware` carries a `FirmwareFlasherProfile`).
3. `carbide-api`-side config lookup + serialization in `build_apply_profile_command`.
4. `scout`-side implementation in `mlx_device::apply_profile()`.
5. Corresponding State Controller updates to handle both the reset-only and reset + profile sync workflows.

In this workflow:
1. We check the interface's `mlxconfig_profile` field.
2. If `None`, we send `ApplyProfile { serialized_profile: None }`, and `scout` will reset to factory defaults (to prepare for the next tenant) and report success.
3. If set, we look it up in the `runtime_config.mlxconfig_profiles` map, serialize it via `SerializableProfile::from_profile()`, and send it down to `scout`.
4. If the profile name is set, but can't be found in config, we return an error rather than sending `None` (which would silently reset without applying any intended profile(s)).
5. `scout` always resets mlxconfig to factory defaults first, then applies the profile if one was provided, and reports back via `MlxObservation`.

The `ApplyProfile` state handler was also broken out into its own `handle_apply_profile()` function, making it independently testable without needing the full async state controller scaffolding. I need to go back and do this in a few other pre-existing places.

Existing tests updated as needed, and new tests introduced.

Signed-off-by: Chet Nichols III <chetn@nvidia.com>
@chet chet requested a review from a team as a code owner February 24, 2026 05:41
Copy link
Contributor

@wminckler wminckler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly nits/thoughts.

If the version thing doesn't apply, then its good

-- device during the ApplyProfile state. A null/empty value
-- means just reset to the card defaults (and don't apply
-- anything else beyond that).
ALTER TABLE dpa_interfaces ADD COLUMN IF NOT EXISTS mlxconfig_profile TEXT;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the name of a profile be unbounded? (TEXT vs VARCHAR). In reality, probably doesn't matter...l

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this? I don't see any changes in api-db/src to update this. Also, we already save this in the card_state field in the DB.

%machine_id, %pci_name, %profile_name,
"mlxconfig_profile not found in config"
);
CarbideError::GenericErrorFromReport(eyre!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand when we use "GenericError" vs making a specific error enum. Add to that, why do we use eyre and CarbideError together. Not really an issue for this PR, but having a generic error in an error enum seems contradictory

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this isn't actually returned from the API, it seems legit ;)

///
/// In both cases, profile_synced=Some(true) is the signal that
/// the workflow completed successfully, and it's safe to transition
/// to the next state.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • typically we use a version to identify that the correct thing was sync'd.
  • comment says that a profile_name is sent back from scout, but that's not checked, so the api doesn't actually know what's sync'd (and as above, we normally use a version).
  • can a profile config change? then you really should have a version and not rely on the name

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants