DMS - A proposal and implementation of a design change to address many open issues #292

schaefi · 2024-09-23T12:25:59Z

Hi, I was looking at the DMS, the open bugs, the ongoing reports about non-working migrations, the big number of limitations in the current design and the lack of knowledge on the support side to allow proper debugging when issues occurs. All this was the driver to suggest a design change and an implementation of how I see a maintainable future of the DMS. If you would like to review the changes please do so based on the list of commits, they also fix some bugs in the implementation which could be merged independently of the design change.

I marked the PR with the workshop label as I'd like to have a conversation/demo on the topic to gather feedback and if the proposed change is considered an acceptable solution. There are presentation slides to provide a high level description of the proposal and I will attach them as follows:

Distribution Migration System

The slide deck also shows a list of defects that I believe will be solved if we go that route.

Long story short if you would like to test an upgrade from SLE12 to SLE15-SP5 which is not possible with the current DMS you can do so in the changed implementation from here as follows:

Start a SLE12 (x86_64) OnDemand instance in some cloud and ssh into it. From there:

sudo zypper ar https://download.opensuse.org/repositories/home:/marcus.schaefer:/dms/SLE_12_SP5 Migration
sudo zypper install suse-migration

sudo migrate --reboot

... a few notes:

The packages installed from my home: project should be made available in the SLE12 public cloud module for a final release such that adding that extra repo will not be needed anymore. We already have this concept in place.
You can continue to use your instance while it upgrades
The new migrate tool is the user facing entry-point. There won't be any reboot/kexec in the process except for the final reboot after upgrade to activate the new kernel and instance registration. This is controlled via the --reboot flag such that a user can determine the time for the reboot himself. However, I strongly recommend to immediately reboot, hence used as such in the above example

I believe there will be many questions. Let's discuss this during the workshop with the implementation at hand.

Tasks not done yet:

Update documentation
Plan package delivery

Thanks

rjschwei · 2024-09-23T13:15:50Z

As stated when the idea about the containerization came up, a container needs a place to run, i.e. it needs a running system. If that containers runs on the system to be upgraded that system is no longer off line and on line migration across major releases is not supported. As such a conversation on how this is expected to work is definitely necessary.

schaefi · 2024-09-23T15:03:42Z

As stated when the idea about the containerization came up, a container needs a place to run, i.e. it needs a running system. If that containers runs on the system to be upgraded that system is no longer off line and on line migration across major releases is not supported. As such a conversation on how this is expected to work is definitely necessary.

The simplest way to see how and that this works is to try out the three lines of example commands above on a SLE12 system. I have tested both the zypper migration plugin as well as the zypper dup operations.

If that containers runs on the system to be upgraded that system is no longer off line and on line migration across major releases is not supported.

I believe the word off-line is misinterpreted in this context. We both know that SUSE explicitly documents that any zypper operation no matter if it's through the migration plugin or by a direct zypper dup call is not supported on the SAME system to run a major release upgrade. The point here is SAME system. With the move of the process into a container of the migration target you are no longer on the same system and therefore fulfill the requirement named off-line

Of course I was concerned about the following aspects:

is a shared mount consistent across the entire upgrade process ?

The answer is yes. A shared mount of / into the container as /system-root is an overlay mount in RAM and in the same way consistent as the mount process that we have in the live ISO
is a host network connection consistent across the entire upgrade process ?

The answer is yes. Any network interface is configured in the kernel and holds its data and stack in RAM until it gets re-configured. The installation of packages are not expected to re-configure the network. If they do this it would be broken no matter if you run the process in a container or on a live ISO. In all my tests the network and also the bridge setup into the container stays persistent and active. As I said the instance stays responsive during the whole process, a big plus in my eyes
is the functionality of the running container instance harmed by the install of a new kernel/modules ?

The answer here is no. What you need from the kernel is an active network (network drivers and bridge) as well as an overlay mount. All this is either provided as modules already in RAM or compiled into the kernel and all present before you start the migration. So there is no feature on the running system which would kill the container instance by an update of a package. The container storage on the sle12 system lives in /var/lib/containers/storage/ and is also wiped at the end of a successful migration to get rid of any artifacts. The data on this storage location is not touched by a zypper based upgrade process, also not when podman or any other container related package gets upgraded. I did a very intrusive test to this case by deleting everything from system-root except /var/lib/containers/storage from inside of a running container instance attached to a shell. The deletion went through and killed the complete sle12 system but the container instance as well as my shell stayed alive ;)

So I knew there will be tough questions like this :) but I believe I have done quite some tests and wrapped my head around several cases. That doesn't mean that I could not have forgotten something important, but this is exactly the reason why I created this PR and also packaged everything up such that you can do experiments as you see fit. It doesn't take away much of your time and you can do it in 5 minutes on AWS

rjschwei · 2024-09-23T19:50:03Z

I believe the word off-line is misinterpreted in this context.

I do not understand how it can be misinterpreted, the documentation is very clear and explicit about this. From [1]

"""
Upgrading offline implies that the operating system to be upgraded is not running (system down state).
"""

When the kernel of the system to be upgraded is running the system is not in a "system down" state. I do not see where there is room for interpretation.

We both know that SUSE explicitly documents that any zypper operation no matter if it's through the migration plugin or by a direct zypper dup call is not supported on the SAME system to run a major release upgrade. The point here is SAME system. With the move of the process into a container of the migration target you are no longer on the same system and therefore fulfill the requirement named off-line

No, the requirement for the migration across major distributions is "system down" for the system that is to be upgraded [1]. this requirement is also why the SUMa process for migration across major version still depends on AutoYaST [2], they also make sure that no part of the system to be upgraded is running. All migration tooling at SUSE is build around this "system down" requirement/assumption, spare micro where the update goes into a new snapshot and as such all of these shenanigans are avoided.

[1] https://documentation.suse.com/sles/15-SP5/html/SLES-all/cha-upgrade-paths.html#sec-upgrade-paths-supported
[2] https://documentation.suse.com/suma/5.0/en/suse-manager/common-workflows/workflow-inplace-sles-upgrade.html

Of course I was concerned about the following aspects:

is a shared mount consistent across the entire upgrade process ?
The answer is yes. A shared mount of / into the container as /system-root is an overlay mount in RAM and in the same way consistent as the mount process that we have in the live ISO

is a host network connection consistent across the entire upgrade process ?
The answer is yes. Any network interface is configured in the kernel and holds its data and stack in RAM until it gets re-configured. The installation of packages are not expected to re-configure the network. If they do this it would be broken no matter if you run the process in a container or on a live ISO. In all my tests the network and also the bridge setup into the container stays persistent and active. As I said the instance stays responsive during the whole process, a big plus in my eyes

is the functionality of the running container instance harmed by the install of a new kernel/modules ?
The answer here is no. What you need from the kernel is an active network (network drivers and bridge) as well as an overlay mount. All this is either provided as modules already in RAM or compiled into the kernel and all present before you start the migration. So there is no feature on the running system which would kill the container instance by an update of a package. The container storage on the sle12 system lives in /var/lib/containers/storage/ and is also wiped at the end of a successful migration to get rid of any artifacts. The data on this storage location is not touched by a zypper based upgrade process, also not when podman or any other container related package gets upgraded. I did a very intrusive test to this case by deleting everything from system-root except /var/lib/containers/storage from inside of a running container instance attached to a shell. The deletion went through and killed the complete sle12 system but the container instance as well as my shell stayed alive ;)

So I knew there will be tough questions like this :) but I believe I have done quite some tests and wrapped my head around several cases. That doesn't mean that I could not have forgotten something important, but this is exactly the reason why I created this PR and also packaged everything up such that you can do experiments as you see fit. It doesn't take away much of your time and you can do it in 5 minutes on AWS

It is not about whether or not it functions. It is about what support do we get from the rest of the organization. If we create a system that does not meet the documented requirements, other teams, especially support, can state that the requirements for major distribution migration are not met. And they will be correct. That in turn means we sign up for the support of the whole stack instead of only supporting the mechanism that drives the migration stack. This is something we cannot do and we risk leading the customer into an unsupported scenario.

That said there are of course arguments to be made that what we have today might be too strict. That means we have to have conversations with technical leadership in BCL and we have to include PMs in the conversation as well. Although I think PM agreement on a scenario which promises to make major distribution migration simpler is a given.

schaefi · 2024-09-24T09:40:22Z

It is not about whether or not it functions.

To be honest for me that's the part that counts compared to a system that barely works right now and fails
in the majority of customer use cases.

The proposed solution does not replace the existing live ISO based concept, it adds another in my eyes better alternative for many many customer use cases. Shouldn't that be a driver to improve ?

And I have a question. If zypper dup or zypper migrate from a set of official SUSE repos successfully completes, is that result
than still considered unsupported depending on the environment the call was made ? I doubt it. The documentation that describes that a major upgrade can only be done in offline-mode exists because that was the solution for the technical
issues of the tools and not because suse doesn't want it. If there is a solution available that overcomes this problem,
why is it not worth looking at it ?

Offline
Upgrading offline implies that the operating system to be upgraded is not
running (system down state). Instead, the installer for the target operating system is
booted (for example, from the installation media, via network or via local boot loader),
and performs the upgrade.

If you take this as given our current live ISO based system which runs a zypper migration plugin is also unsupported.
Simply because the live ISO is not the installer for the target that is still yast on sle15 and performs the upgrade that is
not the migration plugin but some yast upgrade code. So what we already do is outside of any support but if the installation
via zypper completes we consider it good to go ?

When the kernel of the system to be upgraded is running the system is not in a "system down" state

I don't see where the documentation would mention that ?

Looking at suma, this is a suse product with its own setup and upgrade tools, its own documentation, its own repo handling. the DMS was not created to upgrade products that has their own upgrade stack. The DMS can upgrade packages from repositories via zypper, period. That part will also work for any sle12 based suma instance but will by far not be enough. This scenario would come with the same set of issues no matter if you migrate with the live ISO or with a container.

schaefi · 2024-09-25T10:58:55Z

It is about what support do we get from the rest of the organization. If we create a system that does not meet the documented requirements, other teams, especially support, can state that the requirements for major distribution migration are not met. And they will be correct. That in turn means we sign up for the support of the whole stack instead of only supporting the mechanism that drives the migration stack. This is something we cannot do and we risk leading the customer into an unsupported scenario.

I understand that and you are right. However, I do believe we are already into this situation and partly became responsible for issues that exists in e.g the migration plugin for which support can say this is outside the documented procedure, unless there was another agreement with support in regards to public cloud upgrades ?

From my perspective it's still possible to offer an "experimental" stack due to newer technology available that customers could use at their own risk and which allows us at least to provide a potential solution to issues that customers have and that we ignored so far. If the solution turns out to be useful it can still be moved from an "experimental" stack into something more official including the PMs and followup tasks e.g documentation. To be honest I believe it's very unlikely that PM will do anything new for SLE12 other than maintaining it for the rest of the LTSS lifetime. Thus I was under the impression whatever we need to do to make customers happy will live in the SLE12 public cloud module, no ?

I just find our current way of ignoring customer issues with the excuse that the scope of the DMS is limited not really user friendly. Actually I think the scope within the DMS functions today is really small. That's why I was looking for a solution that is technically better suited for cloud instances and also easier to handle and to debug. For me personally this is a lessons learned from my first DMS design.

I guess you have a different view on this topic and that's fine. I just wanted to make you understand my motivation and I hope something like that is still welcome in PCT as I don't want to become the reason for a dispute. As you can see from the initial comment the solution is there for testing in my home project and could be easily stay there or moved into an experimental stage. The relevant code changes lives in a branch named nextgen and can also stay there. I just propose to merge the real bugs I found because they would be in both DMS variants. I can turn that into extra PRs though.

All the rest is up for the workshop. Let's see if we make progress.

Thanks

schaefi · 2024-10-19T19:37:02Z

I'm going to split this PR into several PR's, not all of them can be merged at this time because those changes that reflects the container based workflow needs a decision first where to push the sle12 container stack as I built them here:
https://build.opensuse.org/project/show/home:marcus.schaefer:dms

The idea is that it can live on package hub and this allows to offer an alternative not officially supported online upgrade process for sle12 systems.

Nevertheless some of the proposed changes in this PR are generic and fixes issues that I wanted to fix a long time ago. Thus I split them up now that the DMS is a PCT responsibility again

schaefi · 2024-10-20T14:19:33Z

First round of split PRs

The current suse-migration-services package also packaged the python code for the services. This is not compliant to python refactoring in SUSE which allows multi-python version packages. Because of this reason the package changed into a proper python-migration package which build suse-migration-services as a sub-package only containing the systemd service files. python-migration itself builds for python 3.11 and therefore also moves the migration code into the latest python version supported on SLE15 onwards.

Inherited from the live ISO description but now available as an OCI container

The new design of the DMS is no longer based on a reboot procedure to start the migration. Instead we provide a tool that starts the migration container based and we stay on the system to migrate

Do not change spec file names to allow easier submission. Also make sure the package can build for SLE15 and SLE12. Consolidate new migrate tool as sub-package to suse-migration-services

schaefi added the workshop label Sep 23, 2024

schaefi self-assigned this Sep 23, 2024

schaefi force-pushed the nextgen branch 5 times, most recently from 41a0efd to dbffb9e Compare October 19, 2024 18:55

schaefi added Don't merge and removed workshop labels Oct 19, 2024

schaefi force-pushed the nextgen branch from dbffb9e to 8d0670e Compare October 19, 2024 19:44

schaefi force-pushed the nextgen branch 2 times, most recently from 1e96235 to 96922b7 Compare October 21, 2024 09:37

schaefi added 2 commits October 21, 2024 14:18

Add DMS SLE15 OCI container

1c661b0

Inherited from the live ISO description but now available as an OCI container

schaefi force-pushed the nextgen branch from 96922b7 to 9d0a583 Compare October 21, 2024 12:20

schaefi added 2 commits October 21, 2024 14:23

Add migration tool

4af0eb9

The new design of the DMS is no longer based on a reboot procedure to start the migration. Instead we provide a tool that starts the migration container based and we stay on the system to migrate

Refactor packaging again

c0f4ff4

Do not change spec file names to allow easier submission. Also make sure the package can build for SLE15 and SLE12. Consolidate new migrate tool as sub-package to suse-migration-services

schaefi force-pushed the nextgen branch from 9d0a583 to c0f4ff4 Compare October 21, 2024 12:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DMS - A proposal and implementation of a design change to address many open issues #292

DMS - A proposal and implementation of a design change to address many open issues #292

schaefi commented Sep 23, 2024 •

edited

Loading

rjschwei commented Sep 23, 2024

schaefi commented Sep 23, 2024 •

edited

Loading

rjschwei commented Sep 23, 2024

schaefi commented Sep 24, 2024 •

edited

Loading

schaefi commented Sep 25, 2024 •

edited

Loading

schaefi commented Oct 19, 2024

schaefi commented Oct 20, 2024 •

edited

Loading

DMS - A proposal and implementation of a design change to address many open issues #292

Are you sure you want to change the base?

DMS - A proposal and implementation of a design change to address many open issues #292

Conversation

schaefi commented Sep 23, 2024 • edited Loading

rjschwei commented Sep 23, 2024

schaefi commented Sep 23, 2024 • edited Loading

rjschwei commented Sep 23, 2024

schaefi commented Sep 24, 2024 • edited Loading

schaefi commented Sep 25, 2024 • edited Loading

schaefi commented Oct 19, 2024

schaefi commented Oct 20, 2024 • edited Loading

schaefi commented Sep 23, 2024 •

edited

Loading

schaefi commented Sep 23, 2024 •

edited

Loading

schaefi commented Sep 24, 2024 •

edited

Loading

schaefi commented Sep 25, 2024 •

edited

Loading

schaefi commented Oct 20, 2024 •

edited

Loading