-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Change Proposal] Add package_policy_upgrade_strategy
field to support Fleet upgrade behavior
#244
Comments
Really nice writeup Kyle, especially the table of behaviors - thanks! I think the enum values here make sense. WRT to the field naming, I think we have an opportunity here to introduce fields that affect various aspects of Fleet package management, not just package policy upgrades. For example, I know that we are making efforts to support Stack-aligned packages by bundling them in Kibana, but maybe there will be non-Stack package that would benefit from being kept to date automatically? (to clarify: not the policies, but the actual packages themselves) Another recent use case is the notion of "preserve assets" on upgrade that was needed for ML packages: https://github.com/elastic/kibana/issues/115035 With that in mind, we could introduce a nested structure like this: fleet_management: {
package_policy_upgrade_strategy?: 'always_upgrade' | 'upgrade',
package_upgrade_strategy?: <TBD, probably similar to above>,
package_assets_strategy?: 'preserve'
} For the What do you think? |
100% agree here - it'd be great to move anything we can out of Fleet's hardcoded package lists and into the package manifests for those packages instead. So, rather than maintaining a list of I wonder if it'd be possible to capture things like our
I'm not sure about our default packages, since we need to install those by default during setup. It seems to make more sense to have Fleet maintain that list of packages rather than combing the whole registry for default packages and installing whatever we find. Sort of like a |
Great point, let's use a structure that we can easily extend with related options in the future. One question I have is about timing. I'd like consider other upgrade strategy options separately from the work planned for 8.0 in order to avoid scope creep. What I'm not sure about is what the expectations are for Kibana to support changes added to the package spec. If we add these additional options now, is it acceptable for Kibana to not support them until later? Should we instead just proceed with adding the option we need now, but in a more future-proof structure like the one @jen-huang proposed above?
+1 on getting rid of our special case hard-coded constants in Kibana. This makes testing this behavior in a generic way difficult. Specifically for the
Since this option only really applies after the base package has been upgraded, IMO we should just respect whatever the newly installed version of the package specifies. If it's |
Nice idea, Kyle! Let me challenge this: I'm unfamiliar with the hardcoded actions, but I'm wondering if the package can describe with some scripting how should the update be performed? I'm afraid that even with 10 different enum values (upgrade procedure types) we'll need to introduce this 11th special case. |
Thanks for the proposal, it looks good, but after thinking a bit about the feature I am wondering if we should do it. Having options to decide what to upgrade when upgrading a package is going to lead to situations where users have mixed versions of packages and policies. If we think on having even more options, such as Imagine for example these situations:
So I would propose to try to walk in the direction of having a single strategy for all packages. And this strategy would probably be
Having more flexibility not always provides more value. In this case I think that by taking the decision of only supporting |
Thanks for your input, @mtojek - much appreciated.
I think this is a very valid concern, and it does feel like we may have an ever-growing list of "special cases" to capture various classifications of packages and their behavior. For reference, these hardcoded lists live in the Fleet codebase here: We have 3 general classifications of packages hard-coded here today:
The decision that we're faced with moving forward is whether to add an additional hard-coded list here to capture "Packages whose policies will be automatically upgraded during setup", and so that's where this change request has come from. In terms of whether we could capture these specific traits/behaviors via some kind of scripting, I think this would be possible but I'd love to see examples if we're doing this elsewhere. So I have a few follow-up questions:
@jsoriano thanks for your input as well, I'll try to address your concerns below.
I don't think it's realistic for us to plan around an assumption that users will always strive to run the most up-to-date versions of their packages or to pursue policy upgrades for potentially many in-flight policies. From a product perspective, I think it's respectful to allow users the choice to embrace the "if it's not broken, don't fix it" mentality with their agents and policies. Perhaps it's worth crystalizing that concept in terms of support, though. e.g. if a user is running a version of a package greater than X releases old, support will always recommend they pursue an upgrade regardless of their issue. Maybe making X = 1 release here is also an option, so we never commit to supporting an out-of-date package at all? We already have plenty of cases today where users have mixed package versions and policy versions since until 7.15 the only way to upgrade an existing policy was to completely recreate it once a newer version of the package was installed.
Fleet only recently started removing package assets when a new version of a package is installed in 7.16: elastic/kibana#112644. We still support maintaining assets for versions of a package that an existing policy depends on, however. So, this is already the case today in that we maintain assets for multiple versions of a package in the out-of-sync package version case described above.
This is correct, and we do have some existing logic around rolling back package versions in case of a failure in the Kibana upgrade process. I think in failure cases like this, the potential package spec flags around upgrade behavior would not be honored, and we'd always return packages/policies/assets to their previous state prior to the upgrade. In cases of an intentional downgrade, I don't think we plan to support a first-class concept of downgrading through Fleet in this way, but it's still worth pointing out that we have additional logical considerations to make if that changes.
Upgrading all packages and their policies by default feels very heavy-handed and seems to ignore breaking changes between package versions. In cases where a variable/input is restructured or moved, we'll have situations where we're potentially throwing away now-deprecated configuration from the previous version of a package. Today, we do have some fairly nuanced conflict detection logic that will fail the upgrade in many cases like this, but then we wind up in a state that you've observed above where we have divergent package/policy versions because the automatic upgrade process failed. In an ideal world, it'd be great to guarantee that all packages and policies in use by Fleet/Agent are running the latest version of their respective packages, but it seems to me that would never be possible so long as breaking changes between package versions are possible, as we'll always have potential conflicts/failures in the upgrade process. Perhaps we could implement semver-based rules around this So, in short, I hesitate to accept enforcing the |
Thank you for the reply!
No, not yet. The only scripts I remind are part of ingesting pipelines. It was just an idea to discuss another strategy, maybe more flexible.
... but yes, I was thinking specifically about the painless scripting :) |
@kpollich thanks for your thoughtful explanation. I completely agree that we shouldn't expect users to have the most updated version of every package, when I was referring to My concern is not so much about always upgrading or not, it is that we may be giving too many options to package developers and users and this may lead to a confusing product. Even in this thread where we are all involved in the project we have felt the need to clarify at least twice when we were referring to automatic upgrades of packages or of policies. If we make a decision on the strategies to use, these blurry areas disappear. Every toggle we add is potentially going to require explanations to users, customers and developers, apart of potentially complicating development itself, so we have to be sure that we need it. Focusing on a single option will make the product easier to understand and use, and will help us to focus on providing a better experience for this option. For example if we decide to go in the Regarding assets, similar thing. I guess that the ideal is to keep all the assets that are used by something, and "garbage collect" them when they belong to an old version and are not used by anything else. Maybe we are not in a position to know where all assets are used, but till we reach this situation, if possible, I would prefer not to add toggles that we have to include in the spec, in APIs and UIs and long-term support. I think that taking this kind of decisions gives more value to the product, even when they somehow limit it. |
Thanks for clarifying @jsoriano - that makes a lot of sense. It would be beneficial, certainly, to expand the concept of "auto upgrading" to include both the base package and its policies. This would definitely lower the complexity of the product while still providing value to users around certain "managed" packages having their upgrade processes fully managed by Fleet. I think from a UI perspective we'd want to make some changes that make it clear to users that when they're installing a package with We also allow users to configure a There's also a decision point to make about whether we allow users to override this setting, or if it's an immutable value from the package spec. I think based on this thread so far we'd lean towards disallowing user override for packages w/ this flag set. |
A conversation with @ruflin about this helped me to see policies more like configuration. Then in principle it made more sense to me to don't update them automatically. Now, from the POV of policies seen as configuration, I wonder if policy upgrades could be like this:
Though, looking at the changes in templates, they don't happen frequently, but in packages where they happen more frequently, it uses to be because the config includes processors or some kind of advanced mapping. In these cases the developer probably wants the policy to be always upgraded. If not, pipelines or dashboards in the new version may not work correctly because they may be missing some new field. These packages would benefit from Then I wonder if instead we should make a clear separation in packages between configuration and processors/mappings, then Fleet could still see if the configuration part is equivalent, and decide to upgrade automatically if it is. And the processors/mappings part could be always upgraded, so pipelines and dashboards have the fields they expect. Then I see that we already have a clear separation between actual user configuration and the policy. The user configuration is not the policy, but the variables. Then going back to the beginning of the comment, the strategy could be always like this:
Managed packages could be more strict on the changes in variables, and do them always in backwards compatible ways, so their policies are always upgraded accordingly to the strategy above. Then Fleet doesn't need any special logic for the upgrade of these packages, nor the package spec needs to include any field to select the upgrade strategy. If we limit the possibility of breaking changes for policies to the variables used, we could consider adding deprecation fields for variables to provide a better experience or even automated migration paths, in line to what we are discussing for package deprecation in #227 There may be also breaking changes caused by the use of new settings in policies in old agents. To help on this we could add conditional logic in the templates based on the agent version (perhaps this is already possible?), or add support for the I see this relies a lot on trusting on the lack of breaking changes, but I think this is a desiderable aim given the make it minor initiatives, and we could also help package developers with tooling (for example with elastic/elastic-package#579). Does it make sense? |
👍
Please shed more light on this. I can't see the reason why the above rule can't be enabled except for safety (don't know future effects). |
Look at this as configuration files included in packages in linux distributions. Usually they are not upgraded automatically when the packages are upgraded, unless the user didn't modify them. This would be still coherent with the above rule, but here the check could be a bit different, the "config" can be upgraded if the user has only set variables that are compatible with the new version. |
Well, following this logic would mean that we can always update (try to update) the package, but never adjust configuration. Linux distributions may not reflect this situation well as they also contain libraries or programs. In this case there are only policies. Anyway, thanks for explaining. |
We can always update the policies during the upgrade of a package, and I think this is fine. If you want to adjust configuration you should be able to do it later in most of the cases. The configuration here would be the variables, I wouldn't consider the rest of the policies as configuration, because they can include advanced mappings, local processing and other things that may be needed later in ingest pipelines or dashboards. |
For smaller deployments, I think this makes sense. We could change the flow here to detect those breaking changes (aka 'conflicts') before doing the base package upgrade itself and then just upgrade everything at once. I think this would greatly simplify the UX and matrix of situations we need to support. One motivation I can imagine is allowing admins to test out a package upgrade on a "test" or "canary" agent policy before rolling out to the wider fleet. We in fact do suggest this to users in our documentation on integration upgrades: https://www.elastic.co/guide/en/fleet/current/integrations.html#update-integration. I think this may be a valid use case, but would like to learn more from @mostlyjason on this. Reasons I can imagine this is important for some users:
That said, I wonder if a package rollback would be a better alternative solution to the problem than a more complex upgrade UX. In other words, did we optimize for a non-typical use case and make the typical use case more complicated? |
Regarding test/canaries, I agree that this is a valid use case, but I am not sure if we have a good story now for this. When you upgrade a package now, all its assets are upgraded, so even if there are agents using older policies, they will be using the new ingest pipelines, and data will be visualized in the new dashboards. There can be only a version of a package installed now, and there is no way to have a set of agents completely using a different version of a package. While this happens, while agents ingest data that is going to be managed always by assets of the only installed version, I think that we should do our best to have the policies aligned with this version too. To avoid misunderstandings and tricky support and development situations. |
++ to what Josh said about testing and canary releases. In large enterprises with tens of thousands of agents or clusters with critical infrastructure, they need to test before rolling changes out to production. They can't afford to have an integration upgrade take down their servers by increasing load, cause security or compliance problems by shipping sensitive data, or break data ingestion. Their solution is to test it on a limited set of agents, then promote the change to a larger set of agents to limit the blast radius of problems. In some organizations, a test deployment and a production deployment have to be approved by separate people, and there can be compliance requirements for auditing these steps. I realize we don't currently have a way to test ES assets like ingest pipelines but they are supposed to be compatible with earlier versions of the integration. We can somewhat decouple updating ES assets from updating the agent policies. At least we can lower the blast radius to the cluster and limit breaking changes on endpoints. I like the idea of simplifying the UX by auto-upgrading integrations and integration policies at the same time for smaller or less critical clusters. However, I'd hate to do it at the expense of our enterprise and mission-critical clusters by taking away their control and ability to manage risk. The It seems like the discussion of defaulting to
You mean we'd provide rollback instead of the ability to test upgrades before deploying them? If so, that doesn't provide a solution for use cases like preventing sensitive data from being ingested, limiting load changes, etc.
We may end up removing the need for Default and Unremovable as part of this issue elastic/kibana#108456. It might be better to remove them from Kibana instead of adding them to the package spec. |
On the other hand, you have just described a new feature. Apart from the upgrade strategy, the fleet could provide a canary mode to try new deployment (new policy) on a small set of hosts and then (if succeeded), scale the deployment to the rest. It looks like it will become a control panel for DevOps, so maybe introduce more options there? Deployment calendars (go/no-go days, no deployment Fridays), office hours, time windows, canary percentage, etc. |
Thanks for your comments, I have a better understanding on the kind of flexibility we want to give to users, I see that there are still limitations, as the fact that other resources included in packages such as ingest pipelines are always upgraded, but this could be evolved in the future if needed. Going back to my original concern about giving too many options, I still wonder if we want to give this flexibility to package developers. It seems that we prefer This seems to be also coherent with current experience (for most packages?), this is what I see by default when upgrading the apache and system packages in 7.16. Given that, if we allow package developers to use I notice that the Maybe another approach to this decision would be to answer the question: Why would a package developer opt for
++ |
Yes these will be stack-aligned packages that are shipped with Kibana and upgraded with Kibana. There are some dependencies between these package versions and the code in Kibana. They also tend to be deployed to agents running on centralized servers as opposed to endpoints, so the impact of updating these is smaller. The teams owning these packages believe the convenience of auto upgrades outweighs the need to manage deployment risk. Organizationally, I think we need to carefully review which packages make use of this feature since it limits our end user's control. I don't have a preference for where that happens, either in Kibana or the package definition. Do you know if there a standard place to document these settings where we could put a warning for future package developers? |
If this is the case, could this logic be implemented as part of this? Packages shipped and upgraded with Kibana have their policies always upgraded. Packages installed from the registry have their policies upgraded by default, but users can disable it.
We cannot rely only on reviews, because we want to open package development to more and more teams, eventually also from outside of Elastic. If we add this field to the spec, we could document this in the description, discouraging its use, but if it is there, any package developer is going to be able to use it even if we discourage it. |
Those are good points @jsoriano and I see how adding it to the spec could increase the number of packages developers who misuse it.
What do you think of this idea @kpollich? I'm unsure why System is treated differently. It would save some clicks if we think most of the time users will upgrade the policies as well. Can you think of any integration that should not have upgraded policies by default (with the option to opt out)? |
Apologies for the delay here. Been a busy week or so 😅 .
In general, there aren't any good examples of specific integrations that I can say we'd prefer not to upgrade policies automatically for. We'd always prefer to save the user some clicks and shorten the path they need to take to get their integrations and agents up to date. However, I do think it'd be wise to not opt users into this behavior for experimental/beta integrations. We've had experimental packages like AWS go through substantial reimaginings during their development, so I think it might be a good idea to include that additional semver type check when we determine whether to opt-in to auto-upgrades or not. Another concern I have is potentially confusing UX around this option. It feels like it might be difficult to convey to users why some of their packages opt-in to this behavior and others don't. I think we need to take stock of our messaging on the integration settings page and determine the best way to convey these pieces of logic to the user. We've already sort of run up against this w/ packages like APM where we require auto-upgrades and had to introduce some additional wording to clarify the behavior: |
@kpollich Do we still have a need for this? Should we consider closing this for now and revisiting later if it comes up again? |
@joshdover Thanks for the ping. We don't need this for now so I'll close for the time being. |
Ref elastic/kibana#111858
In 7.15 and 7.16, Fleet introduced behavior that allows our users to more easily manage upgrades for the Elastic Agent integrations and associated policies. The policy upgrade process, in particular, is now much more streamlined, and no longer requires policies to be deleted and recreated against a new version of an integration.
As part of the Fleet team's continuing effort to improve usability and ergonomics around integration upgrades, we'll be moving Fleet's various setup operations to Kibana's boot lifecycle in elastic/kibana#111858. As part of this change, we'd like to solidify our implementation for integration/policy upgrades and allow packages to control whether policies should be updated during this setup process.
Currently, Fleet maintains a hardcoded list of packages with varying behaviors around upgrades. What we'd like to do is push some of this hardcoded logic out of Fleet and capture the behavior as a value in the package spec, instead.
What we're proposing here is an optional
package_policy_upgrade_strategy
enum value in the package spec that allows Fleet to conditionally perform policy upgrades during boot, and additionally controls some UI presentation logic. See the matrix below:always_upgrade
synthetics
,apm
upgrade
true
system
null/undefined
false
.The Fleet UI setting in question appears on the integration settings screen:
The actual names of the enum values are not final, and could likely be made more clear (would love some input here) based on the intended behavior for each. I do think that an enum is likely the best choice here, rather than a
boolean
value so we can remain flexible for future cases in which more specific upgrade logic might be required.The text was updated successfully, but these errors were encountered: