Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Change Proposal] Add package_policy_upgrade_strategy field to support Fleet upgrade behavior #244

Closed
kpollich opened this issue Nov 3, 2021 · 26 comments
Assignees
Labels
discuss Issue needs discussion Team:Ecosystem Label for the Packages Ecosystem team Team:Fleet Label for the Fleet team Team:Integrations Label for the Integrations team

Comments

@kpollich
Copy link
Member

kpollich commented Nov 3, 2021

Ref elastic/kibana#111858

In 7.15 and 7.16, Fleet introduced behavior that allows our users to more easily manage upgrades for the Elastic Agent integrations and associated policies. The policy upgrade process, in particular, is now much more streamlined, and no longer requires policies to be deleted and recreated against a new version of an integration.

As part of the Fleet team's continuing effort to improve usability and ergonomics around integration upgrades, we'll be moving Fleet's various setup operations to Kibana's boot lifecycle in elastic/kibana#111858. As part of this change, we'd like to solidify our implementation for integration/policy upgrades and allow packages to control whether policies should be updated during this setup process.

Currently, Fleet maintains a hardcoded list of packages with varying behaviors around upgrades. What we'd like to do is push some of this hardcoded logic out of Fleet and capture the behavior as a value in the package spec, instead.

What we're proposing here is an optional package_policy_upgrade_strategy enum value in the package spec that allows Fleet to conditionally perform policy upgrades during boot, and additionally controls some UI presentation logic. See the matrix below:

Enum Value Kibana boot behavior Fleet UI behavior Potential use cases
always_upgrade Fleet will attempt to upgrade all package policies for this package User is unable to configure "Keep policies up to date" setting - setting has no effect synthetics, apm
upgrade Fleet will attempt to upgrade all package policies for this package User is able to toggle the "Keep policies up to date" setting - defaults to true system
null/undefined Fleet will not attempt to upgrade package policies for this package User is able to toggle the "Keep policies up to date" - defaults to false. Most packages

The Fleet UI setting in question appears on the integration settings screen:

Screen Shot 2021-11-03 at 2 31 46 PM

The actual names of the enum values are not final, and could likely be made more clear (would love some input here) based on the intended behavior for each. I do think that an enum is likely the best choice here, rather than a boolean value so we can remain flexible for future cases in which more specific upgrade logic might be required.

@kpollich kpollich added discuss Issue needs discussion Team:Fleet Label for the Fleet team Team:Integrations Label for the Integrations team labels Nov 3, 2021
@kpollich kpollich self-assigned this Nov 3, 2021
@kpollich
Copy link
Member Author

kpollich commented Nov 3, 2021

fyi @joshdover @jen-huang @mostlyjason

@jen-huang
Copy link
Contributor

Really nice writeup Kyle, especially the table of behaviors - thanks!

I think the enum values here make sense. WRT to the field naming, I think we have an opportunity here to introduce fields that affect various aspects of Fleet package management, not just package policy upgrades. For example, I know that we are making efforts to support Stack-aligned packages by bundling them in Kibana, but maybe there will be non-Stack package that would benefit from being kept to date automatically? (to clarify: not the policies, but the actual packages themselves)

Another recent use case is the notion of "preserve assets" on upgrade that was needed for ML packages: https://github.com/elastic/kibana/issues/115035

With that in mind, we could introduce a nested structure like this:

fleet_management: {
  package_policy_upgrade_strategy?: 'always_upgrade' | 'upgrade',
  package_upgrade_strategy?: <TBD, probably similar to above>,
  package_assets_strategy?: 'preserve'
}

For the package_policy_upgrade_strategy specifically, I think we will also want to think about what happens if this enum value changes between package versions and how we might resolve potential differences between user configuration and package manifest.

What do you think?

@kpollich
Copy link
Member Author

kpollich commented Nov 3, 2021

I think we have an opportunity here to introduce fields that affect various aspects of Fleet package management, not just package policy upgrades.

100% agree here - it'd be great to move anything we can out of Fleet's hardcoded package lists and into the package manifests for those packages instead. So, rather than maintaining a list of AUTO_UPDATE_PACKAGES in Fleet, we'd have packages w/ package_upgrade_strategy: upgrade (or something along those lines) and check on that field when Fleet's setup process runs.

https://github.com/elastic/kibana/blob/b609b1e4500c80e2ea617b973bc7b22ce639bc9c/x-pack/plugins/fleet/common/constants/epm.ts#L23-L50

I wonder if it'd be possible to capture things like our default and unremovable packages through package spec values as well, and if that'd be worth pursuing as part of this change, e.g.

fleet_management: { 
  package_upgrade_strategy: "upgrade",
  package_policy_upgrade_strategy: "always_upgrade",
  package_assets_strategy: "preserve",
  
  // Capture "unremoveable" packages?
  uninstall_strategy: "disallow",
}

I'm not sure about our default packages, since we need to install those by default during setup. It seems to make more sense to have Fleet maintain that list of packages rather than combing the whole registry for default packages and installing whatever we find. Sort of like a requirements.txt or package.json file for Fleet's setup.

@joshdover
Copy link
Contributor

joshdover commented Nov 4, 2021

I think we have an opportunity here to introduce fields that affect various aspects of Fleet package management, not just package policy upgrades.

Great point, let's use a structure that we can easily extend with related options in the future.

One question I have is about timing. I'd like consider other upgrade strategy options separately from the work planned for 8.0 in order to avoid scope creep. What I'm not sure about is what the expectations are for Kibana to support changes added to the package spec. If we add these additional options now, is it acceptable for Kibana to not support them until later? Should we instead just proceed with adding the option we need now, but in a more future-proof structure like the one @jen-huang proposed above?

I wonder if it'd be possible to capture things like our default and unremovable packages through package spec values as well, and if that'd be worth pursuing as part of this change, e.g.

+1 on getting rid of our special case hard-coded constants in Kibana. This makes testing this behavior in a generic way difficult.

Specifically for the default case, I don't think we should plan to support packages in the registry getting this behavior unless they're also bundled with Kibana. For instance, I don't want us to be making a query to EPR during Kibana startup that asks for the list of default packages to be installed. We should only be respecting this field from packages bundled with Kibana so that we can be sure that Kibana's boot sequence is not adversely affected after a Stack release by a rogue package specifying this field. Whenever we add this option, this should be part of this option's documentation.

For the package_policy_upgrade_strategy specifically, I think we will also want to think about what happens if this enum value changes between package versions and how we might resolve potential differences between user configuration and package manifest.

Since this option only really applies after the base package has been upgraded, IMO we should just respect whatever the newly installed version of the package specifies. If it's always_upgrade we ignore any user settings, but if it's upgrade we would respect the user's setting if they've ever set one. Nit: maybe we should rename upgrade to upgrade_preferred to make this distinction more explicit?

@ruflin ruflin added the Team:Ecosystem Label for the Packages Ecosystem team label Nov 4, 2021
@mtojek
Copy link
Contributor

mtojek commented Nov 4, 2021

Currently, Fleet maintains a hardcoded list of packages with varying behaviors around upgrades. What we'd like to do is push some of this hardcoded logic out of Fleet and capture the behavior as a value in the package spec, instead.

Nice idea, Kyle! Let me challenge this: I'm unfamiliar with the hardcoded actions, but I'm wondering if the package can describe with some scripting how should the update be performed? I'm afraid that even with 10 different enum values (upgrade procedure types) we'll need to introduce this 11th special case.

@jsoriano
Copy link
Member

jsoriano commented Nov 9, 2021

Thanks for the proposal, it looks good, but after thinking a bit about the feature I am wondering if we should do it.

Having options to decide what to upgrade when upgrading a package is going to lead to situations where users have mixed versions of packages and policies. If we think on having even more options, such as package_assets_strategy, the combinations explode, what can complicate a lot support and development.

Imagine for example these situations:

  • A user asks for support because a dashboard is not showing data for some nodes. Support engineers need to start the conversation asking about the versions used in packages, policies and potentially, assets (apart of agent, kibana and so on). The user may not even know about these details. If everything is upgraded as part of the same process, the only relevant versions are more clear: package, agent, kibana.
  • A developer is implementing a new visualization for a dashboard of a package, that needs a new field added by an ingest pipeline. She needs to test the dashboard with combinations of the new dashboard and old policies. Support matrix in CI explodes. And there may be even more combinations if we open the gates to options like package_assets_strategy.
  • Fleet/EPM may need in the future to implement some kind of upgrade reversal, or installation of specific versions, to help users that have problems after upgrading. In these cases, implementation complicates if there can be different combinations of updated policies (and/or assets).

So I would propose to try to walk in the direction of having a single strategy for all packages. And this strategy would probably be always_upgrade. The reasons:

  • It has proven to be the preferred strategy in cases where upgrades are needed more frequently (apm, synthetics, that are released with each version of the stack).
  • Versions are consistent: packages and their policies (and assets) are all aligned. This helps with support and maintainability of packages.
  • Fleet has more control on the versions used, it will be easier to do advanced management of fleets of agents (for example controlled rollover and automatic revert of upgrades if they fail).

Having more flexibility not always provides more value. In this case I think that by taking the decision of only supporting always_upgrade we are limiting a bit the product, but we can actually provide more value, as we can focus on a single experience for all packages.

@kpollich
Copy link
Member Author

kpollich commented Nov 9, 2021

Thanks for your input, @mtojek - much appreciated.

Nice idea, Kyle! Let me challenge this: I'm unfamiliar with the hardcoded actions, but I'm wondering if the package can describe with some scripting how should the update be performed? I'm afraid that even with 10 different enum values (upgrade procedure types) we'll need to introduce this 11th special case.

I think this is a very valid concern, and it does feel like we may have an ever-growing list of "special cases" to capture various classifications of packages and their behavior.

For reference, these hardcoded lists live in the Fleet codebase here:

https://github.com/elastic/kibana/blob/29148d3ed7f2169e2aff702432fe64fef1d9b04f/x-pack/plugins/fleet/common/constants/epm.ts

We have 3 general classifications of packages hard-coded here today:

  1. Default - Packages that are installed during Fleet setup and required for Fleet to function, such as fleet_server.
  2. Auto Update - Packages for which the latest available version is always installed during setup IF the user has manually installed the package. Today, this only includes endpoint.
  3. Unremoveable - Packages for which we do not expose an "Uninstall" option in the UI

The decision that we're faced with moving forward is whether to add an additional hard-coded list here to capture "Packages whose policies will be automatically upgraded during setup", and so that's where this change request has come from.

In terms of whether we could capture these specific traits/behaviors via some kind of scripting, I think this would be possible but I'd love to see examples if we're doing this elsewhere. So I have a few follow-up questions:

  1. Do you know of any similar fields that support scripting in this way? Either in the package spec or elsewhere.
  2. Would this likely be Painless or something else?

@jsoriano thanks for your input as well, I'll try to address your concerns below.

Having options to decide what to upgrade when upgrading a package is going to lead to situations where users have mixed versions of packages and policies. If we think on having even more options, such as package_assets_strategy, the combinations explode, what can complicate a lot support and development.

I don't think it's realistic for us to plan around an assumption that users will always strive to run the most up-to-date versions of their packages or to pursue policy upgrades for potentially many in-flight policies. From a product perspective, I think it's respectful to allow users the choice to embrace the "if it's not broken, don't fix it" mentality with their agents and policies. Perhaps it's worth crystalizing that concept in terms of support, though. e.g. if a user is running a version of a package greater than X releases old, support will always recommend they pursue an upgrade regardless of their issue. Maybe making X = 1 release here is also an option, so we never commit to supporting an out-of-date package at all?

We already have plenty of cases today where users have mixed package versions and policy versions since until 7.15 the only way to upgrade an existing policy was to completely recreate it once a newer version of the package was installed.

A user asks for support because a dashboard is not showing data for some nodes. Support engineers need to start the conversation asking about the versions used in packages, policies and potentially, assets (apart of agent, kibana and so on). The user may not even know about these details. If everything is upgraded as part of the same process, the only relevant versions are more clear: package, agent, kibana.

Fleet only recently started removing package assets when a new version of a package is installed in 7.16: elastic/kibana#112644. We still support maintaining assets for versions of a package that an existing policy depends on, however. So, this is already the case today in that we maintain assets for multiple versions of a package in the out-of-sync package version case described above.

Fleet/EPM may need in the future to implement some kind of upgrade reversal, or installation of specific versions, to help users that have problems after upgrading. In these cases, implementation complicates if there can be different combinations of updated policies (and/or assets).

This is correct, and we do have some existing logic around rolling back package versions in case of a failure in the Kibana upgrade process. I think in failure cases like this, the potential package spec flags around upgrade behavior would not be honored, and we'd always return packages/policies/assets to their previous state prior to the upgrade.

In cases of an intentional downgrade, I don't think we plan to support a first-class concept of downgrading through Fleet in this way, but it's still worth pointing out that we have additional logical considerations to make if that changes.

So I would propose to try to walk in the direction of having a single strategy for all packages. And this strategy would probably be always_upgrade. The reasons:

  • It has proven to be the preferred strategy in cases where upgrades are needed more frequently (apm, synthetics, that are released with each version of the stack).
  • Versions are consistent: packages and their policies (and assets) are all aligned. This helps with support and maintainability of packages.
  • Fleet has more control on the versions used, it will be easier to do advanced management of fleets of agents (for example controlled rollover and automatic revert of upgrades if they fail).

Upgrading all packages and their policies by default feels very heavy-handed and seems to ignore breaking changes between package versions. In cases where a variable/input is restructured or moved, we'll have situations where we're potentially throwing away now-deprecated configuration from the previous version of a package. Today, we do have some fairly nuanced conflict detection logic that will fail the upgrade in many cases like this, but then we wind up in a state that you've observed above where we have divergent package/policy versions because the automatic upgrade process failed.

In an ideal world, it'd be great to guarantee that all packages and policies in use by Fleet/Agent are running the latest version of their respective packages, but it seems to me that would never be possible so long as breaking changes between package versions are possible, as we'll always have potential conflicts/failures in the upgrade process. Perhaps we could implement semver-based rules around this always_upgrade strategy and only attempt upgrades for non-experimental packages to avoid this?

So, in short, I hesitate to accept enforcing the always_upgrade strategy as the default for all packages, as it feels too heavy-handed. If only a subset of packages expects this behavior, they can architect their release ideology around it and avoid breaking changes/conflicts between versions as much as possible. To ask that every package embrace that same ideology feels a bit unrealistic.

@mtojek
Copy link
Contributor

mtojek commented Nov 10, 2021

Thank you for the reply!

In terms of whether we could capture these specific traits/behaviors via some kind of scripting, I think this would be possible but I'd love to see examples if we're doing this elsewhere. So I have a few follow-up questions:
Do you know of any similar fields that support scripting in this way?

No, not yet. The only scripts I remind are part of ingesting pipelines.

It was just an idea to discuss another strategy, maybe more flexible.

Either in the package spec or elsewhere. Would this likely be Painless or something else?

... but yes, I was thinking specifically about the painless scripting :)

@jsoriano
Copy link
Member

@kpollich thanks for your thoughtful explanation.

I completely agree that we shouldn't expect users to have the most updated version of every package, when I was referring to always_upgrade it was in the context of this issue: To always upgrade everything included in a package when a package is upgraded. The user can still decide when to upgrade the package, if an old package works for them, perfect, no need to touch it. We are also making efforts to try to make packages work with a bigger range of stack versions.

My concern is not so much about always upgrading or not, it is that we may be giving too many options to package developers and users and this may lead to a confusing product. Even in this thread where we are all involved in the project we have felt the need to clarify at least twice when we were referring to automatic upgrades of packages or of policies. If we make a decision on the strategies to use, these blurry areas disappear. Every toggle we add is potentially going to require explanations to users, customers and developers, apart of potentially complicating development itself, so we have to be sure that we need it.

Focusing on a single option will make the product easier to understand and use, and will help us to focus on providing a better experience for this option.

For example if we decide to go in the always_upgrade direction, Fleet can take care of updating everything when a package is upgraded, and provide a unified experience when something in the process fails. It can be to automatically revert or to let the user know that some agents couldn't be updated and let them know about how to follow up. No advanced toggles needed, and in any case most of the times the upgrade should work, or we are doing something wrong.
If we decide to go in the "never upgrade" direction, something in the UI can tell the user about what to do right after upgrading a package. But we only need to support a strategy.

Regarding assets, similar thing. I guess that the ideal is to keep all the assets that are used by something, and "garbage collect" them when they belong to an old version and are not used by anything else. Maybe we are not in a position to know where all assets are used, but till we reach this situation, if possible, I would prefer not to add toggles that we have to include in the spec, in APIs and UIs and long-term support.

I think that taking this kind of decisions gives more value to the product, even when they somehow limit it.

@kpollich
Copy link
Member Author

Thanks for clarifying @jsoriano - that makes a lot of sense. It would be beneficial, certainly, to expand the concept of "auto upgrading" to include both the base package and its policies. This would definitely lower the complexity of the product while still providing value to users around certain "managed" packages having their upgrade processes fully managed by Fleet.

I think from a UI perspective we'd want to make some changes that make it clear to users that when they're installing a package with always_upgrade: true in its manifest they won't be able to control its upgrade process. I think some kind of "Note: This package is managed by Fleet, and will automatically update when a new version is available. Your integration policies will also be updated" message displayed before installation and on the package's settings page would be necessary. We have something similar for our hardcoded "managed" packages today, but it would be good to expand it to cover policies as well:

Screen Shot 2021-11-10 at 7 47 21 AM

We also allow users to configure a keep_policies_up_to_date setting for these packages as of 7.16, as you can see above. We'd likely need to remove this functionality altogether in favor of allowing packages to opt into this behavior instead. This feels like a bit of a product decision, so @mostlyjason would you be able to offer any insight from the product side about upgrades for these kinds of managed packages?

There's also a decision point to make about whether we allow users to override this setting, or if it's an immutable value from the package spec. I think based on this thread so far we'd lean towards disallowing user override for packages w/ this flag set.

@jsoriano
Copy link
Member

jsoriano commented Nov 11, 2021

A conversation with @ruflin about this helped me to see policies more like configuration. Then in principle it made more sense to me to don't update them automatically.

Now, from the POV of policies seen as configuration, I wonder if policy upgrades could be like this:

  • If the upgraded policy is equivalent to the previous one, it is upgraded automatically (it should be an effective noop). This would be most of the cases, I see few changes in the templates in the integrations repo.
  • If it is not equivalent, the user should be clearly informed about this. Maybe even before installing the new package.

Though, looking at the changes in templates, they don't happen frequently, but in packages where they happen more frequently, it uses to be because the config includes processors or some kind of advanced mapping. In these cases the developer probably wants the policy to be always upgraded. If not, pipelines or dashboards in the new version may not work correctly because they may be missing some new field. These packages would benefit from package_policy_upgrade_strategy: always_upgrade, but this would give a different experience to users between different packages, and can be error prone for developers, that may forget to use this flag when needed or overuse it when not.

Then I wonder if instead we should make a clear separation in packages between configuration and processors/mappings, then Fleet could still see if the configuration part is equivalent, and decide to upgrade automatically if it is. And the processors/mappings part could be always upgraded, so pipelines and dashboards have the fields they expect.

Then I see that we already have a clear separation between actual user configuration and the policy. The user configuration is not the policy, but the variables. Then going back to the beginning of the comment, the strategy could be always like this:

  • If the new version has the same set of variables, or only new variables with defaults, then the policy is upgraded automatically, for any package. If this fails, this is a bug in the package, and Fleet should revert.
  • If the new package has incompatible changes in variables, the user is clearly informed about this. An action is needed to ensure that everything works as expected. This may be in the line of @kpollich's comment about the conflict detection logic. I wonder if the UI to upgrade an integration could be shown to the user before, or as part, of the process of upgrading the package.

Managed packages could be more strict on the changes in variables, and do them always in backwards compatible ways, so their policies are always upgraded accordingly to the strategy above. Then Fleet doesn't need any special logic for the upgrade of these packages, nor the package spec needs to include any field to select the upgrade strategy.

If we limit the possibility of breaking changes for policies to the variables used, we could consider adding deprecation fields for variables to provide a better experience or even automated migration paths, in line to what we are discussing for package deprecation in #227

There may be also breaking changes caused by the use of new settings in policies in old agents. To help on this we could add conditional logic in the templates based on the agent version (perhaps this is already possible?), or add support for the agent.version constraint.

I see this relies a lot on trusting on the lack of breaking changes, but I think this is a desiderable aim given the make it minor initiatives, and we could also help package developers with tooling (for example with elastic/elastic-package#579).

Does it make sense?

@mtojek
Copy link
Contributor

mtojek commented Nov 15, 2021

If the new version has the same set of variables, or only new variables with defaults, then the policy is upgraded automatically, for any package. If this fails, this is a bug in the package, and Fleet should revert.

👍

A conversation with @ruflin about this helped me to see policies more like configuration. Then in principle it made more sense to me to don't update them automatically.

Please shed more light on this. I can't see the reason why the above rule can't be enabled except for safety (don't know future effects).

@jsoriano
Copy link
Member

A conversation with @ruflin about this helped me to see policies more like configuration. Then in principle it made more sense to me to don't update them automatically.

Please shed more light on this. I can't see the reason why the above rule can't be enabled except for safety (don't know future effects).

Look at this as configuration files included in packages in linux distributions. Usually they are not upgraded automatically when the packages are upgraded, unless the user didn't modify them. This would be still coherent with the above rule, but here the check could be a bit different, the "config" can be upgraded if the user has only set variables that are compatible with the new version.

@mtojek
Copy link
Contributor

mtojek commented Nov 15, 2021

Well, following this logic would mean that we can always update (try to update) the package, but never adjust configuration. Linux distributions may not reflect this situation well as they also contain libraries or programs. In this case there are only policies. Anyway, thanks for explaining.

@jsoriano
Copy link
Member

Well, following this logic would mean that we can always update (try to update) the package, but never adjust configuration.

We can always update the policies during the upgrade of a package, and I think this is fine. If you want to adjust configuration you should be able to do it later in most of the cases. The configuration here would be the variables, I wouldn't consider the rest of the policies as configuration, because they can include advanced mappings, local processing and other things that may be needed later in ingest pipelines or dashboards.

@joshdover
Copy link
Contributor

If the new version has the same set of variables, or only new variables with defaults, then the policy is upgraded automatically, for any package. If this fails, this is a bug in the package, and Fleet should revert.

For smaller deployments, I think this makes sense. We could change the flow here to detect those breaking changes (aka 'conflicts') before doing the base package upgrade itself and then just upgrade everything at once. I think this would greatly simplify the UX and matrix of situations we need to support.

One motivation I can imagine is allowing admins to test out a package upgrade on a "test" or "canary" agent policy before rolling out to the wider fleet. We in fact do suggest this to users in our documentation on integration upgrades: https://www.elastic.co/guide/en/fleet/current/integrations.html#update-integration. I think this may be a valid use case, but would like to learn more from @mostlyjason on this. Reasons I can imagine this is important for some users:

  • New policies could turn on new features or change settings that change the CPU or memory footprint of Agent (well, really the underlying Beats).
  • New policies could ship new data, data that potentially contains sensitive information that wasn't previously shipped.

That said, I wonder if a package rollback would be a better alternative solution to the problem than a more complex upgrade UX. In other words, did we optimize for a non-typical use case and make the typical use case more complicated?

@jsoriano
Copy link
Member

Regarding test/canaries, I agree that this is a valid use case, but I am not sure if we have a good story now for this. When you upgrade a package now, all its assets are upgraded, so even if there are agents using older policies, they will be using the new ingest pipelines, and data will be visualized in the new dashboards. There can be only a version of a package installed now, and there is no way to have a set of agents completely using a different version of a package.

While this happens, while agents ingest data that is going to be managed always by assets of the only installed version, I think that we should do our best to have the policies aligned with this version too. To avoid misunderstandings and tricky support and development situations.

@mostlyjason
Copy link

++ to what Josh said about testing and canary releases. In large enterprises with tens of thousands of agents or clusters with critical infrastructure, they need to test before rolling changes out to production. They can't afford to have an integration upgrade take down their servers by increasing load, cause security or compliance problems by shipping sensitive data, or break data ingestion. Their solution is to test it on a limited set of agents, then promote the change to a larger set of agents to limit the blast radius of problems. In some organizations, a test deployment and a production deployment have to be approved by separate people, and there can be compliance requirements for auditing these steps.

I realize we don't currently have a way to test ES assets like ingest pipelines but they are supposed to be compatible with earlier versions of the integration. We can somewhat decouple updating ES assets from updating the agent policies. At least we can lower the blast radius to the cluster and limit breaking changes on endpoints.

I like the idea of simplifying the UX by auto-upgrading integrations and integration policies at the same time for smaller or less critical clusters. However, I'd hate to do it at the expense of our enterprise and mission-critical clusters by taking away their control and ability to manage risk. The upgrade option provides users with the convenience of auto-upgrades without taking away their control.

It seems like the discussion of defaulting to always_upgrade is tangential to the package spec changes. If anyone wishes to continue discussing that, would it make sense to move that to another thread or issue?

I wonder if a package rollback would be a better alternative solution to the problem than a more complex upgrade UX.

You mean we'd provide rollback instead of the ability to test upgrades before deploying them? If so, that doesn't provide a solution for use cases like preventing sensitive data from being ingested, limiting load changes, etc.

We have 3 general classifications of packages hard-coded here today: Default ... Auto Update .... Unremoveable

We may end up removing the need for Default and Unremovable as part of this issue elastic/kibana#108456. It might be better to remove them from Kibana instead of adding them to the package spec.

@mtojek
Copy link
Contributor

mtojek commented Nov 17, 2021

++ to what Josh said about testing and canary releases. In large enterprises with tens of thousands of agents or clusters with critical infrastructure, they need to test before rolling changes out to production. They can't afford to have an integration upgrade take down their servers by increasing load, cause security or compliance problems by shipping sensitive data, or break data ingestion. Their solution is to test it on a limited set of agents, then promote the change to a larger set of agents to limit the blast radius of problems. In some organizations, a test deployment and a production deployment have to be approved by separate people, and there can be compliance requirements for auditing these steps.

On the other hand, you have just described a new feature.

Apart from the upgrade strategy, the fleet could provide a canary mode to try new deployment (new policy) on a small set of hosts and then (if succeeded), scale the deployment to the rest. It looks like it will become a control panel for DevOps, so maybe introduce more options there? Deployment calendars (go/no-go days, no deployment Fridays), office hours, time windows, canary percentage, etc.

@jsoriano
Copy link
Member

Thanks for your comments, I have a better understanding on the kind of flexibility we want to give to users, I see that there are still limitations, as the fact that other resources included in packages such as ingest pipelines are always upgraded, but this could be evolved in the future if needed.

Going back to my original concern about giving too many options, I still wonder if we want to give this flexibility to package developers. It seems that we prefer package_policy_upgrade_strategy: upgrade plus some kind of conflict detection, so for initial or simpler deployments users need to care about less things, but more complex deployments or experienced users have more control on policy upgrades when upgrading packages.

This seems to be also coherent with current experience (for most packages?), this is what I see by default when upgrading the apache and system packages in 7.16.

Given that, if we allow package developers to use package_policy_upgrade_strategy: always_upgrade, we are allowing any package to remove from users this ability to manage risk. Is this something we want? This can lead to situations where users with complex, large or mission-critical deployments can decide when and how to upgrade the policies of some packages, but not others, what can be a confusing experience.

I notice that the always_upgrade strategy is proposed for synthetics and apm. I wonder why we don't want to provide the same level of control for these packages. If it is because these packages are more managed, or more coupled to the stack, maybe it should be expected that Kibana needs some special code for these special packages. Or if specific Kibana plugins need specific packages and policies, maybe the plugins should take care of managing these packages.

Maybe another approach to this decision would be to answer the question: Why would a package developer opt for always_upgrade over upgrade strategy? If there is no good reason for that, apart of the package being more managed or coupled to Kibana, then perhaps we shouldn't give this option to package developers.

We have 3 general classifications of packages hard-coded here today: Default ... Auto Update .... Unremoveable

We may end up removing the need for Default and Unremovable as part of this issue elastic/kibana#108456. It might be better to remove them from Kibana instead of adding them to the package spec.

++

@mostlyjason
Copy link

If it is because these packages are more managed, or more coupled to the stack, maybe it should be expected that Kibana needs some special code for these special packages.

Yes these will be stack-aligned packages that are shipped with Kibana and upgraded with Kibana. There are some dependencies between these package versions and the code in Kibana. They also tend to be deployed to agents running on centralized servers as opposed to endpoints, so the impact of updating these is smaller. The teams owning these packages believe the convenience of auto upgrades outweighs the need to manage deployment risk.

Organizationally, I think we need to carefully review which packages make use of this feature since it limits our end user's control. I don't have a preference for where that happens, either in Kibana or the package definition. Do you know if there a standard place to document these settings where we could put a warning for future package developers?

@jsoriano
Copy link
Member

Yes these will be stack-aligned packages that are shipped with Kibana and upgraded with Kibana. There are some dependencies between these package versions and the code in Kibana.

If this is the case, could this logic be implemented as part of this? Packages shipped and upgraded with Kibana have their policies always upgraded. Packages installed from the registry have their policies upgraded by default, but users can disable it.
Then no field is needed in package definitions, and there is no chance of package developers misusing this.

Organizationally, I think we need to carefully review which packages make use of this feature since it limits our end user's control. I don't have a preference for where that happens, either in Kibana or the package definition. Do you know if there a standard place to document these settings where we could put a warning for future package developers?

We cannot rely only on reviews, because we want to open package development to more and more teams, eventually also from outside of Elastic. If we add this field to the spec, we could document this in the description, discouraging its use, but if it is there, any package developer is going to be able to use it even if we discourage it.

@mostlyjason
Copy link

Those are good points @jsoriano and I see how adding it to the spec could increase the number of packages developers who misuse it.

Packages installed from the registry have their policies upgraded by default, but users can disable it.

What do you think of this idea @kpollich? I'm unsure why System is treated differently. It would save some clicks if we think most of the time users will upgrade the policies as well. Can you think of any integration that should not have upgraded policies by default (with the option to opt out)?

@kpollich
Copy link
Member Author

kpollich commented Dec 2, 2021

Apologies for the delay here. Been a busy week or so 😅 .

What do you think of this idea @kpollich? I'm unsure why System is treated differently. It would save some clicks if we think most of the time users will upgrade the policies as well. Can you think of any integration that should not have upgraded policies by default (with the option to opt out)?

In general, there aren't any good examples of specific integrations that I can say we'd prefer not to upgrade policies automatically for. We'd always prefer to save the user some clicks and shorten the path they need to take to get their integrations and agents up to date. However, I do think it'd be wise to not opt users into this behavior for experimental/beta integrations. We've had experimental packages like AWS go through substantial reimaginings during their development, so I think it might be a good idea to include that additional semver type check when we determine whether to opt-in to auto-upgrades or not.

Another concern I have is potentially confusing UX around this option. It feels like it might be difficult to convey to users why some of their packages opt-in to this behavior and others don't. I think we need to take stock of our messaging on the integration settings page and determine the best way to convey these pieces of logic to the user. We've already sort of run up against this w/ packages like APM where we require auto-upgrades and had to introduce some additional wording to clarify the behavior:

image

@joshdover
Copy link
Contributor

@kpollich Do we still have a need for this? Should we consider closing this for now and revisiting later if it comes up again?

@kpollich
Copy link
Member Author

@joshdover Thanks for the ping. We don't need this for now so I'll close for the time being.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Issue needs discussion Team:Ecosystem Label for the Packages Ecosystem team Team:Fleet Label for the Fleet team Team:Integrations Label for the Integrations team
Projects
None yet
Development

No branches or pull requests

7 participants