Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] Evaluate Fleet page load performance and steps to improve #118751

Closed
mostlyjason opened this issue Nov 16, 2021 · 16 comments
Closed

[Fleet] Evaluate Fleet page load performance and steps to improve #118751

mostlyjason opened this issue Nov 16, 2021 · 16 comments
Assignees
Labels
performance Team:Fleet Team label for Observability Data Collection Fleet team

Comments

@mostlyjason
Copy link
Contributor

mostlyjason commented Nov 16, 2021

Problem
Several people have noticed slow page load performance within in the Fleet and Integrations apps. When users start a trial in Elastic Cloud they expect good performance as part of a good user experience. Needing to wait 5+ seconds for a page to load makes the application feel sluggish, especially in the absence of UI affordances like loading indicators. The getting started working group sees this as a high priority area to investigate and improve in order to lower our trial churn rate.

Evaluation
I'd like us to evaluate the end to end performance of Fleet as the user starts a cloud trial, views the integration browse page, adds an integration, and adds an agent. Trying it out in my own browser I found these results:

  1. Open Integrations Browse page 3.6s
  2. Open Elastic APM Integration detail view 2.4s
  3. Add Elastic APM Integration first time 9.6s (/api/fleet/setup call is 7s)
  4. Add Elastic APM Integration second time 1.8s
  5. Save and continue 20s (/api/fleet/package_policies was 18s)
  6. Add agent dialog 1.4s

It seems to happen inconsistently but three places that stand out are the Fleet setup call, adding a package policy and the initial data load from EPR.

Questions

  1. How can evaluate the performance of Fleet? Should do we manual or automated testing? How can we identify inconsistent performance issues? Can we get APM data on oblt or staging clusters?
  2. What can we do to improve the performance of any steps taking longer than 5 seconds?
@mostlyjason mostlyjason added the Team:Fleet Team label for Observability Data Collection Fleet team label Nov 16, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

@joshdover
Copy link
Contributor

@mostlyjason just to keep track, which version did you do this testing on?

  1. Add Elastic APM Integration first time 9.6s (/api/fleet/setup call is 7s)

We have plans to optimize this in 8.0 as part of #111858

  1. Save and continue 20s (/api/fleet/package_policies was 18s)

Details about why this is slow can be found in #110500. tl;dr we need a way to write Elasticsearch assets in bulk to reduce the number of cluster state updates needed. The ES team has preferred that instead of a bulk API we spend effort towards a generic "package install" API as part of making packages a first-class concept across the Stack.

@mostlyjason
Copy link
Contributor Author

Thanks Josh! I tested on 7.16-snapshot in a cluster that was created about a week ago. It looks like the APM package installs 41 ingest node pipelines so that might explain why it's slow.

@nchaulet
Copy link
Member

nchaulet commented Jan 6, 2022

I started to give a look at this testing Fleet performance with some agents and policies and looks like we have a few issues:

  • n+1 problem when we do bulk operations that concerns a bunch of policies (upgrade package policy seems, verify enrolment token during the setup)
  • Installing ES assets is slow.
  • Webperf some React component to improve (the integration grid for example)

I will edit that comment when I found more things and I am going to create or link to existing issues for each problem try to add trace|profile where I can.

n+1 problems

During Fleet setup each time

  • Verify for existing enrollment tokens (could we remove that check, and only check this for preconfigured policies)

During Fleet setup one time

  • Package policy auto upgrade

Bulk action on agents:

  • force unenroll| bulk upgrade is extremely slow with a long number of actions

Installing ES assets

Webperf issues

Loading the package list is slow (I had a cpu Profile with 5s of JS to render the package grid, we should investigate more on how to fix that maybe virtualizing that list).

@joshdover
Copy link
Contributor

@nchaulet some good finds here 🎉

During Fleet setup one time

  • Package policy auto upgrade

After 8.0 ships which moves setup to Kibana boot and we've gotten some feedback from users / support I think we should consider removing this setup API call from the UI. It definitely shouldn't be necessary once #120616 is in since Kibana won't even start up if there's a setup issue. We only left it in for now as a hacky/cheap "retry" option, but once we block Kibana boot we can be sure that setup already completed before the UI is ever served up. Also related is #121639.

Curious if there's any improvement we can make to the Integration details page load. I've noticed in the past that this can be quite slow, especially on Cloud for some reason. For example, would it be advantageous to avoid loading the entire package to show this page and instead only load the manifest and screenshots?

@nchaulet
Copy link
Member

nchaulet commented Jan 6, 2022

Curious if there's any improvement we can make to the Integration details page load. I've noticed in the past that this can be quite slow, especially on Cloud for some reason. For example, would it be advantageous to avoid loading the entire package to show this page and instead only load the manifest and screenshots?

Yes I think we can have some optimization on the details and integration list page, I need to dig more in the details page but on the integration list page we are passing a lot of time rendering the grid (I have a CPU profile where in took 5s of blocking JS to render that list) for sure we need to optimize this, maybe we can virtualize that list and render only visible item or maybe there is obvious things here that are not performant need to dig more in it.

@nchaulet
Copy link
Member

@jen-huang
Copy link
Contributor

Thanks for the investigation @nchaulet, good stuff here. WRT to integration list and details, I thought a few times that maybe we can cache package info on the client-side via React Context (or similar) so that we can instantly load the information if it's already been fetched before. This would be in addition to any optimizations we do on the actual package info endpoints you've identified in #122560. WDYT?

@nchaulet
Copy link
Member

Thanks for the investigation @nchaulet, good stuff here. WRT to integration list and details, I thought a few times that maybe we can cache package info on the client-side via React Context (or similar) so that we can instantly load the information if it's already been fetched before. This would be in addition to any optimizations we do on the actual package info endpoints you've identified in #122560. WDYT?

Actually this call is only slow the first time, after the package info are already cached server side (in memory without expiration so this could problematic at some point) so I do not think caching client side will not make a huge difference here

@mostlyjason
Copy link
Contributor Author

@jen-huang should we pass any of these to the journey team, since they own unified integrations UI now?

@joshdover
Copy link
Contributor

WRT to integration list and details, I thought a few times that maybe we can cache package info on the client-side via React Context (or similar) so that we can instantly load the information if it's already been fetched before.

IMO for any client-side caching we should be leveraging the built-in features of the browser, ie. Cache-Control headers. This eliminates any need for complex and hard to maintain caching logic in our application and provides the same benefits.

Though as @nchaulet pointed out in #122560, it seems the main issue is that we download the whole package contents rather than just the manifest + screenshots.

@kpollich
Copy link
Member

IMO for any client-side caching we should be leveraging the built-in features of the browser, ie. Cache-Control headers. This eliminates any need for complex and hard-to-maintain caching logic in our application and provides the same benefits.

100% agree that we should lean on browser cache controls and the fact that the EPR is served via a CDN rather than doing our own custom caching here. We're duplicating a lot of effort in terms of caching here right now, and I think we're also needlessly relying on the "archive" endpoints (e.g. https://epr.elastic.co/epr/nginx/nginx-1.2.1.zip) when we could be relying on the "plain JSON" endpoint (e.g. https://epr.elastic.co/package/nginx/1.2.1/) for each package instead. The production EPR currently responds with cache-control: max-age=600,public for the "plain JSON" endpoint, so these JSON responses are cached on disk for 10 minutes. If we need to alter the caching logic here we'd need to work with the team that maintains EPR, which could add some churn.

Relying on HTTP caching directives would mean moving our interactions with EPR into the client and off of the server, though, as server-side requests won't honor HTTP cache headers like the browser. It would almost certainly be preferential in place of our current workflow which downloads, unpacks, and caches a .zip archive for each package. We'd also be able to remove a huge chunk of somewhat-legacy code for all the in-memory caching that Fleet currently does.

We could probably flesh out #122560 to capture this larger refactor in pursuit of performance gains, or we could try and separate a few of these points out into distinct issues. Eager to hear others' thoughts here.

@ruflin brought up some concerns about package size in this context in an offline email chain earlier, so I'll loop him in here as well.

@nchaulet
Copy link
Member

Relying on HTTP caching directives would mean moving our interactions with EPR into the client and off of the server, though, as server-side requests won't honor HTTP cache headers like the browser. It would almost certainly be preferential in place of our current workflow which downloads, unpacks, and caches a .zip archive for each package. We'd also be able to remove a huge chunk of somewhat-legacy code for all the in-memory caching that Fleet currently does.

Yes it will be a lot better, also the cache we have in memory do not have any expiration it's just a plain object hash map so if the number of package grow to much in the future it could be an issue.

@ruflin
Copy link
Contributor

ruflin commented Jan 20, 2022

The production EPR currently responds with cache-control: max-age=600,public for the "plain JSON" endpoint, so these JSON responses are cached on disk for 10 minutes. If we need to alter the caching logic here we'd need to work with the team that maintains EPR, which could add some churn.

This would be a super simple and quick change.

@joshdover
Copy link
Contributor

@nchaulet Can we close this issue now and use the remaining tickets you opened?

@nchaulet
Copy link
Member

Yes I am going to close that issue 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Team:Fleet Team label for Observability Data Collection Fleet team
Projects
None yet
Development

No branches or pull requests

7 participants