[Fleet] Evaluate Fleet page load performance and steps to improve #118751

mostlyjason · 2021-11-16T17:59:34Z

Problem
Several people have noticed slow page load performance within in the Fleet and Integrations apps. When users start a trial in Elastic Cloud they expect good performance as part of a good user experience. Needing to wait 5+ seconds for a page to load makes the application feel sluggish, especially in the absence of UI affordances like loading indicators. The getting started working group sees this as a high priority area to investigate and improve in order to lower our trial churn rate.

Evaluation
I'd like us to evaluate the end to end performance of Fleet as the user starts a cloud trial, views the integration browse page, adds an integration, and adds an agent. Trying it out in my own browser I found these results:

Open Integrations Browse page 3.6s
Open Elastic APM Integration detail view 2.4s
Add Elastic APM Integration first time 9.6s (/api/fleet/setup call is 7s)
Add Elastic APM Integration second time 1.8s
Save and continue 20s (/api/fleet/package_policies was 18s)
Add agent dialog 1.4s

It seems to happen inconsistently but three places that stand out are the Fleet setup call, adding a package policy and the initial data load from EPR.

Questions

How can evaluate the performance of Fleet? Should do we manual or automated testing? How can we identify inconsistent performance issues? Can we get APM data on oblt or staging clusters?
What can we do to improve the performance of any steps taking longer than 5 seconds?

elasticmachine · 2021-11-16T17:59:36Z

Pinging @elastic/fleet (Team:Fleet)

joshdover · 2021-11-17T12:11:09Z

@mostlyjason just to keep track, which version did you do this testing on?

Add Elastic APM Integration first time 9.6s (/api/fleet/setup call is 7s)

We have plans to optimize this in 8.0 as part of #111858

Save and continue 20s (/api/fleet/package_policies was 18s)

Details about why this is slow can be found in #110500. tl;dr we need a way to write Elasticsearch assets in bulk to reduce the number of cluster state updates needed. The ES team has preferred that instead of a bulk API we spend effort towards a generic "package install" API as part of making packages a first-class concept across the Stack.

mostlyjason · 2021-11-17T14:41:58Z

Thanks Josh! I tested on 7.16-snapshot in a cluster that was created about a week ago. It looks like the APM package installs 41 ingest node pipelines so that might explain why it's slow.

nchaulet · 2022-01-06T13:39:42Z

I started to give a look at this testing Fleet performance with some agents and policies and looks like we have a few issues:

n+1 problem when we do bulk operations that concerns a bunch of policies (upgrade package policy seems, verify enrolment token during the setup)
Installing ES assets is slow.
Webperf some React component to improve (the integration grid for example)

I will edit that comment when I found more things and I am going to create or link to existing issues for each problem try to add trace|profile where I can.

n+1 problems

During Fleet setup each time

Verify for existing enrollment tokens (could we remove that check, and only check this for preconfigured policies)

During Fleet setup one time

Package policy auto upgrade

Bulk action on agents:

force unenroll| bulk upgrade is extremely slow with a long number of actions

Installing ES assets

Webperf issues

Loading the package list is slow (I had a cpu Profile with 5s of JS to render the package grid, we should investigate more on how to fix that maybe virtualizing that list).

joshdover · 2022-01-06T14:57:35Z

@nchaulet some good finds here 🎉

During Fleet setup one time

Package policy auto upgrade

After 8.0 ships which moves setup to Kibana boot and we've gotten some feedback from users / support I think we should consider removing this setup API call from the UI. It definitely shouldn't be necessary once #120616 is in since Kibana won't even start up if there's a setup issue. We only left it in for now as a hacky/cheap "retry" option, but once we block Kibana boot we can be sure that setup already completed before the UI is ever served up. Also related is #121639.

Curious if there's any improvement we can make to the Integration details page load. I've noticed in the past that this can be quite slow, especially on Cloud for some reason. For example, would it be advantageous to avoid loading the entire package to show this page and instead only load the manifest and screenshots?

nchaulet · 2022-01-06T15:02:16Z

Curious if there's any improvement we can make to the Integration details page load. I've noticed in the past that this can be quite slow, especially on Cloud for some reason. For example, would it be advantageous to avoid loading the entire package to show this page and instead only load the manifest and screenshots?

Yes I think we can have some optimization on the details and integration list page, I need to dig more in the details page but on the integration list page we are passing a lot of time rendering the grid (I have a CPU profile where in took 5s of blocking JS to render that list) for sure we need to optimize this, maybe we can virtualize that list and render only visible item or maybe there is obvious things here that are not performant need to dig more in it.

nchaulet · 2022-01-11T17:38:50Z

I created the following issues after investigation more

For the integration listing page:

[Fleet] Integration listing performance - integration grid rendering #122660

For the details page

jen-huang · 2022-01-11T23:26:07Z

Thanks for the investigation @nchaulet, good stuff here. WRT to integration list and details, I thought a few times that maybe we can cache package info on the client-side via React Context (or similar) so that we can instantly load the information if it's already been fetched before. This would be in addition to any optimizations we do on the actual package info endpoints you've identified in #122560. WDYT?

nchaulet · 2022-01-12T01:03:38Z

Thanks for the investigation @nchaulet, good stuff here. WRT to integration list and details, I thought a few times that maybe we can cache package info on the client-side via React Context (or similar) so that we can instantly load the information if it's already been fetched before. This would be in addition to any optimizations we do on the actual package info endpoints you've identified in #122560. WDYT?

Actually this call is only slow the first time, after the package info are already cached server side (in memory without expiration so this could problematic at some point) so I do not think caching client side will not make a huge difference here

mostlyjason · 2022-01-13T20:59:05Z

@jen-huang should we pass any of these to the journey team, since they own unified integrations UI now?

joshdover · 2022-01-19T10:35:09Z

WRT to integration list and details, I thought a few times that maybe we can cache package info on the client-side via React Context (or similar) so that we can instantly load the information if it's already been fetched before.

IMO for any client-side caching we should be leveraging the built-in features of the browser, ie. Cache-Control headers. This eliminates any need for complex and hard to maintain caching logic in our application and provides the same benefits.

Though as @nchaulet pointed out in #122560, it seems the main issue is that we download the whole package contents rather than just the manifest + screenshots.

kpollich · 2022-01-19T14:02:19Z

IMO for any client-side caching we should be leveraging the built-in features of the browser, ie. Cache-Control headers. This eliminates any need for complex and hard-to-maintain caching logic in our application and provides the same benefits.

100% agree that we should lean on browser cache controls and the fact that the EPR is served via a CDN rather than doing our own custom caching here. We're duplicating a lot of effort in terms of caching here right now, and I think we're also needlessly relying on the "archive" endpoints (e.g. https://epr.elastic.co/epr/nginx/nginx-1.2.1.zip) when we could be relying on the "plain JSON" endpoint (e.g. https://epr.elastic.co/package/nginx/1.2.1/) for each package instead. The production EPR currently responds with cache-control: max-age=600,public for the "plain JSON" endpoint, so these JSON responses are cached on disk for 10 minutes. If we need to alter the caching logic here we'd need to work with the team that maintains EPR, which could add some churn.

Relying on HTTP caching directives would mean moving our interactions with EPR into the client and off of the server, though, as server-side requests won't honor HTTP cache headers like the browser. It would almost certainly be preferential in place of our current workflow which downloads, unpacks, and caches a .zip archive for each package. We'd also be able to remove a huge chunk of somewhat-legacy code for all the in-memory caching that Fleet currently does.

We could probably flesh out #122560 to capture this larger refactor in pursuit of performance gains, or we could try and separate a few of these points out into distinct issues. Eager to hear others' thoughts here.

@ruflin brought up some concerns about package size in this context in an offline email chain earlier, so I'll loop him in here as well.

nchaulet · 2022-01-19T14:28:36Z

Relying on HTTP caching directives would mean moving our interactions with EPR into the client and off of the server, though, as server-side requests won't honor HTTP cache headers like the browser. It would almost certainly be preferential in place of our current workflow which downloads, unpacks, and caches a .zip archive for each package. We'd also be able to remove a huge chunk of somewhat-legacy code for all the in-memory caching that Fleet currently does.

Yes it will be a lot better, also the cache we have in memory do not have any expiration it's just a plain object hash map so if the number of package grow to much in the future it could be an issue.

ruflin · 2022-01-20T14:36:20Z

The production EPR currently responds with cache-control: max-age=600,public for the "plain JSON" endpoint, so these JSON responses are cached on disk for 10 minutes. If we need to alter the caching logic here we'd need to work with the team that maintains EPR, which could add some churn.

This would be a super simple and quick change.

joshdover · 2022-02-11T12:30:04Z

@nchaulet Can we close this issue now and use the remaining tickets you opened?

nchaulet · 2022-02-11T12:49:24Z

Yes I am going to close that issue 👍

mostlyjason added the Team:Fleet Team label for Observability Data Collection Fleet team label Nov 16, 2021

joshdover assigned nchaulet Dec 20, 2021

This was referenced Jan 10, 2022

[Fleet] Integration details performance - README rendering #122557

Open

[Fleet] Integration details performance - Package info loading #122560

Closed

[Fleet] Integration listing performance - integration grid rendering #122660

Closed

joshdover added the performance label Feb 7, 2022

nchaulet closed this as completed Feb 11, 2022

joshdover mentioned this issue Feb 16, 2022

[Fleet] Use cache-control headers on EPM get routes #125794

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fleet] Evaluate Fleet page load performance and steps to improve #118751

[Fleet] Evaluate Fleet page load performance and steps to improve #118751

mostlyjason commented Nov 16, 2021 •

edited

Loading

elasticmachine commented Nov 16, 2021

joshdover commented Nov 17, 2021

mostlyjason commented Nov 17, 2021

nchaulet commented Jan 6, 2022

joshdover commented Jan 6, 2022

nchaulet commented Jan 6, 2022

nchaulet commented Jan 11, 2022

jen-huang commented Jan 11, 2022

nchaulet commented Jan 12, 2022

mostlyjason commented Jan 13, 2022

joshdover commented Jan 19, 2022

kpollich commented Jan 19, 2022

nchaulet commented Jan 19, 2022

ruflin commented Jan 20, 2022

joshdover commented Feb 11, 2022

nchaulet commented Feb 11, 2022

[Fleet] Evaluate Fleet page load performance and steps to improve #118751

[Fleet] Evaluate Fleet page load performance and steps to improve #118751

Comments

mostlyjason commented Nov 16, 2021 • edited Loading

elasticmachine commented Nov 16, 2021

joshdover commented Nov 17, 2021

mostlyjason commented Nov 17, 2021

nchaulet commented Jan 6, 2022

n+1 problems

Installing ES assets

Webperf issues

joshdover commented Jan 6, 2022

nchaulet commented Jan 6, 2022

nchaulet commented Jan 11, 2022

jen-huang commented Jan 11, 2022

nchaulet commented Jan 12, 2022

mostlyjason commented Jan 13, 2022

joshdover commented Jan 19, 2022

kpollich commented Jan 19, 2022

nchaulet commented Jan 19, 2022

ruflin commented Jan 20, 2022

joshdover commented Feb 11, 2022

nchaulet commented Feb 11, 2022

mostlyjason commented Nov 16, 2021 •

edited

Loading