Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Organize and Update Metrics #7098

Open
matthewkeil opened this issue Sep 20, 2024 · 8 comments
Open

Organize and Update Metrics #7098

matthewkeil opened this issue Sep 20, 2024 · 8 comments
Assignees
Labels
meta-technical-debt Issues introducing or resolving technical debts. prio-medium Resolve this some time soon (tm). scope-metrics All issues with regards to the exposed metrics.

Comments

@matthewkeil
Copy link
Member

matthewkeil commented Sep 20, 2024

Problem description

There are a few goals that this issue is meant to discuss/cover.

Our collection of metrics has grown over time and the seems to be a lot of technical debt that needs to be addressed. The big thing is the organization of the metrics we currently collect. We have two giant files that are not very well organized so it is becoming harder to maintain. I would like to propose cleaning up the organization by shuffling the metrics into a few separate files so that they are easier to maintain.

There are also a number of metrics that are prescribed in ethereum/beacon-metrics that we are not collected as part of our metrics program. Ideally these will also get added.

There is also another creeping issue that I think should be addressed. We are starting to collect metrics is folders other than beacon-node and having the Metrics type there creates a circular dependency.

Solution description

Creating packages/metrics Folder

I would like to suggest that we move the metrics out of the beacon-node package and into its own package. This will make it easier to import the Metrics interface into other packages. I also am thinking we can export a singleton metrics object from the package. The package could export a getter function that obtains the instance of the singleton metrics object for addition to the chain object as it is today. But this would open the possibility of getting the metrics object in any help function or the other packages without needing to pass the chain object around. Not sure if this is strictly necessary but might make our lives easier going forward.

As an example we have metrics in reqresp and in state-transition that could easily be centralized.

Nesting Metrics

I am not sure if added a lot of nested domains will make using the metrics object easier or more complicated. This is the first thing that should be decided. Is it better to nest similar metrics within domains, under the metrics object or should we keep a flat structure so its easier to find what one is looking for using intellisense

metrics.forkChoice.findHeadSeconds vs metrics.findHeadSeconds

My gut tells me that some minimal nesting will be nice due to the large number of metrics that we collect but using a deeply nested structure may become a pain to deal with as some metrics tend to cross domain boundaries and it will be challenging to decide "where" to put things.

metrics.chain.bls.multithread.aggregatedPublicKeysCount() vs metrics.bls.aggregatedPublicKeys

I actually kinda flip-flopped on this while writing this up so its worth discussing further as I'm not really sure which will be best myself.

File Structure

I think we really need to split up the two giant files into a number of smaller files so its easier to maintain. Seems like an easy win.

lodestar Metric Prefix

This seems a bit overkill to add to basically every single metric. We only collect metrics on "lodestar" which is a "beacon" node so separating the two seems silly. The only reason would be to differentiate "ethereum standard" metrics from our lodestar specifics but I am not sure that is necessary. I do think however that we should be prefixing with the "domain" a metric represents and having that domain match the filename seems like a nice to have so its clear where it is without having to do a global file search.

Additional context

No response

@matthewkeil matthewkeil added meta-feature-request Issues to track feature requests. meta-discussion Indicates a topic that requires input from various developers. scope-metrics All issues with regards to the exposed metrics. meta-technical-debt Issues introducing or resolving technical debts. and removed meta-feature-request Issues to track feature requests. labels Sep 20, 2024
@matthewkeil matthewkeil self-assigned this Sep 20, 2024
@nflaig
Copy link
Member

nflaig commented Sep 20, 2024

There are also a number of metrics that are prescribed in ethereum/beacon-metrics that we are not collected as part of our metrics program. Ideally these will also get added.

We should be implementing at least the interop metrics which seems like we do. Other metrics I would probably not add if nobody is asking for it and we don't deem them useful.

metrics.forkChoice.findHeadSeconds vs metrics.findHeadSeconds

nesting seems useful to differentiate metrics and add more context imo

This seems a bit overkill to add to basically every single metric. We only collect metrics on "lodestar" which is a "beacon" node so separating the two seems silly.

I think we should prefix non-spec'd metrics with lodestar. I don't think the prefix is overkill, it's a good practice to use the software name as prefix to avoid clashes if users collect metrics from a lot of different services.

Creating packages/metrics Folder

was thinking about that too at one point but there is really not that much code which is why I decided to add the metric types required by a lot of packages to packages/utils and the code to run the metrics server, create registry, etc. seemed fine to keep in beacon-node and wire it up in packages/cli.

We are starting to collect metrics is folders other than beacon-node and having the Metrics type there creates a circular dependency.

Where is this the case? Last time I looked at this #6201 and refactored / moved the code around, I removed all duplications and types should only be imported from utils package.

File Structure

Seems reasonable to split it up into multiple files although I don't think the reason the metrics grow over time is because of using a single file. I think the problem is rather that we don't remove metrics unless the code is removed, but we could re-evaluate some metrics added during initial implementation and remove them if the data does not seem necessary, especially if those metrics are not even part of any panel.

@matthewkeil
Copy link
Member Author

There are also a number of metrics that are prescribed in ethereum/beacon-metrics that we are not collected as part of our metrics program. Ideally these will also get added.

We should be implementing at least the interop metrics which seems like we do. Other metrics I would probably not add if nobody is asking for it and we don't deem them useful.

I tend to disagree. I think we should meet the ethereum/beacon-metrics spec just like the other specs. We cannot be sure who or why is expecting those so it is best to include them. The cost to do so is minimal and its the right thing to do as good stewards of the protocol

metrics.forkChoice.findHeadSeconds vs metrics.findHeadSeconds

nesting seems useful to differentiate metrics and add more context imo

👍

This seems a bit overkill to add to basically every single metric. We only collect metrics on "lodestar" which is a "beacon" node so separating the two seems silly.

I think we should prefix non-spec'd metrics with lodestar. I don't think the prefix is overkill, it's a good practice to use the software name as prefix to avoid clashes if users collect metrics from a lot of different services.

I think we should use "beacon" similar to the spec'd metrics. There is no reason to differentiate ones that are "lodestar specific beacon metrics" vs "beacon metrics." Those feel like one and the same. But this is not a hill i would die on and am open to suggestions.

Creating packages/metrics Folder

was thinking about that too at one point but there is really not that much code which is why I decided to add the metric types required by a lot of packages to packages/utils and the code to run the metrics server, create registry, etc. seemed fine to keep in beacon-node and wire it up in packages/cli.

I think creating a separation of concerns is a nice to have here. In particular with regards to the next response about circular dependency.

We are starting to collect metrics is folders other than beacon-node and having the Metrics type there creates a circular dependency.

Where is this the case? Last time I looked at this #6201 and refactored / moved the code around, I removed all duplications and types should only be imported from utils package.

At a minimum I know there are metrics in reqresp and in state-transition. I found it frustrating to add metrics for async shuffling because that happens in state-transition and can imagine as the codebase grows there will be other instances where it is nice to add metrics places, other than in beacon-node.. As a note, to accommodate the metrics for async shuffling I needed to make an interface that meets the metrics object that gets passed in so that they were available. This is not a great idea as there are now two places that those definitions exist (interface in state-transition and implementation in beacon-node). It would have been much better to be able to just pass in the Metrics type in state-transition instead of duplicating it.

File Structure

Seems reasonable to split it up into multiple files although I don't think the reason the metrics grow over time is because of using a single file. I think the problem is rather that we don't remove metrics unless the code is removed, but we could re-evaluate some metrics added during initial implementation and remove them if the data does not seem necessary, especially if those metrics are not even part of any panel.

Agreed. Seems like a nice to have so its more maintainable than 1000+ line files... Will also make it more meaningful to maintain (ie remove unused metrics) if its easier to grok what is cooking.

@nflaig
Copy link
Member

nflaig commented Sep 20, 2024

The cost to do so is minimal and its the right thing to do as good stewards of the protocol

The problem with the beacon-metrics seems that it is not really well maintained, we might wanna clean up there first, see which metrics are actually implemented by other clients at this point.

There is no reason to differentiate ones that are "lodestar specific beacon metrics" vs "beacon metrics."

It's good to know as a consumer, assuming all clients implement the standard beacon metrics you know which ones are interchangeable and which ones are client specific.

It would have been much better to be able to just pass in the Metrics type in state-transition instead of duplicating it.

you mean similar to what we do in reqresp?

export function getMetrics(register: MetricsRegister) {

@matthewkeil
Copy link
Member Author

matthewkeil commented Sep 23, 2024

The cost to do so is minimal and its the right thing to do as good stewards of the protocol

The problem with the beacon-metrics seems that it is not really well maintained, we might wanna clean up there first, see which metrics are actually implemented by other clients at this point.

Correct. That is the heart of this issue/effort.

There is no reason to differentiate ones that are "lodestar specific beacon metrics" vs "beacon metrics."

It's good to know as a consumer, assuming all clients implement the standard beacon metrics you know which ones are interchangeable and which ones are client specific.

I suppose I see both sides of this.

The only consumers of these metrics should be dashboards and in an idea world the dashboards should look the similar across clients. It would be great to be able to standardize metrics methodology across all clients and then for implementation specific dashboards the data would just not show up.

IE GC and event loop would not propagate on lighthouse but all the fork choice ones should be similar and plug-and-play

Prefixing with lodestar only seem relevant for NodeJs like stuff that would never be relevant on Lighthouse but for all others that "could" show up on any beacon node i think we should stick to prefixing with the domain that is represented.

This will help to push the standardization effort that should be a thing for all Ethereum. We could be thought leaders in this space if-so... I have started to add context and am interested in working more on the ethereum/beacon-metrics repo toward this end.

It would have been much better to be able to just pass in the Metrics type in state-transition instead of duplicating it.

you mean similar to what we do in reqresp?

export function getMetrics(register: MetricsRegister) {

This is exactly the issue that I think should be solved. All metrics should be in one place to avoid confusion and duplication. IE we have metrics definitions in several places:

In particular the most egregious is the state-transition interface that requires duplication for type/interface and beacon-node based implementation.

The other thing that stands out to me is that we have gossipsub metrics in both. This should not be true, they should be together for both organizational and maintenance reasons.

The stuff I am also not sure how to handle, but think should also be dealt with is the worker metrics. Those will need a separate instance of the RegistryMetricCreator but having the actual function that does the individual register.*({}) should all be together with the other metrics that are collected. We do not separate the different dashboards and in similar fashion I do not think that we should be segregating the metrics either.

@nflaig
Copy link
Member

nflaig commented Sep 23, 2024

All metrics should be in one place to avoid confusion and duplication. IE we have metrics definitions in several places:

Not sure I agree here, e.g. in the monitoring service it makes much more sense to define the metrics in the constructor instead of somewhere central. This makes it much easier to see where they are used and only adds them if the monitoring service is enabled, I like this pattern.

@matthewkeil
Copy link
Member Author

matthewkeil commented Sep 23, 2024

All metrics should be in one place to avoid confusion and duplication. IE we have metrics definitions in several places:

Not sure I agree here, e.g. in the monitoring service it makes much more sense to define the metrics in the constructor instead of somewhere central. This makes it much easier to see where they are used and only adds them if the monitoring service is enabled, I like this pattern.

I suppose monitoring service could be separate, but not sure why it should be. Its just as easy to have a few exports from packages/metrics and import the correct one into the correct process. Not a hill im willing to die on though for the 5 metrics used by that service.

But big picture is that we have a unified place "where metrics live" kind of like we have a unified place "that dashboards live"

@matthewkeil
Copy link
Member Author

matthewkeil commented Sep 23, 2024

I think the bigger picture though is for something like gossipsup metrics. I am trying to add the ones requested for peerDas and having a hard time with it. Need to dig through how all the metrics are added and still not entirely sure but I know that they need to be added here:

https://github.com/ChainSafe/lodestar/blob/mkeil/standardized-peerdas-metrics/packages/beacon-node/src/network/core/networkCore.ts#L339

However gossip lives both on the worker thread and on the main thread depending on what stage of the process one is attempting to get telemetry from.

Adding the data collectors and then getting access to them is proving a digging exercise.... Having a unified place where they live will also lend itself to a README.md that could explain how/where/why to add metrics so things wire together correctly.

@philknows
Copy link
Member

I think we should discuss this tomorrow as well with additional perspectives, but to summarize from what I'm seeing so far with the proposed solutions:

Creating packages/metrics Folder

Arguments For:

  • Separation of Concerns: Moving metrics to a dedicated package could reduce circular dependencies and improve modularity, allowing easier access to the Metrics interface across different packages.
  • Centralization: A centralized location for metrics could prevent duplication and make it easier to manage and update metrics across the codebase.

Arguments Against:

  • Code Complexity: There isn't enough code to justify creating a separate package, suggesting that existing structures (like packages/utils) are sufficient.
  • Local Definition Benefits: Others argue that defining metrics locally within specific services or components (e.g., monitoring service) makes it easier to see where they are used and ensures they are only added when necessary.

Questions:

  • This is the first time we've been coordinated to include/enforce standardized metrics across all clients, do we want to be meeting ethereum/beacon-metrics implementation for more than just PeerDAS?
    • Do we expect more coordination here in the future? (Unpredictable, this is the first time beacon-metrics has been used in years)
    • Do we want to be stewards or push for more of this standardization in client teams? (Additional work)

Nesting Metrics

Arguments For:

  • Contextual Clarity: Nesting metrics can help differentiate them and add context, making it easier to understand their purpose and usage.
    Organization: Minimal nesting is seen as beneficial due to the large number of metrics, helping organize them logically.

Arguments Against:

  • Complexity in Structure: There are concerns that excessive nesting could complicate the structure, especially when metrics cross domain boundaries, making it challenging to decide where to place them.

Questions

  • Do others feel like the prefix is overkill or makes it more complex than necessary? Or does it provide more clarity/value that it's worth doing so? (Not all of them are also equal in terms of complexity, so it becomes interesting if we decide to change some and not others)

File Structure

Arguments For:

  • Maintainability: Splitting large files into smaller ones is seen as an easy win for maintainability, making it easier to manage and update individual metrics.
  • Clarity: Smaller files could enhance clarity, making it easier to identify unused metrics and maintain a cleaner codebase.

Arguments Against:

  • Growth Not Solely Due to File Size: There is an argument that the growth of metrics isn't solely due to file size but rather because metrics aren't removed unless their associated code is deleted. Thus, re-evaluating the necessity of certain metrics might be more impactful.

Questions

  • How would we like to split the files into smaller ones?
  • Which metrics/code should be removed? (Seems like a good collaborative digging session to have during offsite and propose this as part of Lodestar v2.0 Roadmap #7011 )

Lodestar Metric Prefix

Arguments For:

  • Avoiding Clashes: Prefixing non-spec'd metrics with "lodestar" could prevent naming clashes when users collect metrics from multiple services.
  • Standardization Efforts: Using a consistent prefix aligns with efforts to standardize metric collection across Ethereum clients.

Arguments Against:

  • Redundancy Concerns: Some see prefixing every metric with "lodestar" as redundant since all collected metrics pertain to a "beacon" node.
  • Domain-Based Prefixing Preferred: There is a preference for using domain-based prefixes for metrics that could appear on any beacon node, reserving "lodestar" for Node.js-specific metrics.

Questions:

  • Is there really an indication that using only "beacon" would create conflicts in people's setups? Especially those that are running multiple clients? (I think there is a benefit to being able to define lodestar specific beacon node metrics vs. generic beacon node specific metrics IF it helps non-Lodestar users/devs - otherwise I don't have a strong preference)
  • Do we realistically see other teams using a standard set of plug and play metrics? Perhaps other teams may leverage off other's work if someone were to have done it already? If yes, it pushes forward the argument of having lodestar BN specific and generic beacon node specific metrics

@philknows philknows added prio-medium Resolve this some time soon (tm). and removed meta-discussion Indicates a topic that requires input from various developers. labels Nov 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
meta-technical-debt Issues introducing or resolving technical debts. prio-medium Resolve this some time soon (tm). scope-metrics All issues with regards to the exposed metrics.
Projects
None yet
Development

No branches or pull requests

3 participants