Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancing Prometheus metrics for BW consumption #1077

Closed
staheri14 opened this issue Sep 5, 2023 · 8 comments
Closed

Enhancing Prometheus metrics for BW consumption #1077

staheri14 opened this issue Sep 5, 2023 · 8 comments
Assignees

Comments

@staheri14
Copy link
Contributor

staheri14 commented Sep 5, 2023

As part of celestiaorg/celestia-app#2197, in order to be able to calculate the breakdown of BW consumption per chain ID and reactor, we would like to use to the existing Prometheus metrics in cekestia-core. However, it is not clear whether the current metrics are sufficient and whether we may require additional metrics to accomplish the task.
This issue intends to track the progress of this effort.

@staheri14
Copy link
Contributor Author

Currently, Prometheus collects the sent and received BW, however, they lack two major labels: Channel id and the reactor. So we need to integrate these two.

@staheri14 staheri14 self-assigned this Sep 5, 2023
@staheri14
Copy link
Contributor Author

In the existing implementation, there is a single reactor associated with each channel ID. Therefore, we can deduce the reactors based solely on the channel ID label. Given this, our approach will involve appending only the channel ID label to each metric.
References:

celestia-core/node/node.go

Lines 633 to 637 in 2f93fc8

sw.AddReactor("MEMPOOL", mempoolReactor)
sw.AddReactor("BLOCKCHAIN", bcReactor)
sw.AddReactor("CONSENSUS", consensusReactor)
sw.AddReactor("EVIDENCE", evidenceReactor)
sw.AddReactor("STATESYNC", stateSyncReactor)

sw.AddReactor("PEX", pexReactor)

celestia-core/node/node.go

Lines 153 to 158 in 2f93fc8

// - MEMPOOL
// - BLOCKCHAIN
// - CONSENSUS
// - EVIDENCE
// - PEX
// - STATESYNC

@rootulp
Copy link
Collaborator

rootulp commented Sep 6, 2023

In the existing implementation, there is a single reactor associated with each channel ID.

Is this guaranteed for future modifications to CometBFT? Is there a downside to including two labels?

@staheri14
Copy link
Contributor Author

staheri14 commented Sep 6, 2023

Is this guaranteed for future modifications to CometBFT?

The 1:1 correspondence between reactors and channel IDs is not strictly enforced by the current implementation, as custom reactors can be added to switches on any channel ID. (please refer to #1077 (comment)). However, it is probable, based on past trends, that the one-to-one mapping will continue to be maintained in the future (although not guaranteed with 100% certainty).

Is there a downside to including two labels?

Thank you for your suggestion. Our current implementation, which tracks channel ID, adequately addresses the requirements of this issue. Adding an extra label could introduce redundancy and complexity without a significant improvement in observability.
Our initial goal was to minimize modifications to the existing Prometheus metrics. Currently, we monitor sent and received bytes at the peer level, the central point of communication with peers. Extending this observability to specific reactors would require changes at higher levels, increasing the scope significantly. The reason is that in the peer level there is no observability into the reactors to which/from which the message sent. Identifying the exact reactor needs to be done in higher level components, and the surface would much larger.

While enhancing observability by identifying reactors is possible, the current approach was opted to maintain simplicity and efficiency while meeting the objectives of this issue.

@staheri14
Copy link
Contributor Author

apparently, no two reactors can share the same channel ID as suggested in

// No two reactors can share the same channel.

@evan-forbes
Copy link
Member

if I recall correctly, I've been able to determine the bandwidth used before via channel ID, but would have to dig through the metrics. Will hopefully be able to do that later today depending on if I can get the network tests working again

@staheri14
Copy link
Contributor Author

if I recall correctly, I've been able to determine the bandwidth used before via channel ID, but would have to dig through the metrics. Will hopefully be able to do that later today depending on if I can get the network tests working again

[Reiterating the findings from our discussion in #1078], it has been clarified that the metrics in question are peer_send_bytes_total and peer_receive_bytes_total. Although these metrics include channel IDs, they are presently missing a message type label, rendering them insufficient for providing the desired level of granularity. Nevertheless, the decision reached that we could use them to cross-check results extracted from message_receive_bytes_total and message_send_bytes_total.

staheri14 added a commit that referenced this issue Sep 19, 2023
…otal and message_send_bytes_total metrics (#1078)

Part of #1077
staheri14 added a commit that referenced this issue Sep 22, 2023
…ssageReceiveBytesTotal Prometheus metrics (#1086)

Inline with #1077

Our past experimentation showed that none of the current Prometheus
traffic related metrics encompass all the information regarding the
message type, peer ID and channel ID. This deficiency can be addressed
by incorporating peer IDs for `message_receive_bytes_total` and
`message_send_bytes_total`. This PR provides this feature.

I tested it by executing a local validator node and examining the
Prometheus metrics endpoint. Here's an example of the output:

```
cometbft_p2p_message_send_bytes_total{chID="0x40", chain_id="mocha-4", message_type="blockchain_StatusResponse", peer_id="cb7adca6fbaabc5336c5eebbc1312390a2bb9d2d", version="1.0.0-rc0-278-g4c452c5f4"} 8
```
staheri14 added a commit that referenced this issue Sep 26, 2023
Closes #1077

I have verified the outcome of this pull request, and the block time
gauge now correctly appears in the Prometheus endpoint:

```
# HELP cometbft_consensus_block_time_seconds Duration between this block and the preceding one.
# TYPE cometbft_consensus_block_time_seconds gauge
cometbft_consensus_block_time_seconds{chain_id="mocha-4",version="1.0.0-rc16"} 11.623671263
```
@staheri14
Copy link
Contributor Author

closed by #1091

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants