Skip to content
This repository was archived by the owner on Nov 15, 2023. It is now read-only.

Telemetry spam #7960

Closed
arkpar opened this issue Jan 23, 2021 · 15 comments
Closed

Telemetry spam #7960

arkpar opened this issue Jan 23, 2021 · 15 comments
Labels
A5-stale Pull request did not receive any updates in a long time. No review needed at this stage. Close it. I3-bug The node fails to follow expected behavior.

Comments

@arkpar
Copy link
Member

arkpar commented Jan 23, 2021

When syncing the blockchain with high enough speed (~600 bps) the console is spammed with Telemetry buffer overflowed! messages. I see ~50 of these per second. It also appears when running import-blocks command. There's a few things wrong here:

  1. Telemetry server should not be spammed with that many events.
  2. Error messages should be throttled if the buffer is overflown because of the slow connection, for example.
  3. Error message should be printed with the log crate.
  4. Local commands, such as import-blocks should have telemetry disabled by default, as well as other network related services.
@arkpar arkpar added the I3-bug The node fails to follow expected behavior. label Jan 23, 2021
@bkchr
Copy link
Member

bkchr commented Jan 23, 2021

@cecton did you change the default buffer size with your pr or was it 16 before as well?

Edit: Yeah it was 16 before. Maybe the new worker is too slow?

@dvdplm
Copy link
Contributor

dvdplm commented Jan 25, 2021

@arkpar Do you know if this problem is recent, i.e. if it worked better before #7463?

Item 4 seems like the lowest-hanging fruit, shall we make that the first priority item on the list? Item 2 seems pretty hard to do properly.

cc @maciejhirsz

@arkpar
Copy link
Member Author

arkpar commented Jan 25, 2021

@dvdplm Not sure. It wasn't there a couple of weeks ago, but I can't be sure at this point.
IMO when doing full sync telemetry events should not be generated for each block, but maybe for batches of 100 or 1000

@maciejhirsz
Copy link
Contributor

I don't think it's recent (the spam that is, not the error reporting), Telemetry has a built in throttling for block updates during sync for that purpose so it doesn't affect the server that much, although the extra network traffic and json parsing we could definitely do without.

@cecton
Copy link
Contributor

cecton commented Jan 27, 2021

I can say for the point 3 it is not possible to log at this place. This is why it is a eprintln. That being said I have 2 ideas:

  1. Maybe we could print only when debug_assertions is on
  2. Maybe I can make a channel to report the overflow, that will give finer control: I can throttle the messages and I will be able to log normally. But that means a new background task just for that

@bkchr
Copy link
Member

bkchr commented Jan 27, 2021

2\. Maybe I can make a channel to report the overflow, that will give finer control: I can throttle the messages and I will be able to log normally. But that means a new background task just for that

Can we not handle this in the tracing layer we have for the telemetry? Check when we have printed this message the last time?

@cecton
Copy link
Contributor

cecton commented Jan 27, 2021

Can we not handle this in the tracing layer we have for the telemetry? Check when we have printed this message the last time?

Yes we have a state so that's possible to throttle. But the log will still use eprintln then. To make it print using log when need to log outside the layer. Ok that's option 3.

@bkchr
Copy link
Member

bkchr commented Jan 27, 2021

Yeah, I understand that we must use eprintln! there.

@bkchr
Copy link
Member

bkchr commented Jan 27, 2021

Or could we push to the layers below us the error message? :D So that they will just print this? :P

@cecton
Copy link
Contributor

cecton commented Jan 29, 2021

Or could we push to the layers below us the error message? :D So that they will just print this? :P

I won't try to use an internal API if that's what you had in mind 😁

But I think the solution with the async task in the background is the most maintainable. I'm not sure what would be the impact on the performance.

To clarify my idea:

  • I would make a future that just loops on a channel messages and report "buffer overflow" using log::warn!
  • On the layer I would just send () to the channel
  • Spawn the future in background + throttle the channel stream

@jun0tpyrc
Copy link

jun0tpyrc commented Feb 25, 2021

Another case. Seeing my testnet validator keeps synchronising but not finalising.

After restart it does finalizing , but at the same time showing this Telemetry buffer overflowed! message too , not sure if related

might be it stopped

2021-02-25 04:04:22  👴 Applying authority set change scheduled at block #4445222
2021-02-25 04:04:25  👴 Applying GRANDPA set change to new set .... 
``` in such case

@2075
Copy link
Contributor

2075 commented Mar 26, 2021

hmm, i still get it as well, saw also that telemetry page seems to be overload in general?

Telemetry buffer overflowed!
Telemetry buffer overflowed!
Telemetry buffer overflowed!
Telemetry buffer overflowed!
Telemetry buffer overflowed!
Telemetry buffer overflowed!
Telemetry buffer overflowed!
Telemetry buffer overflowed!
Telemetry buffer overflowed!
Telemetry buffer overflowed!
Telemetry buffer overflowed!
2021-03-26 09:22:41  ⚙️  Syncing 35.8 bps, target=#1309541 (5 peers), best: #35440 (0xd09b…343b), finalized #35328 (0x87a9…89a7), ⬇ 0.7kiB/s ⬆ 0
2021-03-26 09:22:46  ⚙️  Syncing 39.3 bps, target=#1309543 (5 peers), best: #35637 (0xfcc4…1f68), finalized #35328 (0x87a9…89a7), ⬇ 1.0kiB/s ⬆ 24 B/s
2021-03-26 09:22:51  ⚙️  Syncing 40.3 bps, target=#1309545 (5 peers), best: #35839 (0x0676…3822), finalized #35328 (0x87a9…89a7), ⬇ 1.7MiB/s ⬆ 0.5kiB/s
2021-03-26 09:22:56  ⚙️  Syncing 36.2 bps, target=#1309546 (5 peers), best: #36020 (0x6c94…5860), finalized #35940 (0x8bb0…c35f), ⬇ 1.4MiB/s ⬆ 0.2kiB/s```

anything which can be done to avoid this? or is it only related to polkadot telemetry itself?

@cecton
Copy link
Contributor

cecton commented Mar 26, 2021

The message is now in trace log since 6ac86d5 so you shouldn't see it anymore @2075

To summarize the current status compared to the point raised by @arkpar :

  1. Telemetry server should not be spammed with that many events.

It still send everything it can. Should we do something about this? Like set a maximum amount of messages per minute? This might lead to other issues where too many messages get lost.

  1. Error messages should be throttled if the buffer is overflown because of the slow connection, for example.

The message is not throttled but the message is now in level trace in target telemetry. I can still look at implementing a throttle if it is still that much of a problem.

  1. Error message should be printed with the log crate.

This has been fixed.

  1. Local commands, such as import-blocks should have telemetry disabled by default, as well as other network related services

I could look into it. cc @bkchr

@2075
Copy link
Contributor

2075 commented Mar 26, 2021

thanks for the heads up, @cecton, will check again when we will merge substrate into our dev. telemetry itself seems a bit overwhelmed recently, though.

@stale
Copy link

stale bot commented Jul 7, 2021

Hey, is anyone still working on this? Due to the inactivity this issue has been automatically marked as stale. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the A5-stale Pull request did not receive any updates in a long time. No review needed at this stage. Close it. label Jul 7, 2021
@bkchr bkchr closed this as completed Jul 19, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A5-stale Pull request did not receive any updates in a long time. No review needed at this stage. Close it. I3-bug The node fails to follow expected behavior.
Projects
None yet
Development

No branches or pull requests

7 participants