Telemetry spam #7960

arkpar · 2021-01-23T15:06:41Z

When syncing the blockchain with high enough speed (~600 bps) the console is spammed with Telemetry buffer overflowed! messages. I see ~50 of these per second. It also appears when running import-blocks command. There's a few things wrong here:

Telemetry server should not be spammed with that many events.
Error messages should be throttled if the buffer is overflown because of the slow connection, for example.
Error message should be printed with the log crate.
Local commands, such as import-blocks should have telemetry disabled by default, as well as other network related services.

The text was updated successfully, but these errors were encountered:

bkchr · 2021-01-23T20:07:44Z

@cecton did you change the default buffer size with your pr or was it 16 before as well?

Edit: Yeah it was 16 before. Maybe the new worker is too slow?

dvdplm · 2021-01-25T08:20:26Z

@arkpar Do you know if this problem is recent, i.e. if it worked better before #7463?

Item 4 seems like the lowest-hanging fruit, shall we make that the first priority item on the list? Item 2 seems pretty hard to do properly.

cc @maciejhirsz

arkpar · 2021-01-25T10:08:27Z

@dvdplm Not sure. It wasn't there a couple of weeks ago, but I can't be sure at this point.
IMO when doing full sync telemetry events should not be generated for each block, but maybe for batches of 100 or 1000

maciejhirsz · 2021-01-26T09:44:04Z

I don't think it's recent (the spam that is, not the error reporting), Telemetry has a built in throttling for block updates during sync for that purpose so it doesn't affect the server that much, although the extra network traffic and json parsing we could definitely do without.

cecton · 2021-01-27T11:42:00Z

I can say for the point 3 it is not possible to log at this place. This is why it is a eprintln. That being said I have 2 ideas:

Maybe we could print only when debug_assertions is on
Maybe I can make a channel to report the overflow, that will give finer control: I can throttle the messages and I will be able to log normally. But that means a new background task just for that

bkchr · 2021-01-27T12:50:47Z

2\. Maybe I can make a channel to report the overflow, that will give finer control: I can throttle the messages and I will be able to log normally. But that means a new background task just for that

Can we not handle this in the tracing layer we have for the telemetry? Check when we have printed this message the last time?

cecton · 2021-01-27T12:56:13Z

Can we not handle this in the tracing layer we have for the telemetry? Check when we have printed this message the last time?

Yes we have a state so that's possible to throttle. But the log will still use eprintln then. To make it print using log when need to log outside the layer. Ok that's option 3.

bkchr · 2021-01-27T13:07:13Z

Yeah, I understand that we must use eprintln! there.

bkchr · 2021-01-27T13:08:00Z

Or could we push to the layers below us the error message? :D So that they will just print this? :P

cecton · 2021-01-29T08:59:30Z

Or could we push to the layers below us the error message? :D So that they will just print this? :P

I won't try to use an internal API if that's what you had in mind 😁

But I think the solution with the async task in the background is the most maintainable. I'm not sure what would be the impact on the performance.

To clarify my idea:

I would make a future that just loops on a channel messages and report "buffer overflow" using log::warn!
On the layer I would just send () to the channel
Spawn the future in background + throttle the channel stream

jun0tpyrc · 2021-02-25T04:11:45Z

Another case. Seeing my testnet validator keeps synchronising but not finalising.

After restart it does finalizing , but at the same time showing this Telemetry buffer overflowed! message too , not sure if related

might be it stopped

2021-02-25 04:04:22  👴 Applying authority set change scheduled at block #4445222
2021-02-25 04:04:25  👴 Applying GRANDPA set change to new set .... 
``` in such case

2075 · 2021-03-26T09:25:19Z

hmm, i still get it as well, saw also that telemetry page seems to be overload in general?

Telemetry buffer overflowed!
Telemetry buffer overflowed!
Telemetry buffer overflowed!
Telemetry buffer overflowed!
Telemetry buffer overflowed!
Telemetry buffer overflowed!
Telemetry buffer overflowed!
Telemetry buffer overflowed!
Telemetry buffer overflowed!
Telemetry buffer overflowed!
Telemetry buffer overflowed!
2021-03-26 09:22:41  ⚙️  Syncing 35.8 bps, target=#1309541 (5 peers), best: #35440 (0xd09b…343b), finalized #35328 (0x87a9…89a7), ⬇ 0.7kiB/s ⬆ 0
2021-03-26 09:22:46  ⚙️  Syncing 39.3 bps, target=#1309543 (5 peers), best: #35637 (0xfcc4…1f68), finalized #35328 (0x87a9…89a7), ⬇ 1.0kiB/s ⬆ 24 B/s
2021-03-26 09:22:51  ⚙️  Syncing 40.3 bps, target=#1309545 (5 peers), best: #35839 (0x0676…3822), finalized #35328 (0x87a9…89a7), ⬇ 1.7MiB/s ⬆ 0.5kiB/s
2021-03-26 09:22:56  ⚙️  Syncing 36.2 bps, target=#1309546 (5 peers), best: #36020 (0x6c94…5860), finalized #35940 (0x8bb0…c35f), ⬇ 1.4MiB/s ⬆ 0.2kiB/s```

anything which can be done to avoid this? or is it only related to polkadot telemetry itself?

cecton · 2021-03-26T10:33:02Z

The message is now in trace log since 6ac86d5 so you shouldn't see it anymore @2075

To summarize the current status compared to the point raised by @arkpar :

Telemetry server should not be spammed with that many events.

It still send everything it can. Should we do something about this? Like set a maximum amount of messages per minute? This might lead to other issues where too many messages get lost.

Error messages should be throttled if the buffer is overflown because of the slow connection, for example.

The message is not throttled but the message is now in level trace in target telemetry. I can still look at implementing a throttle if it is still that much of a problem.

Error message should be printed with the log crate.

This has been fixed.

Local commands, such as import-blocks should have telemetry disabled by default, as well as other network related services

I could look into it. cc @bkchr

2075 · 2021-03-26T14:17:58Z

thanks for the heads up, @cecton, will check again when we will merge substrate into our dev. telemetry itself seems a bit overwhelmed recently, though.

stale · 2021-07-07T20:23:06Z

Hey, is anyone still working on this? Due to the inactivity this issue has been automatically marked as stale. It will be closed if no further activity occurs. Thank you for your contributions.

arkpar added the I3-bug The node fails to follow expected behavior. label Jan 23, 2021

stale bot added the A5-stale Pull request did not receive any updates in a long time. No review needed at this stage. Close it. label Jul 7, 2021

bkchr closed this as completed Jul 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Telemetry spam #7960

Telemetry spam #7960

arkpar commented Jan 23, 2021

bkchr commented Jan 23, 2021 •

edited

Loading

dvdplm commented Jan 25, 2021 •

edited

Loading

arkpar commented Jan 25, 2021

maciejhirsz commented Jan 26, 2021

cecton commented Jan 27, 2021

bkchr commented Jan 27, 2021

cecton commented Jan 27, 2021

bkchr commented Jan 27, 2021

bkchr commented Jan 27, 2021

cecton commented Jan 29, 2021

jun0tpyrc commented Feb 25, 2021 •

edited

Loading

2075 commented Mar 26, 2021

cecton commented Mar 26, 2021

2075 commented Mar 26, 2021

stale bot commented Jul 7, 2021

Telemetry spam #7960

Telemetry spam #7960

Comments

arkpar commented Jan 23, 2021

bkchr commented Jan 23, 2021 • edited Loading

dvdplm commented Jan 25, 2021 • edited Loading

arkpar commented Jan 25, 2021

maciejhirsz commented Jan 26, 2021

cecton commented Jan 27, 2021

bkchr commented Jan 27, 2021

cecton commented Jan 27, 2021

bkchr commented Jan 27, 2021

bkchr commented Jan 27, 2021

cecton commented Jan 29, 2021

jun0tpyrc commented Feb 25, 2021 • edited Loading

2075 commented Mar 26, 2021

cecton commented Mar 26, 2021

2075 commented Mar 26, 2021

stale bot commented Jul 7, 2021

bkchr commented Jan 23, 2021 •

edited

Loading

dvdplm commented Jan 25, 2021 •

edited

Loading

jun0tpyrc commented Feb 25, 2021 •

edited

Loading