Statsd seems to be dropping metrics probably because of UDP #2044

mstoykov · 2021-05-28T12:34:50Z

Discussed in forum thread.

My last comment confirms that:

running between dockers is worse
fixing MTU in docker helps
lowering push interval (effectively always pushing smaller packets) helps more and effectively fixes it in my case, although this shouldn't be needed as the lib should be doing the same 🤷
updating datadog-go makes it slightly worse 🤦
k6.v0.29.0(random older version I tried) is even worse. 🤦

Possible actions:

lower the push interval - this seems like a good fast solution, given that we are sending metrics as udp packets and they have a small limit, k6 not doing it more regularly doesn't really help anything IMO
investigate this further, it might not be UDP related at all, at which point switching statsd libraries to something with possibly tcp support as well might not even help.
investigate datadog-go + aggregation in datadog-go ... maybe it will help 🤷

shekharshukla · 2022-07-08T14:12:09Z

Is there any update on this?

mstoykov · 2022-07-11T08:07:36Z

If there is an update we will be updating the issue. Do you want to work on it?

rodrigotheodoropsv · 2022-10-26T18:15:37Z

I'm having the same problem with that :(

K6 output:

DataDog:

Idosegal · 2023-05-17T16:56:49Z

Any updates about this issue?
I'm facing the same behavior.

DanielCastronovo · 2023-05-22T17:14:14Z

Same here

rodrigotheodoropsv · 2023-05-23T16:00:55Z

I've just changed where I run the datadog-agent and changed some settings of the SO, it worked for me.
Now I run the agent in the same machine that I run the K6, so I use the loopback to send the metrics from k6 to datadog-agent and send it to the DataDog normally.

bgallagher88 · 2023-05-23T18:07:59Z

@rodrigotheodoropsv - That's great that you got it working. Can you tell me what settings were changed to get it to work?

rodrigotheodoropsv · 2023-05-24T19:27:28Z

We use a VM with Ubuntu, but I don't know the version and more details about it because the team of DevOps who take care about it.

Ps: There are a lot of settings that we use for other things.

The settings are:

Idosegal · 2023-05-25T18:31:45Z

Does anyone experienced the same in a containerized environment?

kullanici0606 · 2023-09-07T14:04:47Z

I tried to investigate the issue and my initial hunch is that the problem is not related to udp. It seems that statsd could not handle incoming packets fast enough.

I first tried netcat to verify that packets are delivered to statsd host (Neither tcpdump nor a package manager exists on statsd-exporter).

I changed the statsd address to port 9127 on k6 container definition:

K6_STATSD_ADDR: "statsdex:9127"

I also added some sleep to command so that it waits until nc is started on statsd-exporter.

On statsd-exporter, I started nc:
nc -l -u -p 9127 > packets.txt

After the end of the test request count reported by k6 matches the "k6.http_reqs" count on packets.txt:

$ grep -c k6.http_reqs out2.txt 
13786

k6_1 | http_reqs......................: 13786 229.537003/s

On the other hand, when I use statsd as output address, on its own udp receive metrics, it reports less packets than the ones captured by tcpdump on k6 host:
UDP packets from pcap (udp.dstport == 9125): 21535

$ curl -s http://localhost:9102/metrics | grep udp
# HELP statsd_exporter_udp_packets_total The total number of StatsD packets received over UDP.
# TYPE statsd_exporter_udp_packets_total counter
statsd_exporter_udp_packets_total 21272

When I checked the statds_export code (https://github.com/prometheus/statsd_exporter/blob/871e2d8df1da51d4eed27738f4bc079271a09c61/pkg/listener/listener.go#L54-L82), it seems that instead of processing the packet in a seperate goroutine/thread, it first parses the packet to events, then sends them to a channel in the udp handling goroutine/thread. Parse operation might take a lot of time depending on the metrics delivered and this may cause packet drops.

It seems that k6 sends a large UDP packet:

k6.vus:15.000000|g k6.vus_max:50.000000|g k6.http_reqs:1|c|#name:https://test.k6.io/,tls_version:tls1.3,scenario:default,url:https://test.k6.io/,method:GET,status:200,proto:HTTP/1.1,expected_response:true k6.http_req_duration:146.138629|ms|#name:https://test.k6.io/,tls_version:tls1.3,proto:HTTP/1.1,scenario:default,expected_response:true,url:https://test.k6.io/,method:GET,status:200 k6.http_req_blocked:3677.270817|ms|#expected_response:true,scenario:default,name:https://test.k6.io/,url:https://test.k6.io/,method:GET,status:200,proto:HTTP/1.1,tls_version:tls1.3 k6.http_req_connecting:12.663258|ms|#status:200,expected_response:true,url:https://test.k6.io/,name:https://test.k6.io/,scenario:default,proto:HTTP/1.1,method:GET,tls_version:tls1.3 k6.http_req_tls_handshaking:3607.250707|ms|#expected_response:true,scenario:default,name:https://test.k6.io/,tls_version:tls1.3,url:https://test.k6.io/,method:GET,status:200,proto:HTTP/1.1 k6.http_req_sending:0.178607|ms|#url:https://test.k6.io/,proto:HTTP/1.1,expected_response:true,tls_version:tls1.3,method:GET,status:200,scenario:default,name:https://test.k6.io/ k6.http_req_waiting:145.692124|ms|#name:https://test.k6.io/,proto:HTTP/1.1,status:200,url:https://test.k6.io/,method:GET,tls_version:tls1.3,scenario:default,expected_response:true

This is my initial investigation, it may not be 100% percent correct though :)

kullanici0606 · 2023-09-08T06:29:49Z

Lowering flush interval helps not because packets are smaller (tcpdump shows similar packet sizes, i.e around 1400 bytes) but packets are not sent in bursts when interval is smaller.

I mean when I compared network captures of 1s and 100ms interval, it shows this (I am summarizing):

1 second interval: 564 packets in 40 ms -> 14,1 packets/ms

100 ms interval: 23 packets in 5 ms -> 4,6 packets/ms

So higher flush interval means more packets to process by statsd_exporter with less time, therefore if it cannot catch up, it drops packets.

Before going further, could you please tell me which direction to pursue?

Newer statsd (datadog-go) client has a feature called "ClientSideAggregation" which reduces the number of packets sent to statsd by aggregating them on client side, however k6's dependency policy explicitly states that it is not prefferred to update datadog-go dependency, therefore going on that direction is not feasible I think

kullanici0606 · 2023-09-08T14:42:31Z

I missed the references pull request (govuk-one-login/performance-testing#122) which also lowers the flush interval.

I also tried changing statsd_exporter logic to process udp packet in a different goroutine, which helped a lot (no packet drop occurred for several attempts):

https://github.com/kullanici0606/statsd_exporter/tree/enqueu_udp_packets

I think, lowering the flush interval to 100ms, increasing the buffer size and changing statds_exporter logic a little bit will be the easiest and the best path for solution.

kullanici0606 · 2023-09-20T15:18:45Z

The version of the k6 in the docker file on the forum is 0.32.0 and on v.0.33.0 some tags are blacklisted / blocked so that UDP packets are smaller:

#2048

https://github.com/grafana/k6/blob/3655065227b09ee368c29fb3dde7650761e22f68/release%20notes/v0.33.0.md#breaking-changes

So upgrading k6 to a version newer than v0.33.0 will definitely help. To sum up:

Upgrade to v0.33.0 or later
set K6_STATSD_BUFFER_SIZE to 100ms or less (K6_STATSD_PUSH_INTERVAL: 100ms in docker-compose.yml)

will reduce the probability of packet drops.

@mstoykov Do you think this issue needs further investigation?

I tried implementing part of the client side aggregation on k6 by decorating / wrapping statsd client but it seems that most of the metrics are trends / timing information which cannot be aggregated easily. Counter is easy but there is only one counter, so it helped a little when I tried aggregating it.

k6/metrics/builtin.go

Lines 87 to 95 in 3655065

    
           HTTPReqs:              registry.MustNewMetric(HTTPReqsName, Counter), 
        
           HTTPReqFailed:         registry.MustNewMetric(HTTPReqFailedName, Rate), 
        
           HTTPReqDuration:       registry.MustNewMetric(HTTPReqDurationName, Trend, Time), 
        
           HTTPReqBlocked:        registry.MustNewMetric(HTTPReqBlockedName, Trend, Time), 
        
           HTTPReqConnecting:     registry.MustNewMetric(HTTPReqConnectingName, Trend, Time), 
        
           HTTPReqTLSHandshaking: registry.MustNewMetric(HTTPReqTLSHandshakingName, Trend, Time), 
        
           HTTPReqSending:        registry.MustNewMetric(HTTPReqSendingName, Trend, Time), 
        
           HTTPReqWaiting:        registry.MustNewMetric(HTTPReqWaitingName, Trend, Time), 
        
           HTTPReqReceiving:      registry.MustNewMetric(HTTPReqReceivingName, Trend, Time),

mstoykov · 2023-09-20T16:08:44Z

@kullanici0606 This issue will likely be moved along all other statsd issues "soon" ™️ to the https://github.com/LeonAdato/xk6-output-statsd as part of #2982.

I guess v0.33.0 helped a lot, but this still will happen with enough requests from my memories when I last tested it.

kullanici0606 · 2023-09-20T16:21:22Z

Yes, even with v0.33.0, problem occurs sometimes, therefore I also tried to do some improvements on statsd_exporter side too:
prometheus/statsd_exporter#511

@mstoykov should I stop working on this issue then?

mstoykov · 2023-09-21T07:19:00Z

should I stop working on this issue then?

I would recommend either waiting for the final move of issues or opening an issue with the new repo.

The k6 core team will in practice have no connection with this and as part of that I will not tell anyone how and what to do with the extension :)

wvikum · 2024-06-27T02:22:57Z

We have encountered a similar issue where there is a discrepancy of around 30,000 requests in Datadog metrics, which is quite frustrating. Do we have a proper solution or workaround for this issue?

codebien · 2024-06-27T11:40:00Z

Hey @wvikum,
in the past weeks, we released a new extension for OpenTelemetry #2557 (comment).

We intend to merge into k6core in the upcoming releases, and it will be probably the best solution to stream metrics to a 3rd party metrics database.

Please, check if that works for you and with your DataDog integration.

LeonAdato · 2024-08-28T15:31:55Z

Per @olegbespalov and @javaducky, this issue should probably be part of the StatsD project. Feel free to transfer it here:

https://github.com/LeonAdato/xk6-output-statsd

olegbespalov · 2024-08-30T08:35:29Z

Closing in favor of LeonAdato/xk6-output-statsd#29

mstoykov added bug help wanted labels May 28, 2021

Tom-Dann mentioned this issue Jun 8, 2023

[QA-142]: Increasing StatsD buffer size and push intervals govuk-one-login/performance-testing#122

Merged

This was referenced Sep 18, 2023

Performance improvement - Introduce queue for raw statsd payloads prometheus/statsd_exporter#459

Closed

Performance improvement - process udp packets in separate goroutine in order to minimize packet drops prometheus/statsd_exporter#511

Merged

github-actions bot added the triage label Jun 27, 2024

github-actions bot assigned oleiade Jun 27, 2024

oleiade removed the triage label Jun 27, 2024

olegbespalov added metrics-output Issues related to the metrics output (a.k.a. just output) statsd Something related to the statsd labels Jul 12, 2024

olegbespalov mentioned this issue Jul 15, 2024

Porting a know issues from the k6 repository LeonAdato/xk6-output-statsd#18

Closed

olegbespalov assigned olegbespalov and unassigned oleiade Aug 28, 2024

olegbespalov mentioned this issue Aug 30, 2024

Statsd seems to be dropping metrics probably because of UDP LeonAdato/xk6-output-statsd#29

Open

olegbespalov closed this as not planned Won't fix, can't repro, duplicate, stale Aug 30, 2024

olegbespalov removed their assignment Aug 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Statsd seems to be dropping metrics probably because of UDP #2044

Statsd seems to be dropping metrics probably because of UDP #2044

mstoykov commented May 28, 2021 •

edited

Loading

shekharshukla commented Jul 8, 2022

mstoykov commented Jul 11, 2022

rodrigotheodoropsv commented Oct 26, 2022

Idosegal commented May 17, 2023

DanielCastronovo commented May 22, 2023

rodrigotheodoropsv commented May 23, 2023

bgallagher88 commented May 23, 2023

rodrigotheodoropsv commented May 24, 2023

Idosegal commented May 25, 2023

kullanici0606 commented Sep 7, 2023

kullanici0606 commented Sep 8, 2023

kullanici0606 commented Sep 8, 2023

kullanici0606 commented Sep 20, 2023 •

edited

Loading

mstoykov commented Sep 20, 2023

kullanici0606 commented Sep 20, 2023

mstoykov commented Sep 21, 2023

wvikum commented Jun 27, 2024 •

edited

Loading

codebien commented Jun 27, 2024

LeonAdato commented Aug 28, 2024

olegbespalov commented Aug 30, 2024

Statsd seems to be dropping metrics probably because of UDP #2044

Statsd seems to be dropping metrics probably because of UDP #2044

Comments

mstoykov commented May 28, 2021 • edited Loading

shekharshukla commented Jul 8, 2022

mstoykov commented Jul 11, 2022

rodrigotheodoropsv commented Oct 26, 2022

Idosegal commented May 17, 2023

DanielCastronovo commented May 22, 2023

rodrigotheodoropsv commented May 23, 2023

bgallagher88 commented May 23, 2023

rodrigotheodoropsv commented May 24, 2023

Idosegal commented May 25, 2023

kullanici0606 commented Sep 7, 2023

kullanici0606 commented Sep 8, 2023

kullanici0606 commented Sep 8, 2023

kullanici0606 commented Sep 20, 2023 • edited Loading

mstoykov commented Sep 20, 2023

kullanici0606 commented Sep 20, 2023

mstoykov commented Sep 21, 2023

wvikum commented Jun 27, 2024 • edited Loading

codebien commented Jun 27, 2024

LeonAdato commented Aug 28, 2024

olegbespalov commented Aug 30, 2024

mstoykov commented May 28, 2021 •

edited

Loading

kullanici0606 commented Sep 20, 2023 •

edited

Loading

wvikum commented Jun 27, 2024 •

edited

Loading