Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Statsd seems to be dropping metrics probably because of UDP #2044

Closed
mstoykov opened this issue May 28, 2021 · 20 comments
Closed

Statsd seems to be dropping metrics probably because of UDP #2044

mstoykov opened this issue May 28, 2021 · 20 comments
Labels
bug help wanted metrics-output Issues related to the metrics output (a.k.a. just output) statsd Something related to the statsd

Comments

@mstoykov
Copy link
Contributor

mstoykov commented May 28, 2021

Discussed in forum thread.

My last comment confirms that:

  1. running between dockers is worse
  2. fixing MTU in docker helps
  3. lowering push interval (effectively always pushing smaller packets) helps more and effectively fixes it in my case, although this shouldn't be needed as the lib should be doing the same 🤷
  4. updating datadog-go makes it slightly worse 🤦
  5. k6.v0.29.0(random older version I tried) is even worse. 🤦

Possible actions:

  1. lower the push interval - this seems like a good fast solution, given that we are sending metrics as udp packets and they have a small limit, k6 not doing it more regularly doesn't really help anything IMO
  2. investigate this further, it might not be UDP related at all, at which point switching statsd libraries to something with possibly tcp support as well might not even help.
  3. investigate datadog-go + aggregation in datadog-go ... maybe it will help 🤷
@shekharshukla
Copy link

Is there any update on this?

@mstoykov
Copy link
Contributor Author

If there is an update we will be updating the issue. Do you want to work on it?

@rodrigotheodoropsv
Copy link

I'm having the same problem with that :(

K6 output:
image

DataDog:
image

@Idosegal
Copy link

Any updates about this issue?
I'm facing the same behavior.

@DanielCastronovo
Copy link

Same here

@rodrigotheodoropsv
Copy link

I've just changed where I run the datadog-agent and changed some settings of the SO, it worked for me.
Now I run the agent in the same machine that I run the K6, so I use the loopback to send the metrics from k6 to datadog-agent and send it to the DataDog normally.

@bgallagher88
Copy link

@rodrigotheodoropsv - That's great that you got it working. Can you tell me what settings were changed to get it to work?

@rodrigotheodoropsv
Copy link

We use a VM with Ubuntu, but I don't know the version and more details about it because the team of DevOps who take care about it.

Ps: There are a lot of settings that we use for other things.

The settings are:

net.core.somaxconn | 65535
net.ipv4.tcp_max_tw_buckets | 1440000
net.ipv4.tcp_max_syn_backlog | 20000
net.ipv4.ip_local_port_range | 1024 65535
net.ipv4.tcp_fin_timeout | 2
net.ipv4.tcp_slow_start_after_idle | 0
vm.swappiness | 1
vm.overcommit_memory | 1
net.ipv4.tcp_wmem | 4096 12582912 16777216
net.ipv4.tcp_rmem | 4096 12582912 16777216
net.ipv4.tcp_retries2 | 5
net.ipv4.tcp_tw_reuse | 1
net.ipv6.conf.all.disable_ipv6 | 1
net.ipv6.conf.default.disable_ipv6 | 1
vm.max_map_count | 999999
fs.nr_open | 100000000
fs.file-max | 100000000
kernel.pid_max | 3999999
vm.dirty_expire_centisecs | 500
vm.dirty_writeback_centisecs | 100
vm.dirty_ratio | 15
vm.dirty_background_ratio | 5
net.ipv4.tcp_keepalive_intvl | 10
net.ipv4.tcp_keepalive_time | 30
net.ipv4.tcp_keepalive_probes | 4
net.core.rmem_default | 33554432
net.core.wmem_default | 33554432
net.core.rmem_max | 33554432
net.core.wmem_max | 33554432
net.core.netdev_max_backlog | 2500

@Idosegal
Copy link

Does anyone experienced the same in a containerized environment?

@kullanici0606
Copy link

I tried to investigate the issue and my initial hunch is that the problem is not related to udp. It seems that statsd could not handle incoming packets fast enough.

I first tried netcat to verify that packets are delivered to statsd host (Neither tcpdump nor a package manager exists on statsd-exporter).

I changed the statsd address to port 9127 on k6 container definition:

K6_STATSD_ADDR: "statsdex:9127"

I also added some sleep to command so that it waits until nc is started on statsd-exporter.

On statsd-exporter, I started nc:
nc -l -u -p 9127 > packets.txt

After the end of the test request count reported by k6 matches the "k6.http_reqs" count on packets.txt:

$ grep -c k6.http_reqs out2.txt 
13786

k6_1 | http_reqs......................: 13786 229.537003/s

On the other hand, when I use statsd as output address, on its own udp receive metrics, it reports less packets than the ones captured by tcpdump on k6 host:
UDP packets from pcap (udp.dstport == 9125): 21535

$ curl -s http://localhost:9102/metrics | grep udp
# HELP statsd_exporter_udp_packets_total The total number of StatsD packets received over UDP.
# TYPE statsd_exporter_udp_packets_total counter
statsd_exporter_udp_packets_total 21272

When I checked the statds_export code (https://github.com/prometheus/statsd_exporter/blob/871e2d8df1da51d4eed27738f4bc079271a09c61/pkg/listener/listener.go#L54-L82), it seems that instead of processing the packet in a seperate goroutine/thread, it first parses the packet to events, then sends them to a channel in the udp handling goroutine/thread. Parse operation might take a lot of time depending on the metrics delivered and this may cause packet drops.

It seems that k6 sends a large UDP packet:

k6.vus:15.000000|g k6.vus_max:50.000000|g k6.http_reqs:1|c|#name:https://test.k6.io/,tls_version:tls1.3,scenario:default,url:https://test.k6.io/,method:GET,status:200,proto:HTTP/1.1,expected_response:true k6.http_req_duration:146.138629|ms|#name:https://test.k6.io/,tls_version:tls1.3,proto:HTTP/1.1,scenario:default,expected_response:true,url:https://test.k6.io/,method:GET,status:200 k6.http_req_blocked:3677.270817|ms|#expected_response:true,scenario:default,name:https://test.k6.io/,url:https://test.k6.io/,method:GET,status:200,proto:HTTP/1.1,tls_version:tls1.3 k6.http_req_connecting:12.663258|ms|#status:200,expected_response:true,url:https://test.k6.io/,name:https://test.k6.io/,scenario:default,proto:HTTP/1.1,method:GET,tls_version:tls1.3 k6.http_req_tls_handshaking:3607.250707|ms|#expected_response:true,scenario:default,name:https://test.k6.io/,tls_version:tls1.3,url:https://test.k6.io/,method:GET,status:200,proto:HTTP/1.1 k6.http_req_sending:0.178607|ms|#url:https://test.k6.io/,proto:HTTP/1.1,expected_response:true,tls_version:tls1.3,method:GET,status:200,scenario:default,name:https://test.k6.io/ k6.http_req_waiting:145.692124|ms|#name:https://test.k6.io/,proto:HTTP/1.1,status:200,url:https://test.k6.io/,method:GET,tls_version:tls1.3,scenario:default,expected_response:true

This is my initial investigation, it may not be 100% percent correct though :)

@kullanici0606
Copy link

Lowering flush interval helps not because packets are smaller (tcpdump shows similar packet sizes, i.e around 1400 bytes) but packets are not sent in bursts when interval is smaller.

I mean when I compared network captures of 1s and 100ms interval, it shows this (I am summarizing):

1 second interval: 564 packets in 40 ms -> 14,1 packets/ms
100 ms interval: 23 packets in 5 ms -> 4,6 packets/ms

So higher flush interval means more packets to process by statsd_exporter with less time, therefore if it cannot catch up, it drops packets.

Before going further, could you please tell me which direction to pursue?

Newer statsd (datadog-go) client has a feature called "ClientSideAggregation" which reduces the number of packets sent to statsd by aggregating them on client side, however k6's dependency policy explicitly states that it is not prefferred to update datadog-go dependency, therefore going on that direction is not feasible I think

@kullanici0606
Copy link

I missed the references pull request (govuk-one-login/performance-testing#122) which also lowers the flush interval.

I also tried changing statsd_exporter logic to process udp packet in a different goroutine, which helped a lot (no packet drop occurred for several attempts):

https://github.com/kullanici0606/statsd_exporter/tree/enqueu_udp_packets

I think, lowering the flush interval to 100ms, increasing the buffer size and changing statds_exporter logic a little bit will be the easiest and the best path for solution.

@kullanici0606
Copy link

kullanici0606 commented Sep 20, 2023

The version of the k6 in the docker file on the forum is 0.32.0 and on v.0.33.0 some tags are blacklisted / blocked so that UDP packets are smaller:

#2048

https://github.com/grafana/k6/blob/3655065227b09ee368c29fb3dde7650761e22f68/release%20notes/v0.33.0.md#breaking-changes

So upgrading k6 to a version newer than v0.33.0 will definitely help. To sum up:

  • Upgrade to v0.33.0 or later
  • set K6_STATSD_BUFFER_SIZE to 100ms or less (K6_STATSD_PUSH_INTERVAL: 100ms in docker-compose.yml)

will reduce the probability of packet drops.

@mstoykov Do you think this issue needs further investigation?

I tried implementing part of the client side aggregation on k6 by decorating / wrapping statsd client but it seems that most of the metrics are trends / timing information which cannot be aggregated easily. Counter is easy but there is only one counter, so it helped a little when I tried aggregating it.

k6/metrics/builtin.go

Lines 87 to 95 in 3655065

HTTPReqs: registry.MustNewMetric(HTTPReqsName, Counter),
HTTPReqFailed: registry.MustNewMetric(HTTPReqFailedName, Rate),
HTTPReqDuration: registry.MustNewMetric(HTTPReqDurationName, Trend, Time),
HTTPReqBlocked: registry.MustNewMetric(HTTPReqBlockedName, Trend, Time),
HTTPReqConnecting: registry.MustNewMetric(HTTPReqConnectingName, Trend, Time),
HTTPReqTLSHandshaking: registry.MustNewMetric(HTTPReqTLSHandshakingName, Trend, Time),
HTTPReqSending: registry.MustNewMetric(HTTPReqSendingName, Trend, Time),
HTTPReqWaiting: registry.MustNewMetric(HTTPReqWaitingName, Trend, Time),
HTTPReqReceiving: registry.MustNewMetric(HTTPReqReceivingName, Trend, Time),

@mstoykov
Copy link
Contributor Author

@kullanici0606 This issue will likely be moved along all other statsd issues "soon" ™️ to the https://github.com/LeonAdato/xk6-output-statsd as part of #2982.

I guess v0.33.0 helped a lot, but this still will happen with enough requests from my memories when I last tested it.

@kullanici0606
Copy link

Yes, even with v0.33.0, problem occurs sometimes, therefore I also tried to do some improvements on statsd_exporter side too:
prometheus/statsd_exporter#511

@mstoykov should I stop working on this issue then?

@mstoykov
Copy link
Contributor Author

should I stop working on this issue then?

I would recommend either waiting for the final move of issues or opening an issue with the new repo.

The k6 core team will in practice have no connection with this and as part of that I will not tell anyone how and what to do with the extension :)

@wvikum
Copy link

wvikum commented Jun 27, 2024

We have encountered a similar issue where there is a discrepancy of around 30,000 requests in Datadog metrics, which is quite frustrating. Do we have a proper solution or workaround for this issue?

image

@codebien
Copy link
Contributor

Hey @wvikum,
in the past weeks, we released a new extension for OpenTelemetry #2557 (comment).

We intend to merge into k6core in the upcoming releases, and it will be probably the best solution to stream metrics to a 3rd party metrics database.

Please, check if that works for you and with your DataDog integration.

@olegbespalov olegbespalov added metrics-output Issues related to the metrics output (a.k.a. just output) statsd Something related to the statsd labels Jul 12, 2024
@LeonAdato
Copy link

Per @olegbespalov and @javaducky, this issue should probably be part of the StatsD project. Feel free to transfer it here:

https://github.com/LeonAdato/xk6-output-statsd

@olegbespalov
Copy link
Contributor

Closing in favor of LeonAdato/xk6-output-statsd#29

@olegbespalov olegbespalov closed this as not planned Won't fix, can't repro, duplicate, stale Aug 30, 2024
@olegbespalov olegbespalov removed their assignment Aug 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug help wanted metrics-output Issues related to the metrics output (a.k.a. just output) statsd Something related to the statsd
Projects
None yet
Development

No branches or pull requests