noop http benchmark - what am I doing wrong? #1311
-
BackgroundI've created a simple network server that
So it is just a noop http benchmark to find how much I can get Liburing usage
ClientsI've mostly used wrk2 as load gen tool, but have also occasionally used Machine detailsuname -a
Linux ryzen5700x 6.8.0-51-generic #52-Ubuntu SMP PREEMPT_DYNAMIC Thu Dec 5 13:09:44 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux Model name is cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=24.04
DISTRIB_CODENAME=noble
DISTRIB_DESCRIPTION="Ubuntu 24.04.1 LTS" Benchmark and ResultsI am using wrk2 to simulate 20480 simultaneous connections and pump a load of 102400 RPS on this server via localhost. ./wrk -c 20480 -t 6 -d 300s http://localhost:7777 -R 102400 -L
Running 5m test @ http://localhost:7777
6 threads and 20480 connections
...(removed for verbosity)
Thread Stats Avg Stdev Max +/- Stdev
Latency 651.79ms 4.15s 49.97s 97.94%
Req/Sec 17.09k 391.18 18.54k 69.22%
Latency Distribution (HdrHistogram - Recorded Latency)
50.000% 106.24ms
75.000% 109.18ms
90.000% 112.32ms
99.000% 26.64s
99.900% 46.73s
99.990% 49.09s
99.999% 49.74s
100.000% 50.00s
(...removed detailed percentile spectrum)
#[Mean = 651.787, StdDeviation = 4150.317]
#[Max = 49971.200, Total count = 27913044]
#[Buckets = 27, SubBuckets = 2048]
----------------------------------------------------------
29784131 requests in 5.00m, 2.58GB read
Socket errors: connect 0, read 0, write 0, timeout 6045
Requests/sec: 99276.24
Transfer/sec: 8.80MB Observations
I am interested mostly in single core performance. I've also tried Questions
More Questions(unrelated - feel free to ignore)
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 16 replies
-
That's quite a good report with all details!
25s and 50s at for 99% and 99.9% sounds insanely high. I'm highly suspicious of that.
It's pretty easy to saturate a CPU, so I wouldn't worry too much about it unless performance is also low.
The generic (non io_uring) tx path here is pretty expensive, around 60% CPU. The good news is it's because of loopback, so it includes cycles spent processing the rx path of the other end (~30%). You also got maybe ~10% of cycles consumed by apparmor, and usually people don't care about it much and it's there just because of the stock kernel and distro defaults. And then you'll be left with the fact that the net stack is heavy. If each request drives only a handful of bytes you won't get anywhere close to the NIC speeds. It'd need multiplexing instead of one send+recv pair per connection, on top of which you can additionally put multishot requests and so on. I also see some accounting overhead, if you regenerate the profile for a new kernel (~6.13) I can take another look. |
Beta Was this translation helpful? Give feedback.
-
Both are arbitrary limits the kernel enforces. Do you really need that much? If you get 64K completions in the ring then your app is doing something very wrong, the latency of processing them will be insanely high. And a ring (single CPU) won't be able to handle 1M hot connections, so most of them are probably cold and you don't really need fixed files at all unless you get some wins from |
Beta Was this translation helpful? Give feedback.
Right, it'll be there irrelevant of io_uring. With a real network card you could say that overhead will be moved to rx path of the server where the other end is.