noop http benchmark - what am I doing wrong? #1311

stonebrakert6 · 2024-12-31T13:07:54Z

stonebrakert6
Dec 31, 2024

Background

I've created a simple network server that

Accepts a valid Http/1.1 request
Parses and validates the request
Responds with 200 OK response with 13 byte body.
There is no file I/O.

So it is just a noop http benchmark to find how much I can get
on a single core using liburing for a network server.

Liburing usage

Ring is initialized with

IORING_SETUP_SINGLE_ISSUER | IORING_SETUP_DEFER_TASKRUN | IORING_SETUP_CQSIZE;

and cq_entries = (1 << 16)

I am using fixed-files/direct descriptors(io_uring_register_files and sqe->flags |= IOSQE_FIXED_FILE)
and also io_uring_register_ring_fd
On the submission side I am using a max batch size of 128 i.e
I call io_uring_submit_and_get_events if there are more than 128
I/Os which are pending submit. I've tried to vary this from 8 to 128
without any major impact.
Eventloop - calls io_uring_submit_and_get_events in case there are still any pending
submit followed by io_uring_for_each_cqe(&ring_, head, cqe) to process
all cqe entries.
In case there is none, call io_uring_wait_cqe to wait for a completion.
Check for cq overflow using io_uring_cq_has_overflow but don't see any.
Not using multi-shot accept or multi-shot recv.
Essentially to serve each request(post connection accept), the server does 1 recv and 1 send using
io_uring_prep_recv and io_uring_prep_send respectively.

Clients

I've mostly used wrk2 as load gen tool, but have also occasionally used
h2load and wrk. I am only sharing results from wrk2

Machine details

uname -a
Linux ryzen5700x 6.8.0-51-generic #52-Ubuntu SMP PREEMPT_DYNAMIC Thu Dec  5 13:09:44 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Model name is Model name: AMD Ryzen 7 5700X 8-Core Processor
It is an 8-core machine(and hyperthreaded so 16 logical cores)

cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=24.04
DISTRIB_CODENAME=noble
DISTRIB_DESCRIPTION="Ubuntu 24.04.1 LTS"

Benchmark and Results

I am using wrk2 to simulate 20480 simultaneous connections and pump a load of 102400 RPS on this server via localhost.

./wrk -c 20480 -t 6 -d 300s http://localhost:7777 -R 102400 -L

Running 5m test @ http://localhost:7777
  6 threads and 20480 connections
  ...(removed for verbosity)
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   651.79ms    4.15s   49.97s    97.94%
    Req/Sec    17.09k   391.18    18.54k    69.22%
  Latency Distribution (HdrHistogram - Recorded Latency)
 50.000%  106.24ms
 75.000%  109.18ms
 90.000%  112.32ms
 99.000%   26.64s
 99.900%   46.73s
 99.990%   49.09s
 99.999%   49.74s
100.000%   50.00s

(...removed detailed percentile spectrum)
#[Mean    =      651.787, StdDeviation   =     4150.317]
#[Max     =    49971.200, Total count    =     27913044]
#[Buckets =           27, SubBuckets     =         2048]
----------------------------------------------------------
  29784131 requests in 5.00m, 2.58GB read
  Socket errors: connect 0, read 0, write 0, timeout 6045
Requests/sec:  99276.24
Transfer/sec:      8.80MB

Observations

Server(single thread - as mentioned) uses 100% cpu(1 core)
Here is the CPU profile of the server
I think tail latencies are high. I really don't know what to expect in terms of performance since I don't
have a baseline, but I was expecting CPU to not reach 100%(am I naive?)
Tail latencies improve if I decrease the number of connections
eg. at 10K connections I use ~75% CPU.
Ofcourse it beats the Golang server which does exactly that and is forced to used 1 core.

I am interested mostly in single core performance. I've also tried IORING_SETUP_SQPOLL
but don't want to create noise with findings from SQPOLL.

Questions

Is there something wrong with my overall application design and the way I am
using liburing?
How can I improve? I see that CPU Profile
shows certain functions in the kernel taking considerable CPU.
Any pointers on how to improve?
e.g. I saved ~5% CPU by shifting to direct-descriptors.
What http load gen tool would you suggest?

More Questions(unrelated - feel free to ignore)

Somehow I am unable to increase cq_entries to more than 65K, why so?
io_uring_queue_init_params results in error if I try to set cq_entries
more than 65K

I am not able to (2 << 20) registered files despite haveing 10M ulimit
as number of open files. I am able to set max (1<<20)

  ulimit -a
  real-time non-blocking time  (microseconds, -R) unlimited
  core file size              (blocks, -c) unlimited
  data seg size               (kbytes, -d) unlimited
  scheduling priority                 (-e) 0
  file size                   (blocks, -f) unlimited
  pending signals                     (-i) 514591
  max locked memory           (kbytes, -l) 16475792
  max memory size             (kbytes, -m) unlimited
  open files                          (-n) 10000000
  pipe size                (512 bytes, -p) 8
  POSIX message queues         (bytes, -q) 819200
  real-time priority                  (-r) 0
  stack size                  (kbytes, -s) 8192
  cpu time                   (seconds, -t) unlimited
  max user processes                  (-u) 514591
  virtual memory              (kbytes, -v) unlimited
  file locks                          (-x) unlimited

Answered by isilence

Jan 2, 2025

The generic (non io_uring) tx path here is pretty expensive, around 60% CPU. The good news is it's because of loopback, so it includes cycles spent processing the rx path of the other end (~30%).

Oh...I had no idea about that - this was very helpful. 1 question - this behaviour(server processing the rx path of the other end of loopback client) is irrespective of io_uring? i.e it always happens regardless of whether application uses io_uring or not?

Right, it'll be there irrelevant of io_uring. With a real network card you could say that overhead will be moved to rx path of the server where the other end is.

So for a real client on the network, this overhead shouldn't be there? Right …

View full answer

isilence · 2024-12-31T22:30:18Z

isilence
Dec 31, 2024
Collaborator

That's quite a good report with all details!

3\. I think tail latencies are high. I really don't know what to expect in terms of performance since I don't

25s and 50s at for 99% and 99.9% sounds insanely high. I'm highly suspicious of that.

    have a baseline, but I was expecting CPU to not reach 100%(am I naive?)

It's pretty easy to saturate a CPU, so I wouldn't worry too much about it unless performance is also low.

Is there something wrong with my overall application design and the way I am
using liburing?

How can I improve? I see that CPU Profile
shows certain functions in the kernel taking considerable CPU.

The generic (non io_uring) tx path here is pretty expensive, around 60% CPU. The good news is it's because of loopback, so it includes cycles spent processing the rx path of the other end (~30%).

You also got maybe ~10% of cycles consumed by apparmor, and usually people don't care about it much and it's there just because of the stock kernel and distro defaults.

And then you'll be left with the fact that the net stack is heavy. If each request drives only a handful of bytes you won't get anywhere close to the NIC speeds. It'd need multiplexing instead of one send+recv pair per connection, on top of which you can additionally put multishot requests and so on.

I also see some accounting overhead, if you regenerate the profile for a new kernel (~6.13) I can take another look.

16 replies

stonebrakert6 Jan 1, 2025
Author

Thanks @isilence . I don't have any experience in the linux kernel, so please bear my ignorance.

The generic (non io_uring) tx path here is pretty expensive, around 60% CPU. The good news is it's because of loopback, so it includes cycles spent processing the rx path of the other end (~30%).

Oh...I had no idea about that - this was very helpful. 1 question - this behaviour(server processing the rx path of the other end of loopback client) is irrespective of io_uring? i.e it always happens regardless of whether application uses io_uring or not?

So for a real client on the network, this overhead shouldn't be there? Right now I don't have the setup to put the client on the network(too ashamed to admit that), will try it as soon I can.

You also got maybe ~10% of cycles consumed by apparmor, and usually people don't care about it much and it's there just because of the stock kernel and distro defaults.

Yeah, I guess I don't want to touch apparmor

It'd need multiplexing instead of one send+recv pair per connection, on top of which you can additionally put multishot requests and so on.

There is multiplexing of I/O from/across multiple connections already, but how is it possible to multiplex I/O for a single connection(especially for http/1.1 - I guess for http/2 multi-shot recv would be very helpful). I mean, first we need to do the recv and only then we can formulate the appropriate send(based on the input message received).

// serving a connection
while (true) {
     recv(...)
     // parse and validate
     ....
     send(...) // response
}

I did think to batch the submit for send and next recv(top of the infinite for loop) using a linked sqe but what if send is a short send(i.e requires multiple sends to send the entire response - rare but possible)

multi-shot recv(and accept), that is already on my TODO list next.

I also see some accounting overhead,

I was running ebpf based profiler to collect cpu profile. I am curious, how did you catch that(accounting overhead?) from the cpu profile?

if you regenerate the profile for a new kernel (~6.13) I can take another look.

Can't do that right now, haven't compiled kernel(ever since I left college) - don't want to risk breaking my system right now.
Would you recommend "Ubuntu Mainline Kernel Installer" package for trying mainline kernels in Ubuntu or something else?

Do you really need that much? If you get 64K completions in the ring then your app is doing something very wrong

Yeah, I completel agree.
But for direct descriptors, throretically speaking if I do get 2M connections(80% of which are say cold), I won't be able to use direct descriptors i.e need to regularly close cold connections?(sorry about this stupid question, but I've another use-case where connection establishment is costly as client and server are quite far apart geograhically and I don't have a problem in wasting RAM for keeping cold connections that sporadically send requests - I believe your point is that for such a usecase I shoudn't use direct descriptors) I am doing the following during application initialization

  std::fill(reg_files_, reg_files_ + max_open_files, -1);
  rc = io_uring_register_files(&ring_, reg_files_,
                               static_cast<uint32_t>(max_open_files)); // API fails if max_open_files as more than (1 << 20)

Thank you so much for your help and patience.

isilence Jan 2, 2025
Collaborator

The generic (non io_uring) tx path here is pretty expensive, around 60% CPU. The good news is it's because of loopback, so it includes cycles spent processing the rx path of the other end (~30%).

Oh...I had no idea about that - this was very helpful. 1 question - this behaviour(server processing the rx path of the other end of loopback client) is irrespective of io_uring? i.e it always happens regardless of whether application uses io_uring or not?

Right, it'll be there irrelevant of io_uring. With a real network card you could say that overhead will be moved to rx path of the server where the other end is.

So for a real client on the network, this overhead shouldn't be there? Right now I don't have the setup to put the client on the network(too ashamed to admit that), will try it as soon I can.

There will be the driver overhead instead, but it's normally lower.

It'd need multiplexing instead of one send+recv pair per connection, on top of which you can additionally put multishot requests and so on.

There is multiplexing of I/O from/across multiple connections already, but how is it possible to multiplex I/O for a single connection(especially for http/1.1 - I guess for http/2 multi-shot recv would be very helpful).

I'd say benchmarks are only useful when they can approximate well enough parts of the real world that we care about. Do you really care about this workload and/or http/1.1? http/1.1 -> 2.0 transition should normally be more impactful than optimisations in IO engine, and maybe then the benchmark doesn't measure the properties you want. Regardless I'd need to refresh myself on what http feature came in what version.

I mean, first we need to do the recv and only then we can formulate the appropriate send(based on the input message received).

Right, but e.g. when you load a page the client requests lots of resources at once (spiky traffic), and IIRC it all multiplexed in http/2 and served via a single socket. And if so a single multishot receive for several app level requests, then the user can process all that and response to all the requests with a single sendmsg assuming the protocol allows it. Really depends what your target workload is.

I did think to batch the submit for send and next recv(top of the infinite for loop) using a linked sqe but what if send is a short send(i.e requires multiple _send_s to send the entire response - rare but possible)

Links can be problematic, but you shouldn't need them here at all unless I missed sth. You can even queue a new receive request as soon as you find the previous one has died even before sending a response.

multi-shot recv(and accept), that is already on my TODO list next.

I also see some accounting overhead,

I was running ebpf based profiler to collect cpu profile. I am curious, how did you catch that(accounting overhead?) from the cpu profile?

I should've been more specific, the accounting I mentioned is network cgroup accounting, which is expected to be there, but maybe that can be optimised if it's not already done in more recent kernels.

if you regenerate the profile for a new kernel (~6.13) I can take another look.

Can't do that right now, haven't compiled kernel(ever since I left college) - don't want to risk breaking my system right now.

It should be of low risk if you install 6.12 (surely you don't want rc kernels) as an additional kernel without removing the current one, so if something doesn't work you just reboot into the old one. (you can default grub to the old one for convenience). Though I'm not insisting, everything has risks, even the installation stage can be accidentally fat fingered.

Would you recommend "Ubuntu Mainline Kernel Installer" package for trying mainline kernels in Ubuntu or something else?

I know nothing about it

Do you really need that much? If you get 64K completions in the ring then your app is doing something very wrong

Yeah, I completel agree. But for direct descriptors, throretically speaking if I do get 2M connections(80% of which are say cold), I won't be able to use direct descriptors i.e need to regularly close cold connections?(sorry about this stupid question, but I've another use-case where connection establishment is costly as client and server are quite far apart geograhically

You can just use normal fds instead. For a long time registered file had a nasty pitfall because of which it was discouraged from using them with long lived IO, i.e. requests using a registered file that is not expected to complete in short-ish bound amount of time. It has been fixed only in quite recently, need to look it up but that could be 6.12 if not in not yet released 6.13.

And there has always been some cost on registering and removing a file from io_uring. So if it's a really cold connection you probably don't need registered files, but it might make sense if a connection is slow but still drives a good amount of bytes during lifetime and/or bursty. In short, you can measure if it really improves performance in a real app...

and I don't have a problem in wasting RAM for keeping cold connections that sporadically send requests - I believe your point is that for such a usecase I shoudn't use direct descriptors) I am doing the following during application initialization

... but regardless it might make sense to lift / increase the limit, there is nothing fundamental about it.

Answer selected by stonebrakert6

stonebrakert6 Jan 3, 2025
Author

This was extremely helpful. Thank you Pavel

I agree that micro-benchmarks are not particularly useful. My objective with this was to get enough experience on how to structure the application around liburing, since completion based model is different than epoll(readyness based) and there are quite a lot of APIs and performance tradeoffs(e.g batching at both submit and completion side).

Is there a place(other than linux source code) where I get get to know about the how io_uring is implemented inside the linux kernel.

isilence Jan 3, 2025
Collaborator

I agree that micro-benchmarks are not particularly useful. My objective with this was to get enough experience on how to structure the application around liburing, since completion based model is different than epoll(readyness based) and there are quite a lot of APIs and performance tradeoffs(e.g batching at both submit and completion side).

That's fair and useful. It can serve as a baseline, but I just wouldn't make any final conclusions if the protocol indeed is a limiting factor for your use case.

One thing though that is very interesting in your numbers is the insanely high 99+% latency. io_uring tries not to be unfair, there shouldn't be cases where a packet is semi infinitely stuck waiting while e.g. the kernel processing newer data. It's definitely worth looking into, e.g. you can measure how long CQE processing takes, and we can trace more info from the kernel as well. Let know if you find anything.

Is there a place(other than linux source code) where I get get to know about the how io_uring is implemented inside the linux kernel.

I think this article should be the most relevant for net app writers

https://github.com/axboe/liburing/wiki/io_uring-and-networking-in-2023

stonebrakert6 Jan 4, 2025
Author

One thing though that is very interesting in your numbers is the insanely high 99+% latency.

I suspect it is because of the some problem with the client - that's why I asked this question.

What http load gen tool would you suggest?

I tried 2 other clients and they don't give such huge numbers and so I am going to ignore those for now

I started implementing multi-shot accept(). Here is how I arm it.

+    io_uring_sqe* sqe = io_uring_get_sqe(&io_service_->ring_);
+    if (sqe == nullptr) {
+      // log error 
+      abort();
+    }
+    io_uring_prep_multishot_accept_direct(sqe, fd_, peer_addr_, addr_len_,
+                                          flags_);
+    io_uring_sqe_set_data(sqe, this);

Rest of the stuff is more or less the same.
I was shocked to find that when I pump load on the server, it now spawns iou-work-<pid_of_server>, lots of them (approx 8000)
I didn't really expect this, I searched the forum to find 842, I am using MSG_WAITALL with send(), I commented it, but that didn't help, tried a couple of random things as well without much help, I am also on Ubuntu(but on 24.04 as noted in my first post)

What did I do wrong, rest of the code is same - I quickly tested the main branch(one without multi-shot accept) and it is behaving as expected.

As noted in my original post, I am using direct descriptors, can that be the reason?

isilence Jan 4, 2025
Collaborator

What http load gen tool would you suggest?

I tried 2 other clients and they don't give such huge numbers and so I am going to ignore those for now

That's my first reaction as well.

I started implementing multi-shot accept(). Here is how I arm it.
Rest of the stuff is more or less the same. I was shocked to find that when I pump load on the server, it now spawns iou-work-<pid_of_server>, lots of them (approx 8000) I didn't really expect this, I searched the forum to find 842, I am using MSG_WAITALL with send(), I commented it, but that didn't help, tried a couple of random things as well without much help, I am also on Ubuntu(but on 24.04 as noted in my first post)

So, you're saying it only happens with (multishot) accepts and you specifically tested with vs without?

isilence · 2024-12-31T22:41:44Z

isilence
Dec 31, 2024
Collaborator

Somehow I am unable to increase cq_entries to more than 65K, why so?
io_uring_queue_init_params results in error if I try to set cq_entries
more than 65K

I am not able to (2 << 20) registered files despite haveing 10M ulimit
as number of open files. I am able to set max (1<<20)

Both are arbitrary limits the kernel enforces. Do you really need that much? If you get 64K completions in the ring then your app is doing something very wrong, the latency of processing them will be insanely high. And a ring (single CPU) won't be able to handle 1M hot connections, so most of them are probably cold and you don't really need fixed files at all unless you get some wins from accept_direct.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

noop http benchmark - what am I doing wrong? #1311

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 16 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

noop http benchmark - what am I doing wrong? #1311

stonebrakert6 Dec 31, 2024

Background

Liburing usage

Clients

Machine details

Benchmark and Results

Observations

Questions

More Questions(unrelated - feel free to ignore)

Replies: 2 comments · 16 replies

isilence Dec 31, 2024 Collaborator

stonebrakert6 Jan 1, 2025 Author

isilence Jan 2, 2025 Collaborator

stonebrakert6 Jan 3, 2025 Author

isilence Jan 3, 2025 Collaborator

stonebrakert6 Jan 4, 2025 Author

isilence Jan 4, 2025 Collaborator

isilence Dec 31, 2024 Collaborator

stonebrakert6
Dec 31, 2024

Replies: 2 comments 16 replies

isilence
Dec 31, 2024
Collaborator

stonebrakert6 Jan 1, 2025
Author

isilence Jan 2, 2025
Collaborator

stonebrakert6 Jan 3, 2025
Author

isilence Jan 3, 2025
Collaborator

stonebrakert6 Jan 4, 2025
Author

isilence Jan 4, 2025
Collaborator

isilence
Dec 31, 2024
Collaborator