The loadtest
module has impressive performance,
and it has got better during the years as the Node.js core improves.
Heavily inspired by autocannon
,
the author has tried out using raw TCP sockets to improve performance.
They use the net module
instead of the HTTP module.
This is the story of how it went.
Spoiler: just add --tcp
to your command line for a speed boost!
The keep-alive (option -k
) makes a huge difference in performance:
instead of opening a new socket for every request,
the same connection is reused,
so it is usually much faster.
We need to run the measurements with and without it to see how each factor is affected.
The following tables summarize all comparisons. Fastest option is shown in bold. Results are shown with one core (or worker, or thread) and three cores for the load tester. Detailed explanations follow.
First without keep-alive, one-core load tester against 3-core test server:
package | krps |
---|---|
loadtest | 6 |
tcp barebones | 10 |
loadtest tcp | 9 |
ab | 20 |
autocannon | 8 |
Now with keep-alive, also one-core load tester against 3-core test server:
package | krps |
---|---|
loadtest | 21 |
tcp barebones | 80 |
loadtest tcp | 68 |
autocannon | 57 |
wrk | 73 |
With keep-alive, 3-core load tester against 3-core test server:
package | krps |
---|---|
loadtest | 54 |
loadtest tcp | 115 |
autocannon | 107 |
wrk | 118 |
With keep-alive, 1-core load tester against Nginx:
package | krps |
---|---|
loadtest | 19 |
loadtest tcp | 61 |
autocannon | 40 |
wrk | 111 |
Finally with keep-alive, 3-core load tester against Nginx:
package | krps |
---|---|
loadtest | 49 |
loadtest tcp | 111 |
autocannon | 80 |
wrk | 122 |
All measurements against the test server using 3 cores (the default configuration for our six-core machine), unless specified otherwise:
$ node bin/testserver.js
Note that the first $
is the console prompt.
Tests run on an Intel Core i5-12400T processor with 6 cores,
with Ubuntu 22.04.3 LTS (Xubuntu actually).
Performance numbers are shown in bold and as thousands of requests per second (krps):
80 krps.
We compare a few packages on the test machine.
Keep in mind that ab
does not use keep-alive while autocannon
does,
so they are not to be compared between them.
First target performance is against Apache ab
.
$ ab -V
Version 2.3 <$Revision: 1879490 $>
With 10 concurrent connections without keep-alive.
$ ab -t 10 -c 10 http://localhost:7357/
[...]
Requests per second: 20395.83 [#/sec] (mean)
Results are around 20 krps.
Keep-alive cannot be used with ab
as far as the author knows.
Next we will try out autocannon
,
the package that actually inspired this approach.
autocannon
uses by default 10 concurrent connections with keep-alive enabled:
$ autocannon --version
autocannon v7.12.0
node v18.17.1
$ autocannon http://localhost:7357/
[...]
┌───────────┬─────────┬─────────┬─────────┬─────────┬──────────┬─────────┬─────────┐
│ Stat │ 1% │ 2.5% │ 50% │ 97.5% │ Avg │ Stdev │ Min │
├───────────┼─────────┼─────────┼─────────┼─────────┼──────────┼─────────┼─────────┤
│ Req/Sec │ 51295 │ 51295 │ 57343 │ 59103 │ 56798.55 │ 2226.35 │ 51285 │
├───────────┼─────────┼─────────┼─────────┼─────────┼──────────┼─────────┼─────────┤
│ Bytes/Sec │ 6.36 MB │ 6.36 MB │ 7.11 MB │ 7.33 MB │ 7.04 MB │ 276 kB │ 6.36 MB │
└───────────┴─────────┴─────────┴─────────┴─────────┴──────────┴─────────┴─────────┘
We will look at the median rate (reported as 50%),
so results are around 57 krps.
Keep-alive cannot be disabled with an option,
but it can be changed directly in the code by setting the header Connection: close
.
Performance is near 8 krps:
$ npx autocannon http://localhost:7357/
[...]
┌───────────┬────────┬────────┬────────┬────────┬────────┬─────────┬────────┐
│ Stat │ 1% │ 2.5% │ 50% │ 97.5% │ Avg │ Stdev │ Min │
├───────────┼────────┼────────┼────────┼────────┼────────┼─────────┼────────┤
│ Req/Sec │ 5831 │ 5831 │ 7703 │ 8735 │ 7674.4 │ 753.53 │ 5828 │
├───────────┼────────┼────────┼────────┼────────┼────────┼─────────┼────────┤
│ Bytes/Sec │ 560 kB │ 560 kB │ 739 kB │ 839 kB │ 737 kB │ 72.4 kB │ 559 kB │
└───────────┴────────┴────────┴────────┴────────┴────────┴─────────┴────────┘
To complete the set we try wrk
:
$ wrk -v
wrk debian/4.1.0-3build1 [epoll]
With a single thread (core) for fair comparison we get almost 73 krps:
$ wrk http://localhost:7357/ -t 1
[...]
Requests/sec: 72639.52
The baseline is the existing http
implementation in loadtest
7.1.1,
running on one core.
Without keep-alive close to 6 krps:
$ node bin/loadtest.js http://localhost:7357 --cores 1
[...]
Effective rps: 6342
Very far away from the 20 krps given by ab
.
With keep-alive:
$ node bin/loadtest.js http://localhost:7357 --cores 1 -k
[...]
Effective rps: 20490
We are around 20 krps.
Again quite far from the 57 krps by autocannon
;
close to ab
but it doesn't use keep-alive so the comparison is meaningless.
For the first implementation we want to learn if the bare sockets implementation is worth the time. In this naïve barebones implementation we open the socket, send a short canned request without taking into account any parameters or headers:
this.params.request = `${this.params.method} ${this.params.path} HTTP/1.1\r\n\r\n`
We don't parse the result either, just assume that it is received as one packet and disregard it. The results are almost 80 krps:
$ node bin/loadtest.js http://localhost:7357 --cores 1 --tcp
[...]
Effective rps: 79997
Very promising start! Obviously this only works properly with GET requests without any body, so it is only useful as a benchmark: we want to make sure we don't lose too much performance when adding all the functionality.
We can also do a barebones implementation without keep-alive,
creating a new socket for every request.
The result is around 10 krps,
still far from Apache ab
.
But here there is not much we can do:
apparently writing sockets in C is more efficient than in Node.js,
or perhaps ab
has some tricks up its sleeve,
probably some low level optimizations.
In the Node.js code there is not much fat we can trim.
So from now on we will focus on the keep-alive tests.
First we add the proper headers in the request. This means we are sending out more data for each round, but performance doesn't seem to be altered much, still around 80 krps.
The request we are now sending is:
GET / HTTP/1.1
host: localhost:7357
accept: */*
user-agent: loadtest/7.1.0
Connection: keep-alive
One interesting bit is that sending the header connection: keep-alive
does not affect performance;
however, sending connection: close
breaks performance to 8 requests per second.
Probably there are huge inefficiencies in the way sockets are created.
This should be investigated in depth at some point,
if we want to have a test without keep-alive at some point.
Now we come to the really critical part: parsing the response including the content.
A very simple implementation just parses the response as a string,
reads the first line and extracts the status code.
Performance is now down to around 68 krps.
Note that we are still assuming that each response is a single packet.
A sample response from the test server included with loadtest
can look like this:
HTTP/1.1 200 OK
Date: Fri, 08 Sep 2023 11:04:21 GMT
Connection: keep-alive
Keep-Alive: timeout=5
Content-Length: 2
OK
We can see a very simple HTTP response that fits in one packet.
It is possible that a response comes in multiple packets,
so we need to keep some state between packets.
This is the next step:
we should make sure that we have received the whole body and not just part of it.
The way to do this is to read the content-length
header,
and then check that the body that we have has this length;
only then can we be 100% sure that we have the whole body.
Therefore we need to parse all incoming headers,
find the content length (in the header content-length
),
and then parse the rest of the packet to check that we have the whole body.
Again, a very simple implementation that parses content length and checks against body length
goes down to 63 krps.
If the body is not complete we need to keep the partial body,
and add the rest as it comes until the required content-length
.
Keep in mind that even headers can be so long that they come in several packets!
In this case even more state needs to be stored between packets.
With decent packet parsing, including multi-packet headers and bodies, performance goes down to 60 krps. Most of the time is spent parsing headers, since the body only needs to be checked for length, not parsed.
Given that answers tend to be identical in a load test, perhaps changing a date or a serial number, we can apply a trick: when receiving a packet check if it's similar enough to one received before so we can skip parsing the headers altogether.
The algorithm checks the following conditions:
- Length of the received packet is less than 1000 bytes.
- Length of the packet is identical to one received before.
- Length of headers and body are also identical.
- Same status as before.
If all of them apply then the headers in the message are not parsed: we estimate that the packet is complete and we don't need to check for content length. Keep in mind that we might be wrong: we might have received a packet with just part of a response that happens to have the same length, status and header length as a previous complete response, and which is also below 1000 bytes. This is however extremely unlikely.
Using this trick we go back to 67 krps.
Packets of different lengths are stored for comparison, which can cause memory issues when size varies constantly.
Now we can go back to using multiple cores:
$ node bin/loadtest.js http://localhost:7357 --cores 3 --tcp
[...]
Effective rps: 115379
In this case half the available cores, leaving the rest for the test server. Now we go up to 115 krps!
What about regular http
connections without the --tcp
option?
It stays at 54 krps:
$ node bin/loadtest.js http://localhost:7357/ -k --cores 3
[...]
Effective rps: 54432
For comparison we try using autocannon
also with three workers:
$ autocannon http://localhost:7357/ -w 3 -c 30
[...]
┌───────────┬───────┬───────┬─────────┬─────────┬──────────┬─────────┬───────┐
│ Stat │ 1% │ 2.5% │ 50% │ 97.5% │ Avg │ Stdev │ Min │
├───────────┼───────┼───────┼─────────┼─────────┼──────────┼─────────┼───────┤
│ Req/Sec │ 88511 │ 88511 │ 107071 │ 110079 │ 105132.8 │ 6148.39 │ 88460 │
├───────────┼───────┼───────┼─────────┼─────────┼──────────┼─────────┼───────┤
│ Bytes/Sec │ 11 MB │ 11 MB │ 13.3 MB │ 13.6 MB │ 13 MB │ 764 kB │ 11 MB │
└───────────┴───────┴───────┴─────────┴─────────┴──────────┴─────────┴───────┘
Median rate (50% percentile) is 107 krps.
Now wrk
which yields 118 krps:
$ wrk http://localhost:7357/ -t 3
[...]
Requests/sec: 118164.03
So loadtest
has managed to be slightly above autocannon
using multiple tricks,
but below wrk
.
We are not done yet.
As it happens the new code is not very precise with connections and clients:
in particular it doesn't play nice with our --rps
feature,
which is used to send an exact number of requests per second.
We need to do a complete refactoring to have a pool of clients,
take them to fulfill a request and them free them back to the pool.
After the refactoring we get some bad news: performance has dropped down back to 60 krps!
$ node bin/loadtest.js http://localhost:7357/ --tcp --cores 1
[...]
Effective rps: 60331
We need to do the painstaking exercise of getting back to our target performance.
We need to see where our microseconds (µs) are being spent. Every microsecond counts: between 67 krps (15 µs per request) to 60 krps (16.7 µs per request) the difference is... less than two microseconds.
We use the microprofiler
package,
which allows us to instrument the code that is sending and receiving requests.
For instance the function makeRequest()
in lib/tcpClient.js
which is sending out the request:
import microprofiler from 'microprofiler'
[...]
makeRequest() {
if (!this.running) {
return
}
// first block: connect
const start1 = microprofiler.start()
this.connect()
microprofiler.measureFrom(start1, 'connect', 100000)
// second block: create parser
const start2 = microprofiler.start()
this.parser = new Parser(this.params.method)
microprofiler.measureFrom(start2, 'create parser', 100000)
// third block: start measuring latency
const start3 = microprofiler.start()
const id = this.latency.begin();
this.currentId = id
microprofiler.measureFrom(start3, 'latency begin', 100000)
// fourth block: write to socket
const start4 = microprofiler.start()
this.connection.write(this.params.request)
microprofiler.measureFrom(start4, 'write', 100000)
}
Each of the four calls are instrumented. When this code runs the output has a lot of lines like this:
$ node bin/loadtest.js http://localhost:7357/ --tcp --cores 1
[...]
Profiling connect: 100000 requests, mean time: 1.144 µs, rps: 6948026
Profiling create parser: 100000 requests, mean time: 0.152 µs, rps: 6582446
Profiling latency begin: 100000 requests, mean time: 1.138 µs, rps: 878664
Profiling write: 100000 requests, mean time: 5.669 µs, rps: 176409
Note that the results oscillate something like 0.3 µs from time to time,
so don't pay attention to very small differences.
Mean time is the interesting part: from 0.152 to create the parser µs to 5.669 µs for the write.
There is not a lot that we can do with the connection.write()
call,
since it's directly speaking with the Node.js core;
we can try reducing the message size (not sending all headers)
but it doesn't seem to do much.
So we center on the this.connect()
call,
which we can reduce to less than a µs.
Then we repeat again on the finishRequest()
call to see if we can squeeze another microsecond there.
After some optimizing and a lot of bug fixing we are back to 68 krps:
$ node bin/loadtest.js http://localhost:7357/ --tcp --cores 1
[...]
Effective rps: 68466
With classic loadtest
without the --tcp
option, we still get 21 krps:
$ node bin/loadtest.js http://localhost:7357/ -k --cores 1
[...]
Effective rps: 21446
Marginally better than before.
By the way, it would be a good idea to try again without keep-alive.
There is currently no option to disable keep-alive,
but it can be done by hacking the header as
Keep-alive: close
.
We get a bit less performance than the barebones implementation,
almost 9 krps:
$ node bin/loadtest.js http://localhost:7357/ --tcp --cores 1
[...]
Effective rps: 8682
The current setup is a bit cumbersome: start the server,
then start the load test with the right parameters.
We need to have a reproducible way of getting performance measurements.
So we introduce the script bin/tcp-performance.js
,
that starts a test server and then runs a load test with the parameters we have been using.
Unfortunately the test server only uses one core (being run in API mode),
and maxes out quickly at 27 krps.
$ node bin/tcp-performance.js
[...]
Effective rps: 27350
The author has carried out multiple attempts at getting a multi-core test server running: use the cluster module, run as a multi-core process, run it as a script using child_process.exec()... They all add too much complexity. So we can use the single-core measurements as a benchmark, even if they are not representative of full operation.
By the way, autocannon
does a bit better in this scenario (single-core test server),
as it reaches 43 krps.
How does it do this magic?
One part of the puzzle can be that it sends less headers,
without user-agent
or accepts
.
So we can do a quick trial of removing these headers in loadtest
:
$ node bin/loadtest.js http://localhost:7357/ --tcp --cores 1
[...]
Effective rps: 29694
Performance is improved a bit but not much, to almost 30 krps.
How autocannon
does this wizardry is not evident.
Our last test is to run loadtest
against a local Nginx server,
which is sure not to max out with only one core:
it goes to 61 krps.
$ node bin/loadtest.js http://localhost:80/ --tcp --cores 1
[...]
Effective rps: 61059
While without --tcp
we only get 19 krps.
A similar test with autocannon
yields only 40 krps:
$ autocannon http://localhost:80/
[...]
┌───────────┬─────────┬─────────┬───────┬─────────┬─────────┬─────────┬─────────┐
│ Stat │ 1% │ 2.5% │ 50% │ 97.5% │ Avg │ Stdev │ Min │
├───────────┼─────────┼─────────┼───────┼─────────┼─────────┼─────────┼─────────┤
│ Req/Sec │ 34591 │ 34591 │ 40735 │ 43679 │ 40400 │ 2664.56 │ 34590 │
├───────────┼─────────┼─────────┼───────┼─────────┼─────────┼─────────┼─────────┤
│ Bytes/Sec │ 29.7 MB │ 29.7 MB │ 35 MB │ 37.5 MB │ 34.7 MB │ 2.29 MB │ 29.7 MB │
└───────────┴─────────┴─────────┴───────┴─────────┴─────────┴─────────┴─────────┘
Now it's not evident either how it reaches less performance against an Nginx
than against our Node.js test server,
but the numbers are quite consistent.
While wrk
takes the crown again with 111 krps:
$ wrk http://localhost:80/ -t 1
[...]
Requests/sec: 111176.14
Running again loadtest
with three cores we get 111 krps:
$ node bin/loadtest.js http://localhost:80/ --tcp --cores 3
[...]
Effective rps: 110858
Without --tcp
we get 49 krps.
While autocannon
with three workers reaches 80 krps:
$ autocannon http://localhost:80/ -w 3
[...]
┌───────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ Stat │ 1% │ 2.5% │ 50% │ 97.5% │ Avg │ Stdev │ Min │
├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Req/Sec │ 65727 │ 65727 │ 80191 │ 84223 │ 78668.8 │ 5071.38 │ 65676 │
├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Bytes/Sec │ 56.4 MB │ 56.4 MB │ 68.9 MB │ 72.4 MB │ 67.6 MB │ 4.36 MB │ 56.4 MB │
└───────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘
Consistent with the numbers reached above against a test server with 3 cores.
wrk
does not go much further with three threads than with one, at 122 krps:
$ wrk http://localhost:80/ -t 3
[...]
Requests/sec: 121991.96
It is good to know that loadtest
can hold its own against such beasts like ab
,
autocannon
or wrk
.
ab
and wrk
are written in C,
while autocannon
is maintained by Matteo Collina who is one of the leading Node.js performance gurus.
There are some unexplained effects,
like why does autocannon
perform so poorly against Nginx.
It would be really interesting to understand it.
Now with TCP sockets and keep-alive you can use loadtest
to go beyond the paltry 6 to 20 krps that we used to get:
especially with multiple cores you can reach 100 krps locally.
If you need performance that goes beyond that,
you can try some of the other options used here.
Note that there are many options not yet implemented for TCP sockets, like secure connections with HTTPS. They will come in the next releases.