-
Notifications
You must be signed in to change notification settings - Fork 49
Gauging Performance
Examples of collecting and analysing performance metrics.
What load can QuantumLeap handle? How many requests per second? What's the framework (Gunicorn/Flask) overhead? And Python's? These and other performance related questions about QuantumLeap you might have in mind aren't easy questions to answer and answers may vary greatly depending on your server environment (hardware/OS stack) as well as the Gunicorn configuration you chose to run QuantumLeap with, not to mention how you tuned the DB backend of your choice---Crate or Timescale. There are quite a number of knobs you can turn to optimise performance and you'll need quality measurements to understand what's the effect of turning any one knob---e.g. changing a Gunicorn setting or even using another WSGI app server!
We put together some simple, perhaps simplistic, examples to show how you might go about gauging QuantumLeap's performance in your environment by taking advantage of QuantumLeap's built-in telemetry features. The basic idea is to start QuantumLeap with telemetry on to collect duration and runtime metrics time series and then analyse these series in Pandas to try finding an answer to your questions. (If you haven't done it yet, you should read up about QuantumLeap telemetry before reading on.)
We'll try answering some hairy questions:
- How many requests per second can the framework QuantumLeap runs in (Gunicorn/Flask/Connexion) potentially handle in my set-up?
- How much overhead does the framework add to the processing of a request?
- What's the throughput of the
notify
endpoint? And what's the ratio of DB insert time to total processing time?
QuantumLeap telemetry comes in handy when trying to answer questions like these because we need to know what's going on inside the box (e.g. what amount of time each QuantumLeap component actually contributes to the total processing time) whereas just reading the report of a load testing tool like K6 might not tell you the whole story.
So we're going to:
- run QuantumLeap with the usual Gunicorn settings but with telemetry on;
- make Crate provide the DB backend;
- use a Redis cache to avoid unnecessary metadata lookups done when inserting NGSI entity data;
- hammer QuantumLeap with more HTTP requests it can possibly handle;
- collect the runtime metrics (GC, memory, user/kernel time, etc.) and duration time series files output by the telemetry component;
- load those files into Pandas and analyse the time series.
To keep things simple, we'll run all the involved components locally.
The Docker Compose file in src/tests/benchmark
defines the QuantumLeap,
Crate and Redis services we'll use. The shell scripts (one for each
test scenario) take care of building a QuantumLeap Docker image, starting
the Docker Compose services, bombarding QuantumLeap with concurrent HTTP
requests, and finally stopping the Docker Compose services.
The _monitoring
directory (mounted on the QuantumLeap container)
is where all the telemetry files wind up.
Surely it must've already occurred to you this kind of set up is far from ideal for a performance benchmark and the numbers you're going to get won't be representative of real-life scenarios where e.g. you have a distributed pool of QuantumLeap services fronting a dedicated Crate cluster. But it'll do for our purpose of showing how you could use QuantumLeap telemetry to gauge performance in your own environment.
Speaking of which, QuantumLeap telemetry gets configured in
gconfig_telemetry.py
, the Gunicorn config we start QuantumLeap with
in all our example test scenarios below. With this configuration, the
server times each HTTP request, building a time series of request durations
in correspondence of each combination of HTTP path and verb. (Open up
gconfig_telemetry.py
to see how exactly.) One thing to keep in mind
is that each duration in a series tells you how long a Gunicorn worker
took to service a certain kind of request but it doesn't account for
the time it took the master process to dispatch the request to the worker.
To get a rough estimate of how long Gunicorn takes to accept and dispatch
requests, we use the same telemetry module client side too. We've whipped
together a simple Python script, threaded_driver.py
, which our shell
scripts call to do all the hammering. The Python script times each HTTP
request too, outputting the corresponding time series in the _monitoring
directory so at the end of a test session we can easily load both server
and client data sets into Pandas. Sweet.
To be able to reproduce the examples below yourself, you'll first have to run these commands:
$ cd /path/to/ngsi-timeseries-api
$ source setup_dev_env.sh
$ pipenv install --dev
$ cd src/tests/benchmark
To get our feet wet, we'll try answering question (1) and (2) to find out what are the limits of the Web server framework (Gunicorn/Flask/Connexion) in our environment. Obviously Gunicorn configuration plays a significant role here, so, a bit more accurately, we should say we're trying to see how the framework performs when Gunicorn is configured with QuantumLeap's default settings of:
- a pre-forked server model with multi-threading (
gthread
); - a worker pool of
n * 4 + 1
processes wheren
is the number of CPU cores; - one thread per worker;
- error log level.
In fact, these are the settings you'll find in gconfig_telemetry.py
,
the configuration file we start QuantumLeap with---have a look at the
Docker Compose file.
To make sure in our tests we measure the framework's performance rather
than QuantumLeap's, we should hit an endpoint that's as close as possible
to a noop. What better endpoint than version
then? It just returns
a little hard-coded JSON object. We'll also wrap the version
function
with a timer decorator to collect a time series of execution times for
it:
from server.telemetry.monitor import time_it
@time_it(label='version()')
def version():
...
Now we're ready to drop the bomb. Run
$ ./baseline-load-test.sh
The script builds a QuantumLeap Docker image, starts the Docker Compose
services, spawns a Python processes to do 10,000 concurrent HTTP GET
s
on the version
endpoint (using 10 threads), and then stops the Docker
Compose services. Once the script is done we can look at the telemetry
data by loading it into Pandas.
We whipped together a simple Python script with a couple of convenience functions that string together Pandas calls to help us explore the data we collected. Load it into the Python interpreter
$ python -i analysis.py
If all went well, there should be a db
variable in the global scope,
an instance of TelemetryDB
(from pandas_import
) making all the
duration and runtime telemetry data found in the monitoring directory
available through Pandas frames and series. Let's print a summary of
the two duration time series we collected
>>> print_measurements_summaries(db.duration())
Time series: /version [GET]
count 10000.000000
mean 0.013604
std 0.015712
min 0.000831
25% 0.003771
50% 0.008827
75% 0.018593
90% 0.031005
95% 0.039584
max 0.477556
Time series: version()
count 10000.000000
mean 0.000142
std 0.001022
min 0.000008
25% 0.000017
50% 0.000021
75% 0.000023
90% 0.000060
95% 0.000191
max 0.025543
Time series: client:version:200
count 10000.000000
mean 0.039894
std 0.024127
min 0.007079
25% 0.025884
50% 0.034068
75% 0.046459
90% 0.065372
95% 0.083069
max 0.522838
and check how requests were distributed among workers
>>> get_version = db.duration().time_series('/version [GET]')
>>> print_series_summary(get_version, by_pid = True)
Time series: /version [GET] | PID 8
count 1159.000000
...
Time series: /version [GET] | PID 10
count 1092.000000
...
Time series: /version [GET] | PID 12
count 1091.000000
...
Time series: /version [GET] | PID 14
count 1075.000000
...
Time series: /version [GET] | PID 16
count 1097.000000
...
Time series: /version [GET] | PID 18
count 1079.000000
...
Time series: /version [GET] | PID 20
count 1094.000000
...
Time series: /version [GET] | PID 22
count 1166.000000
...
Time series: /version [GET] | PID 24
count 1147.000000
...
You can easily do a similar analysis for runtime series: Python GC, RAM, etc. (Have a look at the code to see what's available.) For example, here's how to summarise the memory usage series of each worker process
>>> print_series_summary(db.max_rss(), by_pid = True)
Back to our quest for a performance baseline. From the above summaries we can see we got hit exactly 10,000 times and the three means look like fairly good representatives of their respective data sets:
Time series: /version [GET]
mean 0.013604
Time series: version()
count 10000.000000
mean 0.000142
Time series: client:version:200
mean 0.039894
Then, as expected, the mean of the version()
time series tells us
the version
function returns almost instantly. Each duration in the
/version [GET]
series tells you how long a Gunicorn worker took to
service a request but it doesn't account for the time it took the master
process to dispatch the request to the worker. Hence, we can say that
Flask/Connexion overhead per call sits at around 0.013
seconds. On
average, the client saw each request complete in about 0.040
secs,
so Gunicorn overhead can't cost us more than 0.026s
a call---this
is an upper bound since it factors in network latency, client I/O
and context switches too!
Now, how many requests per second can we handle?
>>> get_version = db.duration().time_series('/version [GET]')
>>> rps = measurements_per_second(get_version)
>>> rps.describe()
mean 238.095238
min 119.000000
25% 214.000000
50% 236.500000
75% 277.000000
max 300.000000
>>> rps.plot()
# displays a graph of requests per second over time
Even though the data was collected from within the server, we'd expect the requests per seconds measured at the client to be pretty much the same since we're on the same host. In fact, that is the case.
>>> get_version_client = db.duration().time_series('client:version:200')
>>> rps = measurements_per_second(get_version_client)
>>> rps.describe()
mean 238.095238
min 118.000000
25% 215.250000
50% 235.000000
75% 273.000000
max 301.000000
So basically both the server and the client observed a throughput of
about 238
requests per second. Plus, server-side work got distributed
evenly and done pretty much in the same amount of millis each time.
We expect the server to have been busy doing network I/O in the kernel
most of the time and in fact
>>> ratio, r_series = sum_by_second_ratio(db.user_time(), db.system_time())
>>> ratio # total user time / kernel time
3.148073238423002
and plotting the ratio series confirms it
>>> r_series.plot()
(Quit the Python interpreter before moving on.)
So we sketched out how you might go about answering questions (1) and
(2). Let's move on to question (3): What's the throughput of the notify
endpoint and what's the ratio of DB insert time to total processing time?
We'll need to collect durations for the notify
endpoint as well as
the insert call in the SQL translator. The telemetry config we have
already collects duration time series for HTTP endpoints but we'll also
plonk in a couple of timers to collect durations of the insert call
in the SQL translator and the Reporter's notify
method. Here's how:
# in the Reporter
@time_it(label='notify()')
def notify():
...
# in the SQL translator's _insert_entities_of_type method
with time_it(label='db.insert'):
self.cursor.executemany(stmt, rows)
We're going to use the same 10,000 request bomb from earlier, except
this time threaded_driver.py
will POST
an NGSI entity to the notify
endpoint. (Open up the script to see what's going on under the bonnet.)
Run
$ ./notify-load-test.sh
...
>>> Count of successful HTTP responses by status code:
HTTP 500: 405
HTTP 200: 9595
...
O deary deary. 405
requests failed with a server error. Ouch. Anyway
just like earlier on, we fire the Python interpreter to make sense of
the telemetry data.
$ python -i analysis.py
Bird's eye view of the three duration series we collected:
>>> print_measurements_summaries(db.duration())
Time series: /v2/notify [POST]
count 10000.000000
mean 0.102399
std 0.166758
min 0.019514
25% 0.074256
50% 0.093793
75% 0.117056
90% 0.140944
95% 0.158496
max 5.933852
Time series: notify()
count 10000.000000
mean 0.083487
std 0.166569
min 0.013632
25% 0.056338
50% 0.073574
75% 0.095656
90% 0.120594
95% 0.139385
max 5.931297
Time series: db.insert
count 9595.000000
mean 0.050841
std 0.032493
min 0.007530
25% 0.027878
50% 0.041102
75% 0.064695
90% 0.096074
95% 0.117234
max 0.328003
Time series: client:notify:200
count 9595.000000
mean 0.140242
std 0.196658
min 0.031160
25% 0.090904
50% 0.117998
75% 0.166469
90% 0.224326
95% 0.254857
max 6.092773
Time series: client:notify:500
count 405.000000
mean 0.187196
std 0.101088
min 0.044285
25% 0.123096
50% 0.157361
75% 0.219076
90% 0.297212
95% 0.340895
max 0.768078
Puzzling numbers. We did get hit 10,000 times but only managed to insert
9,595 entities out of the 10,000 the notify
method got---we lost about
4%
of the incoming data! Insert times were pretty uniform across the
board whereas the notify
durations experienced some huge spikes.
We won't try to understand what went wrong, why we lost data and had those spikes. Let's leave that for another day. But we should select a more uniform subset of the data to get more accurate answers to our questions since some requests took longer than six seconds which could skew our computations.
>>> notify_post = db.duration().time_series('/v2/notify [POST]').data()
>>> notify_fn = db.duration().time_series('notify()').data()
>>> db_insert = db.duration().time_series('db.insert').data()
Plotting the data to visualise those spikes
>>> notify_post.plot()
>>> notify_fn.plot()
shows there was actually only one spike in each series which, in both cases, happened at about ten seconds into the session. Even if we cut off the first ten seconds of each series, we still have loads of data left for analysis, so cut off is what we'll do.
>>> cutoff_ix = notify_post.idxmax() # min index attaining series max
>>> notify_post = notify_post[notify_post.index > cutoff_ix]
>>> notify_fn = notify_fn[notify_fn.index > cutoff_ix]
>>> db_insert = db_insert[db_insert.index > cutoff_ix]
How many requests per second?
>>> notify_post.resample('1S').count().describe()
mean 73.746154
min 12.000000
25% 72.000000
50% 76.000000
75% 80.000000
max 86.000000
Comparing these numbers with the baseline requests per second figures
from earlier, we can say we're using our server stack at about 31% of
its capacity, hence the framework can't possibly be a limiting factor.
Even if e.g. Gunicorn was twice as fast at handling requests, notify
throughput wouldn't be much higher than it is now. Also, per-request
overhead of Flask/Connexion is 0.018s
---basically in the same region
as the baseline, which is good news:
>>> notify_post.describe()
count 9587.000000
mean 0.096304
std 0.033002
min 0.019514
25% 0.073530
50% 0.092602
75% 0.115289
max 0.318799
>>> notify_fn.describe()
count 9587.000000
mean 0.078164
std 0.031723
min 0.013632
25% 0.055968
50% 0.072959
75% 0.094788
max 0.315395
Another thing we wanted to know is the ratio of DB insert time to total
QuantumLeap processing time---the amount of time the notify
function
took to complete vs the time it took to do the DB insert. We know some
entities got sucked into the sink of no return, so to be sure to compare
apples with apples, no insert loss should've happened in the data subset
we chose. Is that so?
>>> notify_fn.size, db_insert.size
(9587, 9587)
Cool, we're in business. Let's compute the ratio series
>>> notify_fn_sum_by_sec = notify_fn.resample('1S').sum()
>>> db_insert_sum_by_sec = db_insert.resample('1S').sum()
>>> r_series = db_insert_sum_by_sec / notify_fn_sum_by_sec # r[k] = d[k]/n[k]
>>> r_series.describe()
mean 0.633801
std 0.136688
min 0.340319
25% 0.534223
50% 0.596363
75% 0.691890
max 0.977415
It looks like the mean is a fairly good representative of this ratio
over time. Good, then the insert should account for about 63% of the
time of each notify
call. What the heck did QuantumLeap do in the
remaining 37% of the time?! Johnny Perf Sleuth has his work cut out...
Anyway, with at least 63% of the time spent doing blocking I/O, it
looks like the notify
endpoint is in a desperate need of an async
I/O face-lift. The ratio of user time to kernel time should just confirm
that:
>>> _, r_series = sum_by_second_ratio(db.user_time(), db.system_time())
>>> r_series.describe()
mean 3.074184
std 0.597316
min 0.766313
25% 3.234516
50% 3.270670
75% 3.284568
max 3.332187
Yoh! Kernel time is more than three times the user time.
Well, it's been one helluva ride. We saw how to use QuantumLeap telemetry
and Pandas to get an idea of how many requests per second the Web server
can potentially handle, come up with an estimate for framework overhead,
see how many entities per second we can shovel down the notify
endpoint
and in what proportion its performance depends on the insert call. Here's
what we've found out:
-
Baseline throughput. About 238
version
requests per second. - Framework overhead. Tends to stay constant across endpoints and below 40 millis. Also, Gunicorn does a pretty good job at splitting up the workload evenly among workers to maximise parallel processing of HTTP requests.
-
Notification throughput. About 74
notify
requests per second. That's about 31% of the server capacity measured in the baseline, so again the server stack wasn't a limiting factor. -
Notification workload distribution inside QuantumLeap. The DB
insert takes up only 63% of the
notify
function execution time, something else is eating up the rest. -
Data loss. Under extremely high load, the
notify
endpoint fails to insert about 4% of the incoming NGSI entities into Crate. -
Optimisation potential. As expected, the
notify
endpoint is I/O bound, but notice all the I/O calls are blocking and kernel time is three times greater than user time. Making QuantumLeap do async I/O could definitely boost performance.
Keep in mind throughput figures can vary hugely depending on your test
set-up. For example, I could get 50% more requests per second simply by
running QuantumLeap outside of Docker. But I've got a sneaking suspicion
that, regardless of test environment, the Gunicorn/Flask/Connexion stack
isn't a bottleneck for the notify
endpoint, at least not at this stage.
Developer Track
- Cookbook
- Gauging performance
- Mother of all queries
- Enteater
- Work a Q
- No async free lunch
- Release procedure
User Track