Gauging Performance

Examples of collecting and analysing performance metrics.

What load can QuantumLeap handle? How many requests per second? What's the framework (Gunicorn/Flask) overhead? And Python's? These and other performance related questions about QuantumLeap you might have in mind aren't easy questions to answer and answers may vary greatly depending on your server environment (hardware/OS stack) as well as the Gunicorn configuration you chose to run QuantumLeap with, not to mention how you tuned the DB backend of your choice---Crate or Timescale. There are quite a number of knobs you can turn to optimise performance and you'll need quality measurements to understand what's the effect of turning any one knob---e.g. changing a Gunicorn setting or even using another WSGI app server!

We put together some simple, perhaps simplistic, examples to show how you might go about gauging QuantumLeap's performance in your environment by taking advantage of QuantumLeap's built-in telemetry features. The basic idea is to start QuantumLeap with telemetry on to collect duration and runtime metrics time series and then analyse these series in Pandas to try finding an answer to your questions. (If you haven't done it yet, you should read up about QuantumLeap telemetry before reading on.)

The plan

We'll try answering some hairy questions:

How many requests per second can the framework QuantumLeap runs in (Gunicorn/Flask/Connexion) potentially handle in my set-up?
How much overhead does the framework add to the processing of a request?
What's the throughput of the notify endpoint? And what's the ratio of DB insert time to total processing time?

QuantumLeap telemetry comes in handy when trying to answer questions like these because we need to know what's going on inside the box (e.g. what amount of time each QuantumLeap component actually contributes to the total processing time) whereas just reading the report of a load testing tool like K6 might not tell you the whole story.

So we're going to:

run QuantumLeap with the usual Gunicorn settings but with telemetry on;
make Crate provide the DB backend;
use a Redis cache to avoid unnecessary metadata lookups done when inserting NGSI entity data;
hammer QuantumLeap with more HTTP requests it can possibly handle;
collect the runtime metrics (GC, memory, user/kernel time, etc.) and duration time series files output by the telemetry component;
load those files into Pandas and analyse the time series.

Our set-up

To keep things simple, we'll run all the involved components locally. The Docker Compose file in src/tests/benchmark defines the QuantumLeap, Crate and Redis services we'll use. The shell scripts (one for each test scenario) take care of building a QuantumLeap Docker image, starting the Docker Compose services, bombarding QuantumLeap with concurrent HTTP requests, and finally stopping the Docker Compose services. The _monitoring directory (mounted on the QuantumLeap container) is where all the telemetry files wind up.

Surely it must've already occurred to you this kind of set up is far from ideal for a performance benchmark and the numbers you're going to get won't be representative of real-life scenarios where e.g. you have a distributed pool of QuantumLeap services fronting a dedicated Crate cluster. But it'll do for our purpose of showing how you could use QuantumLeap telemetry to gauge performance in your own environment.

Speaking of which, QuantumLeap telemetry gets configured in gconfig_telemetry.py, the Gunicorn config we start QuantumLeap with in all our example test scenarios below. With this configuration, the server times each HTTP request, building a time series of request durations in correspondence of each combination of HTTP path and verb. (Open up gconfig_telemetry.py to see how exactly.) One thing to keep in mind is that each duration in a series tells you how long a Gunicorn worker took to service a certain kind of request but it doesn't account for the time it took the master process to dispatch the request to the worker. To get a rough estimate of how long Gunicorn takes to accept and dispatch requests, we use the same telemetry module client side too. We've whipped together a simple Python script, threaded_driver.py, which our shell scripts call to do all the hammering. The Python script times each HTTP request too, outputting the corresponding time series in the _monitoring directory so at the end of a test session we can easily load both server and client data sets into Pandas. Sweet.

To be able to reproduce the examples below yourself, you'll first have to run these commands:

$ cd /path/to/ngsi-timeseries-api
$ source setup_dev_env.sh
$ pipenv install --dev
$ cd src/tests/benchmark

Establishing a baseline

To get our feet wet, we'll try answering question (1) and (2) to find out what are the limits of the Web server framework (Gunicorn/Flask/Connexion) in our environment. Obviously Gunicorn configuration plays a significant role here, so, a bit more accurately, we should say we're trying to see how the framework performs when Gunicorn is configured with QuantumLeap's default settings of:

a pre-forked server model with multi-threading (gthread);
a worker pool of n * 4 + 1 processes where n is the number of CPU cores;
one thread per worker;
error log level.

In fact, these are the settings you'll find in gconfig_telemetry.py, the configuration file we start QuantumLeap with---have a look at the Docker Compose file.

Instrumentation

To make sure in our tests we measure the framework's performance rather than QuantumLeap's, we should hit an endpoint that's as close as possible to a noop. What better endpoint than version then? It just returns a little hard-coded JSON object. We'll also wrap the version function with a timer decorator to collect a time series of execution times for it:

from server.telemetry.monitor import time_it

@time_it(label='version()')
def version():
    ...

Measurements collection

Now we're ready to drop the bomb. Run

$ ./baseline-load-test.sh

The script builds a QuantumLeap Docker image, starts the Docker Compose services, spawns a Python processes to do 10,000 concurrent HTTP GETs on the version endpoint (using 10 threads), and then stops the Docker Compose services. Once the script is done we can look at the telemetry data by loading it into Pandas.

Results analysis

We whipped together a simple Python script with a couple of convenience functions that string together Pandas calls to help us explore the data we collected. Load it into the Python interpreter

$ python -i analysis.py

If all went well, there should be a db variable in the global scope, an instance of TelemetryDB (from pandas_import) making all the duration and runtime telemetry data found in the monitoring directory available through Pandas frames and series. Let's print a summary of the two duration time series we collected

>>> print_measurements_summaries(db.duration())

Time series: /version [GET]
count    10000.000000
mean         0.013604
std          0.015712
min          0.000831
25%          0.003771
50%          0.008827
75%          0.018593
90%          0.031005
95%          0.039584
max          0.477556

Time series: version()
count    10000.000000
mean         0.000142
std          0.001022
min          0.000008
25%          0.000017
50%          0.000021
75%          0.000023
90%          0.000060
95%          0.000191
max          0.025543

Time series: client:version:200
count    10000.000000
mean         0.039894
std          0.024127
min          0.007079
25%          0.025884
50%          0.034068
75%          0.046459
90%          0.065372
95%          0.083069
max          0.522838

and check how requests were distributed among workers

>>> get_version = db.duration().time_series('/version [GET]')
>>> print_series_summary(get_version, by_pid = True)

Time series: /version [GET] | PID 8
count    1159.000000
...
Time series: /version [GET] | PID 10
count    1092.000000
...
Time series: /version [GET] | PID 12
count    1091.000000
...
Time series: /version [GET] | PID 14
count    1075.000000
...
Time series: /version [GET] | PID 16
count    1097.000000
...
Time series: /version [GET] | PID 18
count    1079.000000
...
Time series: /version [GET] | PID 20
count    1094.000000
...
Time series: /version [GET] | PID 22
count    1166.000000
...
Time series: /version [GET] | PID 24
count    1147.000000
...

You can easily do a similar analysis for runtime series: Python GC, RAM, etc. (Have a look at the code to see what's available.) For example, here's how to summarise the memory usage series of each worker process

>>> print_series_summary(db.max_rss(), by_pid = True)

Back to our quest for a performance baseline. From the above summaries we can see we got hit exactly 10,000 times and the three means look like fairly good representatives of their respective data sets:

Time series: /version [GET]
mean         0.013604

Time series: version()
count    10000.000000
mean         0.000142

Time series: client:version:200
mean         0.039894

Then, as expected, the mean of the version() time series tells us the version function returns almost instantly. Each duration in the /version [GET] series tells you how long a Gunicorn worker took to service a request but it doesn't account for the time it took the master process to dispatch the request to the worker. Hence, we can say that Flask/Connexion overhead per call sits at around 0.013 seconds. On average, the client saw each request complete in about 0.040 secs, so Gunicorn overhead can't cost us more than 0.026s a call---this is an upper bound since it factors in network latency, client I/O and context switches too!

Now, how many requests per second can we handle?

>>> get_version = db.duration().time_series('/version [GET]')
>>> rps = measurements_per_second(get_version)
>>> rps.describe()

mean     238.095238
min      119.000000
25%      214.000000
50%      236.500000
75%      277.000000
max      300.000000

>>> rps.plot()
# displays a graph of requests per second over time

Even though the data was collected from within the server, we'd expect the requests per seconds measured at the client to be pretty much the same since we're on the same host. In fact, that is the case.

>>> get_version_client = db.duration().time_series('client:version:200')
>>> rps = measurements_per_second(get_version_client)
>>> rps.describe()

mean     238.095238
min      118.000000
25%      215.250000
50%      235.000000
75%      273.000000
max      301.000000

So basically both the server and the client observed a throughput of about 238 requests per second. Plus, server-side work got distributed evenly and done pretty much in the same amount of millis each time. We expect the server to have been busy doing network I/O in the kernel most of the time and in fact

>>> ratio, r_series = sum_by_second_ratio(db.user_time(), db.system_time())
>>> ratio  # total user time / kernel time
3.148073238423002

and plotting the ratio series confirms it

>>> r_series.plot()

(Quit the Python interpreter before moving on.)

The notify endpoint

So we sketched out how you might go about answering questions (1) and (2). Let's move on to question (3): What's the throughput of the notify endpoint and what's the ratio of DB insert time to total processing time?

Instrumentation

We'll need to collect durations for the notify endpoint as well as the insert call in the SQL translator. The telemetry config we have already collects duration time series for HTTP endpoints but we'll also plonk in a couple of timers to collect durations of the insert call in the SQL translator and the Reporter's notify method. Here's how:

    # in the Reporter
    @time_it(label='notify()')
    def notify():
        ...

    # in the SQL translator's _insert_entities_of_type method
    with time_it(label='db.insert'):
        self.cursor.executemany(stmt, rows)

Measurements collection

We're going to use the same 10,000 request bomb from earlier, except this time threaded_driver.py will POST an NGSI entity to the notify endpoint. (Open up the script to see what's going on under the bonnet.) Run

$ ./notify-load-test.sh
...
>>> Count of successful HTTP responses by status code:
HTTP 500: 405
HTTP 200: 9595
...

O deary deary. 405 requests failed with a server error. Ouch. Anyway just like earlier on, we fire the Python interpreter to make sense of the telemetry data.

$ python -i analysis.py

Results analysis

Bird's eye view of the three duration series we collected:

>>> print_measurements_summaries(db.duration())

Time series: /v2/notify [POST]
count    10000.000000
mean         0.102399
std          0.166758
min          0.019514
25%          0.074256
50%          0.093793
75%          0.117056
90%          0.140944
95%          0.158496
max          5.933852

Time series: notify()
count    10000.000000
mean         0.083487
std          0.166569
min          0.013632
25%          0.056338
50%          0.073574
75%          0.095656
90%          0.120594
95%          0.139385
max          5.931297

Time series: db.insert
count    9595.000000
mean        0.050841
std         0.032493
min         0.007530
25%         0.027878
50%         0.041102
75%         0.064695
90%         0.096074
95%         0.117234
max         0.328003

Time series: client:notify:200
count    9595.000000
mean        0.140242
std         0.196658
min         0.031160
25%         0.090904
50%         0.117998
75%         0.166469
90%         0.224326
95%         0.254857
max         6.092773

Time series: client:notify:500
count    405.000000
mean       0.187196
std        0.101088
min        0.044285
25%        0.123096
50%        0.157361
75%        0.219076
90%        0.297212
95%        0.340895
max        0.768078

Puzzling numbers. We did get hit 10,000 times but only managed to insert 9,595 entities out of the 10,000 the notify method got---we lost about 4% of the incoming data! Insert times were pretty uniform across the board whereas the notify durations experienced some huge spikes.

We won't try to understand what went wrong, why we lost data and had those spikes. Let's leave that for another day. But we should select a more uniform subset of the data to get more accurate answers to our questions since some requests took longer than six seconds which could skew our computations.

>>> notify_post = db.duration().time_series('/v2/notify [POST]').data()
>>> notify_fn = db.duration().time_series('notify()').data()
>>> db_insert = db.duration().time_series('db.insert').data()

Plotting the data to visualise those spikes

>>> notify_post.plot()
>>> notify_fn.plot()

shows there was actually only one spike in each series which, in both cases, happened at about ten seconds into the session. Even if we cut off the first ten seconds of each series, we still have loads of data left for analysis, so cut off is what we'll do.

>>> cutoff_ix = notify_post.idxmax()  # min index attaining series max
>>> notify_post = notify_post[notify_post.index > cutoff_ix]
>>> notify_fn = notify_fn[notify_fn.index > cutoff_ix]
>>> db_insert = db_insert[db_insert.index > cutoff_ix]

How many requests per second?

>>> notify_post.resample('1S').count().describe()

mean      73.746154
min       12.000000
25%       72.000000
50%       76.000000
75%       80.000000
max       86.000000

Comparing these numbers with the baseline requests per second figures from earlier, we can say we're using our server stack at about 31% of its capacity, hence the framework can't possibly be a limiting factor. Even if e.g. Gunicorn was twice as fast at handling requests, notify throughput wouldn't be much higher than it is now. Also, per-request overhead of Flask/Connexion is 0.018s---basically in the same region as the baseline, which is good news:

>>> notify_post.describe()

count    9587.000000
mean        0.096304
std         0.033002
min         0.019514
25%         0.073530
50%         0.092602
75%         0.115289
max         0.318799

>>> notify_fn.describe()

count    9587.000000
mean        0.078164
std         0.031723
min         0.013632
25%         0.055968
50%         0.072959
75%         0.094788
max         0.315395

Another thing we wanted to know is the ratio of DB insert time to total QuantumLeap processing time---the amount of time the notify function took to complete vs the time it took to do the DB insert. We know some entities got sucked into the sink of no return, so to be sure to compare apples with apples, no insert loss should've happened in the data subset we chose. Is that so?

>>> notify_fn.size, db_insert.size
(9587, 9587)

Cool, we're in business. Let's compute the ratio series

>>> notify_fn_sum_by_sec = notify_fn.resample('1S').sum()
>>> db_insert_sum_by_sec = db_insert.resample('1S').sum()
>>> r_series = db_insert_sum_by_sec / notify_fn_sum_by_sec  # r[k] = d[k]/n[k]
>>> r_series.describe()

mean       0.633801
std        0.136688
min        0.340319
25%        0.534223
50%        0.596363
75%        0.691890
max        0.977415

It looks like the mean is a fairly good representative of this ratio over time. Good, then the insert should account for about 63% of the time of each notify call. What the heck did QuantumLeap do in the remaining 37% of the time?! Johnny Perf Sleuth has his work cut out...

Anyway, with at least 63% of the time spent doing blocking I/O, it looks like the notify endpoint is in a desperate need of an async I/O face-lift. The ratio of user time to kernel time should just confirm that:

>>> _, r_series = sum_by_second_ratio(db.user_time(), db.system_time())
>>> r_series.describe()

mean       3.074184
std        0.597316
min        0.766313
25%        3.234516
50%        3.270670
75%        3.284568
max        3.332187

Yoh! Kernel time is more than three times the user time.

Sum-up

Well, it's been one helluva ride. We saw how to use QuantumLeap telemetry and Pandas to get an idea of how many requests per second the Web server can potentially handle, come up with an estimate for framework overhead, see how many entities per second we can shovel down the notify endpoint and in what proportion its performance depends on the insert call. Here's what we've found out:

Baseline throughput. About 238 version requests per second.
Framework overhead. Tends to stay constant across endpoints and below 40 millis. Also, Gunicorn does a pretty good job at splitting up the workload evenly among workers to maximise parallel processing of HTTP requests.
Notification throughput. About 74 notify requests per second. That's about 31% of the server capacity measured in the baseline, so again the server stack wasn't a limiting factor.
Notification workload distribution inside QuantumLeap. The DB insert takes up only 63% of the notify function execution time, something else is eating up the rest.
Data loss. Under extremely high load, the notify endpoint fails to insert about 4% of the incoming NGSI entities into Crate.
Optimisation potential. As expected, the notify endpoint is I/O bound, but notice all the I/O calls are blocking and kernel time is three times greater than user time. Making QuantumLeap do async I/O could definitely boost performance.

Keep in mind throughput figures can vary hugely depending on your test set-up. For example, I could get 50% more requests per second simply by running QuantumLeap outside of Docker. But I've got a sneaking suspicion that, regardless of test environment, the Gunicorn/Flask/Connexion stack isn't a bottleneck for the notify endpoint, at least not at this stage.

Home

Developer Track

User Track

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gauging Performance

The plan

Our set-up

Establishing a baseline

Instrumentation

Measurements collection

Results analysis

The notify endpoint

Instrumentation

Measurements collection

Results analysis

Sum-up

Clone this wiki locally