async higher perf API endpoints and easier to evolve in the long-run #337

viniarck · 2023-03-03T20:10:04Z

viniarck
Mar 3, 2023
Maintainer

This discussion is for issue #301.

These are the problems being solved and the proposed solutions:

Problems:

Werkzeug 2.1.X no longer supports shutting down the server, issue Werkzeug 2.1.X no longer supports shutting down the server #280
API server Flask SocketIO should be ready for running in production out of the box, issue API server Flask SocketIO should be ready for running in production out of the box #168
High rate of requests can lead to runtime instability issues High rate of requests can lead to runtime instability #225

Proposed Solutions (to the respective problems):

Replace werkzeug with uvicorn, which is well maintainted and one of the most battle tested and used ASGI server currently in Python land, and it supports programmatically embedded to be shutdown.
Replace flask (and flask-socketio) with starlette , which is the base async framework that FastAPI is built on top of, and it's one of most widely used today in many projects, and is constantly well ranked in benchmarks (as shown in the figure below). Flask 2+ was an intermediary important milestone on Kytos-ng 2022.3, that unlocked async routes, but it was still runinng on werkzeug server, so it wasn't async turtles all the way down, async was bolted on it while still being compatible as an WSGI server. Moving more towards asyncio instead of gevent or evelent (that flask-socketio recommends) is the way go, since Python upstream is developing heavily towards async and most of the well known web/backend libs are also moving in this direction, consequently our team can leverage well maintained upstream code and libraries.

Why not FastAPI?

Initially, I was inclined to propose FastAPI instead of starlette, FastAPI essentially brings starlette + pydantic + leverage typing a bit more + OpenAPI generation utilities. However, we already have invested in extensive openapi.yml documentation on NApps, and we already have a plan to move openapi-core from mef_eline to core to be reused, so with pydantic we'll only use it to validate DB models, and then since we're already using openapi-core in the future we can leverage openapi-schema-validator to reuse the schemas components/models from the openapi specs to also validate KytosEvent content.

Also, we wouldn't benefit much from the auto generated OpenAPI that FastAPI provides since we've already been maintaining it in a respective file instead. typing is welcome and we'll continue to use. Consequentely, starlette is more suitable without introducing too much that can overlap with the existing code base functionalities/responsibilities without much extra effort or additional liabilities, and starlette is also more likely to be around for years to come, since FastAPI wouldn't even exist without it anyway. If one day we decide not to maintain external OpenAPI files we can revisit this part of the discussion.

uvicorn with starlette won't completely solve any type of instability if an high rate of requests being sent during a long period of time, but asyncio Tasks being less resource intensive than Threads then they can contribute to more throughput and stability for IO-bound parts. Only rate limiting will completely mitigate the original problem, but with asyncio/uvicorn/starlette as the experiments below will show our HTTP endpoints will be much more stable and predictable in terms of latency and responses. In fact, the original issue High rate of requests can lead to runtime instability #225 no longer results in instability resetting client connections.

Initial Experiments:

I researched and conducted the following pre-requisites experiments to confirm that the proposed solutions would fit well:

e1) Confirm that threadpools are working well, stress test also with at least 300 req / sec during 1 min

uvicorn handling 500 req/sec similar to GET topology/v3 route without breaking a sweat and 95th percentile under 78 ms over 1 minute:

❯ jq -ncM '{method: "GET", url: "http://localhost:8000/v3"}' | vegeta attack -format=json -rate 500/1s -duration=60s -timeout=60s | tee results.bin | vegeta report
Requests      [total, rate, throughput]         30000, 500.01, 500.00
Duration      [total, attack, wait]             1m0s, 59.998s, 1.475ms
Latencies     [min, mean, 50, 90, 95, 99, max]  867.391µs, 15.039ms, 2.11ms, 42.127ms, 77.122ms, 159.411ms, 459.388ms
Bytes In      [total, mean]                     246960000, 8232.00
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:30000  
Error Set:

werkzeug trying to handle 500 req/s on GET topology/v3 (scenario from issue #225):

❯ jq -ncM '{method: "GET", url: "http://localhost:8181/api/kytos/topology/v3/"}' | vegeta attack -format=json -rate 500/1s -duration=60s -timeout=60s | tee results.bin | vegeta report
Requests      [total, rate, throughput]         30000, 500.01, 256.45
Duration      [total, attack, wait]             1m47s, 59.999s, 46.706s
Latencies     [min, mean, 50, 90, 95, 99, max]  2.356ms, 6.704s, 329.991ms, 32.181s, 55.272s, 1m0s, 1m0s
Bytes In      [total, mean]                     224220616, 7474.02
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           91.21%
Status Codes  [code:count]                      0:2636  200:27364  
Error Set:
Get "http://localhost:8181/api/kytos/topology/v3/": read tcp 127.0.0.1:49031->127.0.0.1:8181: read: connection reset by peer
Get "http://localhost:8181/api/kytos/topology/v3/": read tcp 127.0.0.1:36911->127.0.0.1:8181: read: connection reset by peer
Get "http://localhost:8181/api/kytos/topology/v3/": read tcp 127.0.0.1:53479->127.0.0.1:8181: read: connection reset by peer
Get "http://localhost:8181/api/kytos/topology/v3/": read tcp 127.0.0.1:33709->127.0.0.1:8181: read: connection reset by peer

e2) Make sure APM instrumetation is capturing requests/responses as expected

starlette is supported by Elastic APM, I've also confirmed in practice, and pymongo instrumentation still works as expected:

e3) Make sure uvicorn won't conflict with kytosd console ctrl-d:

uvicorn embedded server shutdown capabilities worked as expected, and gracefully shuts down including from the console (that [uvicorn.error] log entry it's at INFO level, uvicorn team picked an unfortunate logger name for some modules):

INFO:     Shutting down
2023-03-03 15:40:08,306 - INFO [uvicorn.error] (ThreadPoolExecutor-2_0) Waiting for application shutdown.
INFO:     Finished server process [96357]
2023-03-03 15:40:08,307 - INFO [uvicorn.error] (ThreadPoolExecutor-2_0) Application shutdown complete.

e4) Make sure openapi-core openapi.yml validator is still compatible:

openapi-core 0.16+ supports it, we'll need to upgrade this dependency, I've also quickly prototyped it to double confirm, they had some breaking changes on some Python imports but it works:

2023-03-01 16:09:48,275 - ERROR [kytos.napps.kytos/mef_eline] [main.py:85:create_post] (AnyIO worker thread) errors: [InvalidSchemaValue(value={'some': 1}, type='object', schema_errors
=(<ValidationError: "'name' is a required property">, <ValidationError: "'uni_a' is a required property">, <ValidationError: "'uni_z' is a required property">, <ValidationError: "Additional properties are not allowed ('some' was unexpected)">))]

e5) Adapt rest decorator

The rest decorator will still be compatible as a drop in, and if it's a coroutine then it'll be run in the asyncio eventloop context, otherwise it'll use starlette/uvicorn ThreadPool, so it's entirely compatible with the NApps that still uses synchronously, and as NApps are ready to become more async when it makes sense, they can do it gradually. Other than that, since I haven't adapted the rest decorator yet, I ended up temporarily duplicating some endpoints to be faster when prototyping this, I ran these experiments:

POST /v2/flowsx is equivalent to flow_manager/v2/flows/{dpid}, except it's being served by uvicorn/starlette, notice that even with pymongo using a blocking driver, in this case the 95th percentile latency ended up being 95 times faster, and the rate of request was 100 req/sec over 1 min, which is quite expressive for a real use case networking scenario:

❯ jq -ncM '{method: "POST", url: "http://localhost:8000/v2/flowsx"}' | vegeta attack -format=json -rate 100/1s -duration=60s -timeout=60s | tee results.bin | vegeta report
Requests      [total, rate, throughput]         6000, 100.02, 100.00
Duration      [total, attack, wait]             1m0s, 59.99s, 13.119ms
Latencies     [min, mean, 50, 90, 95, 99, max]  4.647ms, 22.591ms, 14.78ms, 39.614ms, 72.545ms, 164.385ms, 251.667ms
Bytes In      [total, mean]                     216000, 36.00
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:6000  
Error Set:


❯ jq -ncM '{method: "POST", url: "http://localhost:8181/api/kytos/flow_manager/v2/flows/00:00:00:00:00:00:00:01", body: { "force": true, "flows": [ { "priority": 10, "match": { "in_port"
: 1, "dl_vlan": 100 }, "actions": [ { "action_type": "output", "port": 1 } ] } ] } | @base64, header: {"Content-Type": ["application/json"]}}' | vegeta attack -format=json -rate 100/1s -
duration=60s -timeout=60s | tee results.bin | vegeta report
Requests      [total, rate, throughput]         6000, 100.02, 92.98
Duration      [total, attack, wait]             1m5s, 59.99s, 4.541s
Latencies     [min, mean, 50, 90, 95, 99, max]  6.796ms, 2.141s, 373.448ms, 6.141s, 6.884s, 8.312s, 11.163s
Bytes In      [total, mean]                     222000, 37.00
Bytes Out     [total, mean]                     732000, 122.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      202:6000  
Error Set:

e6) Adapt kytos/lib/helpers.py get_test_client(controller, napp).

I haven't done this yet, but it should be doable, no surprises expected here. Also, shipping httpx, that is recommended and maintained by the same uvicorn/starlette team (encode), httpx works both synchronously or asynchronously, providing a very convenient interface. As we've been using async gradually when it makes sense, httpx can also replace requests, including providing async capabilities when needed, so NApps can still use requests, but then as they start leveraging more async they can start using it.

Example of httpx that I demoed before when showing some async capabilities that's supported:

Adjacent opportunities:

I've also taken the opportunity to try to experiment with an adjacent related library motor to fully also have async DB calls, this is the last significant blocking IO part that we have to have asyncio everywhere on our platform/NApps, with motor the official async pymongo driver, to potentially provide an optional async client when needed in addition to maintaining pymongo, however, it's been shown in practice that at the moment, for the upcoming 2023.1 version, it's not worth it at moment because of these two main reasons:

motor has been async since tornado times, it supports asyncio but it's not a first class turtles all the way down asyncio, it's running executors on top of blocking IO, so there are cases where beformance will be better, but MongoDB Python core team acknowledges that on average results might still be relatively similar.
Elast APM doesn't have instrumentation for motor yet, only pymongo, so motor calls wouldn't show up on charts.

Here's an experiment, I simplified two endpoints to upsert flows using /flow_manager/v2/sync_upsert/{dpid} (flask/werkzeug/pymongo) and /v2/async_upsert with (uvicorn/starlette/motor), notice that in this particular case, with 100 req /sec over 1 min motor ended up having slightly worse overall lateness:

❯ jq -ncM '{method: "POST", url: "http://localhost:8181/api/kytos/flow_manager/v2/sync_upsert/00:00:00:00:00:00:00:01"}' | vegeta attack -format=json -rate 100/1s -duration=60s -timeout=
60s | tee results.bin | vegeta report
Requests      [total, rate, throughput]         6000, 100.02, 100.01
Duration      [total, attack, wait]             59.996s, 59.99s, 6.16ms
Latencies     [min, mean, 50, 90, 95, 99, max]  4.516ms, 7.531ms, 6.331ms, 8.082ms, 17.106ms, 33.459ms, 58.408ms
Bytes In      [total, mean]                     222000, 37.00
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      202:6000  
Error Set:

~/repos/napps master*  1m 0s
❯ jq -ncM '{method: "POST", url: "http://localhost:8000/v2/async_upsert"}' | vegeta attack -format=json -rate 100/1s -duration=60s -timeout=60s | tee results.bin | vegeta report
Requests      [total, rate, throughput]         6000, 100.02, 100.01
Duration      [total, attack, wait]             59.995s, 59.99s, 5.122ms
Latencies     [min, mean, 50, 90, 95, 99, max]  3.576ms, 7.651ms, 5.141ms, 16.811ms, 19.714ms, 37.97ms, 85.563ms
Bytes In      [total, mean]                     216000, 36.00
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:6000  
Error Set:

Maybe a case where the synchronous endpoint is blocking too much, motor would outperform, but this confirms that out the gate the current implementation of motor won't always bring better latencies, even when there's a significant number of requests, so we'll keep on eye on it, and let's see in a next opportunity how motor and see when Elastic APM also supports it. As of now, the synchronous pymongo driver call still benefit from uvicorn thread pools and event loop as it was shown in experiment e5, which also leaves us a great position, as we've always been gradually moving to async.

Feedback

Let me know your thoughts, suggestions or concerns, this is being planned to be shipped on 2023.1

jab1982 · 2023-03-03T20:32:09Z

jab1982
Mar 3, 2023
Maintainer

@viniarck, excellent documentation, thanks for putting so much effort to describe the changes. Other than the unit tests, what kind of impact should we expect for the end-to-end tests in term of refactoring? The results are impressive and I am ok with it for 2023.1, I just need more data about the effort to make all changes.

1 reply

viniarck Mar 3, 2023
Maintainer Author

@jab1982, no impact or extra work expected for e2e tests, once it's done, it should still continue to work. In the NApps, the most time consuming part will be to regenerate dependencies, and replacing flask Response objects with starlette Response, and if I manage to get get_test_client(controller, napp) completely compatible then that also simplifies the effort required.

There's also opportunity for simplify some requirements/dev.txt when it comes to regenerating dependencies when kytos core runtime requirements/run.txt changes, I mapped some ideas here #338, since this time I'll have to send dozens of PRs generating the requirements, if that's also demonstrate to be viable I'll take the opportunity to potentially include it. Let's see, this extra part I'll keep you guys posted.

Appreciated your review, thanks.

lmarinve · 2023-03-13T21:02:07Z

lmarinve
Mar 13, 2023

Hi @viniarck , Have you also considered quart hypercorn combined with flask?

3 replies

viniarck Mar 14, 2023
Maintainer Author

Hi, @luis

Initially, yes, hypercorn was considered, here's some points for the discussion:

uvicorn being developed by encode team, the same authors of starlette, both projects share a lot in common in terms of vision, compatibility, well maintained, being battle tested, and widely used together and it's usually one of the top options when picking an async framework and ASGI server. This is a similar opinion/perspective that fastapi author's also shares. Also, years down the road, assuming an overall linear progression, I'd expect uvicorn to continue to be one the most stable and less surprising ASGI server option, and in terms of keeping up with upstream patches that's also pivotal for our project.
On certain benchmarks here and here uvicorn & starlette (or fastapi), have shown solid results, and in Python land they typically are in the top performers, of course we have to take these third-party benchmarks with a grain of salt, but all in all these are good first indicators at least.
The uvicorn team encode praises hypercorn and daphne as alternative ASGI servers. In fact, hypercorn supports protocol versions like http/2 that uvicorn doesn't, however, we don't have immediate requirements for that, and in the future, if the landscape changes, and if hypercorn ever ends up being a superior pick than uvicorn then we could replace it, theoretically, every ASGI should be swappable and compatible with async frameworks, but we all know that implementation-wise, minor features, critical fixes, ergonomics, effort and commitment from maintainers also weigh a lot and make a huge difference, and sticking with a well-known combination of server/framework that's been stable is key for us.
Flask, as you know, it's been a pioneer and solid one, but due to all of its synchronous baggage, their team acknowledges that its async capabilities are great but for an async-first codebase an async framework from ground up is recommended instead, pallets team recommends quart, it's also a decent option, but on bechmarks and in terms of being battle tested and well maintained in async land, uvircon/starlette (or fastapi) is dominating these days.

With that said we can still in the future swap the ASGI server if the landscape changes, but uvicorn with starlette, all in all, are pointing out to be the most solid and less surprising option. Let me know if that clarifies a bit better the rationale and/or if you want to add other points.

lmarinve Mar 14, 2023

Cool, @viniarck, I have used starlette and graphql in the past in a successful project combined with Django, the only thing here is that it will be more challenging at the beginning of the conversion. but it will be a great success!!!

viniarck Mar 14, 2023
Maintainer Author

Right, and thanks for your inputs here on this discussion.

viniarck · 2023-03-17T17:27:28Z

viniarck
Mar 17, 2023
Maintainer Author

For the record, since starlette also supports websocket, the initial support of socket.io what was introduced many years ago, but never completely used will also be replaced. Including removing front-end client dependencies that @italovalcy has also spotted. If one day either in the front-end or NApps communication need websocket we can keep evolving and mapping new requirements.

0 replies

viniarck · 2023-03-20T15:24:32Z

viniarck
Mar 20, 2023
Maintainer Author

I appreciated everybody's feedback on this discussion. We'll go for for starlette/uvicorn.

I'll start to map the related tasks to carry on the implementation. If unexpected major blockers show up they'll be handled as issues. I'll close this issue since it has stayed opened long enough collecting feedback.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

async higher perf API endpoints and easier to evolve in the long-run #337

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

async higher perf API endpoints and easier to evolve in the long-run #337

viniarck Mar 3, 2023 Maintainer

Problems:

Proposed Solutions (to the respective problems):

Initial Experiments:

Feedback

Replies: 4 comments · 4 replies

jab1982 Mar 3, 2023 Maintainer

viniarck Mar 3, 2023 Maintainer Author

lmarinve Mar 13, 2023

viniarck Mar 14, 2023 Maintainer Author

lmarinve Mar 14, 2023

viniarck Mar 14, 2023 Maintainer Author

viniarck Mar 17, 2023 Maintainer Author

viniarck Mar 20, 2023 Maintainer Author

viniarck
Mar 3, 2023
Maintainer

Replies: 4 comments 4 replies

jab1982
Mar 3, 2023
Maintainer

viniarck Mar 3, 2023
Maintainer Author

lmarinve
Mar 13, 2023

viniarck Mar 14, 2023
Maintainer Author

viniarck Mar 14, 2023
Maintainer Author

viniarck
Mar 17, 2023
Maintainer Author

viniarck
Mar 20, 2023
Maintainer Author