Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review Gunicorn worker type #1062

Open
SpacemanPaul opened this issue Aug 12, 2024 · 5 comments
Open

Review Gunicorn worker type #1062

SpacemanPaul opened this issue Aug 12, 2024 · 5 comments

Comments

@SpacemanPaul
Copy link
Contributor

SpacemanPaul commented Aug 12, 2024

Description

PR #1061 removed the --threads argument to gunicorn as it was incompatible with the chosen worker type (gevent) and was not being read.

Should we switch to the gthreads worker-type and reinstate the --threads option?

AFAIAA the only thread-safety issue in OWS is the use of non-re-entrant matplotlib code in automated legend generation, which could easily be wrapped in a thread lock. Given that legend requests make up only a tiny fraction of incoming requests in normal use, a thread lock here shouldn't be a problem.

@omad
Copy link
Member

omad commented Aug 12, 2024

100% it was dumb to have --threads with the gevent worker type! I thought I'd removed it months ago, but it must have been in Explorer, not here.

There's issues beyond thread safety, we also need to actually get better concurrency. At the least, anything that's doing blocking IO needs to release the GIL.

We can postulate as much as we want, but the only way to know is empirical testing.

Adding a thread lock to the legend generation sounds wise!

@benjimin
Copy link
Contributor

We absolutely ought to be profiling OWS in production and/or using OWS benchmarks to decide.

The gevent choice is meant to monkey-patch our code to facilitate asynchronous python sockets (as async coroutines supposedly work very well for python networking), but I'm guessing (having not profiled it) that a lot of OWS handler latency is spend on network exchanges done inside of compiled GDAL or postgres libraries (which gevent.monkey might fail to wrap and instead block on?).

If GDAL (and the postgres driver) are now threadsafe, then native threads (with a lot more than the previous 2 threads per worker, probably limited by raster memory requirements?) might help ensure the pod doesn't waste resources leaving its CPU waiting on IO (when saturated with requests). If not, could experiment with much higher numbers of simple (e.g. sync) worker processes..

@SpacemanPaul
Copy link
Contributor Author

SpacemanPaul commented Aug 13, 2024

Postgres index driver is thread-safe*. According to rasterio, GDAL is pretty good about releasing the GIL: https://rasterio.readthedocs.io/en/stable/topics/concurrency.html

In OWS 1.8, the index driver has a whole other layer of multi-thread-proofing that does nothing useful (and possibly never did) - this is removed in OWS 1.9.

@benjimin
Copy link
Contributor

Releasing the GIL is one thing, but is it thread safe? (I thought you used to not be able to read two different files in different threads because a lot of GDAL API settings were globals?)

https://gdal.org/user/multithreading.html

Unless otherwise stated, no GDAL public C functions and C++ methods should be assumed to be thread-safe.

@benjimin
Copy link
Contributor

benjimin commented Aug 13, 2024

I think we should create a regression test in this repo, with a dummy index in a postgres container and a dummy data collection (maybe hosted by nginx or maybe some mock S3), and then have the test-runner artificially add significant latency to all packets exchanged during the test (so instead of the test executing instantly the database query takes 2s, each raster retrieval takes 2s), and hit the ows container with external async requests so that we can time the concurrent request handling (and fail if it is as slow as handling requests sequentially).

For nginx hosted assets you could just use an echo_sleep module, but to cover all services we could try something similar to tc qdisc add dev eth0 root netem delay 2000ms (maybe lo instead of eth0, would need to fiddle more..). Actually, here's an example:

https://medium.com/@kazushi/simulate-high-latency-network-using-docker-containerand-tc-commands-a3e503ea4307

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants