Serves LLaVA inference using an HTTP server. Supports batched inference and caches the embeddings for each image in order to produce multiple responses per image more efficiently.
gunicorn "app:create_app()"
You must modify gunicorn.conf.py
to change the number of GPUs.