A web application to serve model based on Fastai v1 library. Application is built on Molten Framework due to it's inbuilt support for modern tooling: Swagger, OpenAPI, Prometheus etc.
I wanted to have an application which allow me to serve ML models in real production settings. Application should provide:
- Versioned APIs
- Metrics (prometheus)
- Ablility to run different version of models without unnecassary code changes
- User friendly API documentations
- Input data validation
- Authentication
- Inference device flexibility (CPU or GPU)
- Scale up or down instances based on incoming traffic (Kubernetes)
So that I can focus on my ML model improvment and don't need to bother about how to serve the model in produciton.
To test the current model, just run the following commands
Run the following commands to set things up. You will need to have python 3.6 or above and git-lfs
package installed.
git clone --depth 1 https://github.com/gurvindersingh/mlapp.git
cd mlapp
virtualenv .ve
source .ve/bin/activate
pip install -r requirements.txt
pip install torch_nightly -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html
gunicorn app:app
Now you can visit localhost:8000 to see Swagger UI and test APIs out. Or you can run curl
as
curl -X POST "http://localhost:8000/v1/predict" -F "file=@loadtest/keeshond.jpg"
You can change the config.json
based on your own model, classes and architecture to see predictions from your own model. Make sure to have your .pth
located with correct name under models
directory.
CPU: docker run -ti --rm -p 8000:8000 gurvin/mlapp:v0.1.0-cpu
GPU & CPU: docker run -ti --rm -p 8000:8000 gurvin/mlapp:v0.1.0
Application uses config.json
to provide various options.
{
"model_name": "v1", -> Specify the model from multiple models in your config as which one you want to run with this instance
"token": "", -> To enable authenticaiton to you APIs
"v1": { -> Model name and its corresponding config. These can be different based on your model
"device": "cpu", -> Device on which to run inference (cpu, cuda)
"url": "", -> URL to fetch the weights file .pth from. If empty app assumes local file under models dir
"arch": "resnet34", -> Model Architecutre
"size": 224, -> Image size
"classes": [] -> Classes used during training from which prediciton will happen
}
}
Application design is simple. app.py
set up the routes and required methods to get data in and out. Following routes are exposed by application.
/ -> Returns health information and serve as liveness check for application
/_docs -> Serves Swagger UI to provide API docs and user friendly API testing
/_schema -> json formatted API spec in OpenAPI standard
/metrics -> Endpoint to scrape metrics by Prometheus
/v1/predict -> Receives end user data to make prediction on
/v1/feedback -> User can provide feedback on prediction if there was any errors
API is versioned which makes evolution of our application possible without introducing any breaking changes to downstream applications. app.py
validates data according to schema descibed in model.py
file using Molten framework. The current application provides a simple CNN model which was described in Lesson 1 of Fast v1 course using schema shown below for predict
endpoint. For more details see model.py
or Molten docs
@schema
class ModelData():
file: UploadedFile
Idea is that based on a given model, you can update the schema in this class and let Molten take care of validation for you and provide a user friendly docs and test interface as shown.
To test API either using curl
or swagger UI can be used.
Application uses Molten support for Prometheus to export request count, latency, GC Time and few other metrics.
Kubernetes deployment file is provided in deployment
folder which also have support for HorizontalPodAutoscalar to scale the application based on CPU usage and later scale down when not needed.
- Add support for ONNX