-
Related to #886, but wanted to start a more open discussion about this rather than opening an issue. Use caseDeploying your ML model on the edge can become very cumbersome, especially if you don't want to bind yourself to specific frameworks or pieces of hardware. It would be fantastic if I could use BentoML as a fast and reliable way to create model containers for all sorts of Edge AI devices.* *Well, those that support docker at least. Challenges
I think that the docs somewhere say that BentoML is not primarily focused on edge serving. That makes total sense for the current state of the project, I guess. But are there more principal reasons to this? Are there tools better suited for framework agnostic containerization of edge ai models? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
Hi @highvight, great question and summary of the challenges on edge serving! Yes indeed, currently BentoML works well for most model serving deployments on the cloud or data center(including online API serving, offline batch serving, distributed batch serving, and streaming serving) but not well suited for edge serving. Here are a bit more context and reasons why this is the current state: The main benefit of BentoML is 1) providing an abstract for data scientists to describe how clients interact with their model, and automatically packaging all code and dependencies required into BentoML bundle format, 2) provide high-performance and flexible runtime to serve this BentoML bundle format. The main difference between BentoML and something like TF-serving or Onnx-runtime is that BentoML keeps the Python runtime and allows users to define a prediction service with Python, allowing extra data fetching, pre-processing and post-processing code to be bundled and versioned together with the model. This may sound like adding lots of performance penalties. But in practice, with our adaptive micro-batching layer, the performance differences are neglectable. When we benchmarked BentoML against TF-serving, the throughput matches TF-serving in most scenarios; The latency is not as good as TF-serving for small lightweight models under light traffics, but for larger models or for small models under heavier traffics, the latency is very similar. While by providing the Python runtime, BentoML offers the user a lot more flexibility. We designed BentoML to reach a balance between high-performance(but restricting) runtime like TF-serving, and a flexible(but very bad performance) solution built with Flask, Django, or fastapi. Back to edge serving, currently, there are a number of performance-oriented tools, including Tensorflow Lite, CoreML, PyTorch mobile, etc. The problem those tools are trying to solve is to compress the model, reduce the computation and memory required to run the model, and a model runtime that supports ARM. And the user is required to re-build the pre-processing logic in a different language, which drastically increases the complexity in development, testing, and deployment. It is possible for BentoML to wrap the Python runtime around those edge-optimized model runtime, and offer the end-user a flexible model deployment workflow that is consistent when deployed across online serving, offline batch serving, CD/CD, and edge serving. That's the vision we had for the BentoML. That being said, we will keep an eye on the development of edge serving model formats and runtimes, and we will build support for them on the BentoML end. As a starter, @guy4261 from Apple just submitted a PR for CoreML support #939, super excited to see where this is going. |
Beta Was this translation helpful? Give feedback.
Hi @highvight, great question and summary of the challenges on edge serving!
Yes indeed, currently BentoML works well for most model serving deployments on the cloud or data center(including online API serving, offline batch serving, distributed batch serving, and streaming serving) but not well suited for edge serving. Here are a bit more context and reasons why this is the current state:
The main benefit of BentoML is 1) providing an abstract for data scientists to describe how clients interact with their model, and automatically packaging all code and dependencies required into BentoML bundle format, 2) provide high-performance and flexible runtime to serve this BentoML bundle format.
T…