Support for Edge AI / ARM devices #938

highvight · 2020-07-25T01:51:54Z

highvight
Jul 25, 2020

Related to #886, but wanted to start a more open discussion about this rather than opening an issue.

Use case

Deploying your ML model on the edge can become very cumbersome, especially if you don't want to bind yourself to specific frameworks or pieces of hardware. It would be fantastic if I could use BentoML as a fast and reliable way to create model containers for all sorts of Edge AI devices.*

*Well, those that support docker at least.

Challenges

Missing python wheels for many popular packages
No official arm support from ML frameworks
Cross-compilation
Various ARM versions
GPU access
Model optimization
...

I think that the docs somewhere say that BentoML is not primarily focused on edge serving. That makes total sense for the current state of the project, I guess. But are there more principal reasons to this? Are there tools better suited for framework agnostic containerization of edge ai models?

Answered by parano

Jul 27, 2020

Hi @highvight, great question and summary of the challenges on edge serving!

Yes indeed, currently BentoML works well for most model serving deployments on the cloud or data center(including online API serving, offline batch serving, distributed batch serving, and streaming serving) but not well suited for edge serving. Here are a bit more context and reasons why this is the current state:

The main benefit of BentoML is 1) providing an abstract for data scientists to describe how clients interact with their model, and automatically packaging all code and dependencies required into BentoML bundle format, 2) provide high-performance and flexible runtime to serve this BentoML bundle format.

T…

View full answer

parano · 2020-07-27T19:31:50Z

parano
Jul 27, 2020
Maintainer

Hi @highvight, great question and summary of the challenges on edge serving!

Yes indeed, currently BentoML works well for most model serving deployments on the cloud or data center(including online API serving, offline batch serving, distributed batch serving, and streaming serving) but not well suited for edge serving. Here are a bit more context and reasons why this is the current state:

The main benefit of BentoML is 1) providing an abstract for data scientists to describe how clients interact with their model, and automatically packaging all code and dependencies required into BentoML bundle format, 2) provide high-performance and flexible runtime to serve this BentoML bundle format.

The main difference between BentoML and something like TF-serving or Onnx-runtime is that BentoML keeps the Python runtime and allows users to define a prediction service with Python, allowing extra data fetching, pre-processing and post-processing code to be bundled and versioned together with the model. This may sound like adding lots of performance penalties. But in practice, with our adaptive micro-batching layer, the performance differences are neglectable. When we benchmarked BentoML against TF-serving, the throughput matches TF-serving in most scenarios; The latency is not as good as TF-serving for small lightweight models under light traffics, but for larger models or for small models under heavier traffics, the latency is very similar. While by providing the Python runtime, BentoML offers the user a lot more flexibility. We designed BentoML to reach a balance between high-performance(but restricting) runtime like TF-serving, and a flexible(but very bad performance) solution built with Flask, Django, or fastapi.

Back to edge serving, currently, there are a number of performance-oriented tools, including Tensorflow Lite, CoreML, PyTorch mobile, etc. The problem those tools are trying to solve is to compress the model, reduce the computation and memory required to run the model, and a model runtime that supports ARM. And the user is required to re-build the pre-processing logic in a different language, which drastically increases the complexity in development, testing, and deployment.

It is possible for BentoML to wrap the Python runtime around those edge-optimized model runtime, and offer the end-user a flexible model deployment workflow that is consistent when deployed across online serving, offline batch serving, CD/CD, and edge serving. That's the vision we had for the BentoML.

That being said, we will keep an eye on the development of edge serving model formats and runtimes, and we will build support for them on the BentoML end. As a starter, @guy4261 from Apple just submitted a PR for CoreML support #939, super excited to see where this is going.

2 replies

highvight Jul 29, 2020
Author

Thanks @parano for the thorough answer, this really helped a lot!

I agree with your view on the current state of edge serving. I also do believe that problems in edge serving as of now are furthermost on the flexibility/deployment side, rather than model performance, although BentoML can also compete in terms of performance. Glad you have this on your road map!

One particular compiler/runtime I know quite well is TVM, which works with most ML frameworks, but does not come with a high performance server for model serving. I think this could be a good match for BentoML. I will try to make TVM run with BentoML on my Jetson Nano and share my experiences here.

lonelygo Nov 18, 2021

is there any progress？

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BentoML

Support for Edge AI / ARM devices #938

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

BentoML

Support for Edge AI / ARM devices #938

highvight Jul 25, 2020

Use case

Challenges

Replies: 1 comment · 2 replies

parano Jul 27, 2020 Maintainer

highvight Jul 29, 2020 Author

lonelygo Nov 18, 2021

highvight
Jul 25, 2020

Replies: 1 comment 2 replies

parano
Jul 27, 2020
Maintainer

highvight Jul 29, 2020
Author