It is a simple library to speed up CLIP inference up to 3x (K80 GPU)!
Install clip-onnx module and requirements first. Use this trick
!pip install git+https://github.com/Lednik7/CLIP-ONNX.git
!pip install git+https://github.com/openai/CLIP.git
!pip install onnxruntime-gpu
- Download CLIP image from repo
!wget -c -O CLIP.png https://github.com/openai/CLIP/blob/main/CLIP.png?raw=true
- Load standard CLIP model, image, text on cpu
import clip
from PIL import Image
import numpy as np
# onnx cannot work with cuda
model, preprocess = clip.load("ViT-B/32", device="cpu", jit=False)
# batch first
image = preprocess(Image.open("CLIP.png")).unsqueeze(0).cpu() # [1, 3, 224, 224]
image_onnx = image.detach().cpu().numpy().astype(np.float32)
# batch first
text = clip.tokenize(["a diagram", "a dog", "a cat"]).cpu() # [3, 77]
text_onnx = text.detach().cpu().numpy().astype(np.int32)
- Create CLIP-ONNX object to convert model to onnx
from clip_onnx import clip_onnx
visual_path = "clip_visual.onnx"
textual_path = "clip_textual.onnx"
onnx_model = clip_onnx(model, visual_path=visual_path, textual_path=textual_path)
onnx_model.convert2onnx(image, text, verbose=True)
# ['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider']
onnx_model.start_sessions(providers=["CPUExecutionProvider"]) # cpu mode
- Use for standard CLIP API. Batch inference
image_features = onnx_model.encode_image(image_onnx)
text_features = onnx_model.encode_text(text_onnx)
logits_per_image, logits_per_text = onnx_model(image_onnx, text_onnx)
probs = logits_per_image.softmax(dim=-1).detach().cpu().numpy()
print("Label probs:", probs) # prints: [[0.9927937 0.00421067 0.00299571]]
Enjoy the speed
Example for ViT-B/32 from Model Zoo
!wget https://clip-as-service.s3.us-east-2.amazonaws.com/models/onnx/ViT-B-32/visual.onnx
!wget https://clip-as-service.s3.us-east-2.amazonaws.com/models/onnx/ViT-B-32/textual.onnx
onnx_model = clip_onnx(None)
onnx_model.load_onnx(visual_path="visual.onnx",
textual_path="textual.onnx",
logit_scale=100.0000) # model.logit_scale.exp()
onnx_model.start_sessions(providers=["CPUExecutionProvider"])
Models of the original CLIP can be found on this page.
They are not part of this library but should work correctly.
It happens that onnx does not convert the model the first time, in these cases it is worth trying to run it again.
If it doesn't help, it makes sense to change the export settings.
Model export options in onnx looks like this:
DEFAULT_EXPORT = dict(input_names=['input'], output_names=['output'],
export_params=True, verbose=False, opset_version=12,
do_constant_folding=True,
dynamic_axes={'input': {0: 'batch_size'}, 'output': {0: 'batch_size'}})
You can change them pretty easily.
from clip_onnx.utils import DEFAULT_EXPORT
DEFAULT_EXPORT["opset_version"] = 15
Alternative option (change only visual or textual):
from clip_onnx import clip_onnx
from clip_onnx.utils import DEFAULT_EXPORT
visual_path = "clip_visual.onnx"
textual_path = "clip_textual.onnx"
textual_export_params = DEFAULT_EXPORT.copy()
textual_export_params["dynamic_axes"] = {'input': {1: 'batch_size'},
'output': {0: 'batch_size'}}
textual_export_params["opset_version"] = 12
Textual = lambda x: x
onnx_model = clip_onnx(model.cpu(), visual_path=visual_path, textual_path=textual_path)
onnx_model.convert2onnx(dummy_input_image, dummy_input_text, verbose=True,
textual_wrapper=Textual,
textual_export_params=textual_export_params)
See benchmark.md
See examples folder for more details
Some parts of the code were taken from the post. Thank you neverix for this notebook.