Skip to content

models openai clip vit large patch14

github-actions[bot] edited this page Oct 21, 2023 · 10 revisions

openai-clip-vit-large-patch14

Overview

Description: The CLIP model was developed by OpenAI researchers to learn about what contributes to robustness in computer vision tasks and to test the ability of models to generalize to arbitrary image classification tasks in a zero-shot manner. The model uses a ViT-L/14 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. The model was trained on publicly available image-caption data, which was gathered in a mostly non-interventionist manner. The model is intended as a research output for research communities, and the primary intended users of these models are AI researchers. The model has been evaluated on a wide range of benchmarks across a variety of computer vision datasets, but it currently struggles with respect to certain tasks such as fine-grained classification and counting objects. The model also poses issues with regards to fairness and bias, and the specific biases it exhibits can depend significantly on class design and the choices one makes for categories to include and exclude. > The above summary was generated using ChatGPT. Review the original-model-card to understand the data used to train the model, evaluation metrics, license, intended uses, limitations and bias before using the model. ### Inference samples Inference type|Python sample (Notebook)|CLI with YAML |--|--|--| Real time|zero-shot-image-classification-online-endpoint.ipynb|zero-shot-image-classification-online-endpoint.sh Batch|zero-shot-image-classification-batch-endpoint.ipynb|zero-shot-image-classification-batch-endpoint.sh ### Sample inputs and outputs (for real-time inference) #### Sample input json { "input_data":{ "columns":[ "image", "text" ], "index":[0, 1], "data":[ ["image1", "label1, label2, label3"], ["image2"] ] } } Note: - "image1" and "image2" should be publicly accessible urls or strings in base64 format. - The text column in the first row determines the labels for image classification. The text column in the other rows is not used and can be blank. #### Sample output json [ { "probs": [0.95, 0.03, 0.02], "labels": ["label1", "label2", "label3"] }, { "probs": [0.04, 0.93, 0.03], "labels": ["label1", "label2", "label3"] } ] #### Model inference - visualization For a sample image and label text "credit card payment, contactless payment, cash payment, mobile order". zero shot image classification visualization

Version: 1

Tags

Preview license : mit task : zero-shot-image-classification

View in Studio: https://ml.azure.com/registries/azureml/models/openai-clip-vit-large-patch14/version/1

License: mit

Properties

inference-min-sku-spec: 2|0|7|14

inference-recommended-sku: Standard_DS2_v2, Standard_D2a_v4, Standard_D2as_v4, Standard_DS3_v2, Standard_D4a_v4, Standard_D4as_v4, Standard_DS4_v2, Standard_D8a_v4, Standard_D8as_v4, Standard_DS5_v2, Standard_D16a_v4, Standard_D16as_v4, Standard_D32a_v4, Standard_D32as_v4, Standard_D48a_v4, Standard_D48as_v4, Standard_D64a_v4, Standard_D64as_v4, Standard_D96a_v4, Standard_D96as_v4, Standard_F4s_v2, Standard_FX4mds, Standard_F8s_v2, Standard_FX12mds, Standard_F16s_v2, Standard_F32s_v2, Standard_F48s_v2, Standard_F64s_v2, Standard_F72s_v2, Standard_FX24mds, Standard_FX36mds, Standard_FX48mds, Standard_E2s_v3, Standard_E4s_v3, Standard_E8s_v3, Standard_E16s_v3, Standard_E32s_v3, Standard_E48s_v3, Standard_E64s_v3, Standard_NC4as_T4_v3, Standard_NC6s_v3, Standard_NC8as_T4_v3, Standard_NC12s_v3, Standard_NC16as_T4_v3, Standard_NC24s_v3, Standard_NC64as_T4_v3, Standard_NC24ads_A100_v4, Standard_NC48ads_A100_v4, Standard_NC96ads_A100_v4, Standard_ND96asr_v4, Standard_ND96amsr_A100_v4, Standard_ND40rs_v2

model_id: openai/clip-vit-large-patch14

Clone this wiki locally