refactor inference

huggingface · Jul 1, 2024 · f0e69a5 · f0e69a5
1 parent bb9550b
commit f0e69a5
Show file tree

Hide file tree

Showing 2 changed files with 70 additions and 67 deletions.
diff --git a/docs/source/openvino/inference.mdx b/docs/source/openvino/inference.mdx
@@ -11,7 +11,8 @@ specific language governing permissions and limitations under the License.
 
 Optimum Intel can be used to load optimized models from the [Hub](https://huggingface.co/models?library=openvino&sort=downloads) and create pipelines to run inference with OpenVINO Runtime on a variety of Intel processors ([see](https://docs.openvino.ai/2024/about-openvino/compatibility-and-support/supported-devices.html) the full list of supported devices)
 
-## Transformers models
+
+## Loading
 
 Once [your model was exported](export), you can load it by replacing the `AutoModelForXxx` class with the corresponding `OVModelForXxx` class.
 
@@ -30,38 +31,33 @@ Once [your model was exported](export), you can load it by replacing the `AutoMo
 
 See the [reference documentation](reference) for more information about parameters, and examples for different tasks.
 
-As shown in the table below, each task is associated with a class enabling to automatically load your model.
-
-| Task                                 | Auto Class                           |
-|--------------------------------------|--------------------------------------|
-| `text-classification`                | `OVModelForSequenceClassification`   |
-| `token-classification`               | `OVModelForTokenClassification`      |
-| `question-answering`                 | `OVModelForQuestionAnswering`        |
-| `audio-classification`               | `OVModelForAudioClassification`      |
-| `image-classification`               | `OVModelForImageClassification`      |
-| `feature-extraction`                 | `OVModelForFeatureExtraction`        |
-| `fill-mask`                          | `OVModelForMaskedLM`                 |
-| `image-classification`               | `OVModelForImageClassification`      |
-| `audio-classification`               | `OVModelForAudioClassification`      |
-| `text-generation-with-past`          | `OVModelForCausalLM`                 |
-| `text2text-generation-with-past`     | `OVModelForSeq2SeqLM`                |
-| `automatic-speech-recognition`       | `OVModelForSpeechSeq2Seq`            |
-| `image-to-text`                      | `OVModelForVision2Seq`               |
 
+## Compilation
 
-### Weight-only quantization
+By default the model will be compiled when instantiating an `OVModel`. In the case where the model is reshaped or placed to another device, the model will need to be recompiled again, which will happen by default before the first inference (thus inflating the latency of the first inference). To avoid an unnecessary compilation, you can disable the first compilation by setting `compile=False`. The model can be compiled before the first inference with `model.compile()`.
 
-You can also apply fp16, 8-bit or 4-bit weight compression on the Linear, Convolutional and Embedding layers when loading your model to reduce the memory footprint and inference latency.
-
-For more information on the quantization parameters checkout the [documentation](optimziation#weight-only-quantization).
+```python
+from optimum.intel import OVModelForSequenceClassification
 
-<Tip warning={true}>
+model_id = "distilbert-base-uncased-finetuned-sst-2-english"
+# Load the model and disable the model compilation
+model = OVModelForSequenceClassification.from_pretrained(model_id, export=True, compile=False)
+# Reshape to a static sequence length of 128
+model.reshape(1,128)
+# Compile the model before the first inference
+model.compile()
+```
 
-If not specified, `load_in_8bit` will be set to `True` by default when models larger than 1 billion parameters are exported to the OpenVINO format (with `export=True`). You can disable it with `load_in_8bit=False`.
+To run inference on Intel integrated or discrete GPU, use `.to("gpu")`. On GPU, models run in FP16 precision by default. (See [OpenVINO documentation](https://docs.openvino.ai/2024/get-started/configurations/configurations-intel-gpu.html) about installing drivers for GPU inference).
 
-</Tip>
+```python
+# Static shapes speed up inference
+model.reshape(1, 9)
+model.to("gpu")
+# Compile the model before the first inference
+model.compile()
+```
 
-It's also possible to apply quantization on both weights and activations using the `OVQuantizer`, more information in the [documentation](optimization#static-quantization).
 
 ### Static shape
 
@@ -77,60 +73,33 @@ model.compile()
 When fixing the shapes with the `reshape()` method, inference cannot be performed with an input of a different shape. When instantiating your pipeline, you can specify the maximum total input sequence length after tokenization in order for shorter sequences to be padded and for longer sequences to be truncated.
 
 ```python
-from datasets import load_dataset
 from transformers import AutoTokenizer, pipeline
-from evaluate import evaluator
-from optimum.intel import OVModelForQuestionAnswering
+from optimum.intel import OVModelForSequenceClassification
 
-model_id = "distilbert-base-cased-distilled-squad"
-model = OVModelForQuestionAnswering.from_pretrained(model_id, export=True)
-model.reshape(1, 384)
+model_id = "helenai/distilbert-base-uncased-finetuned-sst-2-english-ov-fp32"
+model = OVModelForSequenceClassification.from_pretrained(model_id, compile=False)
 tokenizer = AutoTokenizer.from_pretrained(model_id)
-eval_dataset = load_dataset("squad", split="validation").select(range(50))
-task_evaluator = evaluator("question-answering")
-qa_pipe = pipeline(
-    "question-answering",
-    model=model,
-    tokenizer=tokenizer,
-    max_seq_len=384,
-    padding="max_length",
-    truncation=True,
-)
-metric = task_evaluator.compute(model_or_pipeline=qa_pipe, data=eval_dataset, metric="squad")
-```
-
-### Compilation
 
-By default the model will be compiled when instantiating our `OVModel`. In the case where the model is reshaped or placed to another device, the model will need to be recompiled again, which will happen by default before the first inference (thus inflating the latency of the first inference). To avoid an unnecessary compilation, you can disable the first compilation by setting `compile=False`. The model can be compiled before the first inference with `model.compile()`.
+batch_size, seq_len = 1, 10
+model.reshape(batch_size, seq_len)
+inputs = "He's a dreadful magician"
+tokens = tokenizer(inputs, max_length=seq_len, padding="max_length", return_tensors="np")
 
-```python
-from optimum.intel import OVModelForSequenceClassification
+# verifying that the inputs shapes match the defined batch size and sequence length
+print(tokens["input_ids"].shape)
+# (1, 10)
 
-model_id = "distilbert-base-uncased-finetuned-sst-2-english"
-# Load the model and disable the model compilation
-model = OVModelForSequenceClassification.from_pretrained(model_id, export=True, compile=False)
-# Reshape to a static sequence length of 128
-model.reshape(1,128)
-# Compile the model before the first inference
-model.compile()
-```
+outputs = model(**tokens)
 
-To run inference on Intel integrated or discrete GPU, use `.to("gpu")`. On GPU, models run in FP16 precision by default. (See [OpenVINO documentation](https://docs.openvino.ai/2024/get-started/configurations/configurations-intel-gpu.html) about installing drivers for GPU inference).
 
-```python
-# Static shapes speed up inference
-model.reshape(1, 9)
-model.to("gpu")
-# Compile the model before the first inference
-model.compile()
+pipeline
 ```
 
-### Configuration
 
+### Configuration
 
 It is possible to pass an `ov_config` parameter to `from_pretrained()` with custom OpenVINO configuration values. This can be used for example to enable full precision inference on devices where FP16 or BF16 inference precision is used by default.
 
-
 ```python
 model = OVModelForSequenceClassification.from_pretrained(model_id, ov_config={"INFERENCE_PRECISION_HINT":"f32"})
 ```
@@ -141,7 +110,39 @@ Optimum Intel leverages OpenVINO's model caching to speed up model compiling on
 model = OVModelForSequenceClassification.from_pretrained(model_id, device="GPU", ov_config={"PERFORMANCE_HINT": "LATENCY", "CACHE_DIR":""})
 ```
 
-### Sequence-to-sequence models
+## Transformers models
+
+As shown in the table below, each task is associated with a class enabling to automatically load your model.
+
+| Task                                 | Auto Class                           |
+|--------------------------------------|--------------------------------------|
+| `text-classification`                | `OVModelForSequenceClassification`   |
+| `token-classification`               | `OVModelForTokenClassification`      |
+| `question-answering`                 | `OVModelForQuestionAnswering`        |
+| `audio-classification`               | `OVModelForAudioClassification`      |
+| `image-classification`               | `OVModelForImageClassification`      |
+| `feature-extraction`                 | `OVModelForFeatureExtraction`        |
+| `fill-mask`                          | `OVModelForMaskedLM`                 |
+| `image-classification`               | `OVModelForImageClassification`      |
+| `audio-classification`               | `OVModelForAudioClassification`      |
+| `text-generation-with-past`          | `OVModelForCausalLM`                 |
+| `text2text-generation-with-past`     | `OVModelForSeq2SeqLM`                |
+| `automatic-speech-recognition`       | `OVModelForSpeechSeq2Seq`            |
+| `image-to-text`                      | `OVModelForVision2Seq`               |
+
+
+You can also apply fp16, 8-bit or 4-bit weight compression on the Linear, Convolutional and Embedding layers when loading your model to reduce the memory footprint and inference latency.
+
+For more information on the quantization parameters checkout the [documentation](optimziation#weight-only-quantization).
+
+<Tip warning={true}>
+
+If not specified, `load_in_8bit` will be set to `True` by default when models larger than 1 billion parameters are exported to the OpenVINO format (with `export=True`). You can disable it with `load_in_8bit=False`.
+
+</Tip>
+
+It's also possible to apply quantization on both weights and activations using the `OVQuantizer`, more information in the [documentation](optimization#static-quantization).
+
 
 Sequence-to-sequence (Seq2Seq) models, that generate a new sequence from an input, can also be used when running inference with OpenVINO. When Seq2Seq models are exported to the OpenVINO IR, they are decomposed into two parts : the encoder and the "decoder" (which actually consists of the decoder with the language modeling head), that are later combined during inference.
 To speed up sequential decoding, a cache with pre-computed key/values hidden-states will be used by default. An additional model component will be exported: the "decoder" with pre-computed key/values as one of its inputs.  This specific export comes from the fact that during the first pass, the decoder has no pre-computed key/values hidden-states, while during the rest of the generation past key/values will be used to speed up sequential decoding. To disable this cache, set `use_cache=False` in the `from_pretrained()` method.

diff --git a/docs/source/openvino/models.mdx b/docs/source/openvino/models.mdx
@@ -7,6 +7,8 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
 specific language governing permissions and limitations under the License.
 -->
 
+# Supported models
+
 🤗 Optimum handles the export of models to OpenVINO in the `exporters.openvino` module. It provides classes, functions, and a command line interface to perform the export easily.
 Here is the list of the supported architectures :