IBM · elronbandel · Nov 19, 2024 · Nov 12, 2024 · Nov 12, 2024 · Nov 12, 2024
diff --git a/docs/docs/examples.rst b/docs/docs/examples.rst
@@ -1,6 +1,6 @@
 .. _examples:
 ==============
-Examples 
+Examples
 ==============
 
 Here you will find complete coding samples showing how to perform different tasks using Unitxt.
@@ -18,7 +18,7 @@ This example demonstrates how to evaluate an existing entailment dataset (wnli)
 
 `Example code <https://github.com/IBM/unitxt/blob/main/examples/evaluate_existing_dataset_no_install.py>`_
 
-Related documentation:  :ref:`Evaluating datasets <evaluating_datasets>`, :ref:`WNLI dataset card in catalog <catalog.cards.wnli>`, :ref:`Relation template in catalog <catalog.templates.classification.multi_class.relation.default>`.
+Related documentation:  :ref:`Evaluating datasets <evaluating_datasets>`, :ref:`WNLI dataset card in catalog <catalog.cards.wnli>`, :ref:`Relation template in catalog <catalog.templates.classification.multi_class.relation.default>`, :ref:`Inference Engines <inference>`.
 
 Evaluate an existing dataset from the Unitxt catalog (with Unitxt installation)
 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
@@ -28,7 +28,7 @@ This approach is faster than using Huggingface APIs.
 
 `Example code <https://github.com/IBM/unitxt/blob/main/examples/evaluate_existing_dataset_with_install.py>`_
 
-Related documentation: :ref:`Installation <installation>` , :ref:`WNLI dataset card in catalog <catalog.cards.wnli>`, :ref:`Relation template in catalog <catalog.templates.classification.multi_class.relation.default>`.
+Related documentation: :ref:`Installation <installation>` , :ref:`WNLI dataset card in catalog <catalog.cards.wnli>`, :ref:`Relation template in catalog <catalog.templates.classification.multi_class.relation.default>`, :ref:`Inference Engines <inference>`.
 
 
 Evaluate a custom dataset
@@ -48,7 +48,7 @@ It also shows how to use preprocessing steps to align the raw input of the datas
 
 `Example code <https://github.com/IBM/unitxt/blob/main/examples/qa_evaluation.py>`_
 
-Related documentation: :ref:`Add new dataset tutorial <adding_dataset>`, :ref:`Open QA task in catalog <catalog.tasks.qa.open>`, :ref:`Open QA template in catalog <catalog.templates.qa.open.title>`.
+Related documentation: :ref:`Add new dataset tutorial <adding_dataset>`, :ref:`Open QA task in catalog <catalog.tasks.qa.open>`, :ref:`Open QA template in catalog <catalog.templates.qa.open.title>`, :ref:`Inference Engines <inference>`.
 
 
 Evaluation usecases
@@ -62,7 +62,7 @@ It also shows how to register assets into a local catalog and reuse them.
 
 `Example code <https://github.com/IBM/unitxt/blob/main/examples/evaluate_different_templates.py>`_
 
-Related documentation: :ref:`Templates tutorial <adding_template>`, :ref:`Formatting tutorial <adding_format>`, :ref:`Using the Catalog <using_catalog>`.
+Related documentation: :ref:`Templates tutorial <adding_template>`, :ref:`Formatting tutorial <adding_format>`, :ref:`Using the Catalog <using_catalog>`, :ref:`Inference Engines <inference>`.
 
 Evaluate the impact of different formats and system prompts
 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
@@ -92,21 +92,21 @@ This example demonstrates how to evaluate a dataset using a pool of templates an
 
 `Example code <https://github.com/IBM/unitxt/blob/main/examples/evaluate_different_templates_num_demos.py>`_
 
-Related documentation: :ref:`Templates tutorial <adding_template>`, :ref:`Formatting tutorial <adding_format>`, :ref:`Using the Catalog <using_catalog>`.
+Related documentation: :ref:`Templates tutorial <adding_template>`, :ref:`Formatting tutorial <adding_format>`, :ref:`Using the Catalog <using_catalog>`, :ref:`Inference Engines <inference>`.
 
 Long Context
 +++++++++++++++++++++++++++++
 
-This example explores the effect of long context in classification.  
+This example explores the effect of long context in classification.
 It converts a standard multi class classification dataset (sst2 sentiment classification),
 where single sentence texts are classified one by one, to a dataset
-where multiple sentences are classified using a single LLM call.  
+where multiple sentences are classified using a single LLM call.
 It compares the f1_micro in both approaches on two models.
 It uses serializers to verbalize and enumerated list of multiple sentences and labels.
 
 `Example code <https://github.com/IBM/unitxt/blob/main/examples/evaluate_batched_multiclass_classification.py>`_
 
-Related documentation:  :ref:`Sst2 dataset card in catalog <catalog.cards.sst2>` :ref:`Types and Serializers Guide <types_and_serializers>`. 
+Related documentation:  :ref:`Sst2 dataset card in catalog <catalog.cards.sst2>` :ref:`Types and Serializers Guide <types_and_serializers>`.
 
 Construct a benchmark of multiple datasets and obtain the final score
 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
@@ -115,7 +115,7 @@ This example shows how to construct a benchmark that includes multiple datasets,
 
 `Example code <https://github.com/IBM/unitxt/blob/main/examples/evaluate_benchmark.py>`_
 
-Related documentation: :ref:`Benchmarks tutorial <adding_benchmark>`, :ref:`Formatting tutorial <adding_format>`, :ref:`Using the Catalog <using_catalog>`.
+Related documentation: :ref:`Benchmarks tutorial <adding_benchmark>`, :ref:`Formatting tutorial <adding_format>`, :ref:`Using the Catalog <using_catalog>`, :ref:`Inference Engines <inference>`.
 
 LLM as Judges
 --------------
@@ -127,7 +127,7 @@ This example demonstrates how to evaluate an existing QA dataset (squad) using t
 
 `Example code <https://github.com/IBM/unitxt/blob/main/examples/evaluate_existing_dataset_by_llm_as_judge.py>`_
 
-Related documentation: :ref:`Evaluating datasets <evaluating_datasets>`, :ref:`LLM as a Judge Metrics Guide <llm_as_judge>`.
+Related documentation: :ref:`Evaluating datasets <evaluating_datasets>`, :ref:`LLM as a Judge Metrics Guide <llm_as_judge>`, :ref:`Inference Engines <inference>`.
 
 Evaluate a custom dataset using a custom LLM as Judge
 +++++++++++++++++++++++++++++++++++++++++++++++++++++
@@ -160,7 +160,7 @@ while the 70b model performs much better.
 
 `Example code <https://github.com/IBM/unitxt/blob/main/examples/evaluate_llm_as_judge.py>`_
 
-Related documentation: :ref:`LLM as a Judge Metrics Guide <llm_as_judge>`.
+Related documentation: :ref:`LLM as a Judge Metrics Guide <llm_as_judge>`, :ref:`Inference Engines <inference>`.
 
 
 Evaluate your model on the Arena Hard benchmark using a custom LLMaJ
@@ -170,7 +170,7 @@ This example demonstrates how to evaluate a user model on the Arena Hard benchma
 
 `Example code <https://github.com/IBM/unitxt/blob/main/examples/evaluate_a_model_using_arena_hard.py>`_
 
-Related documentation: :ref:`Evaluate a Model on Arena Hard Benchmark <arena_hard_evaluation>`.
+Related documentation: :ref:`Evaluate a Model on Arena Hard Benchmark <arena_hard_evaluation>`, :ref:`Inference Engines <inference>`.
 
 Evaluate a judge model performance judging the Arena Hard Benchmark
 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
@@ -180,7 +180,7 @@ The model is evaluated on its capability to give a judgment that is in correlati
 
 `Example code <https://github.com/IBM/unitxt/blob/main/examples/evaluate_a_judge_model_capabilities_on_arena_hard.py>`_
 
-Related documentation: :ref:`Evaluate a Model on Arena Hard Benchmark <arena_hard_evaluation>`.
+Related documentation: :ref:`Evaluate a Model on Arena Hard Benchmark <arena_hard_evaluation>`, :ref:`Inference Engines <inference>`.
 
 Evaluate using ensemble of LLM as a judge metrics
 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
@@ -190,7 +190,7 @@ The example shows how to ensemble two judges which uses different templates.
 
 `Example code <https://github.com/IBM/unitxt/blob/main/examples/evaluate_using_metrics_ensemble.py>`_
 
-Related documentation: :ref:`LLM as a Judge Metrics Guide <llm_as_judge>`.
+Related documentation: :ref:`LLM as a Judge Metrics Guide <llm_as_judge>`, :ref:`Inference Engines <inference>`.
 
 Evaluate predictions of models using pre-trained ensemble of LLM as judges
 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
@@ -206,7 +206,7 @@ Groundedness: Every substantial claim in the response of the model is derivable
 IDK: Does the model response say I don't know?
 `Example code <https://github.com/IBM/unitxt/blob/main/examples/evaluate_idk_judge.py>`
 
-Related documentation: :ref:`LLM as a Judge Metrics Guide <llm_as_judge>`.
+Related documentation: :ref:`LLM as a Judge Metrics Guide <llm_as_judge>`, :ref:`Inference Engines <inference>`.
 
 RAG
 ---
@@ -222,7 +222,7 @@ and use the existing metrics to evaluate model results.
 
 `Example code <https://github.com/IBM/unitxt/blob/main/examples/evaluate_rag_response_generation.py>`_
 
-Related documentation: :ref:`RAG Guide <rag_support>`.  :ref:`Response generation task <catalog.tasks.rag.response_generation>`.
+Related documentation: :ref:`RAG Guide <rag_support>`, :ref:`Response generation task <catalog.tasks.rag.response_generation>`, :ref:`Inference Engines <inference>`.
 
 Multi-Modality
 --------------
@@ -243,15 +243,15 @@ This approach can be adapted for various image-text to text tasks, such as image
 
 `Example code <https://github.com/IBM/unitxt/blob/main/examples/evaluate_image_text_to_text.py>`_
 
-Related documentation: :ref:`Multi-Modality Guide <multi_modality>`.
+Related documentation: :ref:`Multi-Modality Guide <multi_modality>`, :ref:`Inference Engines <inference>`.
 
 
 Evaluate Image-Text to Text Model With Different Templates
 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 Evaluate Image-Text to Text Models with different templates and explore the sensitivity of the model to different textual variations.
 `Example code <https://github.com/IBM/unitxt/blob/main/examples/evaluate_image_text_to_text_with_different_templates.py>`_
 
-Related documentation: :ref:`Multi-Modality Guide <multi_modality>`.
+Related documentation: :ref:`Multi-Modality Guide <multi_modality>`, :ref:`Inference Engines <inference>`.
 
 Types and Serializers
 ----------------------------
@@ -263,6 +263,5 @@ This example show how to define new data types as well as the way these data typ
 
 `Example code <https://github.com/IBM/unitxt/blob/main/examples/custom_types.py>`_
 
-Related documentation: :ref:`Types and Serializers Guide <types_and_serializers>`.
-
+Related documentation: :ref:`Types and Serializers Guide <types_and_serializers>`, :ref:`Inference Engines <inference>`.
 
diff --git a/docs/docs/inference.rst b/docs/docs/inference.rst
@@ -0,0 +1,114 @@
+.. _inference:
+
+==============
+Inference
+==============
+
+.. note::
+
+   This tutorial requires a :ref:`Unitxt installation <install_unitxt>`.
+
+Introduction
+------------
+Unitxt offers a wide array of :class:`Inference Engines <unitxt.inference>` for running models either locally (using HuggingFace, Ollama, and VLLM) or by making API requests to services like WatsonX, OpenAI, AWS, and Together AI.
+
+Unitxt inference engines serve two main purposes:
+
+    1. Running a full end-to-end evaluation pipeline with inference.
+    2. Using models for intermediate steps, such as evaluating other models (e.g., LLMs as judges) or for data augmentation.
+
+Running Models Locally
+-----------------------
+You can run models locally with inference engines like:
+
+    - :class:`HFPipelineBasedInferenceEngine <unitxt.inference.HFPipelineBasedInferenceEngine>`
+    - :class:`VLLMInferenceEngine <unitxt.inference.VLLMInferenceEngine>`
+    - :class:`OllamaInferenceEngine <unitxt.inference.OllamaInferenceEngine>`
+
+To get started, prepare your engine:
+
+.. code-block:: python
+
+    engine = HFPipelineBasedInferenceEngine(
+        model_name="meta-llama/Llama-3.2-1B", max_new_tokens=32
+    )
+
+Then load the data:
+
+.. code-block:: python
+
+    dataset = load_dataset(
+        card="cards.xsum",
+        template="templates.summarization.abstractive.formal",
+        format="formats.chat_api",
+        metrics=[llm_judge_with_summary_metric],
+        loader_limit=5,
+        split="test",
+    )
+
+Notice: we create the data with  `format="formats.chat_api"` which produce data as list of chat turns:
+
+.. code-block:: python
+
+    [
+        {"role": "system", "content": "Summarize the following Document."},
+        {"role": "user", "content": "Document: <...>"}
+    ]
+
+Now run inference on the dataset:
+
+.. code-block:: python
+
+    predictions = engine.infer(dataset)
+
+Finally, evaluate the predictions and obtain final scores:
+
+.. code-block:: python
+
+    evaluate(predictions=predictions, data=dataset)
+
+Calling Models Through APIs
+---------------------------
+Calling models through an API is even simpler and is primarily done using one class: :class:`CrossProviderInferenceEngine <unitxt.inference.CrossProviderInferenceEngine>`.
+
+You can create a :class:`CrossProviderInferenceEngine` as follows:
+
+.. code-block:: python
+
+    engine = CrossProviderInferenceEngine(
+        model="llama-3-2-1b-instruct", provider="watsonx"
+    )
+
+This engine supports providers such as ``watsonx``, ``together-ai``, ``open-ai``, ``aws``, ``ollama``, ``bam``, and ``watsonx-sdk``.
+
+It can be used with all supported models listed here: :class:`supported models <unitxt.inference.CrossProviderInferenceEngine>`.
+
+Running inference follows the same pattern as before:
+
+.. code-block:: python
+
+    predictions = engine.infer(dataset)
+
+Creating a Cross-API Engine
+---------------------------
+Alternatively, you can create an engine without specifying a provider:
+
+.. code-block:: python
+
+    engine = CrossProviderInferenceEngine(
+        model="llama-3-2-1b-instruct"
+    )
+
+You can set the provider later by:
+
+.. code-block:: python
+
+    import unitxt
+
+    unitxt.settings.default_provider = "watsonx"
+
+Or by setting an environment variable:
+
+.. code-block:: bash
+
+    export UNITXT_DEFAULT_PROVIDER="watsonx"
diff --git a/docs/docs/tutorials.rst b/docs/docs/tutorials.rst
@@ -19,6 +19,7 @@ Tutorials ✨
    multimodality
    operators
    saving_and_loading_from_catalog
+   inference
    production
    debugging
    helm

diff --git a/examples/evaluate_a_judge_model_capabilities_on_arena_hard.py b/examples/evaluate_a_judge_model_capabilities_on_arena_hard.py
@@ -1,32 +1,28 @@
 from unitxt import evaluate, load_dataset
-from unitxt.inference import MockInferenceEngine
+from unitxt.inference import CrossProviderInferenceEngine
 from unitxt.text_utils import print_dict
 
-model_id = "meta-llama/llama-3-70b-instruct"
-model_format = "formats.llama3_instruct"
-
 """
-We are evaluating only on a small subset (by using "select(range(4)), in order for the example to finish quickly.
+We are evaluating only on a small subset (by using `max_test_instances=4`), in order for the example to finish quickly.
 The dataset full size if around 40k examples. You should use around 1k-4k in your evaluations.
 """
 dataset = load_dataset(
     card="cards.arena_hard.response_assessment.pairwise_comparative_rating.both_games_gpt_4_judge",
     template="templates.response_assessment.pairwise_comparative_rating.arena_hard_with_shuffling",
-    format=model_format,
-)["test"].select(range(4))
+    format="formats.chat_api",
+    max_test_instances=None,
+    split="test",
+).select(range(5))
 
-inference_model = MockInferenceEngine(model_name=model_id)
+inference_model = CrossProviderInferenceEngine(
+    model="llama-3-2-1b-instruct", provider="watsonx"
+)
 """
-We are using a mock inference engine (and model) in order for the example to finish quickly.
-In real scenarios you can use model from Huggingface, OpenAi, and IBM, using the following:
-from unitxt.inference import (HFPipelineBasedInferenceEngine, IbmGenAiInferenceEngine, OpenAiInferenceEngine)
-and switch them with the MockInferenceEngine class in the example.
-For the arguments these inference engines can receive, please refer to the classes documentation.
+We are using a CrossProviderInferenceEngine inference engine that supply api access to provider such as:
+watsonx, bam, openai, azure, aws and more.
 
-Example of using an IBM model:
-from unitxt.inference import (IbmGenAiInferenceEngine, IbmGenAiInferenceEngineParamsMixin)
-params = IbmGenAiInferenceEngineParamsMixin(max_new_tokens=1024, random_seed=42)
-inference_model = IbmGenAiInferenceEngine(model_name=model_id, parameters=params)
+For the arguments these inference engines can receive, please refer to the classes documentation or read
+about the the open ai api arguments the CrossProviderInferenceEngine follows.
 """
 
 predictions = inference_model.infer(dataset)

diff --git a/examples/evaluate_a_model_using_arena_hard.py b/examples/evaluate_a_model_using_arena_hard.py
@@ -1,35 +1,31 @@
 from unitxt import evaluate, load_dataset
-from unitxt.inference import MockInferenceEngine
+from unitxt.inference import CrossProviderInferenceEngine
 from unitxt.text_utils import print_dict
 
-model_id = "meta-llama/llama-3-70b-instruct"
-model_format = "formats.llama3_instruct"
-
 """
-We are evaluating only on a small subset (by using "select(range(4)), in order for the example to finish quickly.
+We are evaluating only on a small subset (by using `max_test_instances=4`), in order for the example to finish quickly.
 The dataset full size if around 40k examples. You should use around 1k-4k in your evaluations.
 """
 dataset = load_dataset(
     card="cards.arena_hard.generation.english_gpt_4_0314_reference",
     template="templates.empty",
-    format=model_format,
+    format="formats.chat_api",
     metrics=[
-        "metrics.llm_as_judge.pairwise_comparative_rating.llama_3_8b_instruct_ibm_genai_template_arena_hard_with_shuffling"
+        "metrics.llm_as_judge.pairwise_comparative_rating.llama_3_8b_instruct.template_arena_hard"
     ],
-)["test"].select(range(4))
+    max_test_instances=4,
+    split="test",
+)
 
-inference_model = MockInferenceEngine(model_name=model_id)
+inference_model = CrossProviderInferenceEngine(
+    model="llama-3-2-1b-instruct", provider="watsonx"
+)
 """
-We are using a mock inference engine (and model) in order for the example to finish quickly.
-In real scenarios you can use model from Huggingface, OpenAi, and IBM, using the following:
-from unitxt.inference import (HFPipelineBasedInferenceEngine, IbmGenAiInferenceEngine, OpenAiInferenceEngine)
-and switch them with the MockInferenceEngine class in the example.
-For the arguments these inference engines can receive, please refer to the classes documentation.
+We are using a CrossProviderInferenceEngine inference engine that supply api access to provider such as:
+watsonx, bam, openai, azure, aws and more.
 
-Example of using an IBM model:
-from unitxt.inference import (IbmGenAiInferenceEngine, IbmGenAiInferenceEngineParamsMixin)
-params = IbmGenAiInferenceEngineParamsMixin(max_new_tokens=1024, random_seed=42)
-inference_model = IbmGenAiInferenceEngine(model_name=model_id, parameters=params)
+For the arguments these inference engines can receive, please refer to the classes documentation or read
+about the the open ai api arguments the CrossProviderInferenceEngine follows.
 """
 
 predictions = inference_model.infer(dataset)