Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multi api inference engine #1343

Merged
merged 33 commits into from
Nov 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
de868ab
Add multi api inference engine
elronbandel Nov 12, 2024
c40d87d
Fix
elronbandel Nov 12, 2024
d53eb69
Set to greedy decoding
elronbandel Nov 12, 2024
e4a0799
Merge branch 'main' into multi-api-engine
elronbandel Nov 12, 2024
e41a0fa
Merge branch 'main' into multi-api-engine
elronbandel Nov 17, 2024
b36b7ab
Some fixes
elronbandel Nov 17, 2024
0593788
Fix consistency and preparation
elronbandel Nov 18, 2024
28bafa2
Update
elronbandel Nov 18, 2024
ccc72ae
Merge branch 'main' into multi-api-engine
elronbandel Nov 18, 2024
06991ca
Merge branch 'main' into multi-api-engine
elronbandel Nov 18, 2024
3c861fb
Fix test
elronbandel Nov 18, 2024
086aae8
Merge branch 'multi-api-engine' of https://github.com/IBM/unitxt into…
elronbandel Nov 18, 2024
f9cd539
Make all args None
elronbandel Nov 18, 2024
4165c78
Try
elronbandel Nov 18, 2024
f202c3a
Fix grammar
elronbandel Nov 18, 2024
bd8e176
Fix
elronbandel Nov 18, 2024
b686f95
Change api to provider
elronbandel Nov 18, 2024
b4dfe3b
Merge branch 'main' into multi-api-engine
elronbandel Nov 18, 2024
4c91d5e
Added support for param renaming.
yoavkatz Nov 18, 2024
eaead52
Fix merge issues
yoavkatz Nov 18, 2024
4c5ba45
Updated to CrossProviderModel
yoavkatz Nov 18, 2024
00dbd30
Update name back to InferenceEngine terminology
elronbandel Nov 18, 2024
a0373f8
Align all examples with chat api and cross provider engines
elronbandel Nov 19, 2024
4fa6f8e
Add vllm inference engine
elronbandel Nov 19, 2024
8115091
Fix blue bench to use cross provider engine
elronbandel Nov 19, 2024
986d268
Merge branch 'main' into multi-api-engine
elronbandel Nov 19, 2024
728fcc3
Added watsonx-sdk to MultiProviderInferenceEngine
yoavkatz Nov 19, 2024
9414f54
Make hf tests deterministic
elronbandel Nov 19, 2024
69388b5
Fix llmaj with chat api
elronbandel Nov 19, 2024
e921b01
Add inference documentation
elronbandel Nov 19, 2024
9946ab6
Fix examples
elronbandel Nov 19, 2024
ececf85
Fix examples
elronbandel Nov 19, 2024
71365e7
Update docs/docs/inference.rst
elronbandel Nov 19, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 20 additions & 21 deletions docs/docs/examples.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
.. _examples:
==============
Examples
Examples
==============

Here you will find complete coding samples showing how to perform different tasks using Unitxt.
Expand All @@ -18,7 +18,7 @@ This example demonstrates how to evaluate an existing entailment dataset (wnli)

`Example code <https://github.com/IBM/unitxt/blob/main/examples/evaluate_existing_dataset_no_install.py>`_

Related documentation: :ref:`Evaluating datasets <evaluating_datasets>`, :ref:`WNLI dataset card in catalog <catalog.cards.wnli>`, :ref:`Relation template in catalog <catalog.templates.classification.multi_class.relation.default>`.
Related documentation: :ref:`Evaluating datasets <evaluating_datasets>`, :ref:`WNLI dataset card in catalog <catalog.cards.wnli>`, :ref:`Relation template in catalog <catalog.templates.classification.multi_class.relation.default>`, :ref:`Inference Engines <inference>`.

Evaluate an existing dataset from the Unitxt catalog (with Unitxt installation)
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Expand All @@ -28,7 +28,7 @@ This approach is faster than using Huggingface APIs.

`Example code <https://github.com/IBM/unitxt/blob/main/examples/evaluate_existing_dataset_with_install.py>`_

Related documentation: :ref:`Installation <installation>` , :ref:`WNLI dataset card in catalog <catalog.cards.wnli>`, :ref:`Relation template in catalog <catalog.templates.classification.multi_class.relation.default>`.
Related documentation: :ref:`Installation <installation>` , :ref:`WNLI dataset card in catalog <catalog.cards.wnli>`, :ref:`Relation template in catalog <catalog.templates.classification.multi_class.relation.default>`, :ref:`Inference Engines <inference>`.


Evaluate a custom dataset
Expand All @@ -48,7 +48,7 @@ It also shows how to use preprocessing steps to align the raw input of the datas

`Example code <https://github.com/IBM/unitxt/blob/main/examples/qa_evaluation.py>`_

Related documentation: :ref:`Add new dataset tutorial <adding_dataset>`, :ref:`Open QA task in catalog <catalog.tasks.qa.open>`, :ref:`Open QA template in catalog <catalog.templates.qa.open.title>`.
Related documentation: :ref:`Add new dataset tutorial <adding_dataset>`, :ref:`Open QA task in catalog <catalog.tasks.qa.open>`, :ref:`Open QA template in catalog <catalog.templates.qa.open.title>`, :ref:`Inference Engines <inference>`.


Evaluation usecases
Expand All @@ -62,7 +62,7 @@ It also shows how to register assets into a local catalog and reuse them.

`Example code <https://github.com/IBM/unitxt/blob/main/examples/evaluate_different_templates.py>`_

Related documentation: :ref:`Templates tutorial <adding_template>`, :ref:`Formatting tutorial <adding_format>`, :ref:`Using the Catalog <using_catalog>`.
Related documentation: :ref:`Templates tutorial <adding_template>`, :ref:`Formatting tutorial <adding_format>`, :ref:`Using the Catalog <using_catalog>`, :ref:`Inference Engines <inference>`.

Evaluate the impact of different formats and system prompts
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Expand Down Expand Up @@ -92,21 +92,21 @@ This example demonstrates how to evaluate a dataset using a pool of templates an

`Example code <https://github.com/IBM/unitxt/blob/main/examples/evaluate_different_templates_num_demos.py>`_

Related documentation: :ref:`Templates tutorial <adding_template>`, :ref:`Formatting tutorial <adding_format>`, :ref:`Using the Catalog <using_catalog>`.
Related documentation: :ref:`Templates tutorial <adding_template>`, :ref:`Formatting tutorial <adding_format>`, :ref:`Using the Catalog <using_catalog>`, :ref:`Inference Engines <inference>`.

Long Context
+++++++++++++++++++++++++++++

This example explores the effect of long context in classification.
This example explores the effect of long context in classification.
It converts a standard multi class classification dataset (sst2 sentiment classification),
where single sentence texts are classified one by one, to a dataset
where multiple sentences are classified using a single LLM call.
where multiple sentences are classified using a single LLM call.
It compares the f1_micro in both approaches on two models.
It uses serializers to verbalize and enumerated list of multiple sentences and labels.

`Example code <https://github.com/IBM/unitxt/blob/main/examples/evaluate_batched_multiclass_classification.py>`_

Related documentation: :ref:`Sst2 dataset card in catalog <catalog.cards.sst2>` :ref:`Types and Serializers Guide <types_and_serializers>`.
Related documentation: :ref:`Sst2 dataset card in catalog <catalog.cards.sst2>` :ref:`Types and Serializers Guide <types_and_serializers>`.

Construct a benchmark of multiple datasets and obtain the final score
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Expand All @@ -115,7 +115,7 @@ This example shows how to construct a benchmark that includes multiple datasets,

`Example code <https://github.com/IBM/unitxt/blob/main/examples/evaluate_benchmark.py>`_

Related documentation: :ref:`Benchmarks tutorial <adding_benchmark>`, :ref:`Formatting tutorial <adding_format>`, :ref:`Using the Catalog <using_catalog>`.
Related documentation: :ref:`Benchmarks tutorial <adding_benchmark>`, :ref:`Formatting tutorial <adding_format>`, :ref:`Using the Catalog <using_catalog>`, :ref:`Inference Engines <inference>`.

LLM as Judges
--------------
Expand All @@ -127,7 +127,7 @@ This example demonstrates how to evaluate an existing QA dataset (squad) using t

`Example code <https://github.com/IBM/unitxt/blob/main/examples/evaluate_existing_dataset_by_llm_as_judge.py>`_

Related documentation: :ref:`Evaluating datasets <evaluating_datasets>`, :ref:`LLM as a Judge Metrics Guide <llm_as_judge>`.
Related documentation: :ref:`Evaluating datasets <evaluating_datasets>`, :ref:`LLM as a Judge Metrics Guide <llm_as_judge>`, :ref:`Inference Engines <inference>`.

Evaluate a custom dataset using a custom LLM as Judge
+++++++++++++++++++++++++++++++++++++++++++++++++++++
Expand Down Expand Up @@ -160,7 +160,7 @@ while the 70b model performs much better.

`Example code <https://github.com/IBM/unitxt/blob/main/examples/evaluate_llm_as_judge.py>`_

Related documentation: :ref:`LLM as a Judge Metrics Guide <llm_as_judge>`.
Related documentation: :ref:`LLM as a Judge Metrics Guide <llm_as_judge>`, :ref:`Inference Engines <inference>`.


Evaluate your model on the Arena Hard benchmark using a custom LLMaJ
Expand All @@ -170,7 +170,7 @@ This example demonstrates how to evaluate a user model on the Arena Hard benchma

`Example code <https://github.com/IBM/unitxt/blob/main/examples/evaluate_a_model_using_arena_hard.py>`_

Related documentation: :ref:`Evaluate a Model on Arena Hard Benchmark <arena_hard_evaluation>`.
Related documentation: :ref:`Evaluate a Model on Arena Hard Benchmark <arena_hard_evaluation>`, :ref:`Inference Engines <inference>`.

Evaluate a judge model performance judging the Arena Hard Benchmark
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Expand All @@ -180,7 +180,7 @@ The model is evaluated on its capability to give a judgment that is in correlati

`Example code <https://github.com/IBM/unitxt/blob/main/examples/evaluate_a_judge_model_capabilities_on_arena_hard.py>`_

Related documentation: :ref:`Evaluate a Model on Arena Hard Benchmark <arena_hard_evaluation>`.
Related documentation: :ref:`Evaluate a Model on Arena Hard Benchmark <arena_hard_evaluation>`, :ref:`Inference Engines <inference>`.

Evaluate using ensemble of LLM as a judge metrics
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Expand All @@ -190,7 +190,7 @@ The example shows how to ensemble two judges which uses different templates.

`Example code <https://github.com/IBM/unitxt/blob/main/examples/evaluate_using_metrics_ensemble.py>`_

Related documentation: :ref:`LLM as a Judge Metrics Guide <llm_as_judge>`.
Related documentation: :ref:`LLM as a Judge Metrics Guide <llm_as_judge>`, :ref:`Inference Engines <inference>`.

Evaluate predictions of models using pre-trained ensemble of LLM as judges
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Expand All @@ -206,7 +206,7 @@ Groundedness: Every substantial claim in the response of the model is derivable
IDK: Does the model response say I don't know?
`Example code <https://github.com/IBM/unitxt/blob/main/examples/evaluate_idk_judge.py>`

Related documentation: :ref:`LLM as a Judge Metrics Guide <llm_as_judge>`.
Related documentation: :ref:`LLM as a Judge Metrics Guide <llm_as_judge>`, :ref:`Inference Engines <inference>`.

RAG
---
Expand All @@ -222,7 +222,7 @@ and use the existing metrics to evaluate model results.

`Example code <https://github.com/IBM/unitxt/blob/main/examples/evaluate_rag_response_generation.py>`_

Related documentation: :ref:`RAG Guide <rag_support>`. :ref:`Response generation task <catalog.tasks.rag.response_generation>`.
Related documentation: :ref:`RAG Guide <rag_support>`, :ref:`Response generation task <catalog.tasks.rag.response_generation>`, :ref:`Inference Engines <inference>`.

Multi-Modality
--------------
Expand All @@ -243,15 +243,15 @@ This approach can be adapted for various image-text to text tasks, such as image

`Example code <https://github.com/IBM/unitxt/blob/main/examples/evaluate_image_text_to_text.py>`_

Related documentation: :ref:`Multi-Modality Guide <multi_modality>`.
Related documentation: :ref:`Multi-Modality Guide <multi_modality>`, :ref:`Inference Engines <inference>`.


Evaluate Image-Text to Text Model With Different Templates
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Evaluate Image-Text to Text Models with different templates and explore the sensitivity of the model to different textual variations.
`Example code <https://github.com/IBM/unitxt/blob/main/examples/evaluate_image_text_to_text_with_different_templates.py>`_

Related documentation: :ref:`Multi-Modality Guide <multi_modality>`.
Related documentation: :ref:`Multi-Modality Guide <multi_modality>`, :ref:`Inference Engines <inference>`.

Types and Serializers
----------------------------
Expand All @@ -263,6 +263,5 @@ This example show how to define new data types as well as the way these data typ

`Example code <https://github.com/IBM/unitxt/blob/main/examples/custom_types.py>`_

Related documentation: :ref:`Types and Serializers Guide <types_and_serializers>`.

Related documentation: :ref:`Types and Serializers Guide <types_and_serializers>`, :ref:`Inference Engines <inference>`.

114 changes: 114 additions & 0 deletions docs/docs/inference.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
.. _inference:

==============
Inference
==============

.. note::

This tutorial requires a :ref:`Unitxt installation <install_unitxt>`.

Introduction
------------
Unitxt offers a wide array of :class:`Inference Engines <unitxt.inference>` for running models either locally (using HuggingFace, Ollama, and VLLM) or by making API requests to services like WatsonX, OpenAI, AWS, and Together AI.

Unitxt inference engines serve two main purposes:

1. Running a full end-to-end evaluation pipeline with inference.
2. Using models for intermediate steps, such as evaluating other models (e.g., LLMs as judges) or for data augmentation.

Running Models Locally
-----------------------
You can run models locally with inference engines like:

- :class:`HFPipelineBasedInferenceEngine <unitxt.inference.HFPipelineBasedInferenceEngine>`
- :class:`VLLMInferenceEngine <unitxt.inference.VLLMInferenceEngine>`
- :class:`OllamaInferenceEngine <unitxt.inference.OllamaInferenceEngine>`

To get started, prepare your engine:

.. code-block:: python

engine = HFPipelineBasedInferenceEngine(
model_name="meta-llama/Llama-3.2-1B", max_new_tokens=32
)

Then load the data:

.. code-block:: python

dataset = load_dataset(
card="cards.xsum",
template="templates.summarization.abstractive.formal",
format="formats.chat_api",
metrics=[llm_judge_with_summary_metric],
loader_limit=5,
split="test",
)

Notice: we create the data with `format="formats.chat_api"` which produce data as list of chat turns:

.. code-block:: python

[
{"role": "system", "content": "Summarize the following Document."},
{"role": "user", "content": "Document: <...>"}
]

Now run inference on the dataset:

.. code-block:: python

predictions = engine.infer(dataset)

Finally, evaluate the predictions and obtain final scores:

.. code-block:: python

evaluate(predictions=predictions, data=dataset)

Calling Models Through APIs
---------------------------
Calling models through an API is even simpler and is primarily done using one class: :class:`CrossProviderInferenceEngine <unitxt.inference.CrossProviderInferenceEngine>`.
elronbandel marked this conversation as resolved.
Show resolved Hide resolved

You can create a :class:`CrossProviderInferenceEngine` as follows:

.. code-block:: python

engine = CrossProviderInferenceEngine(
model="llama-3-2-1b-instruct", provider="watsonx"
)

This engine supports providers such as ``watsonx``, ``together-ai``, ``open-ai``, ``aws``, ``ollama``, ``bam``, and ``watsonx-sdk``.

It can be used with all supported models listed here: :class:`supported models <unitxt.inference.CrossProviderInferenceEngine>`.

Running inference follows the same pattern as before:

.. code-block:: python

predictions = engine.infer(dataset)

Creating a Cross-API Engine
---------------------------
Alternatively, you can create an engine without specifying a provider:

.. code-block:: python

engine = CrossProviderInferenceEngine(
model="llama-3-2-1b-instruct"
)

You can set the provider later by:

.. code-block:: python

import unitxt

unitxt.settings.default_provider = "watsonx"

Or by setting an environment variable:

.. code-block:: bash

export UNITXT_DEFAULT_PROVIDER="watsonx"
1 change: 1 addition & 0 deletions docs/docs/tutorials.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ Tutorials ✨
multimodality
operators
saving_and_loading_from_catalog
inference
production
debugging
helm
Expand Down
30 changes: 13 additions & 17 deletions examples/evaluate_a_judge_model_capabilities_on_arena_hard.py
Original file line number Diff line number Diff line change
@@ -1,32 +1,28 @@
from unitxt import evaluate, load_dataset
from unitxt.inference import MockInferenceEngine
from unitxt.inference import CrossProviderInferenceEngine
from unitxt.text_utils import print_dict

model_id = "meta-llama/llama-3-70b-instruct"
model_format = "formats.llama3_instruct"

"""
We are evaluating only on a small subset (by using "select(range(4)), in order for the example to finish quickly.
We are evaluating only on a small subset (by using `max_test_instances=4`), in order for the example to finish quickly.
The dataset full size if around 40k examples. You should use around 1k-4k in your evaluations.
"""
dataset = load_dataset(
card="cards.arena_hard.response_assessment.pairwise_comparative_rating.both_games_gpt_4_judge",
template="templates.response_assessment.pairwise_comparative_rating.arena_hard_with_shuffling",
format=model_format,
)["test"].select(range(4))
format="formats.chat_api",
max_test_instances=None,
split="test",
).select(range(5))

inference_model = MockInferenceEngine(model_name=model_id)
inference_model = CrossProviderInferenceEngine(
model="llama-3-2-1b-instruct", provider="watsonx"
)
"""
We are using a mock inference engine (and model) in order for the example to finish quickly.
In real scenarios you can use model from Huggingface, OpenAi, and IBM, using the following:
from unitxt.inference import (HFPipelineBasedInferenceEngine, IbmGenAiInferenceEngine, OpenAiInferenceEngine)
and switch them with the MockInferenceEngine class in the example.
For the arguments these inference engines can receive, please refer to the classes documentation.
We are using a CrossProviderInferenceEngine inference engine that supply api access to provider such as:
watsonx, bam, openai, azure, aws and more.

Example of using an IBM model:
from unitxt.inference import (IbmGenAiInferenceEngine, IbmGenAiInferenceEngineParamsMixin)
params = IbmGenAiInferenceEngineParamsMixin(max_new_tokens=1024, random_seed=42)
inference_model = IbmGenAiInferenceEngine(model_name=model_id, parameters=params)
For the arguments these inference engines can receive, please refer to the classes documentation or read
about the the open ai api arguments the CrossProviderInferenceEngine follows.
"""

predictions = inference_model.infer(dataset)
Expand Down
32 changes: 14 additions & 18 deletions examples/evaluate_a_model_using_arena_hard.py
Original file line number Diff line number Diff line change
@@ -1,35 +1,31 @@
from unitxt import evaluate, load_dataset
from unitxt.inference import MockInferenceEngine
from unitxt.inference import CrossProviderInferenceEngine
from unitxt.text_utils import print_dict

model_id = "meta-llama/llama-3-70b-instruct"
model_format = "formats.llama3_instruct"

"""
We are evaluating only on a small subset (by using "select(range(4)), in order for the example to finish quickly.
We are evaluating only on a small subset (by using `max_test_instances=4`), in order for the example to finish quickly.
The dataset full size if around 40k examples. You should use around 1k-4k in your evaluations.
"""
dataset = load_dataset(
card="cards.arena_hard.generation.english_gpt_4_0314_reference",
template="templates.empty",
format=model_format,
format="formats.chat_api",
metrics=[
"metrics.llm_as_judge.pairwise_comparative_rating.llama_3_8b_instruct_ibm_genai_template_arena_hard_with_shuffling"
"metrics.llm_as_judge.pairwise_comparative_rating.llama_3_8b_instruct.template_arena_hard"
],
)["test"].select(range(4))
max_test_instances=4,
split="test",
)

inference_model = MockInferenceEngine(model_name=model_id)
inference_model = CrossProviderInferenceEngine(
model="llama-3-2-1b-instruct", provider="watsonx"
)
"""
We are using a mock inference engine (and model) in order for the example to finish quickly.
In real scenarios you can use model from Huggingface, OpenAi, and IBM, using the following:
from unitxt.inference import (HFPipelineBasedInferenceEngine, IbmGenAiInferenceEngine, OpenAiInferenceEngine)
and switch them with the MockInferenceEngine class in the example.
For the arguments these inference engines can receive, please refer to the classes documentation.
We are using a CrossProviderInferenceEngine inference engine that supply api access to provider such as:
watsonx, bam, openai, azure, aws and more.

Example of using an IBM model:
from unitxt.inference import (IbmGenAiInferenceEngine, IbmGenAiInferenceEngineParamsMixin)
params = IbmGenAiInferenceEngineParamsMixin(max_new_tokens=1024, random_seed=42)
inference_model = IbmGenAiInferenceEngine(model_name=model_id, parameters=params)
For the arguments these inference engines can receive, please refer to the classes documentation or read
about the the open ai api arguments the CrossProviderInferenceEngine follows.
"""

predictions = inference_model.infer(dataset)
Expand Down
Loading
Loading