Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multi api inference engine #1343

Open
wants to merge 26 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
de868ab
Add multi api inference engine
elronbandel Nov 12, 2024
c40d87d
Fix
elronbandel Nov 12, 2024
d53eb69
Set to greedy decoding
elronbandel Nov 12, 2024
e4a0799
Merge branch 'main' into multi-api-engine
elronbandel Nov 12, 2024
e41a0fa
Merge branch 'main' into multi-api-engine
elronbandel Nov 17, 2024
b36b7ab
Some fixes
elronbandel Nov 17, 2024
0593788
Fix consistency and preparation
elronbandel Nov 18, 2024
28bafa2
Update
elronbandel Nov 18, 2024
ccc72ae
Merge branch 'main' into multi-api-engine
elronbandel Nov 18, 2024
06991ca
Merge branch 'main' into multi-api-engine
elronbandel Nov 18, 2024
3c861fb
Fix test
elronbandel Nov 18, 2024
086aae8
Merge branch 'multi-api-engine' of https://github.com/IBM/unitxt into…
elronbandel Nov 18, 2024
f9cd539
Make all args None
elronbandel Nov 18, 2024
4165c78
Try
elronbandel Nov 18, 2024
f202c3a
Fix grammar
elronbandel Nov 18, 2024
bd8e176
Fix
elronbandel Nov 18, 2024
b686f95
Change api to provider
elronbandel Nov 18, 2024
b4dfe3b
Merge branch 'main' into multi-api-engine
elronbandel Nov 18, 2024
4c91d5e
Added support for param renaming.
yoavkatz Nov 18, 2024
eaead52
Fix merge issues
yoavkatz Nov 18, 2024
4c5ba45
Updated to CrossProviderModel
yoavkatz Nov 18, 2024
00dbd30
Update name back to InferenceEngine terminology
elronbandel Nov 18, 2024
a0373f8
Align all examples with chat api and cross provider engines
elronbandel Nov 19, 2024
4fa6f8e
Add vllm inference engine
elronbandel Nov 19, 2024
8115091
Fix blue bench to use cross provider engine
elronbandel Nov 19, 2024
986d268
Merge branch 'main' into multi-api-engine
elronbandel Nov 19, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 4 additions & 5 deletions docs/docs/examples.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
.. _examples:
==============
Examples
Examples
==============

Here you will find complete coding samples showing how to perform different tasks using Unitxt.
Expand Down Expand Up @@ -97,16 +97,16 @@ Related documentation: :ref:`Templates tutorial <adding_template>`, :ref:`Format
Long Context
+++++++++++++++++++++++++++++

This example explores the effect of long context in classification.
This example explores the effect of long context in classification.
It converts a standard multi class classification dataset (sst2 sentiment classification),
where single sentence texts are classified one by one, to a dataset
where multiple sentences are classified using a single LLM call.
where multiple sentences are classified using a single LLM call.
It compares the f1_micro in both approaches on two models.
It uses serializers to verbalize and enumerated list of multiple sentences and labels.

`Example code <https://github.com/IBM/unitxt/blob/main/examples/evaluate_batched_multiclass_classification.py>`_

Related documentation: :ref:`Sst2 dataset card in catalog <catalog.cards.sst2>` :ref:`Types and Serializers Guide <types_and_serializers>`.
Related documentation: :ref:`Sst2 dataset card in catalog <catalog.cards.sst2>` :ref:`Types and Serializers Guide <types_and_serializers>`.

Construct a benchmark of multiple datasets and obtain the final score
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Expand Down Expand Up @@ -265,4 +265,3 @@ This example show how to define new data types as well as the way these data typ

Related documentation: :ref:`Types and Serializers Guide <types_and_serializers>`.


Empty file added docs/docs/inference.rst
Empty file.
30 changes: 13 additions & 17 deletions examples/evaluate_a_judge_model_capabilities_on_arena_hard.py
Original file line number Diff line number Diff line change
@@ -1,32 +1,28 @@
from unitxt import evaluate, load_dataset
from unitxt.inference import MockInferenceEngine
from unitxt.inference import CrossProviderInferenceEngine
from unitxt.text_utils import print_dict

model_id = "meta-llama/llama-3-70b-instruct"
model_format = "formats.llama3_instruct"

"""
We are evaluating only on a small subset (by using "select(range(4)), in order for the example to finish quickly.
We are evaluating only on a small subset (by using `max_test_instances=4`), in order for the example to finish quickly.
The dataset full size if around 40k examples. You should use around 1k-4k in your evaluations.
"""
dataset = load_dataset(
card="cards.arena_hard.response_assessment.pairwise_comparative_rating.both_games_gpt_4_judge",
template="templates.response_assessment.pairwise_comparative_rating.arena_hard_with_shuffling",
format=model_format,
)["test"].select(range(4))
format="formats.chat_api",
max_test_instances=4,
split="test",
)

inference_model = MockInferenceEngine(model_name=model_id)
inference_model = CrossProviderInferenceEngine(
model="llama-3-2-1b-instruct", provider="watsonx"
)
"""
We are using a mock inference engine (and model) in order for the example to finish quickly.
In real scenarios you can use model from Huggingface, OpenAi, and IBM, using the following:
from unitxt.inference import (HFPipelineBasedInferenceEngine, IbmGenAiInferenceEngine, OpenAiInferenceEngine)
and switch them with the MockInferenceEngine class in the example.
For the arguments these inference engines can receive, please refer to the classes documentation.
We are using a CrossProviderInferenceEngine inference engine that supply api access to provider such as:
watsonx, bam, openai, azure, aws and more.

Example of using an IBM model:
from unitxt.inference import (IbmGenAiInferenceEngine, IbmGenAiInferenceEngineParamsMixin)
params = IbmGenAiInferenceEngineParamsMixin(max_new_tokens=1024, random_seed=42)
inference_model = IbmGenAiInferenceEngine(model_name=model_id, parameters=params)
For the arguments these inference engines can receive, please refer to the classes documentation or read
about the the open ai api arguments the CrossProviderInferenceEngine follows.
"""

predictions = inference_model.infer(dataset)
Expand Down
32 changes: 14 additions & 18 deletions examples/evaluate_a_model_using_arena_hard.py
Original file line number Diff line number Diff line change
@@ -1,35 +1,31 @@
from unitxt import evaluate, load_dataset
from unitxt.inference import MockInferenceEngine
from unitxt.inference import CrossProviderInferenceEngine
from unitxt.text_utils import print_dict

model_id = "meta-llama/llama-3-70b-instruct"
model_format = "formats.llama3_instruct"

"""
We are evaluating only on a small subset (by using "select(range(4)), in order for the example to finish quickly.
We are evaluating only on a small subset (by using `max_test_instances=4`), in order for the example to finish quickly.
The dataset full size if around 40k examples. You should use around 1k-4k in your evaluations.
"""
dataset = load_dataset(
card="cards.arena_hard.generation.english_gpt_4_0314_reference",
template="templates.empty",
format=model_format,
format="formats.chat_api",
metrics=[
"metrics.llm_as_judge.pairwise_comparative_rating.llama_3_8b_instruct_ibm_genai_template_arena_hard_with_shuffling"
"metrics.llm_as_judge.pairwise_comparative_rating.llama_3_8b_instruct.template_arena_hard"
],
)["test"].select(range(4))
max_test_instances=4,
split="test",
)

inference_model = MockInferenceEngine(model_name=model_id)
inference_model = CrossProviderInferenceEngine(
model="llama-3-2-1b-instruct", provider="watsonx"
)
"""
We are using a mock inference engine (and model) in order for the example to finish quickly.
In real scenarios you can use model from Huggingface, OpenAi, and IBM, using the following:
from unitxt.inference import (HFPipelineBasedInferenceEngine, IbmGenAiInferenceEngine, OpenAiInferenceEngine)
and switch them with the MockInferenceEngine class in the example.
For the arguments these inference engines can receive, please refer to the classes documentation.
We are using a CrossProviderInferenceEngine inference engine that supply api access to provider such as:
watsonx, bam, openai, azure, aws and more.

Example of using an IBM model:
from unitxt.inference import (IbmGenAiInferenceEngine, IbmGenAiInferenceEngineParamsMixin)
params = IbmGenAiInferenceEngineParamsMixin(max_new_tokens=1024, random_seed=42)
inference_model = IbmGenAiInferenceEngine(model_name=model_id, parameters=params)
For the arguments these inference engines can receive, please refer to the classes documentation or read
about the the open ai api arguments the CrossProviderInferenceEngine follows.
"""

predictions = inference_model.infer(dataset)
Expand Down
191 changes: 105 additions & 86 deletions examples/evaluate_batched_multiclass_classification.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ class ParseEnumeratedList(FieldOperator):
def process_value(self, text: Any) -> Any:
result = []
for x in text.split("\n"):
line_result = re.findall(r"(\d+)\.\s*(\w+)", x)
line_result = re.findall(r"(\d+)\.\s*(.*)", x)
if len(line_result) == 1:
result.append(line_result[0])
return result
Expand Down Expand Up @@ -63,96 +63,115 @@ def serialize(self, value: EnumeratedList, instance: Dict[str, Any]) -> str:

template = InputOutputTemplate(
input_format="Classify each of the texts to its corresponding {type_of_class} from one of these options:\n{classes}\nReturn for each index the correspond class in a separate line.\nTexts:\n{texts}",
# target_prefix="Answer:\n",
target_prefix="Answer:\n",
output_format="{labels}",
postprocessors=[PostProcess(ParseEnumeratedList())],
postprocessors=["processors.lower_case", PostProcess(ParseEnumeratedList())],
serializer=MultiTypeSerializer(serializers=[EnumeratedListSerializer()]),
)
df = pd.DataFrame(
columns=["model", "batch_size", "num_instances", "f1_micro", "ci_low", "ci_high"]
columns=[
"provider",
"model",
"batch_size",
"num_instances",
"f1_micro",
"ci_low",
"ci_high",
"hellucinations",
]
)

for model_name in [
"ibm/granite-3-8b-instruct",
"meta-llama/llama-3-8b-instruct",
for provider in [
"watsonx",
"bam",
]:
if model_name.startswith("ibm"):
format = SystemFormat(
demo_format=(
"{instruction}\\N{source}\\N<|end_of_text|>\n"
"<|start_of_role|>assistant<|end_of_role|>{target}\\N<|end_of_text|>\n"
"<|start_of_role|>user<|end_of_role|>"
),
model_input_format=(
"<|start_of_role|>system<|end_of_role|>{system_prompt}<|end_of_text|>\n"
"<|start_of_role|>user<|end_of_role|>{demos}{instruction}\\N{source}\\N<|end_of_text|>\n"
"<|start_of_role|>assistant<|end_of_role|>"
),
)
batch_sizes = [50, 30, 10, 1]

if model_name.startswith("meta-llama"):
format = "formats.llama3_instruct"
batch_sizes = [100, 50, 10, 1]

for batch_size in batch_sizes:
card, _ = fetch_artifact("cards.sst2")
card.preprocess_steps.extend(
[
CollateInstances(batch_size=batch_size),
Rename(field_to_field={"text": "texts", "label": "labels"}),
Copy(field="text_type/0", to_field="text_type"),
Copy(field="classes/0", to_field="classes"),
Copy(field="type_of_class/0", to_field="type_of_class"),
for model_name in [
"granite-3-8b-instruct",
"llama-3-8b-instruct",
]:
batch_sizes = [30, 20, 10, 5, 1]

for batch_size in batch_sizes:
card, _ = fetch_artifact("cards.banking77")
card.preprocess_steps.extend(
[
CollateInstances(batch_size=batch_size),
Rename(field_to_field={"text": "texts", "label": "labels"}),
Copy(field="text_type/0", to_field="text_type"),
Copy(field="classes/0", to_field="classes"),
Copy(field="type_of_class/0", to_field="type_of_class"),
]
)
card.task = task
card.templates = [template]
format = "formats.chat_api"
if provider == "bam" and model_name.startswith("llama"):
format = "formats.llama3_instruct"
if provider == "bam" and model_name.startswith("granite"):
format = SystemFormat(
demo_format=(
"{instruction}\\N{source}\\N<|end_of_text|>\n"
"<|start_of_role|>assistant<|end_of_role|>{target}\\N<|end_of_text|>\n"
"<|start_of_role|>user<|end_of_role|>"
),
model_input_format=(
"<|start_of_role|>system<|end_of_role|>{system_prompt}<|end_of_text|>\n"
"<|start_of_role|>user<|end_of_role|>{demos}{instruction}\\N{source}\\N<|end_of_text|>\n"
"<|start_of_role|>assistant<|end_of_role|>"
),
)

dataset = load_dataset(
card=card,
template_card_index=0,
format=format,
num_demos=1,
demos_pool_size=5,
loader_limit=1000,
max_test_instances=200 / batch_size,
)

test_dataset = dataset["test"]
from unitxt.inference import CrossProviderInferenceEngine

inference_model = CrossProviderInferenceEngine(
model=model_name, max_tokens=1024, provider=provider
)
"""
We are using a CrossProviderInferenceEngine inference engine that supply api access to provider such as:
watsonx, bam, openai, azure, aws and more.

For the arguments these inference engines can receive, please refer to the classes documentation or read
about the the open ai api arguments the CrossProviderInferenceEngine follows.
"""
predictions = inference_model.infer(test_dataset)

evaluated_dataset = evaluate(predictions=predictions, data=test_dataset)
# import pandas as pd
# result_df = pd.json_normalize(evaluated_dataset)
# result_df.to_csv(f"output.csv")
# Print results
print_dict(
evaluated_dataset[0],
keys_to_print=[
"source",
"prediction",
"processed_prediction",
"processed_references",
],
)

global_scores = evaluated_dataset[0]["score"]["global"]
df.loc[len(df)] = [
provider,
model_name,
batch_size,
global_scores["num_of_instances"],
global_scores["score"],
global_scores["score_ci_low"],
global_scores["score_ci_high"],
1.0 - global_scores["in_classes_support"],
]
)
card.task = task
card.templates = [template]

dataset = load_dataset(
card=card,
template_card_index=0,
format=format,
num_demos=1,
demos_pool_size=5,
loader_limit=10000,
max_test_instances=1000 / batch_size,
)

test_dataset = dataset["test"]

# inference_model = IbmGenAiInferenceEngine(
# model_name=model_name, max_new_tokens=1024
# )

from unitxt.inference import WMLInferenceEngine

inference_model = WMLInferenceEngine(model_name=model_name, max_new_tokens=1024)

predictions = inference_model.infer(test_dataset)

evaluated_dataset = evaluate(predictions=predictions, data=test_dataset)

# Print results
print_dict(
evaluated_dataset[0],
keys_to_print=[
"source",
"prediction",
"processed_prediction",
"processed_references",
],
)

global_scores = evaluated_dataset[0]["score"]["global"]
df.loc[len(df)] = [
model_name,
batch_size,
global_scores["num_of_instances"],
global_scores["score"],
global_scores["score_ci_low"],
global_scores["score_ci_high"],
]

df = df.round(decimals=2)
logger.info(df.to_markdown())

df = df.round(decimals=2)
logger.info(df.to_markdown())
16 changes: 11 additions & 5 deletions examples/evaluate_benchmark.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
from unitxt.api import evaluate
from unitxt.benchmark import Benchmark
from unitxt.inference import (
HFPipelineBasedInferenceEngine,
CrossProviderInferenceEngine,
)
from unitxt.standard import StandardRecipe
from unitxt.text_utils import print_dict
Expand Down Expand Up @@ -47,11 +47,17 @@
test_dataset = list(benchmark()["test"])


# Infere using flan t5 base using HF API
model_name = "google/flan-t5-base"
inference_model = HFPipelineBasedInferenceEngine(
model_name=model_name, max_new_tokens=32
# Infere using llama-3-2-1b base using Watsonx API
inference_model = CrossProviderInferenceEngine(
model="llama-3-2-1b-instruct", provider="watsonx"
)
"""
We are using a CrossProviderInferenceEngine inference engine that supply api access to provider such as:
watsonx, bam, openai, azure, aws and more.

For the arguments these inference engines can receive, please refer to the classes documentation or read
about the the open ai api arguments the CrossProviderInferenceEngine follows.
"""

predictions = inference_model.infer(test_dataset)
evaluated_dataset = evaluate(predictions=predictions, data=test_dataset)
Expand Down
Loading
Loading