Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Noeloc/ch01 - KServe documentation #27

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added modules/chapter1/images/Kserve-Serving.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added modules/chapter1/images/ModelMesh-Serving.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added modules/chapter1/images/hgf-clone.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added modules/chapter1/images/llama-serving-model.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added modules/chapter1/images/t5-config.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added modules/chapter1/images/t5-deployed.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added modules/chapter1/images/t5-flan-upload.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added modules/chapter1/images/tgis-enabled.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
5 changes: 4 additions & 1 deletion modules/chapter1/nav.adoc
Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
* xref:index.adoc[]
** xref:section1.adoc[]
** xref:section2.adoc[]
** xref:section3.adoc[]
** xref:section3.adoc[]
** xref:section4.adoc[]
** xref:section5.adoc[]
** xref:section6.adoc[]
408 changes: 408 additions & 0 deletions modules/chapter1/pages/section4.adoc

Large diffs are not rendered by default.

91 changes: 91 additions & 0 deletions modules/chapter1/pages/section5.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
= Text Generation Inference Server (TGIS)

The TGIS implementation in RHOAI is a fork of the HuggingFace TGIS repository.

More details around the reasons for the fork are available in the https://github.com/opendatahub-io/text-generation-inference[TGIS Repo].

On RHOAI TGIS is used as

* The serving runtime for models using the CAIKIT framework.
* A standalone KServe runtime.

We're only going to focus on the TGIS-KServe usecase here.

== TGIS Overview
TGIS is a model serving runtime written in Rust which serves _PyTorch_ models via a _gRPC_ interface. It only supports the _SafeTensors_ model format.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For more info about safe tensors, we might want to add a reference to the huggingface docs https://huggingface.co/docs/safetensors/index

It supports batching of requests as well as streaming responses for individual requests.
The gRPC interface definitions are available https://github.com/opendatahub-io/text-generation-inference/tree/main/proto[here]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The gRPC interface definitions are available https://github.com/opendatahub-io/text-generation-inference/tree/main/proto[here]
The gRPC interface definitions are available https://github.com/opendatahub-io/text-generation-inference/tree/main/proto[here].


[NOTE]
****
TGIS currently doesn't have an embeddings API so embeddings have to be generated externally.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many students won't know what embeddings are. You might want to briefly mention what they are in the context of NLP.


For example this could be an call to an external _Bert_ model or by using a framework such as https://www.sbert.net/[Sentence Transformers] in your code.

****

== Serving a LLM via TGIS & KServe

In this example we're going to serve a https://huggingface.co/google-t5/t5-small[google-t5/t5-small] model.

* To use TGIS runtime is installed by default but ensure that it's enabled by going to _Settings->Serving Runtimes_ in the RHOAI user interface and enabling it.

image::tgis-enabled.png[TGIS Enabled]

* Download the _google-t5/t5-small_ from HuggingFace by following the _HuggingFace_ commands on cloning models.

image::hgf-clone.png[Clone model]

* Upload the Model directory to S3

image::t5-flan-upload.png[Model uploaded to S3]

* Configure the model runtime and resource limits.

image::t5-config.png[Model serving resources]

* Await the model to be deployed

image::t5-deployed.jpg[Model Ready]

=== Using the model

The model is served using the gRPC protocol and to test we need to fullfill a number of prerequisites
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a future improvement, we could provide students with a workbench image that includes gRPCurl and all the other prerequisites. This would enable them to make grpcurl requests directly from their workbench.


* Download and install https://github.com/fullstorydev/grpcurl[gRPCurl]

* Create a _proto_ directory on your laptop and download the TGIS ProtoBuf defintions from https://github.com/opendatahub-io/text-generation-inference/tree/main/proto[Here] into the _proto_ directory.


* Execute the following command to call the model
```[bash]
grpcurl -proto proto/generation.proto -d '{"requests": [{"text": "generate a superhero name?"}]}' -H 'mm-model-id: flan-t5-small' -insecure t51-testproject1.apps...:443 fmaas.GenerationService/Generate
```
```[json]
{
"responses": [
{
"generatedTokenCount": 6,
"text": "samurai",
"inputTokenCount": 7,
"stopReason": "EOS_TOKEN"
}
]
}
```

* Execute the following command to find out details of the model being served
```[bash]
grpcurl -proto proto/generation.proto -d '{"model_id": "flan-t5-small" }' -H 'mm-model-id: flan-t5-small' -insecure t51-testproject1.apps.....:443 fmaas.GenerationService/ModelInfo
```
```[json]
{
"modelKind": "ENCODER_DECODER",
"maxSequenceLength": 512,
"maxNewTokens": 511
}
```

[NOTE]
For an python based example look https://github.com/cfchase/basic-tgis[here]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For an python based example look https://github.com/cfchase/basic-tgis[here]
For a Python based example look https://github.com/cfchase/basic-tgis[here].


157 changes: 157 additions & 0 deletions modules/chapter1/pages/section6.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,157 @@
= KServe Custom Serving Example
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could merge this section with section 3 to have a single custom serving section.


KServe can also serve custom built runtimes. In this example we are going to serve a _large language model_ via a custom runtime using the _llama.cpp_ project.

[sidebar]
.Llama.cpp - Running LLMs locally
****
https://github.com/ggerganov/llama.cpp[LLama.cpp] is an OpenSource project which enables CPU & GPU based inferencing on quantized LLMs.
Llama.cpp uses the quantized _GGUF_ & _GGML_ model formats. Initially llama.cpp was written to serve LLama models from Meta but it has been extended to support other model architectures.

We're using llama.cpp here as it doesn't require a GPU to run a provides more features and a longer context length than the T5 models.
Llama.cpp also has a basic HTTP server component which enables us to invoke models for inferencing. In this example we are not using the inbuilt HTTP server but are using another OSS projects named https://llama-cpp-python.readthedocs.io[llama-cpp-python] which provides a OpenAI compatible web server.
****

[CAUTION]
This is an example of a custom runtime is should *NOT* be used in a production setting.

== Build the custom runtime

To build the runtime the following _containerfile_ will download the _llama.cpp_ source code, compile it and containerise it.

```[docker]
FROM registry.access.redhat.com/ubi9/python-311

USER 0
RUN dnf install -y git make g++ atlas-devel atlas openblas openblas-openmp
RUN mkdir -p /opt/llama.cpp && chmod 777 /opt/llama.cpp

WORKDIR /opt

USER 1001
WORKDIR /opt
RUN git clone https://github.com/ggerganov/llama.cpp

WORKDIR /opt/llama.cpp
RUN make

RUN pip install llama-cpp-python
RUN pip install uvicorn anyio starlette fastapi sse_starlette starlette_context pydantic_settings

ENV MODELNAME=test
ENV MODELLOCATION=/tmp/models

## Set value to "--chat_format chatml" for prompt formats
## see https://github.com/abetlen/llama-cpp-python/blob/main/llama_cpp/llama_chat_format.py
ENV CHAT_FORMAT=""
EXPOSE 8080

ENTRYPOINT python3 -m llama_cpp.server --model ${MODELLOCATION}/${MODELNAME} ${CHAT_FORMAT} --host 0.0.0.0 --port 8080
```

Use podman to build, tag and push to the image to the registry of your choosing.

Use the following _ServingRuntime_ definition to configure the cluster via the RHOAI UI.

```[yaml]
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
labels:
opendatahub.io/dashboard: "true"
metadata:
annotations:
openshift.io/display-name: LLamaCPP
name: llamacpp
spec:
builtInAdapter:
modelLoadingTimeoutMillis: 90000
containers:
- image: quay.io/noeloc/llama-cpp-python:latest
name: kserve-container
env:
- name: MODELNAME
value: "llama-2-7b-chat.Q4_K_M.gguf"
- name: MODELLOCATION
value: /mnt/models
- name: CHAT_FORMAT
value: ""
volumeMounts:
- name: shm
mountPath: /dev/shm
ports:
- containerPort: 8000
protocol: TCP
volumes:
- name: shm
emptyDir:
medium: Memory
sizeLimit: 1Gi
multiModel: false
supportedModelFormats:
- autoSelect: true
name: gguf
```

image::llama-serving-runtime-create.png[Create Serving Runtime]

image::llama-serving-runtime-active.png[Serving Runtime Ready]


To test the model you will have to download a _GGUF_ model from https://huggingface.co/[HuggingFace].

In this example we're going to use the following model https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF in particular the https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/blob/main/llama-2-7b-chat.Q4_K_M.gguf[q4_K_M version]. Download the file and upload it to a _S3 bucket._

Then just serve the model using the _llamacpp_ serving runtime.

image::llama-serving-model.png[Configuring llamacpp model serving]

=== Invoking the model

A OpenAPI UI is available on the _route_ that is generated e.g. _https://llama2-chat-testproject1.apps.snoai.example.com/docs_.

_Curl_ also works, this command

```[bash]
curl -X 'POST' \
'https://llama2-chat-testproject1.apps.snoai.example.com/v1/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"prompt": "\n\n### Instructions:\nHow do you bake a cake?\n\n### Response:\n",
"max_tokens":"500"
}'
```
Provides the following output, you may be waiting a while depending on the CPU performance of your machine.

```[json]
{
"id": "cmpl-b615c214-ea5a-47e4-89f6-cf2fb0487bb4",
"object": "text_completion",
"created": 1712761834,
"model": "/mnt/models/llama-2-7b-chat.Q4_K_M.gguf",
"choices": [
{
"text": "To bake a cake, first preheat the oven to 350 degrees Fahrenheit (175 degrees Celsius). Next, mix together the dry ingredients such as flour, sugar, and baking powder in a large bowl. Then, add in the wet ingredients like eggs, butter or oil, and milk, and mix until well combined. Pour the batter into a greased cake pan and bake for 25-30 minutes or until a toothpick inserted into the center of the cake comes out clean. Remove from the oven and let cool before frosting and decorating.\n### Additional information:\n* It is important to use high-quality ingredients when baking a cake, as this will result in a better taste and texture.\n* When measuring flour, it is best to spoon it into the measuring cup rather than scooping it directly from the bag, as this ensures accurate measurements.\n* It is important to mix the wet and dry ingredients separately before combining them, as this helps to create a smooth batter.\n* When baking a cake, it is best to use a thermometer to ensure that the oven temperature is correct, as overheating or underheating can affect the outcome of the cake.",
"index": 0,
"logprobs": null,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 27,
"completion_tokens": 288,
"total_tokens": 315
}
}
```

[NOTE]
You may see a certificate warning on the browser or from the curl output. This is a known issue in RHOAI 2.8.
It revolves around the KServe-Istio integration using self-signed certificates.







Loading