-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Noeloc/ch01 - KServe documentation #27
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,7 @@ | ||
* xref:index.adoc[] | ||
** xref:section1.adoc[] | ||
** xref:section2.adoc[] | ||
** xref:section3.adoc[] | ||
** xref:section3.adoc[] | ||
** xref:section4.adoc[] | ||
** xref:section5.adoc[] | ||
** xref:section6.adoc[] |
Large diffs are not rendered by default.
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,91 @@ | ||||||
= Text Generation Inference Server (TGIS) | ||||||
|
||||||
The TGIS implementation in RHOAI is a fork of the HuggingFace TGIS repository. | ||||||
|
||||||
More details around the reasons for the fork are available in the https://github.com/opendatahub-io/text-generation-inference[TGIS Repo]. | ||||||
|
||||||
On RHOAI TGIS is used as | ||||||
|
||||||
* The serving runtime for models using the CAIKIT framework. | ||||||
* A standalone KServe runtime. | ||||||
|
||||||
We're only going to focus on the TGIS-KServe usecase here. | ||||||
|
||||||
== TGIS Overview | ||||||
TGIS is a model serving runtime written in Rust which serves _PyTorch_ models via a _gRPC_ interface. It only supports the _SafeTensors_ model format. | ||||||
It supports batching of requests as well as streaming responses for individual requests. | ||||||
The gRPC interface definitions are available https://github.com/opendatahub-io/text-generation-inference/tree/main/proto[here] | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
[NOTE] | ||||||
**** | ||||||
TGIS currently doesn't have an embeddings API so embeddings have to be generated externally. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Many students won't know what embeddings are. You might want to briefly mention what they are in the context of NLP. |
||||||
|
||||||
For example this could be an call to an external _Bert_ model or by using a framework such as https://www.sbert.net/[Sentence Transformers] in your code. | ||||||
|
||||||
**** | ||||||
|
||||||
== Serving a LLM via TGIS & KServe | ||||||
|
||||||
In this example we're going to serve a https://huggingface.co/google-t5/t5-small[google-t5/t5-small] model. | ||||||
|
||||||
* To use TGIS runtime is installed by default but ensure that it's enabled by going to _Settings->Serving Runtimes_ in the RHOAI user interface and enabling it. | ||||||
|
||||||
image::tgis-enabled.png[TGIS Enabled] | ||||||
|
||||||
* Download the _google-t5/t5-small_ from HuggingFace by following the _HuggingFace_ commands on cloning models. | ||||||
|
||||||
image::hgf-clone.png[Clone model] | ||||||
|
||||||
* Upload the Model directory to S3 | ||||||
|
||||||
image::t5-flan-upload.png[Model uploaded to S3] | ||||||
|
||||||
* Configure the model runtime and resource limits. | ||||||
|
||||||
image::t5-config.png[Model serving resources] | ||||||
|
||||||
* Await the model to be deployed | ||||||
|
||||||
image::t5-deployed.jpg[Model Ready] | ||||||
|
||||||
=== Using the model | ||||||
|
||||||
The model is served using the gRPC protocol and to test we need to fullfill a number of prerequisites | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As a future improvement, we could provide students with a workbench image that includes gRPCurl and all the other prerequisites. This would enable them to make grpcurl requests directly from their workbench. |
||||||
|
||||||
* Download and install https://github.com/fullstorydev/grpcurl[gRPCurl] | ||||||
|
||||||
* Create a _proto_ directory on your laptop and download the TGIS ProtoBuf defintions from https://github.com/opendatahub-io/text-generation-inference/tree/main/proto[Here] into the _proto_ directory. | ||||||
|
||||||
|
||||||
* Execute the following command to call the model | ||||||
```[bash] | ||||||
grpcurl -proto proto/generation.proto -d '{"requests": [{"text": "generate a superhero name?"}]}' -H 'mm-model-id: flan-t5-small' -insecure t51-testproject1.apps...:443 fmaas.GenerationService/Generate | ||||||
``` | ||||||
```[json] | ||||||
{ | ||||||
"responses": [ | ||||||
{ | ||||||
"generatedTokenCount": 6, | ||||||
"text": "samurai", | ||||||
"inputTokenCount": 7, | ||||||
"stopReason": "EOS_TOKEN" | ||||||
} | ||||||
] | ||||||
} | ||||||
``` | ||||||
|
||||||
* Execute the following command to find out details of the model being served | ||||||
```[bash] | ||||||
grpcurl -proto proto/generation.proto -d '{"model_id": "flan-t5-small" }' -H 'mm-model-id: flan-t5-small' -insecure t51-testproject1.apps.....:443 fmaas.GenerationService/ModelInfo | ||||||
``` | ||||||
```[json] | ||||||
{ | ||||||
"modelKind": "ENCODER_DECODER", | ||||||
"maxSequenceLength": 512, | ||||||
"maxNewTokens": 511 | ||||||
} | ||||||
``` | ||||||
|
||||||
[NOTE] | ||||||
For an python based example look https://github.com/cfchase/basic-tgis[here] | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,157 @@ | ||
= KServe Custom Serving Example | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we could merge this section with section 3 to have a single custom serving section. |
||
|
||
KServe can also serve custom built runtimes. In this example we are going to serve a _large language model_ via a custom runtime using the _llama.cpp_ project. | ||
|
||
[sidebar] | ||
.Llama.cpp - Running LLMs locally | ||
**** | ||
https://github.com/ggerganov/llama.cpp[LLama.cpp] is an OpenSource project which enables CPU & GPU based inferencing on quantized LLMs. | ||
Llama.cpp uses the quantized _GGUF_ & _GGML_ model formats. Initially llama.cpp was written to serve LLama models from Meta but it has been extended to support other model architectures. | ||
|
||
We're using llama.cpp here as it doesn't require a GPU to run a provides more features and a longer context length than the T5 models. | ||
Llama.cpp also has a basic HTTP server component which enables us to invoke models for inferencing. In this example we are not using the inbuilt HTTP server but are using another OSS projects named https://llama-cpp-python.readthedocs.io[llama-cpp-python] which provides a OpenAI compatible web server. | ||
**** | ||
|
||
[CAUTION] | ||
This is an example of a custom runtime is should *NOT* be used in a production setting. | ||
|
||
== Build the custom runtime | ||
|
||
To build the runtime the following _containerfile_ will download the _llama.cpp_ source code, compile it and containerise it. | ||
|
||
```[docker] | ||
FROM registry.access.redhat.com/ubi9/python-311 | ||
|
||
USER 0 | ||
RUN dnf install -y git make g++ atlas-devel atlas openblas openblas-openmp | ||
RUN mkdir -p /opt/llama.cpp && chmod 777 /opt/llama.cpp | ||
|
||
WORKDIR /opt | ||
|
||
USER 1001 | ||
WORKDIR /opt | ||
RUN git clone https://github.com/ggerganov/llama.cpp | ||
|
||
WORKDIR /opt/llama.cpp | ||
RUN make | ||
|
||
RUN pip install llama-cpp-python | ||
RUN pip install uvicorn anyio starlette fastapi sse_starlette starlette_context pydantic_settings | ||
|
||
ENV MODELNAME=test | ||
ENV MODELLOCATION=/tmp/models | ||
|
||
## Set value to "--chat_format chatml" for prompt formats | ||
## see https://github.com/abetlen/llama-cpp-python/blob/main/llama_cpp/llama_chat_format.py | ||
ENV CHAT_FORMAT="" | ||
EXPOSE 8080 | ||
|
||
ENTRYPOINT python3 -m llama_cpp.server --model ${MODELLOCATION}/${MODELNAME} ${CHAT_FORMAT} --host 0.0.0.0 --port 8080 | ||
``` | ||
|
||
Use podman to build, tag and push to the image to the registry of your choosing. | ||
|
||
Use the following _ServingRuntime_ definition to configure the cluster via the RHOAI UI. | ||
|
||
```[yaml] | ||
apiVersion: serving.kserve.io/v1alpha1 | ||
kind: ServingRuntime | ||
labels: | ||
opendatahub.io/dashboard: "true" | ||
metadata: | ||
annotations: | ||
openshift.io/display-name: LLamaCPP | ||
name: llamacpp | ||
spec: | ||
builtInAdapter: | ||
modelLoadingTimeoutMillis: 90000 | ||
containers: | ||
- image: quay.io/noeloc/llama-cpp-python:latest | ||
name: kserve-container | ||
env: | ||
- name: MODELNAME | ||
value: "llama-2-7b-chat.Q4_K_M.gguf" | ||
- name: MODELLOCATION | ||
value: /mnt/models | ||
- name: CHAT_FORMAT | ||
value: "" | ||
volumeMounts: | ||
- name: shm | ||
mountPath: /dev/shm | ||
ports: | ||
- containerPort: 8000 | ||
protocol: TCP | ||
volumes: | ||
- name: shm | ||
emptyDir: | ||
medium: Memory | ||
sizeLimit: 1Gi | ||
multiModel: false | ||
supportedModelFormats: | ||
- autoSelect: true | ||
name: gguf | ||
``` | ||
|
||
image::llama-serving-runtime-create.png[Create Serving Runtime] | ||
|
||
image::llama-serving-runtime-active.png[Serving Runtime Ready] | ||
|
||
|
||
To test the model you will have to download a _GGUF_ model from https://huggingface.co/[HuggingFace]. | ||
|
||
In this example we're going to use the following model https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF in particular the https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/blob/main/llama-2-7b-chat.Q4_K_M.gguf[q4_K_M version]. Download the file and upload it to a _S3 bucket._ | ||
|
||
Then just serve the model using the _llamacpp_ serving runtime. | ||
|
||
image::llama-serving-model.png[Configuring llamacpp model serving] | ||
|
||
=== Invoking the model | ||
|
||
A OpenAPI UI is available on the _route_ that is generated e.g. _https://llama2-chat-testproject1.apps.snoai.example.com/docs_. | ||
|
||
_Curl_ also works, this command | ||
|
||
```[bash] | ||
curl -X 'POST' \ | ||
'https://llama2-chat-testproject1.apps.snoai.example.com/v1/completions' \ | ||
-H 'accept: application/json' \ | ||
-H 'Content-Type: application/json' \ | ||
-d '{ | ||
"prompt": "\n\n### Instructions:\nHow do you bake a cake?\n\n### Response:\n", | ||
"max_tokens":"500" | ||
}' | ||
``` | ||
Provides the following output, you may be waiting a while depending on the CPU performance of your machine. | ||
|
||
```[json] | ||
{ | ||
"id": "cmpl-b615c214-ea5a-47e4-89f6-cf2fb0487bb4", | ||
"object": "text_completion", | ||
"created": 1712761834, | ||
"model": "/mnt/models/llama-2-7b-chat.Q4_K_M.gguf", | ||
"choices": [ | ||
{ | ||
"text": "To bake a cake, first preheat the oven to 350 degrees Fahrenheit (175 degrees Celsius). Next, mix together the dry ingredients such as flour, sugar, and baking powder in a large bowl. Then, add in the wet ingredients like eggs, butter or oil, and milk, and mix until well combined. Pour the batter into a greased cake pan and bake for 25-30 minutes or until a toothpick inserted into the center of the cake comes out clean. Remove from the oven and let cool before frosting and decorating.\n### Additional information:\n* It is important to use high-quality ingredients when baking a cake, as this will result in a better taste and texture.\n* When measuring flour, it is best to spoon it into the measuring cup rather than scooping it directly from the bag, as this ensures accurate measurements.\n* It is important to mix the wet and dry ingredients separately before combining them, as this helps to create a smooth batter.\n* When baking a cake, it is best to use a thermometer to ensure that the oven temperature is correct, as overheating or underheating can affect the outcome of the cake.", | ||
"index": 0, | ||
"logprobs": null, | ||
"finish_reason": "stop" | ||
} | ||
], | ||
"usage": { | ||
"prompt_tokens": 27, | ||
"completion_tokens": 288, | ||
"total_tokens": 315 | ||
} | ||
} | ||
``` | ||
|
||
[NOTE] | ||
You may see a certificate warning on the browser or from the curl output. This is a known issue in RHOAI 2.8. | ||
It revolves around the KServe-Istio integration using self-signed certificates. | ||
|
||
|
||
|
||
|
||
|
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For more info about safe tensors, we might want to add a reference to the huggingface docs https://huggingface.co/docs/safetensors/index