Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add install and build instructions, refactor docs structure #485

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
ef4a2e5
Add initial files with instructions
yatarkan Jun 7, 2024
6552010
Move md files with instructions to docs dir
yatarkan Jun 7, 2024
3707686
Update root and src readme files
yatarkan Jun 7, 2024
49df483
Add rest of supported models to table with architecture
yatarkan Jun 7, 2024
c8ef9bc
Remove old list of supported models from src readme
yatarkan Jun 7, 2024
85314ea
Use git clone recursive command
yatarkan Jun 10, 2024
4dae776
Add ubuntu24 command placeholder
yatarkan Jun 10, 2024
4fbb080
Switch from install instructions in repo md files to Selector Tool
yatarkan Jun 10, 2024
08d49a3
Move install python dependencies from build to usage instructions
yatarkan Jun 10, 2024
3f395fc
Add build instructions for windows and macos
yatarkan Jun 10, 2024
87e992a
Add links to text generation samples
yatarkan Jun 11, 2024
e6f2d33
Remove Intel CPU from title
yatarkan Jun 11, 2024
0e24888
Apply suggestions from code review
yatarkan Jun 12, 2024
02e6662
Fix title case
yatarkan Jun 12, 2024
0a39166
Merge branch 'releases/2024/2' into yt/build-instructions
yatarkan Jun 12, 2024
2c1d2ab
Fix typos
yatarkan Jun 12, 2024
2182bb1
Update usage instructions
yatarkan Jun 12, 2024
0a07b44
Add links to python samples
yatarkan Jun 12, 2024
a936145
Move usage instructions to src readme
yatarkan Jun 12, 2024
bdb598f
Merge branch 'releases/2024/2' into yt/build-instructions
yatarkan Jun 13, 2024
6656d7e
Switch links to release branch in repo
yatarkan Jun 13, 2024
9369be7
Update src/docs/HOW_IT_WORKS.md
yatarkan Jun 13, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
51 changes: 37 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,39 @@
## GenAI Pipeline Repository
# OpenVINO™ GenAI

The GenAI repository contains pipelines that implement image and text generation tasks.
The implementation uses OpenVINO capabilities to optimize the pipelines. Each sample covers
a family of models and suggests certain modifications to adapt the code to specific needs.
It includes the following pipelines:
The OpenVINO™ GenAI repository consists of the GenAI library and additional GenAI samples.

1. [Benchmarking script for large language models](./llm_bench/python/)
2. [Text generation C++ samples that support most popular models like LLaMA 2](./text_generation/causal_lm/cpp/)
3. [Stable Diffuison (with LoRA) C++ image generation pipeline](./image_generation/stable_diffusion_1_5/cpp/)
4. [Latent Consistency Model (with LoRA) C++ image generation pipeline](./image_generation/lcm_dreamshaper_v7/cpp/)
## OpenVINO™ GenAI Library

### License
OpenVINO™ GenAI is a flavor of OpenVINO, aiming to simplify running inference of generative AI models.
Wovchena marked this conversation as resolved.
Show resolved Hide resolved
It hides the complexity of the generation process and minimizes the amount of code required.

The GenAI repository is licensed under [Apache License Version 2.0](LICENSE).
By contributing to the project, you agree to the license and copyright terms therein and release
your contribution under these terms.
For installation and usage instructions, refer to the [GenAI Library README](./src/README.md).

## OpenVINO™ GenAI Samples

The OpenVINO™ GenAI repository contains pipelines that implement image and text generation tasks.
The implementation uses OpenVINO capabilities to optimize the pipelines. Each sample covers
a family of models and suggests certain modifications to adapt the code to specific needs.
It includes the following pipelines:

## Requirements
1. [Benchmarking script for large language models](./llm_bench/python/README.md)
2. Text generation samples that support most popular models like LLaMA 2:
- Python:
1. [beam_search_causal_lm](./samples/python/beam_search_causal_lm/README.md)
2. [chat_sample](./samples/python/chat_sample/README.md)
3. [greedy_causal_lm](./samples/python/greedy_causal_lm/README.md)
4. [multinomial_causal_lm](./samples/python/multinomial_causal_lm/README.md)
- C++:
1. [beam_search_causal_lm](./samples/cpp/beam_search_causal_lm/README.md)
2. [chat_sample](./samples/cpp/chat_sample/README.md)
3. [greedy_causal_lm](./samples/cpp/greedy_causal_lm/README.md)
4. [multinomial_causal_lm](./samples/cpp/multinomial_causal_lm/README.md)
5. [prompt_lookup_decoding_lm](./samples/cpp/prompt_lookup_decoding_lm/README.md)
6. [speculative_decoding_lm](./samples/cpp/speculative_decoding_lm/README.md)
3. [Stable Diffuison (with LoRA) C++ image generation pipeline](./image_generation/stable_diffusion_1_5/cpp/README.md)
4. [Latent Consistency Model (with LoRA) C++ image generation pipeline](./image_generation/lcm_dreamshaper_v7/cpp/README.md)

### Requirements

Requirements may vary for different samples. See respective readme files for more details,
and make sure to install the OpenVINO version listed there. Refer to documentation to see
Expand All @@ -25,3 +42,9 @@ and make sure to install the OpenVINO version listed there. Refer to documentati
The supported devices are CPU and GPU including Intel discrete GPU.

See also: https://docs.openvino.ai/2023.3/gen_ai_guide.html.

## License

The OpenVINO™ GenAI repository is licensed under [Apache License Version 2.0](LICENSE).
By contributing to the project, you agree to the license and copyright terms therein and release
your contribution under these terms.
161 changes: 57 additions & 104 deletions src/README.md
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we drop main content from this file, we will have almost empty page on pypi.
is it OK?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. But the links should be updated to refer to https://github.com/openvinotoolkit/openvino.genai

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this README describes repo, not package

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doesn't seem like that to me. Anyway, what should this file contain?

Original file line number Diff line number Diff line change
@@ -1,30 +1,56 @@
# OpenVINO Generate API
# OpenVINO™ GenAI Library

## Usage
OpenVINO™ GenAI is a flavor of OpenVINO™, aiming to simplify running inference of generative AI models.
It hides the complexity of the generation process and minimizes the amount of code required.

First of all you need to convert your model with optimum-cli
``` sh
optimum-cli export openvino --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" --trust-remote-code "TinyLlama-1.1B-Chat-v1.0"
pip install openvino-genai
```
## Install OpenVINO™ GenAI

The OpenVINO™ GenAI flavor is available for installation via Archive and PyPI distributions.
To install OpenVINO™ GenAI, refer to the [Install Guide](https://docs.openvino.ai/2024/get-started/install-openvino.html).

To build OpenVINO™ GenAI library from source, refer to the [Build Instructions](https://github.com/openvinotoolkit/openvino.genai/tree/releases/2024/2/src/docs/BUILD.md).

## Usage

### Prerequisites

1. Installed OpenVINO™ GenAI

> If OpenVINO GenAI is installed via archive distribution or built from source, you will need to install additional python dependencies (e.g. `optimum-cli` for simplified model downloading and exporting):
>
> ```sh
> # (Optional) Clone OpenVINO GenAI repository if it does not exist
> git clone --recursive https://github.com/openvinotoolkit/openvino.genai.git
> cd openvino.genai
> # Install python dependencies
> python -m pip install ./thirdparty/openvino_tokenizers/[transformers] --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/pre-release
> python -m pip install --upgrade-strategy eager -r ./samples/cpp/requirements.txt
> ```

2. A model in OpenVINO IR format

Download and convert a model with `optimum-cli`:
``` sh
optimum-cli export openvino --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" --trust-remote-code "TinyLlama-1.1B-Chat-v1.0"
```

`LLMPipeline` is the main object used for decoding. You can construct it straight away from the folder with the converted model. It will automatically load the main model, tokenizer, detokenizer and default generation configuration.

### Python

A minimalist example:
A simple example:
```python
import openvino_genai as ov_genai
pipe = ov_genai.LLMPipeline(model_path, "CPU")
print(pipe.generate("The Sun is yellow bacause"))
print(pipe.generate("The Sun is yellow because"))
```

Calling generate with custom generation config parameters, e.g. config for grouped beam search
Calling generate with custom generation config parameters, e.g. config for grouped beam search:
```python
import openvino_genai as ov_genai
pipe = ov_genai.LLMPipeline(model_path, "CPU")

result = pipe.generate("The Sun is yellow bacause", max_new_tokens=30, num_groups=3, group_size=5, diversity_penalty=1.5)
result = pipe.generate("The Sun is yellow because", max_new_tokens=30, num_groups=3, group_size=5, diversity_penalty=1.5)
print(result)
```

Expand All @@ -33,7 +59,7 @@ output:
'it is made up of carbon atoms. The carbon atoms are arranged in a linear pattern, which gives the yellow color. The arrangement of carbon atoms in'
```

A simples chat in python:
A simple chat in Python:
```python
import openvino_genai as ov_genai
pipe = ov_genai.LLMPipeline(model_path)
Expand All @@ -55,19 +81,19 @@ Test to compare with Huggingface outputs

### C++

Minimalistc example
A simple example:
```cpp
#include "openvino/genai/llm_pipeline.hpp"
#include <iostream>

int main(int argc, char* argv[]) {
std::string model_path = argv[1];
ov::genai::LLMPipeline pipe(model_path, "CPU");
std::cout << pipe.generate("The Sun is yellow bacause");
std::cout << pipe.generate("The Sun is yellow because");
}
```

Using Group Beam Search Decoding
Using group beam search decoding:
```cpp
#include "openvino/genai/llm_pipeline.hpp"
#include <iostream>
Expand All @@ -82,12 +108,12 @@ int main(int argc, char* argv[]) {
config.group_size = 5;
config.diversity_penalty = 1.0f;

std::cout << pipe.generate("The Sun is yellow bacause", config);
std::cout << pipe.generate("The Sun is yellow because", config);
}
```

A simple chat in C++ using grouped beam search decoding
``` cpp
A simple chat in C++ using grouped beam search decoding:
```cpp
#include "openvino/genai/llm_pipeline.hpp"
#include <iostream>

Expand Down Expand Up @@ -118,8 +144,8 @@ int main(int argc, char* argv[]) {
}
```

Streaming example with lambda function
``` cpp
Streaming example with lambda function:
```cpp
#include "openvino/genai/llm_pipeline.hpp"
#include <iostream>

Expand All @@ -128,12 +154,12 @@ int main(int argc, char* argv[]) {
ov::genai::LLMPipeline pipe(model_path, "CPU");

auto streamer = [](std::string word) { std::cout << word << std::flush; };
std::cout << pipe.generate("The Sun is yellow bacause", streamer);
std::cout << pipe.generate("The Sun is yellow because", streamer);
}
```

Streaming with a custom class
``` cpp
Streaming with a custom class:
```cpp
#include "openvino/genai/streamer_base.hpp"
#include "openvino/genai/llm_pipeline.hpp"
#include <iostream>
Expand All @@ -160,87 +186,14 @@ int main(int argc, char* argv[]) {

std::string model_path = argv[1];
ov::genai::LLMPipeline pipe(model_path, "CPU");
std::cout << pipe.generate("The Sun is yellow bacause", custom_streamer);
std::cout << pipe.generate("The Sun is yellow because", custom_streamer);
}
```

## How it works

### Stateful LLM

A common LLM inference optimisation is introduction of past KV (key/value)-cache. This cache is represented by the corresponding inputs and outputs in a model implemented originally in DL framework (e.g. PyTorch models from HuggingFace). To optimize it further and simplify usage, the model is transformed to a stateful form. This transformation improves inference performance and decreases amount of allocated runtime memory in long running text generation scenarios. It is achieved by hiding inputs and outputs of the model that represent past KV-cache tensors and handling them inside the model in a more efficient way. Although the cache is still accessible with state API. It is opposed to stateless model approach requiring manipulating these inputs and outputs explicitly. An introduction to stateful models can be found in https://docs.openvino.ai/2023.3/openvino_docs_OV_UG_stateful_models_intro.html.

Hiding KV-cache introduces a peculiarity for beam search algorithm. Beam search suggests batched inference of multiple beams. The design described here so far would result in generating multiple independent sequences of tokens. Beam search algorithm, on the other hand, requires removing some of the ongoing beams and splitting other beams to multiple branches. Beam removal requires deleting corresponding KV-cache entry and beam splitting requires copying corresponding KV-cache values.

To provide the possibility to implement beam search without accessing model's internal state, a stateful LLM converted with `optimum-intel` or [llm_bench](../../../llm_bench/python/) introduces an additional 1-dimentional `beam_idx` input. `beam_idx` must contain indexes of elements in a batch which are intended to be selected and will evolve during the next beam search iteration. There's only one beam when the generation starts. That beam corresponds to the initial prompt. `beam_idx` must have values: `[0, 0]` to keep the initial beam and introduce its copy. The dynamic batch size enables to change the number of beams dynamically. `beam_idx` must have `[1]` as the value to remove zeroth sequence and keep the second beam only.

Assume there are two running beams. To proceed with generating both beams at the next iteration, `beam_idx` values must be `[0, 1]`, pointing to batch elements `0` and `1`. To drop the last beam and split the other beam in two, `beam_idx` must be set to `[0, 0]`. This results in utilizing only the part of KV cache corresponding to the zeroth element in the batch. The process of selecting proper entries in cache is called Cache Reorder.

![](beam_idx-fork.gif)
![](beam_idx-drop.gif)

The images below represent stateless and stateful LLM pipelines. The model has 4 inputs:
1. `input_ids` contains the next selected token
2. `attention_mask` is filled with `1`
3. `position_ids` encodes a position of currently generating token in the sequence
4. `beam_idx` selects beams

The model has 1 output `logits` describing the predicted distribution over the next tokens. And there's KV cache state.

![](stateless.jpg)
![](stateful.jpg)

## Supported models

1. chatglm
1. https://huggingface.co/THUDM/chatglm2-6b - refer to
[chatglm2-6b - AttributeError: can't set attribute](../../../llm_bench/python/doc/NOTES.md#chatglm2-6b---attributeerror-cant-set-attribute)
in case of `AttributeError`
2. https://huggingface.co/THUDM/chatglm3-6b
2. LLaMA 2 (requires access request submission on its Hugging Face page to be downloaded)
1. https://huggingface.co/meta-llama/Llama-2-13b-chat-hf
2. https://huggingface.co/meta-llama/Llama-2-13b-hf
3. https://huggingface.co/meta-llama/Llama-2-7b-chat-hf
4. https://huggingface.co/meta-llama/Llama-2-7b-hf
5. https://huggingface.co/meta-llama/Llama-2-70b-chat-hf
6. https://huggingface.co/meta-llama/Llama-2-70b-hf
3. [Llama2-7b-WhoIsHarryPotter](https://huggingface.co/microsoft/Llama2-7b-WhoIsHarryPotter)
4. OpenLLaMA
1. https://huggingface.co/openlm-research/open_llama_13b
2. https://huggingface.co/openlm-research/open_llama_3b
3. https://huggingface.co/openlm-research/open_llama_3b_v2
4. https://huggingface.co/openlm-research/open_llama_7b
5. https://huggingface.co/openlm-research/open_llama_7b_v2
5. [TinyLlama](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0)
6. Qwen
1. https://huggingface.co/Qwen/Qwen-7B-Chat
2. https://huggingface.co/Qwen/Qwen-7B-Chat-Int4 - refer to
3. https://huggingface.co/Qwen/Qwen1.5-7B-Chat
4. https://huggingface.co/Qwen/Qwen1.5-7B-Chat-GPTQ-Int4
[Qwen-7B-Chat-Int4 - Torch not compiled with CUDA enabled](../../../llm_bench/python/doc/NOTES.md#qwen-7b-chat-int4---torch-not-compiled-with-cuda-enabled)
in case of `AssertionError`
7. Dolly
1. https://huggingface.co/databricks/dolly-v2-3b
8. Phi
1. https://huggingface.co/microsoft/phi-2
2. https://huggingface.co/microsoft/phi-1_5
9. [notus-7b-v1](https://huggingface.co/argilla/notus-7b-v1)
10. [zephyr-7b-beta](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta)
11. [redpajama-3b-chat](https://huggingface.co/ikala/redpajama-3b-chat)
12. [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)
13. [Gemma-2B-it](https://huggingface.co/google/gemma-2b-it)

This pipeline can work with other similar topologies produced by `optimum-intel` with the same model signature. The model is required heve after the conversion the following inputs:
1. `input_ids` contains the tokens
2. `attention_mask` is filled with `1`
3. `beam_idx` selects beams
4. `position_ids` (optional) encodes a position of currently generating token in the sequence

and a single `logits` output.

Some models may require access request submission on their Hugging Face page to be downloaded.

If https://huggingface.co/ is down, the conversion step won't be able to download the models.

> [!NOTE]
> Models should belong to the same family and have same tokenizers.
## How It Works

For information on how OpenVINO™ GenAI works, refer to the [How It Works Section](https://github.com/openvinotoolkit/openvino.genai/tree/releases/2024/2/src/docs/HOW_IT_WORKS.md).

## Supported Models

For a list of supported models, refer to the [Supported Models Section](https://github.com/openvinotoolkit/openvino.genai/tree/releases/2024/2/src/docs/SUPPORTED_MODELS.md).
Loading
Loading