diff --git a/examples/phoenix/README.md b/examples/phoenix/README.md index 636205f1..c13c7186 100644 --- a/examples/phoenix/README.md +++ b/examples/phoenix/README.md @@ -79,8 +79,6 @@ serving = ) ``` -Also see the [Llama example](../../notebooks/llama.livemd) for more option combinations. - ### User images When working with user-given images, the most trivial approach would be to just upload an image as is, in a format like PNG or JPEG. However, this approach has two downsides: diff --git a/mix.exs b/mix.exs index 61c98589..142a8886 100644 --- a/mix.exs +++ b/mix.exs @@ -61,7 +61,7 @@ defmodule Bumblebee.MixProject do extras: [ "notebooks/examples.livemd", "notebooks/stable_diffusion.livemd", - "notebooks/llama.livemd", + "notebooks/llms.livemd", "notebooks/fine_tuning.livemd" ], extra_section: "GUIDES", diff --git a/notebooks/llama.livemd b/notebooks/llms.livemd similarity index 52% rename from notebooks/llama.livemd rename to notebooks/llms.livemd index 8ab9b6f3..363336c1 100644 --- a/notebooks/llama.livemd +++ b/notebooks/llms.livemd @@ -1,4 +1,4 @@ -# Llama +# LLMs ```elixir Mix.install([ @@ -13,13 +13,19 @@ Nx.global_default_backend({EXLA.Backend, client: :host}) ## Introduction -In this notebook we look at running [Meta's Llama](https://ai.meta.com/llama/) model, specifically Llama 2, one of the most powerful open source Large Language Models (LLMs). +In this notebook we outline the general setup for running a Large Langauge Model (LLM). + + + +## Llama 2 + +In this section we look at running [Meta's Llama](https://ai.meta.com/llama/) model, specifically Llama 2, one of the most powerful open source Large Language Models (LLMs). -> **Note:** this is a very involved model, so the generation can take a long time if you run it on a CPU. Also, running on the GPU currently requires at least 16GB of VRAM, though at least 30GB is recommended for optimal runtime. +> **Note:** this is a very involved model, so the generation can take a long time if you run it on a CPU. Also, running on the GPU currently requires at least 16.3GiB of VRAM. -## Text generation + In order to load Llama 2, you need to ask for access on [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf). Once you are granted access, generate a [HuggingFace auth token](https://huggingface.co/settings/tokens) and put it in a `HF_TOKEN` Livebook secret. @@ -29,10 +35,7 @@ Let's load the model and create a serving for text generation: hf_token = System.fetch_env!("LB_HF_TOKEN") repo = {:hf, "meta-llama/Llama-2-7b-chat-hf", auth_token: hf_token} -# Option 1 -# {:ok, model_info} = Bumblebee.load_model(repo, type: :bf16, backend: EXLA.Backend) -# Option 2 and 3 -{:ok, model_info} = Bumblebee.load_model(repo, type: :bf16) +{:ok, model_info} = Bumblebee.load_model(repo, type: :bf16, backend: EXLA.Backend) {:ok, tokenizer} = Bumblebee.load_tokenizer(repo) {:ok, generation_config} = Bumblebee.load_generation_config(repo) @@ -50,25 +53,16 @@ serving = Bumblebee.Text.generation(model_info, tokenizer, generation_config, compile: [batch_size: 1, sequence_length: 1028], stream: true, - # Option 1 and 2 - # defn_options: [compiler: EXLA] - # Option 3 - defn_options: [compiler: EXLA, lazy_transfers: :always] + defn_options: [compiler: EXLA] ) # Should be supervised Kino.start_child({Nx.Serving, name: Llama, serving: serving}) ``` -We adjust the generation config to use a non-deterministic generation strategy. The most interesting part, though, is the combination of serving options. - -First, note that in the Setup cell we set the default backend to `{EXLA.Backend, client: :host}`, which means that by default we load the parameters onto CPU. There are a couple combinations of options related to parameters, trading off memory usage for speed: +Note that we load the parameters directly onto the GPU with `Bumblebee.load_model(..., backend: EXLA.Backend)` and with `defn_options: [compiler: EXLA]` we tell the serving to compile and run computations on the GPU as well. -1. `Bumblebee.load_model(..., backend: EXLA.Backend)`, `defn_options: [compiler: EXLA]` - load all parameters directly onto the GPU. This requires the most memory, but it should provide the fastest inference time. In case you are using multiple GPUs (and a partitioned serving), you still want to load the parameters onto the CPU first and instead use `preallocate_params: true`, so that the parameters are copied onto each of them. - -2. `defn_options: [compiler: EXLA]` - copy all parameters to the GPU before each computation and discard afterwards (or more specifically, when no longer needed in the computation). This requires less memory, but the copying increases the inference time. - -3. `defn_options: [compiler: EXLA, lazy_transfers: :always]` - lazily copy parameters to the GPU during the computation as needed. This requires the least memory, at the cost of inference time. +We adjust the generation config to use a non-deterministic generation strategy, so that the model is able to produce a slightly different output every time. As for the other options, we specify `:compile` with fixed shapes, so that the model is compiled only once and inputs are always padded to match these shapes. We also enable `:stream` to receive text chunks as the generation is progressing. @@ -89,3 +83,47 @@ If a question does not make any sense, or is not factually coherent, explain why Nx.Serving.batched_run(Llama, prompt) |> Enum.each(&IO.write/1) ``` + + + +## Mistral + +We can easily test other LLMs, we just need to change the repository and possibly adjust the prompt template. In this example we run the [Mistral](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) model. + +```elixir +repo = {:hf, "mistralai/Mistral-7B-Instruct-v0.2"} + +{:ok, model_info} = Bumblebee.load_model(repo, type: :bf16, backend: EXLA.Backend) +{:ok, tokenizer} = Bumblebee.load_tokenizer(repo) +{:ok, generation_config} = Bumblebee.load_generation_config(repo) + +:ok +``` + +```elixir +generation_config = + Bumblebee.configure(generation_config, + max_new_tokens: 256, + strategy: %{type: :multinomial_sampling, top_p: 0.6} + ) + +serving = + Bumblebee.Text.generation(model_info, tokenizer, generation_config, + compile: [batch_size: 1, sequence_length: 1028], + stream: true, + defn_options: [compiler: EXLA] + ) + +# Should be supervised +Kino.start_child({Nx.Serving, name: Mistral, serving: serving}) +``` + +```elixir +prompt = """ +[INST] What is your favourite condiment? [/INST] +Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen! +[INST] Do you have mayonnaise recipes? [/INST]\ +""" + +Nx.Serving.batched_run(Mistral, prompt) |> Enum.each(&IO.write/1) +``` diff --git a/notebooks/stable_diffusion.livemd b/notebooks/stable_diffusion.livemd index f742e32c..52c41ffa 100644 --- a/notebooks/stable_diffusion.livemd +++ b/notebooks/stable_diffusion.livemd @@ -17,7 +17,7 @@ Stable Diffusion is a latent text-to-image diffusion model, primarily used to ge -> **Note:** Stable Diffusion is a very involved model, so the generation can take a long time if you run it on a CPU. Also, running on the GPU currently requires at least 5GB of VRAM (or 3GB with lower speed, see below). +> **Note:** Stable Diffusion is a very involved model, so the generation can take a long time if you run it on a CPU. Also, running on the GPU currently requires at least 5GiB of VRAM (or 3GiB with lower speed, see below). @@ -27,7 +27,7 @@ Stable Diffusion is composed of several separate models and preprocessors, so we ```elixir repo_id = "CompVis/stable-diffusion-v1-4" -opts = [params_variant: "fp16", type: :bf16] +opts = [params_variant: "fp16", type: :bf16, backend: EXLA.Backend] {:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "openai/clip-vit-large-patch14"}) {:ok, clip} = Bumblebee.load_model({:hf, repo_id, subdir: "text_encoder"}, opts) @@ -55,9 +55,9 @@ serving = safety_checker_featurizer: featurizer, compile: [batch_size: 1, sequence_length: 60], # Option 1 - preallocate_params: true, defn_options: [compiler: EXLA] # Option 2 (reduces GPU usage, but runs noticeably slower) + # Also remove `backend: EXLA.Backend` from the loading options above # defn_options: [compiler: EXLA, lazy_transfers: :always] )