Skip to content

Latest commit

 

History

History
352 lines (287 loc) · 14.1 KB

LLMs_2_ollama.md

File metadata and controls

352 lines (287 loc) · 14.1 KB

LLMs part 2: running generative models locally with Ollama

Kasper Welbers, Philipp Masur & Wouter van Atteveldt 2024-01

Introduction

In the previous Large Language Model tutorial we used the Hugging Face API to run LLMs on remote servers. This has the benefit that it’s easy to set up, and you can use advanced models that might not work really well on your own device (e.g., laptop). However, some of the limitations are that you might run into API limits, might have to create a paid account to process all your data, and more generally that you are dependent on a third party.

In this tutorial we’ll focus on how you can also run LLMs locally. Specifically, we’ll focus on generative models (gLLMs), similar in functionality to popular services like ChatGPT. For this we’ll be using the amazing Ollama software. Ollama makes it easy to download and run a variety of good performing, open-source gLLMs, that are optimized to run efficiently on lightweight devices such as laptops. Moreover, once we get Ollama running on our computer, we can easily access it from R using Ollama’s REST API. This means that we can talk to Ollama using the httr2 package, just like we did with the Hugging Face API! (If you do not know how to use httr2, please consult the previous tutorial for a quick introduction)

Note that we will not be fine-tuning any models. Fine-tuning is very computationally expensive, and generally requires a powefull GPU. The benefit of gLLMs is that we can perform various tasks, including zero-shot or few-shot classification, without having to actually fine-tune the model.

Installing Ollama

If you’re using macOS or Linux, you can install Ollama directly or run it via Docker. If you’re using Windows you can (at the moment) only use it via Docker.

What is Docker you ask? Docker is a widely used platform for building and running software containers. Simply put, Docker allows you to run any piece of software on your computer by running it inside a container. So while Ollama might not currently run directly on Windows, you can install Docker on windows, and then let Docker run Ollama for you.

Firing up Ollama

To use Ollama, we have to fire it up from our terminal (or powershell window on Windows). If you’re using RStudio, you should see a Terminal tab in the bottom-left window (where you’re Console is normally at).

If you’ve installed Ollama directly on macOS or Linux (not via Docker), it most likely already set up the Ollama server to automatically run on your computer. If you try starting up the server, it’ll either start serving, or tell you that address 127.0.0.1:11434 is already in use.

ollama serve

If it’s already in use, that’s cool, you can just continue. If it wasn’t yet in use, you’ll want to keep the server running in the background. You can do this by simply opening up a new terminal (there should be an option at the top of your Terminal window).

If you’re using Docker, then running the docker contains should already have put you inside a shell where you can type commands.

To verify that Ollama is running, please visit http://localhost:11434/. This should simply then show you the message “Ollama is running”. As you might have guessed, the fact that we can GET this confirmation from Ollama in our browser means that we now have an API server running, that we can start talking to.

Firing up an LLM

Before we start talking to Ollama, we’ll first need to download a model. This is as easy as now telling Ollama (in our terminal, or in the shell that Docker opened) to run a model. You can find an overview of the available models on the Ollama website. For now we’ll run a popular model called llama3.1, which should run fine on most devices. Enter the following command:

ollama run llama3.1

If everything works, this should download the model (only the first time) and fire it up. It’s about 4Gb, so depending on your connection it might take a little while. Once fired up, you can immediately start talking to the model in your terminal/shell, similar to how you can talk with chatGPT.

To close the session simply type /bye. You can also type /? to get help on other options. To run a different model (after closing the session), simply run ollama run [modelname].

Talking to Ollama using the httr2 library

In the previous tutorial we covered how to use httr2 to talk to the Hugging Face API. Our locally running Ollama server also has an API, and we can just treat this in the same way as we would treat an API accessed over the internet.

Starting a conversation

If we check out the Ollama API documentation, we see that we can send a POST request to the “/api/generate” endpoint. To access that endpoint on the server that’s running locally on our server, we use http://localhost:11434/api/generate. Here, http://localhost means that we’re sending an HTTP request to our own computer (localhost), and we’re sending this to port 11434 where our Ollama server is running.

The documentation furthermore tells us that we need to provide a body that specifies what model to use, and the prompt. In addition, we need to set the stream parameter to FALSE. Normally, Ollama would stream the response back to use word-by-word, so that we can already see the response while it’s still being process (you have probably seen this in ChatGPT). But to keep things simple, we ask Ollama to just send the final result in a single response.

library(httr2)

res = request("http://localhost:11434/api/generate") |>
  req_body_json(list(
    model = "llama3.1",
    prompt = "Hi, my dearest lama",
    stream = FALSE
  )) |>
  req_perform() |>
  resp_body_json(simplifyVector = T)

We can now check the response given to us.

res$response

In my case, llama2’s response was:

I’m not sure I understand what you are saying with "Hi, my dearest lama." Lama is a term used to refer to a high-ranking Buddhist priest or spiritual leader, but it is not typically used as a term of endearment. Could you please provide more context or clarify your message?”

You probably got a different response, because these models have a certain level of randomness. We can actually determine how random we want the model to be by setting the temperature. We’ll show this in a minute, but let’s first continue our current conversation.

Keeping a conversation

Next to the textual response, we also get a context

res$context

The context is basically what allows us to keep a conversation. Remember that when we prompt a model, the model doesn’t actually learn or remember anything. So if we want to respond to Ollama’s answer, we need to somehow include the context of our conversation up to this point inside the new request. This context is encoded in the res$context, and we can now pass on this context in the body of our next request!

With the following code I can respond to llama2’s answer to clear up our little misunderstanding. Since you probably got a different response, you might want to change the prompt in the following codeblock.

request("http://localhost:11434/api/generate") |>
  req_body_json(list(
    model = "llama3.1",
    context = res$context,
    prompt = "I'm sorry if that was confusing for you. Since you are the llama2 model, which has a nice logo showing a lama, I though this is how you would like for me to refer to your",
    stream = FALSE
  )) |>
  req_perform() |>
  resp_body_json(simplifyVector = T)

This helped resolve our little misunderstanding, with llama responding:

Ah, I see! Thank you for explaining! Yes, I am indeed a machine learning model inspired by the Tibetan Buddhist tradition, and my logo does feature an image of a lama. However, it’s important to note that I’m just an AI and not a real lama. I don’t have personal preferences or feelings, so you can refer to me in any way you like. Thank you for being respectful and acknowledging my inspiration! How can I help you today?

Lowering the temperature

If we want to use a gLLM for research, we might want to reduce the randomness of the responses, so that our analysis can be replicated. The randomness of a generative model is controlled by the temperature. We can set this temperature by passing on additional options in our request body. Setting the temperature to 0 should set randomness to minimal. Ideally, the results should now be deterministic (i.e. always the same)

res = request("http://localhost:11434/api/generate") |>
  req_body_json(list(
    model = "llama3.1",
    prompt = "Hi, my dearest lama",
    stream = FALSE,
    options = list(temperature = 0)
  )) |>
  req_perform() |>
  resp_body_json(simplifyVector = T)

res$response

If we repeat this request on our computer, we always get the same response. However, We are not 100% sure that everyone gets the same response (there might be some other random factors across computers). In our case. the response was:

I’m not sure I understand what you are saying. Could you please clarify or provide more context? As a responsible and ethical AI language model, I must inform you that it is not appropriate to address me or any other living being as "my dearest lama." It is important to treat all beings with respect and dignity, regardless of their species or identity. Is there anything else I can help you with?

As a sidenote, it’s pretty cool that it picks up on the sarcastic undertone of “my dearest lama”. Being kindly called out, I will better my ways.

Other stuff

There are some other endpoints that you can read about in the documentation. For instance, you can list the model that you have already downloaded with a GET request to the /api/tags endpoint. Also pretty cool is that you can use the models to get the text embeddings.

request("http://localhost:11434/api/embeddings") |>
  req_body_json(list(
    model = "llama3.1",
    prompt = "gimme them embeddings"
  )) |>
  req_perform() |>
  resp_body_json(simplifyVector = T)

Using the rollama package

Above we showed you how to talk to Ollama using httr2 just because we think the ability to talk to APIs like this and learning a bit about servers is a great skill to have as a data scientist. As shown in the previous tutorial, we could wrap now wrap up our API calls in functions to make them easier to work with. However, if you are serious about using Ollama for research, we recommend first looking for an existing package that does the leg-work for you. Properly studying an API and making good bug-free functions takes time. If someone else has already made this effort, it can save you a ton of time.

Lucky for us, Johannes Gruber and Maximillian Weber have already created an R package for Ollama, called rollama. It’s still very new, and thereby a bit experimental, but the design and documentation is already really good. You can install the package from CRAN.

install.packages('rollama')

Since we already have the Ollama server running, and have already downloaded the llama3.1 model, you can get started right away. If not, you can follow rollama’s instructions for running the Docker container.

library(rollama)
pull_model('llama3.1') 
query("Hi. How would you prefer for me to call you?")

Using gLLMs for text classification

One of the nice things about rollama is that it was developed by social scientists, who are also interested in using the gLLMs for research purposes. A tantalizing application of gLLMs is to use them for zero-shot or few-shot classification. The rollama package also includes some great tutorials for how you might approach this.

Which also brings us to a cool feature of R that many people seem to overlook. Some package include vignettes that show how you can use them, and you can directly open these vignettes in R. Here we open rollama’s vignette that explains how you can use Ollama to perform text classification!

vignette("annotation", 'rollama')

This should have opened the vignette in your RStudio Help panel, or in your browser. We greatly recommmend going through it! Here is one example from the vignette.

# Create an example dataframe with 5 movie reviews
movie_reviews <- tibble::tibble(
  review_id = 1:5,
  review = c("A stunning visual spectacle with a gripping storyline.",
             "The plot was predictable, but the acting was superb.",
             "An overrated film with underwhelming performances.",
             "A beautiful tale of love and adventure, beautifully shot.",
             "The movie lacked depth, but the special effects were incredible.")
)

# Use rollama's make_query function to help make a prompt for classification
queries <- make_query(
  text = movie_reviews$review,
  prompt = "Categories: positive, neutral, negative",
  template = "{prefix}{text}\n{prompt}",
  system = "Classify the sentiment of the movie review. Answer with just the correct category.",
  prefix = "Text to classify: "
)

# Use the query to perform the annotation
movie_reviews$annotation <- query(queries, screen = FALSE, output = "text")

# Print the annotated dataframe
movie_reviews