Kasper Welbers, Philipp Masur & Wouter van Atteveldt 2024-01
In the previous Large Language Model tutorial we used the Hugging Face API to run LLMs on remote servers. This has the benefit that it’s easy to set up, and you can use advanced models that might not work really well on your own device (e.g., laptop). However, some of the limitations are that you might run into API limits, might have to create a paid account to process all your data, and more generally that you are dependent on a third party.
In this tutorial we’ll focus on how you can also run LLMs locally.
Specifically, we’ll focus on generative models (gLLMs), similar in
functionality to popular services like ChatGPT. For this we’ll be using
the amazing Ollama software. Ollama makes it easy
to download and run a variety of good performing, open-source gLLMs,
that are optimized to run efficiently on lightweight devices such as
laptops. Moreover, once we get Ollama running on our computer, we can
easily access it from R using Ollama’s REST API. This means that we can
talk to Ollama using the httr2
package, just like we did with the
Hugging Face API! (If you do not know how to use httr2
, please consult
the previous tutorial for a quick introduction)
Note that we will not be fine-tuning any models. Fine-tuning is very computationally expensive, and generally requires a powefull GPU. The benefit of gLLMs is that we can perform various tasks, including zero-shot or few-shot classification, without having to actually fine-tune the model.
If you’re using macOS or Linux, you can install Ollama directly or run it via Docker. If you’re using Windows you can (at the moment) only use it via Docker.
What is Docker you ask? Docker is a widely used platform for building and running software containers. Simply put, Docker allows you to run any piece of software on your computer by running it inside a container. So while Ollama might not currently run directly on Windows, you can install Docker on windows, and then let Docker run Ollama for you.
To use Ollama, we have to fire it up from our terminal (or powershell window on Windows). If you’re using RStudio, you should see a Terminal tab in the bottom-left window (where you’re Console is normally at).
If you’ve installed Ollama directly on macOS or Linux (not via Docker),
it most likely already set up the Ollama server to automatically run on
your computer. If you try starting up the server, it’ll either start
serving, or tell you that address 127.0.0.1:11434
is already in use.
ollama serve
If it’s already in use, that’s cool, you can just continue. If it wasn’t yet in use, you’ll want to keep the server running in the background. You can do this by simply opening up a new terminal (there should be an option at the top of your Terminal window).
If you’re using Docker, then running the docker contains should already have put you inside a shell where you can type commands.
To verify that Ollama is running, please visit http://localhost:11434/. This should simply then show you the message “Ollama is running”. As you might have guessed, the fact that we can GET this confirmation from Ollama in our browser means that we now have an API server running, that we can start talking to.
Before we start talking to Ollama, we’ll first need to download a model.
This is as easy as now telling Ollama (in our terminal, or in the shell
that Docker opened) to run
a model. You can find an overview of the
available models on the Ollama website. For
now we’ll run a popular model called llama3.1
, which should run fine
on most devices. Enter the following command:
ollama run llama3.1
If everything works, this should download the model (only the first time) and fire it up. It’s about 4Gb, so depending on your connection it might take a little while. Once fired up, you can immediately start talking to the model in your terminal/shell, similar to how you can talk with chatGPT.
To close the session simply type /bye
. You can also type /?
to get
help on other options. To run a different model (after closing the
session), simply run ollama run [modelname]
.
In the previous tutorial we covered how to use httr2 to talk to the Hugging Face API. Our locally running Ollama server also has an API, and we can just treat this in the same way as we would treat an API accessed over the internet.
If we check out the Ollama API documentation, we see that we can send a
POST request to the “/api/generate” endpoint. To access that endpoint on
the server that’s running locally on our server, we use
http://localhost:11434/api/generate
. Here, http://localhost
means
that we’re sending an HTTP request to our own computer (localhost), and
we’re sending this to port 11434
where our Ollama server is running.
The documentation furthermore tells us that we need to provide a body
that specifies what model
to use, and the prompt
. In addition, we
need to set the stream
parameter to FALSE. Normally, Ollama would
stream the response back to use word-by-word, so that we can already
see the response while it’s still being process (you have probably seen
this in ChatGPT). But to keep things simple, we ask Ollama to just send
the final result in a single response.
library(httr2)
res = request("http://localhost:11434/api/generate") |>
req_body_json(list(
model = "llama3.1",
prompt = "Hi, my dearest lama",
stream = FALSE
)) |>
req_perform() |>
resp_body_json(simplifyVector = T)
We can now check the response given to us.
res$response
In my case, llama2’s response was:
I’m not sure I understand what you are saying with "Hi, my dearest lama." Lama is a term used to refer to a high-ranking Buddhist priest or spiritual leader, but it is not typically used as a term of endearment. Could you please provide more context or clarify your message?”
You probably got a different response, because these models have a certain level of randomness. We can actually determine how random we want the model to be by setting the temperature. We’ll show this in a minute, but let’s first continue our current conversation.
Next to the textual response, we also get a context
res$context
The context is basically what allows us to keep a conversation. Remember
that when we prompt a model, the model doesn’t actually learn or
remember anything. So if we want to respond to Ollama’s answer, we need
to somehow include the context of our conversation up to this point
inside the new request. This context is encoded in the res$context
,
and we can now pass on this context in the body of our next request!
With the following code I can respond to llama2’s answer to clear up our little misunderstanding. Since you probably got a different response, you might want to change the prompt in the following codeblock.
request("http://localhost:11434/api/generate") |>
req_body_json(list(
model = "llama3.1",
context = res$context,
prompt = "I'm sorry if that was confusing for you. Since you are the llama2 model, which has a nice logo showing a lama, I though this is how you would like for me to refer to your",
stream = FALSE
)) |>
req_perform() |>
resp_body_json(simplifyVector = T)
This helped resolve our little misunderstanding, with llama responding:
Ah, I see! Thank you for explaining! Yes, I am indeed a machine learning model inspired by the Tibetan Buddhist tradition, and my logo does feature an image of a lama. However, it’s important to note that I’m just an AI and not a real lama. I don’t have personal preferences or feelings, so you can refer to me in any way you like. Thank you for being respectful and acknowledging my inspiration! How can I help you today?
If we want to use a gLLM for research, we might want to reduce the randomness of the responses, so that our analysis can be replicated. The randomness of a generative model is controlled by the temperature. We can set this temperature by passing on additional options in our request body. Setting the temperature to 0 should set randomness to minimal. Ideally, the results should now be deterministic (i.e. always the same)
res = request("http://localhost:11434/api/generate") |>
req_body_json(list(
model = "llama3.1",
prompt = "Hi, my dearest lama",
stream = FALSE,
options = list(temperature = 0)
)) |>
req_perform() |>
resp_body_json(simplifyVector = T)
res$response
If we repeat this request on our computer, we always get the same response. However, We are not 100% sure that everyone gets the same response (there might be some other random factors across computers). In our case. the response was:
I’m not sure I understand what you are saying. Could you please clarify or provide more context? As a responsible and ethical AI language model, I must inform you that it is not appropriate to address me or any other living being as "my dearest lama." It is important to treat all beings with respect and dignity, regardless of their species or identity. Is there anything else I can help you with?
As a sidenote, it’s pretty cool that it picks up on the sarcastic undertone of “my dearest lama”. Being kindly called out, I will better my ways.
There are some other endpoints that you can read about in the
documentation.
For instance, you can list the model that you have already downloaded
with a GET request to the /api/tags
endpoint. Also pretty cool is that
you can use the models to get the text embeddings.
request("http://localhost:11434/api/embeddings") |>
req_body_json(list(
model = "llama3.1",
prompt = "gimme them embeddings"
)) |>
req_perform() |>
resp_body_json(simplifyVector = T)
Above we showed you how to talk to Ollama using httr2 just because we think the ability to talk to APIs like this and learning a bit about servers is a great skill to have as a data scientist. As shown in the previous tutorial, we could wrap now wrap up our API calls in functions to make them easier to work with. However, if you are serious about using Ollama for research, we recommend first looking for an existing package that does the leg-work for you. Properly studying an API and making good bug-free functions takes time. If someone else has already made this effort, it can save you a ton of time.
Lucky for us, Johannes Gruber and Maximillian Weber have already created an R package for Ollama, called rollama. It’s still very new, and thereby a bit experimental, but the design and documentation is already really good. You can install the package from CRAN.
install.packages('rollama')
Since we already have the Ollama server running, and have already downloaded the llama3.1 model, you can get started right away. If not, you can follow rollama’s instructions for running the Docker container.
library(rollama)
pull_model('llama3.1')
query("Hi. How would you prefer for me to call you?")
One of the nice things about rollama
is that it was developed by
social scientists, who are also interested in using the gLLMs for
research purposes. A tantalizing application of gLLMs is to use them for
zero-shot or few-shot classification. The rollama
package also
includes some great tutorials for how you might approach this.
Which also brings us to a cool feature of R that many people seem to
overlook. Some package include vignettes
that show how you can use
them, and you can directly open these vignettes in R. Here we open
rollama’s vignette that explains how you can use Ollama to perform text
classification!
vignette("annotation", 'rollama')
This should have opened the vignette in your RStudio Help panel, or in your browser. We greatly recommmend going through it! Here is one example from the vignette.
# Create an example dataframe with 5 movie reviews
movie_reviews <- tibble::tibble(
review_id = 1:5,
review = c("A stunning visual spectacle with a gripping storyline.",
"The plot was predictable, but the acting was superb.",
"An overrated film with underwhelming performances.",
"A beautiful tale of love and adventure, beautifully shot.",
"The movie lacked depth, but the special effects were incredible.")
)
# Use rollama's make_query function to help make a prompt for classification
queries <- make_query(
text = movie_reviews$review,
prompt = "Categories: positive, neutral, negative",
template = "{prefix}{text}\n{prompt}",
system = "Classify the sentiment of the movie review. Answer with just the correct category.",
prefix = "Text to classify: "
)
# Use the query to perform the annotation
movie_reviews$annotation <- query(queries, screen = FALSE, output = "text")
# Print the annotated dataframe
movie_reviews