Note
- AWS Lambda uses CPUs, therefore running
generate
/chat
is a little slow. - The first deployment takes ~5m while the container is built and models are cached, subsequent deployments take ~1m.
- The first request while the model is loaded takes ~20s, subsequent requests take ~5-20s.
- While this is not production grade, it is a cost effective way to serve models.
curl https://wm4s6cxkwua4ncx3skpdtdx27a0qzbnd.lambda-url.us-east-1.on.aws/api/generate -d '{
"model": "llama3.2:1b",
"prompt":"Why is the sky blue?"
}'
- π Please, please, please don't abuse this endpoint, Scaffoldly is Open Source (a.k.a. cash strapped π€£) and we're hosting it for demonstration purposes only!
- Please consider donating if you like what Scaffoldly is doing!
- Check out our other examples
- Give our Tooling and Examples repositories a βοΈ if you like what you see!
Tip
To use a different model than llama3.2:1b
, update scaffoldly.json
with the desired model(s).
- Run the following command to create your own copy of this application:
npx scaffoldly create app --template ollama
-
Create an EFS Filesystem in AWS, give it a
Name
of.cache
(to matchscaffoldly.json
) -
Finally, deploy:
cd my-app
npx scaffoldly deploy
You will see output that looks like:
π App framework not detected. Using `scaffoldly.json` for configuration.
β
Updated Identity: arn:aws:sts::123456789012:assumed-role/aws-examples@scaffold.ly/cnuss
β
Updated ECR Repository: 123456789012.dkr.ecr.us-east-1.amazonaws.com/ollama
β
Updated Local Image Digest: sha256:f7ee27705d66c64a250982d6ee8282d5338a4989ae95c5ac4453a15c264efc97
β
Updated Secret: arn:aws:secretsmanager:us-east-1:123456789012:secret:ollama@ollama-yaVNCp
β
Updated EFS Access Point: arn:aws:elasticfilesystem:us-east-1:123456789012:access-point/fsap-0b0e5506324efd541
β
Updated IAM Role: ollama-0447aaae
β
Updated IAM Role Policy: ollama
β
Updated Lambda Function: ollama
β
Updated Function URL: https://wm4s6cxkwua4ncx3skpdtdx27a0qzbnd.lambda-url.us-east-1.on.aws
β
Updated Schedule Group: ollama-0447aaae
β
Updated Local Image: 123456789012.dkr.ecr.us-east-1.amazonaws.com/ollama:0.0.0-0-0447aaae
β
Updated Local Image Digest: sha256:320447c49d08d109c4fc1702acc24768657a9a09e4e0eb90f8b32051500664ba
β
Updated Secret: arn:aws:secretsmanager:us-east-1:123456789012:secret:ollama@ollama-yaVNCp
β
Updated Lambda Function: ollama
β
Updated Function Code: ollama@sha256:320447c49d08d109c4fc1702acc24768657a9a09e4e0eb90f8b32051500664ba
β
Updated Function Alias: ollama (version: 4)
β
Updated Function Policies: InvokeFunctionUrl
β
Updated Function URL: https://wm4s6cxkwua4ncx3skpdtdx27a0qzbnd.lambda-url.us-east-1.on.aws
β
Updated Network Interface: eni-0dc0e11444fa19715
β
Created Invocation of `( HOME=$XDG_CACHE_HOME OLLAMA_HOST=$URL ollama pull llama3.2:1b )`:
pulling manifest
==> pulling 74701a8c35f6... 100% ββββββββββββββββββ 1.3 GB
==> pulling 966de95ca8a6... 100% ββββββββββββββββββ 1.4 KB
==> pulling fcc5a6bec9da... 100% ββββββββββββββββββ 7.7 KB
==> pulling a70ff7e570d9... 100% ββββββββββββββββββ 6.0 KB
==> pulling 4f659a1e86d7... 100% ββββββββββββββββββ 485 B
==> verifying sha256 digest
==> writing manifest
==> success
β
Updated HTTP GET on https://wm4s6cx...s-east-1.on.aws: 200 OK
π Deployment Complete!
π App Identity: arn:aws:iam::123456789012:role/ollama-0447aaae
π Env Files: .env.ollama, .env.main, .env
π¦ Image Size: 4.81 GB
π URL: https://wm4s6cxkwua4ncx3skpdtdx27a0qzbnd.lambda-url.us-east-1.on.aws
- The
scaffoldly.json
is converted into a Multi-Stage Docker Build - A docker build is pushed to Amazon ECR
- A Lambda Function is created to serve the image
- Models are cached to Amazon EFS
- Requests are proxied to the underlying Ollama server
Tip
This repoistory also comes with a GitHub Action so that deployments can occur from GitHub instead of being executed manually!
After the project has been created, run npx scaffoldly show dockerfile
to see the resultant Dockerfile:
FROM ollama/ollama:0.4.7 AS install-base
WORKDIR /var/task
FROM install-base AS build-base
WORKDIR /var/task
ENV PATH="/var/task:$PATH"
COPY . /var/task/
FROM install-base AS package-base
WORKDIR /var/task
ENV PATH="/var/task:$PATH"
FROM install-base AS runtime
WORKDIR /var/task
ENV PATH="/var/task:$PATH"
COPY --from=scaffoldly/scaffoldly:1 /linux/arm64/awslambda-entrypoint /var/task/.entrypoint
CMD [ "( HOME=$XDG_CACHE_HOME ollama serve )" ]
Running npx scaffoldly deploy
will:
- Infer
scaffoldly.json
into a Multi-Stage Docker Build - Run the equivalent of
docker build
- Setup Amazon ECR
- Create a Lambda Function
AWS Lambda requires that Docker Images come from Amazon ECR Private Registries, and it can't run public images either.
Running npx scaffoldly deploy
will:
- Pull
ollama/ollama:0.4.7
and re-tag it and push it to Amazon ECR as a private image - Create an ECR Repository if it doesn't already exist
- Run the equivalent of
docker push
An AWS Lambda Function is created with the configuration in the scaffoldly.json
file:
Running npx scaffoldly deploy
will:
- Setup Function Environment Variables from
.env
- Deploy the Function with a VPC Configuration and EFS Mounts inferred from Amazon EFS
- Create Lambda Versions and Aliases
- Set an
ENTRYPOINT
which routes AWS Lambda HTTP Requests to Ollama - Create a Lambda Function URL and set it as an environment variable as
URL
Model files are large and cached in Amazon EFS. Using the @immediately
option in the schedules
directive of scaffoldly.json
, the Model is pre-downloaded after the deployment.
Running npx scaffoldly deploy
will:
- Set up a
XDG_CACHE_HOME
environment to be the EFS Mount on the Lambda Function - Use the
OLLAMA_HOST=$URL
envrionment variable to trigger a remote download (on itself) - Use the
HOME=$XDG_CACHE_HOME
to direct Ollama where to store files - Invoke
ollama pull
once the AWS Lambda Function is finished deploying
Finally, Scaffoldly uses the start
option in the scripts
directive of scaffoldly.json
to run ollama serve
.
Running npx scaffoldly deploy
will:
- Copy the
awslambda-entrypoint
- The
awslambda-entrypoint
reads theSLY_ROUTES
andSLY_SERVE
environment variables to start and route requests - Requests are converted from the AWS Lambda HTTP Request format back into a HTTP Request forwarded to the Ollama Server.
- The Ollama Server response is streamed back to the requestor.
Join our Discussions on GitHub. Join our Community on Discord.
This code is licensed under the Apache-2.0 license.
The scaffoldly
toolchain is licensed under the FSL-1.1-Apache-2.0 license.
Copyright 2024 Scaffoldly LLC