Alpha Software - This package is in early development. The API may change between versions without notice.
macOS Only - This package currently only supports macOS with Apple Silicon or Intel processors.
A minimal llama.cpp provider for the Vercel AI SDK, implementing the LanguageModelV3 interface.
This package loads llama.cpp directly into Node.js memory via native C++ bindings, enabling local LLM inference without requiring an external server.
- Native Performance: Direct C++ bindings using node-addon-api (N-API)
- GPU Acceleration: Automatic Metal support on macOS
- Streaming & Non-streaming: Full support for both
generateTextandstreamText - Structured Output: Generate JSON objects with schema validation using
generateObject - Tool/Function Calling: Support for AI SDK tools with automatic tool call detection
- Chat Templates: Automatic or configurable chat template formatting (llama3, chatml, gemma, etc.)
- ESM Only: Modern ECMAScript modules, no CommonJS
- GGUF Support: Load any GGUF-format model
Before installing, ensure you have the following:
- macOS (Apple Silicon or Intel)
- Node.js >= 18.0.0
- CMake >= 3.15
- Xcode Command Line Tools
# Install Xcode Command Line Tools (includes Clang)
xcode-select --install
# Install CMake via Homebrew
brew install cmake
npm install ai-sdk-llama-cpp
The installation will automatically:
- Detect macOS and verify platform compatibility
- Compile llama.cpp as a static library with Metal support
- Build the native Node.js addon
Note: Installation on Windows or Linux will fail with an error. Only macOS is supported.
import { generateText } from "ai";
import { llamaCpp } from "ai-sdk-llama-cpp";
const model = llamaCpp({
modelPath: "./models/llama-3.2-1b-instruct.Q4_K_M.gguf",
});
try {
const { text } = await generateText({
model,
prompt: "Explain quantum computing in simple terms.",
});
console.log(text);
} finally {
await model.dispose();
}
import { streamText } from "ai";
import { llamaCpp } from "ai-sdk-llama-cpp";
const model = llamaCpp({
modelPath: "./models/llama-3.2-1b-instruct.Q4_K_M.gguf",
});
try {
const result = streamText({
model,
prompt: "Write a haiku about programming.",
});
for await (const chunk of result.textStream) {
process.stdout.write(chunk);
}
} finally {
await model.dispose();
}
Generate type-safe JSON objects that conform to a schema using generateObject:
import { generateObject } from "ai";
import { z } from "zod";
import { llamaCpp } from "ai-sdk-llama-cpp";
const model = llamaCpp({
modelPath: "./models/your-model.gguf",
});
try {
const { object: recipe } = await generateObject({
model,
schema: z.object({
name: z.string(),
ingredients: z.array(
z.object({
name: z.string(),
amount: z.string(),
})
),
steps: z.array(z.string()),
}),
prompt: "Generate a recipe for chocolate chip cookies.",
});
// recipe is fully typed as { name: string, ingredients: {...}[], steps: string[] }
console.log(recipe.name);
console.log(recipe.ingredients);
console.log(recipe.steps);
} finally {
await model.dispose();
}
The structured output feature uses GBNF grammar constraints to ensure the model generates valid JSON that conforms to your schema. This works with:
- Primitive types:
string,number,integer,boolean,null - Objects: With
properties,required, andadditionalProperties - Arrays: With
items,minItems,maxItems - Enums and constants:
enum,const - Composition:
oneOf,anyOf,allOf - String constraints:
minLength,maxLength,pattern - Number constraints:
minimum,maximum(for integers) - String formats:
date,time,date-time,uuid - References: Local
$refto$defs/definitions
Use AI SDK tools with local models. The model decides when to call tools based on the conversation context:
import { generateText, stepCountIs, tool } from "ai";
import { z } from "zod";
import { llamaCpp } from "ai-sdk-llama-cpp";
const model = llamaCpp({
modelPath: "./models/your-model.gguf",
});
try {
const result = await generateText({
model,
prompt: "What's the weather in Tokyo?",
tools: {
weather: tool({
description: "Get the current weather for a location",
parameters: z.object({
location: z.string().describe("The location to get weather for"),
}),
execute: async ({ location }) => ({
location,
temperature: 72,
}),
}),
},
stopWhen: stepCountIs(3), // Limit steps to prevent infinite loops
});
console.log(result.text);
} finally {
await model.dispose();
}
Tool calling also works with streamText. When tools are provided, the provider automatically detects tool call JSON output and emits proper tool-call events instead of streaming raw JSON as text.
Note: Tool calling quality depends heavily on the model. Models fine-tuned for function calling (e.g., Llama 3.1+, Hermes 2/3, Functionary, Qwen 2.5) work best. Generic models may produce inconsistent results.
import { embed, embedMany } from "ai";
import { llamaCpp } from "ai-sdk-llama-cpp";
const model = llamaCpp.embedding({
modelPath: "./models/nomic-embed-text-v1.5.Q4_K_M.gguf",
});
try {
const { embedding } = await embed({
model,
value: "Hello, world!",
});
const { embeddings } = await embedMany({
model,
values: ["Hello, world!", "Hello, ▲!"],
});
} finally {
model.dispose();
}
const model = llamaCpp({
// Required: Path to the GGUF model file
modelPath: "./models/your-model.gguf",
// Optional: Maximum context size (default: 2048)
contextSize: 4096,
// Optional: Number of layers to offload to GPU
// Default: 99 (all layers). Set to 0 to disable GPU.
gpuLayers: 99,
// Optional: Number of CPU threads (default: 4)
threads: 8,
// Optional: Enable verbose debug output from llama.cpp (default: false)
debug: true,
// Optional: Chat template to use for formatting messages
// - "auto" (default): Use the template embedded in the GGUF model file
// - Template name: Use a specific built-in template (e.g., "llama3", "chatml", "gemma")
chatTemplate: "auto",
});
The chatTemplate option controls how messages are formatted before being sent to the model. Available templates include:
chatml,llama2,llama2-sys,llama3,llama4mistral-v1,mistral-v3,mistral-v7phi3,phi4,gemma,falcon3,zephyrdeepseek,deepseek2,deepseek3,command-r- And more (see llama.cpp documentation for the full list)
The standard AI SDK generation parameters are supported:
try {
const { text } = await generateText({
model,
prompt: "Hello!",
maxTokens: 256, // Maximum tokens to generate
temperature: 0.7, // Sampling temperature (0-2)
topP: 0.9, // Nucleus sampling threshold
topK: 40, // Top-k sampling
stopSequences: ["\n"], // Stop generation at these sequences
});
} finally {
await model.dispose();
}
You'll need to download GGUF-format models separately. Popular sources:
- Hugging Face - Search for GGUF models
- TheBloke's Models - Popular quantized models
Example download:
# Create models directory
mkdir -p models
# Download a model (example: Llama 3.2 1B)
wget -P models/ https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_K_M.gguf
Creates a new llama.cpp language model instance.
Parameters:
config.modelPath(string, required): Path to the GGUF model fileconfig.contextSize(number, optional): Maximum context size. Default: 2048config.gpuLayers(number, optional): GPU layers to offload. Default: 99config.threads(number, optional): CPU threads. Default: 4config.debug(boolean, optional): Enable verbose llama.cpp output. Default: falseconfig.chatTemplate(string, optional): Chat template to use for formatting messages. Default: "auto"
Returns: LlamaCppLanguageModel - A language model compatible with the Vercel AI SDK
Implements the LanguageModelV3 interface from @ai-sdk/provider.
Methods:
doGenerate(options): Non-streaming text generationdoStream(options): Streaming text generationdispose(): Unload the model and free GPU/CPU resources. Always call this when done to prevent memory leaks, especially when loading multiple models
This is a minimal implementation with the following limitations:
- macOS only: Windows and Linux are not supported
- No image inputs: Only text prompts are supported
See CONTRIBUTING.md for development setup and contribution guidelines.
MIT
- llama.cpp - The underlying inference engine
- Vercel AI SDK - The AI SDK framework
- node-addon-api - N-API C++ wrapper