diff --git a/apps/web/app/install/next-steps-dialog.tsx b/apps/web/app/install/next-steps-dialog.tsx index 6ab24dc..5455ac1 100644 --- a/apps/web/app/install/next-steps-dialog.tsx +++ b/apps/web/app/install/next-steps-dialog.tsx @@ -497,9 +497,9 @@ const results = await engine.retrieve({
{state.install.storeAdapter === 'drizzle' && - 'npx drizzle-kit push'} + 'bunx drizzle-kit push'} {state.install.storeAdapter === 'prisma' && - 'npx prisma db push'} + 'bunx prisma db push'} {state.install.storeAdapter === 'raw-sql' && 'psql $DATABASE_URL -f lib/unrag/schema.sql'} diff --git a/apps/web/components/home/hero.tsx b/apps/web/components/home/hero.tsx index 4ea4c2f..5ef0bd3 100644 --- a/apps/web/components/home/hero.tsx +++ b/apps/web/components/home/hero.tsx @@ -11,6 +11,7 @@ import { Text } from '../elements' import {ArrowNarrowRightIcon} from '../icons' +import {Terminal} from '../terminal' import {DraggableTerminal} from './draggable-terminal' function HeroLeftAlignedWithDemo({ @@ -97,8 +98,8 @@ export function HeroSection() { className="py-16" eyebrow={ } @@ -125,7 +126,7 @@ export function HeroSection() { <>
- + The skill is designed with Claude Code in mind, but the format works with other AI assistants that support similar knowledge structures. The underlying content is just markdown files with structured information—any assistant that can read and reason over project files can benefit. + + +## What the skill includes + +The Unrag skill is comprehensive because it's designed to handle everything you might ask about. There's no point in giving the AI partial knowledge—that just leads to confident-sounding answers that are wrong in the details. Here's what the skill covers: + +### Core API knowledge + +The skill includes complete type definitions for the `ContextEngine` class and all its methods. This means the AI knows that `ingest()` accepts a `sourceId`, `content`, optional `metadata`, and optional per-call `chunking` overrides. It knows that `retrieve()` accepts a `query`, `topK` count, and optional `scope` for filtering results. It knows the shape of the objects returned by each method, including fields like `durations` that break down where time was spent. + +The skill also covers `defineUnragConfig()`, the function you use to set up your engine. The AI understands the full configuration schema—defaults for chunking, retrieval settings, embedding provider options, engine configuration, and how everything fits together. + +### All twelve embedding providers + +Embeddings are the foundation of vector search, but each provider has its own configuration quirks. OpenAI uses `OPENAI_API_KEY` and model names like `text-embedding-3-small`. Google uses `GOOGLE_GENERATIVE_AI_API_KEY` and supports task type hints for better embeddings. Cohere has input type parameters. Azure OpenAI requires resource names and API versions. Each provider has its own way of working, and getting the details wrong means your embeddings don't generate. + +The skill documents all twelve providers in complete detail. For each one, it includes the configuration structure, required environment variables, available models, and any provider-specific options. When you ask the AI to set up Voyage embeddings or switch from OpenAI to local Ollama models, it knows exactly how to configure them. + +### Extractors for rich media + +Text is just one kind of content. Real applications need to ingest PDFs, images, audio transcriptions, and video. Unrag includes twelve extractors that handle different content types: + +**PDFs** can be processed in three ways: `pdf-text-layer` extracts text directly from the PDF structure (fast and accurate when the PDF has a text layer), `pdf-llm` uses a language model to read the PDF (better for scanned documents), and `pdf-ocr` runs optical character recognition (works when all else fails). + +**Images** can be processed with `image-ocr` to extract visible text, or with `image-caption-llm` to generate descriptive captions that can be embedded. + +**Audio** uses `audio-transcribe` to convert speech to text via transcription APIs. + +**Video** can be processed with `video-transcribe` (extracts audio and transcribes it) or `video-frames` (samples frames and describes them visually). + +**Files** like Word documents, PowerPoint presentations, and Excel spreadsheets have dedicated extractors: `file-docx`, `file-pptx`, `file-xlsx`, and a generic `file-text` extractor for plain text files. + +The skill knows how to configure each extractor, what environment variables they need, and how to wire them into your `unrag.config.ts`. + +### Connectors for data sources + +Unrag includes connectors for syncing content from external sources—Notion, Google Drive, OneDrive, and Dropbox. Each connector has its own authentication flow, API quirks, and configuration options. + +The skill understands how to set up each connector, configure OAuth credentials, and use the sync methods. It knows that connectors emit events as they sync, that you can scope syncs to specific folders or databases, and how to handle the assets (PDFs, images) that connectors extract from your source content. + +### Batteries and optional features + +Beyond the core ingest/retrieve flow, Unrag includes optional "batteries" that add functionality: + +**The Reranker** adds two-stage retrieval for better precision. You retrieve a larger candidate set with fast vector search, then rerank the results with a more expensive cross-encoder model. The skill knows how to set this up with Cohere's reranking API. + +**The Eval harness** lets you evaluate your retrieval quality. You define test cases with queries and expected results, then run evaluations to see how well your system performs. The skill knows the eval API and configuration. + +**The Debug panel** provides a real-time terminal UI for watching what Unrag is doing. The skill knows the CLI commands and how to interpret the debug output. + +### Production patterns and recipes + +Knowing the API is only part of the picture. The skill also includes common patterns for using Unrag in production: + +- Building a search endpoint with proper scoping and error handling +- Implementing multi-tenant data isolation using source ID prefixes +- Integrating retrieved chunks into chat prompts for RAG-powered conversations +- Designing ingestion strategies for different content types (static documentation, user-generated content, periodic syncs from external sources) +- Framework integration patterns for Next.js, Express, Hono, and other common setups + +These patterns help the AI suggest architecturally sound approaches, not just syntactically correct code. + +## What this looks like in practice + +When you work with an AI assistant that has the Unrag skill installed, the interaction quality changes noticeably. Here's an example of what the difference feels like. + +**Without the skill**, you might ask: "Help me set up Unrag with Google embeddings and Prisma." The AI produces something that looks plausible—it imports from a package, creates some configuration object, calls some methods. But the import path is invented, the configuration keys are wrong, and the environment variable names don't match what Unrag actually expects. You spend the next ten minutes debugging. + +**With the skill installed**, the same request yields accurate code. The AI knows the correct import path from your generated config file. It knows that Google embeddings use `GOOGLE_GENERATIVE_AI_API_KEY`, not `GOOGLE_API_KEY`. It knows the `defineUnragConfig()` schema and produces configuration that actually matches the types. It knows to mention that you need to run the init command to generate the Prisma adapter files. The code works the first time you try it. + +The difference becomes even more apparent with complex requests. If you ask for help extracting text from PDFs, the AI knows about the three different PDF extractors, understands their tradeoffs (text layer extraction is fast but fails on scanned documents; LLM extraction is slower but handles complex layouts), and can recommend the right one for your situation. If you ask about performance tuning, it knows about the `durations` object returned from each operation and can suggest where to look for bottlenecks. + +Debugging also improves. When you report that "searches return empty results," the AI doesn't just guess at common issues. It knows the actual architecture—that content needs to be ingested before it can be retrieved, that scope filters use prefix matching on `sourceId`, that embedding dimension mismatches cause problems if you change models after ingesting data. It can walk you through specific diagnostic steps based on how Unrag actually works. + +## How the skill is structured + +Understanding the skill's structure helps you appreciate what's happening when the AI consults it. The skill is organized into focused reference files: + +``` +skills/unrag/ +├── SKILL.md # Main overview and entry point +└── references/ + ├── api-reference.md # Complete type definitions and method signatures + ├── embedding-providers.md # All 12 providers with configuration details + ├── store-adapters.md # Drizzle, Prisma, and Raw SQL adapters + ├── extractors.md # All 12 extractors for rich media + ├── connectors.md # Notion, Drive, OneDrive, Dropbox + ├── batteries.md # Reranker, Eval, Debug panel + ├── cli-commands.md # init, add, upgrade, doctor, debug + ├── patterns.md # Common recipes and architectures + └── troubleshooting.md # Debug guides and issue resolution +``` + +This organization ensures the AI can quickly find relevant information without overwhelming its context window. When you ask about embedding providers, it loads the provider reference. When you ask about debugging, it loads the troubleshooting guide. The structure mirrors how you'd naturally navigate documentation, making the AI's reasoning more predictable and reliable. + +## Keeping the skill up to date + +The Unrag skill is versioned to track Unrag releases: + +| Component | Version | +|-----------|---------| +| Skill Version | 1.0.0 | +| Unrag CLI Version | 0.3.2 | +| Config Version | 2 | + +When Unrag releases new features—additional embedding providers, new extractors, changed APIs—the skill is updated to match. You should periodically update your installed skill to keep your AI assistant's knowledge current: + +```bash +bunx add-skill betterstacks/unrag --update +``` + +This is especially important if you upgrade your Unrag installation. If you're running Unrag 0.4.0 but your AI has the 0.3.0 skill, it might suggest patterns or options that have changed. Keeping both in sync ensures the AI's advice matches your actual codebase. + +## A new way to work with AI + +Agent Skills represent a shift in how we collaborate with AI coding assistants. Instead of treating the AI as a general-purpose tool that knows a little about everything, skills let us give it deep expertise in the specific tools we're using. + +For Unrag, this means your AI assistant becomes a genuine expert on Unrag—not because it memorized general RAG patterns, but because it has access to the exact types, methods, and configuration options that Unrag uses. It can answer questions about obscure extractors, suggest the right embedding provider for your use case, and help you debug issues based on what Unrag actually does under the hood. + +The practical benefits are significant. You move faster because you're not constantly cross-referencing documentation. You write correct code earlier because the AI isn't hallucinating APIs. You learn more because the AI can explain concepts in context, drawing on complete and accurate knowledge. And you build more confidently because the patterns the AI suggests are production-tested approaches, not plausible-sounding inventions. + +If you're building RAG systems with Unrag and using an AI coding assistant, installing the skill is one of the highest-leverage things you can do to improve your workflow. + +## Next steps + + + + Get Unrag running in your project from scratch + + + Explore all 12 embedding providers and pick the right one + + + Create your first production retrieval API + + diff --git a/apps/web/content/docs/(unrag)/getting-started/quickstart.mdx b/apps/web/content/docs/(unrag)/getting-started/quickstart.mdx index 136586d..3c0d7f1 100644 --- a/apps/web/content/docs/(unrag)/getting-started/quickstart.mdx +++ b/apps/web/content/docs/(unrag)/getting-started/quickstart.mdx @@ -203,9 +203,18 @@ This presents a list of extractors to install (PDF, image, audio, video, files) If you prefer to configure things manually, the generated `unrag.config.ts` includes `assetProcessing` settings you can edit directly. -When you're ready to ingest content with embedded PDFs or images, check out: +## Speeding up development with AI + +If you're using an AI coding assistant like Claude Code, Cursor, or Windsurf, you can give it deep knowledge of Unrag's API to help you write correct code faster. Out of the box, AI assistants often hallucinate method names or configuration options because Unrag wasn't heavily represented in their training data. By installing the [Unrag Agent Skill](/docs/ai-assisted-coding), your AI gets access to complete type definitions, all twelve embedding providers, extractors, connectors, and production patterns—so it can produce working code instead of plausible-looking guesses. + +This is especially helpful as you move beyond the basics and start configuring specific embedding providers, adding extractors for different file types, or building production search endpoints. Instead of context-switching between your editor and documentation, you can ask your AI assistant and get accurate answers. + +When you're ready to ingest content with embedded PDFs or images, explore AI-assisted workflows, or dive deeper: + + Install the Agent Skill for your AI coding assistant + How PDFs and images are processed diff --git a/apps/web/content/docs/(unrag)/meta.json b/apps/web/content/docs/(unrag)/meta.json index 75edc0c..80ffe72 100644 --- a/apps/web/content/docs/(unrag)/meta.json +++ b/apps/web/content/docs/(unrag)/meta.json @@ -5,6 +5,7 @@ "pages": [ "index", "changelog", + "ai-assisted-coding", "getting-started", "upgrade", "concepts", @@ -21,4 +22,4 @@ "examples", "reference" ] -} +} \ No newline at end of file diff --git a/skills/unrag/SKILL.md b/skills/unrag/SKILL.md new file mode 100644 index 0000000..e41e65e --- /dev/null +++ b/skills/unrag/SKILL.md @@ -0,0 +1,341 @@ +--- +name: unrag +description: Covers RAG installation, ContextEngine API, embedding providers, store adapters, extractors, connectors, batteries, and CLI commands for the unrag TypeScript library. +version: 1.0.0 +--- + +# Unrag Agent Skill + +This skill provides comprehensive knowledge about **unrag** - a RAG (Retrieval-Augmented Generation) installer for TypeScript that vendors auditable source code directly into your project. + +## What is Unrag + +Unrag takes a deliberately different approach to RAG: instead of being a framework or SDK, it **vendors source files** directly into your repository. When you run `unrag init`, you're not adding a dependency that abstracts away the implementation—you're copying source files that are yours to read, modify, and delete. + +### Philosophy + +- **You own your RAG implementation** - The code lives in your repo, appears in PRs, and can be debugged like any other code +- **Primitives over frameworks** - Unrag gives you `ingest()` and `retrieve()`, not routing, agents, or prompt templates +- **Swappable components** - Simple interfaces for embedding providers, store adapters, and extractors +- **Local-first development** - No external services, just code in your codebase + +### Core Operations + +1. **`ingest()`** - Chunk content, generate embeddings, store in Postgres with pgvector +2. **`retrieve()`** - Embed a query and run similarity search +3. **`rerank()`** - Optional second-stage ranking for improved precision +4. **`delete()`** - Remove documents by sourceId or prefix + +## Quick Start + +### Installation + +```bash +# Initialize unrag in your project +bunx unrag@latest init + +# Follow prompts to select: +# - Install directory (default: lib/unrag) +# - Store adapter (Drizzle, Prisma, or Raw SQL) +# - Embedding provider (OpenAI, Google, Cohere, etc.) +# - Rich media extractors (PDF, images, etc.) +``` + +### Minimal Configuration + +```ts +// unrag.config.ts +import { defineUnragConfig } from "./lib/unrag/core"; + +export const unrag = defineUnragConfig({ + embedding: { + provider: "openai", + config: { + model: "text-embedding-3-small", + }, + }, +} as const); +``` + +### First Ingest + +```ts +import { createUnragEngine } from "@unrag/config"; + +const engine = createUnragEngine(); + +await engine.ingest({ + sourceId: "docs:getting-started", + content: "Your document content here...", + metadata: { title: "Getting Started", category: "docs" }, +}); +``` + +### First Retrieval + +```ts +const result = await engine.retrieve({ + query: "how do I get started?", + topK: 8, +}); + +for (const chunk of result.chunks) { + console.log(chunk.content, chunk.score); +} +``` + +## Core Concepts + +### ContextEngine + +The `ContextEngine` class is the main entry point. Create it using `createUnragEngine()` which reads from `unrag.config.ts`: + +```ts +import { createUnragEngine } from "@unrag/config"; + +const engine = createUnragEngine(); +``` + +The engine provides: +- `engine.ingest(input)` - Ingest documents with optional assets +- `engine.retrieve(input)` - Query for relevant chunks +- `engine.rerank(input)` - Rerank retrieved candidates +- `engine.delete(input)` - Delete by sourceId or prefix +- `engine.planIngest(input)` - Dry-run for asset processing +- `engine.runConnectorStream(options)` - Process connector streams + +### Source ID Scoping + +The `sourceId` is a stable identifier for your documents: + +```ts +// Single document +await engine.ingest({ sourceId: "doc:123", content: "..." }); + +// Hierarchical organization +await engine.ingest({ sourceId: "tenant:acme:docs:readme", content: "..." }); + +// Retrieve with prefix scope +const result = await engine.retrieve({ + query: "password reset", + scope: { sourceId: "tenant:acme:" }, // Only this tenant's docs +}); +``` + +**Key behaviors:** +- Re-ingesting with the same `sourceId` replaces the previous version +- Delete supports both exact match and prefix deletion +- Retrieval scope uses prefix matching + +### Chunking + +Documents are split into chunks before embedding: + +```ts +// Global defaults in unrag.config.ts +export const unrag = defineUnragConfig({ + defaults: { + chunking: { + chunkSize: 512, // tokens per chunk + chunkOverlap: 50, // overlap between chunks + }, + }, + // ... +}); + +// Per-ingest override +await engine.ingest({ + sourceId: "doc:123", + content: longDocument, + chunking: { chunkSize: 256 }, +}); +``` + +### Asset Processing + +Rich media (PDFs, images, audio, video, files) can be attached to documents: + +```ts +await engine.ingest({ + sourceId: "doc:report", + content: "Quarterly report summary...", + assets: [ + { + assetId: "attachment-1", + kind: "pdf", + data: { kind: "bytes", bytes: pdfBuffer, mediaType: "application/pdf" }, + }, + ], +}); +``` + +Assets are processed by **extractors** that convert them to text for embedding. See [extractors.md](./references/extractors.md). + +## API Quick Reference + +### ingest() + +```ts +const result = await engine.ingest({ + sourceId: string, // Stable document identifier + content: string, // Document text + metadata?: Metadata, // Optional key-value pairs + chunking?: { chunkSize?, chunkOverlap? }, + assets?: AssetInput[], // Optional rich media + assetProcessing?: DeepPartial, +}); + +// Returns: +// { documentId, chunkCount, embeddingModel, warnings, durations } +``` + +### retrieve() + +```ts +const result = await engine.retrieve({ + query: string, // Search query + topK?: number, // Number of results (default: 8) + scope?: { sourceId?: string }, // Prefix filter +}); + +// Returns: +// { chunks: Array, embeddingModel, durations } +``` + +### rerank() + +```ts +const result = await engine.rerank({ + query: string, + candidates: RerankCandidate[], // From retrieve() + topK?: number, + onMissingReranker?: "throw" | "skip", + onMissingText?: "throw" | "skip", + resolveText?: (candidate) => string | Promise, +}); + +// Returns: +// { chunks, ranking, meta, durations, warnings } +``` + +### delete() + +```ts +// Delete single document +await engine.delete({ sourceId: "doc:123" }); + +// Delete by prefix +await engine.delete({ sourceIdPrefix: "tenant:acme:" }); +``` + +### planIngest() + +Dry-run to preview asset processing without calling external services: + +```ts +const plan = await engine.planIngest({ + sourceId: "doc:report", + content: "...", + assets: [/* ... */], +}); + +// Returns which assets would be processed, by which extractors +``` + +### runConnectorStream() + +Process events from a connector: + +```ts +const stream = notionConnector.sync({ pageIds: ["..."] }); + +const result = await engine.runConnectorStream({ + stream, + onProgress: (event) => console.log(event), +}); +``` + +## Configuration + +### defineUnragConfig() + +The main configuration function: + +```ts +import { defineUnragConfig } from "./lib/unrag/core"; + +export const unrag = defineUnragConfig({ + // Embedding provider configuration (required) + embedding: { + provider: "openai", + config: { model: "text-embedding-3-small" }, + }, + + // Default settings + defaults: { + chunking: { chunkSize: 512, chunkOverlap: 50 }, + embedding: { concurrency: 4, batchSize: 100 }, + retrieval: { topK: 8 }, + }, + + // Engine-level configuration + engine: { + extractors: [/* ... */], // Asset extractors + reranker: createCohereReranker(), + storage: { + storeChunkContent: true, + storeDocumentContent: true, + }, + assetProcessing: {/* ... */}, + }, +} as const); +``` + +### Environment Variables + +Common environment variables by provider: + +| Provider | Variables | +|----------|-----------| +| OpenAI | `OPENAI_API_KEY` | +| Google | `GOOGLE_GENERATIVE_AI_API_KEY` | +| Cohere | `COHERE_API_KEY` | +| Azure | `AZURE_OPENAI_API_KEY`, `AZURE_RESOURCE_NAME` | +| Voyage | `VOYAGE_API_KEY` | +| Ollama | (none, runs locally) | + +Database: `DATABASE_URL` + +## Reference File Guide + +This skill includes detailed reference files for specific topics: + +| Reference | When to Consult | +|-----------|-----------------| +| [api-reference.md](./references/api-reference.md) | Full type definitions, method signatures | +| [embedding-providers.md](./references/embedding-providers.md) | Configuring OpenAI, Google, Cohere, Voyage, Ollama, etc. | +| [store-adapters.md](./references/store-adapters.md) | Drizzle, Prisma, Raw SQL setup and schema | +| [extractors.md](./references/extractors.md) | PDF, image, audio, video, file extractors | +| [connectors.md](./references/connectors.md) | Notion, Google Drive, OneDrive, Dropbox | +| [batteries.md](./references/batteries.md) | Reranker, Eval harness, Debug panel | +| [cli-commands.md](./references/cli-commands.md) | init, add, upgrade, doctor, debug | +| [patterns.md](./references/patterns.md) | Search endpoints, multi-tenant, chat integration | +| [troubleshooting.md](./references/troubleshooting.md) | Common issues, debugging, performance | + +## Version Information + +- **Skill Version:** 1.0.0 +- **Unrag CLI Version:** 0.3.2 +- **Config Version:** 2 + +## Key Source Files + +When you need to look at source code: + +| File | Purpose | +|------|---------| +| `packages/unrag/registry/core/types.ts` | All TypeScript types | +| `packages/unrag/registry/core/context-engine.ts` | ContextEngine class | +| `packages/unrag/registry/manifest.json` | Extractors, connectors, batteries metadata | +| `packages/unrag/cli/commands/*.ts` | CLI command implementations | +| `apps/web/content/docs/**/*.mdx` | Documentation pages | diff --git a/skills/unrag/references/api-reference.md b/skills/unrag/references/api-reference.md new file mode 100644 index 0000000..6c77f06 --- /dev/null +++ b/skills/unrag/references/api-reference.md @@ -0,0 +1,481 @@ +# Unrag API Reference + +Complete type definitions and method signatures for the unrag library. + +## Core Types + +### Chunk + +The fundamental unit returned by retrieval: + +```ts +type Chunk = { + id: string; // Unique chunk identifier + documentId: string; // Parent document ID + sourceId: string; // Stable source identifier + index: number; // Position within document + content: string; // Chunk text content + tokenCount: number; // Token count for this chunk + metadata: Metadata; // Key-value metadata + embedding?: number[]; // Vector (usually not returned) + documentContent?: string; // Full document (if stored) +}; +``` + +### Metadata + +Flexible key-value storage: + +```ts +type MetadataValue = string | number | boolean | null; + +type Metadata = Record< + string, + MetadataValue | MetadataValue[] | undefined +>; +``` + +### ChunkingOptions + +Controls how documents are split: + +```ts +type ChunkingOptions = { + chunkSize: number; // Max tokens per chunk + chunkOverlap: number; // Overlap between chunks +}; +``` + +## Ingest Types + +### IngestInput + +```ts +type IngestInput = { + sourceId: string; // Stable document identifier + content: string; // Document text + metadata?: Metadata; // Optional metadata + chunking?: Partial; // Override chunking + assets?: AssetInput[]; // Rich media attachments + assetProcessing?: DeepPartial; +}; +``` + +### IngestResult + +```ts +type IngestResult = { + documentId: string; // Generated or existing document ID + chunkCount: number; // Number of chunks created + embeddingModel: string; // Model used for embeddings + warnings: IngestWarning[]; // Structured warnings + durations: { + totalMs: number; + chunkingMs: number; + embeddingMs: number; + storageMs: number; + }; +}; +``` + +### IngestWarning + +Structured warnings for skipped or failed assets: + +```ts +type IngestWarning = + | { code: "asset_skipped_unsupported_kind"; ... } + | { code: "asset_skipped_extraction_disabled"; ... } + | { code: "asset_skipped_pdf_llm_extraction_disabled"; ... } + | { code: "asset_skipped_image_no_multimodal_and_no_caption"; ... } + | { code: "asset_skipped_pdf_empty_extraction"; ... } + | { code: "asset_skipped_extraction_empty"; ... } + | { code: "asset_processing_error"; stage: "fetch" | "extract" | "embed" | "unknown"; ... }; +``` + +### IngestPlanResult + +Returned by `planIngest()` for dry-run: + +```ts +type IngestPlanResult = { + documentId: string; + sourceId: string; + assets: AssetProcessingPlanItem[]; + warnings: IngestWarning[]; +}; + +type AssetProcessingPlanItem = + | { status: "will_process"; extractors: string[]; assetId; kind; uri } + | { status: "will_skip"; reason: string; assetId; kind; uri }; +``` + +## Retrieve Types + +### RetrieveInput + +```ts +type RetrieveInput = { + query: string; // Search query + topK?: number; // Results to return (default: 8) + scope?: RetrieveScope; // Filter results +}; + +type RetrieveScope = { + sourceId?: string; // Prefix filter +}; +``` + +### RetrieveResult + +```ts +type RetrieveResult = { + chunks: Array; + embeddingModel: string; + durations: { + totalMs: number; + embeddingMs: number; + retrievalMs: number; + }; +}; +``` + +## Rerank Types + +### RerankInput + +```ts +type RerankInput = { + query: string; + candidates: RerankCandidate[]; + topK?: number; + onMissingReranker?: RerankPolicy; // "throw" | "skip" + onMissingText?: RerankPolicy; + resolveText?: (candidate: RerankCandidate) => string | Promise; +}; + +type RerankCandidate = Chunk & { score: number }; +type RerankPolicy = "throw" | "skip"; +``` + +### RerankResult + +```ts +type RerankResult = { + chunks: RerankCandidate[]; + ranking: RerankRankingItem[]; + meta: { + rerankerName: string; + model?: string; + }; + durations: { + rerankMs: number; + totalMs: number; + }; + warnings: string[]; +}; + +type RerankRankingItem = { + index: number; // Original index in candidates + rerankScore?: number; // Score from reranker +}; +``` + +## Delete Types + +### DeleteInput + +```ts +type DeleteInput = + | { sourceId: string; sourceIdPrefix?: never } // Exact match + | { sourceId?: never; sourceIdPrefix: string }; // Prefix match +``` + +## Asset Types + +### AssetInput + +```ts +type AssetInput = { + assetId: string; // Stable ID within document + kind: AssetKind; // "image" | "pdf" | "audio" | "video" | "file" + data: AssetData; // URL or bytes + uri?: string; // Display URI (for debugging) + text?: string; // Known caption/alt text + metadata?: Metadata; // Per-asset metadata +}; + +type AssetKind = "image" | "pdf" | "audio" | "video" | "file"; +``` + +### AssetData + +```ts +type AssetData = + | { + kind: "url"; + url: string; + headers?: Record; + mediaType?: string; + filename?: string; + } + | { + kind: "bytes"; + bytes: Uint8Array; + mediaType: string; + filename?: string; + }; +``` + +### AssetProcessingConfig + +Complete asset processing configuration: + +```ts +type AssetProcessingConfig = { + onUnsupportedAsset: "skip" | "fail"; + onError: "skip" | "fail"; + concurrency: number; + hooks?: { onEvent?: (event: AssetProcessingEvent) => void }; + fetch: AssetFetchConfig; + pdf: { + textLayer: PdfTextLayerConfig; + llmExtraction: PdfLlmExtractionConfig; + ocr: PdfOcrConfig; + }; + image: { + ocr: ImageOcrConfig; + captionLlm: ImageCaptionLlmConfig; + }; + audio: { + transcription: AudioTranscriptionConfig; + }; + video: { + transcription: VideoTranscriptionConfig; + frames: VideoFramesConfig; + }; + file: { + text: FileTextConfig; + docx: FileDocxConfig; + pptx: FilePptxConfig; + xlsx: FileXlsxConfig; + }; +}; +``` + +### AssetFetchConfig + +```ts +type AssetFetchConfig = { + enabled: boolean; // Allow fetching from URLs + allowedHosts?: string[]; // Hostname allowlist (SSRF mitigation) + maxBytes: number; // Max file size + timeoutMs: number; // Fetch timeout + headers?: Record; +}; +``` + +## Extractor Types + +### AssetExtractor + +Interface for custom extractors: + +```ts +type AssetExtractor = { + name: string; + supports: (args: { asset: AssetInput; ctx: AssetExtractorContext }) => boolean; + extract: (args: { asset: AssetInput; ctx: AssetExtractorContext }) => Promise; +}; + +type AssetExtractorContext = { + sourceId: string; + documentId: string; + documentMetadata: Metadata; + assetProcessing: AssetProcessingConfig; +}; + +type AssetExtractorResult = { + texts: ExtractedTextItem[]; + skipped?: { code: string; message: string }; + metadata?: Metadata; + diagnostics?: { model?: string; tokens?: number; seconds?: number }; +}; + +type ExtractedTextItem = { + label: string; // e.g., "fulltext", "ocr", "transcript" + content: string; // Extracted text + confidence?: number; + pageRange?: [number, number]; + timeRangeSec?: [number, number]; +}; +``` + +## Reranker Types + +### Reranker Interface + +```ts +type Reranker = { + name: string; + rerank: (args: RerankerRerankArgs) => Promise; +}; + +type RerankerRerankArgs = { + query: string; + documents: string[]; +}; + +type RerankerRerankResult = { + order: number[]; // Indices sorted by relevance + scores?: number[]; // Optional relevance scores + model?: string; // Model identifier +}; +``` + +## Store Types + +### VectorStore Interface + +```ts +type VectorStore = { + upsert: (chunks: Chunk[]) => Promise<{ documentId: string }>; + query: (params: { + embedding: number[]; + topK: number; + scope?: RetrieveScope; + }) => Promise>; + delete: (input: DeleteInput) => Promise; +}; +``` + +## Embedding Provider Types + +### EmbeddingProvider Interface + +```ts +type EmbeddingProvider = { + name: string; + dimensions?: number; + embed: (input: EmbeddingInput) => Promise; + embedMany?: (inputs: EmbeddingInput[]) => Promise; + embedImage?: (input: ImageEmbeddingInput) => Promise; +}; + +type EmbeddingInput = { + text: string; + metadata: Metadata; + position: number; + sourceId: string; + documentId: string; +}; + +type ImageEmbeddingInput = { + data: Uint8Array | string; + mediaType?: string; + metadata: Metadata; + position: number; + sourceId: string; + documentId: string; + assetId?: string; +}; +``` + +## Configuration Types + +### DefineUnragConfigInput + +```ts +type DefineUnragConfigInput = { + defaults?: UnragDefaultsConfig; + engine?: UnragEngineConfig; + embedding: UnragEmbeddingConfig; +}; + +type UnragDefaultsConfig = { + chunking?: Partial; + embedding?: Partial; + retrieval?: { topK?: number }; +}; + +type EmbeddingProcessingConfig = { + concurrency: number; // Max concurrent embedding requests + batchSize: number; // Chunks per embedMany batch +}; +``` + +### UnragEmbeddingConfig + +```ts +type UnragEmbeddingConfig = + | { provider: "ai"; config?: AiEmbeddingConfig } + | { provider: "openai"; config?: OpenAiEmbeddingConfig } + | { provider: "google"; config?: GoogleEmbeddingConfig } + | { provider: "openrouter"; config?: OpenRouterEmbeddingConfig } + | { provider: "azure"; config?: AzureEmbeddingConfig } + | { provider: "vertex"; config?: VertexEmbeddingConfig } + | { provider: "bedrock"; config?: BedrockEmbeddingConfig } + | { provider: "cohere"; config?: CohereEmbeddingConfig } + | { provider: "mistral"; config?: MistralEmbeddingConfig } + | { provider: "together"; config?: TogetherEmbeddingConfig } + | { provider: "ollama"; config?: OllamaEmbeddingConfig } + | { provider: "voyage"; config?: VoyageEmbeddingConfig } + | { provider: "custom"; create: () => EmbeddingProvider }; +``` + +### ContextEngineConfig + +Low-level engine configuration: + +```ts +type ContextEngineConfig = { + embedding: EmbeddingProvider; + store: VectorStore; + defaults?: Partial; + chunker?: Chunker; + idGenerator?: () => string; + extractors?: AssetExtractor[]; + reranker?: Reranker; + storage?: Partial; + assetProcessing?: DeepPartial; + embeddingProcessing?: DeepPartial; +}; + +type ContentStorageConfig = { + storeChunkContent: boolean; + storeDocumentContent: boolean; +}; +``` + +## Connector Types + +### ConnectorStream + +Async iterable that emits events: + +```ts +type ConnectorStreamEvent = + | { type: "upsert"; sourceId: string; content: string; metadata?: Metadata; assets?: AssetInput[] } + | { type: "delete"; sourceId?: string; sourceIdPrefix?: string } + | { type: "progress"; message: string; progress?: number } + | { type: "warning"; message: string } + | { type: "checkpoint"; checkpoint: TCheckpoint }; +``` + +### RunConnectorStreamOptions + +```ts +type RunConnectorStreamOptions = { + stream: AsyncIterable>; + onProgress?: (event: { type: string; message?: string }) => void; + signal?: AbortSignal; + checkpoint?: TCheckpoint; +}; + +type RunConnectorStreamResult = { + ingestCount: number; + deleteCount: number; + warnings: string[]; + checkpoint?: TCheckpoint; +}; +``` diff --git a/skills/unrag/references/batteries.md b/skills/unrag/references/batteries.md new file mode 100644 index 0000000..a92bbc8 --- /dev/null +++ b/skills/unrag/references/batteries.md @@ -0,0 +1,405 @@ +# Batteries + +Batteries are optional feature modules that extend Unrag's core functionality. Three batteries are available. + +## Available Batteries + +| Battery | Description | Status | +|---------|-------------|--------| +| **Reranker** | Second-stage reranking with Cohere rerank-v3.5 | Available | +| **Eval** | Retrieval evaluation harness with metrics and CI integration | Available | +| **Debug** | Real-time TUI debugger for RAG operations | Available | + +--- + +## Reranker Battery + +Improve retrieval precision with two-stage ranking. + +### Installation + +```bash +bunx unrag@latest add battery reranker +``` + +**Dependencies:** `ai`, `@ai-sdk/cohere` + +**Environment:** +```bash +COHERE_API_KEY="..." +``` + +### Configuration + +```ts +import { defineUnragConfig } from "./lib/unrag/core"; +import { createCohereReranker } from "./lib/unrag/rerank"; + +export const unrag = defineUnragConfig({ + embedding: { /* ... */ }, + engine: { + reranker: createCohereReranker({ + model: "rerank-v3.5", // Default + maxDocuments: 1000, // Max batch size + }), + }, +}); +``` + +### Usage + +```ts +import { createUnragEngine } from "@unrag/config"; + +const engine = createUnragEngine(); + +// Step 1: Retrieve more candidates than needed +const retrieved = await engine.retrieve({ + query: "how do I reset my password?", + topK: 30, +}); + +// Step 2: Rerank to get the best results +const reranked = await engine.rerank({ + query: "how do I reset my password?", + candidates: retrieved.chunks, + topK: 8, +}); + +// Use reranked results +for (const chunk of reranked.chunks) { + console.log(chunk.content, chunk.score); +} +``` + +### Rerank Options + +```ts +const result = await engine.rerank({ + query: string, + candidates: RerankCandidate[], + topK?: number, // Results to return + onMissingReranker?: "throw" | "skip", // If no reranker configured + onMissingText?: "throw" | "skip", // If candidate has no content + resolveText?: (candidate) => string, // Fetch text externally +}); +``` + +### Custom Reranker + +```ts +import { createCustomReranker } from "./lib/unrag/rerank"; + +const myReranker = createCustomReranker({ + name: "my-reranker", + rerank: async ({ query, documents }) => { + const response = await myRerankerApi.rerank({ query, documents }); + return { + order: response.ranking.map(r => r.index), + scores: response.ranking.map(r => r.score), + model: "my-model-v1", + }; + }, +}); +``` + +### When to Use Reranking + +**Good candidates:** +- Complex queries where top result isn't always correct +- Need 10+ highly relevant results +- Quality matters more than latency + +**Skip reranking:** +- Simple, specific queries +- Latency-sensitive applications +- Vector search already gives good results + +--- + +## Eval Battery + +Deterministic retrieval evaluation with metrics, baselines, and CI integration. + +### Installation + +```bash +bunx unrag@latest add battery eval +``` + +Creates: +- `.unrag/eval/datasets/sample.json` - Sample evaluation dataset +- `.unrag/eval/config.json` - Eval configuration +- `scripts/unrag-eval.ts` - Eval runner script + +### Creating Datasets + +```json +// .unrag/eval/datasets/my-dataset.json +{ + "version": 1, + "name": "My Eval Dataset", + "items": [ + { + "query": "How do I reset my password?", + "expected": ["doc:auth:password-reset", "doc:auth:account-recovery"], + "metadata": { "category": "auth" } + }, + { + "query": "What payment methods are supported?", + "expected": ["doc:billing:payment-methods"], + "metadata": { "category": "billing" } + } + ] +} +``` + +**Fields:** +- `query` - The search query +- `expected` - Array of sourceIds that should be retrieved +- `metadata` - Optional metadata for filtering/grouping + +### Running Evals + +```bash +# Run eval with default dataset +bun run eval + +# Run with specific dataset +bun run eval -- --dataset payments +``` + +Or programmatically: + +```ts +import { createEvalRunner } from "./lib/unrag/eval"; +import { createUnragEngine } from "@unrag/config"; + +const engine = createUnragEngine(); +const runner = createEvalRunner({ engine }); + +const results = await runner.run({ + dataset: "my-dataset", + topK: 10, +}); + +console.log(results.metrics); +// { hitAt1: 0.85, hitAt5: 0.95, recall: 0.90, mrr: 0.88 } +``` + +### Metrics + +| Metric | Description | +|--------|-------------| +| `hit@k` | Percentage of queries where at least one expected doc is in top K | +| `recall@k` | Average fraction of expected docs found in top K | +| `precision@k` | Average fraction of top K that are expected docs | +| `mrr` | Mean Reciprocal Rank - average of 1/rank of first relevant result | + +### Configuration + +```json +// .unrag/eval/config.json +{ + "version": 1, + "defaults": { + "topK": 10, + "metrics": ["hit@1", "hit@5", "recall@10", "mrr"] + }, + "thresholds": { + "hit@5": 0.90, + "recall@10": 0.85 + }, + "datasets": { + "default": "sample", + "all": ["sample", "payments", "auth"] + } +} +``` + +### CI Integration + +```yaml +# .github/workflows/eval.yml +name: RAG Evaluation + +on: + pull_request: + paths: + - 'lib/unrag/**' + - 'unrag.config.ts' + +jobs: + eval: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + - uses: oven-sh/setup-bun@v1 + + - name: Install dependencies + run: bun install + + - name: Run evaluation + run: bun run eval --ci + env: + DATABASE_URL: ${{ secrets.DATABASE_URL }} + OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} +``` + +The `--ci` flag: +- Outputs machine-readable JSON +- Fails if thresholds not met +- Compares against baseline if present + +### Comparing Runs + +```bash +# Save baseline +bun run eval --save-baseline + +# Compare against baseline +bun run eval --compare +``` + +Output shows metric changes: +``` +Metric Baseline Current Change +hit@5 0.90 0.92 +0.02 ✓ +recall@10 0.85 0.83 -0.02 ✗ +mrr 0.88 0.89 +0.01 ✓ +``` + +--- + +## Debug Battery + +Real-time TUI debugger for RAG operations. + +### Installation + +```bash +bunx unrag@latest add battery debug +``` + +**Dependencies:** `ws` + +### Enabling Debug Mode + +```bash +# Set environment variable +UNRAG_DEBUG=true bun run dev + +# Or in .env +UNRAG_DEBUG=true +``` + +When enabled, the ContextEngine automatically starts a WebSocket server. + +### Launching the TUI + +```bash +bunx unrag@latest debug +``` + +Opens an interactive terminal UI with panels: + +1. **Ingest** - Watch ingestion operations in real-time +2. **Retrieve** - See retrieval queries and results +3. **Rerank** - Monitor reranking operations +4. **Doctor** - Health checks and diagnostics +5. **Query** - Interactive query testing +6. **Docs** - Browse indexed documents + +### Panels + +**Ingest Panel:** +- Shows sourceId, chunk count, duration +- Asset processing progress +- Warnings and errors + +**Retrieve Panel:** +- Query text and parameters +- Results with scores +- Timing breakdown + +**Rerank Panel:** +- Original vs reranked order +- Score comparisons +- Timing + +**Doctor Panel:** +- Embedding provider status +- Database connection +- Extractor configuration +- Index statistics + +**Query Panel:** +- Interactive query input +- Live results +- Scope filtering + +**Docs Panel:** +- Browse by sourceId +- View chunks and metadata +- Delete documents + +### Programmatic Usage + +```ts +import { startDebugServer, registerUnragDebug } from "./lib/unrag/debug"; + +// Start server manually +await startDebugServer({ port: 9229 }); + +// Register engine for interactive features +registerUnragDebug({ + engine, + storeInspector: store.inspector, // Optional: for Docs panel +}); +``` + +### Configuration + +```ts +// Environment variables +UNRAG_DEBUG=true // Enable debug mode +UNRAG_DEBUG_PORT=9229 // WebSocket port (default: 9229) +``` + +### Security Note + +The debug server should **never** be enabled in production. It exposes: +- All indexed content +- Query patterns +- Internal metrics + +Use environment-based enabling: + +```ts +// Only in development +if (process.env.NODE_ENV !== "production") { + process.env.UNRAG_DEBUG = "true"; +} +``` + +--- + +## Battery Installation Summary + +```bash +# Install all batteries +bunx unrag@latest add battery reranker +bunx unrag@latest add battery eval +bunx unrag@latest add battery debug + +# Check installed batteries +cat unrag.json | jq '.batteries' +``` + +Installed batteries are tracked in `unrag.json`: + +```json +{ + "batteries": ["reranker", "eval", "debug"] +} +``` diff --git a/skills/unrag/references/cli-commands.md b/skills/unrag/references/cli-commands.md new file mode 100644 index 0000000..ae8b707 --- /dev/null +++ b/skills/unrag/references/cli-commands.md @@ -0,0 +1,403 @@ +# CLI Commands + +The Unrag CLI manages installation, updates, and debugging of your RAG system. + +## Installation + +```bash +# Run without installing +bunx unrag@latest + +# Or install globally +bun add -g unrag +unrag +``` + +--- + +## unrag init + +Initialize Unrag in your project. Copies core files, configures store adapter, selects embedding provider. + +```bash +bunx unrag@latest init [options] +``` + +### Options + +| Option | Description | +|--------|-------------| +| `--yes`, `-y` | Non-interactive mode with defaults | +| `--dir ` | Install directory (default: `lib/unrag`) | +| `--store ` | Store adapter: `drizzle`, `prisma`, `raw-sql` | +| `--alias ` | Import alias (default: `@unrag`) | +| `--provider ` | Embedding provider | +| `--rich-media` | Enable rich media extraction | +| `--no-rich-media` | Disable rich media | +| `--extractors ` | Comma-separated extractor IDs | +| `--preset ` | Load configuration from preset URL | +| `--overwrite ` | `skip` or `force` existing files | +| `--no-install` | Skip dependency installation | +| `--quiet` | Suppress output | +| `--full` | Full scaffold (legacy mode) | +| `--with-docs` | Generate documentation file | + +### Examples + +```bash +# Interactive setup +bunx unrag@latest init + +# Non-interactive with options +bunx unrag@latest init --yes --store drizzle --provider openai + +# Enable rich media with specific extractors +bunx unrag@latest init --rich-media --extractors pdf-text-layer,file-text,image-ocr + +# From preset +bunx unrag@latest init --preset https://example.com/my-preset.json +``` + +### What It Creates + +``` +lib/unrag/ # Vendored source code +├── core/ # Core types and engine +├── store/ # Store adapter +├── embedding/ # Embedding provider +├── extractors/ # Asset extractors (if enabled) +└── ... + +unrag.config.ts # Configuration file +unrag.json # Metadata (version, installed modules) +``` + +### tsconfig.json Updates + +The CLI patches your tsconfig.json to add path aliases: + +```json +{ + "compilerOptions": { + "paths": { + "@unrag/*": ["./lib/unrag/*"], + "@unrag/config": ["./lib/unrag/config"] + } + } +} +``` + +--- + +## unrag add + +Add extractors, connectors, or batteries to an existing installation. + +```bash +bunx unrag@latest add +``` + +### Types + +| Type | Description | +|------|-------------| +| `extractor` | Asset extraction modules | +| `connector` | External service connectors | +| `battery` | Optional feature modules | + +### Examples + +```bash +# Add extractors +bunx unrag@latest add extractor pdf-text-layer +bunx unrag@latest add extractor pdf-llm +bunx unrag@latest add extractor image-ocr +bunx unrag@latest add extractor file-docx + +# Add connectors +bunx unrag@latest add connector notion +bunx unrag@latest add connector google-drive + +# Add batteries +bunx unrag@latest add battery reranker +bunx unrag@latest add battery eval +bunx unrag@latest add battery debug +``` + +### What It Does + +1. Copies module source files to your install directory +2. Adds required dependencies to package.json +3. Updates unrag.json with installed module +4. Installs dependencies (unless `--no-install`) + +### Available Modules + +**Extractors:** +- `pdf-text-layer` - PDF text layer extraction +- `pdf-llm` - LLM-based PDF extraction +- `pdf-ocr` - OCR for scanned PDFs (worker-only) +- `image-ocr` - Image text extraction via vision LLM +- `image-caption-llm` - Image captioning +- `audio-transcribe` - Audio transcription +- `video-transcribe` - Video audio transcription +- `video-frames` - Video frame analysis (worker-only) +- `file-text` - txt/md/json/csv extraction +- `file-docx` - Word document extraction +- `file-pptx` - PowerPoint extraction +- `file-xlsx` - Excel extraction + +**Connectors:** +- `notion` - Notion pages and databases +- `google-drive` - Google Drive files +- `onedrive` - Microsoft OneDrive +- `dropbox` - Dropbox files + +**Batteries:** +- `reranker` - Cohere reranking +- `eval` - Evaluation harness +- `debug` - Debug TUI + +--- + +## unrag upgrade + +Upgrade vendored source files to the latest version with three-way merge. + +```bash +bunx unrag@latest upgrade [options] +``` + +### Options + +| Option | Description | +|--------|-------------| +| `--force` | Overwrite local changes | +| `--dry-run` | Preview changes without applying | +| `--no-install` | Skip dependency installation | + +### How It Works + +1. **Detect current version** from unrag.json +2. **Download new version** from registry +3. **Three-way merge**: + - Original (what you installed) + - Current (your modifications) + - New (latest version) +4. **Apply changes** with conflict markers if needed +5. **Update dependencies** in package.json + +### Conflict Resolution + +When conflicts occur: + +```ts +<<<<<<< LOCAL +// Your modification +const chunkSize = 256; +======= +// New version +const chunkSize = 512; +>>>>>>> REMOTE +``` + +Resolve manually and remove conflict markers. + +### Examples + +```bash +# Interactive upgrade +bunx unrag@latest upgrade + +# Preview changes +bunx unrag@latest upgrade --dry-run + +# Force overwrite (lose local changes) +bunx unrag@latest upgrade --force +``` + +--- + +## unrag doctor + +Validate configuration and diagnose issues. + +```bash +bunx unrag@latest doctor [options] +``` + +### Checks Performed + +1. **Configuration** + - unrag.config.ts exists and is valid + - unrag.json is present and current version + +2. **Database** + - DATABASE_URL is set + - Connection successful + - pgvector extension enabled + - Required tables exist + - Indexes are present + +3. **Embedding Provider** + - API key is set + - Test embedding works + +4. **Store Adapter** + - Adapter matches configured type + - Schema is compatible + +5. **Extractors** + - Installed extractors are configured + - Required dependencies present + +### Output + +``` +Unrag Doctor +============ + +✓ Configuration valid +✓ Database connected +✓ pgvector extension enabled +✓ Tables exist (documents, chunks, embeddings) +✓ Embedding provider configured (openai) +✓ Test embedding successful +✓ Store adapter ready (drizzle) +✓ Extractors configured (3) + +All checks passed! +``` + +### Options + +| Option | Description | +|--------|-------------| +| `--json` | Output as JSON | +| `--fix` | Attempt to fix issues | + +--- + +## unrag debug + +Launch the interactive debug TUI. + +```bash +bunx unrag@latest debug [options] +``` + +### Requirements + +1. Debug battery installed (`add battery debug`) +2. Application running with `UNRAG_DEBUG=true` + +### Options + +| Option | Description | +|--------|-------------| +| `--port ` | WebSocket port (default: 9229) | +| `--host
` | WebSocket host (default: localhost) | + +### Panels + +- **Ingest** - Real-time ingestion monitoring +- **Retrieve** - Query and result inspection +- **Rerank** - Reranking operation details +- **Doctor** - Health checks +- **Query** - Interactive query testing +- **Docs** - Browse indexed documents + +### Keyboard Shortcuts + +| Key | Action | +|-----|--------| +| `Tab` | Switch panels | +| `j/k` | Navigate up/down | +| `Enter` | Select/expand | +| `q` | Quit | +| `?` | Help | + +--- + +## Environment Variables + +| Variable | Description | +|----------|-------------| +| `DATABASE_URL` | PostgreSQL connection string | +| `UNRAG_SKIP_INSTALL` | Skip dependency installation (=1) | +| `UNRAG_DEBUG` | Enable debug mode (=true) | +| `UNRAG_DEBUG_PORT` | Debug WebSocket port | + +Provider-specific variables are documented in [embedding-providers.md](./embedding-providers.md). + +--- + +## unrag.json + +Metadata file tracking your installation: + +```json +{ + "installDir": "lib/unrag", + "storeAdapter": "drizzle", + "aliasBase": "@unrag", + "embeddingProvider": "openai", + "version": 2, + "installedFrom": { + "unragVersion": "0.3.2" + }, + "scaffold": { + "mode": "slim", + "withDocs": false + }, + "connectors": ["notion"], + "extractors": ["pdf-text-layer", "file-text"], + "batteries": ["reranker", "debug"], + "managedFiles": [ + "lib/unrag/core/types.ts", + "lib/unrag/core/context-engine.ts", + "..." + ] +} +``` + +**Do not edit manually** - the CLI manages this file. + +--- + +## Troubleshooting CLI + +**Command not found:** +```bash +# Use bunx instead +bunx unrag@latest init + +# Or install globally +bun add -g unrag +``` + +**Permission errors:** +```bash +# Check file permissions +ls -la lib/unrag/ + +# Reset permissions +chmod -R 755 lib/unrag/ +``` + +**Dependency conflicts:** +```bash +# Clean install +rm -rf node_modules bun.lockb +bun install +``` + +**Version mismatch:** +```bash +# Check versions +cat unrag.json | jq '.installedFrom' +bunx unrag@latest --version + +# Upgrade to fix +bunx unrag@latest upgrade +``` diff --git a/skills/unrag/references/connectors.md b/skills/unrag/references/connectors.md new file mode 100644 index 0000000..d80e289 --- /dev/null +++ b/skills/unrag/references/connectors.md @@ -0,0 +1,448 @@ +# Connectors + +Connectors sync content from external services (Notion, Google Drive, OneDrive, Dropbox) into your RAG system. + +## Available Connectors + +| Connector | Status | Description | +|-----------|--------|-------------| +| **Notion** | Available | Pages, databases, blocks | +| **Google Drive** | Available | Docs, Sheets, folders | +| **OneDrive** | Available | Microsoft files, folders | +| **Dropbox** | Available | Files, folders | +| GitHub | Coming Soon | Repos, docs, issues | +| GitLab | Coming Soon | Repos, wiki pages | +| Slack | Coming Soon | Channels, threads | +| Discord | Coming Soon | Server channels | +| Linear | Coming Soon | Issues, projects | +| Microsoft Teams | Coming Soon | Channels, conversations | + +## Installation + +```bash +# Add a connector +bunx unrag@latest add connector notion +bunx unrag@latest add connector google-drive +``` + +--- + +## Notion Connector + +Sync pages, databases, and blocks from Notion workspaces. + +### Setup + +```bash +bunx unrag@latest add connector notion +``` + +**Dependencies:** `@notionhq/client` + +**Environment:** +```bash +NOTION_TOKEN="secret_..." # Internal integration token +``` + +### Creating an Integration + +1. Go to [Notion Integrations](https://www.notion.so/my-integrations) +2. Create a new integration +3. Copy the "Internal Integration Token" +4. Share pages/databases with the integration + +### Usage + +```ts +import { createNotionConnector } from "./lib/unrag/connectors/notion"; +import { createUnragEngine } from "@unrag/config"; + +const notion = createNotionConnector({ + auth: process.env.NOTION_TOKEN, +}); + +const engine = createUnragEngine(); + +// Sync specific pages +const stream = notion.syncPages({ + pageIds: ["page-id-1", "page-id-2"], +}); + +await engine.runConnectorStream({ stream }); + +// Sync a database +const dbStream = notion.syncDatabase({ + databaseId: "database-id", + filter: { property: "Status", status: { equals: "Published" } }, +}); + +await engine.runConnectorStream({ stream: dbStream }); +``` + +### Notion Connector Options + +```ts +// syncPages options +{ + pageIds: string[]; + includeChildren?: boolean; // Include child pages +} + +// syncDatabase options +{ + databaseId: string; + filter?: NotionFilter; // Notion API filter + sorts?: NotionSort[]; // Notion API sorts +} +``` + +--- + +## Google Drive Connector + +Sync Docs, Sheets, and folders from Google Drive. + +### Setup + +```bash +bunx unrag@latest add connector google-drive +``` + +**Dependencies:** `googleapis`, `google-auth-library` + +**Environment (Service Account):** +```bash +GOOGLE_SERVICE_ACCOUNT_JSON='{"type":"service_account",...}' +``` + +**Environment (OAuth):** +```bash +GOOGLE_CLIENT_ID="..." +GOOGLE_CLIENT_SECRET="..." +GOOGLE_REDIRECT_URI="http://localhost:3000/auth/callback" +``` + +### Usage + +```ts +import { createGoogleDriveConnector } from "./lib/unrag/connectors/google-drive"; +import { createUnragEngine } from "@unrag/config"; + +// With service account +const drive = createGoogleDriveConnector({ + auth: { + type: "service-account", + credentials: JSON.parse(process.env.GOOGLE_SERVICE_ACCOUNT_JSON!), + }, +}); + +// Or with OAuth tokens +const driveOAuth = createGoogleDriveConnector({ + auth: { + type: "oauth", + clientId: process.env.GOOGLE_CLIENT_ID!, + clientSecret: process.env.GOOGLE_CLIENT_SECRET!, + accessToken: userAccessToken, + refreshToken: userRefreshToken, + }, +}); + +const engine = createUnragEngine(); + +// Sync a folder +const stream = drive.syncFolder({ + folderId: "folder-id", + recursive: true, + mimeTypes: [ + "application/vnd.google-apps.document", + "application/vnd.google-apps.spreadsheet", + ], +}); + +await engine.runConnectorStream({ stream }); + +// Sync specific files +const fileStream = drive.syncFiles({ + fileIds: ["file-id-1", "file-id-2"], +}); + +await engine.runConnectorStream({ stream: fileStream }); +``` + +### Google Drive Connector Options + +```ts +// syncFolder options +{ + folderId: string; + recursive?: boolean; // Include subfolders + mimeTypes?: string[]; // Filter by MIME type + modifiedAfter?: Date; // Only modified after +} + +// syncFiles options +{ + fileIds: string[]; +} +``` + +--- + +## OneDrive Connector + +Sync files and folders from Microsoft OneDrive. + +### Setup + +```bash +bunx unrag@latest add connector onedrive +``` + +**Environment:** +```bash +AZURE_TENANT_ID="..." +AZURE_CLIENT_ID="..." +AZURE_CLIENT_SECRET="..." +``` + +### Azure AD App Setup + +1. Register app in [Azure Portal](https://portal.azure.com) +2. Add API permissions: `Files.Read`, `Files.Read.All` +3. Create client secret +4. Configure redirect URIs + +### Usage + +```ts +import { createOneDriveConnector } from "./lib/unrag/connectors/onedrive"; +import { createUnragEngine } from "@unrag/config"; + +const onedrive = createOneDriveConnector({ + tenantId: process.env.AZURE_TENANT_ID!, + clientId: process.env.AZURE_CLIENT_ID!, + clientSecret: process.env.AZURE_CLIENT_SECRET!, + accessToken: userAccessToken, // From OAuth flow +}); + +const engine = createUnragEngine(); + +// Sync a folder +const stream = onedrive.syncFolder({ + driveId: "me", // or specific drive ID + path: "/Documents/Knowledge Base", + recursive: true, +}); + +await engine.runConnectorStream({ stream }); +``` + +--- + +## Dropbox Connector + +Sync files and folders from Dropbox. + +### Setup + +```bash +bunx unrag@latest add connector dropbox +``` + +**Environment:** +```bash +DROPBOX_CLIENT_ID="..." +DROPBOX_CLIENT_SECRET="..." +``` + +### Usage + +```ts +import { createDropboxConnector } from "./lib/unrag/connectors/dropbox"; +import { createUnragEngine } from "@unrag/config"; + +const dropbox = createDropboxConnector({ + clientId: process.env.DROPBOX_CLIENT_ID!, + clientSecret: process.env.DROPBOX_CLIENT_SECRET!, + accessToken: userAccessToken, // From OAuth flow +}); + +const engine = createUnragEngine(); + +// Sync a folder +const stream = dropbox.syncFolder({ + path: "/Knowledge Base", + recursive: true, +}); + +await engine.runConnectorStream({ stream }); +``` + +--- + +## ConnectorStream Pattern + +All connectors emit a `ConnectorStream` - an async iterable of events: + +```ts +type ConnectorStreamEvent = + | { type: "upsert"; sourceId: string; content: string; metadata?: Metadata; assets?: AssetInput[] } + | { type: "delete"; sourceId?: string; sourceIdPrefix?: string } + | { type: "progress"; message: string; progress?: number } + | { type: "warning"; message: string } + | { type: "checkpoint"; checkpoint: TCheckpoint }; +``` + +### Event Types + +**`upsert`** - Ingest or update a document +```ts +{ + type: "upsert", + sourceId: "notion:page:abc123", + content: "Page content...", + metadata: { title: "My Page", url: "https://..." }, + assets: [{ assetId: "img1", kind: "image", data: { kind: "url", url: "..." } }], +} +``` + +**`delete`** - Remove a document +```ts +{ type: "delete", sourceId: "notion:page:abc123" } +// or delete by prefix +{ type: "delete", sourceIdPrefix: "notion:page:" } +``` + +**`progress`** - Progress update +```ts +{ type: "progress", message: "Syncing page 5 of 100", progress: 0.05 } +``` + +**`warning`** - Non-fatal issue +```ts +{ type: "warning", message: "Could not access page xyz" } +``` + +**`checkpoint`** - Resume point for incremental sync +```ts +{ type: "checkpoint", checkpoint: { cursor: "abc", timestamp: 1234567890 } } +``` + +--- + +## Using runConnectorStream + +```ts +const result = await engine.runConnectorStream({ + stream, + + // Progress callback + onProgress: (event) => { + console.log(`[${event.type}] ${event.message || ""}`); + }, + + // Abort signal for cancellation + signal: abortController.signal, + + // Resume from previous checkpoint + checkpoint: savedCheckpoint, +}); + +// Result +console.log(`Ingested: ${result.ingestCount}`); +console.log(`Deleted: ${result.deleteCount}`); +console.log(`Warnings: ${result.warnings.length}`); + +// Save checkpoint for next sync +await saveCheckpoint(result.checkpoint); +``` + +--- + +## Checkpoint-Based Resumption + +Connectors support checkpoints for incremental sync and serverless resumption: + +```ts +// First sync +const result1 = await engine.runConnectorStream({ stream: notion.syncDatabase({ databaseId }) }); +await db.saveCheckpoint("notion-sync", result1.checkpoint); + +// Later: resume from checkpoint +const savedCheckpoint = await db.loadCheckpoint("notion-sync"); +const result2 = await engine.runConnectorStream({ + stream: notion.syncDatabase({ + databaseId, + since: savedCheckpoint?.lastModified, // Only changes + }), + checkpoint: savedCheckpoint, +}); +``` + +--- + +## Building Custom Connectors + +Implement a connector as an async generator: + +```ts +import type { ConnectorStreamEvent, AssetInput, Metadata } from "@unrag/types"; + +type MyCheckpoint = { + cursor?: string; + lastSync: number; +}; + +async function* myConnector(options: MyOptions): AsyncGenerator> { + const client = createMyClient(options); + + yield { type: "progress", message: "Starting sync..." }; + + const items = await client.listItems({ after: options.cursor }); + + for (const item of items) { + if (item.deleted) { + yield { type: "delete", sourceId: `my-source:${item.id}` }; + } else { + yield { + type: "upsert", + sourceId: `my-source:${item.id}`, + content: item.content, + metadata: { title: item.title, url: item.url }, + assets: item.attachments?.map(a => ({ + assetId: a.id, + kind: detectKind(a.mimeType), + data: { kind: "url", url: a.url, mediaType: a.mimeType }, + })), + }; + } + + // Emit checkpoints periodically + yield { + type: "checkpoint", + checkpoint: { cursor: item.id, lastSync: Date.now() }, + }; + } +} +``` + +--- + +## Source ID Conventions + +Connectors should use consistent sourceId patterns: + +``` +{connector}:{type}:{id} + +Examples: +- notion:page:abc123 +- notion:database:xyz789:row:123 +- gdrive:doc:1234567890 +- dropbox:file:/Documents/report.pdf +``` + +This enables: +- Prefix-based deletion (`sourceIdPrefix: "notion:"`) +- Scoped retrieval (`scope: { sourceId: "gdrive:" }`) +- Easy debugging and tracing diff --git a/skills/unrag/references/embedding-providers.md b/skills/unrag/references/embedding-providers.md new file mode 100644 index 0000000..b9cedf5 --- /dev/null +++ b/skills/unrag/references/embedding-providers.md @@ -0,0 +1,449 @@ +# Embedding Providers + +Unrag supports 12 embedding providers plus a custom option. Configure in `unrag.config.ts`: + +```ts +export const unrag = defineUnragConfig({ + embedding: { + provider: "openai", + config: { model: "text-embedding-3-small" }, + }, + // ... +}); +``` + +## Provider Overview + +| Provider | Default Model | Dimensions | Multimodal | +|----------|--------------|------------|------------| +| AI Gateway | `openai/text-embedding-3-small` | 1536 | No | +| OpenAI | `text-embedding-3-small` | 1536 | No | +| Google | `gemini-embedding-001` | 768 | No | +| Cohere | `embed-english-v3.0` | 1024 | No | +| Azure OpenAI | `text-embedding-3-small` | 1536 | No | +| AWS Bedrock | `amazon.titan-embed-text-v2:0` | 1024 | No | +| Voyage | `voyage-3.5-lite` | 1024 | Yes* | +| Mistral | `mistral-embed` | 1024 | No | +| Together | (varies) | (varies) | No | +| Ollama | `nomic-embed-text` | 768 | No | +| OpenRouter | (varies) | (varies) | No | +| Vertex AI | `text-embedding-004` | 768 | No | + +*Voyage supports multimodal with `voyage-multimodal-3` model. + +--- + +## 1. AI Gateway (Vercel AI SDK) + +Generic wrapper around Vercel AI SDK. Legacy default, kept for backwards compatibility. + +```ts +embedding: { + provider: "ai", + config: { + model: "openai/text-embedding-3-small", // provider/model format + timeoutMs: 15_000, + }, +}, +``` + +**Environment:** +```bash +AI_GATEWAY_API_KEY="..." +AI_GATEWAY_MODEL="openai/text-embedding-3-small" # optional +``` + +**Model string format:** `provider/model-name` (e.g., `openai/text-embedding-3-large`, `cohere/embed-english-v3.0`) + +--- + +## 2. OpenAI + +Recommended for most use cases. Best balance of quality and cost. + +```ts +embedding: { + provider: "openai", + config: { + model: "text-embedding-3-small", // or "text-embedding-3-large" + dimensions: 1536, // optional, can reduce for cost savings + user: "user-123", // optional, for usage tracking + timeoutMs: 15_000, + }, +}, +``` + +**Environment:** +```bash +OPENAI_API_KEY="sk-..." +OPENAI_EMBEDDING_MODEL="text-embedding-3-small" # optional override +``` + +**Models:** +- `text-embedding-3-small` - 1536 dimensions, good for most use cases +- `text-embedding-3-large` - 3072 dimensions, higher quality +- `text-embedding-ada-002` - Legacy, 1536 dimensions + +--- + +## 3. Google / Gemini + +Google's embedding models via Generative AI API. + +```ts +embedding: { + provider: "google", + config: { + model: "gemini-embedding-001", + outputDimensionality: 768, // optional + taskType: "RETRIEVAL_DOCUMENT", // or "RETRIEVAL_QUERY" + timeoutMs: 15_000, + }, +}, +``` + +**Environment:** +```bash +GOOGLE_GENERATIVE_AI_API_KEY="..." +GOOGLE_GENERATIVE_AI_EMBEDDING_MODEL="gemini-embedding-001" # optional +``` + +**Task Types:** +- `RETRIEVAL_DOCUMENT` - For indexing documents +- `RETRIEVAL_QUERY` - For search queries +- `SEMANTIC_SIMILARITY` - General similarity +- `CLASSIFICATION` - Classification tasks +- `CLUSTERING` - Clustering tasks + +--- + +## 4. Cohere + +High-quality embeddings with input type optimization. + +```ts +embedding: { + provider: "cohere", + config: { + model: "embed-english-v3.0", + inputType: "search_document", // or "search_query" + truncate: "END", // "NONE" | "START" | "END" + timeoutMs: 15_000, + }, +}, +``` + +**Environment:** +```bash +COHERE_API_KEY="..." +COHERE_EMBEDDING_MODEL="embed-english-v3.0" # optional +``` + +**Input Types:** +- `search_document` - For indexing (ingest) +- `search_query` - For retrieval queries +- `classification` - Classification tasks +- `clustering` - Clustering tasks + +**Models:** +- `embed-english-v3.0` - English, 1024 dims +- `embed-multilingual-v3.0` - 100+ languages + +--- + +## 5. Azure OpenAI + +OpenAI models via Azure. + +```ts +embedding: { + provider: "azure", + config: { + model: "text-embedding-3-small", + dimensions: 1536, + user: "user-123", + timeoutMs: 15_000, + }, +}, +``` + +**Environment:** +```bash +AZURE_OPENAI_API_KEY="..." +AZURE_RESOURCE_NAME="your-resource-name" +AZURE_EMBEDDING_MODEL="text-embedding-3-small" # optional +``` + +--- + +## 6. AWS Bedrock + +Amazon's managed AI service. + +```ts +embedding: { + provider: "bedrock", + config: { + model: "amazon.titan-embed-text-v2:0", + dimensions: 1024, + normalize: true, + timeoutMs: 15_000, + }, +}, +``` + +**Environment:** +```bash +AWS_REGION="us-east-1" +AWS_ACCESS_KEY_ID="..." # when outside AWS +AWS_SECRET_ACCESS_KEY="..." +BEDROCK_EMBEDDING_MODEL="amazon.titan-embed-text-v2:0" # optional +``` + +**Models:** +- `amazon.titan-embed-text-v2:0` - Latest Titan model +- `amazon.titan-embed-text-v1` - Original Titan + +--- + +## 7. Voyage AI + +High-quality embeddings with multimodal support. + +```ts +// Text embeddings +embedding: { + provider: "voyage", + config: { + type: "text", + model: "voyage-3.5-lite", + timeoutMs: 15_000, + }, +}, + +// Multimodal embeddings (text + images) +embedding: { + provider: "voyage", + config: { + type: "multimodal", + model: "voyage-multimodal-3", + timeoutMs: 30_000, + }, +}, +``` + +**Environment:** +```bash +VOYAGE_API_KEY="..." +VOYAGE_MODEL="voyage-3.5-lite" # optional +``` + +**Models:** +- `voyage-3.5-lite` - Fast, cost-effective +- `voyage-3` - Higher quality +- `voyage-multimodal-3` - Text and images + +--- + +## 8. Mistral + +Mistral's embedding model. + +```ts +embedding: { + provider: "mistral", + config: { + model: "mistral-embed", + timeoutMs: 15_000, + }, +}, +``` + +**Environment:** +```bash +MISTRAL_API_KEY="..." +MISTRAL_EMBEDDING_MODEL="mistral-embed" # optional +``` + +--- + +## 9. Together AI + +Open-source models on Together's infrastructure. + +```ts +embedding: { + provider: "together", + config: { + model: "togethercomputer/m2-bert-80M-2k-retrieval", + timeoutMs: 15_000, + }, +}, +``` + +**Environment:** +```bash +TOGETHER_AI_API_KEY="..." +TOGETHER_AI_EMBEDDING_MODEL="togethercomputer/m2-bert-80M-2k-retrieval" # optional +``` + +--- + +## 10. Ollama (Local) + +Run embeddings locally with Ollama. + +```ts +embedding: { + provider: "ollama", + config: { + model: "nomic-embed-text", + baseURL: "http://localhost:11434", // optional + headers: {}, // optional + timeoutMs: 30_000, + }, +}, +``` + +**Environment:** +```bash +OLLAMA_EMBEDDING_MODEL="nomic-embed-text" # optional +``` + +**Setup:** +```bash +# Install Ollama and pull model +ollama pull nomic-embed-text +``` + +**Popular Models:** +- `nomic-embed-text` - General purpose +- `mxbai-embed-large` - Larger, higher quality +- `all-minilm` - Smaller, faster + +--- + +## 11. OpenRouter + +Multi-model gateway. + +```ts +embedding: { + provider: "openrouter", + config: { + model: "openai/text-embedding-3-small", + apiKey: "...", // optional, uses env var + baseURL: "https://openrouter.ai/api/v1", + referer: "https://your-app.com", + title: "Your App Name", + timeoutMs: 15_000, + }, +}, +``` + +**Environment:** +```bash +OPENROUTER_API_KEY="..." +OPENROUTER_EMBEDDING_MODEL="openai/text-embedding-3-small" # optional +``` + +--- + +## 12. Vertex AI (Google Cloud) + +Google's enterprise AI platform. + +```ts +embedding: { + provider: "vertex", + config: { + model: "text-embedding-004", + outputDimensionality: 768, + taskType: "RETRIEVAL_DOCUMENT", + title: "Document title", // optional, improves quality + autoTruncate: true, + timeoutMs: 15_000, + }, +}, +``` + +**Environment:** +```bash +GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json" # when outside GCP +GOOGLE_VERTEX_EMBEDDING_MODEL="text-embedding-004" # optional +``` + +--- + +## 13. Custom Provider + +Bring your own embedding implementation: + +```ts +embedding: { + provider: "custom", + create: () => ({ + name: "my-embeddings", + dimensions: 768, + embed: async (input) => { + // Your embedding logic + return await myEmbeddingService.embed(input.text); + }, + embedMany: async (inputs) => { + // Optional batch support + return await myEmbeddingService.embedBatch(inputs.map(i => i.text)); + }, + embedImage: async (input) => { + // Optional multimodal support + return await myEmbeddingService.embedImage(input.data); + }, + }), +}, +``` + +### EmbeddingProvider Interface + +```ts +type EmbeddingProvider = { + name: string; + dimensions?: number; + embed: (input: EmbeddingInput) => Promise; + embedMany?: (inputs: EmbeddingInput[]) => Promise; + embedImage?: (input: ImageEmbeddingInput) => Promise; +}; +``` + +## Choosing a Provider + +**Best for most use cases:** OpenAI `text-embedding-3-small` +- Good quality, reasonable cost, fast + +**Budget-conscious:** Ollama (local) or Together AI +- Free (Ollama) or cheap (Together) + +**Enterprise/compliance:** Azure OpenAI or Vertex AI +- Enterprise features, data residency + +**Multilingual:** Cohere `embed-multilingual-v3.0` +- 100+ languages + +**Multimodal (images + text):** Voyage `voyage-multimodal-3` +- Unified text and image embeddings + +## Performance Tuning + +Configure concurrency and batch size in defaults: + +```ts +export const unrag = defineUnragConfig({ + defaults: { + embedding: { + concurrency: 4, // Concurrent API calls + batchSize: 100, // Chunks per embedMany call + }, + }, + embedding: { /* ... */ }, +}); +``` + +Higher values improve throughput but may hit rate limits. diff --git a/skills/unrag/references/extractors.md b/skills/unrag/references/extractors.md new file mode 100644 index 0000000..76494ee --- /dev/null +++ b/skills/unrag/references/extractors.md @@ -0,0 +1,540 @@ +# Extractors + +Extractors convert rich media assets (PDFs, images, audio, video, files) into text for embedding. Unrag includes 12 built-in extractors. + +## Overview + +| Extractor | Group | Description | Default | Worker Only | +|-----------|-------|-------------|---------|-------------| +| `pdf-text-layer` | PDF | Fast text layer extraction | Yes | No | +| `pdf-llm` | PDF | LLM-based extraction | No | No | +| `pdf-ocr` | PDF | OCR scanned PDFs | No | Yes | +| `image-ocr` | Image | Extract text via vision LLM | No | No | +| `image-caption-llm` | Image | Generate captions | No | No | +| `audio-transcribe` | Audio | Whisper transcription | No | No | +| `video-transcribe` | Video | Transcribe audio track | No | No | +| `video-frames` | Video | Sample and analyze frames | No | Yes | +| `file-text` | Files | txt/md/json/csv | Yes | No | +| `file-docx` | Files | Word documents | No | No | +| `file-pptx` | Files | PowerPoint slides | No | No | +| `file-xlsx` | Files | Excel spreadsheets | No | No | + +## Installation + +```bash +# During init, select extractors interactively +bunx unrag@latest init --rich-media + +# Or add specific extractors later +bunx unrag@latest add extractor pdf-text-layer +bunx unrag@latest add extractor image-ocr +``` + +--- + +## PDF Extractors + +### pdf-text-layer (Recommended) + +Fast, cheap extraction using the built-in PDF text layer. Works well for digital PDFs but not scanned documents. + +```bash +bunx unrag@latest add extractor pdf-text-layer +``` + +**Dependencies:** `pdfjs-dist` + +**Configuration:** + +```ts +// In assetProcessing config +pdf: { + textLayer: { + enabled: true, + maxBytes: 50_000_000, // 50MB max + maxOutputChars: 500_000, + minChars: 100, // Minimum chars to accept result + maxPages: 500, // Optional page limit + }, +}, +``` + +**When to use:** +- Digital PDFs with selectable text +- High volume processing (cheap and fast) +- As first-pass before falling back to LLM/OCR + +--- + +### pdf-llm + +LLM-based PDF extraction. Higher quality but more expensive. + +```bash +bunx unrag@latest add extractor pdf-llm +``` + +**Dependencies:** `ai` + +**Configuration:** + +```ts +pdf: { + llmExtraction: { + enabled: true, + model: "google/gemini-2.0-flash", // Must support file inputs + prompt: "Extract all text content from this PDF faithfully...", + timeoutMs: 120_000, + maxBytes: 20_000_000, // 20MB max + maxOutputChars: 500_000, + }, +}, +``` + +**When to use:** +- Complex layouts (tables, multi-column) +- When text layer extraction fails +- Higher accuracy requirements + +--- + +### pdf-ocr (Worker Only) + +OCR for scanned PDFs. Requires native binaries (Poppler, Tesseract). + +```bash +bunx unrag@latest add extractor pdf-ocr +``` + +**Dependencies:** System binaries (not npm packages) +- `pdftoppm` (Poppler) +- `tesseract` + +**Configuration:** + +```ts +pdf: { + ocr: { + enabled: true, + maxBytes: 50_000_000, + maxOutputChars: 500_000, + minChars: 50, + maxPages: 100, + pdftoppmPath: "/usr/bin/pdftoppm", // Optional + tesseractPath: "/usr/bin/tesseract", + dpi: 300, // Higher = better OCR + lang: "eng", // Tesseract language + }, +}, +``` + +**When to use:** +- Scanned documents +- Image-only PDFs +- Worker/background job environments + +--- + +## Image Extractors + +### image-ocr + +Extract text from images using a vision-capable LLM. + +```bash +bunx unrag@latest add extractor image-ocr +``` + +**Dependencies:** `ai` + +**Configuration:** + +```ts +image: { + ocr: { + enabled: true, + model: "openai/gpt-4o-mini", // Vision-capable model + prompt: "Extract all visible text from this image...", + timeoutMs: 30_000, + maxBytes: 10_000_000, + maxOutputChars: 50_000, + }, +}, +``` + +**Supported formats:** jpg, png, webp, gif + +--- + +### image-caption-llm + +Generate descriptive captions for images. + +```bash +bunx unrag@latest add extractor image-caption-llm +``` + +**Dependencies:** `ai` + +**Configuration:** + +```ts +image: { + captionLlm: { + enabled: true, + model: "openai/gpt-4o-mini", + prompt: "Describe this image in detail...", + timeoutMs: 30_000, + maxBytes: 10_000_000, + maxOutputChars: 5_000, + }, +}, +``` + +--- + +## Audio Extractors + +### audio-transcribe + +Speech-to-text transcription using Whisper. + +```bash +bunx unrag@latest add extractor audio-transcribe +``` + +**Dependencies:** `ai` + +**Configuration:** + +```ts +audio: { + transcription: { + enabled: true, + model: "openai/whisper-1", + timeoutMs: 300_000, // 5 minutes + maxBytes: 100_000_000, // 100MB + }, +}, +``` + +**Supported formats:** mp3, wav, ogg, m4a + +--- + +## Video Extractors + +### video-transcribe + +Transcribe video audio track. + +```bash +bunx unrag@latest add extractor video-transcribe +``` + +**Dependencies:** `ai` + +**Configuration:** + +```ts +video: { + transcription: { + enabled: true, + model: "openai/whisper-1", + timeoutMs: 600_000, // 10 minutes + maxBytes: 500_000_000, // 500MB + }, +}, +``` + +**Supported formats:** mp4, webm, mov + +--- + +### video-frames (Worker Only) + +Sample frames and analyze with vision LLM. Requires ffmpeg. + +```bash +bunx unrag@latest add extractor video-frames +``` + +**Dependencies:** `ai`, ffmpeg (system binary) + +**Configuration:** + +```ts +video: { + frames: { + enabled: true, + sampleFps: 0.5, // Sample every 2 seconds + maxFrames: 30, + ffmpegPath: "/usr/bin/ffmpeg", + maxBytes: 500_000_000, + model: "openai/gpt-4o-mini", + prompt: "Describe what you see in this video frame...", + timeoutMs: 30_000, // Per frame + maxOutputChars: 100_000, + }, +}, +``` + +--- + +## File Extractors + +### file-text (Recommended) + +Extract text from common text-based files. + +```bash +bunx unrag@latest add extractor file-text +``` + +**Dependencies:** None + +**Configuration:** + +```ts +file: { + text: { + enabled: true, + maxBytes: 10_000_000, + maxOutputChars: 500_000, + minChars: 10, + }, +}, +``` + +**Supported formats:** txt, md, json, csv, html + +--- + +### file-docx + +Extract text from Word documents. + +```bash +bunx unrag@latest add extractor file-docx +``` + +**Dependencies:** `mammoth` + +**Configuration:** + +```ts +file: { + docx: { + enabled: true, + maxBytes: 50_000_000, + maxOutputChars: 500_000, + minChars: 10, + }, +}, +``` + +--- + +### file-pptx + +Extract text from PowerPoint slides. + +```bash +bunx unrag@latest add extractor file-pptx +``` + +**Dependencies:** `jszip` + +**Configuration:** + +```ts +file: { + pptx: { + enabled: true, + maxBytes: 100_000_000, + maxOutputChars: 500_000, + minChars: 10, + }, +}, +``` + +--- + +### file-xlsx + +Extract tables from Excel spreadsheets. + +```bash +bunx unrag@latest add extractor file-xlsx +``` + +**Dependencies:** `xlsx` + +**Configuration:** + +```ts +file: { + xlsx: { + enabled: true, + maxBytes: 50_000_000, + maxOutputChars: 500_000, + minChars: 10, + }, +}, +``` + +--- + +## Wiring Extractors + +After installation, import and configure in `unrag.config.ts`: + +```ts +import { defineUnragConfig } from "./lib/unrag/core"; +import { createPdfTextLayerExtractor } from "./lib/unrag/extractors/pdf-text-layer"; +import { createFileTextExtractor } from "./lib/unrag/extractors/file-text"; +import { createImageOcrExtractor } from "./lib/unrag/extractors/image-ocr"; + +export const unrag = defineUnragConfig({ + embedding: { /* ... */ }, + engine: { + extractors: [ + createPdfTextLayerExtractor(), + createFileTextExtractor(), + createImageOcrExtractor(), + ], + assetProcessing: { + onUnsupportedAsset: "skip", + onError: "skip", + concurrency: 2, + fetch: { + enabled: true, + maxBytes: 50_000_000, + timeoutMs: 30_000, + }, + pdf: { + textLayer: { enabled: true, maxBytes: 50_000_000, maxOutputChars: 500_000, minChars: 100 }, + llmExtraction: { enabled: false, /* ... */ }, + ocr: { enabled: false, /* ... */ }, + }, + image: { + ocr: { enabled: true, model: "openai/gpt-4o-mini", /* ... */ }, + captionLlm: { enabled: false, /* ... */ }, + }, + // ... other asset types + }, + }, +}); +``` + +--- + +## Custom Extractors + +Implement the `AssetExtractor` interface: + +```ts +import type { AssetExtractor, AssetInput, AssetExtractorContext, AssetExtractorResult } from "./types"; + +export function createMyExtractor(): AssetExtractor { + return { + name: "my-extractor", + + supports({ asset, ctx }) { + // Return true if this extractor can handle the asset + return asset.kind === "file" && asset.data.mediaType === "application/my-format"; + }, + + async extract({ asset, ctx }): Promise { + // Fetch bytes if needed + const bytes = asset.data.kind === "bytes" + ? asset.data.bytes + : await fetchAssetBytes(asset.data.url); + + // Extract text + const text = await myExtractionLogic(bytes); + + return { + texts: [ + { + label: "fulltext", + content: text, + confidence: 0.95, + }, + ], + metadata: { + extractedBy: "my-extractor", + }, + }; + }, + }; +} +``` + +### AssetExtractorResult + +```ts +type AssetExtractorResult = { + texts: ExtractedTextItem[]; + skipped?: { code: string; message: string }; + metadata?: Metadata; + diagnostics?: { model?: string; tokens?: number; seconds?: number }; +}; + +type ExtractedTextItem = { + label: string; // "fulltext", "ocr", "transcript", etc. + content: string; // Extracted text + confidence?: number; // 0-1 confidence score + pageRange?: [number, number]; + timeRangeSec?: [number, number]; +}; +``` + +--- + +## Extractor Fallback Chain + +Extractors are tried in order. First successful extraction wins: + +```ts +extractors: [ + createPdfTextLayerExtractor(), // Try text layer first (fast) + createPdfLlmExtractor(), // Fall back to LLM + createPdfOcrExtractor(), // Last resort: OCR +], +``` + +If an extractor returns `{ texts: [], skipped: { code, message } }`, the next extractor is tried. + +--- + +## Asset Processing Events + +Monitor extraction with hooks: + +```ts +assetProcessing: { + hooks: { + onEvent: (event) => { + switch (event.type) { + case "asset:start": + console.log(`Processing ${event.assetKind}: ${event.assetId}`); + break; + case "extractor:success": + console.log(`Extracted ${event.textItemCount} items in ${event.durationMs}ms`); + break; + case "extractor:error": + console.error(`Extraction failed: ${event.errorMessage}`); + break; + } + }, + }, +}, +``` + +Event types: +- `asset:start` - Asset processing started +- `asset:skipped` - Asset skipped (with warning) +- `extractor:start` - Extractor attempt started +- `extractor:success` - Extractor succeeded +- `extractor:error` - Extractor failed diff --git a/skills/unrag/references/patterns.md b/skills/unrag/references/patterns.md new file mode 100644 index 0000000..ca9d32a --- /dev/null +++ b/skills/unrag/references/patterns.md @@ -0,0 +1,639 @@ +# Common Patterns and Recipes + +Practical patterns for building with Unrag. + +## Search Endpoint + +### Basic Next.js Route Handler + +```ts +// app/api/search/route.ts +import { createUnragEngine } from "@unrag/config"; + +export async function GET(request: Request) { + const { searchParams } = new URL(request.url); + const query = searchParams.get("q") ?? ""; + + if (!query.trim()) { + return Response.json({ error: "Missing query" }, { status: 400 }); + } + + const engine = createUnragEngine(); + const result = await engine.retrieve({ query, topK: 8 }); + + return Response.json({ + results: result.chunks.map((chunk) => ({ + id: chunk.id, + content: chunk.content, + source: chunk.sourceId, + score: chunk.score, + metadata: chunk.metadata, + })), + }); +} +``` + +### With Reranking + +```ts +// app/api/search/route.ts +import { createUnragEngine } from "@unrag/config"; + +export async function GET(request: Request) { + const { searchParams } = new URL(request.url); + const query = searchParams.get("q") ?? ""; + + if (!query.trim()) { + return Response.json({ error: "Missing query" }, { status: 400 }); + } + + const engine = createUnragEngine(); + + // Retrieve more candidates + const retrieved = await engine.retrieve({ query, topK: 30 }); + + // Rerank to top results + const reranked = await engine.rerank({ + query, + candidates: retrieved.chunks, + topK: 8, + onMissingReranker: "skip", // Graceful fallback + }); + + return Response.json({ + results: reranked.chunks.map((chunk) => ({ + id: chunk.id, + content: chunk.content, + source: chunk.sourceId, + score: chunk.score, + })), + meta: { + reranked: reranked.meta.rerankerName !== "none", + timings: { + retrieveMs: retrieved.durations.totalMs, + rerankMs: reranked.durations.rerankMs, + }, + }, + }); +} +``` + +### With Input Validation + +```ts +import { z } from "zod"; + +const searchSchema = z.object({ + q: z.string().min(2).max(500), + collection: z.string().optional(), + topK: z.coerce.number().min(1).max(50).default(8), +}); + +export async function GET(request: Request) { + const { searchParams } = new URL(request.url); + + const parsed = searchSchema.safeParse({ + q: searchParams.get("q"), + collection: searchParams.get("collection"), + topK: searchParams.get("topK"), + }); + + if (!parsed.success) { + return Response.json({ error: parsed.error.issues }, { status: 400 }); + } + + const { q: query, collection, topK } = parsed.data; + const engine = createUnragEngine(); + + const result = await engine.retrieve({ + query, + topK, + scope: collection ? { sourceId: collection } : undefined, + }); + + return Response.json({ results: result.chunks }); +} +``` + +--- + +## Multi-Tenant Scoping + +Use sourceId prefixes to isolate tenant data. + +### Ingestion + +```ts +async function ingestForTenant(tenantId: string, doc: Document) { + const engine = createUnragEngine(); + + await engine.ingest({ + sourceId: `tenant:${tenantId}:${doc.id}`, + content: doc.content, + metadata: { + tenantId, + title: doc.title, + createdAt: doc.createdAt, + }, + }); +} +``` + +### Retrieval + +```ts +async function searchForTenant(tenantId: string, query: string) { + const engine = createUnragEngine(); + + const result = await engine.retrieve({ + query, + topK: 10, + scope: { sourceId: `tenant:${tenantId}:` }, + }); + + return result.chunks; +} +``` + +### Deletion + +```ts +async function deleteAllTenantData(tenantId: string) { + const engine = createUnragEngine(); + + await engine.delete({ + sourceIdPrefix: `tenant:${tenantId}:`, + }); +} +``` + +### Hierarchical Prefixes + +```ts +// Organization → Workspace → Document +const sourceId = `org:${orgId}:ws:${workspaceId}:doc:${docId}`; + +// Search within workspace +scope: { sourceId: `org:${orgId}:ws:${workspaceId}:` } + +// Search across organization +scope: { sourceId: `org:${orgId}:` } +``` + +--- + +## Chat Integration + +Use retrieved chunks as context for LLM responses. + +### Basic RAG Chat + +```ts +import { createUnragEngine } from "@unrag/config"; +import { generateText } from "ai"; + +async function chat(userMessage: string) { + const engine = createUnragEngine(); + + // Retrieve relevant context + const retrieved = await engine.retrieve({ + query: userMessage, + topK: 5, + }); + + // Build context string + const context = retrieved.chunks + .map((chunk) => chunk.content) + .join("\n\n---\n\n"); + + // Generate response with context + const response = await generateText({ + model: "openai/gpt-4o", + messages: [ + { + role: "system", + content: `Answer based on the following context. If the answer isn't in the context, say so. + +Context: +${context}`, + }, + { + role: "user", + content: userMessage, + }, + ], + }); + + return { + answer: response.text, + sources: retrieved.chunks.map((c) => ({ + sourceId: c.sourceId, + content: c.content.slice(0, 200) + "...", + })), + }; +} +``` + +### With Reranking + +```ts +async function chat(userMessage: string) { + const engine = createUnragEngine(); + + // Retrieve more candidates + const retrieved = await engine.retrieve({ + query: userMessage, + topK: 20, + }); + + // Rerank for best context + const reranked = await engine.rerank({ + query: userMessage, + candidates: retrieved.chunks, + topK: 5, + onMissingReranker: "skip", + }); + + const context = reranked.chunks + .map((chunk) => chunk.content) + .join("\n\n---\n\n"); + + // ... generate response +} +``` + +### Streaming Response + +```ts +import { streamText } from "ai"; + +export async function POST(request: Request) { + const { message } = await request.json(); + const engine = createUnragEngine(); + + const retrieved = await engine.retrieve({ query: message, topK: 5 }); + const context = retrieved.chunks.map((c) => c.content).join("\n\n"); + + const result = streamText({ + model: "openai/gpt-4o", + messages: [ + { role: "system", content: `Context:\n${context}` }, + { role: "user", content: message }, + ], + }); + + return result.toDataStreamResponse(); +} +``` + +--- + +## Ingestion Patterns + +### Static Content (Build Time) + +```ts +// scripts/ingest-docs.ts +import { createUnragEngine } from "@unrag/config"; +import { glob } from "glob"; +import { readFile } from "fs/promises"; + +async function ingestDocs() { + const engine = createUnragEngine(); + const files = await glob("content/docs/**/*.md"); + + for (const file of files) { + const content = await readFile(file, "utf-8"); + const slug = file.replace("content/docs/", "").replace(".md", ""); + + await engine.ingest({ + sourceId: `docs:${slug}`, + content, + metadata: { path: file, slug }, + }); + + console.log(`Ingested: ${slug}`); + } +} + +ingestDocs(); +``` + +Add to package.json: +```json +{ + "scripts": { + "ingest:docs": "bun run scripts/ingest-docs.ts", + "build": "bun run ingest:docs && next build" + } +} +``` + +### User-Generated Content + +```ts +// app/api/documents/route.ts +import { createUnragEngine } from "@unrag/config"; + +export async function POST(request: Request) { + const { title, content } = await request.json(); + const userId = getUserId(request); // From auth + + // Save to database + const doc = await db.documents.create({ + data: { title, content, userId }, + }); + + // Ingest for search + const engine = createUnragEngine(); + await engine.ingest({ + sourceId: `user:${userId}:doc:${doc.id}`, + content, + metadata: { title, userId, docId: doc.id }, + }); + + return Response.json({ id: doc.id }); +} +``` + +### Periodic Sync (Cron) + +```ts +// scripts/sync-external.ts +import { createUnragEngine } from "@unrag/config"; +import { createNotionConnector } from "./lib/unrag/connectors/notion"; + +async function syncNotion() { + const engine = createUnragEngine(); + const notion = createNotionConnector({ + auth: process.env.NOTION_TOKEN, + }); + + // Load checkpoint from previous sync + const checkpoint = await loadCheckpoint("notion-sync"); + + const stream = notion.syncDatabase({ + databaseId: process.env.NOTION_DATABASE_ID!, + filter: checkpoint ? { last_edited_time: { after: checkpoint.lastSync } } : undefined, + }); + + const result = await engine.runConnectorStream({ + stream, + checkpoint, + onProgress: (e) => console.log(e.message), + }); + + // Save checkpoint for next sync + await saveCheckpoint("notion-sync", result.checkpoint); + + console.log(`Synced ${result.ingestCount} documents`); +} +``` + +Trigger with cron: +```yaml +# .github/workflows/sync.yml +on: + schedule: + - cron: '0 */6 * * *' # Every 6 hours +``` + +### Re-ingestion (Full Refresh) + +```ts +async function reindexAll() { + const engine = createUnragEngine(); + + // Delete all existing + await engine.delete({ sourceIdPrefix: "docs:" }); + + // Re-ingest everything + const docs = await db.documents.findMany(); + + for (const doc of docs) { + await engine.ingest({ + sourceId: `docs:${doc.id}`, + content: doc.content, + metadata: { title: doc.title }, + }); + } +} +``` + +--- + +## Asset Processing Patterns + +### PDF Knowledge Base + +```ts +import { createUnragEngine } from "@unrag/config"; + +async function ingestPdf(pdfBuffer: Buffer, filename: string) { + const engine = createUnragEngine(); + + await engine.ingest({ + sourceId: `pdfs:${filename}`, + content: "", // Text extracted from PDF + assets: [ + { + assetId: "pdf-main", + kind: "pdf", + data: { + kind: "bytes", + bytes: new Uint8Array(pdfBuffer), + mediaType: "application/pdf", + filename, + }, + }, + ], + }); +} +``` + +### Image Gallery with Captions + +```ts +async function ingestImage(imageUrl: string, altText: string, id: string) { + const engine = createUnragEngine(); + + await engine.ingest({ + sourceId: `gallery:${id}`, + content: altText, // Use alt text as base content + assets: [ + { + assetId: "image-main", + kind: "image", + data: { kind: "url", url: imageUrl }, + text: altText, // Provide known text + }, + ], + assetProcessing: { + image: { + captionLlm: { enabled: true }, // Generate additional captions + }, + }, + }); +} +``` + +### Dry-Run Before Expensive Processing + +```ts +async function previewIngestion(doc: Document) { + const engine = createUnragEngine(); + + const plan = await engine.planIngest({ + sourceId: doc.id, + content: doc.content, + assets: doc.assets, + }); + + // Check what would be processed + for (const asset of plan.assets) { + if (asset.status === "will_process") { + console.log(`${asset.assetId}: ${asset.extractors.join(", ")}`); + } else { + console.log(`${asset.assetId}: SKIP - ${asset.reason}`); + } + } + + // Confirm before actual ingestion + if (await confirm("Proceed with ingestion?")) { + await engine.ingest({ + sourceId: doc.id, + content: doc.content, + assets: doc.assets, + }); + } +} +``` + +--- + +## Framework Integration + +### Express + +```ts +import express from "express"; +import { createUnragEngine } from "@unrag/config"; + +const app = express(); + +app.get("/api/search", async (req, res) => { + const query = req.query.q as string; + + if (!query) { + return res.status(400).json({ error: "Missing query" }); + } + + const engine = createUnragEngine(); + const result = await engine.retrieve({ query, topK: 8 }); + + res.json({ results: result.chunks }); +}); +``` + +### Hono + +```ts +import { Hono } from "hono"; +import { createUnragEngine } from "@unrag/config"; + +const app = new Hono(); + +app.get("/api/search", async (c) => { + const query = c.req.query("q"); + + if (!query) { + return c.json({ error: "Missing query" }, 400); + } + + const engine = createUnragEngine(); + const result = await engine.retrieve({ query, topK: 8 }); + + return c.json({ results: result.chunks }); +}); +``` + +### Node Script + +```ts +// scripts/query.ts +import { createUnragEngine } from "@unrag/config"; + +const query = process.argv[2]; + +if (!query) { + console.error("Usage: bun run scripts/query.ts "); + process.exit(1); +} + +const engine = createUnragEngine(); +const result = await engine.retrieve({ query, topK: 5 }); + +for (const chunk of result.chunks) { + console.log(`[${chunk.score.toFixed(3)}] ${chunk.sourceId}`); + console.log(chunk.content.slice(0, 200) + "...\n"); +} +``` + +--- + +## Performance Patterns + +### Connection Pooling + +```ts +// lib/unrag/store/index.ts +import { Pool } from "pg"; + +// Single pool instance +const pool = new Pool({ + connectionString: process.env.DATABASE_URL, + max: 20, + idleTimeoutMillis: 30000, +}); + +export const store = createDrizzleVectorStore(drizzle(pool)); +``` + +### Engine Caching + +```ts +// lib/unrag/config.ts +let cachedEngine: ContextEngine | null = null; + +export function createUnragEngine() { + if (!cachedEngine) { + cachedEngine = unrag.createEngine({ store }); + } + return cachedEngine; +} +``` + +### Batch Ingestion + +```ts +async function batchIngest(documents: Document[]) { + const engine = createUnragEngine(); + + // Process in parallel batches + const batchSize = 10; + for (let i = 0; i < documents.length; i += batchSize) { + const batch = documents.slice(i, i + batchSize); + + await Promise.all( + batch.map((doc) => + engine.ingest({ + sourceId: doc.id, + content: doc.content, + }) + ) + ); + + console.log(`Processed ${Math.min(i + batchSize, documents.length)}/${documents.length}`); + } +} +``` diff --git a/skills/unrag/references/store-adapters.md b/skills/unrag/references/store-adapters.md new file mode 100644 index 0000000..8fa43be --- /dev/null +++ b/skills/unrag/references/store-adapters.md @@ -0,0 +1,392 @@ +# Store Adapters + +Unrag stores vectors in PostgreSQL using the pgvector extension. Three adapters are available—choose based on your existing ORM. + +## Requirements + +All adapters require: +1. PostgreSQL database with pgvector extension enabled +2. Database schema created (documents, chunks, embeddings tables) +3. `DATABASE_URL` environment variable + +## The Three Adapters + +| Adapter | Best For | Dependencies | +|---------|----------|--------------| +| **Drizzle** | Projects using Drizzle ORM | `drizzle-orm`, `drizzle-kit`, `pg` | +| **Prisma** | Projects using Prisma | `@prisma/client` | +| **Raw SQL** | No ORM, minimal deps | `pg` | + +All adapters: +- Implement the same `VectorStore` interface +- Produce the same database schema +- Are functionally equivalent +- Can be switched without data migration + +## Drizzle Adapter (Recommended) + +Type-safe database access with Drizzle ORM. + +### Setup + +```bash +unrag init --store drizzle +``` + +### Configuration + +```ts +// lib/unrag/store/index.ts (generated) +import { drizzle } from "drizzle-orm/node-postgres"; +import { Pool } from "pg"; +import { createDrizzleVectorStore } from "./drizzle-store"; + +const pool = new Pool({ connectionString: process.env.DATABASE_URL }); +const db = drizzle(pool); + +export const store = createDrizzleVectorStore(db); +``` + +### Schema + +The Drizzle adapter provides typed schema: + +```ts +// lib/unrag/store/schema.ts (generated) +import { pgTable, text, integer, vector, timestamp, index, uniqueIndex } from "drizzle-orm/pg-core"; + +export const documents = pgTable("documents", { + id: text("id").primaryKey(), + sourceId: text("source_id").notNull(), + content: text("content"), + metadata: text("metadata"), + createdAt: timestamp("created_at").defaultNow(), + updatedAt: timestamp("updated_at").defaultNow(), +}, (table) => ({ + sourceIdIdx: uniqueIndex("documents_source_id_idx").on(table.sourceId), +})); + +export const chunks = pgTable("chunks", { + id: text("id").primaryKey(), + documentId: text("document_id").notNull().references(() => documents.id, { onDelete: "cascade" }), + index: integer("index").notNull(), + content: text("content"), + tokenCount: integer("token_count").notNull(), + metadata: text("metadata"), +}, (table) => ({ + documentIdx: index("chunks_document_id_idx").on(table.documentId), +})); + +export const embeddings = pgTable("embeddings", { + id: text("id").primaryKey(), + chunkId: text("chunk_id").notNull().references(() => chunks.id, { onDelete: "cascade" }), + embedding: vector("embedding", { dimensions: 1536 }), +}, (table) => ({ + chunkIdx: uniqueIndex("embeddings_chunk_id_idx").on(table.chunkId), + embeddingIdx: index("embeddings_embedding_idx").using("hnsw", table.embedding.op("vector_cosine_ops")), +})); +``` + +### Migrations + +Use Drizzle Kit: + +```bash +# Generate migration +bunx drizzle-kit generate + +# Run migration +bunx drizzle-kit migrate +``` + +--- + +## Prisma Adapter + +Uses Prisma's connection management with raw SQL for vector operations. + +### Setup + +```bash +unrag init --store prisma +``` + +### Configuration + +```ts +// lib/unrag/store/index.ts (generated) +import { PrismaClient } from "@prisma/client"; +import { createPrismaVectorStore } from "./prisma-store"; + +const prisma = new PrismaClient(); + +export const store = createPrismaVectorStore(prisma); +``` + +### Schema + +Since Prisma doesn't natively support pgvector, use raw SQL in migrations: + +```prisma +// prisma/schema.prisma +generator client { + provider = "prisma-client-js" +} + +datasource db { + provider = "postgresql" + url = env("DATABASE_URL") +} + +// Note: Vector columns are created via raw SQL migration +``` + +Create migration manually: + +```sql +-- migrations/0001_init_unrag.sql +CREATE EXTENSION IF NOT EXISTS vector; + +CREATE TABLE documents ( + id TEXT PRIMARY KEY, + source_id TEXT NOT NULL UNIQUE, + content TEXT, + metadata TEXT, + created_at TIMESTAMP DEFAULT NOW(), + updated_at TIMESTAMP DEFAULT NOW() +); + +CREATE TABLE chunks ( + id TEXT PRIMARY KEY, + document_id TEXT NOT NULL REFERENCES documents(id) ON DELETE CASCADE, + index INTEGER NOT NULL, + content TEXT, + token_count INTEGER NOT NULL, + metadata TEXT +); + +CREATE TABLE embeddings ( + id TEXT PRIMARY KEY, + chunk_id TEXT NOT NULL UNIQUE REFERENCES chunks(id) ON DELETE CASCADE, + embedding vector(1536) +); + +CREATE INDEX chunks_document_id_idx ON chunks(document_id); +CREATE INDEX embeddings_embedding_idx ON embeddings USING hnsw (embedding vector_cosine_ops); +``` + +Run with Prisma: + +```bash +bunx prisma db push +``` + +--- + +## Raw SQL Adapter + +Direct pg driver for minimal dependencies. + +### Setup + +```bash +unrag init --store raw-sql +``` + +### Configuration + +```ts +// lib/unrag/store/index.ts (generated) +import { Pool } from "pg"; +import { createRawSqlVectorStore } from "./raw-sql-store"; + +const pool = new Pool({ connectionString: process.env.DATABASE_URL }); + +export const store = createRawSqlVectorStore(pool); +``` + +### Schema + +Create tables manually: + +```sql +-- Create extension +CREATE EXTENSION IF NOT EXISTS vector; + +-- Documents table +CREATE TABLE IF NOT EXISTS documents ( + id TEXT PRIMARY KEY, + source_id TEXT NOT NULL UNIQUE, + content TEXT, + metadata JSONB, + created_at TIMESTAMPTZ DEFAULT NOW(), + updated_at TIMESTAMPTZ DEFAULT NOW() +); + +-- Chunks table +CREATE TABLE IF NOT EXISTS chunks ( + id TEXT PRIMARY KEY, + document_id TEXT NOT NULL REFERENCES documents(id) ON DELETE CASCADE, + index INTEGER NOT NULL, + content TEXT, + token_count INTEGER NOT NULL, + metadata JSONB +); + +-- Embeddings table +CREATE TABLE IF NOT EXISTS embeddings ( + id TEXT PRIMARY KEY, + chunk_id TEXT NOT NULL UNIQUE REFERENCES chunks(id) ON DELETE CASCADE, + embedding vector(1536) +); + +-- Indexes +CREATE INDEX IF NOT EXISTS chunks_document_id_idx ON chunks(document_id); +CREATE INDEX IF NOT EXISTS embeddings_embedding_idx ON embeddings + USING hnsw (embedding vector_cosine_ops); +``` + +--- + +## Custom Store Adapter + +Implement the `VectorStore` interface for other databases: + +```ts +import type { VectorStore, Chunk, DeleteInput, RetrieveScope } from "./types"; + +export function createCustomStore(client: YourDbClient): VectorStore { + return { + async upsert(chunks: Chunk[]): Promise<{ documentId: string }> { + // 1. Extract sourceId from first chunk + const sourceId = chunks[0].sourceId; + + // 2. Delete existing chunks for this sourceId + // 3. Insert document record + // 4. Insert chunk records + // 5. Insert embedding records + + return { documentId: chunks[0].documentId }; + }, + + async query(params: { + embedding: number[]; + topK: number; + scope?: RetrieveScope; + }): Promise> { + // 1. Run similarity search + // 2. Apply scope filter if provided + // 3. Return top K chunks with scores + + return results; + }, + + async delete(input: DeleteInput): Promise { + if (input.sourceId) { + // Delete by exact sourceId + } else if (input.sourceIdPrefix) { + // Delete by sourceId prefix + } + }, + }; +} +``` + +### VectorStore Interface + +```ts +type VectorStore = { + upsert: (chunks: Chunk[]) => Promise<{ documentId: string }>; + query: (params: { + embedding: number[]; + topK: number; + scope?: RetrieveScope; + }) => Promise>; + delete: (input: DeleteInput) => Promise; +}; +``` + +--- + +## Database Setup + +### Enable pgvector + +```sql +-- Requires superuser or extension creation privilege +CREATE EXTENSION IF NOT EXISTS vector; +``` + +### Verify Setup + +```sql +-- Check extension +SELECT * FROM pg_extension WHERE extname = 'vector'; + +-- Check tables +\dt documents +\dt chunks +\dt embeddings + +-- Check indexes +\di embeddings_embedding_idx +``` + +### Index Tuning + +For large datasets, tune the HNSW index: + +```sql +-- Drop and recreate with tuned parameters +DROP INDEX IF EXISTS embeddings_embedding_idx; + +CREATE INDEX embeddings_embedding_idx ON embeddings + USING hnsw (embedding vector_cosine_ops) + WITH (m = 16, ef_construction = 64); + +-- For IVFFlat (alternative, faster builds) +CREATE INDEX embeddings_embedding_ivfflat_idx ON embeddings + USING ivfflat (embedding vector_cosine_ops) + WITH (lists = 100); +``` + +**HNSW Parameters:** +- `m` - Number of connections per layer (default: 16) +- `ef_construction` - Size of candidate list during build (default: 64) + +**IVFFlat Parameters:** +- `lists` - Number of clusters (rule of thumb: rows / 1000) + +--- + +## Wiring the Store + +In `unrag.config.ts`: + +```ts +import { defineUnragConfig } from "./lib/unrag/core"; + +// Import your store +import { store } from "./lib/unrag/store"; + +export const unrag = defineUnragConfig({ + embedding: { /* ... */ }, + // Store is passed at runtime via createEngine +}); + +// In your application code +import { unrag } from "./unrag.config"; +import { store } from "./lib/unrag/store"; + +const engine = unrag.createEngine({ store }); +``` + +Or with `createUnragEngine`: + +```ts +// lib/unrag/config.ts (generated) +import { unrag } from "../../unrag.config"; +import { store } from "./store"; + +export const createUnragEngine = () => unrag.createEngine({ store }); +``` diff --git a/skills/unrag/references/troubleshooting.md b/skills/unrag/references/troubleshooting.md new file mode 100644 index 0000000..f1d0e84 --- /dev/null +++ b/skills/unrag/references/troubleshooting.md @@ -0,0 +1,457 @@ +# Troubleshooting + +Common issues and solutions when working with Unrag. + +## Database Issues + +### Connection Failed + +**Error:** `ECONNREFUSED` or `connection refused` + +**Solutions:** +1. Verify DATABASE_URL is set correctly + ```bash + echo $DATABASE_URL + # Should be: postgresql://user:pass@host:5432/dbname + ``` + +2. Check database is running + ```bash + pg_isready -h localhost -p 5432 + ``` + +3. Test connection directly + ```bash + psql $DATABASE_URL -c "SELECT 1" + ``` + +4. Check firewall/network rules for remote databases + +--- + +### pgvector Extension Missing + +**Error:** `type "vector" does not exist` + +**Solution:** +```sql +-- Connect as superuser +CREATE EXTENSION IF NOT EXISTS vector; + +-- Verify +SELECT * FROM pg_extension WHERE extname = 'vector'; +``` + +For managed databases (Supabase, Neon, etc.): +- Supabase: Enable in Dashboard → Database → Extensions +- Neon: `CREATE EXTENSION vector;` (auto-enabled) +- AWS RDS: Use pgvector-enabled instance class + +--- + +### Tables Don't Exist + +**Error:** `relation "documents" does not exist` + +**Solution:** +Run the schema migration: + +```bash +# Drizzle +bunx drizzle-kit migrate + +# Prisma +bunx prisma db push + +# Raw SQL - run migration manually +psql $DATABASE_URL -f migrations/0001_init_unrag.sql +``` + +--- + +### Dimension Mismatch + +**Error:** `expected 1536 dimensions, not 768` + +**Cause:** Embedding model changed after data was indexed. + +**Solutions:** +1. Re-index all data with new model + ```ts + await engine.delete({ sourceIdPrefix: "" }); // Delete all + // Re-ingest everything + ``` + +2. Or update vector column dimension + ```sql + ALTER TABLE embeddings + ALTER COLUMN embedding TYPE vector(768); + ``` + +3. Recreate index + ```sql + DROP INDEX embeddings_embedding_idx; + CREATE INDEX embeddings_embedding_idx ON embeddings + USING hnsw (embedding vector_cosine_ops); + ``` + +--- + +## Embedding Issues + +### API Key Invalid + +**Error:** `401 Unauthorized` or `Invalid API key` + +**Solutions:** +1. Check environment variable is set + ```bash + echo $OPENAI_API_KEY # Should not be empty + ``` + +2. Verify key format (no extra spaces/newlines) + ```bash + # In .env + OPENAI_API_KEY=sk-proj-... # No quotes + ``` + +3. Test key directly + ```bash + curl https://api.openai.com/v1/models \ + -H "Authorization: Bearer $OPENAI_API_KEY" + ``` + +--- + +### Rate Limits + +**Error:** `429 Too Many Requests` or `rate_limit_exceeded` + +**Solutions:** +1. Reduce concurrency + ```ts + defaults: { + embedding: { + concurrency: 2, // Lower from default 4 + batchSize: 50, // Smaller batches + }, + }, + ``` + +2. Add retry logic + ```ts + // In custom provider + embed: async (input) => { + for (let i = 0; i < 3; i++) { + try { + return await provider.embed(input); + } catch (e) { + if (e.status === 429) { + await sleep(1000 * (i + 1)); + continue; + } + throw e; + } + } + } + ``` + +3. Upgrade API tier or use different provider + +--- + +### Timeout Errors + +**Error:** `AbortError` or `timeout` + +**Solutions:** +1. Increase timeout in config + ```ts + embedding: { + provider: "openai", + config: { + timeoutMs: 60_000, // 60 seconds + }, + }, + ``` + +2. Check network connectivity + +3. Use regional endpoint if available + +--- + +## Extractor Issues + +### Extractor Not Found + +**Error:** `No extractor found for asset kind: pdf` + +**Solutions:** +1. Add the extractor + ```bash + bunx unrag@latest add extractor pdf-text-layer + ``` + +2. Wire it in config + ```ts + import { createPdfTextLayerExtractor } from "./lib/unrag/extractors/pdf-text-layer"; + + engine: { + extractors: [createPdfTextLayerExtractor()], + }, + ``` + +3. Enable in assetProcessing + ```ts + assetProcessing: { + pdf: { + textLayer: { enabled: true }, + }, + }, + ``` + +--- + +### Dependency Missing + +**Error:** `Cannot find module 'pdfjs-dist'` + +**Solution:** +Install missing dependency: +```bash +bun add pdfjs-dist +``` + +Common extractor dependencies: +- `pdf-text-layer`: `pdfjs-dist` +- `file-docx`: `mammoth` +- `file-pptx`: `jszip` +- `file-xlsx`: `xlsx` + +--- + +### Worker-Only Extractor in Serverless + +**Error:** `pdf-ocr requires native binaries` + +**Cause:** Worker-only extractors (`pdf-ocr`, `video-frames`) need native binaries. + +**Solutions:** +1. Use serverless-compatible extractors + - `pdf-text-layer` instead of `pdf-ocr` + - `pdf-llm` for scanned PDFs + +2. Process in worker environment + - Use background job (Trigger.dev, BullMQ) + - Separate worker service + +3. Disable worker-only extractors in serverless + ```ts + assetProcessing: { + pdf: { + ocr: { enabled: false }, + }, + video: { + frames: { enabled: false }, + }, + }, + ``` + +--- + +## Retrieval Issues + +### No Results + +**Query returns empty array** + +**Causes & Solutions:** + +1. **No data ingested** + ```ts + // Check if documents exist + const result = await engine.retrieve({ query: "*", topK: 1 }); + console.log("Has data:", result.chunks.length > 0); + ``` + +2. **Scope too narrow** + ```ts + // Try without scope + const result = await engine.retrieve({ query, topK: 10 }); + // vs with scope + const scoped = await engine.retrieve({ + query, + topK: 10, + scope: { sourceId: "docs:" }, // May be filtering everything + }); + ``` + +3. **Embedding model mismatch** - See Dimension Mismatch above + +4. **Query too short/vague** + ```ts + // Use more specific queries + query: "how to reset password" // Better + query: "reset" // Too vague + ``` + +--- + +### Poor Quality Results + +**Results don't match query well** + +**Solutions:** + +1. **Try reranking** + ```ts + const retrieved = await engine.retrieve({ query, topK: 30 }); + const reranked = await engine.rerank({ + query, + candidates: retrieved.chunks, + topK: 8, + }); + ``` + +2. **Adjust chunk size** + ```ts + defaults: { + chunking: { + chunkSize: 256, // Smaller for precise matching + chunkOverlap: 50, + }, + }, + ``` + +3. **Use better embedding model** + - `text-embedding-3-large` > `text-embedding-3-small` + - Cohere `embed-english-v3.0` for retrieval + +4. **Add metadata for filtering** + ```ts + // At ingest + metadata: { category: "docs", language: "en" } + + // At retrieval - filter client-side + const filtered = result.chunks.filter( + c => c.metadata.category === "docs" + ); + ``` + +--- + +### Slow Retrieval + +**Queries taking > 500ms** + +**Solutions:** + +1. **Check index exists** + ```sql + SELECT indexname FROM pg_indexes + WHERE tablename = 'embeddings'; + + -- Should see: embeddings_embedding_idx + ``` + +2. **Optimize index** (HNSW) + ```sql + -- For faster queries (tradeoff: lower recall) + SET hnsw.ef_search = 40; -- Default: 40 + + -- Rebuild with better parameters + DROP INDEX embeddings_embedding_idx; + CREATE INDEX embeddings_embedding_idx ON embeddings + USING hnsw (embedding vector_cosine_ops) + WITH (m = 24, ef_construction = 200); + ``` + +3. **Use connection pooling** + ```ts + const pool = new Pool({ + connectionString: process.env.DATABASE_URL, + max: 20, + }); + ``` + +4. **Reduce topK** + ```ts + topK: 10 // Instead of 100 + ``` + +--- + +## Debug Tools + +### Enable Debug Mode + +```bash +UNRAG_DEBUG=true bun run dev +``` + +### Run Doctor + +```bash +bunx unrag@latest doctor +``` + +### Launch Debug TUI + +```bash +bunx unrag@latest debug +``` + +### Add Logging + +```ts +// Wrap engine calls +const originalRetrieve = engine.retrieve.bind(engine); +engine.retrieve = async (input) => { + console.log("[retrieve]", input.query); + const start = Date.now(); + const result = await originalRetrieve(input); + console.log("[retrieve] done", Date.now() - start, "ms"); + return result; +}; +``` + +### Asset Processing Events + +```ts +assetProcessing: { + hooks: { + onEvent: (event) => { + console.log(`[${event.type}]`, JSON.stringify(event, null, 2)); + }, + }, +}, +``` + +--- + +## Common Error Messages + +| Error | Likely Cause | Quick Fix | +|-------|--------------|-----------| +| `ECONNREFUSED` | Database not running | Start Postgres | +| `type "vector" does not exist` | pgvector not installed | `CREATE EXTENSION vector` | +| `401 Unauthorized` | Bad API key | Check env var | +| `429 Too Many Requests` | Rate limited | Reduce concurrency | +| `dimension mismatch` | Model changed | Re-index data | +| `No extractor found` | Missing extractor | `unrag add extractor` | +| `Cannot find module` | Missing dependency | `bun add ` | + +--- + +## Getting Help + +1. **Check logs** - Look for stack traces and error codes + +2. **Run doctor** - `bunx unrag@latest doctor` + +3. **Enable debug** - `UNRAG_DEBUG=true` + +4. **Search docs** - Check `/docs` for your specific issue + +5. **Check source** - The code is in your repo at `lib/unrag/`