-
Notifications
You must be signed in to change notification settings - Fork 27
Add local inference service for task summarization #219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Adds local GGUF model inference using llama.cpp via yzma for task summarization and branch name generation. Key components: - InferenceService: Handles model loading and text generation - ModelDownloader: Downloads and caches GGUF models from HuggingFace - LibraryDownloader: Auto-downloads llama.cpp libraries for current platform - summarize command: CLI interface for generating summaries - download command: Pre-download model and libraries - REST API endpoint: POST /v1/inference/summarize Critical fix: Must use addSpecial=true when tokenizing prompts for Gemma models to include BOS token - without this, the model produces incorrect outputs (was outputting examples from the prompt instead of actual summaries). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
|
Warning Review the following alerts detected in dependencies. According to your organization's Security Policy, it is recommended to resolve "Warn" alerts. Learn more about Socket for GitHub.
|
f347259 to
8069e87
Compare
- Truncate parts slice to max 3 elements before loop - Add nolint comment for false positive gosec warning - Update golangci-lint version to 2.6.2 to match CI 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
8069e87 to
ec38067
Compare
- Implement non-blocking background initialization for inference service - Add state management (initializing/ready/failed/disabled) with progress tracking - Return 503 with status info while model downloads in background - Add retry logic with exponential backoff (3 attempts) - Use golang.org/x/sys/unix for cross-platform stderr suppression - Clean up .gitignore (remove models/) and .goreleaser.yml (remove bundled libs) The inference service now starts immediately and downloads libraries/model in the background. Enable with CATNIP_INFERENCE=1 environment variable. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Specify stable versions (yarn@4, pnpm@9, npm@10) instead of letting corepack pick dev versions that may not be available. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
58b9e28 to
1cfccbe
Compare
Summary
Adds local GGUF model inference using llama.cpp via yzma, enabling on-device task summarization and git branch name generation with our fine-tuned Gemma 3 270M model.
Key Features
CLI Commands
catnip summarize "task description"- Generate task summary and branch namecatnip download- Pre-download model and llama.cpp librariesREST API
POST /v1/inference/summarize- Inference endpoint for programmatic accessGET /v1/inference/status- Check inference service availabilityAuto-downloading
~/.catnip/models/~/.catnip/lib/Critical Bug Fix
Fixed inference producing incorrect outputs (always returning "Add Dark Mode" from examples instead of actual summaries).
Root cause: Missing BOS (Beginning of Sequence) token when tokenizing prompts for Gemma models.
Fix: Set
addSpecial=truein tokenization call to include required special tokens.Test plan
catnip summarizeproduces varied, contextually appropriate outputs🤖 Generated with Claude Code