Skip to content

Conversation

@vanpelt
Copy link
Collaborator

@vanpelt vanpelt commented Nov 17, 2025

Summary

Adds local GGUF model inference using llama.cpp via yzma, enabling on-device task summarization and git branch name generation with our fine-tuned Gemma 3 270M model.

Key Features

  • CLI Commands

    • catnip summarize "task description" - Generate task summary and branch name
    • catnip download - Pre-download model and llama.cpp libraries
  • REST API

    • POST /v1/inference/summarize - Inference endpoint for programmatic access
    • GET /v1/inference/status - Check inference service availability
  • Auto-downloading

    • Models cached to ~/.catnip/models/
    • llama.cpp libraries auto-downloaded for current platform to ~/.catnip/lib/

Critical Bug Fix

Fixed inference producing incorrect outputs (always returning "Add Dark Mode" from examples instead of actual summaries).

Root cause: Missing BOS (Beginning of Sequence) token when tokenizing prompts for Gemma models.

Fix: Set addSpecial=true in tokenization call to include required special tokens.

Test plan

  • Verify catnip summarize produces varied, contextually appropriate outputs
  • Compare output quality with Ollama using same model
  • Test multiple prompts to confirm no example contamination
  • Verify lint passes

🤖 Generated with Claude Code

Claude and others added 2 commits November 17, 2025 16:52
Adds local GGUF model inference using llama.cpp via yzma for task
summarization and branch name generation.

Key components:
- InferenceService: Handles model loading and text generation
- ModelDownloader: Downloads and caches GGUF models from HuggingFace
- LibraryDownloader: Auto-downloads llama.cpp libraries for current platform
- summarize command: CLI interface for generating summaries
- download command: Pre-download model and libraries
- REST API endpoint: POST /v1/inference/summarize

Critical fix: Must use addSpecial=true when tokenizing prompts for Gemma
models to include BOS token - without this, the model produces incorrect
outputs (was outputting examples from the prompt instead of actual summaries).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@codecov
Copy link

codecov bot commented Nov 17, 2025

@vanpelt vanpelt changed the title feat: Local inference with llama Add local inference service for task summarization Nov 17, 2025
@socket-security
Copy link

socket-security bot commented Nov 17, 2025

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Addedgolang/​github.com/​hybridgroup/​yzma@​v0.9.097100100100100

View full report

@socket-security
Copy link

socket-security bot commented Nov 17, 2025

Warning

Review the following alerts detected in dependencies.

According to your organization's Security Policy, it is recommended to resolve "Warn" alerts. Learn more about Socket for GitHub.

Action Severity Alert  (click "▶" to expand/collapse)
Warn Medium
Native binaries present: golang github.com/jupiterrider/ffi

Location: Package overview

From: container/go.modgolang/github.com/hybridgroup/yzma@v0.9.0golang/github.com/jupiterrider/ffi@v0.5.1

ℹ Read more on: This package | This alert | Why is native code a concern?

Next steps: Take a moment to review the security alert above. Review the linked package source code to understand the potential risk. Ensure the package is not malicious before proceeding. If you're unsure how to proceed, reach out to your security team or ask the Socket team for help at support@socket.dev.

Suggestion: Verify that the inclusion of native code is expected and necessary for this package's functionality. If it is unnecessary or unexpected, consider using alternative packages without native code to mitigate potential risks.

Mark the package as acceptable risk. To ignore this alert only in this pull request, reply with the comment @SocketSecurity ignore golang/github.com/jupiterrider/ffi@v0.5.1. You can also ignore all packages with @SocketSecurity ignore-all. To ignore an alert for all future pull requests, use Socket's Dashboard to change the triage state of this alert.

View full report

@vanpelt vanpelt force-pushed the fix/inference-bos-token branch from f347259 to 8069e87 Compare November 17, 2025 22:02
- Truncate parts slice to max 3 elements before loop
- Add nolint comment for false positive gosec warning
- Update golangci-lint version to 2.6.2 to match CI

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@vanpelt vanpelt force-pushed the fix/inference-bos-token branch from 8069e87 to ec38067 Compare November 17, 2025 22:04
Claude and others added 2 commits November 17, 2025 21:40
- Implement non-blocking background initialization for inference service
- Add state management (initializing/ready/failed/disabled) with progress tracking
- Return 503 with status info while model downloads in background
- Add retry logic with exponential backoff (3 attempts)
- Use golang.org/x/sys/unix for cross-platform stderr suppression
- Clean up .gitignore (remove models/) and .goreleaser.yml (remove bundled libs)

The inference service now starts immediately and downloads libraries/model
in the background. Enable with CATNIP_INFERENCE=1 environment variable.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Specify stable versions (yarn@4, pnpm@9, npm@10) instead of letting
corepack pick dev versions that may not be available.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@vanpelt vanpelt force-pushed the fix/inference-bos-token branch from 58b9e28 to 1cfccbe Compare November 18, 2025 02:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants