A powerful local AI assistant that runs entirely on your laptop. Laptop-GPT provides template-agnostic prompt formatting with profile-based configuration, giving you full control over quantized model interactions without relying on potentially broken chat templates.
- Profile-Based Prompt Control: Easy switching between model formats (DeepSeek, Llama, ChatML, etc.)
- Template-Independent: No reliance on embedded chat templates that may be corrupted or incompatible
- Quantized Model Optimization: Designed specifically for GGUF quantized models with flexible prompt formatting
- Reasoning Control: Toggle between showing/hiding model thinking processes
- 8GB RAM Optimized: All included models tested and optimized for 8GB systems (alpha v0.1.x)
- Offline-First: Complete local operation, no cloud dependencies
Quantized models (GGUF format) often face significant challenges:
- Broken Chat Templates: During quantization, chat templates can become corrupted or incompatible with llama-cpp-python
- Distillation Issues: Models distilled from larger models may inherit incorrect templates from the parent model
- Training Inconsistencies: Fine-tuning processes can introduce template mismatches between the model's training and its embedded metadata
- Framework Compatibility: Templates that work in one framework (like Transformers) may not translate properly to llama-cpp
Instead of relying on embedded templates, Laptop-GPT gives you direct control:
{
"chat_selection": {
"model_id": "deepseek-r1-1.5b", // Model ID from catalog
"profile": "deepseek", // Prompt format profile
"use_recommended": true // Use model's recommended LLM parameters
},
"profiles": {
"deepseek": {
"format": "deepseek",
"stop_tokens": ["Question:", "Answer:", "User:", "Human:", "\n\n"],
"description": "DeepSeek models - direct Q&A format without reasoning"
}
}
}This approach provides:
- Guaranteed Compatibility: Format works regardless of embedded templates
- Precise Control: Fine-tune stop tokens for optimal responses
- Easy Debugging: See exactly what prompt format is being used
- Model Flexibility: Switch between formats without changing models
It is highly recommended to use a Python virtual environment for this project. This keeps dependencies isolated and your system clean.
python -m venv .venv
source .venv/bin/activatepython -m venv .venv
.venv\Scripts\activatepython -m venv .venv
.venv\Scripts\Activate.ps1pip install -r requirements.txtCopy and customize the provided config.json file:
# View available prompt profiles
python main.py --list-profiles
# See configuration help
python main.py --help-config# See all available models in the catalog
python main.py --list-modelsThen edit your config.json to use a specific model:
{
"chat_selection": {
"model_id": "tinyllama-1.1b", // Model ID from catalog
"profile": "llama", // Prompt format profile
"use_recommended": true // Use model's recommended LLM parameters
}
}About use_recommended: When set to true, Laptop-GPT automatically uses the optimized parameters (n_ctx, n_batch, temperature, etc.) that are pre-configured for each model. This ensures optimal performance and memory usage for your hardware. You can override these by setting use_recommended: false and defining your own custom_llm_params.
# List available prompt profiles
python main.py --list-profilesSelect a profile for prompt formatting:
deepseek- Direct Q&A format without reasoningdeepseek_reasoning- Hidden reasoning (shows only final answer)deepseek_show_reasoning- Shows full thinking processllama- Llama/Mistral formatchatml- ChatML format (GPT-4 style)generic- Universal format that works with most models
Then choose a model ID from the catalog using --list-models.
python main.py
# With debug info to see prompt formatting
python main.py --debug
# Exit the chat with :qFor quick testing without interactive mode, you can pipe questions directly:
# Quick test with a simple question
echo "What is 2+2?" | python main.py
# Test and immediately quit
echo -e "What is the capital of France?\n:q" | python main.py
# Debug mode for troubleshooting
echo "Hello AI" | python main.py --debugThis is especially useful for:
- CI/CD testing: Automated testing in pipelines
- Profile validation: Quick checks when switching models
- Performance benchmarking: Timing response generation
- Troubleshooting: Debugging prompt formatting issues
MacBook Air 8GB Memory Warning:
- Models may experience degraded performance with limited RAM
- Larger models (7B+) may cause system slowdown or memory pressure
- Consider using smaller models like TinyLlama (1.1B) or Llama 3.2 1B for optimal performance
Recommended Memory Guidelines:
- 8GB RAM: Stick to 1-2B parameter models (TinyLlama, Llama 3.2 1B)
- 16GB RAM: Up to 7B models work well (Mistral 7B, Llama 3.1 8B)
- 32GB+ RAM: All models in catalog supported
Alpha Version Model Selection: The pre-configured models in our catalog are specifically chosen to work well on 8GB RAM systems. While larger models (7B-8B) are included for users with more memory, the focus is on ensuring reliable performance on resource-constrained hardware typical of MacBook Air and similar laptops.
No Conversation History: This alpha version operates in single-shot mode:
- Each question is independent - no memory of previous exchanges
- Models don't maintain conversation context between messages
- Every prompt is treated as a fresh interaction
Why No History in Alpha:
- Memory Optimization: History storage would increase RAM usage significantly
- Simplicity: Single-shot is more predictable and easier to debug
- Performance: Avoids context length limitations and memory bloat
- Testing Focus: Allows validation of core prompt formatting without history complexity
Future Versions: Conversation history and multi-turn dialogue planned for v0.2.x series.
List all pre-configured models:
python main.py --list-modelsExample output:
Available Models:
================================================================================
1. DeepSeek R1 Distill Qwen 1.5B (GGUF Q5_K_M)
Format: GGUF | Quantization: Q5_K_M | Context: 4k-8k
Recommended Profile: deepseek
Notes: Current model; balanced speed/quality for Qwen-style chat.
2. TinyLlama 1.1B Chat v1.0
Format: GGUF | Quantization: Q4_K_M | Context: 4k
Recommended Profile: llama
Notes: Ultra-fast, great for lightweight assistants and testing.
Change models by updating your chat selection in config.json:
{
"chat_selection": {
"model_id": "tinyllama-1.1b", // Model ID from catalog
"profile": "llama", // Prompt format profile
"use_recommended": true // Use model's recommended LLM parameters
}
}8GB RAM Focus: All models in the catalog are selected for compatibility with 8GB systems in this alpha version. While larger models (7B-8B) are included, the primary focus is ensuring reliable performance on resource-constrained hardware.
Each model in the catalog includes:
- Name & Description: Clear identification and use case
- Format: GGUF, MLX, Transformers, etc.
- Quantization: Compression level (Q4_K_M, Q5_K_M, etc.)
- Context Length: Supported conversation history size
- Recommended Profile: Optimal prompt format for the model
- Technical Notes: Performance characteristics and requirements
- 8GB Compatibility: All models tested for 8GB RAM systems in alpha v0.1.x
Extend the catalog in config.json:
{
"available_models": [
{
"id": "my-custom-model",
"name": "My Custom Model",
"model": {
"repo": "username/model-repo",
"file": "model.gguf",
"cache_dir": ".models/custom"
},
"format": "gguf",
"quant": "Q4_K_M",
"context_len_hint": "4k",
"notes": "Custom model for specific use case",
"recommended_profile": "generic"
}
]
}List all available profiles:
python main.py --list-profiles{
"chat_selection": {
"model_id": "tinyllama-1.1b", // Model ID from catalog
"profile": "llama", // Prompt format profile
"use_recommended": true // Use model's recommended LLM parameters
},
"profiles": {
"llama": {
"format": "llama", // Prompt format type
"stop_tokens": ["[INST]", "[/INST]"], // When to stop generating
"description": "Llama format for Llama/Mistral models"
}
}
}Change both model and profile in your chat selection:
{
"chat_selection": {
"model_id": "mistral-7b", // Model ID from catalog
"profile": "llama", // Prompt format profile
"use_recommended": true // Use model's recommended LLM parameters
}
}Add your own profile for new model types:
{
"profiles": {
"my_custom_model": {
"format": "auto",
"stop_tokens": ["User:", "Bot:", "\n\n"],
"description": "Custom format for my specific model"
}
}
}Laptop-GPT includes a comprehensive test suite to ensure reliability as you develop new features. Our testing infrastructure has been thoroughly battle-tested through real debugging scenarios.
# Install test dependencies
python run_tests.py --install-deps
# Run all tests
python run_tests.py
# Run with coverage report
python run_tests.py --coverageThe test suite is designed to help you learn testing patterns:
- Unit tests for individual functions
- Integration tests for complete workflows
- Mocking for external dependencies
- Configuration testing for validation
See TESTING.md for detailed testing guide and learning materials.
During development of this project, we encountered and solved several real testing challenges that demonstrate important testing principles:
Issue: Integration tests would hang indefinitely instead of completing.
Root Cause: Mock configuration mismatch between test expectations and actual implementation:
- Tests were mocking
builtins.input - Code actually uses
console.input()from Rich library - Result: Tests waited forever for input that would never be properly mocked
Solution Process:
- Identified the problem: Used
--debugmode and process monitoring - Traced the mocking: Examined what functions were actually being called
- Fixed systematically: Updated all mock targets to match actual implementation
Code Fix Example:
# β Wrong - this doesn't work with Rich Console
with patch('builtins.input', side_effect=[':q']):
result = main.main()
# β
Correct - mock the actual Console class and its input method
with patch('main.Console') as mock_console_class:
mock_console = MagicMock()
mock_console_class.return_value = mock_console
mock_console.input.side_effect = [':q']
result = main.main()Testing Lesson: Always mock what's actually being called, not what you think should be called.
Issue: Tests expected llm.invoke() but code calls llm() directly.
Investigation Process:
# What we expected (wrong):
mock_llm.invoke.assert_called_once()
# What actually happens (correct):
mock_llm.assert_called_once()Response Format Learning: The LLM returns a specific structure that must be mocked correctly:
# β
Correct mock response format
mock_llm.return_value = {
"choices": [{"text": "Test response from LLM"}]
}Testing Lesson: Study the actual API before writing mocks - don't assume the interface.
Issue: Tests couldn't find expected output in assertions.
Discovery Process:
# β Wrong - checking built-in print calls
response_printed = any("Test response" in str(call)
for call in mock_print.call_args_list)
# β
Correct - checking Rich console.print calls
response_printed = any("Test response" in str(call)
for call in mock_console.print.call_args_list)Testing Lesson: Verify output in the same place the application actually writes it.
Understanding these formulas helps you write better tests and optimize model performance:
KV_Cache_Memory = 2 * n_layers * n_ctx * hidden_size * bytes_per_param
Where:
- n_layers: Number of transformer layers
- n_ctx: Context window size (tokens)
- hidden_size: Model's hidden dimension
- bytes_per_param: 4 bytes (FP32) or 2 bytes (FP16)
Example Calculation:
# TinyLlama 1.1B: 22 layers, 2048 hidden size, 1024 context
tinyllama_kv = 2 * 22 * 1024 * 2048 * 2 # FP16
# = ~185 MB KV cache
# Mistral 7B: 32 layers, 4096 hidden size, 1024 context
mistral_kv = 2 * 32 * 1024 * 4096 * 2 # FP16
# = ~1.07 GB KV cacheTesting Implication: This explains why 1B models can handle much larger contexts than 7B models on the same hardware.
Total_Memory = Model_Weights + KV_Cache + Overhead
Model_Weights (quantized) = original_size * quantization_ratio
- Q4_K_M: ~0.5x original size
- Q5_K_M: ~0.625x original size
- Q8_0: ~1.0x original size
Overhead = ~200-500MB (OS, Python, llama-cpp)
Testing Strategy: Use these formulas to predict memory usage and set appropriate test limits.
The simple_model_test.py script provides standalone model validation functions that are excluded from the main pytest suite to avoid interference.
This file contains utility functions for testing model functionality without the complexity of the full test suite:
# Functions available (renamed from test_* to avoid pytest discovery):
validate_model_loading() # Test basic model loading
validate_model_inference() # Test text generation
validate_memory_usage() # Monitor memory consumption
validate_performance() # Benchmark inference speed# Run specific validation functions
python -c "from simple_model_test import validate_model_loading; validate_model_loading()"
# Test inference with custom prompt
python -c "
from simple_model_test import validate_model_inference
validate_model_inference('What is artificial intelligence?')
"
# Check memory usage during operation
python -c "from simple_model_test import validate_memory_usage; validate_memory_usage()"Use simple_model_test.py for:
- New Model Validation: Test a new model before adding to catalog
# Edit simple_model_test.py to point to your model
python -c "from simple_model_test import validate_model_loading; validate_model_loading()"- Performance Benchmarking: Measure inference speed
python -c "
from simple_model_test import validate_performance
validate_performance(num_trials=5)
"- Memory Profiling: Monitor resource usage
# Install memory profiler if needed
pip install memory-profiler
# Run with memory monitoring
python -c "
from simple_model_test import validate_memory_usage
validate_memory_usage(verbose=True)
"- Quick Debugging: Test model behavior without full app startup
# Quick inference test
python -c "
from simple_model_test import validate_model_inference
result = validate_model_inference('Test prompt', debug=True)
print(f'Response: {result}')
"- Pytest Exclusion: Functions renamed from
test_*tovalidate_*to avoid automatic discovery - Standalone Operation: Can run without the full test infrastructure
- Resource Intensive: Model loading tests take time and memory
- Development Focus: Designed for manual testing during development
Add your own validation functions to simple_model_test.py:
def validate_custom_scenario():
"""Test a specific use case or model configuration."""
# Your custom test logic here
pass
def validate_profile_compatibility(profile_name):
"""Test a specific prompt profile with the current model."""
# Test profile-specific behavior
passFrom our debugging experience, here are key testing principles:
- Mock What's Actually Called: Don't assume - verify the actual function calls
- Test at the Right Level: Integration tests should test real workflows
- Understand the Implementation: Study the code before writing tests
- Use Debug Mode: Enable verbose output to understand test failures
- Test Memory Constraints: Include resource usage in your test considerations
- Separate Concerns: Keep heavy model tests separate from fast unit tests
tests/
βββ test_cli.py # Command-line interface tests
βββ test_config_utils.py # Configuration loading tests
βββ test_llm_utils.py # LLM utility function tests
βββ test_main_integration.py # Full application workflow tests
βββ conftest.py # Shared test fixtures
βββ fixtures/ # Test data files
βββ valid_config.json
βββ invalid_config.json
βββ incomplete_config.json
simple_model_test.py # Standalone model validation (excluded from pytest)
This organization separates fast unit tests from resource-intensive model tests, ensuring rapid development iteration while maintaining comprehensive coverage.
Once activated, your terminal prompt will usually change to show (.venv) at the beginning. While the environment is active:
- All Python and pip commands will use the isolated environment.
- To leave the virtual environment, simply run:
deactivateIf you open a new terminal, remember to activate the environment again before running any project commands.
βββ main.py # Main application entry point
βββ cli.py # Command line argument parsing with profile support
βββ config_utils.py # Configuration loading and environment setup
βββ llm_utils.py # LLM setup, model downloading, and flexible prompt formatting
βββ config.json # Configuration file with prompt profiles
βββ requirements.txt # Python dependencies
βββ README.md # This file
Different models expect different conversation formats:
# DeepSeek Q&A format
"Question: What is AI?\nAnswer:"
# DeepSeek Chat format
"<|User|>What is AI?<|Assistant|>"
# Llama/Mistral format
"[INST] What is AI? [/INST]"
# ChatML format
"<|im_start|>user\nWhat is AI?<|im_end|>\n<|im_start|>assistant\n"
# Generic format
"### Human: What is AI?\n### Assistant:"Stop tokens prevent models from continuing conversations indefinitely:
{
"stop_tokens": ["Question:", "Answer:", "User:", "Human:", "\n\n"]
}How it works:
- Model generates:
"Answer: AI is... Question: What about..." - Stops at
"Question:"β Returns:"Answer: AI is..."
For models that show thinking processes:
{
"show_thinking": false // Hide <think>...</think> content
}Input:
<think>
The user is asking about AI. I should provide a clear, informative answer.
</think>
AI is the simulation of human intelligence in machines.
Output (thinking hidden):
AI is the simulation of human intelligence in machines.
python main.py --list-profiles # See available prompt profiles
python main.py --help-config # View configuration help
python main.py --debug # Enable debug modeEdit config.json to customize:
{
"chat_selection": {
"model_id": "deepseek-r1-1.5b", // Model ID from catalog
"profile": "deepseek_reasoning", // Prompt format profile
"use_recommended": true // Use model's recommended LLM parameters
},
"model": {
"repo": "bartowski/DeepSeek-R1-Distill-Qwen-1.5B-GGUF",
"file": "DeepSeek-R1-Distill-Qwen-1.5B-Q5_K_M.gguf",
"cache_dir": ".models/deepseek"
},
"llm_params": {
"n_ctx": 2048, // Context window size
"temperature": 0.3, // Response creativity (0.0-1.0)
"max_tokens": 512, // Maximum response length
"n_threads": 6 // CPU threads to use
}
}- Profile switching: Change prompt formats without code changes
- Stop token tuning: Fine-tune response boundaries
- Reasoning control: Toggle thinking process visibility
- Debug mode: See exactly what prompts are sent to models
python main.py --help # Show all options
python main.py --debug # Enable debug mode with prompt visibility
python main.py --log-level INFO # Set logging level
python main.py --config custom.json # Use custom config file
python main.py --list-profiles # List available prompt profilesProblem: Model generates weird tokens, loops, or doesn't follow chat format Solution: Switch to a different profile:
{
"chat_selection": {
"model_id": "tinyllama-1.1b", // Current model ID
"profile": "generic", // Try universal format
"use_recommended": true // Use model's recommended LLM parameters
}
}Problem: Model shows thinking process when you don't want it Solution: Use non-reasoning profile:
{
"chat_selection": {
"model_id": "deepseek-r1-1.5b", // Current model ID
"profile": "deepseek", // Simple Q&A format
"use_recommended": true // Use model's recommended LLM parameters
}
}Problem: Model stops responding too early Solution: Adjust stop tokens:
{
"profiles": {
"my_profile": {
"stop_tokens": ["\n\n"] // Only stop on double newlines
}
}
}Enable debug mode to see what's happening:
python main.py --debug
# Or test quickly with echo
echo "Test question" | python main.py --debugThis shows:
- Exact prompt format being used
- Active profile and settings
- Stop tokens in effect
- Model loading details
Use echo commands for rapid testing:
# Test model response quality
echo "What is artificial intelligence?" | python main.py
# Validate profile switching
echo "2+2=?" | python main.py --debug
# Check memory performance
echo -e "Simple test\n:q" | python main.pyWhen to use echo testing:
- After changing profiles or models
- When troubleshooting prompt formatting
- For performance benchmarking
- During development and testing
If a profile doesn't exist:
Warning: Profile 'nonexistent' not found, using generic profile
List available profiles:
python main.py --list-profilesTraditional approach (problematic):
Model File β Embedded Chat Template β llama-cpp β Output
β β β
May be Often broken Incompatible
quantized or incompatible with format
Our approach (reliable):
Model File β Laptop-GPT Profile β Custom Format β llama-cpp β Output
β β β β
Any GGUF User-controlled Works reliably Predictable
- Quantization Artifacts: GGUF conversion can corrupt embedded templates
- Distillation Issues: Smaller models inherit templates from incompatible parent models
- Framework Differences: Templates designed for Transformers may not work in llama-cpp
- Training Mismatches: Fine-tuning can create discrepancies between model behavior and templates
- Guaranteed Compatibility: Profiles always work regardless of model metadata
- User Control: You decide the exact format, not the model file
- Easy Debugging: See exactly what prompt is sent with
--debug - Flexible: Switch formats without re-downloading models
- Extensible: Add new profiles for new model types easily
- DeepSeek-R1-Distill-Qwen-1.5B: Excellent reasoning capabilities in small package
- GGUF Quantization: Q5_K_M provides optimal quality/size balance
- CPU-First Design: Universal compatibility across hardware
- Memory Efficiency: Runs comfortably on 8GB systems
- Profile-Based Control: Format templates separated from model files
- Template Independence: No reliance on potentially broken embedded templates
- Debug Transparency: Always know what prompt format is being used
- Reasoning Flexibility: Toggle thinking process visibility
- Local-First: Complete offline operation
- Memory-Aware: Optimized for resource-constrained environments
Q: Why can I run a 1B model with a huge context, but not a 7B? A: Because hidden size is smaller in 1B models, so each token's KV cache footprint is much smaller. The KV cache grows with both context length and model size - a 7B model's attention states require significantly more memory per token than a 1B model.
Q: Does quantization help with memory usage?
A: Quantization (like Q4_K_M, Q5_K_M) shrinks model weights significantly, but doesn't touch KV cache memory usage β unless you use KV quantization. In llama-cpp, you can use --cache-type options like q8_0 or q4_0 to quantize the KV cache as well.
Q: What's n_gpu_layers and should I use it?
A: It's how many transformer layers you offload to GPU (Metal on Mac). It can speed up inference significantly but doesn't change KV cache memory usage much. On MacBook with limited unified memory, sometimes keeping everything on CPU is more stable.
Q: What's the safest setting for my MacBook Air 8GB?
A: Keep n_ctx around 1024β1536 for larger models (7B+), and increase only if you have memory headroom. Always set use_mmap: true and use_mlock: false on macOS. Start with 1B-2B models for best experience.
Q: My system gets slow/hot when running larger models. What's happening?
A: Large models can cause memory pressure, forcing macOS to swap to disk. This creates heat and slowdown. Try smaller models (1B-2B parameters) or reduce n_ctx to 512-1024. Monitor Activity Monitor for memory pressure.
Q: What do all these parameters in llm_params actually do?
A: Here's a breakdown:
n_ctx: Context window size (how much conversation history the model remembers)temperature: Creativity/randomness (0.0=deterministic, 1.0=very creative)top_p: Nucleus sampling (0.9 is usually good - lower = more focused)max_tokens: Maximum response lengthn_threads: CPU threads to use (usually 4-8 for good performance)use_mmap: Memory mapping for efficient loading (alwaystrueon macOS)use_mlock: Lock model in RAM (set tofalseon macOS to avoid issues)
Q: How do I choose the right temperature value?
A:
- 0.0-0.3: Factual, deterministic responses (good for coding, math)
- 0.3-0.7: Balanced creativity and accuracy (general chat)
- 0.7-1.0: Creative writing, brainstorming
- 1.0+: Very creative but potentially incoherent
Q: What's the difference between n_ctx and max_tokens?
A: n_ctx is the total context window (input + output), while max_tokens is the maximum length of just the response. Think of n_ctx as the model's "memory" and max_tokens as "how long it can talk."
Q: Which model should I start with? A: For 8GB RAM systems:
- TinyLlama 1.1B: Ultra-fast, great for testing and simple tasks
- DeepSeek-R1-Distill-Qwen-1.5B: Current default, excellent reasoning
- Llama 3.2 1B: Good balance of capability and speed
For 16GB+ RAM:
- Mistral 7B: Excellent general capability
- Llama 3.1 8B: Strong reasoning and coding
Q: What's the difference between reasoning profiles? A:
deepseek: Direct answers only (hides thinking process)deepseek_reasoning: Shows final answer but hides internal reasoningdeepseek_show_reasoning: Shows full thinking process in<think>tagsgeneric: Universal format that works with most models
Q: My model keeps generating weird tokens or doesn't stop properly. Help? A: This is usually a prompt format issue. Try:
- Switch to
genericprofile first - Use
--debugto see the exact prompt being sent - Adjust
stop_tokensin your profile - Some quantized models need specific stop tokens - check the model's documentation
Q: Can I use models not in the catalog?
A: Yes! Add them to available_models in config.json:
{
"available_models": [
{
"id": "my-model",
"name": "My Custom Model",
"model": {
"repo": "username/repo-name",
"file": "model-file.gguf",
"cache_dir": ".models/custom"
},
"format": "gguf",
"quant": "Q4_K_M",
"context_len_hint": "4k",
"notes": "Custom model for specific use case",
"recommended_profile": "generic",
"recommended_llm_params": {
"n_ctx": 1536,
"n_batch": 8,
"n_threads": 4,
"temperature": 0.3,
"max_tokens": 512,
"use_mlock": false,
"use_mmap": true,
"n_gpu_layers": 0
}
}
]
}Q: The model downloaded but won't load. What's wrong? A: Common causes:
- Corrupted download: Try deleting the model file and re-downloading
- Wrong format: Ensure you're using GGUF format files
- Insufficient memory: Try a smaller model or reduce
n_ctx - File permissions: Check that the cache directory is writable
Q: Responses are very slow. How can I speed things up? A: Try these optimizations:
- Increase
n_threads(try 4-8) - Reduce
n_ctxto 512-1024 - Use a smaller model (1B instead of 7B)
- On Mac: Enable Metal GPU acceleration with
n_gpu_layers - Ensure
use_mmap: truefor faster loading
Q: The model gives different answers each time. Is this normal?
A: Yes, unless temperature is 0.0. AI models are probabilistic by nature. To get more consistent responses:
- Lower
temperature(try 0.1-0.3) - Lower
top_p(try 0.7-0.8) - Some randomness is usually desirable for natural conversation
Q: I get "import errors" or "module not found" - what's missing? A: Make sure you:
- Activated your virtual environment:
source .venv/bin/activate - Installed requirements:
pip install -r requirements.txt - Have Python 3.8+ installed
- On macOS, you might need Xcode command line tools:
xcode-select --install
Q: How do I create a custom prompt profile?
A: Add a new profile to config.json:
{
"profiles": {
"my_custom": {
"format": "auto",
"stop_tokens": ["Human:", "AI:", "\n\n"],
"description": "My custom format for specific use case",
"show_thinking": true
}
}
}Q: Can I run this in a script or automation? A: Yes! Use echo commands:
echo "What is 2+2?" | python main.py
echo -e "Question 1\n:q" | python main.py --debugQ: How do I contribute new models to the catalog? A: We welcome contributions! Test your model thoroughly on 8GB systems and submit a PR with:
- Model entry in the catalog format
- Performance notes and memory requirements
- Recommended profile and configuration
- Basic functionality validation
Q: Is there conversation history? Why does it forget previous messages? A: This alpha version (v0.1.x) operates in single-shot mode - each message is independent. This is intentional for:
- Memory optimization on 8GB systems
- Simplicity and reliability
- Testing core functionality
Conversation history is planned for v0.2.x series.