This repository documents how to run CodeGemma 1.1 7B Instruct efficiently on developer laptops using LM Studio, with a focus on the Q5_K_M quantization and sane default sampling parameters.
The goal is:
- Give developers a repeatable setup for a local coding assistant.
- Explain why we choose specific model variants and parameters.
- Help a CTO or tech lead guide the team depending on each laptop’s hardware.
Tested on a laptop-class setup:
- Lenovo ThinkBook 16 G7 IML
- Intel® Core™ Ultra 7 155H
- Intel Arc integrated GPU (Meteor Lake-P)
- Intel Meteor Lake NPU (Intel AI Boost)
- Ubuntu Linux
For day‑to‑day coding work we recommend:
CodeGemma 1.1 – 7B Instruct – Q5_K_M
This variant offers:
- ✅ High quality: very close to full‑precision (FP16/BF16) performance.
- ✅ Good speed on modern laptop GPUs with ~12–16 GiB VRAM.
- ✅ Comfortable context length (4k–8k tokens depending on runner) without constantly hitting VRAM limits.
On a laptop with ≈15 GiB of VRAM, this is an ideal sweet spot: the model fits comfortably, you keep a long context window, and you still have room for larger batch sizes and high‑throughput decoding.
Quantization is a trade‑off between:
- Quality (how close we are to the original weights).
- Memory usage (VRAM / RAM).
- Speed (smaller models move through the GPU faster).
From a quality perspective, for 7B models:
Q3→ light and fast, but noticeable quality loss.Q4→ good compromise; you can feel some degradation vs FP16.Q5→ “near‑native” quality, usually indistinguishable for coding tasks.Q6/Q8→ almost pure FP16 quality but much more VRAM.
On a machine with 15.3 GiB VRAM:
Q5_K_Muses ~6 GiB for the model weights.- You still have plenty of headroom for:
- key/value cache (attention),
- a long context window,
- the OS and other GPU workloads.
With Q6 or Q8_0, VRAM usage climbs significantly, leaving less space for context and cache. For coding, context length matters more than the last 1–2% of quality, so Q5 is the better trade‑off.
You can use this decision table as a guideline:
| Laptop GPU / VRAM | Recommended quant for 7B | Notes |
|---|---|---|
| ≤ 4 GiB (or iGPU only) | 2B model (Q4/Q5) | 7B is usually too heavy here; use a smaller model. |
| 6–8 GiB | 7B Q4_K_M |
Fits, but keep context smaller (2k–4k tokens). |
| 10–12 GiB | 7B Q4_K_M or Q5_K_M |
Q5 if you prefer quality; Q4 if you want more speed / parallel loads. |
| 14–16 GiB | 7B Q5_K_M |
Recommended default (this repository’s scenario). |
| ≥ 20 GiB | 7B Q6_K or Q8_0 |
Only if you really want the last bit of quality and can afford VRAM. |
For mixed fleets, a CTO can define a baseline (7B Q5) and fallbacks:
- If dev has weak GPU → 2B completion‑only model.
- If dev has mid GPU (8 GiB) → 7B Q4.
- If dev has strong GPU (16–24 GiB) → 7B Q5 or Q6.
When using the model as a coding assistant, we care more about:
- Determinism and correctness than raw creativity.
- Low hallucinations.
- Stable style across sessions.
That leads to conservative sampling defaults.
Recommended:
temperature = 0.2–0.3
- Lower temperature makes the model more deterministic and less “chatty”.
- For code, high temperature (e.g. 0.8–1.0) increases:
- hallucinations,
- random API calls,
- inconsistent formatting.
On your laptop, there is no performance penalty to a lower temperature; it is purely a quality / style choice. For:
- Pure coding / refactors / tests → use 0.2.
- Architecture discussion / doc writing → you can raise to 0.4–0.5 if you want more varied wording.
Recommended:
top_p = 0.9
top_p keeps only the smallest set of tokens whose cumulative probability is p.
top_p = 1.0→ use the full distribution; more randomness.top_p = 0.9→ cut off low‑probability tokens that often cause:- weird syntax,
- off‑topic content,
- very “creative” variable names.
For code, this is a good balance: the model can still choose among several good options, but avoids unlikely choices that usually break compilation.
Recommended:
top_k = 40
top_k keeps only the top‑K tokens by probability.
- Very high
top_k(e.g. 200–1000) + hightop_p→ many candidate tokens, more creativity, more risk. - Very low
top_k(e.g. 5–10) → text may become repetitive or stuck.
For coding on a 7B model:
top_k = 40gives enough diversity to restructure code and generate alternatives,
while still strongly preferring the top tokens (which usually correspond to syntactically correct, relevant completions).
You can also adjust:
- Repetition penalty (
repeat_penaltyorfrequency_penalty):- Default:
1.0(no penalty). - If the model loops or repeats long blocks, try
1.05–1.1.
- Default:
- Max tokens:
- For short inline completions: 64–128.
- For full functions / tests / docs: 256–1024, depending on context limit.
In config/lmstudio_preset_codegemma_7b_it_q5.json you’ll find an example preset you can adapt.
Key points:
- Model: CodeGemma 1.1 – 7B Instruct – Q5_K_M
- Temperature: 0.25
- Top‑p: 0.9
- Top‑k: 40
- Max tokens: 512 (good default for most coding replies)
- Context length: use the model’s max (e.g. 4096 or 8192), adjusted to your VRAM.
Each developer can import this preset and adjust only what they need (e.g. max tokens or temperature) while keeping a consistent baseline across the team.
You can use this repository as a playbook:
-
Inventory developer hardware
Collect GPU model and VRAM for each laptop. -
Assign a tier per machine
Map each machine to:tier-small(no GPU / ≤ 4 GiB),tier-mid(6–8 GiB),tier-strong(≥ 12–16 GiB),tier-ultra(32–64 GiB).
-
Pick model + quant per tier
Keep a simple table in your internal docs (seedocs/model_selection_for_laptops.md) so devs know exactly what to download. -
Standardise sampling defaults
Reuse the same preset (temperature, top‑p, top‑k, penalties) so that:- prompts are more reproducible,
- people can share “prompt recipes” reliably.
-
Iterate based on feedback
If some teams need more creativity (e.g. product copy, docs), create a second preset with:- higher temperature,
- slightly larger top‑p.
README.md→ this file, high‑level overview and reasoning.docs/model_selection_for_laptops.md→ more detailed guidance and tables.config/lmstudio_preset_codegemma_7b_it_q5.json→ example LM Studio preset for CodeGemma 7B Q5.
Feel free to fork / rename to match your company’s internal standards.