Local Coding Assistant – CodeGemma on Laptops (v2)

This repository documents how to run CodeGemma 1.1 7B Instruct efficiently on developer laptops using LM Studio, with a focus on the Q5_K_M quantization and sane default sampling parameters.

The goal is:

Give developers a repeatable setup for a local coding assistant.
Explain why we choose specific model variants and parameters.
Help a CTO or tech lead guide the team depending on each laptop’s hardware.

Tested on a laptop-class setup:

Lenovo ThinkBook 16 G7 IML
Intel® Core™ Ultra 7 155H
Intel Arc integrated GPU (Meteor Lake-P)
Intel Meteor Lake NPU (Intel AI Boost)
Ubuntu Linux

1. Recommended model for coding

For day‑to‑day coding work we recommend:

CodeGemma 1.1 – 7B Instruct – Q5_K_M

This variant offers:

✅ High quality: very close to full‑precision (FP16/BF16) performance.
✅ Good speed on modern laptop GPUs with ~12–16 GiB VRAM.
✅ Comfortable context length (4k–8k tokens depending on runner) without constantly hitting VRAM limits.

On a laptop with ≈15 GiB of VRAM, this is an ideal sweet spot: the model fits comfortably, you keep a long context window, and you still have room for larger batch sizes and high‑throughput decoding.

2. Why Q5_K_M for this laptop?

Quantization is a trade‑off between:

Quality (how close we are to the original weights).
Memory usage (VRAM / RAM).
Speed (smaller models move through the GPU faster).

2.1 Quick mental model

From a quality perspective, for 7B models:

Q3 → light and fast, but noticeable quality loss.
Q4 → good compromise; you can feel some degradation vs FP16.
Q5 → “near‑native” quality, usually indistinguishable for coding tasks.
Q6/Q8 → almost pure FP16 quality but much more VRAM.

On a machine with 15.3 GiB VRAM:

Q5_K_M uses ~6 GiB for the model weights.
You still have plenty of headroom for:
- key/value cache (attention),
- a long context window,
- the OS and other GPU workloads.

With Q6 or Q8_0, VRAM usage climbs significantly, leaving less space for context and cache. For coding, context length matters more than the last 1–2% of quality, so Q5 is the better trade‑off.

2.2 When to use other quants

You can use this decision table as a guideline:

Laptop GPU / VRAM	Recommended quant for 7B	Notes
≤ 4 GiB (or iGPU only)	2B model (Q4/Q5)	7B is usually too heavy here; use a smaller model.
6–8 GiB	7B `Q4_K_M`	Fits, but keep context smaller (2k–4k tokens).
10–12 GiB	7B `Q4_K_M` or `Q5_K_M`	Q5 if you prefer quality; Q4 if you want more speed / parallel loads.
14–16 GiB	7B `Q5_K_M`	Recommended default (this repository’s scenario).
≥ 20 GiB	7B `Q6_K` or `Q8_0`	Only if you really want the last bit of quality and can afford VRAM.

For mixed fleets, a CTO can define a baseline (7B Q5) and fallbacks:

If dev has weak GPU → 2B completion‑only model.
If dev has mid GPU (8 GiB) → 7B Q4.
If dev has strong GPU (16–24 GiB) → 7B Q5 or Q6.

3. Sampling parameters and why we use them

When using the model as a coding assistant, we care more about:

Determinism and correctness than raw creativity.
Low hallucinations.
Stable style across sessions.

That leads to conservative sampling defaults.

3.1 Temperature

Recommended: temperature = 0.2–0.3

Lower temperature makes the model more deterministic and less “chatty”.
For code, high temperature (e.g. 0.8–1.0) increases:
- hallucinations,
- random API calls,
- inconsistent formatting.

On your laptop, there is no performance penalty to a lower temperature; it is purely a quality / style choice. For:

Pure coding / refactors / tests → use 0.2.
Architecture discussion / doc writing → you can raise to 0.4–0.5 if you want more varied wording.

3.2 Top‑p (nucleus sampling)

Recommended: top_p = 0.9

top_p keeps only the smallest set of tokens whose cumulative probability is p.

top_p = 1.0 → use the full distribution; more randomness.
top_p = 0.9 → cut off low‑probability tokens that often cause:
- weird syntax,
- off‑topic content,
- very “creative” variable names.

For code, this is a good balance: the model can still choose among several good options, but avoids unlikely choices that usually break compilation.

3.3 Top‑k

Recommended: top_k = 40

top_k keeps only the top‑K tokens by probability.

Very high top_k (e.g. 200–1000) + high top_p → many candidate tokens, more creativity, more risk.
Very low top_k (e.g. 5–10) → text may become repetitive or stuck.

For coding on a 7B model:

top_k = 40 gives enough diversity to restructure code and generate alternatives,
while still strongly preferring the top tokens (which usually correspond to syntactically correct, relevant completions).

3.4 Other knobs (optional)

You can also adjust:

Repetition penalty (repeat_penalty or frequency_penalty):
- Default: 1.0 (no penalty).
- If the model loops or repeats long blocks, try 1.05–1.1.
Max tokens:
- For short inline completions: 64–128.
- For full functions / tests / docs: 256–1024, depending on context limit.

4. Example LM Studio preset

In config/lmstudio_preset_codegemma_7b_it_q5.json you’ll find an example preset you can adapt.

Key points:

Model: CodeGemma 1.1 – 7B Instruct – Q5_K_M
Temperature: 0.25
Top‑p: 0.9
Top‑k: 40
Max tokens: 512 (good default for most coding replies)
Context length: use the model’s max (e.g. 4096 or 8192), adjusted to your VRAM.

Each developer can import this preset and adjust only what they need (e.g. max tokens or temperature) while keeping a consistent baseline across the team.

5. Fleet‑wide guidance for a CTO

You can use this repository as a playbook:

Inventory developer hardware
Collect GPU model and VRAM for each laptop.
Assign a tier per machine
Map each machine to:
- tier-small (no GPU / ≤ 4 GiB),
- tier-mid (6–8 GiB),
- tier-strong (≥ 12–16 GiB),
- tier-ultra (32–64 GiB).
Pick model + quant per tier
Keep a simple table in your internal docs (see docs/model_selection_for_laptops.md) so devs know exactly what to download.
Standardise sampling defaults
Reuse the same preset (temperature, top‑p, top‑k, penalties) so that:
- prompts are more reproducible,
- people can share “prompt recipes” reliably.
Iterate based on feedback
If some teams need more creativity (e.g. product copy, docs), create a second preset with:
- higher temperature,
- slightly larger top‑p.

6. Files in this repository

README.md → this file, high‑level overview and reasoning.
docs/model_selection_for_laptops.md → more detailed guidance and tables.
config/lmstudio_preset_codegemma_7b_it_q5.json → example LM Studio preset for CodeGemma 7B Q5.

Feel free to fork / rename to match your company’s internal standards.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Local Coding Assistant – CodeGemma on Laptops (v2)

1. Recommended model for coding

2. Why Q5_K_M for this laptop?

2.1 Quick mental model

2.2 When to use other quants

3. Sampling parameters and why we use them

3.1 Temperature

3.2 Top‑p (nucleus sampling)

3.3 Top‑k

3.4 Other knobs (optional)

4. Example LM Studio preset

5. Fleet‑wide guidance for a CTO

6. Files in this repository

About

Uh oh!

Releases

Packages

Uh oh!

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
config		config
docs		docs
README.md		README.md

klever-engineering/ke-lmstudio-codegemma

Folders and files

Latest commit

History

Repository files navigation

Local Coding Assistant – CodeGemma on Laptops (v2)

1. Recommended model for coding

2. Why Q5_K_M for this laptop?

2.1 Quick mental model

2.2 When to use other quants

3. Sampling parameters and why we use them

3.1 Temperature

3.2 Top‑p (nucleus sampling)

3.3 Top‑k

3.4 Other knobs (optional)

4. Example LM Studio preset

5. Fleet‑wide guidance for a CTO

6. Files in this repository

About

Topics

Resources

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Packages