Skip to content

LM Studio playbook for running CodeGemma 1.1 7B Instruct locally: laptop-friendly hardware tiers, Q5 preset, and sampling defaults.

Notifications You must be signed in to change notification settings

klever-engineering/ke-lmstudio-codegemma

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

Local Coding Assistant – CodeGemma on Laptops (v2)

This repository documents how to run CodeGemma 1.1 7B Instruct efficiently on developer laptops using LM Studio, with a focus on the Q5_K_M quantization and sane default sampling parameters.

The goal is:

  • Give developers a repeatable setup for a local coding assistant.
  • Explain why we choose specific model variants and parameters.
  • Help a CTO or tech lead guide the team depending on each laptop’s hardware.

Tested on a laptop-class setup:

  • Lenovo ThinkBook 16 G7 IML
  • Intel® Core™ Ultra 7 155H
  • Intel Arc integrated GPU (Meteor Lake-P)
  • Intel Meteor Lake NPU (Intel AI Boost)
  • Ubuntu Linux

1. Recommended model for coding

For day‑to‑day coding work we recommend:

CodeGemma 1.1 – 7B Instruct – Q5_K_M

This variant offers:

  • High quality: very close to full‑precision (FP16/BF16) performance.
  • Good speed on modern laptop GPUs with ~12–16 GiB VRAM.
  • Comfortable context length (4k–8k tokens depending on runner) without constantly hitting VRAM limits.

On a laptop with ≈15 GiB of VRAM, this is an ideal sweet spot: the model fits comfortably, you keep a long context window, and you still have room for larger batch sizes and high‑throughput decoding.


2. Why Q5_K_M for this laptop?

Quantization is a trade‑off between:

  • Quality (how close we are to the original weights).
  • Memory usage (VRAM / RAM).
  • Speed (smaller models move through the GPU faster).

2.1 Quick mental model

From a quality perspective, for 7B models:

  • Q3 → light and fast, but noticeable quality loss.
  • Q4 → good compromise; you can feel some degradation vs FP16.
  • Q5 → “near‑native” quality, usually indistinguishable for coding tasks.
  • Q6/Q8 → almost pure FP16 quality but much more VRAM.

On a machine with 15.3 GiB VRAM:

  • Q5_K_M uses ~6 GiB for the model weights.
  • You still have plenty of headroom for:
    • key/value cache (attention),
    • a long context window,
    • the OS and other GPU workloads.

With Q6 or Q8_0, VRAM usage climbs significantly, leaving less space for context and cache. For coding, context length matters more than the last 1–2% of quality, so Q5 is the better trade‑off.

2.2 When to use other quants

You can use this decision table as a guideline:

Laptop GPU / VRAM Recommended quant for 7B Notes
≤ 4 GiB (or iGPU only) 2B model (Q4/Q5) 7B is usually too heavy here; use a smaller model.
6–8 GiB 7B Q4_K_M Fits, but keep context smaller (2k–4k tokens).
10–12 GiB 7B Q4_K_M or Q5_K_M Q5 if you prefer quality; Q4 if you want more speed / parallel loads.
14–16 GiB 7B Q5_K_M Recommended default (this repository’s scenario).
≥ 20 GiB 7B Q6_K or Q8_0 Only if you really want the last bit of quality and can afford VRAM.

For mixed fleets, a CTO can define a baseline (7B Q5) and fallbacks:

  • If dev has weak GPU → 2B completion‑only model.
  • If dev has mid GPU (8 GiB) → 7B Q4.
  • If dev has strong GPU (16–24 GiB) → 7B Q5 or Q6.

3. Sampling parameters and why we use them

When using the model as a coding assistant, we care more about:

  • Determinism and correctness than raw creativity.
  • Low hallucinations.
  • Stable style across sessions.

That leads to conservative sampling defaults.

3.1 Temperature

Recommended: temperature = 0.2–0.3

  • Lower temperature makes the model more deterministic and less “chatty”.
  • For code, high temperature (e.g. 0.8–1.0) increases:
    • hallucinations,
    • random API calls,
    • inconsistent formatting.

On your laptop, there is no performance penalty to a lower temperature; it is purely a quality / style choice. For:

  • Pure coding / refactors / tests → use 0.2.
  • Architecture discussion / doc writing → you can raise to 0.4–0.5 if you want more varied wording.

3.2 Top‑p (nucleus sampling)

Recommended: top_p = 0.9

top_p keeps only the smallest set of tokens whose cumulative probability is p.

  • top_p = 1.0 → use the full distribution; more randomness.
  • top_p = 0.9 → cut off low‑probability tokens that often cause:
    • weird syntax,
    • off‑topic content,
    • very “creative” variable names.

For code, this is a good balance: the model can still choose among several good options, but avoids unlikely choices that usually break compilation.

3.3 Top‑k

Recommended: top_k = 40

top_k keeps only the top‑K tokens by probability.

  • Very high top_k (e.g. 200–1000) + high top_p → many candidate tokens, more creativity, more risk.
  • Very low top_k (e.g. 5–10) → text may become repetitive or stuck.

For coding on a 7B model:

  • top_k = 40 gives enough diversity to restructure code and generate alternatives,
    while still strongly preferring the top tokens (which usually correspond to syntactically correct, relevant completions).

3.4 Other knobs (optional)

You can also adjust:

  • Repetition penalty (repeat_penalty or frequency_penalty):
    • Default: 1.0 (no penalty).
    • If the model loops or repeats long blocks, try 1.05–1.1.
  • Max tokens:
    • For short inline completions: 64–128.
    • For full functions / tests / docs: 256–1024, depending on context limit.

4. Example LM Studio preset

In config/lmstudio_preset_codegemma_7b_it_q5.json you’ll find an example preset you can adapt.

Key points:

  • Model: CodeGemma 1.1 – 7B Instruct – Q5_K_M
  • Temperature: 0.25
  • Top‑p: 0.9
  • Top‑k: 40
  • Max tokens: 512 (good default for most coding replies)
  • Context length: use the model’s max (e.g. 4096 or 8192), adjusted to your VRAM.

Each developer can import this preset and adjust only what they need (e.g. max tokens or temperature) while keeping a consistent baseline across the team.


5. Fleet‑wide guidance for a CTO

You can use this repository as a playbook:

  1. Inventory developer hardware
    Collect GPU model and VRAM for each laptop.

  2. Assign a tier per machine
    Map each machine to:

    • tier-small (no GPU / ≤ 4 GiB),
    • tier-mid (6–8 GiB),
    • tier-strong (≥ 12–16 GiB),
    • tier-ultra (32–64 GiB).
  3. Pick model + quant per tier
    Keep a simple table in your internal docs (see docs/model_selection_for_laptops.md) so devs know exactly what to download.

  4. Standardise sampling defaults
    Reuse the same preset (temperature, top‑p, top‑k, penalties) so that:

    • prompts are more reproducible,
    • people can share “prompt recipes” reliably.
  5. Iterate based on feedback
    If some teams need more creativity (e.g. product copy, docs), create a second preset with:

    • higher temperature,
    • slightly larger top‑p.

6. Files in this repository

  • README.md → this file, high‑level overview and reasoning.
  • docs/model_selection_for_laptops.md → more detailed guidance and tables.
  • config/lmstudio_preset_codegemma_7b_it_q5.json → example LM Studio preset for CodeGemma 7B Q5.

Feel free to fork / rename to match your company’s internal standards.

About

LM Studio playbook for running CodeGemma 1.1 7B Instruct locally: laptop-friendly hardware tiers, Q5 preset, and sampling defaults.

Topics

Resources

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published