Skip to content

Conversation

@ServeurpersoCom
Copy link
Collaborator

@ServeurpersoCom ServeurpersoCom commented Dec 8, 2025

Router per-model config

This PR implements INI-based per-model configuration for llama-server router mode, as discussed in #17850.

Summary

Advanced users can define custom configurations using an .ini file. Each model can have its own preset with custom parameters while inheriting router defaults for unspecified options.

Motivation

Multi-model inference servers for small/medium teams need declarative configuration with zero operational friction. Operators should be able to set global defaults via router CLI and override specific parameters per model in a simple text file.

Key Features

  1. INI-based presets - Define model-specific configurations in a simple .ini file
  2. Three independent model sources - Cached models (LLAMA_CACHE), local GGUF files (--models-dir), and custom-path models (--models-preset via paths defined in INI file)
  3. Flexible argument formats - Accepts long args (ctx-size), short args (c), and env vars (LLAMA_ARG_CTX_SIZE) as INI keys
  4. Inheritance system - Preset args are merged with router base args before spawning child processes
  5. Custom model paths - Define models with absolute paths directly in presets without filesystem scanning

Usage

llama-server --models-preset ./presets.ini

The router can combine multiple sources:

llama-server --models-dir ./local-models --models-preset ./custom-configs.ini -ngl 999 -fa

INI Format

Section names define model identifiers. Keys correspond to CLI arguments without leading dashes.

Supported key formats:

  • Long form: n-gpu-layer = 123
  • Short form: c = 4096
  • Env var: LLAMA_ARG_CACHE_RAM = 0

All three formats are equivalent and can be mixed in the same file.

Example presets.ini:

version = 1

; Preset for cached HuggingFace model TODO testing
[ggml-org/gemma-3-27b-it-GGUF:Q6_K]
chat-template = chatml
ngl = 123
jinja = on
ctx-size = 131072

; Custom local model with absolute path
[my-custom-model]
m = /absolute/path/to/model.gguf
mmproj = /absolute/path/to/mmproj.gguf
ctx-size = 65536
temp = 0.7
top-p = 0.8

; MoE model with specific settings
[MoE-Qwen3-30B-A3B-Thinking]
m = /models/Qwen3-30B-A3B-Thinking-Q6_K.gguf
n-cpu-moe = 30
temp = 0.6
top-p = 0.95
ctx-size = 32768

How Models Are Loaded

The router discovers models from three sources:

  1. Cached models - Scanned from LLAMA_CACHE (typically ~/.cache/llama.cpp) <- TODO testing
  2. Local directory - Scanned from --models-dir (non-recursive, direct children only)
  3. Preset definitions - Custom models defined in --models-preset with explicit paths

Model names from presets can match cached or local models to apply custom configurations, or define entirely new models with custom paths.

Argument Inheritance

When spawning a child process for a model, arguments are merged in this order:

  1. Start with preset args from INI (model-specific settings)
  2. Add router base args for any missing keys (global defaults from router CLI)
  3. Force control args (port, host, alias - always overridden by router)

Priority order (highest to lowest):

  • Control args (port, host, alias, model path) - managed by router, cannot be overridden
  • Router base args (inherited from router CLI) - fill in missing preset keys
  • Preset args (from INI) - model-specific overrides

Control args automatically managed by router:

  • --port, --host, --alias
  • --api-key
  • --model, --mmproj, --hf-repo
  • --models-dir, --models-max, --models-preset

If a preset contains control args, they are removed with a warning.

Changes

New files:

  • common/preset.cpp - INI parser using PEG grammar and preset management
  • common/preset.h - Preset structures and API

Modified files:

  • common/arg.cpp/h - Added common_params_parse() for map output, is_truthy/is_falsey/is_autoy utilities
  • common/common.h - Added models_preset parameter
  • common/CMakeLists.txt - Added preset.cpp/h to build
  • tools/server/server-models.cpp/h - Integrated preset system with model loading and spawning
  • tools/server/README.md - Added preset documentation and examples

Technical Details

INI parsing:

  • Uses existing PEG parser from common/peg-parser.h (grammar by @aldehir)
  • Line-oriented parsing handles comments, blank lines, and standard INI sections
  • Whitespace and inline comments properly handled

Argument mapping:

  • Three key formats (long, short, env) all map to same common_arg via lookup table
  • Deduplication handled automatically (short -c and long --ctx-size are the same arg)

Child process spawning:

  • Child servers listen on 127.0.0.1 (not inherited hostname) to avoid conflicts when router runs on 0.0.0.0
  • Arguments passed as CLI args (not environment variables)
  • Router port exported as LLAMA_SERVER_ROUTER_PORT env var for child processes

Use Case Example

Development team runs inference server with multiple models:

llama-server --port 8082 -ngl 999 -ctk q8_0 -ctv q8_0 -fa on --mlock -np 4 -kvu --jinja --models-preset config.ini

# You can combine --models-preset with :
# --models-max N         for router server, maximum number of models to load simultaneously
# --models-dir PATH      directory containing models for the router server (default: disabled)

The presets.ini file defines per-model overrides:

; Minimal setup
[MyModel]
m = /path/to/model.gguf ; or relative path from current working directory

; For this model we want precise sampling parameters and more context
[MoE-Qwen3-Coder-30B-A3B-Instruct]
m = /path/to/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF/Qwen3-Coder-30B-A3B-Instruct-Q6_K.gguf
temp = 0.7
top-p = 0.8
top-k = 20
ctx-size = 131072

; Disable flash attention for this model
[Problematic-model]
m = /path/to/problematic-Q8_0.gguf
fa = off

; Large MoE models don't fit in VRAM, so we use n-cpu-moe = 18
[MoE-Qwen3-Next-80B-A3B-Instruct]
m = /path/to/unsloth/Qwen3-Next-80B-A3B-Instruct-GGUF/Qwen3-Next-80B-A3B-Instruct-Q4_K_M.gguf
n-cpu-moe = 18
ctx-size = 32768

Global defaults (-ngl 999 -fa -ctk q8_0 etc.) apply to all models, but each preset can override specific parameters. The router automatically manages ports, aliases, and model paths.

Testing Status

Tested configurations:

  • TODO Multiple cached HuggingFace models with various quantizations
  • Local GGUF files with mmproj auto-detection
  • Custom path models defined in presets
  • Mixed sources (local + preset) in single router instance
  • Argument inheritance and override behavior

Notes

  • File paths in INI can be absolute or relative to server working directory
  • --models-dir and --models-preset are independent and can be used together
  • Presets are logged at startup with * prefix indicating custom configuration
  • /v1/models endpoint includes preset configuration in response for debugging
  • Boolean flags can use on/off, enabled/disabled, true/false, or 1/0 as values

Related Issues

Closes #17850
Related to #17470, #10932

Thanks to

Co-authored-by: aldehir (INI parser PEG grammar)
Co-authored-by: ngxson (llama-server integrated router/model-childs, preset refactoring, API design, argument system integration, ...)

Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks interesting, I will clean this up a bit and push a commit

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will be nice if we can move part of this file into common/preset.cpp|h, so it can be reused by other tools

Comment on lines 502 to 507
if (value == "false") {
continue;
}

if (value == "true" || value.empty()) {
child_env.push_back(key + "=");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think leaving the original value for bool should be good? We can already handle these values using is_falsey / is_truthy in arg.cpp

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I'll simplify the bool handling to pass through the original values (=true/=false) and let is_truthy/is_falsey handle the conversion

@aldehir
Copy link
Collaborator

aldehir commented Dec 8, 2025

@ServeurpersoCom

Here is a line-oriented approach for the parser:

static const auto ini_parser = build_peg_parser([](auto & p) {
    // newline ::= "\r\n" / "\n" / "\r"
    auto newline = p.rule("newline", p.literal("\r\n") | p.literal("\n") | p.literal("\r"));

    // ws ::= [ \t]*
    auto ws = p.rule("ws", p.chars("[ \t]", 0, -1));

    // comment ::= [;#] (!newline .)*
    auto comment = p.rule("comment", p.chars("[;#]", 1, 1) + p.zero_or_more(p.negate(newline) + p.any()));

    // eol ::= ws comment? (newline / EOF)
    auto eol = p.rule("eol", ws + p.optional(comment) + (newline | p.end()));

    // ident ::= [a-zA-Z_] [a-zA-Z0-9_.-]*
    auto ident = p.rule("ident", p.chars("[a-zA-Z_]", 1, 1) + p.chars("[a-zA-Z0-9_.-]", 0, -1));

    // value ::= (!eol-start .)*
    auto eol_start = p.rule("eol-start", ws + (p.chars("[;#]", 1, 1) | newline | p.end()));
    auto value = p.rule("value", p.zero_or_more(p.negate(eol_start) + p.any()));

    // header-line ::= "[" ws ident ws "]" eol
    auto header_line = p.rule("header-line", "[" + ws + p.tag("section-name", p.chars("[^]]")) + ws + "]" + eol);

    // kv-line ::= ident ws "=" ws value eol
    auto kv_line = p.rule("kv-line", p.tag("key", ident) + ws + "=" + ws + p.tag("value", value) + eol);

    // comment-line ::= ws comment (newline / EOF)
    auto comment_line = p.rule("comment-line", ws + comment + (newline | p.end()));

    // blank-line ::= ws (newline / EOF)
    auto blank_line = p.rule("blank-line", ws + (newline | p.end()));

    // line ::= header-line / kv-line / comment-line / blank-line
    auto line = p.rule("line", header_line | kv_line | comment_line | blank_line);

    // ini ::= line* EOF
    auto ini = p.rule("ini", p.zero_or_more(line) + p.end());

    return ini;
});

I assume the changes were because of the weirdness in consuming spaces/comments. This should alleviate those concerns.

And the visitor can really be something as simple as this:

std::map<std::string, std::map<std::string, std::string>> cfg;

std::string current_section = "default";
std::string current_key;

ctx.ast.visit(result, [&](const auto & node) {
    if (node.tag == "section-name") {
        current_section = std::string(node.text);
        cfg[current_section] = {};
    } else if (node.tag == "key") {
        current_key = std::string(node.text);
    } else if (node.tag == "value" && !current_key.empty()) {
        cfg[current_section][current_key] = std::string(node.text);
        current_key.clear();
    }
});

Copy link
Collaborator

@aldehir aldehir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good as far as parsing is concerned!

I will need to add an expect() helper to provide helpful error messages to users when they make a mistake. I can do that separately in an another PR.

@ServeurpersoCom ServeurpersoCom marked this pull request as draft December 8, 2025 11:57
@ServeurpersoCom
Copy link
Collaborator Author

Now it in a basic working state, with the new line-based PEG parser, I'm testing with my entire per models configuration on the server to test some edge case, and then there are the @ngxson refactoring to do.

@ServeurpersoCom ServeurpersoCom marked this pull request as ready for review December 8, 2025 12:34
@ServeurpersoCom
Copy link
Collaborator Author

ServeurpersoCom commented Dec 8, 2025

Missing sampling parameters need .set_env() in common/arg.cpp (--temp, --top-p, --top-k, --min-p have no LLAMA_ARG_ env vars yet). Successfully migrated llama-swap config (YAML) to config.ini via LLM: llama-server preserved all custom parameters (ctx-size, n-cpu-moe, mmproj, -m ....Q6_K), applied global CLI defaults (-ngl 999, -fa, --mlock, -ctk/-ctv etc...) to all models, and automatically reorganized sections/keys alphabetically to maintain normalized format

@ngxson
Copy link
Collaborator

ngxson commented Dec 8, 2025

Missing sampling parameters need .set_env() in common/arg.cpp (--temp, --top-p, --top-k, --min-p have no LLAMA_ARG_ env vars yet).

Hmm yeah I didn't notice that some env vars are missing. I think it will be cleaner if we default to using the longest arg (for example, --ctx-size instead of -c)

Internally, the parser can accept all 3 forms: env, short arg and long arg ; there is no chance that they will collide anyway. I'll push the change for this

@ServeurpersoCom
Copy link
Collaborator Author

Yes look it just need missing .set_env("LLAMA_ARG_TEMP")); etc... I wait your change while I run some tests

llama-server --models-dir ./models_directory
```

The directory is scanned recursively, so nested vendor/model layouts such as `vendor_name/model_name/*.gguf` are supported. The model name in the router UI matches the relative path inside `--models-dir` (for example, `vendor_name/model_name`).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For visibility, I will remove recursive support from this PR because it's not related to config support - it should be added later via a dedicated PR

Copy link
Collaborator Author

@ServeurpersoCom ServeurpersoCom Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes no worries, (I have to keep it on my side otherwise it breaks my integration server) -> I would adapt the configuration on my side to test this feature-atomic PR if necessary

@ngxson
Copy link
Collaborator

ngxson commented Dec 8, 2025

I moved most of the code inside server-config.cpp to common/preset.cpp

We're now using the term "preset", so I think it's easier to make the file name presets.ini now (it can be extended to use outside of server)

Since I'm now using the same common_arg to handle everything, including parsing and merging args, edge cases like deduplication of short form -a and long form --abc is also handled

We don't yet support repeated args or args with 2 values (like --lora-scaled) but it can be added in the future

API endpoint /v1/models also extended to include the args and INI preset, which will be quite useful for debugging

Things that still need to improve:

  • add falsey and truthy check for input from ini
  • add documentation and example


Alternatively, you can also add GGUF based preset (see next section)

### Model presets
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ServeurpersoCom I updated the docs with an example - lmk if this works in your case

first_shard_file = file;
} else {
model_file = file;
std::function<void(const std::string &, const std::string &)> scan_subdir =
Copy link
Collaborator

@ngxson ngxson Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the recursive implementation - it's unrelated to the current PR, and it's also unsafe as it doesn't handle the case where there's a circular symlink or circular mount points

Copy link
Collaborator Author

@ServeurpersoCom ServeurpersoCom Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can retrieve the rest (except the recursion) and push --force; I won't touch the branch before tomorrow/rebase/test.
A two-level browsing system will be perfect for all cases (separate PR)

@emjomi
Copy link

emjomi commented Dec 8, 2025

Hey guys! Sorry to interrupt, but are the LLAMA_ARG_ prefixes required? I think they make the config a bit noisy.

One more thing: maybe it's better to put the config in ~/.config/llama.cpp/ on Linux, as specified in https://specifications.freedesktop.org/basedir/latest/?

Thank you so much for what you're doing!

@ServeurpersoCom
Copy link
Collaborator Author

Hey guys! Sorry to interrupt, but are the LLAMA_ARG_ prefixes required? I think they make the config a bit noisy.

One more thing: maybe it's better to put the config in ~/.config/llama.cpp/ on Linux, as specified in https://specifications.freedesktop.org/basedir/latest/?

Thank you so much for what you're doing!

No worries! with the last refactor the LLAMA_ARG_ prefixes are optional: you can use the short argument forms (e.g., ngl, c) or long forms with dashes (e.g., n-gpu-layers, ctx-size) instead. All three formats are supported.

Regarding config location: the preset file path is fully customizable via --models-preset , so you can place it wherever you prefer, including ~/.config/llama.cpp/presets.ini if that fits your workflow better.

This is a WIP, I update the first message soon

@ServeurpersoCom
Copy link
Collaborator Author

I'll update the PR documentation with the new implementation today: no more INI auto-generation, deep GGUF tree support without scanning, all 3 variable formats supported, standard Linux binary/INI relative paths, and --models-dir and --models-preset are now independent

@ServeurpersoCom
Copy link
Collaborator Author

ServeurpersoCom commented Dec 10, 2025

Let's get to the practical tests!

CLI

# Model collection is current directory
cd /var/www/ia/models

# System path or absolute path to launch llama-server
llama-server --port 8082 -ngl 999 -ctk q8_0 -ctv q8_0 -fa on --mlock -np 4 -kvu --jinja --models-max 1 --models-dir mradermacher/testing --models-preset config.ini

config.ini

[Dense-OLMo-2-0325-32B-Instruct]
m = unsloth/OLMo-2-0325-32B-Instruct-GGUF/OLMo-2-0325-32B-Instruct-Q6_K.gguf
ctx-size = 4096

[Dense-Vision-Mistral-Small-3.2-24B-Instruct-2506]
m = unsloth/Mistral-Small-3.2-24B-Instruct-2506-GGUF/Mistral-Small-3.2-24B-Instruct-2506-Q6_K.gguf
mmproj = unsloth/Mistral-Small-3.2-24B-Instruct-2506-GGUF/mmproj-BF16.gguf
ctx-size = 65536

[Dense-Vision-Magistral-Small-2509]
m = unsloth/Magistral-Small-2509-GGUF/Magistral-Small-2509-Q6_K.gguf
mmproj = unsloth/Magistral-Small-2509-GGUF/mmproj-BF16.gguf
ctx-size = 65536

[Dense-Uncensored-Dolphin-Mistral-24B-Venice-Edition]
m = bartowski/cognitivecomputations_Dolphin-Mistral-24B-Venice-Edition-GGUF/cognitivecomputations_Dolphin-Mistral-24B-Venice-Edition-Q8_0.gguf
ctx-size = 65536

[Dense-Uncensored-BlackSheep-24B]
m = mradermacher/BlackSheep-24B-i1-GGUF/BlackSheep-24B.Q8_0.gguf
ctx-size = 65536

[Dense-Uncensored-XortronCriminalComputingConfig-24B]
m = mradermacher/XortronCriminalComputingConfig-i1-GGUF/XortronCriminalComputingConfig.Q8_0.gguf
ctx-size = 65536

[Dense-RP-Cydonia-24B-v4.1]
m = bartowski/TheDrummer_Cydonia-24B-v4.1-GGUF/TheDrummer_Cydonia-24B-v4.1-Q8_0.gguf
ctx-size = 65536

[Dense-Devstral-Small-24B-2507]
m = unsloth/Devstral-Small-2507-GGUF/Devstral-Small-2507-Q6_K.gguf
ctx-size = 131072

[Dense-Codestral-22B-v0.1]
m = mradermacher/Codestral-22B-v0.1-i1-GGUF/Codestral-22B-v0.1.Q8_0.gguf
ctx-size = 32768

[Dense-Vision-Gemma-3-27B-IT]
m = unsloth/gemma-3-27b-it-GGUF/gemma-3-27b-it-Q6_K.gguf
mmproj = unsloth/gemma-3-27b-it-GGUF/mmproj-BF16.gguf
ctx-size = 131072

[Dense-RP-Big-Tiger-Gemma-27B-v3]
m = bartowski/TheDrummer_Big-Tiger-Gemma-27B-v3-GGUF/TheDrummer_Big-Tiger-Gemma-27B-v3-Q6_K.gguf
ctx-size = 131072

[Dense-Seed-OSS-36B-Instruct]
m = unsloth/Seed-OSS-36B-Instruct-GGUF/Seed-OSS-36B-Instruct-Q5_K_M.gguf
ctx-size = 32768

[Dense-DeepSeek-Coder-33B-Instruct]
m = mradermacher/deepseek-coder-33b-instruct-i1-GGUF/deepseek-coder-33b-instruct.i1-Q6_K.gguf
ctx-size = 32768

[Dense-DeepSeek-R1-Distill-Qwen-32B]
m = unsloth/DeepSeek-R1-Distill-Qwen-32B-GGUF/DeepSeek-R1-Distill-Qwen-32B-Q6_K.gguf
ctx-size = 32768

[Dense-Aya-Expanse-32B]
m = mradermacher/aya-expanse-32b-i1-GGUF/aya-expanse-32b.i1-Q6_K.gguf
ctx-size = 32768

[Dense-GLM-4-32B-0414]
m = unsloth/GLM-4-32B-0414-GGUF/GLM-4-32B-0414-Q6_K.gguf
ctx-size = 32768

[Dense-GLM-Z1-32B-0414]
m = unsloth/GLM-Z1-32B-0414-GGUF/GLM-Z1-32B-0414-Q6_K.gguf
ctx-size = 32768

[MoE-GLM-4.5-Air-106B]
m = unsloth/GLM-4.5-Air-GGUF/GLM-4.5-Air-Q4_K_M-00001-of-00002.gguf
n-cpu-moe = 30
ctx-size = 32768

[Dense-EXAONE-4.0.1-32B]
m = mradermacher/EXAONE-4.0.1-32B-i1-GGUF/EXAONE-4.0.1-32B.i1-Q6_K.gguf
ctx-size = 131072

[Dense-QwQ-32B]
m = unsloth/QwQ-32B-GGUF/QwQ-32B-Q6_K.gguf
ctx-size = 32768

[Dense-Qwen3-32B]
m = mradermacher/Qwen3-32B-i1-GGUF/Qwen3-32B.i1-Q6_K.gguf
ctx-size = 32768

[Dense-Vision-Qwen2.5-VL-32B-Instruct]
m = unsloth/Qwen2.5-VL-32B-Instruct-GGUF/Qwen2.5-VL-32B-Instruct-Q5_K_M.gguf
mmproj = unsloth/Qwen2.5-VL-32B-Instruct-GGUF/mmproj-BF16.gguf
ctx-size = 32768

[Dense-Vision-Qwen3-VL-32B-Instruct]
m = unsloth/Qwen3-VL-32B-Instruct-GGUF/Qwen3-VL-32B-Instruct-Q5_K_M.gguf
mmproj = unsloth/Qwen3-VL-32B-Instruct-GGUF/mmproj-BF16.gguf
ctx-size = 32768

[MoE-Qwen3-30B-A3B-Instruct-2507]
m = mradermacher/Qwen3-30B-A3B-Instruct-2507-i1-GGUF/Qwen3-30B-A3B-Instruct-2507.i1-Q6_K.gguf
temp = 0.7
top-p = 0.8
top-k = 20
min-p = 0
ctx-size = 32768

[MoE-Qwen3-Coder-30B-A3B-Instruct]
m = unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF/Qwen3-Coder-30B-A3B-Instruct-Q6_K.gguf
temp = 0.7
top-p = 0.8
top-k = 20
min-p = 0
ctx-size = 131072

[MoE-Qwen3-30B-A3B-Thinking-2507]
m = mradermacher/Qwen3-30B-A3B-Thinking-2507-i1-GGUF/Qwen3-30B-A3B-Thinking-2507.i1-Q6_K.gguf
temp = 0.6
top-p = 0.95
top-k = 20
min-p = 0
ctx-size = 32768

[MoE-Aquif-3.5-Max-42B-A3B]
m = unsloth/aquif-3.5-Max-42B-A3B-GGUF/aquif-3.5-Max-42B-A3B-Q4_K_M.gguf
ctx-size = 65536
chat-template-file = unsloth/aquif-3.5-Max-42B-A3B-GGUF/aquif-3.5-Max-42B-A3B.jinja

[MoE-Aquif-3.5-Plus-30B-A3B]
m = mradermacher/aquif-3.5-Plus-30B-A3B-i1-GGUF/aquif-3.5-Plus-30B-A3B.i1-Q6_K.gguf
ctx-size = 131072

[MoE-MiniMax-M2-230B-A10B]
m = unsloth/MiniMax-M2-GGUF/MiniMax-M2-UD-Q2_K_XL-00001-of-00002.gguf
temp = 1.0
top-p = 0.95
top-k = 40
n-cpu-moe = 50
ctx-size = 65536

[MoE-GPT-OSS-20B]
m = lmstudio-community/gpt-oss-20b-GGUF/gpt-oss-20b-MXFP4.gguf
ctx-size = 65536

[MoE-GPT-OSS-120B]
m = lmstudio-community/gpt-oss-120b-GGUF/gpt-oss-120b-MXFP4-00001-of-00002.gguf
n-cpu-moe = 20
ctx-size = 65536

[Dense-Llama-3_3-Nemotron-Super-49B-v1_5]
m = unsloth/Llama-3_3-Nemotron-Super-49B-v1_5-GGUF/Llama-3_3-Nemotron-Super-49B-v1_5-Q4_K_S.gguf
ctx-size = 32768

[Dense-RP-Valkyrie-49B-v2]
m = bartowski/TheDrummer_Valkyrie-49B-v2-GGUF/TheDrummer_Valkyrie-49B-v2-IQ4_NL.gguf
ctx-size = 32768

[Dense-K2-Think-32B]
m = mradermacher/K2-Think-i1-GGUF/K2-Think.i1-Q6_K.gguf
ctx-size = 32768

[MoE-Granite-4.0-h-small-32B]
m = unsloth/granite-4.0-h-small-GGUF/granite-4.0-h-small-UD-Q6_K_XL.gguf
ctx-size = 131072

LOG

init: using 31 threads for HTTP server
srv  server_prese: Loaded 35 presets from config.ini
srv   load_models: Available models (37) (*: custom preset)
srv   load_models:   * Dense-Aya-Expanse-32B
srv   load_models:   * Dense-Codestral-22B-v0.1
srv   load_models:   * Dense-DeepSeek-Coder-33B-Instruct
srv   load_models:   * Dense-DeepSeek-R1-Distill-Qwen-32B
srv   load_models:   * Dense-Devstral-Small-24B-2507
srv   load_models:   * Dense-EXAONE-4.0.1-32B
srv   load_models:   * Dense-GLM-4-32B-0414
srv   load_models:   * Dense-GLM-Z1-32B-0414
srv   load_models:   * Dense-K2-Think-32B
srv   load_models:   * Dense-Llama-3_3-Nemotron-Super-49B-v1_5
srv   load_models:   * Dense-OLMo-2-0325-32B-Instruct
srv   load_models:   * Dense-QwQ-32B
srv   load_models:   * Dense-Qwen3-32B
srv   load_models:   * Dense-RP-Big-Tiger-Gemma-27B-v3
srv   load_models:   * Dense-RP-Cydonia-24B-v4.1
srv   load_models:   * Dense-RP-Valkyrie-49B-v2
srv   load_models:   * Dense-Seed-OSS-36B-Instruct
srv   load_models:   * Dense-Uncensored-BlackSheep-24B
srv   load_models:   * Dense-Uncensored-Dolphin-Mistral-24B-Venice-Edition
srv   load_models:   * Dense-Uncensored-XortronCriminalComputingConfig-24B
srv   load_models:   * Dense-Vision-Gemma-3-27B-IT
srv   load_models:   * Dense-Vision-Magistral-Small-2509
srv   load_models:   * Dense-Vision-Mistral-Small-3.2-24B-Instruct-2506
srv   load_models:   * Dense-Vision-Qwen2.5-VL-32B-Instruct
srv   load_models:   * Dense-Vision-Qwen3-VL-32B-Instruct
srv   load_models:   * MoE-Aquif-3.5-Max-42B-A3B
srv   load_models:   * MoE-Aquif-3.5-Plus-30B-A3B
srv   load_models:   * MoE-GLM-4.5-Air-106B
srv   load_models:   * MoE-GPT-OSS-120B
srv   load_models:   * MoE-GPT-OSS-20B
srv   load_models:   * MoE-Granite-4.0-h-small-32B
srv   load_models:   * MoE-MiniMax-M2-230B-A10B
srv   load_models:   * MoE-Qwen3-30B-A3B-Instruct-2507
srv   load_models:   * MoE-Qwen3-30B-A3B-Thinking-2507
srv   load_models:   * MoE-Qwen3-Coder-30B-A3B-Instruct
srv   load_models:     gemma-3-1b-it-i1-GGUF <- Two models from --models-dir mradermacher/testing
srv   load_models:     gemma-3-4b-it-i1-GGUF
main: starting router server, no model will be loaded in this process
start: binding port with default address family
main: router server is listening on http://127.0.0.1:8082
main: NOTE: router mode is experimental
main:       it is not recommended to use this mode in untrusted environments

--models-dir mradermacher/testing : the parameters are inherited from the main command line

srv    operator(): instance name=gemma-3-1b-it-i1-GGUF exited with status 0
srv  log_server_r: request: GET /v1/models 127.0.0.1 200
srv          load: spawning server instance with name=gemma-3-4b-it-i1-GGUF on port 36237
srv          load: spawning server instance with args:
srv          load:   /root/llama.cpp.pascal/build/bin/llama-server
srv          load:   --host
srv          load:   127.0.0.1
srv          load:   --jinja
srv          load:   -kvu
srv          load:   --mlock
srv          load:   --port
srv          load:   36237
srv          load:   --alias
srv          load:   gemma-3-4b-it-i1-GGUF
srv          load:   --cache-type-k
srv          load:   q8_0
srv          load:   --cache-type-v
srv          load:   q8_0
srv          load:   --flash-attn
srv          load:   on
srv          load:   --model
srv          load:   mradermacher/testing/gemma-3-4b-it-i1-GGUF/gemma-3-4b-it.i1-Q6_K.gguf
srv          load:   --n-gpu-layers
srv          load:   999
srv          load:   --parallel
srv          load:   4
srv  log_server_r: request: POST /models/load 127.0.0.1 200

Model from --models-preset config.ini with parameters inherited from the main command line + preset overload (sampling...)

srv          load: spawning server instance with name=MoE-Qwen3-30B-A3B-Instruct-2507 on port 47981
srv          load: spawning server instance with args:
srv          load:   /root/llama.cpp.pascal/build/bin/llama-server
srv          load:   --host
srv          load:   127.0.0.1
srv          load:   --jinja
srv          load:   -kvu
srv          load:   --min-p
srv          load:   0
srv          load:   --mlock
srv          load:   --port
srv          load:   47981
srv          load:   --temp
srv          load:   0.7
srv          load:   --top-k
srv          load:   20
srv          load:   --top-p
srv          load:   0.8
srv          load:   --alias
srv          load:   MoE-Qwen3-30B-A3B-Instruct-2507
srv          load:   --ctx-size
srv          load:   32768
srv          load:   --cache-type-k
srv          load:   q8_0
srv          load:   --cache-type-v
srv          load:   q8_0
srv          load:   --flash-attn
srv          load:   on
srv          load:   --model
srv          load:   mradermacher/Qwen3-30B-A3B-Instruct-2507-i1-GGUF/Qwen3-30B-A3B-Instruct-2507.i1-Q6_K.gguf
srv          load:   --n-gpu-layers
srv          load:   999
srv          load:   --parallel
srv          load:   4
srv  log_server_r: request: POST /models/load 127.0.0.1 200

WebUI

Capture

Look good! It remains to be tested from HF downloads in .cache/llama.cpp/...

ServeurpersoCom and others added 3 commits December 10, 2025 18:16
Replace flat directory scan with recursive traversal using
std::filesystem::recursive_directory_iterator. Support for
nested vendor/model layouts (e.g. vendor/model/*.gguf).
Model name now reflects the relative path within --models-dir
instead of just the filename. Aggregate files by parent
directory via std::map before constructing local_model
PEG parser usage improvements:
- Simplify parser instantiation (remove arena indirection)
- Optimize grammar usage (ws instead of zero_or_more, remove optional wrapping)
- Fix last line without newline bug (+ operator instead of <<)
- Remove redundant end position check

Feature scope:
- Remove auto-reload feature (will be separate PR per @ngxson)
- Keep config.ini auto-creation and template generation
- Preserve per-model customization logic

Co-authored-by: aldehir <aldehir@users.noreply.github.com>
Co-authored-by: ngxson <ngxson@users.noreply.github.com>
ServeurpersoCom and others added 16 commits December 10, 2025 18:16
Complete rewrite of INI parser grammar and visitor:
- Use p.chars(), p.negate(), p.any() instead of p.until()
- Support end-of-line comments (key=value # comment)
- Handle EOF without trailing newline correctly
- Strict identifier validation ([a-zA-Z_][a-zA-Z0-9_.-]*)
- Simplified visitor (no pending state, no trim needed)
- Grammar handles whitespace natively via eol rule

Business validation preserved:
- Reject section names starting with LLAMA_ARG_*
- Accept only keys starting with LLAMA_ARG_*
- Require explicit section before key-value pairs

Co-authored-by: aldehir <aldehir@users.noreply.github.com>
Children now receive minimal CLI args (executable, model, port, alias)
instead of inheriting all router args. Global settings pass through
LLAMA_ARG_* environment variables only, eliminating duplicate config
warnings.

Fixes: Router args like -ngl, -fa were passed both via CLI and env,
causing 'will be overwritten' warnings on every child spawn
- Sanitize model names: replace / and \ with _ for display
- Recursive directory scan with relative path storage
- Convert relative paths to absolute when spawning children
- Filter router control args from child processes
- Refresh args after port assignment for correct port value
- Fallback preset lookup for compatibility
- Fix missing argv[0]: store server binary path before base_args parsing
Co-authored-by: aldehir <hello@alde.dev>
@ServeurpersoCom ServeurpersoCom force-pushed the server/router-per-model-config branch from bf2d94c to b36b3fe Compare December 10, 2025 17:16
@ServeurpersoCom
Copy link
Collaborator Author

ServeurpersoCom commented Dec 10, 2025

Dead code removed (~40 lines), rebased on latest master.
Working on:

  • Testing cached HuggingFace models integration
  • Documentation cleanup (fixing remaining TODOs in PR doc)

@ngxson
Copy link
Collaborator

ngxson commented Dec 10, 2025

Nice, thanks for testing. @ServeurpersoCom unless you have anything else to add, I guess this is good to merge?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Proposal: allow arg.cpp to import/export configs from/to INI file

4 participants