Skip to content

joe0731/modelsig

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

modelsig

Compare LLM architectures without downloading weights.

modelsig extracts a multi-layer structural fingerprint from any HuggingFace model and tells you whether two models are architecturally equivalent — so the smaller one can act as a valid proxy for testing the larger one.

Weekly Validation Python 3.9+ License: Apache 2.0


What problem does it solve?

Testing inference engines (vLLM, TensorRT-LLM, SGLang, llama.cpp, ONNX Runtime, etc.) against every large model is prohibitively expensive. modelsig answers:

"Can I test Qwen3-72B correctness using Qwen3-7B instead?" "Is Nemotron-120B-FP4 architecturally equivalent to the BF16 variant?" "Does this ONNX export match the original safetensors model?"

It compares structural fingerprints — shape ratios, operator sets, KV cache patterns, layer topology — without ever downloading a single weight tensor.


Key Features

  • Zero weight download — safetensors header via HTTP Range (~20 bytes), ONNX graph-only (no .onnx_data), or config-only fast mode
  • 5-layer fingerprint — static weights, arch config, op types, KV cache pattern, layer-level I/O signatures
  • 3-phase isomorphism comparison — key overlap, substructure, algebraic scaling
  • Substitution verdictsFULL_SUBSTITUTE / PARTIAL_SUBSTITUTE / NO_SUBSTITUTE
  • 4-level multi-fidelity test plan — maps models to test coverage levels L1–L4
  • Wide model support — dense decoder, GQA, MoE, vision-language, speech, ONNX classification
  • Both HF and local models — supports local:/path/to/model
  • JSON / table / markdown output — CI-friendly JSON, human-readable table, shareable markdown

Installation

From PyPI (recommended)

uv add modelsig           # add to a uv project
# or
uv tool install modelsig  # install as a standalone CLI tool

From source

git clone https://github.com/joe0731/modelsig
cd modelsig
uv sync                   # install all deps + editable package
uv run modelsig --help

Still using pip?

pip install modelsig

Dependencies (all installed by default):

Package Purpose
requests HTTP Range fetching for safetensors headers
huggingface_hub Model file listing, downloads, auth
onnx ONNX graph parsing (falls back to built-in protobuf if unavailable)
transformers AutoConfig normalization, layer signature capture
torch Meta-device forward pass for layer I/O shape collection
safetensors Local safetensors file parsing

Quick Start

# Analyze a single model (-m / --model flag)
modelsig -m Qwen/Qwen3-7B --output table

# Compare two models (proxy-test decision)
modelsig -m Qwen/Qwen3-7B -m Qwen/Qwen3-72B --compare --output table

# Fast mode for large models (config only, no download)
modelsig -m nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 --fast --output table

# ONNX model
modelsig -m onnx-community/Qwen3.5-0.8B-ONNX --output json

# Skip layer-level I/O signature capture (faster, no torch needed)
modelsig -m Qwen/Qwen3-7B --no-layer-sig --output json

# Private/gated model
modelsig -m org/private-model --token hf_xxx
# or: export HF_TOKEN=hf_xxx

How It Works

Zero-Weight-Download

For safetensors models, only the file header is fetched via HTTP Range requests (~20 bytes per shard). No weights are transferred.

For ONNX models, only the .onnx graph file is downloaded (typically 1–5 MB). The paired .onnx_data weight file (which can be GBs) is never touched.

For fast mode (--fast), only config.json is fetched (a few KB). No tensors at all.

5-Layer Signature System

Layer What it captures Source
L1 Static weight signature Per-tensor {abstract_key → shape, dtype, layer_type} — layer indices normalized to .N. safetensors header / ONNX initializers
L2 Architecture fingerprint hidden_size, num_hidden_layers, num_attention_heads, num_key_value_heads, intermediate_size, head_dim, MoE config config.json via AutoConfig
L3 Op type set Canonical operator vocabulary: aten/mm, attention, rms_norm, rope, silu, topk/router tensor key patterns / ONNX opset
L4 KV cache shape pattern [batch, num_kv_heads, seq_len, head_dim] derived from L2
L5 Layer I/O signatures Per-module {input: [{dtype, shape}], output: [{dtype, shape}]} on meta device torch forward hooks (--no-layer-sig to skip)

3-Phase Isomorphism Comparison

Phase 1 — Key coverage    : normalized key set overlap ≥ 80%
Phase 2 — Substructure    : attention / FFN / norm submodules match
Phase 3 — Algebraic scale : hidden_size / intermediate_size / head_dim ratios uniform within 20%

Result: ISOMORPHIC / SCALE_ONLY / DIFFERENT_ARCH

Substitution Verdict

Verdict Meaning
FULL_SUBSTITUTE All 3 phases pass + shape ratios uniform + layer_type_coverage ≥ 95%
PARTIAL_SUBSTITUTE Phase 1+2 pass or op coverage ≥ 80%
NO_SUBSTITUTE Different arch, MoE vs Dense mismatch, or key divergence

Quantization Transferability Estimate

When comparing two models, modelsig also computes a structural quantization transferability score:

struct_sim_score   — 1.0 (ISOMORPHIC) / 0.80 (SCALE_ONLY) / 0.20 (DIFFERENT_ARCH)
op_hist_sim        — cosine similarity of operator frequency vectors
layer_type_hist    — Jaccard similarity of layer type sets
shape_uniform      — whether common weight shapes scale uniformly
moe_correction     — ~5% penalty for mixed MoE/Dense pairs
arch_risk_factors  — hidden_size ratio, GQA mismatch, FFN expansion, RoPE theta diff

Output: estimated_transferability score (0–1) with confidence (HIGH/MEDIUM/LOW), recommended_methods (GPTQ/AWQ/mixed-precision/expert-aware), and caveats.

This is a structural pre-filter only. SensCorr and RepAlign require actual calibration data and are the strongest transfer predictors. Use this score to decide whether to attempt transfer at all, not as a final guarantee.

Multi-Fidelity Test Plan (4 levels)

L1 Structure    — cheapest: model loading, tensor shapes, dtype validation
L2 Numerical    — cosine similarity, perplexity on calibration set
L3 Runtime      — prefill latency, decode throughput, KV cache eviction
L4 Canary       — large/MoE model: peak memory, TP/PP correctness

Usage

Basic — analyze a single model

modelsig -m Qwen/Qwen3-7B --output table
==============================================================================
  modelsig v2.0  |  2026-03-17T10:00:00Z
==============================================================================

   Model: Qwen/Qwen3-7B
  type                   qwen3
  hidden_size            3584
  num_hidden_layers      28
  num_attention_heads    28  (kv: 8)
  intermediate_size      18944
  head_dim               128
  is_moe                 False
  ffn_expansion          5.285714
  gqa_ratio              3.5
  kv_cache_pattern       [batch, 8, seq_len, 128]
  op_types               aten/mm, attention, embedding, rms_norm, rope, silu, swiglu
  layer_types            AttentionLayer, EmbeddingLayer, FFN_SwiGLU, LMHead, RMSNorm
  abstract_keys          14
  source                 safetensors

Compare models (proxy-testing decision)

modelsig -m Qwen/Qwen3-7B -m Qwen/Qwen3-72B --compare --output table

Full analysis with multi-fidelity plan

modelsig \
    -m Qwen/Qwen3-7B -m Qwen/Qwen3-30B-A3B -m Qwen/Qwen3-235B-A22B \
    --compare --multi-fidelity --output markdown --save report.md

ONNX model

modelsig -m onnx-community/Qwen3-4B-ONNX --output json

Config-only fast mode (no safetensors/ONNX fetch, instantaneous)

modelsig -m Qwen/Qwen3-235B-A22B --fast --output table

Local model directory

modelsig local:/path/to/model --output json
modelsig local:/path/to/7b local:/path/to/72b --compare

Private / gated models

modelsig -m org/private-model --token hf_xxx
# or: export HF_TOKEN=hf_xxx

Save report

modelsig -m Qwen/Qwen3-7B -m Qwen/Qwen3-72B \
    --compare --output markdown --save report.md

Models with custom code

# Only use --trust-remote-code for models you trust.
# This allows execution of arbitrary Python code from the model repository.
modelsig -m org/custom-model --trust-remote-code

Scenario Examples

Scenario 1 — Inference Engine Regression Testing

Problem: You want to validate a new vLLM kernel for Qwen3-72B but CI is limited to A10G GPUs (24 GB VRAM).

modelsig -m Qwen/Qwen3-7B -m Qwen/Qwen3-72B --compare --output table

Expected result: ISOMORPHIC / FULL_SUBSTITUTE — same GQA pattern, same op set, uniform scaling. You can run full functional tests on 7B and gate the 72B behind a nightly canary run.


Scenario 2 — MoE vs Dense Compatibility Check

Problem: Does Qwen3-30B-A3B (MoE) behave like a drop-in proxy for Qwen3-235B-A22B?

modelsig -m Qwen/Qwen3-30B-A3B -m Qwen/Qwen3-235B-A22B \
    --compare --multi-fidelity --output markdown

Both are MoE models from the same family → ISOMORPHIC. The multi-fidelity plan shows:

  • L1: use 30B-A3B for structure/conversion tests
  • L2: numerical validation on 30B
  • L4: 235B-A22B as canary for routing correctness and peak memory

Scenario 3 — Cross-Family Sanity Check

Problem: Can Llama-3.1-8B proxy-test a Mistral-7B?

modelsig -m meta-llama/Llama-3.1-8B-Instruct -m mistralai/Mistral-7B-v0.1 \
    --compare --output json

Both are dense GQA decoders with the same op set → ISOMORPHIC / FULL_SUBSTITUTE. Despite different model_type labels, the structural fingerprint matches.


Scenario 4 — ONNX Runtime Compatibility

Problem: You converted GPT-2 to ONNX and want to verify the ONNX version matches the torch version structurally.

modelsig -m openai-community/gpt2 -m onnx-community/gpt2 --compare --output table

The ONNX version is parsed from the .onnx graph file. The safetensors version is parsed from the header. Both share the same abstract key set → ISOMORPHIC.


Scenario 5 — Quantized Model Compatibility

Problem: Will nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 (quantized to FP4) behave the same as the BF16 variant?

modelsig \
    -m nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 \
    -m nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
    --compare --fast --output table

Both share the same architecture (120B MoE). --fast uses config-only mode to avoid downloading large safetensors headers. Result: ISOMORPHIC — same layer topology, only dtype differs.


Scenario 6 — Quantization Method Transfer

Problem: You quantized Qwen3-7B with AWQ. Can that config transfer to Qwen3-72B?

modelsig \
    -m Qwen/Qwen3-7B -m Qwen/Qwen3-72B \
    --compare --output json --save qwen3_quant_transfer.json

The quant_transfer block in coverage_matrix gives:

  • estimated_transferability — composite score (0–1) based on structural similarity
  • confidence — HIGH/MEDIUM/LOW
  • recommended_methods — e.g. GPTQ (W4A16), AWQ (W4A16), Mixed-precision
  • arch_risk_factors — e.g. large hidden_size ratio, RoPE theta mismatch
  • caveats — whether activation-aware recalibration is needed

CLI Reference

modelsig [-m MODEL_ID ...] [MODEL_ID ...] [OPTIONS]

Arguments:
  -m / --model MODEL_ID  HF model ID or local:PATH (repeatable, preferred)
  MODEL_ID               positional alternative — same as -m

Options:
  --output              json | table | markdown  (default: json)
  --compare             Compute pairwise coverage for all model pairs
  --save FILE           Save output to file
  --fast                Config-only mode — no safetensors/ONNX download
  --multi-fidelity      Include 4-level multi-fidelity test plan
  --no-layer-sig        Skip per-module I/O dtype+shape capture (faster)
  --token TOKEN         HF Hub token for private/gated models
  --timeout SEC         HTTP timeout (default: 30)
  --no-color            Disable ANSI colors in table output
  --trust-remote-code   Allow trust_remote_code=True for custom model code
                        ⚠ enables arbitrary code execution — use only for trusted models

Module Structure

modelsig/
├── analyze.py              CLI entry point (~190 lines)
├── constants.py            Shared constants: TOOL_NAME, _OP_RULES, _LAYER_TYPE_RULES, …
│
├── hf/
│   └── client.py           HF Hub client: token management, HTTP GET + backoff,
│                           model_info().siblings, hf_hub_download
│
├── parsers/
│   ├── safetensors.py      HTTP Range header fetch + local shard discovery
│   └── config.py           AutoConfig.from_pretrained() + _flatten_config() aliases
│
├── onnx/
│   ├── ops.py              _ONNX_DTYPE map, _ONNX_OP_MAP, canonical op mapping
│   ├── parser.py           onnx.load(load_external_data=False) + protobuf fallback
│   ├── selector.py         Primary .onnx file selection heuristics
│   └── collector.py        Orchestrates HF download → parse pipeline
│
├── torch/
│   └── layer_sig.py        L5: per-module input/output dtype+shape via forward hooks
│
├── signature/
│   ├── static.py           L1: build_static_weight_signature, norm_key, norm_dtype
│   ├── arch.py             L2: build_arch_fingerprint, KV cache pattern, dim ratios
│   ├── template.py         Per-layer canonical submodule template (for phase-2)
│   └── fingerprint.py      ModelFingerprint dataclass + build_fingerprint orchestrator
│
├── comparison/
│   ├── phases.py           Phase 1/2/3 isomorphism tests
│   ├── ratios.py           Shape ratio uniformity analysis
│   ├── quant_transfer.py   Structural quantization transferability estimator
│   ├── coverage.py         Unified compute_coverage + test strategy + quant_transfer
│   └── multifidelity.py    4-level multi-fidelity test plan builder
│
└── output/
    ├── colors.py           ANSI color helpers
    ├── json_fmt.py         JSON formatter + fp_to_dict
    ├── table_fmt.py        ANSI table formatter
    └── markdown_fmt.py     Markdown report formatter

Security

  • No arbitrary code execution by default. trust_remote_code is False unless explicitly set via --trust-remote-code.
  • Token safety. The HF token is passed via HTTP headers only — never embedded in URLs or logged to stderr.
  • No weight download. Only metadata (safetensors header, ONNX graph, config.json) is fetched.

Design Principles

Principle Implementation
Zero weight download HTTP Range (safetensors), graph-only .onnx, config-only fast path
Framework-driven parsing AutoConfig.from_pretrained() for config normalization; onnx.load() for graph parsing
Graceful degradation Every heavy dependency is optional — falls back to built-in parsers
Architecture-agnostic Works on dense decoders, GQA models, MoE, vision-language, speech, classification
Single CLI, composable API Import any module independently or use the unified CLI
Safe by default trust_remote_code=False; token in headers not URLs

Supported Model Families

Validated weekly against 57 models (29 safetensors + 28 ONNX):

Safetensors (full header fetch): Qwen3.5-{0.8B,4B,9B,27B,35B-A3B,397B-A17B}, Qwen2.5-7B-Instruct, Qwen3-Coder-Next, DeepSeek-V3.2, Kimi-K2.5, MiniMax-M2.5, GLM-5, Nemotron-3-{Nano-4B, Super-120B}-{BF16,NVFP4,FP8}, Granite-4.0-1b-speech, BitNet-b1.58-2B-4T, MiroThinker-{1.7,1.7-mini}, Sarvam-{30b,105b}, Reka-edge-2603, LocoTrainer-4B, OmniCoder-9B, Nanbeige4.1-3B, Param2-17B-A2.4B, gpt-oss-20b, all-MiniLM-L6-v2

ONNX (graph-only, no weight download): Qwen3.5-{0.8B,2B,4B}-ONNX, Qwen3-{4B-VL,VL-2B,Reranker-0.6B}-ONNX, Qwen2.5-{0.5B,VL-3B}-ONNX, LFM2-24B-A2B, Olmo-Hybrid-{SFT,DPO,Think}-7B, Voxtral-Mini-4B, Granite-4.0-1b-speech, Nemotron-Nano-4B, BERT-multilingual-NER, chinese-RoBERTa, multilingual-MiniLMv2, CodeT5, Jan-code-4b, Josiefied-Qwen3.5-0.8B, IndoBERT-news-classification, ai-image-detection × 4, vehicle-classification, tmr-text-detector


Contributing

All logic is in the modelsig/ package. Each subdirectory has a single responsibility. Tests live in tests/ and cover 130+ unit + integration scenarios.

git clone https://github.com/joe0731/modelsig
cd modelsig
uv sync --extra dev       # installs all deps + dev tools
uv run pytest tests/ -v

Weekly validation against the full model zoo runs via GitHub Actions (.github/workflows/weekly-validation.yml).


Related Projects


License

Apache 2.0 — see LICENSE.

About

Compare LLM architectures without downloading weights — structural fingerprint & proxy-test advisor for vLLM, TensorRT-LLM, SGLang, ONNX Runtime

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages