PMetal

Powdered Metal — An ML SDK, framework, and application suite for Apple Silicon, written in Rust.

PMetal is a complete machine learning platform for Apple Silicon — from low-level Metal GPU kernels and Apple Neural Engine integration to high-level training APIs, a terminal TUI, and a full desktop GUI. Ship fine-tuned models without leaving the Apple ecosystem.

Use PMetal Your Way

Desktop GUI

A full Tauri + Svelte desktop application for visual model management, training, and inference.

cd crates/pmetal-gui
bun install && bun tauri dev

10 pages: Dashboard, Models, Datasets, Training, Distillation, GRPO, Inference, Merging, Quantize, and Settings. Download models from HuggingFace, configure LoRA training with live loss metrics, chat with models, merge weights, and quantize — all from the GUI. Training runs in-process with real-time progress updates.

Terminal TUI

A full-featured terminal control center with 9 tabs.

pmetal tui

Tab	Description
Dashboard	Live loss curves (braille), LR schedule, throughput sparklines, timing breakdown gauges
Device	GPU/ANE info, Metal feature detection, memory gauge, kernel tuning, UltraFusion topology
Models	Browse cached models, HuggingFace Hub search (`S`), memory fit estimation, download
Datasets	Scan and preview local datasets (JSONL, Parquet, CSV) with line counts
Training	Configure and launch SFT/LoRA/QLoRA training runs with sectioned parameter forms
Distillation	Configure knowledge distillation (online, offline, progressive, cross-vocab)
GRPO	Configure GRPO/DAPO reasoning training with reward functions and sampling params
Inference	Interactive chat interface with markdown rendering and generation settings sidebar
Jobs	Training run history with log viewer, status tracking, and metadata

Keybindings: Tab/Shift+Tab to switch tabs, Alt+1-9 for direct access, L to adjust learning rate mid-run, q to quit.

CLI

# LoRA fine-tuning with sequence packing (default)
pmetal train \
  --model Qwen/Qwen3-0.6B \
  --dataset train.jsonl \
  --output ./output \
  --lora-r 16 --batch-size 4 --learning-rate 2e-4

# Inference with LoRA adapter
pmetal infer \
  --model Qwen/Qwen3-0.6B \
  --lora ./output/lora_weights.safetensors \
  --prompt "Explain quantum entanglement" \
  --chat --show-thinking

# Knowledge distillation
pmetal distill \
  --teacher Qwen/Qwen3-4B \
  --student unsloth/Qwen3.5-0.8B-Base \
  --dataset train.jsonl

# GRPO reasoning training
pmetal grpo \
  --model Qwen/Qwen3-0.6B \
  --dataset reasoning.jsonl \
  --reasoning-rewards

# HuggingFace model search with memory fit
pmetal search "qwen 0.6b" --detailed

# Merge models with SLERP
pmetal merge \
  --models model-a model-b \
  --method slerp --t 0.5

# Quantize to GGUF
pmetal quantize \
  --model ./output \
  --output model.gguf --type q4km

# Fuse LoRA into base model
pmetal fuse \
  --model Qwen/Qwen3-0.6B \
  --lora ./output/lora_weights.safetensors

# Evaluate perplexity
pmetal eval \
  --model Qwen/Qwen3-0.6B \
  --dataset eval.jsonl

# Start OpenAI-compatible server (requires --features serve)
pmetal serve --model Qwen/Qwen3-0.6B --port 8080

All CLI Commands

Command	Description
`train`	Fine-tune with LoRA/QLoRA/DoRA (SFT)
`infer`	Interactive inference with chat, tool use, and thinking mode
`distill`	Knowledge distillation (online, offline, progressive)
`grpo`	GRPO/DAPO reasoning training
`search`	Search HuggingFace Hub with memory fit estimation
`download`	Download a model from HuggingFace Hub
`merge`	Merge two or more models (12 strategies)
`quantize`	GGUF quantization (13 format options)
`fuse`	Fuse LoRA adapter weights into base model
`eval`	Evaluate model perplexity on a dataset
`serve`	OpenAI-compatible inference server (feature-gated)
`tui`	Full TUI control center (9 tabs)
`dashboard`	Real-time training metrics visualization
`dataset`	Dataset utilities: `analyze`, `download`, `convert`
`ollama`	Ollama integration: `modelfile`, `create`, `templates`
`info`	Show device info (GPU, ANE, bandwidth, NAX)
`memory`	Show memory usage and available capacity
`init`	Generate a sample configuration file
`bench`	Benchmark training performance
`bench-gen`	Benchmark generation loop timing
`bench-ffi`	Benchmark FFI overhead

SDK

PMetal is an embeddable SDK — integrate training, inference, and model operations into your own Rust applications. The easy module provides high-level builders, while the underlying crates (pmetal-trainer, pmetal-models, pmetal-lora, etc.) offer full control over every pipeline stage.

use pmetal::easy;

// Fine-tune with LoRA
let result = easy::finetune("Qwen/Qwen3-0.6B", "train.jsonl")
    .lora(16, 32.0)
    .learning_rate(2e-4)
    .epochs(3)
    .output("./output")
    .run()
    .await?;

// DPO preference optimization
let result = easy::dpo("Qwen/Qwen3-0.6B", "preferences.jsonl")
    .dpo_beta(0.1)
    .reference_model("Qwen/Qwen3-0.6B")
    .run()
    .await?;

// Inference
let output = easy::infer("Qwen/Qwen3-0.6B")
    .temperature(0.7)
    .lora("./output/lora_weights.safetensors")
    .generate("What is 2+2?")
    .await?;

// Streaming inference
easy::infer("Qwen/Qwen3-0.6B")
    .generate_streaming("Tell me a story", |delta| {
        print!("{delta}");
        true // return false to stop early
    })
    .await?;

Available builders: easy::finetune(), easy::dpo(), easy::simpo(), easy::orpo(), easy::kto(), easy::infer().

For lower-level control, use the crates directly — pmetal-trainer::TrainingLoop, pmetal-models::DynamicModel, pmetal-lora::DynamicLoraModel, pmetal-distill::Distiller, etc. See the examples/ directory for complete working examples including manual training loop orchestration and ANE-specific workflows.

Python SDK

PMetal exposes a Python extension module via PyO3. Install with maturin develop from crates/pmetal-py.

Quick Start (Easy API)

import pmetal

# Fine-tune with sensible defaults
result = pmetal.finetune(
    "Qwen/Qwen3-0.6B",
    "train.jsonl",
    lora_r=16,
    learning_rate=2e-4,
    epochs=3,
)
print(f"Loss: {result['final_loss']}, Steps: {result['total_steps']}")

# Inference
text = pmetal.infer("Qwen/Qwen3-0.6B", "What is 2+2?")
print(text)

# Inference with LoRA adapter
text = pmetal.infer(
    "Qwen/Qwen3-0.6B",
    "Explain quantum entanglement",
    lora="./output/lora_weights.safetensors",
)

Full Control

import pmetal

# Configure training components
lora_config = pmetal.LoraConfig(r=16, alpha=32.0)
training_config = pmetal.TrainingConfig(
    learning_rate=2e-4,
    num_epochs=3,
    batch_size=4,
    max_seq_len=2048,
)

# Create trainer
trainer = pmetal.Trainer(
    model_id="Qwen/Qwen3-0.6B",
    lora_config=lora_config,
    training_config=training_config,
    dataset_path="train.jsonl",
)
trainer.add_callback(pmetal.ProgressCallback())
result = trainer.train()

# Load model for inference
model = pmetal.Model.load("Qwen/Qwen3-0.6B")
print(model.generate("Hello world", temperature=0.7))

Installation

Prebuilt signed binaries are available on the Releases page.

Crates are available on crates.io.

Build from source:

git clone https://github.com/epistates/pmetal.git && cd pmetal
cargo build --release          # CLI + TUI
cd crates/pmetal-gui && bun install && bun tauri build  # GUI (optional)

Hardware Support

PMetal automatically detects Apple Silicon capabilities at startup and tunes kernel parameters accordingly.

Chip Family	GPU Family	NAX	ANE	UltraFusion	Status
M1 / Pro / Max / Ultra	Apple7	-	16 cores	Ultra: 2-die	Fully supported
M2 / Pro / Max / Ultra	Apple8	-	16 cores	Ultra: 2-die	Fully supported
M3 / Pro / Max / Ultra	Apple9	-	16 cores	Ultra: 2-die	Fully supported
M4 / Pro / Max / Ultra	Apple9	-	16 cores	Ultra: 2-die	Fully supported
M5 / Pro / Max / Ultra	Apple10	Yes	16 cores	Ultra: 2-die	Fully supported

Auto-detected features: GPU family, device tier, core counts, memory bandwidth, dynamic caching, mesh shaders, NAX (M5+), UltraFusion topology (via sysctl hw.packages), ANE availability.

Tier-based kernel tuning: Matrix tile sizes, FlashAttention block sizes, fused kernel threadgroup sizes, and batch multipliers are automatically selected based on device tier (Base/Pro/Max/Ultra) and GPU family. See docs/hardware-support.md for the full tuning matrix.

Architecture

PMetal is organized as a Rust workspace with 18 specialized crates:

pmetal/
├── pmetal-core         # Foundation: configs, traits, types, error handling
├── pmetal-metal        # Custom Metal GPU kernels + ANE runtime
├── pmetal-mlx          # MLX backend integration (KV cache, RoPE, etc.)
├── pmetal-models       # LLM architectures (Llama, Qwen, DeepSeek, etc.)
├── pmetal-lora         # LoRA/QLoRA training implementations
├── pmetal-trainer      # Training loops (SFT, DPO, SimPO, ORPO, KTO, GRPO, etc.)
├── pmetal-data         # Dataset loading, chat templates, tokenization
├── pmetal-hub          # HuggingFace Hub integration + model fit estimation
├── pmetal-distill      # Knowledge distillation (online, offline, cross-vocab, TAID)
├── pmetal-merge        # Model merging (14 strategies)
├── pmetal-gguf         # GGUF format with imatrix quantization
├── pmetal-mhc          # Manifold-Constrained Hyper-Connections
├── pmetal-distributed  # Distributed training (mDNS, Ring All-Reduce)
├── pmetal-vocoder      # BigVGAN neural vocoder
├── pmetal-serve        # OpenAI-compatible inference server
├── pmetal-py           # Python bindings (maturin/PyO3)
├── pmetal-cli          # Command-line interface + TUI control center
└── pmetal-gui          # Desktop GUI (Tauri + Svelte + TailwindCSS)

The pmetal facade crate re-exports all modules with feature flags and provides the easy API for quick-start usage.

Supported Models

Inference (via `DynamicModel` dispatcher)

All models below can be loaded from HuggingFace Hub or local safetensors and used for inference via the CLI, TUI, GUI, or SDK.

Family	Architecture	Variants	`model_type` values
Llama	`Llama`	2, 3, 3.1, 3.2, 3.3	`llama`, `llama3`
Llama 4	`Llama4`	Scout, Maverick	`llama4`
Qwen 2	`Qwen2`	2, 2.5	`qwen2`, `qwen2_5`
Qwen 3	`Qwen3`	3	`qwen3`
Qwen 3 MoE	`Qwen3MoE`	3-MoE	`qwen3_moe`
Qwen 3.5	`Qwen3Next`	3.5 (Next)	`qwen3_next`, `qwen3_5`
DeepSeek	`DeepSeek`	V3, V3.2, V3.2-Speciale	`deepseek`, `deepseek_v3`
Mistral	`Mistral`	7B, Mixtral 8x7B	`mistral`, `mixtral`
Gemma	`Gemma`	2, 3	`gemma`, `gemma2`, `gemma3`
Phi 3	`Phi`	3, 3.5	`phi`, `phi3`
Phi 4	`Phi4`	4	`phi4`
Cohere	`Cohere`	Command R	`cohere`, `command_r`
Granite	`Granite`	3.0, 3.1, Hybrid MoE	`granite`, `granitehybrid`
NemotronH	`NemotronH`	Hybrid (Mamba+Attention)	`nemotron_h`
StarCoder2	`StarCoder2`	3B, 7B, 15B	`starcoder2`
RecurrentGemma	`RecurrentGemma`	Griffin	`recurrentgemma`, `griffin`
Jamba	`Jamba`	1.5	`jamba`
Flux	`Flux`	1-dev, 1-schnell	`flux`

LoRA/QLoRA Training Support

LoRA training is supported for models that have implementations in DynamicLoraModel. Architecture detection is automatic — just point pmetal train at a model directory or HuggingFace ID.

Architecture	LoRA	QLoRA	Notes
Llama	Yes	Yes	Covers Llama 2, 3, 3.1, 3.2, 3.3. Gradient checkpointing supported.
Qwen 2	Yes	—	Uses Qwen3 LoRA implementation internally.
Qwen 3	Yes	Yes	Gradient checkpointing supported.
Qwen 3.5 (Next)	Yes	—	Hybrid architecture with nested `text_config` handling.
Gemma	Yes	Yes	GeGLU activation, special RMSNorm.
Mistral	Yes	Yes	Sliding window attention support.
Phi 3	Yes	—	Partial RoPE, fused gate_up projection.

Architectures not listed above (Llama 4, Qwen 3 MoE, DeepSeek, Cohere, Granite, NemotronH, Phi 4, StarCoder2, RecurrentGemma, Jamba) support inference but do not yet have LoRA training integration via DynamicLoraModel. Contributions welcome.

Architecture Modules (Not Yet in Dispatcher)

The following architectures have implementations in pmetal-models but are not wired into the DynamicModel dispatcher and cannot be loaded via the CLI or DynamicModel::load():

Family	Module	Notes
GPT-OSS	`gpt_oss`	MoE with Top-4 sigmoid routing, 20B/120B variants
Pixtral	`pixtral`	12B vision-language model
Qwen2-VL	`qwen2_vl`	2B, 7B vision-language model
MLlama	`mllama`	Llama 3.2-Vision
CLIP	`clip`	ViT-L/14 vision encoder
Whisper	`whisper`	Base, Small, Medium, Large speech models
T5	`t5`	Encoder-decoder architecture

These modules can be used directly via their Rust types (e.g., pmetal_models::architectures::gpt_oss::GptOssForCausalLM) but require manual weight loading.

Diffusion Models

Family	Variants	Status
Flux	1-dev, 1-schnell	Dispatcher + pipeline implemented

Training Methods

All training methods support callback-based cancellation (should_stop()), metrics JSONL logging, and adaptive learning rate control.

Method	CLI	GUI	TUI	Library
SFT (Supervised Fine-Tuning)	`train`	Yes	Yes	`easy::finetune()`
LoRA	`train`	Yes	Yes	`easy::finetune()`
QLoRA (4-bit)	`train --quantization nf4`	Yes	Yes	`easy::finetune()`
DoRA	`train --dora`	Yes	Yes	`easy::finetune()`
DPO (Direct Preference)	—	—	—	`easy::dpo()`
SimPO (Simple Preference)	—	—	—	`easy::simpo()`
ORPO (Odds-Ratio Preference)	—	—	—	`easy::orpo()`
KTO (Kahneman-Tversky)	—	—	—	`easy::kto()`
GRPO (Reasoning)	`grpo`	Yes	Yes	`GrpoTrainer`
DAPO (Decoupled GRPO)	`grpo --dapo`	Yes	Yes	`DapoTrainer`
Knowledge Distillation	`distill`	Yes	Yes	`Distiller`
TAID (Temporally Adaptive)	—	—	—	`TaidDistiller`
ANE Training	`train` (auto)	—	Yes	`AneTrainingLoop`

Additional methods available via the library only: GSPO (GspoTrainer), PPO (PpoTrainer), Online DPO (OnlineDpoTrainer), Diffusion Training (DiffusionTrainer).

Key Features

Metal GPU Optimizations

Custom Metal shaders provide significant speedups:

FlashAttention: O(n) memory attention with fused softmax, tier-aware block sizes
Fused GDN: Gated Delta Network recurrence kernel (ported from FLA Triton) — single-pass state update with SIMD reductions
Fused LoRA: Combined forward pass for adapter layers (~2x speedup with lora-metal-fused feature)
Fused Cross-Entropy: Unsloth-style chunked loss computation
Fused Linear Cross-Entropy: Skips logits materialization entirely
Fused RoPE: Rotary position embeddings in-kernel
Fused SwiGLU: Fused gate + activation with tier-tuned threadgroups
Fused RMSNorm + LoRA: Combined normalization and adapter projection
Fused Sampler: JIT-compiled token sampling
Fused MLP: Combined gate/up/down projections
Async Scheduler: Double/triple-buffered GPU command scheduling

ANE (Neural Engine) Pipeline

Native ANE integration for power-efficient training and inference:

Dynamic Weight Pipeline: 9 MIL kernels compiled once at startup; weights packed alongside activations in IOSurface spatial dimension
Hybrid Inference: ANE prefill + CPU decode with KV cache. Power-of-2 sequence bucketing for optimal kernel compilation
CPU RMSNorm: RMSNorm computed in f32 on CPU to avoid fp16 overflow on ANE (saturation arithmetic)
IOSurface Zero-Copy: fp32 shared memory surfaces for CPU-ANE data transfer with no serialization overhead
M1-M5 Compatibility: Per-matrix weight blobs for M1, single-blob for M3+. CPU FFN fallback for 4B+ models

Training Infrastructure

Sequence Packing: Efficiently pack multiple sequences into single batches for 2-5x throughput. Enabled by default
Gradient Checkpointing: Trade compute for memory on large models with configurable layer grouping
Adaptive LR: EMA-based anomaly detection with spike recovery, plateau reduction, and divergence detection
Callback System: TrainingCallback trait with lifecycle hooks (on_step_start, on_step_end, should_stop) for metrics logging, progress reporting, and clean cancellation
Checkpoint Management: Save and resume training from checkpoints with best-loss rollback
Tool/Function Calling: Chat templates with native tool definitions for Qwen, Llama 3.1+, Mistral v3+, and DeepSeek
Schedule-Free Optimizer: Memory-efficient optimizer without learning rate schedules
Metal Fused Optimizer: GPU-accelerated AdamW parameter updates
8-bit Adam: Memory-efficient optimizer for large models
LoRA+: Differentiated learning rates for LoRA A and B matrices
NEFTune: Noise-augmented fine-tuning for improved generation quality
Distributed Training: mDNS auto-discovery, Ring All-Reduce with gradient compression

Dataset Formats

Auto-detected training data formats:

ShareGPT: {"conversations": [{"from": "human", "value": "..."}, ...]}
Alpaca: {"instruction": "...", "input": "...", "output": "..."}
OpenAI/Messages: {"messages": [{"role": "user", "content": "..."}, ...]}
Reasoning: {"problem": "...", "thinking": "...", "solution": "..."}
Simple: {"text": "..."}
Parquet: Supports both standard text columns and reasoning formats

The pmetal dataset subcommand provides utilities for analysis, download from HuggingFace, and format conversion (Parquet, JSON, JSONL, CSV, ShareGPT, Alpaca).

Model Operations

HuggingFace Hub Search: pmetal search with memory fit estimation and download

Model Merging (16 strategies via library, 12 via CLI):

CLI	Library	Description
`linear`	`LinearMerge`	Simple weighted averaging
`slerp`	`SlerpMerge`	Spherical linear interpolation
`ties`	`TiesMerge`	Task arithmetic with sparsification and sign consensus
`dare_ties`	`DareMerge`	Random pruning with rescaling (TIES variant)
`dare_linear`	`DareMerge`	Random pruning with rescaling (linear variant)
`task_arithmetic`	`TaskArithmeticMerge`	Task vector arithmetic
`della`	`DellaMerge`	Adaptive magnitude-based pruning
`della_linear`	`DellaMerge`	Adaptive magnitude pruning (linear variant)
`breadcrumbs`	`BreadcrumbsMerge`	Breadcrumbs merge strategy
`model_stock`	`ModelStockMerge`	Geometric interpolation based on task vector similarity
`nearswap`	`NearswapMerge`	Near-swap merge strategy
`passthrough`	`PassthroughMerge`	Layer passthrough composition
—	`RamMerge`	RAM merge strategy
—	`SouperMerge`	Souper merge strategy
—	`KarcherMerge`	Karcher mean on weight manifold
—	`MultiSlerpMerge`	Multi-model SLERP

GPU-Accelerated Merging: Metal-based merge operations for large models
FP8-Aware Merging: Merge with FP8 quantization for memory efficiency
Async Merge Pipeline: Double-buffered streaming merge for large models
LoRA Fusing: Merge LoRA adapters into base weights (standard and accurate modes)

GGUF Quantization (13 format options):

Format	Description
`dynamic`	Auto-select per layer
`q8_0`	8-bit quantization
`q6k`	6-bit k-quant
`q5km`	5-bit k-quant (medium)
`q5ks`	5-bit k-quant (small)
`q4km`	4-bit k-quant (medium)
`q4ks`	4-bit k-quant (small)
`q3km`	3-bit k-quant (medium)
`q3ks`	3-bit k-quant (small)
`q3kl`	3-bit k-quant (large)
`q2k`	2-bit k-quant
`f16`	Float16
`f32`	Float32

Supports importance matrix (--imatrix) for improved quantization quality.

FP8 Runtime Quantization: Convert to FP8 (E4M3) at inference time for ~2x memory reduction

Knowledge Distillation

Multiple distillation methods and loss functions:

Methods: Online (live teacher inference), Offline (cached logits with compression), Progressive
TAID: Temporally Adaptive Interpolated Distillation (ICLR 2025 SOTA) — TaidDistiller
Token-Level Losses: KL Divergence, Jensen-Shannon, Soft Cross-Entropy, TVD, Hinge Ranking, Logistic Ranking
Hidden State Losses: MSE, Cosine similarity, L1
Reasoning-Aware: Rationale distillation for reasoning models
Cross-Vocabulary: Distill between models with different tokenizers
Offline Logit Caching: Compressed logit storage for memory-efficient offline distillation

Configuration

`pmetal train` Parameters

Parameter	Default	Description
`--lora-r`	16	LoRA rank
`--lora-alpha`	32.0	LoRA scaling factor (2x rank)
`--batch-size`	1	Micro-batch size
`--learning-rate`	2e-4	Learning rate
`--max-seq-len`	0	Max seq len (0 = auto-detect)
`--epochs`	1	Number of training epochs
`--max-grad-norm`	1.0	Gradient clipping
`--quantization`	none	QLoRA method (nf4, fp4, int8)
`--gradient-accumulation-steps`	4	Gradient accumulation steps
`--no-ane`	false	Disable ANE training
`--embedding-lr`	None	Separate LR for embeddings
`--no-metal-fused-optimizer`	false	Disable Metal fused optimizer
`--lr-schedule`	cosine	Schedule type (constant, linear, cosine, cosine_with_restarts, polynomial, wsd)
`--no-gradient-checkpointing`	false	Disable gradient checkpointing (enabled by default)
`--gradient-checkpointing-layers`	4	Number of layers per checkpoint block
`--warmup-steps`	100	Learning rate warmup steps
`--weight-decay`	0.01	AdamW weight decay coefficient
`--no-sequence-packing`	false	Disable sequence packing
`--config`	—	Path to YAML configuration file

`pmetal infer` Parameters

Parameter	Default	Description
`--temperature`	Model default	Sampling temperature
`--top-k`	Model default	Top-k sampling
`--top-p`	Model default	Nucleus sampling
`--min-p`	Model default	Min-p dynamic sampling
`--max-tokens`	256	Maximum generation length
`--repetition-penalty`	1.0	Repetition penalty
`--frequency-penalty`	0.0	Frequency penalty
`--presence-penalty`	0.0	Presence penalty
`--chat`	false	Apply chat template
`--show-thinking`	false	Show reasoning content
`--fp8`	false	Use FP8 weights (~2x mem reduction)
`--compiled`	false	Use JIT-compiled sampling
`--no-ane`	false	Disable ANE inference
`--ane-max-seq-len`	1024	Max ANE kernel sequence length
`--tools`	—	Tool/function definitions file (OpenAI format)
`--system`	—	System message

Feature Flags

Feature	Default	Crate	Description
`core`	Yes	`pmetal-core`	Foundation types, configs, traits
`gguf`	Yes	`pmetal-gguf`	GGUF format support
`metal`	Yes	`pmetal-metal`	Metal GPU kernels
`hub`	Yes	`pmetal-hub`	HuggingFace Hub integration
`mlx`	Yes	`pmetal-mlx`	MLX backend
`models`	Yes	`pmetal-models`	LLM architectures
`lora`	Yes	`pmetal-lora`	LoRA/QLoRA
`trainer`	Yes	`pmetal-trainer`	Training loops (pulls in `data`, `distill`)
`easy`	Yes	—	High-level builders (pulls in `trainer`, `hub`, `data`)
`ane`	Yes	—	Apple Neural Engine
`data`	Yes*	`pmetal-data`	Dataset loading (*default via `easy`)
`distill`	Yes*	`pmetal-distill`	Knowledge distillation (*default via `trainer`)
`lora-metal-fused`	No	—	~2x LoRA training speedup via fused Metal kernels
`merge`	No	`pmetal-merge`	Model merging strategies
`vocoder`	No	`pmetal-vocoder`	BigVGAN neural vocoder
`distributed`	No	`pmetal-distributed`	Distributed training
`mhc`	No	`pmetal-mhc`	Manifold-Constrained Hyper-Connections
`full`	No	—	All features

Development

Building

# Release build (default features: ANE + Dashboard)
cargo build --release

# Build without ANE
cargo build --release --no-default-features --features dashboard

# Run tests (single-threaded for Metal compatibility)
just test

# Build GUI
cd crates/pmetal-gui && bun install && bun tauri build

Formal Verification

# cargo-kani proofs for ring all-reduce and topology
just kani-verify

License

Licensed under either of MIT or Apache-2.0.

Acknowledgments

MLX - Apple's machine learning framework
mlx-rs - Rust bindings for MLX
Unsloth - Inspiration for fused kernels
Tauri - Desktop application framework

Name		Name	Last commit message	Last commit date
Latest commit History 258 Commits
.cargo		.cargo
.github/workflows		.github/workflows
benches		benches
crates		crates
docs		docs
examples		examples
fuzz		fuzz
public		public
scripts		scripts
supply-chain		supply-chain
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
README.md		README.md
SECURITY.md		SECURITY.md
justfile		justfile

Folders and files

Latest commit

History

Repository files navigation

PMetal

Use PMetal Your Way

Desktop GUI

Terminal TUI

CLI

All CLI Commands

SDK

Python SDK

Quick Start (Easy API)

Full Control

Installation

Hardware Support

Architecture

Supported Models

Inference (via DynamicModel dispatcher)

LoRA/QLoRA Training Support

Architecture Modules (Not Yet in Dispatcher)

Diffusion Models

Training Methods

Key Features

Metal GPU Optimizations

ANE (Neural Engine) Pipeline

Training Infrastructure

Dataset Formats

Model Operations

Knowledge Distillation

Configuration

pmetal train Parameters

pmetal infer Parameters

Feature Flags

Development

Building

Formal Verification

License

Acknowledgments

About

Topics

Resources

License

Licenses found

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 16

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Inference (via `DynamicModel` dispatcher)

`pmetal train` Parameters

`pmetal infer` Parameters

Packages