feat: add ClassifierStrategy to gate ANE vs CPU-tiled classifier for large-vocab models by christopherkarani · Pull Request #16 · christopherkarani/Espresso

christopherkarani · 2026-03-17T09:24:13Z

Summary

Add ClassifierStrategy enum that selects between ANE and CPU-tiled classifier paths based on model vocab/dModel size
Gate ANE classifier compilation in ensureHybridCompiledLlama — skips compile for large-vocab models (saves 1 ANE compile)
Add CPU-tiled greedy decode path using FP16TiledClassifier.tiledMatvecArgmax with CPU RMSNorm
Pre-convert lmHead weights to FP16 at build time (no per-token conversion overhead)
Complete UInt16 → TokenID migration across codebase

Test plan

ClassifierStrategyTests — 7 tests (selection logic + boundary + FP16TiledClassifier correctness)
Full test suite — 208/210 passed (2 pre-existing MigrationParityTests failures)
Hardware verification: TinyLlama (32K vocab) no longer crashes with statusType=0x9

🤖 Generated with Claude Code

- Add llama.cpp comparison row to benchmark table with Metal/CPU baselines - Add platform compatibility matrix covering M1–M4 SoCs - Add SPM integration section with 5-line first-inference example - Add release badge - Clarify iOS/tvOS entitlement situation in platform matrix - Tighten quick-start: git clone + 3-line TUI launch first Co-Authored-By: Paperclip <noreply@paperclip.ing>

…plates - Add .github/workflows/ci.yml: build + test matrix on Xcode 16.2 and 16.3 (macos-15, SPM cache, unit tests only — no ANE runner required) - Add CONTRIBUTING.md: dev setup, project structure, coding standards, TDD guide - Add .github/ISSUE_TEMPLATE/bug_report.md and feature_request.md - Add .github/PULL_REQUEST_TEMPLATE.md with benchmark impact section - Update README badges: CI badge alongside existing ANE matrix badge Co-Authored-By: Paperclip <noreply@paperclip.ing>

- Target list with 10 Swift/ML community leaders (Panaro, Hollance, Maderix, HF team, MLX, Paul Hudson, Sean Allen) - 5 personalized ready-to-send outreach messages - Partnership research for Apple Silicon benchmark projects (ANEMLL, more-ane-transformers, neural-engine) - Conference talk proposals for Deep Dish Swift, try! Swift Tokyo, WWDC Labs, SwiftConf - Three talk formats: 40-min technical, 20-min intro, 10-min lightning demo Related: ESP-10, ESP-11, ESP-12 Co-Authored-By: Paperclip <noreply@paperclip.ing>

…rainingLoop - Examples/SimpleInference: ~20-line GPT-2 generation using RealModelInferenceEngine.build() - Examples/BenchmarkSuite: Espresso vs CoreML comparison via espresso bench - Examples/TrainingLoop: fine-tuning wrapper over espresso-train CLI - Examples/README.md: setup guide with env vars and local-path override Each example is a standalone Swift package (macOS 15+, Swift 6.2). Co-Authored-By: Paperclip <noreply@paperclip.ing>

- ModelRegistry: add llama3_2_1b (16L/32H/8KVH/2048d/8192h) and llama3_2_3b (28L/24H/8KVH/3072d/8192h). Both use .llama architecture (SwiGLU, RMSNorm, GQA). Offline converter handles GQA head expansion and RoPE rotation baking into Wq/Wk weights. - Tests: add llama3_2_1bConfigIsCorrect, llama3_2_3bConfigIsCorrect, update registryContainsAllSixModels. All 17 ModelSupportTests pass. - Benchmark dashboard: benchmarks/results/latest.json (3.41x over CoreML, 519 tok/s on M3 Max); scripts/generate-benchmark-dashboard.sh regenerates docs/benchmarks.md from JSON; .github/workflows/benchmark-dashboard.yml auto-triggers on JSON changes. - .gitignore: allow benchmarks/results/latest.json, docs/benchmarks.md, scripts/generate-benchmark-dashboard.sh. Co-Authored-By: Paperclip <noreply@paperclip.ing>

- coreml-vs-espresso-benchmarks.html: benchmark comparison with data tables, visual bar chart, architecture explanation, and M1/M2/M3/M4 projections - gpt2-926-tokens-per-second.html: step-by-step guide covering 4.76x win — direct ANE access, 3-layer fusion, recurrent arch, zero-copy argmax - reverse-engineering-apple-neural-engine.html: internals deep-dive covering dlopen bridge, MIL ops, IOSurface memory model, and confirmed dead ends - blog.html: blog index listing all posts with summaries and tags - docs/index.html: add Blog nav link + "From the Blog" section for SEO - .gitignore: allow docs HTML files for GitHub Pages Targets keywords: "CoreML alternative", "apple neural engine framework", "swift ml inference", "GPT-2 apple silicon", "ANE reverse engineering" Co-Authored-By: Paperclip <noreply@paperclip.ing>

…NE inference Adds EspressoGGUF target that bridges EdgeRunner's GGUF loader into Espresso's weight format. GGUFModelLoader.prepare() loads a GGUF file, dequantizes via Metal, transposes per architecture convention, wraps in BLOBFILE format, and writes to a temp directory compatible with RealModelInferenceEngine.build(). - Bumps platform to macOS 26 (required by EdgeRunner/Metal 4) - Bumps swift-tools-version to 6.2 - EdgeRunner added as local package dependency Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Introduces ClassifierStrategy with a 16M-element SRAM threshold (32MB fp16). Models with vocab*dModel <= 16M use the ANE lane-packed classifier head; larger models (Stories110M, TinyLlama, Qwen3 0.6B) fall back to FP16TiledClassifier on CPU. Adds 5 Swift Testing tests covering strategy selection for small/large/huge vocabs and CPU-tiled argmax correctness. Also adds Espresso dependency to RealModelInferenceTests for FP16TiledClassifier access. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Remove unused `import Espresso` from ClassifierStrategy.swift - Add exactThresholdSelectsANE and oneOverThresholdSelectsCPU tests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…PU classifier gating Integrate ClassifierStrategy into the Llama decode path to gate between ANE lane-packed classifier and CPU-tiled FP16 classifier based on vocab*dim SRAM fit. Changes: - Add lmHeadFP16 field to LlamaTopLevelAssets (pre-converted FP16 weights) - Add classifierStrategy stored property, initialized via ClassifierStrategy.select() - Gate ANE greedy norm+classifier compile behind classifierStrategy == .ane - Add CPU-tiled greedy head branch in generateIncrementalHybridLlama decode loop (surface read -> CPU RMSNorm -> FP16TiledClassifier.tiledMatvecArgmax) - Skip xCur readback when using either ANE or CPU-tiled greedy head Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Migrate all token-semantic UInt16 to TokenID (UInt32) in CPURecurrentGenerationModel and FutureTokenProposingLanguageModel protocol. Includes OfflineExactAcceptanceEvaluator trace types. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…guard Migrate all ~40 token-semantic UInt16 occurrences to TokenID (UInt32) in RealModelInferenceEngine: GenerationResult, GenerationStep, encodePrompt, sampleToken, selectGreedyToken, all hybrid/speculative generation methods, and testing helpers. Remove the UInt16 vocab capacity guard that was the original motivation for this migration. Preserved: lmHeadFP16: [UInt16] (fp16 weight data, not token IDs). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…nversion Change writeEmbeddingBatchFP16 tokenIDs parameter from UnsafePointer<UInt16> to UnsafePointer<TokenID>. At the C interop boundary, narrow each TokenID to UInt16 with an exact check, throwing argumentOutOfRange if the token exceeds the ANE embedding surface capacity. Preserved: fp16 channel capacity guards (UInt16.max) and all MemoryLayout<UInt16> references remain unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Pre-allocate hidden buffer before decode loop (avoids per-token alloc) - Use &invRms instead of [invRms] for vDSP_vsmul (matches codebase pattern) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Migrate EspressoTrain, EspressoMultiTokenProbe, and EspressoGenerate (CLI.swift + GPT2DemoSupport.swift) from UInt16 to TokenID for all token-semantic variables. Rename validateUInt16Token -> validateToken. Add ANETypes dependency to EspressoGenerate target. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Migrate LocalTextTokenDatasetBuilder, LocalBigramArtifactBuilder, LocalRealArtifactPipeline, and MultitokenProbeSupport from UInt16 to TokenID for all token-semantic variables. writeUInt16Dataset still narrows to UInt16 for on-disk format compatibility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… + pre/postRoPE graphs TokenID migration (UInt16 → UInt32 / TokenID): - Tests/CPUOpsTests/CPUOpsTests.swift: fix token/target arrays and crossEntropyReference sig - Tests/EspressoTests/GenerationHarnessHardwareTests.swift: replace all [UInt16] token arrays and scalar UInt16(argmax...) with UInt32; preserve FP16 surface UInt16 uses - Tests/EspressoTests/GenerationStagedHeadHardwareTests.swift: [UInt16] → [UInt32] - Tests/RealModelInferenceTests/RealModelInferenceTests.swift: tokens: [UInt32] - Tests/EspressoGenerateTests/EspressoGenerateTests.swift: add ANETypes import for TokenID - Sources/Espresso/LocalBigramArtifactBuilder.swift: buildRecurrentWeights/buildFutureSidecar take [TokenID]; build() still takes [UInt16]; cast TokenID→UInt16 at bridge New features (pre-existing work, now compiling and tested): - Sources/CPUOps/RoPE.swift: add applyDecodeStep(position:theta:nKVHeads:) for single-token decode with GQA support and configurable theta - Sources/ModelSupport/MultiModelConfig.swift: add ropeTheta field (default 10000.0) - Sources/ModelSupport/ModelRegistry.swift: set ropeTheta=500000 on llama3_2_1b/3b - Sources/ModelSupport/TransformerLayerGraphBuilder.swift: add preRoPEForwardLayer and postRoPEForwardLayer graph builders for hybrid CPU-RoPE + ANE attention path - Tests/CPUOpsTests/RoPEDecodeStepTests.swift: 4 new tests (decode step parity, position offset, custom theta, GQA) - Tests/ModelSupportTests/TransformerLayerGraphBuilderLlamaTests.swift: 5 new tests for pre/postRoPE graph structure, output names, and MIL codegen - Tests/RealModelInferenceTests/HybridLlamaDecodeStepTests.swift: resolveLlamaTopLevelWeightPaths tests (struct roundtrip, real paths, missing-file error, ropeTheta values) - Sources/EspressoGGUF/GGUFBenchmark.swift, RunGGUF.swift, Sources/EspressoGGUFRunner/main.swift: GGUF benchmark runner target All 171 non-hardware tests pass. Build clean. Co-Authored-By: Paperclip <noreply@paperclip.ing>

Finish the remaining TokenID migration across all affected files: - EspressoTrain/main.swift: add [TokenID] conversion buffers for UInt16 mmap tokens; use them in Embedding.lookup, CrossEntropy, and Embedding.backward - LocalRealArtifactPipeline: convert [UInt16] dataset to [TokenID] before LocalBigramArtifactBuilder; update promptToken field to TokenID - LocalBigramArtifactBuilder: migrate build/mostLikelyNextToken/ mostLikelyFutureToken and fill helpers to [TokenID: TokenID] - Tests: update all fake model stubs and test token arrays to TokenID in GenerationHarnessTests, RealArtifactPipelineTests, StreamingTwoTokenTests, EspressoTests, GenerationHarnessHardwareTests, GenerationStagedHeadHardwareTests, CPUOpsTests, EspressoGenerateTests, and RealModelInferenceTests swift build --build-tests: Build complete (0 errors) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace remaining UInt32 with TokenID in RealModelInferenceTests and EspressoGenerateTests. Add ANETypes dependency to CPUOpsTests, RealModelInferenceTests, and EspressoGenerateTests targets. Sort imports alphabetically in RealModelInferenceTests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

christopherkarani · 2026-03-17T09:42:15Z

Code review

Found 4 issues:

RoPE hook silently drops error context (CLAUDE.md says "ALWAYS handle errors comprehensively... Never silently swallow errors")

The applyRoPEHook closure catches SurfaceIO errors, discards them into a readOK = false boolean, then throws a generic invalidArguments("RoPE hook surface read failed") with no underlying error details. Every other error handler in this file preserves context via "\(error)" interpolation. This is the only call site that swallows it, making hardware debugging significantly harder.

https://github.com/christopherkarani/Espresso/blob/28142453e779b7813d29eb2cfb25b395f2f7bdfe/Sources/RealModelInference/RealModelInferenceEngine.swift#L2338-L2345

Llama generation never stops on EOS token (comment at line 2527 acknowledges this but does not implement it)

generateIncrementalHybridLlama only stops when effectiveMaxTokens or maxSeq is reached. The comment says "Llama EOS varies by model -- use vocab-1 as a safe sentinel or check config" but no check is implemented. The GPT-2 path has explicit if nextToken == Self.gpt2EOSToken { break } guards. Generation will produce garbage output past the model's natural stop point.

https://github.com/christopherkarani/Espresso/blob/28142453e779b7813d29eb2cfb25b395f2f7bdfe/Sources/RealModelInference/RealModelInferenceEngine.swift#L2525-L2530

GPT-2 greedy path does not check classifierStrategy, will attempt ANE classifier for large-vocab GPT-2

The Llama path correctly gates ANE classifier compilation on classifierStrategy == .ane (line 1659) and adds it to the useANEGreedyHead check (line 2299). The GPT-2 path in ensureHybridCompiled (line 1544) unconditionally compiles the ANE classifier, and generateIncrementalHybrid (line 1739) does not check classifierStrategy. GPT-2-124M has vocab=50257, dModel=768 (38.6M elements > 16M limit), so classifierStrategy would be .cpuTiled, yet the GPT-2 path ignores this and attempts to use the ANE classifier anyway.

https://github.com/christopherkarani/Espresso/blob/28142453e779b7813d29eb2cfb25b395f2f7bdfe/Sources/RealModelInference/RealModelInferenceEngine.swift#L1737-L1743

GGUFModelLoader does not read ropeTheta from GGUF metadata, silently defaults to 10,000

GGUFModelLoader.prepare() constructs MultiModelConfig without setting ropeTheta, so it defaults to 10_000.0. The same PR registers Llama 3.2 models with ropeTheta: 500_000.0 in ModelRegistry. A GGUF-loaded Llama 3.2 model will silently use the wrong RoPE base frequency, producing incorrect positional encodings.

https://github.com/christopherkarani/Espresso/blob/28142453e779b7813d29eb2cfb25b395f2f7bdfe/Sources/EspressoGGUF/GGUFModelLoader.swift#L52-L66

Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

… on classifierStrategy Three code-review fixes: 1. applyRoPEHook: replace boolean-flag error swallowing with direct error propagation — SurfaceIO failures now include the original error description in the ANEError.invalidArguments message. 2. Llama generation EOS: add optional `eosToken: TokenID?` to MultiModelConfig, wire it into the Llama decode loop so generation stops on the model's EOS token. Llama 3.2 1B/3B registry entries set eosToken=128001. 3. GPT-2 ensureHybridCompiled: wrap greedy norm+classifier compilation in `if classifierStrategy == .ane { }` (matching the Llama path), and add the same guard to `useANEGreedyHead` so the CPU-tiled strategy is never bypassed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Reads `{arch}.rope.freq_base` from GGUF metadata and passes it to MultiModelConfig. Falls back to 10,000.0 if not present. Fixes silent wrong positional encoding for Llama 3.2 models loaded via GGUF. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

christopherkarani and others added 23 commits March 16, 2026 23:02

refactor: migrate GenerationHarness protocols from UInt16 to TokenID

ff06eb6

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

refactor: Embedding and CrossEntropy use TokenID (UInt32)

ba1ce9d

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: remove unused import, add boundary tests for ClassifierStrategy

d0c4fdf

- Remove unused `import Espresso` from ClassifierStrategy.swift - Add exactThresholdSelectsANE and oneOverThresholdSelectsCPU tests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

refactor: Embedding lookup/backward use TokenID (UInt32)

14063c2

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

refactor: argmax and output head use TokenID

dbe6fec

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

christopherkarani and others added 2 commits March 17, 2026 13:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add ClassifierStrategy to gate ANE vs CPU-tiled classifier for large-vocab models#16

feat: add ClassifierStrategy to gate ANE vs CPU-tiled classifier for large-vocab models#16
christopherkarani wants to merge 25 commits intomainfrom
feat/classifier-strategy

christopherkarani commented Mar 17, 2026

Uh oh!

christopherkarani commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

christopherkarani commented Mar 17, 2026

Summary

Test plan

Uh oh!

christopherkarani commented Mar 17, 2026

Code review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant