Skip to content

feat: add ClassifierStrategy to gate ANE vs CPU-tiled classifier for large-vocab models#16

Open
christopherkarani wants to merge 25 commits intomainfrom
feat/classifier-strategy
Open

feat: add ClassifierStrategy to gate ANE vs CPU-tiled classifier for large-vocab models#16
christopherkarani wants to merge 25 commits intomainfrom
feat/classifier-strategy

Conversation

@christopherkarani
Copy link
Owner

Summary

  • Add ClassifierStrategy enum that selects between ANE and CPU-tiled classifier paths based on model vocab/dModel size
  • Gate ANE classifier compilation in ensureHybridCompiledLlama — skips compile for large-vocab models (saves 1 ANE compile)
  • Add CPU-tiled greedy decode path using FP16TiledClassifier.tiledMatvecArgmax with CPU RMSNorm
  • Pre-convert lmHead weights to FP16 at build time (no per-token conversion overhead)
  • Complete UInt16 → TokenID migration across codebase

Test plan

  • ClassifierStrategyTests — 7 tests (selection logic + boundary + FP16TiledClassifier correctness)
  • Full test suite — 208/210 passed (2 pre-existing MigrationParityTests failures)
  • Hardware verification: TinyLlama (32K vocab) no longer crashes with statusType=0x9

🤖 Generated with Claude Code

christopherkarani and others added 23 commits March 16, 2026 23:02
- Add llama.cpp comparison row to benchmark table with Metal/CPU baselines
- Add platform compatibility matrix covering M1–M4 SoCs
- Add SPM integration section with 5-line first-inference example
- Add release badge
- Clarify iOS/tvOS entitlement situation in platform matrix
- Tighten quick-start: git clone + 3-line TUI launch first

Co-Authored-By: Paperclip <noreply@paperclip.ing>
…plates

- Add .github/workflows/ci.yml: build + test matrix on Xcode 16.2 and 16.3
  (macos-15, SPM cache, unit tests only — no ANE runner required)
- Add CONTRIBUTING.md: dev setup, project structure, coding standards, TDD guide
- Add .github/ISSUE_TEMPLATE/bug_report.md and feature_request.md
- Add .github/PULL_REQUEST_TEMPLATE.md with benchmark impact section
- Update README badges: CI badge alongside existing ANE matrix badge

Co-Authored-By: Paperclip <noreply@paperclip.ing>
- Target list with 10 Swift/ML community leaders (Panaro, Hollance, Maderix, HF team, MLX, Paul Hudson, Sean Allen)
- 5 personalized ready-to-send outreach messages
- Partnership research for Apple Silicon benchmark projects (ANEMLL, more-ane-transformers, neural-engine)
- Conference talk proposals for Deep Dish Swift, try! Swift Tokyo, WWDC Labs, SwiftConf
- Three talk formats: 40-min technical, 20-min intro, 10-min lightning demo

Related: ESP-10, ESP-11, ESP-12

Co-Authored-By: Paperclip <noreply@paperclip.ing>
…rainingLoop

- Examples/SimpleInference: ~20-line GPT-2 generation using RealModelInferenceEngine.build()
- Examples/BenchmarkSuite: Espresso vs CoreML comparison via espresso bench
- Examples/TrainingLoop: fine-tuning wrapper over espresso-train CLI
- Examples/README.md: setup guide with env vars and local-path override

Each example is a standalone Swift package (macOS 15+, Swift 6.2).

Co-Authored-By: Paperclip <noreply@paperclip.ing>
- ModelRegistry: add llama3_2_1b (16L/32H/8KVH/2048d/8192h) and
  llama3_2_3b (28L/24H/8KVH/3072d/8192h). Both use .llama architecture
  (SwiGLU, RMSNorm, GQA). Offline converter handles GQA head expansion
  and RoPE rotation baking into Wq/Wk weights.
- Tests: add llama3_2_1bConfigIsCorrect, llama3_2_3bConfigIsCorrect,
  update registryContainsAllSixModels. All 17 ModelSupportTests pass.
- Benchmark dashboard: benchmarks/results/latest.json (3.41x over
  CoreML, 519 tok/s on M3 Max); scripts/generate-benchmark-dashboard.sh
  regenerates docs/benchmarks.md from JSON;
  .github/workflows/benchmark-dashboard.yml auto-triggers on JSON changes.
- .gitignore: allow benchmarks/results/latest.json, docs/benchmarks.md,
  scripts/generate-benchmark-dashboard.sh.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
- coreml-vs-espresso-benchmarks.html: benchmark comparison with data tables,
  visual bar chart, architecture explanation, and M1/M2/M3/M4 projections
- gpt2-926-tokens-per-second.html: step-by-step guide covering 4.76x win —
  direct ANE access, 3-layer fusion, recurrent arch, zero-copy argmax
- reverse-engineering-apple-neural-engine.html: internals deep-dive covering
  dlopen bridge, MIL ops, IOSurface memory model, and confirmed dead ends
- blog.html: blog index listing all posts with summaries and tags
- docs/index.html: add Blog nav link + "From the Blog" section for SEO
- .gitignore: allow docs HTML files for GitHub Pages

Targets keywords: "CoreML alternative", "apple neural engine framework",
"swift ml inference", "GPT-2 apple silicon", "ANE reverse engineering"

Co-Authored-By: Paperclip <noreply@paperclip.ing>
…NE inference

Adds EspressoGGUF target that bridges EdgeRunner's GGUF loader into
Espresso's weight format. GGUFModelLoader.prepare() loads a GGUF file,
dequantizes via Metal, transposes per architecture convention, wraps in
BLOBFILE format, and writes to a temp directory compatible with
RealModelInferenceEngine.build().

- Bumps platform to macOS 26 (required by EdgeRunner/Metal 4)
- Bumps swift-tools-version to 6.2
- EdgeRunner added as local package dependency

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Introduces ClassifierStrategy with a 16M-element SRAM threshold (32MB fp16).
Models with vocab*dModel <= 16M use the ANE lane-packed classifier head;
larger models (Stories110M, TinyLlama, Qwen3 0.6B) fall back to
FP16TiledClassifier on CPU. Adds 5 Swift Testing tests covering strategy
selection for small/large/huge vocabs and CPU-tiled argmax correctness.

Also adds Espresso dependency to RealModelInferenceTests for FP16TiledClassifier access.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove unused `import Espresso` from ClassifierStrategy.swift
- Add exactThresholdSelectsANE and oneOverThresholdSelectsCPU tests

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…PU classifier gating

Integrate ClassifierStrategy into the Llama decode path to gate between ANE
lane-packed classifier and CPU-tiled FP16 classifier based on vocab*dim SRAM fit.

Changes:
- Add lmHeadFP16 field to LlamaTopLevelAssets (pre-converted FP16 weights)
- Add classifierStrategy stored property, initialized via ClassifierStrategy.select()
- Gate ANE greedy norm+classifier compile behind classifierStrategy == .ane
- Add CPU-tiled greedy head branch in generateIncrementalHybridLlama decode loop
  (surface read -> CPU RMSNorm -> FP16TiledClassifier.tiledMatvecArgmax)
- Skip xCur readback when using either ANE or CPU-tiled greedy head

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Migrate all token-semantic UInt16 to TokenID (UInt32) in
CPURecurrentGenerationModel and FutureTokenProposingLanguageModel protocol.
Includes OfflineExactAcceptanceEvaluator trace types.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…guard

Migrate all ~40 token-semantic UInt16 occurrences to TokenID (UInt32) in
RealModelInferenceEngine: GenerationResult, GenerationStep, encodePrompt,
sampleToken, selectGreedyToken, all hybrid/speculative generation methods,
and testing helpers. Remove the UInt16 vocab capacity guard that was the
original motivation for this migration.

Preserved: lmHeadFP16: [UInt16] (fp16 weight data, not token IDs).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nversion

Change writeEmbeddingBatchFP16 tokenIDs parameter from UnsafePointer<UInt16>
to UnsafePointer<TokenID>. At the C interop boundary, narrow each TokenID
to UInt16 with an exact check, throwing argumentOutOfRange if the token
exceeds the ANE embedding surface capacity.

Preserved: fp16 channel capacity guards (UInt16.max) and all
MemoryLayout<UInt16> references remain unchanged.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Pre-allocate hidden buffer before decode loop (avoids per-token alloc)
- Use &invRms instead of [invRms] for vDSP_vsmul (matches codebase pattern)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Migrate EspressoTrain, EspressoMultiTokenProbe, and EspressoGenerate
(CLI.swift + GPT2DemoSupport.swift) from UInt16 to TokenID for all
token-semantic variables. Rename validateUInt16Token -> validateToken.
Add ANETypes dependency to EspressoGenerate target.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Migrate LocalTextTokenDatasetBuilder, LocalBigramArtifactBuilder,
LocalRealArtifactPipeline, and MultitokenProbeSupport from UInt16
to TokenID for all token-semantic variables. writeUInt16Dataset
still narrows to UInt16 for on-disk format compatibility.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… + pre/postRoPE graphs

TokenID migration (UInt16 → UInt32 / TokenID):
- Tests/CPUOpsTests/CPUOpsTests.swift: fix token/target arrays and crossEntropyReference sig
- Tests/EspressoTests/GenerationHarnessHardwareTests.swift: replace all [UInt16] token arrays
  and scalar UInt16(argmax...) with UInt32; preserve FP16 surface UInt16 uses
- Tests/EspressoTests/GenerationStagedHeadHardwareTests.swift: [UInt16] → [UInt32]
- Tests/RealModelInferenceTests/RealModelInferenceTests.swift: tokens: [UInt32]
- Tests/EspressoGenerateTests/EspressoGenerateTests.swift: add ANETypes import for TokenID
- Sources/Espresso/LocalBigramArtifactBuilder.swift: buildRecurrentWeights/buildFutureSidecar
  take [TokenID]; build() still takes [UInt16]; cast TokenID→UInt16 at bridge

New features (pre-existing work, now compiling and tested):
- Sources/CPUOps/RoPE.swift: add applyDecodeStep(position:theta:nKVHeads:) for single-token
  decode with GQA support and configurable theta
- Sources/ModelSupport/MultiModelConfig.swift: add ropeTheta field (default 10000.0)
- Sources/ModelSupport/ModelRegistry.swift: set ropeTheta=500000 on llama3_2_1b/3b
- Sources/ModelSupport/TransformerLayerGraphBuilder.swift: add preRoPEForwardLayer and
  postRoPEForwardLayer graph builders for hybrid CPU-RoPE + ANE attention path
- Tests/CPUOpsTests/RoPEDecodeStepTests.swift: 4 new tests (decode step parity, position
  offset, custom theta, GQA)
- Tests/ModelSupportTests/TransformerLayerGraphBuilderLlamaTests.swift: 5 new tests for
  pre/postRoPE graph structure, output names, and MIL codegen
- Tests/RealModelInferenceTests/HybridLlamaDecodeStepTests.swift: resolveLlamaTopLevelWeightPaths
  tests (struct roundtrip, real paths, missing-file error, ropeTheta values)
- Sources/EspressoGGUF/GGUFBenchmark.swift, RunGGUF.swift, Sources/EspressoGGUFRunner/main.swift:
  GGUF benchmark runner target

All 171 non-hardware tests pass. Build clean.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
Finish the remaining TokenID migration across all affected files:

- EspressoTrain/main.swift: add [TokenID] conversion buffers for
  UInt16 mmap tokens; use them in Embedding.lookup, CrossEntropy, and
  Embedding.backward
- LocalRealArtifactPipeline: convert [UInt16] dataset to [TokenID]
  before LocalBigramArtifactBuilder; update promptToken field to TokenID
- LocalBigramArtifactBuilder: migrate build/mostLikelyNextToken/
  mostLikelyFutureToken and fill helpers to [TokenID: TokenID]
- Tests: update all fake model stubs and test token arrays to TokenID
  in GenerationHarnessTests, RealArtifactPipelineTests,
  StreamingTwoTokenTests, EspressoTests, GenerationHarnessHardwareTests,
  GenerationStagedHeadHardwareTests, CPUOpsTests, EspressoGenerateTests,
  and RealModelInferenceTests

swift build --build-tests: Build complete (0 errors)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace remaining UInt32 with TokenID in RealModelInferenceTests and
EspressoGenerateTests. Add ANETypes dependency to CPUOpsTests,
RealModelInferenceTests, and EspressoGenerateTests targets. Sort
imports alphabetically in RealModelInferenceTests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@christopherkarani
Copy link
Owner Author

Code review

Found 4 issues:

  1. RoPE hook silently drops error context (CLAUDE.md says "ALWAYS handle errors comprehensively... Never silently swallow errors")

The applyRoPEHook closure catches SurfaceIO errors, discards them into a readOK = false boolean, then throws a generic invalidArguments("RoPE hook surface read failed") with no underlying error details. Every other error handler in this file preserves context via "\(error)" interpolation. This is the only call site that swallows it, making hardware debugging significantly harder.

https://github.com/christopherkarani/Espresso/blob/28142453e779b7813d29eb2cfb25b395f2f7bdfe/Sources/RealModelInference/RealModelInferenceEngine.swift#L2338-L2345

  1. Llama generation never stops on EOS token (comment at line 2527 acknowledges this but does not implement it)

generateIncrementalHybridLlama only stops when effectiveMaxTokens or maxSeq is reached. The comment says "Llama EOS varies by model -- use vocab-1 as a safe sentinel or check config" but no check is implemented. The GPT-2 path has explicit if nextToken == Self.gpt2EOSToken { break } guards. Generation will produce garbage output past the model's natural stop point.

https://github.com/christopherkarani/Espresso/blob/28142453e779b7813d29eb2cfb25b395f2f7bdfe/Sources/RealModelInference/RealModelInferenceEngine.swift#L2525-L2530

  1. GPT-2 greedy path does not check classifierStrategy, will attempt ANE classifier for large-vocab GPT-2

The Llama path correctly gates ANE classifier compilation on classifierStrategy == .ane (line 1659) and adds it to the useANEGreedyHead check (line 2299). The GPT-2 path in ensureHybridCompiled (line 1544) unconditionally compiles the ANE classifier, and generateIncrementalHybrid (line 1739) does not check classifierStrategy. GPT-2-124M has vocab=50257, dModel=768 (38.6M elements > 16M limit), so classifierStrategy would be .cpuTiled, yet the GPT-2 path ignores this and attempts to use the ANE classifier anyway.

https://github.com/christopherkarani/Espresso/blob/28142453e779b7813d29eb2cfb25b395f2f7bdfe/Sources/RealModelInference/RealModelInferenceEngine.swift#L1737-L1743

  1. GGUFModelLoader does not read ropeTheta from GGUF metadata, silently defaults to 10,000

GGUFModelLoader.prepare() constructs MultiModelConfig without setting ropeTheta, so it defaults to 10_000.0. The same PR registers Llama 3.2 models with ropeTheta: 500_000.0 in ModelRegistry. A GGUF-loaded Llama 3.2 model will silently use the wrong RoPE base frequency, producing incorrect positional encodings.

https://github.com/christopherkarani/Espresso/blob/28142453e779b7813d29eb2cfb25b395f2f7bdfe/Sources/EspressoGGUF/GGUFModelLoader.swift#L52-L66


Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

christopherkarani and others added 2 commits March 17, 2026 13:47
… on classifierStrategy

Three code-review fixes:

1. applyRoPEHook: replace boolean-flag error swallowing with direct
   error propagation — SurfaceIO failures now include the original
   error description in the ANEError.invalidArguments message.

2. Llama generation EOS: add optional `eosToken: TokenID?` to
   MultiModelConfig, wire it into the Llama decode loop so generation
   stops on the model's EOS token. Llama 3.2 1B/3B registry entries
   set eosToken=128001.

3. GPT-2 ensureHybridCompiled: wrap greedy norm+classifier compilation
   in `if classifierStrategy == .ane { }` (matching the Llama path),
   and add the same guard to `useANEGreedyHead` so the CPU-tiled
   strategy is never bypassed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reads `{arch}.rope.freq_base` from GGUF metadata and passes it to
MultiModelConfig. Falls back to 10,000.0 if not present. Fixes silent
wrong positional encoding for Llama 3.2 models loaded via GGUF.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant