feat: add ClassifierStrategy to gate ANE vs CPU-tiled classifier for large-vocab models#16
feat: add ClassifierStrategy to gate ANE vs CPU-tiled classifier for large-vocab models#16christopherkarani wants to merge 25 commits intomainfrom
Conversation
- Add llama.cpp comparison row to benchmark table with Metal/CPU baselines - Add platform compatibility matrix covering M1–M4 SoCs - Add SPM integration section with 5-line first-inference example - Add release badge - Clarify iOS/tvOS entitlement situation in platform matrix - Tighten quick-start: git clone + 3-line TUI launch first Co-Authored-By: Paperclip <noreply@paperclip.ing>
…plates - Add .github/workflows/ci.yml: build + test matrix on Xcode 16.2 and 16.3 (macos-15, SPM cache, unit tests only — no ANE runner required) - Add CONTRIBUTING.md: dev setup, project structure, coding standards, TDD guide - Add .github/ISSUE_TEMPLATE/bug_report.md and feature_request.md - Add .github/PULL_REQUEST_TEMPLATE.md with benchmark impact section - Update README badges: CI badge alongside existing ANE matrix badge Co-Authored-By: Paperclip <noreply@paperclip.ing>
- Target list with 10 Swift/ML community leaders (Panaro, Hollance, Maderix, HF team, MLX, Paul Hudson, Sean Allen) - 5 personalized ready-to-send outreach messages - Partnership research for Apple Silicon benchmark projects (ANEMLL, more-ane-transformers, neural-engine) - Conference talk proposals for Deep Dish Swift, try! Swift Tokyo, WWDC Labs, SwiftConf - Three talk formats: 40-min technical, 20-min intro, 10-min lightning demo Related: ESP-10, ESP-11, ESP-12 Co-Authored-By: Paperclip <noreply@paperclip.ing>
…rainingLoop - Examples/SimpleInference: ~20-line GPT-2 generation using RealModelInferenceEngine.build() - Examples/BenchmarkSuite: Espresso vs CoreML comparison via espresso bench - Examples/TrainingLoop: fine-tuning wrapper over espresso-train CLI - Examples/README.md: setup guide with env vars and local-path override Each example is a standalone Swift package (macOS 15+, Swift 6.2). Co-Authored-By: Paperclip <noreply@paperclip.ing>
- ModelRegistry: add llama3_2_1b (16L/32H/8KVH/2048d/8192h) and llama3_2_3b (28L/24H/8KVH/3072d/8192h). Both use .llama architecture (SwiGLU, RMSNorm, GQA). Offline converter handles GQA head expansion and RoPE rotation baking into Wq/Wk weights. - Tests: add llama3_2_1bConfigIsCorrect, llama3_2_3bConfigIsCorrect, update registryContainsAllSixModels. All 17 ModelSupportTests pass. - Benchmark dashboard: benchmarks/results/latest.json (3.41x over CoreML, 519 tok/s on M3 Max); scripts/generate-benchmark-dashboard.sh regenerates docs/benchmarks.md from JSON; .github/workflows/benchmark-dashboard.yml auto-triggers on JSON changes. - .gitignore: allow benchmarks/results/latest.json, docs/benchmarks.md, scripts/generate-benchmark-dashboard.sh. Co-Authored-By: Paperclip <noreply@paperclip.ing>
- coreml-vs-espresso-benchmarks.html: benchmark comparison with data tables, visual bar chart, architecture explanation, and M1/M2/M3/M4 projections - gpt2-926-tokens-per-second.html: step-by-step guide covering 4.76x win — direct ANE access, 3-layer fusion, recurrent arch, zero-copy argmax - reverse-engineering-apple-neural-engine.html: internals deep-dive covering dlopen bridge, MIL ops, IOSurface memory model, and confirmed dead ends - blog.html: blog index listing all posts with summaries and tags - docs/index.html: add Blog nav link + "From the Blog" section for SEO - .gitignore: allow docs HTML files for GitHub Pages Targets keywords: "CoreML alternative", "apple neural engine framework", "swift ml inference", "GPT-2 apple silicon", "ANE reverse engineering" Co-Authored-By: Paperclip <noreply@paperclip.ing>
…NE inference Adds EspressoGGUF target that bridges EdgeRunner's GGUF loader into Espresso's weight format. GGUFModelLoader.prepare() loads a GGUF file, dequantizes via Metal, transposes per architecture convention, wraps in BLOBFILE format, and writes to a temp directory compatible with RealModelInferenceEngine.build(). - Bumps platform to macOS 26 (required by EdgeRunner/Metal 4) - Bumps swift-tools-version to 6.2 - EdgeRunner added as local package dependency Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Introduces ClassifierStrategy with a 16M-element SRAM threshold (32MB fp16). Models with vocab*dModel <= 16M use the ANE lane-packed classifier head; larger models (Stories110M, TinyLlama, Qwen3 0.6B) fall back to FP16TiledClassifier on CPU. Adds 5 Swift Testing tests covering strategy selection for small/large/huge vocabs and CPU-tiled argmax correctness. Also adds Espresso dependency to RealModelInferenceTests for FP16TiledClassifier access. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove unused `import Espresso` from ClassifierStrategy.swift - Add exactThresholdSelectsANE and oneOverThresholdSelectsCPU tests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…PU classifier gating Integrate ClassifierStrategy into the Llama decode path to gate between ANE lane-packed classifier and CPU-tiled FP16 classifier based on vocab*dim SRAM fit. Changes: - Add lmHeadFP16 field to LlamaTopLevelAssets (pre-converted FP16 weights) - Add classifierStrategy stored property, initialized via ClassifierStrategy.select() - Gate ANE greedy norm+classifier compile behind classifierStrategy == .ane - Add CPU-tiled greedy head branch in generateIncrementalHybridLlama decode loop (surface read -> CPU RMSNorm -> FP16TiledClassifier.tiledMatvecArgmax) - Skip xCur readback when using either ANE or CPU-tiled greedy head Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Migrate all token-semantic UInt16 to TokenID (UInt32) in CPURecurrentGenerationModel and FutureTokenProposingLanguageModel protocol. Includes OfflineExactAcceptanceEvaluator trace types. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…guard Migrate all ~40 token-semantic UInt16 occurrences to TokenID (UInt32) in RealModelInferenceEngine: GenerationResult, GenerationStep, encodePrompt, sampleToken, selectGreedyToken, all hybrid/speculative generation methods, and testing helpers. Remove the UInt16 vocab capacity guard that was the original motivation for this migration. Preserved: lmHeadFP16: [UInt16] (fp16 weight data, not token IDs). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nversion Change writeEmbeddingBatchFP16 tokenIDs parameter from UnsafePointer<UInt16> to UnsafePointer<TokenID>. At the C interop boundary, narrow each TokenID to UInt16 with an exact check, throwing argumentOutOfRange if the token exceeds the ANE embedding surface capacity. Preserved: fp16 channel capacity guards (UInt16.max) and all MemoryLayout<UInt16> references remain unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Pre-allocate hidden buffer before decode loop (avoids per-token alloc) - Use &invRms instead of [invRms] for vDSP_vsmul (matches codebase pattern) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Migrate EspressoTrain, EspressoMultiTokenProbe, and EspressoGenerate (CLI.swift + GPT2DemoSupport.swift) from UInt16 to TokenID for all token-semantic variables. Rename validateUInt16Token -> validateToken. Add ANETypes dependency to EspressoGenerate target. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Migrate LocalTextTokenDatasetBuilder, LocalBigramArtifactBuilder, LocalRealArtifactPipeline, and MultitokenProbeSupport from UInt16 to TokenID for all token-semantic variables. writeUInt16Dataset still narrows to UInt16 for on-disk format compatibility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… + pre/postRoPE graphs TokenID migration (UInt16 → UInt32 / TokenID): - Tests/CPUOpsTests/CPUOpsTests.swift: fix token/target arrays and crossEntropyReference sig - Tests/EspressoTests/GenerationHarnessHardwareTests.swift: replace all [UInt16] token arrays and scalar UInt16(argmax...) with UInt32; preserve FP16 surface UInt16 uses - Tests/EspressoTests/GenerationStagedHeadHardwareTests.swift: [UInt16] → [UInt32] - Tests/RealModelInferenceTests/RealModelInferenceTests.swift: tokens: [UInt32] - Tests/EspressoGenerateTests/EspressoGenerateTests.swift: add ANETypes import for TokenID - Sources/Espresso/LocalBigramArtifactBuilder.swift: buildRecurrentWeights/buildFutureSidecar take [TokenID]; build() still takes [UInt16]; cast TokenID→UInt16 at bridge New features (pre-existing work, now compiling and tested): - Sources/CPUOps/RoPE.swift: add applyDecodeStep(position:theta:nKVHeads:) for single-token decode with GQA support and configurable theta - Sources/ModelSupport/MultiModelConfig.swift: add ropeTheta field (default 10000.0) - Sources/ModelSupport/ModelRegistry.swift: set ropeTheta=500000 on llama3_2_1b/3b - Sources/ModelSupport/TransformerLayerGraphBuilder.swift: add preRoPEForwardLayer and postRoPEForwardLayer graph builders for hybrid CPU-RoPE + ANE attention path - Tests/CPUOpsTests/RoPEDecodeStepTests.swift: 4 new tests (decode step parity, position offset, custom theta, GQA) - Tests/ModelSupportTests/TransformerLayerGraphBuilderLlamaTests.swift: 5 new tests for pre/postRoPE graph structure, output names, and MIL codegen - Tests/RealModelInferenceTests/HybridLlamaDecodeStepTests.swift: resolveLlamaTopLevelWeightPaths tests (struct roundtrip, real paths, missing-file error, ropeTheta values) - Sources/EspressoGGUF/GGUFBenchmark.swift, RunGGUF.swift, Sources/EspressoGGUFRunner/main.swift: GGUF benchmark runner target All 171 non-hardware tests pass. Build clean. Co-Authored-By: Paperclip <noreply@paperclip.ing>
Finish the remaining TokenID migration across all affected files: - EspressoTrain/main.swift: add [TokenID] conversion buffers for UInt16 mmap tokens; use them in Embedding.lookup, CrossEntropy, and Embedding.backward - LocalRealArtifactPipeline: convert [UInt16] dataset to [TokenID] before LocalBigramArtifactBuilder; update promptToken field to TokenID - LocalBigramArtifactBuilder: migrate build/mostLikelyNextToken/ mostLikelyFutureToken and fill helpers to [TokenID: TokenID] - Tests: update all fake model stubs and test token arrays to TokenID in GenerationHarnessTests, RealArtifactPipelineTests, StreamingTwoTokenTests, EspressoTests, GenerationHarnessHardwareTests, GenerationStagedHeadHardwareTests, CPUOpsTests, EspressoGenerateTests, and RealModelInferenceTests swift build --build-tests: Build complete (0 errors) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace remaining UInt32 with TokenID in RealModelInferenceTests and EspressoGenerateTests. Add ANETypes dependency to CPUOpsTests, RealModelInferenceTests, and EspressoGenerateTests targets. Sort imports alphabetically in RealModelInferenceTests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Code reviewFound 4 issues:
The
The Llama path correctly gates ANE classifier compilation on
Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
… on classifierStrategy
Three code-review fixes:
1. applyRoPEHook: replace boolean-flag error swallowing with direct
error propagation — SurfaceIO failures now include the original
error description in the ANEError.invalidArguments message.
2. Llama generation EOS: add optional `eosToken: TokenID?` to
MultiModelConfig, wire it into the Llama decode loop so generation
stops on the model's EOS token. Llama 3.2 1B/3B registry entries
set eosToken=128001.
3. GPT-2 ensureHybridCompiled: wrap greedy norm+classifier compilation
in `if classifierStrategy == .ane { }` (matching the Llama path),
and add the same guard to `useANEGreedyHead` so the CPU-tiled
strategy is never bypassed.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reads `{arch}.rope.freq_base` from GGUF metadata and passes it to
MultiModelConfig. Falls back to 10,000.0 if not present. Fixes silent
wrong positional encoding for Llama 3.2 models loaded via GGUF.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
ClassifierStrategyenum that selects between ANE and CPU-tiled classifier paths based on model vocab/dModel sizeensureHybridCompiledLlama— skips compile for large-vocab models (saves 1 ANE compile)FP16TiledClassifier.tiledMatvecArgmaxwith CPU RMSNormTest plan
ClassifierStrategyTests— 7 tests (selection logic + boundary + FP16TiledClassifier correctness)🤖 Generated with Claude Code