Skip to content

Conversation

@mikepapadim
Copy link
Member

@mikepapadim mikepapadim commented Dec 3, 2025

Summary

Implements fused dequantize-and-compute patterns for quantized matrix-vector operations,
eliminating intermediate memory round-trips during inference.

Changes

  • Fused Dequantization: Dequantize weights directly in registers before compute,
    avoiding the previous dequantize → store → load → compute pipeline
  • Optimized SGEMV Kernels: Improved memory coalescing and compute utilization
    for the memory-bound decode phase
  • SiLU-GLU Fusion: Combined activation and gating into a single kernel pass

Benchmarks (Llama 3.2 1B FP16)

GPU Before After Speedup
RTX 3070 52 tok/s 62 tok/s +19%
RTX 4090 66 tok/s 86 tok/s +30%

Why This Works

Single-token generation is memory-bandwidth bound (matrix-vector ops).
Fusing dequantization with compute hides quantization overhead by keeping
data in registers rather than writing back to memory between operations.

…optimized matrix-vector kernels, and SiLU-GLU activation
@mikepapadim mikepapadim changed the title Implement FP16 support in TornadoVM by introducing HalfFloat arrays, … Implement deq and compute pattern for SGEEMs Dec 3, 2025
…ce overhead, improve cache utilization, and update task graph setup to integrate fused kernel.
…k graph to integrate `ropeRotationWithCacheCopy` kernel, and remove redundant kernels (`rope` and `copyToCaches`).
…ids, and deprecate redundant tasks in FP16 layer.
…r grid assignments, and enhance attention and FFN block configurations.
…r grid assignments, and enhance attention and FFN block configurations.
…e kernel setup, and enhance FP16 task processing.
…rrays and `mapContextWithQuantizeLogits` kernel, enhancing FP16 computation capabilities
…tailed data flow, task breakdown, and fusion points
…incorporate fused RMS normalization, gate, and up-projection
…FFN task graphs by removing deprecated tasks, consolidating RMS normalization and FFN operations into `rms_ffn_gate_up`.
…FN layers to optimize worker grid configuration.
…tmul`, and `fusedRmsNormQKVMatmul`.

Refactor workers and task graphs to utilize new computations and streamline layer configurations for improved performance and reduced memory transfers.
…te Q/K RMSNorm into a single operation. Cleanup deprecated workers, update task names, and streamline layer configuration.
…e task graphs with fused kernels, reorganize attention and FFN block mapping, and integrate final normalization for non-NVIDIA devices. Add detailed Transformer layer task flow documentation.
…fixes, improved attention computation logic, and optimized handling of large models. Update task graph to revert to `processHeadsFlashAttention` for compatibility.
…it TaskGraph type with `var`, streamline task graph configuration by removing unused temp variables.
…ith fused kernels, update worker grid configurations, and streamline data transfer logic.
…consolidate Q/K/V bias addition into a single operation, and update worker grid configurations. Streamline attention block with optimized task mapping and detailed layer flow documentation.
@mikepapadim mikepapadim requested review from Copilot and orionpapadakis and removed request for Copilot and orionpapadakis December 4, 2025 20:28
…yers and update grid scheduler configuration
…c worker grid, update RoPE task configuration, and streamline layer setup.
…update Phi3 FP16 FFN layers with optimized worker grid configurations, fused workflows for attention and FFN blocks, and detailed task flow documentation.
@mikepapadim mikepapadim changed the title Implement deq and compute pattern for SGEEMs [FP16] Improved performance by fusing dequantize with compute in kernels: 20-30% Inference Speedup Dec 4, 2025
@mikepapadim mikepapadim self-assigned this Dec 4, 2025
@mikepapadim mikepapadim marked this pull request as ready for review December 4, 2025 20:59
…r Phi3 FP16 FFN layers to consolidate QKV projection tasks, and update worker grid/task configurations.
… Phi3 FP16 FFN layers to streamline task configuration and clean up commented code.
…lace `rms_ffn_gate_up` and `gateUpSiLU` tasks with a single fused task, streamline task graph and update documentation.
…ented code, and streamline Phi3 FP16 FFN layer configurations.
…ting line breaks in data transfer logic and disabling formatter for consistent formatting.
…id scheduler logic, and improve readability by adjusting formatting.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants