Conversation
This commit significantly improves the performance of row vector × matrix multiplication by reorganizing the computation to exploit row-major storage and SIMD acceleration. ## Key Changes - Rewrote `Matrix.multiplyRowVector` to use weighted sum of matrix rows - Original: column-wise accumulation with strided memory access - Optimized: row-wise accumulation with contiguous memory and SIMD ## Performance Improvements Compared to baseline (from PR #20): | Size | Before | After | Improvement | |---------|-----------|-----------|-------------| | 10×10 | 84.3 ns | 55.2 ns | 34.5% faster | | 50×50 | 1,958 ns | 622.6 ns | 68.2% faster | | 100×100 | 9,208 ns | 1,905 ns | 79.3% faster | The optimization achieves 3.5-4.8× speedup for larger matrices by: 1. Eliminating strided column access patterns 2. Enabling SIMD vectorization on contiguous row data 3. Broadcasting vector weights efficiently across SIMD lanes 4. Skipping zero weights to reduce unnecessary computation ## Implementation Details The new implementation computes: result = v[0]*row0 + v[1]*row1 + ... + v[n-1]*row(n-1) This approach: - Accesses matrix rows contiguously (cache-friendly) - Broadcasts each weight v[i] to all SIMD lanes - Accumulates weighted rows directly into the result vector - Falls back to original scalar implementation for small matrices ## Testing - All 132 existing tests pass - Benchmark infrastructure added (Matrix.fs benchmarks) - Memory allocations unchanged 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Member
|
The numbers do seem to check out. This is running on my laptop where I checked Before (same benchmarks, adding
After:
|
This was referenced Oct 12, 2025
Contributor
Author
📊 Code Coverage ReportSummary
📈 Coverage Analysis🟡 Good Coverage Your code coverage is above 60%. Consider adding more tests to reach 80%. 🎯 Coverage Goals
📋 What These Numbers Mean
🔗 Detailed Reports📋 Download Full Coverage Report - Check the 'coverage-report' artifact for detailed HTML coverage report Coverage report generated on 2025-10-14 at 15:36:53 UTC |
This was referenced Oct 15, 2025
Member
|
This is very good. I did not think of this option. |
Member
|
@muehlhaus So interesting isn't it. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR optimizes row-vector × matrix multiplication (
v × M) achieving 3.5-4.8× speedup for typical matrix sizes by reorganizing the computation to exploit row-major storage and SIMD acceleration.Performance Goal
Goal Selected: Optimize vector × matrix multiplication (Phase 2, Priority: MEDIUM)
Rationale: The research plan from Discussion #11 and benchmarks from PR #20 identified that
VectorMatrixMultiply(vector × matrix) was 4-5× slower thanMatrixVectorMultiply(matrix × vector). This asymmetry was caused by column-wise memory access patterns that don't align with row-major storage.Changes Made
Core Optimization
File Modified:
src/FsMath/Matrix.fs-multiplyRowVectorfunction (lines 581-645)Original Implementation:
Optimized Implementation:
Benchmark Infrastructure
Added comprehensive matrix operation benchmarks from PR #20:
benchmarks/FsMath.Benchmarks/Matrix.fs(108 lines, 14 benchmarks)FsMath.Benchmarks.fsproj- Added Matrix.fs to compilationProgram.fs- Registered MatrixBenchmarks classApproach
Performance Measurements
Test Environment
Results Summary
Detailed Benchmark Results
Key Observations
MatrixVectorMultiplyperformanceWhy This Works
The optimization addresses three key bottlenecks:
Memory Access Pattern:
data[i*m + j]) - cache-unfriendlydata[i*m..(i+1)*m]) - cache-friendlySIMD Utilization:
Computational Efficiency:
Replicating the Performance Measurements
To replicate these benchmarks:
Results are saved to
BenchmarkDotNet.Artifacts/results/in GitHub MD, HTML, and CSV formats.Testing
✅ All 132 tests pass
✅ VectorMatrixMultiply benchmarks execute successfully
✅ Memory allocations unchanged
✅ Performance improves 3.5-4.8× for target sizes
✅ Correctness verified across all test cases
Implementation Details
Optimization Techniques Applied
v × Mas linear combination of matrix rowsNumerics.Vector<'T>(weight)to broadcast scalar across SIMD lanesv[i] == 0Code Quality
Next Steps
This PR establishes parity between vector × matrix and matrix × vector operations. Based on the performance plan, remaining Phase 2 work includes:
getColstill has strided accessFuture Optimization Opportunities
From this work, I identified additional optimization targets:
getCol): Could use SIMD gather instructionsRelated Issues/Discussions
🤖 Generated with Claude Code