-
Notifications
You must be signed in to change notification settings - Fork 2
Closed
Description
observation
My env
macbook-air m3-aarch64
While looking at the matmul benchmark results, I noticed that the performance seems to be lower compared to similar benchmarks in candle.
cpu_graph_matmul_64x64 time: [65.350 µs 68.851 µs 73.700 µs]
change: [-1.5632% +2.9214% +7.4267%] (p = 0.21 > 0.05)
No change in performance detected.
Found 8 outliers among 100 measurements (8.00%)
7 (7.00%) high mild
1 (1.00%) high severe
cpu_graph_matmul_128x128
time: [239.03 µs 244.52 µs 250.49 µs]
change: [-15.182% -6.5155% -0.0611%] (p = 0.13 > 0.05)
No change in performance detected.
Found 11 outliers among 100 measurements (11.00%)
1 (1.00%) low mild
8 (8.00%) high mild
2 (2.00%) high severe
cpu_graph_matmul_256x256
time: [746.20 µs 765.73 µs 787.60 µs]
change: [+2.0148% +4.3329% +6.8584%] (p = 0.00 < 0.05)
Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
5 (5.00%) high mild
2 (2.00%) high severe
candle_matmul_64x64 time: [6.7214 µs 6.7795 µs 6.8656 µs]
change: [-0.7370% +0.4845% +1.6550%] (p = 0.45 > 0.05)
No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
3 (3.00%) high mild
1 (1.00%) high severe
candle_matmul_128x128 time: [58.920 µs 59.852 µs 60.828 µs]
change: [-0.2394% +1.8737% +3.9358%] (p = 0.09 > 0.05)
No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
4 (4.00%) high mild
1 (1.00%) high severe
candle_matmul_256x256 time: [196.04 µs 198.67 µs 201.45 µs]
change: [-6.4764% -3.0850% -0.1278%] (p = 0.07 > 0.05)
No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
4 (4.00%) high mild
2 (2.00%) high severeMy initial analysis suggests that both constensor and candle might be leveraging the same underlying gemm interface(gemm lib).
Assuming this is correct, I'm currently finding it a bit puzzling why there would be a noticeable performance difference primarily originating from the matmul operation itself.
@EricLBuehler I guess this is mainly due to the overhead associated with graph compilation..And I want to see if you might have any initial thoughts or insights into potential reasons for this performance gap
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels