cpu-matmul's efficiency of constensor is much lower than that of candle

## observation

### My env
macbook-air m3-aarch64


While looking at the matmul benchmark results, I noticed that the performance seems to be lower compared to similar benchmarks in candle.

```shell
cpu_graph_matmul_64x64  time:   [65.350 µs 68.851 µs 73.700 µs]
                        change: [-1.5632% +2.9214% +7.4267%] (p = 0.21 > 0.05)
                        No change in performance detected.
Found 8 outliers among 100 measurements (8.00%)
  7 (7.00%) high mild
  1 (1.00%) high severe

cpu_graph_matmul_128x128
                        time:   [239.03 µs 244.52 µs 250.49 µs]
                        change: [-15.182% -6.5155% -0.0611%] (p = 0.13 > 0.05)
                        No change in performance detected.
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) low mild
  8 (8.00%) high mild
  2 (2.00%) high severe

cpu_graph_matmul_256x256
                        time:   [746.20 µs 765.73 µs 787.60 µs]
                        change: [+2.0148% +4.3329% +6.8584%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe

candle_matmul_64x64     time:   [6.7214 µs 6.7795 µs 6.8656 µs]
                        change: [-0.7370% +0.4845% +1.6550%] (p = 0.45 > 0.05)
                        No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

candle_matmul_128x128   time:   [58.920 µs 59.852 µs 60.828 µs]
                        change: [-0.2394% +1.8737% +3.9358%] (p = 0.09 > 0.05)
                        No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

candle_matmul_256x256   time:   [196.04 µs 198.67 µs 201.45 µs]
                        change: [-6.4764% -3.0850% -0.1278%] (p = 0.07 > 0.05)
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe
```

My initial analysis suggests that both constensor and candle might be leveraging the same underlying gemm interface(gemm lib).
Assuming this is correct, I'm currently finding it a bit puzzling why there would be a noticeable performance difference primarily originating from the matmul operation itself.
<br>
@EricLBuehler I guess this is mainly due to the overhead associated with graph compilation..And I want to see if you might have any initial thoughts or insights into potential reasons for this performance gap

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cpu-matmul's efficiency of constensor is much lower than that of candle #28

observation

My env

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

cpu-matmul's efficiency of constensor is much lower than that of candle #28

Description

observation

My env

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions