Skip to content

cpu-matmul's efficiency of constensor is much lower than that of candle #28

@xiaoniaoyouhuajiang

Description

@xiaoniaoyouhuajiang

observation

My env

macbook-air m3-aarch64

While looking at the matmul benchmark results, I noticed that the performance seems to be lower compared to similar benchmarks in candle.

cpu_graph_matmul_64x64  time:   [65.350 µs 68.851 µs 73.700 µs]
                        change: [-1.5632% +2.9214% +7.4267%] (p = 0.21 > 0.05)
                        No change in performance detected.
Found 8 outliers among 100 measurements (8.00%)
  7 (7.00%) high mild
  1 (1.00%) high severe

cpu_graph_matmul_128x128
                        time:   [239.03 µs 244.52 µs 250.49 µs]
                        change: [-15.182% -6.5155% -0.0611%] (p = 0.13 > 0.05)
                        No change in performance detected.
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) low mild
  8 (8.00%) high mild
  2 (2.00%) high severe

cpu_graph_matmul_256x256
                        time:   [746.20 µs 765.73 µs 787.60 µs]
                        change: [+2.0148% +4.3329% +6.8584%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe

candle_matmul_64x64     time:   [6.7214 µs 6.7795 µs 6.8656 µs]
                        change: [-0.7370% +0.4845% +1.6550%] (p = 0.45 > 0.05)
                        No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

candle_matmul_128x128   time:   [58.920 µs 59.852 µs 60.828 µs]
                        change: [-0.2394% +1.8737% +3.9358%] (p = 0.09 > 0.05)
                        No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

candle_matmul_256x256   time:   [196.04 µs 198.67 µs 201.45 µs]
                        change: [-6.4764% -3.0850% -0.1278%] (p = 0.07 > 0.05)
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe

My initial analysis suggests that both constensor and candle might be leveraging the same underlying gemm interface(gemm lib).
Assuming this is correct, I'm currently finding it a bit puzzling why there would be a noticeable performance difference primarily originating from the matmul operation itself.


@EricLBuehler I guess this is mainly due to the overhead associated with graph compilation..And I want to see if you might have any initial thoughts or insights into potential reasons for this performance gap

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions