Replies: 3 comments
-
Oh this is 100% my expectation. The C code we have is just a clean algorithmic reference. The major thing missing is taking advantage of vector instructions - AVX or NEON or etc depending on platform. That's a deep rabbit hole you can go down on. You'd probably get really far just by doing it for matmul forward/backward. |
Beta Was this translation helpful? Give feedback.
-
First bash at matmul #411 brings it down to just less than 4x slower |
Beta Was this translation helpful? Give feedback.
-
Just a note, proving that you can't really benchmark CPU on your local machine...
Not running under X shaves another 100ms off, by the looks of it. |
Beta Was this translation helpful? Give feedback.
-
Looks like there is still A LOT on the table with regards to CPU performance.
I think I'm starting to understand that Pytorch seems to have been written on and for CPU at least initially.... both its startup and training is still faster much than
train_gpt2.c
on CPU. (and its CPU startup is MUCH faster than its GPU startup)Some anecdotal benchmarks from posts and PRs:
Looks like the compiler optimizations do a bit more for AMD... or pytorch is optimized for intel 😅
CPU improvements go in dev/cpu
Beta Was this translation helpful? Give feedback.
All reactions