-
-
Notifications
You must be signed in to change notification settings - Fork 492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SIMD Integration #502
Comments
For ffsvm (and SVMs in general I would argue) most of the CPU time is spent inside one loop inside the kernel. I don't recall the exact numbers, but for our actual production model I think it was in the ballpark of ±80% for this code: for (i, sv) in vectors.row_iter().enumerate() {
let mut sum = f32s::splat(0.0);
let feature: &[f32s] = &feature;
for (a, b) in sv.iter().zip(feature) {
sum += (*a - *b) * (*a - *b);
}
output[i] = f64::from((-self.gamma * sum.sum()).exp());
} Here As a developer, what I would need for ffsvm is roughly:
I think my benchmark baseline would be the equivalent of for (a, b) in sv.iter().zip(feature) {
sum += (*a - *b) * (*a - *b);
} implemented in nalgebra to have the same performance as if implemented manually via Edit - I just tried to implement a naive benchmark and realized it's not that simple, since I'd have to deal with |
Some thoughts:
If std::simd is still a thing, I would probably investigate that one first, as it might come with free cross-platform compatibility. |
A thought: nalgebra's matrix types aren't always backed by totally contiguous storage. There aren't any gaps between elements if the Matrix is backed by a |
@jswrenn Note that non-contiguous storage can also be purposely used to improve SIMD performance on small matrices. For example, if you pad matrix rows to a multiple of the SIMD vector length, and are careful not to introduce NaNs and other IEEE 754 silliness in the padding, you can implement a very fast small-matrix multiplication. In general, SIMD is very sensitive to data layout, which makes abstraction design harder. |
I just want to put my two cents here. One bottleneck in my application seems to be matrix multiplication and to some extent addition. SIMD accelerated matrix multiplication would on its own be amazing. These features alone would make nalgebra a lot more attractive for scientific computing. |
Simd intrinsics for wasm32 target are accessible via core::arch::wasm32 and i don't think the above crates support this specific target. |
I would really like to have a SIMD Vector3 without having to refactor all the code to use AoSoA. |
This is a revival of #27 and #217 and a follow-up to that comment. We need to figure out what we can do to use explicit SIMD in order to improve the overall performances of nalgebra.
Once rusty-machine is integrated to nalgebra (#498), we could use it for benchmarking to see if your optimizations have any impact on some real-world machine-learning operations.
We should also keep in mind the ffsvm crate as an example of application requiring peak performances for some linalg operations, and the simd_aligned crate for an example of an interesting design of a SIMD-friendly data structure for the data storage of a matrix or vector.
Here are some tasks we should start with to get some measurements that will serve at references for our optimizations. This will be useful to guide us through our work:
faster, SIMDeez, std::simd, packed_simd
?The text was updated successfully, but these errors were encountered: