Distributed Communication-Optimal Matrix-Matrix Multiplication Algorithm
-
Updated
Dec 11, 2024 - C++
Distributed Communication-Optimal Matrix-Matrix Multiplication Algorithm
Floating-point matrix multiplication implementation (arbitrary precision)
Matrix multiplication on the NPU inside RK3588
ForMatmul - A Fortran library that overloads the matmul function to enable efficient matrix multiplication with/without coarray.
This project integrates a custom CUDA-based matrix multiplication kernel into a PyTorch deep learning model, leveraging GPU acceleration for matrix operations. The goal is to compare the performance of this custom kernel with PyTorch's built-in matrix multiplication and demonstrate how custom CUDA kernels can optimize compute-intensive operations.
Raspberry Pi Pico (RP2040) and Adafruit Metro M7 (NXP IMXRT10XX) benchmark
In this project, ınstruction numbers from a c program are counted with pin and c++.
OpenMP Matrix Multiplication Offloading Playground
The provided code is a Python script that uses the CuPy library to perform optimized GPU operations, specifically matrix multiplication. The script includes a custom CUDA kernel that is optimized for performance and energy consumption. The kernel uses half-precision floating-point numbers (float16) for improved performance and warp utilization.
📰 This repository contains time measurements of various algorithms on the CPU and GPU using PyCuda: matrix multiplication, Pi computation, and bilateral filtering.
Matrix-matrix multiplication implementations benchmarking
Add a description, image, and links to the matmul topic page so that developers can more easily learn about it.
To associate your repository with the matmul topic, visit your repo's landing page and select "manage topics."