-
Notifications
You must be signed in to change notification settings - Fork 3.6k
[MLAS/NEON] Add dedicated kernel for depthwise convolution for ARM64 using NEON intrinsics #26688
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can commit the suggested changes from lintrunner.
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds a dedicated NEON-optimized depthwise convolution kernel for ARM64 to improve performance over the current Im2Col + SGemm approach. The implementation targets the most common depthwise convolution configuration: 3x3 filters with stride=1, padding≤1, and dilation=1. According to the description, this provides approximately 3% throughput improvement for a key customer model.
Key changes:
- Added ARM64 NEON depthwise convolution kernel implementation with vectorized inner loops
- Extended depthwise convolution algorithm support from WASM_SCALAR to ARM64 platforms
- Reorganized and renamed NCHWC-related files for clarity
Reviewed changes
Copilot reviewed 7 out of 9 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
onnxruntime/core/mlas/lib/sconv_nchw_kernel_neon.cpp |
New file implementing NEON-optimized depthwise 3x3 convolution with vectorized accumulation for ARM64 |
onnxruntime/core/mlas/lib/spool_nchwc_kernel_neon.cpp |
Renamed from spool_kernel_neon.cpp for NCHWC-specific pooling kernels |
onnxruntime/core/mlas/lib/sconv_nchwc_kernel_neon.h |
Renamed from sconv.h, defines convolution kernel flags for NCHWC operations |
onnxruntime/core/mlas/lib/sconv_nchwc_kernel_neon.cpp |
Updated file name reference and include path after renaming |
onnxruntime/core/mlas/lib/convolve.cpp |
Added ARM64 support for depthwise algorithm with threading, lambda-based stride validation |
onnxruntime/core/mlas/lib/mlasi.h |
Extended depthwise convolution function availability to ARM64 platforms |
onnxruntime/core/mlas/inc/mlas.h |
Extended MlasConvAlgorithmDepthwise enum availability to ARM64 |
cmake/onnxruntime_mlas.cmake |
Updated build configuration with new/renamed files and proper ARM64 source inclusion |
onnxruntime/test/mlas/bench/bench_sconv.cpp |
Added benchmark cases for external customer model depthwise convolutions |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 7 out of 9 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Description
Motivation and approach taken:
Add a dedicated depthwise convolution kernel for the most common depthwise convolution configuration (3x3 filter, stride = 1, pad <= 1, dilation = 1) using NEON intrinsics. This does significantly better than the current approach of
Im2Col + SGemm. The Im2Col step extracts convolution patches and this is a wasteful step and for a 3x3 filter, K would be 9 for the SGemm and usually Gemms are not optimized for such smallKvalues. Hence, a dedicated kernel works much better.Initially, I ported over the Winograd based NEON accelerated depthwise convolution kernel from PyTorch but I found that its performance is not very good. It's poor performance is probably due to applying the Winograd transformation for the filter repeatedly. A better approach may be to tranform the filter offline and this approach can be considered for later (I reverted the PyTorch Winograd implementation in this commit: 2820a84).
The current depthwise kernel added in this PR was authored by GPT5.1-Codex and with some minor bug fixes it seems to be functionally correct now and also provides the perf boost we are seeking.
Unit tests:
There are already depthwise convolution tests already existing in the codebase. I don't see a need for new ones at this point.
Kernel benchmarking:
This is the kernel level perf improvement from MLAS Conv benchmarks (About 50% kernel latency improvements):
Motivation and Context
A key customer model had a few depthwise conolution operations and this change provides a non-negligible ~3% throughput improvement using the customer provided benchmarking setup
For those interested, #26654 adds support for the same type of convolution variant but that leverages SME1/SME2 through KleidiAI. This PR is conceptually the same but targeting NEON only platforms.