[MLAS/NEON] Add dedicated kernel for depthwise convolution for ARM64 using NEON intrinsics #26688

hariharans29 · 2025-12-01T15:00:23Z

Description

Motivation and approach taken:

Add a dedicated depthwise convolution kernel for the most common depthwise convolution configuration (3x3 filter, stride = 1, pad <= 1, dilation = 1) using NEON intrinsics. This does significantly better than the current approach of Im2Col + SGemm. The Im2Col step extracts convolution patches and this is a wasteful step and for a 3x3 filter, K would be 9 for the SGemm and usually Gemms are not optimized for such small K values. Hence, a dedicated kernel works much better.

Initially, I ported over the Winograd based NEON accelerated depthwise convolution kernel from PyTorch but I found that its performance is not very good. It's poor performance is probably due to applying the Winograd transformation for the filter repeatedly. A better approach may be to tranform the filter offline and this approach can be considered for later (I reverted the PyTorch Winograd implementation in this commit: 2820a84).

The current depthwise kernel added in this PR was authored by GPT5.1-Codex and with some minor bug fixes it seems to be functionally correct now and also provides the perf boost we are seeking.

Unit tests:
There are already depthwise convolution tests already existing in the codebase. I don't see a need for new ones at this point.

Kernel benchmarking:
This is the kernel level perf improvement from MLAS Conv benchmarks (About 50% kernel latency improvements):

Motivation and Context

A key customer model had a few depthwise conolution operations and this change provides a non-negligible ~3% throughput improvement using the customer provided benchmarking setup

For those interested, #26654 adds support for the same type of convolution variant but that leverages SME1/SME2 through KleidiAI. This PR is conceptually the same but targeting NEON only platforms.

hariharans29 · 2025-12-02T02:18:39Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2025-12-02T02:18:54Z

Azure Pipelines successfully started running 4 pipeline(s).

github-actions

You can commit the suggested changes from lintrunner.

onnxruntime/test/mlas/bench/bench_sconv.cpp

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Copilot

Pull request overview

This PR adds a dedicated NEON-optimized depthwise convolution kernel for ARM64 to improve performance over the current Im2Col + SGemm approach. The implementation targets the most common depthwise convolution configuration: 3x3 filters with stride=1, padding≤1, and dilation=1. According to the description, this provides approximately 3% throughput improvement for a key customer model.

Key changes:

Added ARM64 NEON depthwise convolution kernel implementation with vectorized inner loops
Extended depthwise convolution algorithm support from WASM_SCALAR to ARM64 platforms
Reorganized and renamed NCHWC-related files for clarity

Reviewed changes

Copilot reviewed 7 out of 9 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
`onnxruntime/core/mlas/lib/sconv_nchw_kernel_neon.cpp`	New file implementing NEON-optimized depthwise 3x3 convolution with vectorized accumulation for ARM64
`onnxruntime/core/mlas/lib/spool_nchwc_kernel_neon.cpp`	Renamed from `spool_kernel_neon.cpp` for NCHWC-specific pooling kernels
`onnxruntime/core/mlas/lib/sconv_nchwc_kernel_neon.h`	Renamed from `sconv.h`, defines convolution kernel flags for NCHWC operations
`onnxruntime/core/mlas/lib/sconv_nchwc_kernel_neon.cpp`	Updated file name reference and include path after renaming
`onnxruntime/core/mlas/lib/convolve.cpp`	Added ARM64 support for depthwise algorithm with threading, lambda-based stride validation
`onnxruntime/core/mlas/lib/mlasi.h`	Extended depthwise convolution function availability to ARM64 platforms
`onnxruntime/core/mlas/inc/mlas.h`	Extended `MlasConvAlgorithmDepthwise` enum availability to ARM64
`cmake/onnxruntime_mlas.cmake`	Updated build configuration with new/renamed files and proper ARM64 source inclusion
`onnxruntime/test/mlas/bench/bench_sconv.cpp`	Added benchmark cases for external customer model depthwise convolutions

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

onnxruntime/core/mlas/lib/convolve.cpp

onnxruntime/test/mlas/bench/bench_sconv.cpp

onnxruntime/core/mlas/inc/mlas.h

onnxruntime/core/mlas/lib/sconv_nchw_kernel_neon.cpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 7 out of 9 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

onnxruntime/core/mlas/lib/convolve.cpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

hariharans29 added 8 commits November 30, 2025 22:16

Initial commit

160c71e

More changes

a134ea0

More changes

a44f708

Fix builds

3a3ccf7

Fix builds 2

212dbf1

Threaded

fceae09

Fix x64 builds

3793d70

Experiment

481a7f6

hariharans29 changed the title ~~WIP: Conv Expt~~ [DO NOT REVIEW] WIP: Conv Expt Dec 2, 2025

hariharans29 added 10 commits December 1, 2025 19:11

Experiment revert

8993a0a

Refactor

d765c1a

More changes

a428d50

a

d53dd15

Try

8f12c51

More changes

67b6801

Relax padding

01b43fb

Vanilla NEON Depthwise

ea83394

Fix indexing

dd94a3b

Add benchmark

ffd291a

github-actions bot reviewed Dec 4, 2025

View reviewed changes

onnxruntime/test/mlas/bench/bench_sconv.cpp Outdated Show resolved Hide resolved

hariharans29 added 2 commits December 4, 2025 07:03

Add lambda

92fb604

Rework

d15bb93

hariharans29 changed the title ~~[DO NOT REVIEW] WIP: Conv Expt~~ [MLAS/NEON] Add dedicated kernel for depthwise convolution for ARM64 using NEON intrinsics Dec 4, 2025

hariharans29 mentioned this pull request Dec 4, 2025

ARM KleidiAI Micro-Kernel: Depthwise Convolution (3x3 kernel, Stride of 1) #26654

Open

hariharans29 and others added 3 commits December 4, 2025 07:14

Update onnxruntime/test/mlas/bench/bench_sconv.cpp

119ec9a

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Fix

d0fc143

Remove Winograd implementation

2820a84

hariharans29 requested a review from Copilot December 5, 2025 07:08

Copilot started reviewing on behalf of hariharans29 December 5, 2025 07:09 View session

Copilot finished reviewing on behalf of hariharans29 December 5, 2025 07:12

Copilot AI reviewed Dec 5, 2025

View reviewed changes

hariharans29 and others added 5 commits December 4, 2025 23:21

Update onnxruntime/core/mlas/lib/sconv_nchw_kernel_neon.cpp

0ffb811

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update onnxruntime/core/mlas/lib/sconv_nchw_kernel_neon.cpp

59e2b2d

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update onnxruntime/core/mlas/lib/convolve.cpp

e34c930

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update onnxruntime/core/mlas/inc/mlas.h

027e742

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update onnxruntime/core/mlas/lib/sconv_nchw_kernel_neon.cpp

f93ed67

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

hariharans29 requested a review from Copilot December 5, 2025 07:25

Copilot started reviewing on behalf of hariharans29 December 5, 2025 07:26 View session

Copilot finished reviewing on behalf of hariharans29 December 5, 2025 07:29

Copilot AI reviewed Dec 5, 2025

View reviewed changes

onnxruntime/core/mlas/lib/convolve.cpp Outdated Show resolved Hide resolved

hariharans29 and others added 3 commits December 4, 2025 23:31

Update onnxruntime/core/mlas/lib/convolve.cpp

f5c1b81

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Benchmark updates

f15e554

Merge remote-tracking branch 'origin/main' into hari/expt_conv

bb324b5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MLAS/NEON] Add dedicated kernel for depthwise convolution for ARM64 using NEON intrinsics #26688

[MLAS/NEON] Add dedicated kernel for depthwise convolution for ARM64 using NEON intrinsics #26688

Uh oh!

hariharans29 commented Dec 1, 2025 •

edited

Loading

Uh oh!

hariharans29 commented Dec 2, 2025

Uh oh!

azure-pipelines bot commented Dec 2, 2025

Uh oh!

github-actions bot left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[MLAS/NEON] Add dedicated kernel for depthwise convolution for ARM64 using NEON intrinsics #26688

Are you sure you want to change the base?

[MLAS/NEON] Add dedicated kernel for depthwise convolution for ARM64 using NEON intrinsics #26688

Uh oh!

Conversation

hariharans29 commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Uh oh!

hariharans29 commented Dec 2, 2025

Uh oh!

azure-pipelines bot commented Dec 2, 2025

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hariharans29 commented Dec 1, 2025 •

edited

Loading