Skip to content

Conversation

@hariharans29
Copy link
Member

@hariharans29 hariharans29 commented Dec 1, 2025

Description

Motivation and approach taken:

Add a dedicated depthwise convolution kernel for the most common depthwise convolution configuration (3x3 filter, stride = 1, pad <= 1, dilation = 1) using NEON intrinsics. This does significantly better than the current approach of Im2Col + SGemm. The Im2Col step extracts convolution patches and this is a wasteful step and for a 3x3 filter, K would be 9 for the SGemm and usually Gemms are not optimized for such small K values. Hence, a dedicated kernel works much better.

Initially, I ported over the Winograd based NEON accelerated depthwise convolution kernel from PyTorch but I found that its performance is not very good. It's poor performance is probably due to applying the Winograd transformation for the filter repeatedly. A better approach may be to tranform the filter offline and this approach can be considered for later (I reverted the PyTorch Winograd implementation in this commit: 2820a84).

The current depthwise kernel added in this PR was authored by GPT5.1-Codex and with some minor bug fixes it seems to be functionally correct now and also provides the perf boost we are seeking.

Unit tests:
There are already depthwise convolution tests already existing in the codebase. I don't see a need for new ones at this point.

Kernel benchmarking:
This is the kernel level perf improvement from MLAS Conv benchmarks (About 50% kernel latency improvements):

image

Motivation and Context

A key customer model had a few depthwise conolution operations and this change provides a non-negligible ~3% throughput improvement using the customer provided benchmarking setup

For those interested, #26654 adds support for the same type of convolution variant but that leverages SME1/SME2 through KleidiAI. This PR is conceptually the same but targeting NEON only platforms.

@hariharans29 hariharans29 changed the title WIP: Conv Expt [DO NOT REVIEW] WIP: Conv Expt Dec 2, 2025
@hariharans29
Copy link
Member Author

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

@hariharans29 hariharans29 changed the title [DO NOT REVIEW] WIP: Conv Expt [MLAS/NEON] Add dedicated kernel for depthwise convolution for ARM64 using NEON intrinsics Dec 4, 2025
hariharans29 and others added 3 commits December 4, 2025 07:14
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
@hariharans29 hariharans29 requested a review from Copilot December 5, 2025 07:08
Copilot finished reviewing on behalf of hariharans29 December 5, 2025 07:12
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a dedicated NEON-optimized depthwise convolution kernel for ARM64 to improve performance over the current Im2Col + SGemm approach. The implementation targets the most common depthwise convolution configuration: 3x3 filters with stride=1, padding≤1, and dilation=1. According to the description, this provides approximately 3% throughput improvement for a key customer model.

Key changes:

  • Added ARM64 NEON depthwise convolution kernel implementation with vectorized inner loops
  • Extended depthwise convolution algorithm support from WASM_SCALAR to ARM64 platforms
  • Reorganized and renamed NCHWC-related files for clarity

Reviewed changes

Copilot reviewed 7 out of 9 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
onnxruntime/core/mlas/lib/sconv_nchw_kernel_neon.cpp New file implementing NEON-optimized depthwise 3x3 convolution with vectorized accumulation for ARM64
onnxruntime/core/mlas/lib/spool_nchwc_kernel_neon.cpp Renamed from spool_kernel_neon.cpp for NCHWC-specific pooling kernels
onnxruntime/core/mlas/lib/sconv_nchwc_kernel_neon.h Renamed from sconv.h, defines convolution kernel flags for NCHWC operations
onnxruntime/core/mlas/lib/sconv_nchwc_kernel_neon.cpp Updated file name reference and include path after renaming
onnxruntime/core/mlas/lib/convolve.cpp Added ARM64 support for depthwise algorithm with threading, lambda-based stride validation
onnxruntime/core/mlas/lib/mlasi.h Extended depthwise convolution function availability to ARM64 platforms
onnxruntime/core/mlas/inc/mlas.h Extended MlasConvAlgorithmDepthwise enum availability to ARM64
cmake/onnxruntime_mlas.cmake Updated build configuration with new/renamed files and proper ARM64 source inclusion
onnxruntime/test/mlas/bench/bench_sconv.cpp Added benchmark cases for external customer model depthwise convolutions

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

hariharans29 and others added 5 commits December 4, 2025 23:21
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 9 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants