Skip to content

Conversation

@hunhoffe
Copy link
Collaborator

@hunhoffe hunhoffe commented Nov 22, 2025

I am creating this PR so the code does not get left behind.

Update by James:

This PR adds an algorithm library to simplify creating a design to run some tensor transformation on the NPU. These are compatible with iron.jit. Implemented initial support for C++ kernels (currently limited to non-parallel algorithms) and require specific function signatures as shown below. Follow-up work is needed to standardize kernel signatures and expand algorithm compatibility with kernels.

Using C++ Kernels (ExternalFunction)

The kernel signature and ExternalFunction currently must match the algorithm's expected format. We further assume array-like parameters in params are inputs that will require ObjectFifos and scalar-type parameters are inputs that will be encoded as MLIR constants.

Algorithm C++ Kernel Signature ExternalFunction arg_types
transform void kernel(T* in, T* out, params...) [tile_ty, tile_ty, *param_types]
transform_binary void kernel(T* in1, T* in2, T* out, params...) [tile_ty, tile_ty, tile_ty, *param_types]
for_each void kernel(T* in, T* out, params...) [tile_ty, tile_ty, *param_types]

@ypapadop-amd
Copy link
Collaborator

For a little bikeshedding, for_each / transform are C++-isms (map was already taken). The pythonic HOF are map, filter, reduce etc. https://book.pythontips.com/en/latest/map_filter.html

@hunhoffe hunhoffe force-pushed the parallel/algorithms branch from 5c5e193 to d5bc2bd Compare November 24, 2025 16:57
hunhoffe and others added 18 commits November 24, 2025 09:58
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
hunhoffe and others added 13 commits December 16, 2025 17:18
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
- Removed CoreFunction, consolidating on ExternalFunction as PR #2607 and #2628 showed functional JIT kernel usage with ExternalFunction.
- Split vector_scalar_mul into JIT and non-JIT versions to retain regression testing. Can remove during transition-to-JIT PR.
- Fixed bug with transform and transform_parallel algorithm which only applied function on a single element every tile instead of all elements per tile.
- Fixed bug with mixing task_group and non-task_group workers in transform_parallel and transform_parallel_binary.
- Fixed bug with incorrect TAP for transform_parallel.
- Added unit tests for for_each, transform, transform_binary, transform_parallel, transform_parallel_binary.
- Unit tests leave tile_size as a variable if there is plan to parametrize it in transform.
…plied through iron.jit().

- Added READMEs to programming_examples for algorithms.
- Added checks for using ExternalFunctions with algorithms due to signature requirements.
- `vector_scalar_mul_jit.py` demonstrates using algorithm with ExternalFunction/external kernel.
Comment on lines +22 to +23
| `transform_parallel` | Parallel `transform` across multiple AIE tiles | No |
| `transform_parallel_binary` | Parallel `transform_binary` across multiple AIE tiles | No |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is anything blocking extending this to support ExternalFunctions?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No not necessarily. It's just the current way to support ExternalFunctions is not final nor ideal imo and I figured showing a "flavour" of it in the non-parallel transforms. If you prefer I can make the same set of assumptions for the ExternalFunction format and apply them to parallel ones.

@hunhoffe hunhoffe changed the title [WIP] Parallel/algorithms Parallel/algorithms Feb 9, 2026
@yenjames yenjames marked this pull request as ready for review February 9, 2026 23:00
Copilot AI review requested due to automatic review settings February 9, 2026 23:00
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an aie.iron.algorithms helper library (transform/for_each + parallel variants) intended to simplify building common tensor transformation designs for NPU execution via iron.jit, including initial integration with C++ ExternalFunction kernels.

Changes:

  • Introduces a new aie.iron.algorithms package (transform, transform_binary, for_each, and parallel variants) plus test coverage and runnable examples.
  • Updates JIT invocation to filter out non-runtime arguments (notably scalars / ExternalFunction) before calling the host runtime.
  • Extends ExternalFunction with basic argument count/type validation and adds tests for validation failures.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 16 comments.

Show a summary per file
File Description
test/python/npu-xrt/test_jit_extern_functions.py Adds tests for ExternalFunction argument validation errors.
test/python/npu-xrt/test_algorithms.py New test suite covering algorithm behavior and error handling.
python/utils/jit.py Filters runtime args before kernel execution; adjusts cached execution path.
python/iron/kernel.py Adds ExternalFunction debug flag + argument validation on __call__.
python/iron/algorithms/transform.py Implements unary/binary transform + parallel variants.
python/iron/algorithms/for_each.py Implements in-place algorithm with optional ExternalFunction params.
python/iron/algorithms/init.py Exposes algorithm entry points from the package.
programming_examples/basic/vector_scalar_mul/vector_scalar_mul_jit.py Demonstrates using transform + ExternalFunction under JIT.
programming_examples/basic/vector_scalar_mul/run_jit.lit Adds lit coverage for the JIT example.
programming_examples/basic/vector_scalar_mul/README.md Documents the new JIT-based variant of the example.
programming_examples/algorithms/transform.py Example for unary transform.
programming_examples/algorithms/transform_binary.py Example for transform_binary.
programming_examples/algorithms/transform_parallel.py Example for transform_parallel.
programming_examples/algorithms/transform_parallel_binary.py Example for transform_parallel_binary.
programming_examples/algorithms/for_each.py Example for for_each (contains verbose-mode bugs noted in comments).
programming_examples/algorithms/run_jit.lit Adds lit coverage to run the new algorithm examples.
programming_examples/algorithms/README.md Documents algorithms + ExternalFunction signature expectations (example issue noted in comments).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

# Filter out non-tensor arguments (ExternalFunction, scalars)
# Only tensor args should be passed to the kernel
tensor_args = _filter_tensor_args(args)
return cached_kernel(*tensor_args, **kwargs)
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JIT wrapper returns the cached kernel result on cache hits but returns None on the initial compilation/execution path (no return statement after invoking the kernel). This makes the API behavior depend on whether the kernel is cached. Return the kernel invocation result consistently in both paths (or consistently return None).

Suggested change
return cached_kernel(*tensor_args, **kwargs)
cached_kernel(*tensor_args, **kwargs)

Copilot uses AI. Check for mistakes.
arg_types (list[type[np.ndarray] | np.dtype], optional): The type signature of the function. Defaults to [].
include_dirs (list[str], optional): Additional include directories. Defaults to [].
compile_flags (list[str], optional): Additional compilation flags. Defaults to [].
debug (bool, optional): Enable debug logging. Defaults to True.
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The debug parameter default is False, but the docstring says "Defaults to True." Please fix the docstring to match the actual default (or change the default if the docstring is correct).

Suggested change
debug (bool, optional): Enable debug logging. Defaults to True.
debug (bool, optional): Enable debug logging. Defaults to False.

Copilot uses AI. Check for mistakes.
)

num_inputs = len(inputs)
num_elements = np.size(inputs[0])
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as above: np.size(inputs[0]) may force a tensor materialization/sync. Prefer inputs[0].numel() or int(np.prod(inputs[0].shape)) to avoid host transfers.

Suggested change
num_elements = np.size(inputs[0])
num_elements = int(np.prod(inputs[0].shape))

Copilot uses AI. Check for mistakes.
Comment on lines +34 to +41
def test_transform_add():
"""Test transform algorithm with simple add_one operation"""
input = iron.randint(0, 100, (1024,), dtype=np.int32, device="npu")
output = iron.zeros_like(input)
original = input.numpy().copy()
iron.jit(is_placed=False)(transform)(lambda a: a + 1, input, input)

assert np.allclose(original + 1, input.numpy())
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test constructs an output tensor but then calls transform(..., input, input), writing the result back into input and leaving output unused. This can mask bugs where transform fails to write to the provided output buffer. Consider passing output to transform and asserting on output, or switch to for_each if the intent is an in-place update.

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,64 @@
# transform_binary.py -*- Python -*-
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file header comment says transform_binary.py but this is for_each.py. Please correct the header to avoid confusion when navigating examples.

Suggested change
# transform_binary.py -*- Python -*-
# for_each.py -*- Python -*-

Copilot uses AI. Check for mistakes.
for_each(scale, tensor, factor, tile_size)
"""
is_external_func = isinstance(func, iron.ExternalFunction)
num_elements = np.size(tensor)
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.size(tensor) will typically coerce the Iron tensor to a NumPy array (device sync) to compute the element count. Prefer tensor.numel() / int(np.prod(tensor.shape)) to avoid unnecessary synchronization.

Suggested change
num_elements = np.size(tensor)
num_elements = tensor.numel() if hasattr(tensor, "numel") else int(np.prod(tensor.shape))

Copilot uses AI. Check for mistakes.

print(f"{'input':>4} + {'output':>4}")
print("-" * 34)
count = input.numel()
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable count is not used.

Suggested change
count = input.numel()

Copilot uses AI. Check for mistakes.

print(f"{'input':>4} + {'output':>4}")
print("-" * 34)
count = input.numel()
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable count is not used.

Suggested change
count = input.numel()

Copilot uses AI. Check for mistakes.
vectorized = True

# Define tensor types
tensor_ty = np.ndarray[(tensor_size,), np.dtype[in1_dtype]]
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable tensor_ty is not used.

Suggested change
tensor_ty = np.ndarray[(tensor_size,), np.dtype[in1_dtype]]

Copilot uses AI. Check for mistakes.

from aie.iron import ObjectFifo, Program, Runtime, Worker
from aie.iron.placers import SequentialPlacer
from aie.iron.device import NPU1Col1, NPU2Col1, NPU1, NPU2
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'NPU1Col1' is not used.
Import of 'NPU2Col1' is not used.
Import of 'NPU1' is not used.

Suggested change
from aie.iron.device import NPU1Col1, NPU2Col1, NPU1, NPU2
from aie.iron.device import NPU2

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants