Parallel/algorithms #2730

hunhoffe · 2025-11-22T02:05:47Z

I am creating this PR so the code does not get left behind.

Update by James:

This PR adds an algorithm library to simplify creating a design to run some tensor transformation on the NPU. These are compatible with iron.jit. Implemented initial support for C++ kernels (currently limited to non-parallel algorithms) and require specific function signatures as shown below. Follow-up work is needed to standardize kernel signatures and expand algorithm compatibility with kernels.

Using C++ Kernels (ExternalFunction)

The kernel signature and ExternalFunction currently must match the algorithm's expected format. We further assume array-like parameters in params are inputs that will require ObjectFifos and scalar-type parameters are inputs that will be encoded as MLIR constants.

Algorithm	C++ Kernel Signature	ExternalFunction arg_types
`transform`	`void kernel(T* in, T* out, params...)`	`[tile_ty, tile_ty, *param_types]`
`transform_binary`	`void kernel(T* in1, T* in2, T* out, params...)`	`[tile_ty, tile_ty, tile_ty, *param_types]`
`for_each`	`void kernel(T* in, T* out, params...)`	`[tile_ty, tile_ty, *param_types]`

ypapadop-amd · 2025-11-22T15:56:15Z

For a little bikeshedding, for_each / transform are C++-isms (map was already taken). The pythonic HOF are map, filter, reduce etc. https://book.pythontips.com/en/latest/map_filter.html

Signed-off-by: Muhammad Awad <MuhammadAbdelghaffar.Awad@amd.com>

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

- Removed CoreFunction, consolidating on ExternalFunction as PR #2607 and #2628 showed functional JIT kernel usage with ExternalFunction. - Split vector_scalar_mul into JIT and non-JIT versions to retain regression testing. Can remove during transition-to-JIT PR.

- Fixed bug with transform and transform_parallel algorithm which only applied function on a single element every tile instead of all elements per tile. - Fixed bug with mixing task_group and non-task_group workers in transform_parallel and transform_parallel_binary. - Fixed bug with incorrect TAP for transform_parallel. - Added unit tests for for_each, transform, transform_binary, transform_parallel, transform_parallel_binary. - Unit tests leave tile_size as a variable if there is plan to parametrize it in transform.

…num_elements // num_columns elements.

…plied through iron.jit(). - Added READMEs to programming_examples for algorithms. - Added checks for using ExternalFunctions with algorithms due to signature requirements. - `vector_scalar_mul_jit.py` demonstrates using algorithm with ExternalFunction/external kernel.

jgmelber · 2026-02-09T20:36:28Z

programming_examples/algorithms/README.md

+| `transform_parallel` | Parallel `transform` across multiple AIE tiles | No |
+| `transform_parallel_binary` | Parallel `transform_binary` across multiple AIE tiles |  No |


Is anything blocking extending this to support ExternalFunctions?

No not necessarily. It's just the current way to support ExternalFunctions is not final nor ideal imo and I figured showing a "flavour" of it in the non-parallel transforms. If you prefer I can make the same set of assumptions for the ExternalFunction format and apply them to parallel ones.

Copilot

Pull request overview

Adds an aie.iron.algorithms helper library (transform/for_each + parallel variants) intended to simplify building common tensor transformation designs for NPU execution via iron.jit, including initial integration with C++ ExternalFunction kernels.

Changes:

Introduces a new aie.iron.algorithms package (transform, transform_binary, for_each, and parallel variants) plus test coverage and runnable examples.
Updates JIT invocation to filter out non-runtime arguments (notably scalars / ExternalFunction) before calling the host runtime.
Extends ExternalFunction with basic argument count/type validation and adds tests for validation failures.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 16 comments.

Show a summary per file

File	Description
test/python/npu-xrt/test_jit_extern_functions.py	Adds tests for `ExternalFunction` argument validation errors.
test/python/npu-xrt/test_algorithms.py	New test suite covering algorithm behavior and error handling.
python/utils/jit.py	Filters runtime args before kernel execution; adjusts cached execution path.
python/iron/kernel.py	Adds `ExternalFunction` debug flag + argument validation on `__call__`.
python/iron/algorithms/transform.py	Implements unary/binary transform + parallel variants.
python/iron/algorithms/for_each.py	Implements in-place algorithm with optional `ExternalFunction` params.
python/iron/algorithms/init.py	Exposes algorithm entry points from the package.
programming_examples/basic/vector_scalar_mul/vector_scalar_mul_jit.py	Demonstrates using `transform` + `ExternalFunction` under JIT.
programming_examples/basic/vector_scalar_mul/run_jit.lit	Adds lit coverage for the JIT example.
programming_examples/basic/vector_scalar_mul/README.md	Documents the new JIT-based variant of the example.
programming_examples/algorithms/transform.py	Example for unary `transform`.
programming_examples/algorithms/transform_binary.py	Example for `transform_binary`.
programming_examples/algorithms/transform_parallel.py	Example for `transform_parallel`.
programming_examples/algorithms/transform_parallel_binary.py	Example for `transform_parallel_binary`.
programming_examples/algorithms/for_each.py	Example for `for_each` (contains verbose-mode bugs noted in comments).
programming_examples/algorithms/run_jit.lit	Adds lit coverage to run the new algorithm examples.
programming_examples/algorithms/README.md	Documents algorithms + `ExternalFunction` signature expectations (example issue noted in comments).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-09T23:11:19Z

python/utils/jit.py

+            # Filter out non-tensor arguments (ExternalFunction, scalars)
+            # Only tensor args should be passed to the kernel
+            tensor_args = _filter_tensor_args(args)
+            return cached_kernel(*tensor_args, **kwargs)


The JIT wrapper returns the cached kernel result on cache hits but returns None on the initial compilation/execution path (no return statement after invoking the kernel). This makes the API behavior depend on whether the kernel is cached. Return the kernel invocation result consistently in both paths (or consistently return None).

Suggested change

return cached_kernel(*tensor_args, **kwargs)

cached_kernel(*tensor_args, **kwargs)

Copilot · 2026-02-09T23:11:19Z

python/iron/kernel.py

            arg_types (list[type[np.ndarray] | np.dtype], optional): The type signature of the function. Defaults to [].
            include_dirs (list[str], optional): Additional include directories. Defaults to [].
            compile_flags (list[str], optional): Additional compilation flags. Defaults to [].
+            debug (bool, optional): Enable debug logging. Defaults to True.


The debug parameter default is False, but the docstring says "Defaults to True." Please fix the docstring to match the actual default (or change the default if the docstring is correct).

Suggested change

debug (bool, optional): Enable debug logging. Defaults to True.

debug (bool, optional): Enable debug logging. Defaults to False.

Copilot · 2026-02-09T23:11:20Z

python/iron/algorithms/transform.py

+            )
+
+    num_inputs = len(inputs)
+    num_elements = np.size(inputs[0])


Same issue as above: np.size(inputs[0]) may force a tensor materialization/sync. Prefer inputs[0].numel() or int(np.prod(inputs[0].shape)) to avoid host transfers.

Suggested change

num_elements = np.size(inputs[0])

num_elements = int(np.prod(inputs[0].shape))

Copilot · 2026-02-09T23:11:20Z

test/python/npu-xrt/test_algorithms.py

+def test_transform_add():
+    """Test transform algorithm with simple add_one operation"""
+    input = iron.randint(0, 100, (1024,), dtype=np.int32, device="npu")
+    output = iron.zeros_like(input)
+    original = input.numpy().copy()
+    iron.jit(is_placed=False)(transform)(lambda a: a + 1, input, input)
+
+    assert np.allclose(original + 1, input.numpy())


This test constructs an output tensor but then calls transform(..., input, input), writing the result back into input and leaving output unused. This can mask bugs where transform fails to write to the provided output buffer. Consider passing output to transform and asserting on output, or switch to for_each if the intent is an in-place update.

Copilot · 2026-02-09T23:11:20Z

programming_examples/algorithms/for_each.py

@@ -0,0 +1,64 @@
+# transform_binary.py -*- Python -*-


The file header comment says transform_binary.py but this is for_each.py. Please correct the header to avoid confusion when navigating examples.

Suggested change

# transform_binary.py -*- Python -*-

# for_each.py -*- Python -*-

Copilot · 2026-02-09T23:11:22Z

python/iron/algorithms/for_each.py

+        for_each(scale, tensor, factor, tile_size)
+    """
+    is_external_func = isinstance(func, iron.ExternalFunction)
+    num_elements = np.size(tensor)


np.size(tensor) will typically coerce the Iron tensor to a NumPy array (device sync) to compute the element count. Prefer tensor.numel() / int(np.prod(tensor.shape)) to avoid unnecessary synchronization.

Suggested change

num_elements = np.size(tensor)

num_elements = tensor.numel() if hasattr(tensor, "numel") else int(np.prod(tensor.shape))

Copilot · 2026-02-09T23:11:22Z

programming_examples/algorithms/transform.py

+
+        print(f"{'input':>4} + {'output':>4}")
+        print("-" * 34)
+        count = input.numel()


Variable count is not used.

Suggested change

count = input.numel()

Copilot · 2026-02-09T23:11:23Z

programming_examples/algorithms/transform_parallel.py

+
+        print(f"{'input':>4} + {'output':>4}")
+        print("-" * 34)
+        count = input.numel()


Variable count is not used.

Suggested change

count = input.numel()

Copilot · 2026-02-09T23:11:23Z

programming_examples/basic/vector_scalar_mul/vector_scalar_mul_jit.py

+    vectorized = True
+
+    # Define tensor types
+    tensor_ty = np.ndarray[(tensor_size,), np.dtype[in1_dtype]]


Variable tensor_ty is not used.

Suggested change

tensor_ty = np.ndarray[(tensor_size,), np.dtype[in1_dtype]]

Copilot · 2026-02-09T23:11:23Z

python/iron/algorithms/transform.py

+
+from aie.iron import ObjectFifo, Program, Runtime, Worker
+from aie.iron.placers import SequentialPlacer
+from aie.iron.device import NPU1Col1, NPU2Col1, NPU1, NPU2


Import of 'NPU1Col1' is not used.
Import of 'NPU2Col1' is not used.
Import of 'NPU1' is not used.

Suggested change

from aie.iron.device import NPU1Col1, NPU2Col1, NPU1, NPU2

from aie.iron.device import NPU2

hunhoffe mentioned this pull request Nov 22, 2025

Refine JIT Runtime/Compilation/Build Capabilities #2721

Open

23 tasks

mawad-amd and others added 9 commits November 24, 2025 09:37

Add initial device-side algorithm

1bbcec4

Add initial core function

7156353

Port vector_scalar_mul.py

c707299

Write trace

088f4f4

Support external function calls

d9f65d5

Signed-off-by: Muhammad Awad <MuhammadAbdelghaffar.Awad@amd.com>

Add extern lambda function example

c0a8117

Signed-off-by: Muhammad Awad <MuhammadAbdelghaffar.Awad@amd.com>

[WIP] parallel algorithms transform patterns

1f83798

Fix parallel binary transform

60a9007

update programming example from rough rebase

d5bc2bd

hunhoffe force-pushed the parallel/algorithms branch from 5c5e193 to d5bc2bd Compare November 24, 2025 16:57

hunhoffe and others added 18 commits November 24, 2025 09:58

Merge branch 'main' into parallel/algorithms

d159a2f

Update python/iron/algorithms/transform.py

201c874

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Update python/iron/algorithms/transform.py

a43dc8a

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Update python/iron/algorithms/transform.py

cd0464d

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Update python/iron/algorithms/transform.py

9ae8ac6

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Update python/iron/algorithms/transform.py

4d9be9c

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Update python/iron/algorithms/transform.py

3f8718b

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Update python/iron/algorithms/transform.py

e0b63da

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Update python/iron/algorithms/transform.py

7c0372d

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Update python/iron/algorithms/transform.py

05c4a15

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Update python/iron/algorithms/transform.py

55ef223

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Update python/iron/algorithms/transform.py

985f30c

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

black formatting

d5aa997

Merge branch 'main' into parallel/algorithms

cf7ec54

restore missing import

39a3373

add lit test run lines

e46519b

fix merge typo

ff95a95

Merge branch 'main' into parallel/algorithms

0454d57

hunhoffe and others added 13 commits December 16, 2025 17:18

Merge branch 'main' into parallel/algorithms

4ca2546

Merge branch 'main' into parallel/algorithms

fb1b734

Update programming_examples/algorithms/for_each_extern.py

5d8153e

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Update programming_examples/algorithms/for_each_extern.py

641f58d

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Black formatting.

0eb40a7

Missed black.

6749b3c

Apply same TAP fix to transform_parallel, each worker should process …

76bd0c4

…num_elements // num_columns elements.

Refactor transform and transform_parallel for better code reuse.

2cf8a26

Moving algorithm tests to npu-xrt.

5b72a58

Merge branch 'main' into parallel/algorithms

09de07e

jgmelber approved these changes Feb 9, 2026

View reviewed changes

hunhoffe changed the title ~~[WIP] Parallel/algorithms~~ Parallel/algorithms Feb 9, 2026

Merge branch 'main' into parallel/algorithms

15ba28b

yenjames marked this pull request as ready for review February 9, 2026 23:00

yenjames requested review from andrej, fifield, jackl-xilinx and stephenneuendorffer as code owners February 9, 2026 23:00

Copilot AI review requested due to automatic review settings February 9, 2026 23:00

yenjames requested review from denolf and pvasireddy-amd as code owners February 9, 2026 23:00

Copilot started reviewing on behalf of yenjames February 9, 2026 23:01 View session

Copilot AI reviewed Feb 9, 2026

View reviewed changes

		\| `transform_parallel` \| Parallel `transform` across multiple AIE tiles \| No \|
		\| `transform_parallel_binary` \| Parallel `transform_binary` across multiple AIE tiles \| No \|

	return cached_kernel(tensor_args, *kwargs)
	cached_kernel(tensor_args, *kwargs)

	debug (bool, optional): Enable debug logging. Defaults to True.
	debug (bool, optional): Enable debug logging. Defaults to False.

	num_elements = np.size(inputs[0])
	num_elements = int(np.prod(inputs[0].shape))

	# transform_binary.py -- Python --
	# for_each.py -- Python --

	num_elements = np.size(tensor)
	num_elements = tensor.numel() if hasattr(tensor, "numel") else int(np.prod(tensor.shape))

	from aie.iron.device import NPU1Col1, NPU2Col1, NPU1, NPU2
	from aie.iron.device import NPU2

Parallel/algorithms #2730

Are you sure you want to change the base?

Parallel/algorithms #2730

Uh oh!

Conversation

hunhoffe commented Nov 22, 2025 • edited by yenjames Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Update by James:

Using C++ Kernels (ExternalFunction)

Uh oh!

ypapadop-amd commented Nov 22, 2025

Uh oh!

jgmelber Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

yenjames Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

hunhoffe commented Nov 22, 2025 •

edited by yenjames

Loading