-
Notifications
You must be signed in to change notification settings - Fork 168
Parallel/algorithms #2730
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Parallel/algorithms #2730
Conversation
|
For a little bikeshedding, for_each / transform are C++-isms ( |
Signed-off-by: Muhammad Awad <MuhammadAbdelghaffar.Awad@amd.com>
Signed-off-by: Muhammad Awad <MuhammadAbdelghaffar.Awad@amd.com>
5c5e193 to
d5bc2bd
Compare
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
- Fixed bug with transform and transform_parallel algorithm which only applied function on a single element every tile instead of all elements per tile. - Fixed bug with mixing task_group and non-task_group workers in transform_parallel and transform_parallel_binary. - Fixed bug with incorrect TAP for transform_parallel. - Added unit tests for for_each, transform, transform_binary, transform_parallel, transform_parallel_binary. - Unit tests leave tile_size as a variable if there is plan to parametrize it in transform.
…num_elements // num_columns elements.
…plied through iron.jit(). - Added READMEs to programming_examples for algorithms. - Added checks for using ExternalFunctions with algorithms due to signature requirements. - `vector_scalar_mul_jit.py` demonstrates using algorithm with ExternalFunction/external kernel.
| | `transform_parallel` | Parallel `transform` across multiple AIE tiles | No | | ||
| | `transform_parallel_binary` | Parallel `transform_binary` across multiple AIE tiles | No | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is anything blocking extending this to support ExternalFunctions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No not necessarily. It's just the current way to support ExternalFunctions is not final nor ideal imo and I figured showing a "flavour" of it in the non-parallel transforms. If you prefer I can make the same set of assumptions for the ExternalFunction format and apply them to parallel ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Adds an aie.iron.algorithms helper library (transform/for_each + parallel variants) intended to simplify building common tensor transformation designs for NPU execution via iron.jit, including initial integration with C++ ExternalFunction kernels.
Changes:
- Introduces a new
aie.iron.algorithmspackage (transform,transform_binary,for_each, and parallel variants) plus test coverage and runnable examples. - Updates JIT invocation to filter out non-runtime arguments (notably scalars /
ExternalFunction) before calling the host runtime. - Extends
ExternalFunctionwith basic argument count/type validation and adds tests for validation failures.
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 16 comments.
Show a summary per file
| File | Description |
|---|---|
| test/python/npu-xrt/test_jit_extern_functions.py | Adds tests for ExternalFunction argument validation errors. |
| test/python/npu-xrt/test_algorithms.py | New test suite covering algorithm behavior and error handling. |
| python/utils/jit.py | Filters runtime args before kernel execution; adjusts cached execution path. |
| python/iron/kernel.py | Adds ExternalFunction debug flag + argument validation on __call__. |
| python/iron/algorithms/transform.py | Implements unary/binary transform + parallel variants. |
| python/iron/algorithms/for_each.py | Implements in-place algorithm with optional ExternalFunction params. |
| python/iron/algorithms/init.py | Exposes algorithm entry points from the package. |
| programming_examples/basic/vector_scalar_mul/vector_scalar_mul_jit.py | Demonstrates using transform + ExternalFunction under JIT. |
| programming_examples/basic/vector_scalar_mul/run_jit.lit | Adds lit coverage for the JIT example. |
| programming_examples/basic/vector_scalar_mul/README.md | Documents the new JIT-based variant of the example. |
| programming_examples/algorithms/transform.py | Example for unary transform. |
| programming_examples/algorithms/transform_binary.py | Example for transform_binary. |
| programming_examples/algorithms/transform_parallel.py | Example for transform_parallel. |
| programming_examples/algorithms/transform_parallel_binary.py | Example for transform_parallel_binary. |
| programming_examples/algorithms/for_each.py | Example for for_each (contains verbose-mode bugs noted in comments). |
| programming_examples/algorithms/run_jit.lit | Adds lit coverage to run the new algorithm examples. |
| programming_examples/algorithms/README.md | Documents algorithms + ExternalFunction signature expectations (example issue noted in comments). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Filter out non-tensor arguments (ExternalFunction, scalars) | ||
| # Only tensor args should be passed to the kernel | ||
| tensor_args = _filter_tensor_args(args) | ||
| return cached_kernel(*tensor_args, **kwargs) |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The JIT wrapper returns the cached kernel result on cache hits but returns None on the initial compilation/execution path (no return statement after invoking the kernel). This makes the API behavior depend on whether the kernel is cached. Return the kernel invocation result consistently in both paths (or consistently return None).
| return cached_kernel(*tensor_args, **kwargs) | |
| cached_kernel(*tensor_args, **kwargs) |
| arg_types (list[type[np.ndarray] | np.dtype], optional): The type signature of the function. Defaults to []. | ||
| include_dirs (list[str], optional): Additional include directories. Defaults to []. | ||
| compile_flags (list[str], optional): Additional compilation flags. Defaults to []. | ||
| debug (bool, optional): Enable debug logging. Defaults to True. |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The debug parameter default is False, but the docstring says "Defaults to True." Please fix the docstring to match the actual default (or change the default if the docstring is correct).
| debug (bool, optional): Enable debug logging. Defaults to True. | |
| debug (bool, optional): Enable debug logging. Defaults to False. |
| ) | ||
|
|
||
| num_inputs = len(inputs) | ||
| num_elements = np.size(inputs[0]) |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same issue as above: np.size(inputs[0]) may force a tensor materialization/sync. Prefer inputs[0].numel() or int(np.prod(inputs[0].shape)) to avoid host transfers.
| num_elements = np.size(inputs[0]) | |
| num_elements = int(np.prod(inputs[0].shape)) |
| def test_transform_add(): | ||
| """Test transform algorithm with simple add_one operation""" | ||
| input = iron.randint(0, 100, (1024,), dtype=np.int32, device="npu") | ||
| output = iron.zeros_like(input) | ||
| original = input.numpy().copy() | ||
| iron.jit(is_placed=False)(transform)(lambda a: a + 1, input, input) | ||
|
|
||
| assert np.allclose(original + 1, input.numpy()) |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test constructs an output tensor but then calls transform(..., input, input), writing the result back into input and leaving output unused. This can mask bugs where transform fails to write to the provided output buffer. Consider passing output to transform and asserting on output, or switch to for_each if the intent is an in-place update.
| @@ -0,0 +1,64 @@ | |||
| # transform_binary.py -*- Python -*- | |||
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The file header comment says transform_binary.py but this is for_each.py. Please correct the header to avoid confusion when navigating examples.
| # transform_binary.py -*- Python -*- | |
| # for_each.py -*- Python -*- |
| for_each(scale, tensor, factor, tile_size) | ||
| """ | ||
| is_external_func = isinstance(func, iron.ExternalFunction) | ||
| num_elements = np.size(tensor) |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.size(tensor) will typically coerce the Iron tensor to a NumPy array (device sync) to compute the element count. Prefer tensor.numel() / int(np.prod(tensor.shape)) to avoid unnecessary synchronization.
| num_elements = np.size(tensor) | |
| num_elements = tensor.numel() if hasattr(tensor, "numel") else int(np.prod(tensor.shape)) |
|
|
||
| print(f"{'input':>4} + {'output':>4}") | ||
| print("-" * 34) | ||
| count = input.numel() |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable count is not used.
| count = input.numel() |
|
|
||
| print(f"{'input':>4} + {'output':>4}") | ||
| print("-" * 34) | ||
| count = input.numel() |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable count is not used.
| count = input.numel() |
| vectorized = True | ||
|
|
||
| # Define tensor types | ||
| tensor_ty = np.ndarray[(tensor_size,), np.dtype[in1_dtype]] |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable tensor_ty is not used.
| tensor_ty = np.ndarray[(tensor_size,), np.dtype[in1_dtype]] |
|
|
||
| from aie.iron import ObjectFifo, Program, Runtime, Worker | ||
| from aie.iron.placers import SequentialPlacer | ||
| from aie.iron.device import NPU1Col1, NPU2Col1, NPU1, NPU2 |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'NPU1Col1' is not used.
Import of 'NPU2Col1' is not used.
Import of 'NPU1' is not used.
| from aie.iron.device import NPU1Col1, NPU2Col1, NPU1, NPU2 | |
| from aie.iron.device import NPU2 |
I am creating this PR so the code does not get left behind.
Update by James:
This PR adds an
algorithmlibrary to simplify creating a design to run some tensor transformation on the NPU. These are compatible withiron.jit. Implemented initial support for C++ kernels (currently limited to non-parallel algorithms) and require specific function signatures as shown below. Follow-up work is needed to standardize kernel signatures and expand algorithm compatibility with kernels.Using C++ Kernels (ExternalFunction)
The kernel signature and
ExternalFunctioncurrently must match the algorithm's expected format. We further assume array-like parameters inparamsare inputs that will require ObjectFifos and scalar-type parameters are inputs that will be encoded as MLIR constants.transformvoid kernel(T* in, T* out, params...)[tile_ty, tile_ty, *param_types]transform_binaryvoid kernel(T* in1, T* in2, T* out, params...)[tile_ty, tile_ty, tile_ty, *param_types]for_eachvoid kernel(T* in, T* out, params...)[tile_ty, tile_ty, *param_types]