-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Access L2 Segment Allocation in Herd with Python Example #668
Changes from 4 commits
b8c5869
70eb71e
bb7251f
a1a7034
c0077fd
2b6b797
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -23,14 +23,18 @@ def build_module(): | |
ChannelOp("ChanOut") | ||
ChannelOp("ToSelf") | ||
|
||
# We want to store our data in L1 memory | ||
mem_space = IntegerAttr.get(T.i32(), MemorySpace.L1) | ||
mem_space_l1 = IntegerAttr.get(T.i32(), MemorySpace.L1) | ||
image_type_l1 = MemRefType.get( | ||
shape=IMAGE_SIZE, | ||
element_type=T.i32(), | ||
memory_space=mem_space_l1, | ||
) | ||
|
||
# This is the type definition of the image | ||
image_type = MemRefType.get( | ||
mem_space_l2 = IntegerAttr.get(T.i32(), MemorySpace.L1) | ||
image_type_l2 = MemRefType.get( | ||
shape=IMAGE_SIZE, | ||
element_type=T.i32(), | ||
memory_space=mem_space, | ||
memory_space=mem_space_l2, | ||
) | ||
|
||
# We will send an image worth of data in and out | ||
|
@@ -47,51 +51,48 @@ def launch_body(a, b): | |
@segment(name="seg") | ||
def segment_body(): | ||
|
||
# The herd sizes correspond to the dimensions of the contiguous block of cores we are hoping to get. | ||
# We just need one compute core, so we ask for a 1x1 herd | ||
@herd(name="copyherd", sizes=[1, 1]) | ||
def herd_body(tx, ty, sx, sy): | ||
tensor_in_l2 = AllocOp(image_type_l2, [], []) | ||
tensor_out_l2 = AllocOp(image_type_l2, [], []) | ||
|
||
# We must allocate a buffer of image size for the input/output | ||
tensor_in = AllocOp(image_type, [], []) | ||
tensor_out = AllocOp(image_type, [], []) | ||
tensor_in2 = AllocOp(image_type, [], []) | ||
tensor_out2 = AllocOp(image_type, [], []) | ||
ChannelGet("ChanIn", tensor_in_l2) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. After There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have to do L2->L1 in the herd due to the purpose of this example (see my other comment). If this not a valid operation? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is valid operation. I didn't understand the intend the design but I get it now. |
||
ChannelPut("ChanOut", tensor_out_l2) | ||
|
||
ChannelGet("ChanIn", tensor_in) | ||
DeallocOp(tensor_in_l2) | ||
DeallocOp(tensor_out_l2) | ||
|
||
# Access every value in the tile | ||
for j in range_(IMAGE_HEIGHT): | ||
for i in range_(IMAGE_WIDTH): | ||
# Load the input value from tile_in | ||
val = load(tensor_in, [i, j]) | ||
# The herd sizes correspond to the dimensions of the contiguous block of cores we are hoping to get. | ||
# We just need one compute core, so we ask for a 1x1 herd | ||
@herd( | ||
name="copyherd", | ||
sizes=[1, 1], | ||
operands=[tensor_in_l2, tensor_out_l2], | ||
) | ||
def herd_body(tx, ty, sx, sy, tensor_in_l2, tensor_out_l2): | ||
|
||
# Store the output value in tile_out | ||
store(val, tensor_out, [i, j]) | ||
yield_([]) | ||
yield_([]) | ||
# We must allocate a buffer of image size for the input/output | ||
tensor_in_l1 = AllocOp(image_type_l1, [], []) | ||
tensor_out_l1 = AllocOp(image_type_l1, [], []) | ||
|
||
ChannelPut("ToSelf", tensor_out) | ||
ChannelGet("ToSelf", tensor_in2) | ||
ChannelPut("ToSelf", tensor_in_l2) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see. I'd suggest moving this channel put to outside herd. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This example is called worker to self because the purpose of the example is to demonstrate (if) it is possible to put data into a channel and then fetch it from the same channel from within the same worker/compute core in a herd. So to keep this example, we need both The only reason I modified this example in this PR is because previously this example was blocked because I was unable to access L2 memory in the herd, and that capability is needed because channels must move data between different memory regions (in this case, L2 and L1). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see. This scenario is not exercised before. |
||
ChannelGet("ToSelf", tensor_in_l1) | ||
|
||
# Access every value in the tile | ||
for j in range_(IMAGE_HEIGHT): | ||
for i in range_(IMAGE_WIDTH): | ||
# Load the input value from tile_in | ||
val = load(tensor_in2, [i, j]) | ||
val = load(tensor_in_l1, [i, j]) | ||
|
||
# Store the output value in tile_out | ||
store(val, tensor_out2, [i, j]) | ||
store(val, tensor_out_l1, [i, j]) | ||
yield_([]) | ||
yield_([]) | ||
|
||
ChannelPut("ChanOut", tensor_out2) | ||
ChannelPut("ToSelf", tensor_out_l1) | ||
ChannelGet("ToSelf", tensor_out_l2) | ||
|
||
# Deallocate our L1 buffers | ||
DeallocOp(tensor_in) | ||
DeallocOp(tensor_out) | ||
DeallocOp(tensor_in2) | ||
DeallocOp(tensor_out2) | ||
DeallocOp(tensor_in_l1) | ||
DeallocOp(tensor_out_l1) | ||
|
||
|
||
if __name__ == "__main__": | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
# (c) Copyright 2024 Advanced Micro Devices, Inc. | ||
# SPDX-License-Identifier: MIT | ||
srcdir := $(shell dirname $(realpath $(firstword $(MAKEFILE_LIST)))) | ||
|
||
targetname := $(shell basename ${srcdir}) | ||
|
||
run: | ||
mkdir -p build | ||
cd build && ${powershell} python3 ${srcdir}/run.py | ||
|
||
clean: | ||
rm -rf build __pycache__ |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,88 @@ | ||
# run.py -*- Python -*- | ||
# | ||
# Copyright (C) 2024, Advanced Micro Devices, Inc. All rights reserved. | ||
# SPDX-License-Identifier: MIT | ||
# Copyright (C) 2024, Advanced Micro Devices, Inc. | ||
# SPDX-License-Identifier: MIT | ||
|
||
import argparse | ||
import numpy as np | ||
import air.backend.xrt as xrt_backend | ||
import filelock | ||
|
||
from segment_alloc import * | ||
|
||
INOUT_DATATYPE = np.uint32 | ||
INOUT_ELEM_SIZE = np.dtype(INOUT_DATATYPE).itemsize | ||
INOUT_SIZE = IMAGE_SIZE[0] * IMAGE_SIZE[1] | ||
INOUT_SIZE_BYTES = INOUT_SIZE * INOUT_ELEM_SIZE | ||
|
||
|
||
def main(verbose=False, experimental_passes=False): | ||
mlir_module = build_module() | ||
|
||
input_a = np.arange(1, INOUT_SIZE + 1, dtype=INOUT_DATATYPE) | ||
output_b = np.arange(1, INOUT_SIZE + 1, dtype=INOUT_DATATYPE) | ||
for i in range(INOUT_SIZE): | ||
input_a[i] = i + 0x1000 | ||
output_b[i] = 0x00DEFACED | ||
|
||
backend = xrt_backend.XRTBackend( | ||
verbose=verbose, | ||
experimental_passes=experimental_passes, | ||
omit_while_true_loop=True, | ||
) | ||
|
||
# run the module | ||
with filelock.FileLock("/tmp/npu.lock"): | ||
mul = backend.compile_and_load(mlir_module) | ||
(_, output_b) = mul(input_a, output_b) | ||
|
||
backend.unload() | ||
|
||
# check output, should have the top left filled in | ||
errors = 0 | ||
for i in range(INOUT_SIZE): | ||
rb = output_b[i] | ||
|
||
row = i / IMAGE_WIDTH | ||
col = i % IMAGE_WIDTH | ||
|
||
if row < TILE_HEIGHT and col < TILE_WIDTH: | ||
# value should have been updated | ||
if not (rb == 0x1000 + i): | ||
print(f"IM {i} [{col}, {row}] should be 0x{i:x}, is 0x{rb:x}\n") | ||
errors += 1 | ||
else: | ||
# value should stay unchanged | ||
if rb != 0x00DEFACED: | ||
print( | ||
f"IM {i} [{col}, {row}] should be 0xdefaced, is 0x{rb:x}\n", | ||
i, | ||
col, | ||
row, | ||
rb, | ||
) | ||
errors += 1 | ||
|
||
if errors == 0: | ||
print("PASS!") | ||
exit(0) | ||
else: | ||
print("failed. errors=", errors) | ||
exit(-1) | ||
|
||
|
||
if __name__ == "__main__": | ||
parser = argparse.ArgumentParser( | ||
prog="run.py", | ||
description="Builds, runs, and tests the segment_alloc example", | ||
) | ||
|
||
parser.add_argument( | ||
"-v", | ||
"--verbose", | ||
action="store_true", | ||
) | ||
args = parser.parse_args() | ||
main(experimental_passes=True, verbose=args.verbose) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
// (c) Copyright 2024 Advanced Micro Devices, Inc. | ||
// SPDX-License-Identifier: MIT | ||
// | ||
// REQUIRES: ryzen_ai | ||
// | ||
// RUN: make -f %S/Makefile clean | ||
// RUN: make -f %S/Makefile run | FileCheck %s | ||
// CHECK: PASS! |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,109 @@ | ||
# Copyright (C) 2024, Advanced Micro Devices, Inc. | ||
# SPDX-License-Identifier: MIT | ||
|
||
from air.ir import * | ||
from air.dialects.air import * | ||
from air.dialects.memref import AllocOp, DeallocOp, load, store | ||
from air.dialects.func import FuncOp | ||
from air.dialects.scf import for_, yield_ | ||
|
||
range_ = for_ | ||
|
||
IMAGE_WIDTH = 32 | ||
IMAGE_HEIGHT = 16 | ||
IMAGE_SIZE = [IMAGE_WIDTH, IMAGE_HEIGHT] | ||
|
||
TILE_WIDTH = 16 | ||
TILE_HEIGHT = 8 | ||
TILE_SIZE = [TILE_WIDTH, TILE_HEIGHT] | ||
|
||
|
||
@module_builder | ||
def build_module(): | ||
memrefTyInOut = MemRefType.get(IMAGE_SIZE, T.i32()) | ||
|
||
# We will send an image worth of data in and out | ||
@FuncOp.from_py_func(memrefTyInOut, memrefTyInOut) | ||
def copy(arg0, arg1): | ||
|
||
# The arguments are the input and output | ||
@launch(operands=[arg0, arg1]) | ||
def launch_body(a, b): | ||
|
||
# The arguments are still the input and the output | ||
@segment(name="seg", operands=[a, b]) | ||
def segment_body(arg2, arg3): | ||
# We want to store our data in L1 memory | ||
mem_space_l2 = IntegerAttr.get(T.i32(), MemorySpace.L2) | ||
|
||
# This is the type definition of the tile | ||
tile_type_l2 = MemRefType.get( | ||
shape=TILE_SIZE, | ||
element_type=T.i32(), | ||
memory_space=mem_space_l2, | ||
) | ||
|
||
# We must allocate a buffer of tile size for the input/output | ||
tile_in_l2 = AllocOp(tile_type_l2, [], []) | ||
|
||
# The herd sizes correspond to the dimensions of the contiguous block of cores we are hoping to get. | ||
# We just need one compute core, so we ask for a 1x1 herd | ||
@herd(name="copyherd", sizes=[1, 1], operands=[arg2, arg3, tile_in_l2]) | ||
def herd_body(tx, ty, sx, sy, a, b, my_l2_tile): | ||
|
||
# We want to store our data in L1 memory | ||
mem_space_l1 = IntegerAttr.get(T.i32(), MemorySpace.L1) | ||
|
||
# This is the type definition of the tile | ||
tile_type_l1 = MemRefType.get( | ||
shape=TILE_SIZE, | ||
element_type=T.i32(), | ||
memory_space=mem_space_l1, | ||
) | ||
|
||
# We must allocate a buffer of tile size for the input/output | ||
tile_in_l1 = AllocOp(tile_type_l1, [], []) | ||
tile_out_l1 = AllocOp(tile_type_l1, [], []) | ||
|
||
dma_memcpy_nd( | ||
my_l2_tile, | ||
a, | ||
src_offsets=[0, 0], | ||
src_sizes=[TILE_HEIGHT, TILE_WIDTH], | ||
src_strides=[IMAGE_WIDTH, 1], | ||
) | ||
|
||
# Copy a tile from the input image (a) into the L1 memory region (tile_in) | ||
dma_memcpy_nd( | ||
tile_in_l1, | ||
my_l2_tile, | ||
) | ||
|
||
# Access every value in the tile | ||
for j in range_(TILE_HEIGHT): | ||
for i in range_(TILE_WIDTH): | ||
# Load the input value from tile_in | ||
val = load(tile_in_l1, [i, j]) | ||
|
||
# Store the output value in tile_out | ||
store(val, tile_out_l1, [i, j]) | ||
yield_([]) | ||
yield_([]) | ||
|
||
# Copy the output tile into the output | ||
dma_memcpy_nd( | ||
b, | ||
tile_out_l1, | ||
dst_offsets=[0, 0], | ||
dst_sizes=[TILE_HEIGHT, TILE_WIDTH], | ||
dst_strides=[IMAGE_WIDTH, 1], | ||
) | ||
|
||
# Deallocate our L1 buffers | ||
DeallocOp(tile_in_l1) | ||
DeallocOp(tile_out_l1) | ||
|
||
|
||
if __name__ == "__main__": | ||
module = build_module() | ||
print(module) |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -23,6 +23,29 @@ def pyint_to_index(i): | |
return arith.ConstantOp.create_index(i) if isinstance(i, int) else i | ||
|
||
|
||
def get_region_operand_types(operands): | ||
""" | ||
Utility function to get the type of arguments given to region ops. | ||
""" | ||
operand_types = [] | ||
for o in operands: | ||
if isinstance(o, Value): | ||
operand_types.append(o.type) | ||
elif isinstance(o, OpView): | ||
if len(o.results.types) != 1: | ||
raise AttributeError( | ||
f"Operation given to a region op as a parameter ({o}) has more " | ||
"than one return type ({o.results.types}), which would lead to a mismatch " | ||
"between number of operands and number of operand types" | ||
) | ||
operand_types += o.results.types | ||
else: | ||
raise AttributeError( | ||
f"Argument {o} is not a Value or an Operation: {type(o).mro()}" | ||
) | ||
return operand_types | ||
|
||
|
||
class Launch(LaunchOp): | ||
"""Specialization for LaunchOp class.""" | ||
|
||
|
@@ -48,7 +71,7 @@ def __init__( | |
launch_operands=operands, | ||
sym_name=name, | ||
) | ||
operand_types = [s.type for s in sizes] * 2 + [o.type for o in operands] | ||
operand_types = [s.type for s in sizes] * 2 + get_region_operand_types(operands) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think considering not just Values but also Operations makes sense in the segment and herd since one could feasibly create the operation in the outer region. That is what these changes do. However, I also applied these changes to the launch and I'm not sure if this makes sense. Is it true that a launch argument should only ever be a value and not an operation? If so, I should revert this section of the changes. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jgmelber Do you have an opinion? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think in the future there could be other operations before or after launch. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay, I will keep my changes then! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe I'm missing some subtlety about the python bindings but isn't the result of an Operation always one or more Values? Can you share a code snippet where this comes up? This isn't an issue in ordinary MLIR code or using C++ APIs. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here is the context: #666 |
||
self.regions[0].blocks.append(*operand_types) | ||
|
||
|
||
|
@@ -74,7 +97,7 @@ def __init__( | |
segment_operands=operands, | ||
sym_name=name, | ||
) | ||
operand_types = [s.type for s in sizes] * 2 + [o.type for o in operands] | ||
operand_types = [s.type for s in sizes] * 2 + get_region_operand_types(operands) | ||
self.regions[0].blocks.append(*operand_types) | ||
|
||
|
||
|
@@ -102,7 +125,7 @@ def __init__( | |
sym_name=name, | ||
link_with=link_with, | ||
) | ||
operand_types = [s.type for s in sizes] * 2 + [o.type for o in operands] | ||
operand_types = [s.type for s in sizes] * 2 + get_region_operand_types(operands) | ||
self.regions[0].blocks.append(*operand_types) | ||
|
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I attempted to change this file to use the new ability to send L2 allocations made in a segment to a herd. However, this example still does not work. I'm unsure of how I've rewritten it is reasonable or not, but I've made a note in the associated issue (#654)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
tensor_in_l2
seems to be deallocated before it was used inair.herd
, if the code was interpreted line by line. Maybe move the deallocation to after theherd
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed a bug (I accidentally was using L1 for both allocation types) and made the change you suggested, as well as simplifying the example a bit. The example is still not successful, although the error changed: