Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Access L2 Segment Allocation in Herd with Python Example #668

Merged
merged 6 commits into from
Jul 17, 2024
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -23,14 +23,18 @@ def build_module():
ChannelOp("ChanOut")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I attempted to change this file to use the new ability to send L2 allocations made in a segment to a herd. However, this example still does not work. I'm unsure of how I've rewritten it is reasonable or not, but I've made a note in the associated issue (#654)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tensor_in_l2 seems to be deallocated before it was used in air.herd, if the code was interpreted line by line. Maybe move the deallocation to after the herd.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed a bug (I accidentally was using L1 for both allocation types) and made the change you suggested, as well as simplifying the example a bit. The example is still not successful, although the error changed:

Running: builtin.module(air-to-aie{emit-while-loop=false row-offset=2 col-offset=0 device=npu1_4col})
Segmentation fault (core dumped)

ChannelOp("ToSelf")

# We want to store our data in L1 memory
mem_space = IntegerAttr.get(T.i32(), MemorySpace.L1)
mem_space_l1 = IntegerAttr.get(T.i32(), MemorySpace.L1)
image_type_l1 = MemRefType.get(
shape=IMAGE_SIZE,
element_type=T.i32(),
memory_space=mem_space_l1,
)

# This is the type definition of the image
image_type = MemRefType.get(
mem_space_l2 = IntegerAttr.get(T.i32(), MemorySpace.L1)
image_type_l2 = MemRefType.get(
shape=IMAGE_SIZE,
element_type=T.i32(),
memory_space=mem_space,
memory_space=mem_space_l2,
)

# We will send an image worth of data in and out
Expand All @@ -47,51 +51,48 @@ def launch_body(a, b):
@segment(name="seg")
def segment_body():

# The herd sizes correspond to the dimensions of the contiguous block of cores we are hoping to get.
# We just need one compute core, so we ask for a 1x1 herd
@herd(name="copyherd", sizes=[1, 1])
def herd_body(tx, ty, sx, sy):
tensor_in_l2 = AllocOp(image_type_l2, [], [])
tensor_out_l2 = AllocOp(image_type_l2, [], [])

# We must allocate a buffer of image size for the input/output
tensor_in = AllocOp(image_type, [], [])
tensor_out = AllocOp(image_type, [], [])
tensor_in2 = AllocOp(image_type, [], [])
tensor_out2 = AllocOp(image_type, [], [])
ChannelGet("ChanIn", tensor_in_l2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After tensor_in_l2 is filled with data from getting ChanIn, I would expect a put to some L2->L1 channel?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to do L2->L1 in the herd due to the purpose of this example (see my other comment).

If this not a valid operation?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is valid operation. I didn't understand the intend the design but I get it now.

ChannelPut("ChanOut", tensor_out_l2)

ChannelGet("ChanIn", tensor_in)
DeallocOp(tensor_in_l2)
DeallocOp(tensor_out_l2)

# Access every value in the tile
for j in range_(IMAGE_HEIGHT):
for i in range_(IMAGE_WIDTH):
# Load the input value from tile_in
val = load(tensor_in, [i, j])
# The herd sizes correspond to the dimensions of the contiguous block of cores we are hoping to get.
# We just need one compute core, so we ask for a 1x1 herd
@herd(
name="copyherd",
sizes=[1, 1],
operands=[tensor_in_l2, tensor_out_l2],
)
def herd_body(tx, ty, sx, sy, tensor_in_l2, tensor_out_l2):

# Store the output value in tile_out
store(val, tensor_out, [i, j])
yield_([])
yield_([])
# We must allocate a buffer of image size for the input/output
tensor_in_l1 = AllocOp(image_type_l1, [], [])
tensor_out_l1 = AllocOp(image_type_l1, [], [])

ChannelPut("ToSelf", tensor_out)
ChannelGet("ToSelf", tensor_in2)
ChannelPut("ToSelf", tensor_in_l2)
Copy link
Collaborator

@erwei-xilinx erwei-xilinx Jul 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I'd suggest moving this channel put to outside herd.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example is called worker to self because the purpose of the example is to demonstrate (if) it is possible to put data into a channel and then fetch it from the same channel from within the same worker/compute core in a herd. So to keep this example, we need both ChannelPut("ToSelf", some_mem) and ChannelGet("ToSelf", some_other_mem) within the herd.

The only reason I modified this example in this PR is because previously this example was blocked because I was unable to access L2 memory in the herd, and that capability is needed because channels must move data between different memory regions (in this case, L2 and L1).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. This scenario is not exercised before.

ChannelGet("ToSelf", tensor_in_l1)

# Access every value in the tile
for j in range_(IMAGE_HEIGHT):
for i in range_(IMAGE_WIDTH):
# Load the input value from tile_in
val = load(tensor_in2, [i, j])
val = load(tensor_in_l1, [i, j])

# Store the output value in tile_out
store(val, tensor_out2, [i, j])
store(val, tensor_out_l1, [i, j])
yield_([])
yield_([])

ChannelPut("ChanOut", tensor_out2)
ChannelPut("ToSelf", tensor_out_l1)
ChannelGet("ToSelf", tensor_out_l2)

# Deallocate our L1 buffers
DeallocOp(tensor_in)
DeallocOp(tensor_out)
DeallocOp(tensor_in2)
DeallocOp(tensor_out2)
DeallocOp(tensor_in_l1)
DeallocOp(tensor_out_l1)


if __name__ == "__main__":
Expand Down
12 changes: 12 additions & 0 deletions programming_examples/segment_alloc/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# (c) Copyright 2024 Advanced Micro Devices, Inc.
# SPDX-License-Identifier: MIT
srcdir := $(shell dirname $(realpath $(firstword $(MAKEFILE_LIST))))

targetname := $(shell basename ${srcdir})

run:
mkdir -p build
cd build && ${powershell} python3 ${srcdir}/run.py

clean:
rm -rf build __pycache__
88 changes: 88 additions & 0 deletions programming_examples/segment_alloc/run.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# run.py -*- Python -*-
#
# Copyright (C) 2024, Advanced Micro Devices, Inc. All rights reserved.
# SPDX-License-Identifier: MIT
# Copyright (C) 2024, Advanced Micro Devices, Inc.
# SPDX-License-Identifier: MIT

import argparse
import numpy as np
import air.backend.xrt as xrt_backend
import filelock

from segment_alloc import *

INOUT_DATATYPE = np.uint32
INOUT_ELEM_SIZE = np.dtype(INOUT_DATATYPE).itemsize
INOUT_SIZE = IMAGE_SIZE[0] * IMAGE_SIZE[1]
INOUT_SIZE_BYTES = INOUT_SIZE * INOUT_ELEM_SIZE


def main(verbose=False, experimental_passes=False):
mlir_module = build_module()

input_a = np.arange(1, INOUT_SIZE + 1, dtype=INOUT_DATATYPE)
output_b = np.arange(1, INOUT_SIZE + 1, dtype=INOUT_DATATYPE)
for i in range(INOUT_SIZE):
input_a[i] = i + 0x1000
output_b[i] = 0x00DEFACED

backend = xrt_backend.XRTBackend(
verbose=verbose,
experimental_passes=experimental_passes,
omit_while_true_loop=True,
)

# run the module
with filelock.FileLock("/tmp/npu.lock"):
mul = backend.compile_and_load(mlir_module)
(_, output_b) = mul(input_a, output_b)

backend.unload()

# check output, should have the top left filled in
errors = 0
for i in range(INOUT_SIZE):
rb = output_b[i]

row = i / IMAGE_WIDTH
col = i % IMAGE_WIDTH

if row < TILE_HEIGHT and col < TILE_WIDTH:
# value should have been updated
if not (rb == 0x1000 + i):
print(f"IM {i} [{col}, {row}] should be 0x{i:x}, is 0x{rb:x}\n")
errors += 1
else:
# value should stay unchanged
if rb != 0x00DEFACED:
print(
f"IM {i} [{col}, {row}] should be 0xdefaced, is 0x{rb:x}\n",
i,
col,
row,
rb,
)
errors += 1

if errors == 0:
print("PASS!")
exit(0)
else:
print("failed. errors=", errors)
exit(-1)


if __name__ == "__main__":
parser = argparse.ArgumentParser(
prog="run.py",
description="Builds, runs, and tests the segment_alloc example",
)

parser.add_argument(
"-v",
"--verbose",
action="store_true",
)
args = parser.parse_args()
main(experimental_passes=True, verbose=args.verbose)
8 changes: 8 additions & 0 deletions programming_examples/segment_alloc/run_makefile.lit
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
// (c) Copyright 2024 Advanced Micro Devices, Inc.
// SPDX-License-Identifier: MIT
//
// REQUIRES: ryzen_ai
//
// RUN: make -f %S/Makefile clean
// RUN: make -f %S/Makefile run | FileCheck %s
// CHECK: PASS!
109 changes: 109 additions & 0 deletions programming_examples/segment_alloc/segment_alloc.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# Copyright (C) 2024, Advanced Micro Devices, Inc.
# SPDX-License-Identifier: MIT

from air.ir import *
from air.dialects.air import *
from air.dialects.memref import AllocOp, DeallocOp, load, store
from air.dialects.func import FuncOp
from air.dialects.scf import for_, yield_

range_ = for_

IMAGE_WIDTH = 32
IMAGE_HEIGHT = 16
IMAGE_SIZE = [IMAGE_WIDTH, IMAGE_HEIGHT]

TILE_WIDTH = 16
TILE_HEIGHT = 8
TILE_SIZE = [TILE_WIDTH, TILE_HEIGHT]


@module_builder
def build_module():
memrefTyInOut = MemRefType.get(IMAGE_SIZE, T.i32())

# We will send an image worth of data in and out
@FuncOp.from_py_func(memrefTyInOut, memrefTyInOut)
def copy(arg0, arg1):

# The arguments are the input and output
@launch(operands=[arg0, arg1])
def launch_body(a, b):

# The arguments are still the input and the output
@segment(name="seg", operands=[a, b])
def segment_body(arg2, arg3):
# We want to store our data in L1 memory
mem_space_l2 = IntegerAttr.get(T.i32(), MemorySpace.L2)

# This is the type definition of the tile
tile_type_l2 = MemRefType.get(
shape=TILE_SIZE,
element_type=T.i32(),
memory_space=mem_space_l2,
)

# We must allocate a buffer of tile size for the input/output
tile_in_l2 = AllocOp(tile_type_l2, [], [])

# The herd sizes correspond to the dimensions of the contiguous block of cores we are hoping to get.
# We just need one compute core, so we ask for a 1x1 herd
@herd(name="copyherd", sizes=[1, 1], operands=[arg2, arg3, tile_in_l2])
def herd_body(tx, ty, sx, sy, a, b, my_l2_tile):

# We want to store our data in L1 memory
mem_space_l1 = IntegerAttr.get(T.i32(), MemorySpace.L1)

# This is the type definition of the tile
tile_type_l1 = MemRefType.get(
shape=TILE_SIZE,
element_type=T.i32(),
memory_space=mem_space_l1,
)

# We must allocate a buffer of tile size for the input/output
tile_in_l1 = AllocOp(tile_type_l1, [], [])
tile_out_l1 = AllocOp(tile_type_l1, [], [])

dma_memcpy_nd(
my_l2_tile,
a,
src_offsets=[0, 0],
src_sizes=[TILE_HEIGHT, TILE_WIDTH],
src_strides=[IMAGE_WIDTH, 1],
)

# Copy a tile from the input image (a) into the L1 memory region (tile_in)
dma_memcpy_nd(
tile_in_l1,
my_l2_tile,
)

# Access every value in the tile
for j in range_(TILE_HEIGHT):
for i in range_(TILE_WIDTH):
# Load the input value from tile_in
val = load(tile_in_l1, [i, j])

# Store the output value in tile_out
store(val, tile_out_l1, [i, j])
yield_([])
yield_([])

# Copy the output tile into the output
dma_memcpy_nd(
b,
tile_out_l1,
dst_offsets=[0, 0],
dst_sizes=[TILE_HEIGHT, TILE_WIDTH],
dst_strides=[IMAGE_WIDTH, 1],
)

# Deallocate our L1 buffers
DeallocOp(tile_in_l1)
DeallocOp(tile_out_l1)


if __name__ == "__main__":
module = build_module()
print(module)
29 changes: 26 additions & 3 deletions python/air/dialects/_air_ops_ext.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,29 @@ def pyint_to_index(i):
return arith.ConstantOp.create_index(i) if isinstance(i, int) else i


def get_region_operand_types(operands):
"""
Utility function to get the type of arguments given to region ops.
"""
operand_types = []
for o in operands:
if isinstance(o, Value):
operand_types.append(o.type)
elif isinstance(o, OpView):
if len(o.results.types) != 1:
raise AttributeError(
f"Operation given to a region op as a parameter ({o}) has more "
"than one return type ({o.results.types}), which would lead to a mismatch "
"between number of operands and number of operand types"
)
operand_types += o.results.types
else:
raise AttributeError(
f"Argument {o} is not a Value or an Operation: {type(o).mro()}"
)
return operand_types


class Launch(LaunchOp):
"""Specialization for LaunchOp class."""

Expand All @@ -48,7 +71,7 @@ def __init__(
launch_operands=operands,
sym_name=name,
)
operand_types = [s.type for s in sizes] * 2 + [o.type for o in operands]
operand_types = [s.type for s in sizes] * 2 + get_region_operand_types(operands)
Copy link
Contributor Author

@hunhoffe hunhoffe Jul 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think considering not just Values but also Operations makes sense in the segment and herd since one could feasibly create the operation in the outer region. That is what these changes do.

However, I also applied these changes to the launch and I'm not sure if this makes sense. Is it true that a launch argument should only ever be a value and not an operation? If so, I should revert this section of the changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jgmelber Do you have an opinion?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in the future there could be other operations before or after launch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I will keep my changes then!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm missing some subtlety about the python bindings but isn't the result of an Operation always one or more Values? Can you share a code snippet where this comes up? This isn't an issue in ordinary MLIR code or using C++ APIs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the context: #666

self.regions[0].blocks.append(*operand_types)


Expand All @@ -74,7 +97,7 @@ def __init__(
segment_operands=operands,
sym_name=name,
)
operand_types = [s.type for s in sizes] * 2 + [o.type for o in operands]
operand_types = [s.type for s in sizes] * 2 + get_region_operand_types(operands)
self.regions[0].blocks.append(*operand_types)


Expand Down Expand Up @@ -102,7 +125,7 @@ def __init__(
sym_name=name,
link_with=link_with,
)
operand_types = [s.type for s in sizes] * 2 + [o.type for o in operands]
operand_types = [s.type for s in sizes] * 2 + get_region_operand_types(operands)
self.regions[0].blocks.append(*operand_types)


Expand Down
Loading