This recipe showcases RapidStream's ability to optimize a complex design that combines various source types, such as handcrafted RTL Verilog, solutions generated using High-Level Synthesis (HLS), and Xilinx Compiled intellectual property (xci) blocks.
RapidStreamHLS
in this recipe is a set of low-level API that provides the flexibility.
To automatically optimize layout hints, resource information is required. We have pre-defined options for common devices:
U50,
U55c,
U250,
U280,
VHK158,
and
VCK190. However, you can also utilize the DeviceFactory
to support custom parts and boards.
This example utilizes the predefined U50 virtual device, which divides the FPGA part into four equal slots, each occupying half of a Super Logic Region (SLR):
The Python snippet below demonstrates how to define the device configuration for the Alveo U50 device. The source file u50.py can be found in your RapidStream installation directory, such as
<home_dir>/.rapidstream/opt/python3.10/lib/python3.10/site-packages/rapidstream/assets/device_library/u50/u50.py
.
In this setup, the Alveo U50 device is divided into four slots. The function factory.set_slot_pblock specifies the coordinates and ranges of these slots in units of clock regions. For example, SLOT_X0Y0 encompasses all clock regions from CLOCKREGION_X0Y0 to CLOCKREGION_X3Y3. The function extract_slot_resources calls vivado to extract the resources within the clock regions defined by set_slot_pblock. Additionally, set_slot_capacity configures the LAGUNA routing resources for crossing SLR wires. Since the Alveo U50 contains two SLRs, only the north of SLOT_X0Y0 and SLOT_X1Y0, and the south of SLOT_X0Y1 and SLOT_X1Y1 have SLR crossing capacity.
def get_u50_default_device_factory() -> DeviceFactory:
"""Get a U50 default device factory."""
factory = DeviceFactory(row=2, col=2, part_num=U50_PART_NAME, board_name=None)
factory.set_slot_pblock(0, 0, ["-add CLOCKREGION_X0Y0:CLOCKREGION_X3Y3"])
factory.set_slot_pblock(1, 0, ["-add CLOCKREGION_X4Y0:CLOCKREGION_X7Y3"])
factory.set_slot_pblock(0, 1, ["-add CLOCKREGION_X0Y4:CLOCKREGION_X3Y7"])
factory.set_slot_pblock(1, 1, ["-add CLOCKREGION_X4Y4:CLOCKREGION_X7Y7"])
factory.extract_slot_resources()
# set SLR crossing capacity
factory.set_slot_capacity(0, 0, north=11520)
factory.set_slot_capacity(0, 1, south=11520)
factory.set_slot_capacity(1, 0, north=11520)
factory.set_slot_capacity(1, 1, south=11520)
return factory
def get_u50_default_device(output_path: Path | None = None) -> VirtualDevice:
"""Get a U50 default device."""
factory = get_u50_default_device_factory()
return factory.generate_virtual_device(output_path)
RapidStream supports Verilog files, whether they are handcrafted or generated by tools. In this tutorial, we will recreate the VecAdd example from vitis_source, using source files from a different origin.
The VecAddMix design has 3 types of input source:
-
read_mem_0
,read_mem_1
,write_mem_0
andkernel_add
are generated by compiling HLS C++ code byVitis_HLS
. -
A reset synchronouser module is from an
.xci
from Xilinx. -
The HLS-generated modules above are connected by streaming-fifos which is a manual designed Verilog file in a top Verilog file.
Raidstream realies on the handshake interfaces to add pipeline registers to improve timing. For this case, rapidstream will infer the interface types of read_mem_0
, read_mem_1
, write_mem_0
and kernel_add
from their HLS-generated reports. However, only both source and target ports qualify the handshake criteria can the link be pipelinable. For the streaming fifo modules, we need manually specify the types of ports in the Verilog file as below.
(* RS_HS = "inbound.data" *) input [DSIZE-1:0] din_TDATA;
(* RS_HS = "inbound.valid" *) input din_TVALID;
(* RS_HS = "inbound.ready" *) output din_TREADY;
(* RS_HS = "outbound.data" *) output [DSIZE-1:0] dout_TDATA;
(* RS_HS = "outbound.valid" *) output dout_TVALID;
(* RS_HS = "outbound.ready" *) input dout_TREADY;
(* RS_CLK *) input clk;
(* RS_RST = "ff" *) input rst_n;
In this example, there are:
- Two handshake interfaces (
RS_HS
):inbound
interface: input streaming dataoutbound
interface: output streaming data
- One default clock interface (
RS_CLK
) namedclk
- One reset interface (
RS_RST
) namedrst_n
For the most up-to-date pragma syntax, refer to the RTL Interface Pragmas.
IP configuration files in the XCI format can be included, as demonstrated in this recipe. For instance,
design_1_proc_sys_reset_0_0.xci
can be added. The interfaces defined in the XCI files will be analyzed, and the files will be utilized to determine resource usage and perform evaluations.
RapidStream can integrate modules or systems generated by High-Level Synthesis (HLS) into the project.
Run the command below to generate the HLS solutions form read_mem
, write_mem
and kernel_add
.
source <Vitis_install_path>/Vitis/2023.2/settings64.sh
make hls
You can find the HLS-generated Verilog files under build
for different kernels.
The interface information is automatically inferred from the HLS reports, such as
./build/run.py/kernel_add/solution/syn/report/kernel_add_csynth.rpt
This liminates the need for manual pragma additions to the RTL files.
For instance, stream_*
interfaces with HLS axis
protocols will be inferred as handshake interfaces. RapidStream uses .xml
files instead of .rpt
files for this purpose. The .rpt
screenshot here is for a readable demonstration purpose:
================================================================
== Interface
================================================================
* Summary:
+-------------------+-----+-----+--------------+--------------+--------------+
| RTL Ports | Dir | Bits| Protocol | Source Object| C Type |
+-------------------+-----+-----+--------------+--------------+--------------+
|ap_clk | in| 1| ap_ctrl_none| kernel_add| return value|
|ap_rst_n | in| 1| ap_ctrl_none| kernel_add| return value|
|stream_in1_TVALID | in| 1| axis| stream_in1| pointer|
|stream_in1_TDATA | in| 32| axis| stream_in1| pointer|
|stream_in1_TREADY | out| 1| axis| stream_in1| pointer|
|stream_in2_TVALID | in| 1| axis| stream_in2| pointer|
|stream_in2_TDATA | in| 32| axis| stream_in2| pointer|
|stream_in2_TREADY | out| 1| axis| stream_in2| pointer|
|stream_out_TREADY | in| 1| axis| stream_out| pointer|
|stream_out_TDATA | out| 32| axis| stream_out| pointer|
|stream_out_TVALID | out| 1| axis| stream_out| pointer|
+-------------------+-----+-----+--------------+--------------+--------------+
After we get the HLS solutions ready in step 1, we can use rapidstream to optimized the mixed design. For simplicity, we only generated the Out-of-Context implementation results for fast demonstration.
In run.py file, we show how to use the APIs we provided to specify the constraints for proper optimization by rapidstream.
For the handcrafted RTL and xci files, we need to specify the source directory by:
rs.add_vlog_dir(f"{CURR_DIR}/design/rtl")
rs.add_xci_dir(f"{CURR_DIR}/design/xci/ip/design_1_proc_sys_reset_0_0")
For the HLS-generated solutions, we specify the directories as below, such that rapidstream can infer the interfaces from the .rpt
or .xml
files.
rs.add_hls_solution(f"{CURR_DIR}/build/kernel_add/solution")
rs.add_hls_solution(f"{CURR_DIR}/build/read_mem/solution")
rs.add_hls_solution(f"{CURR_DIR}/build/write_mem/solution")
For better optimizing the manual designs, we must explicitly define the placement of ports on the device by applying constraints. For instance, if we want to assign ports to a specific region, such as HBM AXI 16-31, we need to utilize the appropriate API to allocate the ports to their designated slots X1Y0.
rs.assign_port_to_region(".*", "SLOT_X1Y0:SLOT_X1Y0")
We can utilize assign_cell_to_region in a regular expression manner to designate target cells to specific SLOT regions. To demonstrate this capability, we have intentionally assigned four modules to four distinct SLOTS as shown below.
rs.assign_cell_to_region(".*kernel_add.*", "SLOT_X0Y1:SLOT_X0Y1")
rs.assign_cell_to_region(".*fifo_read2kernel0.*", "SLOT_X0Y1:SLOT_X0Y1")
rs.assign_cell_to_region(".*fifo_read2kernel1.*", "SLOT_X0Y0:SLOT_X0Y0")
rs.assign_cell_to_region(".*fifo_kernel2write.*", "SLOT_X1Y1:SLOT_X1Y1")
After all, we run rapidstream to optimize the design by running the command below or execute make all
:
source <Vitis_install_path>/Vitis/2023.2/settings64.sh
rapidstream ./run.py
When execution is completed, we found the target modules are assigned to target SLOTS.
+------------------+------------------------+------+------+----------+-----+------+
|name | floorplan | ff | lut | bram_18k | dsp | uram |
+------------------+------------------------+------+------+----------+-----+------+
|fifo_read2kernel1 | SLOT_X0Y0_TO_SLOT_X0Y0 | 20 | 31 | 1 | 0 | 0 |
|read_mem_0 | SLOT_X1Y0_TO_SLOT_X1Y0 | 2509 | 1526 | 15 | 0 | 0 |
|read_mem_1 | SLOT_X1Y0_TO_SLOT_X1Y0 | 2509 | 1526 | 15 | 0 | 0 |
|reset_syncer | SLOT_X1Y0_TO_SLOT_X1Y0 | 40 | 18 | 0 | 0 | 0 |
|write_mem_0 | SLOT_X1Y0_TO_SLOT_X1Y0 | 2359 | 1559 | 16 | 0 | 0 |
|fifo_read2kernel0 | SLOT_X0Y1_TO_SLOT_X0Y1 | 20 | 31 | 1 | 0 | 0 |
|kernel_add_0 | SLOT_X0Y1_TO_SLOT_X0Y1 | 645 | 310 | 0 | 2 | 0 |
|fifo_kernel2write | SLOT_X1Y1_TO_SLOT_X1Y1 | 20 | 31 | 1 | 0 | 0 |
+------------------+------------------------+------+------+----------+-----+------+
The final OoC implementation layout is as below.
RapidStream mandates a clear distinction between communication and computation within user designs.
-
In
Group modules
, users are tasked solely with defining inter-submodule communication. For those familiar with Vivado IP Integrator flow, crafting a Group module mirrors the process of connecting IPs in IPI. RapidStream subsequently integrates appropriate pipeline registers into these Group modules. -
In
Leaf modules
, users retain the flexibility to implement diverse computational patterns, as RapidStream leaves these Leaf modules unchanged.
For further details, please consult the code style section in our Documentation.
To generate a report on group types, execute the commands below or run make show_groups
:
rapidstream ../../common/util/get_group.py \
-i build/passes/0-imported.json \
-o build/module_types.csv
The module types for your design can be found in build/module_types.csv
. Below, we list the four Group modules. In this design, VecAddMix
serves as a Group module, while the other three modules are added by RapidStream.
Module Name | Group Type |
---|---|
VecAddMix | grouped_module |
__rs_ap_ctrl_start_ready_pipeline | grouped_module |
The RapidStream flow performs design space exploration and creates optimized design checkpoint (.dcp
) files. If you execution is successful, you should find the post-routed dcp located at:
build/run.py/dse/candidate_0/route.dcp
To review the timing results for each generated design point, use this command:
find build/run.py/dse -name timing_summary.rpt
These commands will help you locate and analyze the relevant files within the ./build/run.py/dse
directory.
Click here to go back to Getting Started