Hardware-accelerated mutual Information Computation using AI Engines on AMD ACAP Versal VCK5000

Mutual Information (MI) is a crucial similarity metric widely used in image registration. However, the intensive computational demands of histogram extraction and entropy computation make it a challenging task. While GPU and FPGA accelerations have seen advancements, AI Engines remain underexplored for this purpose.

This project presents a hardware-accelerated solution leveraging the AI Engine of the ACAP Versal VCK5000 for efficient entropy computation using joint and marginal histograms of two digital images. Our approach focuses on extensive parallelization across multiple tiles and SIMD operations, enhancing performance significantly.

Key highlights of this work include:

Novel Vectorized Logarithm Implementation: We introduce a new vectorized base-two logarithm function, absent in the current AIE API, which outperforms the Intel i7-4770 CPU and the existing AIE API logarithm implementation. This new function achieves a $1.49\times$ speedup with minimal error margins for values in the range $[0, 100]$.
Performance Evaluation: Through three different graph implementations of the AIE, we observe substantial speedups in entropy computation for MI, particularly with parallelized kernels, achieving a $9.71\times$ improvement.
Broader Applications: While focused on image registration, our method has potential applications in other areas requiring efficient MI computation, such as feature selection and cryptanalysis.

How to run

Prerequisites

Before running the project, ensure you have the necessary tools and dependencies installed, which might include:

Xilinx Vivado or Vitis tools depending on your hardware platform (xilinx_vck5000_gen4x8_qdma_2_202220_1 in this case).
Make sure make utility is installed on your system.

Steps to Run the Project

Clone the Project Repository

git clone <repository_url>
cd <project_directory>

Set Environment Variables Ensure that the TARGET and PLATFORM variables are set correctly in the Makefile:
- TARGET: Specifies the target platform (hw or hw_emu).
- PLATFORM: Specifies the Xilinx platform (xilinx_vck5000_gen4x8_qdma_2_202220_1 in your case).
You can adjust these variables directly in the Makefile or pass them as arguments when invoking make.
Build Hardware (xclbin) Objects
```
make build_hw TARGET=hw PLATFORM=xilinx_vck5000_gen4x8_qdma_2_202220_1
```
This command will compile the AIE (AI Engine), data movers, and link them into a hardware binary (overlay_hw.xclbin). Note that we have already included the bitstream in the hw directory, so this step can be skipped.
Build Software Object
```
make build_sw
```
This command will compile the software components located in the sw directory.
```
./host_overlay.exe
```
This command (in the sw directory) will run the build.
Pack the Build
```
make pack
```
This command will copy the necessary files (host_overlay.exe and overlay_hw.xclbin) into a build/hw_build directory.
Run Testbenches with x86 (Optional)
```
make testbench_all
```
This command will compile AIE for x86, set up testbenches for joint AIE, and marginal AIE, and sink from AIE in the data_movers directory.
Run AIE kernels with VLIW architecture (Optional)
```
make all
```
This command will first compile AIE for VLIW and then simulate it in the aie directory.
Clean Up (Optional)
```
make clean
```
This command will clean up compiled binaries and temporary files from all directories (aie, data_movers, hw, sw).

Further optional commands can be found in the Makefiles in the different directories

Implementation Details

To accelerate Mutual Information computation with the AI Engine, we developed two different AIE graph configurations:

3-Kernels Graph

The first graph features three kernels: one computes the marginal entropy from two input histograms, another computes the joint entropy from the joint histogram, and the third calculates the Mutual Information by subtracting the marginal entropy from the joint entropy. The first two kernels operate concurrently, while the third kernel waits for their outputs before proceeding.

12-Kernels Systolic Array Graph

The second graph implements a systolic array with 8 kernels organized in a 4x2 matrix for computing the joint entropy, 1 kernel for marginal entropy, and 3 kernels for a reduce-structure to compute the final MI value efficiently.

These configurations were designed to distribute computation evenly among AIE kernels, leveraging parallelism and optimizing throughput. The systolic array arrangement, in particular, enhances efficiency by allowing each kernel to process distinct portions of the histogram concurrently, significantly speeding up the entropy computation process.

Performance Benchmarks

Performance Comparison for Logarithm Computation

The following table presents the performance comparison for the logarithm computation, highlighting the latency, throughput, and speedup of our proposed vectorized base-two logarithm implementation compared to the CPU and AIE utilities.

Method	Latency (ms)	Throughput (10^6 log/s)	Speedup
CPU (Intel i7-4770)	1.78	73.636	+0%
AIE (utils)	20.6	6.363	-91.37%
AIE (our)	1.21	108.324	+49.46%

Performance Comparison for Mutual Information

To test the AMD Versal VCK5000 on the computation of Mutual Information, we created a test bench that consists of computing the MI from the marginal and joint histograms of 11 progressively more aligned image couples. This allows us to test the performance on a test case that is as similar as possible to a real scenario.

The following table summarizes the average latency and throughput measurements for MI computation using the CPU and AI Engine (AIE) configurations with 3 and 12 kernels.

Method	Mean Latency (ms)	Mean Throughput (jobs/ms)
CPU (Intel i7-4770)	1.515	1.123
AIE (3 kernels)	0.720	1.436
AIE (12 kernels)	0.156	6.684

As shown in the performance comparison table, a single AIE tile achieves a speedup of $1.49\times$ compared to the CPU. The error introduced by the polynomial approximation is minimal, within the range $[-4.29 \times 10^{-6}, 5.24 \times 10^{-6}]$ for $x \in [0,100]$. In contrast, the AIE API Utils logarithm is significantly slower and less accurate than the CPU, with errors within the range $[0, 0.99]$ and performance that is nearly 10 times slower.

Future Work

In future work, we will integrate our technique to compute Mutual Information (MI) within an image registration, to execute all stages of image registration on the AMD ACAP Versal VCK5000. While this study concentrated on Mutual Information for image registration, future research could also investigate accelerating the computation of Mutual Information in other domains, such as feature selection or cryptanalysis.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.vscode		.vscode
aie		aie
common		common
data_movers		data_movers
hw		hw
sw		sw
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
Makefile		Makefile
OpenHW_report.pdf		OpenHW_report.pdf
README.md		README.md
setup_all.sh		setup_all.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hardware-accelerated mutual Information Computation using AI Engines on AMD ACAP Versal VCK5000

How to run

Prerequisites

Steps to Run the Project

Implementation Details

3-Kernels Graph

12-Kernels Systolic Array Graph

Performance Benchmarks

Performance Comparison for Logarithm Computation

Performance Comparison for Mutual Information

Future Work

About

Releases

Packages

Contributors 2

Languages

License

necst/Hpps24-fpga2aie

Folders and files

Latest commit

History

Repository files navigation

Hardware-accelerated mutual Information Computation using AI Engines on AMD ACAP Versal VCK5000

How to run

Prerequisites

Steps to Run the Project

Implementation Details

3-Kernels Graph

12-Kernels Systolic Array Graph

Performance Benchmarks

Performance Comparison for Logarithm Computation

Performance Comparison for Mutual Information

Future Work

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages