Kernels written for the Radiance GPU.
You will need a GCC rv32 toolchain installed from
riscv-gnu-toolchain. You
might need to set ABI to be ilp32 (not ilp32d). Export that path to an
environment variable $(RISCV_TOOLCHAIN_PATH) - there should be a directory
called riscv32-unknown-elf under that path. The purpose of this installation
is to retrieve headers & definitions from the GCC sysroot; the binary libraries
are not usable and no libraries (e.g. libm, crt) can be linked to due to ISA
differences.
There are two ways to install the Muon LLVM toolchain.
-
Prebuilt: run
./scripts/llvm_prebuilt.sh, which will decompress the existing archived LLVM binaries atllvm/llvm-muon.tar.xz. This is compiled on an Ubuntu 24.04 system with GLIBC 2.39, meaning there's a good chance it won't work on older systems. You will also need ZSTD. If for any reason the prebuilt toolchain doesn't work, use the second method. -
Build from scratch: run
./scripts/llvm.sh. This will initialize the submodule located atllvm/llvm-src, which is not cloned by default due to its size. You will needclang,clang++,ninjainstalled in your system (more may be required) to build the Muon toolchain.
Once the toolchain is installed, compile the Muon runtime by running make
under lib/. There should now be lib/libmuonrt.a; if not, check previous
steps. As a further compiler sanity check, you can also inspect for any
<unknown>s in the assembly dump lib/libmuonrt.dump; all instructions in
executable sections should be 8 bytes and properly recognized by the
disassembler.
First initialize the submodule:
git submodule update --init muon-isa-testsThen compile the tests:
autoconf
./configure
cd isa
makeThis should generate a bunch of binaries under isa along with their dumps.
You can run the tests with cyclotron,
or straight from the isa directory by using make run. Note that you'll need
to go into isa/Makefile and change the MUON_SIM variable to point to the
compiled cyclotron binary.
Similar to the ISA tests, go to kernels/<kernel_name> and run make.
Use *.elf binaries for the BINARY= argument of the Chipyard RTL
simulations.
*.soc.elf binaries fuse both the host CPU and device GPU programs
into one, and they should be run on the SoC CONFIGs such as
RadianceTapeoutConfig and RadianceSingleClusterConfig.
*.radiance.elf binaries are GPU-only kernels that should be run on
host-less CONFIGs such as MuonCoreTestConfig.
See RadianceConfigs.scala for the full list of configs.
When putting threadblock barriers around thread-divergent branches
(mu_barrier in mu_intrinsic.h), be careful
about the compiler potentially duplicating the barrier to both branch paths and
resulting in a deadlock. For example:
if (tid_in_threadblock == 0) {
// do something
}
mu_barrier(0, NUM_WARPS);
may be transformed to:
if (tid_in_threadblock == 0) {
// do something
mu_barrier(0, NUM_WARPS);
} else {
mu_barrier(0, NUM_WARPS);
}
This may result in a deadlock, since the thread-divergent warp 0 executes
mu_barrier twice due to SIMT branch serialization. Because the other warps
are convergent, they execute the barrier once, and warp 0 deadlocks due to no
participation from other warps.
This problem requires a compiler fix, and otherwise can only be worked around with various levels of friction.
Putting a nop explicitly at the else-clause sometimes helps:
if (tid_in_threadblock == 0) {
// do something
} else {
asm volatile ("nop");
}
mu_barrier(...);
Placing the barrier further away from divergent branches also helps, e.g. by wrapping the branch inside a function.
Using the -Os compiler flag also seems to keep the compiler from doing aggressive
branch-duplication, albeit with a performance impact.
We put __attribute__((convergent, noinline)) directive to mu_barrier, but
that doens't seem to fix this.