-
Notifications
You must be signed in to change notification settings - Fork 2
How to Write Software
A document describing how to port parallel code (OpenMP, DOALL) to the proposed architecture. Each section gives guideline on how to think about the porting process. If you have a guideline in mind, please add it as a new section.
To run programs on the manycore model we write software using pthreads (rather than OpenMP for example). Pthreads are used for both baseline manycore code and vectorized code. Importantly, pthreads let us customize each core configuration so that it can fit nicely in a vector group (elaborated below).
There are a few helper functions and templates that facilitate ease of use. The main function is only run on CPU0 and is responsible for allocating memory for the kernel and launching pthreads on other cores with pointers to the allocated memory. You can use either malloc
or malloc_cache_aligned
to allocate memory. malloc_cache_aligned
gaurentees that the array start at the beginning of a cache line and can improve performance, but is not necessary as long as you know what you're doing when doing vector prefetching (described below). A helper function launch_kernel
will create a pthread on each specified core in the function along with arguments and kernel that you want to run. The function blocks until all threads have completed.
The recommended template is create a kernel.h
file that implements the following three functions and a struct. Defining these lets you use launch_kernel
effectively.
-
Kern_Args
to hold all of the kernel arguments -
Kern_Args* construct_args(args)
to pack all of the kernel args -
void* pthread_kernel()
to launch the kernel via pthreads -
void kernel(args)
the actual implement kernel that looks like a normal function signature
For smaller kernels, they may be an issue with the thread mapping to cores. In the model a new threads is assigned to the lowest numbered core that is not currently running a thread. So if the kernel is very small then threads may just keep getting mapped to the same core as it quickly finishes each one. A barrier can be helpful in this case i.e., pthread_barrier_wait()
. Note that a pthread cannot be allocated on CPU0, so you will just do a normal function call of kernel()
to start. launch_kernel
takes care of this nuance.
To access pthreads and the helper functions include <pthread.h>
and pthread_launch.h
.
Unconditional jumps and predication supported in vector cores. Give example usage.
The default place for the stack is in main memory (i.e. mapped to DRAM). In an ordinary CPU with local caches, the stack can be cached in fast to access memory. However, in our system this automatic caching policy does not exist because memory must be directed into local memory by the programmer. This results in slow accesses to the stack which would likely lie in an LLC rather than a local cache.
We would like the stack on the local scratchpad for fast access in case of register spill within a hot loop. There are many ways to get the stack onto the local scratchpad. We currently adopt the lowest effort approach and manually force the stack pointer onto scratchpad right before launching the critical kernel. The upside to this is we ONLY get the most relevant stack frames mapped into the scratchpad and not older less useful ones. The downside is that you need to manually copy part of the most recent frame onto the scratchpad. This size is not known unless you actually disassemble the binary and look which stack offset are being used.
Consider the following example, of moving the stack pointer and then calling the hot kernel. The top two instructions are inline assembly to swap the stack pointer onto the scratchpad and save the old stack pointer. Then before the kernel function is called stencil_vector
, there are multiple accesses to the stack pointer, but this is referencing data on the OLD stack location, not the empty one we just swapped to. Notice that we will only access up to offset 64
at most. It suffices to copy up to that before moving the stack pointer.
--------------
11578: 00010a13 mv s4,sp
1157c: 00050113 mv sp,a0
--------------
11580: 03013783 ld a5,48(sp)
--------------
11588: 04013e03 ld t3,64(sp)
--------------
115a0: 01513423 sd s5,8(sp)
115a4: 03813a83 ld s5,56(sp)
--------------
115b8: 02f13023 sd a5,32(sp)
115bc: 00d13c23 sd a3,24(sp)
115c0: 00d13823 sd a3,16(sp)
115c4: 01613023 sd s6,0(sp)
--------------
115e8: 03c13823 sd t3,48(sp)
115ec: be0ff0ef jal ra,109cc <stencil_vector>
--------------
11738: 000a0113 mv sp,s4
--------------
An example of inline assembly to copy the stack pointer before doing the stack swap.
// get address on the scratchpad to place the stack pointer
unsigned long long *spTop = getSpTop(ptid);
spTop -= 9;
unsigned long long stackLoc;
asm volatile (
// copy part of the stack onto the scratchpad in case there are any loads to scratchpad right before
// function call
"ld t0, 0(sp)\n\t"
"sd t0, 0(%[spad])\n\t"
"ld t0, 8(sp)\n\t"
"sd t0, 8(%[spad])\n\t"
"ld t0, 16(sp)\n\t"
"sd t0, 16(%[spad])\n\t"
"ld t0, 24(sp)\n\t"
"sd t0, 24(%[spad])\n\t"
"ld t0, 32(sp)\n\t"
"sd t0, 32(%[spad])\n\t"
"ld t0, 40(sp)\n\t"
"sd t0, 40(%[spad])\n\t"
"ld t0, 48(sp)\n\t"
"sd t0, 48(%[spad])\n\t"
"ld t0, 56(sp)\n\t"
"sd t0, 56(%[spad])\n\t"
"ld t0, 64(sp)\n\t"
"sd t0, 64(%[spad])\n\t"
// save the stack ptr
"addi %[dest], sp, 0\n\t"
// overwrite stack ptr
"addi sp, %[spad], 0\n\t"
: [dest] "=r" (stackLoc)
: [spad] "r" (spTop)
);
// run kernel
// restore stack pointer
asm volatile (
"addi sp, %[stackTop], 0\n\t" :: [stackTop] "r" (stackLoc)
);
Vector groups typically have a characteristic software length that they can perform operations on. For example, in a stencil kernel with a filter dimension of three, a characteristic length would be FILTER_DIM * VECTOR_LEN
where FILTER_DIM
is the length of the filter and VECTOR_LEN
is the number of cores apart of the vector group. This vector group would be able to process multiples of FILTER_DIM * VECTOR_LEN
, but would fail for non-mulitples.
An obvious, but partial solution is to remove the part of the problem size that doesn't quite map to the vector groups i.e., with a mod operation. The question then becomes where to do this unmapped work. There are three solutions enumerated below.
- Do the work sequentially on the host CPU. This might be the right solution for a convetional discrete vector accelerator.
- Do the work in parallel using the manycore after the fact.
- Same as 2, but try to schedule the manycore work on inactive core in parallel to the main chunk of work.
The proposed architecture is in a somewhat unique scenario where we can deconfigure the vector groups and just use the more flexible manycore to finish the computation. This is preferred over running sequential on a host CPU because the manycore version will be much faster. It also might not a huge issue where we put this work because this unmapped part is likely a small part of the computation, but that may not be true in the case of a multidimensional kernel and you might have overhangs at the end of each row.
To be determined.
The gem5 simulator reports statistic of the simulated run (cycles, llc_misses, etc.) that are useful in comparisons with other simulated systems. These can be found in path/to/results/stats.txt
. In many kernels the start of time of allocating memory and launching pthreads consumes a significant amount of simulation time. We would like to not record this in our performance statistics especially for smaller kernels. We use custom instructions to turn the statistic monitoring on and off at the boundaries of the critical section of the program. The instructions are encapsulated in the macros stats_on()
and stats_off()
from bind_defs.h
.
Stats should only be turned on and off by a single thread. It's important to make sure the other threads are done or close to done when profiling is turned off. A pthread_barrier_wait
can be useful in this scenario.
gem5.opt: build/RVSP/cpu/io/iew.cc:258: void IEW::doWriteback(): Assertion 0 failed.
is a segfault.
Sometimes you can gdb into the software running on top of gem5. But haven't gotten to work with our software yet.
-
Run gem5 normally in one shell (running <software binary, build with -g>)
-
In another shell:
-
gdb <software binary, built with -g>
-
set remote Z-packet on
-
target remote 127.0.0.1:7000
-
c