How to Write Software

A document describing how to port parallel code (OpenMP, DOALL) to the proposed architecture. Each section gives guideline on how to think about the porting process. If you have a guideline in mind, please add it as a new section.

Pthread Programming

To run programs on the manycore model we write software using pthreads (rather than OpenMP for example). Pthreads are used for both baseline manycore code and vectorized code. Importantly, pthreads let us customize each core configuration so that it can fit nicely in a vector group (elaborated below).

There are a few helper functions and templates that facilitate ease of use. The main function is only run on CPU0 and is responsible for allocating memory for the kernel and launching pthreads on other cores with pointers to the allocated memory. You can use either malloc or malloc_cache_aligned to allocate memory. malloc_cache_aligned gaurentees that the array start at the beginning of a cache line and can improve performance, but is not necessary as long as you know what you're doing when doing vector prefetching (described below). A helper function launch_kernel will create a pthread on each specified core in the function along with arguments and kernel that you want to run. The function blocks until all threads have completed.

The recommended template is create a kernel.h file that implements the following three functions and a struct. Defining these lets you use launch_kernel effectively.

Kern_Args to hold all of the kernel arguments
Kern_Args* construct_args(args) to pack all of the kernel args
void* pthread_kernel() to launch the kernel via pthreads
void kernel(args) the actual implement kernel that looks like a normal function signature

For smaller kernels, they may be an issue with the thread mapping to cores. In the model a new threads is assigned to the lowest numbered core that is not currently running a thread. So if the kernel is very small then threads may just keep getting mapped to the same core as it quickly finishes each one. A barrier can be helpful in this case i.e., pthread_barrier_wait(). Note that a pthread cannot be allocated on CPU0, so you will just do a normal function call of kernel() to start. launch_kernel takes care of this nuance.

To access pthreads and the helper functions include <pthread.h> and pthread_launch.h.

Declaring Vector Groups

Using Vector Groups

Control Flow

Unconditional jumps and predication supported in vector cores. Give example usage.

Optimizing for Memory Hierarchy

Prefetching

Declaring a Frame

Using Frames

The Stack

The default place for the stack is in main memory (i.e. mapped to DRAM). In an ordinary CPU with local caches, the stack can be cached in fast to access memory. However, in our system this automatic caching policy does not exist because memory must be directed into local memory by the programmer. This results in slow accesses to the stack which would likely lie in an LLC rather than a local cache.

We would like the stack on the local scratchpad for fast access in case of register spill within a hot loop. There are many ways to get the stack onto the local scratchpad. We currently adopt the lowest effort approach and manually force the stack pointer onto scratchpad right before launching the critical kernel. The upside to this is we ONLY get the most relevant stack frames mapped into the scratchpad and not older less useful ones. The downside is that you need to manually copy part of the most recent frame onto the scratchpad. This size is not known unless you actually disassemble the binary and look which stack offset are being used.

Consider the following example, of moving the stack pointer and then calling the hot kernel. The top two instructions are inline assembly to swap the stack pointer onto the scratchpad and save the old stack pointer. Then before the kernel function is called stencil_vector, there are multiple accesses to the stack pointer, but this is referencing data on the OLD stack location, not the empty one we just swapped to. Notice that we will only access up to offset 64 at most. It suffices to copy up to that before moving the stack pointer.

   --------------
   11578:	00010a13          	mv	s4,sp
   1157c:	00050113          	mv	sp,a0
   --------------
   11580:	03013783          	ld	a5,48(sp)
   --------------
   11588:	04013e03          	ld	t3,64(sp)
   --------------
   115a0:	01513423          	sd	s5,8(sp)
   115a4:	03813a83          	ld	s5,56(sp)
   --------------
   115b8:	02f13023          	sd	a5,32(sp)
   115bc:	00d13c23          	sd	a3,24(sp)
   115c0:	00d13823          	sd	a3,16(sp)
   115c4:	01613023          	sd	s6,0(sp)
   --------------
   115e8:	03c13823          	sd	t3,48(sp)
   115ec:	be0ff0ef          	jal	ra,109cc <stencil_vector>
   --------------
   11738:	000a0113          	mv	sp,s4
   --------------

An example of inline assembly to copy the stack pointer before doing the stack swap.

  // get address on the scratchpad to place the stack pointer
  unsigned long long *spTop = getSpTop(ptid);
  spTop -= 9;

  unsigned long long stackLoc;
  asm volatile (
    // copy part of the stack onto the scratchpad in case there are any loads to scratchpad right before
    // function call
    "ld t0, 0(sp)\n\t"
    "sd t0, 0(%[spad])\n\t"
    "ld t0, 8(sp)\n\t"
    "sd t0, 8(%[spad])\n\t"
    "ld t0, 16(sp)\n\t"
    "sd t0, 16(%[spad])\n\t"
    "ld t0, 24(sp)\n\t"
    "sd t0, 24(%[spad])\n\t"
    "ld t0, 32(sp)\n\t"
    "sd t0, 32(%[spad])\n\t"
    "ld t0, 40(sp)\n\t"
    "sd t0, 40(%[spad])\n\t"
    "ld t0, 48(sp)\n\t"
    "sd t0, 48(%[spad])\n\t"
    "ld t0, 56(sp)\n\t"
    "sd t0, 56(%[spad])\n\t"
    "ld t0, 64(sp)\n\t"
    "sd t0, 64(%[spad])\n\t"
    // save the stack ptr
    "addi %[dest], sp, 0\n\t" 
    // overwrite stack ptr
    "addi sp, %[spad], 0\n\t"
    : [dest] "=r" (stackLoc)
    : [spad] "r" (spTop)
  );

  // run kernel

  // restore stack pointer
  asm volatile (
    "addi sp, %[stackTop], 0\n\t" :: [stackTop] "r" (stackLoc)
  );

Mapping Problem Sizes to Vector Groups

Vector groups typically have a characteristic software length that they can perform operations on. For example, in a stencil kernel with a filter dimension of three, a characteristic length would be FILTER_DIM * VECTOR_LEN where FILTER_DIM is the length of the filter and VECTOR_LEN is the number of cores apart of the vector group. This vector group would be able to process multiples of FILTER_DIM * VECTOR_LEN, but would fail for non-mulitples.

An obvious, but partial solution is to remove the part of the problem size that doesn't quite map to the vector groups i.e., with a mod operation. The question then becomes where to do this unmapped work. There are three solutions enumerated below.

Do the work sequentially on the host CPU. This might be the right solution for a convetional discrete vector accelerator.
Do the work in parallel using the manycore after the fact.
Same as 2, but try to schedule the manycore work on inactive core in parallel to the main chunk of work.

The proposed architecture is in a somewhat unique scenario where we can deconfigure the vector groups and just use the more flexible manycore to finish the computation. This is preferred over running sequential on a host CPU because the manycore version will be much faster. It also might not a huge issue where we put this work because this unmapped part is likely a small part of the computation, but that may not be true in the case of a multidimensional kernel and you might have overhangs at the end of each row.

Reductions

To be determined.

Profiling

The gem5 simulator reports statistic of the simulated run (cycles, llc_misses, etc.) that are useful in comparisons with other simulated systems. These can be found in path/to/results/stats.txt. In many kernels the start of time of allocating memory and launching pthreads consumes a significant amount of simulation time. We would like to not record this in our performance statistics especially for smaller kernels. We use custom instructions to turn the statistic monitoring on and off at the boundaries of the critical section of the program. The instructions are encapsulated in the macros stats_on() and stats_off() from bind_defs.h.

Stats should only be turned on and off by a single thread. It's important to make sure the other threads are done or close to done when profiling is turned off. A pthread_barrier_wait can be useful in this scenario.

Debugging

Common errors

gem5.opt: build/RVSP/cpu/io/iew.cc:258: void IEW::doWriteback(): Assertion 0 failed. is a segfault.

Remote GDB

Sometimes you can gdb into the software running on top of gem5. But haven't gotten to work with our software yet.

Run gem5 normally in one shell (running <software binary, build with -g>)
In another shell:
gdb <software binary, built with -g>
set remote Z-packet on
target remote 127.0.0.1:7000
c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly