How to Write Software

A document describing how to port parallel code (OpenMP, DOALL) to the proposed architecture. Each section gives guideline on how to think about the porting process. If you have a guideline in mind, please add it as a new section.

Pthread Programming

To run programs on the manycore model we write software using pthreads (rather than OpenMP for example). Pthreads are used for both baseline manycore code and vectorized code. Importantly, pthreads let us customize each core configuration so that it can fit nicely in a vector group (elaborated below).

There are a few helper functions and templates that facilitate ease of use. The main function is only run on CPU0 and is responsible for allocating memory for the kernel and launching pthreads on other cores with pointers to the allocated memory. You can use either malloc or malloc_cache_aligned to allocate memory. malloc_cache_aligned gaurentees that the array start at the beginning of a cache line and can improve performance, but is not necessary as long as you know what you're doing when doing vector prefetching (described below). A helper function launch_kernel will create a pthread on each specified core in the function along with arguments and kernel that you want to run. The function blocks until all threads have completed.

The recommended template is create a kernel.h file that implements the following three functions and a struct. Defining these lets you use launch_kernel effectively.

Kern_Args to hold all of the kernel arguments
Kern_Args* construct_args(args) to pack all of the kernel args
void* pthread_kernel() to launch the kernel via pthreads
void kernel(args) the actual implement kernel that looks like a normal function signature

For smaller kernels, they may be an issue with the thread mapping to cores. In the model a new threads is assigned to the lowest numbered core that is not currently running a thread. So if the kernel is very small then threads may just keep getting mapped to the same core as it quickly finishes each one. A barrier can be helpful in this case i.e., pthread_barrier_wait(). Note that a pthread cannot be allocated on CPU0, so you will just do a normal function call of kernel() to start. launch_kernel takes care of this nuance.

To access pthreads and the helper functions include <pthread.h> and pthread_launch.h.

Declaring Vector Groups

All cores begin their pthread execution as independent cores. A programmer can choose to configure cores into a vector group before running the critical section of the kernel to gain performance and energy efficiency. The VECTOR_EPOCH macro represents an instruction to configure the core in vector mode. A mask must be provided to the macro specifying how exactly the current core fits into a vector group. Two helper functions are used to develop this mask: vector_group_template_*(), which declares the shape and locations of the vector groups and getSIMDMask() which figures out how a core fits into those declared groups. The * refers to a particular configuration that has been pre-developed (currently vector lengths of 4 and 16 are supported in a 64 core simulation). Instead of using a pre-developed configuration you can manually declare one group at a time using the rect_vector_group() helper function.

In general a flow might look like the following within a pthread shell where the global core tids and mesh dimensions are known. It is also important to return the core back to normal mode after the critical section has completed. This requires the use of the DEVEC macro, which maps to a custom instruction to reset the core configuration.

void pthread_kernel(int ptid_x, int ptid_y, int pdim_x, int pdim_y) {
  // determine which vector group this core should fall into (if any)
  int used = vector_group_template_4(ptid_x, ptid_y, pdim_x, pdim_y, 
    &vtid, &vtid_x, &vtid_y, &is_da, &orig_x, &orig_y, &master_x, &master_y, &unique_id, &total_groups);

  if (used) {
     // configure the core for vector mode 
     int mask = getSIMDMask(master_x, master_y, orig_x, orig_y, vtid_x, vtid_y, vdim_x, vdim_y, is_da);
     VECTOR_EPOCH(mask);

     // run critical section

     // deconfigure core back to normal mode
     DEVEC(<unique_name>);
  }
}

Using Vector Groups

There are two types of cores per vector group: a single scalar core and some number of vector cores corresponding to the desired vector length. The general execution model is that the scalar core will direct control and fetch memory on behalf of the vector cores. Directing control and fetching memory achieves energy efficiency and performance respectively, but there is some cross-play, which will be described later.

Two Sets of Code

The kernel function to be run on the architecture needs to be written in a separate file. In the function, the programmer must write 2 sets of code: one that will run on the single scalar core and one that will run on all of the vector cores. Writing 2 sets of code in the same kernel essentially boils down to using 2 sets of variables (for eg: x_scalar and x_vector) and using pragma` defined sections to separate the code. We have a compiler pass to help facilitate the fusion of these into a single program binary.

Software hacks:

The first line of the kernel must be VECTOR_EPOCH(mask); which sets the vector and scalar configurations of the core. From this point onwards, scalar core is responsible for sending instructions to vector core.
The first argument to the kernel should be mask. This ensures that it is passed inside the argument register a0 and there is no stack dependency.

Directing Control

The scalar core can command the vector cores to execute a chunk of vector core code using the ISSUE_VINST macro. The usage is to specify a label (a.k.a. an eventual PC address) representing a block of code you want the vector cores to run.

  // scalar code (only runs on the scalar core!)
  ISSUE_VINST(fable0);

  // vector code (only runs on vector cores!)
  fable0:
    c = a + b;

Loop bodies, especially ones within complex loop nests become hard to do this linking manually. The compiler pass will help.

Fetching memory

Although the vector cores can fetch their own memory, it is often faster to have the scalar core prefetch the memory ahead of time for the vector cores to use. The VPREFETCH_L macro enables the scalar core to fetch data from global memory into each vector cores local scratchpad. The full usage will be described later, but a simple example is given below.

  // scalar code (only runs on the scalar core!)
  VPREFETCH_L(0, aPtr, 0, VLEN, 0);
  VPREFETCH_L(1, bPtr, 0, VLEN, 0);

  ISSUE_VINST(fable0);

  // vector code (only runs on vector cores!)
  fable0:
    a = scratchpad[0];
    b = scratchpad[1];
    c = a + b;

Performance

It's very easy to have the scalar core bottleneck the execution. Often, a scalar core will run more instructions per iteration than the corresponding vector cores. This means that we can't issue vector commands fast enough to keep the vector cores busy. Low utilization is the bane of vector architectures, so the programmer should take steps to avoid larger code length for the scalar core relative to the vector core.

To identify this bottleneck you will see a lot of mesh input stalls. You can confirm this by looking at the distance between vector issue commands in the gem5 trace (enable trace using command line argument --debug-flags=Mesh). You can also look at the disassembly to estimate, but this only will give static instruction counts and not dynamic instruction counts (i.e., how many instructions are actually run due to looping). Currently, there's not best practice to quickly identify.

Control Flow in Vector Cores

Vector groups cannot run conventional control flow where each core can conditionally branch their own way. All cores in a vector group must be running the same code and thus follow exactly the same control flow. In a perfect world, there would be no control flow and each core could just do straight-line code. Sadly this is not the case for interesting kernels, so we support various control constructs to support somewhat interesting control flow in vector cores. Three are listed below. Note that scalar cores can still do all ordinary control flow without any extra modification.

Unconditional Direct Jumps

Function calls are allowed because each core will follow this call. However, indirect function calls requiring a vtable (like in C++ function overloading) are not allowed because they may jump to different places. You can call a function as you would on a conventional processor.

Convergent Conditional Branches

Conditional branches (i.e., if/else) are allowed as long as the programmer is certain that each core will branch the same way. A local iterator has this behavior for example. All branches are allowed except those that map to beqz, which is due to our assembly pass. This extra clause might be annulled after a fancier compiler is up and running. Again, you can generally use if/else statements as you would in an ordinary core.

Predication

There many be scenarios where you application contains edge case that will not result in each core following the same control flow. Predication can be used in this scenario, where every instruction is fetched and issued, but only the instructions on the effective control path are executed. There are special instructions to turn execution of instructions on and off based on the provided predicate. An example may look like the following.

    PRED_NEQ(vtid, dim - 1);
    c = a + b;
    PRED_EQ(vtid, vtid);

The PRED_NEQ and PRED_EQ are macros for the predication instructions that compare the thread id with various feels to determine whether c = a + b will be executed. An annoying compiler challenging arises because the compiler assumes c = a + b will always be run and will potential produce incorrect code if the line is not executed. To combat this, we tell the compiler that it is conditional with a volatile conditional branch and then remove the branch in a later compiler pass. The code should then look like the following.

    volatile int temp = 1;
    PRED_NEQ(vtid, dim - 1);
    if (temp) { 
    c = a + b;
    }
    PRED_EQ(vtid, vtid);

Performance

In general all three approaches are not performant compared to straight-line code. In the jump and branch case, the vector cores cannot speculatively issue the branch and must wait for resolution of the PC before fetching instructions. In predication, you are fetching more instructions than you need in some cases. Try to reduce the uses of these constructs as much as possible for the best performance.

Optimizing for Memory Hierarchy

The memory hierarchy in descending order include DRAM, LLC, and a local scratchpad (4KB). The local scratchpad does not automatically cache values loaded by the program. Instead, the programmer must manually store data into the scratchpad to save for later. Some scratchpad helper functions are included in spad.h. Before accessing the scratchpad, you should initialize them with initScratchpads() and then a pointer to the scratchpad memory space can be obtained using getSpAddr(<scratchpadIdx>, <offset>).

You can read and write to your local scratchpad by using your own physical thread id to access the scratchpad pointer. You can also access another cores scratchpad by using its physical thread id (e.g., getSpAddr(ptid + 1, 0)) to access the scratchpad of the neighbor to the right of you.

Accesses not to the scratchpad region are to the LLC or DRAM by default (depending on if the address has been cached yet).

Prefetching

Prefetching can be used by the scalar core to get significant speedups. The programmer can initiate a fetch from a global memory address into a specific scratchpad in the vector group. For certain programs, we can do this prefetching well ahead of when the vector core will actually need the data, and avoid any load-use stalls in the vector cores. Load-use stalls are particularly bad in vector cores because they can also delay the instruction stream between cores, so best to avoid them.

Vector Prefetching

One could perform a single memory access per prefetch instruction, but this would increase the scalar code size linearly with the number of cores in the vector group. As mentioned previously, this large scalar code size may lead to low vector core utilization because we can't issue vector commands fast enough to keep the vector cores busy.

Vector prefetching partially resolves the aforementioned issue. It allows to prefetch multiple elements per prefetch instruction on the scalar core, which will reduce the code size and hopefully become more inline with the vector core code size. A vector prefetch instructions specifies: the base scratchpad offset, the base global memory address, the base core offset in the vector group, the number of loads to perform, and extra config bits.

  VPREFETCH_L(<Base Scratchpad Offset>, <Base Memory Offset>, <Base Core Offset>, <Num Loads>, <Config>);

The config bits currently specify whether we are doing a 'horizontal' or 'vertical' vector prefetch. A horizontal prefetch fetches consecutive elements to the same scratchpad offset in consecutive cores. A vertical prefetch fetches consecutive elements to the consecutive scratchpad offsets in the same core. A combination of these two and single length prefetches are enough to map to every memory access pattern with various performance. We could extend the hardware to allow other vector prefetch patterns in between horizontal and vertical, but we don't due to potential hardware complexity.

A horizontal and vertical prefetch are shown below. Only the config bit needs to change between them.

  // horizontal fetch one element to each core in a group of 4
  VPREFETCH_L(0, aPtr, 0, 4, 0);
  
  // vertical fetch four elements to a single core
  VPREFETCH_L(0, aPtr, 0, 4, 1);

If you want to fetch 4 elements to each of the four cores the two might look like the following.

  // horizontal, iterate over memory offsets
  for (int i = 0; i < 4; i++) {
    VPREFETCH_L(i, aPtr + i, 0, 4, 0);
  }
  
  // vertical, iterate over core offset
  for (int i = 0; i < 4; i++) {
    VPREFETCH_L(0, aPtr, i, 4, 1);
  }

Vector Prefetching Interaction with Cache Lines

Global memory is generally stored in lines in a cache. If we want to use the cached memory (which we definitely do), we need to abide by the restrictions it imposes. For example, we cannot load more elements than the size of the cache line because then we would be accesses two cache lines with a single request, which is impractical. In our system, the cache line size if 16 and thus the number of loads per prefetch cannot exceed 16.

Another challenge is that we can potentially reach the edge of a cache line even if the prefetch length doesn't exceed the cache line size. If the base offset of the request starts in the middle of a cache line this failure case could occur. You try to align you memory with cache lines (i.e., with malloc_cache_aligned), but you generally can't practically do this for every program. To resolve this, a programmer can issue two requests, one that fetches from one side of the cache line boundary, and one that fetches from the other side. We denote these two as VPREFETCH_L(prefetch left) and VPREFETCH_R (prefetch right).

  VPREFETCH_L(0, aPtr, 0, 16, 0);
  VPREFETCH_R(0, aPtr, 0, 16, 0);

The same arguments can be provided by both and memory unit will resolve how to split the vector request up about the cache line boundary. Sometimes all of the data falls within the first cache line in which case the memory unit will convert the second prefetch to a nop. This works for both horizontal and vertical prefetching.

Frames

How do the vector cores know when the data prefetch for them has arrived? In an ordinary cache this would be trivial due to the metadata stored for every cache line. A scratchpad does not include this metadata (part of it's efficiency) and thus it becomes non-trivial to track.

Instead of having ready metadata for each word in the scratchpad, we do coarse grain tracking of which words are ready. A single counter can be used to track whether x consecutive loads have arrived, which is much more efficient in terms of hardware complexity than precisely tracking when each word has arrived. This limits how fast we can access the words, but since the data was prefetched well ahead of time anyway it's likely not an issue.

We call these coarsely tracked regions, a frame. The programmer can set the hardware frame size that they desire using the PREFETCH_EPOCH macro as shown below.

  // set the number of frames and the size of each one
  int prefetchMask = (NUM_REGIONS << PREFETCH_NUM_REGION_SHAMT) | (REGION_SIZE << PREFETCH_REGION_SIZE_SHAMT);
  // configure the core with this frame settings
  PREFETCH_EPOCH(prefetchMask);

  // make sure all cores have done this before prefetching
  pthread_barrier_wait(&start_barrier);

Frames in Vector Cores

The scalar core doesn't have to change itself much for frames (TODO implicit sync). Vector cores must wait for frames to fill up and free them using the FRAME_START and REMEM macros respectively. An example use is shown below.

    // software frame size, potentially decoupled from hardware frame size?
    int frameSize = 2;

    // wait for 2 consecutive elements to enter the scratchpad
    FRAME_START(frameSize);
    
    // load values from the ready scratchpad frame
    a = scratchpad[0];
    b = scratchpad[1];

    // perform computation
    c = a + b;

    // free the frame so we can prefetch new data into that part of the scratchpad
    REMEM(frameSize);

Currently there is a decoupling between how many elements you wait for and free in the frame, but generally it needs to be related to hardware frame size declared with PREFETCH_EPOCH to work, better if it's the same value!

The Stack

The default place for the stack is in main memory (i.e. mapped to DRAM). In an ordinary CPU with local caches, the stack can be cached in fast to access memory. However, in our system this automatic caching policy does not exist because memory must be directed into local memory by the programmer. This results in slow accesses to the stack which would likely lie in an LLC rather than a local cache.

We would like the stack on the local scratchpad for fast access in case of register spill within a hot loop. There are many ways to get the stack onto the local scratchpad. We currently adopt the lowest effort approach and manually force the stack pointer onto scratchpad right before launching the critical kernel. The upside to this is we ONLY get the most relevant stack frames mapped into the scratchpad and not older less useful ones. The downside is that you need to manually copy part of the most recent frame onto the scratchpad. This size is not known unless you actually disassemble the binary and look which stack offset are being used.

Consider the following example, of moving the stack pointer and then calling the hot kernel. The top two instructions are inline assembly to swap the stack pointer onto the scratchpad and save the old stack pointer. Then before the kernel function is called stencil_vector, there are multiple accesses to the stack pointer, but this is referencing data on the OLD stack location, not the empty one we just swapped to. Notice that we will only access up to offset 64 at most. It suffices to copy up to that before moving the stack pointer.

   --------------
   11578:	00010a13          	mv	s4,sp
   1157c:	00050113          	mv	sp,a0
   --------------
   11580:	03013783          	ld	a5,48(sp)
   --------------
   11588:	04013e03          	ld	t3,64(sp)
   --------------
   115a0:	01513423          	sd	s5,8(sp)
   115a4:	03813a83          	ld	s5,56(sp)
   --------------
   115b8:	02f13023          	sd	a5,32(sp)
   115bc:	00d13c23          	sd	a3,24(sp)
   115c0:	00d13823          	sd	a3,16(sp)
   115c4:	01613023          	sd	s6,0(sp)
   --------------
   115e8:	03c13823          	sd	t3,48(sp)
   115ec:	be0ff0ef          	jal	ra,109cc <stencil_vector>
   --------------
   11738:	000a0113          	mv	sp,s4
   --------------

An example of inline assembly to copy the stack pointer before doing the stack swap.

  // get address on the scratchpad to place the stack pointer
  unsigned long long *spTop = getSpTop(ptid);
  spTop -= 9;

  unsigned long long stackLoc;
  asm volatile (
    // copy part of the stack onto the scratchpad in case there are any loads to scratchpad right before
    // function call
    "ld t0, 0(sp)\n\t"
    "sd t0, 0(%[spad])\n\t"
    "ld t0, 8(sp)\n\t"
    "sd t0, 8(%[spad])\n\t"
    "ld t0, 16(sp)\n\t"
    "sd t0, 16(%[spad])\n\t"
    "ld t0, 24(sp)\n\t"
    "sd t0, 24(%[spad])\n\t"
    "ld t0, 32(sp)\n\t"
    "sd t0, 32(%[spad])\n\t"
    "ld t0, 40(sp)\n\t"
    "sd t0, 40(%[spad])\n\t"
    "ld t0, 48(sp)\n\t"
    "sd t0, 48(%[spad])\n\t"
    "ld t0, 56(sp)\n\t"
    "sd t0, 56(%[spad])\n\t"
    "ld t0, 64(sp)\n\t"
    "sd t0, 64(%[spad])\n\t"
    // save the stack ptr
    "addi %[dest], sp, 0\n\t" 
    // overwrite stack ptr
    "addi sp, %[spad], 0\n\t"
    : [dest] "=r" (stackLoc)
    : [spad] "r" (spTop)
  );

  // run kernel

  // restore stack pointer
  asm volatile (
    "addi sp, %[stackTop], 0\n\t" :: [stackTop] "r" (stackLoc)
  );

The above can be done in a loop in cases where the copying might get long:

  unsigned long long *spTop = getSpTop(ptid);
  // // guess the remaining of the part of the frame that might be needed??
  spTop -= 30;

  unsigned long long stackLoc;
  unsigned long long temp;
  #pragma GCC unroll(30)
  for(int i=0;i<30;i++){
    asm volatile("ld t0, %[id](sp)\n\t"
                "sd t0, %[id](%[spad])\n\t"
                : "=r"(temp)
                : [id] "i"(i*8), [spad] "r"(spTop));
  }
  asm volatile (// save the stack ptr
      "addi %[dest], sp, 0\n\t"
      // overwrite stack ptr
      "addi sp, %[spad], 0\n\t"
      : [ dest ] "=r"(stackLoc)
      : [ spad ] "r"(spTop));

  // run kernel

  // restore stack pointer
  asm volatile (
    "addi sp, %[stackTop], 0\n\t" :: [stackTop] "r" (stackLoc)
  );

Future Optimization: Another issue with the current method, apart from copying the relevant stack addresses before the function call, is the depth of copying. A function only needs to access its arguments on the stack, but we might end up copying way more elements from the stack on scratchpad. Before calling a function with say n arguments, some of them are pushed to argument registers a0-a7 and the rest are pushed on to stack (n-7). To load these arguments the caller function might need to go way back in the stack in some cases, similar to the previous example:ld s5,56(sp). Hence an optimized way would be to copy over the stack pointer after we have loaded these values and are ready to push them to stack for the function call. This can be done by changing the assembly file directly (or incorporating this in the compiler somehow).

Another thing to note: If we are passing an array to a function that is defined in the stack space of the thread and is not global then we need to make sure that all the elements of the array are copied and not just the address of the first element since the addresses are stack pointer offsets. So for an array say int a[4] on the stack, accessing it might look like add sp off for a[0] and add sp (off+8) for a[1]. Since these addresses are not absolute addresses and relative to the original stack pointer (which is not on the scratchpad), we can run into segfault if all elements are not copied. Compiler can guarantee to count the number of elements and change the stack depth of copying accordingly.

Mapping Problem Sizes to Vector Groups

Vector groups typically have a characteristic software length that they can perform operations on. For example, in a stencil kernel with a filter dimension of three, a characteristic length would be FILTER_DIM * VECTOR_LEN where FILTER_DIM is the length of the filter and VECTOR_LEN is the number of cores apart of the vector group. This vector group would be able to process multiples of FILTER_DIM * VECTOR_LEN, but would fail for non-mulitples.

An obvious, but partial solution is to remove the part of the problem size that doesn't quite map to the vector groups i.e., with a mod operation. The question then becomes where to do this unmapped work. There are three solutions enumerated below.

Do the work sequentially on the host CPU. This might be the right solution for a convetional discrete vector accelerator.
Do the work in parallel using the manycore after the fact.
Same as 2, but try to schedule the manycore work on inactive core in parallel to the main chunk of work.

The proposed architecture is in a somewhat unique scenario where we can deconfigure the vector groups and just use the more flexible manycore to finish the computation. This is preferred over running sequential on a host CPU because the manycore version will be much faster. It also might not a huge issue where we put this work because this unmapped part is likely a small part of the computation, but that may not be true in the case of a multidimensional kernel and you might have overhangs at the end of each row.

Parallel Reductions

A reduction is an operation that aggregates disparate data into a single location where some scalar operation can be performed. Reductions are an important computational primitive in parallel architectures. Examples include sum and dot product. Reductions are further complicated in architectures operation with a concept of groups (i.e., GPU, us, etc..) because these internally synchronized groups needs to synchronize globally.

We perform reductions in a multi-phase multi-configuration process.

Perform any local computation in vector mode to increase performance and energy-efficiency. In dot product this would be to do element-wise multiply and accumulate a local/partial sum.
Deconfigure back to manycore mode and have cores forward their data to other cores to do the reduction. This is like a dataflow operation. Each core that receives data will combine the input partial sums to another partial sum.
The final core to receive data will produce the final sum and writeback the data.
Repeat step 1 as needed.

Step 1 follows the typical vector computation workflow. Step 2 requires synchronization between each forwarding path. To do this we use a token queue which ensures that data has arrived from the producer core before the consumer reads it. Token queues are also circular so their memory footprint is low.

We support quickly going in and out of vector mode. You might have the following pattern.

  // do vector computation (within this function configures and deconfigures from vector mode)
  int partial_sum = vector_foo();

  // reduction using cores that have partial sums
  if (active_in_reduction)
    manycore_reduction(partial_sum, core);

Token Queues

A token queue represents a synchronized one-way link between two cores, one taking the role as a producer and the other as a consumer. You should figure out which cores should be linked and initialize the token queues accordingly.

  int core = ;
  
  // each core has a synchronized input and output link using token queues
  token_queue_t consumer;
  token_queue_t producer;

  // figure out which tid to pair with
  int pair_core = foo(core);

  // put the consumer token queue in scratchpad memory with a specified backing buffer size
  init_token_queue_consumer(consumer_spad_loc, size, core, &consumer);
  // put the producer token queue in scracthpad memory and link it to the consumer queue in the pair_core
  init_token_queue_producer(producer_spad_loc, consumer_spad_loc, size, core, pair_core , &producer);

  // barrier to make sure queues are setup before execution (only have to barrier now and never during execution even among multiple functions)
  pthread_barrier_wait(&start_barrier);

Once initialized the producer and consumer have been initialized, you should use the token queue API to safely pull and push values from the token queues.

Producer pattern:

  // producer, waits for space in consumer, writes tokens (remote stores), and informs consumer that new tokens are available
  int num_values_to_write = 1;
  int token_offset = wait_tokens_producer(prod, num_tokens_to_write, core);
  set_token(prod, sum, token_offset, core);
  produce_tokens(prod, num_tokens_to_write, core);

Consumer pattern:

  // consumer, waits for tokens, reads tokens, and frees token
  int num_values_to_read = 1;
  int token_offset = wait_tokens_consumer(consumer, num_values_to_read, core);
  int *data0 = (int*)get_token(consumer, token_offset, core);
  sum += data0[0];
  consume_tokens(consumer, num_values_to_read, core);

You should think carefully about the flow of the reduction. Generally you want to send to coreId / 2. The cores that don't receive from anyone shouldn't read of it's consumer queue, and the final core to finish the accumulation shouldn't push to its producer queue and instead writeback the final value.

Getting Started

Start by making a manycore version of the kernel using pthreads (no vector, no prefetching). This will form the a great starting point to create the vector version. You also will likely need a manycore version anyway as a baseline and to handle the part of the kernel that doesn't fit nicely on the vector groups.

Profiling

The gem5 simulator reports statistic of the simulated run (cycles, llc_misses, etc.) that are useful in comparisons with other simulated systems. These can be found in path/to/results/stats.txt. In many kernels the start of time of allocating memory and launching pthreads consumes a significant amount of simulation time. We would like to not record this in our performance statistics especially for smaller kernels. We use custom instructions to turn the statistic monitoring on and off at the boundaries of the critical section of the program. The instructions are encapsulated in the macros stats_on() and stats_off() from bind_defs.h.

Stats should only be turned on and off by a single thread. It's important to make sure the other threads are done or close to done when profiling is turned off. A pthread_barrier_wait can be useful in this scenario.

Performance Tips

Stores

In our CPU micro-architecture, stores to main memory are frequently the bottleneck because we have to wait for the acknowledgement from a far away memory. Ordinarily this wouldn't be a problem because the store would access a local cache and the ACK would be given back quickly, however, the manycore does not have local caches. We provide a STORE_NOACK macro that specifies a store instruction that does not wait for an ACK from memory before committing. To ensure that all ACKs have been received, the programmers should insert a memory fence. Generally, you just need to do this for each thread before they exit the kernel to make sure they actually finished all of their work (before moving onto another kernel potentially). A fence can be inserted in the usual way.

  // do computation
  c = a + b;

  // do store and don't wait for ack
  STORE_NOACK(c, cPtr, 0);

  // do other computation

  // ensure that all stores have been acknowledged by respective memories
  asm volatile("fence\n\t");

  // exit kernel
  return;

Early REMEM

Try to move vector block remem as early as possible (i.e., right after done accessing scratchpad frame data).

A frame cannot start until previous remem has committed. This is so it knows how many tokens are actually available. If you have remem as the last instruction in the vector block then there can be non-trivial stalling of frame due to it in the pipeline.

Debugging

Common errors

gem5.opt: build/RVSP/cpu/io/iew.cc:258: void IEW::doWriteback(): Assertion 0 failed. is a segfault.
gem5.opt: build/RVSP/mem/ruby/common/NetDest.cc:41: void NetDest::add(MachineID): Assertion bitIndex(newElement.num) < m_bits[vecIndex (newElement)].getSize() failed. means that you are trying to access a scratchpad that doesn't exist. Make sure that the number of cores in your gem5 simulation matches the number of cores you compiled your program for.
panic: Can't create socket:Too many open files ! means that there are too many remote gdb sockets (either you are running a lot of cores or someone else is running a gem5 simulation using some of the system sockets. You can disable remote gdb by using this command line argument when starting gem5 --remote-gdb-port=0

Remote GDB

Sometimes you can gdb into the software running on top of gem5. But haven't gotten to work with our software yet.

Run gem5 normally in one shell (running <software binary, build with -g>)
In another shell:
gdb <software binary, built with -g>
set remote Z-packet on
target remote 127.0.0.1:7000
c

If gem5 is running in docker:

ip addr show eth0 will provide ip address, docker_ip of docker
Use this when connecting from other terminal target remote docker_ip:7000

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to Write Software

Pthread Programming

Declaring Vector Groups

Using Vector Groups

Two Sets of Code

Software hacks:

Directing Control

Fetching memory

Performance

Control Flow in Vector Cores

Unconditional Direct Jumps

Convergent Conditional Branches

Predication

Performance

Optimizing for Memory Hierarchy

Prefetching

Vector Prefetching

Vector Prefetching Interaction with Cache Lines

Frames

Frames in Vector Cores

The Stack

Mapping Problem Sizes to Vector Groups

Parallel Reductions

Token Queues

Getting Started

Profiling

Performance Tips

Stores

Early REMEM

Debugging

Common errors

Remote GDB

Clone this wiki locally