Skip to content

Automatically Generated Functionality

Julian Kemmerer edited this page Jun 1, 2024 · 87 revisions

This isn't PipelineC++ folks. One step at a time.

I want PipelineC code to compatible with regular C/C++ compilers for easy functional verification of your hardware. This makes custom C based "simulations" easy to develop. I mostly don't want to add new language features to standard C / change languages completely unless I have a parser and software compiler to support them. Would love some help in this area.

With that said, some concepts just beg to be implemented with template types and so really benefit from being implemented as auto generated C code.

Table of Contents

Top Level IO + Modules + Registers + Processes, etc

See this documentation.

u/intN_t Types

  • Any 'N' is supported, ex. uint13_t
  • Generated typedefs in header files "uintN_t.h" and "intN_t.h"

Bit Manipulation

Unions can be complicated since they can rely on the layout of bytes in memory. There is no memory model. So for now there are auto generated bit manipulation functions for built in types. These functions are implemented as raw VHDL.

  • Bit slice/select
    • Integer-like variables can be used like a function call, where arguments define the bit slice range. Ex.:
      uint32_t x; // x(31 downto 0)
      uint1_t y = x(15); // y = x[15]
      uint16_t z = x(15, 0); // z = x[15:0]
    • The above syntax is sugar for directly calling fully named bit slicing functions:
      • uint<Y-X+1>_t <type_prefix>_Y_X(<type>)
        • Ex. uint16_t uintX_15_0(uintX_t data); // data[15:0], select bits 15 down to 0
  • Bit concatenation
    • uintX_t <type_prefix_0>_<type_prefix_1>(<type0>, <type1>)
    • Result size X is sum of input sizes
    • Ex. uint16_t uint14_uint4(uint14_t x, uint4_t y); // Upper 14 bits and lower 4 bits of a uint16_t
  • Bit duplication
    • uint<Y>_t <type_prefix>_X(<type>)
    • Result width is X times the input width
    • Ex. uint16_t uint4_4(uint4_t x); // Repeat a uint4_t 4 times to form uint16_t
  • Rotate left/right
    • uint<size>_t rot[l|r]<size>_<amount>(uint<size>_t x)
    • Ex. uint64_t rotl64_7(uint64_t x); // Rotate the uint64_t value to the left by 7
  • Bit assignment
    • base_t <base_type_prefix>_<assignment_type_prefix>_X(<base>, <assignment>)
    • Assign data at bit position X in the base value. Result width is same as base.
    • Ex. uint64_t uint64_uint16_2(uint64_t in, uint16_t x); // in[17:2] = x
  • Float SEM construction
    • Only 32b float supported right now, can easily add other widths
    • float float_uint<exponent_bits>_uint<mantissa_bits>(sign, exponent, mantissa)
    • Ex. float float_uint8_uint23(uint1_t sign, uint8_t exponent, uint23_t mantissa);
  • Float uint32_t construction
    • Interpret uint32_t as a 32b float
    • float float_uint32(uint32_t data);
  • Byte swap
    • Swap the byte ordering of an unsigned value
    • uintN_t bswap_<bit_width>(input)
    • Ex. uint32_t bswap_32(uint32_t input);
  • Array to unsigned
    • Concatenate elements of an array to form a single value
    • Element ordering (typically bytes) is either 'big endian' or 'little endian'
    • uintY_t uintX_arrayN_[be|le](uintX_t input[N])
    • Result width is N times input width
    • Ex. uint64_t uint8_array8_be/le(uint8_t x[8]);

Bit Math

Little helper functions for common 'math' operations. These functions are implemented in PipelineC.

  • Absolute value
    • uintN_t <type_prefix>_abs(intN_t input)
    • Ex. uint32_t int32_abs(int32_t input); // Absolute value removes sign bit
  • N->1 Mux
    • Binary tree of multiplexers selecting a single value from N values
    • <type> <type_prefix>_muxN(select, input0, input1, ... inputN);
    • Ex. uint8_t uint8_mux4(uint2_t select, uint8_t input0, uint8_t input1, uint8_t input2, uint8_t input3);
  • Count zeros starting from upper/left bits
    • <count_type> count0s_<type_prefix>(type)
    • Ex. uint3_t count0s_uint7(uint7_t data); // Max zeros possible is 7
  • N->1 Binary Operations
    • Binary tree of binary operations
    • Only AND, OR, SUM(add) supported right now
    • Only float and u/intN_t types supported.
    • <output_type> <type_prefix>_<operation>N(input0, input1, ... inputN);
    • Ex. uint5_t uint3_sum3(uint3_t input0, uint3_t input1, uint3_t input2);
      • uint3 max = 7, 7*3 = 21, stored in uint5_t
    • Arrays are supported as well: Ex. uint5_t uint3_array_sum3(uint3_t input[3]);

Fixed Size Array Types

Any type can be generated into fixed size arrays of that type by including a specific header file. This is mostly to support returning fixed sized arrays from C functions. Ex.

typedef struct point
{
  uint8_t x;
  uint8_t y;
} point;
#include "point_array_N_t.h"
// Types like this are generated for you
typedef struct point_array_3_t
{
  point data[3];
} point_array_3_t;

Casting to and from bytes

Any type can be converted to and from byte arrays by including a specific header file. This is mostly to support moving C structs from software C to PipelineC buffers. Both PipelineC and regular C code (i.e. using pointers instead of fixed size buffers) is generated. Ex.

typedef struct point ... ;
#include "point_bytes_t.h"

// A header like this is generated for you
#define point_bytes_t uint8_t_array_2_t  // 2 bytes, auto gen fixed size array struct
#define point_size_t uint2_t // 0-2
point_bytes_t point_to_bytes(point x);
point bytes_to_point(point_bytes_t bytes);
// And similar functions using pointers for real C code

Clock Crossings

Clock crossing code is generated by including a specifically named header file. Ex.

message_t in_msg;
#include "clock_crossing/in_msg.h"

The READ and WRITE function signatures generated in that file depend on how and where the READ and WRITE functions are used. See clock domain crossing documentation here.

RAMs

  • Described as a state variable (prefer static locals) array with an element type and dimensions
    • Ex. elem_t the_ram[8][8][8][8]; // 4096 elements
      • Four 3b indices/addresses is concatenated to form 12b address
  • State variables are not used directly, that would infer registers and muxes instead.
    // Declare an array
    static uint32_t my_ram[RAM_DEPTH];
    
    // Using array variable directly in code is regs+muxes
    uint32_t rdata = my_ram[addr];
    if(wr_enable) my_ram[addr] = wr_data;
    
    // Doing this _RAM_SP_RF_0 instead is same thing
    // but will let the synthesis tool infer a proper RAM
    uint32_t rdata = my_ram_RAM_SP_RF_0(addr, wr_data, wr_enable);
  • Access to the RAM takes the form of stateful functions acting on that storage data
    • elem_t <var_name>_<RAM_type>(address0,...,addressN, write_data, write_enable)

    • Input arguments are read/write data+flags and output is read data

    • RAM types:

      • SP_RF_<latency> Single port, read first
      • DP_RF_<latency> Dual port, read first
      • 0 clock latency RAMs are implemented as LUTRAMs for same cycle access
      • 1,2 clock latency RAMs are implemented as either block RAMs or LUTRAMs (synthesis tool decides).
    • Ex. elem_t the_ram_RAM_SP_RF_2(uint3_t addr0, uint3_t addr1, uint3_t addr2, uint3_t addr3, elem_t write_data, uint1_t write_enable);

      • Reads and write return values are valid/completed/returned on the _2 second iteration/call of the function.
    • Single Port, example 1 (notice 1 clock write and read latency)

      #include "uintN_t.h"
      #pragma MAIN my_func
      uint32_t my_func()
      {
        static uint32_t my_bram[128];
        static uint32_t waddr = 0;
        static uint32_t wdata = 0;
        static uint32_t raddr = 0;
        uint32_t rdata = my_bram_RAM_DP_RF_1(raddr, waddr, wdata, 1);
        printf("Write: addr=%d,data=%d. Read addr=%d. Read data=%d\n",
               waddr, wdata, raddr, rdata);
        // Test pattern
        if(wdata > 0){
          raddr += 1;
        }
        waddr += 1;
        wdata += 1;
        return rdata; // Dummy
      }
      • Simulating with cocotb and ghdl --sim --comb --cocotb --ghdl
      Clock:  0
      Write: addr=0,data=0. Read addr=0. Read data=0
      
      Clock:  1
      Write: addr=1,data=1. Read addr=0. Read data=0
      
      Clock:  2
      Write: addr=2,data=2. Read addr=1. Read data=0
      
      Clock:  3
      Write: addr=3,data=3. Read addr=2. Read data=1
      
      Clock:  4
      Write: addr=4,data=4. Read addr=3. Read data=2
      
      Clock:  5
      Write: addr=5,data=5. Read addr=4. Read data=3
      ...
      
    • Single Port, example 2, using a multidimensional array with power of 2 sizes

      One 36 Kb primitive to fit the elem_t=uint8_t 4Kbs per the above example
      Report Cell Usage: 
      +------+-----------+------+
      |      |Cell       |Count |
      +------+-----------+------+
      |1     |BUFG       |     1|
      |2     |RAMB36E1_1 |     1|
      |3     |FDRE       |    29|
      |4     |IBUF       |    22|
      |5     |OBUF       |     8|
      +------+-----------+------+
      
      #include "uintN_t.h"
      #define elem_t uint8_t
      #pragma MAIN_MHZ main 100.0
      elem_t main(uint3_t addr0, uint3_t addr1, uint3_t addr2, uint3_t addr3, elem_t write_data, uint1_t write_enable)
      {
        static elem_t the_ram[8][8][8][8]; // 4096 elements
        return the_ram_RAM_SP_RF_2(addr0, addr1, addr2, addr3, write_data, write_enable);
      }
    • Dual Port

      #include "uintN_t.h"
      #define elem_t uint8_t
      #pragma MAIN_MHZ main 100.0
      elem_t main(
        uint3_t addr_r0, uint3_t addr_r1, uint3_t addr_r2, uint3_t addr_r3, // Read port
        uint3_t addr_w0, uint3_t addr_w1, uint3_t addr_w2, uint3_t addr_w3, elem_t write_data, uint1_t write_enable) // Write port
      {
        static elem_t the_ram[8][8][8][8]; // 4096 elements
        return the_ram_RAM_DP_RF_2(addr_r0, addr_r1, addr_r2, addr_r3,
                addr_w0, addr_w1, addr_w2, addr_w3, write_data, write_enable);
      }

DSPs

By default the PipelineC tool will infer pipelined multipliers from the * operator. In VHDL the multiply operator is surrounded by input and output registers as needed, allowing the synthesis tool to infer as much DSP primitive pipelining as possible. This can be changed for the entire design using the --mult command line option. See -h for more information.

To disable use of inferred multipliers and use FPGA fabric implementations only use the FUNC_MULT_STYLE pragma to set fabric multiplier style. Ex.

#pragma FUNC_MULT_STYLE mult fabric  // default: infer
uint32_t mult(uint16_t x, uint16_t y)
{
  return x * y;
}

Operator Overloading

See operator name constants near the top of C_TO_LOGIC.py. Make sure to use argument names like left, right, and expr. Finally, multiplies are distinguished by if they use inferred hard DSP blocks or are implemented in FPGA fabric, see below examples.

// An example user type
typedef struct complex
{
  float re;
  float im;
}complex;

// Override the '+' operator
complex BIN_OP_PLUS_complex_complex(complex left, complex right)
{
  complex rv;
  rv.re = left.re + right.re;
  rv.im = left.im + right.im;
  return rv;
}
complex BIN_OP_PLUS_complex_float(complex left, float right)
{
  complex rv;
  rv.re = left.re + right;
  rv.im = left.im + right;
  return rv;
}
// Override the '*' operator
// Implementation to use when inferred DSPs for multiplies are used
complex BIN_OP_INFERRED_MULT_complex_complex(complex left, complex right)
{
  complex rv;
  rv.re = left.re * right.re;
  rv.im = left.im * right.im;
  return rv;
}
// Implementation to use when multipliers are implemented in logic/FPGA fabric (same here)
complex BIN_OP_MULT_complex_complex(complex left, complex right)
{
  complex rv;
  rv.re = left.re * right.re;
  rv.im = left.im * right.im;
  return rv;
}
// Override the '!' operator
complex UNARY_OP_NOT_complex(complex expr)
{
  complex rv;
  rv.re = expr.re * -1.0;
  rv.im = expr.im * -1.0;
  return rv;
}

Pragmas

See the pragmas page.

VHDL Escape Hatch

Write raw HDL if you must.

Clone this wiki locally