Skip to content

NITTA internal

Aleksandr Penskoi edited this page Jan 31, 2021 · 14 revisions

This document aims to engage the reader in the NITTA project internal structure. Therefore, we will leave out the motivation part and go straight to the technical details.

The central points in the project:

  • target processor architecture;
  • intermediate algorithm representation;
  • transport triggered behavior;
  • synthesis process.

Target processor architecture

From the hardware point of view, the target processor looks very simple if we don't consider how processor units work inside and where we get software for it. A target processor contains:

  • The control unit provides control signals for all processor units. Stored software does not need any decoder and contains control signals as is (thank you, NISC).

  • A set of processor units (PU) allow processing data, including storing and IO (can be developed by advanced users).

  • The signal bus, which transfers control signals from the control unit to all PUs.

  • The value bus, which includes two components: data and attributes buses. It is not a real bus on a chip. It is imitation on logical OR (for more detail, see NITTA.Model.ProcessorUnits), so when PU is not sending data, it's output should provide zero.

  • IO ports, which make a link between target processor environment and processor unit with UI.

The specific configuration of PU set and buses we will call microarchitecture.

                                                                               || I
+--------------+       signal bus                                          sclk|| O
| Control Unit +--+----+----+-------+-------+------+-----+----+----+       mosi||
+--------------+  |    |    |       |       |      |     |    |    |       miso|| P
                  |wr  |oe  |addr   |reset  |load  |neg  |oe  |wr  |oe       cs|| o
                  |    |    |       |       |      |     |    |    |           || r
                  v    v    v       v       v      v     v    v    v           || t
                 +-------------+   +----------------------+  +------------+    || s
                 | PU1 :: Fram |   |     PU2 :: Accum     |  | SPI :: SPI +====+/
                 +------+------+   +----------+-----------+  +------------+
                        ^                     ^                    ^
                        |                     |                    |
                        v                     v                    v
                     [================================================] value bus

Let see how it work on the example: counter. This program executes counter from zero, increases value one time per computational cycle, and sends it by SPI process unit outside of the processor (question about synchronization between computational and data transfer cycles was beyond the scope of this document).

The following code implement counter on Lua:

function counter(x1)
    send(x1)
    local x2 = x1 + 1
    counter(x2)
end
counter(0)

Each PU is controlled individually and independently from each other. You can see each PU's microprogram in the following table. Each command is directly mapping to control signals. A hardware decoder is not required (pros: simple hardware and full control over PU without ISA restrictions; cons: big program with low space efficiency).

PC Fram Accum SPI
0 PrepareRead 1 Nop Nop
1 Nop ResetAndLoad False Sending
2 PrepareRead 0 Nop Nop
3 Nop Load False
4 Nop
5 Out
6 Write 1 Nop
7 Nop

All instruction is heavy PU centric and does not have any interpretation outside of the PU context. Instruction is only a readable version of PU control signals. However, that program as a whole can organize the computational process in the entire NITTA processor. It works like a Leinniz' monadology, where PU don't know one of each other, but when one writing data, other units will reading data at same moment.

The second part of NITTA architecture is Transport Triggered Architecture, which means that a computational process can be described from a dataflow point of view. We do it in the next section.

Intermediate algorithm representation and execution

Any NITTA program can be represented as a data flow graph (DFG). A DFG represents a computational process by an oriented graph, where nodes represent functions for storing, processing, or IO; edges represent value transfers between functions. Another type of object is not allowed, which means that all values should be represented as functions, including constant.

During a synthesis process, it can be changed several times for:

  • preparing an algorithm to executable form (see NITTA.Model.Problems.Refactor.BreakLoop);
  • an algorithm adaptation for the microarchitecture (see NITTA.Model.Problems.Refactor.OptimizeAccum);
  • an algorithm optimization for the microarchitecture (see NITTA.Model.Problems.Refactor.ResolveDeadlock).

Let see how it works (you can see such graphs in NITTA UI in the more rounded form).

Initial cyclic DFG

Initial DFG is a result of Lua source code translation. Transpiler uses the following rules:

  • operators and function call translated to a node in DFG (except function with the debug prefix, which is not synthesizable and used for debugging);
  • all edges represent transferring values between functions (e.g., x1 value transferred two times: as argument of send and as operand of +), so we need to create aliases for them in the following style <original-name>#<usage-index> (in our example: x1 transforms into x1#1 and x1#2, x2 into x2#1);
  • all constant in the algorithm should be translated in a constant function (like const(1) with output 1#1), multiple time used of constant should be merged (1#1, 1#2...);
  • function arguments should be represented as a loop function with the initial value, provided, and consumed value at the beginning and end of a computational cycle consequence.
                /--------------------------------------\
            x2#1|                                      |
                v                                      |
+-------------------------+       x1#1  +------+       |
| loop(x2#1) = x1#1, x1#2 |------+----->| send |       |
+-------------------------+      |      +------+       |
                                 |                     |
                             x1#2|                 x2#1|
                                 |                     |
                                 v                     |
          +----------+ 1#1  +-------------------+      |
          | const(1) |----->| x1#2 + 1#0 = x2#1 |------/
          +----------+      +-------------------+

Usually, such graphs can not be executed due to cycles.

Executable acyclic DFG

To make DFG executable, we need to break all loop functions in the algorithm, make the graph acyclic. Of course, we don't forget the history of our LoopEnd and LoopBegin functions. That performing by a BreakLoop refactoring (we will return to it later).

The result is represented on the following scheme:

+--------------------------+       x1#1  +------+       +-----------------------------------+
| LoopBegin() = x1#1, x1#2 |------+----->| send |   /-->| LoopEnd(x2#1) pair out: x1#1, x1#2 |
+--------------------------+      |      +------+   |   +-----------------------------------+
                                  |                 |
                              x1#2|             x2#1|
                                  |                 |
                                  v                 |
+----------+    1#1    +-------------------+        |
| const(1) |---------->| x1#2 + 1#0 = x2#1 |--------/
+----------+           +-------------------+

We can describe a process of one computational cycle from the transport point of view. Let do it.

PC Transport Fram Accum SPI
0
1 x1: Fram -> { SPI, Accum } Source x1#1, x1#2 Target x1#2 Target x1#1
2
3 1#1: Fram -> Accum Source 1#1 Target 1#1
4
5
6 x2#1: Accum -> Fram Target x2#1 Source x2#1
7

Right now, you can compare this table and table with PU instructions and see their consistency. PU receives the instruction for performing the necessary computational process from the dataflow point of view. In the next section, we show how the computational process happens across many cycles. How Loop function work.

Computational process on intermediate level

A target processor should repeat an algorithm infinite times, and usually, it is required to transfer data between cycles. Loop function allows doing that. At the first computational cycle, the Loop function produces an initial value and consumes the new value at the end. On the next cycles, it produces the value from the previous.

                           Cycle 0                                                  Cycle 1

][----------------------+       x1#1  +------+       +---][----------------------+       x1#1  +------+       +---][------...
][op(x2#1) = x1#1, x1#2 |------+----->| send |   /-->| lo][op(x2#1) = x1#1, x1#2 |------+----->| send |   /-->| lo][op(x2#...
][----------------------+      |      +------+   |   +---][----------------------+      |      +------+   |   +---][------...
][                             |                 |       ][                             |                 |       ][      ...
][                         x1#2|             x2#1|       ][                         x1#2|             x2#1|       ][      ...
][                             |                 |       ][                             |                 |       ][      ...
][                             v                 |       ][                             v                 |       ][      ...
][      +----------+ 1#1  +-------------------+  |       ][      +----------+ 1#1  +-------------------+  |       ][      ...
][      | const(1) |----->| x1#2 + 1#0 = x2#1 |--/       ][      | const(1) |----->| x1#2 + 1#0 = x2#1 |--/       ][      ...
][      +----------+      +-------------------+          ][      +----------+      +-------------------+          ][      ...

We show LoopBegin and LoopEnd as the original Loop function. It is implemented:

  • on intermediate level by metadata;
  • on target processor by PU internal implementation (e.g., in FRAM both parts are bound to the same memory cell).

Logical simulation can be executed by the following command: stack build && stack exec nitta -- -l examples/counter.lua.

Cycle x1 x2
0 0 1
1 1 2
2 2 3
3 3 4

Transport triggered behavior at compile time

In this section, we will try to connect intermediate algorithm representation and target processor microarchitecture. Before we start, we need to separate compile-time (CT) and run-time (RT) elements of a NITTA computational process. All things about the intermediate level happen on compile time. Target processor knows only about hardware and software in control signals (but we show it by mnemonics).

In the figure, you can see a computational process from the transport point of view, which defines at compile time. You can see computational cycles and states of the value bus on each processor tick. By * we represent providing (one or more) or consuming (only one at a tick) value by PU.

                             Cycle 0                                   Cycle 1
          ][--0----1----2----3----4----5----6----7--][--0----1----2----3----4----5----6----7--][...
spi out:
fram out:          **        *                               **        *
accum out:         ||        |              *                ||        |              *
                   ||        |              |                ||        |              |
value bus:         ||        |              |                ||        |              |
                   ||        |              |                ||        |              |
accum in:          ||        |              |           x1#1 ||    1#1 |              |
fram in:      x1#1 *|    1#1 *         x2#1 *                *|        *         x2#1 *
spi in:        x1#2 *                                    x1#2 *

So, software generation for a NITTA processor looks "simple". We need only:

  • modify intermediate algorithm representation;
  • bound functions of intermediate representation to available processor units;
  • scheduling all needed transports within PU and algorithm restrictions.

After that, we will receive something like you can see in the table. In the next section, we will try to explain how it happens.

PC Transport (CT) Fram (CT) Fram (RT) Accum (CT) Accum (RT) SPI (CT) SPI (RT)
0 PrepareRead 1 Nop Nop
1 x1: Fram -> { SPI, Accum } Source x1#1, x1#2 Nop Target x1#2 ResetAndLoad False Target x1#1 Sending
2 PrepareRead 0 Nop Nop
3 1#1: Fram -> Accum Source 1#1 Nop Target 1#1 Load False
4 Nop
5 Out
6 x2#1: Accum -> Fram Target x2#1 Write 1 Source x2#1 Nop
7 Nop

Synthesis process

What do we need for a target processor synthesis, including software and hardware?

We need as input data:

  • algorithm in intermediate representation with all knows functions;
  • microarchitecture of the target processor, which includes all necessary process units and bus configurations (this item is temporal input data; in the long term should be part of synthesis results).

We need as part of our synthesis tool:

  • Some kind of target system model sufficient for:
    • understanding what functions can be executed;
    • what control signals sequence required to do that (software);
    • how to use and, if necessary, generate hardware.
  • Synthesis method should be directly implemented as part of the synthesis tool. It should allow schedule or synthesize the target computational process description at all levels.
  • Target system generator, which allows generating target project files from the target system model.
  • UI interface, which allows our users to inspect the synthesis process or make it in a semiautomated manner.
  • Verification subsystems, which allows our user to do a functional simulation of the algorithm and compare expected behavior with logical simulation on RTL level.

Target system model

One of the features of NITTA is the complex behavior of process units. That means that we cannot use a simple solution like an equation system for describing PU. We need something more complicated and build for that imitational model of the target system, which can report to the synthesis method what it can do and apply selected decisions step by step.

                            /-------------------------------------------\
                            V                                           |
I'm the              What can you do?                                   |
 Method      ------------------------------->  +-----+                  |
        O            Possible options          |\\\\\| I'm the          |
      /\ /\  <-------------------------------  |\\\\\| target           |
        V           Execute my decision!       |\\\\\| model            |
             ------------------------------->  +-----+                  |
                            |                                           |
                            |                   repeat                  |
                            \-------------------------------------------/ 

Target system model includes (see: NITTA.Model.TargetSystem):

  • dataflow graph or intermediate algorithm representation (we need it because the algorithm can be changed during a synthesis process);
  • computational unit model (processor microarchitecture, including processor units and network).

We already said all that we need about dataflow graphs, so we focus on a computational unit model. We will describe it bottom-up: processor unit model, network unit model.

Processor unit model

The main aim of a processor unit model is to teach NITTA how to use the processor unit. For that, we make an imitation model for each PU. Interaction with that model allows getting all that we need to know. Interaction implemented in dialog form. Let see it in simplified form on the Accum example:

"Hi, fresh Accum model. Can you execute loop function?" the synthesis method asks.

"No," the Accum model says.

"Can you execute const function?"

"No."

"Can you execute x1#2 + 1#0 = x2#1?"

"I afraid you don't ask it. Yes. Here is my new state, where I bounded the function and prepared to its execution."

"Good, I agree," the method says because it doesn't have any alternative. "What can you do from a transport point?"

"Only two options: 1) I can be a target for x1#2; 2) I can be a target for 1#0. What would you like to select?"

"Okey, let be a target for x1#2." Of course, the synthesis method can make a different decision too.

"Ok. Here is my new state where I scheduled my computational process for receiving x1#2 from the value bus."

"Good, I agree. What can you do from a transport point of view?"

"Only one option: I can be a target for 1#0. What would you like?"

"Okey, let be a target for 1#0."

"Ok. Here is my new state where I scheduled my computational process for receiving 1#0 from the value bus and sum it with the previous one after a short delay."

"Good. What can you do from a transport point of view?"

"Only one option: I can be a source for x2#1. What would you like?"

"Great. Let be a source for x2#1."

"Ok. Here is my new state where I scheduled my computational process for sending x2#1 to the value bus."

"Excellent. What can you do?"

"Nothing," the model says and squeezes its shoulder.

"I think we are done. Can I see your computational process description?"

"Of course, you can see it in my column in the table above. Also, you can see the full description of my computational process, where I note all that I do and mark vertical relations:"

$ stack build --fast nitta:nitta && stack exec nitta -- -t=fx32.32 examples/counter.lua -p=8080
[NOTICE : NITTA.UI] Running NITTA server at http://localhost:8080 ...
# Process steps:
$ http get localhost:8080/node/-3-4-3-2-0-1-0-0/microarchitecture/accum/process | jq '.steps[] | "\(.pID), \(.pDesc)"'
"6, Intermediate: +x1#0 +1@const#0 = x2#0;"
"5, Instruction: Out"
"4, Endpoint: Source x2#0"
"3, Instruction: Load False"
"2, Endpoint: Target 1@const#0"
"1, Instruction: ResetAndLoad False"
"0, Endpoint: Target x1#0"
# Vertical relationships
$ http get localhost:8080/node/-3-4-3-2-0-1-0-0/microarchitecture/accum/process | jq '.relations[] | "up: \(.[0]), down: \(.[1])"'
"up: 6, down: 4"
"up: 6, down: 2"
"up: 6, down: 0"
"up: 4, down: 5"
"up: 2, down: 3"
"up: 0, down: 1"

"And a little explanation about vertical relationships," the model says, "When I was scheduling my computational process, I note all that I know about my activities. For example, when you order me to be an x2#0 source (4), I schedule instruction Out (5) for it. Also, I mark the relationship between the intermediate level (function) and all my endpoint roles (6 and 4, 2, 0). Am I not a good and pedantic model?"

Let stop that role game and return to the normal narrative. It is the correct description of the synthesis method's internal process on the conceptual level. Of course, a real interaction with a PU model much more complicated:

  • more questions;
  • more type of questions (we call it problems, see: NITTA.Model.Problems);
  • a lot of timing information (time constrains for options and intervals for decisions);
  • the question sequence is not straightforward (we can bound function, schedule it, and bound another one);
  • restriction on transport sequences;
  • etc.

All of them can be test by your self inside REPL. For more detail, look at NITTA.Model.ProcessorUnits.Multiplier and try it.

Network unit model

A network unit model is very similar to a processor unit model. It can answer the following questions:

  1. Which processor unit can evaluate the function?

    It is a simple merge of all positive answers from processor units. E.g., we have two fram PU (fram1 and fram2) in the network and want to bind the Loop function. Both PU answer that they can evaluate the function. On the network level, that means we have two options: Bind Loop fram1 and Bind Loop fram2 for selection. For more details, see: NITTA.Model.Problems.Bind.

  2. What value can be transferred by the network, or, more correctly, what variables with the same value can be transferred?

    This is works a little bit more complicated. The model collects all information about data transfers from processor units and finds an intersection between possible sources and targets. If something matches - we have an option about data transfers by the network. E.g., fram can be a source of x1#1 and x1#2, accum can be a target of x1#2, and spi can be a target of x1#1, so we can have the following transfer options (for more details see NITTA.Model.Problems.Dataflow):

    • Fram -> { SPI };
    • Fram -> { Accum };
    • Fram -> { SPI, Accum }.
  3. Refactoring, which allows us to change an algorithm for different purposes. We return to that question later.

with UI, select the Subforest in the menu and see what options are provided and make a decision by yourself.

Multi-network unit model

Not implemented.

Refactoring

Previously, we speak about unit model decisions, but we have refactoring, which causes both the dataflow graph and the unit model. Usually, how it works depends on a refactoring type and includes:

  • what part of the target system model recognize refactoring necessities;
  • how unit models change (if necessary) after the synthesis method makes the decision.

Let see what happens when we break the Loop function into LoopBegin and LoopEnd. It happens in few steps:

  • The Loop function was bounded to fram and the specific memory cell inside it. The function can not be evaluated (be a source or a target for related values).
  • The fram model recognizes the possibility of BreakLoop refactoring and informs the synthesis method about it.
  • The synthesis method decides to break Loop function, after that:
    • The target system model receive the order and resend it to:
      • Dataflow graph, which replace Loop function with LoopBegin and LoopEnd functions.
      • Network model, which resends it to fram model, which already knows about the Loop function.
        • fram model marks the specific memory cell that Loop was broken into two parts. After that, the model can provide options about related values sending or receiving.
  • The target system model contains a new state, where the Loop function was broken, and related values can be transferred.

You can touch BreakLoop and another refactoring by your self in our user interface. For another refactoring see: NITTA.Model.Problems.Refactor.

Synthesis process

As we early say, the synthesis method works with the target system model in dialog form. If we continue this metaphor, you can imagine a synthesis process like a dialog tree in game dev (but it is a real tree, not a graph). The dialog started with the target system model, where all functions from the dataflow graph bounded to the computational unit (network unit model).

After that, we will start the dialog. In it, we can step back any time and select another reply option. Each node of our tree represented the target system model state, edges between them represent our replies (order or synthesis decision), ** leaves** represent the final stage of the synthesis (successful or failed). Selected replies define our future options.

Let see the dialog thread for the counter.lua example:

# Here we request all nodes from the root to the specific node. -3-4-3-2-0-1-0-0 - synthesis node id of the leaf.
http get localhost:8080/node/-2-4-0-0-0-0-0-0/history | jq  -r '.[].decision | "\(.)"'
{"function":{"fvHistory":[],"fvFun":"Loop (X 0.000000) (O [x1#0,x1#1]) (I x2#0)"},"tag":"BindDecisionView","pu":"fram1"}
{"tag":"BreakLoopView","value":"0.000000","input":"x2#0","outputs":["x1#0","x1#1"]}
{"function":{"fvHistory":[],"fvFun":"const(1.000000) = 1@const#0"},"tag":"BindDecisionView","pu":"fram1"}
{"function":{"fvHistory":[],"fvFun":"x1#0 + 1@const#0 = x2#0"},"tag":"BindDecisionView","pu":"accum"}
{"function":{"fvHistory":[],"fvFun":"send(x1#1)"},"tag":"BindDecisionView","pu":"spi"}
{"tag":"DataflowDecisionView","targets":{"x1#0":["accum",[1,1]],"x1#1":["spi",[1,1]]},"source":"fram1"}
{"tag":"DataflowDecisionView","targets":{"1@const#0":["accum",[3,3]]},"source":"fram1"}
{"tag":"DataflowDecisionView","targets":{"x2#0":["fram1",[6,6]]},"source":"accum"}

As you can see, in 1-2 lines, we bound and prepare the Loop function. In 3-5 lines, we bound Constant, + and Send functions. On the rest lines, we schedule the computational process as a sequence of data transfers by the network. For any node, we can request all available replays:

$ http get localhost:8080/node/-2-4-0-0-0-0-0-0/subForest | jq ''
[]

As expected, we don't have any options to say at the leaf node because the synthesis process is finished. But early (after four decisions), we have:

$ http get localhost:8080/node/-2-4-0-0/subForest | jq  -r '.[].decision | "\(.)"' 
{"function":{"fvHistory":[],"fvFun":"send(x1#1)"},"tag":"BindDecisionView","pu":"spi"}
{"tag":"DataflowDecisionView","targets":{"x1#0":["accum",[1,1]],"x1#1":null},"source":"fram1"}
{"tag":"DataflowDecisionView","targets":{"1@const#0":["accum",[1,1]]},"source":"fram1"}

In this case, the synthesis method can make a decision with different types: binding or dataflow. It is common to mix them during the synthesis process because we don't want to make unnecessary restrictions.

In a more user-friendly way, you can see and try that process inside UI.

Synthesis method

A synthesis method is a method for synthesis tree exploration. It defines:

  • which nodes and in which order should be explored;
  • which nodes should not be explored.

The best synthesis method should find a synthesis tree leaf with the shortest duration of the target computational process with a minimal number of explored nodes. How it implemented works?

We can calculate many metrics for each node, and each edges depending on the target system model and specific decision. E.g.:

  • for binding options: how many alternative binding we have for the function (if we don't have alternatives, we can just bind);
  • for transport options: how many time we need to wait before we can transport the value;
  • etc.

By the objective function, we can find the best decision for each node, and in theory, that decision sequence allows us to find the best leaf in the shortest time. For more detail, see NITTA.Synthesis.Explore and modules for each type of synthesis decisions: NITTA.Synthesis.{Bind, Dataflow, etc.}

Time at compile-time and run-time

TODO

Clone this wiki locally