Skip to content

NITTA internal

Aleksandr Penskoi edited this page Jan 30, 2021 · 14 revisions

NITTA internal

This document aims to engage the reader in the NITTA project internal structure. Therefore, we will leave out the motivation part and go straight to the technical details.

The central points in the project:

  • target processor architecture;
  • intermediate algorithm representation;
  • synthesis method.

Target processor architecture

From the hardware point of view, the target processor looks very simple if we don't consider how processor units work inside and where we get software for it. A target processor contains:

  • The control unit provides control signals for all processor units. Stored software does not need any decoder and contains control signals as is (thank you, NISC).

  • A set of processor units (PU) allow processing data, including storing and IO.

  • The signal bus, which transfers signals from the control unit to all PUs.

  • The value bus, which includes two components: data and attributes buses. It is not a real bus on a chip. It is imitation on logical OR (for more detail, see NITTA.Model.ProcessorUnits), so when PU is not sending data, it's output should provide zero.

  • IO ports, which make a link between target processor environment and processor unit with UI.

The specific configuration of PU set and buses we will call microarchitecture.

                                                                               || I
+--------------+       signal bus                                          sclk|| O
| Control Unit +--+----+----+-------+-------+------+-----+----+----+       mosi||
+--------------+  |    |    |       |       |      |     |    |    |       miso|| P
                  |wr  |oe  |addr   |reset  |load  |neg  |oe  |wr  |oe       cs|| o
                  |    |    |       |       |      |     |    |    |           || r
                  v    v    v       v       v      v     v    v    v           || t
                 +-------------+   +----------------------+  +------------+    || s
                 | PU1 :: Fram |   |     PU2 :: Accum     |  | SPI :: SPI +====+/
                 +------+------+   +----------+-----------+  +------------+
                        ^                     ^                    ^
                        |                     |                    |
                        v                     v                    v
                     [================================================] value bus^1

Let see how it work on the example: counter. This program executes counter from zero, increases value one time per computational cycle, and sends it by SPI process unit outside of the processor (question about synchronization between computational and data transfer cycles we will consider later).

The following code implement counter on Lua:

function counter(x1)
    send(x1)
    local x2 = x1 + 1
    counter(x2)
end
counter(0)

Each PU is controlled individually and independently from each other. You can see each PU's microprogram in the following table. Each command is directly mapping to control signals. A hardware decoder is not required.

PC Fram Accum SPI
0 PrepareRead 1 Nop Nop
1 Nop ResetAndLoad False Sending
2 PrepareRead 0 Nop Nop
3 Nop Load False
4 Nop
5 Out
6 Write 1 Nop
7 Nop

All instruction is heavy PU centric and does not have any interpretation outside of the PU context. Instruction is only a readable version of PU control signals. However, that program as a whole can organize the computational process in the entire NITTA processor.

The second part of NITTA architecture is Transport Triggered Architecture, which means that a computational process can be described from a transport point of view. We do it in the next section.

Intermediate algorithm representation and execution

Any NITTA program can be represented as a data flow graph (DFG). A DFG represents a computational process by an oriented graph, where nodes represent functions for storing, processing, or IO; edges represent value transfers between functions. Another type of object is not allowed, which means that all values should be represented as functions, including constant.

During a synthesis process, it can be changed several times for:

  • preparing an algorithm to executable form (see BreakLoop);
  • an algorithm adaptation for the microarchitecture (see ResolveDeadLock);
  • an algorithm optimization for the microarchitecture (see OptimizaAccum).

Let see how it works (you can see such a graph in NITTA UI in the more rounded form).

Initial cyclic graph

Initial DFG is a result of Lua source code translation. Transpiler uses the following rules:

  • operators and function call translated to a node in DFG (except function with the debug prefix, which used for debugging purpose);
  • all edges represent transferring values between functions (e.g., x1 value transferred two times: as argument of send and as operand of +), so we need to create aliases for them in the following <original-name>#<usage-index> (in our example: x1 transforms into x1#1 and x1#2);
  • all constant in the algorithm should be translated in a constant function (like const(1) with output 1#1), multiple usages of constant should be merged (1#1, 1#2...);
  • function arguments should be represented as a loop function with the initial value, provided and consumed value at the beginning and end of a computational cycle consequence.
                /----------------------------------\
            x2#1|                                  |
                v                                  |
+-------------------------+       x1#1  +------+   |
| loop(x2#1) = x1#1, x1#2 |------+----->| send |   |
+-------------------------+      |      +------+   |
                                 |                 |
                             x1#2|             x2#1|
                                 |                 |
                                 v                 |
          +----------+ 1#1  +-------------------+  |
          | const(1) |----->| x1#2 + 1#0 = x2#1 |--/
          +----------+      +-------------------+

Executable acyclic graph

To make DFG executable, we need to break all loop functions in the algorithm, make the graph acyclic. Of course, we don't forget the history of our LoopIn and LoopOut functions.

The result is represented on the following scheme:

+------------------------+       x1#1  +------+       +-----------------------------------+
| loopOut() = x1#1, x1#2 |------+----->| send |   /-->| loopIn(x2#1) pair out: x1#1, x1#2 |
+------------------------+      |      +------+   |   +-----------------------------------+
                                |                 |
                            x1#2|             x2#1|
                                |                 |
                                v                 |
+----------+   1#1   +-------------------+        |
| const(1) |-------->| x1#2 + 1#0 = x2#1 |--------/
+----------+         +-------------------+

We can describe a process of one computational cycle from the transport point of view. Let do it.

PC Transport Fram Accum SPI
0
1 x1: Fram -> { SPI, Accum } Source x1#1, x1#2 Target x1#2 Target x1#1
2
3 1#1: Fram -> Accum Source 1#1 Target 1#1
4
5
6 x2#1: Accum -> Fram Target x2#1 Source x2#1
7

Right now, you can compare this table and table with PU instructions and see their consistency. PU receives the instruction for performing the necessary computational process from the transport point of view. In the next section, we show how the computational process happens across many cycles. How Loop function work.

Computational process on intermediate level

A target processor should repeat an algorithm infinite times, and usually, it is required to transfer data between cycles. Loop function allows doing that. At the first computational cycle, the Loop function produces an initial value and consumes the new value at the end. On the next cycles, it produces the value from the previous.

                           Cycle 0                                                  Cycle 1

][----------------------+       x1#1  +------+       +---][----------------------+       x1#1  +------+       +---][------...
][op(x2#1) = x1#1, x1#2 |------+----->| send |   /-->| lo][op(x2#1) = x1#1, x1#2 |------+----->| send |   /-->| lo][op(x2#...
][----------------------+      |      +------+   |   +---][----------------------+      |      +------+   |   +---][------...
][                             |                 |       ][                             |                 |       ][      ...
][                         x1#2|             x2#1|       ][                         x1#2|             x2#1|       ][      ...
][                             |                 |       ][                             |                 |       ][      ...
][                             v                 |       ][                             v                 |       ][      ...
][      +----------+ 1#1  +-------------------+  |       ][      +----------+ 1#1  +-------------------+  |       ][      ...
][      | const(1) |----->| x1#2 + 1#0 = x2#1 |--/       ][      | const(1) |----->| x1#2 + 1#0 = x2#1 |--/       ][      ...
][      +----------+      +-------------------+          ][      +----------+      +-------------------+          ][      ...

We show loopOut and loopIn as the original Loop function. It is implemented:

  • on intermediate level by metadata;
  • on target processor by PU internal implementation (e.g., in FRAM both parts are bound to the same address).

Logical simulation can be executed by the following command: stack build && stack exec nitta -- -l examples/counter.lua.

Cycle x1 x2
0 0 1
1 1 2
2 2 3
3 3 4

Transport triggered at compile time

In this section, we will try to connect intermediate algorithm representation and target processor architecture. Before we start, we need to separate compile-time (CT) and run-time (RT) elements for a NITTA processor. All talks about the intermediate level happen on compile time, target processor knows only about hardware and software in the form of control signals (but we show its mnemonics).

In the figure, you can see a computational process from the transport point of view, which happens at compile time. You can see computational cycles and states of the value bus on each processor tick. By * we represent providing (one or more) or consuming (only one at once) value by PU.

                             Cycle 0                                   Cycle 1
          ][--0----1----2----3----4----5----6----7--][--0----1----2----3----4----5----6----7--][...
spi out
fram out           **        *                               **        *
accum out          ||        |              *                ||        |              *
                   ||        |              |                ||        |              |
value bus:         ||        |              |                ||        |              |
                   ||        |              |                ||        |              |
accum in:          v|        v              v           x1#1 v|    1#1 v              v
fram in:      x1#1 *v    1#1 *         x2#1 *                *v        *         x2#1 *
spi in:        x1#2 *                                    x1#2 *

So, software generation for a NITTA processor looks simple. We need only:

  • modify intermediate algorithm representation;
  • bound functions of intermediate representation to available processor units;
  • scheduling all needed transports within PU and algorithm restrictions.

After that, we will receive something like you can see the table. In the next section, we will try to explain how it happens.

PC Transport (CT) Fram (CT) Fram (RT) Accum (CT) Accum (RT) SPI (CT) SPI (RT)
0 PrepareRead 1 Nop Nop
1 x1: Fram -> { SPI, Accum } Source x1#1, x1#2 Nop Target x1#2 ResetAndLoad False Target x1#1 Sending
2 PrepareRead 0 Nop Nop
3 1#1: Fram -> Accum Source 1#1 Nop Target 1#1 Load False
4 Nop
5 Out
6 x2#1: Accum -> Fram Target x2#1 Write 1 Source x2#1 Nop
7 Nop

Synthesis process

What do we need for a target processor synthesis, including software and hardware?

We need as input data:

  • algorithm in intermediate representation with all knows functions;
  • microarchitecture of the target processor, which includes all necessary process units and bus configurations (this item is temporal input data; in the long term should be part of synthesis results).

We need as part of our synthesis tool:

  • Some kind of target system model sufficient for:
    • understanding what functions can be executed;
    • what control signals sequence required to do that (software);
    • how to use and, if necessary, generate hardware implementation.
  • Synthesis method should be directly implemented as part of the synthesis tool. It should allow schedule or synthesize the target computational process description.
  • Target system generator, which allows generating target project files from the target system model.
  • UI interface or debugger, which allows our users to inspect the synthesis process or make it in a semiautomated manner. Verification subsystems, which allows our user to do a functional simulation of the algorithm and compare expected behavior with logical simulation on RTL level.

Target system model

One of the features of NITTA is the complex behavior of process units. It means that we are not able to use a simple solution like an equation system. We need something more complicated and build for that imitational model of the target system, which can report to the synthesis method what it can do and apply selected decisions step by step.

                            /-------------------------------------------\   
                            V                                           |
I'm the              What can you do?                                   |
 Method      ------------------------------->  +-----+                  |
        O            Possible options          |\\\\\| I'm the          |
      /\ /\  <-------------------------------  |\\\\\| target           |
        V           Execute my decision!       |\\\\\| model            |
             ------------------------------->  +-----+                  |
                            |                                           |
                            |                   repeat                  |
                            \-------------------------------------------/ 

Target system model includes (see: NITTA.Model.TargetSystem):

  • dataflow graph or intermediate algorithm representation (we need it because the algorithm can be changed during a synthesis process);
  • computational unit model (processor microarchitecture, including processor units and network).

We already said all that we need about dataflow graphs, so we focus on a computational unit model. We will describe it bottom-up: processor unit model, network unit model.

Processor unit model

The main aim of a processor unit model is to teach NITTA how to use the processor unit. For that, we make an imitation model for each PU. Interaction with that model allows getting all that we need to know. Interaction implemented in dialog form. Let see it in simplified form on the Accum example:

"Hi, fresh Accum model. Can you execute loop function?" the synthesis method asks.

"No," the Accum model says.

"Can you execute const function?"

"No."

"Can you execute x1#2 + 1#0 = x2#1?"

"I afraid you don't ask it. Yes. Here is my new state, where I bounded the function and prepared to its execution."

"Good, I agree," the method says because it doesn't have any alternative. "What can you do from a transport point?"

"Only two options: 1) I can be a target for x1#2; 2) I can be a target for 1#0. What would you like to select?"

"Okey, let be a target for x1#2." Of course, the synthesis method can make a different decision too.

"Ok. Here is my new state where I scheduled my computational process for receiving x1#2 from the value bus."

"Good, I agree. What can you do from a transport point of view?"

"Only one option: I can be a target for 1#0. What would you like?"

"Okey, let be a target for 1#0."

"Ok. Here is my new state where I scheduled my computational process for receiving 1#0 from the value bus and sum it with the previous one after a short delay."

"Good. What can you do from a transport point of view?"

"Only one option: I can be a source for x2#1. What would you like?"

"Great. Let be a source for x2#1."

"Ok. Here is my new state where I scheduled my computational process for sending x2#1 to the value bus."

"Excellent. What can you do?"

"Nothing," the model says and squeezes its shoulder.

"I think we are done. Can I see your computational process description?"

"Of course, you can see it in my column in the table above. Also, you can see the full description of my computational process, where I note all that I do and mark vertical relations:"

$ stack build --fast nitta:nitta && stack exec nitta -- -t=fx32.32 examples/counter.lua -p=8080
[NOTICE : NITTA.UI] Running NITTA server at http://localhost:8080 ...
# Process steps:
$ http get localhost:8080/node/-3-4-3-2-0-1-0-0/microarchitecture/accum/process | jq '.steps[] | "\(.pID), \(.pDesc)"'
"6, Intermediate: +x1#0 +1@const#0 = x2#0;"
"5, Instruction: Out"
"4, Endpoint: Source x2#0"
"3, Instruction: Load False"
"2, Endpoint: Target 1@const#0"
"1, Instruction: ResetAndLoad False"
"0, Endpoint: Target x1#0"
# Vertical relationships
$ http get localhost:8080/node/-3-4-3-2-0-1-0-0/microarchitecture/accum/process | jq '.relations[] | "up: \(.[0]), down: \(.[1])"'
"up: 6, down: 4"
"up: 6, down: 2"
"up: 6, down: 0"
"up: 4, down: 5"
"up: 2, down: 3"
"up: 0, down: 1"

"And a little explanation about vertical relationships," the model says, "When I was scheduling my computational process, I note all that I know about my activities. For example, when you order me to be an x2#0 source (4), I schedule instruction Out (5) for it. The endpoint role and the instruction are placed on a different computational process level, so I mark their relationship (6 and 4, 2, 0). Also, I mark the relationship between the intermediate level (function) and all my endpoint roles. Am I not a good and pedantic model?"

Let stop that role game and return to the normal style. It is the correct description of the synthesis method's internal process on the conceptual level. Of course, a real interaction with a model much more complicated:

  • more questions;
  • more type of questions (we call it problems, see: NITTA.Model.Problems); a lot of timing information (time constrains for options and intervals for decisions); the question sequence is not straightforward (we can bound function, schedule it, and bound another one);
  • restriction on transport sequences;
  • etc.

All of them can be test by your self inside REPL. For more detail, look at NITTA.Model.ProcessorUnits.Multiplier and try it.

Network unit model

A network unit model is very similar to a processor unit model. It can answer the following questions:

  1. What processor unit can evaluate the function?

    It is a simple merge of all positive answers from processor units. E.g., we have two fram PU (fram1 and fram2) in the network and want to bind the Loop function. Both PU answer that they can evaluate the function. On the network level, that means we have two options: Bind Loop fram1 and Bind Loop fram2 for selection. For more details, see: NITTA.Model.Problems.Bind.

  2. What value can be transferred by the network, or, more correctly, what variables with the same value can be transferred?

    This is works a little bit more complicated. The model collects all information about data transfers from processor units and finds an intersection between possible sources and targets. If something matches - we have an option about data transfers by the network. E.g., fram can be a source of x1#1 and x1#2, accum can be a target of x1#2, and spi can be a target of x1#1, so we can have the following transfer options (for more details see NITTA.Model.Problems.Dataflow):

    • Fram -> { SPI };
    • Fram -> { Accum };
    • Fram -> { SPI, Accum }.
  3. Refactoring, which allows us to change an algorithm for different purposes. We return to that question later.

You can see dialog between synthesis method and network unit model by UI. For that, you need to run NITTA with UI, select the Subforest menu and see what options are provided and make a decision by yourself.

Multi-network unit model

Not implemented.

Refactoring

Previously, we speak about unit model decisions, but we have refactoring, which causes both the dataflow graph and the unit model. Usually, how it works depends on a refactoring type and includes:

  • what part of the target system model recognize refactoring necessities;
  • how unit models change (if necessary) after the synthesis method makes the decision.

Let see what happens when we break the Loop function into LoopIn and LoopOut. It happens in few steps:

  • The Loop function was bounded to fram and the specific memory cell inside it. The function can not be evaluated (be a source or a target for related values).
  • The fram model recognizes the possibility of BreakLoop refactoring and informs the synthesis method about it.
  • The synthesis method decides to break that function, after that:
    • The TargetSystem model receive the order and resend it to:
      • Dataflow graph, which replace Loop function with LoopIn and LoopOut functions.
      • Network model resends it to fram model, which already knows about the Loop function.
        • fram model marks the specific memory cell that Loop was broken into two parts. After that, the model can provide options about related values sending or receiving.
  • The TargetSystem model contains a new state, where the Loop function was broken.

You can touch BreakLoop and another refactoring by your self in our user interface. For another refactoring see: NITTA.Model.Problems.Refactor.

Synthesis process

As we early say, the synthesis method works with the target system model in dialog form. If we continue this metaphor, you can imagine a synthesis process like a dialog tree in game dev (but it is a real tree, not a graph). The dialog started with the target system model, where all functions from the dataflow graph bounded to the computational unit (network unit model).

After that, we will start the dialog. In it, we can step back any time and select another reply option. Each node of our tree represented the target system model state, and edges between them represent our replies (order or synthesis decision). Selected replies define our future options (some dialog thread ends successfully, some fails).

Let see the dialog thread for the counter.lua example:

# Here we request all nodes from the root to the specific node. -3-4-3-2-0-1-0-0 - synthesis node id of the leaf.
$ http get localhost:8080/node/-3-4-3-2-0-1-0-0/rootPath | jq  -r '.[].decision | "\(.)"'
{"function":{"fvHistory":[],"fvFun":"const(1.000000) = 1@const#0"},"tag":"BindDecisionView","pu":"fram1"}
{"function":{"fvHistory":[],"fvFun":"Loop (X 0.000000) (O [x1#0,x1#1]) (I x2#0)"},"tag":"BindDecisionView","pu":"fram1"}
{"tag":"BreakLoopView","value":"0.000000","input":"x2#0","outputs":["x1#0","x1#1"]}
{"function":{"fvHistory":[],"fvFun":"x1#0 + 1@const#0 = x2#0"},"tag":"BindDecisionView","pu":"accum"}
{"function":{"fvHistory":[],"fvFun":"send(x1#1)"},"tag":"BindDecisionView","pu":"spi"}
{"tag":"DataflowDecisionView","targets":{"x1#0":["accum",[1,1]],"x1#1":["spi",[1,1]]},"source":"fram1"}
{"tag":"DataflowDecisionView","targets":{"1@const#0":["accum",[3,3]]},"source":"fram1"}
{"tag":"DataflowDecisionView","targets":{"x2#0":["fram1",[6,6]]},"source":"accum"}

As you can see, in 1-3 lines, we bound and prepare Loop and Const functions. In 4-5 lines, we bound Send and + functions. On the rest lines, we schedule the computational process as a sequence of data transfers by the network. For any node, we can request all available replays:

$ http get localhost:8080/node/-0-0-2-0-0-1-0-0/subForest | jq ''
[]

As expected, we don't have any options to say at the leaf node because synthesis process is finished. But early we have:

$ http get localhost:8080/node/-0-0-2-0/subForest | jq -r '.[] | "\(.sid) \(.decision)"' 
-0-0-2-0-0 {"function":{"fvHistory":[],"fvFun":"send(x1#1)"},"tag":"BindDecisionView","pu":"spi"}
-0-0-2-0-1 {"tag":"DataflowDecisionView","targets":{"1@const#0":["accum",[1,1]]},"source":"fram1"}
-0-0-2-0-2 {"tag":"DataflowDecisionView","targets":{"x1#0":["accum",[1,1]],"x1#1":null},"source":"fram1"}

In this case, the synthesis method can make a decision with different types: binding or dataflow. It is common to mix them during the synthesis process because we don't want to make unnecessary restrictions.

In a more user-friendly way, you can see and try that process inside UI.

Clone this wiki locally