-
Notifications
You must be signed in to change notification settings - Fork 6
NITTA internal
This document aims to engage the reader in the NITTA project internal structure. Therefore, we will leave out the motivation part and go straight to the technical details.
The central points in the project:
- target processor architecture;
- intermediate algorithm representation;
- synthesis method.
From the hardware point of view, the target processor looks very simple if we don't consider how processor units work inside and where we get software for it. A target processor contains:
-
The control unit provides control signals for all processor units. Stored software does not need any decoder and contains control signals as is (thank you, NISC).
-
A set of processor units (PU) allow processing data, including storing and IO.
-
The
signal bus
, which transfers signals from the control unit to all PUs. -
The
value bus
, which includes two components: data and attributes buses. It is not a real bus on a chip. It is imitation on logical OR (for more detail, seeNITTA.Model.ProcessorUnits
), so when PU is not sending data, it's output should provide zero. -
IO ports, which make a link between target processor environment and processor unit with UI.
The specific configuration of PU set and buses we will call microarchitecture.
|| I
+--------------+ signal bus sclk|| O
| Control Unit +--+----+----+-------+-------+------+-----+----+----+ mosi||
+--------------+ | | | | | | | | | miso|| P
|wr |oe |addr |reset |load |neg |oe |wr |oe cs|| o
| | | | | | | | | || r
v v v v v v v v v || t
+-------------+ +----------------------+ +------------+ || s
| PU1 :: Fram | | PU2 :: Accum | | SPI :: SPI +====+/
+------+------+ +----------+-----------+ +------------+
^ ^ ^
| | |
v v v
[================================================] value bus^1
Let see how it work on the example: counter. This program executes counter from zero, increases value one time per computational cycle, and sends it by SPI process unit outside of the processor (question about synchronization between computational and data transfer cycles we will consider later).
The following code implement counter on Lua:
function counter(x1)
send(x1)
local x2 = x1 + 1
counter(x2)
end
counter(0)
Each PU is controlled individually and independently from each other. You can see each PU's microprogram in the following table. Each command is directly mapping to control signals. A hardware decoder is not required.
PC | Fram | Accum | SPI |
---|---|---|---|
0 | PrepareRead 1 | Nop | Nop |
1 | Nop | ResetAndLoad False | Sending |
2 | PrepareRead 0 | Nop | Nop |
3 | Nop | Load False | ″ |
4 | ″ | Nop | ″ |
5 | ″ | Out | ″ |
6 | Write 1 | Nop | ″ |
7 | Nop | ″ | ″ |
All instruction is heavy PU centric and does not have any interpretation outside of the PU context. Instruction is only a readable version of PU control signals. However, that program as a whole can organize the computational process in the entire NITTA processor.
The second part of NITTA architecture is Transport Triggered Architecture, which means that a computational process can be described from a transport point of view. We do it in the next section.
Any NITTA program can be represented as a data flow graph (DFG). A DFG represents a computational process by an oriented graph, where nodes represent functions for storing, processing, or IO; edges represent value transfers between functions. Another type of object is not allowed, which means that all values should be represented as functions, including constant.
During a synthesis process, it can be changed several times for:
- preparing an algorithm to executable form (see
BreakLoop
); - an algorithm adaptation for the microarchitecture (see
ResolveDeadLock
); - an algorithm optimization for the microarchitecture (see
OptimizaAccum
).
Let see how it works (you can see such a graph in NITTA UI in the more rounded form).
Initial DFG is a result of Lua source code translation. Transpiler uses the following rules:
- operators and function call translated to a node in DFG (except function with the
debug
prefix, which used for debugging purpose); - all edges represent transferring values between functions (e.g.,
x1
value transferred two times: as argument ofsend
and as operand of+
), so we need to create aliases for them in the following<original-name>#<usage-index>
(in our example:x1
transforms intox1#1
andx1#2
); - all constant in the algorithm should be translated in a constant function (like
const(1)
with output1#1
), multiple usages of constant should be merged (1#1
,1#2
...); - function arguments should be represented as a
loop
function with the initial value, provided and consumed value at the beginning and end of a computational cycle consequence.
/----------------------------------\
x2#1| |
v |
+-------------------------+ x1#1 +------+ |
| loop(x2#1) = x1#1, x1#2 |------+----->| send | |
+-------------------------+ | +------+ |
| |
x1#2| x2#1|
| |
v |
+----------+ 1#1 +-------------------+ |
| const(1) |----->| x1#2 + 1#0 = x2#1 |--/
+----------+ +-------------------+
To make DFG executable, we need to break all loop
functions in the algorithm, make the graph acyclic. Of course, we don't forget the history of our LoopIn
and LoopOut
functions.
The result is represented on the following scheme:
+------------------------+ x1#1 +------+ +-----------------------------------+
| loopOut() = x1#1, x1#2 |------+----->| send | /-->| loopIn(x2#1) pair out: x1#1, x1#2 |
+------------------------+ | +------+ | +-----------------------------------+
| |
x1#2| x2#1|
| |
v |
+----------+ 1#1 +-------------------+ |
| const(1) |-------->| x1#2 + 1#0 = x2#1 |--------/
+----------+ +-------------------+
We can describe a process of one computational cycle from the transport point of view. Let do it.
PC | Transport | Fram | Accum | SPI |
---|---|---|---|---|
0 | ||||
1 |
x1 : Fram -> { SPI, Accum } |
Source x1#1 , x1#2
|
Target x1#2
|
Target x1#1
|
2 | ||||
3 |
1#1 : Fram -> Accum |
Source 1#1
|
Target 1#1
|
|
4 | ||||
5 | ||||
6 |
x2#1 : Accum -> Fram |
Target x2#1
|
Source x2#1
|
|
7 |
Right now, you can compare this table and table with PU instructions and see their consistency. PU receives the instruction for performing the necessary computational process from the transport point of view. In the next section, we show how the computational process happens across many cycles. How Loop
function work.
A target processor should repeat an algorithm infinite times, and usually, it is required to transfer data between cycles. Loop
function allows doing that. At the first computational cycle, the Loop
function produces an initial value and consumes the new value at the end. On the next cycles, it produces the value from the previous.
Cycle 0 Cycle 1
][----------------------+ x1#1 +------+ +---][----------------------+ x1#1 +------+ +---][------...
][op(x2#1) = x1#1, x1#2 |------+----->| send | /-->| lo][op(x2#1) = x1#1, x1#2 |------+----->| send | /-->| lo][op(x2#...
][----------------------+ | +------+ | +---][----------------------+ | +------+ | +---][------...
][ | | ][ | | ][ ...
][ x1#2| x2#1| ][ x1#2| x2#1| ][ ...
][ | | ][ | | ][ ...
][ v | ][ v | ][ ...
][ +----------+ 1#1 +-------------------+ | ][ +----------+ 1#1 +-------------------+ | ][ ...
][ | const(1) |----->| x1#2 + 1#0 = x2#1 |--/ ][ | const(1) |----->| x1#2 + 1#0 = x2#1 |--/ ][ ...
][ +----------+ +-------------------+ ][ +----------+ +-------------------+ ][ ...
We show loopOut
and loopIn
as the original Loop
function. It is implemented:
- on intermediate level by metadata;
- on target processor by PU internal implementation (e.g., in FRAM both parts are bound to the same address).
Logical simulation can be executed by the following command: stack build && stack exec nitta -- -l examples/counter.lua
.
Cycle | x1 | x2 |
---|---|---|
0 | 0 | 1 |
1 | 1 | 2 |
2 | 2 | 3 |
3 | 3 | 4 |
In this section, we will try to connect intermediate algorithm representation and target processor architecture. Before we start, we need to separate compile-time (CT) and run-time (RT) elements for a NITTA processor. All talks about the intermediate level happen on compile time, target processor knows only about hardware and software in the form of control signals (but we show its mnemonics).
In the figure, you can see a computational process from the transport point of view, which happens at compile time. You can see computational cycles and states of the value bus on each processor tick. By *
we represent providing (one or more) or consuming (only one at once) value by PU.
Cycle 0 Cycle 1
][--0----1----2----3----4----5----6----7--][--0----1----2----3----4----5----6----7--][...
spi out
fram out ** * ** *
accum out || | * || | *
|| | | || | |
value bus: || | | || | |
|| | | || | |
accum in: v| v v x1#1 v| 1#1 v v
fram in: x1#1 *v 1#1 * x2#1 * *v * x2#1 *
spi in: x1#2 * x1#2 *
So, software generation for a NITTA processor looks simple. We need only:
- modify intermediate algorithm representation;
- bound functions of intermediate representation to available processor units;
- scheduling all needed transports within PU and algorithm restrictions.
After that, we will receive something like you can see the table. In the next section, we will try to explain how it happens.
PC | Transport (CT) | Fram (CT) | Fram (RT) | Accum (CT) | Accum (RT) | SPI (CT) | SPI (RT) |
---|---|---|---|---|---|---|---|
0 | PrepareRead 1 | Nop | Nop | ||||
1 |
x1 : Fram -> { SPI, Accum } |
Source x1#1 , x1#2
|
Nop | Target x1#2
|
ResetAndLoad False | Target x1#1
|
Sending |
2 | PrepareRead 0 | Nop | Nop | ||||
3 |
1#1 : Fram -> Accum |
Source 1#1
|
Nop | Target 1#1
|
Load False | ″ | |
4 | ″ | Nop | ″ | ||||
5 | ″ | Out | ″ | ||||
6 |
x2#1 : Accum -> Fram |
Target x2#1
|
Write 1 | Source x2#1
|
Nop | ″ | |
7 | Nop | ″ | ″ |
What do we need for a target processor synthesis, including software and hardware?
We need as input data:
- algorithm in intermediate representation with all knows functions;
- microarchitecture of the target processor, which includes all necessary process units and bus configurations (this item is temporal input data; in the long term should be part of synthesis results).
We need as part of our synthesis tool:
- Some kind of target system model sufficient for:
- understanding what functions can be executed;
- what control signals sequence required to do that (software);
- how to use and, if necessary, generate hardware implementation.
- Synthesis method should be directly implemented as part of the synthesis tool. It should allow schedule or synthesize the target computational process description.
- Target system generator, which allows generating target project files from the target system model.
- UI interface or debugger, which allows our users to inspect the synthesis process or make it in a semiautomated manner. Verification subsystems, which allows our user to do a functional simulation of the algorithm and compare expected behavior with logical simulation on RTL level.
One of the features of NITTA is the complex behavior of process units. It means that we are not able to use a simple solution like an equation system. We need something more complicated and build for that imitational model of the target system, which can report to the synthesis method what it can do and apply selected decisions step by step.
/-------------------------------------------\
V |
I'm the What can you do? |
Method -------------------------------> +-----+ |
O Possible options |\\\\\| I'm the |
/\ /\ <------------------------------- |\\\\\| target |
V Execute my decision! |\\\\\| model |
-------------------------------> +-----+ |
| |
| repeat |
\-------------------------------------------/
Target system model includes (see: NITTA.Model.TargetSystem
):
- dataflow graph or intermediate algorithm representation (we need it because the algorithm can be changed during a synthesis process);
- computational unit model (processor microarchitecture, including processor units and network).
We already said all that we need about dataflow graphs, so we focus on a computational unit model. We will describe it bottom-up: processor unit model, network unit model.
The main aim of a processor unit model is to teach NITTA how to use the processor unit. For that, we make an imitation model for each PU. Interaction with that model allows getting all that we need to know. Interaction implemented in dialog form. Let see it in simplified form on the Accum
example:
"Hi, fresh Accum model. Can you execute loop
function?" the synthesis method asks.
"No," the Accum model says.
"Can you execute const
function?"
"No."
"Can you execute x1#2 + 1#0 = x2#1
?"
"I afraid you don't ask it. Yes. Here is my new state, where I bounded the function and prepared to its execution."
"Good, I agree," the method says because it doesn't have any alternative. "What can you do from a transport point?"
"Only two options: 1) I can be a target for x1#2
; 2) I can be a target for 1#0
. What would you like to select?"
"Okey, let be a target for x1#2
." Of course, the synthesis method can make a different decision too.
"Ok. Here is my new state where I scheduled my computational process for receiving x1#2
from the value bus."
"Good, I agree. What can you do from a transport point of view?"
"Only one option: I can be a target for 1#0
. What would you like?"
"Okey, let be a target for 1#0
."
"Ok. Here is my new state where I scheduled my computational process for receiving 1#0
from the value bus and sum it with the previous one after a short delay."
"Good. What can you do from a transport point of view?"
"Only one option: I can be a source for x2#1
. What would you like?"
"Great. Let be a source for x2#1
."
"Ok. Here is my new state where I scheduled my computational process for sending x2#1
to the value bus."
"Excellent. What can you do?"
"Nothing," the model says and squeezes its shoulder.
"I think we are done. Can I see your computational process description?"
"Of course, you can see it in my column in the table above. Also, you can see the full description of my computational process, where I note all that I do and mark vertical relations:"
$ stack build --fast nitta:nitta && stack exec nitta -- -t=fx32.32 examples/counter.lua -p=8080
[NOTICE : NITTA.UI] Running NITTA server at http://localhost:8080 ...
# Process steps:
$ http get localhost:8080/node/-3-4-3-2-0-1-0-0/microarchitecture/accum/process | jq '.steps[] | "\(.pID), \(.pDesc)"'
"6, Intermediate: +x1#0 +1@const#0 = x2#0;"
"5, Instruction: Out"
"4, Endpoint: Source x2#0"
"3, Instruction: Load False"
"2, Endpoint: Target 1@const#0"
"1, Instruction: ResetAndLoad False"
"0, Endpoint: Target x1#0"
# Vertical relationships
$ http get localhost:8080/node/-3-4-3-2-0-1-0-0/microarchitecture/accum/process | jq '.relations[] | "up: \(.[0]), down: \(.[1])"'
"up: 6, down: 4"
"up: 6, down: 2"
"up: 6, down: 0"
"up: 4, down: 5"
"up: 2, down: 3"
"up: 0, down: 1"
"And a little explanation about vertical relationships," the model says, "When I was scheduling my computational process, I note all that I know about my activities. For example, when you order me to be an x2#0
source (4), I schedule instruction Out
(5) for it. The endpoint role and the instruction are placed on a different computational process level, so I mark their relationship (6 and 4, 2, 0). Also, I mark the relationship between the intermediate level (function) and all my endpoint roles. Am I not a good and pedantic model?"
Let stop that role game and return to the normal style. It is the correct description of the synthesis method's internal process on the conceptual level. Of course, a real interaction with a model much more complicated:
- more questions;
- more type of questions (we call it problems, see:
NITTA.Model.Problems
); a lot of timing information (time constrains for options and intervals for decisions); the question sequence is not straightforward (we can bound function, schedule it, and bound another one); - restriction on transport sequences;
- etc.
All of them can be test by your self inside REPL. For more detail, look at NITTA.Model.ProcessorUnits.Multiplier
and try it.
A network unit model is very similar to a processor unit model. It can answer the following questions:
-
What processor unit can evaluate the function?
It is a simple merge of all positive answers from processor units. E.g., we have two fram PU (
fram1
andfram2
) in the network and want to bind theLoop
function. Both PU answer that they can evaluate the function. On the network level, that means we have two options:Bind Loop fram1
andBind Loop fram2
for selection. For more details, see:NITTA.Model.Problems.Bind
. -
What value can be transferred by the network, or, more correctly, what variables with the same value can be transferred?
This is works a little bit more complicated. The model collects all information about data transfers from processor units and finds an intersection between possible sources and targets. If something matches - we have an option about data transfers by the network. E.g.,
fram
can be a source ofx1#1
andx1#2
,accum
can be a target ofx1#2
, andspi
can be a target ofx1#1
, so we can have the following transfer options (for more details seeNITTA.Model.Problems.Dataflow
):-
Fram -> { SPI }
; -
Fram -> { Accum }
; -
Fram -> { SPI, Accum }
.
-
-
Refactoring, which allows us to change an algorithm for different purposes. We return to that question later.
You can see dialog between synthesis method and network unit model by UI. For that, you need to run NITTA with UI, select the Subforest
menu and see what options are provided and make a decision by yourself.
Not implemented.
Previously, we speak about unit model decisions, but we have refactoring, which causes both the dataflow graph and the unit model. Usually, how it works depends on a refactoring type and includes:
- what part of the target system model recognize refactoring necessities;
- how unit models change (if necessary) after the synthesis method makes the decision.
Let see what happens when we break the Loop
function into LoopIn
and LoopOut.
It happens in few steps:
- The
Loop
function was bounded tofram
and the specific memory cell inside it. The function can not be evaluated (be a source or a target for related values). - The
fram
model recognizes the possibility ofBreakLoop
refactoring and informs the synthesis method about it. - The synthesis method decides to break that function, after that:
- The TargetSystem model receive the order and resend it to:
- Dataflow graph, which replace
Loop
function withLoopIn
andLoopOut
functions. - Network model resends it to
fram
model, which already knows about theLoop
function.-
fram
model marks the specific memory cell thatLoop
was broken into two parts. After that, the model can provide options about related values sending or receiving.
-
- Dataflow graph, which replace
- The TargetSystem model receive the order and resend it to:
- The TargetSystem model contains a new state, where the
Loop
function was broken.
You can touch BreakLoop
and another refactoring by your self in our user interface. For another refactoring see: NITTA.Model.Problems.Refactor
.
As we early say, the synthesis method works with the target system model in dialog form. If we continue this metaphor, you can imagine a synthesis process like a dialog tree in game dev (but it is a real tree, not a graph). The dialog started with the target system model, where all functions from the dataflow graph bounded to the computational unit (network unit model).
After that, we will start the dialog. In it, we can step back any time and select another reply option. Each node of our tree represented the target system model state, and edges between them represent our replies (order or synthesis decision). Selected replies define our future options (some dialog thread ends successfully, some fails).
Let see the dialog thread for the counter.lua
example:
# Here we request all nodes from the root to the specific node. -3-4-3-2-0-1-0-0 - synthesis node id of the leaf.
$ http get localhost:8080/node/-3-4-3-2-0-1-0-0/rootPath | jq -r '.[].decision | "\(.)"'
{"function":{"fvHistory":[],"fvFun":"const(1.000000) = 1@const#0"},"tag":"BindDecisionView","pu":"fram1"}
{"function":{"fvHistory":[],"fvFun":"Loop (X 0.000000) (O [x1#0,x1#1]) (I x2#0)"},"tag":"BindDecisionView","pu":"fram1"}
{"tag":"BreakLoopView","value":"0.000000","input":"x2#0","outputs":["x1#0","x1#1"]}
{"function":{"fvHistory":[],"fvFun":"x1#0 + 1@const#0 = x2#0"},"tag":"BindDecisionView","pu":"accum"}
{"function":{"fvHistory":[],"fvFun":"send(x1#1)"},"tag":"BindDecisionView","pu":"spi"}
{"tag":"DataflowDecisionView","targets":{"x1#0":["accum",[1,1]],"x1#1":["spi",[1,1]]},"source":"fram1"}
{"tag":"DataflowDecisionView","targets":{"1@const#0":["accum",[3,3]]},"source":"fram1"}
{"tag":"DataflowDecisionView","targets":{"x2#0":["fram1",[6,6]]},"source":"accum"}
As you can see, in 1-3 lines, we bound and prepare Loop
and Const
functions. In 4-5 lines, we bound Send
and +
functions. On the rest lines, we schedule the computational process as a sequence of data transfers by the network. For any node, we can request all available replays:
$ http get localhost:8080/node/-0-0-2-0-0-1-0-0/subForest | jq ''
[]
As expected, we don't have any options to say at the leaf node because synthesis process is finished. But early we have:
$ http get localhost:8080/node/-0-0-2-0/subForest | jq -r '.[] | "\(.sid) \(.decision)"'
-0-0-2-0-0 {"function":{"fvHistory":[],"fvFun":"send(x1#1)"},"tag":"BindDecisionView","pu":"spi"}
-0-0-2-0-1 {"tag":"DataflowDecisionView","targets":{"1@const#0":["accum",[1,1]]},"source":"fram1"}
-0-0-2-0-2 {"tag":"DataflowDecisionView","targets":{"x1#0":["accum",[1,1]],"x1#1":null},"source":"fram1"}
In this case, the synthesis method can make a decision with different types: binding or dataflow. It is common to mix them during the synthesis process because we don't want to make unnecessary restrictions.
In a more user-friendly way, you can see and try that process inside UI.