-
Notifications
You must be signed in to change notification settings - Fork 108
Flow Based Programming Study
The content of this page is slightly outdated since it's a study done by Ivan Briano ivan.briano@intel.com on May 2015.
Based on the difficulties we encountered when trying to write simple examples in Soletta™ Framework and deciding what the behavior of the basic components should be, the need arose to see what other people working with this methodology are doing, both when writing applications and the different framework implementations.
This document is based on some parts of the book “Flow Based Programming, 2nd Edition” by J. Paul Morrison, discussions from the FBP mailing list, and from the sources of other projects, like NoFlo, NodeRed and Micro-Flo.
Some of the key points encountered are:
- Morrison makes a distinction between Classical FBP (what he proposes in his book) and other implementations that don’t always follow all the principles he describes.
- Regardless of how it’s implemented, at its core FBP is a way to model application development so that they can be equally understood by developers and users, and maintained in the long term with minimum effort.
- Applications are made to solve specific problems, and those problems have specific data requirements. As such, the Information Packets (IPs) that travel an FBP network are designed to represent those requirements as much as possible, instead of being coerced to work with a fixed set of data types provided by the framework of choice.
The driving principle behind FBP is that traditional procedural programming is hard and not a good fit for business applications. We are taught that because computers work by processing one instruction at a time, - it hasn’t been true for the last 65 millions years, but you get the point - our programs should be written in the same way.
Morrison proposes that developers should spend less time thinking about the order in which things will be executed (control flow), and more about the data and the transformations that will be applied to it (data flow). This way of focusing on application development leads to a more natural way of designing them, where the flow diagrams that might be used to describe how the application should work to users, managers and developers, translate directly into how it will be coded.
- Processes: The boxes. Each box in the network is an instance of a component, a “black box” that essentially receives some data, does something with it and sends some more data. These may be composed of inner processes that perform more basic functions, until the point is reached where basic components need to be implemented in some form of native, procedural language.
- Connections: Processes are connected one to another through ports defined by their components. Each process will receive the data it needs to work through its INPUT ports and send the result through the OUTPUT ports. It doesn’t need to know anything about what other processes are connected to those ports. All each box in the network knows is itself.
- Information Packets: The data that travels through the network, usually in the form of structured packets or streams of packets.
- Scheduler: Usually part of the framework, decides which processes of the network should run based on the availability of data on their ports.
One very important concept is that of re-usability of components. As more and more complex applications are written using this model, common patterns emerge that can be extracted into a library of pre-made components ready to be used in new applications. As that set of components grows, the need implement those components in code decreases significantly, to the point where a large program will end up being formed of largely pre-made components and only a few, very specific one being implemented in lower level languages, if at all.
Besides the basics describe above, Classical FBP as proposed in the book and the implementations that came out of it has other well defined properties.
The packets (Information Packets, IPs) traveling through the network are treated like physical objects and they have a well defined lifetime. They can be owned by only one process at a time, which will either pass it along to the next process in the network or drop it (free it). An IP can then be either owned by a process or in its way through a connection.
A consequence of this treatment of IPs is that when a process sends a packet through one of its output ports, it’s relinquishing its hold over the information contained in it and can no longer access it. Also, since packets can be owned by a single process at a time, output ports allow only one connection. If it’s necessary to branch the network at any point, a special component will be necessary to duplicate the packet and send it to the two (or more) processes that start each branch, but then those packets will not be the same, they merely contain the same information.
In contrast to the single connections for output ports, an input port may accept any number of connections, and its owning component will process the packets that arrive in it on a first-come, first-served basis. Who sent the packet is of no relevance to the process that will handle them, and so there’s no need to identify the source of the packet or which connection it is.Who sent the packet is of no relevance to the process that will handle them, and so there’s no need to identify the source of the packet or which connection it is.
Which brings us to the concept of stream of packets. In cFBP, a process reads every packet from one input port until it gets an end-of-stream, then moves on to reading everything from the next input port until all ports are over, and then works its magic on the information it got. Depending on the function of that component, it will send its result packets as it processes or send one (or more) once it finishes its function.
The connections between processes are bounded buffers, that means they have a limit on the number of packets that they can hold. As said above, a process will read from its input ports until it receives an end-of-stream, but if there are no packets to be read, and no EOS is received, the process blocks until more packets arrive. Likewise, when sending through its output ports, if the connection on that port is full, the process doing the sending will be blocked too, until the other end of the connection gets back to reading packets from it and makes room for the first process to continue sending. This is called back-pressure and it’s fundamental to prevent that a flood of packets in the network consume all the resources available on the system.
Blocking processes are not a problem because of the asynchronous nature of the model. Since each process in the network is decoupled from all the others, it’s possible, and even desired, that multiple processes run at the same time. This, summed with the fact that different processes will send their packets whenever they feel like it, means that it’s impossible to know when things will happen and that any network that expects the timely arrival of packets from different branches into one process is flawed by design. This is usually not a problem under cFBP, given how the connections are buffered and processes block when reading from their ports until a packet arrives, so packets arriving into one port from one branch of the network will wait there until they are read, even if the process that will read them is still waiting on other packets from a longer/slower branch of the network to arrive. The order the packets arrive in, however, is another matter. cFBP makes no guarantee that the packets originating from one point in the network will arrive in the same order at any other point, the only guarantees made are:
Given A -> B
- Any packet coming out from A will arrive at B only after it has left A.
- The order of the packets going from A to B will be preserved
But the second point only applies to direct connections, given a network
A -> B -> C
C cannot expect that it will receive its packets in the same order in which they left A, since B is free to reorder them while processing, or even drop some of them.
Component re-usability is achieved by designing said components in a way that they perform one function and nothing else. But even that function may need some parameters to know, for example, which file to read from or write to, the maximum length the output string can be or the address where to find a hardware device. To this end, when a process is first executed, it will read the parameters it needs from an OPTIONS port, and then proceed to read from its inputs to do its work. Those parameters can be fed into the process by which Morrison calls Initial Information Packet (IIP), or it can come from another process in the network.
Options coming from another process can be very useful when we consider that a process in a flow is only alive for as long as it has something to do. That is, no process will be running until it receives some packet on its input ports, and once it’s done doing its job and the last packet is sent, the process finishes. With all that, we get that every time new information arrives to a process, it will be executed again and it will, once more, read its options. If a process is long lived and doesn’t support being reconfigured during runtime, it can just ignore anything new that arrives through that port. With those options coming from another process, the same flow or part of it, can run with different configurations during the same run of the entire application.
And to end, a common convention with cFBP is to use array ports to group multiple connections of the same thing into one named port with a meaningful name. That is, a component that takes to numbers and outputs the sum of them will not have two input port IN0 and IN1, preferring to use one array port IN that can take connections into IN[0] and IN[1].
The term Classic FBP is something Morrison came up to compare what he describes in his book and his implementations against the outpouring of new projects like NoFlo, NodeRed and others of the sort. His comparison can be found at his site, and a summary follows.
The most common difference and the one with the most repercussions to the rest of the model is the “reactiveness” of ports. That is, instead of a process being activated when data is available at any one of its input ports, and it deciding where to read, which could potentially be a different port and so block the process again, the framework calls the process and it will perform the action that port requires instantly, possibly sending some result through its output ports even though there may be other packets waiting to be processed and will likely trigger a new process of the same component as soon as execution returns to the framework.
Depending on how the above is implemented, the asynchronous nature of FBP may be broken too, as sending a packet from one process may, instead of going through a buffered connection, trigger a callback in the next process, which in turn will trigger the following until the last component in that chain is reached. Systems working this way retain only the image of FBP, looking like boxes connected one to another and passing information through those connections, while underneath they work in the same way as a traditional, procedural program would.
Packet lifetime also receives a different treatment, usually just being a container to pass values as if they were simple function arguments. This is especially true for frameworks that allow multiple connections on the output port of a component, since the same packet will be replicated to all those connected processes.
The packet distribution also doesn’t necessarily happen through a bounded buffer, as cFBP expects, with the implementations varying between the direct callback chain described above, where no buffering of any sort will happen, or packets being queued to be delivered but with no upper bound. The latter maintains the buffering, but it risks consuming all available resources if one process is continually producing packets at a faster rate than they can be consumed. This lack of back-pressure also means that the sending of packets is non-blocking, and in the same line, there’s no blocking when reading packets because there is no reading packets usually, the process will be called with the packet as a parameter, for each packet it is meant to receive.
The motivation of this research was to find out how other people are working around the problems we encountered when implementing our own FBP framework. As it were, it was easier to find people discussing ideas than actual programs written using any of these models.
Nevertheless, those discussions provided the ideas presented here, one of which is less about implementation of things and more about how FBP is a way to organize application development around the concepts of reusability, decoupling of components and asynchronous and distributed processes.
Not surprisingly then, the people most active in these discussions all have their own implementations following these ideas, and deviating from them when they consider it necessary or fit for the problem space they are trying to solve. Some adhere more closely to cFBP, others prefer the reactive approach to triggering ports, seeing the processes that compose the flow as a components that are always alive, just waiting for something to happen and react to that event, then go back to waiting for the next one.
In all cases, however, the idea that each component is independent of all others is paramount, and once a framework is in place, development of these individual components takes place based on the problem that’s being solved at the time. How granular those components are will depend more on the necessity of that granularity and not on a desire to offer a large library of pre-made components from the get-go.
Return of Investment, as Morrison calls it, is the factor to take into account when a new component is needed. Basically, given the problem that needs to be solved and the need to have specific business logic that’s not already available off the shelf, how likely it is that this new component will be reused, either in the same application or in others? Sometimes it may be easier to write one big block that covers a larger area, other times that block may be split into smaller ones, some probably one-shot, but others may still be useful to other people or even other parts of the same program.
Our own implementation of an FBP framework is closer to the general FBP camp than to cFBP, and even then it has some differences of its own with all the others.
For clarity’s sake, terminology:
- What is usually called a ‘process’, for us is a node.
- Components then are node-types
- IPs are packets
Like the more popular projects, our nodes are reactive. We have one function per port, that will be called when a packet is sent to it and for the most part, our node-types process that packet and send a result immediately. Some exceptions exists for some specific types and for when not every port has received a first packet yet.
Our approach to configure nodes is also quite different. Options are passed to the node when it’s opened, either declared as part of the node declaration or read from configuration file by the resolver, but never from flow itself. Some node-types have parameters that make sense to be changed during the application's lifetime, and for these there are specific ports that can be used to change them.
Packets are created at the time the node sends some value to an output port. The next node will receive that packet and get the internal value, but there’s no concept of ownership transfer. That ownership could be transferred by means of preserving the contents of the packet, depending on the packet type, but since our output ports support multiple connections and a single reference of the packet is sent to all nodes connected to it, we still don’t abide by the principle that a packet is owned by a single process at a time. Instead, once a packet is sent, it’s the scheduler that takes ownership, and it will free it after all it’s recipients have been handled.
Sending of packets is implemented in a way that returns to the mainloop after each send, so we avoid the callback chain mentioned above, but there’s still no back-pressure. Sending of packets will never block and depending on a flag on the output port, packets will be queued indefinitely or just one packet from that node/port will be kept in the queue. Likewise, there’s no reading a buffer by the node to receive packets, the framework will call the specified function each time, for each packet in the queue. We have no concept of streams of packets.
Another difference with most implementations is that our ports are strictly typed, for the most part. A few accept packets of any type and will handle them differently, but if a packet arrives that the node can’t handle, it will likely be silently dropped. Custom packet types can be written by node-type implementations, but these will very likely not be understood by other nodes, except those also written to cooperate with this new node type. The reasoning behind the strict packet types is to be portable and scale from large systems to very small ones, where generic packet types that can accommodate any kind of information might not fit within the constraints of the system.
Probably the biggest problem we ran into when trying to write examples of FBP applications with Soletta is not adapting our thinking to how these applications should be modeled, and instead try to translate C code directly into FBP using the basic node-types we already have. This usually results in very complicated logic that tries to control when things need to happen, abusing node-types to perform operations other than the one they were created to do, which reduces the clarity of program when reading the code, and in some cases the number of nodes and connections necessary to perform a simple operation end up being more than the lines of C code that it would take to do it all by hand.
Even with the delayed sending of packets and with nodes that use the REPLACE_PACKET flag to overwrite the last value sent, depending on how the flow is written, the fact that we receive a packet on a port and immediately send a result ends up with cases where the application needs to be designed around the fact that there might be bogus or partial results arriving to a node. For example, if we expect a node to receive the sum of two values, one of which will travel through a larger branch in the network, with the way things are implemented (and having to consider implementation details is already a problem of its own), one value will reach the Sum node, the other may still be going through some intermediate node in its branch, and so the Sum node will process the value it got and emit a result. Since we will return to the mainloop after that, and before the other value arrives, that partial result will be sent before the Sum node gets the chance to get the next value and perform another send that would have overwritten that packet with the final value we were expecting.
While having strict packet types and nodes that work with those specific types can be handy when we work on small environments, for some kinds of applications it ends up being more a nuisance. Even after writing custom node-types to perform the specific operations needed for the application, if a more complex data type is desired to represent the information our node needs to work with, none of the pre-made node-types available will be useful to work with packets of that type. Instead, the developers is now faced with the option of having to write several utility node-types that will understand this custom packet-type, or have converter types that can take the custom packet, extract whatever basic value needs to operate on and pass that one the nodes available off the shelf, then get the result back and apply it to the original packet to then go on processing it.