JOSHUA HOLBROOK github : twitter

Thoughts on Graphical Dataflow Language Design (posted 12 Aug 2010)


I’ve been teaching classes on programming with MATLAB recently, to cover for my advisor, who has been out of town at a really cool-sounding conference. By classes, I mean two of them.

The first was on programming GUIs in MATLAB. This was all new to me (I usually don’t need or want GUIs at the level of programming that I do), but easy enough. Using callbacks with Node helped with understandment. I also learned enough about objects in MATLAB to know that they’re terrible. I had notes from my prof., so I was just able to follow those directly and everything was fine.

The second class was a little more interesting, in that I got notes on two different subjects—one on numerical methods in MATLAB, and the other on using simulink—with the intention that I would mash them together. This meant that I really got to design everything for this particular class period myself, and as a consequence I got to experiment with Simulink, which got me thinkin’.

Let’s start not with the idea of graphical languages here, but instead the idea of domain specific languages. Why? Because so many dataflow languages that I’m aware of are, in fact, domain-specific (or at least started that way):

Language Domain
Simulink Simulation of dynamic systems
LabVIEW Instrumentation/automation
PureData Algorithmic audio and video
Ladder Logic Relay-based logic
Yahoo Pipes Data mash-ups

(As an aside: The reason MATLAB kinda sucks, in my opinion, is this: It was originally designed as a domain-specific language for applying numerical methods to O-0, O-1 and O-2 tensors, and plotting the results. When MATLAB is being used for these particular applications, it comes across as quite slick, excepting perhaps for the massive size of MATLAB’s disorganized toolbox. However, over time, MATLAB has tried to become more general-purpose, and has felt significant growing pains due to everything more general-purpose being “bolted on” and being poorly designed, at least in part due to the adherence to the original design decisions regarding matrices, which are great for scalars, vectors and matrices, but terrible for just about anything else. But that’s beside the point.)

As far as graphical languages go, the only one I have much experience in is LabVIEW. In fact, I became a Certified LabVIEW Associate Developer by passing a test on the subject. So, while this doesn’t really mean much (I’m really rusty at LabVIEW), perhaps it’s motivating enough to suggest that I’ve experienced most of the important stuff regarding LabVIEW.

I’ve also tried my hand at PureDATA, mostly because I wanted to try out other graphical languages such as LabVIEW (I believe the language for LabVIEW is technically called “G,” but nobody actually calls it that IMO). I couldn’t get into it, but then again I wasn’t all that interested in making music.

Some principles are common through most (all?) variations of these sorts of languages: The idea of using wires to connect functions together. For example, let’s use this image of a Directed Acyclic Graph from Wikipedia for the purpose of discussion:

In this case, each of the nodes is a function, with each edge connecting outputs of functions to inputs of other functions. Some of the so-called functions aren’t truly functions, but are instead I/O–in Simulink terms, “sources” and “sinks.” In the case of this DAG, The top three are sources, and the bottom three are sinks.

Part of the appeal of this view is that you can avoid the explicit assignment of mutable variables–that is, you never have to say, “x=2;” or some such aside from the sources and sinks, because all that stuff (outside the nodes) is passed around through the wires and not in variable read/writes.

The other major aspect that’s attractive to many is the “black box” aspect of the nodes. That is, you don’t really need to know, or honestly care, how the nodes process their input to produce the output. This is why these graphical languages are considered attractive to many—in the ideal simple case, you can just take some inputs, apply these canned methods to them, and grab the output (Of course, this approach doesn’t always cut it). I think this is also why these graphical languages are usually domain-specific; It is within specific domains where one may predict which functions users would most be interested in, and supply them for easy hookings-up. In the general purpose, your blocks have to be very simple, as compared to these DSLs where the blocks can implement an entire numerical method and be useful in its domain.

There is one other part missing from this view, which is that the nodes in the DAG only really have one input and one output, whereas nodes in graphical languages usually have multiple inputs and multiple outputs, aside from the idea of vector-valued inputs or outputs. I see two ways I see to deal with this, in a pure sense:

  • Interpret separate outputs as containing every output in a vector-form, with the inputs implicitly filtering out only the part that they’re interested in

  • Interpret each node as not a node in-and-of-itself, but instead as a collection of black-boxed nodes.

My feeling is that nether one of these alone is completely satisfying, but that the first may save the most headache, pragmatically.

So, back to Simulink, which is what spurred this whole thought process: Simulink is surprisingly cool. Here’s why: Simulink allows for feedback loops—that is, it allows for cyclic graphs, which are usually expressly forbid in this family of languages.

How does Simulink get away with this? It’s because of Simulink’s domain. Simulink is intended only to work with the sort of problems which you would be able to describe in a control block diagram or a signal flow graph—that is, Simulink isn’t really meant for describing a program so much as it’s meant to describe a dynamic system. Sources are always functions of time, and Simulink deals with feedback loops by compiling the entire model into a state-space representation and applying a Runge-Kutta-esque method to the resulting system of equations. Having discovered the power of this approach, I think it would be awesome to have more implementations of this idea. Simulink is really really cool within its domain.

Having seen a graphical language done “right” in many ways (Simulink has some failings–incorporating data from other sources is a huge pain in the butt from what I saw), I started thinking about implementing my own graphical language (whether I’ll get past the thought experiment state, who knows).

So, I actually see a few domains where a graphical data language would be useful:

  1. Time-domain simulation of dynamic systems. This is because many people think of dynamic systems in terms of block diagrams, and the ability to just slap down a block diagram and mash “go” is pretty powerful. In addition, there aren’t very many implementations of this outside Simulink, and nothing open-source and Simulink-esque that I’m aware of. I think something like Simulink for python that utilizes numpy and scipy throughout would be really rad.

  2. A replacement for spreadsheets. See, spreadsheets actually work very similarly to graphical languages such as those already discussed. Each cell is like a node, and the cell addresses given to the formulas are equivalent to wires connecting them. The fact that each cell shows its output just means that each cell has an automatic sink.
    Being able to replace formulas in spreadsheets with a graphical equivalent, and using a spreadsheet proper for the view (and perhaps controller) instead of the model, could really clean things up, and also give one a prettier chart at the end of the process.

  3. Something more network-esque. See, when I started thinking about how to implement such a thing, my first thought was to use events in Node to carry output information to inputs. I realized that you could easily have cyclic graphs in this form, since a block could simply take its most recent inputs on triggering instead of depending on its own output. This could allow for something really neat, but it would also be somewhat mutually exclusive with the time-domain simulation approach. At the least, it could be a cool way to work with making and visualizing RPCs, RMIs (think dNode), Unix-style pipes, and things like that. Maybe it could do something like starflow, once starflow’s ready enough to have docs.

I think what I would like to see is a combination of all these things. I don’t know if that’s a good idea or not, or if it’s even possible. But man, that could be sweet.

Some odds and ends:

  • I’ve noticed a lot of dataflow languages are strongly-typed. LabVIEW, for example, colors its lines based on datatype, and won’t run if connected inputs and outputs emit and expect different datatypes. On the other hand, the languages I’m most comfortable in are dynamically-typed (python and javascript). So, what approach would be best–dynamic or static typing? if it was implemented in js or python, how would you do that? Maybe Traits (if using Python) or something similar? Maybe typecasting on whatever wraps your function? Who knows. Perhaps type inferrence would be the way to go! Type inferrence is cool.

  • Doing something simulink-esque requires that all nodes are seen as functions of simulated time, whereas the evented approach depends on nodes being seen as, in a sense, functions of real time. I think the only way to reconcille the different approaches that come with these two approaches would be to have known non-polymorphic types. That is, if a closed path has an event-based edge, then it should act that way–Otherwise, if it has time-series legs in the right places, it should try to solve the loop if possible. To be honest though, I think the most responsible approach would be to implement both ideas separately. In my opinion, the evented approach and the spreadsheet idea(s) should be combined, with the Simulink-esque idea being a much more constrained tool.

  • I think it’s important that extending the language with your own nodes is made easy. For example, pypes (heavily influence by pipes) is written for stackless python, and writing your own nodes is meant to be pretty easy for a somewhat competent python programmer. But, I think this could be taken farther—Why not include nodes for all sorts of languages? Or, perhaps better, just include a child process node that can pipe IO to stdin and stdout, and extend it to be able to deal with any interesting language. For something like MATLAB, use something like Expect.

  • I really like the idea of RPC nodes. Like, a XML-RPC box, a SOAP box (heh), a JSON-RPC box, a dNode box. In a sense, I think this is the kind of domain that Pipes was meant for anyway, but why stop at there? Run a FFT on the results and throw the results on a table or something!

Anyways.