a moveable beast

Idle ideation. Coarse cognition.

GDD: Graph-Driven Design

Here I describe a recursive method I’ve been using to build small projects and subsystems with some success. The ideas involved are not new (or, at least, not mine), but I have yet to see them all in one place, and this is as good a spot as any.

Major influences include Gray Bernhardt’s Boundaries, Adam Grant’s The Surprising Habits of Original Thinkers, Pieter Hintjen’s CraftConf 2015 Talk, Bob Martin’s Architecture: The Lost Years, J. B. Rainsberger’s Becoming an Accomplished Software Designer, Kenneth Stanley’s Why Greatness Cannot Be Planned, and José Valim’s Mocks and Explicit Contracts. (Of course, if GDD sucks, it has nothing to do with those guys, and I’ve simply made a mess of their otherwise good ideas.)

I warn the reader that this article is way too complicated. If you agree that, when working a project, you should

  • identify the desired outcomes
  • work backward from outcomes toward what is necessary to effect them
  • work only to solve the problems directly in front of you
  • solve the easiest problems, first
  • postpone solving hard problems
  • avoid making decisions that are not immediately relevant
  • produce falsifiable evidence of correct behavior
  • minimize the number of things that need names (because naming is hard)
  • keep like things together
  • define boundaries with data
  • avoid dependency cycles

then I claim part of your workflow already looks similar to GDD, and you should move on to more interesting articles.

If you come to have doubts about the following approach as stated, join the club. Caveat emptor: there is no silver bullet.

What is GDD?

TDD works in isolation because, in part, it’s aim is to isolate components. The choice of what to build first and what to build next are decisions that necessarily exist outside TDD’s scope. While there are many methods that do address such things, I find they lack the purely mechanical, automatic feel of TDD. GDD is my attempt to fill this perceived gap; it is my current answer to the question, “What should happen at the bounds of TDD?”

Put another way, GDD is yet another strategy for convincing your brain to produce simpler code more easily.

In GDD, graph traversal plays a central role, which is why I say the process is “graph-driven.” The second “D” is “Design” because the decisions made just above the level of TDD carry real weight, impacting how easy or hard it is to complete the rest of a project.

At a high level, GDD is the following.

  1. Describe the system you want to build.
  2. List all outputs of the system, then list what data each output contains or is derived from (i.e. the inputs of the system).
  3. For each output or input not readily expressible in code, replace it with an intermediate representation that is (e.g. “bytecode instruction” instead of “robot moves”, or “points” instead of “line graph”).
  4. Construct a graph based on all the inputs and outputs of the system.
  5. Walk paths in the graph, applying TDD at each node.
  6. If you aren’t ready for TDD, go back to 1 with the node you’re stuck at, thereby refining the graph at that node.

A simple GDD graph looks like this:

A GDD graph

Nodes represent unique computations and edges represent unique data structures. Edges attached only at the head are external dependencies of the system; edges attached only at the tail are deliverables of the system.

In truth, I’ve never drawn an actual graph to do GDD: I just do it in my head. But for conveying my decision process to others, this is a decent model to work from.

Why GDD?

Because it was the process I naturally arrived at through trial and error.

In time, I’ve found that expanding, contracting, and walking an abstract graph representation of my project helps me iterate quickly, design only what’s in front of me, and code only what I understand. When it works, GDD limits my focus to one subproblem at a time, providing enough structure that I can simply do what GDD tells me but not so much that I can’t readily change tack along the way.

Getting ahead of myself, designing with bad assumptions and inaccurate predictions, has invariably lead to backtracking and redesigns—the most egregious of these having arisen when I was especially knowledgeable in the target domain. If you’ve ever run a red light because you only saw the green light three blocks beyond it, you know something about looking too far ahead. If you’ve ever gotten on the highway and had to turn back around for your passport, you understand just how costly retracing your steps can be.

GDD counters this by imposing a “masked horizon,” which has lead me to produce effective designs in one pass. By forcing myself to begin with a known good result and ignore the inputs that lead to it rather than starting with what data I have and building up to some output, I always follow the “grain” of the problem, suppressing prior knowledge about parts of the solution until I truly need it. This acts as a kind of drawing restraint—or if you prefer, “strategic ignorance”—which imparts focus.

GDD also helps me leave a trail of tests in my wake by complementing TDD and giving me clear direction on what to test (and, therefore, implement) next.

Ultimately, if you’ve internalized the virtues of avoiding decisions (laziness), knowing nothing beyond your immediate surroundings (naiveté), and writing false implementations (falsifiability), the GDD model is an unnecessarily complicated, disposable ladder. Make a nice, warm fire with it.

If you’re still interested, I recommend you skip the next two sections, read “An abstract example,” then jump back for more detail.

Constructing the graph

Where I’ve written arrow, I clearly mean directed edge, but arrow is simpler.

  1. Given n inputs and outputs, draw a radially symmetric graph having n nodes.
  2. For each input, draw an arrow radiating inward, toward a corresponding node of your choosing. Nodes with arrows pointing inward are input nodes.
  3. For each output, draw an arrow radiating outward, away from a corresponding node of your choosing. Nodes with arrows pointing outward are output nodes.
  4. Label each arrow with a mnemonic to identify the input or output with which it is associated.
  5. For each output node, draw arrows from the corresponding input nodes toward the output node.
  6. Where two or more output nodes attach to identical input nodes, insert an aggregating node to which all the corresponding input and output nodes attach.

The algorithm

Always start with an output node. Beyond that, I recommend you choose an output node having the fewest attached arrows (inputs and outputs).

  1. Obtain sample data for the corresponding output.
  2. Use the sample to write a collaboration test, then make the test pass with a deliberately preposterous implementation.
  3. For each input arrow pointing to this node, guess an appropriate data structure and TDD this node’s implementation for that input.
  4. If you find TDD difficult for any of the inputs (or it takes more than five minutes to write a test), draw a new graph representing just this node and start the algorithm anew.
  5. Refactor if necessary.
  6. Select one of the following.

    • an unvisited input node that points to this node
    • an unvisited output node having the fewest attached arrows
  7. Return to 1.

An abstract example

Suppose that in describing my desired system I arrive at three inputs and four outputs. I decide one of the outputs is not easily expressible in code (an image, say), and I therefore replace this output with an intermediate representation (JSON or somesuch). Having done this, I construct the following graph.

The system node

Not very exciting. I refine this graph by dividing the one node into seven nodes, one per each input or output.

Refining the system node

Next I connect the inputs to their corresponding outputs, as determined during my initial analysis. (I apologize for the nodes moving about; GraphViz can be hard)

Adding dependencies

Here I notice that “Output 3” and “Output 4” take identical inputs. To eliminate the redundant edges, I insert an aggregating node in front of the “Output 3” and “Output 4” nodes.

A graph

To begin, I select the output node “Output 1” and wrangle up a sample of valid output. This node is a good choice because (1) it’s an output and (2) it has low cardinality, or few connecting edges.

Starting with "Output 1"

Next, I write a test which calls a function and asserts equality with my sample. To make the test pass, I literally paste the sample data into the appropriate function in the appropriate module.

Testing "Output 1" with input

I observe that “Output 1” takes one input, and I change my function to accept one argument, causing my initial test to fail. I TDD the expanded function until I am no longer able to falsify legitimate output with anything but an ostensibly correct implementation.

Once finished, I move on to “Input 3” and begin the algorithm anew. To write my next test, I take input from my “Output 1” tests and assert equality against the output of my “Input 3” function.

Moving to "Output 3"

I then change my function to accept one argument, causing my test to fail. Again, I TDD the expanded function until I am no longer able to falsify legitimate output with anything but an ostensibly correct implementation.

During this round of TDD, I realize my data model between “Input 3” and “Output 1” is deficient. Another field must be added to handle a specific case that meets a requirement. I therefore update all relevant test fixtures and refactor my implementations to pass all tests once more.

The completed path

Having exhausted this path in the graph, I elect to write an integrated test for the entire path. Once done, I choose “Output 3” as my next node and begin the algorithm anew.

Et cetera.


One might argue that correcting the deficient data structure in the above scenario would be a mistake. Concerns that such early interventions can lead to dead code and other cruft would not be unfounded, but I posit that this is actually a good time to adjust the design. For one, I’m not looking a hundred moves ahead: I’m modifying a data structure I’m handling right now, with all the context necessary to make a decision. Second, if modifying the design fulfills an actual requirement, I would rather craft a slightly impoverished solution now that I can quickly refine later. In short, I don’t want to circle back if I don’t have to, or at least until the whole program “works” in some generally recognizable sense.

Now suppose that during this scenario I took too long to write a test and that this occurred at the aggregate node linking input nodes 1-3. I would have expanded that node into the following graph.

Another graph

Because any further attempts to expand the individual nodes will result in more of the same, it’s probably only beneficial to produce at most two levels of depth.

As to why even two levels might be useful, I can only say that having a mechanical process guide me in dividing computations and defining boundaries reduces my cognitive load. Clearly I need not divide my code into as many pieces as these graphs may indicate (though the opposite is also true), but I find it easier to expand a structure completely before collapsing any extraneous parts.

Why then should I choose to expand individual nodes only as needed?

I’ve found that the more parts I have in my head, the worse my design becomes. By only expanding nodes as test feedback dictates, I maintain a sharper focus. Also, with more parts come more names, and naming regularly wastes hours of my time.

Reduction in cognitive load is also part of the graph traversal strategy. By walking only one path at a time, I need only consider a linear sequence of data transformations rather than a synthesis of multiple inputs.

Finally, it’s remarkable that every boundary between computations is data and that each edge directly corresponds to at least one collaboration/contract test pair. Moreover, a complete path through the graph corresponds to an integrated test where, at each intermediate computation, a - 1 arguments are “held constant,” so to speak. This in no way reduces the combinatorial explosion of integrated tests required to reach all code, but I find it does make missing tests more obvious.

About the author: Failed network engineer. Lame programmer. Armchair mathematician. Suspected member of Homo sapiens.