Jun 9, 2013
A new way of testing

There's a combinatorial explosion at the heart of writing tests: the more coarse-grained the test, the more possible code paths to test, and the harder it gets to cover every corner case. In response, conventional wisdom is to test behavior at as fine a granularity as possible. The customary divide between 'unit' and 'integration' tests exists for this reason. Integration tests operate on the external interface to a program, while unit tests directly invoke different sub-components.

But such fine-grained tests have a limitation: they make it harder to move function boundaries around, whether it's splitting a helper out of its original call-site, or coalescing a helper function into its caller. Such transformations quickly outgrow the build/refactor partition that is at the heart of modern test-based development; you end up either creating functions without tests, or throwing away tests for functions that don't exist anymore, or manually stitching tests to a new call-site. All these operations are error-prone and stress-inducing. Does this function need to be test-driven from scratch? Am I losing something valuable in those obsolete tests? In practice, the emphasis on alternating phases of building (writing tests) and refactoring (holding tests unchanged) causes certain kinds of global reorganization to never happen. In the face of gradually shifting requirements and emphasis, codebases sink deeper and deeper into a locally optimum architecture that often has more to do with historical reasons than thoughtful design.

I've been experimenting with a new approach to keep the organization of code more fluid, and to keep tests from ossifying it. Rather than pass in specific inputs and make assertions on the outputs, I modify code to judiciously print to a trace and make assertions on the trace at the end of a run. As a result, tests no longer need call fine-grained helpers directly.

An utterly contrived and simplistic code example and test:

int foo() { return 34; }
void test_foo() { check(foo() == 34); }

With traces, I would write this as:

int foo() {
  trace << "foo: 34";
  return 34;
void test_foo() {
  check_trace_contents("foo: 34");

The call to trace is conceptually just a print or logging statement. And the call to check_trace_contents ensures that the 'log' for the test contains a specific line of text:

foo: 34

That's the basic flow: create side-effects to check for rather than checking return values directly. At this point it probably seems utterly redundant. Here's a more realistic example, this time from my toy lisp interpreter. Before:

void test_eval_handles_body_keyword_synonym() {
  run("f <- (fn (a b ... body|do) body)");
  cell* result = eval("(f 2 :do 1 3)");
  // result should be (1 3)
  check(car(result) == new_num(1));
  check(car(cdr(result)) == new_num(3));


void test_eval_handles_body_keyword_synonym() {
  run("f <- (fn (a b ... body|do) body)");
  run("(f 2 :do 1 3)");
  check_trace_contents("(1 3)");

(The code looks like this.)

This example shows the key benefit of this approach. Instead of calling eval directly, we're now calling the top-level run function. Since we only care about a side-effect we don't need access to the value returned by eval. If we refactored eval in the future we wouldn't need to change this function at all. We'd just need to ensure that we preserved the tracing to emit the result of evaluation somewhere in the program.

As I've gone through and 'tracified' all my tests, they've taken on a common structure: first I run some preconditions. Then I run the expression I want to test and inspect the trace. Sometimes I'm checking for something that the setup expressions could have emitted and need to clear the trace to avoid contamination. Over time different parts of the program get namespaced with labels to avoid accidental conflict.

check_trace_contents("eval", "=> (1 3)");

This call now says, "look for this line only among lines in the trace tagged with the label eval." Other tests may run the same code but test other aspects of it, such as tokenization, or parsing. Labels allow me to verify behavior of different subsystems in an arbitrarily fine-grained manner without needing to know how to invoke them.

Other codebases will have a different common structure. They may call a different top-level than run, and may pass in inputs differently. But they'll all need labels to isolate design concerns.

The payoff of these changes: all my tests are now oblivious to internal details like tokenization, parsing and evaluation. The trace checks that the program correctly computed a specific fact, while remaining oblivious about how it was computed, whether synchronously or asynchronously, serially or in parallel, whether it was returned in a callback or a global, etc. The hypothesis is that this will make high-level reorganizations easier in future, and therefore more likely to occur.


As I program in this style, I've been keeping a list of anxieties, potentially-fatal objections to it:

  • Are the new tests more brittle? I've had a couple of spurious failures from subtly different whitespace, but they haven't taken long to diagnose. I've also been gradually growing a vocabulary of possible checks on the trace. Even though it's conceptually like logging, the trace doesn't have to be stored in a file on disk. It's a random-access in-memory structure that can be sliced and diced in various ways. I've already switched implementations a couple of times as I added labels to namespace different subsystems/concerns, and a notion of frames for distinguishing recursive calls.

  • Are we testing what we think we're testing? The trace adds a level of indirection, and it takes a little care to avoid false successes. So far it hasn't been more effort than conventional tests.

  • Will they lead to crappier architecture? Arguably the biggest benefit of TDD is that it makes functions more testable all across a large program. Tracing makes it possible to keep such interfaces crappier and more tangled. On the other hand, the complexities of flow control, concurrency and error management often cause interface complexity anyway. My weak sense so far is that tests are like training wheels for inexperienced designers. After some experience, I hope people will continue to design tasteful interfaces even if they aren't forced to do so by their tests.

  • Am I just reinventing mocks? I hope not, because I hate mocks. The big difference to my mind is that traces should output and verify domain-specific knowledge rather than implementation details, and that it's more convenient with traces to selectively check specific states in specific tests, without requiring a lot of setup in each test. Indeed, one way to view this whole approach is as test-specific assertions that can be easily turned on and off from one test to the next.

  • Avoiding side-effects is arguably the most valuable rule we know about good design. Could this whole approach be a dead-end simply because of its extreme use of side-effects? Arguably these side-effects are ok, because they don't break referential transparency. The trace is purely part of the test harness, something the program can be oblivious to in production runs.

The future

I'm going to monitor those worries, but I feel very optimistic about this idea. Traces could enable tests that have so far been hard to create: for performance, fault-tolerance, synchronization, and so on. Traces could be a unifying source of knowledge about a codebase. I've been experimenting with a collapsing interface for rendering traces that would help newcomers visualize a new codebase, or veterans more quickly connect errors to causes. More on these ideas anon.


  • Sae Hirak, 2013-06-15: This is a brilliant idea. The primary reason I don't write unit tests is because... well, unit tests tend to be larger than the code they're actually testing, so it takes a lot of effort to write them. That would be fine if the tests didn't change very often, but when refactoring your code, you have to change the tests!

    So now you not only have to deal with the effort of refactoring code, but *also* refactoring tests! I can't stand that constant overhead of extra effort. But your approach seems like it would allow me to change the unit tests much less frequently, thereby increasing the benefit/cost ratio, thereby making them actually worth it.

    I don't see any connection at all to mocks. The point of mocks is that if you have something that's stateful (like a database or whatever), you don't test the database directly, instead you create a fake database and test that instead.   

  • boxed, 2013-06-19: http://doctestjs.org/ has a mode of operation that is very similar, and it has really good pretty-printing and whitespace normalization code to handle those brittleness problems you talk about.

    One thing I try to do with my tests is assert completeness at the end. So for example, if you trace (in doctest.js parliance "print") three things: "a", {"b": 1} and 4, then if you assert "a", that object is popped from the pile of objects that have been traced. This way you can at the end do: assert len(traces) == 0. This is pretty cool in that you assert both the positive _and negative_. I use this type of thinking a lot.   

  • David Barbour, 2013-10-10: I've been pursuing testing from the other side: by externalizing state (even if using linear types to ensure an exclusive writer), it becomes much easier to introduce tests as observers (semantic access to state). But I like the approach you describe here. Data driven approaches like this tend to be much more robust and extensible, i.e. can add a bunch more observers.

    The piece you're missing at the moment: generation of the trace should be automatic, or mostly so. Might be worthwhile to use a preprocessor, and maybe hot-comments, to auto-inject the trace.   

  • Kartik Agaram, 2014-01-31: Without traces, the right way to move function boundaries around is bottom-up: https://practicingruby.com/articles/refactoring-is-not-redesign   
  • Kartik Agaram, 2014-04-29: Perhaps I shouldn't worry about the effect of testing on design. Support from an unexpected source:

    "The design integrity of your system is far more important than being able to test it any particular layer. Stop obsessing about unit tests, embrace backfilling of tests when you're happy with the design, and strive for overall system clarity as your principle pursuit." -- David Heinemeier Hansson, http://david.heinemeierhansson.com/2014/test-induced-design-damage.html   

  • Anonymous, 2014-06-06: Interesting. You are basically creating an addition API, one that is used solely for testing (the trace output). The interesting challenge here is to prove whether this new API is more resistant to breakage because of refactoring when compared to the primary API.   
    • Anonymous, 2014-06-06: What I meant, there are multiple ways to accomplish the task even on the business logic level. For example, both of the following snippets are correct:

      void clean_up () {
           sweep_the_floor ();
           wash_the_dishes ();
      void clean_up () {
           wash_the_dishes ();
           sweep_the_floor ();

      Yet the trace describing what have happened would be different.

      To account for this kind of thing, the test would have to do some kind of normalisation on of the trace...   

    • Kartik Agaram, 2014-06-06: Thanks! Traces would have to focus on domain-specific facts rather than details of incidental complexity. Hopefully that problem is more amenable to good taste. But yes, still an open question.

Comments gratefully appreciated. Please send them to me by any method of your choice and I'll include them here.

RSS (?)
twtxt (?)
Station (?)