Jun 9, 2013
A new way of testing

There's a combinatorial explosion at the heart of writing tests: the more coarse-grained the test, the more possible code paths to test, and the harder it gets to cover every corner case. In response, conventional wisdom is to test behavior at as fine a granularity as possible. The customary divide between 'unit' and 'integration' tests exists for this reason. Integration tests operate on the external interface to a program, while unit tests directly invoke different sub-components.

But such fine-grained tests have a limitation: they make it harder to move function boundaries around, whether it's splitting a helper out of its original call-site, or coalescing a helper function into its caller. Such transformations quickly outgrow the build/refactor partition that is at the heart of modern test-based development; you end up either creating functions without tests, or throwing away tests for functions that don't exist anymore, or manually stitching tests to a new call-site. All these operations are error-prone and stress-inducing. Does this function need to be test-driven from scratch? Am I losing something valuable in those obsolete tests? In practice, the emphasis on alternating phases of building (writing tests) and refactoring (holding tests unchanged) causes certain kinds of global reorganization to never happen. In the face of gradually shifting requirements and emphasis, codebases sink deeper and deeper into a locally optimum architecture that often has more to do with historical reasons than thoughtful design.

I've been experimenting with a new approach to keep the organization of code more fluid, and to keep tests from ossifying it. Rather than pass in specific inputs and make assertions on the outputs, I modify code to judiciously print to a trace and make assertions on the trace at the end of a run. As a result, tests no longer need call fine-grained helpers directly.

An utterly contrived and simplistic code example and test:

int foo() { return 34; }
void test_foo() { check(foo() == 34); }

With traces, I would write this as:

int foo() {
  trace << "foo: 34";
  return 34;
}
void test_foo() {
  foo();
  check_trace_contents("foo: 34");
}

The call to trace is conceptually just a print or logging statement. And the call to check_trace_contents ensures that the 'log' for the test contains a specific line of text:

foo: 34

That's the basic flow: create side-effects to check for rather than checking return values directly. At this point it probably seems utterly redundant. Here's a more realistic example, this time from my toy lisp interpreter. Before:

void test_eval_handles_body_keyword_synonym() {
  run("f <- (fn (a b ... body|do) body)");
  cell* result = eval("(f 2 :do 1 3)");
  // result should be (1 3)
  check(is_cons(result));
  check(car(result) == new_num(1));
  check(car(cdr(result)) == new_num(3));
}

After:

void test_eval_handles_body_keyword_synonym() {
  run("f <- (fn (a b ... body|do) body)");
  run("(f 2 :do 1 3)");
  check_trace_contents("(1 3)");
}

(The code looks like this.)

This example shows the key benefit of this approach. Instead of calling eval directly, we're now calling the top-level run function. Since we only care about a side-effect we don't need access to the value returned by eval. If we refactored eval in the future we wouldn't need to change this function at all. We'd just need to ensure that we preserved the tracing to emit the result of evaluation somewhere in the program.

As I've gone through and 'tracified' all my tests, they've taken on a common structure: first I run some preconditions. Then I run the expression I want to test and inspect the trace. Sometimes I'm checking for something that the setup expressions could have emitted and need to clear the trace to avoid contamination. Over time different parts of the program get namespaced with labels to avoid accidental conflict.

check_trace_contents("eval", "=> (1 3)");

This call now says, "look for this line only among lines in the trace tagged with the label eval." Other tests may run the same code but test other aspects of it, such as tokenization, or parsing. Labels allow me to verify behavior of different subsystems in an arbitrarily fine-grained manner without needing to know how to invoke them.

Other codebases will have a different common structure. They may call a different top-level than run, and may pass in inputs differently. But they'll all need labels to isolate design concerns.

The payoff of these changes: all my tests are now oblivious to internal details like tokenization, parsing and evaluation. The trace checks that the program correctly computed a specific fact, while remaining oblivious about how it was computed, whether synchronously or asynchronously, serially or in parallel, whether it was returned in a callback or a global, etc. The hypothesis is that this will make high-level reorganizations easier in future, and therefore more likely to occur.

Worries

As I program in this style, I've been keeping a list of anxieties, potentially-fatal objections to it:

  • Are the new tests more brittle? I've had a couple of spurious failures from subtly different whitespace, but they haven't taken long to diagnose. I've also been gradually growing a vocabulary of possible checks on the trace. Even though it's conceptually like logging, the trace doesn't have to be stored in a file on disk. It's a random-access in-memory structure that can be sliced and diced in various ways. I've already switched implementations a couple of times as I added labels to namespace different subsystems/concerns, and a notion of frames for distinguishing recursive calls.

  • Are we testing what we think we're testing? The trace adds a level of indirection, and it takes a little care to avoid false successes. So far it hasn't been more effort than conventional tests.

  • Will they lead to crappier architecture? Arguably the biggest benefit of TDD is that it makes functions more testable all across a large program. Tracing makes it possible to keep such interfaces crappier and more tangled. On the other hand, the complexities of flow control, concurrency and error management often cause interface complexity anyway. My weak sense so far is that tests are like training wheels for inexperienced designers. After some experience, I hope people will continue to design tasteful interfaces even if they aren't forced to do so by their tests.

  • Am I just reinventing mocks? I hope not, because I hate mocks. The big difference to my mind is that traces should output and verify domain-specific knowledge rather than implementation details, and that it's more convenient with traces to selectively check specific states in specific tests, without requiring a lot of setup in each test. Indeed, one way to view this whole approach is as test-specific assertions that can be easily turned on and off from one test to the next.

  • Avoiding side-effects is arguably the most valuable rule we know about good design. Could this whole approach be a dead-end simply because of its extreme use of side-effects? Arguably these side-effects are ok, because they don't break referential transparency. The trace is purely part of the test harness, something the program can be oblivious to in production runs.

The future

I'm going to monitor those worries, but I feel very optimistic about this idea. Traces could enable tests that have so far been hard to create: for performance, fault-tolerance, synchronization, and so on. Traces could be a unifying source of knowledge about a codebase. I've been experimenting with a collapsing interface for rendering traces that would help newcomers visualize a new codebase, or veterans more quickly connect errors to causes. More on these ideas anon.

mission
Making the big picture easy to see, in software and in society at large.
accomplishments
Code (contributions)
Prose (shorter; favorites)
favorite insights
Programming
Social Dynamics
Life
Making
Work
Cognition
Economics
History
Startups
Social Software