now about contact

Retrofitting a conventional data model with immutability
(For the Future of Coding Slack group)

I found myself working a couple of years ago on a little internal tool, a glorified checklist of things teams had to do to certify that their product was ready to launch externally.

It was basically a CRUD app backed by a database. However, as the company grew the data model for the app and checklists had grown complex. There were many tables, and foreign keys proliferated. While the data model was still quite close to normal form, there were path-dependent artifacts. If you designed the tables today, you wouldn't necessarily organize them quite like this.

There were tables for checklist questions organized in various hierarchies, and separate tables for people's answers to the questions. Various events would trigger a clearing of answers so that teams had to do them all over again (ugh).

At some point one of the senior people with privileges to modify checklist questions did so (a task usually performed by the owner) and didn't realize it would kill everyone's answers.

In response, we decided we needed to snapshot versions of answers so that it was possible in principle to roll back. We also received a feature requirement for batching up changes to questions so that people's answers were clobbered less often. It also seemed like a good idea to continue to show answers to previous versions so that filling out the checklists seemed less pointless.

My immediate reaction was to attempt something like Datalog where instead of modifying rows in tables I keep adding new rows, and tweak queries to select the latest version of a row by default.

I was aware that this wouldn't work if rows were modified very often. In this case our database was tiny and all data was entered by someone.

Encouragingly, I discovered a whole body of literature in the field of "Data warehousing" called Slowly Changing Dimensions (SCD). Datalog turned out to be just one point in the state space enumerated by SCD, so SCD felt useful.

I started out with SCD type 2, which is most like Datomic. Rows are immutable. Every mutation creates a new row.

As I started to try to apply these ideas in my domain, however, I noticed that my queries started getting rapidly complex.

In the simple case, a query like this:

select * from criteria where id=nnn;

…turned into this:

select * from criteria where obj_id=nnn
   order by id desc limit 1;

Ok. However, even simple joins got hard to write. How would you rewrite this query if the table had multiple rows for the same conceptual object?

select * from criteria where policy_id=ppp;

(I'd worked it out at some point, but can't find the answer anymore.)

Worse, since our tables were not ideally organized, we often had strange schemas. One of our tables had six rows per conceptual object, each with a different value for a specific column (say category_type) that was conceptually an enum. So a query like this would yield six rows, each with a different category_type:

select * from evaluations where id=nnn;

To rewrite it to get the latest version for each of the six rows, you needed this mind-bending reflexive query (that I did care to save):

select * from evaluations x
     where x.obj_id = nnn
     and x.id = (select max(id) from evaluations y
                   where y.obj_id = nnn
                   and y.category_type = x.category_type);

Writing such queries would be too much of an ongoing burden for others besides myself.

Benchmarking showed that performance quickly went down the toilet with any sort of join.

I switched to SCD type 4 (a separate history table). Every mutation operates on a table as normal—and appends to a separate table of versions.

Now most queries could work as usual. Only if you needed older versions did you run into the complexity and latency of the complex queries.

Even here there were challenges. Since we wanted to allow people to browse 'major' versions, we'd track the most recent version in the history table for each major version. Then a query for an older major version would look like this:

select * from criteria_versions where obj_id=nnn
     and id < (select final_id from criteria_major_versions ...) 
     order by id desc limit 1;

We continued to have poor performance for queries like this (though they were expected to be less frequent than in the previous scenario).

At this point I started considering more insane alternatives, like swapping out a new history table for every new major version. Or even swapping out a new (table, history table) pair.

Then I shook myself awake.

One lesson I learned from this experience was that there are two categories of 'versioned data' that are tempting to conflate but are really quite distinct:

User-visible 'major' versions that can be browsed.
'Minor' versions that are purely for recovery and audit trail purposes.

The Datalog model fits well for minor versions, but while it's tempting to reuse for major versions, one is better off explicitly accounting for them in the data model.

Comments gratefully appreciated. Please send them to me by any method of your choice and I'll include them here.