Mu: Sketching out a minimal system programming language

now about contact

Oct 15, 2019

Mu: Sketching out a minimal system programming language

In the previous post, I described what my new hobbyist computing stack looks like today, and how the design decisions seemed to accumulate inevitably from a small set of axiomatic goals. In this post I describe some (far more speculative) future plans for Mu and try to articulate how the design process continues to seem pre-ordained.

(Many of the sections below outline constraints before describing a design that mostly fits them. This flow is inspired by Christopher Alexander's “Notes on the synthesis of form”.)

To recap, I plan 3 levels of languages for Mu:

Level 1 (SubX): just the processor's instruction set with some syntax sugar
Level 2 (Mu): a memory-safe language
Level 3 (TBD): an expressive high-level language

So far Mu has just level 1 (described in the prequel). After writing 30k LoC in level 1, the thing I miss most is the safety of a higher-level language. In particular, memory safety. I repeatedly make the following kinds of errors:

I allocate local variables on the stack but forget to clean them up before returning.
I accidentally clobber a register in a function without first saving the caller's version of it.
I accidentally clobber memory outside the bounds of an array.
I accidentally continue to hold on to a heap allocation after I've freed it.

The first two are problems that C solves. The latter two are not. Accordingly, level 2 will have some similarity with C but also major differences. Like C's original goals, it's intended to be easy to translate to native machine code. Unlike C, however:

It is intended to remain easy to translate over time. C compilers have optimization passes that have grown ever more complex and sophisticated over time in a search for higher performance. Mu is intended to never have an optimizer.
It is intended to always be built out of lower level languages. This imposes some tight constraints on the number of features it can have.
Performance is not a priority. I want to make it safe before I make it fast. In particular, to keep the translation simple, I'm willing to use run-time checks for things like out-of-bounds memory access.
To keep the translation simple, I give up mathematical notation. You won't be able to say things like `a + b*c` (that C compilers then need CSE to optimize).

Since the compiler won't optimize code, not just in the initial phase but ever, the language should allow the programmer complete control over performance. It won't let you write utterly unsafe code (bump down to machine code for that), but it shouldn't introduce gratuitous overheads when emitting machine code. C started out simple to translate, but gained declarative keywords like inline and register over time. I want to push much harder than C on being easy to translate. For the most part there should be a 1-to-1 mapping between statements and x86 instructions. Just with more safety.

‘Compiling’ feels too pompous a word for this process. Let's call it just ‘translating’ instead. And it turns out to be surprisingly constraining. If you start out with the goals stated above about the target instruction set and how parsimonious the translation process needs to be, there seems to be exactly one core language you can end up with.

Ingredients

As hinted above, the language has to be exclusively statement-oriented. While I prefer the syntax of purely expression-oriented languages, that's something to build higher up in Mu. Not having to translate expressions to statements avoids the need to detect common sub-expressions and so on.

I said above that I want instructions to map 1:1 to instructions. The x86 instruction set doesn't allow instructions to access more than one memory location. The simplest way to work with this restriction is to make registers explicit in our code, so that they're a constant consideration for the programmer. Mu will be a manually register-allocated language. It will still check your register allocation and flag an error if multiple variables overlap on the same register. But it's up to the programmer to manage registers and spill to memory when necessary. The good news: the programmer can really optimize register use when it matters.

x/eax # variable x is stored in register eax
x # variable x is stored in memory

The language will specify types of variables. The translator need only compare the input and output types of each statement in isolation.

Types allow restrictions to instructions to support memory safety. For example, `add` can be restricted to numbers, `index` to arrays, and `get` to records (structs). All three may emit the same instruction in machine code, but the logical semantics allow us to forbid arbitrary pointer arithmetic.

In addition to types, variables can also specify their storage location: whether they're allocated on a register, on the stack or on the data segment. The skeleton of the translator starts to take shape now: after parsing and type-checking, a simple dispatch loop that decides what instruction to emit based on both the operation and where the operands are stored.

The one feature I miss most when programming in machine code is the ability to declare a struct/record type and access record fields rather than numeric offsets. Mu will have user-defined types.

While most Mu code will be translated 1:1 into binary, there will be a handful of situations where instructions are inserted without any counterpart in the sources. The list so far:

Stack management. When I declare local variables I'd like them to be automatically popped off the stack when exiting scope.
Checking array bounds. Every index into an array compares with its length. To make this check convenient, all array start with their lengths in Mu.
Checking pointers into the heap. Heap allocations can be reclaimed, and we want to immediately flag dereferences to stale pointers after they've been reclaimed.

This is the hazy plan. Still lots of details to work out.

Milestones

Here's an outline of the broad milestones planned for Mu's eponymous level-2 language:

Parse core syntax into some intermediate representation.
Code-generate just instructions with integer operands, for starters.
Variable declarations and scopes.
User-defined types.
Address types. We probably need them. How to make them safe?
Addresses on the heap, and how to detect when they're reclaimed.

I'll sketch out the first three in this post, and then go off to build the language with just integers. Once I'm done with it I'll flesh out the rest.

Core syntax

As I said in the previous post, machine code for any processor consists of linear sequences of instructions. These instructions uniformly consist of an operation and some number of operands. Conceptually:

op o1, o2, …

Different processors impose constraints on these operands. In x86 most instructions can take at most two operands. At most one of them can lie in memory, and at most operand can be written to. The constraints are independent; instructions may write to either memory or registers.

Machine code usually requires multiple primitive instructions to perform a call to a user-defined function. It seems useful for primitive operations and calls to user-defined functions to have a uniform syntax.

It would be nice to be able to see at a glance which operands of an instruction are read or written.

Conventional Assembly languages use order to distinguish the output parameter. However they don't support a syntax for function calls.

One potential syntax may be:

o1, o2, … ← op i1, i2, …

However, this scheme doesn't admit a clean place for operands that are both read and written. (We'll call them ‘inout’ operands, extrapolating from ‘input’ and ‘output’.) Inout operands are common in x86 machine code, where you can have at most two operands:

Add operand o1 to o2;
subtract o1 from o2;
compute the bitwise `AND` of o1 and o2, and store the result in o1;
…and so on.

User-defined functions can also have inout operands. It would be nice to highlight them uniformly.

It would also be nice to see at a glance which operands are in registers, and which are in memory.

However, specifying both input/output and reg/mem properties explicitly for every single operand seems excessive, both for the implementor and for users.

Function calls can pass inputs and outputs either on the stack or in registers. Typically the stack holds only input or inout operands. You could move the return address down to make some extra room to write outputs to. But these stack manipulations take up instructions, and calls would lose a core symmetry: it's easy and useful to check that the amount of data pushed before a call matches the amount of data popped off after.

Function calls can return results in registers, but callers must always know precisely which registers are being written. Separating output registers gives a sense of flexibility that is false.

Finally, registers can't hold types larger than a single word. Larger types need memory, and often they need to allocate this memory. If we assume memory management is manual, it makes functions more self-contained and easy to test if they don't do their own memory allocation. Instead the predominant idiom in C and Assembly is for the caller to be responsible for allocating space for results. The callee simply writes into this space.

That's a lot of constraints, and quite heterogenous! Putting them all together, it seems to me they point toward only one solution:

Give up on separating inputs from outputs in the general case. C is wise here.
Highlight output registers of function calls. These are just for documentation and error-checking. As we saw above, you can't mix and match arbitrary registers into the outputs of a call. But it's still nice for the reader to be able to tell at a glance which registers are being modified where, and for the translator to detect when the wrong output register is assumed in a call.
```
fn factorial n : int → result/eax : int [
   …
]
# call
x/eax ← factorial 20  # ok
x/ecx ← factorial 20  # error
```
Highlight output registers of primitives:
```
x/eax ← copy y
```
Unlike function calls, primitive instructions can usually write to any register.
Put inout registers of primitive instructions only on the output. For example adding y to x when x is a register:
```
x/eax ← add y
```
Other than registers, memory addresses written to are really inout. Never put them on the output side. The above two instructions when the destination is not a register:
```
copy-to x, y/ecx
add-to x, y/ecx
copy-bytes-to x, y # function call
```
The -to suffix helps indicate which operand is written. It also seems to suggest a convention of putting outputs first. You could use a -from suffix and keep outputs clearly at the end. But what if you have multiple inputs? It seems more common to have a single output.

Code generation

Once we've parsed a core syntax we need to emit code for each instruction. Function calls are straightforward, given pre-existing syntax sugar. In Mu:

o1, o2 <- f i1, i2

In machine code with syntax sugar:

(f i1 i2) # output registers are implied

In machine code without syntax sugar:

push i2
push i1
call f
add 8 bytes to esp

Since each operand is in a separate instruction, it is free to live in a register or memory.

Now for primitives. I'll show a quick sketch of how we translate add instructions to x86 opcodes based on the storage type of their operands. Other instructions will follow similar logic.

The x86 instruction set operates on registers, memory or literals. Data in registers is addressed directly. Data in the stack is accessed in offsets of the stack pointer: `*(esp+n)` where `n` is usually a static value. Global variables are accessed as labels (32-bit addresses). We also capitalize global variables by convention.

Each add instruction takes two operands. Here are the opcodes (and optional sub-opcodes) we care about, from the Intel manual (with optional sub-opcode or subop after the slash):

	← Source operand →
Destination Operand ↓	Literal	Register	Local	Global
Register	81 /0	01	03	03
Local	81 /0	01	X	X
Global	81 /0	01	X	X

Alternatively, showing where operands are stored (destination first):

reg += literal ⇒ 81 0/subop %reg literal/imm32
stack += literal ⇒ 81 0/subop *(esp+offset) literal/imm32
Global += literal ⇒ 81 0/subop *Global literal/imm32
reg += reg2 ⇒ 01 %reg reg2/r32
stack += reg2 ⇒ 01 *(esp+offset) reg2/r32
Global += reg2 ⇒ 01 *Global reg2/r32
reg += stack ⇒ 03 *(esp+offset) reg/r32
reg += Global ⇒ 03 *Global reg/r32

Other binary operations get translated similarly. (For details on the metadata after slashes, see the section on “Error-checking” in the previous post and the project Readme.)

Variable declarations

Mu needs to be able to declare friendly names for locations in memory or register. How would they work? Again, we'll start with some goals/constraints:

We'd like to be able to attach types to registers or memory locations, so that errors in writing incompatible values to them can be quickly caught.
Registers can only hold a word-sized type.
We shouldn't have to explicitly deallocate variables when exiting a scope.
We'd like to allocate vars close to their use rather than at the top of a function. Ideally we'd like to support scopes in blocks rather than just for entire function bodies.
We'd like to be able to exit early and automatically clean up.
We'd like to avoid unnecessary shadowing. If a variable isn't used anymore it shouldn't need to be restored.

Here's a declaration for a local variable with a type:

var x : int

This gets translated to (⇒)

subtract 4 from esp

You can also initialize a variable with 0's:

var x : int ← 0

⇒

push 0

Variables on the stack (offsets from esp) have a single type for their entire lifetimes.

If you're allocating a variable to a register, you can also turn any instruction writing to a single register into an initialization:

var x/ecx : int ← copy y

It's also fine to read from a register and write to a conceptually new variable in the same register:

var opcode/ecx : int ← and inst/ecx 0xff

While register variables have a single type for their entire lifetimes, it's totally fine for a register to be used by multiple variables of distinct types in a single function or block. Put another way, register variables tend to have shorter lifetimes than variables on the stack.

Variable declarations interact with Mu's {} blocks. Here's a variable allocated on the stack:

{
var x : int ← 0
…
}

This gets translated to (⇒) the following level-1 pseudocode:

{
push 0
…
add 4 to esp
}

Similarly, a variable allocated to a register:

{
var x/ecx : int ← copy y
…
}

⇒

{
push ecx # spill register
ecx ← copy y
…
pop to ecx # restore register
}

(This approach fits a calling convention where all registers are saved by the callee rather than the caller.)

Variables don't always need to be shadowed; sometimes they're safe to clobber. Rather than taking on responsibilities of analyzing lifetimes, Mu provides a single simple rule: the first variable in a register shadows previous values, and subsequent variables to the same register in the same block clobber:

{
var x/ecx : int ← copy y
var z/ecx : int ← …A…
…B…
}

⇒

{
push ecx
ecx ← copy y
ecx ← …A…
…B…
pop to ecx
}

To shadow, create a new inner block.

Early exits should also clean up any variables in a block.

{
var x/ecx : int ← copy y
break-if …A…
…B…
}

⇒

{
push ecx
ecx ← copy y
{
break-if …A…
…B…
}
pop to ecx
}

Early returns are more complicated, because we may need to unwind multiple blocks. The Mu translator is responsible for tracking the size of the stack frame at any point, and updating the stack appropriately before returning.

{
var x/ecx : int ← copy y
return-if …A…
…B…
}

⇒

{
push ecx # spill register
ecx ← copy y
{
break-unless …A…
«increment esp appropriately»
return
}
…B…
pop to ecx # single restore
}

Wrap up

Putting these ideas together, here's what factorial looks like in a bare-bones but type-safe system programming language targeting x86 (supporting only integers so far):

fn factorial n : int → result/eax : int {
# if (n <= 1) return 1
compare n, 1
{
break-if >
return 1
}
# otherwise return n * factorial(n-1)
{
break-if <=
# var tmp = n-1
var tmp/ebx : int ← copy n
decrement tmp
# return n * factorial(tmp)
var tmp2/eax : int ← factorial tmp
result ← multiply n, tmp2
return result
}
}

And here's the level-1 machine code I plan for it to be translated to:

factorial:
# function prologue
55/push-ebp
89/<- %ebp 4/r32/esp
# if (n <= 1) return 1
81 7/subop/compare *(ebp+8)
{
7f/jump-if-greater break/disp8
b8/copy-to-eax 1/imm32
e9/jump $factorial:end/disp32
}
# otherwise return n * factorial(n-1)
{
7e/jump-if-lesser-or-equal break/disp8
# var tmp = n-1
53/push-ebx # spill
8b/-> *(ebp+8) 3/r32/ebx
4b/decrement-ebx
# return n * factorial(tmp)
(factorial %ebx) # → eax
f7 4/subop/multiply-into-eax *(ebp+8)
5b/pop-to-ebx # restore
e9/jump $factorial:end/disp32
}
$factorial:end:
# function epilogue
89/<- %esp 5/r32/ebp
5d/pop-to-ebp
c3/return

This isn't as efficient as what I wrote by hand at the bottom of the previous post, but seems like a realistic translation based on this post.

Ok, time to get to work building it. I'd love to hear suggestions or feedback based on this design, either in the comments below or over email.

(Thanks Garth Goldwater, Vladimir Gordeev, Edward de Jong, Breck Yunits, Tibor Halter, Remo Dentato and Federico Pereiro for helpful feedback on drafts of this post.)

comments

Peter Van Sandt, 2019-11-22: Interesting to see something that is even more a nicer version of codegen for assembly. I also see how optimizers make programmers less able to reason about the output assembly. Couple of thoughts:

1. Have you considered based in this on SSA or LLVM IR rather than x86 ASM? It might be interesting to see what that would look like if there was a way to write LLVM IR in a way that was designed for humans to write.

2. Mathematical equivalences and fancy bitmasking is pretty hard for humans to come up with for every place that it's needed, and without an optimizer, you lose that. To keep the 1-1 equivalence of code with output, what do you think about an optimizing linter? Something that could complain about areas where your code could be optimized, but not actually do it for you.

Kartik Agaram, 2019-11-22: I like both ideas! An optimizing linter sounds like a great idea, because it goes with the grain of this project: to keep humans in the loop, to encourage them to understand details, and to reward curiosity.

I'd be curious to hear what about LLVM IR you find to be hard for humans to write. It seemed fairly clean when I looked at it. The problem is not so much with the IR syntax itself but with all the myriad design decisions in the implementation that assume it's going to be just compilers emitting the IR.

Anonymous, 2019-12-09: An optimizing linter has the problem of being destructive. It goes like this:

The programmer will write his or her program in a readable way. They'll run it through the compiler, which points out that something can be optimized, the programmer—having already gone through the process of writing the first implementation with all its constraints and other ins and outs fresh in their mind—will slap their head and mutter "of course!", and then replace the original naive implementation with one based on the notes the compiler has given. Chances are high that the result will be less comprehensible to other programmers who come along—or even to the same programmer revisiting their own code 6 months later.

What you really want is something like Knuth's "interactive program-manipulation system".

The first use case could be doing away with inline concerns that `factorial`'s `tmp2` is a register variable. The original program P would elide storage class annotation altogether (which by default would be understood that it resides on the stack), and then your "transformations that make it efficient" would include specifying that `tmp2` can/should actually live in eax.

I've had a draft post on this topic for a while and just now dumped it on keybase.

Also AIUI, "LLVM IR" doesn't actually exist. It's not stable.

Kartik Agaram, 2019-12-09: Very interesting draft, thanks. I ended up favoriting both the Animats comment you linked to as well as its parent thread on Dafny.

Even though I find it more interesting, I don't follow all the intricacies of your verification-side argument (and theirs); I don't have much background in verification yet. But I spent some time thinking about the rewrite system you describe with similarities to Literate Programming and Aspect-Oriented Programming. I haven't been able to come up with a good way to separate a declarative 'spec' from optimizations using these techniques. This is why I'm currently using the hackier approach of a) having lots of tests, and b) gradually building up a complex system across multiple layers. My layers often end up repeating the code of an earlier layer but with greater elaboration. While the repetition is distasteful, it does allow the reader to start with a simpler version at an earlier layer and then separately see how it's optimized. And the tests ensure that the new version continues to pass old constraints in addition to new ones defined alongside it.

Maybe someday I or someone else will be able to replace many of Mu's tests with a verification system of some sort. But for now it's been hard enough to just support these hacks while bootstrapping from machine code.

Comments gratefully appreciated. Please send them to me by any method of your choice and I'll include them here.