Foundations
Over the past year I've built a minimal-dependency hobbyist computing stack called Mu that:
- Is easy to comprehend and modify, and doesn't require much code.
- Is thoroughly tested from the ground up.
- Can be run emulated atop Linux as well as on native hardware.
- Supports time-travel debugging.
- Is self-hosted and can build programs without any dependency—direct or indirect—on C.
- Can package programs up with an OS into bootable disk images that can be uploaded to a cloud server. (The OS does require C. For now.)
(Why did I do this? I've written elsewhere about philosophy and goals.)
The catch: so far the stack must be programmed with machine code. I've built a simple syntax for machine code that can catch many common errors, and the emulation infrastructure makes debugging easier. It's ergonomic enough (thanks to some syntax sugar) that I've written 40k LoC with it, but still. It's machine code.
I'm trying now to build a programming language for Mu that occupies roughly C's niche in modern Unix descendants: statically typed, compiled ahead of time, designed to easily convert to native machine code.
Constraints
My goals diverge from C in a few places:
- Top priority is keeping the compiler implementation tiny and hackable over time.
- I don't care about portability. My machine code is only for Intel x86 processors. I'm willing to stay with that platform, and I don't plan to support processors I can't easily test myself.
- I prioritize memory safety over nice/concise syntax. You won't be able to use math notation like `a + b*c`, but you'll get an error if you ever stride past the end of an array.
- I want to make it safe before I make it fast.
These axioms have a couple of implications:
- To keep the compiler from growing complex over time I'm going to skip the conventional optimization pass.
- To provide memory safety in a small implementation I'm willing to sacrifice some performance for runtime checks.
Basically, when compared to C, I want to trade off performance for safety, and portability for implementation simplicity.
Since the compiler won't optimize code, not just in the initial phase but ever, the language should allow the programmer complete control over performance. It won't let you write utterly unsafe code (bump down to machine code for that), but it shouldn't introduce gratuitous overheads when emitting machine code. C started out simple to translate, but gained declarative keywords like inline and register over time. I want to push much harder than C on being easy to translate. For the most part there should be a 1-to-1 mapping between statements and x86 instructions. Just with more safety.
‘Compiling’ feels too pompous a word for this process. Let's call it just ‘translating’ instead. And it turns out to be surprisingly constraining. If you start out with the goals stated above about the target instruction set and how parsimonious the translation process needs to be, there seems to be exactly one core language you can end up with.
Ingredients
As hinted above, the language has to be exclusively statement-oriented. While I prefer the syntax of purely expression-oriented languages, that's something to build higher up in Mu. Not having to translate expressions to statements avoids the need to detect common sub-expressions and so on.
I said above that I want instructions to map 1:1 to instructions. The x86 instruction set doesn't allow instructions to access more than one memory location. The simplest way to work with this restriction is to make registers explicit in our code. Mu will be a manually register allocated language. It will still check your register allocation and flag an error if multiple variables clobber the same register. But it's up to the programmer to manage registers and spill to memory when necessary. The good news: the programmer can really optimize register use when it matters.
I still care about structured control flow. Fortunately I've already proved out a syntax for constructs like `if` and `while` in the previous prototype. The basic idea is to turn this:
if (A) { … }
into this:
var tmp = A { break-unless tmp … }
Basically the `{` and `}` get translated into labels, and `break` gets translated into a jump to the enclosing `}`. Correspondingly, `loop` gets translated to a jump to the enclosing (earlier) `{`. This syntax is surprisingly ergonomic and also surprisingly easy to teach to non-programmers. I wrote 30k LoC in this style in the Mu1 prototype.
The language will specify types of variables. The translator need only compare the input and output types of each statement in isolation.
Types allow restrictions to instructions to support memory safety. `add` can be restricted to numbers, `index` to arrays, and `get` to records (structs). All three may emit the same instruction in machine code, but the logical semantics allow us to forbid arbitrary pointer arithmetic.
In addition to types, variables can also specify their storage location: whether they're allocated on a register, on the stack or on the data segment. The skeleton of the translator starts to take shape now: after parsing and type-checking, a simple dispatch loop that decides what instruction to emit based on both the operation and where the operands are stored.
The one thing I miss most when programming in machine code is the ability to declare a struct/record type and access record fields rather than numeric offsets. Mu will have user-defined types.
One place to loosen the 1:1 restriction is stack management. That's the thing I missed second-most: When I declare local variables I'd like them to be automatically popped off the stack when exiting scope.
We'll introduce runtime checks in two places: checking array bounds and heap lifetimes. Both these will also break the 1:1 restriction. New instructions will get added without any counterpart in the sources.
This is the hazy plan. Still lots of details to work out.
Milestones
I've bitten off more than I can chew in this post. Here's an outline of the broad milestones for this language project:
- Parse core syntax into some intermediate representation.
- Code-generate just instructions with integer operands, for starters.
- Variable declarations and scopes.
- User-defined types.
- Address types. We probably need them. How to make them safe?
- Addresses on the heap, and how to detect when they're reclaimed.
I think I'll sketch out the first three in this post, and then go off to build a language with just integers. Once I'm done with it I'll flesh out the rest.
Core syntax
Machine code consists of instructions with an operation and some number of operands. Conceptually:
op o1, o2, …
There are often constraints on operands of primitive instructions. In x86 most instructions can take at most one memory operand, and most instructions can write to at most one location (whether register or memory).
It seems useful to provide a uniform syntax for primitive and user-defined instructions (function calls).
It would be nice to be able to see at a glance which operands of an instruction are read, written or both. It would also be nice to see at a glance which operands are in registers, and which are in memory. However, specifying both properties explicitly for every single operand seems excessive, both for the implementor and for users. We need some way to show these properties using either position or (yuck) sigils.
Primitive instructions commonly have operands that are both read and written. (We'll call them ‘inout’, extrapolating from ‘input’ and ‘output’.) Inout operands are challenging. It's easy to come up with a syntax for inputs and outputs. For example, Mu1 statements look like this:
out1, out2 ← op in1, in2, …
But I can't think of a clean ‘place’ for inout operands in this scheme. If I had a second dimension for each statement, I might make statements look like triangles or something.
Primitives sometimes write to the memory operand and sometimes to the register operand. Conventional Assembly languages use order to distinguish the inout parameter. However they don't support a syntax for function calls.
Function calls can pass inputs and outputs either on the stack or in registers. Typically the stack can hold only inputs or inout operands. You could move the return address down to make some extra room to write outputs to. But these stack manipulations take up instructions, and calls would lose a core symmetry: it's easy and useful to check that the amount of data pushed before a call matches the amount of data popped off after.
Function calls can return results in registers, but callers must always know precisely which registers are being written. Separating output registers gives a sense of flexibility that is false.
Finally, registers can't hold types larger than a single word. Larger types need memory, and often they need to allocate this memory. If we assume memory management is manual, it makes functions more self-contained and easy to test if they don't do their own memory allocation. Instead the predominant idiom in C and Assembly is for the caller to be responsible for allocating space for results. The callee simply writes into this space.
That's a lot of constraints, and quite heterogenous! Putting them all together, it seems to me they point toward one obvious syntax:
- Give up on separating inputs from outputs in the general case. Assembly
languages are wise here.
- Allow instructions to show just output registers. These may be
just documentation for function calls. As we saw above, you can't mix and match
arbitrary registers into the outputs of a call. But it's still nice for the
reader to be able to tell at a glance which registers are being modified where.
fn factorial n : int → result/eax : int [ … ] # call x/eax ← factorial 20 # ok x/ecx ← factorial 20 # error
- Make register outputs of primitives salient:
x/eax ← copy y # translates to opcode 8b because y is stored in memory
Primitive instructions are one case where registers are usually flexible. You can copy a word to any register. Highlighting the output is particularly useful here.
- Put inout registers of primitive instructions only on the output. For
example adding y to x when x is a register:
x/eax ← add y # translates to opcode 03
- Other than registers, memory addresses written to are really inout. Never
put them on the output side. The above two instructions when the destination is
not a register:
copy-to x, y/ecx # translates to opcode 89 because x is stored in memory add-to x, y/ecx # translates to opcode 01 because x is stored in memory copy-bytes-to x, y # function call
The -to suffix helps indicate which operand is written. It also seems to suggest a convention of putting outputs first. You could use a -from suffix and keep outputs clearly at the end. But what if you have multiple inputs? It seems more common to have a single output.
Code generation
Once we've parsed a core syntax we need to emit code for each instruction. Function calls are straightforward, given pre-existing syntax sugar. In Mu:
o1, o2 <- f i1, i2
In machine code with syntax sugar:
(f i1 i2) # output registers are implied
In machine code without syntax sugar:
push i2 push i1 call f add 8 bytes to esp
Since each operand is in a separate instruction there are no constraints on where it can be.
Now for primitives. I'll show a quick sketch of how we translate add instructions to x86 opcodes based on the storage type of their operands. Other instructions will follow similar logic.
The x86 instruction set operates on registers, memory or literals. Data in registers is addressed directly. Data in the stack is accessed in offsets of the stack pointer: `*(esp+n)` where `n` is usually a static value. Global variables are accessed as labels (32-bit addresses). We also capitalize global variables by convention.
Each add instruction takes two operands. Here are the opcodes (and optional sub-opcodes) we care about, from the Intel manual:
← Source operand → | ||||
---|---|---|---|---|
Destination Operand ↓ |
Literal | Register | Local | Global |
Register | 81 /0 | 01 | 03 | 03 |
Local | 81 /0 | 01 | X | X |
Global | 81 /0 | 01 | X | X |
Alternatively, showing where operands are stored (destination first):
reg += literal ⇒ 81 0/subop %reg literal/imm32 stack += literal ⇒ 81 0/subop *(esp+offset) literal/imm32 Global += literal ⇒ 81 0/subop *Global literal/imm32 reg += reg2 ⇒ 01 %reg reg2/r32 stack += reg2 ⇒ 01 *(esp+offset) reg2/r32 Global += reg2 ⇒ 01 *Global reg2/r32 reg += stack ⇒ 03 *(esp+offset) reg/r32 reg += Global ⇒ 03 *Global reg/r32
Other binary operations get translated similarly.
Variable declarations
Mu needs to be able to declare friendly names for locations in memory or register. How would they work? Again, we'll start with some goals/constraints:
- We'd like to be able to attach types to registers or memory locations, so
that errors in writing incompatible values to them can be quickly caught.
- Registers can only hold a word-sized type.
- We shouldn't have to explicitly deallocate variables when exiting a scope.
- We'd like to allocate vars close to their use rather than at the top of a
function. Ideally we'd like to support scopes in blocks rather than just for
entire function bodies.
- We'd like to be able to exit early and automatically clean up.
- We'd like to avoid unnecessary shadowing. If a variable isn't used anymore it
shouldn't need to be restored.
Here's a declaration for a local variable with a type:
var x : int # turns into a decrement of the stack pointer
You can also initialize a variable with 0's:
var x : int ← 0 # turns into the appropriate number of pushes
Variables on the stack have a single type for their entire lifetimes.
If you're allocating a variable to a register, you can also turn any instruction writing to a single register into an initialization:
var x/ecx : int ← copy y
It's also fine to read from a register and write to a conceptually new variable in the same register:
var opcode/ecx : int ← and inst/ecx 0xff
While register variables have a single type for their entire lifetimes, it's totally fine for a register to be used by multiple variables of distinct types in a single function or block. Put another way, register variables tend to have shorter lifetimes than variables on the stack.
Variable declarations interact with Mu's {} blocks. Here's a variable allocated on the stack:
{ var x : int ← 0 … }
This gets translated to (⇒) the following pseudo machine code (Mu's machine code also supports syntax sugar for {} and structured control flow):
{ push 0 … increment stack pointer }
Similarly, a variable allocated to a register:
{ var x/ecx : int ← copy y … }
⇒
{ push ecx # spill register ecx ← copy y … pop to ecx # restore register }
Variables don't always need to be shadowed; sometimes they're safe to clobber. Rather than taking on responsibilities of analyzing lifetimes, Mu provides a single simple rule: the first variable in a register shadows previous values, and subsequent variables to the same register in the same block clobber:
{ var x/ecx : int ← copy y var z/ecx : int ← …A… …B… }
⇒
{ push ecx ecx ← copy y ecx ← …A… …B… pop to ecx }
To shadow, create a new inner block.
Early exits should also clean up any variables in a block.
{ var x/ecx : int ← copy y break-if …A… …B… }
⇒
{ push ecx ecx ← copy y { break-if …A… …B… } pop to ecx }
Early returns are more complicated, because we may need to unwind multiple blocks. The Mu translator is responsible for tracking the size of the stack frame at any point, and updating the stack appropriately before returning.
{ var x/ecx : int ← copy y return-if …A… …B… }
⇒
{ push ecx # spill register ecx ← copy y { break-unless …A… «increment stack appropriately» return } …B… pop to ecx # single restore }
Wrap up
Putting these ideas together, here's what factorial looks like in a bare-bones but type-safe system programming language targeting x86 (supporting only integers so far):
fn factorial n : int → result/eax : int { # if (n <= 1) return 1 compare n, 1 { break-if > return 1 } # otherwise return n * factorial(n-1) { break-if <= # var tmp = n-1 var tmp/ebx : int ← copy n decrement tmp # return n * factorial(tmp) var tmp2/eax : int ← factorial tmp result ← multiply n, tmp2 return result } }
⇒ (details in the Mu Readme)
factorial: # function prologue 55/push-ebp 89/<- %ebp 4/r32/esp # if (n <= 1) return 1 81 7/subop/compare *(ebp+8) { 7f/jump-if-greater break/disp8 b8/copy-to-eax 1/imm32 e9/jump $factorial:end/disp32 } # otherwise return n * factorial(n-1) { 7e/jump-if-lesser-or-equal break/disp8 # var tmp = n-1 53/push-ebx # spill 8b/-> *(ebp+8) 3/r32/ebx 4b/decrement-ebx # return n * factorial(tmp) (factorial %ebx) # → eax f7 4/subop/multiply-into-eax *(ebp+8) 5b/pop-to-ebx # restore e9/jump $factorial:end/disp32 } $factorial:end: # function epilogue 89/<- %esp 5/r32/ebp 5d/pop-to-ebp c3/return
This isn't as efficient as what I would write by hand, but seems like a realistic translation based on this post.
Ok, time to get to work building it.
Comments gratefully appreciated. Please send them to me by any method of your choice and I'll include them here.